INF442 - Density Estimation
- Download and unzip the archive
INF442-td5-1-handin.zip
. It contains several source files to get you started, that is aMakefile
to compile the various tests, aquiz
folder with a Python starter-kit script to do the quiz, as well as various datasets (.csv
files in thecsv
folder). - To get full credit, you need to solve and submit all exercises of Section 2. Section 3 is optional.
Context
- This TD corresponds to lecture 5.
- Starting from last TD going forward, we structure the project into
separate compilation units; they are separated into folders, e.g.,
point/
. - We will implement three different kernel density estimators and integrate them into a common class structure based on inheritance.
- Then, we will introduce the technique of mean shifts, which can help us to improve the quality of the estimations.
Structure
The TD is structured as follows:
- Section 0 describes how to use
assert
in C++; - Section 1 describes how to run kernel density
estimation algorithms using
scikit-learn
to analyze the data for the quiz; - Section 2 is the main section of the TD: we will implement the flat, Gaussian, and \(k\)-nn kernels;
- Section 3 is optional: you can choose to implement the mean shift algorithm to de-noise a dataset prior to applying kernel density estimators.
0. Assertions in C++
Testing C++ code
One way of writing safe code is to test some fundamental invariants
or assumptions (e.g., that an index does not exceed the length of the
array). One standard approach to this is via the function-like macro
assert
. You can read all about it in its official
documentation.
Its basic usage is assert(expr)
, where expr
is some C++ expression that evaluates to either true or false, like
1+1 == 2
. If the expression evaluates true, then the
assert
does nothing. If the expression evaluates to false,
then the program will be terminated with an error message when
evaluating the assert
. This error message includes the
expression that evaluated to false. Thus, we can use assertions to check
for errors in the code during execution. You are free to add your own
assertions to your code to find bugs.
1. Data Analysis
Density estimation with Python
We will explore several generated 2D data sets:
double_spiral.csv
(necessary for the quiz);normal.csv
(bonus);galaxies.csv
(bonus);mystery.csv
(bonus);galaxies_3D.csv
(for Section 3).
as well as the famous MNIST dataset of
handwritten digits (and yes, it’s normal you don’t have it in
csv/
, it will be downloaded on-the-fly):
![](resources/mnist.jpg)
First, install the necessary libraries:
$ pip3 install --user --only-binary=:all: numpy pandas scikit-learn matplotlib
$ pip3 install --user loguru
Then use the Python script kernel_density.py
in the
quiz/
directory to analyze the datasets and answer the
questions in the quiz for TD5 on Moodle. The script takes three
arguments: csv_file
, kernel
and
bandwidth
. The first argument determines which
.csv
file to open, in our case
../csv/double_spiral.csv
. For the MNIST dataset, just
replace this argument by digits
. In the second argument,
kernel
, you can give any
kernel supported by scikit-learn, although:
some won’t work with
digits
(see below) since thesample
method isn’t implemented except forgaussian
andtophat
;we will stick to
gaussian
andtophat
(corresponding toflat
in your lectures and in what follows) to answer the quiz.
The third argument, bandwidth
, defines the bandwidth of
your chosen kernel. Example usage:
$ python3 kernel_density.py ../csv/double_spiral.csv tophat 1
This should print the following to a file named
csv_file_kernel_bandwidth.png
:
![](resources/spiral_tophat_1.png)
Under the hood, we use KernelDensity
from scikit-learn as simply as:
= KernelDensity(kernel=kernel, bandwidth=bandwidth).fit(data) kde
The first plot is just a scatter plot of the points in the csv file:
0].scatter(data[:, 0], data[:, 1], s=0.3) ax[
The second plot is a heatmap of the estimated density, evaluated on 10,000 new points spanning the original support of the data:
= np.mgrid[data[:, 0].min():data[:, 0].max():100j,
xx, yy 1].min():data[:, 1].max():100j] data[:,
For the MNIST dataset, we take another approach. MNIST handwritten digits are images composed of \(8 \times 8\) pixels = 64 “features” (recall that the csv files used above have only 2 features). To visually represent the quality of the estimated density, we will simply plot examples from each digit alongside examples from the estimated density. To “classify” the drawn images in digits, we rely on 1-Nearest Neighbor:
= KNeighborsClassifier(n_neighbors=1).fit(data, np.eye(10)[digits.target])
knn_classif = np.argmax(knn_classif.predict(new_data), axis=1) predictions
Using
$ python3 kernel_density.py digits tophat 1
you should obtain the following plot:
![](resources/mnist_tophat_1.png)
2. Kernel Density Estimators
The code archive you have downloaded contains, among other files,
kernel.hpp
and kernel.cpp
in the
kernel/
folder, which describes the base class for kernel
estimators.
Kernel estimators are non-parametric density estimators (see the Data Science booklet for additional details) . Recall that, given a “training” dataset of \(n\) samples \((p_i)_1^n\) drawn from the unknown distribution \(f\), the estimated density is given by:
\[\hat{f}_n(x) = \dfrac{c_{k, d}}{n \sigma^d} \sum_{i=1}^n k \left( \dfrac{||x - p_i||_2^2}{\sigma^2} \right).\]
This base class has the pure virtual function
double density(const point &p)
which should return the
estimated density \(\hat{f}_n(x)\) at
point x
.
We will implement three different kernels:
- flat
- Gaussian
- \(k\)-nearest neighbors
The first two (flat and Gaussian) are instances of radial kernels, i.e., they only depend on the distance to a single data point at a time. The third (\(k\)-nearest neighbors) is not radial as it depends on several (i.e., \(k\)) nearest data points.
Hence, we will group the flat and Gaussian kernels as derived classes
of an intermediate class radial
. The class diagram is as
follows:
flat
/
radial
/ \
kernel gaussian
\
knn
In the provided code, this hierarchy is also represented by the folder structure.
The base class, kernel
, has already been implemented for
you (see kernel/kernel.hpp
and
kernel/kernel.cpp
).
The header files for radial
, flat
,
gaussian
and knn
are provided in the
subfolders of kernel/
.
To handle our samples \((p_i)_1^n\),
we will reuse the classes point
and cloud
from
the previous TDs (see the eponymous folders and files).
2.1 The radial
class
The radial
class needs a constructor. In particular, it
needs to set the data member double bandwidth
and its
cloud
of points, namely *data
, as suggested by
the inheritance of kernel
(hint: you can use
kernel
’s constructor as suggested p. 79 of your Cpp
booklet). Implement radial
’s constructor.
grader
will not compile until you have done
so.
Still in the radial
class, watch for the declaration of
two pure virtual member functions (see
radial.hpp
and the relevant section in your Cpp
booklet):
double volume()
: returns the volume of the unit ball weighed with the kernel. This will be \(\dfrac{\pi^{d/2}}{\Gamma(d/2 + 1)}\) for the flat kernel and \((2\pi)^{d/2}\) for the Gaussian kernel. This volume is equal to \(\dfrac{1}{c_{k, d}}\).double profile(double t)
: returns the kernel profile function \(k\) evaluated at the given real number \(t\). For the flat kernel, this will be \(1\) if and only if \(t \leq 1\) and \(0\) otherwise. For the Gaussian kernel, this will be \(e^{-t/2}\).
The use of pure virtual member functions makes radial
an
abstract class. In order to be able to instantiate an actual object, we
must have derived classes which “implement” (override)
these functions, in this case in flat
and
gaussian
. We will implement these classes, and in
particular their methods volume
and profile
,
in Sections 2.1.2 and 2.1.3
respectively.
Notice that in kernel.hpp
the density
member function is a pure virtual function. Assuming that
volume
and profile
are already implemented,
implement density
in
radial.cpp
. Recall that the density for radial
kernels is defined as \[\hat{f}_n(x) =
\dfrac{c_{k,d}}{n\sigma^d} \sum_{i=1}^n k\left(\dfrac{\lVert x -
p_i\rVert_2^2}{\sigma^2} \right)\] where the \(p_i\) are the data points, \(k\) is the kernel profile
function, \(c_{k,d}\) is the inverse of
the volume
, and \(\sigma\)
is the chosen bandwidth
.
You can test your class by doing the usual make grader
and executing ./grader 1
in the terminal (although passing
the tests will not absolutely guarantee correctness - recall that we add
supplementary tests after the deadline).
Upload your file radial.cpp
here:
2.1.1 Flat Kernel
Implement the class flat
derived from
radial
by implementing the member functions
volume
and profile
in
flat.cpp
.
Recall that volume
should return \(\dfrac{\pi^{d/2}}{\Gamma(d/2 + 1)}\), and
profile
should return \(1\) if and only if \(t \leq 1\) and \(0\) otherwise.
The \(\Gamma\) function can be computed knowing
a few tricks, or std::tgamma
can be used.
You can test your class by doing the usual make grader
and executing ./grader 2
.
Upload your file flat.cpp
here:
2.1.2 Gaussian Kernel
Implement the class gaussian
derived from
radial
by implementing the member functions
volume
and profile
.
Recall that volume
should return \((2\pi)^{d/2}\), and profile
should return \(e^{-t/2}\) (you can use
std::exp(...)
).
Implement a member function
void guess_bandwidth()
which sets the bandwidth
equal to \[1.06 \cdot \hat{\sigma} /
n^{1/5}\] where \(\hat{\sigma}\)
is the standard deviation of the dataset, which you can access via the
newly added member function standard_deviation
in the class
cloud
(see cloud/cloud.cpp
).
You can test your class by doing the usual make grader
and executing ./grader 3
.
Upload your file gaussian.cpp
here:
2.2 \(k\)-Nearest Neighbor Kernel
The return value of the density
function of the
knn
kernel should be \[\hat{f}_n(x) = \frac{k}{2nVd_k(x)}\] where
\(d_k(x)\) is the distance of the given
point \(x\) to its \(k\)-th nearest neighbor, and \(k\) and \(V\) are two parameters set in
knn
’s constructor.
A convenient way of identifying the distance to the \(k\)-th nearest neighbor of point \(x\) is to add a function
k_dist_knn
to the class cloud
which finds the
\(k\) nearest neighbors of \(x\) and returns \(d_k(x)\). Implement
k_dist_knn
in cloud.cpp
.
If you want to do the optional Section
3 about Meanshift, you should rather implement the knn
method which returns a pointer to an array of points (the k nearest
neighbors in the cloud
to a point
given as
input), and call knn
in
k_dist_knn
.
You can test your class by doing the usual make grader
and executing ./grader 4
.
Upload your file cloud.cpp
here:
Implement the class knn
, derived from
kernel
, for the \(k\)-nearest neighbor kernel by implementing
the member functions volume
and
density
.
You can test your class by doing the usual make grader
and executing ./grader 5
.
Upload your file knn.cpp
here:
Note: we call the parameter \(V\) the total volume, which we set manually for simplicity. For \(V=1\), the kernel’s density is not strictly speaking a density because its integral is not necessarily equal to \(1\).
3. Mean Shift (Optional)
Mean Shift
It is often necessary and desirable to remove noise from a data set. For that, multiple possibilities exist. We will implement one of them, a mean shift algorithm based on the \(k\)-nearest neighbor kernel.
One iteration of the algorithm updates the position of every data point to the mean of its \(k\) nearest neighbors.
Complete the shift
method in cloud.cpp
which implements one iteration of this algorithm and returns the new
coordinates of the points in the cloud as a pointer to an array of
point
s (see in cloud::meanshift
how it is
used).
You can test your implementation by doing the usual
make grader
and executing ./grader 6
.
This runs one iteration of the mean shift algorithm on the \(2\)-dimensional dataset
normal.csv
with \(k = 3\)
(see grading.cpp
).
You are free to change grading.cpp
to print the new
coordinates to a file, e.g., using standard redirection
(>
) and setting verbose
to
true
.
For reference, the first 10 data points of normal.csv
after running one iteration with \(k =
3\) should be equal (or close) to:
-0.742035 1.09605
0.025417 0.1229
0.845654 -0.97052
0.04139 0.921662
0.00173367 -0.848124
0.095338 1.4491
2.13119 -0.484897
-0.302607 0.184375
1.26503 -0.312559
-0.467785 0.744896
Upload your file cloud.cpp
here:
Use the estimators you implemented on the datasets in conjunction with the mean shift algorithm with different parameters and compare the results.
The mean shift algorithm can be particularly useful for
higher-dimensional datasets. To experiment with this, you can use the 3D
dataset galaxies_3D.csv
. For visualization of 3D data, use
the program MeshLab, which is
installed on the computers in the lab.
After opening the dataset galaxies_3D.csv
, you will see
something like this:
Try using the mean shift algorithm on this dataset and compare the resulting set of data points with the original. A few iterations of the algorithm with a parameters of \(k=30\) should yield noticeable results.