This assignment has been closed on April 10, 2024.

You must be authenticated to submit your files

INF442 - Density Estimation

Context
Structure
0. Assertions in C++
1. Data Analysis
2. Kernel Density Estimators
3. Mean Shift (Optional)

Download and unzip the archive INF442-td5-1-handin.zip. It contains several source files to get you started, that is a Makefile to compile the various tests, a quiz folder with a Python starter-kit script to do the quiz, as well as various datasets (.csv files in the csv folder).
To get full credit, you need to solve and submit all exercises of Section 2. Section 3 is optional.

Context

This TD corresponds to lecture 5.
Starting from last TD going forward, we structure the project into separate compilation units; they are separated into folders, e.g., point/.
We will implement three different kernel density estimators and integrate them into a common class structure based on inheritance.
Then, we will introduce the technique of mean shifts, which can help us to improve the quality of the estimations.

Structure

The TD is structured as follows:

Section 0 describes how to use assert in C++;
Section 1 describes how to run kernel density estimation algorithms using scikit-learn to analyze the data for the quiz;
Section 2 is the main section of the TD: we will implement the flat, Gaussian, and \(k\)-nn kernels;
Section 3 is optional: you can choose to implement the mean shift algorithm to de-noise a dataset prior to applying kernel density estimators.

0. Assertions in C++

Testing C++ code

One way of writing safe code is to test some fundamental invariants or assumptions (e.g., that an index does not exceed the length of the array). One standard approach to this is via the function-like macro assert. You can read all about it in its official documentation.

Its basic usage is assert(expr), where expr is some C++ expression that evaluates to either true or false, like 1+1 == 2. If the expression evaluates true, then the assert does nothing. If the expression evaluates to false, then the program will be terminated with an error message when evaluating the assert. This error message includes the expression that evaluated to false. Thus, we can use assertions to check for errors in the code during execution. You are free to add your own assertions to your code to find bugs.

1. Data Analysis

Density estimation with Python

We will explore several generated 2D data sets:

double_spiral.csv (necessary for the quiz);
normal.csv (bonus);
galaxies.csv (bonus);
mystery.csv (bonus);
galaxies_3D.csv (for Section 3).

as well as the famous MNIST dataset of handwritten digits (and yes, it’s normal you don’t have it in csv/, it will be downloaded on-the-fly):

First, install the necessary libraries:

$ pip3 install --user --only-binary=:all: numpy pandas scikit-learn matplotlib
$ pip3 install --user loguru

Then use the Python script kernel_density.py in the quiz/ directory to analyze the datasets and answer the questions in the quiz for TD5 on Moodle. The script takes three arguments: csv_file, kernel and bandwidth. The first argument determines which .csv file to open, in our case ../csv/double_spiral.csv. For the MNIST dataset, just replace this argument by digits. In the second argument, kernel, you can give any kernel supported by scikit-learn, although:

some won’t work with digits (see below) since the sample method isn’t implemented except for gaussian and tophat;
we will stick to gaussian and tophat (corresponding to flat in your lectures and in what follows) to answer the quiz.

The third argument, bandwidth, defines the bandwidth of your chosen kernel. Example usage:

$ python3 kernel_density.py ../csv/double_spiral.csv tophat 1

This should print the following to a file named csv_file_kernel_bandwidth.png:

Under the hood, we use KernelDensity from scikit-learn as simply as:

kde = KernelDensity(kernel=kernel, bandwidth=bandwidth).fit(data)

The first plot is just a scatter plot of the points in the csv file:

ax[0].scatter(data[:, 0], data[:, 1], s=0.3)

The second plot is a heatmap of the estimated density, evaluated on 10,000 new points spanning the original support of the data:

xx, yy = np.mgrid[data[:, 0].min():data[:, 0].max():100j,
                  data[:, 1].min():data[:, 1].max():100j]

For the MNIST dataset, we take another approach. MNIST handwritten digits are images composed of \(8 \times 8\) pixels = 64 “features” (recall that the csv files used above have only 2 features). To visually represent the quality of the estimated density, we will simply plot examples from each digit alongside examples from the estimated density. To “classify” the drawn images in digits, we rely on 1-Nearest Neighbor:

knn_classif = KNeighborsClassifier(n_neighbors=1).fit(data, np.eye(10)[digits.target])
predictions = np.argmax(knn_classif.predict(new_data), axis=1)

Using

$ python3 kernel_density.py digits tophat 1

you should obtain the following plot:

2. Kernel Density Estimators

The code archive you have downloaded contains, among other files, kernel.hpp and kernel.cpp in the kernel/ folder, which describes the base class for kernel estimators.

Kernel estimators are non-parametric density estimators (see the Data Science booklet for additional details) . Recall that, given a “training” dataset of \(n\) samples \((p_i)_1^n\) drawn from the unknown distribution \(f\), the estimated density is given by:

\[\hat{f}_n(x) = \dfrac{c_{k, d}}{n \sigma^d} \sum_{i=1}^n k \left( \dfrac{||x - p_i||_2^2}{\sigma^2} \right).\]

This base class has the pure virtual function double density(const point &p) which should return the estimated density \(\hat{f}_n(x)\) at point x.

We will implement three different kernels:

flat
Gaussian
\(k\)-nearest neighbors

The first two (flat and Gaussian) are instances of radial kernels, i.e., they only depend on the distance to a single data point at a time. The third (\(k\)-nearest neighbors) is not radial as it depends on several (i.e., \(k\)) nearest data points.

Hence, we will group the flat and Gaussian kernels as derived classes of an intermediate class radial. The class diagram is as follows:

                  flat
                /
         radial
       /        \
kernel            gaussian
       \
         knn

In the provided code, this hierarchy is also represented by the folder structure.

The base class, kernel, has already been implemented for you (see kernel/kernel.hpp and kernel/kernel.cpp).

The header files for radial, flat, gaussian and knn are provided in the subfolders of kernel/.

To handle our samples \((p_i)_1^n\), we will reuse the classes point and cloud from the previous TDs (see the eponymous folders and files).

2.1 The `radial` class

The radial class needs a constructor. In particular, it needs to set the data member double bandwidth and its cloud of points, namely *data, as suggested by the inheritance of kernel (hint: you can use kernel’s constructor as suggested p. 79 of your Cpp booklet). Implement radial’s constructor. grader will not compile until you have done so.

Still in the radial class, watch for the declaration of two pure virtual member functions (see radial.hpp and the relevant section in your Cpp booklet):

double volume(): returns the volume of the unit ball weighed with the kernel. This will be \(\dfrac{\pi^{d/2}}{\Gamma(d/2 + 1)}\) for the flat kernel and \((2\pi)^{d/2}\) for the Gaussian kernel. This volume is equal to \(\dfrac{1}{c_{k, d}}\).
double profile(double t): returns the kernel profile function \(k\) evaluated at the given real number \(t\). For the flat kernel, this will be \(1\) if and only if \(t \leq 1\) and \(0\) otherwise. For the Gaussian kernel, this will be \(e^{-t/2}\).

The use of pure virtual member functions makes radial an abstract class. In order to be able to instantiate an actual object, we must have derived classes which “implement” (override) these functions, in this case in flat and gaussian. We will implement these classes, and in particular their methods volume and profile, in Sections 2.1.2 and 2.1.3 respectively.

Notice that in kernel.hpp the density member function is a pure virtual function. Assuming that volume and profile are already implemented, implement density in radial.cpp. Recall that the density for radial kernels is defined as \[\hat{f}_n(x) = \dfrac{c_{k,d}}{n\sigma^d} \sum_{i=1}^n k\left(\dfrac{\lVert x - p_i\rVert_2^2}{\sigma^2} \right)\] where the \(p_i\) are the data points, \(k\) is the kernel profile function, \(c_{k,d}\) is the inverse of the volume, and \(\sigma\) is the chosen bandwidth.

You can test your class by doing the usual make grader and executing ./grader 1 in the terminal (although passing the tests will not absolutely guarantee correctness - recall that we add supplementary tests after the deadline).

Upload your file radial.cpp here:

Upload form is only available when connected

2.1.1 Flat Kernel

Implement the class flat derived from radial by implementing the member functions volume and profile in flat.cpp.

Recall that volume should return \(\dfrac{\pi^{d/2}}{\Gamma(d/2 + 1)}\), and profile should return \(1\) if and only if \(t \leq 1\) and \(0\) otherwise.

The \(\Gamma\) function can be computed knowing a few tricks, or std::tgamma can be used.

You can test your class by doing the usual make grader and executing ./grader 2.

Upload your file flat.cpp here:

Upload form is only available when connected

2.1.2 Gaussian Kernel

Implement the class gaussian derived from radial by implementing the member functions volume and profile.

Recall that volume should return \((2\pi)^{d/2}\), and profile should return \(e^{-t/2}\) (you can use std::exp(...)).

Implement a member function void guess_bandwidth() which sets the bandwidth equal to \[1.06 \cdot \hat{\sigma} / n^{1/5}\] where \(\hat{\sigma}\) is the standard deviation of the dataset, which you can access via the newly added member function standard_deviation in the class cloud (see cloud/cloud.cpp).

You can test your class by doing the usual make grader and executing ./grader 3.

Upload your file gaussian.cpp here:

Upload form is only available when connected

2.2 \(k\)-Nearest Neighbor Kernel

The return value of the density function of the knn kernel should be \[\hat{f}_n(x) = \frac{k}{2nVd_k(x)}\] where \(d_k(x)\) is the distance of the given point \(x\) to its \(k\)-th nearest neighbor, and \(k\) and \(V\) are two parameters set in knn’s constructor.

A convenient way of identifying the distance to the \(k\)-th nearest neighbor of point \(x\) is to add a function k_dist_knn to the class cloud which finds the \(k\) nearest neighbors of \(x\) and returns \(d_k(x)\). Implement k_dist_knn in cloud.cpp.

If you want to do the optional Section 3 about Meanshift, you should rather implement the knn method which returns a pointer to an array of points (the k nearest neighbors in the cloud to a point given as input), and call knn in k_dist_knn.

You can test your class by doing the usual make grader and executing ./grader 4.

Upload your file cloud.cpp here:

Upload form is only available when connected

Implement the class knn, derived from kernel, for the \(k\)-nearest neighbor kernel by implementing the member functions volume and density.

You can test your class by doing the usual make grader and executing ./grader 5.

Upload your file knn.cpp here:

Upload form is only available when connected

Note: we call the parameter \(V\) the total volume, which we set manually for simplicity. For \(V=1\), the kernel’s density is not strictly speaking a density because its integral is not necessarily equal to \(1\).

3. Mean Shift (Optional)

Mean Shift

It is often necessary and desirable to remove noise from a data set. For that, multiple possibilities exist. We will implement one of them, a mean shift algorithm based on the \(k\)-nearest neighbor kernel.

One iteration of the algorithm updates the position of every data point to the mean of its \(k\) nearest neighbors.

Complete the shift method in cloud.cpp which implements one iteration of this algorithm and returns the new coordinates of the points in the cloud as a pointer to an array of points (see in cloud::meanshift how it is used).

You can test your implementation by doing the usual make grader and executing ./grader 6.

This runs one iteration of the mean shift algorithm on the \(2\)-dimensional dataset normal.csv with \(k = 3\) (see grading.cpp).

You are free to change grading.cpp to print the new coordinates to a file, e.g., using standard redirection (>) and setting verbose to true.

For reference, the first 10 data points of normal.csv after running one iteration with \(k = 3\) should be equal (or close) to:

-0.742035 1.09605 
0.025417 0.1229 
0.845654 -0.97052 
0.04139 0.921662 
0.00173367 -0.848124 
0.095338 1.4491 
2.13119 -0.484897 
-0.302607 0.184375 
1.26503 -0.312559 
-0.467785 0.744896

Upload your file cloud.cpp here:

Upload form is only available when connected

Use the estimators you implemented on the datasets in conjunction with the mean shift algorithm with different parameters and compare the results.

The mean shift algorithm can be particularly useful for higher-dimensional datasets. To experiment with this, you can use the 3D dataset galaxies_3D.csv. For visualization of 3D data, use the program MeshLab, which is installed on the computers in the lab.

After opening the dataset galaxies_3D.csv, you will see something like this:

Try using the mean shift algorithm on this dataset and compare the resulting set of data points with the original. A few iterations of the algorithm with a parameters of \(k=30\) should yield noticeable results.

Context

Structure

0. Assertions in C++

1. Data Analysis

2. Kernel Density Estimators

2.1 The radial class

2.1.1 Flat Kernel

2.1.2 Gaussian Kernel

2.2 \(k\)-Nearest Neighbor Kernel

3. Mean Shift (Optional)

2.1 The `radial` class