This assignment has been closed on April 10, 2024.
You must be authenticated to submit your files

INF442 - Density Estimation

Context

Structure

The TD is structured as follows:

0. Assertions in C++

Testing C++ code

One way of writing safe code is to test some fundamental invariants or assumptions (e.g., that an index does not exceed the length of the array). One standard approach to this is via the function-like macro assert. You can read all about it in its official documentation.

Its basic usage is assert(expr), where expr is some C++ expression that evaluates to either true or false, like 1+1 == 2. If the expression evaluates true, then the assert does nothing. If the expression evaluates to false, then the program will be terminated with an error message when evaluating the assert. This error message includes the expression that evaluated to false. Thus, we can use assertions to check for errors in the code during execution. You are free to add your own assertions to your code to find bugs.

1. Data Analysis

Density estimation with Python

We will explore several generated 2D data sets:

  • double_spiral.csv (necessary for the quiz);
  • normal.csv (bonus);
  • galaxies.csv (bonus);
  • mystery.csv (bonus);
  • galaxies_3D.csv (for Section 3).

as well as the famous MNIST dataset of handwritten digits (and yes, it’s normal you don’t have it in csv/, it will be downloaded on-the-fly):

First, install the necessary libraries:

$ pip3 install --user --only-binary=:all: numpy pandas scikit-learn matplotlib
$ pip3 install --user loguru

Then use the Python script kernel_density.py in the quiz/ directory to analyze the datasets and answer the questions in the quiz for TD5 on Moodle. The script takes three arguments: csv_file, kernel and bandwidth. The first argument determines which .csv file to open, in our case ../csv/double_spiral.csv. For the MNIST dataset, just replace this argument by digits. In the second argument, kernel, you can give any kernel supported by scikit-learn, although:

The third argument, bandwidth, defines the bandwidth of your chosen kernel. Example usage:

$ python3 kernel_density.py ../csv/double_spiral.csv tophat 1

This should print the following to a file named csv_file_kernel_bandwidth.png:

Under the hood, we use KernelDensity from scikit-learn as simply as:

kde = KernelDensity(kernel=kernel, bandwidth=bandwidth).fit(data)

The first plot is just a scatter plot of the points in the csv file:

ax[0].scatter(data[:, 0], data[:, 1], s=0.3)

The second plot is a heatmap of the estimated density, evaluated on 10,000 new points spanning the original support of the data:

xx, yy = np.mgrid[data[:, 0].min():data[:, 0].max():100j,
                  data[:, 1].min():data[:, 1].max():100j]

For the MNIST dataset, we take another approach. MNIST handwritten digits are images composed of \(8 \times 8\) pixels = 64 “features” (recall that the csv files used above have only 2 features). To visually represent the quality of the estimated density, we will simply plot examples from each digit alongside examples from the estimated density. To “classify” the drawn images in digits, we rely on 1-Nearest Neighbor:

knn_classif = KNeighborsClassifier(n_neighbors=1).fit(data, np.eye(10)[digits.target])
predictions = np.argmax(knn_classif.predict(new_data), axis=1)

Using

$ python3 kernel_density.py digits tophat 1

you should obtain the following plot:

2. Kernel Density Estimators

The code archive you have downloaded contains, among other files, kernel.hpp and kernel.cpp in the kernel/ folder, which describes the base class for kernel estimators.

Kernel estimators are non-parametric density estimators (see the Data Science booklet for additional details) . Recall that, given a “training” dataset of \(n\) samples \((p_i)_1^n\) drawn from the unknown distribution \(f\), the estimated density is given by:

\[\hat{f}_n(x) = \dfrac{c_{k, d}}{n \sigma^d} \sum_{i=1}^n k \left( \dfrac{||x - p_i||_2^2}{\sigma^2} \right).\]

This base class has the pure virtual function double density(const point &p) which should return the estimated density \(\hat{f}_n(x)\) at point x.

We will implement three different kernels:

  1. flat
  2. Gaussian
  3. \(k\)-nearest neighbors

The first two (flat and Gaussian) are instances of radial kernels, i.e., they only depend on the distance to a single data point at a time. The third (\(k\)-nearest neighbors) is not radial as it depends on several (i.e., \(k\)) nearest data points.

Hence, we will group the flat and Gaussian kernels as derived classes of an intermediate class radial. The class diagram is as follows:

                  flat
                /
         radial
       /        \
kernel            gaussian
       \
         knn

In the provided code, this hierarchy is also represented by the folder structure.

The base class, kernel, has already been implemented for you (see kernel/kernel.hpp and kernel/kernel.cpp).

The header files for radial, flat, gaussian and knn are provided in the subfolders of kernel/.

To handle our samples \((p_i)_1^n\), we will reuse the classes point and cloud from the previous TDs (see the eponymous folders and files).

2.1 The radial class

The radial class needs a constructor. In particular, it needs to set the data member double bandwidth and its cloud of points, namely *data, as suggested by the inheritance of kernel (hint: you can use kernel’s constructor as suggested p. 79 of your Cpp booklet). Implement radial’s constructor. grader will not compile until you have done so.

Still in the radial class, watch for the declaration of two pure virtual member functions (see radial.hpp and the relevant section in your Cpp booklet):

The use of pure virtual member functions makes radial an abstract class. In order to be able to instantiate an actual object, we must have derived classes which “implement” (override) these functions, in this case in flat and gaussian. We will implement these classes, and in particular their methods volume and profile, in Sections 2.1.2 and 2.1.3 respectively.

Notice that in kernel.hpp the density member function is a pure virtual function. Assuming that volume and profile are already implemented, implement density in radial.cpp. Recall that the density for radial kernels is defined as \[\hat{f}_n(x) = \dfrac{c_{k,d}}{n\sigma^d} \sum_{i=1}^n k\left(\dfrac{\lVert x - p_i\rVert_2^2}{\sigma^2} \right)\] where the \(p_i\) are the data points, \(k\) is the kernel profile function, \(c_{k,d}\) is the inverse of the volume, and \(\sigma\) is the chosen bandwidth.

You can test your class by doing the usual make grader and executing ./grader 1 in the terminal (although passing the tests will not absolutely guarantee correctness - recall that we add supplementary tests after the deadline).

Upload your file radial.cpp here:

Upload form is only available when connected

2.1.1 Flat Kernel

Implement the class flat derived from radial by implementing the member functions volume and profile in flat.cpp.

Recall that volume should return \(\dfrac{\pi^{d/2}}{\Gamma(d/2 + 1)}\), and profile should return \(1\) if and only if \(t \leq 1\) and \(0\) otherwise.

The \(\Gamma\) function can be computed knowing a few tricks, or std::tgamma can be used.

You can test your class by doing the usual make grader and executing ./grader 2.

Upload your file flat.cpp here:

Upload form is only available when connected

2.1.2 Gaussian Kernel

Implement the class gaussian derived from radial by implementing the member functions volume and profile.

Recall that volume should return \((2\pi)^{d/2}\), and profile should return \(e^{-t/2}\) (you can use std::exp(...)).

Implement a member function void guess_bandwidth() which sets the bandwidth equal to \[1.06 \cdot \hat{\sigma} / n^{1/5}\] where \(\hat{\sigma}\) is the standard deviation of the dataset, which you can access via the newly added member function standard_deviation in the class cloud (see cloud/cloud.cpp).

You can test your class by doing the usual make grader and executing ./grader 3.

Upload your file gaussian.cpp here:

Upload form is only available when connected

2.2 \(k\)-Nearest Neighbor Kernel

The return value of the density function of the knn kernel should be \[\hat{f}_n(x) = \frac{k}{2nVd_k(x)}\] where \(d_k(x)\) is the distance of the given point \(x\) to its \(k\)-th nearest neighbor, and \(k\) and \(V\) are two parameters set in knn’s constructor.

A convenient way of identifying the distance to the \(k\)-th nearest neighbor of point \(x\) is to add a function k_dist_knn to the class cloud which finds the \(k\) nearest neighbors of \(x\) and returns \(d_k(x)\). Implement k_dist_knn in cloud.cpp.

If you want to do the optional Section 3 about Meanshift, you should rather implement the knn method which returns a pointer to an array of points (the k nearest neighbors in the cloud to a point given as input), and call knn in k_dist_knn.

You can test your class by doing the usual make grader and executing ./grader 4.

Upload your file cloud.cpp here:

Upload form is only available when connected

Implement the class knn, derived from kernel, for the \(k\)-nearest neighbor kernel by implementing the member functions volume and density.

You can test your class by doing the usual make grader and executing ./grader 5.

Upload your file knn.cpp here:

Upload form is only available when connected

Note: we call the parameter \(V\) the total volume, which we set manually for simplicity. For \(V=1\), the kernel’s density is not strictly speaking a density because its integral is not necessarily equal to \(1\).

3. Mean Shift (Optional)

Mean Shift

It is often necessary and desirable to remove noise from a data set. For that, multiple possibilities exist. We will implement one of them, a mean shift algorithm based on the \(k\)-nearest neighbor kernel.

One iteration of the algorithm updates the position of every data point to the mean of its \(k\) nearest neighbors.

Complete the shift method in cloud.cpp which implements one iteration of this algorithm and returns the new coordinates of the points in the cloud as a pointer to an array of points (see in cloud::meanshift how it is used).

You can test your implementation by doing the usual make grader and executing ./grader 6.

This runs one iteration of the mean shift algorithm on the \(2\)-dimensional dataset normal.csv with \(k = 3\) (see grading.cpp).

You are free to change grading.cpp to print the new coordinates to a file, e.g., using standard redirection (>) and setting verbose to true.

For reference, the first 10 data points of normal.csv after running one iteration with \(k = 3\) should be equal (or close) to:

-0.742035 1.09605 
0.025417 0.1229 
0.845654 -0.97052 
0.04139 0.921662 
0.00173367 -0.848124 
0.095338 1.4491 
2.13119 -0.484897 
-0.302607 0.184375 
1.26503 -0.312559 
-0.467785 0.744896 

Upload your file cloud.cpp here:

Upload form is only available when connected

Use the estimators you implemented on the datasets in conjunction with the mean shift algorithm with different parameters and compare the results.

The mean shift algorithm can be particularly useful for higher-dimensional datasets. To experiment with this, you can use the 3D dataset galaxies_3D.csv. For visualization of 3D data, use the program MeshLab, which is installed on the computers in the lab.

After opening the dataset galaxies_3D.csv, you will see something like this:

Try using the mean shift algorithm on this dataset and compare the resulting set of data points with the original. A few iterations of the algorithm with a parameters of \(k=30\) should yield noticeable results.