This assignment will be closed on April 08, 2025 (23:59:59).
You must be authenticated to submit your files

CSC_4302_EP - Density Estimation

It contains in particular:

Log in to be able to upload your files.

Context

Structure

The TD is structured as follows:

First, if not already done, install the necessary libraries:

$ pip3 install --user --only-binary=:all: numpy pandas scikit-learn matplotlib

As usual, the goal is to implement basic versions of the algorithms by ourselves, so we will not rely on existing libraries such as scikit-learn in our implementations. There will be one exception to this: we will rely on scikit-learn to compute \(k\)-nearest neighbors, because we have already studied this question in a previous TD.

1. Kernel Density Estimators

We will represent our samples \((p_i)_1^n\) by a list data of numpy arrays (each element of the list is a numpy array representing one given point).

Kernel estimators are non-parametric density estimators. Recall that, given a “training” dataset of \(n\) samples \((p_i)_1^n\) drawn from the unknown distribution \(f\), the estimated density has the form:

\[\hat{f}_n(x) = \dfrac{c_{k, d}}{n \sigma^d} \sum\limits_{i=1}^n k \left( \dfrac{||x - p_i||_2^2}{\sigma^2} \right).\]

We will define a base class Kernel with a method density(x: np.ndarray) -> float which should return the estimated density \(\hat{f}_n(x)\) at point x.

We will implement three different kernels:

  1. flat
  2. Gaussian
  3. \(k\)-nearest neighbors

The first two (flat and Gaussian) are instances of radial kernels, i.e., they only depend on the distance to a single data point at a time. The third (\(k\)-nearest neighbors) is not radial as it depends on several (i.e., \(k\)) nearest data points.

Hence, we will group the flat and Gaussian kernels as derived classes of an intermediate class Radial. The class diagram is as follows:

                  Flat
                /
         Radial
       /        \
Kernel            Gaussian
       \
         Knn

To keep things simple, we will use a unique Python file for all these classes. Open the file kernels.py in the directory TD and copy the following code into it:

import numpy as np
from typing import Self
from abc import abstractmethod

class Kernel:
    """A class for kernel density estimation, which also stores the cloud of points
    Attributes:
        d: int                 -- dimension of the ambient space
        data: list[np.ndarray] -- list of coordinates of the points (each of dimension self.d)
    """
    def __init__(self: Self, d: int, data: list[np.ndarray]):
        self.data = data
        self.d = d

    @abstractmethod
    def density(self: Self, x: np.ndarray) -> float:
        pass

The class Kernel is abstract, we will not implement its density method. The @abstractmethod decorator means that each class derived from Kernel will be forced to give a concrete implementation of density (unless they are also abstract).

1.1 The Radial class

We will now define the Radial class inherited from Kernel. Objects of the Radial class have an additional attribute self.bandwidth: float, and they need an initializer taking three arguments: d, data and bandwidth. Implement a first version of the Radial class containing only this initializer (__init__). We recommend that you call Kernel’s initializer inside Radial’s initializer, to limit code duplication.

Then, in the Radial class, add two abstract methods:

As in Kernel, the use of abstract member methods makes Radial an abstract class. In order to be able to instantiate an actual object, we will need derived classes which “implement” (override) these methods: in this case, Flat and Gaussian. We will implement these classes, and in particular their methods volume and profile, in Sections 1.1.2 and 1.1.3 respectively.

Assuming that volume and profile are already implemented, implement density in Radial. Recall that the density for radial kernels is defined as \[\hat{f}_n(x) = \dfrac{c_{k,d}}{n\sigma^d} \sum\limits_{i=1}^n k\left(\dfrac{\lVert x - p_i\rVert_2^2}{\sigma^2} \right)\] where the \(p_i\) are the data points, \(k\) is the kernel profile function, \(c_{k,d}\) is the inverse of the volume, and \(\sigma\) is the chosen bandwidth.

Test your implementation by running python grader.py 1 (or an equivalent command depending on your operating system and Python installation).

Upload your file kernels.py here:

Upload form is only available when connected
1.1.1 Flat Kernel

Implement the class Flat derived from Radial by implementing the member methods volume and profile.

Recall that volume should return \(\dfrac{\pi^{d/2}}{\Gamma(d/2 + 1)}\), and profile should return \(1\) if and only if \(t \leq 1\) and \(0\) otherwise.

To compute the \(\Gamma\) function, you can use a few tricks, or you can just import math and use math.gamma.

Test your implementation by running python grader.py 2.

Upload your file kernels.py here:

Upload form is only available when connected
1.1.2 Gaussian Kernel

Implement the class Gaussian derived from Radial by implementing the member methods volume and profile.

Recall that volume should return \((2\pi)^{d/2}\), and profile should return \(e^{-t/2}\) (you can use from math import exp).

Implement a member function guess_bandwidth(self: Self)-> None which sets the bandwidth according to the so-called Silverman rule of thumb , ​ \[\mathrm{guessed~bandwidth} = \left(\frac{n (d+2)}{4}\right)^{-1/(d+4)} \cdot \hat{\sigma},\] where \(\hat{\sigma}\) is the sample standard deviation of the dataset, namely \(\hat\sigma = \sqrt{\dfrac{1}{n-1}\sum\limits_{i=1}^{n}\|p_i- mean\|^2 }\).

Test your implementation by running python grader.py 3.

Upload your file kernels.py here:

Upload form is only available when connected
1.2 \(k\)-Nearest Neighbor Kernel

The return value of the density function of the Knn kernel should be \[\hat{f}_n(x) = \frac{k}{2nVd_k(x)}\] where \(d_k(x)\) is the distance of the given point \(x\) to its \(k\)-th nearest neighbor, and \(k\) and \(V\) are two parameters set in Knn’s initializer.

Because \(k\)-NN is not our main focus today, we will rely on scikit-learn. We have prepared a class Knn for this purpose:

from sklearn.neighbors import NearestNeighbors
class Knn(Kernel):
    """A class for kernel density estimation with k-Nearest Neighbors
       derived from Kernel
    Attributes not already in Kernel:
        k: int      -- parameter for k-NN
        V: float    -- "volume" constant appearing in density
        neigh:    sklearn.neighbors.NearestNeighbors   -- data structure and methods for efficient k-NN computations
    """
    def __init__(self: Self, d: int, data: list[np.ndarray], k: int, V: float):
        super().__init__(d,data)
        self.k, self.V = k, V
        self.neigh = NearestNeighbors(n_neighbors=self.k)
   		#....

    def fit_knn(self):
        """Computes the inner data structure acccording to the data points."""
        self.neigh.fit(np.array(self.data))

    def knn(self, x: np.ndarray, vk:int):
        """The vk nearest-neighbors (vk can be different from self.k)."""
        return [np.array(self.data[i]) for i in self.neigh.kneighbors([x], n_neighbors=vk)[1][0] ]

    def k_dist_knn(self, x: np.ndarray, vk: int) -> float:
        """The distance to vk-th nearest-neighbor."""
       return self.neigh.kneighbors([x], n_neighbors=vk)[0][0][vk-1]

Complete the class Knn, by completing the missing line(s) in its initializer, and by implementing the member function density(self: Self, x:np.ndarray) -> float: for the \(k\)-nearest neighbor kernel.

Test your implementation by running python grader.py 4.

Upload your file kernels.py here:

Upload form is only available when connected

Note: we call the parameter \(V\) the total volume, which we set manually for simplicity. For given \(V\) (for example \(V=1\)), the kernel’s density is not strictly speaking a (probability) density because its integral is not necessarily equal to \(1\). Depending on applications it may be necessary or not to rescale \(V\) in order to work with a valid probability density.

1.3 Visualization

We recommend you try to visualize the outputs of your estimators on real data. The following code loads the double_spiral dataset from the archive you unpacked, and plots the estimated density. You can modify this code to test other datasets, kernels, and parameters.

mydata = np.loadtxt("csv/double_spiral.csv", delimiter=" ", dtype=float)
vk = 80
myknn = Knn(2, mydata, vk, 1.0)
grid = 100
print_2D_density(myknn.density, -np.ceil(-mydata.min(axis=0)).astype(int), np.ceil(mydata.max(axis=0)).astype(int), grid,  f"Knn k={vk} grid={grid}")

The code uses the plotting function print_2D_density below, which relies on mathplotlib. The parameter grid: int indicates the number of rows/columns of the underlying plotting grid.

def print_2D_density(density, min_xy, max_xy, grid: int, title: str):
    x_range,y_range = np.linspace(min_xy[0], max_xy[0], grid), np.linspace(min_xy[1], max_xy[1], grid)
    X, Y = np.meshgrid(x_range, y_range)
    Z = np.array([[density(np.array([x, y])) for x in x_range] for y in y_range])
    plt.figure(figsize=(10, 5))
    plt.contourf(X, Y, Z, cmap="viridis")
    plt.title(title)
    plt.show()  ### Use plt.savefig(f'{title}.png') to save to a file instead of displaying

You can run the script galaxy_plot.py and you should obtain a density plot as this one (feel free to try different datasets and parameters!)

2. Mean Shift

It is often necessary and desirable to remove noise from a data set. For that, multiple possibilities exist. We will implement one of them, a mean shift algorithm based on the \(k\)-nearest neighbor kernel. Beyond noise removal, mean shift is strongly related to density estimation, since the displacement of points in the mean shift algorithm is a way to estimate the gradient of the density function. In particular mean shift can be useful to detect modes of a distribution.

One iteration of the algorithm updates the position of every data point to the mean of its \(k\) nearest neighbors.

Add a method meanshift(self: Self, k: int) -> None to the class Knn which implements one iteration of this algorithm and updates the new coordinates of the points.

Test your implementation by running python grader.py 5.

For reference, the first 10 data points of normal.csv after running one iteration with \(k = 3\) should be equal (or close) to:

-0.742035 1.09605
0.025417 0.1229
0.845654 -0.97052
0.04139 0.921662
0.00173367 -0.848124
0.095338 1.4491
2.13119 -0.484897
-0.302607 0.184375
1.26503 -0.312559
-0.467785 0.744896

Upload your file kernels.py here:

Upload form is only available when connected

You can use the estimators you implemented on the datasets in conjunction with the mean shift algorithm with different parameters and compare the results.

The mean shift algorithm can be particularly useful for higher-dimensional datasets. To experiment with this, you can use the 3D dataset galaxies_3D.csv. For visualization of 3D data, use the program MeshLab. Note that on some versions of Meshlab you should rename the file from .csv to .xyz so that MeshLab opens it. After opening the dataset galaxies_3D.xyz you will see something like this:

For shorter computation time, we recommend that you use the dataset galaxies_3D-short.xyz instead, which contains much less points. Try using the mean shift algorithm on this dataset and compare the resulting set of data points with the original. Apply four iterations of the algorithm with the parameter \(k=10\) and save the results to a new .xyz file. Then open both files simultaneously on Meshlab, and display the points simultaneously with two different colors. You should get noticeable results. How do the points distribute in space after performing this operation?

—-> Please answer this question on Moodle (in the quiz named “TD4”).

Note: in order to load/save the data to/from .xyz files, you can you use code such as this:

mydata = np.loadtxt("csv/galaxies_3D-short.xyz", delimiter=" ", dtype=float)
myknn=Knn(3,mydata,10,1.0)
....
np.savetxt("csv/galaxies_3D-short-afterMeanshift.xyz", np.array(myknn.data), delimiter=" ")

3. Putting it to a test: total variation distance

The total variation distance \(d_{TV}(p,q)\) between two probability densities \(p\) and \(q\) is defined as half of their \(L^1\) distance, namely ​ \[ d\_{TV}(p,q) = \frac{1}{2} \displaystyle\int |p(x)-q(x)| dx\] where the integral is taken over the whole (\(d\)-dimensional) space. It is a number between zero and one, which is zero if and only if \(p\equiv q\) as probability measures.

The attached file estimate_TV.py generates sample points from a certain 2-dimensional density (known to the program, in fact it is a multinormal distribution), then computes the total variation distance between this known density and the estimated density. Here is the output of the program if you run it with the parameter job=1 (you can change this parameter somewhere inside the code, look for it).

For the central plot, we used the built-in density estimator from the sklearn library, as you can see from the code lines:

kde = KernelDensity(bandwidth='silverman', kernel='gaussian')
kde.fit(data)
estimated_density_sk = lambda x, y: np.exp(kde.score_samples(np.array([[x, y]])))[0]

For the rightmost plot, we use our naive implementation of the Gaussian kernel, which gives a much worse total variation distance (around 0.4). The main reason is that our “rule of thumb” for bandwidth estimation is not performing well on this distribution.

Add the method guess_bandwidth_challenge() to the class Gaussian in such a way that it performs better on the multi-normal examples provided by estimate_TV.py than the previous guess_bandwidth(). You are free to use your own heuristics. You can then run estimate_TV.py with each value of job=1,2,3,4 to see how your heuristics performs. You should not modify estimate_TV.py except for the job number and for the lines:

#### Comment the following line and uncomment the next one:
kg.guess_bandwidth()
#kg.guess_bandwidth_challenge()

You will be graded according to the ratio between the TV distance you reach and the one reached by scikit-learn, with the following scale (this scale gives you an idea of your grade for for this question only)

To get a grade you need to obtain it on at least three of the four jobs, except for A and A+ where you need to get it for all of them. Test your implementation by running python grader.py 6 to test the four jobs at once. Note: the grader on the server has a limit of 45 seconds, so the test has to pass before that time.

Hint: try different bandwidths, estimate the TV for each of them, and try to find a better rule for bandwidth estimation. Once this is done, implement this heuristic in guess_bandwidth_challenge(). To get A+, you might need to do more, and modify the Gaussian class to handle (for example) some preprocessing.

Upload your file kernels.py here:

Upload form is only available when connected