This assignment has been closed on April 16, 2025.

You must be authenticated to submit your files

CSC_43042_EP – \(k\)-NN for classification

Context
Structure
Datasets
Using external libraries: SciPy and Matplotlib
\(k\)-NN classification
Speeding up by using random projections
ROC curves
Putting it to the test

Download and unzip the archive CSC43042EP-td5-1-handin.zip.

It contains in particular:

the file grader.py, which is used for the various test scripts,
the files TD/classification.py, TD/confusion_matrix.py, TD/dataset.py, TD/knn_classification.py, and TD/random_projection.py, which interact with each other and which form the code base of this TD, some of the files need to be edited,
the file TD/knn_application.py, which is used only for the very last exercise,
Python files starting with test_, which are used to test certain features of your code that are not graded,
a file knn_performance_check.py, which is used for the final coding challenge, as well as
different datasets used for some additional tests, in the folder csv/.

In the following, we assume that the directory TD/ is the root for running the scripts. In particular, we omit the prefix TD/, and when you are asked to create new files, please add them to TD/, if not instructed otherwise.

Context

You will implement the supervised learning algorithm \(k\)-nearest neighbors (\(k\)-NN) together with some performance metrics (such as confusion matrix).
You will learn how to speed up the computation dramatically by using random projections.

Structure

This TD is structured as follows:

The first section describes the datasets we work with in this TD.
The second section describes the external libraries that we use in this TD.
The third section contains the programming exercises that implement a \(k\)-NN classifier.
The fourth section explains how employing random projections speeds up the computation.
The fifth section shows how to put together the implemented functionality in order to plot a ROC curve.
The sixth section tasks you to come up with a good parametrization for an unknown dataset.

Datasets

For most of the programming part of the TD, we use an e-mail dataset, which consists of e-mails classified as spam and not spam (referred to as ham). The text of each e-mail has been transformed to a Boolean vector, based on a dictionary of common words among the e-mails. That is, each e-mail is represented as a binary vector of the same length as the dictionary, where the \(i\)-th element indicates whether the the \(i\)-th word in the dictionary is present (in the e-mail) or not. More details on this transformation can be found on the dedicated page.

Using external libraries: SciPy and Matplotlib

The scikit-learn library is required for this TD (as it simplifies certain calculations). Please follow the installation instructions here for your system. (Please recall that we recommend version 1.4.1.) We will make use of the class KDTree, whose documentation is here.

In addition, we use the Matplotlib library for crude visualizations. Please follow its installation instructions here. (The latest version should be fine.)

\(k\)-NN classification

On a separate page, we provide a small refresher on \(k\)-NNs, another refresher on confusion matrices, and an introduction to random projections.

To help you navigate through the code, we provide on a separate page a small explanation of the classes Dataset, Classification, KnnClassification, ConfusionMatrix, and RandomProjection.

Exercise 1: The KnnClassification class

Open the file knn_classification.py and edit KnnClassification, derived from the class Classification. Augment the initializer (__init__) such that it initializes k and kd_tree. Note that the initializer should create the KD-tree corresponding to the subset \(X\) of the associated dataset (i.e., all \(d\) dimensions in the dataset except for the one that we aim to predict).

Test your code by running python ./grader.py 1.

Upload your file knn_classification.py here:

Upload form is only available when connected

Exercise 2: \(\boldsymbol{k}\)-nearest neighbors and classification

Implement the method estimate(self, x: np.ndarray, threshold: float = 0.5) -> int. This method returns the (via \(k\)-NN classification) estimated value for the \(Y\)-component of the vector with \(X\)-component x.

Given the \(d\)-dimensional sample x, this method should compute the \(k\)-nearest neighbors of x in the subset \(X\) of the training dataset \(D\). To this end, you can use the query() function of KDTree.

Once the \(k\)-nearest neighbors of x are determined, its label is determined as explained in the refresher.

Bonus: If you have spare time, consider implementing the KnnClassification weighted by distances, i.e., \(\hat{p}(y=1 \mid x) = \frac{1}{k} \sum_{i=1}^{k} \frac{y_{nn_i}}{w_i}\), where \(w_i\) is the distance between \(x\) and \(nn_i\). See an example here.

Remark: There is no guarantee that it will actually perform better.

Test your code by running python ./grader.py 2.

Upload your file knn_classification.py here:

Upload form is only available when connected

Exercise 3: Evaluating the performance

Open the file confusion_matrix.py. The class ConfusionMatrix shows us the performance of a given classifier. Note the usage of format strings in the method print_evaluation.

Implement the initializer of the class, which creates a 2-by-2 ndarray (initialized with 0.0s).

Implement the method add_prediction, which simply increases the corresponding entry in the confusion matrix by 1. For example, if true_label is 0 and predicted_label is 1, the counter in the \((0,1)\) coordinate of the confusion matrix is incremented.

Implement the methods f_score, precision, error_rate, detection_rate, and false_alarm_rate according to the formulas in the course or in this refresher.

Test your code by running python ./grader.py 3.

Upload your file confusion_matrix.py here:

Upload form is only available when connected

Putting everything together

You’re now all set to run \(k\)-NN classification on a pair of training/test files and store results in a confusion matrix. First, have a look test_knn.py (not in TD/!) to understand how we run the test. Then, run test_knn.py via

  $ python ./test_knn.py 10 csv/mail_train.csv csv/mail_test.csv

  No column specified for classification, assuming first column of dataset (0).
  Dataset with 2251 samples and 1900 dimensions.
  k-NN classification (k = 10, classification over column 0) ...
  Prediction and Confusion Matrix filling

  Execution time: 3044 ms

		Predicted
                0       1
Actual  0       264     62
        1       15      127

Error rate              0.16452991452991453
False-alarm rate        0.1901840490797546
Detection rate          0.8943661971830986
F-score                 0.7673716012084593
Precision               0.671957671957672

Quite impressive, isn’t it? The only problem: it took around 3 seconds to predict on 1000 new e-mails. If you’re part of Google’s Gmail team, this is probably way too long. This is where random projections are useful.

Speeding up by using random projections

The basic idea is to drastically reduce the dimension of the problem (1899 as of now) to reduce the computing time and the memory footprint of \(k\)-NN. Roughly, this amounts to compressing the data, which results in an information loss. The trade-off between the dimensionality and the information loss as well as the rationale behind random projections are discussed in this separate page.

Exercise 4: Sampling

Open the file random_projection.py. Implement the method random_gaussian_matrix(d: int, projection_dim: int) -> np.ndarray, which takes two parameters: an integer d, which represents the dimension of the original data, and an integer projection_dim, which represents the projection dimension (denoted here by \(\ell\)). It returns a d × \(\ell\) matrix whose entries are sampled from \(\mathrm{N}(0,1/\ell)\) (beware: \(1/\ell\) denotes the variance!). You can check numpy to find out how to draw such samples.

You can also take inspiration from the implementation of random_rademacher_matrix (in random_projection.py). Its constructor fills the matrix with 0s, sqrt(3.0 / projection_dim)s, and −sqrt(3.0 / projection_dim)s such that each entry is non-zero with probability 1/3 and its sign is chosen uniformly at random.

Test your code by running python ./grader.py 4.

Upload your file random_projection.py here:

Upload form is only available when connected

Exercise 5: RandomProjection

First, implement the initializer of RandomProjection (in random_projection.py). Initialize all attributes except the matrix projection with the corresponding arguments. Initialize projection depending on the value of the argument type_sample (which is "Gaussian" or something else). Please use the Rademacher matrix if type_sample is not "Gaussian".

Now, implement the method projection_quality(self, dataset: Dataset) -> tuple[float, float], which computes the average distance between all pairs of points in the original dataset (given as a parameter) and its projection. The method returns a tuple whose first component is the average distance in the original dataset, and whose second component is the average distance in the projection.

Use the provided method project to compute the projection. Note that the result of project contains the column to classify as its last column, which must be ignored when computing the quality. (Similarly, the original dataset also still contains the column to predict (at col_class), which must also be ignored.)

Test your code by running python ./grader.py 5. Note that these tests can take quite some time (as in, several minutes), since the dataset is not that small.

Upload your file random_projection.py here:

Upload form is only available when connected

Random Projections meet classifier performance and memory/time complexity

Now run test_random_projection.py (again, please have a look at its code beforehand; it is not in TD/). The program compares timings and also the prediction quality before and after projection. Let’s run it for \(\ell = 10\) and a Gaussian random matrix:

$ python ./test_random_projection.py 3 10 csv/mail_train.csv csv/mail_test.csv Gaussian

No column specified for classification, assuming first column of dataset (0).
Dataset with 2251 samples and 1900 dimensions.

Create a random projection.
Execution time: 15 ms

Training k-NN classification on the original data.
Execution time: 547 ms

Training k-NN classification on the projected data.
Execution time: 41 ms

Predicting k-NN on original data.
Execution time: 2870 ms

		Predicted
		0	1
Actual	0	255	71
    	1	14	128

Error rate		0.18162393162393162
False-alarm rate	0.21779141104294478
Detection rate		0.9014084507042254
F-score			0.7507331378299121
Precision		0.6432160804020101


Predicting k-NN on projected data.
Execution time: 61 ms

		Predicted
		0	1
Actual	0	288	38
    	1	59	83

Error rate		0.20726495726495728
False-alarm rate	0.1165644171779141
Detection rate		0.5845070422535211
F-score			0.6311787072243347
Precision		0.6859504132231405


Speedup on training: 13
Speedup on testing: 47

The speedup is huge, but the performance decreased. Try \(\ell = 60\) instead.

Still a tremendous speedup for approximately the same performance!

For Rademacher, the results are even better.

ROC curves

Thanks to the method estimate of KnnClassification, we can run the \(k\)-NN algorithm with different threshold values. This allows us to draw a ROC (receiver operating characteristic) curve, which represents the detection rate as a function of the false-alarm rate (with the original data, or the projected data, or both!).

In order to achieve this, complete the method create_roc_plot() -> None in test_roc_curve.py (not in TD/!) by looping over the thresholds to apply the prediction of \(k\)-NN on the test set, given the current threshold, and storing the results in the corresponding confusion matrix.

Moreover, tell the user the area under the ROC curve (AUC score). To this end, you can use the numpy method trapz.

Once this is implemented, plot the ROC curve by running test_roc_curve.py as shown below, where CHOOSE_WISELY is a number that determines the granularity of our approach. Figure out which value makes sense. Note that we now use the roc_data dataset. Please use the number that corresponds to your number determined by the Moodle quiz for this TD. For the example below, we assume that this number is \(i \in [0 ..5]\). (Warning: If you run the following command for the mail dataset, this can take a long time!)

  $ python ./test_roc_curve.py 10 csv/roc_data_i_train.csv csv/roc_data_i_test.csv CHOOSE_WISELY 7

Run tests for the dataset roc_data (and the correct value of \(i\)) with 2, 6, and 10 nearest neighbors on normalized data. Check out MinMaxScaler from sklearn.preprocessing. Which of these classifiers has the highest (test) AUC score? (In case of a tie, go with the largest number.) Does this classifier also have the best (maximum) F-score? (In case of a tie for the first place, we say yes.) Answer these two questions in the quiz on Moodle.

Putting it to the test

Last, we check how well our classifier performs on unknown data. Have a look at the file knn_performance_check.py (outside of TD/), which shows you roughly the script that is run on the server to estimate the performance of your classifier. This test makes use of the dataset samples (split into train and test), which you should download here. Note that the test data is not the full file that is on the server!

This script makes use of the file TD/knn_application.py. This file is intended for communicating a good parametrization of your classifier to the test script. Edit TD/knn_application.py meaningfully.

You need to choose good parameters by yourself. The variable seed is the seed used for the randomness involved in the random projection used by the classifier.

Your grade is determined by both how much faster your classifier is in comparison to the one that does not use a random projection, and by how large the F-score of your classifier is. You want both numbers to be as large as possible. These two quantities are combined into a single score, as detailed in knn_performance_check.py (not in TD/!) in line 71. In a nutshell, the speed-up is very important, but the F-score covers the final gap. Note that below this formula, you also see at which thresholds you get which grade. Good luck!

Attention: Since the speed-up is subject to randomness itself in how quickly the computations are performed, it may make sense to repeat an experiments of a promising parametrization, in order to get a good run. However, we note that the random fluctuation on the full test dataset is not that large, as the dataset itself is pretty huge.

Upload your file knn_application.py here:

Upload form is only available when connected