CSC_43042_EP – \(k\)-NN for classification
Download and unzip the archive CSC43042EP-td5-1-handin.zip
.
It contains in particular:
- the file
grader.py
, which is used for the various test scripts, - the files
TD/classification.py
,TD/confusion_matrix.py
,TD/dataset.py
,TD/knn_classification.py
, andTD/random_projection.py
, which interact with each other and which form the code base of this TD, some of the files need to be edited, - the file
TD/knn_application.py
, which is used only for the very last exercise, - Python files starting with
test_
, which are used to test certain features of your code that are not graded, - a file
knn_performance_check.py
, which is used for the final coding challenge, as well as - different datasets used for some additional tests, in the folder
csv/
.
In the following, we assume that the directory TD/
is
the root for running the scripts. In particular, we omit the prefix
TD/
, and when you are asked to create new files, please add
them to TD/
, if not instructed otherwise.
Log in to be able to upload your files.
Context
- You will implement the supervised learning algorithm \(k\)-nearest neighbors (\(k\)-NN) together with some performance metrics (such as confusion matrix).
- You will learn how to speed up the computation dramatically by using random projections.
Structure
This TD is structured as follows:
- The first section describes the datasets we work with in this TD.
- The second section describes the external libraries that we use in this TD.
- The third section contains the programming exercises that implement a \(k\)-NN classifier.
- The fourth section explains how employing random projections speeds up the computation.
- The fifth section shows how to put together the implemented functionality in order to plot a ROC curve.
- The sixth section tasks you to come up with a good parametrization for an unknown dataset.
Datasets
For most of the programming part of the TD, we use an e-mail dataset, which consists of e-mails classified as spam and not spam (referred to as ham). The text of each e-mail has been transformed to a Boolean vector, based on a dictionary of common words among the e-mails. That is, each e-mail is represented as a binary vector of the same length as the dictionary, where the \(i\)-th element indicates whether the the \(i\)-th word in the dictionary is present (in the e-mail) or not. More details on this transformation can be found on the dedicated page.
Using external libraries: SciPy and Matplotlib
The scikit-learn library is required for this TD (as it simplifies certain calculations). Please follow the installation instructions here for your system. (Please recall that we recommend version 1.4.1.) We will make use of the class KDTree, whose documentation is here.
In addition, we use the Matplotlib library for crude visualizations. Please follow its installation instructions here. (The latest version should be fine.)
\(k\)-NN classification
On a separate page, we provide a small refresher on \(k\)-NNs, another refresher on confusion matrices, and an introduction to random projections.
To help you navigate through the code, we provide on a separate page a small explanation
of the classes Dataset
, Classification
,
KnnClassification
, ConfusionMatrix
, and
RandomProjection
.
Exercise 1: The KnnClassification class
Open the file knn_classification.py
and edit KnnClassification
, derived from
the class Classification
. Augment the
initializer (__init__
) such that it initializes
k
and kd_tree
. Note that the initializer
should create the KD-tree corresponding to the subset \(X\) of the associated dataset (i.e., all
\(d\) dimensions in the dataset except
for the one that we aim to predict).
Test your code by running
python ./grader.py 1
.
Upload your file knn_classification.py
here:
Exercise 2: \(\boldsymbol{k}\)-nearest neighbors and classification
Implement the method
estimate(self, x: np.ndarray, threshold: float = 0.5) -> int
.
This method returns the (via \(k\)-NN
classification) estimated value for the \(Y\)-component of the vector with \(X\)-component x
.
Given the \(d\)-dimensional sample
x
, this method should compute the \(k\)-nearest neighbors of x
in
the subset \(X\) of the training
dataset \(D\). To this end, you can use
the query()
function of KDTree
.
Once the \(k\)-nearest neighbors of
x
are determined, its label is determined as explained in
the refresher.
Bonus: If you have spare time, consider implementing
the KnnClassification
weighted by distances, i.e., \(\hat{p}(y=1 \mid x) = \frac{1}{k} \sum_{i=1}^{k}
\frac{y_{nn_i}}{w_i}\), where \(w_i\) is the distance between \(x\) and \(nn_i\). See an example here.
Remark: There is no guarantee that it will actually perform better.
Test your code by running
python ./grader.py 2
.
Upload your file knn_classification.py
here:
Exercise 3: Evaluating the performance
Open the file confusion_matrix.py
. The
class ConfusionMatrix
shows us the performance of a given
classifier. Note the usage of format strings in the method
print_evaluation
.
Implement the initializer of the class, which
creates a 2-by-2 ndarray
(initialized with
0.0
s).
Implement the method add_prediction
,
which simply increases the corresponding entry in the confusion matrix
by 1. For example, if true_label
is 0 and
predicted_label
is 1, the counter in the \((0,1)\) coordinate of the confusion matrix
is incremented.
Implement the methods f_score
,
precision
, error_rate
,
detection_rate
, and false_alarm_rate
according
to the formulas in the course or in this refresher.
Test your code by running
python ./grader.py 3
.
Upload your file confusion_matrix.py
here:
Putting everything together
You’re now all set to run \(k\)-NN
classification on a pair of training/test files and store results in a
confusion matrix. First, have a look test_knn.py
(not in
TD/
!) to understand how we run the test. Then, run
test_knn.py
via
$ python ./test_knn.py 10 csv/mail_train.csv csv/mail_test.csv
No column specified for classification, assuming first column of dataset (0).
Dataset with 2251 samples and 1900 dimensions.
k-NN classification (k = 10, classification over column 0) ...
Prediction and Confusion Matrix filling
Execution time: 3044 ms
Predicted
0 1
Actual 0 264 62
1 15 127
Error rate 0.16452991452991453
False-alarm rate 0.1901840490797546
Detection rate 0.8943661971830986
F-score 0.7673716012084593
Precision 0.671957671957672
Quite impressive, isn’t it? The only problem: it took around 3 seconds to predict on 1000 new e-mails. If you’re part of Google’s Gmail team, this is probably way too long. This is where random projections are useful.
Speeding up by using random projections
The basic idea is to drastically reduce the dimension of the problem (1899 as of now) to reduce the computing time and the memory footprint of \(k\)-NN. Roughly, this amounts to compressing the data, which results in an information loss. The trade-off between the dimensionality and the information loss as well as the rationale behind random projections are discussed in this separate page.
Exercise 4: Sampling
Open the file random_projection.py
.
Implement the method
random_gaussian_matrix(d: int, projection_dim: int) -> np.ndarray
,
which takes two parameters: an integer d
, which represents
the dimension of the original data, and an integer
projection_dim
, which represents the projection dimension
(denoted here by \(\ell\)). It returns a
d
× \(\ell\) matrix whose
entries are sampled from \(\mathrm{N}(0,1/\ell)\) (beware: \(1/\ell\) denotes the variance!). You can
check numpy to find out how to draw such samples.
You can also take inspiration from the implementation of
random_rademacher_matrix
(in
random_projection.py
). Its constructor fills the matrix
with 0
s, sqrt(3.0 / projection_dim)
s, and
−sqrt(3.0 / projection_dim)
s such that each entry is
non-zero with probability 1/3 and its sign is chosen uniformly at
random.
Test your code by running
python ./grader.py 4
.
Upload your file random_projection.py
here:
Exercise 5: RandomProjection
First, implement the initializer of
RandomProjection
(in random_projection.py
).
Initialize all attributes except the matrix projection
with
the corresponding arguments. Initialize projection
depending on the value of the argument type_sample
(which
is "Gaussian"
or something else). Please use the Rademacher
matrix if type_sample
is not "Gaussian"
.
Now, implement the method
projection_quality(self, dataset: Dataset) -> tuple[float, float]
,
which computes the average distance between all pairs of points in the
original dataset (given as a parameter) and its projection. The method
returns a tuple whose first component is the average distance in the
original dataset, and whose second component is the average distance in
the projection.
Use the provided method project
to compute the
projection. Note that the result of project
contains the
column to classify as its last column, which must be ignored when
computing the quality. (Similarly, the original dataset also still
contains the column to predict (at col_class
), which must
also be ignored.)
Test your code by running
python ./grader.py 5
. Note that these tests can take quite
some time (as in, several minutes), since the dataset is not that
small.
Upload your file random_projection.py
here:
Random Projections meet classifier performance and memory/time complexity
Now run test_random_projection.py
(again, please have a look at its code beforehand; it is not in
TD/
). The program compares timings and also the prediction
quality before and after projection. Let’s run it for \(\ell = 10\) and a Gaussian random
matrix:
$ python ./test_random_projection.py 3 10 csv/mail_train.csv csv/mail_test.csv Gaussian
No column specified for classification, assuming first column of dataset (0).
Dataset with 2251 samples and 1900 dimensions.
Create a random projection.
Execution time: 15 ms
Training k-NN classification on the original data.
Execution time: 547 ms
Training k-NN classification on the projected data.
Execution time: 41 ms
Predicting k-NN on original data.
Execution time: 2870 ms
Predicted
0 1
Actual 0 255 71
1 14 128
Error rate 0.18162393162393162
False-alarm rate 0.21779141104294478
Detection rate 0.9014084507042254
F-score 0.7507331378299121
Precision 0.6432160804020101
Predicting k-NN on projected data.
Execution time: 61 ms
Predicted
0 1
Actual 0 288 38
1 59 83
Error rate 0.20726495726495728
False-alarm rate 0.1165644171779141
Detection rate 0.5845070422535211
F-score 0.6311787072243347
Precision 0.6859504132231405
Speedup on training: 13
Speedup on testing: 47
The speedup is huge, but the performance decreased. Try \(\ell = 60\) instead.
Still a tremendous speedup for approximately the same performance!
For Rademacher, the results are even better.
ROC curves
Thanks to the method estimate
of
KnnClassification
, we can run the \(k\)-NN algorithm with different threshold
values. This allows us to draw a ROC (receiver operating characteristic)
curve, which represents the detection rate as a function of the
false-alarm rate (with the original data, or the projected data, or
both!).
In order to achieve this, complete the method
create_roc_plot() -> None
in
test_roc_curve.py
(not in TD/
!) by looping
over the thresholds to apply the prediction of \(k\)-NN on the test set, given the current
threshold, and storing the results in the corresponding confusion
matrix.
Moreover, tell the user the area under the ROC curve
(AUC score). To this end, you can use the numpy
method trapz
.
Once this is implemented, plot the ROC curve by
running test_roc_curve.py
as shown below, where
CHOOSE_WISELY
is a number that determines the granularity
of our approach. Figure out which value makes sense. Note that we now
use the roc_data
dataset. Please use the number that
corresponds to your number determined by the Moodle quiz for this TD.
For the example below, we assume that this number is \(i \in [0 ..5]\). (Warning: If you run the
following command for the mail
dataset, this can take a
long time!)
$ python ./test_roc_curve.py 10 csv/roc_data_i_train.csv csv/roc_data_i_test.csv CHOOSE_WISELY 7
Run tests for the dataset roc_data
(and
the correct value of \(i\)) with 2, 6,
and 10 nearest neighbors on normalized data. Check out
MinMaxScaler
from sklearn.preprocessing
.
Which of these classifiers has the highest (test) AUC
score? (In case of a tie, go with the largest number.)
Does this classifier also have the best (maximum)
F-score? (In case of a tie for the first place, we say yes.)
Answer these two questions in the quiz on Moodle.
Putting it to the test
Last, we check how well our classifier performs on unknown data.
Have a look at the file
knn_performance_check.py
(outside of TD/
),
which shows you roughly the script that is run on the server to estimate
the performance of your classifier. This test makes use of the dataset
samples
(split into train and test), which you should download
here. Note that the test data is not the full file that is
on the server!
This script makes use of the file TD/knn_application.py
.
This file is intended for communicating a good parametrization of your
classifier to the test script. Edit
TD/knn_application.py
meaningfully.
You need to choose good parameters by yourself. The variable
seed
is the seed used for the randomness involved in the
random projection used by the classifier.
Your grade is determined by both how much faster your classifier is
in comparison to the one that does not use a random projection, and by
how large the F-score of your classifier is. You want both numbers to be
as large as possible. These two quantities are combined into a single
score, as detailed in knn_performance_check.py
(not in
TD/
!) in line 71. In a nutshell, the speed-up is very
important, but the F-score covers the final gap. Note that below this
formula, you also see at which thresholds you get which grade. Good
luck!
Attention: Since the speed-up is subject to randomness itself in how quickly the computations are performed, it may make sense to repeat an experiments of a promising parametrization, in order to get a good run. However, we note that the random fluctuation on the full test dataset is not that large, as the dataset itself is pretty huge.
Upload your file knn_application.py
here: