This page is here so that you familiarize yourselves with the proposed implementation of \(k\)-NN classification, which forms the basis of the exercises. Some of the files need to be created by you, and you will ask to do so by the TD.
Dataset classThe class Dataset is provided in the file dataset.py.
It is mainly a wrapper around an ndarray from numpy, allowing you direct easier access to
the dimension and number of samples.
Moreover, its initializer creates the dataset either by reading a comma-separated file or by getting an existing
ndarray.
class Dataset:
"""A dataset of points of the same dimension.
Attributes:
dim: int -- dimension of the ambient space
nsamples: int -- the number of points of the dataset
instances: np.ndarray -- an array of all the points of the dataset
"""
def __init__(self, file_path: str = "", dataset: np.ndarray = np.array([])):
if file_path != "":
self.instances = np.genfromtxt(file_path, delimiter=",")
else:
self.instances = dataset
shape = np.shape(self.instances)
self.nsamples, self.dim = shape
Classification classThe abstract class Classification is provided in file Classification.py:
class Classification(ABC):
"""An abstract class for defining classifiers.
Attributes:
dataset: Dataset -- the dataset classified by the classifier
col_class: int -- the index of the column to classify
"""
def __init__(self, dataset: Dataset, col_class: int):
super().__init__()
self.dataset = dataset
self.col_class = col_class
@abstractmethod
def estimate(self, x: np.ndarray, threshold: float = 0.5) -> int:
"""Classify data point x for the given threshold."""
pass
This class provides access to a training dataset, dataset, and the index of the labels w.r.t. the
dimension (\(0 \leq\) col_class\(\leq
d\)).
For convenience, let \(\mathcal{X} \subset \mathbb{R}^{d}\) denote the (\(d\)-dimensional) subset resulting from excluding the classification dimension, and let \(\mathcal{Y} = \{0,1\}\) denote the (1-dimensional) subset for the classification dimension. Note that samples in the dataset are thus referred to as \(\boldsymbol{s} = (\boldsymbol{x}, y)\), with \(\boldsymbol{x} \in \mathcal{X}\), \(y \in \mathcal{Y}\). Given these conditions, the goal of classification is the estimation, for a sample \(\boldsymbol{s}=(\boldsymbol{x}, \cdot)\) and a training dataset \(D\), of \(\hat{y} = f(\boldsymbol{x}, D)\).
KnnClassification classFor a refresher on the principle of \(k\)-NN classification, refer to this separate site.
Our \(k\)-NN classification is implemented through the class
KnnClassification, which is declared as follows:
class KnnClassification(Classification):
"""A k-NN classifier.
Attributes:
k: int -- the number of nearest neighbors to use for classification
kd_tree: KDTree -- the kd-tree used for computing shortest distances quickly
"""
def __init__(self, k: int, dataset: Dataset, col_class: int):
super().__init__(dataset, col_class)
pass
def estimate(self, x: np.ndarray, threshold: float = 0.5) -> int:
"""Classify data point x for the given threshold."""
pass
You have to complete the implementation of the class KnnClassification, derived from the abstract
class Classification.
Once this class is fully implemented as well as Dataset and ConfusionMatrix, you can test
its performance with
the test program test_knn.py, via
$ python ./test_knn.py k train_file test_file [ column_for_classification ]
An example of its output is provided:
$ python ./test_knn.py 10 ../csv/mail_train.csv ../csv/mail_test.csv
No column specified for classification, assuming first column of dataset (0).
Dataset with 4000 samples and 1900 dimensions.
Computing k-NN classification (k = 10, classification over column 0) ...
Prediction and Confusion Matrix filling
Execution time: 9646 ms
Predicted
0 1
Actual 0 570 122
1 23 285
Error rate 0.145
False-alarm rate 0.17630057803468208
Detection rate 0.9253246753246753
F-score 0.7972027972027973
Precision 0.7002457002457002
Notice that the error rate is low, but so are the detection rate and the F-score.
RandomProjection classThe RandomProjection class makes stronger use of numpy. The basic idea is that it stores the original
dimension (denoted by \(d\) above), the projection dimension (denoted by \(\ell\) above), and the generated projection matrix (denoted by \(R\) above), and it has the option to perform Gaussian random projection or Rademacher
random projection.
Its prototype is as follows:
class RandomProjection:
"""A random projection for downsampling the data.
Attributes:
original_dimension: int -- the dimension of the dataset before the projection
col_class: int -- the index of the column to classify
projection_dim: int -- the dimension of the dataset after the projection
type_sample: str -- the type of the projection (either "Gaussian" or "Rademacher")
projection: np.ndarray -- the matrix of the projection itself
"""
def __init__(
self,
original_dimension: int,
col_class: int,
projection_dim: int,
type_sample: str,
):
pass
@staticmethod
def random_gaussian_matrix(d: int, projection_dim: int) -> np.ndarray:
"""Creates a random Gaussian matrix."""
pass
@staticmethod
def random_rademacher_matrix(d: int, projection_dim: int) -> np.ndarray:
"""Creates a random Rademacher matrix."""
return np.random.choice(
a=[sqrt(3.0 / projection_dim) * v for v in [-1.0, 0.0, 1.0]],
size=(d, projection_dim),
replace=True,
p=[1.0 / 6.0, 2.0 / 3.0, 1.0 / 6.0],
)
def projection_quality(self, dataset: Dataset) -> tuple[float, float]:
"""Computes the quality of the projection."""
pass
def project(self, dataset: Dataset) -> Dataset:
"""Projects a dataset to a lower dimension."""
assert (
dataset.dim - 1 >= self.projection_dim
), "Impossible to project to higher dimensions!"
ds_wo_col_class = np.delete(dataset.instances, [self.col_class], axis=1)
minor_projected_data = ds_wo_col_class.dot(self.projection)
# Append the column to predict to the end
projected_data = np.c_[
minor_projected_data, dataset.instances[:, self.col_class]
]
return Dataset(dataset=projected_data)
Method projection_quality calculates the mean distance between two points in the original and the
projected data. Recall that these two distances should be \(\varepsilon\)-close.
Method project performs the projection, that is, it multiplies the sub-matrix of predictors (i.e.,
without
the labels) by the projection matrix, adds the labels as the last column of the resulting matrix, and stores it as
a Dataset object.
ConfusionMatrix classThe ConfusionMatrix class implements and prints all metrics presented in the refresher on confusion matrices.
Its prototype is as follows:
class ConfusionMatrix:
"""A confusion matrix
Attributes:
confusion_matrix: np.ndarray -- The actual 2×2 confusion matrix
"""
def __init__(self) -> None:
pass
def add_prediction(self, true_label: int, predicted_label: int) -> None:
"""Add a labeled point to the matrix."""
pass
@property
def tp(self) -> int:
"""Return the number of true positives."""
return int(self.confusion_matrix[1, 1])
@property
def tn(self) -> int:
"""Return the number of true negatives."""
return int(self.confusion_matrix[0, 0])
@property
def fp(self) -> int:
"""Return the number of false positives."""
return int(self.confusion_matrix[0, 1])
@property
def fn(self) -> int:
"""Return the number of false negatives."""
return int(self.confusion_matrix[1, 0])
def f_score(self) -> float:
"""Compute the F-score."""
pass
def precision(self) -> float:
"""Compute the precision."""
pass
def error_rate(self) -> float:
"""Compute the error rate."""
pass
def detection_rate(self) -> float:
"""Compute the detection rate."""
pass
def false_alarm_rate(self) -> float:
"""Compute the false-alarm rate."""
pass
def print_evaluation(self) -> None:
"""Print a summary of the values of the matrix."""
print("\t\tPredicted")
print("\t\t0\t1")
print(f"Actual\t0\t{self.tn}\t{self.fp}")
print(f"\t1\t{self.fn}\t{self.tp}\n")
print(f"Error rate\t\t{self.error_rate()}")
print(f"False-alarm rate\t{self.false_alarm_rate()}")
print(f"Detection rate\t\t{self.detection_rate()}")
print(f"F-score\t\t\t{self.f_score()}")
print(f"Precision\t\t{self.precision()}")
The print_evaluation method prints TN, FN, TP, FP, the error rate, the false-alarm rate, the detection
rate, the F-score, and the precision. It is used mainly by test_knn.py.
The add_prediction method takes two arguments: a true_label (the true label of a sample),
and a predicted_label (the predicted label (from KnnClassification) of a sample). It adds
the result of the prediction to the appropriate cell of the ConfusionMatrix.