This page is here so that you familiarize yourselves with the proposed implementation of \(k\)-NN classification, which forms the basis of the exercises. Some of the files need to be created by you, and you will ask to do so by the TD.
Dataset
classThe class Dataset
is provided in the file dataset.py
.
It is mainly a wrapper around an ndarray
from numpy
, allowing you direct easier access to
the dimension and number of samples.
Moreover, its initializer creates the dataset either by reading a comma-separated file or by getting an existing
ndarray
.
class Dataset: """A dataset of points of the same dimension. Attributes: dim: int -- dimension of the ambient space nsamples: int -- the number of points of the dataset instances: np.ndarray -- an array of all the points of the dataset """ def __init__(self, file_path: str = "", dataset: np.ndarray = np.array([])): if file_path != "": self.instances = np.genfromtxt(file_path, delimiter=",") else: self.instances = dataset shape = np.shape(self.instances) self.nsamples, self.dim = shape
Classification
classThe abstract class Classification
is provided in file Classification.py
:
class Classification(ABC): """An abstract class for defining classifiers. Attributes: dataset: Dataset -- the dataset classified by the classifier col_class: int -- the index of the column to classify """ def __init__(self, dataset: Dataset, col_class: int): super().__init__() self.dataset = dataset self.col_class = col_class @abstractmethod def estimate(self, x: np.ndarray, threshold: float = 0.5) -> int: """Classify data point x for the given threshold.""" pass
This class provides access to a training dataset, dataset
, and the index of the labels w.r.t. the
dimension (\(0 \leq\) col_class
\(\leq
d\)).
For convenience, let \(\mathcal{X} \subset \mathbb{R}^{d}\) denote the (\(d\)-dimensional) subset resulting from excluding the classification dimension, and let \(\mathcal{Y} = \{0,1\}\) denote the (1-dimensional) subset for the classification dimension. Note that samples in the dataset are thus referred to as \(\boldsymbol{s} = (\boldsymbol{x}, y)\), with \(\boldsymbol{x} \in \mathcal{X}\), \(y \in \mathcal{Y}\). Given these conditions, the goal of classification is the estimation, for a sample \(\boldsymbol{s}=(\boldsymbol{x}, \cdot)\) and a training dataset \(D\), of \(\hat{y} = f(\boldsymbol{x}, D)\).
KnnClassification
classFor a refresher on the principle of \(k\)-NN classification, refer to this separate site.
Our \(k\)-NN classification is implemented through the class
KnnClassification
, which is declared as follows:
class KnnClassification(Classification): """A k-NN classifier. Attributes: k: int -- the number of nearest neighbors to use for classification kd_tree: KDTree -- the kd-tree used for computing shortest distances quickly """ def __init__(self, k: int, dataset: Dataset, col_class: int): super().__init__(dataset, col_class) pass def estimate(self, x: np.ndarray, threshold: float = 0.5) -> int: """Classify data point x for the given threshold.""" pass
You have to complete the implementation of the class KnnClassification
, derived from the abstract
class Classification
.
Once this class is fully implemented as well as Dataset
and ConfusionMatrix
, you can test
its performance with
the test program test_knn.py
, via
$ python ./test_knn.py k train_file test_file [ column_for_classification ]
An example of its output is provided:
$ python ./test_knn.py 10 ../csv/mail_train.csv ../csv/mail_test.csv No column specified for classification, assuming first column of dataset (0). Dataset with 4000 samples and 1900 dimensions. Computing k-NN classification (k = 10, classification over column 0) ... Prediction and Confusion Matrix filling Execution time: 9646 ms Predicted 0 1 Actual 0 570 122 1 23 285 Error rate 0.145 False-alarm rate 0.17630057803468208 Detection rate 0.9253246753246753 F-score 0.7972027972027973 Precision 0.7002457002457002
Notice that the error rate is low, but so are the detection rate and the F-score.
RandomProjection
classThe RandomProjection
class makes stronger use of numpy. The basic idea is that it stores the original
dimension (denoted by \(d\) above), the projection dimension (denoted by \(\ell\) above), and the generated projection matrix (denoted by \(R\) above), and it has the option to perform Gaussian random projection or Rademacher
random projection.
Its prototype is as follows:
class RandomProjection: """A random projection for downsampling the data. Attributes: original_dimension: int -- the dimension of the dataset before the projection col_class: int -- the index of the column to classify projection_dim: int -- the dimension of the dataset after the projection type_sample: str -- the type of the projection (either "Gaussian" or "Rademacher") projection: np.ndarray -- the matrix of the projection itself """ def __init__( self, original_dimension: int, col_class: int, projection_dim: int, type_sample: str, ): pass @staticmethod def random_gaussian_matrix(d: int, projection_dim: int) -> np.ndarray: """Creates a random Gaussian matrix.""" pass @staticmethod def random_rademacher_matrix(d: int, projection_dim: int) -> np.ndarray: """Creates a random Rademacher matrix.""" return np.random.choice( a=[sqrt(3.0 / projection_dim) * v for v in [-1.0, 0.0, 1.0]], size=(d, projection_dim), replace=True, p=[1.0 / 6.0, 2.0 / 3.0, 1.0 / 6.0], ) def projection_quality(self, dataset: Dataset) -> tuple[float, float]: """Computes the quality of the projection.""" pass def project(self, dataset: Dataset) -> Dataset: """Projects a dataset to a lower dimension.""" assert ( dataset.dim - 1 >= self.projection_dim ), "Impossible to project to higher dimensions!" ds_wo_col_class = np.delete(dataset.instances, [self.col_class], axis=1) minor_projected_data = ds_wo_col_class.dot(self.projection) # Append the column to predict to the end projected_data = np.c_[ minor_projected_data, dataset.instances[:, self.col_class] ] return Dataset(dataset=projected_data)
Method projection_quality
calculates the mean distance between two points in the original and the
projected data. Recall that these two distances should be \(\varepsilon\)-close.
Method project
performs the projection, that is, it multiplies the sub-matrix of predictors (i.e.,
without
the labels) by the projection matrix, adds the labels as the last column of the resulting matrix, and stores it as
a Dataset
object.
ConfusionMatrix
classThe ConfusionMatrix
class implements and prints all metrics presented in the refresher on confusion matrices.
Its prototype is as follows:
class ConfusionMatrix: """A confusion matrix Attributes: confusion_matrix: np.ndarray -- The actual 2×2 confusion matrix """ def __init__(self) -> None: pass def add_prediction(self, true_label: int, predicted_label: int) -> None: """Add a labeled point to the matrix.""" pass @property def tp(self) -> int: """Return the number of true positives.""" return int(self.confusion_matrix[1, 1]) @property def tn(self) -> int: """Return the number of true negatives.""" return int(self.confusion_matrix[0, 0]) @property def fp(self) -> int: """Return the number of false positives.""" return int(self.confusion_matrix[0, 1]) @property def fn(self) -> int: """Return the number of false negatives.""" return int(self.confusion_matrix[1, 0]) def f_score(self) -> float: """Compute the F-score.""" pass def precision(self) -> float: """Compute the precision.""" pass def error_rate(self) -> float: """Compute the error rate.""" pass def detection_rate(self) -> float: """Compute the detection rate.""" pass def false_alarm_rate(self) -> float: """Compute the false-alarm rate.""" pass def print_evaluation(self) -> None: """Print a summary of the values of the matrix.""" print("\t\tPredicted") print("\t\t0\t1") print(f"Actual\t0\t{self.tn}\t{self.fp}") print(f"\t1\t{self.fn}\t{self.tp}\n") print(f"Error rate\t\t{self.error_rate()}") print(f"False-alarm rate\t{self.false_alarm_rate()}") print(f"Detection rate\t\t{self.detection_rate()}") print(f"F-score\t\t\t{self.f_score()}") print(f"Precision\t\t{self.precision()}")
The print_evaluation
method prints TN, FN, TP, FP, the error rate, the false-alarm rate, the detection
rate, the F-score, and the precision. It is used mainly by test_knn.py
.
The add_prediction
method takes two arguments: a true_label
(the true label of a sample),
and a predicted_label
(the predicted label (from KnnClassification
) of a sample). It adds
the result of the prediction to the appropriate cell of the ConfusionMatrix
.