This page is here so that you familiarize yourselves with the proposed implementation of \(k\)-NN classification, which forms the basis of the exercises. Some of the files need to be created by you, and you will ask to do so by the TD.
Dataset classThe class Dataset is provided in the file dataset.py.
    It is mainly a wrapper around an ndarray from numpy, allowing you direct easier access to
    the dimension and number of samples.
    Moreover, its initializer creates the dataset either by reading a comma-separated file or by getting an existing
    ndarray.
  
class Dataset:
    """A dataset of points of the same dimension.
    Attributes:
        dim: int              -- dimension of the ambient space
        nsamples: int         -- the number of points of the dataset
        instances: np.ndarray -- an array of all the points of the dataset
    """
    def __init__(self, file_path: str = "", dataset: np.ndarray = np.array([])):
        if file_path != "":
            self.instances = np.genfromtxt(file_path, delimiter=",")
        else:
            self.instances = dataset
        shape = np.shape(self.instances)
        self.nsamples, self.dim = shape
  
  Classification classThe abstract class Classification is provided in file Classification.py:
class Classification(ABC):
    """An abstract class for defining classifiers.
    Attributes:
        dataset: Dataset -- the dataset classified by the classifier
        col_class: int   -- the index of the column to classify
    """
    def __init__(self, dataset: Dataset, col_class: int):
        super().__init__()
        self.dataset = dataset
        self.col_class = col_class
    @abstractmethod
    def estimate(self, x: np.ndarray, threshold: float = 0.5) -> int:
        """Classify data point x for the given threshold."""
        pass
  This class provides access to a training dataset, dataset, and the index of the labels w.r.t. the
    dimension (\(0 \leq\) col_class\(\leq
      d\)).
For convenience, let \(\mathcal{X} \subset \mathbb{R}^{d}\) denote the (\(d\)-dimensional) subset resulting from excluding the classification dimension, and let \(\mathcal{Y} = \{0,1\}\) denote the (1-dimensional) subset for the classification dimension. Note that samples in the dataset are thus referred to as \(\boldsymbol{s} = (\boldsymbol{x}, y)\), with \(\boldsymbol{x} \in \mathcal{X}\), \(y \in \mathcal{Y}\). Given these conditions, the goal of classification is the estimation, for a sample \(\boldsymbol{s}=(\boldsymbol{x}, \cdot)\) and a training dataset \(D\), of \(\hat{y} = f(\boldsymbol{x}, D)\).
KnnClassification classFor a refresher on the principle of \(k\)-NN classification, refer to this separate site.
Our \(k\)-NN classification is implemented through the class
    KnnClassification, which is declared as follows:
  
class KnnClassification(Classification):
    """A k-NN classifier.
    Attributes:
        k: int          -- the number of nearest neighbors to use for classification
        kd_tree: KDTree -- the kd-tree used for computing shortest distances quickly
    """
    def __init__(self, k: int, dataset: Dataset, col_class: int):
        super().__init__(dataset, col_class)
        pass
    def estimate(self, x: np.ndarray, threshold: float = 0.5) -> int:
        """Classify data point x for the given threshold."""
        pass
  You have to complete the implementation of the class KnnClassification, derived from the abstract
    class Classification.
Once this class is fully implemented as well as Dataset and ConfusionMatrix, you can test
    its performance with
    the test program test_knn.py, via
$ python ./test_knn.py k train_file test_file [ column_for_classification ]
An example of its output is provided:
  $ python ./test_knn.py 10 ../csv/mail_train.csv ../csv/mail_test.csv
  No column specified for classification, assuming first column of dataset (0).
  Dataset with 4000 samples and 1900 dimensions.
  Computing k-NN classification (k = 10, classification over column 0) ...
  Prediction and Confusion Matrix filling
  Execution time: 9646 ms
                  Predicted
                  0       1
  Actual  0       570     122
          1       23      285
  Error rate              0.145
  False-alarm rate        0.17630057803468208
  Detection rate          0.9253246753246753
  F-score                 0.7972027972027973
  Precision               0.7002457002457002
  Notice that the error rate is low, but so are the detection rate and the F-score.
RandomProjection classThe RandomProjection class makes stronger use of numpy. The basic idea is that it stores the original
    dimension (denoted by \(d\) above), the projection dimension (denoted by \(\ell\) above), and the generated projection matrix (denoted by \(R\) above), and it has the option to perform Gaussian random projection or Rademacher
    random projection.
Its prototype is as follows:
  class RandomProjection:
      """A random projection for downsampling the data.
      Attributes:
          original_dimension: int -- the dimension of the dataset before the projection
          col_class: int          -- the index of the column to classify
          projection_dim: int     -- the dimension of the dataset after the projection
          type_sample: str        -- the type of the projection (either "Gaussian" or "Rademacher")
          projection: np.ndarray  -- the matrix of the projection itself
      """
      def __init__(
          self,
          original_dimension: int,
          col_class: int,
          projection_dim: int,
          type_sample: str,
      ):
          pass
      @staticmethod
      def random_gaussian_matrix(d: int, projection_dim: int) -> np.ndarray:
          """Creates a random Gaussian matrix."""
          pass
      @staticmethod
      def random_rademacher_matrix(d: int, projection_dim: int) -> np.ndarray:
          """Creates a random Rademacher matrix."""
          return np.random.choice(
              a=[sqrt(3.0 / projection_dim) * v for v in [-1.0, 0.0, 1.0]],
              size=(d, projection_dim),
              replace=True,
              p=[1.0 / 6.0, 2.0 / 3.0, 1.0 / 6.0],
          )
      def projection_quality(self, dataset: Dataset) -> tuple[float, float]:
          """Computes the quality of the projection."""
          pass
      def project(self, dataset: Dataset) -> Dataset:
          """Projects a dataset to a lower dimension."""
          assert (
              dataset.dim - 1 >= self.projection_dim
          ), "Impossible to project to higher dimensions!"
          ds_wo_col_class = np.delete(dataset.instances, [self.col_class], axis=1)
          minor_projected_data = ds_wo_col_class.dot(self.projection)
          # Append the column to predict to the end
          projected_data = np.c_[
              minor_projected_data, dataset.instances[:, self.col_class]
          ]
          return Dataset(dataset=projected_data)
  Method projection_quality calculates the mean distance between two points in the original and the
    projected data. Recall that these two distances should be \(\varepsilon\)-close.
  
Method project performs the projection, that is, it multiplies the sub-matrix of predictors (i.e.,
    without
    the labels) by the projection matrix, adds the labels as the last column of the resulting matrix, and stores it as
    a Dataset object.
ConfusionMatrix classThe ConfusionMatrix class implements and prints all metrics presented in the refresher on confusion matrices.
Its prototype is as follows:
class ConfusionMatrix:
    """A confusion matrix
    Attributes:
        confusion_matrix: np.ndarray -- The actual 2×2 confusion matrix
    """
    def __init__(self) -> None:
        pass
    def add_prediction(self, true_label: int, predicted_label: int) -> None:
        """Add a labeled point to the matrix."""
        pass
    @property
    def tp(self) -> int:
        """Return the number of true positives."""
        return int(self.confusion_matrix[1, 1])
    @property
    def tn(self) -> int:
        """Return the number of true negatives."""
        return int(self.confusion_matrix[0, 0])
    @property
    def fp(self) -> int:
        """Return the number of false positives."""
        return int(self.confusion_matrix[0, 1])
    @property
    def fn(self) -> int:
        """Return the number of false negatives."""
        return int(self.confusion_matrix[1, 0])
    def f_score(self) -> float:
        """Compute the F-score."""
        pass
    def precision(self) -> float:
        """Compute the precision."""
        pass
    def error_rate(self) -> float:
        """Compute the error rate."""
        pass
    def detection_rate(self) -> float:
        """Compute the detection rate."""
        pass
    def false_alarm_rate(self) -> float:
        """Compute the false-alarm rate."""
        pass
        def print_evaluation(self) -> None:
        """Print a summary of the values of the matrix."""
        print("\t\tPredicted")
        print("\t\t0\t1")
        print(f"Actual\t0\t{self.tn}\t{self.fp}")
        print(f"\t1\t{self.fn}\t{self.tp}\n")
        print(f"Error rate\t\t{self.error_rate()}")
        print(f"False-alarm rate\t{self.false_alarm_rate()}")
        print(f"Detection rate\t\t{self.detection_rate()}")
        print(f"F-score\t\t\t{self.f_score()}")
        print(f"Precision\t\t{self.precision()}")
  The print_evaluation method prints TN, FN, TP, FP, the error rate, the false-alarm rate, the detection
    rate, the F-score, and the precision. It is used mainly by test_knn.py.
The add_prediction method takes two arguments: a true_label (the true label of a sample),
    and a predicted_label (the predicted label (from KnnClassification) of a sample). It adds
    the result of the prediction to the appropriate cell of the ConfusionMatrix.