Classes of TD6

This page is here so that you familiarize yourselves with the proposed implementation of \(k\)-NN classification, which forms the basis of the exercises. Some of the files need to be created by you, and you will ask to do so by the TD.

The Dataset class

The class Dataset is provided in the file dataset.py. It is mainly a wrapper around an ndarray from numpy, allowing you direct easier access to the dimension and number of samples. Moreover, its initializer creates the dataset either by reading a comma-separated file or by getting an existing ndarray.

class Dataset:
    """A dataset of points of the same dimension.

    Attributes:
        dim: int              -- dimension of the ambient space
        nsamples: int         -- the number of points of the dataset
        instances: np.ndarray -- an array of all the points of the dataset
    """

    def __init__(self, file_path: str = "", dataset: np.ndarray = np.array([])):
        if file_path != "":
            self.instances = np.genfromtxt(file_path, delimiter=",")
        else:
            self.instances = dataset
        shape = np.shape(self.instances)
        self.nsamples, self.dim = shape
  

The Classification class

The abstract class Classification is provided in file Classification.py:

class Classification(ABC):
    """An abstract class for defining classifiers.

    Attributes:
        dataset: Dataset -- the dataset classified by the classifier
        col_class: int   -- the index of the column to classify
    """

    def __init__(self, dataset: Dataset, col_class: int):
        super().__init__()
        self.dataset = dataset
        self.col_class = col_class

    @abstractmethod
    def estimate(self, x: np.ndarray, threshold: float = 0.5) -> int:
        """Classify data point x for the given threshold."""
        pass

This class provides access to a training dataset, dataset, and the index of the labels w.r.t. the dimension (\(0 \leq\) col_class\(\leq d\)).

For convenience, let \(\mathcal{X} \subset \mathbb{R}^{d}\) denote the (\(d\)-dimensional) subset resulting from excluding the classification dimension, and let \(\mathcal{Y} = \{0,1\}\) denote the (1-dimensional) subset for the classification dimension. Note that samples in the dataset are thus referred to as \(\boldsymbol{s} = (\boldsymbol{x}, y)\), with \(\boldsymbol{x} \in \mathcal{X}\), \(y \in \mathcal{Y}\). Given these conditions, the goal of classification is the estimation, for a sample \(\boldsymbol{s}=(\boldsymbol{x}, \cdot)\) and a training dataset \(D\), of \(\hat{y} = f(\boldsymbol{x}, D)\).

The KnnClassification class

For a refresher on the principle of \(k\)-NN classification, refer to this separate site.

Our \(k\)-NN classification is implemented through the class KnnClassification, which is declared as follows:

class KnnClassification(Classification):
    """A k-NN classifier.

    Attributes:
        k: int          -- the number of nearest neighbors to use for classification
        kd_tree: KDTree -- the kd-tree used for computing shortest distances quickly
    """

    def __init__(self, k: int, dataset: Dataset, col_class: int):
        super().__init__(dataset, col_class)
        pass

    def estimate(self, x: np.ndarray, threshold: float = 0.5) -> int:
        """Classify data point x for the given threshold."""
        pass

You have to complete the implementation of the class KnnClassification, derived from the abstract class Classification.

Once this class is fully implemented as well as Dataset and ConfusionMatrix, you can test its performance with the test program test_knn.py, via

  $ python ./test_knn.py k train_file test_file [ column_for_classification ]

An example of its output is provided:

  $ python ./test_knn.py 10 ../csv/mail_train.csv ../csv/mail_test.csv

  No column specified for classification, assuming first column of dataset (0).
  Dataset with 4000 samples and 1900 dimensions.
  Computing k-NN classification (k = 10, classification over column 0) ...
  Prediction and Confusion Matrix filling

  Execution time: 9646 ms

                  Predicted
                  0       1
  Actual  0       570     122
          1       23      285

  Error rate              0.145
  False-alarm rate        0.17630057803468208
  Detection rate          0.9253246753246753
  F-score                 0.7972027972027973
  Precision               0.7002457002457002

Notice that the error rate is low, but so are the detection rate and the F-score.

The RandomProjection class

The RandomProjection class makes stronger use of numpy. The basic idea is that it stores the original dimension (denoted by \(d\) above), the projection dimension (denoted by \(\ell\) above), and the generated projection matrix (denoted by \(R\) above), and it has the option to perform Gaussian random projection or Rademacher random projection.

Its prototype is as follows:

  class RandomProjection:
      """A random projection for downsampling the data.

      Attributes:
          original_dimension: int -- the dimension of the dataset before the projection
          col_class: int          -- the index of the column to classify
          projection_dim: int     -- the dimension of the dataset after the projection
          type_sample: str        -- the type of the projection (either "Gaussian" or "Rademacher")
          projection: np.ndarray  -- the matrix of the projection itself
      """

      def __init__(
          self,
          original_dimension: int,
          col_class: int,
          projection_dim: int,
          type_sample: str,
      ):
          pass

      @staticmethod
      def random_gaussian_matrix(d: int, projection_dim: int) -> np.ndarray:
          """Creates a random Gaussian matrix."""
          pass

      @staticmethod
      def random_rademacher_matrix(d: int, projection_dim: int) -> np.ndarray:
          """Creates a random Rademacher matrix."""
          return np.random.choice(
              a=[sqrt(3.0 / projection_dim) * v for v in [-1.0, 0.0, 1.0]],
              size=(d, projection_dim),
              replace=True,
              p=[1.0 / 6.0, 2.0 / 3.0, 1.0 / 6.0],
          )

      def projection_quality(self, dataset: Dataset) -> tuple[float, float]:
          """Computes the quality of the projection."""
          pass

      def project(self, dataset: Dataset) -> Dataset:
          """Projects a dataset to a lower dimension."""
          assert (
              dataset.dim - 1 >= self.projection_dim
          ), "Impossible to project to higher dimensions!"

          ds_wo_col_class = np.delete(dataset.instances, [self.col_class], axis=1)
          minor_projected_data = ds_wo_col_class.dot(self.projection)
          # Append the column to predict to the end
          projected_data = np.c_[
              minor_projected_data, dataset.instances[:, self.col_class]
          ]

          return Dataset(dataset=projected_data)

Method projection_quality calculates the mean distance between two points in the original and the projected data. Recall that these two distances should be \(\varepsilon\)-close.

Method project performs the projection, that is, it multiplies the sub-matrix of predictors (i.e., without the labels) by the projection matrix, adds the labels as the last column of the resulting matrix, and stores it as a Dataset object.

The ConfusionMatrix class

The ConfusionMatrix class implements and prints all metrics presented in the refresher on confusion matrices.

Its prototype is as follows:

class ConfusionMatrix:
    """A confusion matrix

    Attributes:
        confusion_matrix: np.ndarray -- The actual 2×2 confusion matrix
    """

    def __init__(self) -> None:
        pass

    def add_prediction(self, true_label: int, predicted_label: int) -> None:
        """Add a labeled point to the matrix."""
        pass

    @property
    def tp(self) -> int:
        """Return the number of true positives."""
        return int(self.confusion_matrix[1, 1])

    @property
    def tn(self) -> int:
        """Return the number of true negatives."""
        return int(self.confusion_matrix[0, 0])

    @property
    def fp(self) -> int:
        """Return the number of false positives."""
        return int(self.confusion_matrix[0, 1])

    @property
    def fn(self) -> int:
        """Return the number of false negatives."""
        return int(self.confusion_matrix[1, 0])

    def f_score(self) -> float:
        """Compute the F-score."""
        pass

    def precision(self) -> float:
        """Compute the precision."""
        pass

    def error_rate(self) -> float:
        """Compute the error rate."""
        pass

    def detection_rate(self) -> float:
        """Compute the detection rate."""
        pass

    def false_alarm_rate(self) -> float:
        """Compute the false-alarm rate."""
        pass

        def print_evaluation(self) -> None:
        """Print a summary of the values of the matrix."""
        print("\t\tPredicted")
        print("\t\t0\t1")
        print(f"Actual\t0\t{self.tn}\t{self.fp}")
        print(f"\t1\t{self.fn}\t{self.tp}\n")

        print(f"Error rate\t\t{self.error_rate()}")
        print(f"False-alarm rate\t{self.false_alarm_rate()}")
        print(f"Detection rate\t\t{self.detection_rate()}")
        print(f"F-score\t\t\t{self.f_score()}")
        print(f"Precision\t\t{self.precision()}")

The print_evaluation method prints TN, FN, TP, FP, the error rate, the false-alarm rate, the detection rate, the F-score, and the precision. It is used mainly by test_knn.py.

The add_prediction method takes two arguments: a true_label (the true label of a sample), and a predicted_label (the predicted label (from KnnClassification) of a sample). It adds the result of the prediction to the appropriate cell of the ConfusionMatrix.