Classes of TD7

We provide a short description of various classes that we use in TD7 to implement the functionality of the Linear and the Knn Regressor.

The Dataset class

The implementation of the class Dataset is in the files Dataset.hpp and Dataset.cpp.

Remark: It is (almost) identical to the Dataset of TD6.

Its declaration is as follows:

class Dataset {
    public:
        Dataset(const char* file);
        
        ~Dataset();
        
        void show(bool verbose);
        std::vector< double> get_instance(int i);
        int get_nbr_samples();
        int get_dim();

    private:
        int m_dim;
        int m_nsamples;
        std::vector< std::vector< double> > m_instances;
};

You do not need to implement any functionality for this class.

The Regression class

The implementation of the abstract class Regression is in the files Classification.hpp and Classification.cpp. Its declaration is as follows:
class Regression{
    protected:
        Dataset* m_dataset;
        int m_col_regr;
    public:
        Regression(Dataset* dataset, int col_regr);
        virtual double estimate( const Eigen::VectorXd & x ) = 0;
        int get_col_regr();
        Dataset* get_dataset();
};

This class provides a pointer to a training dataset, m_dataset, and the index of the regression dimension (\(0 \leq\) m_col_regr \(< d+1\)).

For convenience, we will call \(\mathcal{X} \subset \mathbb{R}^{d}\) the (\(d\)-dimensional) subset resulting from excluding the regression dimension, and \(\mathcal{Y} \subset \mathbb{R}\) the (1-dim.) subset for the regression dimension. Note that samples in the dataset will be thus referred to as \(\mathbf{s} = (\mathbf{x}, y)\), with \(\mathbf{x} \in \mathcal{X}\), \(y \in \mathcal{Y}\). In these conditions, the goal of regression is the estimation, for a sample \(\mathbf{s}=(\mathbf{x}, y)\) and a training dataset \(D\), of \(\hat{y} = f(\mathbf{x}, D)\).

The LinearRegression class

A \(d\)-multivariate linear regression algorithm provides a set of coefficients \(\{ \beta_0, \beta_1, ..., \beta_{d} \}\), such that \(\beta_0 + \sum_{i=1}^{d} \beta_i x_i = (\beta_0, \beta_1, ..., \beta_d) \cdot (1, x_1, x_2, ..., x_d)^T\) is the BLU (best linear unbiased) estimator for \(y\).

These coefficients are determined so that they minimize the estimation error over a training dataset. There are several methods for computing these coefficients. In this TD we will use the Ordinary Least Squares (OLS) estimator, as it is indicated in the slides of the course.

Linear regression, based on a training dataset, is implemented through a class LinearRegression, derived from Regression. The declaration of the class is as follows:
class LinearRegression : public Regression {
    private:
        Eigen::VectorXd* m_beta;
    public:
        LinearRegression(Dataset* dataset, int col_regr);
        ~LinearRegression();
        Eigen::MatrixXd construct_matrix();
        Eigen::VectorXd construct_y();
        void set_coefficients();
        Eigen::VectorXd get_coefficients();
        void show_coefficients();
        void print_raw_coefficients();
        void sums_of_squares(Dataset* dataset, double& ess, double& rss, double& tss);
        double estimate(const Eigen::VectorXd & x);
};

In Part A, you will complete the implementation of the class LinearRegression by implementing the member functions set_coefficients, sum_of_squares, and estimate.

The KnnRegression class

Remark: This is similar to the KnnClassification class in TD6.

The principle of \(k\)-NN regression is very simple. Given two subspaces \(X \subset \mathbb{R}^d\) (for dimensions not involved in the regression) and \(Y \subset \mathbb{R}\) (the regression dimension), and a training dataset \(D\) with instances \(\mathbf{z} = (\mathbf{z}_X, z_Y)\), \(\mathbf{z}_X \in X\), \(z_Y \in Y\), the regression value for an instance \(\mathbf{u} \in \mathbb{R}^{d}\) is computed as a function of the \(Y\)-values of the \(k\)-nearest neighbors of \(\mathbf{u}\) in \(X \subset D\).

We implement the \(k\)-NN regression through class KnnRegression, the declaration of which is as follows:
class KnnRegression : public Regression {
    private:
        int m_k;
        ANNkd_tree* m_kdTree;
    public:
        KnnRegression(int k, Dataset* dataset, int col_regr);
        ~KnnRegression();
        double estimate(const Eigen::VectorXd & x);
        int get_k();
        ANNkd_tree* get_kdTree();
};

In Part B, you You will complete the implementation of the class KnnRegression, by implementing its constructor and destructor and the member function estimate.