This assignment has been closed on May 06, 2024.

You must be authenticated to submit your files

INF442 - TD7: \(k\) Nearest Neighbors (\(k\)-NN) Regression vs. Linear Regression

Before you start
Introduction
Datasets
Quiz
External libraries: Eigen and ANN
Classes used in this TD
(A) Exercises on Linear Regression with OLS
(B) Exercises on Knn Regression
(Optional) Exercise 7: Impact of \(k\) in errors

Before you start

Download and unzip the archive INF442-td7-1-handin.zip. It contains classes Dataset, Regression, LinearRegression, KnnRegression, and various test files.
To get full credit, you need to solve and submit exercises 1 to 6. Exercise 7 is optional.

Introduction

In this TD, you will implement and compare the performance of two regressors: the \(k\)-NN regressor and the linear regressor. You will test each regressor’s performance against some real datasets.

You are expected to submit your work as you complete it.

Datasets

The following datasets are considered (slightly post-processed for convenience):

Introductory examples
- Wine quality (white and red), from https://archive.ics.uci.edu/ml/datasets/wine+quality.
- Boston house-price data, from http://lib.stat.cmu.edu/datasets/boston
Quiz (described in the quiz section)
- Salary
- Maison

(The csv files are in the csv/ directory)

We consider (training) datasets with \((d+1)\) dimensions and \(n\) samples, i.e., we can represent each dataset using a matrix \(\mathbf{A} \in \mathbb{R}^{(d+1) \times n}\).

Quiz

See the quiz in Moodle. There is also this dedicated webpage that provides more details.

External libraries: Eigen and ANN

If all went well for TD6 you may ignore this section.

The ANN library is required for this TD. The Eigen library is required for the optional exercises (3 onwards). On the computers of the salles informatiques, they are already installed, their paths are /usr/local/eigen-3.3.7 and /usr/local/ann-1.1.2 as already specified in the Makefile.

If you are compiling on your computer and you haven’t installed these libraries yet, please do so now (see Installation instructions). You will then need to update accordingly the path for Eigen and ANN in the Makefile.

You will need to explore their documentation: see here for Eigen and here for ANN (suggestion: take a look at a sample program in Section 2.1.4).

Classes used in this TD

To help you navigate through the code, we provide on a separate page a small explanation of the classes. In this TD, you will implement the basic functionality of Linear Regression and KNN Regression. The source code involves the following classes: Dataset, Regression (similar to Classification from TD6), LinearRegression, and KnnRegression (similar to KnnClassification from TD6), for which you need to provide/implement some of their basic functionality.

(A) Exercises on Linear Regression with OLS

In this exercise you will implement the functionality of Linear Regression with Ordinary Least Squares (OLS).

Your code will be tested on the following data files that you can find in the csv directory:

train_boston_housing.csv and regr_boston_housing.csv
train_winequality-red.csv and regr_winequality-red.csv
train_winequality-white.csv and regr_winequality-white.csv

Exercise 1 : Setup the Linear regression (1/2)

Implement the member functions construct_matrix() and construct_y() of the LinearRegression class, that construct (and return) the matrix X and a vector y from the data. We will use these functions to compute the coefficients of the regression. Notice that the first column of X should have only ones.

Some hints: Eigen methods transpose, setZero, Ones, rows, cols, …

You can test your implementation by typing the commands make grader then ./grader 1 in the terminal. The command make grader launches the compilation, using the makefile that we have provided. The command ./grader 1 runs the tests for the exercise.

Once your program passes all (or at least some) of the tests, upload your file LinearRegression.cpp:

Upload form is only available when connected

Exercise 2 : Setup the Linear regression (2/2)

Use the two member functions of Exercise 1 to implement the member function void set_coefficients() of the LinearRegression class. This member function computes (the values of) the coefficients \(\{\beta_0, \beta_1, ..., \beta_d\}\) of the linear regression and store them in the vector *m_beta.

Some hints: Eigen methods transpose, setZero, solve, Ones, rows, cols, … Note that we recommend the method solve, not inverse. In the lecture, the formula for the coefficients was \((X^TX)^{-1}X^Ty\), so it is tempting to use matrix inversion. This will be correct but you can write more efficient code if you will be able to reformulate this computation in terms of matrix products, transpositions, and linear system solving.

You can test your implementation by typing the commands make grader then ./grader 2 in the terminal. The command make grader launches the compilation, using the makefile that we have provided. The command ./grader 2 runs the tests for the exercise.

Once your program passes all (or at least some) of the tests, upload your file LinearRegression.cpp:

Upload form is only available when connected

Exercise 3 : Estimate

Implement the member function double estimate(const Eigen::VectorXd & x). This member function returns the estimated value for the \(Y\)-component of the vector with \(X\)-component x. In other words, it returns the prediction \(\beta_0 + \mathbf{x} \cdot (\beta_1, ..., \beta_d )^T\).

Some hints: Eigen methods transpose, tail, rows, cols, …

You can test your implementation by typing the commands make grader then ./grader 3 in the terminal. The command make grader launches the compilation, using the makefile that we have provided. The command ./grader 3 runs the tests for the exercise.

Once your program passes all (or at least some) of the tests, upload your file LinearRegression.cpp:

Upload form is only available when connected

Exercise 4 : Linear Regression errors in datasets

Implement the member function void sums_of_squares(Dataset* dataset, double& ess, double& rss, double& tss).

Given a dataset of the same dimensions as the training dataset m_dataset, this member function should:

Extract from the dataset dataset the matrix \(X\) containing dimensions involved in the prediction, of size \(m \times d\); and the column vector \(Y\), of size \(m\), including exact values of the regression (to be compared to the estimation). You can use Eigen’s classes MatrixXd and VectorXd for this.
Compute the prediction \(\beta\_0 + X \cdot ( \beta\_1, ..., \beta\_d )^T\), where \(\beta\) (m_beta) is the vector of coefficients computed by set_coefficients().
Compute the values of the Explained Sum of Squares (ESS), Residual Sum of Squares (RSS), and Total Sum of Squares (TSS), and return them through the following arguments passed-by references ess, rss, and tss.

Note that this member function will be useful for computing the error of a set of regression coefficients against the training dataset, and against any other dataset that may be used for performance evaluation. For the training dataset, with the given least square model, it is verified that \(TSS = ESS+RSS\) (and \(R2\) can be computed as \(R2=\frac{ESS}{TSS}\)).

Some hints: Eigen methods transpose, tail, Constant, Ones, rows, cols …

You can test your implementation by typing the commands make grader then ./grader 4 in the terminal. The command make grader launches the compilation, using the makefile that we have provided. The command ./grader 4 runs the tests for the exercise.

Once your program passes all (or at least some) of the tests, upload your file LinearRegression.cpp:

Upload form is only available when connected

(B) Exercises on Knn Regression

In this exercise you will implement the functionality of Knn Regression.

You can test your implementation using the corresponding test programs and the following data files that you can find in the csv directory:

train_boston_housing.csv and regr_boston_housing.csv
train_winequality-red.csv and regr_winequality-red.csv
train_winequality-white.csv and regr_winequality-white.csv

Exercise 5 : Knn Regression constructor/destructor

Implement the constructor and destructor of the KnnRegression class.

Note that the constructor should create the \(k\)d-tree corresponding to the subset \(X\) (i.e., the \(d\) dimensions not included in regression) of the associated dataset; you can use the primitives of the ANN library for that, e.g. ANNkd_tree and ANNpointArray.

Note: you can take inspiration from TD6 / Exercise 1…

You can test your implementation by typing the commands make grader then ./grader 5 in the terminal. The command make grader launches the compilation, using the makefile that we have provided. The command ./grader 5 runs the tests for the exercise.

Once your program passes all (or at least some) of the tests, upload your file KnnRegression.cpp:

Upload form is only available when connected

Exercise 6: \(k\)-nearest neighbors and regression

Implement the member function double estimate(const Eigen::VectorXd & x). This member function returns the estimated (through \(k\)-NN regression) value \(\hat{y}\) for the regressed feature \(y\) of the vector x.

Given the \(d\)-dimensional sample x, this member function should compute the \(k\)-nearest neighbors of x in the subset \(X\) of the training dataset \(D\). You can use for that the annkSearch() function of the ANN library.

Once the \(k\) nearest neighbors of x are determined, \(\{nn_1, nn_2, ..., nn_k\}\), the regression value \(y\), for this sample x is estimated as the mean of the \(Y\)-components of the identified nearest neighbors, i.e.,: \(\frac{1}{k}\sum_{i=1}^k {y_{nn_i}}\).

Note: you can take inspiration from TD6 / Exercise 2…

You can test your implementation by typing the commands make grader then ./grader 6 in the terminal. The command make grader launches the compilation, using the makefile that we have provided. The command ./grader 6 runs the tests for the exercise.

Once your program passes all (or at least some) of the tests, upload your file KnnRegression.cpp:

Upload form is only available when connected

(Optional) Exercise 7: Impact of \(k\) in errors

By increasing the number of nearest neighbors to consider, \(k\), you could expect to achieve more accurate regressions on the explored dataset, at the cost of increasing the memory required to compute a single estimation.

You should complete the test program test_k_knn, to have the following syntax:

./test_k_knn train_file test_file [ column_for_regression ]

You can compile it with the command make test_k_knn. This program should generate, for a pair of training and test datasets, a data file (e.g. output.txt or via standard out redirection) mapping the value of \(k\), for \(k\) in the range [1,100], with the Mean Square Error for the explored dataset. You can use the test program test_knn for inspiration.

A few example outputs are provided below. You can observe that the evolution of the MSE with respect to \(k\) is dataset-specific and it is not necessarily monotonic with \(k\).

$ ./test_k_knn ./csv/train_winequality-white.csv ./csv/regr_winequality-white.csv 11

Dataset with 1999 samples, and 12 dimensions.

$ cat output.txt

1  1.360594 
2  1.089717 
3  0.963231 
4  0.892102 
5  0.844210 
6  0.805661 
7  0.780235 
8  0.757624 
9  0.747783 
10  0.734569 
...
20  0.682038 
...
30  0.667373 
...
40  0.661300 
...
50  0.663417 
...
60  0.663084 
...
70  0.662470 
...
80  0.661089 
...
90  0.658104 
...
100  0.655846

$ ./test_k_knn ./csv/train_winequality-red.csv ./csv/regr_winequality-red.csv 11
Dataset with 598 samples, and 12 dimensions.

$ cat output.txt 

1  0.970030 
2  0.813437 
3  0.708514 
4  0.664148 
5  0.647433 
6  0.652542 
7  0.643642 
8  0.643154 
9  0.639065 
10  0.641149 
...
20  0.632710 
...
30  0.635504 
...
40  0.629290 
...
50  0.625778 
...
60  0.630816 
...
70  0.634937 
...
80  0.636074 
...
90  0.634904 
...
100  0.636513