INF442 - TD7: \(k\) Nearest Neighbors (\(k\)-NN) Regression vs. Linear Regression
Before you start
Download and unzip the archive
INF442-td7-1-handin.zip
. It contains classesDataset
,Regression
,LinearRegression
,KnnRegression
, and various test files.To get full credit, you need to solve and submit exercises 1 to 6. Exercise 7 is optional.
Introduction
In this TD, you will implement and compare the performance of two regressors: the \(k\)-NN regressor and the linear regressor. You will test each regressor’s performance against some real datasets.
You are expected to submit your work as you complete it.
Datasets
The following datasets are considered (slightly post-processed for convenience):
- Introductory examples
- Wine quality (white and red), from https://archive.ics.uci.edu/ml/datasets/wine+quality.
- Boston house-price data, from http://lib.stat.cmu.edu/datasets/boston
- Quiz (described in the quiz section)
- Salary
- Maison
(The csv files are in the csv/
directory)
We consider (training) datasets with \((d+1)\) dimensions and \(n\) samples, i.e., we can represent each dataset using a matrix \(\mathbf{A} \in \mathbb{R}^{(d+1) \times n}\).
Quiz
See the quiz in Moodle. There is also this dedicated webpage that provides more details.
External libraries: Eigen and ANN
If all went well for TD6 you may ignore this section.
The ANN library is required for this TD. The Eigen library is
required for the optional exercises (3 onwards). On the
computers of the salles informatiques, they are already
installed, their paths are /usr/local/eigen-3.3.7
and /usr/local/ann-1.1.2
as already specified in
the Makefile.
If you are compiling on your computer and you haven’t installed these libraries yet, please do so now (see Installation instructions). You will then need to update accordingly the path for Eigen and ANN in the Makefile.
You will need to explore their documentation: see here for Eigen and here for ANN (suggestion: take a look at a sample program in Section 2.1.4).
Classes used in this TD
To help you navigate through the code, we provide on a separate page a small explanation
of the classes. In this TD, you will implement the basic functionality
of Linear Regression and KNN Regression. The source code involves the
following classes: Dataset
, Regression
(similar to Classification
from TD6),
LinearRegression
, and KnnRegression
(similar
to KnnClassification
from TD6), for which you need to
provide/implement some of their basic functionality.
(A) Exercises on Linear Regression with OLS
In this exercise you will implement the functionality of Linear Regression with Ordinary Least Squares (OLS).
Your code will be tested on the following data files that you can
find in the csv
directory:
train_boston_housing.csv
andregr_boston_housing.csv
train_winequality-red.csv
andregr_winequality-red.csv
train_winequality-white.csv
andregr_winequality-white.csv
Exercise 1 : Setup the Linear regression (1/2)
Implement the member functions construct_matrix()
and
construct_y()
of the LinearRegression
class,
that construct (and return) the matrix X
and a vector
y
from the data. We will use these functions to compute the
coefficients of the regression. Notice that the first column of
X
should have only ones.
Some hints: Eigen methods transpose
,
setZero
, Ones
, rows
,
cols
, …
You can test your implementation by typing the commands
make grader
then ./grader 1
in the terminal.
The command make grader
launches the compilation, using the
makefile that we have provided. The command ./grader 1
runs
the tests for the exercise.
Once your program passes all (or at least some) of the tests, upload
your file LinearRegression.cpp
:
Exercise 2 : Setup the Linear regression (2/2)
Use the two member functions of Exercise 1 to implement the member
function void set_coefficients()
of the
LinearRegression
class. This member function computes (the
values of) the coefficients \(\{\beta_0,
\beta_1, ..., \beta_d\}\) of the linear regression and store them
in the vector *m_beta
.
Some hints: Eigen methods transpose
,
setZero
, solve
, Ones
,
rows
, cols
, … Note that we recommend the
method solve
, not inverse
. In the lecture, the
formula for the coefficients was \((X^TX)^{-1}X^Ty\), so it is tempting to use
matrix inversion. This will be correct but you can write more efficient
code if you will be able to reformulate this computation in terms of
matrix products, transpositions, and linear system solving.
You can test your implementation by typing the commands
make grader
then ./grader 2
in the terminal.
The command make grader
launches the compilation, using the
makefile that we have provided. The command ./grader 2
runs
the tests for the exercise.
Once your program passes all (or at least some) of the tests, upload
your file LinearRegression.cpp
:
Exercise 3 : Estimate
Implement the member function
double estimate(const Eigen::VectorXd & x)
. This member
function returns the estimated value for the \(Y\)-component of the vector with \(X\)-component x
. In other
words, it returns the prediction \(\beta_0 +
\mathbf{x} \cdot (\beta_1, ..., \beta_d )^T\).
Some hints: Eigen methods transpose
, tail
,
rows
, cols
, …
You can test your implementation by typing the commands
make grader
then ./grader 3
in the terminal.
The command make grader
launches the compilation, using the
makefile that we have provided. The command ./grader 3
runs
the tests for the exercise.
Once your program passes all (or at least some) of the tests, upload
your file LinearRegression.cpp
:
Exercise 4 : Linear Regression errors in datasets
Implement the member function
void sums_of_squares(Dataset* dataset, double& ess, double& rss, double& tss)
.
Given a dataset
of the same dimensions as the
training dataset m_dataset
, this member function
should:
- Extract from the dataset
dataset
the matrix \(X\) containing dimensions involved in the prediction, of size \(m \times d\); and the column vector \(Y\), of size \(m\), including exact values of the regression (to be compared to the estimation). You can use Eigen’s classesMatrixXd
andVectorXd
for this. - Compute the prediction \(\beta\_0 + X
\cdot ( \beta\_1, ..., \beta\_d )^T\), where \(\beta\) (
m_beta
) is the vector of coefficients computed byset_coefficients()
. - Compute the values of the Explained Sum of Squares (ESS), Residual
Sum of Squares (RSS), and Total Sum of Squares (TSS), and return them
through the following arguments passed-by references
ess
,rss
, andtss
.
Note that this member function will be useful for computing the error of a set of regression coefficients against the training dataset, and against any other dataset that may be used for performance evaluation. For the training dataset, with the given least square model, it is verified that \(TSS = ESS+RSS\) (and \(R2\) can be computed as \(R2=\frac{ESS}{TSS}\)).
Some hints: Eigen methods transpose
, tail
,
Constant
, Ones
, rows
,
cols
…
You can test your implementation by typing the commands
make grader
then ./grader 4
in the terminal.
The command make grader
launches the compilation, using the
makefile that we have provided. The command ./grader 4
runs
the tests for the exercise.
Once your program passes all (or at least some) of the tests, upload
your file LinearRegression.cpp
:
(B) Exercises on Knn Regression
In this exercise you will implement the functionality of Knn Regression.
You can test your implementation using the corresponding test
programs and the following data files that you can find in the
csv
directory:
train_boston_housing.csv
andregr_boston_housing.csv
train_winequality-red.csv
andregr_winequality-red.csv
train_winequality-white.csv
andregr_winequality-white.csv
Exercise 5 : Knn Regression constructor/destructor
Implement the constructor and destructor of the
KnnRegression
class.
Note that the constructor should create the \(k\)d-tree corresponding to the subset \(X\) (i.e., the \(d\) dimensions not included in regression)
of the associated dataset; you can use the primitives of the ANN library
for that, e.g. ANNkd_tree
and
ANNpointArray
.
Note: you can take inspiration from TD6 / Exercise 1…
You can test your implementation by typing the commands
make grader
then ./grader 5
in the terminal.
The command make grader
launches the compilation, using the
makefile that we have provided. The command ./grader 5
runs
the tests for the exercise.
Once your program passes all (or at least some) of the tests, upload
your file KnnRegression.cpp
:
Exercise 6: \(k\)-nearest neighbors and regression
Implement the member function
double estimate(const Eigen::VectorXd & x)
. This member
function returns the estimated (through \(k\)-NN regression) value \(\hat{y}\) for the regressed feature \(y\) of the vector x.
Given the \(d\)-dimensional sample
x
, this member function should compute the \(k\)-nearest neighbors of x in the subset
\(X\) of the training dataset \(D\). You can use for that the
annkSearch()
function of the ANN library.
Once the \(k\) nearest neighbors of
x are determined, \(\{nn_1, nn_2, ...,
nn_k\}\), the regression value \(y\), for this sample x
is
estimated as the mean of the \(Y\)-components of the identified nearest
neighbors, i.e.,: \(\frac{1}{k}\sum_{i=1}^k
{y_{nn_i}}\).
Note: you can take inspiration from TD6 / Exercise 2…
You can test your implementation by typing the commands
make grader
then ./grader 6
in the terminal.
The command make grader
launches the compilation, using the
makefile that we have provided. The command ./grader 6
runs
the tests for the exercise.
Once your program passes all (or at least some) of the tests, upload
your file KnnRegression.cpp
:
(Optional) Exercise 7: Impact of \(k\) in errors
By increasing the number of nearest neighbors to consider, \(k\), you could expect to achieve more accurate regressions on the explored dataset, at the cost of increasing the memory required to compute a single estimation.
You should complete the test program test_k_knn
, to have
the following syntax:
./test_k_knn train_file test_file [ column_for_regression ]
You can compile it with the command make test_k_knn
.
This program should generate, for a pair of training and test datasets,
a data file (e.g. output.txt
or via standard out
redirection) mapping the value of \(k\), for \(k\) in the range [1,100], with the Mean
Square Error for the explored dataset. You can use the test program
test_knn
for inspiration.
A few example outputs are provided below. You can observe that the evolution of the MSE with respect to \(k\) is dataset-specific and it is not necessarily monotonic with \(k\).
$ ./test_k_knn ./csv/train_winequality-white.csv ./csv/regr_winequality-white.csv 11
Dataset with 1999 samples, and 12 dimensions.
$ cat output.txt
1 1.360594
2 1.089717
3 0.963231
4 0.892102
5 0.844210
6 0.805661
7 0.780235
8 0.757624
9 0.747783
10 0.734569
...
20 0.682038
...
30 0.667373
...
40 0.661300
...
50 0.663417
...
60 0.663084
...
70 0.662470
...
80 0.661089
...
90 0.658104
...
100 0.655846
$ ./test_k_knn ./csv/train_winequality-red.csv ./csv/regr_winequality-red.csv 11
Dataset with 598 samples, and 12 dimensions.
$ cat output.txt
1 0.970030
2 0.813437
3 0.708514
4 0.664148
5 0.647433
6 0.652542
7 0.643642
8 0.643154
9 0.639065
10 0.641149
...
20 0.632710
...
30 0.635504
...
40 0.629290
...
50 0.625778
...
60 0.630816
...
70 0.634937
...
80 0.636074
...
90 0.634904
...
100 0.636513