CSC_43042_EP - TD6: \(k\) Nearest Neighbors (\(k\)-NN) Regression vs. Linear Regression
This document details the TD exercises while also reviewing key theoretical concepts from the lecture “Linear Models for Regression”. Each section now includes additional explanations along with figure placeholders to help visualize the concepts.
Before you begin
Download and unzip the archive CSC43042EP-td6-1-handin.zip
.
Overview and Instructions
TD Objectives: You will implement two regression methods—linear regression (using the Ordinary Least Squares (OLS) estimator) and \(k\)-Nearest Neighbors regression—and compare their performance on real datasets.
Datasets: The CSV files (e.g.,
train_boston_housing.csv
,winequality-red.csv
, etc.) are located in thecsv/
directory. These datasets include:- Boston Housing data
- Wine quality data (red and white)
- Salary and Maison datasets
Submission: Complete Questions 1-5 in the quiz on Moodle using
quiz/quiz_regression.py
, and exercises 1 to 6. Upload your updated files as you complete each exercise.Environment: The quizz require the scikit-learn, and you will need to install. The tasks are using NumPy and SciPy (for the KDTree in \(k\)-NN regression). If you are using
pip
to manage your Python installation, then you can get the required libraries using the following command in a terminal:$ pip3 install --user --only-binary=:all: numpy scipy scikit-learn
Quiz
Complete Questions 1-5 in the
quiz on Moodle using quiz_regression.py
.
Theoretical Background
This section provides a brief review of key theoretical concepts that underlie the regression methods you will implement. Figures are included as placeholders to help you visualize the ideas.
Supervised Regression
Supervised regression involves learning a function \(f : X \to Y\) from a dataset of observations \(\{(x_i, y_i)\}_{i=1}^n\) where:
- \(x_i \in \mathbb{R}^d\) (features)
- \(y_i \in \mathbb{R}\) (continuous response)
The goal is to minimize the expected prediction error on new data, typically measured by a loss function \(L(y, f(x))\). In practice, we minimize the empirical risk over our training set.
Linear Models and the OLS Estimator
A linear model assumes that the response \(Y\) depends linearly on the features \(X\) plus some noise:
\[ Y = \beta_0 + \beta_1 x_1 + \cdots + \beta_d x_d + \varepsilon \]
This can be written in matrix form as:
\[ y = X\beta + \varepsilon \]
where \(X\) is the design matrix (with a column of ones if an intercept is included).
The OLS estimator minimizes the Residual Sum of Squares (RSS):
\[ \hat{\beta} = \arg\min_{\beta} \| y - X\beta \|_2^2 \]
Under the assumption that \(X\) has full column rank, the unique solution is given by:
\[ \hat{\beta} = \left(X^T X\right)^{-1}X^T y \]

Figure 2: The orthogonal projection of \(y\) onto the column space of \(X\) minimizes the squared error.
Optimality and Practical Evaluation
Once the OLS estimator is obtained, we evaluate the fit using:
- Total Sum of Squares (TSS): \[\mathrm{TSS} = \sum_{i=1}^n (y_i - \bar{y})^2\]
- Explained Sum of Squares (ESS): \[\mathrm{ESS} = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2\]
- Residual Sum of Squares (RSS): \[\mathrm{RSS} = \sum_{i=1}^n (y_i - \hat{y}_i)^2\]
These quantities satisfy the relationship:
\[ \mathrm{TSS} = \mathrm{ESS} + \mathrm{RSS} \]
and the coefficient of determination \(R^2\) is given by:
\[ R^2 = \frac{\mathrm{ESS}}{\mathrm{TSS}} = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}} \]
Degenerate Settings and Regularization
In cases where the design matrix \(X\) does not have full column rank (e.g., due to collinearity or when \(n < d\)), the OLS problem has multiple solutions. Regularization techniques can be applied to obtain a unique and stable solution:
- Ridge Regression: Adds an \(\ell_2\) penalty: \[ \hat{\beta}_{\text{ridge}} = \arg\min_{\beta} \left\{ \| y - X\beta \|_2^2 + \lambda \| \beta \|_2^2 \right\} \] with solution: \[ \hat{\beta}_{\text{ridge}} = \left(X^T X + \lambda I\right)^{-1}X^T y \]
- Lasso Regression: Uses an \(\ell_1\) penalty to promote sparsity.

Figure 4: Regularization can shrink coefficient estimates and stabilize the solution.
Non-parametric Regression using Kernels
Non-parametric regression methods, such as kernel regression, do not assume a fixed parametric form for \(f(x)\). Instead, they express the predictor as a weighted combination of kernel functions centered at the training data:
\[ \hat{f}(x) = \sum_{j=1}^n \alpha_j k(x, x_j) \]
where \(k(\cdot, \cdot)\) is a kernel function (e.g., the Gaussian kernel).
The Representer Theorem ensures that the solution to the regularized empirical risk minimization problem in a Reproducing Kernel Hilbert Space (RKHS) has this form.
Parametric Non-linear Regression using Basis Functions
Another approach to capture non-linearity is to transform the inputs using basis functions (e.g., polynomials) and then apply linear regression in the transformed space:
\[ f(x) = \beta_0 + \beta_1 \phi_1(x) + \cdots + \beta_p \phi_p(x) \]
For example, using a polynomial basis:
\[ \phi_1(x) = x,\quad \phi_2(x) = x^2,\quad \ldots,\quad \phi_p(x) = x^p \]

Figure 6: Transforming the input data via basis functions can capture non-linear relationships while still using linear regression techniques.
TD Exercises
(A) Exercises on Linear Regression with OLS
Exercise 1A: Constructing the Design Matrix
Theoretical Background:
In a linear regression model we assume that
\[ y = X\beta + \varepsilon, \]
where:
- \(X\) is the design matrix containing the features of all samples.
- \(\beta\) is the vector of coefficients (including an intercept if needed).
- \(\varepsilon\) is the error term.
When including an intercept, the design matrix \(X\) should have a first column of ones. The remaining columns are filled with the feature values from the dataset (excluding the target column).
Task:
Implement the member function construct_matrix()
in the
LinearRegression
class. Your implementation
should:
- Verify that the dataset has more than one dimension.
- Build an \(n \times (d-1)\) matrix (where \(n\) is the number of samples and \(d\) is the total number of columns) with the features (excluding the target).
- If
fit_intercept
is set toTrue
, add a column of ones as the first column of \(X\).
Provided Code:
import numpy as np
from TD.regression import Regression
from TD.dataset import Dataset
class LinearRegression(Regression):
"""
Linear Regression class using the least squares method.
"""
def __init__(self, dataset: Dataset, col_regr: int, fit_intercept: bool = True) -> None:
"""
Initialize the LinearRegression instance.
Parameters:
dataset (Dataset): The dataset containing the samples.
col_regr (int): The index of the target (regression) column.
fit_intercept (bool): Whether to include an intercept term.
"""
super().__init__(dataset, col_regr)
self.m_beta = None # Regression coefficients
self.fit_intercept = fit_intercept # Flag for intercept term
# Compute the coefficients using the implemented methods.
self.set_coefficients()
def __del__(self):
if self.m_beta is not None:
self.m_beta = None
def construct_matrix(self) -> np.ndarray:
"""
Construct the design matrix X for the regression model.
Returns:
np.ndarray: A 2D array of shape (n_samples, n_features) containing
the feature values (excluding the target). If fit_intercept is True,
the returned matrix will have a first column of ones.
"""
# Ensure the dataset has more than one dimension
Exercise 1A:
- Read the theoretical background on the design matrix.
- Review the provided
construct_matrix()
method. - Implement (or verify) your solution according to the guidelines and
note the expected return type (
np.ndarray
with a column of ones if intercept is included).
Test your implementation by running
$ python ./grader.py 1
When done, upload your file
linear_regression.py
.
Exercise 1B: Constructing the Target Vector
Theoretical Background:
The target vector \(y\) holds the response values for each sample in your dataset. In the regression model
\[ y = X\beta + \varepsilon, \]
\(y\) is obtained by selecting the target column from the dataset. This vector is used together with the design matrix \(X\) to compute the regression coefficients using the OLS estimator.
Task:
Implement the member function construct_y()
in the
LinearRegression
class. Your implementation
should:
- Verify that the dataset has more than one dimension.
- Create a vector \(y\) of length \(n\) (number of samples) by extracting the target values from the dataset.
Provided Code:
def construct_y(self) -> np.ndarray:
"""
Construct the target vector y from the dataset.
Returns:
np.ndarray: A 1D array of length n_samples containing the target values.
"""
# Ensure the dataset has more than one dimension
Exercise 1B:
- Read the theoretical background on the target vector.
- Review the provided
construct_y()
method. - Implement (or verify) your solution according to the guidelines and
note that it should return a 1D
np.ndarray
of target values.
After completing both parts, your Linear Regression model will correctly generate the design matrix \(X\) and the target vector \(y\) from the dataset—key steps in computing the OLS solution.
Test your implementation by running
$ python ./grader.py 2
When done, upload your file
linear_regression.py
.
Exercise 2: Setup the Linear Regression
Theoretical Background:
The Ordinary Least Squares (OLS) estimator computes the regression coefficients by minimizing the residual sum of squares between the observed responses and the responses predicted by the linear model. Mathematically, the OLS solution is given by
\[ \hat{\beta} = \operatorname*{arg\,min}_{\beta} \| y - X\beta \|_2^2, \]
which, under the assumption that \(X\) has full column rank, has the closed-form solution
\[ \hat{\beta} = (X^T X)^{-1}X^T y. \]
In practice, a more numerically stable approach is to use the
np.linalg.lstsq
function in Python to compute \(\hat{\beta}\).
Task:
Using the functions from Exercise 1A
(construct_matrix()
) and Exercise 1B
(construct_y()
), implement the member function
set_coefficients()
in the LinearRegression
class to compute the regression coefficients \(\hat{\beta}\) using the least squares
solution.
Provided Code Skeleton:
def set_coefficients(self) -> None:
"""
Compute and set the regression coefficients using the least squares solution.
This function should:
- Verify that the dataset has more than one dimension.
- Construct the design matrix X using construct_matrix().
- Construct the target vector y using construct_y().
- Compute the least squares solution for the coefficients using np.linalg.lstsq.
- Set the computed coefficients to self.m_beta.
Returns:
None
"""
Review the Code from Exercise 1: Make sure you have correctly implemented
construct_matrix()
andconstruct_y()
. These functions build the necessary \(X\) and \(y\) needed for computing the coefficients.Implement
set_coefficients()
:- Use
np.linalg.lstsq
to compute the regression coefficients. - Store the result in
self.m_beta
.
- Use
Test your implementation by running
$ python ./grader.py 3
When done, upload your file
linear_regression.py
.
Exercise 3:
Implement the estimate(x)
Function
Theoretical Background:
Once the regression coefficients \(\hat{\beta}\) have been computed, you can make predictions on new input data. For a given feature vector \(x\) (which should exclude the target value), the prediction is computed as follows:
- If an intercept is included, the prediction is given by:
\[ \hat{y} = \beta_0 + x \cdot \beta_{1:} \]
where \(\beta_0\) is the intercept term and \(\beta_{1:}\) are the remaining coefficients.
- If no intercept is included, the prediction is simply:
\[ \hat{y} = x \cdot \beta. \]
The function must also verify that the input vector \(x\) has the correct number of features (i.e., the total number of features in the dataset minus one for the target).
Task:
Implement the member function estimate(x)
in the
LinearRegression
class to compute and return the predicted
target value \(\hat{y}\) for a given
input feature vector \(x\).
Provided Code Skeleton:
def estimate(self, x: np.ndarray) -> float:
"""
Estimate the target value for a given input feature vector.
Parameters:
x (np.ndarray): Feature vector (excluding the target column).
Returns:
float: The predicted target value.
"""
Review the Theoretical Background: Understand how predictions are computed using the regression coefficients. Note the difference in the computation when an intercept is included versus when it is not.
Implement the
estimate(x)
Method:- Verify that the input vector \(x\)
has the correct dimension (i.e.,
x.size
should equaldataset.get_dim() - 1
). - Check that the regression coefficients (\(\beta\)) have been computed (i.e.,
self.m_beta
is notNone
). - Check
fit_intercept
isTrue
orFalse
. - Return the prediction as a
float
.
- Verify that the input vector \(x\)
has the correct dimension (i.e.,
Test your implementation by running
$ python ./grader.py 4
When done, upload your file
linear_regression.py
.
Exercise 4: Linear Regression Errors in Datasets
Theoretical Background:
In regression analysis, the performance of a model is often evaluated using error metrics computed from the dataset. Three key metrics are:
Total Sum of Squares (TSS): Measures the total variation in the observed target values.
\[ \text{TSS} = \sum_{i=1}^{n} (y_i - \bar{y})^2 \]
Explained Sum of Squares (ESS): Measures the variation explained by the regression model.
\[ \text{ESS} = \sum_{i=1}^{n} (\hat{y}_i - \bar{y})^2 \]
Residual Sum of Squares (RSS): Measures the variation that is not explained by the model (i.e., the errors). \[ \text{RSS} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]
These quantities satisfy the relationship:
\[ \text{TSS} = \text{ESS} + \text{RSS}. \]
Note that this decomposition \(\text{TSS} = \text{ESS} + \text{RSS}\) only holds under certain conditions:
- When the model includes an intercept term (so \(\bar{\hat{y}} = \bar{y}\))
- When evaluation on the training set.
Computing these metrics allows you to assess how well your linear regression model fits the data.
Task:
Implement a function to compute TSS, ESS, and RSS for a given dataset. Your implementation should:
- Verify that the dataset has the same dimensions as the training dataset.
- Construct the design matrix \(X\)
for the provided dataset (using the same approach as in
construct_matrix()
). - Construct the target vector \(y\) by extracting the target column.
- Compute the predicted values \(\hat{y}\) using the computed coefficients.
- Compute TSS, ESS, and RSS using the definitions above.
- Return a tuple
(tss, ess, rss)
with type(float, float, float)
.
Provided Code Skeleton:
def sum_of_squares(self, dataset) -> tuple:
"""
Compute the Total Sum of Squares (TSS), Explained Sum of Squares (ESS),
and Residual Sum of Squares (RSS) for a given dataset.
Parameters:
dataset (Dataset): The dataset on which to evaluate the regression errors.
Returns:
tuple: A tuple (tss, ess, rss) where:
- tss (float): Total Sum of Squares.
- ess (float): Explained Sum of Squares.
- rss (float): Residual Sum of Squares.
"""
= 0.0
tss = 0.0
ess = 0.0
rss
# Return the computed error metrics.
return tss, ess, rss
Review the Theoretical Background: Understand the definitions of TSS, ESS, and RSS, and how they relate to the performance of the regression model.
Implement the Function:
- Verify that the dataset passed to the function matches the dimensions of the training dataset.
- Reuse the logic from your
construct_matrix()
andconstruct_y()
methods to build \(X\) and \(y\). - Compute the predicted values\(\hat{y}\) using your previously computed coefficients.
- Compute TSS, ESS, and RSS using the formulas provided.
- Return a tuple
(tss, ess, rss)
with appropriate types.
Testing: Once implemented, test your function to ensure that it correctly computes the error metrics for your regression model.
Test your implementation by running
$ python ./grader.py 5
When done, upload your file
linear_regression.py
.
Exercise 5: \(k\)-NN Regression Constructor
Theoretical Background:
In \(k\)-Nearest Neighbors (k-NN) regression, the predicted value for a new input is computed based on the average (or weighted average) of the target values of its \(k\) nearest neighbors in the training data. To efficiently locate these neighbors in multi-dimensional space, a KDTree is used. A KDTree is a data structure that allows for fast nearest-neighbor queries, which is especially beneficial when working with high-dimensional data.
Task:
Implement the constructor of the KnnRegression
class.
Your implementation should:
- Fill the rest to store \(k\) (the number of neighbors to use). Provided Code Skeleton:
import numpy as np
from scipy.spatial import KDTree
from TD.regression import Regression
class KnnRegression(Regression):
def __init__(self, k: int, dataset, col_regr: int) -> None:
"""
Constructor for the KnnRegression class.
Parameters:
k (int): Number of nearest neighbors to use for regression.
dataset (Dataset): The dataset containing the training data.
col_regr (int): The column index of the target variable in the dataset.
Returns:
None
Task:
Build a KDTree from the dataset's features (excluding the target column).
"""
# Initialize the parent Regression class
super().__init__(dataset, col_regr)
def _build_kd_tree(self) -> KDTree:
"""
Build a KDTree from the dataset's features (excluding the target column).
Returns:
KDTree: A KDTree built from the feature vectors.
"""
int = self.m_dataset.get_nbr_samples()
n_samples: int = self.m_dataset.get_dim() - 1 # Exclude target column
n_features:
# Initialize an array to store feature vectors.
= np.zeros((n_samples, n_features))
data_pts
# Populate data_pts by iterating through all samples.
for i in range(n_samples):
= 0 # Tracks original column indices while skipping target column
j2 for j in range(n_features):
if j2 == self.m_col_regr:
+= 1 # Skip the target column in the original dataset
j2 = self.m_dataset.get_instance(i)[j2]
data_pts[i, j] += 1
j2
return KDTree(data_pts)
Implement the Constructor:
- Fill the missing elements.
Test your implementation by running
$ python ./grader.py 6
When done, upload your file
knn_regression.py
.
Exercise 6: \(k\)-Nearest Neighbors and Regression
Theoretical Background:
In \(k\)-Nearest Neighbors (k-NN) regression, the prediction for a new input is computed based on the target values of the \(k\) nearest neighbors in the training data. The idea is to find the \(k\) samples in the training set that are closest to the given feature vector \(x\) (using a KDTree for efficient searching), and then compute the mean of their target values. This average is used as the predicted value \(\hat{y}\).
Task:
Implement the estimate(x)
method in the
KnnRegression
class. Your function should:
- Verify that the input feature vector \(x\) has the correct number of features
(i.e., equal to
dataset.get_dim() - 1
). - Use the KDTree to query and find the \(k\) nearest neighbors to \(x\).
- Retrieve the target values corresponding to these neighbors from the dataset.
- Compute and return the mean of these target values as the estimated value.
Provided Code Skeleton:
def estimate(self, x: np.ndarray) -> float:
"""
Estimate the target value for a given input feature vector using k-NN regression.
Parameters:
x (np.ndarray): Feature vector (excluding the target column).
Expected shape: (n_features,).
Returns:
float: The estimated target value, computed as the mean of the target values
of the k nearest neighbors.
"""
Review the Theoretical Background: Understand how k-NN regression uses the average of the \(k\) nearest neighbors’ target values to make a prediction.
Implement the
estimate(x)
Method:- Verify the dimensionality of \(x\) matches the expected number of features.
- Query the KDTree using \(x\) to obtain the \(k\) nearest neighbor indices.
- Retrieve the corresponding target values from the dataset.
- Compute the average (mean) of these target values.
- Return the computed mean as a
float
.
Testing: Once implemented, test your function with various feature vectors to ensure it correctly computes the predictions.
Test your implementation by running
$ python ./grader.py 7
When done, upload your file
knn_regression.py
.