CSC43042EP / TD6 (2024)

This assignment has been closed on April 30, 2025.

You must be authenticated to submit your files

CSC_43042_EP - TD6: $k$ Nearest Neighbors ($k$-NN) Regression vs. Linear Regression

CSC_43042_EP - TD6: $k$ Nearest Neighbors ($k$-NN) Regression vs. Linear Regression

This document details the TD exercises while also reviewing key theoretical concepts from the lecture “Linear Models for Regression”. Each section now includes additional explanations along with figure placeholders to help visualize the concepts.

Before you begin

Download and unzip the archive CSC43042EP-td6-1-handin.zip.

Overview and Instructions

TD Objectives: You will implement two regression methods—linear regression (using the Ordinary Least Squares (OLS) estimator) and $k$-Nearest Neighbors regression—and compare their performance on real datasets.
Datasets: The CSV files (e.g., train_boston_housing.csv, winequality-red.csv, etc.) are located in the csv/ directory. These datasets include:
- Boston Housing data
- Wine quality data (red and white)
- Salary and Maison datasets
Submission: Complete Questions 1-5 in the quiz on Moodle using quiz/quiz_regression.py, and exercises 1 to 6. Upload your updated files as you complete each exercise.
Environment: The quizz require the scikit-learn, and you will need to install. The tasks are using NumPy and SciPy (for the KDTree in $k$-NN regression). If you are using pip to manage your Python installation, then you can get the required libraries using the following command in a terminal:
```
$ pip3 install --user --only-binary=:all: numpy scipy scikit-learn
```

Quiz

Complete Questions 1-5 in the quiz on Moodle using quiz_regression.py.

Theoretical Background

This section provides a brief review of key theoretical concepts that underlie the regression methods you will implement. Figures are included as placeholders to help you visualize the ideas.

Supervised Regression

Supervised regression involves learning a function $f : X \to Y$ from a dataset of observations $\{(x_i, y_i)\}_{i=1}^n$ where:

$x_i \in \mathbb{R}^d$ (features)
$y_i \in \mathbb{R}$ (continuous response)

The goal is to minimize the expected prediction error on new data, typically measured by a loss function $L(y, f(x))$. In practice, we minimize the empirical risk over our training set.

Linear Models and the OLS Estimator

A linear model assumes that the response $Y$ depends linearly on the features $X$ plus some noise:

\[ Y = \beta_0 + \beta_1 x_1 + \cdots + \beta_d x_d + \varepsilon \]

This can be written in matrix form as:

\[ y = X\beta + \varepsilon \]

where $X$ is the design matrix (with a column of ones if an intercept is included).

The OLS estimator minimizes the Residual Sum of Squares (RSS):

\[ \hat{\beta} = \arg\min_{\beta} \| y - X\beta \|_2^2 \]

Under the assumption that $X$ has full column rank, the unique solution is given by:

\[ \hat{\beta} = \left(X^T X\right)^{-1}X^T y \]

Figure 2: The orthogonal projection of $y$ onto the column space of $X$ minimizes the squared error.

Optimality and Practical Evaluation

Once the OLS estimator is obtained, we evaluate the fit using:

Total Sum of Squares (TSS): \[\mathrm{TSS} = \sum_{i=1}^n (y_i - \bar{y})^2\]
Explained Sum of Squares (ESS): \[\mathrm{ESS} = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2\]
Residual Sum of Squares (RSS): \[\mathrm{RSS} = \sum_{i=1}^n (y_i - \hat{y}_i)^2\]

These quantities satisfy the relationship:

\[ \mathrm{TSS} = \mathrm{ESS} + \mathrm{RSS} \]

and the coefficient of determination $R^2$ is given by:

\[ R^2 = \frac{\mathrm{ESS}}{\mathrm{TSS}} = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}} \]

Degenerate Settings and Regularization

In cases where the design matrix $X$ does not have full column rank (e.g., due to collinearity or when $n < d$), the OLS problem has multiple solutions. Regularization techniques can be applied to obtain a unique and stable solution:

Ridge Regression: Adds an $\ell_2$ penalty: \[ \hat{\beta}_{\text{ridge}} = \arg\min_{\beta} \left\{ \| y - X\beta \|_2^2 + \lambda \| \beta \|_2^2 \right\} \] with solution: \[ \hat{\beta}_{\text{ridge}} = \left(X^T X + \lambda I\right)^{-1}X^T y \]
Lasso Regression: Uses an $\ell_1$ penalty to promote sparsity.

Figure 4: Regularization can shrink coefficient estimates and stabilize the solution.

Non-parametric Regression using Kernels

Non-parametric regression methods, such as kernel regression, do not assume a fixed parametric form for $f(x)$. Instead, they express the predictor as a weighted combination of kernel functions centered at the training data:

\[ \hat{f}(x) = \sum_{j=1}^n \alpha_j k(x, x_j) \]

where $k(\cdot, \cdot)$ is a kernel function (e.g., the Gaussian kernel).

The Representer Theorem ensures that the solution to the regularized empirical risk minimization problem in a Reproducing Kernel Hilbert Space (RKHS) has this form.

Parametric Non-linear Regression using Basis Functions

Another approach to capture non-linearity is to transform the inputs using basis functions (e.g., polynomials) and then apply linear regression in the transformed space:

\[ f(x) = \beta_0 + \beta_1 \phi_1(x) + \cdots + \beta_p \phi_p(x) \]

For example, using a polynomial basis:

\[ \phi_1(x) = x,\quad \phi_2(x) = x^2,\quad \ldots,\quad \phi_p(x) = x^p \]

Figure 6: Transforming the input data via basis functions can capture non-linear relationships while still using linear regression techniques.

TD Exercises

(A) Exercises on Linear Regression with OLS

Exercise 1A: Constructing the Design Matrix

Theoretical Background:

In a linear regression model we assume that

\[ y = X\beta + \varepsilon, \]

where:

$X$ is the design matrix containing the features of all samples.
$\beta$ is the vector of coefficients (including an intercept if needed).
$\varepsilon$ is the error term.

When including an intercept, the design matrix $X$ should have a first column of ones. The remaining columns are filled with the feature values from the dataset (excluding the target column).

Task:

Implement the member function construct_matrix() in the LinearRegression class. Your implementation should:

Verify that the dataset has more than one dimension.
Build an $n \times (d-1)$ matrix (where $n$ is the number of samples and $d$ is the total number of columns) with the features (excluding the target).
If fit_intercept is set to True, add a column of ones as the first column of $X$.

Provided Code:

import numpy as np
from TD.regression import Regression
from TD.dataset import Dataset

class LinearRegression(Regression):
    """
    Linear Regression class using the least squares method.
    """

    def __init__(self, dataset: Dataset, col_regr: int, fit_intercept: bool = True) -> None:
        """
        Initialize the LinearRegression instance.

        Parameters:
            dataset (Dataset): The dataset containing the samples.
            col_regr (int): The index of the target (regression) column.
            fit_intercept (bool): Whether to include an intercept term.
        """
        super().__init__(dataset, col_regr)
        self.m_beta = None  # Regression coefficients
        self.fit_intercept = fit_intercept   # Flag for intercept term
        # Compute the coefficients using the implemented methods.
        self.set_coefficients()

    def __del__(self):
        if self.m_beta is not None:
            self.m_beta = None

    def construct_matrix(self) -> np.ndarray:
        """
        Construct the design matrix X for the regression model.

        Returns:
            np.ndarray: A 2D array of shape (n_samples, n_features) containing
                        the feature values (excluding the target). If fit_intercept is True,
                        the returned matrix will have a first column of ones.
        """
        # Ensure the dataset has more than one dimension

Exercise 1A:

Read the theoretical background on the design matrix.
Review the provided construct_matrix() method.
Implement (or verify) your solution according to the guidelines and note the expected return type (np.ndarray with a column of ones if intercept is included).

Test your implementation by running

$ python ./grader.py 1

When done, upload your file linear_regression.py.

Upload form is only available when connected

Exercise 1B: Constructing the Target Vector

Theoretical Background:

The target vector $y$ holds the response values for each sample in your dataset. In the regression model

\[ y = X\beta + \varepsilon, \]

$y$ is obtained by selecting the target column from the dataset. This vector is used together with the design matrix $X$ to compute the regression coefficients using the OLS estimator.

Task:

Implement the member function construct_y() in the LinearRegression class. Your implementation should:

Verify that the dataset has more than one dimension.
Create a vector $y$ of length $n$ (number of samples) by extracting the target values from the dataset.

Provided Code:

    def construct_y(self) -> np.ndarray:
        """
        Construct the target vector y from the dataset.

        Returns:
            np.ndarray: A 1D array of length n_samples containing the target values.
        """
        # Ensure the dataset has more than one dimension

Exercise 1B:

Read the theoretical background on the target vector.
Review the provided construct_y() method.
Implement (or verify) your solution according to the guidelines and note that it should return a 1D np.ndarray of target values.

After completing both parts, your Linear Regression model will correctly generate the design matrix $X$ and the target vector $y$ from the dataset—key steps in computing the OLS solution.

Test your implementation by running

$ python ./grader.py 2

When done, upload your file linear_regression.py.

Upload form is only available when connected

Exercise 2: Setup the Linear Regression

Theoretical Background:

The Ordinary Least Squares (OLS) estimator computes the regression coefficients by minimizing the residual sum of squares between the observed responses and the responses predicted by the linear model. Mathematically, the OLS solution is given by

\[ \hat{\beta} = \operatorname*{arg\,min}_{\beta} \| y - X\beta \|_2^2, \]

which, under the assumption that $X$ has full column rank, has the closed-form solution

\[ \hat{\beta} = (X^T X)^{-1}X^T y. \]

In practice, a more numerically stable approach is to use the np.linalg.lstsq function in Python to compute $\hat{\beta}$.

Task:

Using the functions from Exercise 1A (construct_matrix()) and Exercise 1B (construct_y()), implement the member function set_coefficients() in the LinearRegression class to compute the regression coefficients $\hat{\beta}$ using the least squares solution.

Provided Code Skeleton:

    def set_coefficients(self) -> None:
        """
        Compute and set the regression coefficients using the least squares solution.

        This function should:
          - Verify that the dataset has more than one dimension.
          - Construct the design matrix X using construct_matrix().
          - Construct the target vector y using construct_y().
          - Compute the least squares solution for the coefficients using np.linalg.lstsq.
          - Set the computed coefficients to self.m_beta.

        Returns:
            None
        """

Review the Code from Exercise 1: Make sure you have correctly implemented construct_matrix() and construct_y(). These functions build the necessary $X$ and $y$ needed for computing the coefficients.
Implement set_coefficients():
- Use np.linalg.lstsq to compute the regression coefficients.
- Store the result in self.m_beta.

Test your implementation by running

$ python ./grader.py 3

When done, upload your file linear_regression.py.

Upload form is only available when connected

Exercise 3: Implement the `estimate(x)` Function

Theoretical Background:

Once the regression coefficients $\hat{\beta}$ have been computed, you can make predictions on new input data. For a given feature vector $x$ (which should exclude the target value), the prediction is computed as follows:

If an intercept is included, the prediction is given by:

\[ \hat{y} = \beta_0 + x \cdot \beta_{1:} \]

where $\beta_0$ is the intercept term and $\beta_{1:}$ are the remaining coefficients.

If no intercept is included, the prediction is simply:

\[ \hat{y} = x \cdot \beta. \]

The function must also verify that the input vector $x$ has the correct number of features (i.e., the total number of features in the dataset minus one for the target).

Task:

Implement the member function estimate(x) in the LinearRegression class to compute and return the predicted target value $\hat{y}$ for a given input feature vector $x$.

Provided Code Skeleton:

    def estimate(self, x: np.ndarray) -> float:
        """
        Estimate the target value for a given input feature vector.

        Parameters:
            x (np.ndarray): Feature vector (excluding the target column).

        Returns:
            float: The predicted target value.
        """

Review the Theoretical Background: Understand how predictions are computed using the regression coefficients. Note the difference in the computation when an intercept is included versus when it is not.
Implement the estimate(x) Method:
- Verify that the input vector $x$ has the correct dimension (i.e., x.size should equal dataset.get_dim() - 1).
- Check that the regression coefficients ($\beta$) have been computed (i.e., self.m_beta is not None).
- Check fit_intercept is True or False.
- Return the prediction as a float.

Test your implementation by running

$ python ./grader.py 4

When done, upload your file linear_regression.py.

Upload form is only available when connected

Exercise 4: Linear Regression Errors in Datasets

Theoretical Background:

In regression analysis, the performance of a model is often evaluated using error metrics computed from the dataset. Three key metrics are:

Total Sum of Squares (TSS): Measures the total variation in the observed target values.

\[ \text{TSS} = \sum_{i=1}^{n} (y_i - \bar{y})^2 \]
Explained Sum of Squares (ESS): Measures the variation explained by the regression model.

\[ \text{ESS} = \sum_{i=1}^{n} (\hat{y}_i - \bar{y})^2 \]
Residual Sum of Squares (RSS): Measures the variation that is not explained by the model (i.e., the errors). \[ \text{RSS} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]

These quantities satisfy the relationship:

\[ \text{TSS} = \text{ESS} + \text{RSS}. \]

Note that this decomposition $\text{TSS} = \text{ESS} + \text{RSS}$ only holds under certain conditions:

When the model includes an intercept term (so $\bar{\hat{y}} = \bar{y}$)
When evaluation on the training set.

Computing these metrics allows you to assess how well your linear regression model fits the data.

Task:

Implement a function to compute TSS, ESS, and RSS for a given dataset. Your implementation should:

Verify that the dataset has the same dimensions as the training dataset.
Construct the design matrix $X$ for the provided dataset (using the same approach as in construct_matrix()).
Construct the target vector $y$ by extracting the target column.
Compute the predicted values $\hat{y}$ using the computed coefficients.
Compute TSS, ESS, and RSS using the definitions above.
Return a tuple (tss, ess, rss) with type (float, float, float).

Provided Code Skeleton:

    def sum_of_squares(self, dataset) -> tuple:
        """
        Compute the Total Sum of Squares (TSS), Explained Sum of Squares (ESS),
        and Residual Sum of Squares (RSS) for a given dataset.

        Parameters:
            dataset (Dataset): The dataset on which to evaluate the regression errors.

        Returns:
            tuple: A tuple (tss, ess, rss) where:
                   - tss (float): Total Sum of Squares.
                   - ess (float): Explained Sum of Squares.
                   - rss (float): Residual Sum of Squares.
        """

        tss = 0.0
        ess = 0.0
        rss = 0.0

        # Return the computed error metrics.
        return tss, ess, rss

Review the Theoretical Background: Understand the definitions of TSS, ESS, and RSS, and how they relate to the performance of the regression model.
Implement the Function:
- Verify that the dataset passed to the function matches the dimensions of the training dataset.
- Reuse the logic from your construct_matrix() and construct_y() methods to build $X$ and $y$.
- Compute the predicted values$\hat{y}$ using your previously computed coefficients.
- Compute TSS, ESS, and RSS using the formulas provided.
- Return a tuple (tss, ess, rss) with appropriate types.
Testing: Once implemented, test your function to ensure that it correctly computes the error metrics for your regression model.

Test your implementation by running

$ python ./grader.py 5

When done, upload your file linear_regression.py.

Upload form is only available when connected

Exercise 5: $k$-NN Regression Constructor

Theoretical Background:

In $k$-Nearest Neighbors (k-NN) regression, the predicted value for a new input is computed based on the average (or weighted average) of the target values of its $k$ nearest neighbors in the training data. To efficiently locate these neighbors in multi-dimensional space, a KDTree is used. A KDTree is a data structure that allows for fast nearest-neighbor queries, which is especially beneficial when working with high-dimensional data.

Task:

Implement the constructor of the KnnRegression class. Your implementation should:

Fill the rest to store $k$ (the number of neighbors to use). Provided Code Skeleton:

import numpy as np
from scipy.spatial import KDTree
from TD.regression import Regression

class KnnRegression(Regression):
    def __init__(self, k: int, dataset, col_regr: int) -> None:
        """
        Constructor for the KnnRegression class.

        Parameters:
            k (int): Number of nearest neighbors to use for regression.
            dataset (Dataset): The dataset containing the training data.
            col_regr (int): The column index of the target variable in the dataset.

        Returns:
            None

        Task:
            Build a KDTree from the dataset's features (excluding the target column).
        """
        # Initialize the parent Regression class
        super().__init__(dataset, col_regr)

    def _build_kd_tree(self) -> KDTree:
        """
        Build a KDTree from the dataset's features (excluding the target column).

        Returns:
            KDTree: A KDTree built from the feature vectors.
        """
        n_samples: int = self.m_dataset.get_nbr_samples()
        n_features: int = self.m_dataset.get_dim() - 1  # Exclude target column

        # Initialize an array to store feature vectors.
        data_pts = np.zeros((n_samples, n_features))

        # Populate data_pts by iterating through all samples.
        for i in range(n_samples):
            j2 = 0  # Tracks original column indices while skipping target column
            for j in range(n_features):
                if j2 == self.m_col_regr:
                    j2 += 1  # Skip the target column in the original dataset
                data_pts[i, j] = self.m_dataset.get_instance(i)[j2]
                j2 += 1

        return KDTree(data_pts)

Implement the Constructor:
- Fill the missing elements.

Test your implementation by running

$ python ./grader.py 6

When done, upload your file knn_regression.py.

Upload form is only available when connected

Exercise 6: $k$-Nearest Neighbors and Regression

Theoretical Background:

In $k$-Nearest Neighbors (k-NN) regression, the prediction for a new input is computed based on the target values of the $k$ nearest neighbors in the training data. The idea is to find the $k$ samples in the training set that are closest to the given feature vector $x$ (using a KDTree for efficient searching), and then compute the mean of their target values. This average is used as the predicted value $\hat{y}$.

Task:

Implement the estimate(x) method in the KnnRegression class. Your function should:

Verify that the input feature vector $x$ has the correct number of features (i.e., equal to dataset.get_dim() - 1).
Use the KDTree to query and find the $k$ nearest neighbors to $x$.
Retrieve the target values corresponding to these neighbors from the dataset.
Compute and return the mean of these target values as the estimated value.

Provided Code Skeleton:

    def estimate(self, x: np.ndarray) -> float:
        """
        Estimate the target value for a given input feature vector using k-NN regression.

        Parameters:
            x (np.ndarray): Feature vector (excluding the target column).
                            Expected shape: (n_features,).

        Returns:
            float: The estimated target value, computed as the mean of the target values
                   of the k nearest neighbors.
        """

Review the Theoretical Background: Understand how k-NN regression uses the average of the $k$ nearest neighbors’ target values to make a prediction.
Implement the estimate(x) Method:
- Verify the dimensionality of $x$ matches the expected number of features.
- Query the KDTree using $x$ to obtain the $k$ nearest neighbor indices.
- Retrieve the corresponding target values from the dataset.
- Compute the average (mean) of these target values.
- Return the computed mean as a float.
Testing: Once implemented, test your function with various feature vectors to ensure it correctly computes the predictions.

Test your implementation by running

$ python ./grader.py 7

When done, upload your file knn_regression.py.

Upload form is only available when connected

CSC_43042_EP - TD6: \(k\) Nearest Neighbors (\(k\)-NN) Regression vs. Linear Regression

Before you begin

Overview and Instructions

Quiz

Theoretical Background

Supervised Regression

Linear Models and the OLS Estimator

Optimality and Practical Evaluation

Degenerate Settings and Regularization

Non-parametric Regression using Kernels

Parametric Non-linear Regression using Basis Functions

TD Exercises

(A) Exercises on Linear Regression with OLS

Exercise 1A: Constructing the Design Matrix

Exercise 1B: Constructing the Target Vector

Exercise 2: Setup the Linear Regression

Exercise 3: Implement the `estimate(x)` Function

Exercise 4: Linear Regression Errors in Datasets

Exercise 5: \(k\)-NN Regression Constructor

Exercise 6: \(k\)-Nearest Neighbors and Regression

CSC_43042_EP - TD6: \(k\) Nearest Neighbors (\(k\)-NN) Regression vs. Linear Regression

Before you begin

Overview and Instructions

Quiz

Theoretical Background

Supervised Regression

Linear Models and the OLS Estimator

Optimality and Practical Evaluation

Degenerate Settings and Regularization

Non-parametric Regression using Kernels

Parametric Non-linear Regression using Basis Functions

TD Exercises

(A) Exercises on Linear Regression with OLS

Exercise 1A: Constructing the Design Matrix

Exercise 1B: Constructing the Target Vector

Exercise 2: Setup the Linear Regression

Exercise 3: Implement the estimate(x) Function

Exercise 4: Linear Regression Errors in Datasets

Exercise 5: \(k\)-NN Regression Constructor

Exercise 6: \(k\)-Nearest Neighbors and Regression

Exercise 3: Implement the `estimate(x)` Function