This assignment will be closed on April 29, 2025 (23:59:59).
You must be authenticated to submit your files

CSC_43042_EP - TD6: \(k\) Nearest Neighbors (\(k\)-NN) Regression vs. Linear Regression

This document details the TD exercises while also reviewing key theoretical concepts from the lecture “Linear Models for Regression”. Each section now includes additional explanations along with figure placeholders to help visualize the concepts.

Before you begin

Download and unzip the archive CSC43042EP-td6-1-handin.zip.

Overview and Instructions


Quiz

Complete Questions 1-5 in the quiz on Moodle using quiz_regression.py.

Theoretical Background

This section provides a brief review of key theoretical concepts that underlie the regression methods you will implement. Figures are included as placeholders to help you visualize the ideas.

Supervised Regression

Supervised regression involves learning a function \(f : X \to Y\) from a dataset of observations \(\{(x_i, y_i)\}_{i=1}^n\) where:

The goal is to minimize the expected prediction error on new data, typically measured by a loss function \(L(y, f(x))\). In practice, we minimize the empirical risk over our training set.

Linear Models and the OLS Estimator

A linear model assumes that the response \(Y\) depends linearly on the features \(X\) plus some noise:

\[ Y = \beta_0 + \beta_1 x_1 + \cdots + \beta_d x_d + \varepsilon \]

This can be written in matrix form as:

\[ y = X\beta + \varepsilon \]

where \(X\) is the design matrix (with a column of ones if an intercept is included).

The OLS estimator minimizes the Residual Sum of Squares (RSS):

\[ \hat{\beta} = \arg\min_{\beta} \| y - X\beta \|_2^2 \]

Under the assumption that \(X\) has full column rank, the unique solution is given by:

\[ \hat{\beta} = \left(X^T X\right)^{-1}X^T y \]

Figure 2: The orthogonal projection of \(y\) onto the column space of \(X\) minimizes the squared error.

Optimality and Practical Evaluation

Once the OLS estimator is obtained, we evaluate the fit using:

These quantities satisfy the relationship:

\[ \mathrm{TSS} = \mathrm{ESS} + \mathrm{RSS} \]

and the coefficient of determination \(R^2\) is given by:

\[ R^2 = \frac{\mathrm{ESS}}{\mathrm{TSS}} = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}} \]

Degenerate Settings and Regularization

In cases where the design matrix \(X\) does not have full column rank (e.g., due to collinearity or when \(n < d\)), the OLS problem has multiple solutions. Regularization techniques can be applied to obtain a unique and stable solution:

Figure 4: Regularization can shrink coefficient estimates and stabilize the solution.

Non-parametric Regression using Kernels

Non-parametric regression methods, such as kernel regression, do not assume a fixed parametric form for \(f(x)\). Instead, they express the predictor as a weighted combination of kernel functions centered at the training data:

\[ \hat{f}(x) = \sum_{j=1}^n \alpha_j k(x, x_j) \]

where \(k(\cdot, \cdot)\) is a kernel function (e.g., the Gaussian kernel).

The Representer Theorem ensures that the solution to the regularized empirical risk minimization problem in a Reproducing Kernel Hilbert Space (RKHS) has this form.

Parametric Non-linear Regression using Basis Functions

Another approach to capture non-linearity is to transform the inputs using basis functions (e.g., polynomials) and then apply linear regression in the transformed space:

\[ f(x) = \beta_0 + \beta_1 \phi_1(x) + \cdots + \beta_p \phi_p(x) \]

For example, using a polynomial basis:

\[ \phi_1(x) = x,\quad \phi_2(x) = x^2,\quad \ldots,\quad \phi_p(x) = x^p \]

Figure 6: Transforming the input data via basis functions can capture non-linear relationships while still using linear regression techniques.


TD Exercises

(A) Exercises on Linear Regression with OLS
Exercise 1A: Constructing the Design Matrix

Theoretical Background:

In a linear regression model we assume that

\[ y = X\beta + \varepsilon, \]

where:

When including an intercept, the design matrix \(X\) should have a first column of ones. The remaining columns are filled with the feature values from the dataset (excluding the target column).

Task:

Implement the member function construct_matrix() in the LinearRegression class. Your implementation should:

Provided Code:

import numpy as np
from TD.regression import Regression
from TD.dataset import Dataset

class LinearRegression(Regression):
    """
    Linear Regression class using the least squares method.
    """

    def __init__(self, dataset: Dataset, col_regr: int, fit_intercept: bool = True) -> None:
        """
        Initialize the LinearRegression instance.

        Parameters:
            dataset (Dataset): The dataset containing the samples.
            col_regr (int): The index of the target (regression) column.
            fit_intercept (bool): Whether to include an intercept term.
        """
        super().__init__(dataset, col_regr)
        self.m_beta = None  # Regression coefficients
        self.fit_intercept = fit_intercept   # Flag for intercept term
        # Compute the coefficients using the implemented methods.
        self.set_coefficients()

    def __del__(self):
        if self.m_beta is not None:
            self.m_beta = None

    def construct_matrix(self) -> np.ndarray:
        """
        Construct the design matrix X for the regression model.

        Returns:
            np.ndarray: A 2D array of shape (n_samples, n_features) containing
                        the feature values (excluding the target). If fit_intercept is True,
                        the returned matrix will have a first column of ones.
        """
        # Ensure the dataset has more than one dimension

Exercise 1A:

Test your implementation by running

$ python ./grader.py 1

When done, upload your file linear_regression.py.

Upload form is only available when connected

Exercise 1B: Constructing the Target Vector

Theoretical Background:

The target vector \(y\) holds the response values for each sample in your dataset. In the regression model

\[ y = X\beta + \varepsilon, \]

\(y\) is obtained by selecting the target column from the dataset. This vector is used together with the design matrix \(X\) to compute the regression coefficients using the OLS estimator.

Task:

Implement the member function construct_y() in the LinearRegression class. Your implementation should:

Provided Code:

    def construct_y(self) -> np.ndarray:
        """
        Construct the target vector y from the dataset.

        Returns:
            np.ndarray: A 1D array of length n_samples containing the target values.
        """
        # Ensure the dataset has more than one dimension

Exercise 1B:

After completing both parts, your Linear Regression model will correctly generate the design matrix \(X\) and the target vector \(y\) from the dataset—key steps in computing the OLS solution.

Test your implementation by running

$ python ./grader.py 2

When done, upload your file linear_regression.py.

Upload form is only available when connected
Exercise 2: Setup the Linear Regression

Theoretical Background:

The Ordinary Least Squares (OLS) estimator computes the regression coefficients by minimizing the residual sum of squares between the observed responses and the responses predicted by the linear model. Mathematically, the OLS solution is given by

\[ \hat{\beta} = \operatorname*{arg\,min}_{\beta} \| y - X\beta \|_2^2, \]

which, under the assumption that \(X\) has full column rank, has the closed-form solution

\[ \hat{\beta} = (X^T X)^{-1}X^T y. \]

In practice, a more numerically stable approach is to use the np.linalg.lstsq function in Python to compute \(\hat{\beta}\).

Task:

Using the functions from Exercise 1A (construct_matrix()) and Exercise 1B (construct_y()), implement the member function set_coefficients() in the LinearRegression class to compute the regression coefficients \(\hat{\beta}\) using the least squares solution.

Provided Code Skeleton:

    def set_coefficients(self) -> None:
        """
        Compute and set the regression coefficients using the least squares solution.

        This function should:
          - Verify that the dataset has more than one dimension.
          - Construct the design matrix X using construct_matrix().
          - Construct the target vector y using construct_y().
          - Compute the least squares solution for the coefficients using np.linalg.lstsq.
          - Set the computed coefficients to self.m_beta.

        Returns:
            None
        """

Test your implementation by running

$ python ./grader.py 3

When done, upload your file linear_regression.py.

Upload form is only available when connected
Exercise 3: Implement the estimate(x) Function

Theoretical Background:

Once the regression coefficients \(\hat{\beta}\) have been computed, you can make predictions on new input data. For a given feature vector \(x\) (which should exclude the target value), the prediction is computed as follows:

\[ \hat{y} = \beta_0 + x \cdot \beta_{1:} \]

where \(\beta_0\) is the intercept term and \(\beta_{1:}\) are the remaining coefficients.

\[ \hat{y} = x \cdot \beta. \]

The function must also verify that the input vector \(x\) has the correct number of features (i.e., the total number of features in the dataset minus one for the target).

Task:

Implement the member function estimate(x) in the LinearRegression class to compute and return the predicted target value \(\hat{y}\) for a given input feature vector \(x\).

Provided Code Skeleton:

    def estimate(self, x: np.ndarray) -> float:
        """
        Estimate the target value for a given input feature vector.

        Parameters:
            x (np.ndarray): Feature vector (excluding the target column).

        Returns:
            float: The predicted target value.
        """

  1. Review the Theoretical Background: Understand how predictions are computed using the regression coefficients. Note the difference in the computation when an intercept is included versus when it is not.

  2. Implement the estimate(x) Method:

    • Verify that the input vector \(x\) has the correct dimension (i.e., x.size should equal dataset.get_dim() - 1).
    • Check that the regression coefficients (\(\beta\)) have been computed (i.e., self.m_beta is not None).
    • Check fit_intercept is True or False.
    • Return the prediction as a float.

Test your implementation by running

$ python ./grader.py 4

When done, upload your file linear_regression.py.

Upload form is only available when connected
Exercise 4: Linear Regression Errors in Datasets

Theoretical Background:

In regression analysis, the performance of a model is often evaluated using error metrics computed from the dataset. Three key metrics are:

These quantities satisfy the relationship:

\[ \text{TSS} = \text{ESS} + \text{RSS}. \]

Note that this decomposition \(\text{TSS} = \text{ESS} + \text{RSS}\) only holds under certain conditions:

Computing these metrics allows you to assess how well your linear regression model fits the data.

Task:

Implement a function to compute TSS, ESS, and RSS for a given dataset. Your implementation should:

Provided Code Skeleton:

    def sum_of_squares(self, dataset) -> tuple:
        """
        Compute the Total Sum of Squares (TSS), Explained Sum of Squares (ESS),
        and Residual Sum of Squares (RSS) for a given dataset.

        Parameters:
            dataset (Dataset): The dataset on which to evaluate the regression errors.

        Returns:
            tuple: A tuple (tss, ess, rss) where:
                   - tss (float): Total Sum of Squares.
                   - ess (float): Explained Sum of Squares.
                   - rss (float): Residual Sum of Squares.
        """

        tss = 0.0
        ess = 0.0
        rss = 0.0

        # Return the computed error metrics.
        return tss, ess, rss

  1. Review the Theoretical Background: Understand the definitions of TSS, ESS, and RSS, and how they relate to the performance of the regression model.

  2. Implement the Function:

    • Verify that the dataset passed to the function matches the dimensions of the training dataset.
    • Reuse the logic from your construct_matrix() and construct_y() methods to build \(X\) and \(y\).
    • Compute the predicted values\(\hat{y}\) using your previously computed coefficients.
    • Compute TSS, ESS, and RSS using the formulas provided.
    • Return a tuple (tss, ess, rss) with appropriate types.
  3. Testing: Once implemented, test your function to ensure that it correctly computes the error metrics for your regression model.

Test your implementation by running

$ python ./grader.py 5

When done, upload your file linear_regression.py.

Upload form is only available when connected
Exercise 5: \(k\)-NN Regression Constructor

Theoretical Background:

In \(k\)-Nearest Neighbors (k-NN) regression, the predicted value for a new input is computed based on the average (or weighted average) of the target values of its \(k\) nearest neighbors in the training data. To efficiently locate these neighbors in multi-dimensional space, a KDTree is used. A KDTree is a data structure that allows for fast nearest-neighbor queries, which is especially beneficial when working with high-dimensional data.

Task:

Implement the constructor of the KnnRegression class. Your implementation should:

import numpy as np
from scipy.spatial import KDTree
from TD.regression import Regression

class KnnRegression(Regression):
    def __init__(self, k: int, dataset, col_regr: int) -> None:
        """
        Constructor for the KnnRegression class.

        Parameters:
            k (int): Number of nearest neighbors to use for regression.
            dataset (Dataset): The dataset containing the training data.
            col_regr (int): The column index of the target variable in the dataset.

        Returns:
            None

        Task:
            Build a KDTree from the dataset's features (excluding the target column).
        """
        # Initialize the parent Regression class
        super().__init__(dataset, col_regr)

    def _build_kd_tree(self) -> KDTree:
        """
        Build a KDTree from the dataset's features (excluding the target column).

        Returns:
            KDTree: A KDTree built from the feature vectors.
        """
        n_samples: int = self.m_dataset.get_nbr_samples()
        n_features: int = self.m_dataset.get_dim() - 1  # Exclude target column

        # Initialize an array to store feature vectors.
        data_pts = np.zeros((n_samples, n_features))

        # Populate data_pts by iterating through all samples.
        for i in range(n_samples):
            j2 = 0  # Tracks original column indices while skipping target column
            for j in range(n_features):
                if j2 == self.m_col_regr:
                    j2 += 1  # Skip the target column in the original dataset
                data_pts[i, j] = self.m_dataset.get_instance(i)[j2]
                j2 += 1

        return KDTree(data_pts)

  1. Implement the Constructor:

    • Fill the missing elements.

Test your implementation by running

$ python ./grader.py 6

When done, upload your file knn_regression.py.

Upload form is only available when connected
Exercise 6: \(k\)-Nearest Neighbors and Regression

Theoretical Background:

In \(k\)-Nearest Neighbors (k-NN) regression, the prediction for a new input is computed based on the target values of the \(k\) nearest neighbors in the training data. The idea is to find the \(k\) samples in the training set that are closest to the given feature vector \(x\) (using a KDTree for efficient searching), and then compute the mean of their target values. This average is used as the predicted value \(\hat{y}\).

Task:

Implement the estimate(x) method in the KnnRegression class. Your function should:

Provided Code Skeleton:

    def estimate(self, x: np.ndarray) -> float:
        """
        Estimate the target value for a given input feature vector using k-NN regression.

        Parameters:
            x (np.ndarray): Feature vector (excluding the target column).
                            Expected shape: (n_features,).

        Returns:
            float: The estimated target value, computed as the mean of the target values
                   of the k nearest neighbors.
        """

  1. Review the Theoretical Background: Understand how k-NN regression uses the average of the \(k\) nearest neighbors’ target values to make a prediction.

  2. Implement the estimate(x) Method:

    • Verify the dimensionality of \(x\) matches the expected number of features.
    • Query the KDTree using \(x\) to obtain the \(k\) nearest neighbor indices.
    • Retrieve the corresponding target values from the dataset.
    • Compute the average (mean) of these target values.
    • Return the computed mean as a float.
  3. Testing: Once implemented, test your function with various feature vectors to ensure it correctly computes the predictions.

Test your implementation by running

$ python ./grader.py 7

When done, upload your file knn_regression.py.

Upload form is only available when connected