This assignment has been closed on May 21, 2025.

You must be authenticated to submit your files

CSC_43042_EP - TD9: Neural networks from scratch

Context
Notations
Before you begin
Structure
1. Build an un-trainable network: feed-forward
2. Actually learning something: back-propagation
3. Xaggle competition

Context

In this TD you will learn to create a simple multilayer perceptron (MLP) neural network from scratch using only the Numpy package for matrix-vector operations. In order to train the network on several datasets, you will need to implement the back-propagation algorithm with gradient descent.

Before getting started, let’s review the notations we will use in this lab and recall the stochastic gradient descent algorithm.

Notations

READ THIS CAREFULLY BEFORE YOU ATTEMPT ANY EXERCISE

We will consider simple fully-connected MLP networks that can be described by the following recursive relationship:

\[ \begin{align} &z^l = a^{l-1} W^l + b^l,\\ &a^l = \sigma^l(z^l), \end{align} \]

where \(a^l\) is the output (activation) of layer \(l\) which is a nonlinear function \(\sigma^l\) of a linear transformation of the previous layer’s output. The linear transformation is performed using the weight matrix \(W^l\) and bias vector \(b^l\) associated with the layer \(l\). We will denote the last layer in the network with a capital \(L\) superscript. The recursion is stopped by setting \(a^0 = x\), where \(x\) is the input vector to our network. Note that in the recursive expressions above, we implicitly assume that our input/output vectors are row vectors. The reason for this will be apparent later.

Taking this notation into account, we see that a network with \(L-1\) hidden layers is fully expressed by its \(L\) weight matrices, bias vectors, and activation functions. We can denote the set of trainable parameters in our network by \(\theta = \{ W^1, \dots, W^L, b^1, \dots, b^L \}\).

Beware that in Python, weights’, biases’ and activations’ indices will start at 0.

In this lab, we are only concerned with supervised learning tasks. Recall that in supervised learning, we have a dataset represented by a list of \((x, y)\) pairs where \(x\) is the input to our model and \(y\) is the desired output.

For regression problems where we want to fit a function \(y = f(x)\), \(x\) is the independent variable vector, and \(y\) is the function value.
In classification problems, \(x\) will correspond to a set of attributes and \(y\) the corresponding label.

The goal of supervised learning is to “train” our network by adjusting its parameters in order to minimize a cost function over the entire training set:

\[ \min_\theta \mathcal{L} = \min_\theta \sum_{i=1}^N \ell^{(i)} \]

where \(\ell^{(i)} = \ell(\sigma^L(\sigma^{L-1}(\dots(x^{(i)}W^1 + b^1)\dots)W^L + b^L), y^{(i)})\) and \(\ell(\hat{y}, y)\) denotes the particular form of the loss function being considered. In this lab, we will use 2 different loss functions:

Quadratic Loss: \(\ell(\hat{y}, y) = \|\hat{y} - y\|^2\)
Cross-entropy Loss: \(\ell(\hat{y}, y) = -[ y \ln\hat{y} + (1-y)\ln(1-\hat{y})]\)

WARNING: we consider only 2D arrays; vectors will have one dimension of size 1, e.g. for a particular sample, \(a^\ell\) defined above will have shape \((1, d)\), not \((d,)\). This allows us to easily generalize to feed the neural network several samples at a time, s.t. we don’t have anything to change to the code to make it work with arbitrary batches of input data, e.g. \(a^\ell\) becomes \((n, d)\), the loss \(\mathcal{L}\) is the sum of the loss over the row-axis, and the gradients can be computed sample-wise and summed to be back-propagated (more on that later).

Before you begin

Download and unzip the archive CSC43042EP-td9-1-handin.zip.

It contains:

source files in a TD package, in particular the files/modules activations.py, losses.py and network.py,
a grader.py script.

Structure

This TD is structured as follows:

In part 1., we will implement activation functions and the feed-forward pass, s.t. with fixed weights and biases, we are able to provide an estimate \(\hat{y}\).
In part 2., we implement loss functions and the back-propagation pass, which change the weights s.t. the predictions get better.
In part 3., we try this network on the MNIST dataset.

1. Build an un-trainable network: feed-forward

1.1 Activations

We wish to implement activation functions (in the notations above) as classes, to be able to also incorporate their derivatives (we will need them for back-propagation).

However, we don’t want multiple instances of activation function objects (the activation is always the same), hence we rely on static methods (these are “class-wide methods” and do NOT depend on a particular object’s values, they can be called using the class’ name).

Finally, we want to be able to call the activation function directly by its class’ name, like a function. Due to all these reasons, we rely on the base class Activation at the top of activations.py. It seems complicated, it’s really not, here’s why:

__call__ methods are retrieved when using parentheses with an instantiated class object. However, instantiation has precedence over __call__, so calling a Class like another static method of that class or a function will not work (see e.g. here).
__init__ (typically used to customize the instantiated objects of a class) cannot return anything. The trick is thus to use it in conjunction with __new__ (which creates the object and passes it to __init__).

Notice the plot method, which will allow us to plot the different activation functions by running the activations.py (see __main__) module like so: python -m TD activations. This will work and indeed plot something when the derived classes (see next two sections) are implemented and only if you run this command from the directory that contains the TD package (see some explanation of how PYTHONPATH works).

1.2 Identity activation

The easiest activation we can implement is the identity function which simply returns the input as itself. Implement this below in the class template Identity. The prime function should implement the derivative of the activation.

Python tricks: Identity inherits from Activation; thus, calling Identity(some_array) will call Identity.__new__(self, some_array) which in turn will call the static method Identity.__call__(some_array).

Python hint: see np.ones_like.

1.3 Sigmoid activation

Recall that the sigmoid function is given by \(\sigma(x) = \dfrac{1}{1 + \exp(-x)}\). Calculate its derivative. Recall that the derivative of \(\dfrac{1}{f(x)}\) is \(-\dfrac{f'(x)}{f(x)^2}\). Rearrange the resulting expression so that only \(\sigma\) appears.

Plot both activation functions using python -m TD activations as suggested above.

Test your work with python grader.py 1 and upload it using the box below:

Upload form is only available when connected

1.4 Feed-forward

1.4.1 Network initialization

From the recursion formulas above, figure out the shapes of the weight matrices and bias vectors, given the number of neurons \(d^l\) in layer \(l\). In network.py’s Network class, given arguments sizes: List[int], corresponding to \((d^0, ..., d^L)\) (where \(d^0\) corresponds to the number of features in the input and \(d^L\) is the dimension of the output which must match with the dimension of \(y\)), and sigmas: List[Type[Activation]] corresponding to \((\sigma^1, \sigma^L)\), which are stored in attributes sizes and sigmas:

Construct a list of weight matrices (in a weights attribute), a list of bias vectors (in a biases attribute) which are randomly initialized from the standard normal distribution (see np.random.randn); pay attention to the size of each list and the shape of each element which must be consistent with sizes.
Implement the num_params function which counts the total number of trainable parameters (\(\theta\) in the notations above). You may use your knowledge of the shapes of the weights and biases given the sizes argument.

Python hints: array[1:] gives you all the array except the first entry; array[:-1] gives you all the array except the last entry; for x, y in zip(a, b) lets you iterate “simultaneously” through a pair of objects of same length.

Test your work with python grader.py 2 and upload it using the box below:

Upload form is only available when connected

1.4.2 Feed-forward pass

Use the recurrence relations discussed in the Notation section, in particular:

\[z^l = a^{l-1} W^l + b^l,\\ a^l = \sigma^l(z^l).\]

Return:

the output of the network, i.e. \(a^L\).
the list of \(a^l\)s (i.e. \([a^0, \dots, a^L]\)).
the list of linear combinations \(z^l\)s (i.e. \([z^1, \dots, z^L]\)).

The two last lists returned will serve for the back-propagation implementation.

Test your work with python grader.py 3 and upload it using the box below:

Upload form is only available when connected

2. Actually learning something: back-propagation

READ THIS SECTION CAREFULLY BEFORE YOU ATTEMPT ANY EXERCISE

(Stochastic) Gradient Descent

Recall from the beginning of the TD that we want to train the network parameters by minimizing a given loss function over an entire dataset. One method of doing this is using the gradient descent (GD) algorithm which you have already seen in the lectures; recall the update rule:

\[ \theta = \theta - \eta \nabla_\theta \mathcal{L}, \]

where \(\eta\) is the learning rate, \(\theta\) denotes any learnable parameter in the network, and \(\mathcal{L}\) denotes the sum of the loss over all samples considered (see Section Notations). The update rule above is called GD because direction of change in the network parameters follow the opposite of the parameters’ gradient in the loss function. You can think of this like a ball rolling down a hill to find the minimum of the topology. Only in this case, the ball is massless because it has no momentum. Note that, contrary to linear and logistic regression, the loss / error function might not be convex, hence the existence of a unique minimizer isn’t guaranteed, nor is it guaranteed that gradient descent will find it.

When the update rule is applied to a random subset of the total dataset, it is called stochastic gradient descent (SGD). SGD is far more efficient (computationally) than GD when the batch size is large enough to approximate the true gradients while being significantly smaller than the full dataset. Running over the entire dataset with SGD once is called an “epoch” (of training).

From the GD update rule above, it is clear we will need to compute the gradients of the network parameters with respect to the cost function. This is exactly what back-propagation does, and thus is a crucial component to almost all neural network learning algorithms. In what follows, we derive the 4 equations of back-propagation. You may want to verify our equations, which can relatively easily be derived using the chain rule for differentiation or look at the “proofs” below.

Detail of computation of gradients of parameters

Before we start, it is convenient to define the following variable:

\[ \delta^l \equiv \frac{\partial \ell_p}{\partial z^l}. \]

In other words, \(\delta^l\) is the gradient of the loss function for a point \(p\) with respect to the input to the activation function for the layer \(l\) in our network. Recall that we must work our way backwards. Thus, let’s express \(\delta^L = \frac{\partial \ell_p}{\partial z^L} = \frac{\partial \ell_p}{\partial \hat{y}}\) in terms of \(a^L\) and \(\sigma^L\).

\[\delta^L = \frac{\partial \ell_p}{\partial z^L}, \textrm{ by definition}.\]

\[\frac{\partial \ell_p}{\partial z^L}= \frac{\partial \ell_p}{\partial a^L} \frac{\partial a^L}{\partial z^L}, \textrm{ by the chain rule}. \]

Since \(a_L = \sigma^L (z^L)\), we have that

\[\frac{\partial a^L}{\partial z^L} = \sigma'^L (z^L).\]

Thus, we are able to conclude that, \[\delta^L = \frac{\partial \ell_p}{\partial a^L} \sigma'^L (z^L) = \nabla_{a^L} \ell_p \cdot \sigma'^L (z^L).\]

Now that we know how to get \(\delta^L\), let’s continue our journey backward and express \(\delta^l\) in terms of \(\delta^{l+1}\):

\[\delta^{l} = \frac{\partial \ell_p}{\partial z^{l}}, \textrm{ by definition.}\]

\[= \frac{\partial \ell_p}{\partial z^{l+1}} \cdot \frac{\partial z^{l+1}}{\partial a^{l}} \cdot \frac{\partial a^{l}}{\partial z^{l}}, \textrm{ by the chain rule.} \]

\[= \delta^{l+1} \cdot (W^{l+1})^T \cdot \sigma'^{l} (z^{l}) .\]

Recall that our end goal is to update parameters \(\theta = \{ W^1, \dots, W^L, b^1, \dots, b^L \}\) in the opposite direction of their respective gradients \(\nabla_{W^1} \ell_p, ..., \nabla_{W^L} \ell_p, \nabla_{b^1} \ell_p, ..., \nabla_{b^L} \ell_p\).

Denote by \((\nabla_{W^l}\ell_p)_{i,j}\) the \((i,j)^{\text{th}}\) entry of \(\nabla_{W^l}\ell_p\).

We have: \[(\nabla_{W^l}\ell_p)_{i,j} = \frac{\partial \ell_p}{\partial W^{l}_{i,j}} = \frac{\partial \ell_p}{\partial z^{l}_{j}} \frac{\partial z^{l}_{j}}{\partial W^{l}_{i,j}} = \delta^l_j \frac{\partial \left( \sum_k \left[a^{l-1}_k W^{l}_{k,j}\right] + b^l_j\right)}{\partial W^{l}_{i,j}}.\]

For \(k \neq i\), \(\dfrac{\partial W^{l}_{k,j}}{\partial W^{l}_{i,j}}=0\), such that \(\dfrac{\partial \left( \sum_k \left[a^{l-1}_k W^{l}_{k,j}\right] + b^l_j\right)}{\partial W^{l}_{i,j}} = a^{l-1}_i.\)

We get: \[(\nabla_{W^l}\ell_p)_{i,j} = a^{l-1}_i \delta^l_j ,\]

hence \[\nabla_{W^l}\ell_p = (a^{l-1})^T \delta^l.\]

The same reasoning for biases yields:

\[(\nabla_{b^l}\ell_p)_j = \frac{\partial \ell^p}{\partial z^l_j} \frac{\partial z^l_j}{\partial b^l_j} = \delta^l_j \frac{\partial \left( \sum_k \left[a^{l-1}_k W^{l}_{k,j}\right] + b^l_j\right)}{\partial b^{l}_{j}} = \delta^l_j.\]

To sum up, the equations that we will need to implement are:

\[ \begin{align} \delta^L &= \nabla_{a^L} \ell_p \cdot \sigma'^L (z^L)\\ \delta^{l} &= \delta^{l+1} \cdot (W^{l+1})^T \cdot \sigma'^{l} (z^{l})\\ \nabla_{W^l}\ell_p &= (a^{l-1})^T \delta^l\\ \nabla_{b^l}\ell_p &= \delta^l \end{align} \]

2.1 Losses

To make good use of these equations to back-propagate the error, we need a loss function to compute \(\ell_p\). To do that, we provide the losses.py module, which is quite similar to activations.py in that it provides a base abstract class Loss which takes two 2D arrays (thus assuming there could be several targets) y and y_hat for both its __call__ and prime class methods, which respectively implement the sum of the losses over the samples (\(\mathcal{L}\)) and possibly over several targets, and the derivative of the loss w.r.t. each sample and target.

Notice the plot method, which will allow us to plot the different loss functions by running the losses.py (see __main__) module like so: python -m TD losses. This will work and indeed plot something when the derived classes (see next two sections) are implemented and only if you run this command from the directory that contains the TD package (see some explanation of how PYTHONPATH works).

2.1.2 Quadratic

Implement Quadratic s.t. Quadratic(y_hat, y) returns the sum of squared errors on all samples and (possibly) all targets, and its prime class method returns its derivative w.r.t. each sample and each target.

2.1.2 CrossEntropy

Implement CrossEntropy s.t. CrossEntropy(y_hat, y) returns the sum of log losses on all samples and (possibly) all targets, and its prime class method returns its derivative w.r.t. each sample and each target.

Plot both loss functions using python -m TD losses as suggested above.

Test your work with python grader.py 4 and upload it using the box below:

Upload form is only available when connected

2.2 Back-propagation

With these losses defined, allowing the Network class to train resolves to implementing backpropagation (i.e. applying the equations given above to compute the gradients), update_step which calls backpropagation and applies the GD rule, and train which runs update_step in a loop (this last step is given, you have to implement the first two in what follows).

2.2.1 Back-propagation pass to compute gradients

The first is the backpropagation method which can take training sample(s) of arbitrary sizes and a Loss class like the one above, and computes the gradient of the loss function w.r.t. to all the weights and biases using the backpropagation algorithm, summed over all provided samples. Use the formulas above. It should return the lists of \(\nabla_{W^l}\ell\) and \(\nabla_{b^l}\ell\) for \(1 \leq l \leq L\).

Test your work with python grader.py 5 and upload it using the box below:

Upload form is only available when connected

2.2.2 Update parameters

Next, implement a single epoch of training using the GD update rule to update the weights and biases in update_step, given the gradients of each weight and bias provided by backpropagation.

The update_step method is used by the fit method successively in a loop as you can see in the Network class implementation. Assuming you have implemented the backpropagation and update_step functions correctly, you should now be able to train the network on real training data.

Test your work with python grader.py 6 and upload it using the box below:

Upload form is only available when connected

You may now play with your implementation to train on some generated sinusoidal data by running the network.py module, e.g. with python -m TD sine. This will work and indeed plot something when the derived classes (see next two sections) are implemented and only if you run this command from the directory that contains the TD package (see some explanation of how PYTHONPATH works).

3. Xaggle competition

Now that we have a full-fledge neural network implementation, and given that you already know how to do text classification (recall TD5 and TD7), let’s try our luck on image classification.

Grayscale images are 2D matrices where each entry corresponds to the intensity/brightness (0 is black and 255 is white). Similarly, color images are typically 3D matrices where each entry of each 2D matrix corresponds to the intensity of each color (which is the 3rd dimension), typically Red-Green-Blue (RGB). Similarly, color videos are typically 4D matrices, where each entry is a timestamped color image (e.g. 24i/s). If you work at your preferred automaker on autonomous driving, you now have 5D datasets: probably millions of samples (videos) to train on. Say you have 1 million 1-minute samples of color videos at 24 Full-HD images / second frame rate. You have 1 000 000 samples x 60 seconds x 24 images x 3 colors x 1920 x 1080 pixels, roughly 9e15 values…

Let’s come back to ~~earth~~ academia and work on grayscale images of hand-written digits in mnist.py and network.py (see e.g. get_raw_data and get_xaggle_champion respectively). It is briefly explained hereafter.

Have a look at the data by executing python -m TD mnist which downloads the MNIST dataset and plots 5 samples per digit. Have a look at the script. It roughly does the following (in particular, see get_raw_data): Recall that we assumed the inputs of our network were 2D arrays (#samples, #features). To make grayscale images fit into our work, let’s flatten them (either stack one column after the other or each row…). Also, the labels are integers from 0 to 9, but we know how to deal with binary classification with the Sigmoid activation function which outputs a probability, so let’s make the output a vector of size 10 representing the sample belonging to the \(i\)th entry (i.e a handwritten 1 would be encoded as \((0, 1, 0, ..., 0))\). Now we can train our network to recognize them! Feel free to manipulate the provided dataset, tune your network’s hyperparameters, come up with good evaluation metrics…

Now that we have appropriate 2D arrays ((#samples, #features), (#samples, #targets)), we can use our neural net! Have a look at network.py, in particular hyperparameters and get_xaggle_champion. You must tweak hyperparameters s.t. python grader 7 yields the best score on the private test set (beware that this won’t work locally since you don’t have the private test set). To achieve this and train some model locally before submitting you champion, you can make use of python -m TD xaggle which instantiates and fit a neural network given your choice of hyperparameters on the MNIST train data fetched above, and computes an accuracy on a public validation set. You can assume (at your own risk!) that maximizing the accuracy on the public validation set will maximize the accuracy on the private test set.

Recall that python -m TD mnist and python -m TD network (sine|xaggle) will only work if you run it from the directory that contains the TD package (see some explanation of how PYTHONPATH works).

Upload form is only available when connected