CSC_43042_EP - TD9: Neural networks from scratch
Context
In this TD you will learn to create a simple multilayer perceptron (MLP) neural network from scratch using only the Numpy package for matrix-vector operations. In order to train the network on several datasets, you will need to implement the back-propagation algorithm with gradient descent.
Before getting started, let’s review the notations we will use in this lab and recall the stochastic gradient descent algorithm.
Notations
READ THIS CAREFULLY BEFORE YOU ATTEMPT ANY EXERCISE
We will consider simple fully-connected MLP networks that can be described by the following recursive relationship:
\[ \begin{align} &z^l = a^{l-1} W^l + b^l,\\ &a^l = \sigma^l(z^l), \end{align} \]
where \(a^l\) is the output (activation) of layer \(l\) which is a nonlinear function \(\sigma^l\) of a linear transformation of the previous layer’s output. The linear transformation is performed using the weight matrix \(W^l\) and bias vector \(b^l\) associated with the layer \(l\). We will denote the last layer in the network with a capital \(L\) superscript. The recursion is stopped by setting \(a^0 = x\), where \(x\) is the input vector to our network. Note that in the recursive expressions above, we implicitly assume that our input/output vectors are row vectors. The reason for this will be apparent later.
Taking this notation into account, we see that a network with \(L-1\) hidden layers is fully expressed by its \(L\) weight matrices, bias vectors, and activation functions. We can denote the set of trainable parameters in our network by \(\theta = \{ W^1, \dots, W^L, b^1, \dots, b^L \}\).
Beware that in Python, weights’, biases’ and activations’ indices will start at 0.
In this lab, we are only concerned with supervised learning tasks. Recall that in supervised learning, we have a dataset represented by a list of \((x, y)\) pairs where \(x\) is the input to our model and \(y\) is the desired output.
- For regression problems where we want to fit a function \(y = f(x)\), \(x\) is the independent variable vector, and \(y\) is the function value.
- In classification problems, \(x\) will correspond to a set of attributes and \(y\) the corresponding label.
The goal of supervised learning is to “train” our network by adjusting its parameters in order to minimize a cost function over the entire training set:
\[ \min_\theta \mathcal{L} = \min_\theta \sum_{i=1}^N \ell^{(i)} \]
where \(\ell^{(i)} = \ell(\sigma^L(\sigma^{L-1}(\dots(x^{(i)}W^1 + b^1)\dots)W^L + b^L), y^{(i)})\) and \(\ell(\hat{y}, y)\) denotes the particular form of the loss function being considered. In this lab, we will use 2 different loss functions:
- Quadratic Loss: \(\ell(\hat{y}, y) = \|\hat{y} - y\|^2\)
- Cross-entropy Loss: \(\ell(\hat{y}, y) = -[ y \ln\hat{y} + (1-y)\ln(1-\hat{y})]\)
WARNING: we consider only 2D arrays; vectors will have one dimension of size 1, e.g. for a particular sample, \(a^\ell\) defined above will have shape \((1, d)\), not \((d,)\). This allows us to easily generalize to feed the neural network several samples at a time, s.t. we don’t have anything to change to the code to make it work with arbitrary batches of input data, e.g. \(a^\ell\) becomes \((n, d)\), the loss \(\mathcal{L}\) is the sum of the loss over the row-axis, and the gradients can be computed sample-wise and summed to be back-propagated (more on that later).
Before you begin
Download and unzip the archive CSC43042EP-td9-1-handin.zip
.
It contains:
- source files in a
TD
package, in particular the files/modulesactivations.py
,losses.py
andnetwork.py
, - a
grader.py
script.
Structure
This TD is structured as follows:
In part 1., we will implement activation functions and the feed-forward pass, s.t. with fixed weights and biases, we are able to provide an estimate \(\hat{y}\).
In part 2., we implement loss functions and the back-propagation pass, which change the weights s.t. the predictions get better.
In part 3., we try this network on the MNIST dataset.
1. Build an un-trainable network: feed-forward
1.1 Activations
We wish to implement activation functions (in the notations above) as classes, to be able to also incorporate their derivatives (we will need them for back-propagation).
However, we don’t want multiple instances of activation function objects (the activation is always the same), hence we rely on static methods (these are “class-wide methods” and do NOT depend on a particular object’s values, they can be called using the class’ name).
Finally, we want to be able to call the activation function directly
by its class’ name, like a function. Due to all these reasons, we rely
on the base class Activation at the top of activations.py
.
It seems complicated, it’s really not, here’s why:
__call__
methods are retrieved when using parentheses with an instantiated class object. However, instantiation has precedence over__call__
, so calling a Class like another static method of that class or a function will not work (see e.g. here).__init__
(typically used to customize the instantiated objects of a class) cannot return anything. The trick is thus to use it in conjunction with__new__
(which creates the object and passes it to__init__
).
Notice the plot
method, which will allow us to plot the
different activation functions by running the
activations.py
(see __main__
) module like so:
python -m TD activations
. This will work and indeed
plot something when the derived classes (see next two sections) are
implemented and only if you run this command from the directory that
contains the TD
package (see some explanation of how
PYTHONPATH works).
1.2 Identity activation
The easiest activation we can implement is the identity function which simply returns the input as itself. Implement this below in the class template Identity. The prime function should implement the derivative of the activation.
Python tricks: Identity inherits from Activation; thus,
calling Identity(some_array)
will call
Identity.__new__(self, some_array)
which in turn will call
the static method Identity.__call__(some_array)
.
Python hint: see np.ones_like
.
1.3 Sigmoid activation
Recall that the sigmoid function is given by \(\sigma(x) = \dfrac{1}{1 + \exp(-x)}\). Calculate its derivative. Recall that the derivative of \(\dfrac{1}{f(x)}\) is \(-\dfrac{f'(x)}{f(x)^2}\). Rearrange the resulting expression so that only \(\sigma\) appears.
Plot both activation functions using
python -m TD activations
as suggested above.
Test your work with python grader.py 1
and upload it
using the box below:
1.4 Feed-forward
1.4.1 Network initialization
From the recursion formulas above, figure out the shapes of the
weight matrices and bias vectors, given the number of neurons \(d^l\) in layer \(l\). In network.py
’s
Network
class, given arguments
sizes: List[int]
, corresponding to \((d^0, ..., d^L)\) (where \(d^0\) corresponds to the number of features
in the input and \(d^L\) is the
dimension of the output which must match with the dimension of \(y\)), and
sigmas: List[Type[Activation]]
corresponding to \((\sigma^1, \sigma^L)\), which are stored in
attributes sizes
and sigmas
:
Construct a list of weight matrices (in a
weights
attribute), a list of bias vectors (in abiases
attribute) which are randomly initialized from the standard normal distribution (seenp.random.randn
); pay attention to the size of each list and the shape of each element which must be consistent withsizes
.Implement the
num_params
function which counts the total number of trainable parameters (\(\theta\) in the notations above). You may use your knowledge of the shapes of the weights and biases given thesizes
argument.
Python hints: array[1:]
gives you all the array
except the first entry; array[:-1]
gives you all the array
except the last entry; for x, y in zip(a, b)
lets you
iterate “simultaneously” through a pair of objects of same length.
Test your work with python grader.py 2
and upload it
using the box below:
1.4.2 Feed-forward pass
Use the recurrence relations discussed in the Notation section, in particular:
\[z^l = a^{l-1} W^l + b^l,\\ a^l = \sigma^l(z^l).\]
Return:
the output of the network, i.e. \(a^L\).
the list of \(a^l\)s (i.e. \([a^0, \dots, a^L]\)).
the list of linear combinations \(z^l\)s (i.e. \([z^1, \dots, z^L]\)).
The two last lists returned will serve for the back-propagation implementation.
Test your work with python grader.py 3
and upload it
using the box below:
2. Actually learning something: back-propagation
READ THIS SECTION CAREFULLY BEFORE YOU ATTEMPT ANY EXERCISE
(Stochastic) Gradient Descent
Recall from the beginning of the TD that we want to train the network parameters by minimizing a given loss function over an entire dataset. One method of doing this is using the gradient descent (GD) algorithm which you have already seen in the lectures; recall the update rule:
\[ \theta = \theta - \eta \nabla_\theta \mathcal{L}, \]
where \(\eta\) is the learning rate, \(\theta\) denotes any learnable parameter in the network, and \(\mathcal{L}\) denotes the sum of the loss over all samples considered (see Section Notations). The update rule above is called GD because direction of change in the network parameters follow the opposite of the parameters’ gradient in the loss function. You can think of this like a ball rolling down a hill to find the minimum of the topology. Only in this case, the ball is massless because it has no momentum. Note that, contrary to linear and logistic regression, the loss / error function might not be convex, hence the existence of a unique minimizer isn’t guaranteed, nor is it guaranteed that gradient descent will find it.
When the update rule is applied to a random subset of the total dataset, it is called stochastic gradient descent (SGD). SGD is far more efficient (computationally) than GD when the batch size is large enough to approximate the true gradients while being significantly smaller than the full dataset. Running over the entire dataset with SGD once is called an “epoch” (of training).
From the GD update rule above, it is clear we will need to compute the gradients of the network parameters with respect to the cost function. This is exactly what back-propagation does, and thus is a crucial component to almost all neural network learning algorithms. In what follows, we derive the 4 equations of back-propagation. You may want to verify our equations, which can relatively easily be derived using the chain rule for differentiation or look at the “proofs” below.
Detail of computation of gradients of parameters
Before we start, it is convenient to define the following variable:
\[ \delta^l \equiv \frac{\partial \ell_p}{\partial z^l}. \]
In other words, \(\delta^l\) is the gradient of the loss function for a point \(p\) with respect to the input to the activation function for the layer \(l\) in our network. Recall that we must work our way backwards. Thus, let’s express \(\delta^L = \frac{\partial \ell_p}{\partial z^L} = \frac{\partial \ell_p}{\partial \hat{y}}\) in terms of \(a^L\) and \(\sigma^L\).
\[\delta^L = \frac{\partial \ell_p}{\partial z^L}, \textrm{ by definition}.\]
\[\frac{\partial \ell_p}{\partial z^L}= \frac{\partial \ell_p}{\partial a^L} \frac{\partial a^L}{\partial z^L}, \textrm{ by the chain rule}. \]
Since \(a_L = \sigma^L (z^L)\), we have that
\[\frac{\partial a^L}{\partial z^L} = \sigma'^L (z^L).\]
Thus, we are able to conclude that, \[\delta^L = \frac{\partial \ell_p}{\partial a^L} \sigma'^L (z^L) = \nabla_{a^L} \ell_p \cdot \sigma'^L (z^L).\]
Now that we know how to get \(\delta^L\), let’s continue our journey backward and express \(\delta^l\) in terms of \(\delta^{l+1}\):
\[\delta^{l} = \frac{\partial \ell_p}{\partial z^{l}}, \textrm{ by definition.}\]
\[= \frac{\partial \ell_p}{\partial z^{l+1}} \cdot \frac{\partial z^{l+1}}{\partial a^{l}} \cdot \frac{\partial a^{l}}{\partial z^{l}}, \textrm{ by the chain rule.} \]
\[= \delta^{l+1} \cdot (W^{l+1})^T \cdot \sigma'^{l} (z^{l}) .\]
Recall that our end goal is to update parameters \(\theta = \{ W^1, \dots, W^L, b^1, \dots, b^L \}\) in the opposite direction of their respective gradients \(\nabla_{W^1} \ell_p, ..., \nabla_{W^L} \ell_p, \nabla_{b^1} \ell_p, ..., \nabla_{b^L} \ell_p\).
Denote by \((\nabla_{W^l}\ell_p)_{i,j}\) the \((i,j)^{\text{th}}\) entry of \(\nabla_{W^l}\ell_p\).
We have: \[(\nabla_{W^l}\ell_p)_{i,j} = \frac{\partial \ell_p}{\partial W^{l}_{i,j}} = \frac{\partial \ell_p}{\partial z^{l}_{j}} \frac{\partial z^{l}_{j}}{\partial W^{l}_{i,j}} = \delta^l_j \frac{\partial \left( \sum_k \left[a^{l-1}_k W^{l}_{k,j}\right] + b^l_j\right)}{\partial W^{l}_{i,j}}.\]
For \(k \neq i\), \(\dfrac{\partial W^{l}_{k,j}}{\partial W^{l}_{i,j}}=0\), such that \(\dfrac{\partial \left( \sum_k \left[a^{l-1}_k W^{l}_{k,j}\right] + b^l_j\right)}{\partial W^{l}_{i,j}} = a^{l-1}_i.\)
We get: \[(\nabla_{W^l}\ell_p)_{i,j} = a^{l-1}_i \delta^l_j ,\]
hence \[\nabla_{W^l}\ell_p = (a^{l-1})^T \delta^l.\]
The same reasoning for biases yields:
\[(\nabla_{b^l}\ell_p)_j = \frac{\partial \ell^p}{\partial z^l_j} \frac{\partial z^l_j}{\partial b^l_j} = \delta^l_j \frac{\partial \left( \sum_k \left[a^{l-1}_k W^{l}_{k,j}\right] + b^l_j\right)}{\partial b^{l}_{j}} = \delta^l_j.\]To sum up, the equations that we will need to implement are:
\[ \begin{align} \delta^L &= \nabla_{a^L} \ell_p \cdot \sigma'^L (z^L)\\ \delta^{l} &= \delta^{l+1} \cdot (W^{l+1})^T \cdot \sigma'^{l} (z^{l})\\ \nabla_{W^l}\ell_p &= (a^{l-1})^T \delta^l\\ \nabla_{b^l}\ell_p &= \delta^l \end{align} \]
2.1 Losses
To make good use of these equations to back-propagate the error, we
need a loss function to compute \(\ell_p\). To do that, we provide the
losses.py
module, which is quite similar to
activations.py
in that it provides a base abstract class
Loss
which takes two 2D arrays (thus assuming there could
be several targets) y
and y_hat
for both its
__call__
and prime
class methods, which
respectively implement the sum of the losses over the samples (\(\mathcal{L}\)) and possibly over several
targets, and the derivative of the loss w.r.t. each sample and
target.
Notice the plot
method, which will allow us to plot the
different loss functions by running the losses.py
(see
__main__
) module like so: python -m TD losses
.
This will work and indeed plot something when the derived
classes (see next two sections) are implemented and only if you run this
command from the directory that contains the TD
package
(see some explanation of how
PYTHONPATH works).
2.1.2 Quadratic
Implement Quadratic
s.t.
Quadratic(y_hat, y)
returns the sum of squared errors on
all samples and (possibly) all targets, and its prime
class
method returns its derivative w.r.t. each sample and each target.
2.1.2 CrossEntropy
Implement CrossEntropy
s.t.
CrossEntropy(y_hat, y)
returns the sum of log losses on all
samples and (possibly) all targets, and its prime
class
method returns its derivative w.r.t. each sample and each target.
Plot both loss functions using python -m TD losses
as
suggested above.
Test your work with python grader.py 4
and upload it
using the box below:
2.2 Back-propagation
With these losses defined, allowing the Network
class to
train resolves to implementing backpropagation
(i.e. applying the equations given above to compute the
gradients), update_step
which calls
backpropagation
and applies the GD rule, and
train
which runs update_step
in a loop (this
last step is given, you have to implement the first two in what
follows).
2.2.1 Back-propagation pass to compute gradients
The first is the backpropagation
method which can take
training sample(s) of arbitrary sizes and a Loss
class like
the one above, and computes the gradient of the loss function w.r.t. to
all the weights and biases using the backpropagation algorithm,
summed over all provided samples. Use the formulas
above. It should return the lists of \(\nabla_{W^l}\ell\) and \(\nabla_{b^l}\ell\) for \(1 \leq l \leq L\).
Test your work with python grader.py 5
and upload it
using the box below:
2.2.2 Update parameters
Next, implement a single epoch of training using the GD update rule
to update the weights and biases in update_step
, given the
gradients of each weight and bias provided by
backpropagation
.
The update_step
method is used by the fit
method successively in a loop as you can see in the Network
class implementation. Assuming you have implemented the
backpropagation
and update_step
functions
correctly, you should now be able to train the network on real training
data.
Test your work with python grader.py 6
and upload it
using the box below:
You may now play with your implementation to train on some generated
sinusoidal data by running the network.py
module,
e.g. with python -m TD sine
. This will
work and indeed plot something when the derived classes (see next two
sections) are implemented and only if you run this command from the
directory that contains the TD
package (see some
explanation of how
PYTHONPATH works).
3. Xaggle competition
Now that we have a full-fledge neural network implementation, and given that you already know how to do text classification (recall TD5 and TD7), let’s try our luck on image classification.
Grayscale images are 2D matrices where each entry corresponds to the intensity/brightness (0 is black and 255 is white). Similarly, color images are typically 3D matrices where each entry of each 2D matrix corresponds to the intensity of each color (which is the 3rd dimension), typically Red-Green-Blue (RGB). Similarly, color videos are typically 4D matrices, where each entry is a timestamped color image (e.g. 24i/s). If you work at your preferred automaker on autonomous driving, you now have 5D datasets: probably millions of samples (videos) to train on. Say you have 1 million 1-minute samples of color videos at 24 Full-HD images / second frame rate. You have 1 000 000 samples x 60 seconds x 24 images x 3 colors x 1920 x 1080 pixels, roughly 9e15 values…
Let’s come back to earth academia and work on grayscale images of
hand-written digits in mnist.py
and
network.py
(see e.g. get_raw_data
and
get_xaggle_champion
respectively). It is briefly explained
hereafter.
Have a look at the data by executing python -m TD mnist
which downloads the MNIST dataset and plots 5 samples per digit. Have a
look at the script. It roughly does the following (in particular, see
get_raw_data
): Recall that we assumed the inputs of our
network were 2D arrays (#samples, #features). To make grayscale images
fit into our work, let’s flatten them (either stack one column after the
other or each row…). Also, the labels are integers from 0 to 9, but we
know how to deal with binary classification with the Sigmoid activation
function which outputs a probability, so let’s make the output a vector
of size 10 representing the sample belonging to the \(i\)th entry (i.e a handwritten 1
would be encoded as \((0, 1, 0, ...,
0))\). Now we can train our network to recognize them! Feel free
to manipulate the provided dataset, tune your network’s hyperparameters,
come up with good evaluation metrics…
Now that we have appropriate 2D arrays ((#samples, #features),
(#samples, #targets)), we can use our neural net! Have a look at
network.py
, in particular hyperparameters
and
get_xaggle_champion
. You must tweak
hyperparameters
s.t. python grader 7
yields
the best score on the private test set (beware that this won’t
work locally since you don’t have the private test set). To
achieve this and train some model locally before submitting you
champion, you can make use of python -m TD xaggle
which
instantiates and fit a neural network given your choice of
hyperparameters on the MNIST train data fetched above, and computes an
accuracy on a public validation set. You can assume (at your own risk!)
that maximizing the accuracy on the public validation set will maximize
the accuracy on the private test set.
Recall that python -m TD mnist
and
python -m TD network (sine|xaggle)
will only work if you
run it from the directory that contains the TD
package (see
some explanation of how
PYTHONPATH works).