CSC_4302_EP - Random Forests and Ensemble Methods
Download and unzip the archive CSC43042EP-td8-1-handin.zip
.
It contains in particular:
- the file
grader.py
, which is used for the various test scripts. - various CSV files that we will use.
Log in to be able to upload your files.
Context
- This TD corresponds to lecture 8.
- We will implement Decision Trees as a method for supervised classification.
- Then, we will use the Random Forest technique, which consists in taking a majority vote among several randomly constructed decision trees. This is an example of an ensemble method.
Structure
The TD is structured as follows:
- In Part I we will implement a version of decision trees, with a splitting based on the best entropy gain;
- In Part II we will implement the random forest strategy;
- Part III describes how to perform similar tasks
with the existing library
scikit-learn
.
We will start (again!) on the example of the IRIS dataset, but we will try something new (Fashion MNIST) near the end.
Loading the datasets
Iris
For this TD we will work with two datasets. To start with (and keep computations light), we will work once again with the iris flower database, which is a database of 150 iris flowers measured according to four features, and already labelled into three categories.
import matplotlib.pyplot as plt
from sklearn import datasets
= datasets.load_iris()
iris = iris.data, iris.target
X, y print(X.shape,y.shape)
Here is a visualisation of the dataset according to the first two features:

Plotting code
= plt.subplots()
_, ax = ax.scatter(iris.data[:, 0], iris.data[:, 1], c=iris.target)
scatter set(xlabel=iris.feature_names[0], ylabel=iris.feature_names[1])
ax.= ax.legend(
_ 0], iris.target_names, loc="lower right", title="Classes"
scatter.legend_elements()[ )
We will split the dataset into a train set (for learning) and a test set (to measure accuracy) deterministically:
from sklearn.model_selection import train_test_split
= train_test_split(X, y, test_size=0.2, random_state=571) X_train, X_test, y_train, y_test
(Please note the seed value 571
which ensures we
will all work with exactly the same data.)
Fashion MNIST
We will also work with the “Fashion MNIST” dataset, which is a set of 70,000 grayscale images of clothes (28x28 resolution) from the Zalando commercial site. The images are classified into ten categories:

For this TD, to keep computation time shorter, we have prepared a simplified version keeping only four categories (0,1,3,9) and many fewer samples. We have already split the data it into a test set (2500 samples) and a train set (500 samples). Use the following code to load the dataset:
=np.loadtxt("csv/fashion-mnist_2000_X_train.csv", delimiter=",", dtype=float)
X_train=np.loadtxt("csv/fashion-mnist_2000_y_train.csv", delimiter=",", dtype=int)
y_train=np.loadtxt("csv/fashion-mnist_500_X_test.csv", delimiter=",", dtype=float)
X_test=np.loadtxt("csv/fashion-mnist_500_y_test.csv", delimiter=",", dtype=int)
y_testprint(f"Shapes {X_train.shape, y_train.shape, X_test.shape, y_test.shape}")
You can of course download the full Fashion MNIST dataset online and try your code on it for your own experiments (we encourage you to do so!).
Part I: Decision trees
Our version of decision trees will split the data at each node according to a single feature, depending on whether it is above or below a certain threshold. Therefore, each node of a decision tree must contain the following information:
- its left and right subtrees,
- the index of the feature used to split at this node, and
- the threshold value used for splitting.
We will also set
- a maximum depth for the tree that will be passed down to nodes, and
- a value for the node, that will be equal to the predicted class label when the node is a leaf.
Open the file DecisionTree.py
and copy
the following code into it:
class DecisionTree:
int
max_depth: int
feature_index: float
threshold: int
value: 'DecisionTree'
left: 'DecisionTree' right:
Add an initializer
__init__(self: Self, max_depth:int)
to the class
DecisionTree
. This initializer should set the value of
self.max_depth
, and initialize all other member variables
to None
.
1 Splitting based on entropy gain
Given a list \(y=(y_1,\dots,y_\ell)\) of class labels \(y_i\in\{1,\dots,d\}\), the entropy of \(y\) is defined to be \[ H(y) = - \sum_{i=1}^{d} p_i \log_2 p_i \] where \(p_i\) is the proportion of elements of \(y\) which are equal to \(i\) (of course the function \(p\mapsto p\log_2 p\) is extended by continuity at zero, even if this will not be useful today).
Implement a static method
entropy(y:np.ndarray)->float
in the
DecisionTree
class. This method should return the entropy
of the vector \(y\). As usual we assume
here that numpy
is imported as np
. You
might find np.unique
useful here.
Your method
entropy
(andbest_split
andbest_split_pair
below) is static: that is, the method belongs to the classDecisionTree
, but does not depend on a particular instance (object) of the class. In particular, you will notice that it has noself
parameter to refer to the active object. We declareentropy
to be static explicitly using the@staticmethod
decorator, as follows:@staticmethod def entropy(y:np.array)->float:
Test your implementation running
python ./grader.py 1
.
Upload your DecisionTree.py
file
here:
2. Best splits
Now consider a list of vectors of the form \(X=(x_1, \dots, x_\ell)\), with \(x_i\in \mathbb{R}^d\), represented as a
matrix of shape \((\ell,d)\). Let \(y\) be the list of class labels for the
vectors in \(X\). Fix an index \(s\in \{1,\dots,d\}\) (represented by a
feature_index: int
) and a threshold value \(\theta \in \mathbb{R}\). We let \(X_-\) and \(X_+\) be the sublists (submatrices) of
\(X\) consisting of vectors whose \(s\)-th coordinate (feature) is at most
\(\theta\) (for \(X_-\)) or more than \(\theta\) (for \(X_+\)), and let \(y_+\) and \(y_-\) be the lists of class labels
corresponding to \(X_+\) and \(X_-\), respectively. The entropy
gain is defined to be \[
\mathrm{gain}(s, \theta) = H(y)- \frac{\ell_-}{\ell} H(y_-)-
\frac{\ell_+}{\ell}H(y_+),
\] where \(\ell\), \(\ell_-\), and \(\ell_+\) are the lengths of \(y\), \(y_-\), and \(y_+\), respectively.
Implement two static methods in the
DecisionTree
class:
best_split(X:np.ndarray, y:np.ndarray, feature_index: int) -> (float, float)
returns the pair of values of \((\mathrm{gain}(s,\theta), \theta)\) for which the entropy gain is maximal with the given feature index. To keep our code easily testable, we will restrict to values of \(\theta\) which appear among \(s\)-th coordinates of elements of \(X\).best_split_pair(X:np.ndarray, y:np.ndarray) -> (float, int, float)
returns the triple of values \((\mathrm{gain}(s, \theta),s, \theta)\) for which the entropy gain \(\mathrm{gain}(s,\theta)\) is maximal.
Like entropy
above, best_split
and
best_split_pair
must be static methods,
declared using the @staticmethod
decorator as explained
above.
Check that calling
DecisionTree.best_split
on the iris dataset with
feature_index
equal to 0
or 1
returns respectively threshold values of 5.5
and
3.3
. Intuitively, these correspond to the best separating
vertical line and horizontal line on the picture
above. But this is not the best possible split: indeed,
check that when calling
DecisionTree.best_split_pair
, the third coordinate (feature
2
) gives a better entropy gain.
Test your implementation further by running
python ./grader.py 2
.
Upload your DecisionTree.py
file
here:
3. Constructing the tree
Now add a method
fit(self:Self, X:np.ndarray, y:np.ndarray, depth:int=0)->None
to the class DecisionTree
. Your fit
method
should do the following:
- if
depth
is equal to the maximum depth, or ify
contains a unique (possibly repeated) value, we create a leaf node, and we assign its value according to a majority vote. - otherwise, we select the feature \(s\) (and associated threshold \(\theta\)) for which the best split is the largest possible. We then split the data according to this split and create a left and right subtree recursively.
Notice that the fit
method is not
static.
Now test your code on the iris dataset (see the introduction to load it) using the following basic plotting function:
def print_tree(node:DecisionTree, depth=0):
if node.value is not None:
print((' '*2*depth)+f"- class={node.value}")
else:
print((' '*2*depth)+f"- X[{node.feature_index}] <= {node.threshold}")
+ 1)
print_tree(node.left, depth + 1)
print_tree(node.right, depth = DecisionTree(max_depth=3)
mytree
mytree.fit(iris.data, iris.target) print_tree(mytree)
Make sure that you get the following output:
- X[2] <= 1.9
- class=0
- X[3] <= 1.7
- X[2] <= 4.9
- class=1
- class=2
- X[2] <= 4.8
- class=2
- class=2
Test your implementation further by running
python ./grader.py 3
.
Upload your DecisionTree.py
file
here:
4. The classifier
Now that the tree has been constructed, we can use it to predict the class of incoming new data.
Implement a method
predict(self:Self, X:np.ndarray) -> int
in the
class DecisionTree
. This method should predict the class of
the given vector \(X\) of features.
Test your method with this simple check: on the iris
dataset with max_depth=3
, your predict
method
should return the class 1
for the vector
[1.,1.,2.,1.7]
.
Test your implementation further by running
python ./grader.py 4
.
Upload your DecisionTree.py
file
here:
Part II: Ensemble method: Random forest
We now implement the Random Forest strategy, which consists in
building nbtrees
decision trees from our data, and use a
majority vote among the obtained classifiers.
5. Building the random forest
Each decision tree in the forest is constructed from a subset of the
train set obtained by picking elements uniformly at random,
independently and with repetition. The number of elements
picked in each subset is a proportion ratio: float
of the
total number of samples.
Add the class RandomForest
below to
your DecisionTree
file, and complete the
fit
method:
class RandomForest:
"""
Attributes:
- trees: np.array
"""
def __init__(self: Self, nbtrees=1, max_depth=1):
self.trees = np.array([DecisionTree(max_depth) for _ in range(nbtrees)])
def fit(self: Self, X: np.array, y: np.array, ratio=0.3) -> None:
"""Build the decision trees in the `trees` array,
each using a proportion `ratio` of the data.
"""
for tree in self.trees:
...
Test your implementation of fit
by
running python ./grader.py 5
6. Predictions
Add a method
def predict(self:Self, X:np.ndarray) -> int
to
the RandomForest
class. Your predict
method
should use a majority rule among its trees.
Test your code on the train/test datasets loaded in Loading the datasets above:
- on the IRIS dataset, you must obtain an accuracy of (at
least) 93% with
(nb_trees, depth, ratio)=(50,7,0.3)
. By tuning these parameters you should be able to reach more than 96% in a reproducible way. - on (our shortened version of) the MNIST Fashion dataset, you should be able to obtain an accuracy of (at least) 90%, depending on parameters and the computational power used.
Test your implementation further by running
python ./grader.py 6
.
Upload your DecisionTree.py
file
here:
Part III: Parameter tuning challenge
To go further, you can now try to use better parameter values:
- What seems to be the best ratio and number of trees for each of the two training sets? What accuracy can you reach?
- Can you answer the previous question (trying to determine the best parameters) using only the training set (and not the test set)?
This question will be graded according to the accuracy you reach on the MNIST Fashion (reduced) dataset:
- less than 90%: D
- 90%-92%: C
- 92%-93%: B
- 93%-94.5%: A
- more than 94.5%: A+
Once you are done, copy the following method into
your RandomForest
class, and modify the three numerical
values nb_trees
, max_depth
, and
ratio
to match your results above for the MNIST Fashion
set. (These values will be used by the grader as parameters for testing
your RandomForest
implementation.)
def my_MNIST_Fashion_parameters()->(int, int, float):
# Modify RHS in the next line:
= 10,2,0.01
nb_trees, max_depth, ratio return (nb_trees, max_depth, ratio)
Test your implementation further by running
python ./grader 7
. Be careful: the full test has to run
below the grader’s time limit (120 seconds, otherwise you get a
json
-related error), which sets a hard computational limit
on what you can afford to do!
Upload your DecisionTree.py
file
here:
For the record: random forests using existing libraries
The scikit-learn library is already equipped with efficient implementations of Random Forests. For example, the following code performs essentially what we have been doing on the iris dataset:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
= iris.data
X = iris.target
y
= train_test_split(X, y, test_size=0.2, random_state=571)
Xtrain, Xtest, ytrain, ytest
= RandomForestClassifier(n_estimators=100, max_depth=10)
forest
forest.fit(Xtrain, ytrain)= forest.predict(Xtest)
yGuessed = accuracy_score(ytest, yGuessed)
accuracy print("Accuracy:", accuracy)
Feel free to compare the performance (accuracy/speed) of these functions with your own code!