This assignment has been closed on May 14, 2025.

You must be authenticated to submit your files

CSC_4302_EP - Random Forests and Ensemble Methods

Context
Structure
Loading the datasets
- Iris
- Fashion MNIST
Part I: Decision trees
Part II: Ensemble method: Random forest
- 5. Building the random forest
- 6. Predictions
Part III: Parameter tuning challenge

Download and unzip the archive CSC43042EP-td8-1-handin.zip.

It contains in particular:

the file grader.py, which is used for the various test scripts.
various CSV files that we will use.

Log in to be able to upload your files.

Context

This TD corresponds to lecture 8.
We will implement Decision Trees as a method for supervised classification.
Then, we will use the Random Forest technique, which consists in taking a majority vote among several randomly constructed decision trees. This is an example of an ensemble method.

Structure

The TD is structured as follows:

In Part I we will implement a version of decision trees, with a splitting based on the best entropy gain;
In Part II we will implement the random forest strategy;
Part III describes how to perform similar tasks with the existing library scikit-learn.

We will start (again!) on the example of the IRIS dataset, but we will try something new (Fashion MNIST) near the end.

Loading the datasets

Iris

For this TD we will work with two datasets. To start with (and keep computations light), we will work once again with the iris flower database, which is a database of 150 iris flowers measured according to four features, and already labelled into three categories.

import matplotlib.pyplot as plt
from sklearn import datasets
iris = datasets.load_iris()
X, y = iris.data, iris.target
print(X.shape,y.shape)

Here is a visualisation of the dataset according to the first two features:

For the record, the plot above was produced using the following code:

Plotting code

_, ax = plt.subplots()
scatter = ax.scatter(iris.data[:, 0], iris.data[:, 1], c=iris.target)
ax.set(xlabel=iris.feature_names[0], ylabel=iris.feature_names[1])
_ = ax.legend(
    scatter.legend_elements()[0], iris.target_names, loc="lower right", title="Classes"
)

We will split the dataset into a train set (for learning) and a test set (to measure accuracy) deterministically:

  from sklearn.model_selection import train_test_split
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=571)

(Please note the seed value 571 which ensures we will all work with exactly the same data.)

Fashion MNIST

We will also work with the “Fashion MNIST” dataset, which is a set of 70,000 grayscale images of clothes (28x28 resolution) from the Zalando commercial site. The images are classified into ten categories:

For this TD, to keep computation time shorter, we have prepared a simplified version keeping only four categories (0,1,3,9) and many fewer samples. We have already split the data it into a test set (2500 samples) and a train set (500 samples). Use the following code to load the dataset:

X_train=np.loadtxt("csv/fashion-mnist_2000_X_train.csv", delimiter=",", dtype=float)
y_train=np.loadtxt("csv/fashion-mnist_2000_y_train.csv", delimiter=",", dtype=int)
X_test=np.loadtxt("csv/fashion-mnist_500_X_test.csv", delimiter=",", dtype=float)
y_test=np.loadtxt("csv/fashion-mnist_500_y_test.csv", delimiter=",", dtype=int)
print(f"Shapes {X_train.shape, y_train.shape, X_test.shape, y_test.shape}")

You can of course download the full Fashion MNIST dataset online and try your code on it for your own experiments (we encourage you to do so!).

Part I: Decision trees

Our version of decision trees will split the data at each node according to a single feature, depending on whether it is above or below a certain threshold. Therefore, each node of a decision tree must contain the following information:

its left and right subtrees,
the index of the feature used to split at this node, and
the threshold value used for splitting.

We will also set

a maximum depth for the tree that will be passed down to nodes, and
a value for the node, that will be equal to the predicted class label when the node is a leaf.

Open the file DecisionTree.py and copy the following code into it:

class DecisionTree:
        max_depth: int
        feature_index: int
        threshold: float
        value: int
        left: 'DecisionTree'
        right: 'DecisionTree'

Add an initializer __init__(self: Self, max_depth:int) to the class DecisionTree. This initializer should set the value of self.max_depth, and initialize all other member variables to None.

1 Splitting based on entropy gain

Given a list \(y=(y_1,\dots,y_\ell)\) of class labels \(y_i\in\{1,\dots,d\}\), the entropy of \(y\) is defined to be \[ H(y) = - \sum_{i=1}^{d} p_i \log_2 p_i \] where \(p_i\) is the proportion of elements of \(y\) which are equal to \(i\) (of course the function \(p\mapsto p\log_2 p\) is extended by continuity at zero, even if this will not be useful today).

Implement a static method entropy(y:np.ndarray)->float in the DecisionTree class. This method should return the entropy of the vector \(y\). As usual we assume here that numpy is imported as np. You might find np.unique useful here.

Your method entropy (and best_split and best_split_pair below) is static: that is, the method belongs to the class DecisionTree, but does not depend on a particular instance (object) of the class. In particular, you will notice that it has no self parameter to refer to the active object. We declare entropy to be static explicitly using the @staticmethod decorator, as follows:
@staticmethod
def entropy(y:np.array)->float:

Test your implementation running python ./grader.py 1.

Upload your DecisionTree.py file here:

Upload form is only available when connected

2. Best splits

Now consider a list of vectors of the form \(X=(x_1, \dots, x_\ell)\), with \(x_i\in \mathbb{R}^d\), represented as a matrix of shape \((\ell,d)\). Let \(y\) be the list of class labels for the vectors in \(X\). Fix an index \(s\in \{1,\dots,d\}\) (represented by a feature_index: int) and a threshold value \(\theta \in \mathbb{R}\). We let \(X_-\) and \(X_+\) be the sublists (submatrices) of \(X\) consisting of vectors whose \(s\)-th coordinate (feature) is at most \(\theta\) (for \(X_-\)) or more than \(\theta\) (for \(X_+\)), and let \(y_+\) and \(y_-\) be the lists of class labels corresponding to \(X_+\) and \(X_-\), respectively. The entropy gain is defined to be \[ \mathrm{gain}(s, \theta) = H(y)- \frac{\ell_-}{\ell} H(y_-)- \frac{\ell_+}{\ell}H(y_+), \] where \(\ell\), \(\ell_-\), and \(\ell_+\) are the lengths of \(y\), \(y_-\), and \(y_+\), respectively.

Implement two static methods in the DecisionTree class:

best_split(X:np.ndarray, y:np.ndarray, feature_index: int) -> (float, float) returns the pair of values of \((\mathrm{gain}(s,\theta), \theta)\) for which the entropy gain is maximal with the given feature index. To keep our code easily testable, we will restrict to values of \(\theta\) which appear among \(s\)-th coordinates of elements of \(X\).
best_split_pair(X:np.ndarray, y:np.ndarray) -> (float, int, float) returns the triple of values \((\mathrm{gain}(s, \theta),s, \theta)\) for which the entropy gain \(\mathrm{gain}(s,\theta)\) is maximal.

Like entropy above, best_split and best_split_pair must be static methods, declared using the @staticmethod decorator as explained above.

Check that calling DecisionTree.best_split on the iris dataset with feature_index equal to 0 or 1 returns respectively threshold values of 5.5 and 3.3. Intuitively, these correspond to the best separating vertical line and horizontal line on the picture above. But this is not the best possible split: indeed, check that when calling DecisionTree.best_split_pair, the third coordinate (feature 2) gives a better entropy gain.

Test your implementation further by running python ./grader.py 2.

Upload your DecisionTree.py file here:

Upload form is only available when connected

3. Constructing the tree

Now add a method fit(self:Self, X:np.ndarray, y:np.ndarray, depth:int=0)->None to the class DecisionTree. Your fit method should do the following:

if depth is equal to the maximum depth, or if y contains a unique (possibly repeated) value, we create a leaf node, and we assign its value according to a majority vote.
otherwise, we select the feature \(s\) (and associated threshold \(\theta\)) for which the best split is the largest possible. We then split the data according to this split and create a left and right subtree recursively.

Notice that the fit method is not static.

Now test your code on the iris dataset (see the introduction to load it) using the following basic plotting function:

def print_tree(node:DecisionTree, depth=0):
    if node.value is not None:
        print((' '*2*depth)+f"- class={node.value}")
    else:
        print((' '*2*depth)+f"- X[{node.feature_index}] <= {node.threshold}")
        print_tree(node.left, depth + 1)
        print_tree(node.right, depth + 1)
mytree = DecisionTree(max_depth=3)
mytree.fit(iris.data, iris.target)
print_tree(mytree)

Make sure that you get the following output:

- X[2] <= 1.9
  - class=0
  - X[3] <= 1.7
    - X[2] <= 4.9
      - class=1
      - class=2
    - X[2] <= 4.8
      - class=2
      - class=2

Test your implementation further by running python ./grader.py 3.

Upload your DecisionTree.py file here:

Upload form is only available when connected

4. The classifier

Now that the tree has been constructed, we can use it to predict the class of incoming new data.

Implement a method predict(self:Self, X:np.ndarray) -> int in the class DecisionTree. This method should predict the class of the given vector \(X\) of features.

Test your method with this simple check: on the iris dataset with max_depth=3, your predict method should return the class 1 for the vector [1.,1.,2.,1.7].

Test your implementation further by running python ./grader.py 4.

Upload your DecisionTree.py file here:

Upload form is only available when connected

Part II: Ensemble method: Random forest

We now implement the Random Forest strategy, which consists in building nbtrees decision trees from our data, and use a majority vote among the obtained classifiers.

5. Building the random forest

Each decision tree in the forest is constructed from a subset of the train set obtained by picking elements uniformly at random, independently and with repetition. The number of elements picked in each subset is a proportion ratio: float of the total number of samples.

Add the class RandomForest below to your DecisionTree file, and complete the fit method:

class RandomForest:
    """
    Attributes:
    - trees: np.array
    """

    def __init__(self: Self, nbtrees=1, max_depth=1):
        self.trees = np.array([DecisionTree(max_depth) for _ in range(nbtrees)])

    def fit(self: Self, X: np.array, y: np.array, ratio=0.3) -> None:
        """Build the decision trees in the `trees` array,
        each using a proportion `ratio` of the data.
        """
        for tree in self.trees:
			...

Test your implementation of fit by running python ./grader.py 5

Upload form is only available when connected

6. Predictions

Add a method def predict(self:Self, X:np.ndarray) -> int to the RandomForest class. Your predict method should use a majority rule among its trees.

Test your code on the train/test datasets loaded in Loading the datasets above:

on the IRIS dataset, you must obtain an accuracy of (at least) 93% with (nb_trees, depth, ratio)=(50,7,0.3). By tuning these parameters you should be able to reach more than 96% in a reproducible way.
on (our shortened version of) the MNIST Fashion dataset, you should be able to obtain an accuracy of (at least) 90%, depending on parameters and the computational power used.

Test your implementation further by running python ./grader.py 6.

Upload your DecisionTree.py file here:

Upload form is only available when connected

Part III: Parameter tuning challenge

To go further, you can now try to use better parameter values:

What seems to be the best ratio and number of trees for each of the two training sets? What accuracy can you reach?
Can you answer the previous question (trying to determine the best parameters) using only the training set (and not the test set)?

This question will be graded according to the accuracy you reach on the MNIST Fashion (reduced) dataset:

less than 90%: D
90%-92%: C
92%-93%: B
93%-94.5%: A
more than 94.5%: A+

Once you are done, copy the following method into your RandomForest class, and modify the three numerical values nb_trees, max_depth, and ratio to match your results above for the MNIST Fashion set. (These values will be used by the grader as parameters for testing your RandomForest implementation.)

def my_MNIST_Fashion_parameters()->(int, int, float):
	# Modify RHS in the next line:
	nb_trees, max_depth, ratio = 10,2,0.01
	return (nb_trees, max_depth, ratio)

Test your implementation further by running python ./grader 7. Be careful: the full test has to run below the grader’s time limit (120 seconds, otherwise you get a json-related error), which sets a hard computational limit on what you can afford to do!

Upload your DecisionTree.py file here:

Upload form is only available when connected

For the record: random forests using existing libraries

The scikit-learn library is already equipped with efficient implementations of Random Forests. For example, the following code performs essentially what we have been doing on the iris dataset:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X = iris.data  
y = iris.target

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=571)

forest = RandomForestClassifier(n_estimators=100, max_depth=10)

forest.fit(Xtrain, ytrain)
yGuessed = forest.predict(Xtest)
accuracy = accuracy_score(ytest, yGuessed)
print("Accuracy:", accuracy)

Feel free to compare the performance (accuracy/speed) of these functions with your own code!