This assignment has been closed on May 15, 2024.
You must be authenticated to submit your files

INF442 - Support Vector Machines

Before you start

Context

Structure

This TD is structured as follows:

1. Quiz

Unfold this if you want to do the quiz with C++

LibSVM is a library and set of tools for Support Vector Machines. Among its many functionalities (see the documentation for an exhaustive description), LibSVM provides support for different SVM methods (C-SVM, nu-SVM, etc.), binary and non-binary classification (one-vs-one strategy), different kernels (linear, polynomial, RBF, sigmoid or precomputed), and implements cross-validation for optimal parameter selection.

In your working folder, go in the libsvm-3.23/ subfolder and type make in the terminal to compile the source code.

LibSVM has three main tools, which you should find in libsvm-3.23/ after compilation:

  • svm-scale, for preparing data in specified format and scale them, if necessary.
  • svm-train, for training the SVM over a training set, to create a prediction model.
  • svm-predict, for using the model to predict labels on a testing dataset.
Basic usage

By default (i.e., if no options are specified), svm-train uses C-SVM for multi-class classification, with parameter \(C=1\), with an RBF kernel:

\[K(x,y) = \exp{(-\gamma |x-y|^2)} ,\]

where \(\gamma=\frac{1}{n}\), \(n\) denoting the number of training samples.

Dataset file format

LibSVM expects datasets in the following (sparse) format:

[label] [index1]:[value1] [index2]:[value2] ...
[label] [index1]:[value1] [index2]:[value2] ...

with a return character ('\n') at the end of each line.

Each line corresponds to a different point, \(p_i\). The first column corresponds to the point label (i.e., \(y_i\)). The other columns describe the individual non-zero features of the point (i.e., \(x_{\textrm{index}^{(i)}} = \textrm{value}^{(i)}\)). Features not explicitly listed are assumed to be zero.

See csv/synt* files for examples.

Working with the command line through the quiz

We will first use LibSVM to perform predictions on a few parameters:

  • Gender is indicated in the 2nd column (0 male, 1 female).
  • Age interval (identified by their lower bound) is indicated in the 3rd column.
  • Engine size (of the preferred scooter) is indicated in the 5th column.

For your convenience, to generate libsvm-format datasets, you can either use libsvm-3.23/tools/convert.c that you need to adapt (selecting columns, field separator), compile (g++ -o convert convert.c), and run (./convert <file_to_read> <file_to_write>), or you can use the provided Python script in quiz/preformat-libsvm.py. The syntax for this script is as follows:

python3 preformat-libsvm.py <file_to_read> <file_to_write> [id_label rest_of_cols 'char_sep']

where:

  • [<id_label>] is the id of the column (the first column has id 1) selected as label (2 for the gender, 3 for the age, 5 for the engine size).
  • [<rest_of_cols>] are the columns to be included in the output file (ex: 1,2,3), starting by 1. You can use “-” for including all columns (i.e. for not excluding any column in the original dataset).
Predicting labels

Typical use of SVM includes training a model, over the training dataset, and evaluating its performance (the quality of its prediction) over the testing dataset, for which labels are also already known. Running the model over the testing set allows to compute the testing accuracy of the model.

You can use svm-train and svm-predict with the default values (type the name of each program, with no arguments, to get a description of all available options and default values). Use either convert or preformat-libsvm.py to obtain a libSVM-compatible version of the training and the testing datasets, for prediction of gender (column 2), in files scooter-train-gender.svm and scooter-test-gender.svm. Then use svm-train to generate an SVM model (with default parameters, i.e. RBF kernel). Test this model over the testing dataset:

# libSVM format
python3 preformat-libsvm.py ../csv/scooter-train.csv ../csv/scooter-train-gender.svm 2 - ';'
python3 preformat-libsvm.py ../csv/scooter-test.csv ../csv/scooter-test-gender.svm 2 - ';'

# train and store coefficients
./svm-train ../csv/scooter-train-gender.svm ../csv/scooter-model-gender.svm

# predict
./svm-predict ../csv/scooter-test-gender.svm ../csv/scooter-model-gender.svm ../csv/scooter-gender.output

For multi-class classification, LibSVM allows the use of C-SVC and nu-SVC (option -s). These options entail slightly different variants of the same optimization objective displayed in the next section, with parameters \(c \in [0, +\infty)\) (for C-SVC) or \(\nu \in [0,1]\) (for nu-SVC). In this section, we will use the C-SVC formulation, enabled by default by svm-train. Note that the \(c\) parameter in LibSVM corresponds to the tolerance to misclassifications in the formulation in the reminder.

  • The -c option in svm-train allows to set explicitly the cost value \(c\), which controls the tolerance to misclassified items in the training set: for higher values of c, accuracy over the training set will be higher (at the expense of possible overfitting). By default, \(c=1\).

  • The -gamma option in svm-train corresponds to the \(\gamma\) kernel parameter (if no kernel is specified, svm-train assumes RBF kernel). Its impact on the predictor thus depends on the used kernel. In general, smaller values of \(\gamma\) entail a larger reach of the points in the determination of the boundary between classes (and reversely). By default, this coefficient has value \(\gamma = \frac{1}{n}\).

Unfold this if you want to do the quiz with Python

You can use the provided Python script svm_with_sklearn.py in the quiz/ folder which can read from standard csv file as well as libSVM-formatted files and perform training. Note that it uses libSVM under the hood: you should get the exact same results as when using the C++ scripts.

Typing python3 svm_with_sklearn.py should give you an idea of how to use this script:

Syntax: python3 svm_with_sklearn.py <dataset_train> <dataset_test> [<id_label> <rest_of_cols> ['<char_sep>']]

<rest_of_cols> : columns to include in output file (ex: 1,2,3), starting at 0. Use - for including all.

For some of the questions, you should modify the parameters \(C\) and \(\gamma\) as well as the argument to MinMaxScaler directly in the Python script. Beware that the first column of CSV files is used as the index, see ident for the scooter dataset (the only dataset loaded as a CSV).

Supplementary information about the scooter dataset

Consider the dataset scooter.csv in the csv folder. This collects the preferences and usage patterns of different users of scooters, as well its user characteristics (age, gender, level of studies).

We work with a pre-processed version of this dataset, scooter.csv, in which each qualitative characteristic is decomposed in a binary vector (with as many dimensions as there are possible categorical values). The dataset is split in a training sub-dataset, scooter-train.csv and a testing sub-dataset scooter-test.csv.

  • For example: type-utilisation, that can be domicile-travail, perso-loisirs and professionnel, is decomposed in a 3-dimensional vector, where \((1,0,0)\) corresponds to domicile-travail, \((0,1,0)\) to perso-loisirs, and \((0,0,1)\) to professionnel.

You can find an exhaustive description of the range and format of the dataset, and of their different features, in the first lines of the file (those starting with #, use grep '#' scooter.csv to display their content on a terminal):

# ident;sexe;age;CSP;type-cylindree;[type-utilisation-a;b;c];[critere-esthetique-a;b;c;d;e;f;g];
  note-satisfaction;imp-magasin;imp-marque;imp-esthetique;imp-prix;imp-confort-pilote;
    imp-confort-passager;imp-dimensions;imp-freinage;imp-cylindr?e;imp-antivol;imp-tableau-de-bord;
  imp-accessoires;imp-rangement;imp-propulsion;imp-refroidissement;imp-tablier-avant;imp-feux;
  imp-fiabilite-moteur
# sexe: 0 homme, 1 femme
# type-utilisation-a;b;c; : 1;0;0; Domicile-travail | 0;1;0; Perso&loisirs | 0;0;1; Professionnel
# critere-esthetique-a;b;c;d;e;f;g; : 1000000 a la mode | innovant nouveau | sportif | 
  discrete / sobre | classique | retro | ressemble a une moto 
# CSP: 0 inactif, 1 ouvrier, 2 employe, 3 classe moyenne, 4 classe sup moyenne, 5 classe sup
Supplementary information about the synt3 dataset

Consider the synthetic dataset synt3. It consists of 4000 points, each of them with 2 features, \(x_1\) and \(x_2\), and one label, \(y \in \{0, 1, 2\}\) (see Fig. 1). It has been split between the file synt3-train.svm (for training purposes, 500 points), and synt3-test.svm (for testing purposes, 3500 points). (For your convenience, these datasets are already in LibSVM format.)

SVM prediction relies on the minimization of Euclidian distances between points. If ranges of the different features are too different, raw distances may artificially penalize features with lower ranges. In order to avoid this effect, datasets with features having very different ranges should be scaled (normalized to a uniform range) before being trained.

Fig. 1. Dataset synt3.

If you use C++ to solve the quiz, LibSVM allows to normalize the scale of all features in a dataset. When there are training and testing datasets, this can be done with the tool svm-scale in the libsvm-3.23 folder, as follows:

svm-scale -s range_sc.txt training_dataset.dat > training_dataset.scaled.dat
svm-scale -r range_sc.txt testing_dataset.dat > testing_dataset.scaled.dat

Note that the use of the file range_sc.txt allows to use the same scaling factor in training and testing datasets – otherwise scaling might not be consistent. You can use the -l x -u y parameter to scale to \([x, y]\) (by default: \([-1, 1]\))

If you use Python to solve the quiz, see scaler in the provided script.

Supplementary information about the synt1 dataset

The synthetic dataset synt1 contains 4000 points of 2 dimensions \((x_1, x_2)\), with labels \(y \in \{0, 1, 2\}\) (see Fig. 2). As the (previously considered) synt3, synt1 has been also split between the file synt1-train.svm (for training purposes, 500 points), and synt1-test.svm (for testing purposes, 3500 points). (Note that synt1 parameters are already scaled.)

Fig. 2. Dataset synt1.

2. Theory

Given a training set of \(n\) \(d\)-dimensional points \(p_i = (\boldsymbol{x}_i, y_i)\), where \(\boldsymbol{x}_i= (x_{i,1}, ..., x_{i,d})\) and \(y \in Y = \{-1, +1\}\) (binary classification), SVMs allow to build the best hyperplane that separates points of each class (for convenience, denoted \(y = -1\), and \(y= + 1\)). This hyperplane is then used to classify (label) further points into categories \(\{-1, +1 \}\).

The resulting hyperplane is a hyperplane of the form \(\boldsymbol{x}^T \mathbb{\beta} - \beta_0 = 0\), such that:

(Eq. 1) \[\beta, \beta_0, (\xi_i)_{i=1}^n = {\arg\!\min}_{\beta, \beta_0, (\xi_i)_{i=1}^n} \frac{1}{n} \sum_{i=1}^n \xi_i + \lambda ||\beta||^2\]

Subject to:

\[\forall i, \xi_i \geq 0 \text{ and } y_i(\boldsymbol{x}_i^T \beta - \beta_0) \geq 1 - \xi_i\]

where \(\xi_i = \max\{ 0, 1 - y_i(\boldsymbol{x}_i^T \beta - \beta_0) \}\) are the slack variables.

The first term of the optimization objective allows to minimize training losses, i.e. training points being misclassified by the hyperplane. The second term allows to maximize the margin between the closest support vectors on each category.

The \(\lambda\) parameter defines the SVM tradeoff between hyperplane margins and training classification accuracy: higher values of \(\lambda\) lead to hyperplanes with larger margins, even if that increases the number of misclassifications; lower values reduce the number of misclassifications in the training dataset, at the cost of lower margins between predicted classes and at the risk of overfitting.

Setting \(\lambda = \frac{n}{2C}\) and multiplying by \(\frac{C}{n}\), we get:

(Eq. 2) \[\beta, \beta_0, (\xi_i)_{i=1}^n = {\arg\!\min}_{\beta, \beta_0, (\xi_i)_{i=1}^n} C \sum_{i=1}^n \xi_i + \frac{1}{2} ||\beta||^2\]

with the same conditions. The \(C\) parameter, referred to as “cost”, is inversely proportional to \(\lambda\) but is “more traditionally” used in libraries (libSVM and sklearn).

Non-binary classification

The previous mechanism can be easily extended to non-binary classification (i.e., for label sets \(|Y|>2\)) through different strategies: one-vs-all (one classifier per class, considering samples in any other class as belonging to a single superclass), and one-vs-one (one classifier per pair of classes, with the predicted label being the one with more “votes” - default in libSVM).

Kernels

The use of kernel functions allows to embed \(d\)-dimensional points from the dataset into a higher-dimensional space (of dimension \(d' \gg d\)), so that separability between classes becomes easier. In particular, this proves useful when the dataset is not linearly separable in its original space. SVM is then used in the higher dimensional space (dimension \(d'\)), i.e. a \((d' - 1)\)-hyperplane is determined to separate the embedded points of each class (for binary classification).

\(\Phi: \mathbb{R}^d \to \mathbb{R}^{d'}\) denotes the embedding mapping into a higher (\(d'\)) dimensional space. SVM optimization, relying only on the relative distance between points but not their representation, does not require the explicit expression of the embedding \(\Phi\) - only the embedding of the inner product in \(\mathbb{R}^{d'}\), i.e. \(K(\boldsymbol{x}, \boldsymbol{y}) = \Phi(\boldsymbol{x})^T \Phi(\boldsymbol{y})\), with \(\boldsymbol{x}, \boldsymbol{y} \in \mathbb{R}^d\) (kernel trick). Kernel functions \(K(\cdot, \cdot)\) provide thus a measure of point similarity in the higher-dimensional space in which the dataset is embedded (and we don’t even need to know that space and the representation of the points in that space).

Some well-known kernels include:

Dual problem

In kernel SVM, \(\xi_i\) in (Eq. 2) becomes:

\[ \max \{0, 1 - y_i(\Phi(\boldsymbol{x}_i)^T \beta - \beta_0)\}.\]

To solve (Eq. 2), we can rely on the following Lagrange (primal) function:

\[ L_P = \frac{1}{2} ||\beta||^2 + C \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i (y_i(\Phi(x_i)^T \beta - \beta_0) - (1 - \xi_i)) - \sum_{i=1}^n \mu_i \xi_i,\]

which we minimize w.r.t \(\beta\), \(\beta_0\) and \(\xi_i\). Setting the respective derivatives to zero, we get:

\[\beta = \sum_{i=1}^n \alpha_i y_i \xi_i\] \[\sum_{i=1}^n \alpha_i y_i = 0\] \[\alpha_i = C − \mu_i, ∀i\]

By substituting, we obtain the Lagrangian dual objective function:

(Eq. 3) \[ L_D = \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j \underbrace{\Phi(\boldsymbol{x}_i)^T \Phi(\boldsymbol{x}_j)}_{K(\boldsymbol{x}_i, \boldsymbol{x}_j)},\]

which gives a lower bound on the objective function (Eq. 2) for any feasible point. We maximize \(L_D\) subject to \(0 \leq \alpha_i \leq C\) and \(\sum_{i=1}^n \alpha_i y_i = 0\). Some additional conditions, out of the scope of the present TD, ensure that the solution to the primal and dual problem match.

Projected gradient ascent

An “easy” way to maximize (Eq. 3) is to resort to gradient ascent, i.e. find the derivative of the objective function w.r.t. to each parameter and update each in the direction of the gradient.

This yields:

(Eq. 4) \[ \frac{\partial L_D}{\partial \alpha_i} = 1 - \sum_{j=1}^n \alpha_j y_i y_j K(\boldsymbol{x}_i, \boldsymbol{x}_j). \]

For some learning rate \(\eta\), we update \(\alpha_i\) at step \((t)\) such that:

(Eq. 5) \[ \tilde{\alpha}_i^{(t+1)} = \alpha_i^{(t)} + \eta \frac{\partial L_D}{\partial \alpha_i}. \]

We then project the update step into the feasible solution set by clipping the \(\alpha_i\):

(Eq. 6) \[ \alpha_i^{(t+1)} = \max (0, \min(C, \tilde{\alpha}_i^{(t+1)})). \]

We stop whenever all \(\alpha_i\) are clipped and / or all partial derivatives are small enough.

The intercept term \(\beta_0\) is defined as:

\[ \beta_0 = \frac{1}{n_S} \sum_{s \in S} \left( \sum_{i=1}^n \alpha_i y_i K(\boldsymbol{x}_i, \boldsymbol{x}_s) - y_s \right), \]

where \(S=\{s \mid 0 < \alpha_s < C \}\) denotes the set of support vectors.

3. The real deal: spam detector with SVM

We can use SVM to classify mails into spam / ham (binary classification). We already had a look at this problem of mail (binary) classification in TD6. We recall the definitions:

Where \(TP\), \(FP\), \(TN\), and \(FN\) stand for true positive, false positive, true negative, and false negative classifications.

For evaluating the performance of SVM in the problem of mail classification in spam and ham categories, we consider (as we did in TD6) the dataset mail_train.csv, for training, and mail_test.csv, for testing the prediction performance.

3.1 Implementing kernels

The linear kernel can be computed as follows:

\[K(x_1, x_2) = x_1^T x_2.\]

The polynomial kernel can be computed as follows:

\[K(x_1, x_2) = (\gamma x_1^T x_2 + \text{coef0})^\text{degree}.\]

The radial basis function kernel can be computed as follows:

\[K(x_1, x_2) = \exp \left( - \gamma || x_1 - x_2 ||_2^2 \right).\]

The sigmoid kernel can be computed as follows:

\[K(x_1, x_2) = \tanh(\gamma x_1^T x_2 + \text{coef0}). \]

The rational-quadratic kernel can be computed as follows:

\[K(x_1, x_2) = 1 - \frac{||x_1 - x_2||_2^2}{||x_1-x_2||_2^2 + {\text{coef0}}} = \frac{\text{coef0}}{||x_1-x_2||_2^2 + {\text{coef0}}}.\]

Implement the linear, polynomial, rbf, sigmoid and rational-quadratic kernels in the eponymous methods of class Kernel in Kernel/Kernel.cpp. You can test your implementation by make grader and ./grader 1.

Upload your file Kernel.cpp here:

Upload form is only available when connected

3.2 SVM constructor

Implement class SVM’s constructor in Svm/Svm.cpp. Most of it is already done, but you still need to populate train_features and train_labels from the dataset provided (yes, again!). However, remember that most of the time labels are given in 0/1 form but SVM assumes -1/1 labels. Hence, you also need to transform the labels from 0/1 to -1/1 in the constructor.

Implement compute_kernel which should populate the attribute computed_kernel with \(y_i y_j K(\boldsymbol{x}_i, \boldsymbol{x}_j)\) (as can be seen in 2. Theory, this term appears both in the objective and the gradient - note that it is symmetrical w.r.t. \(i\) and \(j\)).

You may test your code with ./grader 2. Upload your file Svm.cpp here:

Upload form is only available when connected

3.3 Compute beta_0

Assuming \(\alpha = (\alpha_i)_1^n\) has been correctly estimated, compute the intercept term in compute_beta_0:

\[ \beta_0 = \frac{1}{n_S} \sum_{s \in S} \left( \sum_{i=1}^n \alpha_i y_i K(\boldsymbol{x}_i, \boldsymbol{x}_s) - y_s \right). \]

You may test your code with ./grader 3. Upload your file svm.cpp here:

Upload form is only available when connected

3.4 Training the SVM

Implement projected gradient ascent on \(\alpha\) by completing the method train in class SVM. The pseudo-code is the following:

  1. stop := False
  2. While not stop do
    1. stop := True
    2. For i from 1 to n do
      1. Compute the gradient of the loss w.r.t. \(\alpha_i\) (Eq. 4)
      2. Update \(\alpha_i\) (Eq. 5 - \(\eta\) is the learning rate lr)
      3. Clip \(\alpha_i\) if necessary (Eq. 6)
      4. If clipping was not necessary and |gradient| > stopping_criterion, stop := False
  3. Compute \(\beta_0\)

Do not get stuck on this exercise: if ./grader 4 is failing, implement subsequent exercises. Current tests rely on your implementation of train but we may change this behaviour to allow for successful tests for f_hat and test irrespective of train.

You may test your code with ./grader 4. Upload your file Svm.cpp here:

Upload form is only available when connected

3.5 Predict from a (trained) SVM

For a new query point \(\boldsymbol{x}\), the class prediction of SVM is given by:

\[ \hat{f}(\boldsymbol{x}) = \text{sign} \left[ \Phi(\boldsymbol{x})^T \beta - \beta_0 \right] = \text{sign} \left[ \sum_{i=1}^n \alpha_i y_i K(\boldsymbol{x}_i, \boldsymbol{x}) - \beta_0 \right] \]

Implement this formula in f_hat. You may test your code with ./grader 5. Upload your file Svm.cpp here:

Upload form is only available when connected

3.6 Test the SVM

We provide the ConfusionMatrix that you hopefully managed to implement during TD6. Given a test dataset, extract (yet again!) the relevant columns in test_labels and test_features, use f_hat to predict the label of the test samples and fill the ConfusionMatrix.

You may test your code with ./grader 6. Upload your file svm.cpp here:

Upload form is only available when connected

In TD6, you explored mail classification with \(k\)-NN. You can compare the performance of your \(k\)-NN implementation with the performance of SVM. Did we improve it further? You can play around in grading.cpp and with ./grader 7 (not tested on the server, run locally) to test it out!

References

[1] Chih-Chung Chang, Chih-Jen Lin (2001): LIBSVM, a library for Support Vector Machines. (Last update Nov 2019)

[2] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin (2003): A Practical Guide to Support Vector Classification. (Last update May 2016)