This assignment has been closed on May 15, 2024.

You must be authenticated to submit your files

INF442 - Support Vector Machines

Before you start
Context
Structure
1. Quiz
2. Theory
3. The real deal: spam detector with SVM
References

Before you start

Download and unzip the archive INF442-td8-1-handin.zip. It contains several source files to get you started, a Makefile to compile the various tests, a quiz folder with a Python starter-kit script to do the quiz, as well as datasets (.csv files in the csv folder).

Context

This TD corresponds to Lecture 8: Linear Models for Classification.
We will focus on Support Vector Machines (SVM).
You will implement Soft Margin Kernel SVMs by solving the dual problem via a projected gradient ascent method (you may want to fasten your seatbelt).

Structure

This TD is structured as follows:

Section 1 is about the quiz, the datasets and the libraries used in the TD;
Section 2 provides some explanations on the theory behind SVMs;
Section 3 is where you have to submit your code.

1. Quiz

Unfold this if you want to do the quiz with C++

LibSVM is a library and set of tools for Support Vector Machines. Among its many functionalities (see the documentation for an exhaustive description), LibSVM provides support for different SVM methods (C-SVM, nu-SVM, etc.), binary and non-binary classification (one-vs-one strategy), different kernels (linear, polynomial, RBF, sigmoid or precomputed), and implements cross-validation for optimal parameter selection.

In your working folder, go in the libsvm-3.23/ subfolder and type make in the terminal to compile the source code.

LibSVM has three main tools, which you should find in libsvm-3.23/ after compilation:

svm-scale, for preparing data in specified format and scale them, if necessary.
svm-train, for training the SVM over a training set, to create a prediction model.
svm-predict, for using the model to predict labels on a testing dataset.

Basic usage

By default (i.e., if no options are specified), svm-train uses C-SVM for multi-class classification, with parameter \(C=1\), with an RBF kernel:

\[K(x,y) = \exp{(-\gamma |x-y|^2)} ,\]

where \(\gamma=\frac{1}{n}\), \(n\) denoting the number of training samples.

Dataset file format

LibSVM expects datasets in the following (sparse) format:

[label] [index1]:[value1] [index2]:[value2] ...
[label] [index1]:[value1] [index2]:[value2] ...

with a return character ('\n') at the end of each line.

Each line corresponds to a different point, \(p_i\). The first column corresponds to the point label (i.e., \(y_i\)). The other columns describe the individual non-zero features of the point (i.e., \(x_{\textrm{index}^{(i)}} = \textrm{value}^{(i)}\)). Features not explicitly listed are assumed to be zero.

See csv/synt* files for examples.

Working with the command line through the quiz

We will first use LibSVM to perform predictions on a few parameters:

Gender is indicated in the 2nd column (0 male, 1 female).
Age interval (identified by their lower bound) is indicated in the 3rd column.
Engine size (of the preferred scooter) is indicated in the 5th column.

For your convenience, to generate libsvm-format datasets, you can either use libsvm-3.23/tools/convert.c that you need to adapt (selecting columns, field separator), compile (g++ -o convert convert.c), and run (./convert <file_to_read> <file_to_write>), or you can use the provided Python script in quiz/preformat-libsvm.py. The syntax for this script is as follows:

python3 preformat-libsvm.py <file_to_read> <file_to_write> [id_label rest_of_cols 'char_sep']

where:

[<id_label>] is the id of the column (the first column has id 1) selected as label (2 for the gender, 3 for the age, 5 for the engine size).
[<rest_of_cols>] are the columns to be included in the output file (ex: 1,2,3), starting by 1. You can use “-” for including all columns (i.e. for not excluding any column in the original dataset).

Predicting labels

Typical use of SVM includes training a model, over the training dataset, and evaluating its performance (the quality of its prediction) over the testing dataset, for which labels are also already known. Running the model over the testing set allows to compute the testing accuracy of the model.

You can use svm-train and svm-predict with the default values (type the name of each program, with no arguments, to get a description of all available options and default values). Use either convert or preformat-libsvm.py to obtain a libSVM-compatible version of the training and the testing datasets, for prediction of gender (column 2), in files scooter-train-gender.svm and scooter-test-gender.svm. Then use svm-train to generate an SVM model (with default parameters, i.e. RBF kernel). Test this model over the testing dataset:

# libSVM format
python3 preformat-libsvm.py ../csv/scooter-train.csv ../csv/scooter-train-gender.svm 2 - ';'
python3 preformat-libsvm.py ../csv/scooter-test.csv ../csv/scooter-test-gender.svm 2 - ';'

# train and store coefficients
./svm-train ../csv/scooter-train-gender.svm ../csv/scooter-model-gender.svm

# predict
./svm-predict ../csv/scooter-test-gender.svm ../csv/scooter-model-gender.svm ../csv/scooter-gender.output

For multi-class classification, LibSVM allows the use of C-SVC and nu-SVC (option -s). These options entail slightly different variants of the same optimization objective displayed in the next section, with parameters \(c \in [0, +\infty)\) (for C-SVC) or \(\nu \in [0,1]\) (for nu-SVC). In this section, we will use the C-SVC formulation, enabled by default by svm-train. Note that the \(c\) parameter in LibSVM corresponds to the tolerance to misclassifications in the formulation in the reminder.

The -c option in svm-train allows to set explicitly the cost value \(c\), which controls the tolerance to misclassified items in the training set: for higher values of c, accuracy over the training set will be higher (at the expense of possible overfitting). By default, \(c=1\).
The -gamma option in svm-train corresponds to the \(\gamma\) kernel parameter (if no kernel is specified, svm-train assumes RBF kernel). Its impact on the predictor thus depends on the used kernel. In general, smaller values of \(\gamma\) entail a larger reach of the points in the determination of the boundary between classes (and reversely). By default, this coefficient has value \(\gamma = \frac{1}{n}\).

Unfold this if you want to do the quiz with Python

You can use the provided Python script svm_with_sklearn.py in the quiz/ folder which can read from standard csv file as well as libSVM-formatted files and perform training. Note that it uses libSVM under the hood: you should get the exact same results as when using the C++ scripts.

Typing python3 svm_with_sklearn.py should give you an idea of how to use this script:

Syntax: python3 svm_with_sklearn.py <dataset_train> <dataset_test> [<id_label> <rest_of_cols> ['<char_sep>']]

<rest_of_cols> : columns to include in output file (ex: 1,2,3), starting at 0. Use - for including all.

For some of the questions, you should modify the parameters \(C\) and \(\gamma\) as well as the argument to MinMaxScaler directly in the Python script. Beware that the first column of CSV files is used as the index, see ident for the scooter dataset (the only dataset loaded as a CSV).

Supplementary information about the scooter dataset

Consider the dataset scooter.csv in the csv folder. This collects the preferences and usage patterns of different users of scooters, as well its user characteristics (age, gender, level of studies).

We work with a pre-processed version of this dataset, scooter.csv, in which each qualitative characteristic is decomposed in a binary vector (with as many dimensions as there are possible categorical values). The dataset is split in a training sub-dataset, scooter-train.csv and a testing sub-dataset scooter-test.csv.

For example: type-utilisation, that can be domicile-travail, perso-loisirs and professionnel, is decomposed in a 3-dimensional vector, where \((1,0,0)\) corresponds to domicile-travail, \((0,1,0)\) to perso-loisirs, and \((0,0,1)\) to professionnel.

You can find an exhaustive description of the range and format of the dataset, and of their different features, in the first lines of the file (those starting with #, use grep '#' scooter.csv to display their content on a terminal):

# ident;sexe;age;CSP;type-cylindree;[type-utilisation-a;b;c];[critere-esthetique-a;b;c;d;e;f;g];
  note-satisfaction;imp-magasin;imp-marque;imp-esthetique;imp-prix;imp-confort-pilote;
    imp-confort-passager;imp-dimensions;imp-freinage;imp-cylindr?e;imp-antivol;imp-tableau-de-bord;
  imp-accessoires;imp-rangement;imp-propulsion;imp-refroidissement;imp-tablier-avant;imp-feux;
  imp-fiabilite-moteur
# sexe: 0 homme, 1 femme
# type-utilisation-a;b;c; : 1;0;0; Domicile-travail | 0;1;0; Perso&loisirs | 0;0;1; Professionnel
# critere-esthetique-a;b;c;d;e;f;g; : 1000000 a la mode | innovant nouveau | sportif | 
  discrete / sobre | classique | retro | ressemble a une moto 
# CSP: 0 inactif, 1 ouvrier, 2 employe, 3 classe moyenne, 4 classe sup moyenne, 5 classe sup

Supplementary information about the synt3 dataset

Consider the synthetic dataset synt3. It consists of 4000 points, each of them with 2 features, \(x_1\) and \(x_2\), and one label, \(y \in \{0, 1, 2\}\) (see Fig. 1). It has been split between the file synt3-train.svm (for training purposes, 500 points), and synt3-test.svm (for testing purposes, 3500 points). (For your convenience, these datasets are already in LibSVM format.)

SVM prediction relies on the minimization of Euclidian distances between points. If ranges of the different features are too different, raw distances may artificially penalize features with lower ranges. In order to avoid this effect, datasets with features having very different ranges should be scaled (normalized to a uniform range) before being trained.

Fig. 1. Dataset synt3. — Fig. 1. Dataset `synt3`.

If you use C++ to solve the quiz, LibSVM allows to normalize the scale of all features in a dataset. When there are training and testing datasets, this can be done with the tool svm-scale in the libsvm-3.23 folder, as follows:

svm-scale -s range_sc.txt training_dataset.dat > training_dataset.scaled.dat
svm-scale -r range_sc.txt testing_dataset.dat > testing_dataset.scaled.dat

Note that the use of the file range_sc.txt allows to use the same scaling factor in training and testing datasets – otherwise scaling might not be consistent. You can use the -l x -u y parameter to scale to \([x, y]\) (by default: \([-1, 1]\))

If you use Python to solve the quiz, see scaler in the provided script.

Supplementary information about the synt1 dataset

The synthetic dataset synt1 contains 4000 points of 2 dimensions \((x_1, x_2)\), with labels \(y \in \{0, 1, 2\}\) (see Fig. 2). As the (previously considered) synt3, synt1 has been also split between the file synt1-train.svm (for training purposes, 500 points), and synt1-test.svm (for testing purposes, 3500 points). (Note that synt1 parameters are already scaled.)

Fig. 2. Dataset synt1. — Fig. 2. Dataset `synt1`.

2. Theory

Given a training set of \(n\) \(d\)-dimensional points \(p_i = (\boldsymbol{x}_i, y_i)\), where \(\boldsymbol{x}_i= (x_{i,1}, ..., x_{i,d})\) and \(y \in Y = \{-1, +1\}\) (binary classification), SVMs allow to build the best hyperplane that separates points of each class (for convenience, denoted \(y = -1\), and \(y= + 1\)). This hyperplane is then used to classify (label) further points into categories \(\{-1, +1 \}\).

The resulting hyperplane is a hyperplane of the form \(\boldsymbol{x}^T \mathbb{\beta} - \beta_0 = 0\), such that:

(Eq. 1) \[\beta, \beta_0, (\xi_i)_{i=1}^n = {\arg\!\min}_{\beta, \beta_0, (\xi_i)_{i=1}^n} \frac{1}{n} \sum_{i=1}^n \xi_i + \lambda ||\beta||^2\]

Subject to:

\[\forall i, \xi_i \geq 0 \text{ and } y_i(\boldsymbol{x}_i^T \beta - \beta_0) \geq 1 - \xi_i\]

where \(\xi_i = \max\{ 0, 1 - y_i(\boldsymbol{x}_i^T \beta - \beta_0) \}\) are the slack variables.

The first term of the optimization objective allows to minimize training losses, i.e. training points being misclassified by the hyperplane. The second term allows to maximize the margin between the closest support vectors on each category.

The \(\lambda\) parameter defines the SVM tradeoff between hyperplane margins and training classification accuracy: higher values of \(\lambda\) lead to hyperplanes with larger margins, even if that increases the number of misclassifications; lower values reduce the number of misclassifications in the training dataset, at the cost of lower margins between predicted classes and at the risk of overfitting.

Setting \(\lambda = \frac{n}{2C}\) and multiplying by \(\frac{C}{n}\), we get:

(Eq. 2) \[\beta, \beta_0, (\xi_i)_{i=1}^n = {\arg\!\min}_{\beta, \beta_0, (\xi_i)_{i=1}^n} C \sum_{i=1}^n \xi_i + \frac{1}{2} ||\beta||^2\]

with the same conditions. The \(C\) parameter, referred to as “cost”, is inversely proportional to \(\lambda\) but is “more traditionally” used in libraries (libSVM and sklearn).

Non-binary classification

The previous mechanism can be easily extended to non-binary classification (i.e., for label sets \(|Y|>2\)) through different strategies: one-vs-all (one classifier per class, considering samples in any other class as belonging to a single superclass), and one-vs-one (one classifier per pair of classes, with the predicted label being the one with more “votes” - default in libSVM).

Kernels

The use of kernel functions allows to embed \(d\)-dimensional points from the dataset into a higher-dimensional space (of dimension \(d' \gg d\)), so that separability between classes becomes easier. In particular, this proves useful when the dataset is not linearly separable in its original space. SVM is then used in the higher dimensional space (dimension \(d'\)), i.e. a \((d' - 1)\)-hyperplane is determined to separate the embedded points of each class (for binary classification).

\(\Phi: \mathbb{R}^d \to \mathbb{R}^{d'}\) denotes the embedding mapping into a higher (\(d'\)) dimensional space. SVM optimization, relying only on the relative distance between points but not their representation, does not require the explicit expression of the embedding \(\Phi\) - only the embedding of the inner product in \(\mathbb{R}^{d'}\), i.e. \(K(\boldsymbol{x}, \boldsymbol{y}) = \Phi(\boldsymbol{x})^T \Phi(\boldsymbol{y})\), with \(\boldsymbol{x}, \boldsymbol{y} \in \mathbb{R}^d\) (kernel trick). Kernel functions \(K(\cdot, \cdot)\) provide thus a measure of point similarity in the higher-dimensional space in which the dataset is embedded (and we don’t even need to know that space and the representation of the points in that space).

Some well-known kernels include:

Linear kernel, if no embedding into higher-dimensional space is performed: \(K(\boldsymbol{x}, \boldsymbol{y}) = <\boldsymbol{x}, \boldsymbol{y}>\) (note that this is the regular dot product in the original space)
Radial basis function (RBF) kernel: \(K(\boldsymbol{x}, \boldsymbol{y}) = \exp{(- \gamma ||\boldsymbol{x} - \boldsymbol{y}||_2^2)}\) - default in sklearn and libSVM
Polynomial kernel: \(K(\boldsymbol{x}, \boldsymbol{y}) = (\gamma \boldsymbol{x}^T \boldsymbol{y} + \text{coef0})^{\textrm{degree}}\)
…

Dual problem

In kernel SVM, \(\xi_i\) in (Eq. 2) becomes:

\[ \max \{0, 1 - y_i(\Phi(\boldsymbol{x}_i)^T \beta - \beta_0)\}.\]

To solve (Eq. 2), we can rely on the following Lagrange (primal) function:

\[ L_P = \frac{1}{2} ||\beta||^2 + C \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i (y_i(\Phi(x_i)^T \beta - \beta_0) - (1 - \xi_i)) - \sum_{i=1}^n \mu_i \xi_i,\]

which we minimize w.r.t \(\beta\), \(\beta_0\) and \(\xi_i\). Setting the respective derivatives to zero, we get:

\[\beta = \sum_{i=1}^n \alpha_i y_i \xi_i\] \[\sum_{i=1}^n \alpha_i y_i = 0\] \[\alpha_i = C − \mu_i, ∀i\]

By substituting, we obtain the Lagrangian dual objective function:

(Eq. 3) \[ L_D = \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j \underbrace{\Phi(\boldsymbol{x}_i)^T \Phi(\boldsymbol{x}_j)}_{K(\boldsymbol{x}_i, \boldsymbol{x}_j)},\]

which gives a lower bound on the objective function (Eq. 2) for any feasible point. We maximize \(L_D\) subject to \(0 \leq \alpha_i \leq C\) and \(\sum_{i=1}^n \alpha_i y_i = 0\). Some additional conditions, out of the scope of the present TD, ensure that the solution to the primal and dual problem match.

Projected gradient ascent

An “easy” way to maximize (Eq. 3) is to resort to gradient ascent, i.e. find the derivative of the objective function w.r.t. to each parameter and update each in the direction of the gradient.

This yields:

(Eq. 4) \[ \frac{\partial L_D}{\partial \alpha_i} = 1 - \sum_{j=1}^n \alpha_j y_i y_j K(\boldsymbol{x}_i, \boldsymbol{x}_j). \]

For some learning rate \(\eta\), we update \(\alpha_i\) at step \((t)\) such that:

(Eq. 5) \[ \tilde{\alpha}_i^{(t+1)} = \alpha_i^{(t)} + \eta \frac{\partial L_D}{\partial \alpha_i}. \]

We then project the update step into the feasible solution set by clipping the \(\alpha_i\):

(Eq. 6) \[ \alpha_i^{(t+1)} = \max (0, \min(C, \tilde{\alpha}_i^{(t+1)})). \]

We stop whenever all \(\alpha_i\) are clipped and / or all partial derivatives are small enough.

The intercept term \(\beta_0\) is defined as:

\[ \beta_0 = \frac{1}{n_S} \sum_{s \in S} \left( \sum_{i=1}^n \alpha_i y_i K(\boldsymbol{x}_i, \boldsymbol{x}_s) - y_s \right), \]

where \(S=\{s \mid 0 < \alpha_s < C \}\) denotes the set of support vectors.

3. The real deal: spam detector with SVM

We can use SVM to classify mails into spam / ham (binary classification). We already had a look at this problem of mail (binary) classification in TD6. We recall the definitions:

\(\mbox{Detection rate} = \mbox{Recall} = \frac{TP}{TP + FN}\)
\(\mbox{False alarm rate} = \frac{FP}{FP + TN}\)
\(\mbox{Precision} = \frac{TP}{TP + FP}\)
\(\mbox{F-score} = \frac{2 \cdot \mbox{Precision} \cdot \mbox{Recall}}{\mbox{Precision} + \mbox{Recall}}\)

Where \(TP\), \(FP\), \(TN\), and \(FN\) stand for true positive, false positive, true negative, and false negative classifications.

For evaluating the performance of SVM in the problem of mail classification in spam and ham categories, we consider (as we did in TD6) the dataset mail_train.csv, for training, and mail_test.csv, for testing the prediction performance.

3.1 Implementing kernels

The linear kernel can be computed as follows:

\[K(x_1, x_2) = x_1^T x_2.\]

The polynomial kernel can be computed as follows:

\[K(x_1, x_2) = (\gamma x_1^T x_2 + \text{coef0})^\text{degree}.\]

The radial basis function kernel can be computed as follows:

\[K(x_1, x_2) = \exp \left( - \gamma || x_1 - x_2 ||_2^2 \right).\]

The sigmoid kernel can be computed as follows:

\[K(x_1, x_2) = \tanh(\gamma x_1^T x_2 + \text{coef0}). \]

The rational-quadratic kernel can be computed as follows:

\[K(x_1, x_2) = 1 - \frac{||x_1 - x_2||_2^2}{||x_1-x_2||_2^2 + {\text{coef0}}} = \frac{\text{coef0}}{||x_1-x_2||_2^2 + {\text{coef0}}}.\]

Implement the linear, polynomial, rbf, sigmoid and rational-quadratic kernels in the eponymous methods of class Kernel in Kernel/Kernel.cpp. You can test your implementation by make grader and ./grader 1.

Upload your file Kernel.cpp here: