INF442 - Support Vector Machines
Before you start
- Download and unzip the archive
INF442-td8-1-handin.zip
. It contains several source files to get you started, aMakefile
to compile the various tests, aquiz
folder with a Python starter-kit script to do the quiz, as well as datasets (.csv
files in thecsv
folder).
Context
- This TD corresponds to Lecture 8: Linear Models for Classification.
- We will focus on Support Vector Machines (SVM).
- You will implement Soft Margin Kernel SVMs by solving the dual problem via a projected gradient ascent method (you may want to fasten your seatbelt).
Structure
This TD is structured as follows:
- Section 1 is about the quiz, the datasets and the libraries used in the TD;
- Section 2 provides some explanations on the theory behind SVMs;
- Section 3 is where you have to submit your code.
1. Quiz
Unfold this if you want to do the quiz with C++
LibSVM
is a library and set of tools for Support Vector
Machines. Among its many functionalities (see the documentation for
an exhaustive description), LibSVM
provides support for
different SVM methods (C-SVM, nu-SVM, etc.), binary and non-binary
classification (one-vs-one strategy), different kernels (linear,
polynomial, RBF, sigmoid or precomputed), and implements
cross-validation for optimal parameter selection.
In your working folder, go in the libsvm-3.23/
subfolder
and type make
in the terminal to compile the source
code.
LibSVM
has three main tools, which you should find in
libsvm-3.23/
after compilation:
svm-scale
, for preparing data in specified format and scale them, if necessary.svm-train
, for training the SVM over a training set, to create a prediction model.svm-predict
, for using the model to predict labels on a testing dataset.
Basic usage
By default (i.e., if no options are specified),
svm-train
uses C-SVM for multi-class classification, with
parameter \(C=1\), with an RBF
kernel:
\[K(x,y) = \exp{(-\gamma |x-y|^2)} ,\]
where \(\gamma=\frac{1}{n}\), \(n\) denoting the number of training samples.
Dataset file format
LibSVM
expects datasets in the following (sparse)
format:
[label] [index1]:[value1] [index2]:[value2] ... [label] [index1]:[value1] [index2]:[value2] ...
with a return character ('\n'
) at the end of each
line.
Each line corresponds to a different point, \(p_i\). The first column corresponds to the point label (i.e., \(y_i\)). The other columns describe the individual non-zero features of the point (i.e., \(x_{\textrm{index}^{(i)}} = \textrm{value}^{(i)}\)). Features not explicitly listed are assumed to be zero.
See csv/synt*
files for examples.
Working with the command line through the quiz
We will first use LibSVM to perform predictions on a few parameters:
- Gender is indicated in the 2nd column (0 male, 1 female).
- Age interval (identified by their lower bound) is indicated in the 3rd column.
- Engine size (of the preferred scooter) is indicated in the 5th column.
For your convenience, to generate libsvm-format datasets, you can either use
libsvm-3.23/tools/convert.c
that you need to adapt
(selecting columns, field separator), compile
(g++ -o convert convert.c
), and run
(./convert <file_to_read> <file_to_write>
), or
you can use the provided Python script in
quiz/preformat-libsvm.py
. The syntax for this script is as
follows:
python3 preformat-libsvm.py <file_to_read> <file_to_write> [id_label rest_of_cols 'char_sep']
where:
[<id_label>]
is the id of the column (the first column has id 1) selected as label (2 for the gender, 3 for the age, 5 for the engine size).[<rest_of_cols>]
are the columns to be included in the output file (ex: 1,2,3), starting by 1. You can use “-” for including all columns (i.e. for not excluding any column in the original dataset).
Predicting labels
Typical use of SVM includes training a model, over the training dataset, and evaluating its performance (the quality of its prediction) over the testing dataset, for which labels are also already known. Running the model over the testing set allows to compute the testing accuracy of the model.
You can use svm-train
and svm-predict
with
the default values (type the name of each program, with no arguments, to
get a description of all available options and default values). Use
either convert
or preformat-libsvm.py
to
obtain a libSVM-compatible version of the training and the testing
datasets, for prediction of gender (column 2), in files
scooter-train-gender.svm
and
scooter-test-gender.svm
. Then use svm-train
to
generate an SVM model (with default parameters, i.e. RBF kernel). Test
this model over the testing dataset:
# libSVM format
python3 preformat-libsvm.py ../csv/scooter-train.csv ../csv/scooter-train-gender.svm 2 - ';'
python3 preformat-libsvm.py ../csv/scooter-test.csv ../csv/scooter-test-gender.svm 2 - ';'
# train and store coefficients
./svm-train ../csv/scooter-train-gender.svm ../csv/scooter-model-gender.svm
# predict
./svm-predict ../csv/scooter-test-gender.svm ../csv/scooter-model-gender.svm ../csv/scooter-gender.output
For multi-class classification, LibSVM allows the use of C-SVC and
nu-SVC (option -s
). These options entail slightly different
variants of the same optimization objective displayed in the next section, with parameters \(c \in [0, +\infty)\) (for C-SVC) or \(\nu \in [0,1]\) (for nu-SVC). In this
section, we will use the C-SVC formulation, enabled by default by
svm-train
. Note that the \(c\) parameter in LibSVM corresponds to the
tolerance to misclassifications in the formulation in
the reminder.
The
-c
option insvm-train
allows to set explicitly the cost value \(c\), which controls the tolerance to misclassified items in the training set: for higher values ofc
, accuracy over the training set will be higher (at the expense of possible overfitting). By default, \(c=1\).The
-gamma
option insvm-train
corresponds to the \(\gamma\) kernel parameter (if no kernel is specified,svm-train
assumes RBF kernel). Its impact on the predictor thus depends on the used kernel. In general, smaller values of \(\gamma\) entail a larger reach of the points in the determination of the boundary between classes (and reversely). By default, this coefficient has value \(\gamma = \frac{1}{n}\).
Unfold this if you want to do the quiz with Python
You can use the provided Python script
svm_with_sklearn.py
in the quiz/
folder which
can read from standard csv file as well as libSVM-formatted files and
perform training. Note that it uses libSVM under the hood: you should
get the exact same results as when using the C++ scripts.
Typing python3 svm_with_sklearn.py
should give you an
idea of how to use this script:
Syntax: python3 svm_with_sklearn.py <dataset_train> <dataset_test> [<id_label> <rest_of_cols> ['<char_sep>']]
<rest_of_cols> : columns to include in output file (ex: 1,2,3), starting at 0. Use - for including all.
For some of the questions, you should modify the parameters \(C\) and \(\gamma\) as well as the argument to
MinMaxScaler
directly in the Python script. Beware
that the first column of CSV files is used as the index, see
ident
for the scooter
dataset (the only
dataset loaded as a CSV).
Supplementary information about the scooter
dataset
Consider the dataset scooter.csv
in the csv
folder. This collects the preferences and usage patterns of different
users of scooters, as well its user characteristics (age, gender, level
of studies).
We work with a pre-processed version of this dataset,
scooter.csv
, in which each qualitative characteristic is
decomposed in a binary vector (with as many dimensions as there are
possible categorical values). The dataset is split in a
training sub-dataset, scooter-train.csv
and a testing sub-dataset
scooter-test.csv
.
- For example: type-utilisation, that can be domicile-travail, perso-loisirs and professionnel, is decomposed in a 3-dimensional vector, where \((1,0,0)\) corresponds to domicile-travail, \((0,1,0)\) to perso-loisirs, and \((0,0,1)\) to professionnel.
You can find an exhaustive description of the range and format of the
dataset, and of their different features, in the first lines of the file
(those starting with #
, use
grep '#' scooter.csv
to display their content on a
terminal):
# ident;sexe;age;CSP;type-cylindree;[type-utilisation-a;b;c];[critere-esthetique-a;b;c;d;e;f;g]; note-satisfaction;imp-magasin;imp-marque;imp-esthetique;imp-prix;imp-confort-pilote; imp-confort-passager;imp-dimensions;imp-freinage;imp-cylindr?e;imp-antivol;imp-tableau-de-bord; imp-accessoires;imp-rangement;imp-propulsion;imp-refroidissement;imp-tablier-avant;imp-feux; imp-fiabilite-moteur # sexe: 0 homme, 1 femme # type-utilisation-a;b;c; : 1;0;0; Domicile-travail | 0;1;0; Perso&loisirs | 0;0;1; Professionnel # critere-esthetique-a;b;c;d;e;f;g; : 1000000 a la mode | innovant nouveau | sportif | discrete / sobre | classique | retro | ressemble a une moto # CSP: 0 inactif, 1 ouvrier, 2 employe, 3 classe moyenne, 4 classe sup moyenne, 5 classe sup
Supplementary information about the synt3
dataset
Consider the synthetic dataset synt3
. It consists of
4000 points, each of them with 2 features, \(x_1\) and \(x_2\), and one label, \(y \in \{0, 1, 2\}\) (see Fig. 1). It has
been split between the file synt3-train.svm
(for training
purposes, 500 points), and synt3-test.svm
(for testing
purposes, 3500 points). (For your convenience, these datasets are
already in LibSVM format.)
SVM prediction relies on the minimization of Euclidian distances between points. If ranges of the different features are too different, raw distances may artificially penalize features with lower ranges. In order to avoid this effect, datasets with features having very different ranges should be scaled (normalized to a uniform range) before being trained.
![Fig. 1. Dataset synt3.](resources/synt3.png)
If you use C++ to solve the quiz, LibSVM allows to normalize the
scale of all features in a dataset. When there are training and testing
datasets, this can be done with the tool svm-scale
in the
libsvm-3.23
folder, as follows:
svm-scale -s range_sc.txt training_dataset.dat > training_dataset.scaled.dat svm-scale -r range_sc.txt testing_dataset.dat > testing_dataset.scaled.dat
Note that the use of the file range_sc.txt
allows to use
the same scaling factor in training and testing datasets –
otherwise scaling might not be consistent. You can use the
-l x -u y
parameter to scale to \([x, y]\) (by default: \([-1, 1]\))
If you use Python to solve the quiz, see scaler
in the
provided script.
Supplementary information about the synt1
dataset
The synthetic dataset synt1
contains 4000 points of 2
dimensions \((x_1, x_2)\), with labels
\(y \in \{0, 1, 2\}\) (see Fig. 2). As
the (previously considered) synt3
, synt1
has
been also split between the file synt1-train.svm
(for
training purposes, 500 points), and synt1-test.svm
(for
testing purposes, 3500 points). (Note that synt1
parameters
are already scaled.)
![Fig. 2. Dataset synt1.](resources/synt1.png)
2. Theory
Given a training set of \(n\) \(d\)-dimensional points \(p_i = (\boldsymbol{x}_i, y_i)\), where \(\boldsymbol{x}_i= (x_{i,1}, ..., x_{i,d})\) and \(y \in Y = \{-1, +1\}\) (binary classification), SVMs allow to build the best hyperplane that separates points of each class (for convenience, denoted \(y = -1\), and \(y= + 1\)). This hyperplane is then used to classify (label) further points into categories \(\{-1, +1 \}\).
The resulting hyperplane is a hyperplane of the form \(\boldsymbol{x}^T \mathbb{\beta} - \beta_0 = 0\), such that:
(Eq. 1) \[\beta, \beta_0, (\xi_i)_{i=1}^n = {\arg\!\min}_{\beta, \beta_0, (\xi_i)_{i=1}^n} \frac{1}{n} \sum_{i=1}^n \xi_i + \lambda ||\beta||^2\]
Subject to:
\[\forall i, \xi_i \geq 0 \text{ and } y_i(\boldsymbol{x}_i^T \beta - \beta_0) \geq 1 - \xi_i\]
where \(\xi_i = \max\{ 0, 1 - y_i(\boldsymbol{x}_i^T \beta - \beta_0) \}\) are the slack variables.
The first term of the optimization objective allows to minimize training losses, i.e. training points being misclassified by the hyperplane. The second term allows to maximize the margin between the closest support vectors on each category.
The \(\lambda\) parameter defines the SVM tradeoff between hyperplane margins and training classification accuracy: higher values of \(\lambda\) lead to hyperplanes with larger margins, even if that increases the number of misclassifications; lower values reduce the number of misclassifications in the training dataset, at the cost of lower margins between predicted classes and at the risk of overfitting.
Setting \(\lambda = \frac{n}{2C}\) and multiplying by \(\frac{C}{n}\), we get:
(Eq. 2) \[\beta, \beta_0, (\xi_i)_{i=1}^n = {\arg\!\min}_{\beta, \beta_0, (\xi_i)_{i=1}^n} C \sum_{i=1}^n \xi_i + \frac{1}{2} ||\beta||^2\]
with the same conditions. The \(C\) parameter, referred to as “cost”, is inversely proportional to \(\lambda\) but is “more traditionally” used in libraries (libSVM and sklearn).
Non-binary classification
The previous mechanism can be easily extended to non-binary classification (i.e., for label sets \(|Y|>2\)) through different strategies: one-vs-all (one classifier per class, considering samples in any other class as belonging to a single superclass), and one-vs-one (one classifier per pair of classes, with the predicted label being the one with more “votes” - default in libSVM).
Kernels
The use of kernel functions allows to embed \(d\)-dimensional points from the dataset into a higher-dimensional space (of dimension \(d' \gg d\)), so that separability between classes becomes easier. In particular, this proves useful when the dataset is not linearly separable in its original space. SVM is then used in the higher dimensional space (dimension \(d'\)), i.e. a \((d' - 1)\)-hyperplane is determined to separate the embedded points of each class (for binary classification).
\(\Phi: \mathbb{R}^d \to \mathbb{R}^{d'}\) denotes the embedding mapping into a higher (\(d'\)) dimensional space. SVM optimization, relying only on the relative distance between points but not their representation, does not require the explicit expression of the embedding \(\Phi\) - only the embedding of the inner product in \(\mathbb{R}^{d'}\), i.e. \(K(\boldsymbol{x}, \boldsymbol{y}) = \Phi(\boldsymbol{x})^T \Phi(\boldsymbol{y})\), with \(\boldsymbol{x}, \boldsymbol{y} \in \mathbb{R}^d\) (kernel trick). Kernel functions \(K(\cdot, \cdot)\) provide thus a measure of point similarity in the higher-dimensional space in which the dataset is embedded (and we don’t even need to know that space and the representation of the points in that space).
Some well-known kernels include:
- Linear kernel, if no embedding into higher-dimensional space is performed: \(K(\boldsymbol{x}, \boldsymbol{y}) = <\boldsymbol{x}, \boldsymbol{y}>\) (note that this is the regular dot product in the original space)
- Radial basis function (RBF) kernel: \(K(\boldsymbol{x}, \boldsymbol{y}) = \exp{(- \gamma
||\boldsymbol{x} - \boldsymbol{y}||_2^2)}\) - default in
sklearn
and libSVM - Polynomial kernel: \(K(\boldsymbol{x}, \boldsymbol{y}) = (\gamma \boldsymbol{x}^T \boldsymbol{y} + \text{coef0})^{\textrm{degree}}\)
- …
Dual problem
In kernel SVM, \(\xi_i\) in (Eq. 2) becomes:
\[ \max \{0, 1 - y_i(\Phi(\boldsymbol{x}_i)^T \beta - \beta_0)\}.\]
To solve (Eq. 2), we can rely on the following Lagrange (primal) function:
\[ L_P = \frac{1}{2} ||\beta||^2 + C \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i (y_i(\Phi(x_i)^T \beta - \beta_0) - (1 - \xi_i)) - \sum_{i=1}^n \mu_i \xi_i,\]
which we minimize w.r.t \(\beta\), \(\beta_0\) and \(\xi_i\). Setting the respective derivatives to zero, we get:
\[\beta = \sum_{i=1}^n \alpha_i y_i \xi_i\] \[\sum_{i=1}^n \alpha_i y_i = 0\] \[\alpha_i = C − \mu_i, ∀i\]
By substituting, we obtain the Lagrangian dual objective function:
(Eq. 3) \[ L_D = \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j \underbrace{\Phi(\boldsymbol{x}_i)^T \Phi(\boldsymbol{x}_j)}_{K(\boldsymbol{x}_i, \boldsymbol{x}_j)},\]
which gives a lower bound on the objective function (Eq. 2) for any feasible point. We maximize \(L_D\) subject to \(0 \leq \alpha_i \leq C\) and \(\sum_{i=1}^n \alpha_i y_i = 0\). Some additional conditions, out of the scope of the present TD, ensure that the solution to the primal and dual problem match.
Projected gradient ascent
An “easy” way to maximize (Eq. 3) is to resort to gradient ascent, i.e. find the derivative of the objective function w.r.t. to each parameter and update each in the direction of the gradient.
This yields:
(Eq. 4) \[ \frac{\partial L_D}{\partial \alpha_i} = 1 - \sum_{j=1}^n \alpha_j y_i y_j K(\boldsymbol{x}_i, \boldsymbol{x}_j). \]
For some learning rate \(\eta\), we update \(\alpha_i\) at step \((t)\) such that:
(Eq. 5) \[ \tilde{\alpha}_i^{(t+1)} = \alpha_i^{(t)} + \eta \frac{\partial L_D}{\partial \alpha_i}. \]
We then project the update step into the feasible solution set by clipping the \(\alpha_i\):
(Eq. 6) \[ \alpha_i^{(t+1)} = \max (0, \min(C, \tilde{\alpha}_i^{(t+1)})). \]
We stop whenever all \(\alpha_i\) are clipped and / or all partial derivatives are small enough.
The intercept term \(\beta_0\) is defined as:
\[ \beta_0 = \frac{1}{n_S} \sum_{s \in S} \left( \sum_{i=1}^n \alpha_i y_i K(\boldsymbol{x}_i, \boldsymbol{x}_s) - y_s \right), \]
where \(S=\{s \mid 0 < \alpha_s < C \}\) denotes the set of support vectors.
3. The real deal: spam detector with SVM
We can use SVM to classify mails into spam / ham (binary classification). We already had a look at this problem of mail (binary) classification in TD6. We recall the definitions:
\(\mbox{Detection rate} = \mbox{Recall} = \frac{TP}{TP + FN}\)
\(\mbox{False alarm rate} = \frac{FP}{FP + TN}\)
\(\mbox{Precision} = \frac{TP}{TP + FP}\)
\(\mbox{F-score} = \frac{2 \cdot \mbox{Precision} \cdot \mbox{Recall}}{\mbox{Precision} + \mbox{Recall}}\)
Where \(TP\), \(FP\), \(TN\), and \(FN\) stand for true positive, false positive, true negative, and false negative classifications.
For evaluating the performance of SVM in the problem of mail
classification in spam and ham categories, we consider (as we did in TD6)
the dataset mail_train.csv
, for training, and
mail_test.csv
, for testing the prediction performance.
3.1 Implementing kernels
The linear kernel can be computed as follows:
\[K(x_1, x_2) = x_1^T x_2.\]
The polynomial kernel can be computed as follows:
\[K(x_1, x_2) = (\gamma x_1^T x_2 + \text{coef0})^\text{degree}.\]
The radial basis function kernel can be computed as follows:
\[K(x_1, x_2) = \exp \left( - \gamma || x_1 - x_2 ||_2^2 \right).\]
The sigmoid kernel can be computed as follows:
\[K(x_1, x_2) = \tanh(\gamma x_1^T x_2 + \text{coef0}). \]
The rational-quadratic kernel can be computed as follows:
\[K(x_1, x_2) = 1 - \frac{||x_1 - x_2||_2^2}{||x_1-x_2||_2^2 + {\text{coef0}}} = \frac{\text{coef0}}{||x_1-x_2||_2^2 + {\text{coef0}}}.\]
Implement the linear, polynomial, rbf, sigmoid and rational-quadratic
kernels in the eponymous methods of class Kernel
in
Kernel/Kernel.cpp
. You can test your implementation by
make grader
and ./grader 1
.
Upload your file Kernel.cpp
here:
3.2 SVM constructor
Implement class SVM’s constructor in Svm/Svm.cpp
. Most
of it is already done, but you still need to populate
train_features
and train_labels
from the
dataset provided (yes, again!). However, remember that most of the time
labels are given in 0/1 form but SVM assumes -1/1 labels. Hence, you
also need to transform the labels from 0/1 to -1/1 in the
constructor.
Implement compute_kernel
which should populate the
attribute computed_kernel
with \(y_i y_j K(\boldsymbol{x}_i,
\boldsymbol{x}_j)\) (as can be seen in 2.
Theory, this term appears both in the objective and the gradient -
note that it is symmetrical w.r.t. \(i\) and \(j\)).
You may test your code with ./grader 2
. Upload your file
Svm.cpp
here:
3.3 Compute beta_0
Assuming \(\alpha = (\alpha_i)_1^n\)
has been correctly estimated, compute the intercept term in
compute_beta_0
:
\[ \beta_0 = \frac{1}{n_S} \sum_{s \in S} \left( \sum_{i=1}^n \alpha_i y_i K(\boldsymbol{x}_i, \boldsymbol{x}_s) - y_s \right). \]
You may test your code with ./grader 3
. Upload your file
svm.cpp
here:
3.4 Training the SVM
Implement projected gradient ascent on \(\alpha\) by completing the method
train
in class SVM
. The pseudo-code is the
following:
stop
:= False- While not
stop
dostop
:= True- For i from 1 to n do
- Compute the
gradient
of the loss w.r.t. \(\alpha_i\) (Eq. 4) - Update \(\alpha_i\) (Eq. 5 - \(\eta\) is the learning rate
lr
) - Clip \(\alpha_i\) if necessary (Eq. 6)
- If clipping was not necessary and |
gradient
| >stopping_criterion
,stop
:= False
- Compute the
- Compute \(\beta_0\)
Do not get stuck on this exercise: if
./grader 4
is failing, implement subsequent exercises.
Current tests rely on your implementation of train
but we
may change this behaviour to allow for successful tests for
f_hat
and test
irrespective of
train
.
You may test your code with ./grader 4
. Upload your file
Svm.cpp
here:
3.5 Predict from a (trained) SVM
For a new query point \(\boldsymbol{x}\), the class prediction of SVM is given by:
\[ \hat{f}(\boldsymbol{x}) = \text{sign} \left[ \Phi(\boldsymbol{x})^T \beta - \beta_0 \right] = \text{sign} \left[ \sum_{i=1}^n \alpha_i y_i K(\boldsymbol{x}_i, \boldsymbol{x}) - \beta_0 \right] \]
Implement this formula in f_hat
. You may test your code
with ./grader 5
. Upload your file Svm.cpp
here:
3.6 Test the SVM
We provide the ConfusionMatrix
that you hopefully
managed to implement during TD6.
Given a test
dataset, extract (yet again!) the relevant
columns in test_labels
and test_features
, use
f_hat
to predict the label of the test samples and fill the
ConfusionMatrix
.
You may test your code with ./grader 6
. Upload your file
svm.cpp
here:
In TD6,
you explored mail classification with \(k\)-NN. You can compare the performance of
your \(k\)-NN implementation with the
performance of SVM. Did we improve it further? You can play around in
grading.cpp
and with ./grader 7
(not tested on
the server, run locally) to test it out!
References
[1] Chih-Chung Chang, Chih-Jen Lin (2001): LIBSVM, a library for Support Vector Machines. (Last update Nov 2019)
[2] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin (2003): A Practical Guide to Support Vector Classification. (Last update May 2016)