Support Vector Machine, a Gentle Mathematical Introduction

As I was talking with a fellow friend, I had this conversation where he explains how Support Vector Machine had performed really well on its last biological data set compared to a neural network approach. His data set was obviously not large enough to fit the framework of crunchy data algorithm such as NN, so he set with SVM.

Apart from using them with scikit, I never had the opportunity to learn how they behave. By spending the weekend reading on SVM, here is my attempt to try filling the hole.

Margin: geometric and functionnal

Introduction to the problem

Given $ x $ our attribute vector, $ w $, and $ b $ the learning weights of our algorithm, we want to predict a label $y \in \{-1, 1\}$. A simple classifier formula is defined as $g( w^Tx + b ) $ such:

$g(z) = 1\text{ if } z > 0$ $g(z) = -1\text{ if } z < 0$

Functionnal Margin

The functionnal margin is defined as $\hat{\gamma^{i}} = y^{i}(w^Tx + b)$, where $y^{i} \in \{-1, 1\}$. We want the margin to be as large as possible for every examples in our training set $(x_i, y_i)$, because it’s easy to see that if $\hat{\gamma^i}$ is positive, then both $h^i$ and $g(z)$ are of the same sign, and if furthermore the $\hat{\gamma}$ value is large, then the algorithm is confident about its prediciton.

Notes

The functional margin gives the same prediction independently of the scaling of $w$ and $b$. It might not be fascinating but it will help us define our optimization problem.

The functionnal margin is then defined as:

$\displaystyle{\min \hat{\gamma^i}}$

for all $i$ pairs of training labeled data.

Geometric Margin

Now consider a linearly separable dataset, and an linear boundary ($H$) between them. $A$ is a point labelled as $+1$.

Notes

Let’s first start from our functional margin definition and prove the orthonality between $w^T$ and ($H$).
Given two points $x_1, x_2$ both on the separation line, this point would have a value of $(w^Tx_1 + b) = 0$ and $(w^Tx_2 + b) = 0$ respectively.

If we take the difference between this two equations we come to the conclusion:

$w^T \cdot (x_1 - x_2) = 0$ (1)

Hence $w$ is orthogonal to any segment of the separation line.

Now, let’s try to define what the geometric margin $\gamma^i$ is. It is the smallest distance between the linear boundary and every point in our dataset.

Great, but how can we defined it geometrically?

Let’s start by computing the orthogonal projection of any point in our dataset $x_i$ on ($H$):
$x_i - \gamma^i \frac{\vec w}{\lVert w \rVert}$
We can use the orthogonality equation and inject this new point in the equation:
$w^T \cdot (x_i - \gamma^i \frac{\vec w}{\lVert w \rVert}) + b = 0$
Solving for $\gamma^i$ gives us:
$\gamma^i \frac{\vec w}{\lVert w \rVert}^2 = \vec w^T \cdot x_i + b$ $\gamma^i = \frac{\vec w}{(\lVert w \rVert)^2}^T \cdot x_i + \frac{b}{(\lVert w \rVert)^2}$

Notes that this defintion holds if our training examples have all been correctly separated. In a more general way, we define for any pair $(x_i, y_i)$, the margin as:

$\displaystyle{\gamma^i = y_i (\frac{\vec w^T}{(\lVert w \rVert)^2} \cdot x_i + \frac{b}{(\lVert w \rVert)^2})}$

The geometric margin is then defined as:

$\displaystyle{\min \gamma^i}$

for all $i$ pairs of training labeled data.

Find the optimal classificator

First formulation

Nowadays, most machine learning models are defined as an optimization problem, and Support Vector Machine is one of them.

The simplest idea to formulate the problem is simply to trying to maximize the geometric margin, which in mathematical symbols is written as:

$\!\max \gamma$ $\text{s.c }y^i(w^T x_i + b) \gt \gamma$ $\text{and }\lVert w \rVert = 1$

Cool, we’re done. Our problem is defined, let’s throw it to any optimizer program…. If only it were that simple because let’s be honest, this formulation is terrible; the set of points that satisfy the norm constraint is not convex (the norm function itself is convex), hence affecting the convexity of the problem.

Our problem becomes not convex, and it’s bad, because we can’t use convex optimization. Convex optimization is very robust, and on a side note, I think in deep learning there should be more research towards deep convex neural networks ¹.

Hold on, if you don’t trust me, at least trust the Maths:

Both $\lVert w \rVert = 1$, and $\lVert -w \rVert = 1$ satisfy the norm constraint, however if we pick a convex sum of this two points (i.e $\frac{1}{2} (\lVert w \rVert + \lVert -w \rVert) = 0$) which is not on the unit circle.

Second formulation

The second formulation is simply a rewriting with the functionnal margin (i.e $\gamma = \frac{\hat{\gamma}}{\lVert w \rVert)}$).

$max \frac{\hat{\gamma}}{\lVert w \rVert}$ $\text{s.c }y^i(w^T x_i + b) \gt \gamma$

The problem is convex in its constraint, but now we have a non-convex objective function. Let’s dive deeper.

Third formulation

Do you remember, when I said that the formulation of the functional margin was invariant w.r.t rescaling…

“Don’t speak all together”.

If $w, b$ could be rescaled, then we can set $\hat{\gamma} = 1$. Hence

$\frac{\hat{\gamma}}{\lVert w \rVert} = \frac{1}{\lVert -w \rVert}$

In our problem we are looking for the $\displaystyle{max\frac{1}{w}}$ such that $\forall\text{ }w^{'} $:

$\frac{1}{\lVert w \rVert} \gt \frac{1}{\lVert w^{'} \rVert}$ $\lVert w^{'} \rVert \gt \lVert w \rVert$

We square the norm, without changing the correctness of the inequation. Hence we are turning our initial nonconvex maximization problem, into a quadratic minimization problem.

$\lVert w^{'} \rVert^2 \gt \lVert w \rVert^2$

To keep the derivative nice, we can add $\frac{1}{2}$

And here is our final optimization formulation :

$min \frac{1}{2}{\lVert w \rVert}^2$ $\text{s.c }y^i(w^T x_i + b) \gt 1$

This problem is now convex in its constraint, and the objective function is also convex. Even if we are happy with this solution, we can still be smarter and avoid some computations, while extending this formulation to nonseparable hyperplane. Here is all the magic of SVM.

Passing into dual form

We will use Lagrange polynomial. If you don’t know what Lagrange problem and dual form are, I’ll make an article about it soon. Let’s first transform our inequality constraint into a function $g_i(w) : y^i(w^T x_i + b) - 1 = 0$ for all $i$ pair of examples.

We can then write the loss function as $ \displaystyle{L(w, b, \alpha) = \frac{1}{2}{\lVert w \rVert}^2 + \sum_{i=1}^n \alpha_i y^i(w^T x_i + b) - 1 }$ where $\alpha_i$ are the Lagrange multipliers.

If we try to find an extrema point, we will have to compute the partial derivative concerning the weight and the bias, and set it equal to zero.

$\frac{\partial L}{\partial w} = w - \sum_{i=1}^N \alpha_i y^i(w^T x_i + b) - 1 = 0$ $\frac{\partial L}{\partial b} = - \sum_{i=1}^N \alpha_i y^i = 0$

Plugging this back into the $L$, we find at:

$\displaystyle{L(w, b, \alpha) = \frac{1}{2}\sum_{i,j=1}^n \alpha_i \alpha_j y^i y^j x_i^ T x_j + \sum_{i=1}^N \alpha_i}$

This will be the function will be maximizing for our dual formulation under the constraint of $\sum_{i=1}^N \alpha_i y^i = 0$.

Sides notes 1

By definition support vector are the closest points to the linear boundary. For those points $x_i$, the Lagrange multiplier $\alpha_i$ can be anything because the functional margin is equal to 1, hence $g_i(w) = 0$ .

Kernel

Definition

Kernel are magic. Kernel are life. Kernel trick.

Features are computed by transforming our initial attribute vector $w$ into a different dimensionnal space. Feature mapping (i.e for example $\theta(x) = x^T \cdot x$) passes attributes in new space.

A kernel is defined as an inner product between feature mapping.

How to used them in the context of SVM

If we take the formulation of our dual problem, we see one inner product between the attributes vector. Now with our kernel definition, we can replace this inner product $<x,x>$ by an inner product of feature mappings $<\theta(x), \theta(x)>$. Math doesn’t change, convex properties still hold, initial data is only mapped into a new space.

THE Kernel trick

Now the trick about kernel is that feature mapping can be computationally expensive to compute. For example, in linear regression, we sometimes transform our attributes by replacing $x$ with all its monomes up to a certain degree. For three attributes $A={a_1, a_2, a_3}$, and up to degree 2, we’ll have a vector of dimension $ |A|^2$:

$F = \begin{bmatrix} a_1a_1 \\ a_1a_2 \\ a_1a_3 \\ \vdots \\ a_3a_3 \\ \end{bmatrix}$

And it’s only degree 2. However, if we can find a kernel function which inner product has feature mapping equal F, then computation will be much faster $O(|A|)$. With some simple math, we can find such a kernel K, $K(x,z) = (x^Tz)^2 = \dots = \displaystyle{\sum_{i,j = 1}^{|A|}(x_ix_j)(z_iz_j)}$. And here is the magic. We only need to compute a function which “under the hood” is an inner product of feature mapping without even having to compute this potentially large vector.

Some famous kernel in the literature: the Gaussian kernel is one of them. It has been proved to map attributes to an infinite space.

Sometimes, it’s just enough to know it’s a kernel, and you don’t want to check if it is defined as an inner product. Mercel theorem states that:

Theorem: If K is a positive symmetric semi definite matrix, then K is also a valid kernel, so that there exists a feature mapping $\theta$ such that $K(x, z) = \theta(x)^T\theta(y)$.

What if the optimal margin was not the best.

In Machine Learning, in the context of optimization problem, finding the global optimum is sometimes not necessary, and can lead to bad generalization on unseen data. Hence it is possible to add a regularization term $\displaystyle{C\sum_{i=1}^n \epsilon_i}$ in the primal formulation where $\epsilon_i$ replaces our $\hat{\gamma} = 1$.

It allows some i pairs to have a margin smaller than 1, but not too often, neither “too much”$y_i(w^T + b) \gt 1 - \epsilon_i $.

In the context of the dual formulation, a constraint is add to $\alpha_i$ so that $\alpha_i \lt C$.

Bibliography

Convex Neural Networks from Yoshua Bengio, Nicolas Le Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte ↩