Solve and implement Support Vector Machine (Part 1)
Table of contents:
1. Introduction
2. Formalize the SVM problem
3. Find solution for SVM
4. Implement in Python
5. Evaluation
1. Introduction
Support vector machines (SVMs) are supervised learning models for classification and regression analysis. SVM was developed by Vladimir Vapnik and his colleagues at AT&T Bell Labs. The development of SVMs started in 1962 and they were first published in 1964 [1]. SVMs and kernel methods were extremely popular in 1990s until the early 2000s.
2. Formalize the SVM problem
Recall the distance from a point to a line in two-dimensional space:
The distance from a point to a plane in three-dimensional space:
Generalize in m-dimensional space, the distance becomes:
Suppose that we are working on binary classification. The positive label is +1 and the negative label is -1. The idea of SVM is to find an optimal hyperplane that separates the two classes.
In the formula of the distance from a point to a hyperplane, if we remove the absolute sign, we can know a point belongs to which side of the hyperplane.
Consider n datapoints x1,… ,xn and corresponding label y1,…,yn. Suppose that the positive group belongs to the positive side, and the negative group belong to the negative side of the hyperplane. Then we always have:
If the two groups are placed in the wrong sides, we can multiple w, b with -1. At the end, the hyperplane isn’t changed.
According to SVM, the optimal hyperplane is the one that maximizes the margin. The margin is the distance of the closest points to the hyperplane.
Notice that if we set w = kw and b = kb, the hyperplane doesn’t change.
Therefore, we can assume for the closest points to the hyperplane:
Then the objective function becomes:
3. Find solution for SVM
Recall the objective function:
This is a Quadratic Programming, we can solve with quadratic solvers like cvxopt. However, the equation doesn’t match with the standard form provided by cvxopt:
Let’s consider it’s dual problem.
Firstly, the objective function is strictly convex because the norm function is convex, and:
Secondly, we check the Slater’s condition. The reason is to assure that the dual problem has a solution.
Consider the optimization problem:
The Slater’s condition states that strong duality holds if there exists an:
If strong duality holds, the duality gap equals zero. The solution of the dual problem is also the solution of the primal problem, and this solution is unique.
Back to the primal problem, the relint(D) is R^m. Assume that the data is linearly separable, then there is always a hyperplane that satisfies:
Hence, the Slater’s condition satisfies, and strong duality holds.
The Lagrange function is:
The dual function is:
Consider the Karush–Kuhn–Tucker conditions:
- Stationarity:
- Primal feasibility:
- Dual feasibility:
- Complementary slackness:
Using two equations of the stationarity part for the dual function, we have:
The Lagrange dual problem is:
The function is convex and it’s form is similar to the one provided by cvxopt.
4. Implement in Python
We create a class for SVM, this class inherits the class BaseModel which provides two abstract methods train() and predict().
In the function train(), we find the solution for w and b. Recall the Lagrange dual problem and the standard form of quadratic programming provided by cvxopt:
The parameters P, q, G, h, A, b are:
We input these parameters and get lambda at the end:
n_samples, _ = X.shape
K = (X * y[:, np.newaxis]).T
P = cvxopt.matrix(K.T.dot(K)) # P has shape n*n
q = cvxopt.matrix(-1 * np.ones(n_samples)) # q has shape n*1
G = cvxopt.matrix(-1 * np.identity(n_samples)) # G is diagonal matrix
h = cvxopt.matrix(np.zeros(n_samples))
A = cvxopt.matrix(1.0 * y, (1, n_samples))
b = cvxopt.matrix(0.0)
# solve quadratic programming
cvxopt.solvers.options[‘show_progress’] = False
solution = cvxopt.solvers.qp(P, q, G, h, A, b)
_lambda = np.ravel(solution[‘x’])
Having lambda, we find w and b by the complementary slackness and stationarity condition:
We are interested in the case where lambda is not zero (or greater than zero). The datapoints corresponding to lambda greater than zero are the datapoints that lay in the dashed lines:
They are called support vectors since they support to find w and b. The implementation is followed:
S = np.where(_lambda > 1e-10)[0] # lambda is closed to zero
self._w = K[:, S].dot(_lambda[S])
self._b = np.mean(y[S] — X[S, :].dot(self._w))
For the function predict(), we compute the inner product of w, x then + b and take the sign of the equation:
results = np.sign(X.dot(self._w) + self._b)
results[results == 0] = 1
5. Evaluation
We generate a test dataset with the function make_blobs of sklearn.
We fit the model and plot the result:
classifier = SVM()
classifier.train(X, y)
There are two support vectors near the orange region and two support vectors near the turquoise region.
Now we test the model with two new datapoints:
X_new = np.array([[300, 300], [150, 200]])
y_new = classifier.predict(X_new)
My code is accessible in this repository: https://github.com/nguyenhaidang94/ml-from-scratch
The SVM model: https://github.com/nguyenhaidang94/ml-from-scratch/blob/main/src/models/svm.py
The notebook to evaluate the result: https://github.com/nguyenhaidang94/ml-from-scratch/blob/main/notebooks/svm.ipynb
[1] Schölkopf, Bernhard, Zhiyuan Luo, and Vladimir Vovk, eds. Empirical inference: Festschrift in honor of Vladimir N. Vapnik. Springer Science & Business Media, 2013.