Implementing the Multiclass SVM with CVXOPT

CVXOPT is a popular Python library for convex optimization, which forms the basis of Support Vector Machines (SVMs). However converting mathematical expressions into the format required by CVXOPT can be non-trivial even for a binary-class SVM. Recently, I had to implement a multiclass SVM using CVXOPT for my Machine Learning course at the Indian Institute of Science. In this post, I will share the formulation and code that I used to implement the multiclass SVM. I am aware of alternative techniques such as one-vs-all and one-vs-one which utilize binary-class SVMs for multiclass classification, but I was required to implement the multiclass SVM directly during my course and that is what we will be discussing here. The direct formulation provides better performance and is less sensitive to issues related to class imbalance, but I haven’t yet looked into whether it is really a better option in practice. The reason for me to write this post is that there are very few resources available online for the direct formulation, because it seems that most people prefer the one-vs-all or one-vs-one approaches.

This formulation is based on On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines by Kolby Crammer and Yoram Singer (2002) in the Journal of Machine Learning Research. The first section of this post might seem a bit mathematical, but I have tried to keep it as simple and explanatory as possible. The implementation is in the second section.

Problem Formulation

The formulation of the multiclass SVM problem is very similar to the binary-class SVM, except that we now have multiple hyperplanes given by the vectors \(w_1, w_2, \ldots, w_K\) - one for each class. The classifier, \(f\) is then given by the hyperplane that maximizes the positive margin for a given input \(x\).

\[f(x) = \arg \max_{k \in [K]} w_k^T x\]

Suppose that our training data is given by \(\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n\). Then the multiclass SVM problem can be formulated as follows.

\[\min_{w_1, \ldots, w_K, \xi} \frac{1}{2} \sum_{k=1}^K ||w_k||^2 + C \sum_{i=1}^n \xi_i\]

Here \(\xi_i\) are slack variables, and \(C\) is a hyperparameter. Let \(\delta\) denote the Kronecker delta function. Then the constraints \(w_{y_i}^T x_i - w_k^T x_i \ge 1 - \xi_i\) for all \(i \in [n]\) and \(k \in [K] \setminus \{y_i\}\) and \(\xi_i \ge 0\) for all \(i \in [n]\) can be written succinctly as

\[w_{y_i}^T x_i - w_k^T x_i + \delta_{y_i,k} \ge 1 - \xi_i \quad \forall i \in [n], k \in [K]\]

We first write the Lagrangian for the primal problem, by introducing the dual variables $\alpha = (\alpha_1, \ldots, \alpha_n)$. Let $w$ denote $(w_1, \ldots, w_K)$.

\[\begin{align*} \mathcal{L}(w, \xi, \alpha) &= \frac{1}{2} \sum_{k=1}^K \|w_k\|^2 + C \sum_{i=1}^n \xi_i + \sum_{i=1}^n \sum_{k=1}^K \alpha_{ik} (1 - \xi_i + (w_k - w_{y_i})^T x_i - \delta_{y_i,k}) \end{align*}\]

Here, $\alpha_{ij} \ge 0$ for all $i \in [n]$ and $j \in [K]$.

Taking the partial derivative with respect to $\xi_i$ and setting it to zero, we get that for all $i \in [n]$,

\[\begin{align*} \frac{\partial \mathcal{L}}{\partial \xi_i} &= C - \sum_{k=1}^K \alpha_{ik} = 0 \\ C &= \sum_{k=1}^K \alpha_{ik} \end{align*}\]

Now, taking the gradient with respect to $w_j$ and setting it to zero, we get that for any $j \in [K]$,

\[\begin{align*} \nabla_{w_j} \mathcal{L} &= w_j - \sum_{y_i = j} \sum_{\substack{k=1 \\ k\neq j}}^K \alpha_{ik} x_i + \sum_{y_i \neq j} \alpha_{ij} x_i = 0 \\ 0 &= w_j - \sum_{y_i = j} \sum_{k=1}^K \alpha_{ik} x_i + \sum_{i = 1}^n \alpha_{ij} x_i \end{align*}\] \[\begin{align*} w_j &= \sum_{y_i = j} \sum_{k=1}^K \alpha_{ik} x_i - \sum_{i=1}^n \alpha_{ij} x_i = \sum_{y_i = j} x_i \left( \sum_{k=1}^K \alpha_{ik}\right) - \sum_{i=1}^n \alpha_{ij} x_i \\ &= C \sum_{y_i = j} x_i - \sum_{i=1}^n \alpha_{ij} x_i = C \sum_{i=1}^n \delta_{y_i,j} x_i - \sum_{i=1}^n \alpha_{ij} x_i \\ w_j &= \sum_{i=1}^n (C \delta_{y_i,j} - \alpha_{ij}) x_i \end{align*}\]

Substituting these values back into the Lagrangian, we get

\[\begin{align*} \mathcal{L} &= \frac{1}{2} \sum_{k=1}^K \sum_{i, j = 1}^n (C \delta_{y_i,k} - \alpha_{ik}) (C \delta_{y_j,k} - \alpha_{jk}) x_i^T x_j + \sum_{i=1}^n \xi_i\left(C - \sum_{k=1}^K \alpha_{ik}\right) \\ &+ \sum_{i=1}^n \sum_{k=1}^K \alpha_{ik} + \sum_{i=1}^n \sum_{k=1}^K \alpha_{ik} (w_k - w_{y_i})^T x_i - \sum_{i=1}^n \sum_{k=1}^K \alpha_{ik} \delta_{y_i,k} \\ &= \frac{1}{2} \sum_{k=1}^K \sum_{i, j = 1}^n (C \delta_{y_i,k} - \alpha_{ik}) (C \delta_{y_j,k} - \alpha_{jk}) x_i^T x_j + nC + \sum_{i=1}^n \sum_{k=1}^K \alpha_{ik} (w_k - w_{y_i})^T x_i \\ &- \sum_{i=1}^n \sum_{k=1}^K \alpha_{ik} \delta_{y_i,k} \end{align*}\]

Note that

\[\begin{align*} \sum_{i=1}^n \sum_{k=1}^K \alpha_{ik} w_k^T x_i &= \sum_{i=1}^n \sum_{k=1}^K \alpha_{ik} \left( \sum_{j=1}^n (C \delta_{y_j,k} - \alpha_{jk}) x_j \right)^T x_i \\ &= \sum_{i, j = 1}^n x_j^T x_i \left( \sum_{k=1}^K \alpha_{ik} (C \delta_{y_j,k} - \alpha_{jk}) \right) \\ \sum_{i=1}^n \sum_{k=1}^K \alpha_{ik} w_{y_i}^T x_i &= \sum_{i=1}^n \sum_{k=1}^K \alpha_{ik} \left( \sum_{j=1}^n (C \delta_{y_j,y_i} - \alpha_{jy_i}) x_j \right)^T x_i \\ &= \sum_{i, j = 1}^n x_j^T x_i \left( \sum_{k=1}^K \alpha_{ik} (C \delta_{y_j,y_i} - \alpha_{jy_i}) \right) \\ &= \sum_{i, j=1}^n x_i^T x_i (C (C \delta_{y_i,y_i} - \alpha_{jy_i})) \\ &= \sum_{i, j=1}^n x_i^T x_i \left( \sum_{k=1}^K C\delta_{y_i,k}(C\delta_{y_i,k} - \alpha_{jk}) \right) \end{align*}\]

and therefore,

\[\begin{align*} \mathcal{L} &= \frac{1}{2} \sum_{k=1}^K \sum_{i, j = 1}^n (C \delta_{y_i,k} - \alpha_{ik}) (C \delta_{y_j,k} - \alpha_{jk}) x_i^T x_j + nC \\ &- \sum_{i, j=1}^n x_i^T x_j \left( \sum_{k=1}^K (C \delta_{y_i,k} - \alpha_{ik}) (C \delta_{y_j,k} - \alpha_{jk}) \right) - \sum_{i=1}^n \sum_{k=1}^K \alpha_{ik} \delta_{y_i,k} \\ &= -\frac{1}{2} \sum_{k=1}^K \sum_{i, j = 1}^n (C \delta_{y_i,k} - \alpha_{ik}) (C \delta_{y_j,k} - \alpha_{jk}) x_i^T x_j + nC - \sum_{i=1}^n \sum_{k=1}^K \alpha_{ik} \delta_{y_i,k} \end{align*}\]

Let $e_i = (\delta_{i,1}, \ldots, \delta_{i,K})$ for any $i \in [K]$, and let $e = (1, 1, \ldots, 1) \in \mathbb{R}^K$. Then the above equation can be written as

\[\begin{align*} \mathcal{L} &= -\frac{1}{2} \sum_{i, j = 1}^n (Ce_{y_i} - \alpha_i)^T (Ce_{y_j} - \alpha_j) x_i^T x_j + nC - \sum_{i=1}^n \alpha_i^T e_{y_i} \end{align*}\]

Therefore, the dual problem is

\[\begin{align*} \max_{\alpha} &\quad -\frac{1}{2} \sum_{i, j = 1}^n (Ce_{y_i} - \alpha_i)^T (Ce_{y_j} - \alpha_j) x_i^T x_j - \sum_{i=1}^n \alpha_i^T e_{y_i} + nC\\ \text{or} \quad \min_{\alpha} &\quad \frac{1}{2} \sum_{i, j = 1}^n (Ce_{y_i} - \alpha_i)^T (Ce_{y_j} - \alpha_j) x_i^T x_j + \sum_{i=1}^n \alpha_i^T e_{y_i} - nC \\ \text{such that} &\quad \alpha_{ij} \ge 0 \quad \forall i \in [n], j \in [K], \quad \alpha_i^T e = C \quad \forall i \in [n] \end{align*}\]

Let $\lambda_i = Ce_{y_i} - \alpha_i$ for all $i \in [n]$. Note that

\[\begin{align*} \sum_{i=1}^n \alpha_i^T e_{y_i} &= \sum_{i=1}^n (Ce_{y_i} - \lambda_i)^T e_{y_i} = \sum_{i=1}^n Ce_{y_i}^T e_{y_i} - \sum_{i=1}^n \lambda_i^T e_{y_i} \\ &= nC - \sum_{i=1}^n \lambda_i^T e_{y_i} \end{align*}\]

Since $\alpha_i \ge 0$ for all $i \in [n]$, we have that $\lambda_i \le Ce_{y_i}$. Similarly, since $\alpha_i^T e = C$ for all $i \in [n]$, we have that $\lambda_i^T e = 0$. Substituting the values of $\lambda_i$, the dual problem can be written as

\[\begin{align*} \min_{\lambda} &\quad \frac{1}{2} \sum_{i, j = 1}^n (\lambda_i^T \lambda_j) (x_i^T x_j) - \sum_{i=1}^n \lambda_i^T e_{y_i} \\ \text{such that} &\quad \lambda_i \le Ce_{y_i}, \quad \lambda_i^T e = 0 \quad \forall i \in [n] \end{align*}\]

Note that on solving the dual problem, we get the values of $w_j$ as

\[\begin{align*} w_j &= \sum_{i=1}^n \lambda_{ij} x_i \end{align*}\]

for all $j \in [K]$.

This final expression is much simpler than not only the original primal problem, but also most formulations of the binary-class SVM.

Implementation with CVXOPT

CVXOPT provides a simple interface for quadratic programming. We will convert the dual problem into the standard form required by CVXOPT and feed it to the solver. We want a problem of the form

\[\begin{align*} \min_{x} &\quad \frac{1}{2} x^T P x + q^T x \\ \text{such that} &\quad Gx \preccurlyeq h, \quad Ax = b \end{align*}\]

I am providing the $P, q, G, h, A, b$ matrices below and will leave the proof of correctness as an exercise for the reader (it is a simple exercise in linear algebra). Note that I have replaced the inner product $x_i^T x_j$ with the kernel $k(x_i, x_j)$.

\[\begin{align*} P &= \begin{bmatrix} k(x_1, x_1) I_{K} & \cdots & k(x_1, x_n) I_{K} \\ \vdots & \ddots & \vdots \\ k(x_n, x_1) I_{K} & \cdots & k(x_n, x_n) I_{K} \end{bmatrix} \\ q &= \begin{bmatrix} -e_{y_1} \\ \vdots \\ -e_{y_n} \end{bmatrix} \\ G &= -I_{nK} \\ h &= C \begin{bmatrix} e_{y_1} \\ \vdots \\ e_{y_n} \end{bmatrix} \\ A &= \begin{bmatrix} 1_{1 \times K} & 0_{1 \times K} & \cdots & 0_{1 \times K} \\ 0_{1 \times K} & 1_{1 \times K} & \cdots & 0_{1 \times K} \\ \vdots & \vdots & \ddots & \vdots \\ 0_{1 \times K} & 0_{1 \times K} & \cdots & 1_{1 \times K} \end{bmatrix} \\ b &= 0_{n \times 1} \end{align*}\]

The implementation of our Mutliclass SVM is as follows.

import numpy as np

class MultiClassSVM:
    A Support Vector Machine for Multi-Class Classification

    def __init__(self, labels: np.ndarray, train_x: np.ndarray, train_y: np.ndarray):
        Constructor for the MultiClassSVM class that initializes the training data and labels
        :param labels: the labels of the classes
        :param train_x: the training data
        :param train_y: the training labels
        self.labels = labels
        self.train_x = train_x
        self.train_y = train_y

    def train(
        kernel: Callable = Kernels.linear(),
        slack_weight: float = 0,
    ) -> None:
        Train the Multi-Class SVM
        :param kernel: the kernel function to use if provided
        :param slack_weight: the weight of the slack variables (0 for hard margin)
        n = self.train_x.shape[0]
        k = len(self.labels)

        e = np.zeros(n * k)
        for i in range(n):
            e[i * k + list(self.labels).index(self.train_y[i])] = 1

        P = np.zeros((n * k, n * k))
        for i in range(n):
            for j in range(n):
                P[i * k : (i + 1) * k, j * k : (j + 1) * k] = (
                    kernel(self.train_x[i], self.train_x[j]) + 1
                ) * np.eye(k)

        q = -e

        G = np.eye(n * k)
        h = slack_weight * e

        A = np.zeros((n, n * k))
        for i in range(n):
            A[i, i * k : (i + 1) * k] = 1

        b = np.zeros(n)

        sol = solvers.qp(
            matrix(A, (n, n * k), "d"),

        # construct the weight vectors
        W = [None for _ in range(k)]
        for j in range(k):
            W[j] = lambda x, j=j: sum(
                sol["x"][i * k + j] * (kernel(x, self.train_x[i]) + 1)
                for i in range(n)

        # construct the classifier
        def classifier(x):
            res = [W[j](x) for j in range(k)]
            return self.labels[res.index(max(res))]

        return Classifier(classifier)

The Kernel functions can be easily swapped in and out by passing them as arguments to the train function. An example implementation of the Kernels class is as follows.

class Kernels:
    A class for different kernel functions

    def rbf(sigma: float) -> float:
        Radial Basis Function (RBF) kernel
        :param sigma: the width of the kernel
        :return: the kernel function
        return lambda x, y: np.exp(-np.linalg.norm(x - y) ** 2 / (2 * sigma**2))

    def linear() -> float:
        Linear kernel
        :return: the kernel function
        return lambda x, y: x @ y

The Classifier class simply provides a wrapper around the classifier function.

class Classifier:
    A class for a classifier generated by the SVM

    def __init__(self, classifier: Callable):
        Constructor for the Classifier class
        :param classifier: the classifier function
        self.classifier = classifier

    def predict(self, x: np.ndarray) -> float:
        Predict the class of a data point
        :param x: the data point
        :return: the predicted class
        return self.classifier(x)

    def test(self, test_x: np.ndarray, test_y: np.ndarray) -> float:
        Test the classifier on a test set
        :param test_x: the test data
        :param test_y: the test labels
        :return: the accuracy of the classifier
        correct = 0
        for i in range(test_x.shape[0]):
            if self.predict(test_x[i]) == test_y[i]:
                correct += 1
        return correct / test_x.shape[0]