This formulation is based on On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines by Kolby Crammer and Yoram Singer (2002) in the *Journal of Machine Learning Research*. The first section of this post might seem a bit mathematical, but I have tried to keep it as simple and explanatory as possible. The implementation is in the second section.

The formulation of the multiclass SVM problem is very similar to the binary-class SVM, except that we now have multiple hyperplanes given by the vectors \(w_1, w_2, \ldots, w_K\) - one for each class. The classifier, \(f\) is then given by the hyperplane that maximizes the positive margin for a given input \(x\).

\[f(x) = \arg \max_{k \in [K]} w_k^T x\]Suppose that our training data is given by \(\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n\). Then the multiclass SVM problem can be formulated as follows.

\[\min_{w_1, \ldots, w_K, \xi} \frac{1}{2} \sum_{k=1}^K ||w_k||^2 + C \sum_{i=1}^n \xi_i\]Here \(\xi_i\) are slack variables, and \(C\) is a hyperparameter. Let \(\delta\) denote the Kronecker delta function. Then the constraints \(w_{y_i}^T x_i - w_k^T x_i \ge 1 - \xi_i\) for all \(i \in [n]\) and \(k \in [K] \setminus \{y_i\}\) and \(\xi_i \ge 0\) for all \(i \in [n]\) can be written succinctly as

\[w_{y_i}^T x_i - w_k^T x_i + \delta_{y_i,k} \ge 1 - \xi_i \quad \forall i \in [n], k \in [K]\]We first write the Lagrangian for the primal problem, by introducing the dual variables $\alpha = (\alpha_1, \ldots, \alpha_n)$. Let $w$ denote $(w_1, \ldots, w_K)$.

\[\begin{align*} \mathcal{L}(w, \xi, \alpha) &= \frac{1}{2} \sum_{k=1}^K \|w_k\|^2 + C \sum_{i=1}^n \xi_i + \sum_{i=1}^n \sum_{k=1}^K \alpha_{ik} (1 - \xi_i + (w_k - w_{y_i})^T x_i - \delta_{y_i,k}) \end{align*}\]Here, $\alpha_{ij} \ge 0$ for all $i \in [n]$ and $j \in [K]$.

Taking the partial derivative with respect to $\xi_i$ and setting it to zero, we get that for all $i \in [n]$,

\[\begin{align*} \frac{\partial \mathcal{L}}{\partial \xi_i} &= C - \sum_{k=1}^K \alpha_{ik} = 0 \\ C &= \sum_{k=1}^K \alpha_{ik} \end{align*}\]Now, taking the gradient with respect to $w_j$ and setting it to zero, we get that for any $j \in [K]$,

\[\begin{align*} \nabla_{w_j} \mathcal{L} &= w_j - \sum_{y_i = j} \sum_{\substack{k=1 \\ k\neq j}}^K \alpha_{ik} x_i + \sum_{y_i \neq j} \alpha_{ij} x_i = 0 \\ 0 &= w_j - \sum_{y_i = j} \sum_{k=1}^K \alpha_{ik} x_i + \sum_{i = 1}^n \alpha_{ij} x_i \end{align*}\] \[\begin{align*} w_j &= \sum_{y_i = j} \sum_{k=1}^K \alpha_{ik} x_i - \sum_{i=1}^n \alpha_{ij} x_i = \sum_{y_i = j} x_i \left( \sum_{k=1}^K \alpha_{ik}\right) - \sum_{i=1}^n \alpha_{ij} x_i \\ &= C \sum_{y_i = j} x_i - \sum_{i=1}^n \alpha_{ij} x_i = C \sum_{i=1}^n \delta_{y_i,j} x_i - \sum_{i=1}^n \alpha_{ij} x_i \\ w_j &= \sum_{i=1}^n (C \delta_{y_i,j} - \alpha_{ij}) x_i \end{align*}\]Substituting these values back into the Lagrangian, we get

\[\begin{align*} \mathcal{L} &= \frac{1}{2} \sum_{k=1}^K \sum_{i, j = 1}^n (C \delta_{y_i,k} - \alpha_{ik}) (C \delta_{y_j,k} - \alpha_{jk}) x_i^T x_j + \sum_{i=1}^n \xi_i\left(C - \sum_{k=1}^K \alpha_{ik}\right) \\ &+ \sum_{i=1}^n \sum_{k=1}^K \alpha_{ik} + \sum_{i=1}^n \sum_{k=1}^K \alpha_{ik} (w_k - w_{y_i})^T x_i - \sum_{i=1}^n \sum_{k=1}^K \alpha_{ik} \delta_{y_i,k} \\ &= \frac{1}{2} \sum_{k=1}^K \sum_{i, j = 1}^n (C \delta_{y_i,k} - \alpha_{ik}) (C \delta_{y_j,k} - \alpha_{jk}) x_i^T x_j + nC + \sum_{i=1}^n \sum_{k=1}^K \alpha_{ik} (w_k - w_{y_i})^T x_i \\ &- \sum_{i=1}^n \sum_{k=1}^K \alpha_{ik} \delta_{y_i,k} \end{align*}\]Note that

\[\begin{align*} \sum_{i=1}^n \sum_{k=1}^K \alpha_{ik} w_k^T x_i &= \sum_{i=1}^n \sum_{k=1}^K \alpha_{ik} \left( \sum_{j=1}^n (C \delta_{y_j,k} - \alpha_{jk}) x_j \right)^T x_i \\ &= \sum_{i, j = 1}^n x_j^T x_i \left( \sum_{k=1}^K \alpha_{ik} (C \delta_{y_j,k} - \alpha_{jk}) \right) \\ \sum_{i=1}^n \sum_{k=1}^K \alpha_{ik} w_{y_i}^T x_i &= \sum_{i=1}^n \sum_{k=1}^K \alpha_{ik} \left( \sum_{j=1}^n (C \delta_{y_j,y_i} - \alpha_{jy_i}) x_j \right)^T x_i \\ &= \sum_{i, j = 1}^n x_j^T x_i \left( \sum_{k=1}^K \alpha_{ik} (C \delta_{y_j,y_i} - \alpha_{jy_i}) \right) \\ &= \sum_{i, j=1}^n x_i^T x_i (C (C \delta_{y_i,y_i} - \alpha_{jy_i})) \\ &= \sum_{i, j=1}^n x_i^T x_i \left( \sum_{k=1}^K C\delta_{y_i,k}(C\delta_{y_i,k} - \alpha_{jk}) \right) \end{align*}\]and therefore,

\[\begin{align*} \mathcal{L} &= \frac{1}{2} \sum_{k=1}^K \sum_{i, j = 1}^n (C \delta_{y_i,k} - \alpha_{ik}) (C \delta_{y_j,k} - \alpha_{jk}) x_i^T x_j + nC \\ &- \sum_{i, j=1}^n x_i^T x_j \left( \sum_{k=1}^K (C \delta_{y_i,k} - \alpha_{ik}) (C \delta_{y_j,k} - \alpha_{jk}) \right) - \sum_{i=1}^n \sum_{k=1}^K \alpha_{ik} \delta_{y_i,k} \\ &= -\frac{1}{2} \sum_{k=1}^K \sum_{i, j = 1}^n (C \delta_{y_i,k} - \alpha_{ik}) (C \delta_{y_j,k} - \alpha_{jk}) x_i^T x_j + nC - \sum_{i=1}^n \sum_{k=1}^K \alpha_{ik} \delta_{y_i,k} \end{align*}\]Let $e_i = (\delta_{i,1}, \ldots, \delta_{i,K})$ for any $i \in [K]$, and let $e = (1, 1, \ldots, 1) \in \mathbb{R}^K$. Then the above equation can be written as

\[\begin{align*} \mathcal{L} &= -\frac{1}{2} \sum_{i, j = 1}^n (Ce_{y_i} - \alpha_i)^T (Ce_{y_j} - \alpha_j) x_i^T x_j + nC - \sum_{i=1}^n \alpha_i^T e_{y_i} \end{align*}\]Therefore, the dual problem is

\[\begin{align*} \max_{\alpha} &\quad -\frac{1}{2} \sum_{i, j = 1}^n (Ce_{y_i} - \alpha_i)^T (Ce_{y_j} - \alpha_j) x_i^T x_j - \sum_{i=1}^n \alpha_i^T e_{y_i} + nC\\ \text{or} \quad \min_{\alpha} &\quad \frac{1}{2} \sum_{i, j = 1}^n (Ce_{y_i} - \alpha_i)^T (Ce_{y_j} - \alpha_j) x_i^T x_j + \sum_{i=1}^n \alpha_i^T e_{y_i} - nC \\ \text{such that} &\quad \alpha_{ij} \ge 0 \quad \forall i \in [n], j \in [K], \quad \alpha_i^T e = C \quad \forall i \in [n] \end{align*}\]Let $\lambda_i = Ce_{y_i} - \alpha_i$ for all $i \in [n]$. Note that

\[\begin{align*} \sum_{i=1}^n \alpha_i^T e_{y_i} &= \sum_{i=1}^n (Ce_{y_i} - \lambda_i)^T e_{y_i} = \sum_{i=1}^n Ce_{y_i}^T e_{y_i} - \sum_{i=1}^n \lambda_i^T e_{y_i} \\ &= nC - \sum_{i=1}^n \lambda_i^T e_{y_i} \end{align*}\]Since $\alpha_i \ge 0$ for all $i \in [n]$, we have that $\lambda_i \le Ce_{y_i}$. Similarly, since $\alpha_i^T e = C$ for all $i \in [n]$, we have that $\lambda_i^T e = 0$. Substituting the values of $\lambda_i$, the dual problem can be written as

\[\begin{align*} \min_{\lambda} &\quad \frac{1}{2} \sum_{i, j = 1}^n (\lambda_i^T \lambda_j) (x_i^T x_j) - \sum_{i=1}^n \lambda_i^T e_{y_i} \\ \text{such that} &\quad \lambda_i \le Ce_{y_i}, \quad \lambda_i^T e = 0 \quad \forall i \in [n] \end{align*}\]Note that on solving the dual problem, we get the values of $w_j$ as

\[\begin{align*} w_j &= \sum_{i=1}^n \lambda_{ij} x_i \end{align*}\]for all $j \in [K]$.

This final expression is much simpler than not only the original primal problem, but also most formulations of the binary-class SVM.

CVXOPT provides a simple interface for quadratic programming. We will convert the dual problem into the standard form required by CVXOPT and feed it to the solver. We want a problem of the form

\[\begin{align*} \min_{x} &\quad \frac{1}{2} x^T P x + q^T x \\ \text{such that} &\quad Gx \preccurlyeq h, \quad Ax = b \end{align*}\]I am providing the $P, q, G, h, A, b$ matrices below and will leave the proof of correctness as an exercise for the reader (it is a simple exercise in linear algebra). Note that I have replaced the inner product $x_i^T x_j$ with the kernel $k(x_i, x_j)$.

\[\begin{align*} P &= \begin{bmatrix} k(x_1, x_1) I_{K} & \cdots & k(x_1, x_n) I_{K} \\ \vdots & \ddots & \vdots \\ k(x_n, x_1) I_{K} & \cdots & k(x_n, x_n) I_{K} \end{bmatrix} \\ q &= \begin{bmatrix} -e_{y_1} \\ \vdots \\ -e_{y_n} \end{bmatrix} \\ G &= -I_{nK} \\ h &= C \begin{bmatrix} e_{y_1} \\ \vdots \\ e_{y_n} \end{bmatrix} \\ A &= \begin{bmatrix} 1_{1 \times K} & 0_{1 \times K} & \cdots & 0_{1 \times K} \\ 0_{1 \times K} & 1_{1 \times K} & \cdots & 0_{1 \times K} \\ \vdots & \vdots & \ddots & \vdots \\ 0_{1 \times K} & 0_{1 \times K} & \cdots & 1_{1 \times K} \end{bmatrix} \\ b &= 0_{n \times 1} \end{align*}\]The implementation of our Mutliclass SVM is as follows.

```
import numpy as np
class MultiClassSVM:
"""
A Support Vector Machine for Multi-Class Classification
"""
def __init__(self, labels: np.ndarray, train_x: np.ndarray, train_y: np.ndarray):
"""
Constructor for the MultiClassSVM class that initializes the training data and labels
:param labels: the labels of the classes
:param train_x: the training data
:param train_y: the training labels
"""
self.labels = labels
self.train_x = train_x
self.train_y = train_y
def train(
self,
kernel: Callable = Kernels.linear(),
slack_weight: float = 0,
) -> None:
"""
Train the Multi-Class SVM
:param kernel: the kernel function to use if provided
:param slack_weight: the weight of the slack variables (0 for hard margin)
"""
n = self.train_x.shape[0]
k = len(self.labels)
e = np.zeros(n * k)
for i in range(n):
e[i * k + list(self.labels).index(self.train_y[i])] = 1
P = np.zeros((n * k, n * k))
for i in range(n):
for j in range(n):
P[i * k : (i + 1) * k, j * k : (j + 1) * k] = (
kernel(self.train_x[i], self.train_x[j]) + 1
) * np.eye(k)
q = -e
G = np.eye(n * k)
h = slack_weight * e
A = np.zeros((n, n * k))
for i in range(n):
A[i, i * k : (i + 1) * k] = 1
b = np.zeros(n)
sol = solvers.qp(
matrix(P),
matrix(q),
matrix(G),
matrix(h),
matrix(A, (n, n * k), "d"),
matrix(b),
)
# construct the weight vectors
W = [None for _ in range(k)]
for j in range(k):
W[j] = lambda x, j=j: sum(
sol["x"][i * k + j] * (kernel(x, self.train_x[i]) + 1)
for i in range(n)
)
# construct the classifier
def classifier(x):
res = [W[j](x) for j in range(k)]
return self.labels[res.index(max(res))]
return Classifier(classifier)
```

The Kernel functions can be easily swapped in and out by passing them as arguments to the `train`

function. An example implementation of the `Kernels`

class is as follows.

```
class Kernels:
"""
A class for different kernel functions
"""
@staticmethod
def rbf(sigma: float) -> float:
"""
Radial Basis Function (RBF) kernel
:param sigma: the width of the kernel
:return: the kernel function
"""
return lambda x, y: np.exp(-np.linalg.norm(x - y) ** 2 / (2 * sigma**2))
@staticmethod
def linear() -> float:
"""
Linear kernel
:return: the kernel function
"""
return lambda x, y: x @ y
```

The `Classifier`

class simply provides a wrapper around the classifier function.

```
class Classifier:
"""
A class for a classifier generated by the SVM
"""
def __init__(self, classifier: Callable):
"""
Constructor for the Classifier class
:param classifier: the classifier function
"""
self.classifier = classifier
def predict(self, x: np.ndarray) -> float:
"""
Predict the class of a data point
:param x: the data point
:return: the predicted class
"""
return self.classifier(x)
def test(self, test_x: np.ndarray, test_y: np.ndarray) -> float:
"""
Test the classifier on a test set
:param test_x: the test data
:param test_y: the test labels
:return: the accuracy of the classifier
"""
correct = 0
for i in range(test_x.shape[0]):
if self.predict(test_x[i]) == test_y[i]:
correct += 1
return correct / test_x.shape[0]
```

and I didn’t seem quite right.

I supposed I need to move nearer

or maybe just turn on the tubelight.

I realised the mirror is stained

and so I better wipe it clean.

Those spots however seemed ingrained

and so I replaced it with one that was pristine.

But now the mirror appeared warped

as the reflection was mine but not really.

“My nose looks a bit too long,” I harped

and replaced it again, though now wearily.

And just like that, it so happened

that every mirror I brought was flawed.

Mirror after mirror, I was rather maddended

but who knew I was hitting the wrong chord.

The distortions are my own

from the stains to the contortions.

Afterall mirrors are just shiny stone

that occasionally teach me a little precaution.