Linear Models for Clasification - Discriminant Functions-Amit Rajan Blog

4.1 Discriminant Functions

A discriminant is a function which takes the input vector $X$ and assigns it to one of the classes $C_k$. When the decision surfaces are hyperplanes, we call them as linear discriminants.

4.1.1 Two Classes

The simplest representation of linear discriminant function is

$$\begin{align} y(X) = W^TX + W_0 \end{align}$$

where $W$ is called the weight vector and $W_0$ is bias. The negative of the bias is called threshold. The input vector will be assigned to the class $C_1$ if $y(X) \geq 0$ and $C_2$ otherwise. The decision boundary will be defined by the relation $y(X) = 0$. Let $X_A,X_B$ are the two points lying on the decision surface, then $y(X_A) = y(X_B) = 0$ and hence $W^T(X_A - X_B) = 0$. This means that the weight vector $W$ is orthogonal to every vector in the decision surface, i.e. $W$ determines the direction of the decision surface. Let $X$ be a point on the decsion surface which is closest to the origin. Thie means that this point can be represented as $X = \alpha W$ as the perpendicular weight vector $W$ will be the direction in which we will have the smallest vector passing through origin and the hyperplace. As $X$ lies on the hyperplace, we have $y(X) = W^TX + W_0 = 0$, i.e. $W^TX + W_0 = \alpha W^TW + W_0 = 0$ and hence $\alpha = -W_0/||W||^2$. The distance of the point $X$ from the origin is $||X|| = ||\alpha W|| = \frac{-W_0}{||W||^2}||W|| = \frac{-W_0}{||W||}$. Hence, the normal distance of the hyperplane from the origin is given by $\frac{-W_0}{||W||}$. This means that the bias parameter $W_0$ decides the location of the decision surface.

Another thing to note is that the value of $y(X)$ gives the signed measure of the perpendicular distance $r$ of point $X$ from the decision surface. Let $X_{\perp}$ be the orthogonal projection of $X$ on the decision surface, then

$$\begin{align} X = X_{\perp} + r\frac{W}{||W||} \end{align}$$

as $\frac{W}{||W||}$ is the unit vector which is perpendicular to the decision surface. Multiplying both sides by $W^T$ and adding $W_0$, we have

$$\begin{align} W^TX + W_0 = W^TX_{\perp} + rW^T\frac{W}{||W||} + W_0 \end{align}$$

$$\begin{align} y(X) = y(X_{\perp}) + rW^T\frac{W}{||W||} = rW^T\frac{W}{||W||} = r\frac{||W||^2}{||W||} = r||W|| \end{align}$$

$$\begin{align} r = \frac{y(X)}{||W||} \end{align}$$

The geometrical representation of the result is shown in the below figure.

We can combine the bias into weight vector with the new weight vector represented as $\tilde{W} = (W_0, W)$ with the updated input vector as $\tilde{X} = (1,X)$ and the updated equation

$$\begin{align} y(X) = \tilde{W}^T\tilde{X} \end{align}$$

4.1.2 Multiple Classes

A $K$-class discriminant ($K>2$) can be built by combining multiple two-class discriminant functions. However, there are some limitations to it. Consider a one-versus-the-rest classifier which seperates a particulae class $C_k$ from points not in the class. For a $K$-class problem, we will need a total of $K-1$ classifiers of such kind. The result of this kind of classifiere is shown in the below figure. As shown in the left image, the green region is classified both as class $C_1$ and $C_2$ and hence this region is ambiguously classified.

Another approach is to use $K(K-1)/2$ discriminant functions, one for each class pair of classes. This is called as one-versus-one classifier. Each point then can be classified according to the majority vote amongst the discriminant functions. The right hand side image in the above figure shows this approach. The green region has one vote for each of the classes $C_1,C_2$ and $C_3$ and hence being ambiguous.

These ambiguous classification regions can be avoided by using a single $K$-class discriminant comprising of $K$ linear functions of the form

$$\begin{align} y_k(X) = W_{k}^TX + W_{k0} \end{align}$$

The class assignment will be done on the basis of: assign a point $X$ to class $C_k$ if $y_k(X) > y_j(X)$ for all $j \neq k$. The decision boundary between class $C_k$ and $C_j$ is given by $y_k(X) = y_j(X)$ and corresponds to a $(D-1)$-dimensional hyperplane defined by

$$\begin{align} (W_{k} - W_{j})^TX + (W_{k0} - W_{j0}) = 0 \end{align}$$

The decision boundry is same as the one for a two-class case. The decision region of such a discriminant function are always singly connected and convex. For two points $X_A,X_B$ which lie in the decision region $R_k$, any point $\bar{X}$ which lies on the line connecting $X_A,X_B$ can be expressed in the form

$$\begin{align} \bar{X} = \lambda X_A + (1-\lambda)X_B \end{align}$$

From the linearlity of discriminant function

$$\begin{align} y_k(\bar{X}) = \lambda y_k(X_A) + (1-\lambda) y_k(X_B) \end{align}$$

As both $X_A,X_B$ lie inside $R_k$, it follows $y_k(X_A) > y_j(X_A)$ and $y_k(X_B) > y_j(X_B)$ for all $j \neq k$. From this, we get $y_k(\bar{X}) > y_j(\bar{X})$, and hence $\bar{X}$ lies in the region $R_k$. This means that the region $R_k$ is singly connected and convex.

FEATURED TAGS

alternate-hypothesis applied basis basis-function bayes-theorem-for-gaussian-variables bernoulli-distribution binomial-distribution bishop cdf classification column-space conceptual confidence-intervals conjugate-prior cross-validation determinant dimension eigenvalue-decomposition eigenvalues eigenvectors exercises expectation-maximization exponential-distribution feed-forward-network gaussian-distribution gilbert-strang graphical-models hypothesis-testing islr kernel-methods lagrange-multipliers least-squares linear-algebra linear-equations linear-model-selection linear-models linear-regression logistic-regression matrix-factorization matrix-multiplications matrix-space maximum-likelihood-for-the-gaussian maximum-margin-classifiers mean mixture-models mixtures-of-gaussians moving-beyond-linearity multinomial-distribution neural-networks normal-distribution null-hypothesis null-space one-tailed-test pattern-recognition pmf power probability-distributions projection random-variables regularization resampling statistical-learning students-t-distribution subspace support-vector-machines support-vectors think-stats tree-based-methods two-tailed-test unsupervised-learning variance vector-space

4.1 Discriminant Functions

4.1.1 Two Classes

4.1.2 Multiple Classes

CATALOG

FEATURED TAGS