Processing math: 100%

Linear Models for Regression - Linear Basis Function Models : Part 1

Posted by Amit Rajan on Monday, June 20, 2022

Linear regression model has the property of being linear functions of adjustable parameters. We can add more complexity in the linear regression models by taking linear combinations of a fixed set of nonlinear functions of the input variables, known as basis functions. In the modeling process, giveb x, we have to predict t which can be predicted as y(x). Form a probabilistic prespective, we aim to model the predictive distribution p(t|x) as this expresses the uncertainty about the value of t for each value of x. Linear models have significant limitations as practical techniques for pattern recognition, particularly for problems involving input spaces of high dimensionality.

3.1 Linear Basis Function Models

In the simplest linear regression model, output is the linear combination of input variables

y(X,W)=w0+w1x1+w2x2++wDxD

Extending it to include the nonlinear combination of input variables, we get

y(X,W)=w0+M1j=1wjϕj(X)

where ϕj(X) is called as basis function. The total number of parameters in this model is M. w0 is called as the bias parameter. If we define a dummy basis function for bias as ϕ0(X)=1, we have

y(X,W)=M1j=0wjϕj(X)=WTϕ(X)

where W=(w0,w1,,wM1)T and ϕ=(ϕ0,ϕ1,,ϕM1)T. The basis function can take any form. For example, a polynomial basis function takes the form ϕj(x)=xj. We can also have piecewise polynomial function called as spline functions where we have different polynomials in different regions of input space. Another example is Gaussian basis function, which takes the form

ϕj(x)=exp[(xμj)22s2]

where μj govers the location and s governs their spatial space. We can also have sigmoidal basis function, which is defined as

ϕj(x)=11+exp[xμjs]

3.1.1 Maximum Likelihood and Least Squares

The target variable t is predicted as y(X,W). The prediction y(X,W) will have some additional noise in it and let us assume that the noise is Gaussian. Then,

t=y(X,W)+ϵ

where ϵ is a zero mean Gaussian with precision β. Hence,

p(t|X,W,β)=N(t|y(X,W),β1)

For a squared loss function, the optimal prediction, for a new value of X, will be given by the conditional mean of the target variable, i.e.

E[t|X]=tp(t|X)dt=y(X,W)

Let us consider a dataset of inputs X=X1,X2,,Xn with the target variables t=t1,t2,,tn. Assuming that these data points are drawn independetly, the likelihood function is given as

p(t|X,W,β)=Nn=1N(tn|WTϕ(Xn),β1)

The log likelihood is given as

lnp(t|X,W,β)=Nn=1lnN(tn|WTϕ(Xn),β1)

=Nn=1ln[1(2πβ1)1/2exp(β2(tnWTϕ(Xn))2)]

=N2lnβN2ln(2π)β2Nn=1(tnWTϕ(Xn))2

=N2lnβN2ln(2π)βED(W)

where

ED(W)=12Nn=1(tnWTϕ(Xn))2

Taking derivative with respect to W and setting this to 0, we get

lnp(t|X,W,β)=Nn=1[tnWTϕ(Xn)]ϕ(Xn)T=0

Nn=1tnϕ(Xn)TWTNn=1ϕ(Xn)ϕ(Xn)T=0

Converting it into matrix form and solving we get,

WML=(ϕTϕ)1ϕTt

where ϕ is a N×M design matrix with nth row has the basis vector for Xn. The above equation is called as the normal equation for the least square problem. Taking derivative with respect to β and equating it to 0, we get

lnp(t|X,W,β)=N2βED(W)=0

1βML=2NED(W)=1NNn=1(tnWTϕ(Xn))2

The role of the bias parameter W0 can be analyzed by setting the derivative of lnp(t|X,W,β) with respect to W0 0. As W0 is only in ED(W), making it explicit in ED(W), we get

ED(W)=12Nn=1(tnW0M1j=1Wjϕj(Xn))2

lnp(t|X,W,β)=ED(W)=Nn=1(tnW0M1j=1Wjϕj(Xn))=0

NW0=Nn=1(tnM1j=1Wjϕj(Xn))

W0=1NNn=1tnM1j=1Wj1NNn=1ϕj(Xn)

W0=ˉtM1j=1Wj¯ϕj

where

ˉt=1NNn=1tn

¯ϕj=1NNn=1ϕj(Xn)

Hence the bias compensates for the difference in actual value (avearge over the training set) and the weighetd sum of of the avearage over the basis vector.