ISLR Chapter 6: Linear Model Selection and Regularization (Part 5: Exercises - Applied)

Posted by Amit Rajan on Saturday, May 26, 2018

Applied

Q8. In this exercise, we will generate simulated data, and will then use this data to perform best subset selection.

(a) Use the rnorm() function to generate a predictor X of length n = 100, as well as a noise vector  of length n = 100.

import numpy as np

np.random.seed(5)
X = np.random.normal(0, 1, 100)
e = np.random.normal(0, 1, 100)

(b) Generate a response vector Y of length n = 100 according to the model $Y = β_0 + β_1X + β_2X^2 + β_3X^3 + \epsilon$, where $β_0, β_1, β_2,$ and $β_3$ are constants of your choice.

beta0 = 5
beta1 = 2
beta2 = 3
beta3 = 4

y = beta0 + beta1*X + beta2*(X**2) + beta3*(X**3) + e

(c) Use the regsubsets() function to perform best subset selection in order to choose the best model containing the predictors $X,X^2, . . ., X^{10}$. What is the best model obtained according to $C_p$, BIC, and adjusted $R^2$? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained. Note you will need to use the data.frame() function to create a single data set containing both X and Y .

Sol: Test MSE is minimum for model with size 3 and having predictors [0, 1, 2], which is perfectly in accordance with the generated data.

import itertools as it
from sklearn.linear_model import LinearRegression

def select_subset_sizeK(X_, y_, k):
    model = LinearRegression()
    best_score = 0.0
    M_k = []
    for combo in it.combinations(range(X_.shape[1]), k):
        X = X_[:, list(combo)]
        model.fit(X, y_)
        s = model.score(X, y_)
        if s > best_score:
            M_k = list(combo)
            best_score = s
    return M_k

def subset_selection(X_, y_):
    # Fit model with intercept only (Null model)
    train_MSE = {}
    model_cols = {}
    y_pred = np.mean(y_)
    train_MSE[0] = np.sum((y_ - y_pred)**2) / len(y_)
    for s in range(1, X_.shape[1]):
        cols = select_subset_sizeK(X_, y_, s)
        X = X_[:, cols]
        model = LinearRegression()
        model.fit(X, y_)
        y_pred = model.predict(X)
        train_MSE[s] = mean_squared_error(y_pred, y_)
        model_cols[s] = cols
    return (model_cols, train_MSE)
from sklearn.model_selection import train_test_split

X_ = np.vstack((X, X**2, X**3, X**4, X**5, X**6, X**7, X**8, X**9, X**10)).T
X_train, X_test, y_train, y_test = train_test_split(X_, y, test_size=0.1)

t = subset_selection(X_train, y_train)
models = t[0]
train_MSE = t[1]

fig = plt.figure(figsize=(15, 8))

lists = sorted(train_MSE.items()) # sorted by key, return a list of tuples
x, y = zip(*lists) # unpack a list of pairs into two tuples
ax = fig.add_subplot(121)
plt.plot(x, y, color='r')
plt.grid()
ax.set_xlabel('Model Size')
ax.set_ylabel('Training MSE')
ax.set_title('Training MSE vs Model Size')

test_MSE = {}
for size, cols in models.items():
    if size == 0:
        test_MSE[size] = np.sum((y_test - cols)**2) / len(y_test)
    else:
        model = LinearRegression()
        model.fit(X_train[:, cols], y_train)
        y_pred = model.predict(X_test[:, cols])
        test_MSE[size] = mean_squared_error(y_pred, y_test)

lists = sorted(test_MSE.items()) # sorted by key, return a list of tuples
x, y = zip(*lists) # unpack a list of pairs into two tuples
ax = fig.add_subplot(122)
plt.plot(x, y, color='g')
plt.grid()
ax.set_xlabel('Model Size')
ax.set_ylabel('Test MSE')
ax.set_title('Test MSE vs Model Size')
Text(0.5,1,'Test MSE vs Model Size')
print("Test MSE is minimum for model size: " +str(min(test_MSE, key=test_MSE.get)))
cols = models.get(min(test_MSE, key=test_MSE.get))
print("Columns used in the model: " +str(cols))
model = LinearRegression()
model.fit(X_train[:, cols], y_train)
print("Model Coefficients: " +str(model.coef_))
Test MSE is minimum for model size: 3
Columns used in the model: [0, 1, 2]
Model Coefficients: [2.00715291 3.06050471 4.07996844]

(e) Now fit a lasso model to the simulated data, again using $X,X^2, . . ., X^{10}$ as predictors. Use cross-validation to select the optimal value of λ. Report the resulting coefficient estimates, and discuss the results obtained.

Sol: The lasso model has significant coefficients for predictors 0,1,2,3,4. All other coefficients are insignificant.

from sklearn.linear_model import LassoCV

n_alphas = 200
alphas = np.logspace(-10, 2, n_alphas)
# Leave one out cross-validation
model = LassoCV(alphas=alphas, fit_intercept=True, cv=None)
model.fit(X_train, y_train)

predictions = model.predict(X_test)
print("Test Error: " +str(mean_squared_error(y_test, predictions)))
print("Model coefficients: " + str(model.coef_))
Test Error: 4687.204168578645
Model coefficients: [ 1.83007496e+00  2.54077338e+00  3.91756182e+00  2.22585449e-01
  2.89062042e-01 -3.33340397e-02 -2.88216147e-02 -3.80527416e-03
 -9.90808115e-03  3.31479566e-03]


/Users/amitrajan/Desktop/PythonVirtualEnv/Python3_VirtualEnv/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)
/Users/amitrajan/Desktop/PythonVirtualEnv/Python3_VirtualEnv/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)

Q9. In this exercise, we will predict the number of applications received using the other variables in the College data set.

import pandas as pd

college = pd.read_csv("data/College.csv")
college.head()

Unnamed: 0 Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
0 Abilene Christian University Yes 1660 1232 721 23 52 2885 537 7440 3300 450 2200 70 78 18.1 12 7041 60
1 Adelphi University Yes 2186 1924 512 16 29 2683 1227 12280 6450 750 1500 29 30 12.2 16 10527 56
2 Adrian College Yes 1428 1097 336 22 50 1036 99 11250 3750 400 1165 53 66 12.9 30 8735 54
3 Agnes Scott College Yes 417 349 137 60 89 510 63 12960 5450 450 875 92 97 7.7 37 19016 59
4 Alaska Pacific University Yes 193 146 55 16 44 249 869 7560 4120 800 1500 76 72 11.9 2 10922 15

(a) Split the data set into a training set and a test set.

college = college.rename(columns={'Unnamed: 0': 'Name'})
college['Private'] = college['Private'].map({'Yes': 1, 'No': 0})
msk = np.random.rand(len(college)) < 0.8
train = college[msk]
test = college[~msk]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

model = LinearRegression(fit_intercept=True)
model.fit(train.drop(['Name', 'Apps'], axis=1), train['Apps'])
predictions = model.predict(test.drop(['Name', 'Apps'], axis=1))

print("Test Error: " +str(mean_squared_error(test['Apps'], predictions)))
Test Error: 1178528.8421813124

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

from sklearn.linear_model import RidgeCV

n_alphas = 200
alphas = np.logspace(-10, 2, n_alphas)
# Leave one out cross-validation
model = RidgeCV(alphas=alphas, fit_intercept=True, cv=None, store_cv_values=True)
model.fit(train.drop(['Name', 'Apps'], axis=1), train['Apps'])

predictions = model.predict(test.drop(['Name', 'Apps'], axis=1))
print("Test Error: " +str(mean_squared_error(test['Apps'], predictions)))
Test Error: 1188298.4979038904

(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

from sklearn.linear_model import LassoCV

# Leave one out cross-validation
model = LassoCV(alphas=alphas, fit_intercept=True, cv=None)
model.fit(train.drop(['Name', 'Apps'], axis=1), train['Apps'])

predictions = model.predict(test.drop(['Name', 'Apps'], axis=1))
print("Test Error: " +str(mean_squared_error(test['Apps'], predictions)))
print("Number of Non-zero coefficients: " + str(len(model.coef_)))
Test Error: 1207847.543688796
Number of Non-zero coefficients: 17


/Users/amitrajan/Desktop/PythonVirtualEnv/Python3_VirtualEnv/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)

(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.model_selection import LeaveOneOut
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

def PCR_CV(X_train, Y_train, X_test, Y_test, M):
    X_train_scaled = preprocessing.scale(X_train)
    X_test_scaled = preprocessing.scale(X_test)

    MSE = {}
    test_MSE = {}

    for m in M: # Iterate over number of principal components
        pca = PCA(n_components=m)
        X_train_reduced = pca.fit_transform(X_train_scaled)
        X_test_reduced = pca.fit_transform(X_test_scaled)

        mse = 0
        test_mse = 0
        loo = LeaveOneOut() # Leave one out cross-validation
        for train_index, test_index in loo.split(X_train_reduced):
            X, X_CV = X_train_reduced[train_index], X_train_reduced[test_index]
            Y, Y_CV = Y_train[train_index], Y_train[test_index]
            model = LinearRegression(fit_intercept=True)
            model.fit(X, Y)
            p = model.predict(X_CV)
            mse += mean_squared_error(p, Y_CV)
        MSE[m] = mse/len(X_train_reduced)

        # Compute test MSE for the model
        model = LinearRegression(fit_intercept=True)
        model.fit(X_train_reduced, Y_train)
        p = model.predict(X_test_reduced)
        test_MSE[m] = mean_squared_error(p, Y_test)

    # Plot validation MSE
    lists = sorted(MSE.items()) # sorted by key, return a list of tuples
    x, y = zip(*lists) # unpack a list of pairs into two tuples
    fig = plt.figure(figsize=(15, 8))
    ax = fig.add_subplot(121)
    plt.plot(x, y, color='r')
    plt.grid()
    ax.set_xlabel('Number of Principal Components')
    ax.set_ylabel('Validation MSE')
    ax.set_title('Validation MSE vs Principal Components')

    lists = sorted(test_MSE.items()) # sorted by key, return a list of tuples
    x, y = zip(*lists) # unpack a list of pairs into two tuples
    ax = fig.add_subplot(122)
    plt.plot(x, y, color='g')
    plt.grid()
    ax.set_xlabel('Number of Principal Components')
    ax.set_ylabel('Test MSE')
    ax.set_title('Test MSE vs Principal Components')
    plt.show()

M = np.arange(1, 17, 1) # Principal components
PCR_CV(train.drop(['Name', 'Apps'], axis=1), train['Apps'].values, test.drop(['Name', 'Apps'], axis=1),
       test['Apps'], M)

(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

from sklearn.cross_decomposition import PLSRegression

def PLS_CV(X_train, Y_train, X_test, Y_test, M):
    X_train_scaled = preprocessing.scale(X_train)
    X_test_scaled = preprocessing.scale(X_test)

    MSE = {}
    test_MSE = {}

    for m in M: # Iterate over number of principal components
        mse = 0
        test_mse = 0
        loo = LeaveOneOut() # Leave one out cross-validation
        for train_index, test_index in loo.split(X_train_scaled):
            X, X_CV = X_train_scaled[train_index], X_train_scaled[test_index]
            Y, Y_CV = Y_train[train_index], Y_train[test_index]
            model = PLSRegression(n_components=m)
            model.fit(X, Y)
            p = model.predict(X_CV)
            mse += mean_squared_error(p, Y_CV)
        MSE[m] = mse/len(X_train_scaled)

        # Compute test MSE for the model
        model = PLSRegression(n_components=m)
        model.fit(X_train_scaled, Y_train)
        p = model.predict(X_test_scaled)
        test_MSE[m] = mean_squared_error(p, Y_test)

    # Plot validation MSE
    lists = sorted(MSE.items()) # sorted by key, return a list of tuples
    x, y = zip(*lists) # unpack a list of pairs into two tuples
    fig = plt.figure(figsize=(15, 8))
    ax = fig.add_subplot(121)
    plt.plot(x, y, color='r')
    plt.grid()
    ax.set_xlabel('M')
    ax.set_ylabel('Validation MSE')
    ax.set_title('Validation MSE vs M')

    lists = sorted(test_MSE.items()) # sorted by key, return a list of tuples
    x, y = zip(*lists) # unpack a list of pairs into two tuples
    ax = fig.add_subplot(122)
    plt.plot(x, y, color='g')
    plt.grid()
    ax.set_xlabel('M')
    ax.set_ylabel('Test MSE')
    ax.set_title('Test MSE vs M')
    plt.show()

M = np.arange(1, 17, 1) # Principal components
PLS_CV(train.drop(['Name', 'Apps'], axis=1), train['Apps'].values, test.drop(['Name', 'Apps'], axis=1),
       test['Apps'], M)

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

Sol: The test errors (with order of magnitude $10^7$) for various methods are as follows:

  • Least squares linear model : 0.118

  • Ridge regression model : 0.119

  • Tha lasso: 0.121

  • PCR: 0.67

  • PLS: 0.11

It can be conluded that all the other models perform well as compared to PCR.

Q10. We have seen that as the number of features used in a model increases, the training error will necessarily decrease, but the test error may not. We will now explore this in a simulated data set.

(a) Generate a data set with $p = 15$ features, $n = 1,000$ observations, and an associated quantitative response vector generated according to the model $$Y = Xβ + \epsilon$$ where β has some elements that are exactly equal to zero.

X = np.random.normal(size=(1000, 15))
beta = np.random.normal(size=15)
beta[3] = 0
beta[5] = 0
beta[9] = 0
e = np.random.normal(size=1000)
y = np.dot(X, beta) + e

(b) Split your data set into a training set containing 100 observations and a test set containing 900 observations.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

(c) Perform best subset selection on the training set, and plot the training set MSE associated with the best model of each size.

(d) Plot the test set MSE associated with the best model of each size.

import itertools as it
from sklearn.linear_model import LinearRegression

def select_subset_sizeK(X_, y_, k):
    model = LinearRegression()
    best_score = 0.0
    M_k = []
    for combo in it.combinations(range(X_.shape[1]), k):
        X = X_[:, list(combo)]
        model.fit(X, y_)
        s = model.score(X, y_)
        if s > best_score:
            M_k = list(combo)
            best_score = s
    return M_k

def subset_selection(X_, y_):
    # Fit model with intercept only (Null model)
    train_MSE = {}
    model_cols = {}
    y_pred = np.mean(y_)
    train_MSE[0] = np.sum((y_ - y_pred)**2) / len(y_)
    for s in range(1, X_.shape[1]):
        cols = select_subset_sizeK(X_, y_, s)
        X = X_[:, cols]
        model = LinearRegression()
        model.fit(X, y_)
        y_pred = model.predict(X)
        train_MSE[s] = mean_squared_error(y_pred, y_)
        model_cols[s] = cols
    return (model_cols, train_MSE)

t = subset_selection(X_train, y_train)
models = t[0]
train_MSE = t[1]

fig = plt.figure(figsize=(15, 8))

lists = sorted(train_MSE.items()) # sorted by key, return a list of tuples
x, y = zip(*lists) # unpack a list of pairs into two tuples
ax = fig.add_subplot(121)
plt.plot(x, y, color='r')
plt.grid()
ax.set_xlabel('Model Size')
ax.set_ylabel('Training MSE')
ax.set_title('Training MSE vs Model Size')

test_MSE = {}
for size, cols in models.items():
    if size == 0:
        test_MSE[size] = np.sum((y_test - cols)**2) / len(y_test)
    else:
        model = LinearRegression()
        model.fit(X_train[:, cols], y_train)
        y_pred = model.predict(X_test[:, cols])
        test_MSE[size] = mean_squared_error(y_pred, y_test)

lists = sorted(test_MSE.items()) # sorted by key, return a list of tuples
x, y = zip(*lists) # unpack a list of pairs into two tuples
ax = fig.add_subplot(122)
plt.plot(x, y, color='g')
plt.grid()
ax.set_xlabel('Model Size')
ax.set_ylabel('Test MSE')
ax.set_title('Test MSE vs Model Size')
Text(0.5,1,'Test MSE vs Model Size')

(e) For which model size does the test set MSE take on its minimum value? Comment on your results. If it takes on its minimum value for a model containing only an intercept or a model containing all of the features, then play around with the way that you are generating the data in (a) until you come up with a scenario in which the test set MSE is minimized for an intermediate model size.

print("Test MSE is minimum for model size: " +str(min(test_MSE, key=test_MSE.get)))
Test MSE is minimum for model size: 13

(f) How does the model at which the test set MSE is minimized compare to the true model used to generate the data? Comment on the coefficient values.

Sol: The model is well in accordance with the way data is generated, First of all, the columns that are not used for model generation are: 5, 9. While generating data, we set the coefficients 3,5, and 9 to 0 and hence the model captures this well. Apart from this, the coefficient of feature 3 is -0.07353929, which is quite low as well.

cols = models.get(min(test_MSE, key=test_MSE.get))
print("Columns used in the model: " +str(cols))
model = LinearRegression()
model.fit(X_train[:, cols], y_train)
print("Model Coefficients: " +str(model.coef_))
Columns used in the model: [0, 1, 2, 3, 4, 6, 7, 8, 10, 11, 12, 13, 14]
Model Coefficients: [-2.50852562 -0.43695144  1.40013156 -0.07353929 -0.85895357 -1.89061122
 -0.30136561  1.12543061  0.09474982 -0.70489182 -0.6278358  -1.40983561
  0.17529716]

Q11. We will now try to predict per capita crime rate in the Boston data set.

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

boston = pd.read_csv("data/Boston.csv")
boston.dropna(inplace=True)
np.random.seed(5)
X_train, X_test, y_train, y_test = train_test_split(boston.iloc[:,1:], boston['crim'], test_size=0.1)

# The lasso
n_alphas = 200
alphas = np.logspace(-10, 2, n_alphas)
# Leave one out cross-validation
model = LassoCV(alphas=alphas, fit_intercept=True, cv=None)
model.fit(X_train, y_train)

predictions = model.predict(X_test)
print("Test Error: " +str(mean_squared_error(y_test, predictions)))
print("Model coefficients: " + str(model.coef_))
Test Error: 54.95744135110768
Model coefficients: [ 0.04198738 -0.07640927 -0.         -0.          0.32159534 -0.00722933
 -0.67844741  0.5220804  -0.0022908  -0.09016879 -0.00174827  0.13527718
 -0.15964042]
# Ridge Regression
model = RidgeCV(alphas=alphas, fit_intercept=True, cv=None)
model.fit(X_train, y_train)

predictions = model.predict(X_test)
print("Test Error: " +str(mean_squared_error(y_test, predictions)))
print("Model coefficients: " + str(model.coef_))
Test Error: 55.55453079915659
Model coefficients: [ 0.0406467  -0.08234663 -0.20301002 -0.09532246  0.55364958 -0.00933135
 -0.70209263  0.52848777 -0.00235438 -0.14191045 -0.00127377  0.14595535
 -0.17415457]
# PCR
M = np.arange(1, 14, 1) # Principal components
PCR_CV(X_train, y_train.values, X_test, y_test.values, M)
# PLS
M = np.arange(1, 14, 1) # Principal components
PLS_CV(X_train, y_train.values, X_test, y_test.values, M)

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

Sol: Except PCR, the lasso, PLS and ridge regression performs decently well.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

Sol: If we look at the lasso model with test MSE of 54.95744, the coefficients of chas, nox are 0 and those of age, tax and black are pretty low.