ISLR Chapter 6: Linear Model Selection and Regularization (Part 5: Exercises

Applied

Q8. In this exercise, we will generate simulated data, and will then use this data to perform best subset selection.

(a) Use the rnorm() function to generate a predictor X of length n = 100, as well as a noise vector of length n = 100.

import numpy as np

np.random.seed(5)
X = np.random.normal(0, 1, 100)
e = np.random.normal(0, 1, 100)

(b) Generate a response vector Y of length n = 100 according to the model $Y = β_0 + β_1X + β_2X^2 + β_3X^3 + \epsilon$, where $β_0, β_1, β_2,$ and $β_3$ are constants of your choice.

beta0 = 5
beta1 = 2
beta2 = 3
beta3 = 4

y = beta0 + beta1*X + beta2*(X**2) + beta3*(X**3) + e

(c) Use the regsubsets() function to perform best subset selection in order to choose the best model containing the predictors $X,X^2, . . ., X^{10}$. What is the best model obtained according to $C_p$, BIC, and adjusted $R^2$? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained. Note you will need to use the data.frame() function to create a single data set containing both X and Y .

Sol: Test MSE is minimum for model with size 3 and having predictors [0, 1, 2], which is perfectly in accordance with the generated data.

import itertools as it
from sklearn.linear_model import LinearRegression

def select_subset_sizeK(X_, y_, k):
    model = LinearRegression()
    best_score = 0.0
    M_k = []
    for combo in it.combinations(range(X_.shape[1]), k):
        X = X_[:, list(combo)]
        model.fit(X, y_)
        s = model.score(X, y_)
        if s > best_score:
            M_k = list(combo)
            best_score = s
    return M_k

def subset_selection(X_, y_):
    # Fit model with intercept only (Null model)
    train_MSE = {}
    model_cols = {}
    y_pred = np.mean(y_)
    train_MSE[0] = np.sum((y_ - y_pred)**2) / len(y_)
    for s in range(1, X_.shape[1]):
        cols = select_subset_sizeK(X_, y_, s)
        X = X_[:, cols]
        model = LinearRegression()
        model.fit(X, y_)
        y_pred = model.predict(X)
        train_MSE[s] = mean_squared_error(y_pred, y_)
        model_cols[s] = cols
    return (model_cols, train_MSE)

from sklearn.model_selection import train_test_split

X_ = np.vstack((X, X**2, X**3, X**4, X**5, X**6, X**7, X**8, X**9, X**10)).T
X_train, X_test, y_train, y_test = train_test_split(X_, y, test_size=0.1)

t = subset_selection(X_train, y_train)
models = t[0]
train_MSE = t[1]

fig = plt.figure(figsize=(15, 8))

lists = sorted(train_MSE.items()) # sorted by key, return a list of tuples
x, y = zip(*lists) # unpack a list of pairs into two tuples
ax = fig.add_subplot(121)
plt.plot(x, y, color='r')
plt.grid()
ax.set_xlabel('Model Size')
ax.set_ylabel('Training MSE')
ax.set_title('Training MSE vs Model Size')

test_MSE = {}
for size, cols in models.items():
    if size == 0:
        test_MSE[size] = np.sum((y_test - cols)**2) / len(y_test)
    else:
        model = LinearRegression()
        model.fit(X_train[:, cols], y_train)
        y_pred = model.predict(X_test[:, cols])
        test_MSE[size] = mean_squared_error(y_pred, y_test)

lists = sorted(test_MSE.items()) # sorted by key, return a list of tuples
x, y = zip(*lists) # unpack a list of pairs into two tuples
ax = fig.add_subplot(122)
plt.plot(x, y, color='g')
plt.grid()
ax.set_xlabel('Model Size')
ax.set_ylabel('Test MSE')
ax.set_title('Test MSE vs Model Size')

Text(0.5,1,'Test MSE vs Model Size')

print("Test MSE is minimum for model size: " +str(min(test_MSE, key=test_MSE.get)))
cols = models.get(min(test_MSE, key=test_MSE.get))
print("Columns used in the model: " +str(cols))
model = LinearRegression()
model.fit(X_train[:, cols], y_train)
print("Model Coefficients: " +str(model.coef_))

Test MSE is minimum for model size: 3
Columns used in the model: [0, 1, 2]
Model Coefficients: [2.00715291 3.06050471 4.07996844]

(e) Now fit a lasso model to the simulated data, again using $X,X^2, . . ., X^{10}$ as predictors. Use cross-validation to select the optimal value of λ. Report the resulting coefficient estimates, and discuss the results obtained.

Sol: The lasso model has significant coefficients for predictors 0,1,2,3,4. All other coefficients are insignificant.

from sklearn.linear_model import LassoCV

n_alphas = 200
alphas = np.logspace(-10, 2, n_alphas)
# Leave one out cross-validation
model = LassoCV(alphas=alphas, fit_intercept=True, cv=None)
model.fit(X_train, y_train)

predictions = model.predict(X_test)
print("Test Error: " +str(mean_squared_error(y_test, predictions)))
print("Model coefficients: " + str(model.coef_))

Test Error: 4687.204168578645
Model coefficients: [ 1.83007496e+00  2.54077338e+00  3.91756182e+00  2.22585449e-01
  2.89062042e-01 -3.33340397e-02 -2.88216147e-02 -3.80527416e-03
 -9.90808115e-03  3.31479566e-03]


/Users/amitrajan/Desktop/PythonVirtualEnv/Python3_VirtualEnv/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)
/Users/amitrajan/Desktop/PythonVirtualEnv/Python3_VirtualEnv/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)

Q9. In this exercise, we will predict the number of applications received using the other variables in the College data set.

import pandas as pd

college = pd.read_csv("data/College.csv")
college.head()

	Unnamed: 0	Private	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate
0	Abilene Christian University	Yes	1660	1232	721	23	52	2885	537	7440	3300	450	2200	70	78	18.1	12	7041	60
1	Adelphi University	Yes	2186	1924	512	16	29	2683	1227	12280	6450	750	1500	29	30	12.2	16	10527	56
2	Adrian College	Yes	1428	1097	336	22	50	1036	99	11250	3750	400	1165	53	66	12.9	30	8735	54
3	Agnes Scott College	Yes	417	349	137	60	89	510	63	12960	5450	450	875	92	97	7.7	37	19016	59
4	Alaska Pacific University	Yes	193	146	55	16	44	249	869	7560	4120	800	1500	76	72	11.9	2	10922	15

(a) Split the data set into a training set and a test set.

college = college.rename(columns={'Unnamed: 0': 'Name'})
college['Private'] = college['Private'].map({'Yes': 1, 'No': 0})
msk = np.random.rand(len(college)) < 0.8
train = college[msk]
test = college[~msk]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

model = LinearRegression(fit_intercept=True)
model.fit(train.drop(['Name', 'Apps'], axis=1), train['Apps'])
predictions = model.predict(test.drop(['Name', 'Apps'], axis=1))

print("Test Error: " +str(mean_squared_error(test['Apps'], predictions)))

Test Error: 1178528.8421813124

from sklearn.linear_model import RidgeCV

n_alphas = 200
alphas = np.logspace(-10, 2, n_alphas)
# Leave one out cross-validation
model = RidgeCV(alphas=alphas, fit_intercept=True, cv=None, store_cv_values=True)
model.fit(train.drop(['Name', 'Apps'], axis=1), train['Apps'])

predictions = model.predict(test.drop(['Name', 'Apps'], axis=1))
print("Test Error: " +str(mean_squared_error(test['Apps'], predictions)))

Test Error: 1188298.4979038904

(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

from sklearn.linear_model import LassoCV

# Leave one out cross-validation
model = LassoCV(alphas=alphas, fit_intercept=True, cv=None)
model.fit(train.drop(['Name', 'Apps'], axis=1), train['Apps'])

predictions = model.predict(test.drop(['Name', 'Apps'], axis=1))
print("Test Error: " +str(mean_squared_error(test['Apps'], predictions)))
print("Number of Non-zero coefficients: " + str(len(model.coef_)))

Test Error: 1207847.543688796
Number of Non-zero coefficients: 17


/Users/amitrajan/Desktop/PythonVirtualEnv/Python3_VirtualEnv/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)

(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.model_selection import LeaveOneOut
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

def PCR_CV(X_train, Y_train, X_test, Y_test, M):
    X_train_scaled = preprocessing.scale(X_train)
    X_test_scaled = preprocessing.scale(X_test)

    MSE = {}
    test_MSE = {}

    for m in M: # Iterate over number of principal components
        pca = PCA(n_components=m)
        X_train_reduced = pca.fit_transform(X_train_scaled)
        X_test_reduced = pca.fit_transform(X_test_scaled)

        mse = 0
        test_mse = 0
        loo = LeaveOneOut() # Leave one out cross-validation
        for train_index, test_index in loo.split(X_train_reduced):
            X, X_CV = X_train_reduced[train_index], X_train_reduced[test_index]
            Y, Y_CV = Y_train[train_index], Y_train[test_index]
            model = LinearRegression(fit_intercept=True)
            model.fit(X, Y)
            p = model.predict(X_CV)
            mse += mean_squared_error(p, Y_CV)
        MSE[m] = mse/len(X_train_reduced)

        # Compute test MSE for the model
        model = LinearRegression(fit_intercept=True)
        model.fit(X_train_reduced, Y_train)
        p = model.predict(X_test_reduced)
        test_MSE[m] = mean_squared_error(p, Y_test)

    # Plot validation MSE
    lists = sorted(MSE.items()) # sorted by key, return a list of tuples
    x, y = zip(*lists) # unpack a list of pairs into two tuples
    fig = plt.figure(figsize=(15, 8))
    ax = fig.add_subplot(121)
    plt.plot(x, y, color='r')
    plt.grid()
    ax.set_xlabel('Number of Principal Components')
    ax.set_ylabel('Validation MSE')
    ax.set_title('Validation MSE vs Principal Components')

    lists = sorted(test_MSE.items()) # sorted by key, return a list of tuples
    x, y = zip(*lists) # unpack a list of pairs into two tuples
    ax = fig.add_subplot(122)
    plt.plot(x, y, color='g')
    plt.grid()
    ax.set_xlabel('Number of Principal Components')
    ax.set_ylabel('Test MSE')
    ax.set_title('Test MSE vs Principal Components')
    plt.show()

M = np.arange(1, 17, 1) # Principal components
PCR_CV(train.drop(['Name', 'Apps'], axis=1), train['Apps'].values, test.drop(['Name', 'Apps'], axis=1),
       test['Apps'], M)

(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

from sklearn.cross_decomposition import PLSRegression

def PLS_CV(X_train, Y_train, X_test, Y_test, M):
    X_train_scaled = preprocessing.scale(X_train)
    X_test_scaled = preprocessing.scale(X_test)

    MSE = {}
    test_MSE = {}

    for m in M: # Iterate over number of principal components
        mse = 0
        test_mse = 0
        loo = LeaveOneOut() # Leave one out cross-validation
        for train_index, test_index in loo.split(X_train_scaled):
            X, X_CV = X_train_scaled[train_index], X_train_scaled[test_index]
            Y, Y_CV = Y_train[train_index], Y_train[test_index]
            model = PLSRegression(n_components=m)
            model.fit(X, Y)
            p = model.predict(X_CV)
            mse += mean_squared_error(p, Y_CV)
        MSE[m] = mse/len(X_train_scaled)

        # Compute test MSE for the model
        model = PLSRegression(n_components=m)
        model.fit(X_train_scaled, Y_train)
        p = model.predict(X_test_scaled)
        test_MSE[m] = mean_squared_error(p, Y_test)

    # Plot validation MSE
    lists = sorted(MSE.items()) # sorted by key, return a list of tuples
    x, y = zip(*lists) # unpack a list of pairs into two tuples
    fig = plt.figure(figsize=(15, 8))
    ax = fig.add_subplot(121)
    plt.plot(x, y, color='r')
    plt.grid()
    ax.set_xlabel('M')
    ax.set_ylabel('Validation MSE')
    ax.set_title('Validation MSE vs M')

    lists = sorted(test_MSE.items()) # sorted by key, return a list of tuples
    x, y = zip(*lists) # unpack a list of pairs into two tuples
    ax = fig.add_subplot(122)
    plt.plot(x, y, color='g')
    plt.grid()
    ax.set_xlabel('M')
    ax.set_ylabel('Test MSE')
    ax.set_title('Test MSE vs M')
    plt.show()

M = np.arange(1, 17, 1) # Principal components
PLS_CV(train.drop(['Name', 'Apps'], axis=1), train['Apps'].values, test.drop(['Name', 'Apps'], axis=1),
       test['Apps'], M)

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

Sol: The test errors (with order of magnitude $10^7$) for various methods are as follows:

Least squares linear model : 0.118
Ridge regression model : 0.119
Tha lasso: 0.121
PCR: 0.67
PLS: 0.11

It can be conluded that all the other models perform well as compared to PCR.

Q10. We have seen that as the number of features used in a model increases, the training error will necessarily decrease, but the test error may not. We will now explore this in a simulated data set.

(a) Generate a data set with $p = 15$ features, $n = 1,000$ observations, and an associated quantitative response vector generated according to the model $$Y = Xβ + \epsilon$$ where β has some elements that are exactly equal to zero.

X = np.random.normal(size=(1000, 15))
beta = np.random.normal(size=15)
beta[3] = 0
beta[5] = 0
beta[9] = 0
e = np.random.normal(size=1000)
y = np.dot(X, beta) + e

(b) Split your data set into a training set containing 100 observations and a test set containing 900 observations.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

(c) Perform best subset selection on the training set, and plot the training set MSE associated with the best model of each size.

(d) Plot the test set MSE associated with the best model of each size.

import itertools as it
from sklearn.linear_model import LinearRegression

def select_subset_sizeK(X_, y_, k):
    model = LinearRegression()
    best_score = 0.0
    M_k = []
    for combo in it.combinations(range(X_.shape[1]), k):
        X = X_[:, list(combo)]
        model.fit(X, y_)
        s = model.score(X, y_)
        if s > best_score:
            M_k = list(combo)
            best_score = s
    return M_k

def subset_selection(X_, y_):
    # Fit model with intercept only (Null model)
    train_MSE = {}
    model_cols = {}
    y_pred = np.mean(y_)
    train_MSE[0] = np.sum((y_ - y_pred)**2) / len(y_)
    for s in range(1, X_.shape[1]):
        cols = select_subset_sizeK(X_, y_, s)
        X = X_[:, cols]
        model = LinearRegression()
        model.fit(X, y_)
        y_pred = model.predict(X)
        train_MSE[s] = mean_squared_error(y_pred, y_)
        model_cols[s] = cols
    return (model_cols, train_MSE)

t = subset_selection(X_train, y_train)
models = t[0]
train_MSE = t[1]

fig = plt.figure(figsize=(15, 8))

lists = sorted(train_MSE.items()) # sorted by key, return a list of tuples
x, y = zip(*lists) # unpack a list of pairs into two tuples
ax = fig.add_subplot(121)
plt.plot(x, y, color='r')
plt.grid()
ax.set_xlabel('Model Size')
ax.set_ylabel('Training MSE')
ax.set_title('Training MSE vs Model Size')

test_MSE = {}
for size, cols in models.items():
    if size == 0:
        test_MSE[size] = np.sum((y_test - cols)**2) / len(y_test)
    else:
        model = LinearRegression()
        model.fit(X_train[:, cols], y_train)
        y_pred = model.predict(X_test[:, cols])
        test_MSE[size] = mean_squared_error(y_pred, y_test)

lists = sorted(test_MSE.items()) # sorted by key, return a list of tuples
x, y = zip(*lists) # unpack a list of pairs into two tuples
ax = fig.add_subplot(122)
plt.plot(x, y, color='g')
plt.grid()
ax.set_xlabel('Model Size')
ax.set_ylabel('Test MSE')
ax.set_title('Test MSE vs Model Size')

Text(0.5,1,'Test MSE vs Model Size')

(e) For which model size does the test set MSE take on its minimum value? Comment on your results. If it takes on its minimum value for a model containing only an intercept or a model containing all of the features, then play around with the way that you are generating the data in (a) until you come up with a scenario in which the test set MSE is minimized for an intermediate model size.

print("Test MSE is minimum for model size: " +str(min(test_MSE, key=test_MSE.get)))

Test MSE is minimum for model size: 13

(f) How does the model at which the test set MSE is minimized compare to the true model used to generate the data? Comment on the coefficient values.

Sol: The model is well in accordance with the way data is generated, First of all, the columns that are not used for model generation are: 5, 9. While generating data, we set the coefficients 3,5, and 9 to 0 and hence the model captures this well. Apart from this, the coefficient of feature 3 is -0.07353929, which is quite low as well.

cols = models.get(min(test_MSE, key=test_MSE.get))
print("Columns used in the model: " +str(cols))
model = LinearRegression()
model.fit(X_train[:, cols], y_train)
print("Model Coefficients: " +str(model.coef_))

Columns used in the model: [0, 1, 2, 3, 4, 6, 7, 8, 10, 11, 12, 13, 14]
Model Coefficients: [-2.50852562 -0.43695144  1.40013156 -0.07353929 -0.85895357 -1.89061122
 -0.30136561  1.12543061  0.09474982 -0.70489182 -0.6278358  -1.40983561
  0.17529716]

Q11. We will now try to predict per capita crime rate in the Boston data set.

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

boston = pd.read_csv("data/Boston.csv")
boston.dropna(inplace=True)

np.random.seed(5)
X_train, X_test, y_train, y_test = train_test_split(boston.iloc[:,1:], boston['crim'], test_size=0.1)

# The lasso
n_alphas = 200
alphas = np.logspace(-10, 2, n_alphas)
# Leave one out cross-validation
model = LassoCV(alphas=alphas, fit_intercept=True, cv=None)
model.fit(X_train, y_train)

predictions = model.predict(X_test)
print("Test Error: " +str(mean_squared_error(y_test, predictions)))
print("Model coefficients: " + str(model.coef_))

Test Error: 54.95744135110768
Model coefficients: [ 0.04198738 -0.07640927 -0.         -0.          0.32159534 -0.00722933
 -0.67844741  0.5220804  -0.0022908  -0.09016879 -0.00174827  0.13527718
 -0.15964042]

# Ridge Regression
model = RidgeCV(alphas=alphas, fit_intercept=True, cv=None)
model.fit(X_train, y_train)

predictions = model.predict(X_test)
print("Test Error: " +str(mean_squared_error(y_test, predictions)))
print("Model coefficients: " + str(model.coef_))

Test Error: 55.55453079915659
Model coefficients: [ 0.0406467  -0.08234663 -0.20301002 -0.09532246  0.55364958 -0.00933135
 -0.70209263  0.52848777 -0.00235438 -0.14191045 -0.00127377  0.14595535
 -0.17415457]

# PCR
M = np.arange(1, 14, 1) # Principal components
PCR_CV(X_train, y_train.values, X_test, y_test.values, M)

# PLS
M = np.arange(1, 14, 1) # Principal components
PLS_CV(X_train, y_train.values, X_test, y_test.values, M)

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

Sol: Except PCR, the lasso, PLS and ridge regression performs decently well.

Sol: If we look at the lasso model with test MSE of 54.95744, the coefficients of chas, nox are 0 and those of age, tax and black are pretty low.

ISLR Chapter 6: Linear Model Selection and Regularization (Part 5: Exercises - Applied)

Applied

CATALOG

FEATURED TAGS