Applied
Q8. In this exercise, we will generate simulated data, and will then use this data to perform best subset selection.
(a) Use the rnorm() function to generate a predictor X of length n = 100, as well as a noise vector of length n = 100.
import numpy as np
np.random.seed(5)
X = np.random.normal(0, 1, 100)
e = np.random.normal(0, 1, 100)
(b) Generate a response vector Y of length n = 100 according to the model $Y = β_0 + β_1X + β_2X^2 + β_3X^3 + \epsilon$, where $β_0, β_1, β_2,$ and $β_3$ are constants of your choice.
beta0 = 5
beta1 = 2
beta2 = 3
beta3 = 4
y = beta0 + beta1*X + beta2*(X**2) + beta3*(X**3) + e
(c) Use the regsubsets() function to perform best subset selection in order to choose the best model containing the predictors $X,X^2, . . ., X^{10}$. What is the best model obtained according to $C_p$, BIC, and adjusted $R^2$? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained. Note you will need to use the data.frame() function to create a single data set containing both X and Y .
Sol: Test MSE is minimum for model with size 3 and having predictors [0, 1, 2], which is perfectly in accordance with the generated data.
import itertools as it
from sklearn.linear_model import LinearRegression
def select_subset_sizeK(X_, y_, k):
model = LinearRegression()
best_score = 0.0
M_k = []
for combo in it.combinations(range(X_.shape[1]), k):
X = X_[:, list(combo)]
model.fit(X, y_)
s = model.score(X, y_)
if s > best_score:
M_k = list(combo)
best_score = s
return M_k
def subset_selection(X_, y_):
# Fit model with intercept only (Null model)
train_MSE = {}
model_cols = {}
y_pred = np.mean(y_)
train_MSE[0] = np.sum((y_ - y_pred)**2) / len(y_)
for s in range(1, X_.shape[1]):
cols = select_subset_sizeK(X_, y_, s)
X = X_[:, cols]
model = LinearRegression()
model.fit(X, y_)
y_pred = model.predict(X)
train_MSE[s] = mean_squared_error(y_pred, y_)
model_cols[s] = cols
return (model_cols, train_MSE)
from sklearn.model_selection import train_test_split
X_ = np.vstack((X, X**2, X**3, X**4, X**5, X**6, X**7, X**8, X**9, X**10)).T
X_train, X_test, y_train, y_test = train_test_split(X_, y, test_size=0.1)
t = subset_selection(X_train, y_train)
models = t[0]
train_MSE = t[1]
fig = plt.figure(figsize=(15, 8))
lists = sorted(train_MSE.items()) # sorted by key, return a list of tuples
x, y = zip(*lists) # unpack a list of pairs into two tuples
ax = fig.add_subplot(121)
plt.plot(x, y, color='r')
plt.grid()
ax.set_xlabel('Model Size')
ax.set_ylabel('Training MSE')
ax.set_title('Training MSE vs Model Size')
test_MSE = {}
for size, cols in models.items():
if size == 0:
test_MSE[size] = np.sum((y_test - cols)**2) / len(y_test)
else:
model = LinearRegression()
model.fit(X_train[:, cols], y_train)
y_pred = model.predict(X_test[:, cols])
test_MSE[size] = mean_squared_error(y_pred, y_test)
lists = sorted(test_MSE.items()) # sorted by key, return a list of tuples
x, y = zip(*lists) # unpack a list of pairs into two tuples
ax = fig.add_subplot(122)
plt.plot(x, y, color='g')
plt.grid()
ax.set_xlabel('Model Size')
ax.set_ylabel('Test MSE')
ax.set_title('Test MSE vs Model Size')
Text(0.5,1,'Test MSE vs Model Size')
print("Test MSE is minimum for model size: " +str(min(test_MSE, key=test_MSE.get)))
cols = models.get(min(test_MSE, key=test_MSE.get))
print("Columns used in the model: " +str(cols))
model = LinearRegression()
model.fit(X_train[:, cols], y_train)
print("Model Coefficients: " +str(model.coef_))
Test MSE is minimum for model size: 3
Columns used in the model: [0, 1, 2]
Model Coefficients: [2.00715291 3.06050471 4.07996844]
(e) Now fit a lasso model to the simulated data, again using $X,X^2, . . ., X^{10}$ as predictors. Use cross-validation to select the optimal value of λ. Report the resulting coefficient estimates, and discuss the results obtained.
Sol: The lasso model has significant coefficients for predictors 0,1,2,3,4. All other coefficients are insignificant.
from sklearn.linear_model import LassoCV
n_alphas = 200
alphas = np.logspace(-10, 2, n_alphas)
# Leave one out cross-validation
model = LassoCV(alphas=alphas, fit_intercept=True, cv=None)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print("Test Error: " +str(mean_squared_error(y_test, predictions)))
print("Model coefficients: " + str(model.coef_))
Test Error: 4687.204168578645
Model coefficients: [ 1.83007496e+00 2.54077338e+00 3.91756182e+00 2.22585449e-01
2.89062042e-01 -3.33340397e-02 -2.88216147e-02 -3.80527416e-03
-9.90808115e-03 3.31479566e-03]
/Users/amitrajan/Desktop/PythonVirtualEnv/Python3_VirtualEnv/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
ConvergenceWarning)
/Users/amitrajan/Desktop/PythonVirtualEnv/Python3_VirtualEnv/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
ConvergenceWarning)
Q9. In this exercise, we will predict the number of applications received using the other variables in the College data set.
import pandas as pd
college = pd.read_csv("data/College.csv")
college.head()
Unnamed: 0 | Private | Apps | Accept | Enroll | Top10perc | Top25perc | F.Undergrad | P.Undergrad | Outstate | Room.Board | Books | Personal | PhD | Terminal | S.F.Ratio | perc.alumni | Expend | Grad.Rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Abilene Christian University | Yes | 1660 | 1232 | 721 | 23 | 52 | 2885 | 537 | 7440 | 3300 | 450 | 2200 | 70 | 78 | 18.1 | 12 | 7041 | 60 |
1 | Adelphi University | Yes | 2186 | 1924 | 512 | 16 | 29 | 2683 | 1227 | 12280 | 6450 | 750 | 1500 | 29 | 30 | 12.2 | 16 | 10527 | 56 |
2 | Adrian College | Yes | 1428 | 1097 | 336 | 22 | 50 | 1036 | 99 | 11250 | 3750 | 400 | 1165 | 53 | 66 | 12.9 | 30 | 8735 | 54 |
3 | Agnes Scott College | Yes | 417 | 349 | 137 | 60 | 89 | 510 | 63 | 12960 | 5450 | 450 | 875 | 92 | 97 | 7.7 | 37 | 19016 | 59 |
4 | Alaska Pacific University | Yes | 193 | 146 | 55 | 16 | 44 | 249 | 869 | 7560 | 4120 | 800 | 1500 | 76 | 72 | 11.9 | 2 | 10922 | 15 |
(a) Split the data set into a training set and a test set.
college = college.rename(columns={'Unnamed: 0': 'Name'})
college['Private'] = college['Private'].map({'Yes': 1, 'No': 0})
msk = np.random.rand(len(college)) < 0.8
train = college[msk]
test = college[~msk]
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
model = LinearRegression(fit_intercept=True)
model.fit(train.drop(['Name', 'Apps'], axis=1), train['Apps'])
predictions = model.predict(test.drop(['Name', 'Apps'], axis=1))
print("Test Error: " +str(mean_squared_error(test['Apps'], predictions)))
Test Error: 1178528.8421813124
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
from sklearn.linear_model import RidgeCV
n_alphas = 200
alphas = np.logspace(-10, 2, n_alphas)
# Leave one out cross-validation
model = RidgeCV(alphas=alphas, fit_intercept=True, cv=None, store_cv_values=True)
model.fit(train.drop(['Name', 'Apps'], axis=1), train['Apps'])
predictions = model.predict(test.drop(['Name', 'Apps'], axis=1))
print("Test Error: " +str(mean_squared_error(test['Apps'], predictions)))
Test Error: 1188298.4979038904
(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.
from sklearn.linear_model import LassoCV
# Leave one out cross-validation
model = LassoCV(alphas=alphas, fit_intercept=True, cv=None)
model.fit(train.drop(['Name', 'Apps'], axis=1), train['Apps'])
predictions = model.predict(test.drop(['Name', 'Apps'], axis=1))
print("Test Error: " +str(mean_squared_error(test['Apps'], predictions)))
print("Number of Non-zero coefficients: " + str(len(model.coef_)))
Test Error: 1207847.543688796
Number of Non-zero coefficients: 17
/Users/amitrajan/Desktop/PythonVirtualEnv/Python3_VirtualEnv/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
ConvergenceWarning)
(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.model_selection import LeaveOneOut
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
def PCR_CV(X_train, Y_train, X_test, Y_test, M):
X_train_scaled = preprocessing.scale(X_train)
X_test_scaled = preprocessing.scale(X_test)
MSE = {}
test_MSE = {}
for m in M: # Iterate over number of principal components
pca = PCA(n_components=m)
X_train_reduced = pca.fit_transform(X_train_scaled)
X_test_reduced = pca.fit_transform(X_test_scaled)
mse = 0
test_mse = 0
loo = LeaveOneOut() # Leave one out cross-validation
for train_index, test_index in loo.split(X_train_reduced):
X, X_CV = X_train_reduced[train_index], X_train_reduced[test_index]
Y, Y_CV = Y_train[train_index], Y_train[test_index]
model = LinearRegression(fit_intercept=True)
model.fit(X, Y)
p = model.predict(X_CV)
mse += mean_squared_error(p, Y_CV)
MSE[m] = mse/len(X_train_reduced)
# Compute test MSE for the model
model = LinearRegression(fit_intercept=True)
model.fit(X_train_reduced, Y_train)
p = model.predict(X_test_reduced)
test_MSE[m] = mean_squared_error(p, Y_test)
# Plot validation MSE
lists = sorted(MSE.items()) # sorted by key, return a list of tuples
x, y = zip(*lists) # unpack a list of pairs into two tuples
fig = plt.figure(figsize=(15, 8))
ax = fig.add_subplot(121)
plt.plot(x, y, color='r')
plt.grid()
ax.set_xlabel('Number of Principal Components')
ax.set_ylabel('Validation MSE')
ax.set_title('Validation MSE vs Principal Components')
lists = sorted(test_MSE.items()) # sorted by key, return a list of tuples
x, y = zip(*lists) # unpack a list of pairs into two tuples
ax = fig.add_subplot(122)
plt.plot(x, y, color='g')
plt.grid()
ax.set_xlabel('Number of Principal Components')
ax.set_ylabel('Test MSE')
ax.set_title('Test MSE vs Principal Components')
plt.show()
M = np.arange(1, 17, 1) # Principal components
PCR_CV(train.drop(['Name', 'Apps'], axis=1), train['Apps'].values, test.drop(['Name', 'Apps'], axis=1),
test['Apps'], M)
(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
from sklearn.cross_decomposition import PLSRegression
def PLS_CV(X_train, Y_train, X_test, Y_test, M):
X_train_scaled = preprocessing.scale(X_train)
X_test_scaled = preprocessing.scale(X_test)
MSE = {}
test_MSE = {}
for m in M: # Iterate over number of principal components
mse = 0
test_mse = 0
loo = LeaveOneOut() # Leave one out cross-validation
for train_index, test_index in loo.split(X_train_scaled):
X, X_CV = X_train_scaled[train_index], X_train_scaled[test_index]
Y, Y_CV = Y_train[train_index], Y_train[test_index]
model = PLSRegression(n_components=m)
model.fit(X, Y)
p = model.predict(X_CV)
mse += mean_squared_error(p, Y_CV)
MSE[m] = mse/len(X_train_scaled)
# Compute test MSE for the model
model = PLSRegression(n_components=m)
model.fit(X_train_scaled, Y_train)
p = model.predict(X_test_scaled)
test_MSE[m] = mean_squared_error(p, Y_test)
# Plot validation MSE
lists = sorted(MSE.items()) # sorted by key, return a list of tuples
x, y = zip(*lists) # unpack a list of pairs into two tuples
fig = plt.figure(figsize=(15, 8))
ax = fig.add_subplot(121)
plt.plot(x, y, color='r')
plt.grid()
ax.set_xlabel('M')
ax.set_ylabel('Validation MSE')
ax.set_title('Validation MSE vs M')
lists = sorted(test_MSE.items()) # sorted by key, return a list of tuples
x, y = zip(*lists) # unpack a list of pairs into two tuples
ax = fig.add_subplot(122)
plt.plot(x, y, color='g')
plt.grid()
ax.set_xlabel('M')
ax.set_ylabel('Test MSE')
ax.set_title('Test MSE vs M')
plt.show()
M = np.arange(1, 17, 1) # Principal components
PLS_CV(train.drop(['Name', 'Apps'], axis=1), train['Apps'].values, test.drop(['Name', 'Apps'], axis=1),
test['Apps'], M)
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
Sol: The test errors (with order of magnitude $10^7$) for various methods are as follows:
-
Least squares linear model : 0.118
-
Ridge regression model : 0.119
-
Tha lasso: 0.121
-
PCR: 0.67
-
PLS: 0.11
It can be conluded that all the other models perform well as compared to PCR.
Q10. We have seen that as the number of features used in a model increases, the training error will necessarily decrease, but the test error may not. We will now explore this in a simulated data set.
(a) Generate a data set with $p = 15$ features, $n = 1,000$ observations, and an associated quantitative response vector generated according to the model $$Y = Xβ + \epsilon$$ where β has some elements that are exactly equal to zero.
X = np.random.normal(size=(1000, 15))
beta = np.random.normal(size=15)
beta[3] = 0
beta[5] = 0
beta[9] = 0
e = np.random.normal(size=1000)
y = np.dot(X, beta) + e
(b) Split your data set into a training set containing 100 observations and a test set containing 900 observations.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
(c) Perform best subset selection on the training set, and plot the training set MSE associated with the best model of each size.
(d) Plot the test set MSE associated with the best model of each size.
import itertools as it
from sklearn.linear_model import LinearRegression
def select_subset_sizeK(X_, y_, k):
model = LinearRegression()
best_score = 0.0
M_k = []
for combo in it.combinations(range(X_.shape[1]), k):
X = X_[:, list(combo)]
model.fit(X, y_)
s = model.score(X, y_)
if s > best_score:
M_k = list(combo)
best_score = s
return M_k
def subset_selection(X_, y_):
# Fit model with intercept only (Null model)
train_MSE = {}
model_cols = {}
y_pred = np.mean(y_)
train_MSE[0] = np.sum((y_ - y_pred)**2) / len(y_)
for s in range(1, X_.shape[1]):
cols = select_subset_sizeK(X_, y_, s)
X = X_[:, cols]
model = LinearRegression()
model.fit(X, y_)
y_pred = model.predict(X)
train_MSE[s] = mean_squared_error(y_pred, y_)
model_cols[s] = cols
return (model_cols, train_MSE)
t = subset_selection(X_train, y_train)
models = t[0]
train_MSE = t[1]
fig = plt.figure(figsize=(15, 8))
lists = sorted(train_MSE.items()) # sorted by key, return a list of tuples
x, y = zip(*lists) # unpack a list of pairs into two tuples
ax = fig.add_subplot(121)
plt.plot(x, y, color='r')
plt.grid()
ax.set_xlabel('Model Size')
ax.set_ylabel('Training MSE')
ax.set_title('Training MSE vs Model Size')
test_MSE = {}
for size, cols in models.items():
if size == 0:
test_MSE[size] = np.sum((y_test - cols)**2) / len(y_test)
else:
model = LinearRegression()
model.fit(X_train[:, cols], y_train)
y_pred = model.predict(X_test[:, cols])
test_MSE[size] = mean_squared_error(y_pred, y_test)
lists = sorted(test_MSE.items()) # sorted by key, return a list of tuples
x, y = zip(*lists) # unpack a list of pairs into two tuples
ax = fig.add_subplot(122)
plt.plot(x, y, color='g')
plt.grid()
ax.set_xlabel('Model Size')
ax.set_ylabel('Test MSE')
ax.set_title('Test MSE vs Model Size')
Text(0.5,1,'Test MSE vs Model Size')
(e) For which model size does the test set MSE take on its minimum value? Comment on your results. If it takes on its minimum value for a model containing only an intercept or a model containing all of the features, then play around with the way that you are generating the data in (a) until you come up with a scenario in which the test set MSE is minimized for an intermediate model size.
print("Test MSE is minimum for model size: " +str(min(test_MSE, key=test_MSE.get)))
Test MSE is minimum for model size: 13
(f) How does the model at which the test set MSE is minimized compare to the true model used to generate the data? Comment on the coefficient values.
Sol: The model is well in accordance with the way data is generated, First of all, the columns that are not used for model generation are: 5, 9. While generating data, we set the coefficients 3,5, and 9 to 0 and hence the model captures this well. Apart from this, the coefficient of feature 3 is -0.07353929, which is quite low as well.
cols = models.get(min(test_MSE, key=test_MSE.get))
print("Columns used in the model: " +str(cols))
model = LinearRegression()
model.fit(X_train[:, cols], y_train)
print("Model Coefficients: " +str(model.coef_))
Columns used in the model: [0, 1, 2, 3, 4, 6, 7, 8, 10, 11, 12, 13, 14]
Model Coefficients: [-2.50852562 -0.43695144 1.40013156 -0.07353929 -0.85895357 -1.89061122
-0.30136561 1.12543061 0.09474982 -0.70489182 -0.6278358 -1.40983561
0.17529716]
Q11. We will now try to predict per capita crime rate in the Boston data set.
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
boston = pd.read_csv("data/Boston.csv")
boston.dropna(inplace=True)
np.random.seed(5)
X_train, X_test, y_train, y_test = train_test_split(boston.iloc[:,1:], boston['crim'], test_size=0.1)
# The lasso
n_alphas = 200
alphas = np.logspace(-10, 2, n_alphas)
# Leave one out cross-validation
model = LassoCV(alphas=alphas, fit_intercept=True, cv=None)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print("Test Error: " +str(mean_squared_error(y_test, predictions)))
print("Model coefficients: " + str(model.coef_))
Test Error: 54.95744135110768
Model coefficients: [ 0.04198738 -0.07640927 -0. -0. 0.32159534 -0.00722933
-0.67844741 0.5220804 -0.0022908 -0.09016879 -0.00174827 0.13527718
-0.15964042]
# Ridge Regression
model = RidgeCV(alphas=alphas, fit_intercept=True, cv=None)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print("Test Error: " +str(mean_squared_error(y_test, predictions)))
print("Model coefficients: " + str(model.coef_))
Test Error: 55.55453079915659
Model coefficients: [ 0.0406467 -0.08234663 -0.20301002 -0.09532246 0.55364958 -0.00933135
-0.70209263 0.52848777 -0.00235438 -0.14191045 -0.00127377 0.14595535
-0.17415457]
# PCR
M = np.arange(1, 14, 1) # Principal components
PCR_CV(X_train, y_train.values, X_test, y_test.values, M)
# PLS
M = np.arange(1, 14, 1) # Principal components
PLS_CV(X_train, y_train.values, X_test, y_test.values, M)
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.
Sol: Except PCR, the lasso, PLS and ridge regression performs decently well.
(c) Does your chosen model involve all of the features in the data set? Why or why not?
Sol: If we look at the lasso model with test MSE of 54.95744, the coefficients of chas, nox are 0 and those of age, tax and black are pretty low.