This exercise is about applying regularization to mitigate the effects of overfitting. This exercise assumes that you have read the tutorial about cross validation .
Make a copy of the tutorial and make edits in the copy.
In the tutorial, go to the "Hold-out validation" section and add a for loop that runs the cell for at least 10 iterations. That is, in each iteration:
Inspect the minimum and maximum $R^2$ scores and calculate their mean and variance. What does this indicate about the influence of the training set on model predictions?
Go to the "Effects of polynomials on model fit" section and implement 10 fold cross validation to train the models with 3rd, 4th, and 5th order polynomials. Does this affect the fit of the models?
# Add your reflections here
In the cross validation tutorial , it was observed that adding third or higher order polynomial terms results in overfitting of the regression model. In the following steps, a model pipeline similar to the one from the tutorial will be built, this time using ridge regression.
# Write your reflection here
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold, RepeatedKFold, cross_validate
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, Normalizer
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge # additional import for regularization
np.random.seed(99) # seed for randomization
dataset = fetch_california_housing(as_frame=True)
df = dataset.frame # This is the dataframe (a table)
X = dataset.data # These are the input features (anything but the house price)
y = dataset.target # This contains the output features (just the house price)
Ridge
class from Scikit learn.np.geomspace
function to create an array, regularization_params
, with values exponentially spaced between $10^{-10}$ and $10^2$. These values will be used to vary the regularization parameter. regularization_params
.
Note: the regularization parameter $\lambda$ is called alpha in sckit learn.
Asses the performance of the models on the validation set by calculating the $R^2$ scores and store them in scores
.
Run the cell below to plot the $R^2$ scores for each model (each regularization value). What does the plot reveal about the effect of the regularization parameter on the perfomance of the model on the testing set.
# Write your solution here
model = Pipeline([
("features", PolynomialFeatures(3)), # Calculates the design matrix for a third order polynomial
("normalization", Normalizer()), # Normalizes the features to a (0, 1) range.
("model", Ridge(alpha=1)), # The regression model and regularization parameter value
])
regularization_params = np.geomspace(1e-10, 1e2, 20)
scores = []
plt.plot(regularization_params, scores)
plt.xscale('log')
plt.title('R-squared Scores')
plt.show()
This task is about evaluating the effects of the regularisation parameters.
# Write your solution and reflections here
# Plot the mean and variance R-squared scores
plt.figure(figsize=(10, 5)) # Set the figure size
plt.subplot(1, 2, 1) # Subplot 1 for Mean R-squared
plt.plot(regularization_params, mean_scores, label='Mean R-squared')
plt.xscale('log')
plt.title('Mean R-squared Scores')
plt.subplot(1, 2, 2) # Subplot 2 for Variance
plt.plot(regularization_params, variance_scores, label='Variance')
plt.xscale('log')
plt.title('Variance of R-squared Scores')
plt.tight_layout() # Ensure proper spacing between subplots
plt.show()
This task investigates model generalization using k-fold cross validation.
# Write your solution here
# write your reflections here