Bias-variance and regularization

This exercise is about applying regularization to mitigate the effects of overfitting. This exercise assumes that you have read the tutorial about cross validation .

List of tasks

Task 1: Tutorial review
Task 2: Reflections on regularization
Task 3: Loading the dataset
Task 4: Implementing regularization
Task 5: Evaluating models
Task 6: Cross validation
Task 7: Reflection on results

Reflection on the tutorial

Task 1: Tutorial review

Make a copy of the tutorial and make edits in the copy.
In the tutorial, go to the "Hold-out validation" section and add a for loop that runs the cell for at least 10 iterations. That is, in each iteration:
- Run the hold-out train-validation split.
- Fit the model on the training set.
- Compute and store the $R^2$ scores on the validation set.
Inspect the minimum and maximum $R^2$ scores and calculate their mean and variance. What does this indicate about the influence of the training set on model predictions?
Go to the "Effects of polynomials on model fit" section and implement 10 fold cross validation to train the models with 3rd, 4th, and 5th order polynomials. Does this affect the fit of the models?

# Add your reflections here

Regularization

In the cross validation tutorial , it was observed that adding third or higher order polynomial terms results in overfitting of the regression model. In the following steps, a model pipeline similar to the one from the tutorial will be built, this time using ridge regression.

Task 2: Reflections on regularization

Define the loss function used in ridge regression.
What is the importance of the regularization parameter $\lambda$?
What influence does $\lambda$ have when it becomes:
- 0?
- 1?
- Large?

# Write your reflection here

Task 3: Loading the dataset

Run the cell below to import libraries and set up the dataset.

import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import KFold, RepeatedKFold, cross_validate
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, Normalizer
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge # additional import for regularization

np.random.seed(99) # seed for randomization 

dataset = fetch_california_housing(as_frame=True)

df = dataset.frame # This is the dataframe (a table)

X = dataset.data # These are the input features (anything but the house price)
y = dataset.target # This contains the output features (just the house price)

Task 4: Implementing regularization

Run the cell below to:
- create a third-order polynomial model with ridge regression using the Ridge class from Scikit learn.
- use the np.geomspace function to create an array, regularization_params , with values exponentially spaced between $10^{-10}$ and $10^2$. These values will be used to vary the regularization parameter.
In the cell below, divide the dataset into an 80-20 training-validation split and use the training set to train third-order Ridge regression models with different regularization parameters $\lambda_i$, by iterating over the elements in regularization_params .

Note: the regularization parameter $\lambda$ is called alpha in sckit learn.

Asses the performance of the models on the validation set by calculating the $R^2$ scores and store them in scores .
Run the cell below to plot the $R^2$ scores for each model (each regularization value). What does the plot reveal about the effect of the regularization parameter on the perfomance of the model on the testing set.

# Write your solution here

model = Pipeline([
    ("features", PolynomialFeatures(3)), # Calculates the design matrix for a third order polynomial
    ("normalization", Normalizer()), # Normalizes the features to a (0, 1) range. 
    ("model", Ridge(alpha=1)), # The regression model and regularization parameter value
])

regularization_params = np.geomspace(1e-10, 1e2, 20)

scores = []

plt.plot(regularization_params, scores)
plt.xscale('log')
plt.title('R-squared Scores')
plt.show()

Task 5: Evaluating models

This task is about evaluating the effects of the regularisation parameters.

In the cell above, add a for-loop to rerun the cell 20 times and store the $R^2$ results from each iteration. The loop should repeat the 80-20 hold-out train-validation split each time as in Task 1.
Calculate the mean and variance of the $R^2$ scores for each regularization value then run the cell below to plot the results.
Based on the plots, which regularization parameter value gives the best results and why? Note down your observations and reflections in the cell below as it will be used in the next task.

# Write your solution and reflections here


# Plot the mean and variance R-squared scores
plt.figure(figsize=(10, 5))  # Set the figure size
plt.subplot(1, 2, 1)  # Subplot 1 for Mean R-squared
plt.plot(regularization_params, mean_scores, label='Mean R-squared')
plt.xscale('log')
plt.title('Mean R-squared Scores')

plt.subplot(1, 2, 2)  # Subplot 2 for Variance
plt.plot(regularization_params, variance_scores, label='Variance')
plt.xscale('log')
plt.title('Variance of R-squared Scores')

plt.tight_layout()  # Ensure proper spacing between subplots
plt.show()

Cross-validation

Task 6: Cross validation

This task investigates model generalization using k-fold cross validation.

Construct a new model, with the same setup as before by using the optimal regularization parameter found in the previous task.
Train the model using k-fold cross validation. Set the number of folds to 2.
Vary the number of folds from 2 to 20 and calculate the mean and the standard deviation of the $R^2$ score for each fold.
Plot the mean and the standard deviation of the $R^2$ scores as a function of the folds.

# Write your solution here

Task 7: Reflection on results

Use the plotted mean and variance to argue about the model performance.
List reasons for the variability in model performance?
Compare the variability in model perfomance observed in the tutorial with the results of the current exercise.
Argue how the regularized model performs compared to the standard linear regression implemented in the tutorial. Print the model parameters and use them to argue for differences between the linear model and the regularized model.

# write your reflections here