Home
Course Guidelines
About the course Prerequite Material References
Python
Jupyter Notebooks Python overview
Exercises
Before the semester start: Installation and exercise setup Week 1: Introduction to Python and libraries Week 2: Vector representations Week 3: Linear Algebra Week 4: Linear Transformations Week 5: Models and least squares Week 6: Assignment 1 - Gaze Estimation Week 7: Model selection and descriptive statistics Week 8: Filtering Week 9: Classification Week 10: Evaluation Week 11: Dimensionality reduction Week 12: Clustering and refresh on gradients Week 13: Neural Networks Week 14: Convolutional Neural Networks (CNN's)
Tutorials
Week 1: Data analysis, manipulation and plotting Week 2: Linear algebra Week 3: Transformations tutorial Week 4: Projection and Least Squares tutorial Week 7: Cross-validation and descriptive statistics tutorial Week 8: Filtering tutorial Week 11: Gradient Descent / Ascent
In-class Exercises
In-class 1 In-class 2 In-class 10 In-class 3 In-class 4 In-class 8
Explorer

Document

  • Overview
  • 1. Optimization
  • 2. NN architectures
  • 3. Bias variance and regularization

Content

  • Reflection on the tutorial
    • Task 1 Tutorial review
  • Regularization
    • Task 2 Reflections on regularization
    • Task 3 Loading the dataset
    • Task 4 Implementing regularization
    • Task 5 Evaluating models
  • Cross-validation
    • Task 6 Cross validation
    • Task 7 Reflection on results

Bias-variance and regularization

This exercise is about applying regularization to mitigate the effects of overfitting. This exercise assumes that you have read the tutorial about cross validation .

List of tasks
  • Task 1: Tutorial review
  • Task 2: Reflections on regularization
  • Task 3: Loading the dataset
  • Task 4: Implementing regularization
  • Task 5: Evaluating models
  • Task 6: Cross validation
  • Task 7: Reflection on results

Reflection on the tutorial

Task 1: Tutorial review
  1. Make a copy of the tutorial and make edits in the copy.

  2. In the tutorial, go to the "Hold-out validation" section and add a for loop that runs the cell for at least 10 iterations. That is, in each iteration:

    • Run the hold-out train-validation split.
    • Fit the model on the training set.
    • Compute and store the $R^2$ scores on the validation set.
  3. Inspect the minimum and maximum $R^2$ scores and calculate their mean and variance. What does this indicate about the influence of the training set on model predictions?

  4. Go to the "Effects of polynomials on model fit" section and implement 10 fold cross validation to train the models with 3rd, 4th, and 5th order polynomials. Does this affect the fit of the models?

# Add your reflections here
# Add your reflections here

Regularization

In the cross validation tutorial , it was observed that adding third or higher order polynomial terms results in overfitting of the regression model. In the following steps, a model pipeline similar to the one from the tutorial will be built, this time using ridge regression.

Task 2: Reflections on regularization
  1. Define the loss function used in ridge regression.
  2. What is the importance of the regularization parameter $\lambda$?
  3. What influence does $\lambda$ have when it becomes:
    • 0?
    • 1?
    • Large?
# Write your reflection here
# Write your reflection here
Task 3: Loading the dataset
  1. Run the cell below to import libraries and set up the dataset.
import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import KFold, RepeatedKFold, cross_validate from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures, Normalizer from sklearn.pipeline import Pipeline from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from sklearn.linear_model import Ridge # additional import for regularization np.random.seed(99) # seed for randomization dataset = fetch_california_housing(as_frame=True) df = dataset.frame # This is the dataframe (a table) X = dataset.data # These are the input features (anything but the house price) y = dataset.target # This contains the output features (just the house price)
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import KFold, RepeatedKFold, cross_validate
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, Normalizer
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge # additional import for regularization

np.random.seed(99) # seed for randomization 

dataset = fetch_california_housing(as_frame=True)

df = dataset.frame # This is the dataframe (a table)

X = dataset.data # These are the input features (anything but the house price)
y = dataset.target # This contains the output features (just the house price)
Task 4: Implementing regularization
  1. Run the cell below to:
    • create a third-order polynomial model with ridge regression using the Ridge class from Scikit learn.
    • use the np.geomspace function to create an array, regularization_params , with values exponentially spaced between $10^{-10}$ and $10^2$. These values will be used to vary the regularization parameter.
  2. In the cell below, divide the dataset into an 80-20 training-validation split and use the training set to train third-order Ridge regression models with different regularization parameters $\lambda_i$, by iterating over the elements in regularization_params .

Note: the regularization parameter $\lambda$ is called alpha in sckit learn.

  1. Asses the performance of the models on the validation set by calculating the $R^2$ scores and store them in scores .

  2. Run the cell below to plot the $R^2$ scores for each model (each regularization value). What does the plot reveal about the effect of the regularization parameter on the perfomance of the model on the testing set.

# Write your solution here model = Pipeline([ ("features", PolynomialFeatures(3)), # Calculates the design matrix for a third order polynomial ("normalization", Normalizer()), # Normalizes the features to a (0, 1) range. ("model", Ridge(alpha=1)), # The regression model and regularization parameter value ]) regularization_params = np.geomspace(1e-10, 1e2, 20) scores = [] plt.plot(regularization_params, scores) plt.xscale('log') plt.title('R-squared Scores') plt.show()
# Write your solution here

model = Pipeline([
    ("features", PolynomialFeatures(3)), # Calculates the design matrix for a third order polynomial
    ("normalization", Normalizer()), # Normalizes the features to a (0, 1) range. 
    ("model", Ridge(alpha=1)), # The regression model and regularization parameter value
])

regularization_params = np.geomspace(1e-10, 1e2, 20)

scores = []

plt.plot(regularization_params, scores)
plt.xscale('log')
plt.title('R-squared Scores')
plt.show()
Task 5: Evaluating models

This task is about evaluating the effects of the regularisation parameters.

  1. In the cell above, add a for-loop to rerun the cell 20 times and store the $R^2$ results from each iteration. The loop should repeat the 80-20 hold-out train-validation split each time as in Task 1.
  2. Calculate the mean and variance of the $R^2$ scores for each regularization value then run the cell below to plot the results.
  3. Based on the plots, which regularization parameter value gives the best results and why? Note down your observations and reflections in the cell below as it will be used in the next task.
# Write your solution and reflections here # Plot the mean and variance R-squared scores plt.figure(figsize=(10, 5)) # Set the figure size plt.subplot(1, 2, 1) # Subplot 1 for Mean R-squared plt.plot(regularization_params, mean_scores, label='Mean R-squared') plt.xscale('log') plt.title('Mean R-squared Scores') plt.subplot(1, 2, 2) # Subplot 2 for Variance plt.plot(regularization_params, variance_scores, label='Variance') plt.xscale('log') plt.title('Variance of R-squared Scores') plt.tight_layout() # Ensure proper spacing between subplots plt.show()
# Write your solution and reflections here


# Plot the mean and variance R-squared scores
plt.figure(figsize=(10, 5))  # Set the figure size
plt.subplot(1, 2, 1)  # Subplot 1 for Mean R-squared
plt.plot(regularization_params, mean_scores, label='Mean R-squared')
plt.xscale('log')
plt.title('Mean R-squared Scores')

plt.subplot(1, 2, 2)  # Subplot 2 for Variance
plt.plot(regularization_params, variance_scores, label='Variance')
plt.xscale('log')
plt.title('Variance of R-squared Scores')

plt.tight_layout()  # Ensure proper spacing between subplots
plt.show()

Cross-validation

Task 6: Cross validation

This task investigates model generalization using k-fold cross validation.

  1. Construct a new model, with the same setup as before by using the optimal regularization parameter found in the previous task.
  2. Train the model using k-fold cross validation. Set the number of folds to 2.
  3. Vary the number of folds from 2 to 20 and calculate the mean and the standard deviation of the $R^2$ score for each fold.
  4. Plot the mean and the standard deviation of the $R^2$ scores as a function of the folds.
# Write your solution here
# Write your solution here
Task 7: Reflection on results
  1. Use the plotted mean and variance to argue about the model performance.
  2. List reasons for the variability in model performance?
  3. Compare the variability in model perfomance observed in the tutorial with the results of the current exercise.
  4. Argue how the regularized model performs compared to the standard linear regression implemented in the tutorial. Print the model parameters and use them to argue for differences between the linear model and the regularized model.
# write your reflections here
# write your reflections here