Home
Course Guidelines
About the course Prerequite Material References
Python
Jupyter Notebooks Python overview
Exercises
Before the semester start: Installation and exercise setup Week 1: Introduction to Python and libraries Week 2: Vector representations Week 3: Linear Algebra Week 4: Linear Transformations Week 5: Models and least squares Week 6: Assignment 1 - Gaze Estimation Week 7: Model selection and descriptive statistics Week 8: Filtering Week 9: Classification Week 10: Evaluation Week 11: Dimensionality reduction Week 12: Clustering and refresh on gradients Week 13: Neural Networks Week 14: Convolutional Neural Networks (CNN's)
Tutorials
Week 1: Data analysis, manipulation and plotting Week 2: Linear algebra Week 3: Transformations tutorial Week 4: Projection and Least Squares tutorial Week 7: Cross-validation and descriptive statistics tutorial Week 8: Filtering tutorial Week 11: Gradient Descent / Ascent
In-class Exercises
In-class 1 In-class 2 In-class 10 In-class 3 In-class 4 In-class 8
Explorer

Document

  • Overview
  • 1. Polynomial regression
  • 2. Model Complexity and Overfitting
  • 3. Model selection and descriptive statistics

Content

  • Data
    • Task 1 Polynomial regression
    • Task 2 Train and evaluate linear models with polynomial features
    • Task 3 Plot the polynomials (models)
    • Task 4 Reflection
    • Task 5 Changing the data generating function

Model Complexity and Overfitting

Overview

In this exercise you will experiment with the impact of model complexity (higher order polynomial) and how it relates to Occam's Razor.

List of tasks
  • Task 1: Polynomial regression
  • Task 2: Train and evaluate linear models with po…
  • Task 3: Plot the polynomials (models)
  • Task 4: Reflection
  • Task 5: Changing the data generating function

This exercise is about making a regression model to predict the growth of Thuja Green Giant trees. You have to help the scientists decide which polynomial order best represents the training data to estimate future growth. To determine the optimal fit (model parameters), another group of researchers have provided you with observation of height of their Thuja Green Giant trees from years later than currently observed by your team (X_test and y_test ). You will use this to choose the optimal model representing the growth of the Thuja Green Giant.

Data

The following cell constructs and shows the data for the exercise. The data simulates growth (in meters) of one of the fastest growing trees, the Thuja Green Giant, each year. Scientists have observed and reported the growth of the tree for 7 years (X_train and y_train ), and now want to predict the future growth.

The objective is to assist in making predictions based on this data. Additional data from another group has been provided to validate the hypothesis.

The scientists assume a polynomial relationship.

import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures # Set a random seed for reproducibility np.random.seed(42) # Generate synthetic data n_samples = 100 X = np.linspace(0, 10, n_samples).reshape(-1, 1) y_true = 1.5 * X.ravel() + 0.2 noise = np.random.normal(0, 1, n_samples) y = y_true + noise # Split the data into training and test sets split_index = int(0.7 * n_samples) X_train, X_test = X[:split_index], X[split_index:] y_train, y_test = y[:split_index], y[split_index:] # Plot the results plt.figure(figsize=(10, 6)) plt.scatter(X_train, y_train, color='blue', label='Training data') plt.scatter(X_test, y_test, color='red', label='Test data') plt.xlabel('X') plt.ylabel('y') plt.legend()
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
# Set a random seed for reproducibility
np.random.seed(42)

# Generate synthetic data
n_samples = 100
X = np.linspace(0, 10, n_samples).reshape(-1, 1)
y_true = 1.5 * X.ravel() + 0.2
noise = np.random.normal(0, 1, n_samples)
y = y_true + noise

# Split the data into training and test sets
split_index = int(0.7 * n_samples)
X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]


# Plot the results
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, color='blue', label='Training data')
plt.scatter(X_test, y_test, color='red', label='Test data')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
Task 1: Polynomial regression
  1. In this exercise you may reuse polynomial regression using least squares from the previous exercise, or use the PolynomialFeatures() method from the scikit-learn library to implement the polynomial_regression() method in the cell below.
def polynomial_regression(X, y, degree): """ Create and train a model of desired order and use it to predict the growth of the trees. :param X: Vector of combined observed years). :param y: Vector of combined observed height. :param degree: Degree of the model. :return: Vector containing prediction for training data, vector containing prediction for test data. """ #write code/solution here ...
def polynomial_regression(X, y, degree):
    """
    Create and train a model of desired order and use it to predict the growth of the trees.

    :param X: Vector of combined observed years).
    :param y: Vector of combined observed height.
    :param degree: Degree of the model.
    
    :return: Vector containing prediction for training data, vector containing prediction for test data.
    """
#write code/solution here ...
Task 2: Train and evaluate linear models with polynomial features
  1. Use the function polynomial_regression to perform polynomial regression for each order defined in the degrees variable and predict the outcome for both the test and training data.
  2. Implement the function compute_mse that based on the predictions of a model and the ground truth targets returns the mean-squared-error.
$$ MSE = \frac{1}{m}\sum_{i=1}^{m}(f_{\mathbf{w}}(x_{i})-y_{i})^2 $$
Hint

You may save some time by modifying the implementation of the rmse function from the previous exercise.

  1. For each polynomial model calculate the mean-squared-error for both the training and test data (use polynomial_regression and compute_mse ).
def compute_mse(y_true, y_pred): """Compute Mean Squared Error between true and predicted values.""" #write code/solution here ... # Train and evaluate linear models with different polynomial features degrees = [1, 2, 3, 4, 5, 6] train_pred = [] test_pred = [] train_error = [] test_error = [] #write code/solution here ...
def compute_mse(y_true, y_pred):
    """Compute Mean Squared Error between true and predicted values."""
    #write code/solution here ... 

# Train and evaluate linear models with different polynomial features

degrees = [1, 2, 3, 4, 5, 6]
train_pred = []
test_pred = []
train_error = []
test_error = []

#write code/solution here ...
Task 3: Plot the polynomials (models)

Run the cell below to:

  1. Plot the data so that training and test data have different colors.
  2. Plot the predictions of the polynomial models over the scatter plot showing the given data. Perform this for both the training and test sets using X as input.
# Plot the results plt.figure(figsize=(10, 6)) plt.scatter(X_train, y_train, color='blue', label='Training data') plt.scatter(X_test, y_test, color='red', label='Test data') for i, degree in enumerate(degrees): plt.plot(X, np.concatenate((train_pred[i],test_pred[i])), label=f'Degree {degree}, MSE Train: {train_error[i]:.2f}, MSE Test: {test_error[i]:.2f}') plt.xlabel('X') plt.ylabel('y') plt.ylim(0,22) plt.legend() plt.title('Linear Models with Different Polynomial Features') plt.show() # Insert code for question 1 # The following line keep axis fixed in a plot plt.ylim(0,30) # Insert code for question 2
# Plot the results

plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, color='blue', label='Training data')
plt.scatter(X_test, y_test, color='red', label='Test data')

for i, degree in enumerate(degrees):
    plt.plot(X, np.concatenate((train_pred[i],test_pred[i])), label=f'Degree {degree}, MSE Train: {train_error[i]:.2f}, MSE Test: {test_error[i]:.2f}')
plt.xlabel('X')
plt.ylabel('y')
plt.ylim(0,22)
plt.legend()
plt.title('Linear Models with Different Polynomial Features')
plt.show()
# Insert code for question 1
# The following line keep axis fixed in a plot
plt.ylim(0,30)
# Insert code for question 2
Task 4: Reflection

Reflect on:

  1. Which model had the best performance on the training data?
  2. Which model had the best performance on the test data?
  3. How does the complexity (degree) of the model affect the performance on the training and test data?
  4. Which model(s) shows signs of overfitting? How can you tell?
# Write reflection here
# Write reflection here
Task 5: Changing the data generating function

How do the results change if the underlying function generating the data was changed to a 2. order polynomial, so that it e.g. simulates bacteria growth instead?

  1. Re-generate the data by replacing y_true with $y=f(x)=x^2+1.5x-3$ in the data generation step, and rerun the other code blocks.
  2. Does it still make sense to follow the strategy of Occam's razor?
# Write reflection
# Write reflection