Overview
In this exercise you will experiment with the impact of model complexity (higher order polynomial) and how it relates to Occam's Razor.
In this exercise you will experiment with the impact of model complexity (higher order polynomial) and how it relates to Occam's Razor.
This exercise is about making a regression model to predict the growth of Thuja Green Giant trees. You have to help the scientists decide which polynomial order best represents the training data to estimate future growth. To determine the optimal fit (model parameters), another group of researchers have provided you with observation of height of their Thuja Green Giant trees from years later than currently observed by your team (X_test
and y_test
). You will use this to choose the optimal model representing the growth of the Thuja Green Giant.
The following cell constructs and shows the data for the exercise. The data simulates growth (in meters) of one of the fastest growing trees, the Thuja Green Giant, each year. Scientists have observed and reported the growth of the tree for 7 years (X_train
and y_train
), and now want to predict the future growth.
The objective is to assist in making predictions based on this data. Additional data from another group has been provided to validate the hypothesis.
The scientists assume a polynomial relationship.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
# Set a random seed for reproducibility
np.random.seed(42)
# Generate synthetic data
n_samples = 100
X = np.linspace(0, 10, n_samples).reshape(-1, 1)
y_true = 1.5 * X.ravel() + 0.2
noise = np.random.normal(0, 1, n_samples)
y = y_true + noise
# Split the data into training and test sets
split_index = int(0.7 * n_samples)
X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]
# Plot the results
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, color='blue', label='Training data')
plt.scatter(X_test, y_test, color='red', label='Test data')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
PolynomialFeatures()
method from the scikit-learn library
to implement the polynomial_regression()
method in the cell below. def polynomial_regression(X, y, degree):
"""
Create and train a model of desired order and use it to predict the growth of the trees.
:param X: Vector of combined observed years).
:param y: Vector of combined observed height.
:param degree: Degree of the model.
:return: Vector containing prediction for training data, vector containing prediction for test data.
"""
#write code/solution here ...
polynomial_regression
to perform polynomial regression for each order defined in the degrees
variable and predict the outcome for both the test and training data. compute_mse
that based on the predictions of a model and the ground truth targets returns the mean-squared-error.You may save some time by modifying the implementation of the rmse
function from the previous exercise.
polynomial_regression
and compute_mse
).def compute_mse(y_true, y_pred):
"""Compute Mean Squared Error between true and predicted values."""
#write code/solution here ...
# Train and evaluate linear models with different polynomial features
degrees = [1, 2, 3, 4, 5, 6]
train_pred = []
test_pred = []
train_error = []
test_error = []
#write code/solution here ...
Run the cell below to:
# Plot the results
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, color='blue', label='Training data')
plt.scatter(X_test, y_test, color='red', label='Test data')
for i, degree in enumerate(degrees):
plt.plot(X, np.concatenate((train_pred[i],test_pred[i])), label=f'Degree {degree}, MSE Train: {train_error[i]:.2f}, MSE Test: {test_error[i]:.2f}')
plt.xlabel('X')
plt.ylabel('y')
plt.ylim(0,22)
plt.legend()
plt.title('Linear Models with Different Polynomial Features')
plt.show()
# Insert code for question 1
# The following line keep axis fixed in a plot
plt.ylim(0,30)
# Insert code for question 2
Reflect on:
# Write reflection here
How do the results change if the underlying function generating the data was changed to a 2. order polynomial, so that it e.g. simulates bacteria growth instead?
y_true
with $y=f(x)=x^2+1.5x-3$ in the data generation step, and rerun the other code blocks.# Write reflection