Polynomial regression

Info

If you're short on time leave this exercise for later and prioritize the next exercises.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

In this exercise, suppose you want to buy a house in the City of Windsor, Canada. You contact a real-estate salesperson to get information about current house prices and receive details on 546 properties sold in Windsor in the last two years. You would like to figure out what the expected cost of a house might be given only the lot size of the house you want to buy. The dataset has an independent variable, lotsize , specifying the lot size of a property and a dependent variable, price , the sale price of a house. Assume an $N$th-order polynomial relation between price and lot-size .

The goal is to estimate the best model (in a least-square-sense) that predicts the house price based from lot size.

You will implement a method to estimate the model parameters of $N$-th order polynomials and use the model to predict the price of a house (in Canadian dollars) based on its lot size (in square feet).

A polynomial model of order $N$ is defined by:

$$ f_\mathbf{w}(x) = \mathbf{w}_0 + \mathbf{w}_1 x + \mathbf{w}_2 x^2 + \dots + \mathbf{w}_N x^N, $$

in which, the coefficients $\mathbf{w}_i$ are the model parameters and $x$ is the lot_size .

Note that $f_\mathbf{w}$ is linear in the model parameters, $\mathbf{w}$. Solving for the model parameters can be done by setting up the linear set of equations $A \mathbf{w} = y$.

Where

$$ \underbrace{\begin{bmatrix} 1 & x_1 & x_1^2 & x_1^3 & \dots & x_1^N \\ 1 & x_2 & x_2^2 & x_2^3 & \dots & x_2^N \\ 1 & x_3 & x_3^2 & x_3^3 & \dots & x_3^N \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_m & x_m^2 & x_m^3 & \dots & x_m^N \end{bmatrix}}_A \times \underbrace{\begin{bmatrix} \mathbf{w}_0 \\ \mathbf{w}_1 \\ \mathbf{w}_2 \\ \mathbf{w}_3 \\ \vdots \\ \mathbf{w}_N \end{bmatrix}}_\mathbf{w} = \underbrace{\begin{bmatrix} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_m \end{bmatrix}}_y. $$

Define the loss $\mathcal{L}$ for a single prediction as the squared error

$$ \mathcal{L}(\hat{y}_i, y_i) = (\hat{y}_i-y_{i})^2, $$

where $\hat{y}_i=f_{\mathbf{w}}(x_i)$ is the prediction and $y_i$ is the label.

The linear least squares method minimizes the sum of squares. In other words, the parameters $\mathbf{w}$ can be learned by solving the following optimisation problem:

$$ \mathbf{w} = \underset{\mathbf{w}}{\operatorname{argmin}} \frac{1}{m}\sum_{i=1}^{m} \mathcal{L}(\hat{y}_i, y_i) \quad\quad \text{(1)} $$

Recall projecting the vector of labels $\mathbb{y} = \begin{bmatrix} y_1\\y_2\\\vdots\\y_n \end{bmatrix}$ onto the column space of the design matrix defined by $A$ is equivalent to minimizing the mean squared error in Equation 1.

Data exploration

The following cell loads the dataset and visualizes the data:

filename = "./data/simple_windsor.csv"
names = ["lotsize", "price"]
dataset = np.loadtxt(filename, delimiter=',').astype(np.int64)

X_full, y_full = dataset.T

plt.scatter(X_full, y_full)
plt.xlabel('Lot size')
plt.ylabel('House price')

Task 1: Questions

In the cell below list and characterize 5 observations about the data.

## List reasons here

Splitting into train and test data

The following cell splits the dataset into $80\%$ training data and $20\%$ test data using the scikit-learn library :

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_full, y_full, test_size=0.2, random_state=42)

Polynomial regression

The following exercise guides you through the steps (1-4) for learning the polynomial model.

Define the model, e.g. line or polynomial.
Identitfy knowns and uknowns.
Construct the design matrix $A$ for the dataset (see the get_design_matrix function below).
Estimate the model parameters using linear least squares ( Task 2 ).

The function get_design_matrix (defined in the cell below) creates a design matrix for a polynomial of order $N$.

def get_design_matrix(x, order=1):
    """
    Get the coefficients of polynomial in a least square sense of order N.
    
    :param x: Must be numpy array of size (N).
    :order n: Order of Polynomial.
    """
    
    if order < 1 or x.ndim != 1:
        return x

    count = x.shape[0]
    matrix = np.ones((count, order + 1), np.float64)

    for i in range(1, order+1):
        matrix[:, i] = x**i

    return matrix

Task 2: Estimate model parameters

Implement the function train(X, y, order) in the cell below to learn the model parameters. Use get_design_matrix(X, order) to create the design matrix.

def train(X, y, order):
    """
    :param X: Input vector.
    :param y: Training data values.
    :param order: Order of the model to estimate.
    
    :return: Parameters of model.
    """
    ...

Task 3: Define prediction model

Use the learned model parameters to predict house prices given an input vector $X$ of lot sizes. Implement the prediction function predict(X, params) in the cell below.

def predict(X, w):
    """
    :param X: Input vector.
    :param w: Estimated parameters.
    
    :return: Predicted y-values.
    """
    ...

Task 4: Prediction

In this task you will use the learnt model parameters for making predictions of house prices given lot sizes. Implement the following steps (marked by # ) in the code cell below.

Learn model parameters using X_train and y_train .
In the cell below calculate the predicted house prices (y -values) given the lot-sizes defined in the values variable.
Plot the predicted house prices as a line-plot.

values = np.linspace(X_full.min(), X_full.max(), 50)

# (1) Learn model parameters

# (2) Evaluate model

# (3) Plot predicted values
plt.scatter(X_train, y_train)

Task 5: Order of Polynomial

In this task you will experiment with the order of the polynomial model to investigate performance.

Increase the order of the polynomial and evaluate the results for:
1. A $3$rd-order polynomial.
2. A $4$th-order polynomial.
3. A $7$th-order polynomial.
4. An $11$th-order polynomial.

Observe that the predictions deviate drastically from the actual lot sizes for the $7$th-order polynomial and above.

Explain why this happens?

This problem can be solved by normalizing the input vectors. Normalization transforms the input values to the interval $[0, 1]$ by scaling and translating the inputs using the minimum and maximum values. The cell below provides functions for normalizing and denormalizing (the inverse transformation) input vectors:

def normalized(X):
    n = (X - np.min(X_full))/np.max(X_full)
    return n

def denormalized(X):
    return X*np.max(X_full) + np.min(X_full)

Task 6: Higher order polynomials with normalization

In this task you will redo Task 4 using normalization. Write your solution in the cell below.

Normalize the inputs in the variable X_{train} using the functionnormalized .
Re-train the model parameters using the normalized inputs using $3$rd, $4$th, and $7$th order polynomials as in Task 4.
Predict the values of (normalized) X_{test} .
Plot the predicted result as a curve using plt.plot

# (1) Normalize the inputs

# (2) Learn parameters and predict y-values

# Sort X_test and corresponding y_predicted for a smooth plot
sorted_indices = np.argsort(X_test)
X_test_sorted = X_test[sorted_indices]
y_predicted_sorted = y_predicted[sorted_indices]

# (3) Plot predicted values
plt.scatter(X_test, y_test, c="g")
plt.plot(X_test_sorted, y_predicted_sorted, "r")  # Sorted for a clean plot
plt.show()

Task 7: Normalization improvement

Visually inspect and reason about how normalization impacts the results.
Explain why normalization achieves a better performance.

# Write your answer here

Evaluation

In the following steps you will evaluate the models using the root mean squarred error (RMSE) on unseen data (test data).

The root mean squared error is defined as: $$ \sqrt{\frac{1}{m}\sum_{i=1}^{m}(f_{\mathbf{w}}(x_{i})-y_{i})^2} $$

and calculates the average error measured in the same units as the house prices.

The code cell below provides an implementation of the RMSE:

def rmse(X, y, w):
    X = normalized(X)
    ym = predict(X, w)
    return np.sqrt(np.mean((y-ym)**2))

Task 8: Model evaluation

In this task you will implement the function evaluate_models to evaluate polynomial models of order 1 to 19 using the root mean squared error.

For each model order:

Learn the model parameters using the train function.
Calculate the root mean squared error of the model on the training set.
Calculate the root mean squared error of the model on the test set.

def evaluate_models():
    """Calculates the RMS error for both training and test data for models with polynomial orders
    from 1 to 19.
    
    Returns: (train losses, test losses)
    """
    losses_train = []
    losses_test = []
    for order in range(1, 19):
        # Add code here
        # first, estimate parameters
        rmse_train = ...
        rmse_test = ...

        losses_train.append(rmse_train)
        losses_test.append(rmse_test)
    return losses_train, losses_test

Task 9: Plotting results

Plot the training and test losses in the cell below.
Explain why the test and training losses behave differently as the order of the polynomial increases.
Use RMSE and the plot(s) to argue what could be the consequences of using these models given that they have these properties.

# Write your solution here

Task 10: Reflection

Reflect on whether it's possible and reasonable to chose a higher order polynomial for improving the training loss. Use the plot in your argumentation.

# Write your answers here