Info
If you're short on time leave this exercise for later and prioritize the next exercises.
If you're short on time leave this exercise for later and prioritize the next exercises.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
In this exercise, suppose you want to buy a house in the City of Windsor, Canada. You contact a real-estate salesperson to get information about current house prices and receive details on 546 properties sold in Windsor in the last two years. You would like to figure out what the expected cost of a house might be given only the lot size of the house you want to buy. The dataset has an independent variable, lotsize
, specifying the lot size of a property and a dependent variable, price
, the sale price of a house. Assume an $N$th-order polynomial relation between price
and lot-size
.
The goal is to estimate the best model (in a least-square-sense) that predicts the house price based from lot size.
You will implement a method to estimate the model parameters of $N$-th order polynomials and use the model to predict the price of a house (in Canadian dollars) based on its lot size (in square feet).
A polynomial model of order $N$ is defined by:
$$ f_\mathbf{w}(x) = \mathbf{w}_0 + \mathbf{w}_1 x + \mathbf{w}_2 x^2 + \dots + \mathbf{w}_N x^N, $$in which, the coefficients $\mathbf{w}_i$ are the model parameters and $x$ is the lot_size
.
Note that $f_\mathbf{w}$ is linear in the model parameters, $\mathbf{w}$. Solving for the model parameters can be done by setting up the linear set of equations $A \mathbf{w} = y$.
Where
$$ \underbrace{\begin{bmatrix} 1 & x_1 & x_1^2 & x_1^3 & \dots & x_1^N \\ 1 & x_2 & x_2^2 & x_2^3 & \dots & x_2^N \\ 1 & x_3 & x_3^2 & x_3^3 & \dots & x_3^N \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_m & x_m^2 & x_m^3 & \dots & x_m^N \end{bmatrix}}_A \times \underbrace{\begin{bmatrix} \mathbf{w}_0 \\ \mathbf{w}_1 \\ \mathbf{w}_2 \\ \mathbf{w}_3 \\ \vdots \\ \mathbf{w}_N \end{bmatrix}}_\mathbf{w} = \underbrace{\begin{bmatrix} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_m \end{bmatrix}}_y. $$Define the loss $\mathcal{L}$ for a single prediction as the squared error
$$ \mathcal{L}(\hat{y}_i, y_i) = (\hat{y}_i-y_{i})^2, $$where $\hat{y}_i=f_{\mathbf{w}}(x_i)$ is the prediction and $y_i$ is the label.
The linear least squares method minimizes the sum of squares. In other words, the parameters $\mathbf{w}$ can be learned by solving the following optimisation problem:
$$ \mathbf{w} = \underset{\mathbf{w}}{\operatorname{argmin}} \frac{1}{m}\sum_{i=1}^{m} \mathcal{L}(\hat{y}_i, y_i) \quad\quad \text{(1)} $$
Recall projecting the vector of labels $\mathbb{y} = \begin{bmatrix} y_1\\y_2\\\vdots\\y_n \end{bmatrix}$ onto the column space of the design matrix defined by $A$ is equivalent to minimizing the mean squared error in Equation 1.
The following cell loads the dataset and visualizes the data:
filename = "./data/simple_windsor.csv"
names = ["lotsize", "price"]
dataset = np.loadtxt(filename, delimiter=',').astype(np.int64)
X_full, y_full = dataset.T
plt.scatter(X_full, y_full)
plt.xlabel('Lot size')
plt.ylabel('House price')
## List reasons here
The following cell splits the dataset into $80\%$ training data and $20\%$ test data using the scikit-learn library :
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_full, y_full, test_size=0.2, random_state=42)
The following exercise guides you through the steps (1-4) for learning the polynomial model.
get_design_matrix
function below).The function get_design_matrix
(defined in the cell below) creates a design matrix for a polynomial of order $N$.
def get_design_matrix(x, order=1):
"""
Get the coefficients of polynomial in a least square sense of order N.
:param x: Must be numpy array of size (N).
:order n: Order of Polynomial.
"""
if order < 1 or x.ndim != 1:
return x
count = x.shape[0]
matrix = np.ones((count, order + 1), np.float64)
for i in range(1, order+1):
matrix[:, i] = x**i
return matrix
Implement the function train(X, y, order)
in the cell below to learn the model parameters. Use get_design_matrix(X, order)
to create the design matrix.
def train(X, y, order):
"""
:param X: Input vector.
:param y: Training data values.
:param order: Order of the model to estimate.
:return: Parameters of model.
"""
...
Use the learned model parameters to predict house prices given an input vector $X$ of lot sizes. Implement the prediction function predict(X, params)
in the cell below.
def predict(X, w):
"""
:param X: Input vector.
:param w: Estimated parameters.
:return: Predicted y-values.
"""
...
In this task you will use the learnt model parameters for making predictions of house prices given lot sizes. Implement the following steps (marked by #
) in the code cell below.
X_train
and y_train
. y
-values) given the lot-sizes defined in the values
variable.values = np.linspace(X_full.min(), X_full.max(), 50)
# (1) Learn model parameters
# (2) Evaluate model
# (3) Plot predicted values
plt.scatter(X_train, y_train)
In this task you will experiment with the order of the polynomial model to investigate performance.
Observe that the predictions deviate drastically from the actual lot sizes for the $7$th-order polynomial and above.
This problem can be solved by normalizing the input vectors. Normalization transforms the input values to the interval $[0, 1]$ by scaling and translating the inputs using the minimum and maximum values. The cell below provides functions for normalizing and denormalizing (the inverse transformation) input vectors:
def normalized(X):
n = (X - np.min(X_full))/np.max(X_full)
return n
def denormalized(X):
return X*np.max(X_full) + np.min(X_full)
In this task you will redo Task 4 using normalization. Write your solution in the cell below.
X_{train}
using the functionnormalized
. X_{test}
.plt.plot
# (1) Normalize the inputs
# (2) Learn parameters and predict y-values
# Sort X_test and corresponding y_predicted for a smooth plot
sorted_indices = np.argsort(X_test)
X_test_sorted = X_test[sorted_indices]
y_predicted_sorted = y_predicted[sorted_indices]
# (3) Plot predicted values
plt.scatter(X_test, y_test, c="g")
plt.plot(X_test_sorted, y_predicted_sorted, "r") # Sorted for a clean plot
plt.show()
# Write your answer here
In the following steps you will evaluate the models using the root mean squarred error (RMSE) on unseen data (test data).
The root mean squared error is defined as: $$ \sqrt{\frac{1}{m}\sum_{i=1}^{m}(f_{\mathbf{w}}(x_{i})-y_{i})^2} $$
and calculates the average error measured in the same units as the house prices.
The code cell below provides an implementation of the RMSE:
def rmse(X, y, w):
X = normalized(X)
ym = predict(X, w)
return np.sqrt(np.mean((y-ym)**2))
In this task you will implement the function evaluate_models
to evaluate polynomial models of order 1 to 19 using the root mean squared error.
For each model order:
train
function.def evaluate_models():
"""Calculates the RMS error for both training and test data for models with polynomial orders
from 1 to 19.
Returns: (train losses, test losses)
"""
losses_train = []
losses_test = []
for order in range(1, 19):
# Add code here
# first, estimate parameters
rmse_train = ...
rmse_test = ...
losses_train.append(rmse_train)
losses_test.append(rmse_test)
return losses_train, losses_test
# Write your solution here
# Write your answers here