Principal Component Analysis (PCA)

In this in-class exercise you will be guided through the steps necessary for implementing a PCA on a sequence of human poses. You will work with the poses data, which was used for the exercises in week 2 . The dataset has a shape of $(1403, 100, 25*2)$. This means that there are 1403 pose sequences. Each sequence is a 100-frames time series capturing human poses. Each pose consists of 25 skeletal joints, where each joint is an x and y coordinate ($25*2$). For this exercise, you will use a single pose sequence of 100 frames and apply dimension reduction to the selected sequence.

The following cells loads the libraries, the dataset and provides functions for plotting the poses:

import numpy as np
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
from pca_utils import *

# Suppress the specific warning
warnings.filterwarnings("ignore")

1. Data inspection

The cell below:

Loads the data and constructs the data matrix.
Reshapes the data into a $100x50$ data-matrix where each row contains a flattened vector of a pose.
Selects the first 40 frames from a single pose sequence and plot it.

Task 1: Loading and inspecting the data

Run the code cell below.
Change the code to display the sequences 4,5,7 and visually observe how these sequences vary.

data = np.load('poses_norm.npy')
print(data.shape)
N,T,D,C = data.shape
reshaped_data = data.reshape(N,T,D*C)
dataset = reshaped_data[19]
print(reshaped_data.shape)

# Define the new desired shape (40, 50)
nr_of_frames=40
new_shape = (nr_of_frames, 50)
# Reshape the array to the new shape
reshaped_data2 = np.empty(new_shape)  # Create an empty array with the new shape
reshaped_data2[:] = dataset[:new_shape[0], :]  
print(dataset.shape)

plot_single_sequence(reshaped_data2,pose_name='Pose',color='blue')

(1403, 100, 25, 2)
(1403, 100, 50)
(100, 50)

2. Covariance matrix

In the following tasks you will construct and inspect the covariance matrix for a given pose sequence.

Task 2: Covariance matrix - NumPy method

Run the cell below to obtain and plot the covariance matrix.
What does the heatmap tell us about the relationship between the variables (skeletal joint coordinates)?

# Calculate the covariance matrix for the entire dataset
cov_matrix = np.cov(dataset, rowvar=False)
# Plotting
sns.heatmap(cov_matrix, cmap='coolwarm')

Currently, the dataset is organized by frames, with each frame having alternating x and y coordinates in the order: $[x_1, y_1, x_2, y_2, \dots, x_{25}, y_{25}]$. The cell below rearranges the data for illustrative purposes, grouping all x-coordinates first, followed by all y-coordinates: $[x_1, x_2, \dots, x_{25}, y_1, y_2, \dots, y_{25}]$.

# Get the number of rows and columns in the dataset
num_rows, num_columns = dataset.shape

# Separate even and odd columns
even_indexes = np.arange(0, num_columns, 2)  # Even indexes (0, 2, 4, ...)
odd_indexes = np.arange(1, num_columns, 2)   # Odd indexes (1, 3, 5, ...)

# Rearrange the dataset
rearranged_dataset = dataset[:, np.concatenate((even_indexes, odd_indexes))]

Task 3: Covariance matrix - rearranged data

Run the cell below to obtain and plot the covariance matrix of the rearranged data.

cov_matrix_rearranged = np.cov(rearranged_dataset, rowvar=False)
sns.heatmap(cov_matrix_rearranged, cmap='coolwarm')

Task 4: Implement your own covariance matrix (optional)

The following task should only be completed if you have extra time and want to try constructing the covariance matrix yourself. Use the rearranged_dataset to:

Construct the covariance matrix $\mathbf{C}$:

$$ \mathbf{C} = \frac{1}{N} \sum_{i=1}^{N} (\mathbf{x}_i - \boldsymbol{\bar{x}})(\mathbf{x}_i - \boldsymbol{\bar{x}})^\top $$

where $\mathbf{x}_i$ represents the $i$-th coordinate in the dataset and $\boldsymbol{\bar{x}}$ is the mean vector obtained by averaging the coordinates for each joint $\boldsymbol{\bar{x}} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{x}_i$

Hint

To center the data first calculate the mean vector, then subtract it from each data point of the pose sequence.

Create a heatmap of the covariance matrix.
Compare the covariance matrix obtained in this task to the one obtained in the previous task. How and why are they similar/different?

# write your solution here

Task 5: Reflection (optional)

How would you change the above pipeline for obtaining the covariance matrix for all of the 1403 pose sequences?

3. Eigenvalues and eigenvectors

The following steps involve implementing the eigen decomposition of the covariance matrix.

Task 6: Eigen decomposition

Run the cell below to find the eigenvalues and eigenvectors of the covariance matrix.
Plot the eigenvalues as in the plot below.

eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)

# write your solution here

Task 7: Properties of eigenvalues and eigenvectors

Determine whether all of the eigenvalues are non-negative (greater than or equal to 0)
Verify that the obtained eigenvectors are orthogonal. An efficient way is to use the definition of an orthonormal matrix ($A^ \top A=I$). Alternatively, you can verify them individually.

Hint

Notice that the values may be slightly imprecise due to the finite precision of numerical representations. You can use np.isclose to check whether two values are close to each other or not.

What is the total variance of the dataset?

Hint

The sum of all eigenvalues should equal the total variance in the original data, however due to numerical imprecision you might get slightly different values.

# Write your solution here

- Are all eigenvalues greater than or equal to 0: True
- Orthogonal eigenvectors: True
- Sum of eigenvalues: 0.56
- Total variance: 0.56

Task 8: Sorting Eigenvalues and Eigenvectors

Use np.argsort to get the list of permutation indices of the eigenvalues in descending order, then sort them.
Use the list of indices to sort the eigenvectors based on the eigenvalues.
Plot the sorted eigenvalues. The plot should have a similar shape as the figure below.

# Write your solution here

Info

We can chose to retain a certain percentage of the total variation by selecting the number of principal components where the sum of the eigenvalues correspond to the desired variance. It is convenient to calculate the acummulative sum of the eigenvalues (sum of the variances) to easily determine the number of components needed for retaining a certain percentage of the total variance. This can be achieved using the cumsum function.

Task 9: Retain variance

Run the cell below to calculate the normalized cumulative explained variance.

Plot the cumulative variance as in the figure below.
How many components are needed to retain: $50$%, $80$%, $90$%, $95$% of the variation
For the following tasks, select $k$ such that $95$% of the variation is retained.

cumulative_variance_ratio = np.cumsum(sorted_eigenvalues) / np.sum(sorted_eigenvalues)

# Write your solution here

9 Components keep 95% of the variance

PCA - In-class week 2

4. Mixing parameters

The following section describes how much each variable (x or y coordinates of a shape) contributes to the selected principal components:

Task 10: Mixing parameters

Change the cell below to construct the orthonormal matrix $\Phi$ containing the first $k = 9$ eigenvectors, where:

$$ {\Phi} = \begin{bmatrix} | & | & \cdots & | \\ \Phi_1 & \Phi_2 & \cdots & \Phi_9 \\ | & | & \cdots & | \end{bmatrix} $$

Define the mixing parameters $m_i = \sqrt{\lambda_i} \Phi_{i} $, where $\Phi_{i}$ represents the $i$-th column of $\Phi$ and $\lambda_i$ represents the corresponding eigenvalue.

# Write your solution here


print(mixing_params.shape)

(50, 9)

Task 11: Plot the mixing parameters

Run the cell below to make a plot that illustrates how much each $x,y$ variable of the shape contributes to the principal components.
Use the plot to describe how each of the principal components make use of the different variables of the pose coordinates in the original data.
What do positive and negative values in the principal components indicate, and how do they relate to the original data?

num_variables = dataset.shape[1]
bar_width = 0.4  # Width of the bars

# Plotting only the first two principal components
plot_pca_loadings(
    mixing_params=mixing_params,
    dataset=dataset,
    bar_width=0.4,
    colors=['skyblue', 'salmon'],  
    component_indices=[1, 2],      
    show_individual=True,
    show_combined=True,
    save_plots=False               
)

plot_pose_with_numbers(dataset[12])

# Write your solution here

5. Generative process - Projecting to latent space and back

The pose data can be mapped to the latent space spanned by the principal components (eigenvectors) by: $$ \Phi^\top(x-\mu) $$ where

$$ {\Phi} = \begin{bmatrix} | & | & \cdots & | \\ \Phi_1 & \Phi_2 & \cdots & \Phi_9 \\ | & | & \cdots & | \end{bmatrix} $$

and $\Phi_i$ are the eigenvectors.

The following steps will implement this process.

Task 12: Project to subspace

Run the cell below to center the data. Use the centered data to:

Project the original data onto the selected eigenvectors using $\Phi^\top(x-\mu)$.
Plot the projected data.

# Calculate the mean vector
mean_vector = np.mean(dataset, axis=0)

# Subtract the mean from each data point
centered_data = dataset - mean_vector

# Write your solution here


plot_pca_pairwise_scatter(
    dataset=dataset,
    phi=phi,
    num_components=9,                
    figsize=(12, 12),
    marker='.',
    xlim=(-1, 1),
    ylim=(-1, 1)                         
)

print(projected_data.T.shape)

(100, 9)

Task 13: Re-project from latent space to original data space

Project the data from the latent space to the original data space using $\Phi x + \mu$.

# Write your solution here


print(reconstructed_data.shape)

(100, 50)

Task 14: Plotting original and reconstructed data

Run the cell below to plot the first nr_of_frames from the original and the reconstructed data.

# Write your solution here


# Define the new shape you want (40, 50)
nr_of_frames=40

reshaped_original = dataset[:nr_of_frames, :].reshape(nr_of_frames, -1)
reshaped_reconstructed = reconstructed_data[:nr_of_frames, :].reshape(nr_of_frames, -1)
plot_single_sequence(reshaped_original, pose_name='Original', color='blue')

# Plot reconstructed sequence
plot_single_sequence(reshaped_reconstructed, pose_name='Reconstructed', color='red')

Task 15: Changing the number of components

Use the pca_reconstruction function below to rerun the analysis by changing the number of components $k = 1,2,4,40 $.

# function for PCA
pca_reconstruction(dataset=dataset, k=9, variance_retained=0.95, num_frames=40)

Number of Components (k): 9
Variance Explained by 9 components: 95.77%