Important
Most of the code is provided, with only a few adjustments left to complete, it is however important for you to reflect on the outcomes and relate them to the theory.
This exercise is about activation- and loss functions for neural architectures. The architecture of an NN offers endless variations and customization possibilities for defining (families of) prediction functions, making it impractical to exhaustively test every possible option for a given problem. Consequently, developing an intuition of how different architectural choices, parameters, and hyperparameters impact the performance of the network is essential.
Most of the code is provided, with only a few adjustments left to complete, it is however important for you to reflect on the outcomes and relate them to the theory.
This exercise is about the importance of selecting an appropriate activation function. Activation functions are crucial for introducing non-linearity into neural architectures. Each activation function has its unique characteristics and trade-offs and can significantly influence the model’s predictive capability, impacting its performance, convergence behavior, and the complexity of tasks it can effectively address.
This exercise explores activation functions for:
Classification:
Sigmoid: Commonly used in binary classification tasks, the sigmoid function maps input values to a number between 0 and 1. However, it can suffer from so-called vanishing gradient problems, occurring in deeper networks.
Hyperbolic Tangent (tanh): maps inputs to the range between -1 and 1 and addresses some of the shortcomings of the sigmoid function in terms of vanishing gradients.
Softmax: is a generalization of the logistic function, commonly used for the output layer of multi-class classification networks, converting raw inputs into probabilities across multiple classes.
Regression:
Rectified Linear Unit (ReLU): ReLU introduces non-linearity while maintaining computational efficiency. It helps mitigate vanishing gradient issues but may encounter "dead neurons" due to zero gradients for negative inputs.
Leaky ReLU: addresses the "dying ReLU" problem by allowing a small, non-zero gradients for negative inputs, maintaining active neurons during training.
Exponential Linear Unit (ELU): similar to ReLU for positive inputs, ELU applies an exponential function to negative values, ensuring smoother gradients and reducing the risk of “dead neurons” during trainin
torch
library to implement the activation functions defined in the cell below. Linear
$$ f(x) = x $$Sigmoid
$$ f(x) = \frac{1}{1 + e^{-x}} $$ReLU
$$ f(x) = \max(0, x) $$Leaky ReLU
$$ f(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \leq 0 \end{cases} $$Tanh
$$ f(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $$ELU $$ f(x) = \begin{cases} x & \text{for } x \geq 0 \\ \alpha \left( e^x - 1 \right) & \text{for } x < 0 \end{cases} $$
import torch
def linear(x):
return None # Replace
def sigmoid(x):
return None # Replace
def relu(x):
return None # Replace
def leaky_relu(x, alpha=0.01):
return None # Replace
def tanh(x):
return None # Replace
def elu(x, alpha=1.0):
return None # Replace
The make_moons
function from the Scikit-Learn Library
is used to generate synthetic data. The function generates two classes to be separated as shown in Figure 1. The following tasks will apply the activation functions for the classification problem to:
The NN defined in the cell below uses functions and classes defined in the following files:
networks.py
: Contains the SimpleNN
network implementation.trainers.py
: contains the train
function used for training.metrics.py
: contains the evaluateNN
function for evaluation and visualization.Examine the files and gain an overview of the architecture of the NN and the training loop.
Run the cell below to train and visualize the performance of the models with the different activations functions.
from torch import optim
from torchvision import transforms
from trainers import *
from networks import *
from metrics import *
X_train_tensor, X_test_tensor, y_train_tensor, y_test_tensor,X_train, X_test, y_train, y_test, X, y = get_data()
# Activation functions to test
activation_functions = {
'linear': linear,
'sigmoid': sigmoid,
'relu': relu,
'tanh': tanh,
'leaky_relu': leaky_relu,
'elu': elu
}
results = {}
for name, activation in activation_functions.items():
model, train_losses, accuracy, training_time, decision_threshold = train(SimpleNN(activation=activation), name, X_train_tensor, X_test_tensor, y_train_tensor, y_test_tensor, X_train, X_test, y_train, y_test, epoch=300 )
# Store results
results[name] = {
'model': model,
'train_losses': train_losses,
'accuracy': accuracy,
'training_time': training_time
}
evaluateNN(results, X, y)
#Write your reflactions here...
The loss function evaluates how closely the model’s predictions match the true labels and guides the adjustment of model parameters during training. Different types of problems require specific loss functions. Therefore, understanding the data and the problem is crucial for selecting or designing the most suitable loss function for training the network.
This exercise explores the impact of the following loss functions:
Classification
Regression
torch
library.Let:
Define:
Mean Squared Error (MSE)
$$ \text{MSE}(y, \hat{y}) = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 $$where $ \hat{y}_i $ is the predicted value of the $i$-th sample in the training set.
Mean Absolute Error
$$ \text{MAE}(y, \hat{y}) = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i| $$where $ \hat{y}_i $ is the predicted value of the $i$-th sample in the training set..
Binary Cross-Entropy Loss (BCE)
$$ \text{BCE}(y, \hat{y}) = - \frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] $$where $ \hat{y}_i $ is the predicted probability of the $i$-th sample in the training set.
# Custom Binary Cross-Entropy Loss Function
class BCE_Loss(nn.Module):
def __init__(self):
super(BCE_Loss, self).__init__()
def forward(self, outputs, targets):
epsilon = 1e-12
outputs = torch.clamp(outputs, min=epsilon, max=1-epsilon)
return None #Write your solution here
# Custom Mean Squared Error Loss Function
class MSE_Loss(nn.Module):
def __init__(self):
super(MSE_Loss, self).__init__()
def forward(self, outputs, targets):
return None #Write your solution here
# Custom Mean Absolute Error Loss Function
class MAE_Loss(nn.Module):
def __init__(self):
super(MAE_Loss, self).__init__()
def forward(self, outputs, targets):
return None #Write your solution here
BCE
, MSE
and MAE
loss functions when applied to a synthetic binary classification problem. Use the implemented loss functions, the true labels and the linspace
in a for loop, to plot the loss functions.BCE
is preferred for classification problems. import seaborn as sns
# Instantiate the loss functions
#Write your solution here...
# True label
y_true = torch.tensor([0.0, 1.0]) # True label
# Range of predictions
predictions = torch.tensor(np.linspace(0.01, 0.99, 100))
# Plot using Seaborn
plt.figure(figsize=(8, 4))
for i in y_true:
# Compute BCE and MSE using the defined classes
#bce_values = ...
#mse_values = ...
#mae_values = ...
# Set Seaborn style
sns.set(style="whitegrid")
# Create a DataFrame for plotting
import pandas as pd
data = pd.DataFrame({
'Prediction': predictions,
'Binary Cross-Entropy (BCE)': bce_values,
'Mean Squared Error (MSE)': mse_values,
'Mean Absolute Error (MAE)': mae_values
})
# Plot BCE
sns.lineplot(data=data, x='Prediction', y='Binary Cross-Entropy (BCE)', color='blue', label='Binary Cross-Entropy (BCE)', linewidth=2.5, alpha=0.5)
# Plot MSE
sns.lineplot(data=data, x='Prediction', y='Mean Squared Error (MSE)', color='green', label='Mean Squared Error (MSE)', linewidth=2.5, alpha=0.75)
# Plot MAE
sns.lineplot(data=data, x='Prediction', y='Mean Absolute Error (MAE)', color='orange', label='Mean Absolute Error (MAE)', linewidth=2.5, alpha=0.75)
# Add labels, title, and legend
plt.title('Comparison of BCE, MSE and MAE', fontsize=8)
plt.xlabel('Prediction', fontsize=8)
plt.ylabel('Loss', fontsize=8)
plt.axhline(0, color='black', linewidth=0.5)
plt.axvline(0.5, color='red', linestyle='--', linewidth=1.0, label='Prediction = 0.5')
plt.legend(fontsize=8)
plt.grid(True)
plt.tight_layout()
plt.show()
#Write your reflections here...
The next step is about using the loss functions for training a neural network on the same classification task as before and to show the impact of the loss functions on performance metrics, such as accuracy, precision, recall and F1 score.
For this task, the evaluateNN2
function in the metrics.py
file will be used. Note the noise parameter in the get_data
function and the decision_threshold
in the train
function.
from torch import optim
from torchvision import transforms
from trainers import *
from networks import *
from metrics import *
X_train_tensor, X_test_tensor, y_train_tensor, y_test_tensor,X_train, X_test, y_train, y_test, X, y = get_data(0.2)
# Define loss functions to test
loss_functions = {
'Binary Cross-Entropy': BCE_Loss(),
'Mean Squared Error': MSE_Loss(),
'Mean Absolute Error': MAE_Loss()
}
results = {}
for name, loss in loss_functions.items():
model, train_losses, accuracy, training_time, decision_threshold = train(SimpleNN(activation=relu), name, X_train_tensor, X_test_tensor, y_train_tensor, y_test_tensor, X_train, X_test, y_train, y_test, 500, loss=loss, decision_threshold=0.1 )
# Store results
results[name] = {
'model': model,
'train_losses': train_losses,
'accuracy': accuracy,
'training_time': training_time,
'decision_threshold': decision_threshold
}
evaluateNN2(results, X, y, X_test_tensor, y_test)
Use the plots to evaluate the performance of the different loss functions and incorporate theoretical concepts to interpret the results.
Explain why MSE and MAE losses may achieve lower final loss values, while BCE delivers comparable or superior accuracy for the classification task. Discuss the distinct characteristics of each loss function and how they relate to classification performance.
Experiment with the noise parameter and explain its impact on accuracy. Relate your explanation to the findings from Task 5.
Modify the decision threshold and analyze its impact on the results. Address the following questions:
#Write your reflections...
Consider the case of iris codes derived from an individual’s iris pattern. The codes are represented as binary vectors which is matched against a database of authorized codes to determine access. The Hamming Distance (HD) is commonly used as a similarity measure.
$$ HD(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{n} \begin{cases} 1 & \text{if } x_i \neq y_i \\ 0 & \text{if } x_i = y_i \end{cases} $$It calculates the number of positions with different values.
A neural network could potentially be trained to generate synthetic eye images embedding specific iris codes, potentially to gain unauthorized access to a system. It may seem logical to design a custom loss function based on the Hamming Distance, encouraging the network to generate patterns closely matching a target iris code. However, this approach is not valid for a loss function in the context of neural network training.
Incoorporate the formula of the HD and the nature of Gradient Decent in your discussion.
#Write your reflections here...