Activation functions and loss functions

This exercise is about activation- and loss functions for neural architectures. The architecture of an NN offers endless variations and customization possibilities for defining (families of) prediction functions, making it impractical to exhaustively test every possible option for a given problem. Consequently, developing an intuition of how different architectural choices, parameters, and hyperparameters impact the performance of the network is essential.

Important

Most of the code is provided, with only a few adjustments left to complete, it is however important for you to reflect on the outcomes and relate them to the theory.

List of individual tasks

Task 1: Implement activation functions
Task 2: Experiment with activation functions
Task 3: Evaluate
Task 4: Implement loss functions
Task 5: Visualizing loss functions
Task 6: Experiment with loss functions
Task 7: Evaluate
Task 8: Custom loss functions

Activation Functions

This exercise is about the importance of selecting an appropriate activation function. Activation functions are crucial for introducing non-linearity into neural architectures. Each activation function has its unique characteristics and trade-offs and can significantly influence the model’s predictive capability, impacting its performance, convergence behavior, and the complexity of tasks it can effectively address.

This exercise explores activation functions for:

Classification:

Sigmoid: Commonly used in binary classification tasks, the sigmoid function maps input values to a number between 0 and 1. However, it can suffer from so-called vanishing gradient problems, occurring in deeper networks.
Hyperbolic Tangent (tanh): maps inputs to the range between -1 and 1 and addresses some of the shortcomings of the sigmoid function in terms of vanishing gradients.
Softmax: is a generalization of the logistic function, commonly used for the output layer of multi-class classification networks, converting raw inputs into probabilities across multiple classes.

Regression:

Rectified Linear Unit (ReLU): ReLU introduces non-linearity while maintaining computational efficiency. It helps mitigate vanishing gradient issues but may encounter "dead neurons" due to zero gradients for negative inputs.
Leaky ReLU: addresses the "dying ReLU" problem by allowing a small, non-zero gradients for negative inputs, maintaining active neurons during training.
Exponential Linear Unit (ELU): similar to ReLU for positive inputs, ELU applies an exponential function to negative values, ensuring smoother gradients and reducing the risk of “dead neurons” during trainin

Task 1: Implement activation functions

Use the torch library to implement the activation functions defined in the cell below.

Activation functions

Linear

$$ f(x) = x $$

Sigmoid

$$ f(x) = \frac{1}{1 + e^{-x}} $$

ReLU

$$ f(x) = \max(0, x) $$

Leaky ReLU

$$ f(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \leq 0 \end{cases} $$

Tanh

$$ f(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $$

ELU $$ f(x) = \begin{cases} x & \text{for } x \geq 0 \\ \alpha \left( e^x - 1 \right) & \text{for } x < 0 \end{cases} $$

import torch


def linear(x):
    return None # Replace

def sigmoid(x):
    return None # Replace

def relu(x):
    return None # Replace

def leaky_relu(x, alpha=0.01):
    return None # Replace

def tanh(x):
    return None # Replace

def elu(x, alpha=1.0):
    return None # Replace

Applying activation functions to neural networks

Figure 1:
Example of the synthetic data generated by the `make_moons` function.

The make_moons function from the Scikit-Learn Library is used to generate synthetic data. The function generates two classes to be separated as shown in Figure 1. The following tasks will apply the activation functions for the classification problem to:

train neural networks (NNs),
evaluate their effects on performance, focusing on accuracy, convergence speed, and computational efficiency.

The NN defined in the cell below uses functions and classes defined in the following files:

networks.py : Contains the SimpleNN network implementation.
trainers.py : contains the train function used for training.
metrics.py : contains the evaluateNN function for evaluation and visualization.

Task 2: Experiment with activation functions

Examine the files and gain an overview of the architecture of the NN and the training loop.
Run the cell below to train and visualize the performance of the models with the different activations functions.

from torch import optim
from torchvision import transforms
from trainers import *
from networks import *
from metrics import *

X_train_tensor, X_test_tensor, y_train_tensor, y_test_tensor,X_train, X_test, y_train, y_test, X, y  = get_data()
# Activation functions to test
activation_functions = {
    'linear': linear,
    'sigmoid': sigmoid,
    'relu': relu,
    'tanh': tanh,
    'leaky_relu': leaky_relu,
    'elu': elu
}

results = {}

for name, activation in activation_functions.items():
    model, train_losses, accuracy, training_time, decision_threshold = train(SimpleNN(activation=activation), name, X_train_tensor, X_test_tensor, y_train_tensor, y_test_tensor, X_train, X_test, y_train, y_test, epoch=300 )
    # Store results
    results[name] = {
        'model': model,
        'train_losses': train_losses,
        'accuracy': accuracy,
        'training_time': training_time
    }

evaluateNN(results, X, y)

Task 3: Evaluate

Evaluate the performance of the models based on the activation functions used. Analyze the results using the plots and relate them to theoretical concepts to interpret the outcomes.
Evaluate the activation functions in terms of computational efficiency and their suitability for the given problem. Explain which activation function is preferred and justify the choice.
Rerun the experiment by increasing the number of epochs to 1000. Investigate the training curves and analyze the results. Discuss why some activation functions cause the learning to plateau before eventually converging to a smaller loss?

#Write your reflactions here...

Loss Functions

The loss function evaluates how closely the model’s predictions match the true labels and guides the adjustment of model parameters during training. Different types of problems require specific loss functions. Therefore, understanding the data and the problem is crucial for selecting or designing the most suitable loss function for training the network.

This exercise explores the impact of the following loss functions:

Classification

Binary Cross Entropy (BCE)

Regression

Mean Squared Error (MSE)
Mean Absolute Error (MAE)

Task 4: Implement loss functions

Implement the loss functions defined below using the torch library.

Loss functions

Let:

$ y_i $ be the true value of the $i$-th sample in the training set,
$ N $ be the number of samples.

Define:

Mean Squared Error (MSE)

$$ \text{MSE}(y, \hat{y}) = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 $$

where $ \hat{y}_i $ is the predicted value of the $i$-th sample in the training set.

Mean Absolute Error

$$ \text{MAE}(y, \hat{y}) = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i| $$

where $ \hat{y}_i $ is the predicted value of the $i$-th sample in the training set..

Binary Cross-Entropy Loss (BCE)

$$ \text{BCE}(y, \hat{y}) = - \frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] $$

where $ \hat{y}_i $ is the predicted probability of the $i$-th sample in the training set.

# Custom Binary Cross-Entropy Loss Function
class BCE_Loss(nn.Module):
    def __init__(self):
        super(BCE_Loss, self).__init__()

    def forward(self, outputs, targets):
        epsilon = 1e-12
        outputs = torch.clamp(outputs, min=epsilon, max=1-epsilon)
        return None #Write your solution here

            

# Custom Mean Squared Error Loss Function
class MSE_Loss(nn.Module):
    def __init__(self):
        super(MSE_Loss, self).__init__()

    def forward(self, outputs, targets):
        return None #Write your solution here


# Custom Mean Absolute Error Loss Function
class MAE_Loss(nn.Module):
    def __init__(self):
        super(MAE_Loss, self).__init__()

    def forward(self, outputs, targets):
        return None #Write your solution here

Task 5: Visualizing loss functions

Complete the cell below to visualize the BCE , MSE and MAE loss functions when applied to a synthetic binary classification problem. Use the implemented loss functions, the true labels and the linspace in a for loop, to plot the loss functions.
Use the plots and the definitions of the loss functions to argue why BCE is preferred for classification problems.

import seaborn as sns

# Instantiate the loss functions
#Write your solution here...

# True label
y_true = torch.tensor([0.0, 1.0])  # True label 

# Range of predictions
predictions = torch.tensor(np.linspace(0.01, 0.99, 100))

# Plot using Seaborn
plt.figure(figsize=(8, 4))

for i in y_true:
    # Compute BCE and MSE using the defined classes

    #bce_values = ...
    #mse_values = ...
    #mae_values = ...

    # Set Seaborn style
    sns.set(style="whitegrid")

    # Create a DataFrame for plotting
    import pandas as pd
    data = pd.DataFrame({
        'Prediction': predictions,
        'Binary Cross-Entropy (BCE)': bce_values,
        'Mean Squared Error (MSE)': mse_values,
        'Mean Absolute Error (MAE)': mae_values
    })



    # Plot BCE
    sns.lineplot(data=data, x='Prediction', y='Binary Cross-Entropy (BCE)', color='blue', label='Binary Cross-Entropy (BCE)', linewidth=2.5, alpha=0.5)

    # Plot MSE
    sns.lineplot(data=data, x='Prediction', y='Mean Squared Error (MSE)', color='green', label='Mean Squared Error (MSE)', linewidth=2.5, alpha=0.75)

    # Plot MAE
    sns.lineplot(data=data, x='Prediction', y='Mean Absolute Error (MAE)', color='orange', label='Mean Absolute Error (MAE)', linewidth=2.5, alpha=0.75)

# Add labels, title, and legend
plt.title('Comparison of BCE, MSE and MAE', fontsize=8)
plt.xlabel('Prediction', fontsize=8)
plt.ylabel('Loss', fontsize=8)
plt.axhline(0, color='black', linewidth=0.5)
plt.axvline(0.5, color='red', linestyle='--', linewidth=1.0, label='Prediction = 0.5')
plt.legend(fontsize=8)
plt.grid(True)

plt.tight_layout()
plt.show()

#Write your reflections here...

The next step is about using the loss functions for training a neural network on the same classification task as before and to show the impact of the loss functions on performance metrics, such as accuracy, precision, recall and F1 score.

For this task, the evaluateNN2 function in the metrics.py file will be used. Note the noise parameter in the get_data function and the decision_threshold in the train function.

Task 6: Experiment with loss functions

Run the cell below to train and visualize the performance of the network with the different loss functions.
Compare the results to the previous tasks and relate them to theory.

from torch import optim
from torchvision import transforms
from trainers import *
from networks import *
from metrics import *


X_train_tensor, X_test_tensor, y_train_tensor, y_test_tensor,X_train, X_test, y_train, y_test, X, y  = get_data(0.2)

# Define loss functions to test
loss_functions = {
    'Binary Cross-Entropy': BCE_Loss(),
    'Mean Squared Error': MSE_Loss(),
    'Mean Absolute Error': MAE_Loss()
}

results = {}

for name, loss in loss_functions.items():
    model, train_losses, accuracy, training_time, decision_threshold = train(SimpleNN(activation=relu), name, X_train_tensor, X_test_tensor, y_train_tensor, y_test_tensor, X_train, X_test, y_train, y_test, 500, loss=loss, decision_threshold=0.1 )
    # Store results
    results[name] = {
        'model': model,
        'train_losses': train_losses,
        'accuracy': accuracy,
        'training_time': training_time,
        'decision_threshold': decision_threshold
    }

evaluateNN2(results, X, y, X_test_tensor, y_test)

Task 7: Evaluate

Use the plots to evaluate the performance of the different loss functions and incorporate theoretical concepts to interpret the results.
Explain why MSE and MAE losses may achieve lower final loss values, while BCE delivers comparable or superior accuracy for the classification task. Discuss the distinct characteristics of each loss function and how they relate to classification performance.
Experiment with the noise parameter and explain its impact on accuracy. Relate your explanation to the findings from Task 5.
Modify the decision threshold and analyze its impact on the results. Address the following questions:
- Explain a scenario where increasing the threshold might be beneficial, and support the reasoning with theory.
- Explain a scenario where decreasing the threshold might be advantageous, and provide a theoretical justification.

#Write your reflections...

Task 8: Custom loss functions

Consider the case of iris codes derived from an individual’s iris pattern. The codes are represented as binary vectors which is matched against a database of authorized codes to determine access. The Hamming Distance (HD) is commonly used as a similarity measure.

$$ HD(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{n} \begin{cases} 1 & \text{if } x_i \neq y_i \\ 0 & \text{if } x_i = y_i \end{cases} $$

It calculates the number of positions with different values.

A neural network could potentially be trained to generate synthetic eye images embedding specific iris codes, potentially to gain unauthorized access to a system. It may seem logical to design a custom loss function based on the Hamming Distance, encouraging the network to generate patterns closely matching a target iris code. However, this approach is not valid for a loss function in the context of neural network training.

Argue why the HD is not a valid choice as loss function.

Hint

Incoorporate the formula of the HD and the nature of Gradient Decent in your discussion.

#Write your reflections here...