🧠 Intro to Deep Learning 1: Loss Functions, Optimizers, Underfitting, Overfitting, Dropout and Batch Normalization

date

May 10, 2023

slug

intro-to-deep-learning-loss-functions-optimizers-underfitting-overfitting-dropout-batchnormalization

status

Published

0) Loss Functions

When we talk about giving knowledge to a Deep Learning model, we are talking about giving the model the ability to change its weights and bias, nothing more and nothing less. Having this in mind, the first step to this is to give the model a metric to measure how well is its current knowledge, evaluate it and settle whether the weights and bias will be or not updated. For this task, we take on Loss Functions.

Loss Functions are functions that compare the outputs generated by the model with the real outputs observed in the dataset. If the difference between them is high, the function tells the model to adjust its weights and bias, else the weights and bias keep the same.

Nowadays, there a considerable number of Loss Functions for the most variable problems, so, for this notebook, we will see the main three ones used to solve Regression Problems: Mean Absolute Error (MAE), Mean Squared Error (MSE) and Huber Loss Function. Let's go deeper into each one!!

- Mean Absolute Error (MAE)

MAE calculates the sum of the absolute value of the difference between the predicted output and the real one. After that, divides the result by the total number of predictions. Equation:

MAE = sum(abs(predicted_y - real_y)) / number_of_predictions

- Mean Squared Error (MSE)

MSE calculates the sum of the squared of the difference between the predicted output and the real one. After that, divides the result by the total number of predictions. It's quite similar to MAE, the difference is replacing the absolute operation to the power 2. Equation:

MSE = sum((predicted_y - real_y)**2) / number_of_predictions

- Huber Loss Function

Huber Loss Function has two equation depending to the sum of the absolute difference between all predicted outputs and the real ones compared to the outputs standard deviation (std). Being:

for sum(abs(predicted_y - real_y)) <= std: calculates the sum of 1/2 times the squared of the difference between the predicted output and the real one. After that, divides the result by the total number of predictions.

Huber Loss = sum(1/2 * (predicted_y - real_y)**2) / number_of_predictions

for sum(abs(predicted_y - real_y)) > std: calculates the sum of standard deviation times the absolute difference between the predicted output and the real one minus 1/2 times the standard deviation. After that, divides the result by the total number of predictions.

Huber Loss = sum(std * (abs(predicted_y - real_y) - 1/2 * std)) / number_of_predictions

OBS.: observing the equations for a certain time, it hit you that the Huber Loss combines MAE and MSE, using the first one to the first case, and second one to the second case.

Looking at this equations, you may be like: "What the hell is this? And how am I supposed to memorize them all and know when to use each one?". Don't worry!! The good news is that you don't have to memorize the equation, only knowing what they're and what they do is enough. Now, about when to use them, here's a cheat sheet for ya!

- Mean Absolute Error (MAE) and Huber Loss: best when your model can make significant errors, like a Deep Learning Model that predicts house prices and the final user always check out the predicted price before takin an action.

- Mean Squared Error (MSE): best when you want to avoid the probability of getting significant errors, like a Deep Learning Model that predicts stock prices and the final user can get huge losses when a prediction goes wrong.

1) Optimizers

While Loss Functions evaluate our model results and tell it whether the weights and bias must or not be updated, Optimizers tell the model HOW to update them!

To this task, the Optimizers split up the dataset into batches and processes them multiple times in cycles called epochs. Each epoch corresponds to a training step using all the dataset batches and, also after each epoch, the weights and bias are updated if the Loss Functions say so. This loop goes on untill all defined number of epochs are done.

The Optimizers main goal, in a nutshell, is to minimize the loss and improve the fitted predicted line of the outputs to the best match, at the same time, reach a point where the weights and bias keep approximately the same after each update.

The image above shows an animation that after each update on the fitted predicted line of the outputs, the loss decreases and the weights and bias goes a step further to the perfect match!

Other thing to have in mind is the Learning Rate. It's a measurement that helps the Optimizers measure each weights and bias update, that is: when the update is higher than the learning rate, we can say that the model is learning something new, else, we can say that the model is not learning anymore.

I know that all of this are new information for you and it can be confusion at the first reading, but don't worry, take your time, read as many times you need and do it to understand the content rather than to memorize it.

Oh! and I have more great news for ya! In this notebook we will be using one of the best Optimizer that can solve the majority problems, the Adam Optimizer. With this one, we don't need to especify the Learning Rate, due to its ability to find the best one for our model, cool, isn't it?

Yeah my friend, and you thought that you would spend more time programming here right? You're not wrong, after you get all the basic contents, you'll have your whole time to program, oh, and talking about programming time, we are gonna see how to create a Deep Learning model and assign it a Loss Function and an Optimizer!

# ---- Importing Libraries and Creating the Model ----

# pip install tensorflow
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    # hidden layers >> ReLU as Activation Function
    layers.Dense(512, activation='relu', input_shape=[11]),
    layers.Dense(512, activation='relu'),
    layers.Dense(512, activation='relu'),
    
    # output layer >> Linear as Activation Function
    layers.Dense(1)
])

# ---- Adding a Loss Function and an Optimizer ----
model.compile(optimizer='adam', loss='mse')

# ---- Displaying a Summary of our Model ----
model.summary()

Output:
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #
=================================================================
 dense (Dense)               (None, 512)               6144

 dense_1 (Dense)             (None, 512)               262656

 dense_2 (Dense)             (None, 512)               262656

 dense_3 (Dense)             (None, 1)                 513

=================================================================
Total params: 531,969
Trainable params: 531,969
Non-trainable params: 0

Our model has 531,969 parameters, yes, this simple model has a half million parameters! Now you can understand why some AI Companies says that their models have millions of params.

Now, let's read the red_wine.csv dataset located at datasets folder, train our model with it and display the loss plot!

import pandas as pd
import matplotlib.pyplot as plt

# ---- Reading Dataset ----
red_wine = pd.read_csv('../datasets/red-wine.csv')

# ---- Create Training and Validation Splits ----
df_train = red_wine.sample(frac=0.7, random_state=0)
df_valid = red_wine.drop(df_train.index)

# ---- Scaling to [0, 1] ----
max_ = df_train.max(axis=0)
min_ = df_train.min(axis=0)
df_train = (df_train - min_) / (max_ - min_)
df_valid = (df_valid - min_) / (max_ - min_)

# ---- Split Features and Target ----
X_train = df_train.drop('quality', axis=1)
X_valid = df_valid.drop('quality', axis=1)
y_train = df_train['quality']
y_valid = df_valid['quality']

# ---- Displaying Dataset Info and Head ----
print(f'# of Rows: {df_train.shape[0]}')
print(f'# of Columns: {df_train.shape[1]}')
print('----')
df_train.head()

# of Rows: 1119

# of Columns: 12

# ---- Training the Model ----
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=256,
    epochs=10,
)

# ---- Plotting the Training Loss ----
history_df = pd.DataFrame(history.history)
history_df['loss'].plot()
history_df['val_loss'].plot()
# history_df.loc[:, ['loss', 'val_loss']].plot();

plt.title('Losses per Epochs')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend(['loss', 'val_loss'])
plt.show();

Realize that as more epochs the model trains, the lower is the loss. Besides, observe that the Learning Rate between the 6th and the 8th epochs are kind of similar, so it wouldn't have problem if our model stopped training after the 6th epoch, in reality, it would even be perfect, because it could decrease the odds for overfitting - the topic we will cover now together underfitting.

3) Underfitting and Overfitting

When we are training our model, it can learn two things to adjust the weights and bias, signals and noises. Signals are the general patterns that really help our model to get good knowledge and adjustements; whereas noises are patterns that are only available in the training dataset and, consequently, makes our model to get bad knowledge and adjustements.

These knowledge problems are called Underfitting and Overfitting, being:

Underfitting - the model does not learn enough signals and both training and validation predictions are poor;

Overfitting - the model learn so much noises, turning the training predictions quite good, but the validation ones poor.

The image above shows, in a nutshell, what these two problems are. In Underfitting area, both loss values into traning and validation steps are higher, whereas in Overfitting area, the traning loss value is small but the validation one is too big.

The main goal when training a Deep Learning Model to avoid Underfitting and Overfitting is minimizing the amount of noises and maximize the amount of signals learned by the model. That is, reaching a point where the loss values for traning and validation steps and the gap between them are significantly small - like the Early Stopping area in the image.

And guess what are we gonna see now? Yeah, techniques to avoid under and overfitting!

- Capacity

Capacity refers to the size and complexity of patterns that the model is able to learn, that is, the number of neurons and hidden layers.

When our model is Underfitting, we increase its capacity, that is, we increase the number of neurons or hidden layers. In the other hand, when our model is Overfitting, we decrease its capacity decreasing the number of neurons or hidden layers.

Underfitting - increase the number of neurons or the number of hidden layers;

Overfitting - decrease the number of neurons or the number of hidden layers.

Usually, we adjust the number of neurons when working with Linear Relatioships between variables, and the number of hidden layers when working with Non-Linear Relationships between them.

Adjust Number of Neurons - for Linear Relatioships between variables;

Adjust Number of Hidden Layers - for Non-Linear Relationships between variables.

Let's see how this technique would look like in code:

# ---- Simple Model ----
model = keras.Sequential([
    # hidden layers
    layers.Dense(units=512, activation='relu', input_shape=[11])
    , layers.Dense(units=512, activation='relu')
    
    # output layer
    , layers.Dense(units=1)
])

# ---- Capacity Adjustment to Avoid Underfitting ----

# - Linear Relationships between Variables
#
# \ decrease the number of neurons from 512 to 256 in each hidden layer
#
model = keras.Sequential([
    # hidden layers
    layers.Dense(units=256, activation='relu', input_shape=[11])
    , layers.Dense(units=256, activation='relu')
    
    # output layer
    , layers.Dense(units=1)
])

# - Non-Linear Relationships between Variables
#
# \ subtracted one more hidden layer
#
model = keras.Sequential([
    # hidden layers
    layers.Dense(units=512, activation='relu', input_shape=[11])
    
    # output layer
    , layers.Dense(units=1)
])

# ---- Capacity Adjustement to Avoid Overfitting ----

# - Linear Relatioships between Variables
#
# \ increased the number of neurons from 512 to 1024 in each hidden layer
#
model = keras.Sequential([
    # hidden layers
    layers.Dense(units=1024, activation='relu', input_shape=[11])
    , layers.Dense(units=1024, activation='relu')
    
    # output layer
    , layers.Dense(units=1)
])

# - Non-Linear Relatioships between Variables
#
# \ added one more hidden layer
#
model = keras.Sequential([
    # hidden layers
    layers.Dense(units=512, activation='relu', input_shape=[11])
    , layers.Dense(units=512, activation='relu')
    , layers.Dense(units=512, activation='relu')
    
    # output layer
    , layers.Dense(units=1)
])

- Early Stopping

Early Stopping is another technique, but destined to avoid Overfitting only. It makes the model stop the training step before all epochs are processed, consequently, avoiding the model to learn noises rather than signals.

But this stop is not random, for this, the technique consider two parameters - min_delta AKA learning_rate and patience. The first parameter tells what is the minimum learning rate in each epoch acceptable to consider that the model is learning signals, whereas the second one tells how many epochs the model can process without reaching the minimum learning rate. When these amount of epochs had been passed and the model have not reached the minimum learning rate, the training step is early stopped!

Also, there is a third parameter of interest - restore_best_weights. This one tells what weight and bias adjustement to consider when the Early Stopping is triggered. When this parameter is true, our model will consider the best weight and bias values from the training step. When this parameter is false, our model will consider the last weight and bias values from the training step.

Restore Best Weights: True - the model consider the best weights and bias values from the training step;

Restore Best Weights: False - the model consider the last weights and bias values from the training step.

Let's see how to create an Early Stopping.

# ---- Creating an Early Stopping ----
#
# \ min_delta: miminum learning rate to consider that the model is learning signals rather than noises;
# \ patience: tolerated number of processed epochs that the model has not reached the minimum learning rate;
# \ restore_best_weights: whether restore or not the best weights after the training step.
#
from tensorflow.keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(
    min_delta=0.001
    , patience=20
    , restore_best_weights=True
)

Now, let's create another model to train with the red_wine.csv dataset and check out if we will get better loss values than before.

# ---- Creating the Model, defining and Optimizer and Loss Function and adding an Early Stopping ----
model = keras.Sequential([
    # hidden layers >> ReLU as Activation Function
    layers.Dense(units=512, activation='relu', input_shape=[11])
    , layers.Dense(units=512, activation='relu')
    , layers.Dense(units=512, activation='relu')
    
    # output layers >> Linear as Activation Function
    , layers.Dense(units=1)
])

model.compile(optimizer='adam', loss='mse')

early_stopping = EarlyStopping(
    min_delta=0.001
    , patience=20
    , restore_best_weights=True
)

# ---- Summaring the Model ----
model.summary()

Output:
Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #
=================================================================
 dense_19 (Dense)            (None, 512)               6144

 dense_20 (Dense)            (None, 512)               262656

 dense_21 (Dense)            (None, 512)               262656

 dense_22 (Dense)            (None, 1)                 513

=================================================================
Total params: 531,969
Trainable params: 531,969
Non-trainable params: 0

# ---- Training the Model with Early Stopping and Plotting the Results ----
history = model.fit(
    X_train, y_train
    , validation_data=(X_valid, y_valid)
    , batch_size=256
    , epochs=500
    , callbacks=[early_stopping]
    , verbose=0 # don't log the training steps
)

history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot()
# history_df['loss'].plot()
# history_df['val_loss'].plot()

plt.title('Losses per Epochs')
plt.xlabel('Epochs')
plt.ylabel('Loss')
#plt.legend(['loss', 'val_loss'])
plt.show();

4) Dropout and Batch Normalization Layers

So far we have just seen how to create Deep Learning Models using the Dense Layer, however and fortunately, there are tons of other layers destined for especific tasks. We will cover now two new layers that will help you to avoid both under and overfitting: Dropout and Batch Normalization.

About the other layers, you can check them out at Keras Documentation.

- Dropout Layer

In the previous topic we talked about how overfitting is caused by the network learning spurious patterns in the training data, that is, learning more noises rather than signals. To recognize these spurious patterns a network will often rely on very a specific combinations of weight, a kind of "conspiracy" of weights. Being so specific, they tend to be fragile: remove one and the conspiracy falls apart.

For this very solution we have Dropout Layers, that minimize the risk of the model to learn noises and focus on more signals. For this, them randomly drop out (turn off) a fraction of a layer's neurons in each epoch, making it much harder for the network to learn those spurious patterns in the training data!!

The GIF above shows how the drop out process works: a fraction of a layer's neurons (50% in this case) are randomly dropped out in each epoch, that is, over an epoch E1 we can have the neurons X and Y working while the neurons W and Z are dropped out; and over an epoch E2 we can have the neurons W and Y working while the X and Z are dropped out.

Let's see how to implement that in Python!

# ---- Creating a Model with Drop Out Layer ----
#
# - add the Dropout Layer before the layer you want to apply it
# - the 'rate' param goes from 0 to 1 and points to the percentage of neurons that
# will be randomly dropped out in each epoch
#
model = keras.Sequential([
    # hidden layers
    layers.Dropout(rate=0.3)
    , layers.Dense(units=512, activation='relu', input_shape=[11])
    
    , layers.Dropout(rate=0.3)
    , layers.Dense(units=512, activation='relu')
    
    , layers.Dropout(rate=0.3)
    , layers.Dense(units=512, activation='relu')
    
    # output layer
    , layers.Dense(units=1)
])

- Batch Normalization Layer

Over the Data Preprocessing step, you will probably be scaling your dataset using a scaling strategy, such as StandardScaler, RobustScaler, MinMaxScaler and Normalizer, right? The Batch Normalization layer, AKA batchnorm, does the same thing, however, inside the network, that is, over the training step.

In a nuthsell, this layer takes and normalizes the output of the previous layer before assigning it as input to the next one. It can also works as a preprocessor when added as the first layer of a network.

Let's take a peek on how to add Batch Normalization layers in a model!

# ---- Creating a Model with Batch Normalization Layer ----
#
# - when added as the first layer, it works as a preprocessor one
#
model = keras.Sequential([
    # hidden layer
    layers.BatchNormalization()
    , layers.Dense(units=512, activation='relu', input_shape=[11])
        
    , layers.Dropout(rate=0.3)
    , layers.Dense(units=512, activation='relu')
    
    # output layer
    , layers.Dense(units=1)
])

# ---- Creating a Model with Batch Normalization Layer ----
#
# - when added before a input-output layer, it works as scaler for this very layer
#
model = keras.Sequential([
    # hidden layer
    layers.Dense(units=512, activation='relu', input_shape=[11])
        
    , layers.Dropout(rate=0.3)
    , layers.BatchNormalization()
    , layers.Dense(units=512, activation='relu')
    
    # output layer
    , layers.Dense(units=1)
])

# ---- Creating a Model with Batch Normalization Layer ----
#
# - when added between a layer and its activation function, it works as a scaling before the activation step
# - yes, you can define an activation function outside the layer, but it must be assigned after the layer
#
model = keras.Sequential([
    # hidden layer
    layers.Dense(units=512, activation='relu', input_shape=[11])
        
    , layers.Dropout(rate=0.3)
    , layers.Dense(units=512)
    , layers.BatchNormalization()
    , layers.Activation('relu')
    
    # output layer
    , layers.Dense(units=1)
])