🧠 Intro to Deep Learning 3: Exploring More Activation Functions and Layers

date

May 12, 2023

slug

intro-to-deep-learning-exploring-more-activation-functions-and-layers

status

Published

0) Activation Functions

Before going to the Activation Functions, let's explain some essential things for your better understanding:

Activation Functions - functions to adjust the weights and bias of a network;

Activation Functions Derivatives - plot of the possible slopes in the weights and bias adjustments. Can be big or small ones;

Convergence Rate - synonym of Learning Rate;

Gradient - the slope used by the Optimizers, such as Adam, to adjust the weights and bias.

- Sigmoid (Logistic)

As seen in the previous notebook, the Sigmoid (Logistic) Activation Function is used for Classification Problems transforming the outputs in a scale from 0 to 1. The equation is given as below (consider e as Euler's Constant that is approximately equals to 2.71, and x as the input):

sigmoid(x) = 1 / (1 + e**-x)

The plot of the function and its derivative:

Some problems in using this Activation Function:

Vanishing Gradient - looking at the function plot, you can see that when inputs become small or large, the function saturates at 0 or 1, with a derivative extremely close to 0 and the weights and bias have the probability to not be adjusted anymore. Thus it has almost no gradient to propagate back through the network, so there is almost nothing left for lower layers;

Computationally Expensive - the function has an exponential operation, so as larger the datasets, the longer will be the training step;

The Output is not Zero Centered - in the derivative plot, when the input's equal 0, the result is 0.5 centered.

Code:

# ---- Sigmoid Function Declaration ----
model = keras.Sequential([
    layers.Dense(units=512, activation='sigmoid', input_shape=[11])
])

- Hyperbolic Tangent (Tanh)

Hyperbolic Tangent, AKA Tanh, is applied in Classification Problems transforming the outputs in a scale from -1 to 1 and being a good Sigmoid Function alternative in some cases. Its equation is given as below:

tanh(x) = (e**x - e**-x) / (e**x + e**-x)

The plot of the function and its derivative:

Some problems in using this Activation Function - since it is similar to Sigmoid one, the problems are the sames, except the fact that the output is zero centered:

Vanishing Gradient - looking at the function plot, you can see that when inputs become small or large, the function saturates at -1 or 1, with a derivative extremely close to 0 and the weights and bias have the probability to not be adjusted anymore. Thus it has almost no gradient to propagate back through the network, so there is almost nothing left for lower layers;

Computationally Expensive - the function has an exponential operation, so as larger the dataset, the longer will be the training step.

Code:

# ---- Tahn Function Declaration ----
model = keras.Sequential([
    layers.Dense(units=512, activation='tanh', input_shape=[11])
])

OBS.: about choosing between Sigmoid and Tanh Functions, we observe that the gradient of Tanh is four times greater than Sigmoid's. This means that using that one results in higher values of gradient during traning and higher updates in the weights and bias of our model. So, if you want strong gradients and big learning steps, you should use the Tanh Activation function; else, Sigmoid would be the perfect match!

- Rectified Linear Unit (ReLU)

You've already seen this function, right? ReLU is the most popular Activation Function and it works like this: when the input is negative, it returns 0, and when the input is positive, it returns the input value, that's it, simple! Equation:

relu(x) = 0 if x <= 0 else x

The plot of the function and its derivative:

Some problems in using this Activation Function - yeah, even working great in most applications, it's not perfect as we think 😥):

Dying ReLU - during training, some neurons effectively die, meaning they stop outputting anything other than 0. In some cases, you may find that half of your network's neurons are dead, especially if you used a large Learning Rate. A neuron dies when its weights and bias get tweaked in such a way that the weighted sum of its inputs are negative for all instances in the training set. When this happens, it just keeps outputing 0s, and Gradiend Descent does not affect it anymore since the gradient of the ReLU function is 0 when its input is negative (Hands-on Machine Learning [2], page 329).

Code:

# ---- ReLU Function Declaration ----
model = keras.Sequential([
    layers.Dense(units=512, activation='relu', input_shape=[11])
])

- Leaky ReLU

Now there's something quite interesting to us, when the Data Scientists stumbled upon Dying ReLU they wonder: "By any change, is there any way to avoid it forever? and they came up with an yes!! to the question. Yeah, this is the short story of how Leaky ReLU has been created. It works like ReLU, but with the difference that never will happen Dying ReLU with it. Equation:

leaky_relu(x) = max(alpha*x, x)

Where α (alpha) is the hyperparameter of how much the function leaks, that is, it's the slope of the function for x <= 0 and is typically set to 0.01. Also, the small slope ensures that this Activation Function never dies.

Plot of the function and its derivative:

Some problems:

Hyperparameter α (Alpha) - since the α hyperparameter must be set by the Data Scientist, it may take several tries until finding the best match.

Code:

# ---- Leaky ReLU Function Declaration ----
from tensorflow.keras.layers import LeakyReLU

leaky_relu = LeakyReLU(alpha=0.01)

model = keras.Sequential([
    layers.Dense(units=512, activation=leaky_relu, input_shape=[11])
])

- Parametric Leaky ReLU (PReLU)

The Parametric Leaky ReLU (PReLU) is similar to Leaky ReLU, being the hyperparameter alpha the unique difference. While in Leaky ReLU the Data Scientist must set its value, in PReLU there's no need for that, because the network will adjust its value with the weights and bias.

With this in mind, you'll notice that the equation, function plot and deritivative plot will be the same as Leaky ReLU.

Some problems:

Longer Training Step - since the hyperparameter is adjusted with the weights and bias, the training step will take more time!

Code:

# ---- PReLU Function Declaration ----
from tensorflow.keras.layers import PReLU

prelu = PReLU()

model = keras.Sequential([
    layers.Dense(units=512, activation=prelu, input_shape=[11])
])

- Exponential Linear Unit (ELU)

The Exponential Linear Unit (ELU) is a variation of ReLU with a better output for x <= 0. Equation:

elu(x) = alpha * (e**x -1) if x <= 0 else x

Plot of the function and its derivative:

Problems:

Slower Computation - ELU is slower to comput than ReLU and its variants PReLU and Leaky ReLU due to the use of the exponential function, but during training this is compensated by the faster Convergence Rate. However, at test time, an ELU network will be slower than a ReLU network.

Code:

# ---- ELU Function Declaration ----
model = keras.Sequential([
    layers.Dense(units=512, activation='elu', input_shape=[11])
])

- Scaled Exponential Linear Unit (SELU)

The Exponential Linear Unit (SELU) is another variation of ReLU that's used for the premise: "if you build a neural network composed exclusively of a stack of Dense Layers and if all hidden layers use the SELU Activation Function, then the network will self-normalize (the output of each layer will tend to preserve mean 0 and standard deviation 1 during training, which resolves the vanishing/exploding gradient problem).".

In a nutshell, if you use SELU as the Activation Function on the hidden Dense Layers of your model, you likely won't have the need to use Batch Normalization Layers because the outputs will be normalized by the layers.

Besides, SELU often performs better (outperforms) than other ReLU variants.

Equation:

selu(x) = scale * x if x > 0 else scale * alpha * e**x - alpha

Plot of the function and its derivative:

Some Problems:

Dense Layers Exclusivity - SELU works only for a neural network composed exclusively of a stack of Dense Layers. It might not work for Convolution Neural Networks;

LeCun Normal Initialization - every hidden layer's weights must also be initialized using LeCun Normal Initialization, demanding more machine's process power;

Input Features Standardization - input features must be standardized with mean 0 and standard deviation 1. For this, you can use StandardScaler.

Code:

# ---- SELU Function Declaration ----
model = keras.Sequential([
    layers.Dense(units=512, activation='selu', kernel_initializer='lecun_normal', input_shape=[11])
])

After reading about all these seven Activation Functions, you may be wondering: "How the hell am I supposed to know which Activation Function to use in each case?". Here's a cheatsheet from Hands-on Machine Learning book at page 332:

In general choose SELU > ELU > Leaky ReLU (and its variants) > ReLU > Tanh > Sigmoid/Logistic;

If the network’s architecture prevents it from self-normalizing, then ELU may perform better than SELU (since SELU is not smooth at x = 0);

If you care a lot about runtime latency, then you may prefer Leaky ReLU;

If you don’t want to tweak yet another hyperparameter, you may just use the default α values used by Keras (e.g., 0.3 for the leaky ReLU);

If you have spare time and computing power, you can use cross-validation to evaluate other activation functions, in particular, RReLU if your network is over‐fitting, or PReLU if you have a huge training set.

Also, remember:

Tahn and Logistic/Sigmoid Activation Functions are used for Binary Classification Problems;

Use Sigmoid Function rather than Tahn when you want stronger gradients and big learning steps.

1) Layers

Now, let's see the Layers!!

- Dense Layer

Dense Layers are the most common layers used in Deep Learning Models. Their neurons receives as input the outputs from the previous layer, besides, they perform dot product of all input values along with the weights for obtaining the output.

Basic Structure:

Code:

# ---- Dense Layer Declaration ----
model = keras.Sequential([
    layers.Dense(units=512, activation='selu', kernel_initializer='lecun_normal', input_shape=[11])
])

- Dropout Layer

Dropout Layers, as seen previously, help to avoid Overfitting by dropping out (turning off) some neurons of a layer in each epoch.

Besides, this layers goes before the layer that will apply the dropout technique.

Basic Structure:

Code:

# ---- Dropout Layer Declaration ----
#
# \ rate >> percentage of neurons to be dropped out in each epoch;
# \ seed >> random_state for reproducability
#
model = keras.Sequential([
    layers.Dropout(rate=0.3, seed=5296)
    , layers.Dense(units=512, activation='relu', input_shape=[11])
])

- Batch Normalization Layer (batchnorm)

Batch Normalization Layer (batchnorm) is used to normalize the datas like Normalizer over preprocessing, but inside the model over the training step.

In a nuthsell, this layer takes and normalizes the output of the previous layer before assigning it as input to the next one. It can also works as a preprocessor when added as the first layer of a network.

When added as the first layer, it works as a preprocessor.

Basic Structure:

Code:

# ---- Batch Normalization Layer Declaration ----
model = keras.Sequential([
    # hidden layer
    layers.BatchNormalization()
    , layers.Dense(units=512, activation='relu', input_shape=[11])
        
    , layers.Dense(units=512, activation='relu')
    , layers.BatchNormalization()
    
    # output layer
    , layers.Dense(units=1)
])

- Flatter Layer

Flatten Layers work changing the shape of the input, for instance, consider a batch size input of [3, 3], after being processed by the layer, its shape will be [9].

Basic Structure:

Code:

# ---- Flatten Layer Declaration ----
model = keras.Sequential([
    layers.Dense(units=512, activation='selu', kernel_initializer='lecun_normal', input_shape=[11])
    
    , layers.Flatten()
    , layers.Dense(units=1)
])

- Reshape Layer

Like Flatten Layers, Reshape Layers reshape the input, but instead of transforming the input size to one dimension, they transform to the dimensional size you've assigned as a parameter.

Code:

# ---- Reshape Layer Declaration ----
model = keras.Sequential([
    layers.Dense(units=16, activation='selu', kernel_initializer='lecun_normal', input_shape=[8, 8])
    
    , layers.Reshape([16, 8])
    , layers.Dense(units=1)
])

- Permute Layer

Permute Layers work like Reshaping ones, with the difference that instead of assigning the new input shape, you have to assign the multipliers for each dimension.

Code:

# ---- Permute Layer Declaration ----
model = keras.Sequential([
    layers.Dense(units=16, activation='selu', kernel_initializer='lecun_normal', input_shape=[4, 2])
    
    , layers.Permute([1, 2])
    , layers.Dense(units=1)
])

- Repeat Vector Layer

Repeat Vector Layers repeat the input for the integer number assigned as the parameter.

Code:

# ---- Repeat Vector Layer Declaration ----
model = keras.Sequential([
    layers.Dense(units=16, activation='selu', kernel_initializer='lecun_normal', input_shape=[4])
    
    , layers.RepeatVector(3)
    , layers.Dense(units=1)
])

- Lambda Layer

Lambda Layers change the inputs by following a function. These layers can have four parameters, being:

Function - represents the lambda function that will be applied to the input (required);

Output Shape - represents the shape of the transformed input (optional);

Mask - represents the mask to be applied (optional);

Arguments - represents the optional argument for the lambda function as dictionary (optional).

Code:

# ---- Lambda Layer Declaration ----
model = keras.Sequential([
    layers.Dense(units=512, activation='relu', input_shape=[2, 2])
    
    , layers.Lambda(lambda x: x**2)
    , layers.Dense(units=1)
])

- Pooling Layer

As far as we'll see this layer over the Computer Vision notebooks, we won't go deeper about this one here. Just have in mind that Pooling Layers are applied for Max Pooling operations on temporal data and can receive two main parameters:

Pool Size - refers the max pooling windows;

Strides - refer the factors for downscale.

Code:

# ---- Pooling Layer Declaration ----
model = keras.Sequential([
    layers.Dense(units=512, activation='relu', input_shape=[2, 2])
    
    , layers.MaxPooling1D(
        pool_size=2
        , strides=None
        , padding='valid'
        , data_format='channels_last'
    )
    
    , layers.Dense(units=1)
])

- Locally Connected Layer

This one also will be seen on Computer Vision notebooks, so just have in mind that Locally Connected Layers possess similar functionality to Conv1D Layers, the difference arises from the usage of weights. In Conv1D Layers, weights are shared whereas in case of Locally Connected Layers weights are not shared.

Conv1D Layers - weights are shared;

Locally Connected Layers - weights are not shared.

Code:

# ---- Locally Connected Layer Declaration ----
model = keras.Sequential([
    layers.LocallyConnected1D(16, 3, input_shape=[10,8])
    , layers.LocallyConnected1D(8, 3)
    , layers.Dense(units=1)
])