What are activation functions and why do we need those?
- By Lakshay Wadhwa, CoffeeBeans Consulting
Activation functions are functions that are used in Artificial Neural Networks to capture the complexities inside the data. A neural network without an activation function is just a simple regression model. The activation function does the non-linear transformation to the input making it capable to learn and perform more complex tasks. We introduce non-linearity in each layer through activation functions.
Let us assume there are 3 hidden layers, 1 input and 1 output layer.
W1-Weight matrix between Input layer and first hidden layer
W2-Weight matrix between first hidden layer and second hidden layer
W3-Weight matrix between second hidden layer and third hidden layer
W4-Weight matrix between third hidden layer and output layer
Below mentioned equations represents a feedforward neural network.
If we stack multiple layers, we can see output layer as a function:
What are the ideal qualities of an activation function:
- Non- Linearity
The activation function generally introduces non-linearity in the network to capture the complex relations between input features and output variable/class.
2. Continuously differentiable:
The activation function needs to be differentiable since neural networks are generally trained using a gradient descent process or to enable gradient-based optimization methods.
3. Zero centered:
Zero-centered activations functions make sure that the mean activation value is around 0. This is important because convergence is usually seen faster on normalized data. I have explained many of the commonly used activations below, some are zero-centered some are not. Mostly when we have an activation function which is not zero centered we tend to use normalization layers like batch normalization to mitigate this issue.
4. Computational expense should be low:
Activation functions are used in each layer of the network and are computed a lot of times, hence its computation should be easy and not very computationally expensive.
5. Killing gradients:
Activation functions like sigmoid have a saturation problem where the value doesn’t change much for large negative and large positive values.
The derivative of the sigmoid function gets very small there which in turn prevents the updation of the weights in initial layers during backpropagation and hence the network doesn’t learn effectively. This should be avoided to learn patterns in the data and hence the activation function should not ideally suffer from this issue.
Most commonly used activation functions:
In this section, we will go over different activation functions.
- Sigmoid Function
The sigmoid function is defined as:
The sigmoid function is a type of activation function which has a characteristic “S” shaped curve which has a domain of all real numbers and output between 0 and 1. An undesirable property of the sigmoid function is that the activation of the neuron saturates either at 0 or 1 when the input from the neuron is either large positive or large negative. It is also non-zero centered which makes neural network learning difficult. In almost the majority of the cases, it is always better to use the Tanh activation function instead of sigmoid activation function.
2. Tanh function-
Tanh has just one advantage over sigmoid function that it is zero-centered and its value is bound between -1 and 1.
3. RELU(Rectified Linear Unit)-
RELU is one of the many non zero-centered activation functions and given this disadvantage it is still widely used because of the advantages it has. It is computationally very inexpensive, does not cause saturation and does not cause the vanishing gradient problem. The RELU function doesn’t have a higher limit, hence it has a problem of exploding activations and on the other hand for negative values, it has 0 activation and hence it completely ignores the nodes with negative values. Hence it suffers from a “dying relu” problem.
Dying ReLU problem: During the backpropagation process, the weights and biases for some neurons are not updated because of its nature where activation is zero for negative values. This might create dead neurons which never get activated.
4. Leaky RELU-
Leaky RELU is a type of activation function based on RELU function with a small slope for negative values instead of zero.
Here, alpha is generally set to 0.01. It solves the “dying RELU” problem and also its value is generally small and is not set near to 1 since it will only be a linear function then.
If we use alpha as hyperparameter for each neuron, it becomes a PReLU or parametrized RELU function.
This version of ReLU function is basically a ReLU function restricted on the positive side.
This helps in containing the activation function for large input positive values and hence stops the gradient to go to inf value.
6. Exponential Linear Units (ELUs) Function-
Exponential Linear Unit is also a version of ReLU that modifies the slope of the negative part of the function.
This activation function also avoids dead ReLU problem but it has exploding gradient problem because of no constraint on the activations for large positive values.
7. Softmax activation function-
It is often used in the last activation layer of a neural network to normalize the output of a network to a probability value that in turn is mapped to each class which helps us in deciding the probability of output belonging to each class with respect to given inputs. It is popularly used for multi-class classification problems.
I hope you enjoyed reading this. I have tried to cover many of the activation functions which are commonly used in Neural Networks. If any mistake is found, please feel free to mention and the blog will be corrected.