Definition of the Problem
In DL, the vanishing gradient is the phenomenon where the gradients at layers in a DL network become very small or even zero during backpropagation. But why does the gradient become very small or not? Remember how a DL model learns: updating weights using the Gradient Descent algorithm. In GD, the parameter $\theta$ is updated by the formula:
$$ \begin{aligned} \theta = \theta - \eta \nabla_{\theta} J(\theta) \end{aligned} $$
where $\nabla_{\theta} J(\theta)$ is the gradient of the current loss function. We see that to update the model’s parameters, GD must compute the gradient of the loss function. The goal of DL is to minimize the error between the predicted answer and the ground truth during training. And as we learned above, moving in the direction of the gradient ascent will find the maximum point. But for the loss function, we want to find a minimum point to ensure the model predicts as accurately as possible. Therefore, the minus sign in the GD formula indicates that we are going against the gradient to move towards the minimum point. The gradient is the rate of change of the value of a real function at any point. Suppose there is a function $f(x)$, the derivative $\frac{df}{dx}$ tells us how the value of the function $f$ will change as $x$ changes. Is it large or small? Consider a real example:
$$ \begin{equation} f(x) = 5x \end{equation} $$
then, the derivative of the function $f(x)$ with respect to x will be calculated as follows:
$$ \begin{aligned} & \nabla f(x) = (5x)’ \\ \Leftrightarrow & \nabla f(x) = 5 \end{aligned} $$
This shows that when x changes by a certain amount, the result of the function $f(x)$ will change by a quantity 5 times. Derivatives are commonly used in mathematics to find local maximum or minimum points of functions. Following the increasing direction of the derivative, we will find the maximum point and vice versa. So the answer to the question: what effect does the very small gradient at the layers have on the DL network? You can see that with $\nabla_{\theta} J(\theta)$ very small, $\theta$ is almost not updated, leading to the model unable to learn more. This is extremely dangerous because the model may not reach the minimum point, causing its operational results to be poor.
Causes of the Vanishing Gradient
We already know what the vanishing gradient is, but how does it happen? The fact that the gradient at the layers is very small is just the surface, what we are concerned about is the deep-rooted cause of the phenomenon. According to my own knowledge and the sources I consulted during our learning process, we can identify two main factors: the problem of initializing the parameters of the DL network and the influence of activation functions.
Cause related to Activation Functions
Non-linear activation functions are the key points that allow a deep learning network to learn complex representations, not just linear operations. Previously, people often used the Sigmoid function in hidden layers in deep learning networks. The formula of the Sigmoid function is as follows:
$$ sigmoid(x) = \sigma(x) =\frac{1}{1 + e^{-x}}, $$
and the derivative of the Sigmoid is:
$$ sigmoid’(x) = \sigma’(x) = \sigma(x) * (1 - \sigma(x)) $$
Many of you may ask: not very relevant? In fact, for any value of $x$, the output of Sigmoid at $x$ always lies in the range $[0, 1]$. This is suitable for problems where the output reflects probability values. So why do people no longer use Sigmoid in hidden layers in deep learning networks?

Fig. 1. Sigmoid Activation Function and Derivative
We can see that if we use Sigmoid in hidden layers, the gradient at these layers becomes vanishing when the value of the activation function is very large or very small, specifically greater than 1 or less than -1. These sections are called the “saturation ranges” of the activation function, which means that at this time the gradient becomes very small and does not help update the weights at the corresponding layers.
Cause related to Parameter Initialization Issues
Previously, humans often initialized parameters for neural networks according to the standard distribution with a standard deviation = 1 and zero-centered. This means the parameters will be in the range $[-1; 1]$. Does this affect computation or not? The answer is yes, very much. Imagine your neural network has $L$ layers, and you initialize parameters for each layer as follows:
$ W^{l} = \begin{bmatrix} 0.5 & 0 \\ 0 & 0.5 \end{bmatrix} $
Now the first operation is to compute forward one time to find the predicted value:
$$ \hat{y} = \Pi^{L}_{i=1} W^{i} $$
Suppose we don’t care about the bias b, now when you continuously multiply the W’s together, you will get a very small value because our $W < 1$. Just like when you take a number less than 1 and raise it to dozens of times, suppose $0.9^{40}$, the result will tend to 0. With this value, the gradient at each layer deeper behind will be smaller, leading to vanishing gradients.
Resolutions
One of the solution to mitigate vanishing gradients is changing the way of choosing activation functions. There are two activation functions chosen for use very often at hidden layers, which are tanh and ReLU.
Tanh activation function
The tanh function is represented by the following formula:
$$ tanh(x)= \frac {e^x + e^{−x}} {e^x − e^{−x}} $$
If you have ever seen the graph of the tanh function, it has the same structure as Sigmoid, except shifted down 1 unit along the $Oy$ axis, making it zero-centered.

Fig. 2. Tanh Activation Function and Derivative
Becoming zero-centered helps neural networks using tanh easier to be optimized in the backpropagation process, however, the problem of vanishing gradients has not been completely solved.
ReLU activation function
ReLU (Rectified Linear Unit) can be considered as the most optimal activation function today. Its formula is very simple:
$$ ReLU(x) = relu(x) = max(0, x) $$
This means, for all values of $x > 0$, the output of the ReLU function is also the input. The remaining cases have an output of 0.

Fig. 3. ReLU Activation Function and Derivative
Looking at the graph, we can clearly see that the derivative of ReLU does not have a saturation range. This makes weight updates in the network much better compared to using Sigmoid or Tanh. Moreover, ReLU’s computation is also simpler and less costly.
In addition to choosing activation functions, proper parameter initialization is also a way to minimize the vanishing gradient phenomenon.
Batch normalization
Batch normalization is a very good method to minimize vanishing gradients. The way batch norm works is to scale the parameter values of each layer to a Gaussian distribution, ensuring that the standard deviation and variance at each layer in the network are the same. This helps the training process to be stable and no parameters are activated too large/too small after each activation function from the previous layer.