Introduction to Recurrent Neural Networks

9 min readApr 14, 2023

I have mentioned that Convolutional Neural Networks (CNN), which is a type of Artificial Neural Networks, consists of layers and learning occurs with the forward/backward propagation mechanism with each layer. In this article, I will talk about Recurrent Neural Networks (RNN) which is another type of neural networks and how they work.

We can say that RNN produce solutions to problems expressed as time series. It is especially used in voice processing (such as speech to text/text to speech, music production) because sound is a continuous data over time. It is also used in natural language processing (especially emotion recognition and emotion analysis from text, positive-negative interpretation) and text translation (Text Search and Analytics). At the same time, RNN types (such as LSTM) are frequently used in medical applications such as predictions of DNA sequences. Moreover, it is also used in the coding and production of drugs and chemicals. Although we use a CNN in image processing, we can also use RNN in the solutions of videos where this image extends over the time series. RNN can also be used when deciding whether a machine is due for maintenance or determining whether it will fail (Predictive Maintenance). It can even be used in finance. So, let’s make an introduction to RNN, which has a very wide usage area.

In classical neural networks, the inputs goes the output with layers, and when feeding forward, all inputs affect all outputs:

Here we can encounter situations where x<T> and y<T> are not equal to each other. Due to the effect of time and sequence, different sequences encountered throughout the text would not be learned in such a model. We use RNN for such reasons.

Here are the words according to the working principle of RNN; It is expressed with vectors by giving 1 to the related word and 0 to other words. For example, in the sentences above, we see that proper nouns are given 1 and others 0. Each word is placed in the dictionary (vector) in turn. In the text example above, the vector length is set to 10000 based on the number of words. Then, in the part of the word, that word is given 1 and the others 0 and learning is provided sequentially. Learning takes place iteratively in this way, and for every input, each output is equal. In this way, the model of the artificial neural network we have created has changed:

Here a<0> becomes the input weight vector and I need to know the values of the first input x<1> and a<0> to calculate the estimation y^<1>. In the next step, I get the values of y^<2> and a<2>, which I estimated by processing a<1> and x<2>. Here, each of the x entries represents a word. Let’s not forget that these are not layers, but rather time-dependent progression. At the same time, the weights are the same for each word, and the y^<Ty> value at the last moment will depend on each operation before it. Here I will need to know each x input values for output. For example, to calculate y^<3>, I need to know x<3>,x<2>,x<1>,a<1>,a<0>.

Note: The y values obtained in CNN were not dependent on each x input value and were different weights.

The biggest disadvantage of RNN is that estimations are always depend on previous information. In some cases we will have to take future information into account and it will be a problem.

To define Candy, as in this example, we should look at the words after it, not the words before it. For this, we will need to take information about future into account.

RNN Calculations

First, let’s look at the processing of a<0> and x<1> and the calculation of y^<1> and a<1>:

Firstly, the new a value is determined by multiplying the x input and a values with the W weights and adding the bias value. While calculating the y value, the new a value we found is weighted again and the bias is added. Here “g” is an activation function like ReLu. Although the ReLu activation function is often preferred in CNN, it is not preferred in RNN. Hyperbolic tangent (tanh) is generally preferred in RNN. The same or different activation functions can be preferred when calculating a and y in the process. If we generalize the operation we have described, we get the following expressions:

Therefore, when calculating y^<t>, I am using the value of a<t> that I have calculated. Generally, when calculating a, hyperbolic tangent (tanh) and ReLu (rarely) and sigmoid activation functions are preferred when calculating y outputs. Softmax can also be used for output according to the problem.

Here, Waa is a 100x100 matrix and a is a vector of 100. Considering that the x input value is a 10000 long vector according to the number of words in the input, we express the Wax value as a 100x10000 matrix.

We can also express the Wax expression and the general a expression as follows:

We will use more indexed expressions in the future and we will have a hard time writing the expression a and y. So we simplify the expression in this way.

We can think of these calculations, which we have progressed from left to right, as feed-forward. Well, let’s examine how the back propagation algorithm works in RNN.

First of all, we need to use the loss value and the current values of the weights in back propagation. With the loss function (euclidean distance, mean squares error, cross entropy, etc.), we can calculate the loss value between the estimated y^ and the real y value. So how do we update the weights?

In the forward propagation, we calculated the a and y values using the W weights and the L value. Now we need to update these weights. For this, we will use optimization algorithms such as gradient descent.

In backpropagation, we will calculate the y and a values, respectively, by going backwards from Loss. The outputs we will obtain will be the current Wa, Wy values. Then again, forward propagation will take place using these Wa, Wy values.

Recursive neural networks (RNN) may need different designs according to their usage areas.

For example, there can be a single input and multiple outputs, as in music production, or multiple inputs and single outputs, as in sentiment analysis, or multiple inputs, multiple outputs, such as in machine translation. In this way, it is possible to design different RNNs for the inputs and outputs of different data types.

Language Model and Sequence Generation with RNN

While reading, we may encounter words and sentences that are very similar to each other but have quite different meanings. For example, structures used in different tenses such as “red” and “read”:

We know that we will encounter “read” more often here, so its probability will be high.

The same is true for word indexes. For example, after the phrase “hard to”, the word above is more common:

At the same time, we need to assign a “dot” or a value to indicate that the sentence is finished. In other words, while we get y output for each word, we should also get y output for the dot. I will use the EOS (end of sentence) value to express the point:

We may also encounter “unknow” words that are not found in the dictionary. In this example there are two words denoting the breed of dog. I have to define them as << Unk >>:

Now let’s examine the RNN model in the sentence “Dogs live for about 14 years <<EOS>>”. Here we can consider the entire dictionary as a pool of possibilities. The pool contains the probability of all words from P(a) to P(<EOS>):

Here, the first word of the sentence is dogs, and there is no word before it. So x<1> = 0. Then we make the y^<1> estimation. The estimated value will be the input of the new word. So x<2> = y^<1> (dogs). After Dogs, live comes and x<3> = y^<2> (live). We see that each word depends on the previous words. Finally, there is the situation where <<EOS>> comes after “years”.

Here we see that y<2> is a probability depending on y<1> before it and y<3> depending on y<2> and y<1> before it:

With this model we created, I learn new words from left to right, and at the end of the model, I get a loss function (softmax, cross entropy, etc.). At this loss value, we calculate the loss between the real y and the estimated y’s. The loss function is expressed as:

For example, we can use the Cross entropy loss function here:

Character-Level Language Model

This word-level language model makes it difficult to predict the next word in “unknown” words and will cause errors in the calculations. We use the character-level language model for this. For example, the two words here are “unknown”, which indicates the breed of dog:

Here x inputs will be “character (letters)”, not words. We often use this model to define << Unk >> words. It would require us to calculate longer strings as we would have to express each character, which would increase the processing overhead. Therefore, we have a higher success rate in the word-level language model and use it more often.

Vanishing Gradients

Where the derivative of the activation function we used while taking the derivative in back propagation was 0, the weights were also updated with 0 and causing loss of information. Therefore, there was no learning in those parts (Vanishing Gradients). We may encounter this problem in the RNN model as well.

In order to have information about a word in RNN, the previous word or words, that is, words as close as possible, are looked at. However, this may not always be so.

For example, in order to predict “French” in this sentence, we must have information about the word “France”, but this information was given to us at the beginning of the article. There may be many sentences in between and different city names in those sentences.

Like here, in order to decide which was/were to use, I need to be able to decide whether the first word is singular or plural. RNN is not very skilled at getting information from the old past so it may not be successful in long sentences. So how can we learn this?

Actually, there are many methods we use for this problem. These methods are:

Gated Recurrent Units (GRU)
Long Short Term Memory (LSTM)
Bidirectional RNN
Deep RNN

I will include these methods on my next article. For now, don’t forget to follow me, see you soon!

Reference: deeplearning.ai

Introduction to Recurrent Neural Networks

RNN Calculations

Language Model and Sequence Generation with RNN

Character-Level Language Model

Vanishing Gradients

Written by Esma Bozkurt