How Do Artificial Neural Networks Learn in Deep Learning?

Esma Bozkurt
CodeX
Published in
6 min readFeb 24, 2023

--

Deep learning is a sub-branch of machine learning based on the functioning of the human brain in data processing. The aim is to create the modeling of neural networks in the brain and to produce humanoid solutions to problems. In this way, it allows machines to learn without human supervision. It uses artificial neural networks for this purpose. Neural networks can be single-layer or multi-layered. Deep learning models are based on multilayer neural networks.

To understand the functioning of neural networks, we can think of impulse transmission in neurons. Here, the data received from the dendritic ends are weighted in the nucleus (multiplied by w) and transmitted along the axon and connected to another nerve cell. Axons (x’s) coming out of one neuron become inputs to the other neuron. In this way, transmission between nerves is ensured. In order to model and train it on the computer, we need to know the algorithm of this operation, that is, we need to be able to obtain output by applying a series of commands to the inputs. So how can we express this mathematically?

The values [1] and [2] in the picture represent the layer numbers, and you can see a single-layer network structure at the top and a simple multi-layer network structure at the bottom (it has become 2 layers by adding 1 hidden layer). It would be useful to examine both at the same time so that the operations are understandable.

The x input values are multiplied with the w weights (these weights reflect the importance of the inputs), which are given random values at the beginning, and the bias (the error/bias value reflecting the distance between the predicted data and the actual data as a result of the modeling) is added to get the z value, this value has a value in the common node (neuron). The a value obtained by subjecting it to the activation function (sigmoid) becomes our output and this a value is subjected to a loss function (which gives information about the error rate of the model) with the desired y value. First, this loss function will have a high value because the system is just starting to learn.

Then we have to go back and do all the work in the opposite direction. We go backwards by derivative (back propagation) and note down the values obtained. The new weight values become important in this process because we will use these new weight values as we go forward again. We continue this process, which we call the feed forward back propagation algorithm, until the loss converges to the minimum and we find the best model.

In the picture above, you can see a 2-layer neural network, a hidden layer with 4 neurons and an output layer with a single neuron. The input layer (0'nd layer) has no effect on the number of layers. These values are expressed as w(number of neurons in its layer, number of input values) and b(number of neurons in its layer, 1). The inputs of the hidden layer are x and the inputs of the output layer are the values of a:

Instead of sigmoid function as activation function, hyperbolic tangent, ReLU, Leaky ReLU etc. available. These functions are used in layers and must be differentiable. Because we update the weights by taking the derivative operation in back propagation.

Hyperbolic tangent (tanh) is the sigmoid shifted from -1 to +1. Since its average is 0, we can say that it performs better than the sigmoid function, whose average is 0.5. It is also exponential.

It is more meaningful to use tanh in the hidden layers and sigmoid in the output layers. Especially if we want a value in the range of 0–1 at the output, we should use sigmoid because sigmoid is in the range of 0–1.

Recently, ReLU activation function is most used in deep learning. Since parts less than 0 are not differentiable in ReLU, they cannot participate in the process and learning does not occur in that part. Values less than 0 are also differentiable in Leaky ReLU, and since there is a result for each value, learning takes place in any case.

The above expression is called cost function, we can express y^ and y in the cost function as the average of the difference between the desired value and the value we get.

The above expression is called Gradient Descent Algorithm. The purpose of the gradient descent algorithm is to minimize the cost. There are other methods for this as well. Convergence to the minimum is important to optimize our model. The alpha value here is the learning rate.

Backpropagation Algorithm & Chain Rule

We said that we are going backwards by taking the derivative in backward propagation. Let’s examine this through an example:

Let’s go step-by-step, assuming the sigmoid function is used in the example and the result of f (its derivative) at the output is 1. Let’s go backwards, taking the derivative of the last 1/x operation:

First, replace d(1/x)/dx= (-1/x²) → x with 1.37 and multiply by 1, indicated in red→ -0.53:

It goes to the previous operation. d(c+x)/dx=1 will come from the derivative of the sum in the +1 operation and it is multiplied by the previous value of -0.53 → -0.53 :

In the other operation, the derivative of e^x is itself and (e^-1.00)*-0.53 =-0.20. We multiply the -0.20 value we found, this time by the derivative of the previous product, d(ax)/dx=-1, and our other value will be 0.20. If we do all the backwards operations in this way, we calculate the new weights:

We can also express this process as follows:

In multi-layer neural networks, the initial values cannot be chosen as 0 as in single-layer, because learning will not take place because its derivatives will be the same as itself. Therefore, we should initially choose random values.

Finally, you can review the Python code of a two-layer network model below. See you in my next post!

Reference: deeplearning.ai

--

--