# Back Propagation: A Mathematical and Intuitive Explanation

**Author: Oluwole Oyetoke ***(18th August, 2017)*

## 𐐒AƆK PROPAGATION

Often times we hear about the backpropagation technique used during the training process of ANNs. At the snap of a finger, we can utilize backpropagation for Neural Nets Training on tools such as TesnorFlow, MATCOVNET, Caffe etc. For me, I decided to go a little bit deeper to understand what really goes on underneath.

**Diagram 1:**Back Propagation At A Glance

The backward propagation of errors or backpropagation, is a common method of training artificial neural networks and used in conjunction with an optimization method such as gradient descent. After the pioneering work of Rosenblatt and others, no efficient learning algorithm for multilayer or arbitrary feedforward neural networks was known. This led some to the premature conclusion that the whole field of ANN had reached a dead-end. The rediscovery of the backpropagation algorithm in the 1980s, together with the development of alternative network topologies, led to the intense outburst of activity which put neural computing back into the mainstream of computer science. The backpropagation algorithm was a major milestone in machine learning because, before it was discovered, optimization methods were extremely unsatisfactory. One popular method was to perturb (adjust) the weights in a random, uninformed direction (increase or decrease) and see if the performance of the ANN improved. If it did not, one would attempt to either go in the other direction, educe the perturbation size or a combination of both. Another attempt was to use Genetic Algorithms to evolve the performance of the neural networks. Unfortunately, in both cases, without (analytically) being informed on the correct direction, results and efficiency were suboptimal. This was where the backpropagation algorithm came into play by repeating a two-phase cycle (propagation and weight update) through which the network is able to learn on its own with less human trials and error approaches. Essentially, the two phases of the back propagation can be intuitively viewed as the steps below.

- Feed-forward computation
- Back propagation to the output layer
- Back propagation to the hidden layer
- Weight updates

When input data is presented to a Neural Network, it is propagated layer-by-layer through the network up-to the output layer. This is called forward propagation. Using a cost function, the output of the network is then compared to the desired output and an error value is calculated for each of the neurons in the output layer. The error values are then propagated backwards, and used to adjust the weight value of each of the neurons which make up the layers of the neural network. Backpropagation uses these error values (from each of the output neurons) to compute the gradient of the cost function with respect to all the weights in the network. With this, as the network is trained, the neurons in the intermediate layers organize themselves in such a way that the different neurons learn to recognize different characteristics of the total input space. After the training process, when a previously unseen input pattern is presented to the network, neurons in the hidden layer of the network will respond with an active output if the input contains already learned features. It is important to note that backpropagation requires a known output for every input value to the network in order to calculate the cost function’s gradient. This explains why it is usually referred to as a supervised learning method.

The diagram 2 this post shows a graphical representation of the gradient descent operation on the cost function during the back-propagation process. In summary, with the backpropagation methodology, the loss function calculates the difference between the output of the network after being fed with input training samples against the network’s expected output. Note that these outputs are numeric values and are mainly dependent on the various mathematical computations within the various layers of the NN.

## ANN Weight Update Computation (cost function)

In the ANN (supervised learning), the output of the network’s computation is compared to the desired output and by using a desired cost (loss) function an error value is calculated for each of the neurons in the output layer. There are different kinds of cost functions that can be used, some of which are listed below

- Mean Squared Error (MSE) Cost Function
- Cross Entropy Cost Function
- Kullback-Leibler Divergence
- Exponential Cost
- Hellinger Distance
- Itakura–Saito distance

Cost function as the name implies are functions which help us determine the level of divergence an output is compared to a desired output, consequently helping us draw a line of best fit through a stream of data. The cost function on its own is a mathematical hypothesis which minimizes the error between a set of given input and outputs. (1) below shows the hypothesis formula for a linear regression with one variable which predicts that output y is a linear function of input x.

$$h_0(x) = {\theta}_0 + {\theta}_1x \dots (1)$$

Where: $$h_0(x) = Hypothesis$$

$$\theta_i = Parameters of the Model$$

$$x = Input of the Model$$

(2) below describes the hypothesis for data set with more than one input variable

$$ h_0(x^i) = {\theta}_0 + {\theta}_1x^i \dots(2) $$

The cost function seeks to find a valid set of 𝜃 so that h_{0}(x) is close to output y for every piece of input x supplied to the system i.e. in a supervised learning environment. Therefore, minimizing the error between the hypothesis and the reality by choosing suitable 𝜃. This can be achieved by (3) below

$$ \min\limits_{\theta_0 \theta_1} = J(\theta_0, \theta_1) = {1 \over 2} \sum_{i=1}^m {(h_\theta (x^i) - y^i)^2} \dots (3)$$

$$M=Number-of-training-set$$

Thinking intuitively through the algorithm of this function, different values of 𝜃 which make up the hypothesis are chosen and tested in equation 14 above to derive different cost values. This explains why it is also called the squared error function. This square isn’t there for no reason, as it allows the result to be quadratic. We know that a quadratic function when plotted will always have a visible ‘u’ shape making it convex, having only one minimum, unlike non-quadratic graphs which could have local and global minimums. This feature will come in handy should the gradient descent algorithm need to be used. For the ANN, the hypothesis h_{ 𝜃 }and output y are represented by the actual output of the Neural Network (NN) and the targeted output of the network. The entire error calculation for one sweep of training data across the network (1-Epoch) is realized by analysing the error/difference at the output nodes. i.e. comparing the difference in the networks output value for each of the output node(s) n against the targeted/expected output. This error is mathematically represented by the (4) below

$$E_p = {1 \over 2} \sum_1^n (T_n-N_n)^2 \dots (4)$$

$$N_n = Network Values$$

$$T_n = Target Values$$

P represents the sweep number e.g. at p=1, this represents the first sweep of training data across the NN. In other words, to derive the total error over the entire training iterations, equation r below can be used

$$ E = \sum_1^p E_p = {1 \over 2} \sum_1^p \sum_1^n ({T_n}^p - {N_n}^p)^2 \dots (5) $$

E is total error, and p represents all forward propagation iteration and n = number of output nodes. A normalized version of the total error is given by the MSE in (6) below

$$E = \sum_1^p E_p = {1 \over 2pn} \sum_1^p \sum_1^n ({N_n}^p-{T_n}^p)^2 \dots (6)$$

By computing the partial derivative of the total loss with respect to the individual weights at the connection between each of the individual neurons that make up the system, we can obtain how much a change in that weight value affects the total error E. The goal of backpropagation is to update each of the weights in the network so that they cause the actual output to be closer the target output, thereby minimizing the error for each output neuron and the network as a whole. However, to get the most minimal error, it is wise to also get the most minimum value from all the possible cost functions generated for a set of data. This can be gotten by using gradient descent which is a more general algorithm which can be used to minimize cost functions and even other functions. There exist also other algorithms which can be used in place of the gradient descent which are

The objective of this method is to find better training directions by using the second derivatives of the loss function.__Newton's Method:__This method build up an approximation to only information on the first derivatives of the loss function__Quasi-Newton Method:__The conjugate gradient method can be regarded as something intermediate between gradient descent and Newton's method. Here, the training directions are conjugated with respect to the used Hessian matrix__Conjugate Gradient:__Also known as the damped least-squares method. It works with the gradient vector and the Jacobian matrix.__Levenberg – Marquardt algorithm::__

The slowest training algorithm is usually gradient descent, but it is the one requiring less memory. On the contrary, the fastest one might be the Levenberg-Marquardt algorithm, but usually requires a lot of memory. A good compromise might be the quasi-Newton method

Using the gradient descent, maximally minimizes the total networks output error, every time we want to update our weights, we subtract the derivative of the cost function w.r.t. the weight itself, scaled by a learning rate from the current weight. In the case where there is more than one neural connection to the node to deal with, it means the weight values have to be simultaneously updated.

$$w := w - \alpha{\partial E \over \partial w} \dots (7)$$

Where: $$ w = weight $$

$$ E = cost function$$

$$ \alpha = learning rate$$

**Diagram 2:**Back Propagation Per Neuron

For an arbitrary node as the diagram above which has an input connection weight w, net input ‘in’ an output of ‘o’, with an expected target output of t and constitutes the part of a network with an MSE of E, the deferential of the MSE with respect to its input connection weight w is given by (8) below based on chains rule

$$ {\partial E \over \partial w}={\partial E \over \partial o} * {\partial o \over \partial in} * {\partial in \over \partial w} \dots (8) $$

In words, this can be read out as:

# “

How much does the total error change with respect to the output * How much does the output ‘o’ change with respect to its total net input * how much does the total net input of in change with respect to the input weight w

# "

By using the quotient rule given in (9) below, the individual differentials in equation t above are reduced as described in (10), (11) and (12) below

$$ {d{ \begin{Bmatrix} f(x) \over g(x) \end{Bmatrix}}} = {{g(x)f^|(x) - f(x)g^|(x)} \over {{(g(x))}^2}} \dots (9)$$

$$ {\partial E \over \partial o} = {-(t-o)} \dots (10)$$

$$ {\partial o \over \partial in} = {o(1-o)} \dots (11)$$

$${\partial in \over \partial w} = in \dots (12)$$

(8) above can be re-written as (13) below

$$ {\partial E \over \partial w} = {-(t-o) * o(1-o) * in} \dots (13)$$

Therefore to decrement the error and adjust the weight for every node, the new weight w(+1) is given by (14) below

$$w(+1) = w - \alpha (-(t-o) * o(1-o) * in) \dots (14) $$

Once a complete back propagation has been carried out and the weight adjusted, a new sample is fed to the network and its output is supervised. Effort is made to correct errors detected using the same back propagation technique until the near perfect classifier is generated. As we know, a supervised learning algorithm analyses the training data and produces an inferred function, which is called a classifier. This inferred function (made up of the network’s weights) should predict the correct output value for any valid input object.

By iterating through the network multiple times and minimizing the error effect of each input weight on the total error of the network, the learning algorithm gradually descends towards the least possible error value until convergence is reached. The learning rate controls how steep/fast the function steps down the gradients surface. It might be very difficult for the system to converge when the learning rate chosen is too large, however, on the other hand, choosing an extremely minimal learning rate would make the system take longer time to converge.

## Intuitive Analysis of The Gradient Descent Operation

As was highlighted in the sections above, the MSE formulae is designed with a squared in it (which is later cancelled out by the multiplication by 0.5) so that the result of the function is quadratic. It is known that a quadratic function when plotted will always have a visible ‘u’ shape making it convex, having only one minimum, unlike non-quadratic graphs which could have local and global minimums. As the gradient descent function adjust the weight values through the network, it systematically converges towards a set of weight values which will generate the least error difference. Note that the least error difference graphically can be found on the points corresponding to the lowest point of the ‘u’ shaped graph. What the derivate in the gradient descent formulae does is simply pick a point on the MSE plot based on the selected weight value and look for the slope of the line tangent to that point. In other words, (7) above can be re-written as (15) below.

$$ w := w- \alpha * (positive slope) \dots (15)$$

Therefore, at every iteration, the value of w reduces towards the minimum. At the local/global minimum, the derived slope will be zero, therefore bringing the system to convergence. There is no need to decrease the learning rate over time, because as one approaches the local minimum, the slope approaches 0 and the step becomes smaller.

**Diagram 3:**Gradient Descent

It is important to note that there are many laws (algorithms) used to implement the adaptive feedback required to adjust the weights during training. Backpropagation just happens to be one of the most commonly used and known. Training the ANN requires a conscious analysis, to ensure that the network is not over trained and when finally, the system has been correctly trained, and no further learning is needed, the weights can, if desired, be "frozen." In some systems, this finalized network is then turned into hardware so that it can be fast. Other systems don't lock themselves in but continue to learn while in production use.

**Diagram 4:**Supervised Training Using Back Propagation