Demystifying Neural Networks: A Mathematical Approach (Part 2)
My notes from the book ‘Make your own Neural Network’ by Tariq Rashid.
The neural network is a kind of technology that is not an algorithm. It is a network that has weights on it, and you can adjust these weights so that it learns. You teach it through trials — Howard Rheingold
This is the second article in the series — Demystifying Neural Networks: A Mathematical Approach. Part 1 was more of an introduction to the world of neural networks. It majorly dealt with the basics which should be known to clearly understand the intricacies of a Neural Network and its working. If you haven’t read the first part, I would highly recommend you to do so since it serves as a primer and a foundation for this second part.
In this article, we will be moving further into the amazing world of neural networks. We shall be working with examples and basic maths, to get a strong foothold on the concepts like feeding forward signals, backpropagation, gradient descent etc. By the end of the article, you will have a decent knowledge about how a signal passes from input to output node of a neural network traversing through the hidden layers and how it learns from its outputs. So let’s get started without much ado.
The flow of the article:
- Following Signals Through A Two-Layer Neural Network
- Following Signals Through A Three-Layer Neural Network
- Learning Weights From More Than One Node
- Backpropagating errors
- Updating the Weights using Gradient Descent.
Following Signals Through A Two-Layer Neural Network
To know how a neural network transmits a signal, let us work through an example. We will be working with a small neural network with only 2 layers, each consisting of 2 neurons. However, the process remains the same for multilayers neurons albeit the calculations get complicated.
In the above figure we have:
- Two input ares entering the network with values 1.0 and 0.5. The layer where the inputs enter is called the input layer. The input layer doesn’t apply activation functions to the incoming signals and it only represents the inputs.
- Each node will transform the inputs into an output using sigmoid activation function( Refer to Part 1 to learn more about activation functions)
- The second layer is where the actual calculations take place i.e the raw inputs from the connected nodes of the previous layer, moderated by the link weights, enter this layer.
Calculations at different layers:
No calculations take place here.
The combined input from each node of the previous layer enters this layer. What are these combined inputs? Recalling the formula for the sigmoid activation function,
where, x = combined input into a node. Thus, the representation at each node would be (if there were three input nodes):
- Calculations at Layer 2, node 1
x = (output from first node * link weight) + (output from second node * link weight)
x = (1.0 * 0.9) + (0.5 * 0.3)
x = 1.05
Let’s now calculate the combined moderated input into the first node of layer 2. That is quite simple and comes out to be:
y = 1 / (1 + 0.3499) = 1 / 1.3499
y = 0.7408
- Calculations at Layer 2, node 2
Following the same principle as above, x for node 2 comes out to be 0.6 and y= 0.6457 which when marked in the diagram appears to be like:
Our calculations were fairly simple since we were only dealing with 2 layers. Imagine the numbers if there were hundreds of these layers. Fortunately, mathematics offers us a concise approach to simplify the calculations and that approach is called matrices.
It is assumed that the readers have an idea about matrices and matrix multiplication in general. Incase, you want to brush up your skills, I would recommend either Khan Academy or 3Blue1Brown videos on the same topic.
The process to multiply two matrices is as follows:
Let’s now replace the above letters with our neural network calculations.
- The first matrix consists of weights in between the nodes of two layers.
- The second matrix consists of the signals of the first input layer
- The third/output matrix is the combined moderated signal which travels to the nodes of the second layer.
Visually the above equation will make more sense.
Phew… a lot of calculations. Interestingly , however, it can all be compactly written as:
W = weights matrix
I = inputs matrix
X = Resultant weights matrix
As for the activation function, we can simply apply the sigmoid function to every element of the matrix X. The final output from the layer 2 would now be:
where O is the matrix which contains all the outputs from the final layer of the concerned neural network.
Following Signals Through A Three-Layer Neural Network
We understood the concept of layers, weights and the flow of the signals in a two layered network. Let us apply the principles learnt in the previous section, to learn about a three-layered neural network. The diagram below depicts a neural network consisting of 3 layers , each having 3 nodes.
Please note that all weights are not marked in the diagram for visual clarity.
Let us first understand the terminologies used here.
- Input Layer: The first layer is called the input layer .(We already know that from the previous section).
- Output Layer : The final layer is called the output layer.
Hidden Layer : The middle layer is called the hiddenlayer.
How does an input signal flow in a three-layered network? We will work through the network shown in figure 2.
It can be represented by matrix I and is equal to:
As for the middle hidden layer, we will be required to calculate the combined moderated signals to each node of this layer, which equals to X = W.I. Here I represents the input layer matrix and W represents the matrix of weights. In the figure 2, we only showed some of the weight for simplification, but here we have the complete matrix. These are just some random weights for example purpose .
W(input-hidden) are the weights between the input and hidden layer, hence the notation.
Likewise, W(1,1) will be the weight between the first input node and the first node of the hidden layer and so on.
Just like the W(input−hidden) matrix , we need to have another matrix that will consist of weights for the links between the hidden layer and the output layer. Keeping in line with the notation, we shall name it as W(hidden-output)
After having worked out the weight matrices, it is time to calculate the combined moderated inputs to the hidden layer. Let us call this input X(hidden) which is given by the formula:
Putting in the values and then visualising them , we obtain the following results:
So, we have calculated the moderated inputs to the nodes of the hidden layer. These nodes as we know, apply a sigmoid activation function to the inputs and generate output. Let us denote the output signals from the hidden layer by the notation O(hidden), and it is calculated as:
Now, applying the sigmoid function to each element of the X(hidden) matrix:
where sigmoid = 1/(1+e^-x)
Notice how the all the values of O(hidden) are between 0 and 1, a property of the sigmoid function.
We have by far obtained the signal flow through the middle hidden layer. It’s now time to work with the third layer of the network called the output layer.
The approach is the same as the previous layer, in fact, it will be the same for any numbers of layers. For a layer, we will combine the incoming signals, use link weights to moderate them and then apply activation function to obtain the output for that layer. It’s that simple.
For the Output layer:
- Inputs = Outputs from the hidden layer i.e O(hidden)
- Weights = weights for links between second and third layer i.e. W(hidden-output)
Thus, putting in the appropriate values in the general equation X = W.I, we get X(output) i.e. matrix of combined inputs to the final layer.
In the final leg of the calculations, let’s apply the activation function to the X(output):
The final output of the three layered neural network is as follows:
We have followed the signal from initial input into the first layer, through the hidden layers and finally out of the final layer.
Learning Weights From More Than One Node
What do we do after obtaining an output from the neural network? Well, the next step is to compare that output with the training example to calculate the difference/error. It is this error that will help us to refine the network and in turn improve the outputs. In this section w,e will try to understand this concept in detail.
In the Part-1 article of this series, we had discussed in detail about how we could refine a simple classifier by using the error as a feedback. The idea was to minimise the error. That was an easy task due to the simplicity of the network design as there was only one input node. But how do we refine/update the weights if we have more than one input nodes?
To solve the problem in figure 3, we can work around with three scenarios:
- Scenario 1
Use the error to update only one weight. This doesn’t seem plausible as both the weights have contributed to the error.
- Scenario 2
Split the error equally amongst all the contributing nodes. This will amount to something like this:
- Scenario 3
An even better option would be to split the error in proportion to the value of link weights.
The third scenario appears to be most accurate since it is giving correct weightage to the contributing link weights. So let us stick with this idea and use it for training our network. We will see more on this in the next section.
Weights are used in 2 ways in a network:
- To propagate the signals forward from input to the output layers
- To propagate the error backwards from the output back into the network
The process of distributing the calculated error at the output, backwards into the network for the purpose of refining the weights is called Backpropagation. It is shorthand for “the backward propagation of errors,” since an error is computed at the output and distributed backwards throughout the network’s layers
Backpropagating Errors from more than one Output Node
Let us add another output node to the network in figure 3 to understand the concept of backpropagation clearly.
In the above diagram, we find that the output nodes have an error of e1 and e2 respectively. We will split the output node’s error in proportion to the value of link weights associated with them.
- e1 and e2 are the errors at the first and the second output nodes respectively and can be mathematically written as :
It can be clearly seen from the figure above that e1 should be split in proportion to the connected links, which have weights w11 and w12 respectively. Same holds true for e2 as well.
Calculating the proportions to split the errors:
- The fraction of e1 used to update w11:
- The fraction of e1 used to update w21:
We can enforce the above idea with a simple example:
So now we know how to use errors to refine the parameters of a network, incase there were more than one output nodes. But what do we do when we have more than two layers in our network?
Backpropagating Errors to more than one layer.
Let us add a hidden layer to figure 6 and analyse it. We now have a three-layered neural network consisting of an input, hidden and an output layer.
Elaborating on the notations:
We will simply take the errors associated with the output of the hidden layer nodes e (hidden), and split those again proportionately across the preceding links between the input and hidden layers w (ih).
However, there is a catch. We only have the target values for the output layers nodes, which incidentally are obtained from the training examples. However, we have no target values for the hidden layers. But we need an error for the hidden layers too to be able to update the weights in the previous layer. But we don’t have an obvious answer to what they are. We can’t say the error is the difference between the desired target output from those nodes and the actual outputs because our training data examples only give us targets for the very final output nodes. Let’s now work on this problem to obtain a possible solution.
One way could be to recombine the split errors for the links using the error backpropagation we just saw earlier. So the error in the first hidden node is the sum of the split errors in all the links connecting forward from the same node. We can visualise this idea as:
Writing this down:
Backpropagating these errors back in the network.
Let’s follow one error back. You can see the error 0.5 at the second output layer node being split proportionately into 0.1 and 0.4 across the two connected links which have weights 1.0 and 4.0. You can also see that the recombined error at the second hidden layer node is the sum of the connected split errors, which here are 0.9 and 0.4, to give 1.3.
Similarly, applying the same idea to the previous hidden layer, we get:
Backpropagation expressed as Matrix
The important thing to note here is the multiplication of the output errors with the linked weights . The larger the weight, the more of the output error is carried back to the hidden layer. That’s the important bit. The bottom of those fractions are a kind of normalising factor. If we ignored that factor, we’d only lose the scaling of the errors being fed back.
The weight matrix obtained in the above equation is the transposed form of the actual weight matrix.
Writing in form of an equation, we finally get:
The errors being fed back respect the strength of the link weights, because that is the best indication we have of sharing the blame for the error.
Updating the Weights
Let us just summarize in points what we have learnt so far:
1. We learnt how to feed forward an input signal in a neural network
2. We also learnt that a neural network learns by refining the link weights which is guided by the error between predicted and actual values. The network propagates errors backwards from output into the network, constituting the backpropagation mechanism.
3. The error at the output layers is split in proportion to the size of the connected weights and is then recombined at each internal node.
However, we have so far eluded a very vital question i.e
How do we actually update the link weights in a neural network.
This means that we need to to devise a mathematical relationship between the weights and errors so that by changing one entity we should be able to change the other. We are in fact trying to minimize the neural network’s error and the parameter we are trying to refine is the neural network link weight which can be more than one. To achieve this task, we are going to use one of the commonly used optimisation algorithms called Gradient Descent.
Gradient Descent: A brief intro
We will not be going into detail of this algorithm since it is a whole article in itself. But we will scratch its surface so that its importance in the current scenario is understood.
Gradient descent is an iterative optimisation algorithm used to find the minimum value for a function. The function to be minimised in this case is the error of the network, and by minimising it, we will be improving upon the output. We shall try and understand this concept with a small example.
Consider a function y, where
Let’s say, y is the error and we want to find that value of x that minimizes it.
How does Gradient Descent work?
- It randomly chooses a starting point like the red dot in the graph
- The slope (drawn as a tangent to the graph) is marked and is negative in this scenario.
- We start moving in the downward direction i.e. in the direction of x to the right. This means we are increasing x a little.
- In this way, we have taken a step closer to the actual minimum. The size of the step is an important factor in reaching the minimum and is called the learning rate. If the learning rate is large, we might reach the minimum in a few steps, but there is a chance of overshooting. On the contrary, if the learning rate is too small, it might take forever. Thus, it is a good idea to moderate the step size as the function gradient gets smaller, which is a good indicator of how close we are to a minimum.
Enough theory, let us get into some calculations. Look at the table below that contains the actual vs the predicted values for three output nodes :
We have squared the error term for several possible reasons:
- We will be working with a derivative of the error function, and it is easy to work with a square term in derivatives.
- Taking the square also makes the error function continuous and smooth.
The aim of the network is to minimize the error by tweaking the weight parameter. Let us picture these two entities in the form of a graph.
The graph is similar to figure 8, the only difference being that the function to be minimised here is the Error and the parameter to be refined is the link weight. So this means we can use Gradient Descent to minimise the error.
- The figure above is the actual representation of the problem at hand, and we shall be referring to it quite often. How does error E change as the weight changes i.e. what is the slope of the error function that we want to move towards a minimum?
- Expanding the error function(sum of the difference between the target and output nodes) over n output nodes.
- Looking at figure 10 again, we notice that output at any node n is o(n) and is only dependent on the links connected to it. This means that for a node k, the output o(k) would only depend on the weights w(jk) and not on weights w(ij) as they aren’t linked to it. But how does this realisation help us? Well, this means we do not need to consider the entire output for all the nodes but only for interconnected ones. This simplifies our equation:
- In the above equation, the part t(k) is constant as the target values remain unchanged with weights meaning t(k) is not a function of w(jk). So solving the above equation using calculus:
- Output (if we recall) is the sigmoid function applied to the weighted sum of the connected inputs.
- Differentiating the sigmoid function and putting the value in the previous equation
We have successfully described the slope of the error regarding the link weights between the hidden and the output layers. This is the mathematical relationship that we wanted. We have got rid of the in ‘
2' the front since we are only interested in the slope and having a 2, 3 etc. will not make much difference to it.
This expression as it turns out, is quite simple to understand if broken down in parts. I have colour coded it to differentiate the equation in three distinct parts.
- Red: it is merely the error i.e. target — output.
- Green: The expression inside the sigmoid is the signal fed into the final output layer node before the activation function is applied. It could have also been written as i(k)
- Blue: it is the output from the previously hidden layer node j
Please Note, we are referring to terminology used in figure 10 here.
The expression that we have obtained is for refining the weights between the hidden and the output layers. But we also need to find the expression for the weights between the input and the hidden layers.
This shouldn’t be much of a problem since we have already done much of the hard work and it only a matter of substitution now. For the new expression:
- First part would be the recombined back propagated error out of the hidden nodes. We shall call it e(j)
- In the Second part, the sum expressions inside refer to the preceding layers, so the sum is over all the inputs moderated by the weights into a hidden node j. We could call this i (j).
- The last part is now the output of the first layer of nodes o (i), which happen to be the input signals.
Thus, the slope of the error function for the weights between the input and hidden layers is:
After having obtained these expressions, we can now update the weights after each training example
The symbol alpha, is a factor that moderates the strength of the changes so that we do not overshoot the minimum. It is called the Learning Rate.
We can also perform these calculations using matrices, and that should make our lives much simpler.
The learning rate has been omitted from the matrices above since it is a constant and wouldn’t change depending upon our matrix’s configuration.
The matrix form of these weight update matrices is as follows and we can readily implement it in any programming language of our choice.
Updating weight with an example
The theory part is ok but let’s see a real demonstration of the equations calculated above. We will again be working with our favourite three layered network. The weights have been randomly assigned the numbers for example purposes.
Aim: To update the weight w(ii) between the hidden and output layers, which currently has the weight 2.0
Writing the equation of the error again and plugging in the values:
- Red part: error e(1) = 1.5
- Green part: sum inside the sigmoid function is (2.0*0.4)+(4.0*0.5) = 2.8 and value of sigmoid is (1/1+e^(-2.8)) = 0.943. So the middle expression is 0.943*(1–0.943) = 0.054
- Blue part : o(j) where j = 1 i.e weight where j=1. It is equal to 0.4.
Multiplying all the values together gives us a figure of -0.06048.
Now if we have a learning rate of 0.1, that give us a change of (0.1 * 0.06048) = + 0.006. So the new w 11 is the original 2.0 plus 0.006 = 2.006.
This is quite a small change, but over many hundreds or thousands of iterations, the weights will eventually assume such a value that the outputs from the network reflect the training examples.
Phew! These were a lot of calculations but the best part is we managed to demystify the neural network from start to finish. You will not be doing any of these calculations in real life to solve a problem using neural networks. There are state of the art libraries and packages provided to us for that. However, it is good to know what is taking place under the hood especially when it comes at the cost of a only a few mathematical calculations.