Gradient Descent Nodes – Silicon Prairie Dog

As a loyal reader, you know that one of my major goals for the year is to work on my Deep Learning understanding. It’s going slowly but pretty well. One of the things that I have been wrestling with is understanding what is going on within a node, specifically Gradient Descent nodes.

Descending nodes down a gradient what?

Sorry, I seem to have gotten the cart before the horse. I am digging into regression Deep learning problems. One of the ways to do this is using gradient descent. Technically I am using Stochastic Gradient Descent but anyways. I am learning what is going on within the nodes themselves and how the interact with each other. Currently, I am perfectly capable of building an basic neural network. I want to build a better neural network. So here I am.

Got it, Gradient Descent Nodes, continue.

With that out of the way lets focus on what I actually want to talk about. Remember, I am learning this so my understanding may be incomplete or straight up wrong. I don’t think it is but I understand I am not an expert and so should you.

For each of the Gradient Descent nodes here is the basics of a single node.

weight
input
goal
alpha

for x to m
   prediction = weight * input
   delta = prediction - goal
   mse = delta^2
   weightedDelta = delta * input
   alphaDelta = weightedDelta * alpha
   weight -= alphaDelta

Variables

So we have some definitions to go over. Weight is the not the thing that is on my belly and my booty but a relative importance of the input in list form.

Which brings us to input, which is an list of values to do the training. The list size should equal since each input has one weight. This case we are doing a single input node so its 1.

For example lets say you have three little pigs. Each pig has a input of the kind of house they want so straw, stick and stone. Let’s assign them a number value to make it easy on the machine so 0,1,2. We are a big bad wolf that would love some BBQ, Memphis, not KC style, so with dry rub in hand we need to get some pork. We need a pig.

So each building material has a chance a beating back hungry BBQ loving, Memphis, not Carolina style, wolves. Since we are only looking at one input we need to chose a building material. In this case, lets look a sticks. So we will give the machine some weights. let’s say .40. Why .40? Well I am stretching this example already to its breaking point so don’t ask too many questions. With that settled, we have a weight and input value.

Now we have a goal and this goal is what percentage correct we want. Let’s say we would like our house’s building material to be perfect but perfect is not a things so let’s say 60% so the goal is .60. Put your hands down, you will break the theme of this example if you think too hard.

Finally we have the alpha. It is a big honking Jake brake on the adjustments made. This is done to prevent divergence from happening. This is where the iterative step between each loop of the weight is so large that is causes a over compensation. Because of how much the weight changes, it ends up in a death spiral getting bigger and bigger because of that over compensation. Use a tiny value for this a slow down the learning.

Loop

This for loop has x which is what iteration we are on going to m which is just a really big number. Just any number, you think is really big. In a real problem you would not use a really big number, you would use a goal like mse, mean square error, being greater than a value. We are keeping it simple stupid so pucker up.

The code

So now we are in the guts of our Gradient Descent nodes code. The Prediction is nothing more than the input value say 1 since it is stick times its weight so .40 or .40 .

Delta is the error. Nothing more than your prediction, .40 minus .60, the goal. or -.20. We missed.

Since regression uses mean square error, we need to get it. It is the delta squared. This is done in order for force it to a positive number. You do this because lets say the correct value is 0 and the error is +- 2 so the range is -2 to 2. You could when taking the average error calculate a -2 and a 2 which would give you zero when averaged. It’s the correct answer so you continue with a machine with a massive error that you thought was accurate because its average said no error. Square each value which gives you 4 and 4 then average and you get 4. You see the error. In our case its 0.04.

If you read my article on the deep learning problem I was amused by, you will start seeing some common things. What I was talking about then is what I am talking about now. We want the delta to be weighted and we can use the input to achieve this. In our case the input is 1 and give us a weightedDelta of -.2.

Then we want to throw the brake on with the alpha so we calculate the alphaDelta which is weighted Delta -0.2 by say .1 which is-.02.

Finally, we adjust the weight by subtracting out the product of alphaDelta and weightedDelta and go back to the top of the iteration. So out new weight is .42. Then we do it all over again with .42 instead of .4.

Wrapping Up

So that is kind of the basis for what I have been battling with over the last couple of days. I don’t know if this example is actually possible. So let’s find out.

Here is the code to drop the example into a Jupyter Notebooks.

weight = .4
input = 1
goal = .60
alpha = .10

for i in range(325):
  prediction = weight * input
  delta = prediction - goal
  mse = delta ** 2
  weightedDelta = delta * input
  alphaDelta = alpha * weightedDelta
  weight = weight - alphaDelta
  print(weight)

It looks like it stalls out at 0.5999999999999994 around the 320 mark. If you run it a 100,000 times you get the same thing so I think we hit it as close as it can get hit. We can see it driving toward the correct number with ever iteration. Here are the first 10 weights.

0.42000000000000004 
0.43800000000000006 
0.45420000000000005 
0.46878000000000003 
0.48190200000000005 
0.49371180000000003 
0.50434062 
0.513906558 
0.5225159022 
0.53026431198

It will keep moving to the 5.999999 value. You can do it with multiple inputs. It is the same thing only you take the sum of the predictions to get your delta and subtract your alphaDelta from each weight. At least that is what I think will happen.

If you ever wondered what happened in those Gradient Descent nodes, now the great mystery has been solved. Now, I want some BBQ.