The spelled-out intro to neural networks and backpropagation: building micrograd

00:00:00.000 Hello, my name is Andre and I've been training deep neural networks for a bit more than a decade.

00:00:04.800 And in this lecture I'd like to show you what neural network training looks like under the hood.

00:00:09.520 So in particular we are going to start with a blank jubyar notebook and by the end of this

00:00:13.920 lecture we will define and train in neural net in order to see everything that goes on under the

00:00:18.800 hood and exactly sort of how that works on an intuitive level. Now specifically what I would

00:00:23.600 like to do is I would like to take you through building of micrograd. Now micrograd is this

00:00:29.440 library that I released on github about two years ago but at the time I only uploaded this source

00:00:34.160 code and you'd have to go in by yourself and really figure out how it works. So in this lecture

00:00:40.160 I will take you through it step by step and kind of comment on all the pieces of it. So what's

00:00:44.560 micrograd and why is it interesting? Okay. Micrograd is basically an autograd engine. Autograd is

00:00:52.560 short for automatic gradient and really what it does is it implements backpropagation. Now

00:00:57.680 backpropagation is this algorithm that allows you to efficiently evaluate the gradient of

00:01:02.480 some kind of a loss function with respect to the weights of a neural network and what that

00:01:08.400 allows us to do then is we can editively tune the weights of that neural network to minimize the

00:01:13.120 loss function and therefore improve the accuracy of the network. So backpropagation would be at the

00:01:18.320 mathematical core of any modern deep neural network library like say PyTorch or jax. So the

00:01:24.240 functionality of micrograd is I think best illustrated by an example. So if we just scroll down here

00:01:28.640 you'll see that micrograd basically allows you to build out mathematical expressions and here

00:01:35.440 what we are doing is we have an expression that we're building out where you have two inputs A and B

00:01:39.920 and you'll see that A and B are negative four and two but we are wrapping those values into this

00:01:47.520 value object that we are going to build out as part of micrograd. So this value object will wrap

00:01:53.360 the numbers themselves and then we are going to build out a mathematical expression here where

00:01:58.000 A and B are transformed into C, D and eventually E, F and G and I'm showing some of the function

00:02:05.600 some of the functionality of micrograd and the operations that it supports. So you can add two

00:02:09.840 value objects you can multiply them you can raise them to a constant power you can all set by one

00:02:15.760 negate squash at zero square divide by constant divide by hit etc. And so we're building out an

00:02:23.920 expression graph with these two inputs A and B and we're creating an output value of G and

00:02:30.720 micrograd will in the background build out this entire mathematical expression. So it will for

00:02:35.680 example know that C is also a value C was a result of an addition operation and the child

00:02:43.280 nodes of C are A and B because the and will maintain pointers to A and B value objects.

00:02:49.840 So we'll basically know exactly how all of this is laid out and then not only can we do what we

00:02:55.280 call the forward pass where we actually look at the value of G of course that's pretty straightforward

00:02:59.840 we will access that using the dot data attribute and so the output of the forward pass the value

00:03:06.080 of G is 24.7 it turns out but the big deal is that we can also take this G value object and we can

00:03:13.040 call dot backward and this will basically initialize back propagation at the node G

00:03:18.480 and what back propagation is going to do is it's going to start at G and it's going to go

00:03:24.160 backwards through that expression graph and it's going to recursively apply the chain rule from

00:03:28.800 calculus and what that allows us to do then is we're going to evaluate basically the derivative

00:03:34.800 of G with respect to all the internal nodes like E, D and C but also with respect to the inputs

00:03:41.840 A and B and then we can actually query this derivative of G with respect to A for example that's A dot

00:03:49.120 grad in this case it happens to be 138 and the derivative of G with respect to B which also

00:03:54.800 happens to be here 645 and this derivative we'll see soon is very important information because it's

00:04:01.360 telling us how A and B are affecting G through this mathematical expression so in particular A

00:04:08.480 that grad is 138 so if we slightly nudge A and make it slightly larger 138 is telling us that G

00:04:16.800 will grow and the slope of that growth is going to be 138 and the slope of growth of B is going to

00:04:22.960 be 645 so that's going to tell us about how G will respond if A and B get tweaked a tiny amount

00:04:29.200 in a positive direction okay now you might be confused about what this expression is that we

00:04:36.400 built out here and this expression by the way is completely meaningless I just made it up I'm

00:04:40.880 just flexing about the kinds of operations that are supported by micro grad and what we actually

00:04:45.520 really care about are neural networks but it turns out that neural networks are just mathematical

00:04:49.760 expressions just like this one but actually slightly bit less crazy even neural networks are just a

00:04:55.760 mathematical expression they take the input data as an input and they take the weights of a neural

00:05:00.960 network as an input and some mathematical expression and the output are your predictions of your neural

00:05:06.000 net or the loss function we'll see this in a bit but basically neural networks just happen to be a

00:05:11.360 certain class of mathematical expressions but the back propagation is actually significantly more

00:05:16.160 general it doesn't actually care about neural networks at all it only tells about arbitrary

00:05:20.720 mathematical expressions and then we happen to use that machinery for training of neural

00:05:25.600 networks now one more note I would like to make at the stage is that as you see here micro grad is a

00:05:30.240 scalar valued autograd engine so it's working on the you know level of individual scalars like

00:05:35.280 negative four and two and we're taking neural nets and we're breaking them down all the way to these

00:05:39.600 atoms of individual scalars and all the little pluses and times and it's just excessive and so

00:05:45.040 obviously you would never be doing any of this in production it's real just put them for pedagogical

00:05:49.440 reasons because it allows us to not have to deal with these n-dimensional tensors that you would

00:05:54.240 use in modern deep neural network library so this is really done so that you understand and refactor

00:06:00.400 out back propagation and chain rule and understanding of your training and then if you actually want

00:06:06.400 to train bigger networks you have to be using these tensors but none of the math changes this is

00:06:10.480 done purely for efficiency we are basically taking scalar value all the scalar values we're packaging

00:06:16.000 them up into tensors which are just arrays of these scalars and then because we have these large arrays

00:06:21.840 we're making operations on those large arrays that allows us to take advantage of the parallelism

00:06:26.640 in a computer and all those operations can be done in parallel and then the whole thing runs

00:06:31.440 faster but really none of the math changes and that's done purely for efficiency so I don't think

00:06:36.000 that it's pedagogically useful to be dealing with tensors from scratch and I think and that's why

00:06:40.720 I fundamentally wrote micrograd because you can understand how things work at the fundamental level

00:06:45.760 and then you can speed it up later okay so here's the fun part my claim is that micrograd is what

00:06:51.120 you need to train neural networks and everything else is just efficiency so you'd think that micrograd

00:06:55.920 would be a very complex piece of code and that turns out to not be the case so if we just go to

00:07:02.000 micrograd and you will see that there's only two files here in micrograd this is the actual engine

00:07:08.240 it doesn't know anything about neural nets and this is the entire neural nets library on top of

00:07:13.040 a micrograd so engine and and and dot pi so the actual back propagation autograd engine that gives

00:07:21.600 you the power of neural networks is literally a hundred lines of code of like very simple python

00:07:28.880 which we'll understand by the end of this lecture and then and and dot pi this neural network library

00:07:35.120 built on top of the autograd engine is like a joke it's like we have to define what is a neuron

00:07:42.640 and then we have to define what is a layer of neurons and then we define what is a multilateral

00:07:46.640 perceptron which is just a sequence of layers of neurons and so it's just a total joke so basically

00:07:52.720 there's a lot of power that comes from only 150 lines of code and that's only need to understand

00:07:58.960 to understand neural network training and everything else is just efficiency and of course there's a

00:08:03.760 lot too efficiency but fundamentally that's all that's happening okay so now let's dive right in

00:08:09.280 and implement micro grad step by step the first thing i'd like to do is i'd like to make sure

00:08:13.120 that you have a very good understanding intuitively of what a derivative is and exactly what information

00:08:18.640 it gives you so let's start with some basic imports that are copy-based in every jupiter

00:08:23.520 notebook always and let's define the function scalar value function f of x that's follows

00:08:30.800 so i just made this up randomly i just wanted a scalar value function that takes a single scalar

00:08:35.840 x and returns a single scalar y and we can call this function of course so we can pass in say 3.0

00:08:41.920 and get 20 back now we can also plot this function to get a sense of its shape you can tell from the

00:08:47.840 mathematical expression that this is probably a parabola it's a quadratic and so if we just create

00:08:54.160 a set of scalar values that we can feed in using for example a range from negative 5 to 5 and

00:09:01.840 steps of 0.25 so x is just from negative 5 to 5 not including 5 in steps of 0.25 and we can

00:09:11.760 actually call this function on this numpy array as well so we get a set of y's if we call f on x's

00:09:16.800 and these y's are basically also applying the function on every one of these elements independently

00:09:25.360 and we can plot this using matplotlib so plt.plot x is and y's and we get a nice parabola so previously

00:09:32.880 here we fed in 3.0 somewhere here and we received 20 back which is here the y coordinate so now i'd

00:09:39.360 like to think through what is the derivative of this function at any single input point x

00:09:44.800 right so what is the derivative at different points x of this function now if you remember

00:09:50.320 back to your calculus class you've probably derived derivatives so we take this mathematical

00:09:54.640 expression 3x square minus 4x plus 5 and you would write out on a piece of paper and you would

00:09:59.280 you know apply the product rule and all the other rules and derive the mathematical expression

00:10:03.040 of the great derivative of the original function and then you could plug in different

00:10:07.040 taxes and see what the derivative is we're not going to actually do that because no one in neural

00:10:13.200 networks actually writes out the expression for neural net it would be a massive expression

00:10:17.120 it would be you know thousands since thousands of terms no one actually derives the derivative

00:10:22.560 of course and so we're not going to take this kind of symbolic approach instead what i'd like to do

00:10:27.040 is i'd like to look at the definition of derivative and just make sure that we really understand

00:10:31.120 what derivative is measuring what is telling you about the function and so if we just look

00:10:36.000 up derivative we see that um okay so this is not a very good definition of derivative this is a

00:10:46.080 definition of what it means to be differentiable but if you remember from your calculus it is the

00:10:50.560 limit s h goes to 0 of f of x plus h minus f of x over h so basically what it's saying is if you

00:10:58.400 slightly bump up you're at some point x that you're interested in or hey and if you slightly

00:11:03.440 bump up you know you slightly increase it by small number h how does the function respond

00:11:09.440 with what sensitivity does it respond what is the slope at that point does the function go up or

00:11:14.080 doesn't go down and by how much and that's the slope of that function the the slope of that response

00:11:20.080 at that point and so we can basically evaluate um the derivative here numerically by taking a

00:11:26.720 very small h of course the definition would ask us to take h to 0 we're just going to pick a very

00:11:31.760 small h 0.001 and let's say we're interested in 0.3.0 so we can look at f of x of course as 20

00:11:39.040 and now f of x plus h so if we slightly nudge x in a positive direction how is the function

00:11:44.240 going to respond and just looking at this do you expect do you expect f of x plus h to be slightly

00:11:49.600 greater than 20 or do you expect to be slightly lower than 20 and since this 3 is here and this

00:11:56.320 is 20 if we slightly go positively the function will respond positively so you'd expect this to be

00:12:02.160 slightly greater than 20 and by how much is telling you the sort of the strength of that slope right

00:12:09.280 the the size of the slope so f of x plus h from f of x this is how much the function responded

00:12:15.280 in the positive direction and we have to normalize by the run so we have the rise over run to get the

00:12:22.480 slope so this of course is just a numerical approximation of the slope because we have to make a very

00:12:28.640 very small to converge to the exact amount now if i'm doing too many zeros at some point

00:12:36.160 i'm gonna get an incorrect answer because we're using floating point arithmetic and the

00:12:41.120 representations of all these numbers in computer memory is finite and at some point we get into

00:12:45.600 trouble so we can converge towards the right answer with this approach but basically at 3

00:12:52.560 the slope is 14 and you can see that by taking 3x square minus 4x plus 5 and differentiating it

00:12:59.280 in our head so 3x square would be 6x minus 4 and then we plug in x equals 3 so that's 18

00:13:07.600 minus 4 is 14 so this is correct so that's a 3 now how about the slope at say negative 3

00:13:17.360 would you expect what would you expect for the slope now telling the exact value is really hard

00:13:22.640 but what is the sign of that slope so at negative 3 if we slightly go in the positive direction

00:13:29.120 at x the function would actually go down and so that tells you that the slope would be negative

00:13:33.840 so we'll get a slight number below uh below 20 and so if we take the slope we expect something

00:13:39.840 negative negative 22 okay and at some point here of course the slope would be 0 now for this

00:13:47.600 specific function i looked it up previously and it's at point 2 over 3 so at roughly 2 over 3

00:13:53.760 that's somewhere here um this this derivative would be 0 so basically at that precise point

00:14:01.120 yeah at that precise point if we nudge in a positive direction the function doesn't respond

00:14:08.160 this stays the same almost and so that's why the slope is 0 okay now let's look at a bit more

00:14:12.560 complex case so we're going to start you know complexifying a bit so now we have a function

00:14:18.320 here with output variable d that is a function of three scalar inputs ab and c so ab and c are

00:14:27.280 some specific values three inputs into our expression graph and a single output d and so

00:14:33.440 if we just print d we get four and now what i like to do is i'd like to again look at the

00:14:38.800 derivatives of d with respect to ab and c and think through again just the intuition of what

00:14:45.600 this derivative is telling us so in order to evaluate this derivative we're going to get a

00:14:51.040 bit hacky here we're going to again have a very small value of h and then we're gonna fix the inputs

00:14:57.360 at some values that we're interested in so these are the this is the point abc at which we're going

00:15:03.600 to be evaluating the the uh derivative of d with respect to all ab and c at that point

00:15:08.800 so there are three inputs and now we have d1 is that expression and then we're going to for

00:15:14.560 example look at the derivative of d with respect to a so we'll take a and we'll bump it by h and

00:15:20.080 then we'll get d2 to be the exact same function and now we're going to print um you know f1 d1 is d1

00:15:30.000 d2 is d2 and print slope so the derivative or slope here will be um of course d2 minus d1

00:15:42.880 divided h so d2 minus d1 is how much the function increased when we bumped the specific input that

00:15:52.960 we're interested in by a tiny amount and this is the normalized by h to get the slope so um yeah

00:16:06.320 so this so i just from this we're going to print d1 which we know is for now d2 will be bumped a

00:16:17.760 will be bumped by h so let's just think through a little bit what d2 will be printed out here

00:16:26.960 in particular d1 will be four will d2 be a number slightly greater than four or slightly lower than

00:16:35.200 four and it's going to tell us the the sign of the derivative so we're bumping a by h

00:16:44.000 b is minus three c is 10 so you can just intuitively think through this derivative and what it's

00:16:51.040 doing it will be slightly more positive and but b is a negative number so if a is slightly more

00:16:58.800 positive because b is negative three we're actually going to be adding less to d

00:17:06.480 so you'd actually expect that the value of the function will go down so let's just see this

00:17:14.720 yeah and so we went from four to 3.9996 and that tells you that the slope will be negative and then

00:17:24.880 will be negative number because we went down and then the exact number of slope will be exact number

00:17:31.760 slope is negative three and you can also convince yourself that negative three is the right answer

00:17:36.400 mathematically and analytically because if you have a times b plus c and you are you know you have

00:17:42.080 calculus then differentiating a times b plus c with respect to a gives you just b and indeed

00:17:49.120 the value of b is negative three which is the derivative that we have so you can tell that that's correct

00:17:54.480 so now if we do this with b so if we bump b by a little bit in a positive direction we'd get

00:18:01.440 different slopes so what is the influence of b on the output d so if we bump b by tiny amount

00:18:08.000 in a positive direction then because a is positive we'll be adding more to d right so and now what

00:18:16.240 is the what is the sensitivity what is the slope of that addition and it might not surprise you that

00:18:21.200 this should be two and why is it two because d of d by db the fraction with respect to b

00:18:30.080 would be would give us a and the value of a is two so that's also working well and then if c gets

00:18:36.320 bumped a tiny amount in h by h then of course h has b is unaffected and now c becomes slightly

00:18:43.280 bit higher what does that do to the function it makes it slightly bit higher because we're

00:18:47.440 simply adding c and it makes us slightly bit higher by the exact same amount that we added to c

00:18:53.120 and so that tells you that the slope is one that will be the the rate at which d will increase

00:19:02.320 as we scale c okay so we now have some intuitive sense of what this derivative is telling you

00:19:08.240 about the function and we'd like to move to neural networks now as i mentioned neural networks will

00:19:12.240 be pretty massive expressions mathematical expressions so we need some data structures

00:19:16.320 that maintain these expressions and that's what we're going to start to build out now

00:19:19.600 so we're going to build out this value object that i showed you in the read me page of micrograd

00:19:26.880 so let me copy paste a skeleton of the first very simple value object so class value takes a single

00:19:35.520 scalar value that it wraps and keeps track of and that's it so we can for example do value of 2.0

00:19:43.520 and then we can get we can look at its content and python will internally use the repper function

00:19:51.840 to return this string like that so this is a value object with data equals 2 that we're creating

00:20:02.160 here now we'd like to do is like we'd like to be able to have not just like two values

00:20:10.080 but we'd like to do a wealthy right we'd like to add them so currently you'd get an error because

00:20:15.840 python doesn't know how to add two value objects so we have to tell it so here's addition

00:20:22.960 so you have to basically use these special double underscore methods in python to define

00:20:30.560 these operators for these objects so if we call um the uh if we use this plus operator python will

00:20:39.760 internally call a dot add of b that's what will happen internally and so b will be the other

00:20:47.280 and uh self will be a and so we see that what we're going to return is a new value object and it's

00:20:54.400 just uh is going to be wrapping the plus of their data but remember now because uh data is the actual

00:21:02.400 like numbered python number so this operator here is just the typical floating point plus addition

00:21:08.720 now it's not an addition of value objects and will return a new value so now a plus b should work

00:21:15.360 and it should print value of negative one because that's two plus minus string there we go okay let's

00:21:22.160 now implement multiply just so we can recreate this expression here so multiply i think it won't

00:21:28.080 surprise you will be fairly similar so instead of add we're going to be using mall and then here

00:21:34.880 of course we want to do times and so now we can create a c value object which will be 10.0 and now

00:21:40.960 we should be able to do a times b well let's just do a times b first um that's value of negative six

00:21:49.680 now and by the way i skipped over this a little bit uh suppose that i didn't have the wrapper

00:21:54.160 function here uh then it's just that you'll get some kind of an ugly expression so what wrapper is

00:21:59.680 doing is it's providing us a way to print out like a nicer looking expression in python

00:22:05.440 so we don't just have something cryptic we actually are you know it's value of negative six

00:22:11.120 so this gives us a times and then this we should now be able to add c to it because we've defined

00:22:17.920 and told the python how to do mall and add and so this will call this will basically be equivalent

00:22:23.280 to a dot mall of b and then this new value object will be dot add of c and so let's see if that work

00:22:34.640 yep so that worked well that gave us four which is what we expect from before and

00:22:39.440 i believe you can just call the manually as well there we go so yeah okay so now what we are missing

00:22:46.480 is the connected tissue of this expression as i mentioned we want to keep these expression

00:22:50.720 graphs so we need to know and keep pointers about what values produce what other values

00:22:55.840 so here for example we are going to introduce a new variable which will call children and by

00:23:01.520 default it will be an empty tuple and then we're actually going to keep a slightly different

00:23:05.360 variable in the class which we'll call underscore private which will be the set of children

00:23:10.320 this is how i done i did it in the original micro grad looking at my code here i can't remember

00:23:16.400 exactly the reason i believe it was efficiency but this underscore children will be a tuple for

00:23:20.960 convenience but then when we actually maintain it in the class it will be just this set yes i believe

00:23:25.920 for efficiency so now when we are creating a value like this with a constructor children will be

00:23:33.440 empty and prep will be the empty set but when we are creating a value through addition or

00:23:37.920 multiplication we're going to feed in the children of this value which in this case is self and other

00:23:44.640 so those are their children here so now we can do d dot prep and we'll see that the children of the

00:23:55.840 we now know are this a value of negative six and value of ten and this of course is the

00:24:01.440 value resulting from a times b and the c value which is ten now the last piece of information

00:24:08.400 we don't know so we know now that the children of every single value but we don't know what

00:24:12.400 operation created this value so we need one more element here let's call it underscore pop

00:24:19.120 and by default this is the empty set for leaves and then we'll just maintain it here

00:24:23.680 and now the operation will be just a simple string and in the case of addition it's plus

00:24:30.800 in the case of multiplication it's times so now we not just have d dot prep we also have a d dot

00:24:37.840 op and we know that d was produced by an addition of those two values and so now we have the full

00:24:43.440 mathematical expression and we're building out this data structure and we know exactly how each

00:24:48.480 value came to be by word expression and from what other values

00:24:52.000 now because these expressions are about to get quite a bit larger we'd like a way to

00:24:58.720 nicely visualize these expressions that we're building out so for that i'm going to copy

00:25:03.200 paste a bunch of slightly scary code that's going to visualize this these expression graphs for us

00:25:09.040 so here's the code and i'll explain it in a bit but first let me just show you what this code does

00:25:14.880 basically what it does is it creates a new function draw dot that we can call on some root node

00:25:19.920 and then it's going to visualize it so if we call draw dot on d which is this final value here that

00:25:26.640 is a times b plus c it creates something like this so this is d and you see that this is a times b

00:25:34.000 creating an interpret value plus c gives us this output node d so that's draw dot at d and i'm not

00:25:43.040 going to go through this in complete detail you can take a look at graphis and its api

00:25:47.440 graphis is a open source graph visualization software and what we're doing here is we're

00:25:52.480 building out this graph in graphis api and you can basically see that trace is this helper function

00:25:59.600 that enumerates all the nodes and edges in the graph so that just builds a set of all the nodes

00:26:04.480 and edges and then we iterate for all the nodes and we create special node objects for them in

00:26:11.360 using dot node and then we also create edges using dot dot edge and the only thing that's like

00:26:17.760 slightly tricky here is you'll notice that i basically add these fake nodes which are these

00:26:22.720 operation nodes so for example this node here is just like a plus node and i create these

00:26:29.920 special op nodes here and i connect them accordingly so these nodes of course are not actual

00:26:39.520 nodes in the original graph they're not actually a value object the only value objects here are

00:26:45.600 the things in squares those are actual value objects or representations thereof and these

00:26:50.800 op nodes are just created in this draw dot routine so that it looks nice let's also add labels to

00:26:56.960 these graphs just so we know what variables are where so let's create a special underscore label

00:27:04.000 or let's just do label equals ft by default and save it to each node

00:27:08.880 and then here we're going to do label is a label is the label is c

00:27:18.640 and then let's create a special equals a times b

00:27:30.720 and dot label will be e it's coming on me and e will be a plus c and d dot label will be b

00:27:40.960 okay so nothing really changes i just added this new e function new e variable and then here

00:27:49.440 when we are printing this i'm going to print the label here so this will be a percent s bar

00:27:56.800 and this will be n dot label and so now we have the label on the left here so it says a b creating

00:28:06.240 e and then e plus c creates d just like we have it here and finally let's make this expression just

00:28:12.560 one layer deeper so d will not be the final output node instead after d we are going to create a new

00:28:20.160 value object called f we're going to start running out of variables soon f will be negative 2.0

00:28:26.480 and it's label of course just the f and then l capital l will be the output of our graph

00:28:34.800 and l will be t times f okay so l will be negative eight is the output so

00:28:44.720 now we don't just draw a d we draw l okay and somehow the label of l was undefined

00:28:54.960 oops all that label has to be explicitly given to it there also l is the output so let's quickly

00:29:02.320 recap what we've done so far we are able to build out mathematical expressions using only plus

00:29:07.280 and times so far they are scalar valued along the way and we can do this forward pass and build out

00:29:14.800 a mathematical expression so we have multiple inputs here a b c n f going into a mathematical

00:29:20.960 expression that produces a single output l and this here is this rising the forward pass so the

00:29:27.600 output of the forward pass is negative eight that's the value now what we'd like to do next is we'd

00:29:33.280 like to run back propagation and in back propagation we are going to start here at the end and we're

00:29:38.800 going to reverse and calculate the gradient along along all these intermediate values and

00:29:45.600 really what we're computing for every single value here we're going to compute the derivative

00:29:50.880 of that node with respect to l so the derivative of l with respect to l is just one and then we're

00:30:00.880 going to derive what is the derivative of l with respect to f with respect to d with respect to c

00:30:06.160 with respect to e with respect to b and with respect to a and in neural network setting

00:30:11.920 you'd be very interested in the derivative of basically this loss function l with respect to

00:30:17.600 the weights of a neural network and here of course we have just these variables a b c and f

00:30:22.400 but some of these will eventually represent the weights of a neural network and so we'll need to

00:30:26.480 know how those weights are impacting the loss function so we'll be interested basically in the

00:30:31.520 derivative of the output with respect to some of its leaf nodes and those leaf nodes will be the

00:30:36.480 weights of the neural network and the other leaf nodes of course will be the data itself but usually

00:30:41.360 we will not want or use the derivative of the loss function with respect to data because the data

00:30:46.400 is fixed but the weights will be iterated on using the gradient information so next we are going to

00:30:53.040 create a variable inside the value class that maintains the derivative of l with respect to that

00:30:59.840 value and we will call this variable grad so there's a dot data and there's a self-help grad

00:31:06.640 and initially it will be zero and remember that zero is basically means no effect so at initialization

00:31:14.160 we're assuming that every value does not impact does not affect the output right because if the

00:31:20.560 gradient is zero that means that changing this variable is not changing the loss function

00:31:24.720 so by default we assume that the gradient is zero and then now that we have grad and it's 0.0

00:31:34.160 we are going to be able to visualize it here after data so here grad is 0.4 f

00:31:41.360 and this will be in that grad and now we are going to be showing both the data and the grad

00:31:50.000 and initialize that zero and we are just about getting ready to calculate the back propagation

00:31:56.800 and of course this grad again as I mentioned is representing the derivative of the output in this

00:32:02.240 case l with respect to this value so with respect to so this is the derivative of l with respect to f

00:32:08.480 with respect to d and so on so let's now fill in those gradients and actually do back propagation

00:32:13.440 manually so let's start filling in these gradients and start all the way at the end as I mentioned

00:32:17.600 here first we are interested to fill in this gradient here so what is the derivative of l

00:32:23.360 with respect to l in other words if I change l by a tiny amount h how much does l change

00:32:31.360 it changes by h so it's proportional and therefore the variable will be 1 we can of course measure

00:32:38.560 these or estimate these numerical gradients numerically just like we've seen before so if I

00:32:43.840 take this expression and I create a def lll function here and put this here now the reason

00:32:51.040 I'm creating a gating function lll here is because I don't want to pollute or mess up the global scope

00:32:56.160 here this is just kind of like a little staging area and as you know in python all of these will

00:33:00.400 be local variables to this function so I'm not changing any of the global scope here so here l1

00:33:06.720 will be l and then copy-pasting this expression we're going to add a small amount h

00:33:14.880 in for example a right and this would be measuring the derivative of l with respect to a so here

00:33:26.000 this will be l2 and then we want to print that derivative so print l2 minus l1 which is how much

00:33:33.840 l changed and then normalize it by h so this is the rise over run and we have to be careful

00:33:40.400 because l is a value node so we actually want its data so that these are floats divided by h

00:33:48.960 and this should print the derivative l with respect to a because a is the one that we bumped

00:33:53.920 a little bit by h so what is the derivative of l with respect to a it's six okay and obviously

00:34:03.520 if we change l by h then that would be here effectively this looks really awkward but

00:34:13.520 changing l by h you see the derivative here is one

00:34:17.680 that's kind of like the base case of what we are doing here so basically we cannot come up here

00:34:26.480 and we can manually set l.grad to one this is our manual back propagation

00:34:31.920 l.grad is one and let's redraw and we'll see that we filled in grad is one for l we're now going to

00:34:39.600 continue to back propagation so let's here look at the derivatives of l with respect d and f

00:34:44.640 let's do a d first so what we are interested in if i turn it to markdown on here is we'd like to

00:34:51.120 know basically we have that l is d times f and we'd like to know what is d l by d d what is that

00:35:01.680 and if you know you're a calculus l is d times f so what is d l by d d it would be f and if you

00:35:08.880 don't believe me we can also just derive it because the proof would be fairly straightforward we go to

00:35:14.720 the definition of the derivative which is f of x plus h minus f of x divided h as a limit limit

00:35:23.280 of h goes to zero of this kind of expression so when we have l is d times f then increasing d by h

00:35:31.600 would give us the output of d plus h times f that's basically of x plus h right minus d times f

00:35:40.400 and then divide h and symbolically expanding out here we would have basically d times f plus h

00:35:48.800 times f minus d times f divided h and then you see how the d f minus d f cancels so you're left

00:35:55.360 with h times f divided h which is f so in the limit as h goes to zero of you know derivative

00:36:04.960 definition we just get f in a case of d times f so symmetrically d l by d f will just be d

00:36:18.560 so what we have is that f dot grad we see now is just the value of d which is four

00:36:25.200 and we see that d dot grad is just the value of f

00:36:33.600 and so the value of f is negative two so we'll set those manually

00:36:45.120 let me erase the smart download and then let's redraw what we have

00:36:48.400 okay and let's just make sure that these were correct so we seem to think that dl by dd is

00:36:56.800 negative two so let's double check let me erase this plus h from before and now we want to derivative

00:37:03.360 with respect to f so let's just come here when i create f and let's do a plus h here and this

00:37:08.960 should print a derivative of l with respect to f so we expect to see four yeah and this is four

00:37:15.360 up to floating point funkiness and then dl by dd should be f which is negative two grad is negative

00:37:25.760 two so if we again come here and we change d d dot data plus equals h right here so we expect

00:37:36.320 so we've added l h and then we see how l changed and we expect to print negative two

00:37:43.200 there we go so we've numerically verified what we're doing here is kind of like an in-line gradient

00:37:51.760 check gradient check is when we are deriving this like back propagation and getting the derivative

00:37:57.200 we expect to all the intermediate results and then numerical gradient is just you know

00:38:03.200 estimating it using small step size now we're doing to the crux back propagation so this will be the

00:38:10.000 most important node to understand because if you understand the gradient for this node

00:38:15.040 you understand all of back propagation and all of training of neural nets basically

00:38:18.880 so we need to derive dl by dc in other words the derivative l with respect to c because we've

00:38:26.960 computed all these other gradients already now we're coming here and we're continuing the back

00:38:31.280 propagation manually so we want dl by dc and then we'll also derive dl by de now here's the problem

00:38:39.200 how do we derive dl by dc we actually know the derivative l with respect to d so we know how

00:38:48.080 l is sensitive to d but how is l sensitive to c so if we wiggle c how does that impact l through d

00:38:58.000 so we know dl by dc and we also here know how c impacts d and so just very intuitively if you

00:39:06.080 know the impact that c is having on d and the impact that d is having on l then you should be

00:39:11.920 able to somehow put that information together to figure out how c impacts l and indeed this is

00:39:17.200 what we can actually do so in particular we know just concentrating on d first let's look at how

00:39:23.600 what is the derivative basically of d with respect to c so in other words what is dd by dc

00:39:28.320 so here we know that d is c times c plus ee that's what we know and now we're interested in dd by dc

00:39:38.880 if you just know your calculus again and you remember that the differentiating c plus e with

00:39:43.920 respect to c you know that that gives you 1 and 0 and we can also go back to the basics and derive

00:39:50.240 this because again we can go to our f of x plus h minus f of x divided by h that's the definition

00:39:57.280 of the derivative as h goes to 0 and so here focusing on c and its effect on d we can basically

00:40:05.120 do the f of x plus h will be c is incremented by h plus e that's the first evaluation of our

00:40:12.160 function minus c plus e and then divide h and so what is this just expanding this out this will be

00:40:21.840 c plus h plus e minus c minus e divide h and then you see here how c minus c cancels e minus e cancels

00:40:30.160 we're left with h over h which is 1.0 and so by symmetry also d d by d e will be 1.0 as well

00:40:41.600 so basically the derivative of a sum expression is very simple and this is the local derivative

00:40:47.680 so i call this the local derivative because we have the final output value all the way at the

00:40:52.480 end of this graph and we're now like a small node here and this is a little plus node and

00:40:58.320 the little plus node doesn't know anything about the rest of the graph that it's embedded in

00:41:03.520 all it knows is that it did it plus it took a c and an e added them and created d and this plus

00:41:09.840 node also knows the local influence of c on d or rather rather the derivative of d with respect to

00:41:15.440 c and it also knows the derivative of d with respect to e but that's not what we want that's

00:41:21.280 just a local derivative what we actually want is dl by dc and l could l is here just one step away

00:41:29.200 but in a general case this little plus node is could be embedded in like a massive graph

00:41:33.760 so again we know how l impacts d and now we know how c and e impact d how do we put that

00:41:41.440 information together to write dl by dc and the answer of course is the chain rule in calculus

00:41:47.760 and so I pulled up chain rule here from kapedia and I'm going to go through this very briefly

00:41:55.520 so chain rule we could be there sometimes can be very confusing and calculus can

00:42:00.320 can be very confusing like this is the way I learned chain rule and was very confusing like

00:42:06.640 what is happening it's just complicated so I like this expression much better

00:42:12.960 if a variable z depends on a variable y which itself depends on a variable x then z depends on

00:42:18.960 x as well obviously through the intermediate variable y and in this case the chain rule is

00:42:23.840 expressed as if you want dz by dx then you take the dz by dy and you multiply it by dy by dx

00:42:32.880 so the chain rule fundamentally is telling you how we chain these derivatives together

00:42:41.520 correctly so to differentiate through a function composition we have to apply a multiplication

00:42:48.160 of those derivatives so that's really what chain rule is telling us and there's a nice little

00:42:56.000 intuitive explanation here which I also think is kind of cute the chain rule says that knowing

00:43:00.160 the instantaneous rate of change of z with respect to y and y relative to x allows one to calculate

00:43:04.720 the instantaneous rate of change of z relative to x as a product of those two rates of change

00:43:10.400 simply the product of those two so here's a good one if a car travels twice as fast as bicycle

00:43:16.720 and the bicycle is four times as fast as walking men then the car travels two times four eight times

00:43:23.280 as fast as the men and so this makes it very clear that the correct thing to do sort of

00:43:28.560 is to multiply so cars twice as fast as bicycle and bicycle is four times as fast as men

00:43:36.480 so the car will be eight times as fast as the men and so we can take these intermediate

00:43:42.720 rates of change if you will and multiply them together and that justifies the chain rule

00:43:49.120 intuitively so have a look at chain rule about here really what it means for us is there's a very

00:43:54.400 simple recipe for deriving what we want which is dl by dc and what we have so far is we know

00:44:03.600 one and we know what is the impact of d on l so we know dl by dd the derivative of

00:44:14.160 l with respect to dd we know that that's negative two and now because of this local

00:44:18.800 reasoning that we've done here we know dd by dc so how does c impact d and in particular this

00:44:28.160 is a plus node so the local derivative is simply 1.0 it's very simple and so the chain rule tells

00:44:35.200 us that dl by dc going through this intermediate variable will just be simply dl by dd times

00:44:49.200 dd by dc that's chain rule so this is identical to what's happening here except

00:44:57.120 z is rl y is rd and x is rc so we literally just have to multiply these and because

00:45:07.440 these local derivatives like dd by dc are just one we basically just copy over dl by dd because

00:45:17.680 this is just times one so what is it so because dl by dd is negative two what is dl by dc

00:45:24.640 well it's the local gradient 1.0 times dl by dd which is negative two so literally what a plus

00:45:32.880 node does you can look at it that way is it literally just routes the gradient because the

00:45:38.000 plus nodes local derivatives are just one and so in the chain rule one times dl by dd is

00:45:47.520 is just dl by dd and so that derivative just gets routed to both c and to e in the skates

00:45:54.560 so basically we have that e dot grad or let's start with c since that's the one we looked at

00:46:01.760 is negative two times one negative two and in the same way by symmetry e dot grad will be negative

00:46:12.080 two that's the claim so we can set those we can redraw and you see how we just assign negative to

00:46:21.040 negative two so this back propagating signal which is carrying the information of like what is the

00:46:25.840 derivative of l with respect to all the intermediate nodes we can imagine it almost like flowing

00:46:30.640 backwards through the graph and a plus node will simply distribute the derivative to all the leaf

00:46:35.520 nodes sorry to all the children nodes of it so this is the claim and now let's verify it so let

00:46:42.240 me remove the plus h here from before and now instead what we're going to do is we want to

00:46:46.800 incurrent c so c dot data will be incremented by h and when i run this we expect to see negative two

00:46:52.880 negative two and then of course for e so e dot data plus equals h and we expect to see

00:47:01.680 negatives here simple so those are the derivatives of these internal nodes and now we're going to

00:47:12.320 recurse our way backwards again and we're again going to apply the chain rule so here we go our

00:47:18.880 second application of chain rule and we will apply it all the way through the graph which just

00:47:22.880 happened to only have one more node remaining we have that dl by de as we have just calculated

00:47:29.760 is negative two so we know that so we know the derivative of l with respect to e

00:47:34.560 and now we want dl by da right and the chain rule is telling us that that's just dl by d e

00:47:46.240 negative two times the local gradient so what is the local gradient basically de by da we have to

00:47:57.760 look at that so i'm a little times node inside a massive graph and i only know that i did a times

00:48:06.400 b and i produced an e so now what is de by da and de by dv that's the only thing that i sort of know

00:48:14.240 about that's my local gradient so because we have that e is a times b we're asking what is de by dv

00:48:24.080 and of course we just did that here we had a times so i'm not going to redrive it but if you

00:48:30.880 want to differentiate this with respect to a you'll just get b right the value of b which in this

00:48:37.200 case is negative three point zero so basically we have that dl by da well let me just do it right

00:48:46.400 here we have that a dot grad and we are applying chain rule here is dl by d e which we see here is

00:48:54.240 negative two times what is de by da it's the value of b which is negative three that's it

00:49:05.280 and then we have b dot grad is again dl by d e which is negative two just the same way times

00:49:15.600 what is de by db is the value of a which is two dot two point zero that's the value of a

00:49:24.480 so these are our claimed derivatives let's redraw and we see here that a dot grad turns out to be

00:49:35.280 six because that is negative two times negative three and b dot grad is negative four times sorry

00:49:42.240 is negative two times two which is negative four so those are our claims let's delete this and

00:49:47.760 let's verify them we have a here a dot data plus equals h so the claim is that a dot grad is six

00:50:01.760 let's verify six and we have b dot data plus equals h so nudging b by h and looking at what

00:50:12.000 happens we claim it's negative four and indeed it's negative four plus minus again float oddness

00:50:19.520 and that's it this that was the manual back propagation all the way from here to all the

00:50:30.400 leaf nodes and we've done it piece by piece and really all we've done is as you saw we iterated

00:50:35.840 through all the nodes one by one and locally applied the chain rule we always know what is

00:50:41.040 the derivative of l with respect to this little output and then we look at how this output was

00:50:45.600 produced this output was produced through some operation and we have the pointers to the

00:50:50.400 children nodes of this operation and so in this little operation we know what the local derivatives

00:50:55.760 are and we just multiply them onto the derivative always so we just go through and recursively

00:51:01.760 multiply on the local derivatives and that's what back propagation is is just a recursive

00:51:06.800 application of chain rule backwards through the computation graph let's see this power in action

00:51:12.320 just very briefly what we're going to do is we're going to uh nudge our inputs to try to make

00:51:18.000 l go up so in particular what we're doing is we want a dot data we're going to change it

00:51:23.520 and if you want l to go up that means we just have to go in the direction of the gradient so

00:51:29.040 a should increase in the direction of gradient by like some small step amount this is the step size

00:51:36.560 and we don't just want this for being but also for being also for C also for F those are leaf nodes

00:51:48.000 which we usually have control over and if we nudge in direction of the gradient we expect a positive

00:51:54.480 influence on l so we expect l to go up positively so it should become less negative it should go up

00:52:02.320 to say negative you know six or something like that uh it's hard to tell exactly and we'd have to

00:52:08.320 rerun the forward pass so let me just um do that here um this would be the forward pass

00:52:17.680 F would be unchanged this is effectively the forward pass and now if we print l dot data

00:52:23.840 we expect because we nudged all the values all the inputs in the direction of gradient we expected

00:52:29.760 less negative l we expected to go up so maybe it's negative six or so let's see what happens

00:52:35.520 okay negative seven and this is basically one step of an optimization that will end up running

00:52:43.760 and really this gradient just give us some power because we know how to influence the final outcome

00:52:49.280 and this will be extremely useful for training all that's as well as CMC so now I would like to do

00:52:54.080 one more example of manual back propagation using a bit more complex and useful example

00:53:01.680 we are going to back propagate through a neuron so we want to eventually build out neural networks

00:53:09.760 and in the simplest case these are multilateral perceptrons as they're called so this is a two

00:53:13.760 layer neural net and it's got these hidden layers made up of neurons and these neurons are fully

00:53:18.560 connected to each other now biologically neurons are very complicated devices but we have very

00:53:23.600 simple mathematical models of them and so this is a very simple mathematical model of a neuron

00:53:29.360 you have some inputs x's and then you have these synapses that have weights on them so

00:53:35.600 the W's are weights and then the synapse interacts with the input to this neuron

00:53:43.360 multiplicatively so what flows to the cell body of this neuron is W times x but there's multiple

00:53:50.720 inputs so there's many W times x's flowing into the cell body the cell body then has also like

00:53:56.640 some bias so this is kind of like the inert innate sort of trigger happiness of this neuron

00:54:03.120 so this bias can make it a bit more trigger happy or a bit less trigger happy regardless of the

00:54:07.520 input but basically we're taking all the W times x of all the inputs adding the bias and then we

00:54:14.400 take it through an activation function and this activation function is usually some kind of a

00:54:18.880 squashing function like a sigmoid or 10H or something like that so as an example we're going to use

00:54:25.360 the 10H in this example numpy has a np.10H so we can call it on a range and we can plot it

00:54:35.760 this is the 10H function and you see that the inputs as they come in get squashed on the white

00:54:42.800 coordinate here so right at zero we're going to get exactly zero and then as you go more positive

00:54:49.520 in the input then you'll see that the function will only go up to one and then plateau out

00:54:54.800 and so if you pass in very positive inputs we're going to cap it smoothly at one and on a negative

00:55:01.600 side we're going to cap it smoothly to negative one so that's 10H and that's the squashing function

00:55:08.320 or an activation function and what comes out of this neuron is just the activation function

00:55:12.960 applied to the dot product of the weights and the inputs so let's write one out

00:55:20.160 I'm going to copy paste because I don't want to type too much but okay so here we have the inputs

00:55:30.480 x1x2 so this is a two-dimensional neuron so two inputs are going to come in these are thought

00:55:36.560 out as the weights of this neuron weights w1w2 and these weights again are the synaptic

00:55:42.880 strengths for each input and this is the bias of the neuron B and now what we want to do is

00:55:50.800 according to this model we need to multiply x1 times w1 and x2 times w2 and then we need to

00:55:59.120 add bias on top of it and it gets a little messy here but all we are trying to do is x1w1 plus x2w2

00:56:06.880 plus B and these are multiply here except I'm doing it in small steps so that we actually

00:56:12.720 have pointers to all these intermediate nodes so we have x1w1 variable x times x2w2 variable

00:56:19.280 and I'm also labeling them so n is now the cell body raw activation without the activation

00:56:28.960 function from now and this should be enough to basically plot it so draw a dot of n

00:56:34.560 gives us x1 times w1 x2 times w2 being added then the bias gets added on top of this and this

00:56:46.480 n is this sum so we're now going to take it through an activation function and let's say we use the

00:56:53.440 10H so that we produce the output so what we'd like to do here is we'd like to do the output

00:56:59.600 and I'll call it o is n dot 10H okay but we haven't yet written the 10H now the reason that we need

00:57:09.280 to implement another 10H function here is that 10H is a hyperbolic function and we've only so far

00:57:16.880 implemented plus and at times and you can't make a 10H out of just pluses and times you also need

00:57:22.720 exponentiation so 10H is this kind of a formula here you can use either one of these and you see

00:57:28.880 that there is exponentiation involved which we have not implemented yet for our low value

00:57:33.600 node here so we're not going to be able to produce 10H yet and we have to go back up and

00:57:37.120 implement something like it now one option here is we could actually implement exponentiation

00:57:46.480 right and we could return the exp of a value instead of a 10H of a value because if we had

00:57:53.120 exp then we have everything else that we need so because we know how to add and we know how to

00:57:59.280 we know how to add and we know how to multiply so we'd be able to create 10H if we knew how to exp

00:58:06.320 but for the purposes of this example I specifically wanted to show you that we don't necessarily need

00:58:12.720 to have the most atomic pieces in this value object we can actually like create functions at arbitrary

00:58:21.200 points of abstraction they can be complicated functions but they can be also very very simple

00:58:27.360 functions like a plus and it's totally up to us the only thing that matters is that we know how

00:58:31.920 to differentiate through any one function so we take some inputs and we make an output the only

00:58:36.880 thing that matters it can be arbitrarily complex function as long as you know how to create the

00:58:41.760 local derivative if you know the local derivative of how the inputs impact the output then that's

00:58:46.240 all you need so we're going to cluster up all of this expression and we're not going to break

00:58:51.760 it down to its atomic pieces we're just going to directly implement 10H so let's do that

00:58:56.160 depth and h and then how it will be a value

00:59:00.720 of and we need this expression here so

00:59:08.400 let me actually copy paste

00:59:11.120 let's grab n which is a cell data and then this I believe is the 10H

00:59:20.000 math.x of 2, no n minus 1 over 2n plus 1 maybe I can call this x just to edit matches exactly

00:59:35.680 okay and now this will be t and children of this node they're just one child and I'm wrapping it

00:59:45.280 in a tuple so this is a tuple of one object just self and here the name of this operation will be

00:59:50.880 10H and we're going to return that okay so now values should be implementing 10H and now we can

01:00:02.480 scroll the way down here and we can actually do n dot 10H and that's going to return the 10H

01:00:08.480 output of n and now we should be able to draw that of oh not of n so let's see how that worked

01:00:15.840 there we go n went through 10H to produce this output so now 10H is a sort of our little micro

01:00:28.560 grad supported node here as an operation and as long as we know derivative of 10H

01:00:35.600 then we'll be able to back propagate through it now let's see this 10H in action currently it's

01:00:40.800 not squashing too much because the input to it is pretty low so the bias was increased to say 8

01:00:47.200 then we'll see that what's flowing into the 10H now is 2 and 10H is squashing it to 0.96

01:00:56.960 so we're already hitting the tail of this 10H and it will sort of smoothly go up to 1 and then

01:01:01.760 plateau out over there okay so now i'm going to do something slightly strange i'm going to change

01:01:06.560 this bias from 8 to this number 6.88 etc and i'm going to do this for specific reasons because

01:01:14.560 we're about to start back propagation and i want to make sure that our numbers come out nice

01:01:20.160 they're not like very crazy numbers they're nice numbers that we can sort of understand in our head

01:01:24.640 let me also add both label oh a short four output here so that's the error okay so

01:01:32.000 pointing 8 flows into 10H comes out 0.7 so on so now we're going to do back propagation and

01:01:37.680 we're going to fill in all the gradients so what is the derivative of with respect to all the

01:01:44.400 inputs here and of course in a typical neural network setting what we really care about the most

01:01:49.520 is the derivative of these neurons on the weights specifically the w2 and w1 because those are the

01:01:56.560 weights that we're going to be changing part of the optimization and the other thing that we have

01:02:00.480 to remember is here we have only a single neuron but in the neural net you typically have many

01:02:04.400 neurons and they're connected so this is only like a one small neuron a piece of a much bigger puzzle

01:02:10.480 and eventually there's a loss function that sort of measures the accuracy of the neural net and we're

01:02:14.640 back propagating with respect to that accuracy and trying to increase it so let's start off

01:02:20.320 back propagation here and what is the derivative of oh with respect to oh the base case sort of

01:02:26.480 we know always is that the gradient is just 1.0 so let me fill it in and then let me split out

01:02:37.120 the drawing function here and then here cell cleared this output here okay so now when we draw

01:02:51.040 oh we'll see that oh that grad is 1 so now we're going to back propagate through the 10H

01:02:56.000 so to back propagate through 10H we need to know the local derivative of 10H so if we have that

01:03:03.680 oh is 10H of n then what is d oh by d n now what you could do is you could come here and you could

01:03:14.240 take this expression and you could do your calculus derivative taking and that would work

01:03:20.720 but we can also just scroll down with the PTI here into a section that hopefully tells us that

01:03:27.120 derivative d by dx of 10H of x is any of these I like this one 1 minus 10H square of x so this is

01:03:35.840 1 minus 10H of x squared so basically what this is saying is that d o by d n is 1 minus 10H of n

01:03:47.920 squared and we already have 10H of n it's just oh so it's 1 minus o squared

01:03:56.480 so oh is the output here so the output is this number

01:04:00.400 o dot data is this number and then what this is saying is that d o by d n is 1 minus this squared

01:04:12.400 so 1 minus o dot data squared is 0.5 conveniently so the local derivative of this 10H operation here

01:04:22.400 is 0.5 and so that would be d o by d n so we can fill in that n dot grad

01:04:30.720 is 0.5 we'll just fill it in

01:04:35.200 so this is exactly 0.5 1/2 so now we're going to continue the back propagation

01:04:49.280 this is 0.5 and this is a plus node so how is backdrop going to what is backdrop going to do here

01:04:55.920 and if you remember our previous example a plus is just a distributor of gradient so this gradient

01:05:02.560 will simply flow to both of these equally and that's because the local derivative of this operation

01:05:07.440 is one for every one of its nodes so 1 times 0.5 is 0.5 so therefore we know that this node here

01:05:15.840 which we called this its grad is just 0.5 and we know that b dot grad is also 0.5

01:05:23.600 so let's set those and let's draw

01:05:26.400 so those are 0.5 continuing we have another plus 0.5 again we'll just distribute

01:05:34.240 so 0.5 will flow to both of these so we can set theirs

01:05:43.680 x2w2 as well that grad is 0.5 and let's redraw

01:05:48.960 plus this are my favorite operations to back propagate through because it's very simple

01:05:53.840 so now it's flowing into these expressions this 0.5 and so really again keep in mind what

01:05:59.040 derivative is telling us at every point in time along here this is saying that if we want the

01:06:05.040 output of this neuron to increase then the influence on these expressions is positive on the output

01:06:12.240 both of them are positive contribution to the output

01:06:18.080 so now back propagating to x2w2 first this is a times node so we know that the local

01:06:26.560 derivative is you know the other term so if we want to calculate x2 dot grad then can you think

01:06:34.000 through what it's going to be so x2 dot grad will be w2 dot data times this x2w2 dot grad right

01:06:49.200 and w2 dot grad will be x2 dot data times x2w2 dot grad

01:07:01.360 right so that's the local piece of chain rule

01:07:04.160 let's set them and let's redraw so here we see that the gradient on our weight

01:07:11.840 2 is 0 because x2's data was 0 right but x2 will have the gradient 0.5 because data here was 1

01:07:19.920 and so what's interesting here right is because the input x2 was 0 and because of the way the

01:07:26.720 times works of course this gradient will be 0 and to think about intuitively why that is

01:07:31.840 derivative always tells us the influence of this on the final output if i will w2 how is the output

01:07:40.560 changing it's not changing because we're multiplying by 0 so because it's not changing there is no

01:07:46.080 derivative and 0 is the correct answer because we're splashing with that 0 and let's do it here

01:07:53.360 0.5 should come here and flow through this times and so we'll have that x1 dot grad is

01:08:00.480 can you think through a little bit what what this should be

01:08:04.960 the local derivative of times with respect to x1 is going to be w1 so w1's data times

01:08:15.280 x1 w1 dot grad and w1 dot grad will be x1 dot data times x1 w2 w1 dot grad

01:08:25.760 let's see what those came out to be so this is 0.5 so this would be negative 1.5 and this would be

01:08:33.360 1 and we back propagate it through this expression these are the actual final derivatives so if we

01:08:39.360 want this neurons output to increase we know that what's necessary is that w2 we have no gradient

01:08:48.560 w2 doesn't actually matter to this neuron right now but this neuron this weight should go up

01:08:54.480 so if this weight goes up then this neurons output would have gone up and proportionally

01:09:00.320 because the gradient is 1 okay so doing the back propagation manually is obviously ridiculous

01:09:05.040 so we are now going to put an end to this suffering and we're going to see how we can implement

01:09:10.080 the backward pass a bit more automatically we're not going to be doing all of it manually out here

01:09:14.320 it's now pretty obvious to us by example how these pluses and times are back property ingredients

01:09:19.920 so let's go up to the value object and we're going to start codifying what we've seen

01:09:26.080 in the examples below so we're going to do this by storing a special self dot backward

01:09:34.800 and underscore backward and this will be a function which is going to do that little piece of chain

01:09:40.240 rule at each little node that took inputs and produced output we're going to store

01:09:46.000 how we are going to chain the outputs gradient into the inputs gradients so by default

01:09:53.040 this will be a function that doesn't do anything so and you can also see that here in the value

01:10:01.440 in micro grad so with this backward function by default doesn't do anything this is a complete

01:10:09.040 function and that would be sort of the case for example for a leaf node for leaf node there's nothing

01:10:13.600 to do but now if when we're creating these out values these out values are an addition of self

01:10:22.640 and other and so we're going to want to sell set outs backward to be the function that propagates

01:10:31.120 the gradient so let's define what should happen

01:10:37.200 and we're going to store it in a closure let's define what should happen when we call

01:10:44.320 outscrad for addition our job is to take outscrad and propagate it into self-scrad and other

01:10:56.080 grad so basically we want to solve self-grad to something and we want to set others that grad

01:11:02.720 to something okay and the way we saw below how chain rule works we want to take the local derivative

01:11:10.720 times the sort of global derivative i should call it which is the derivative of the final output of

01:11:16.560 the expression with respect to outs data with respect to out so the local derivative of self

01:11:26.960 in an addition is 1.0 so it's just 1.0 times outs grad that's the chain rule and others that grad

01:11:36.960 will be 1.0 times outscrad and what you basically what you're seeing here is that outs grad will

01:11:42.960 simply be copied onto self-scrad and others grad as we saw happens for an addition operation

01:11:48.960 so we're going to later call this function to propagate the gradient having done an addition

01:11:54.480 let's not do multiplication we're going to also define that backward

01:11:59.360 and we're going to set its backward to be backward

01:12:07.840 and we want to chain out grad into self- that grad and others that grad

01:12:15.520 and this will be a little piece of chain rule formal application so we'll have so what should

01:12:22.160 it be can you think through so what is the local derivative here the local derivative was others

01:12:32.720 that data and then others that data and then times out that grad that's chain rule

01:12:40.960 and here we have self- that data times out that grad that's what we've been doing

01:12:46.480 and finally here for 10h that backward and then we want to set outs backwards to be just backward

01:13:00.560 and here we need to back propagate we have out that grad and we want to chain it into self- that grad

01:13:07.440 and self- that grad will be the local derivative of this operation that we've done here which is 10h

01:13:15.360 and so we saw that the local gradient is 1 minus the 10h of x squared which here is t

01:13:22.080 that's the local derivative because that's t is the output of this 10h so 1 minus t square is the

01:13:28.800 local derivative and then gradients as we multiplied because of the chain rule so outs grad is chained

01:13:36.080 through the local gradient into self- that grad and that should be basically it so we're going to

01:13:42.720 redefine our value node we're going to swing all the way down here and we're going to redefine

01:13:50.560 our expression make sure that all the grads are zero okay but now we don't have to do this manually

01:13:58.080 anymore we are going to be basically calling the dot backward in the right order so first we want

01:14:05.920 to call o's dot backward so o was the outcome of 10h right so calling o's dot back those goes

01:14:20.480 backward will be this function this is what it will do now we have to be careful because there's a

01:14:29.760 times out that grad and out that grad remember is initialized to zero so here we see grad zero

01:14:40.480 so as a base case we need to set both that grad to 1.0 to initialize this with 1

01:14:48.000 and then once this is 1 we can call o dot backward and what that should do is it should propagate

01:14:59.840 this grad through 10h so the local derivative times the global derivative which is initialized at 1 so

01:15:08.240 this should um

01:15:11.440 a dough so I thought about redoing it but I figured I should just leave the error in here because

01:15:21.040 it's pretty funny why is not an object not callable it's because I screwed up we're trying to save

01:15:28.880 these functions so this is correct this here we don't want to call the function because that

01:15:35.040 returns none these functions return none we just want to store the function so let me redefine

01:15:40.560 the value object and then we're going to come back and redefine the expression draw dot

01:15:45.680 everything is great o dot grad is 1 o dot grad is 1 and now math this should work of course

01:15:54.480 okay so all that backward should have this grad should now be 0.5 if we redraw and if everything

01:16:01.920 went correctly 0.5 yay okay so now we need to call n-stat grad

01:16:07.760 n-stat-acquard-sorry n-stat-acquard so that seems to have worked

01:16:15.520 so n-stat-acquard wrapped the gradient to both of these so this is looking great

01:16:24.560 now we could of course call called v.grad v.backward sorry what's going to happen

01:16:30.880 well b doesn't have it backward b's backward because b is a leaf node

01:16:36.800 b's backward is by initialization dm t function so nothing would happen but we can call call it on

01:16:44.800 it but when we call this one it's backward then we expect this 0.5 to get further around it

01:16:56.560 right so there we go 0.5.5 and then finally we want to call it here on x2w2

01:17:10.320 and on x1w1 let's do both of those and there we go so we get 0.5 negative 1.5 and 1 exactly as we

01:17:24.400 did before but now we've done it through calling that backward sort of manually so we have the

01:17:32.880 last one last piece to get rid of which is us calling underscore backward manually so let's

01:17:38.400 think through what we are actually doing we've laid out a mathematical expression and now we're

01:17:43.600 trying to go backwards through that expression so going backwards through the expression just

01:17:49.440 means that we never want to call a dot backward for any node before we've done sort of everything

01:17:58.400 after it so we have to do everything after it before ever going to call dot backward on any one

01:18:03.680 node we have to get all of its full dependencies everything that it depends on has to propagate

01:18:08.960 to it before we can continue back propagation so this ordering of graphs can be achieved using

01:18:15.760 something called topological sort so topological sort is basically a laying out of a graph such

01:18:23.360 that all the edges go only from left to right basically so here we have a graph so the direction

01:18:29.280 as I click a graph a tag and this is two different topological orders of it I believe

01:18:35.600 where basically you'll see that it's a laying out of the nodes such that all the edges go only

01:18:39.360 one way from left to right and implementing topological sort you can look in Wikipedia and so on I'm

01:18:46.160 not going to go through it in detail but basically this is what builds a topological graph we maintain

01:18:54.800 a set of visited nodes and then we are going through starting at some root node which for us is oh

01:19:03.120 that's what I want to start a topological sort and starting at oh we go through all of its children

01:19:08.800 and we need to lay them out from left to right and basically this starts at oh if it's not visited

01:19:16.080 then it marks it as visited and then it iterates through all of its children and calls built topological

01:19:22.800 on them and then after it's gone through all the children it adds itself so basically this node

01:19:30.800 that we're going to call it on like say oh it's only going to add itself to the topological list

01:19:36.480 after all of the children have been processed and that's how this function is guaranteeing that

01:19:42.160 you're only going to be in the list once all your children are in the list and that's the invariant

01:19:46.480 that is being maintained so if we built up on oh and then inspect this list we're going to see that

01:19:53.120 it ordered our value objects and the last one is the value of 0.707 which is the output so this is

01:20:02.400 oh and then this is n and then all the other nodes get laid out before it so that built the topological

01:20:11.120 graph and really what we're doing now is we're just calling dot underscore backward on all of the

01:20:17.280 nodes in a topological order so if we just reset the gradients they're all zero what did we do

01:20:24.080 we started by setting o dot grad to be one that's the base case then we built the topological order

01:20:38.400 and then we went for node in reversed octopo now in the reverse order because this list goes from

01:20:50.080 you know we need to go through it in reversed order so starting at oh node dot backward and

01:20:58.800 this should be it there we go those are the correct derivatives finally we are going to hide this

01:21:08.880 functionality so i'm going to copy this and we're going to hide it inside the value class because

01:21:15.200 we don't want to have all that code lying around so instead of an underscore backward we're now going

01:21:20.320 to define an actual backward so that backward without the underscore and that's going to do all

01:21:27.120 the stuff that we just arrived so let me just clean this up a little bit so we're first going to

01:21:33.360 build the topological graph starting at self so build topo of self will populate

01:21:45.280 the topological order into the topo list which is a local variable then we set self-up graphs to be

01:21:51.200 one and then for each node in the reversed list so starting at us and going to all the children

01:21:58.160 underscore backward and that should be it so save come down here we define

01:22:09.920 okay all the grads are zero and now what we can do is oh that backward without the underscore and

01:22:17.920 there we go and that's that's back propagation please for one neuron now we shouldn't be too

01:22:29.440 happy with ourselves actually because we have a bad bug and we have not surfaced the bug because

01:22:35.360 of some specific conditions that we are we have to think about right now so here's the simplest case

01:22:41.040 that shows the bug say I create a single node a and then I create a b that is a plus a and then

01:22:51.840 I call backward so what's going to happen is a is three and then a is b is a plus a so there's

01:22:59.680 two arrows on top of each other here then we can see that b is of course the forward pass works

01:23:06.880 b is just a plus a which is six but the gradient here is not actually correct that we calculate

01:23:13.440 it automatically and that's because of course just doing calculus in your head the derivative of b

01:23:23.040 with respect to a should be two one plus one it's not one intuitively what's happening here right so

01:23:32.560 b is the result of a plus a and then we call backward on it so let's go up and see what that does

01:23:39.040 b is a result of addition so out is b and then when we call backward what happened is

01:23:50.240 self that grad was set to one and then other that grad was set to one but because we're doing a

01:23:58.800 plus a self and other are actually the exact same object so we are overriding the gradient we are

01:24:06.000 setting it to one and then we are setting it again to one and that's why it stays at one so that's

01:24:13.440 a problem there's another way to see this in a little bit more complicated expression

01:24:21.520 so here we have a and b and then d will be the multiplication of the two and e will be the

01:24:30.320 addition of the two and then we multiply times d to get f and then we call f that backward

01:24:36.640 and these gradients if you check will be incorrect so fundamentally what's happening here again is

01:24:43.120 basically we're going to see an issue anytime we use a variable more than once

01:24:49.200 until now in these expressions above every variable is used exactly once so we didn't see the issue

01:24:54.080 but here if a variable is used more than once what's going to happen during backward pass

01:24:58.480 we're back propagating from f to e to d so far so good but now e calls it backward and it deposits

01:25:05.680 its gradients to a and b but then we come back to d and call backward and it overrides those gradients

01:25:12.240 at a and b so that's obviously a problem and the solution here if you look at the multivariate

01:25:21.120 case of the chain rule and its generalization there the solution there is basically that we

01:25:26.320 have to accumulate these gradients these gradients add and so instead of setting those gradients

01:25:32.720 we can simply do plus equals we need to accumulate those gradients plus equals plus equals

01:25:41.920 plus equals plus equals and this will be okay remember because we are initializing them at zero

01:25:49.920 so they start at zero and then any contribution that flows backwards will simply add so now if we

01:25:59.600 redefine this one because the plus equals this now works because a that grad started at zero

01:26:07.920 and we call beat backward we deposit one and then we deposit one again and now this is two which is

01:26:13.920 correct and here this will also work and we'll get correct gradients because when we call it

01:26:19.360 backward we will deposit the gradients from this branch and then we get to back to d that backward

01:26:24.880 it will deposit its own gradients and then those gradients simply add on top of each other and so

01:26:30.560 we just accumulate those gradients and that fixes the issue okay now before we move on let me actually

01:26:35.200 do a bit of cleanup here and delete some of these some of this intermediate work so

01:26:40.320 i'm not going to need any of this now that we've derived all of it um we are going to keep this

01:26:47.520 because i want to come back to it delete the 10h delete harmonic any example and leave this step

01:26:54.640 delete this keep the code that draws and then delete this example and leave behind only the

01:27:03.280 definition of value and now let's come back to this nonlinearity here that we implemented the 10h

01:27:08.720 now i told you that we could have broken down 10h into its explicit atoms in terms of other

01:27:15.280 expressions if we had the exp function so if you remember 10h is defined like this and we chose to

01:27:20.960 develop 10h as a single function and we can do that because we know it's derivative and we can

01:27:25.440 back propagate through it but we can also break down 10h into an expressive function of exp and

01:27:31.360 i would like to do that now because i want to prove to you that you get all the same results

01:27:34.800 and all those same gradients but also because it forces us to implement a few more expressions

01:27:39.840 it forces us to do exponentiation addition subtraction division and things like that and i think it's

01:27:45.520 a good exercise to go through a few more of these okay so let's scroll up to the definition of value

01:27:51.200 and here one thing that we currently can't do is we can do like a value of say 2.0 but we can't do

01:27:59.840 you know here for example one to add constant one and we can't do something like this

01:28:03.920 and we can't do it because it says int object has no attribute data that's because a plus one

01:28:10.160 comes right here to add and then other is the integer one and then here python is trying to

01:28:16.160 access one dot data and that's not a thing that's because basically one is not a value object and

01:28:21.440 we only have addition form value objects so as a matter of convenience so that we can create

01:28:26.800 expressions like this and make them make sense we can simply do something like this

01:28:30.640 basically we let other alone if other is an instance of value but if it's not an instance of

01:28:38.160 value we're going to assume that it's a number like an integer or a float and we're going to simply

01:28:41.840 wrap it in in value and then other will just become value of other and then other will have a data

01:28:47.280 attribute and this should work so if i just say this read a foreign value then this should work

01:28:53.200 there we go okay now let's do the exact same thing for multiply because we can't do something like this

01:28:58.000 again for the exact same reason so we just have to go to mall and if other is not a value then

01:29:05.840 let's wrap it in value let's redefine value and now this works now here's a kind of unfortunate

01:29:12.080 and not obvious part eight times two works we saw that but two times a is that gonna work

01:29:19.680 you'd expect it to write but actually it will not and the reason it won't is because python doesn't

01:29:25.200 know like when when you do a times two basically um so a times two python will go and it will

01:29:32.080 basically do something like a dot mall of two that's basically what it will call but to it two times a

01:29:38.960 is the same as two dot mall of a and it doesn't two can't multiply value and so it's really confused

01:29:46.400 about that so instead what happens is in python the way this works is you are free to define

01:29:51.200 something called the armhole and armhole is kind of like a fallback so if the python can't do two

01:29:58.960 times a it will check if um if by any chance a knows how to multiply it too and that will be

01:30:06.400 called into armhole so because python can't do two times a it will check is there an armhole in value

01:30:13.520 and because there is it will now call that and what we'll do here is we will swap the order of the

01:30:19.600 operands so basically two times a will redirect to armhole and armhole will basically call a times

01:30:25.280 two and that's how that will work so redefining now with armhole two times a becomes four okay now

01:30:32.880 looking at the other elements that we still need we need to know how to exponentiate and how to divide

01:30:37.040 so let's first the explanation to the exponentiation part we're going to introduce a single function

01:30:43.440 exp here and exp is going to mirror 10 age in the sense that it's a simple single function that

01:30:49.840 transform a single scalar value and outputs a single scalar value so we pop out the python number

01:30:55.040 we use method to exponentiate it create a new value object everything that we've seen before

01:31:00.320 tricky part of course is how do you back propagate through e to the x and uh so here you can

01:31:06.800 potentially pause the video and think about what should go here

01:31:10.000 okay so basically i'm going to need to know what is the local derivative of e to the x so d by d x

01:31:19.280 of e to the x is famously just e to the x and we've already just calculated e to the x and it's inside

01:31:25.520 out that data so we can do out that data times and out that grad that's the chainhole so we're

01:31:32.480 just chaining on to the current running grad and this is what the expression looks like it looks a

01:31:37.520 little confusing but this is what it is and that's the explanation so redefining we should not be

01:31:43.440 able to call a data exp and hopefully the backward pass works as well okay and the last thing we'd

01:31:49.440 like to do of course is we'd like to be able to divide now i actually will implement something

01:31:54.480 slightly more powerful than division because division is just a special case of something a bit

01:31:58.800 more powerful so in particular just by rearranging if we have some kind of a b equals value of 4.0 here

01:32:06.880 we'd like to basically be able to do a divided b and we'd like this to be able to give us 0.5

01:32:10.720 now division actually can be reshuffled as follows if we have a divided b that's actually the same as

01:32:17.840 a multiplying one over b and that's the same as a multiplying b to the power of negative one

01:32:24.320 and so what i'd like to do instead is i basically like to implement the operation of x to the k

01:32:29.200 for some constant k so it's an integer or a float and we would like to be able to differentiate

01:32:35.760 this and then as a special case negative one will be division and so i'm doing that just because

01:32:42.480 it's more general and yeah you might as well do it that way so basically what i'm saying is we can

01:32:47.840 redefine division which we will put here somewhere you know we can put this here somewhere what i'm

01:32:56.480 saying is that we can redefine division so self-divide other can actually be rewritten as self times

01:33:03.120 other to the power of negative one and now value raised to the power of negative one we have now

01:33:10.000 define that so here's so we need to implement the power function where am i gonna put the power

01:33:17.280 function maybe here somewhere let's just call it for it so this function will be called when we try

01:33:24.560 to raise a value to some power and other will be that power now i'd like to make sure that other is

01:33:30.560 only an int or a float usually other is some kind of a different value object but here other will be

01:33:36.480 forced to be an int or a float otherwise the math won't work for for we're trying to achieve in

01:33:43.280 the specific case that would be a different derivative expression if we wanted other to be a value

01:33:48.560 so here we create the output value which is just you know this data raised to the power of other

01:33:54.640 and other here could be for example negative one that's what we are hoping to achieve

01:33:58.240 and then this is the backward stub and this is the fun part which is what is the chain rule

01:34:05.840 expression here for back for back propagating through the power function where the power is to

01:34:13.440 the power of some kind of a constant so this is the exercise and maybe pause the video here and

01:34:18.000 see if you can figure it out yourself as to what we should put here

01:34:20.720 okay so you can actually go here and look at derivative rules as an example and we see lots of

01:34:33.360 derivatives that you can hopefully know from calculus in particular what we're looking for is

01:34:37.600 the power rule because that's telling us that if we're trying to take d by dx of x to the n

01:34:42.720 which is what we're doing here then that is just n times x to the n minus one right okay so that's

01:34:51.840 telling us about the local derivative of this power operation so all we want here basically n

01:34:59.280 is now other and self that data is x and so this now becomes other which is n times

01:35:07.680 self that data which is now a python int or a float it's not a value object we're accessing

01:35:14.720 the data attribute raised to the power of other minus one or n minus one i can put brackets around

01:35:22.080 this but this doesn't matter because power takes precedence over multiply and by him so that would

01:35:28.320 have been okay and that's the local derivative only but now we have to chain it and we change

01:35:33.280 just a simply by multiplying by a top grad that's chain rule and this should technically work

01:35:39.120 and we're gonna find out soon but now if we do this this should now work and we get point five so

01:35:47.920 the forward pass works but does the backward password and i realize that we actually also have to

01:35:52.800 know how to subtract so right now a minus b will not work to make it work we need one more piece of

01:36:00.480 code here and basically this is the subtraction and the way we're going to implement subtraction

01:36:08.080 is we're going to implement it by addition of a negation and then to implement negation we're

01:36:12.160 going to multiply by negative one so just again using the stuff we've already built and just

01:36:16.320 expressing it in terms of what we have and a minus b does not work it okay so now let's scroll

01:36:22.080 again to this expression here for this neuron and let's just compute the backward pass here once

01:36:28.480 we've defined o and let's draw it so here's the gradients for all these leak notes for this two

01:36:34.720 dimensional neuron that has a 10h that we've seen before so now what i'd like to do is i'd like to

01:36:40.240 break up this 10h into this expression here so let me copy paste this here and now instead of

01:36:48.720 we'll preserve the label and we will change how we define o so in particular we're going to

01:36:55.040 implement this formula here so we need e to the 2x minus 1 over e to the x plus 1 so e to the 2x

01:37:02.640 we need to take 2 times m and we need to exponentiate it that's e to the 2x and then because we're using

01:37:08.800 it twice let's create an intermediate variable e and then define o as e plus 1 over e minus 1

01:37:17.440 over e plus 1 e minus 1 over e plus 1 and that should be it and then we should be able to draw

01:37:25.600 that above so now before i run this what do we expect to see number one we're expecting to see a

01:37:32.400 much longer graph here because we've broken up 10h into a bunch of other operations but those

01:37:38.080 operations are mathematically equivalent and so what we're expecting to see is number one the

01:37:42.880 same result here so the forward pass works and number two because of that mathematical

01:37:47.760 equivalence we expect to see the same backward pass and the same gradients on these leaf nodes

01:37:52.800 so these gradients should be identical so let's run this so number one let's verify that

01:38:00.000 instead of a single 10h node we have now exp and we have plus we have times negative one this is

01:38:07.360 the division and we end up with the same forward pass here and then the gradients we have to be

01:38:13.120 careful because they're in slightly different order potentially the gradients for w2x2 should be

01:38:17.920 0 and 0 and 0 and 0 and 0 and 0 and 0 and 0 and 0 and 0 and 0 and 0 and 0 and 0 and 0 and 1x1 are

01:38:23.200 1 and negative 1 and 5 so that means that both our forward passes and backward passes were correct

01:38:31.040 because this turned out to be equivalent to 10h before and so the reason I wanted to go through

01:38:37.280 this exercise is number one we got to practice a few more operations and writing more backwards

01:38:42.400 passes and number two I wanted to illustrate the point that the the level at which you implement

01:38:49.360 your operations is totally up to you you can implement backward passes for tiny expressions

01:38:53.920 like a single individual plus or a single times or you can implement them for say 10h

01:39:00.000 which is a kind of a potentially you can see it as a composite operation because it's made up of all

01:39:03.920 these more atomic operations but really all of this is kind of like a fake concept all that

01:39:08.800 matters is we have some kind of inputs and some kind of an output and this output is a function of

01:39:12.480 the inputs in some way and as long as you can do forward pass and the backward pass of that

01:39:17.040 little operation it doesn't matter what that operation is and how composite it is if you can

01:39:23.360 write the local gradients you can chain the gradient and you can continue back propagation

01:39:27.360 so the design of what those functions are is completely up to you so now I would like to show

01:39:32.960 you how you can do the exact same thing but using a modern deep neural network library like for

01:39:37.360 example PyTorch which I've roughly modeled micro grad by and so PyTorch is something you would use

01:39:44.880 in production and I'll show you how you can do the exact same thing but in PyTorch API so I'm just

01:39:49.920 going to copy paste it in and walk you through it a little bit this is what it looks like

01:39:54.720 so we're going to import PyTorch and then we need to define these value objects like we have here

01:40:01.120 now micro grad is a scalar valued engine so we only have scalar values like 2.0 but in PyTorch

01:40:09.600 everything is based around tensors and like I mentioned tensors are just end dimensional arrays

01:40:14.320 of scalars so that's why things get a little bit more complicated here I just need a scalar

01:40:20.240 valued tensor a tensor with just a single element but by default when you work with

01:40:25.280 PyTorch you would use more complicated tensors like this so if I import PyTorch

01:40:31.600 then I can create tensors like this and this tensor for example is a 2x3 array of scalars

01:40:41.600 in a single compact representation so you can check its shape we see that it's a 2x3 array

01:40:48.000 and so on so this is usually what you would work with in the actual libraries so here I'm creating

01:40:54.560 a tensor that has only a single element 2.0 and then I'm casting it to be double because Python

01:41:04.480 is by default using double precision for its floating point numbers so I'd like everything to

01:41:08.720 be identical by default the data type of these tensors will be float 32 so it's only using a

01:41:15.280 single precision float so I'm casting it to double so that we have float 64 just like in Python

01:41:21.600 so I'm casting to double and then we get something similar to value of two the next thing I have to

01:41:28.720 do is because these are leaf nodes by default PyTorch assumes that they do not require gradients

01:41:33.760 so I need to explicitly say that all of these nodes require gradients okay so this is going

01:41:38.960 to construct scalar valued one element tensors make sure that PyTorch knows that they require

01:41:44.880 gradients now by default these are set to false by the way because of efficiency reasons because

01:41:50.320 usually you would not want gradients for leaf nodes like the inputs to the network and this is

01:41:55.840 just trying to be efficient in the most common cases so once we've defined all of our values

01:42:01.440 in PyTorch land we can perform arithmetic just like we can here in micrograd land so this would

01:42:06.240 just work and then there's a torch.10h also and when we get back is a tensor again and we can just

01:42:13.760 like in micrograd it's got a data attribute and it's got grad attributes so these tensor objects

01:42:19.520 just like in micrograd have a dot data and a dot grad and the only difference here is that

01:42:24.800 we need to call a dot item because otherwise PyTorch dot item basically takes a single tensor of one

01:42:33.440 element and it just returns that element stripping out the tensor so let me just run this and hopefully

01:42:39.440 we are going to get this is going to print the forward pass which is 0.707 and this will be the

01:42:45.760 gradients which hopefully are 0.50 negative 1.5 and 1 so if we just run this there we go 0.7 so

01:42:55.600 the forward pass agrees and then 0.50 negative 1.5 and 1 so PyTorch agrees with us and just to

01:43:03.120 show you here basically 0 here's a tensor with a single element and it's a double and we can call

01:43:10.320 that item on it to just get the single number out so that's what item does and 0 is a tensor object

01:43:17.520 like i mentioned and it's got a backward function just like we've implemented and then all of these

01:43:23.120 also have a dot grad so like x2 for example has a grad and it's a tensor and we can pop out the

01:43:28.160 individual number with dot item so basically Torxus Torx can do what we did in micro grad as a special

01:43:36.320 case when your tensors are all single element tensors but the big deal with PyTorch is that

01:43:42.240 everything is significantly more efficient because we are working with these tensor objects and we

01:43:46.800 can do lots of operations in parallel on all of these tensors but otherwise what we've built very

01:43:53.360 much agrees with the API of PyTorch okay so now that we have some machinery to build out pretty

01:43:57.840 complicated mathematical expressions we can also start building up neural nets and as i

01:44:02.240 mentioned neural nets are just a specific class of mathematical expressions so we're going to start

01:44:07.680 building out a neural net piece by piece and eventually we'll build out a two layer multilayer

01:44:12.160 layer perceptron as it's called and i'll show you exactly what that means let's start with a single

01:44:16.720 individual neuron we've implemented one here but here i'm going to implement one that also subscribes

01:44:22.160 to the PyTorch API and how it designs its neural network modules so just like we saw that we can

01:44:28.480 like match the API of PyTorch on the autograd side we're going to try to do that on the neural

01:44:34.480 network modules so here's class neuron and just for the sake of efficiency i'm going to copy

01:44:41.520 paste some sections that are relatively straightforward so the constructor will take number of inputs

01:44:48.480 to this neuron which is how many inputs come to a neuron so this one for example is three inputs

01:44:54.000 and then it's going to create a weight that is some random number between negative one and one

01:44:59.440 for every one of those inputs and a bias that controls the overall trigger happiness of this

01:45:04.640 neuron and then we're going to implement a def underscore underscore call of self and x

01:45:12.720 some input x and really what we don't do here is w times x plus b or w times x here is a dot

01:45:19.120 product specifically now if you haven't seen call let me just return 0.0 here from now the way this

01:45:27.040 works now is we can have an x which is say like 2.0 3.0 then we can initialize a neuron that is

01:45:32.480 two-dimensional because these are two numbers and then we can feed those two numbers into that neuron

01:45:37.600 to get an output and so when you use this notation n of x python will use call so currently call

01:45:45.920 just return 0.0 now we'd like to actually do the forward pass of this neuron instead so we're going

01:45:55.280 to do here first is we need to basically multiply all of the elements of w with all of the elements

01:46:01.360 of x pairwise we need to multiply them so the first thing we're going to do is we're going to zip up

01:46:06.960 uh salta w and x and in python zip takes two iterators and it creates a new iterator that

01:46:14.480 iterates over the topples of their corresponding entries so for example just to show you we can

01:46:20.000 print this list and still return 0.0 here sorry i'm in life so we see that these w's are paired up

01:46:36.240 with the x's w with x and now what we're going to do is

01:46:42.640 for wy xi in we want to multiply w times wy times xi and then we want to sum all of that together

01:46:56.960 to come up with an activation and add also salta b on top so that's the raw activation and then of

01:47:04.320 course we need to pass that through a null minority so what we're going to be returning is act dot

01:47:08.640 10h and here's out so now we see that we are getting some outputs and we get a different output from

01:47:16.560 neuron each time because we are initializing different weights and biases and then to be a

01:47:21.680 bit more efficient here actually sum by the way takes a second optional parameter which is the

01:47:27.840 start and by default the start is 0 so these elements of this sum will be added on top of 0

01:47:34.720 to begin with but actually we can just start with salta b and then we just have an expression like this

01:47:40.000 and then the generator expression here must be parenticized by thumb there we go

01:47:53.680 yep so now we can forward a single neuron next up we're going to define a layer of neurons

01:47:59.280 so here we have a schematic for a mlp so we see that these mlps each layer this is one layer

01:48:06.240 has actually a number of neurons and they're not connected to each other but all of them are fully

01:48:09.760 connected to the input so what is a layer of neurons it's just it's just a set of neurons evaluated

01:48:15.040 independently so in the interest of time i'm going to do something fairly straightforward here

01:48:23.040 it's um literally a layer is just a list of neurons and then how many neurons do we have we take that

01:48:31.200 as an input argument here how many neurons do you want in your layer number of outputs in this layer

01:48:35.680 and so we just initialize completely independent neurons with this given dimensionality and when

01:48:41.520 we call on it we just independently evaluate them so now instead of a neuron we can make a layer

01:48:48.800 of neurons they are two dimensional neurons and let's have three of them and now we see that we

01:48:53.200 have three independent evaluations of three different neurons right okay finally let's complete this

01:49:00.320 picture and define an entire multilateral perception or mlp and as we can see here in an mlp these

01:49:06.640 layers just feed into each other sequentially so let's come here and i'm just going to copy the

01:49:12.080 code here in the interest of time so an mlp is very similar we're taking the number of inputs

01:49:18.560 as before but now instead of taking a single n out which is number of neurons in a single layer

01:49:23.760 we're going to take a list of n outs and this list defines the sizes of all the layers that we want

01:49:28.800 in our mlp so here we just put them all together and then iterate over consecutive pairs of these

01:49:35.040 sizes and create layer objects for them and then in the call function we are just calling them

01:49:39.680 sequentially so that's an mlp really and let's actually reimplement this picture so we want

01:49:45.280 three input neurons and then two layers of four and an output unit so we want

01:49:51.040 three dimensional input say this is an example input we want three inputs into two layers of four

01:49:58.880 and one output and this of course is an mlp and there we go that's a forward pass of an mlp

01:50:05.920 to make this a little bit nicer you see how we have just a single element but it's wrapped

01:50:10.320 in the list because layer always returns lists. Circum for convenience return outs at zero if

01:50:17.520 length outs is exactly a single element else return full list and this will allow us to just get a

01:50:23.680 single value out at the last layer that only has a single neuron and finally we should be able to

01:50:29.360 draw dot of n of x and as you might imagine these expressions are now getting relatively involved

01:50:38.480 so this is an entire mlp that we're defining now

01:50:40.880 all the way until a single output okay and so obviously you would never differentiate on

01:50:51.760 pen and paper these expressions but with micro grad we will be able to back propagate all the way

01:50:56.720 through this and back propagate into these weights of all these neurons so let's see how that works

01:51:04.080 okay so let's create ourselves a very simple example data set here so this data set has four

01:51:09.920 examples and so we have four possible inputs into the neural net and we have four desired targets

01:51:17.440 so we'd like the neural net to assign or output 1.0 when it's fed this example negative one when

01:51:24.880 it's fed these examples and one when it's fed this example so it's a very simple binary classifier

01:51:29.600 neural net basically that we would like here now let's think what the neural net currently thinks

01:51:34.320 about these four examples we can just get their predictions um basically we can just call n of x

01:51:40.160 four x in x's and then we can print so these are the outputs of the neural net on those four examples

01:51:48.160 so the first one is 0.91 but we like it to be one so we should push this one higher this one we want

01:51:56.720 to be higher this one says 0.88 and we want this to be negative one this is 0.8 we want it to be

01:52:04.000 negative one and this one is 0.8 we want it to be one so how do we make the neural net and how do we

01:52:10.480 tune the weights to better predict the desired targets and the trick used in deep learning to

01:52:18.480 achieve this is to calculate a single number that somehow measures the total performance of your

01:52:24.400 neural net and we call the single number the loss so the loss first is a single number that we're

01:52:32.400 going to define that basically measures how well the neural net is performing right now we have the

01:52:36.640 intuitive sense that it's not performing very well because we're not very much close to this

01:52:40.320 so the loss will be high and we'll want to minimize the loss so in particular in this case what we're

01:52:46.080 going to do is we're going to implement the mean squared error loss so what this is doing is if

01:52:50.880 we're going to basically iterate for y ground truth and y output in zip of y's and line thread so

01:53:01.360 we're going to pair up the ground truths with the predictions and the zip iterates over tuples of them

01:53:08.080 and for each y ground truth and y output we're going to subtract them

01:53:16.960 and square them so let's first see what these losses are these are individual loss components

01:53:21.760 and so basically for each one of the four we are taking the prediction and the ground truth

01:53:29.440 we are subtracting them and squaring them so because this one is so close to its target

01:53:36.240 0.91 is almost one subtracting them gives a very small number so here we would get like a negative

01:53:43.440 point one and then squaring it just makes sure that regardless of whether we are more negative or

01:53:50.240 more positive we always get a positive number instead of squaring we should also take for example

01:53:56.400 the absolute value we need to discard the sign and so you see that the expression is arranged so that

01:54:01.920 you only get 0 exactly when y out is equal to y ground truth when those two are equal so your

01:54:07.760 prediction is exactly the target you are going to get 0 and if your prediction is not the target

01:54:12.880 you are going to get some other number so here for example we are way off and so that's why the

01:54:17.760 loss is quite high and the more off we are the greater the loss will be so we don't want high

01:54:25.760 loss we want low loss and so the final loss here will be just the sum of all of these numbers so

01:54:34.400 you see that this should be 0 roughly plus 0 roughly but plus 7 so loss should be about 7 here and now

01:54:45.440 we want to minimize the loss we want the loss to be low because if loss is low then every one of the

01:54:52.960 predictions is equal to its target so the loss the lowest it can be is 0 and the greater it is

01:55:00.320 the worse off the neural net is predicting so now of course if we do loss that backward

01:55:06.080 something magical happened when I hit enter and the magical thing of course that happened is that

01:55:12.880 we can look at and add layers that neuron and that layers at say like the first layer that neurons at

01:55:19.680 0 because remember that MLP has the layers which is a list and each layer has neurons which is a

01:55:28.080 list and that gives us individual neuron and then it's got some weights and so we can for example

01:55:34.000 look at the weights at 0 oops it's not called weights it's called w and that's a value but now

01:55:45.840 this value also has a graph because of the backward pass and so we see that because this gradient here

01:55:53.200 on this particular weight of this particular neuron of this particular layer is negative

01:55:57.120 we see that its influence on the loss is also negative so slightly increasing this particular

01:56:03.280 weight of this neuron of this layer would make the loss go down and we actually have this information

01:56:10.160 for every single one of our neurons and all their parameters actually it's worth looking at also

01:56:14.880 the draw dot loss by the way so previously we looked at the draw dot of a single neural

01:56:20.720 neuron forward pass and that was already a large expression but what is this expression we actually

01:56:25.920 forwarded every one of those four examples and then we have the loss on top of them with the mean

01:56:31.600 squared error and so this is a really massive graph because this graph that we've built up now

01:56:38.400 oh my gosh this graph that we've built up now which is kind of excessive it's excessive because

01:56:45.440 it has four forward passes of a neural net for every one of the examples and then it has the loss

01:56:50.480 on top and it ends with the value of the loss which was 7.12 and this loss will now back propagate

01:56:56.640 through all the forward forward passes all the way through just every single intermediate value

01:57:01.760 of the neural net all the way back to of course the parameters of the weights which are at the input

01:57:07.120 so these weight parameters here are inputs to this neural net and these numbers here these scalars

01:57:14.960 are inputs to the neural net so if we went around here we will probably find some of these examples

01:57:21.760 this 1.0 potentially maybe this 1.0 or you know some of the others and you'll see that they all

01:57:27.280 have gradients as well the thing is these gradients on the input data are not that useful to us

01:57:32.880 and that's because the input data seems to be not changeable it's it's a given to the problem

01:57:39.840 and so it's a fixed input we're not going to be changing it or messing with it even though we do

01:57:43.520 have gradients for it but some of these gradients here will be for the neural net work parameters

01:57:51.200 the W's and the B's and those sweep of course we want to change okay so now we're going to want

01:57:57.840 some convenience code to gather up all of the parameters of the neural net so that we can operate

01:58:02.320 on all of them simultaneously and every one of them we will nudge a tiny amount based on the

01:58:09.120 gradient deformation so let's collect the parameters of the neural net all in one array so let's create

01:58:15.840 a parameters of self that just returns self that W which is a list concatenated with a list of

01:58:24.880 self that B so this will just return a list list plus list just you know gives you a list so that's

01:58:32.560 parameters of neuron and i'm calling it this way because also PyTorch has a parameters on every

01:58:38.480 single and in module and it does exactly what we're doing here it just returns the

01:58:43.200 parameter tensors for us is the parameter scalars now layer is also a module so it will have parameters

01:58:51.520 self and basically what we want to do here is something like this like

01:59:00.240 params is here and then for neuron in salt out neurons we want to get neuron that parameters

01:59:09.040 and we want to params that extend all right so these are the parameters of this neuron and then

01:59:17.040 we want to put them on top of params so params dot extend of piece and then we want to return

01:59:23.520 params so this there's way too much code so actually there's a way to simplify this

01:59:29.360 which is return p for neuron in self dot neurons for p in neuron dot parameters

01:59:44.000 so it's a single list comprehension in pyton you can sort of nest them like this and you can

01:59:51.520 then create the desired array so this is these are identical we can take this out

01:59:58.320 and then let's do the same here

02:00:01.360 that parameters self and return a parameter for layer in self dot layers for p in layer dot

02:00:16.880 parameters and that should be good now let me pop out this so we don't re-initialize our network

02:00:28.960 because we need to re-initialize our okay so unfortunately we will have to probably

02:00:37.600 re-initialize the network because we just had functionality because this class of course we

02:00:42.880 i want to get all the end up parameters but that's not going to work because this is the old class

02:00:48.240 okay so unfortunately we do have to re-initialize the network which will change some of the numbers

02:00:54.480 but let me do that so that we pick up the new api we can now do end up parameters

02:00:59.280 and these are all the weights and biases inside the entire neural net so in total this mlp has 41

02:01:08.720 parameters and now we'll be able to change them if we recalculate the loss here we see that

02:01:18.560 unfortunately we have slightly different predictions and slightly different loss

02:01:24.080 but that's okay okay so we see that this neurons gradient is slightly negative we can also look

02:01:33.600 at its data right now which is 0.85 so this is the current value of this neuron and this is its

02:01:40.560 gradient on the loss so what we want to do now is we want to iterate for every p in and dot

02:01:48.160 parameters so for all the 41 parameters of this neural net we actually want to change p dot data

02:01:54.000 slightly according to the gradient information okay so dot dot dot to do here but this will be

02:02:02.960 basically a tiny update in this gradient descent scheme and gradient descent we are thinking of the

02:02:10.480 gradient as a vector pointing in the direction of increased loss and so in gradient descent we are

02:02:21.280 modifying p dot data by a small step size in the direction of the gradient so the step size as an

02:02:28.400 example could be like a very small number 0.01 is the step size times p dot grad right but we have

02:02:36.800 to think through some of the signs here so in particular working with this specific example here

02:02:43.680 we see that if we just left it like this then this neurons value would be currently increased by a

02:02:50.800 tiny amount of the gradient the gradient is negative so this value of this neuron would go

02:02:57.040 slightly down it would become like 0.8 you know four or something like that but if this neurons

02:03:03.840 value goes lower that would actually increase the loss that because the derivative of this neuron

02:03:14.000 is negative so increasing this makes the loss go down so increasing it is what we want to do

02:03:20.720 instead of decreasing it so basically what we're missing here is we're actually missing a negative

02:03:25.120 sign and again this other interpretation and that's because we want to minimize the loss we don't

02:03:30.880 want to maximize the loss we want to decrease it and the other interpretation as i mentioned is

02:03:35.760 you can think of the gradient vector so basically just the vector of all the gradients as pointing

02:03:41.280 in the direction of increasing the loss but then we want to decrease it so we actually want to go

02:03:47.200 in the opposite direction and so you can convince yourself that this sort like that's the right thing

02:03:51.920 here with the negative because we want to minimize the loss so if we notch all the parameters by

02:03:57.120 tiny amount then we'll see that this data will have changed a little bit so now this neuron is a

02:04:06.640 tiny amount greater value so 0.854 once you 0.857 and that's a good thing because slightly increasing

02:04:16.480 this neuron data makes the loss go down according to the gradient and so the correct thing has

02:04:23.840 happened signwise and so now what we would expect of course is that because we've changed all these

02:04:30.320 parameters we expect that the loss should have gone down a bit so we want to reevaluate the loss

02:04:36.880 let me basically this is just a data definition that hasn't changed but the forward pass here

02:04:44.800 of the network we can recalculate and actually let me do it outside here so that we can compare

02:04:52.320 the two loss values so here if I recalculate the loss we'd expect the mu loss now to be slightly

02:04:59.920 lower than this number so hopefully what we're getting now is a tiny bit lower than 4.84

02:05:04.960 4.36 okay and remember the way we've arranged this is that low loss means that our predictions

02:05:13.440 are matching the targets so our predictions now are probably slightly closer to the targets

02:05:18.800 and now all we have to do is we have to iterate this process so again we've done the forward pass

02:05:26.400 and this is the loss now we can lost that backward let me take these out and we can do a step size

02:05:32.720 and now we should have a slightly lower loss 4.36 goes to 3.9 and okay so we've done the forward

02:05:42.320 pass here's the backward pass nudge and now the loss is 3.66 3.47 and you get the idea we just

02:05:53.840 continue doing this and this is grading descent we're just iteratively doing forward pass backward

02:05:59.440 pass update forward pass backward pass update and the neural net is improving its predictions

02:06:05.600 so here if we look at y-pred now y-pred we see that this value should be getting closer to 1

02:06:16.160 so this value should be getting more positive these should be getting more negative and this one

02:06:20.160 should be also getting more positive so if we just iterate this a few more times

02:06:24.560 actually we'll be able to afford to go a bit faster let's try a slightly higher learning rate

02:06:34.240 oops okay there we go so now we're at 0.31 if you go too fast by the way if you try to make it too

02:06:41.840 bigger of a step you may actually overstep um it's overconfidence because again remember we don't

02:06:49.280 actually know exactly about the loss function the loss function has all kinds of structure and we

02:06:53.840 only know about the very local dependence of all these parameters on the loss but if we step too far

02:06:59.360 we may step into you know a part of the loss that is completely different and that can destabilize

02:07:04.400 training and make your loss actually blow up even so the loss is now 0.04 so actually the predictions

02:07:11.600 should be really quite close let's take a look so you see how this is almost one almost negative

02:07:17.760 one almost one we can continue going so yep backward update oops there we go so we went way too

02:07:27.520 fast and um we actually overstepped so we got to uh two eager where are we now oops okay

02:07:36.560 seven in negative nine so this is very very low loss and the predictions are basically perfect

02:07:44.000 so somehow we we basically we were doing way to the updates and we briefly exploded but then

02:07:50.560 somehow we ended up getting into a really good spot so usually this learning rate and a tuning

02:07:55.920 of it is a is a subtle art you want to set your learning rate if it's too low you're going to take

02:08:00.720 way too long to converge but if it's too high the whole thing gets unstable and you might actually

02:08:05.040 even explode the loss depending on your loss function so finding the step size to be just right it's

02:08:11.760 it's a pretty subtle art sometimes when you're using sort of vanilla gradient descent but we

02:08:16.240 happen to get into a good spot we can look at um and that parameters so this is the setting of

02:08:24.800 weights and biases that makes our network predict the desired targets very very close

02:08:32.640 and um basically we've successfully trained in neural nut okay let's make this a tiny bit more

02:08:40.160 respectable and implement an actual training loop and what that looks like so this is the data

02:08:44.560 definition that stays this is the forward pass um so for uh k in range you know we're going to

02:08:53.920 take a bunch of steps first you do the forward pass we evaluate the loss

02:09:00.960 let's reinitialize the neural nut from scratch and here's the data and we first do forward pass

02:09:10.800 then we do the backward pass and then we do an update that's great in descent

02:09:21.840 and then we should be able to iterate this and we should be able to print the current step

02:09:29.600 the current loss um let's just print the sort of number of the loss and that should be it

02:09:39.040 and then the learning rate 0.01 is a little too small 0.1 we saw is like a little bit

02:09:44.640 dangerous if you're high let's go somewhere in between and we'll optimize this for not 10 steps

02:09:50.880 but let's go for say 20 steps let me erase all of this junk and let's run the optimization

02:10:00.800 and you see how we've actually converged slower in a more controlled manner and got to a loss that

02:10:09.280 is very low so I expect white bread to be quite good there we go

02:10:16.240 and that's it okay so this is kind of embarrassing but we actually have a really terrible bug

02:10:27.520 in here and it's a subtle bug and it's a very common bug and I can't believe I've done it for

02:10:33.840 the 20th time in my life especially on camera and I could have re-shot the whole thing but I think

02:10:39.600 it's pretty funny and you know you get to appreciate a bit what working with neural nets maybe is

02:10:45.920 like sometimes we are guilty of a common bug I've actually tweeted the most common neural

02:10:53.760 mistakes a long time ago now and I'm not really gonna explain any of these except for we are

02:11:01.840 guilty of number three you forgot to zero grad before dot backward what is that

02:11:06.800 basically what's happening and it's a subtle bug and I'm not sure if you saw it is that all of

02:11:14.640 these weights here have a dot data and a dot grad and dot grad starts at zero and then we do

02:11:23.280 backward and we fill in the gradients and then we do an update on the data but we don't flush the

02:11:28.880 grad it stays there so when we do the second forward pass and we do backward again remember that all

02:11:36.160 the backward operations do a plus equals on the grad and so these gradients just add up and they

02:11:42.320 never get reset to zero so basically we didn't zero grad so here's how we zero grad before

02:11:49.280 backward we need to iterate over all the parameters and we need to make sure that p dot grad is set to

02:11:56.960 zero we need to reset it to zero just like it is in the constructor so remember all the way here

02:12:04.080 for all these value nodes grad is reset to zero and then all these backward passes do a plus equals

02:12:09.840 from that grad but we need to make sure that we reset these grads to zero so that when we do backward

02:12:17.280 all of them start at zero and the actual backward pass accumulates the loss derivatives into the

02:12:24.320 grads so this is zero grad in pytorch and we will get a slightly different optimization

02:12:33.120 let's reset the neural net the data is the same this is now i think correct and we get a much more

02:12:39.440 you know we get a much more slower descent we still end up with pretty good results and we can

02:12:46.640 continue this a bit more to get down lower and lower and lower yeah so the only reason that

02:12:57.040 the previous thing worked it's extremely buggy the only reason that worked is that

02:13:01.520 this is a very very simple problem and it's very easy for this neural net to fit this data

02:13:08.640 and so the grads ended up accumulating and it effectively gave us a massive step size

02:13:14.720 and it made us converge extremely fast but basically now we have to do more steps

02:13:21.600 to get to very low values of loss and get y-pred to be really good we can try to step a bit greater

02:13:29.040 yeah we're going to get closer and closer to one minus one and one so we're going to

02:13:40.240 neural nets is sometimes tricky because you may have lots of bugs in the code and your network

02:13:48.800 might actually work just like ours worked but chances are is that if we had a more complex problem

02:13:54.160 then actually this bug would have made us not optimize the loss very well and we were only able

02:13:58.960 to get away with it because the problem is very simple so let's now bring everything together

02:14:05.040 and summarize what we learned what are neural nets neural nets are these mathematical expressions

02:14:10.480 fairly simple mathematical expressions in case of multi-layer perceptron that take

02:14:16.000 input as the data and they take input the weights and the parameters of the neural net

02:14:21.600 mathematical expression for the forward pass followed by a loss function and the loss function

02:14:26.640 tries to measure the accuracy of the predictions and usually the loss will be low when your predictions

02:14:32.400 are matching your targets or where the new network is basically behaving well so we we manipulate

02:14:37.760 the loss function so that when the loss is low the network is doing what you wanted to do on your

02:14:42.400 problem and then we backward the loss use back propagation to get the gradient and then we know

02:14:49.280 how to tune all the parameters to decrease the loss locally but then we have to iterate that

02:14:53.920 process many times in what's called the gradient descent so we simply follow the gradient information

02:14:59.200 and that minimizes the loss and the loss is arranged so that when the loss is minimized

02:15:03.680 the network is doing what you want it to do and yeah so we just have a blob of neural stuff and

02:15:10.400 we can make it do arbitrary things and that's what gives neural nets their power it's you know this

02:15:16.080 is a very tiny network with 41 parameters but you can build significantly more complicated neural

02:15:21.600 nets with billions at this point almost trillions of parameters and it's a massive blob of neural

02:15:28.480 tissue simulated neural tissue roughly speaking and you can make it do extremely complex problems

02:15:35.680 and these neural nets then have all kinds of very fascinating emergent properties in when you try

02:15:42.080 to make them do significantly hard problems as in the case of GPT for example we have massive

02:15:48.880 amounts of text from the internet and we're trying to get a neural nets to predict to take like a

02:15:53.200 few words and try to predict the next word in a sequence that's the learning problem and it turns

02:15:58.080 out that when you train this on all of internet the neural net actually has like really remarkable

02:16:02.480 emergent properties but that neural net would have hundreds of billions of parameters

02:16:06.240 but it works on fundamentally these exact same principles the neural net of course will be a

02:16:12.000 bit more complex but otherwise the evaluating the gradient is there and it would be identical

02:16:18.800 and the gradient descent would be there and would be basically identical but people usually use

02:16:23.440 slightly different updates this is a very simple stochastic gradient descent update

02:16:27.520 and the loss function would not be in mean squared error they would be using something called

02:16:32.560 the cross entropy loss for predicting the next token so there's a few more details but fundamentally

02:16:37.520 the neural network setup and neural network training is identical and pervasive and now you

02:16:42.480 understand intuitively how that works under the hood in the beginning of this video i told you

02:16:47.280 that by the end of it you would understand everything in micrograd and that would slowly

02:16:51.040 build it up let me briefly prove that to you so i'm going to step through all the code that is in

02:16:55.600 micrograd as of today actually potentially some of the code will change by the time you watch this

02:17:00.320 video because i intend to continue developing micrograd but let's look at what we have so far

02:17:04.880 at least init.py is empty when you go to engine.py that has the value everything here you should

02:17:10.880 mostly recognize so we have the dead data that the grad attributes we have the backward function

02:17:16.080 we have the previous set of children and the operation that produced this value we have addition

02:17:21.520 multiplication and raising to a scalar power we have the reloon nonlinearity which is slightly

02:17:27.360 different type of nonlinearity than 10h that we used in this video both of them are nonlinearities

02:17:32.560 and notably 10h is not actually present in micrograd as of right now but i intend to add it later

02:17:37.760 with the backward which is identical and then all of these other operations which are built up

02:17:43.520 on top of operations here so values should be very recognizable except for the nonlinearity used in

02:17:48.480 this video there's no massive difference between relu and 10h and sigmoid and these other nonlinearities

02:17:55.200 they're all roughly equivalent and can be used in MLPs so i use 10h because it's a bit smoother

02:18:00.240 and because it's a little bit more complicated than relu and therefore it's stressed a little bit more

02:18:04.880 the the local gradients and working with those derivatives which i thought would be useful

02:18:10.720 and then the pi is the neural networks library as i mentioned so you should recognize identical

02:18:15.520 implementation of their own layer and MLP notably or not so much we have a class module here there's

02:18:22.080 a parent class of all these modules i did that because there's an end up module class in PyTorch

02:18:27.840 and so this exactly matches that api and end up module in PyTorch has also a zero grad which

02:18:32.960 i refactored out here so that's the end of micro grad really then there's a test which you'll see

02:18:41.440 basically creates two chunks of code one in micro grad and one in pytorch and we'll make sure

02:18:47.520 that the forward and the backward paths agree identically for a slightly less complicated

02:18:51.920 expression and slightly more complicated expression everything agrees so we agree with pytorch and

02:18:57.280 all of these operations and finally there's a demo that ipyy and b here and it's a bit more

02:19:02.320 complicated binary classification demo than the one i covered in this lecture so we only had a

02:19:07.040 tiny dataset of four examples here we have a bit more complicated example with lots of blue

02:19:12.560 points and lots of red points and we're trying to again build a binary classifier to distinguish

02:19:17.840 two-dimensional points as red or blue it's a bit more complicated than mlp here with it's a bigger

02:19:23.440 mlp the loss is a bit more complicated because it supports batches so because our dataset was so

02:19:30.560 tiny we always did a forward pass on the entire dataset of four examples but when your dataset

02:19:35.520 is like a million examples what we usually do in practice is we basically pick out some random

02:19:41.200 subset we call that a batch and then we only process the batch forward backward and update

02:19:46.480 so we don't have to forward the entire training set so this supports batching because there's a

02:19:51.440 lot more examples here we do a forward pass the loss is slightly more different this is a max

02:19:57.600 margin loss that i implement here the one that we used was the mean squared error loss because

02:20:02.960 it's the simplest one there's also the binary cross entropy loss all of them can be used for

02:20:07.760 binary classification and don't make too much of a difference in the simple examples that we looked

02:20:11.920 at so far there's something called l2 regularization used here this has to do with generalization of

02:20:18.560 the neural net and controls the overfitting in machine learning setting but i did not cover

02:20:23.360 these concepts in this video potentially later and the training loop you should recognize so forward

02:20:29.440 backward with zero grad and update and so on you'll notice that in the update here the learning rate

02:20:36.640 is scaled as a function of number of iterations and it shrinks and this is something called learning

02:20:42.960 rate decay so in the beginning you have a high learning rate and as the network sort of stabilizes

02:20:48.000 near the end you bring down the learning rate to get some of the fine details in the end and in the

02:20:53.840 end we see the decision surface of the neural net and we see that it learned to separate out the red

02:20:58.480 and the blue area based on the data points so that's the slightly more complicated example and

02:21:03.920 then i want demo that high pyyb that you're free to go over but yeah as of today that is my

02:21:09.280 grad i also wanted to show you a little bit of real stuff so that you get to see how this is

02:21:13.520 actually implemented in production grade library like pytorch so in particular i wanted to show i

02:21:18.480 wanted to find and show you the backward pass for 10h in pytorch so here in micro grad we see that

02:21:24.720 the backward pass for 10h is one minus t square where t is the output of the 10h of x um times

02:21:33.760 of that grad which is a chain rule so we're looking for something that looks like this now i went to

02:21:39.920 pytorch which has an open source github code base and i looked through a lot of its code

02:21:46.240 and honestly i spent about 15 minutes and i couldn't find 10h and that's because these libraries

02:21:53.280 unfortunately they grow in size and entropy and if you just search for 10h you get apparently

02:21:58.800 2,800 results and 400 and 406 files so i don't know what these files are doing honestly

02:22:07.760 and why there are so many mentions of 10h but unfortunately these libraries are quite complex

02:22:12.000 they're meant to be used not really inspected um eventually i did stumble on someone who tries to

02:22:19.840 change the 10h backward code for some reason and someone here pointed to the cpu kernel and the

02:22:24.880 kuda kernel for 10h backward so this so basically depends on if you're using pytorch on a cpu device

02:22:31.440 or on a gpu which these are different devices and i haven't covered this but this is the 10h

02:22:36.080 backward kernel for cpu and the reason it's so large is that at number one this is like if you're

02:22:45.200 using a complex type which we haven't even talked about if you're using a specific data type of b

02:22:49.440 float 16 which we haven't talked about um and then if you're not then this is the kernel and

02:22:55.920 deep here we see something that resembles our backward pass so they have a times 1 minus b square

02:23:03.760 so this b b here must be the output of the 10h and this is the out that graph so here we found it

02:23:10.400 deep inside pytorch on this location for some reason inside binary ops kernel when 10h is not

02:23:18.880 actually binary op and then this is the gpu kernel um we're not complex we're here and here we go

02:23:28.880 with online of code so we did find it but basically unfortunately these code bases are very large and

02:23:35.440 micrograd is very very simple but if you actually want to use real stuff finding the code for it

02:23:41.360 you'll actually find that difficult i also wanted to show you a little example here where pytorch is

02:23:47.520 showing you how you can register a new type of function that you want to add to pytorch as a

02:23:52.080 lego building block so here if you want to for example add a like jaundere polynomial 3

02:23:57.600 here's how you could do it you will register it as a class that subclasses torch.org rather that

02:24:05.200 function and then you have to tell pytorch how to forward your new function and how to backward

02:24:11.680 through it so as long as you can do the forward pass of this little function piece that you want

02:24:15.760 to add and as long as you know the the local derivative the local gradients which are implemented in

02:24:20.480 the backward pytorch will be able to back propagate through your function and then you can use this

02:24:24.960 as a lego block in a larger lego castle of all the different lego blocks that pytorch already has

02:24:30.320 and so that's the only thing you have to tell pytorch and everything would just work and you can

02:24:34.640 register new types of functions in this way following this example and that is everything that i wanted

02:24:40.000 to cover in this lecture so i hope you enjoyed building out micrograd with me i hope you find

02:24:44.400 interesting insightful and yeah i will post a lot of the links that are related to this video in the

02:24:51.680 video description below i will also probably post a link to a discussion forum or discussion group

02:24:57.280 where you can ask questions related to this video and then i can answer or someone else can answer

02:25:02.240 your questions and i may also do a follow-up video that answers some of the most common questions

02:25:07.120 but for now that's it i hope you enjoyed it if you did then please like and subscribe so that

02:25:12.880 youtube knows to feature this video to more people and that's it for now i'll see you later

02:25:17.360 now here's the problem we know dl by wait what is the problem

02:25:29.520 and that's everything i wanted to cover in this lecture so i hope um you enjoyed us building

02:25:37.040 out micrograd micrograd okay now let's do the exact same thing for multiply because we can't do

02:25:44.880 something like eight times two oops i know what happened there