00:00:01.200 Today we are continuing our implementation of Makemore.
00:00:04.200 Now in the last lecture, we implemented the Multilayer
00:00:06.080 Perceptron along the lines of Benjio et al 2003
00:00:10.720 So we followed this paper, took in a few characters
00:00:13.160 in the past, and used an MLP to predict the next character
00:00:18.520 like to move on to more complex and larger neural networks,
00:00:21.480 like recurrent neural networks, and their variations
00:00:26.440 Now before we do that though, we have to stick around the level
00:00:31.600 And I'd like to do this because I would like us
00:00:35.400 of the activations in the neural net during training,
00:00:38.360 and especially the gradients that are flowing backwards,
00:00:48.040 Because we'll see that recurrent neural networks,
00:00:52.400 are a universal approximator, and can in principle implement
00:00:55.840 all the algorithms, we'll see that they are not
00:00:58.760 very easily optimizable with the first order of gradient-based
00:01:06.920 optimizable easily is to understand the activations
00:01:10.520 and the gradients and how they behave during training.
00:01:24.280 is largely the code from before, but I've cleaned it up
00:01:28.320 So you'll see that we are importing all the torch and map
00:01:39.320 Here's a vocabulary of all the lowercase letters
00:01:44.360 Here we are reading the data set and processing it
00:01:47.760 and creating three splits, the train, dev, and the test
00:01:56.400 except you see that I removed a bunch of magic numbers
00:02:03.520 and the number of hidden units in the hidden layer.
00:02:07.880 so that we don't have to go and change all these magic numbers
00:02:11.680 With the same neural net with 11,000 parameters
00:02:21.320 a little bit, but there are no functional changes.
00:02:23.640 I just created a few extra variables, a few more comments,
00:02:32.040 Then when we optimize, we saw that our loss looked
00:02:36.000 We saw that the train and val loss were about 2.16
00:02:50.120 And then here, depending on train, val or test,
00:02:55.560 And then this is the forward pass of the network
00:03:07.640 which you can also look up and read documentation of.
00:03:11.360 Basically what this decorator does on top of a function
00:03:17.640 is seen by torch to never require an ingredient.
00:03:24.240 that it does to keep track of all the gradients
00:03:29.640 It's almost as if all the tensors that get created here
00:03:34.520 And so it just makes everything much more efficient
00:03:36.440 because you're telling torch that I will not call
00:03:40.680 and you don't need to maintain the graph under the hood.
00:03:48.520 with torch dot no grad and you can look those up.
00:03:54.960 Just as before, just a poor passive in neural net,
00:04:06.840 much nicer looking words, simple from the model.
00:04:09.800 It's still not amazing and they're still not fully name like
00:04:22.640 I can tell that our network is very improperly configured
00:04:26.160 at initialization and there's multiple things wrong with it
00:04:31.200 Look here on the zero federation, the very first iteration.
00:04:34.880 We are recording a loss of 27 and this rapidly comes down
00:04:40.320 So I can tell that the initialization is almost up
00:04:44.400 In training of neural nets, it is almost always the case
00:04:46.920 that you will have a rough idea for what loss to expect
00:04:49.400 at initialization and that just depends on the loss function
00:04:57.120 I expect a much lower number and we can calculate it together.
00:05:00.600 Basically at initialization, what we'd like is that
00:05:09.040 At initialization, we have no reason to believe any characters
00:05:13.680 And so we'd expect that the probability distribution
00:05:15.800 that comes out initially is a uniform distribution
00:05:19.160 assigning about equal probability to all the 27 characters.
00:05:25.720 for any character would be roughly one over 27.
00:05:33.880 and then the loss is the negative log probability.
00:05:49.920 And so what's happening right now is that at initialization,
00:05:52.880 the neural net is creating probability distributions
00:06:05.240 and that's what makes it record very high loss.
00:06:10.240 So here's a smaller four dimensional example of the issue.
00:06:15.960 and then we have logits that come out of the neural net
00:06:23.800 we get probabilities that are a diffuse distribution.
00:06:31.120 And then in this case, if the label is say two,
00:06:33.760 it doesn't actually matter if the label is two or three
00:06:37.160 or one or zero because it's a uniform distribution,
00:06:43.160 So this is the loss we would expect for a four dimensional example.
00:06:55.800 this could be a very high number like five or something like that.
00:06:59.320 Then in that case, we'll record a very low loss
00:07:02.840 at initialization by chance to the correct label.
00:07:06.720 Much more likely it is that some other dimension
00:07:14.040 And then what will happen is we start to record
00:07:17.160 And what can happen is basically the logits come out
00:07:26.560 For example, if we have torched out a random of four,
00:07:31.680 so these are uniformly distributed numbers for them.
00:07:49.680 for the most part, the loss that comes out is okay.
00:07:55.680 You see how because these are more extreme values,
00:08:00.680 it's very unlikely that you're going to be guessing
00:08:09.720 If your logits are coming up even more extreme,
00:08:14.680 same losses like infinity even at initialization.
00:08:27.880 In fact, the logits don't have to be just zero,
00:08:34.160 then because of the normalization inside the softmax,
00:08:38.560 But by symmetry, we don't want it to be any arbitrary,
00:08:43.600 and record the loss that we expect at initialization.
00:08:53.520 And here let me break after the very first iteration.
00:09:01.400 And intuitively, now we can expect the variables involved.
00:09:10.960 we see that the logits take on quite extreme values.
00:09:22.120 So these logits should be much, much closer to zero.
00:09:37.680 So first of all, currently we're initializing B2
00:09:46.680 we don't actually want to be adding a bias of random numbers.
00:10:03.000 then we would be multiplying W2 and making that smaller.
00:10:14.440 you see that we are getting much closer to what we expect.
00:10:17.360 So roughly what we want is about 3.29, this is 4.2.
00:10:32.680 Then we get, of course, exactly what we're looking for
00:10:42.360 and I'll show you in a second why you don't want to be
00:10:44.880 setting W's or weights of a neural net exactly to zero.
00:10:57.320 but I'll show you in a second where things go wrong
00:11:24.960 we can run the optimization with this new initialization,
00:11:42.240 because basically what's happening in the hockey stick,
00:11:50.000 is the optimization is just squashing down the logits,
00:11:58.680 where just the weights were just being shrunk down.
00:12:01.840 And so therefore, we don't get these easy gains
00:12:11.440 So good things are happening in that both, number one,
00:12:20.640 and this is true for any neural nut you might train,
00:12:29.560 Unfortunately, I erased what we had here before.
00:12:49.000 several thousand iterations, probably just squashing down the weights,
00:12:56.840 So something to look out for, and that's number one.
00:13:14.160 lurking inside this neural nut, and it's initialization.
00:13:25.360 Now if we just visualize this vector, sorry, this tensor h,
00:13:31.720 is you see how many of the elements are one or negative one.
00:13:41.760 and it squashes them into a range of negative one and one,
00:14:13.680 And then we can pass this into plt.hist for histogram,
00:14:20.040 and it's semicolon to suppress a bunch of output we don't want.
00:14:55.680 and lots of numbers here take on very extreme values.
00:15:07.840 of how these gradients flow through a neural net,
00:15:16.320 We have to keep in mind that during backpropagation,
00:15:19.880 we are doing backward pass starting at the loss
00:15:36.560 So let's look at what happens in 10h in the backward pass.
00:15:39.760 We can actually go back to our previous micro grad code
00:15:42.920 in the very first lecture and see how we implement a 10h.
00:15:49.200 and then we calculate t, which is the 10h of x.
00:15:52.320 So that's t, and t is between negative one and one.
00:16:03.920 this is the chain rule with the local gradient,
00:16:16.000 you're going to get zero, multiplying out that grad.
00:16:22.880 and we're stopping effectively the backward propagation
00:16:48.000 is not going to impact the output of the 10h too much,
00:16:55.680 And so therefore, there's no impact on the loss.
00:17:02.480 along with this 10h neuron do not impact the loss,
00:17:17.200 the gradient would be basically zero, it vanishes.
00:17:27.320 So when the 10h takes on exactly value of zero,
00:17:49.480 So in fact, you'll see that the gradient flowing
00:18:07.040 And through the concern here is that if all of these outputs h
00:18:12.040 are in the flat regions of negative one and one,
00:18:15.000 then the gradients that are flowing through the network
00:18:32.920 take the absolute value and see how often it is
00:18:52.400 And so basically what we have here is the 32 examples
00:19:22.520 if it was the case that the entire column is white.
00:19:26.200 Because in that case we have what's called the dead neuron.
00:19:30.040 where the initialization of the weights and the biases
00:19:34.400 ever activates this ten-h in the sort of active part
00:19:46.560 And so just scrutinizing this and looking for columns
00:19:50.360 of completely white, we see that this is not the case.
00:19:54.160 So I don't see a single neuron that is all of white.
00:19:59.400 And so therefore it is the case that for every one of these
00:20:05.400 that activate them in the active part of the ten-h.
00:20:24.320 when no matter what inputs you plug in from your data set,
00:20:33.400 because all the gradients will be just zero that.
00:20:50.000 but basically the same will actually apply to sigmoid.
00:20:59.120 So relu has a completely flat region here below zero.
00:21:17.200 this would be exactly zeroing out the gradient.
00:21:19.880 Like all of the gradients would be set exactly to zero
00:21:28.400 And so you can get, for example, a dead relu neuron
00:21:31.480 and a dead relu neuron would basically look like.
00:21:41.120 so for any examples that you plug in in the dataset,
00:21:44.000 it never turns on, it's always in this flat region,
00:21:55.720 And this can sometimes happen at initialization
00:21:57.960 because the weights and the biases just make it
00:21:59.600 so that by chance some neurons are just forever dead.
00:22:04.880 If you have like a too high learning weight, for example,
00:22:10.360 and they get knocked out off the data manifold.
00:22:25.400 if your learning rate is very high, for example,
00:22:29.680 you train the neural net and you get some last loss.
00:22:46.400 And usually what happens is that during training,
00:22:48.400 these relu neurons are changing, moving, et cetera.
00:22:50.680 And then because of a high gradient somewhere by chance,
00:22:53.080 they get knocked off and then nothing ever activates them.
00:22:59.000 So that's kind of like a permanent brain damage
00:23:07.360 because you can see that it doesn't have flat tails.
00:23:24.160 And in this case, we have way too many activations H
00:23:52.760 it's creating a distribution that is too saturated
00:24:01.280 for these neurons because they update less frequently.
00:24:21.560 So we want this pre-activation to be closer to zero,
00:24:27.360 So here we want actually something very, very similar.
00:24:30.600 Now it's okay to set the biases to very small number.
00:24:41.680 just so that there's like a little bit of variation
00:24:53.720 And then the weights we can also just like squash.
00:25:05.440 You see now because we multiply it doubly by 0.1,
00:25:19.880 So basically that's because there are no neurons
00:25:46.440 So maybe this is what our initialization should be.
00:25:56.640 let me run the full optimization without the break.
00:26:12.400 So we see that we actually do get an improvement here.
00:26:16.400 we started off with a validation loss of 2.17 when we started.
00:26:24.080 And by fixing the 10th layer being way too saturated,
00:26:31.920 And so we're spending more time being productive training
00:26:42.480 like the overconfidence of the softmax in the beginning.
00:26:49.000 So this is illustrating basically initialization
00:26:55.720 just by being aware of the internals of these neural nets
00:27:02.920 This is just one layer, multi-layer perception.
00:27:07.760 the optimization problem is actually quite easy.
00:27:11.520 So even though our initialization was terrible,
00:27:19.440 Once we actually start working with much deeper networks
00:27:33.040 where the network is basically not training at all
00:27:37.320 And the deeper your network is and the more complex it is,
00:27:39.960 the less forgiving it is to some of these errors.
00:27:55.720 But what we have here now is all these magic numbers
00:27:58.280 like point two, like where do I come up with this?
00:28:02.040 if I have a large neural net with lots and lots of layers?
00:28:07.560 There's actually some relatively principled ways
00:28:09.560 of setting these scales that I would like to introduce to you now.
00:28:18.480 So what I'm doing here is we have some random input here,
00:28:25.320 And there's 1000 examples that are 10 dimensional.
00:28:34.760 And these neurons in the head and layer look at 10 inputs
00:28:44.280 in this case, the multiplication, X multiplied by W
00:28:57.160 If I do X times W and we forget for now the bias
00:29:13.600 And the standard deviation again is just the measure
00:29:17.200 But then once we multiply here and we look at the
00:29:21.200 histogram of Y, we see that the mean, of course,
00:29:39.800 And so we're expanding this Gaussian from the input.
00:29:45.720 And we don't want that, we want most of the neural nets
00:29:50.520 So unit Gaussian roughly throughout the neural net.
00:29:56.240 to preserve the distribution to remain a Gaussian?
00:30:06.720 these elements of W by a large number, let's say by five,
00:30:10.440 then this Gaussian grows and grows in standard deviation.
00:30:17.360 So basically these numbers here in the output Y
00:30:25.200 then conversely, this Gaussian is getting smaller
00:30:31.080 And you can see that the standard deviation is 0.6.
00:30:33.920 And so the question is what do I multiply by here
00:30:36.640 to exactly preserve the standard deviation to be one?
00:30:40.960 And it turns out that the correct answer mathematically
00:30:44.840 of this multiplication here is that you are supposed
00:30:58.040 So we are supposed to divide by 10 square root.
00:31:17.440 Now unsurprisingly, a number of papers have looked
00:31:19.760 into how but to best initialize neural networks.
00:31:29.120 And we want to make sure that the activations are well behaved
00:31:31.840 and they don't expand to infinity or shrink all the way to 0.
00:31:35.240 And the question is how do we initialize the weights
00:31:37.000 so that these activations take on reasonable values
00:31:40.920 Now, one paper that has studied this in quite a bit of detail
00:31:43.720 that is often referenced is this paper by Kamin-Het Hall
00:31:49.360 Now in this case, they actually study convolutional neural networks
00:31:52.360 and they study especially the relu nonlinearity
00:32:06.000 the relu nonlinearity that they care about quite a bit here
00:32:09.280 is a squashing function where all the negative numbers
00:32:24.640 they find in their analysis of the forward activations
00:32:27.440 in the neural net that you have to compensate for that
00:32:30.800 And so here, they find that basically when they initialize
00:32:36.680 their weights, they have to do it with a zero-ming Gaussian
00:32:39.200 whose standard deviation is square root of two over the fanon.
00:32:43.440 What we have here is we are initializing the Gaussian
00:32:50.520 So what we have is square root of one over the fanon
00:33:02.680 half of the distribution and clamps it at zero.
00:33:07.960 Now, in addition to that, this paper also studies
00:33:10.680 not just the sort of behavior of the activations
00:33:17.720 And we have to make sure that the gradients also
00:33:20.920 And so, because ultimately they end up updating our parameters.
00:33:25.800 And what they find here through a lot of the analysis
00:33:36.120 the backward pass is also approximately initialized
00:33:39.280 up to a constant factor that has to do with the size
00:33:42.640 of the number of hidden neurons in an early and late layer.
00:33:51.080 that this is not a choice that matters too much.
00:33:54.000 Now, this timing initialization is also implemented
00:34:02.520 And in my opinion, this is probably the most common way
00:34:25.120 most of the people just leave it as the default,
00:34:28.120 And then second, pass in the nonlinearity that you are using
00:34:33.520 we need to calculate a slightly different gain.
00:34:56.360 And the reason it's a square root because in this paper,
00:34:59.280 you see how the two is inside of the square root.
00:35:13.840 In a case of 10H, which is what we're using here,
00:35:27.480 So what that means is you're taking the output distribution
00:35:33.680 Now Ralu squashes it by taking everything below zero
00:35:37.800 10H also squashes it because it's a contractive operation.
00:35:40.600 It will take the tails and it will squeeze them in.
00:35:49.040 so that we renormalize everything back to standard,
00:35:53.440 So that's why there's a little bit of a gain that comes out.
00:35:56.600 Now I'm skipping through this section a little bit quickly
00:36:03.200 about seven years ago when this paper was written,
00:36:14.280 and the scrutinizing of the null and your T's used and so on.
00:36:17.040 And everything was very finicky and very fragile
00:36:19.560 and very properly arranged for the neural network train,
00:36:22.560 especially if your neural network was very deep.
00:36:26.320 that have made everything significantly more stable
00:36:33.960 And some of those modern innovations for example
00:36:35.760 are residual connections which we will cover in the future.
00:36:52.440 the simple optimizer we're basically using here,
00:37:01.080 make it less important for you to precisely calibrate
00:37:06.080 All that being said in practice, what should we do?
00:37:09.680 In practice when I initialize these neural nets,
00:37:15.880 So basically, roughly what we did here is what I do.
00:37:34.040 So to set the standard deviation of our weights,
00:37:43.720 and let's say I just create a thousand numbers,
00:37:47.400 And of course that's one, that's the amount of spread.
00:37:50.080 Let's make this a bit bigger so it's closer to one.
00:37:52.440 So that's the spread of the Gaussian of zero mean
00:38:04.480 and that makes its standard deviation point two.
00:38:07.080 So basically the number that you multiply by here
00:38:09.040 ends up being the standard deviation of this Gaussian.
00:38:12.200 So here this is a standard deviation point two Gaussian here
00:38:20.960 to gain over square root of fan mode, which is fan in.
00:38:55.880 the fan in for w one is actually an embed times block size,
00:39:01.840 And that's because each character is 10 dimensional,
00:39:03.960 but then we have three of them and we concatenate them.
00:39:14.560 This is what our standard deviation we want to be.
00:39:22.520 and making sure it looks okay, we came up with point two.
00:39:41.200 And these brackets here are not that necessary,
00:39:52.200 And this is how we would initialize the neural mat.
00:40:05.160 and then we can train the neural mat and see what we got.
00:40:18.720 but that's just the randomness, the process I suspect.
00:40:21.520 But the big deal of course is we get to the same spot,
00:40:24.400 but we did not have to introduce any magic numbers
00:40:37.160 and something that we can sort of use as a guide.
00:40:41.640 of these initializations is not as important today
00:40:57.840 because it made it possible to train very deep neural nets
00:41:07.480 Basically we have these hidden states, HP, right?
00:41:15.520 these pre-activation states to be way too small
00:41:27.520 In fact, we want them to be roughly, roughly Gaussian.
00:41:30.480 So zero mean and a unit or a one standard deviation,
00:41:36.040 So the insight from the batch normalization paper is,
00:41:47.840 And it sounds kind of crazy, but you can just do that
00:41:56.800 is a perfectly differentiable operation as we'll see.
00:41:59.680 And so that was kind of like the big insight in this paper.
00:42:04.520 because you can just normalize these hidden states
00:42:06.280 and if you'd like unit Gaussian states in your network,
00:42:09.840 at least initialization, you can just normalize them
00:42:16.520 So we're going to scroll to our pre-activations here
00:42:24.960 and that's because if these are way too small numbers,
00:42:32.920 then the 10H is way to saturate it and gradiate it in the flow.
00:42:41.520 is that we can just standardize these activations.
00:42:56.040 So basically what we can do is we can take H pre-act
00:43:00.200 and the mean we want to calculate across the zero dimension
00:43:20.880 And similarly, we can calculate the standard deviation
00:43:29.480 Now in this paper, they have the sort of prescription here.
00:43:43.840 And then the standard deviation is basically kind of like
00:43:46.440 the measure of the spread that we've been using,
00:43:50.200 which is the distance of every one of these values
00:43:53.840 away from the mean and that squared and averaged.
00:44:01.600 And then if you want to take the standard deviation,
00:44:10.080 And now we're going to normalize or standardize
00:44:32.640 This is exactly what these two STD and mean are calculating.
00:44:43.080 You see how the sigma is the standard deviation usually.
00:44:46.800 which is the variance is the square of the standard deviation.
00:44:53.160 And what this will do is that every single neuron now
00:44:55.880 and its firing rate will be exactly unit Gaussian
00:45:09.560 Notice that calculating the mean and their standard deviation,
00:45:29.720 But we don't want these to be forced to be Gaussian always.
00:45:34.120 We'd like to allow the neural net to move this around
00:45:40.560 to make some 10-inch neurons maybe more trigger happy
00:45:52.360 And so in addition to this idea of standardizing
00:45:59.320 we have to also introduce this additional component
00:46:01.600 in the paper here described as scale and shift.
00:46:09.240 and we are additionally scaling them by some gain
00:46:20.440 We are going to allow a batch-formalization gain
00:46:27.520 and the once will be in the shape of one by n hidden.
00:46:37.760 And it will also be of the shape n by one by n hidden.
00:46:49.760 So because this is initialized to one and this to zero,
00:47:03.560 no matter what the distribution of the H-preact is coming in,
00:47:07.160 coming out it will be unit Gaussian for each neuron.
00:47:09.760 And that's roughly what we want, at least at initialization.
00:47:24.840 Here we just have to make sure that we include these
00:47:32.160 because they will be trained with back propagation.
00:47:54.880 and we're also going to do the exact same thing
00:47:58.400 So similar to train time, we're going to normalize
00:48:09.320 And we'll see in a second that we're actually going
00:48:15.600 So I'm just going to wait for this to converge.
00:48:17.640 Okay, so I allowed the neural nut to converge here.
00:48:20.040 And when we scroll down, we see that our validation loss here
00:48:26.320 And we see that this is actually kind of comparable
00:48:28.160 to some of the results that we've achieved previously.
00:48:31.200 Now, I'm not actually expecting an improvement in this case.
00:48:34.800 And that's because we are dealing with a very simple neural nut
00:48:39.360 So in fact, in this very simple case of just one hidden layer,
00:48:43.040 we were able to actually calculate what the scale of W
00:48:45.480 should be to make these pre-activations already
00:48:53.240 But you might imagine that once you have a much deeper
00:48:55.520 neural nut that has lots of different types of operations,
00:48:59.000 and there's also, for example, residual connections
00:49:08.800 such that all the activations throughout the neural nut
00:49:12.840 And so that's going to become very quickly intractable.
00:49:15.920 But compared to that, it's going to be much, much easier
00:49:22.120 So in particular, it's common to look at every single
00:49:26.960 This is a linear layer multiplied by weight matrices
00:49:32.280 which we'll cover later and also perform basically
00:49:42.400 or convolutional layer and append a batch normalization layer
00:49:46.360 right after it to control the scale of these activations
00:49:53.440 throughout the neural nut, and then this controls
00:49:55.760 the scale of these activations throughout the neural nut.
00:49:58.560 It doesn't require us to do perfect mathematics
00:50:06.520 lego building blocks that you might want to introduce
00:50:14.840 Now, the stability offered by batch normalization
00:50:18.880 And that cost is that if you think about what's happening here,
00:50:22.320 something terribly strange and unnatural is happening.
00:50:30.320 and then we calculate this activations and its logits.
00:50:42.320 we suddenly started to use batches of examples.
00:50:44.800 But those batches of examples were processed independently,
00:50:51.520 because of the normalization through the batch,
00:51:05.520 are not just a function of that example and its input,
00:51:08.320 but they're also a function of all the other examples
00:51:17.760 when you look at HP Act that's going to feed into H,
00:51:30.320 And depending on what other examples happen to come for a ride,
00:51:34.080 H is going to change suddenly and is going to like jitter,
00:51:39.520 because the statistics of the mean and standard deviation
00:51:48.560 And you think that this would be a bug or something undesirable,
00:51:54.360 this actually turns out to be good in neural network training.
00:51:59.640 And the reason for that is that you can think of this
00:52:03.920 Because what's happening is you have your input and you get your H.
00:52:09.880 And so what that does is that it's effectively padding out
00:52:18.680 it's actually kind of like a form of a data augmentation,
00:52:22.960 And it's kind of like augmenting the input a little bit
00:52:26.840 And that makes it harder for the neural nets to overfit
00:52:50.360 that the examples in the batch are coupled mathematically
00:53:10.240 of batch normalization and move to other normalization
00:53:12.400 techniques that do not couple the examples of a batch.
00:53:16.880 instance normalization, group normalization, and so on.
00:53:25.480 batch normalization was the first kind of normalization layer
00:53:38.200 and move to some of the other normalization techniques.
00:53:40.920 But it's been hard because it just works quite well.
00:53:44.320 And some of the reason that it works quite well
00:53:50.600 at controlling the activations and their distributions.
00:53:54.520 So that's kind of like the brief story of batch normalization.
00:53:57.480 And I'd like to show you one of the other weird sort
00:54:07.680 When I was evaluating the loss on the validation set.
00:54:13.280 we'd like to deploy it in some kind of a setting.
00:54:27.960 the neural net expects batches as an input now.
00:54:34.400 And so the proposal in the batch normalization paper
00:54:38.960 What we would like to do here is we would like to basically
00:54:41.800 have a step after training that calculates and sets
00:54:52.280 And so I wrote this code here in interest of time.
00:54:55.360 And we're going to call what's called calibrate
00:54:59.160 And basically what we do is torch.not.torch.not.not grad
00:55:10.720 get the pre-activations for every single training example.
00:55:13.640 And then one single time estimate the mean and standard
00:55:33.440 And here we're just going to use B and standard deviation.
00:55:37.160 And so at test time, we are going to fix these,
00:55:43.200 And now you see that we get basically identical result.
00:55:50.760 is that we can now also forward a single example
00:55:59.440 this mean and standard deviation as a second stage
00:56:09.600 which is that we can estimate the mean and standard deviation
00:56:17.040 And then we can simply just have a single stage of training
00:56:21.720 we are estimating the running mean and standard deviation.
00:56:30.120 And let me call this B and mean on the I iteration.
00:56:47.120 And the mean comes here and the STD comes here.
00:56:54.200 I've just moved around and I created these extra variables
00:57:01.720 But what we're going to do now is we're going to keep
00:57:03.520 a running mean of both of these values during training.
00:57:06.760 So let me swing up here and let me create a B and mean
00:57:40.400 these mean and standard deviation that are running,
00:57:45.400 they're not actually part of the gradient based optimization.
00:57:47.760 We're never going to derive gradients with respect to them.
00:57:53.640 And so what we're going to do here is we're going to say
00:57:58.080 telling PyTorch that the update here is not supposed
00:58:05.400 But this running mean is basically going to be 0.99
00:58:13.600 Plus 0.001 times the, this value, this new mean.
00:58:26.000 But it will receive a small update in the direction
00:58:41.280 And it's simply being updated not using gradient descent.
00:58:43.840 It's just being updated using a janky like smooth,
00:59:02.440 and standard deviation and estimating them once.
00:59:09.360 now I'm keeping track of this in a running manner.
00:59:35.000 So during training, the exact same thing will happen.
00:59:46.880 So let's wait for the optimization to converge.
00:59:50.160 And hopefully the running mean and standard deviation
01:00:03.840 and then the B and mean from the explicit estimation is here.
01:00:35.960 Instead of B and STD, we can use B and STD running.
01:00:54.200 Okay, so we're almost done with vascularization.
01:00:56.080 There are only two more notes that I'd like to make.
01:01:02.200 This epsilon is usually like some small fixed number.
01:01:15.880 In that case, here we normally have a division by zero,
01:01:25.600 So feel free to also add a + ? here of a very small number.
01:01:29.120 It doesn't actually substantially change the result.
01:01:41.280 but right here where we are adding the bias into H preact,
01:01:52.920 for every one of these neurons and subtracting it.
01:02:04.600 and they don't impact the rest of the calculation.
01:02:07.280 So if you look at b1.grad, it's actually going to be zero
01:02:13.600 And so whenever you're using rationalization layers,
01:02:17.680 before like a linear or a conv or something like that,
01:02:20.600 you're better off coming here and just like not using bias.
01:02:30.640 Instead, we have this rationalization bias here
01:02:33.720 and that rationalization bias is now in charge of
01:02:38.920 instead of this b1 that we had here originally.
01:02:42.200 And so basically, the rationalization layer has its own bias
01:02:45.920 and there's no need to have a bias in the layer before it
01:02:49.360 because that bias is going to be subtracted out anyway.
01:02:52.000 So that's the other small detail to be careful with.
01:02:53.880 Sometimes it's not going to do anything catastrophic.
01:03:03.120 but it doesn't actually really impact anything otherwise.
01:03:07.200 Okay, so I rearranged the code a little bit with comments
01:03:13.760 We are using rationalization to control the statistics
01:03:22.160 across the neural net and usually we will place it
01:03:27.760 like for example, a linear layer or a convolutional layer
01:03:32.760 Now, the rationalization internally has parameters
01:03:37.680 for the gain and the bias and these are trained
01:03:44.480 The buffers are the mean and the standard deviation,
01:03:47.120 the running mean and the running mean of the standard deviation.
01:03:51.000 And these are not trained using back propagation.
01:04:05.240 And then really what it's doing is it's calculating
01:04:07.400 the mean and the standard deviation of the activations
01:04:10.520 that are feeding into the batch room layer over that batch.
01:04:14.000 Then it's centering that batch to be unit Gaussian
01:04:26.040 of the mean and standard deviation of the inputs.
01:04:28.920 And it's maintaining this running mean and standard deviation.
01:04:40.520 to basically forward individual examples at test time.
01:04:50.440 Now, I wanted to show you a little bit of a real example.
01:05:02.120 And of course, we haven't come in re-estenets in detail.
01:05:04.680 So I'm not going to explain all the pieces of it.
01:05:12.200 And there's many, many layers with repeating structure
01:05:15.120 all the way to predictions of what's inside that image.
01:05:18.320 This repeating structure is made up of these blocks.
01:05:20.880 And these blocks are just sequentially stacked up
01:05:25.640 Now, the code for this, the block basically that's used
01:05:40.240 but I want to point out some small pieces of it.
01:05:43.160 Here in the init is where we initialize the neural net.
01:05:53.000 how the neural net acts once you actually have the input.
01:05:55.720 So this code here is along the lines of what we're doing here.
01:05:59.120 And now these blocks are replicated and stacked up
01:06:04.720 serially, and that's what a residual network would be.
01:06:26.560 And basically, this linear multiplication and bias
01:06:28.920 offset are done on patches instead of the full input.
01:06:34.760 So because these images have spatial structure,
01:06:40.840 but they do it on overlapping patches of the input.
01:06:46.720 Then we have the normal layer, which by default here
01:07:04.520 and you can just use them relatively interchangeably
01:07:07.360 for very deep networks relu's typically, empirically,
01:07:14.120 We have convolution, batch normalization, relu--
01:07:23.000 But basically, that's the exact same pattern we have here.
01:07:35.560 But basically, a weight layer, a normalization layer,
01:07:39.560 And that's the motif that you would be stacking up
01:07:46.960 is that here when they are initializing the conv layers,
01:07:50.240 like conv one by one, the depth for that is right here.
01:08:04.760 The bias equals false is exactly for the same reason
01:08:16.760 And the batch normalization subtracts that bias
01:08:20.640 So there's no need to introduce these spurious parameters.
01:08:23.160 It wouldn't hurt performance, it's just useless.
01:08:33.400 So by the way, this example here is very easy to find,
01:08:41.760 So this is kind of like the stock implementation
01:08:48.200 But of course, I haven't covered many of these parts yet.
01:09:17.000 So the clock is WX+B very much like we did here.
01:09:21.520 To initialize this layer, you need to know the fan in,
01:09:24.160 the fan out, so that they can initialize this W.
01:09:32.000 So they know how big the weight matrix should be.
01:09:35.040 You need to also pass in whether or not you want a bias.
01:09:43.480 And you may want to do that exactly like in our case,
01:09:47.120 if your layer is followed by a normalization layer,
01:09:50.640 So this allows you to basically disable a bias.
01:10:05.880 In the same way, they have a weight and a bias.
01:10:08.600 And they're talking about how they initialize it by default.
01:10:11.720 So by default, pytorch will initialize your weights
01:10:14.280 by taking the fan in and then doing one over fan in square root.
01:10:27.920 but they are using a one instead of five over three.
01:10:33.680 but otherwise it's exactly one over the square root of fan in,
01:10:39.040 So one over the square root of K is the scale of the weights,
01:10:48.760 they're using a uniform distribution by default.
01:10:51.400 And so they draw uniformly from negative square root of K
01:10:56.080 But it's the exact same thing and the same motivation
01:10:58.640 from, with respect to what we've seen in this lecture.
01:11:11.760 And you basically achieve that by scaling the weights
01:11:19.920 And then the second thing is the pasteurmalization layer.
01:11:23.200 So let's look at what that looks like in PyTorch.
01:11:26.160 So here we have a one dimensional pasteurmalization layer,
01:11:45.920 Then they need to know the value of epsilon here.
01:12:05.120 The momentum we are using here, in this example, is 0.001.
01:12:08.560 And basically you may want to change this sometimes.
01:12:13.720 And roughly speaking, if you have a very large batch size,
01:12:19.160 when you estimate the mean and standard deviation,
01:12:21.640 for every single batch size, if it's large enough,
01:12:26.120 And so therefore you can use slightly higher momentum,
01:12:46.160 that might not be good enough for this value to settle
01:12:49.680 and converge to the actual mean and standard deviation
01:12:55.160 And so basically if your batch size is very small,
01:12:59.800 and it might make it so that the running mean and standard deviation
01:13:20.720 I'm not actually sure why you would want to change this
01:13:24.120 Then track running stats is determining whether or not
01:13:29.360 bachelorsalization layer of PyTorch will be doing this.
01:13:32.800 And one reason you may want to skip the running stats
01:13:39.200 estimate them at the end as a stage two like this.
01:13:44.520 normalization layer to be doing all this extra compute
01:13:48.880 And finally, we need to know which device we're going to run
01:14:06.160 and everything is the same exactly as we've done here.
01:14:09.600 Okay, so that's everything that I wanted to cover
01:14:13.840 Really what I wanted to talk about is the importance
01:14:15.840 of understanding the activations and the gradients
01:14:27.400 at the output layer, and we saw that if you have
01:14:30.160 two confident mispredictions, because the activations
01:14:44.000 Then we also saw that we need to control the activations.
01:14:50.720 And because that you can run into a lot of trouble
01:14:52.920 with all of these nonlinearities in these neural nets.
01:14:55.840 And basically, you want everything to be fairly homogeneous
01:15:02.560 Let me talk about, okay, if we want roughly Gaussian
01:15:05.720 activations, how do we scale these weight matrices
01:15:08.720 and biases during initialization of the neural net
01:15:11.320 so that we don't get, so everything is as controlled
01:15:20.000 And then I talked about how that strategy is not actually
01:15:32.480 it becomes really, really hard to precisely set the weights
01:15:35.800 and the biases in such a way that the activations
01:15:41.360 So then I introduced the notion of the normalization layer.
01:16:03.000 This is a layer that you can sprinkle throughout
01:16:06.320 And the basic idea is if you want roughly Gaussian
01:16:16.640 And you can do that because the centering operation
01:16:28.640 Because now we're centering the data, that's great.
01:16:35.280 And then because we are coupling all the training examples,
01:16:38.640 now suddenly the question is, how do you do the inference?
01:16:41.160 Or to do the inference, we need to now estimate
01:16:57.800 and try to estimate these in the running manner
01:17:02.760 And that gives us the batch normalization layer.
01:17:12.560 And intuitively it's because it is coupling examples
01:17:18.800 And I've shocked myself in the foot with this layer
01:17:28.360 So basically try to avoid it as much as possible.
01:17:33.760 are for example group normalization or layer normalization.
01:17:36.640 And those have become more common in more recent deep learning.
01:17:44.440 was very influential at the time when it came out
01:17:50.400 that you could train reliably much deeper neural nets.
01:17:59.080 at controlling the statistics of the activations
01:18:18.720 And when you actually optimize these neural nets.
01:18:28.080 normalization layers will become very, very important
01:18:36.480 I would like us to do one more summary here as a bonus.
01:18:39.200 And I think it's useful as to have one more summary
01:18:43.920 But also I would like us to start by tortuifying our code
01:18:47.240 So it looks much more like what you would encounter in PyTorch.
01:18:52.120 into these modules, like a linear module and a batch
01:19:01.400 so that we can construct neural networks very much like we
01:19:08.800 Then we will do the optimization loop as we did before.
01:19:12.240 And then the one more thing that I want to do here
01:19:16.160 both in the forward pass and in the backward pass.
01:19:19.320 And then here we have the evaluation and sampling
01:19:22.960 So let me rewind all the way up here and go a little bit slower.
01:19:29.280 You'll notice that Torch.NN has lots of different types
01:19:35.120 Torch.NN.linear takes a number of input features, output
01:19:39.920 And then the device that we want to place this layer on
01:19:43.920 So I will omit these two, but otherwise we have the exact
01:19:48.360 We have the fan in, which is the number of inputs.
01:19:55.320 And internally, inside this layer, there's a weight and a
01:19:59.880 It is typical to initialize the weight using, say, random
01:20:05.960 And then here's the timing initialization that we've
01:20:12.000 And also the default that I believe PyTorch uses.
01:20:14.720 And by default, the bias is usually initialized to zeros.
01:20:18.360 Now, when you call this module, this will basically
01:20:24.880 And then when you also call that parameters on this module,
01:20:27.440 it will return the tensors that are the parameters of this
01:20:32.200 Now, next we have the bachelors-renolization layer.
01:20:37.000 And this is very similar to PyTorch and then the bachelors
01:20:44.400 So I'm kind of taking these three parameters here, the
01:20:48.120 dimensionality, the epsilon that we'll use in the division,
01:20:51.480 and the momentum that we will use in keeping track of these
01:20:54.320 running stats, the running mean and the running variance.
01:20:58.160 Now, PyTorch actually takes quite a few more things, but
01:21:03.920 That means that we will be using a gamma beta after
01:21:09.560 So we will be keeping track of the running mean and the
01:21:17.080 And the data type by default is float, float 32.
01:21:23.440 Otherwise, we are taking all the same parameters in this
01:21:31.080 There's a dot training, which by default is true.
01:21:33.560 And PyTorch and N modules also have this attribute,
01:21:39.480 So is included in that have a different behavior, whether you
01:21:43.560 are training your own or whether you are running it in an
01:21:46.360 evaluation mode and calculating your evaluation laws or using
01:21:53.560 So is an example of this, because when we are training, we
01:21:56.160 are going to be using the mean and the variance
01:21:59.680 But during inference, we are using the running mean and
01:22:03.960 And so also, if we are training, we are updating mean
01:22:07.760 But if we are testing, then these are not being
01:22:16.320 Now, the parameters of Bachelors 1D are the gamma and the
01:22:21.680 And then the running mean and running variance are called
01:22:27.480 And these buffers are trained using exponential moving
01:22:33.360 And they are not part of the back propagation and stochastic
01:22:37.000 So they are not sort of parameters of this layer.
01:22:39.800 And that's why when we have parameters here, we only
01:22:46.600 This is trained internally here, every forward pass, using
01:22:56.880 Now, in a forward pass, if we are training, then we use the
01:23:08.800 Now, up above, I was estimating the standard deviation and
01:23:12.280 keeping track of the standard deviation here in the running
01:23:15.680 standard deviation instead of running variance.
01:23:20.120 Here, they calculate the variance, which is the standard
01:23:23.800 And that's what's kept track of in the running variance
01:23:29.720 But those two would be very, very similar, I believe.
01:23:33.840 If we are not training, then we use running mean and
01:23:39.040 And then here, I am calculating the output of this layer.
01:23:42.040 And I'm also assigning it to an attribute called dot out.
01:23:45.480 Now, dot out is something that I'm using in our modules
01:23:53.240 I'm creating a dot out because I would like to very easily
01:23:57.400 maintain all those variables so that we can create statistics
01:24:01.360 But PyTorch and modules will not have a dot out attribute.
01:24:05.440 And finally, here, we are updating the buffers using,
01:24:07.880 again, as I mentioned, exponential moving average,
01:24:13.040 And importantly, you'll notice that I'm using the Torstap
01:24:17.360 And I'm doing this because if we don't use this,
01:24:22.120 computational graph out of these tensors because it is
01:24:25.120 expecting that we will eventually call dot backward.
01:24:28.040 But we are never going to be calling dot backward on
01:24:29.960 anything that includes running mean and running variance.
01:24:32.600 So that's why we need to use this context manager so that
01:24:35.400 we are not maintaining them using all this additional memory.
01:24:41.880 And it's just telling PyTorch that while we know
01:25:05.200 But because these are layers, it now becomes very easy to
01:25:08.800 sort of stack them up into basically just a list.
01:25:13.280 And we can do all the initialization that we're used to.
01:25:16.280 So we have the initial sort of embedding matrix.
01:25:19.480 We have our layers, and we can call them sequentially.
01:25:22.240 And then again, with Torch dot no grad, there's some
01:25:26.160 So we want to make the output softmax a bit less
01:25:30.360 And in addition to that, because we are using a six layer
01:25:33.320 multi layer perceptron here, so you see how I'm stacking
01:25:36.200 linear 10H, linear 10H, et cetera, I'm going to be using
01:25:41.320 And I'm going to play with this in a second, so you'll see
01:25:43.400 how when we change this, what happens to this statistics.
01:25:47.280 Finally, the parameters are basically the embedding matrix
01:25:52.440 And notice here, I'm using a double list comprehension, if
01:25:56.120 But for every layer in layers and for every parameter in each
01:25:59.880 of those layers, we are just stacking up all those
01:26:09.400 And I'm telling PyTorch that all of them require gradient.
01:26:12.320 Then here, we have everything here we are actually mostly
01:26:23.520 The forward pass now is just the linear application of all
01:26:25.760 the layers in order, followed by the cross entropy.
01:26:29.400 And then in the backward pass, you'll notice that for
01:26:31.240 every single layer, I now iterate over all the outputs.
01:26:34.160 And I'm telling PyTorch to retain the gradient of them.
01:26:37.440 And then here, we are already used to all the gradients
01:26:48.760 And then I am going to break after a single iteration.
01:26:52.040 Now here in this cell, in this diagram, I'm visualizing the
01:26:54.960 histogram, the histograms of the forward pass activations.
01:26:58.680 And I'm specifically doing it at the 10-inch layers.
01:27:01.800 So iterating over all the layers, except for the very last
01:27:05.320 one, which is basically just the softmax layer.
01:27:10.160 If it is a 10-inch layer-- and I'm using a 10-inch layer
01:27:12.760 just because they have a finite output, negative 1 to 1.
01:27:21.720 I take the out tensor from that layer into T. And then I'm
01:27:25.840 calculating the mean, the standard deviation, and the
01:27:28.360 percent saturation of T. And we are defined to percent
01:27:31.760 saturation as that T dot absolute value is greater than 0.97.
01:27:35.440 So that means we are here at the tails of the 10-inch.
01:27:38.600 And remember that when we are in the tails of the 10-inch,
01:27:51.200 So basically what this is doing is that every different
01:27:53.320 type of layer-- and they all have a different color--
01:27:55.320 we are looking at how many values in these tensors take on
01:28:04.120 So the first layer is fairly saturated here at 20%.
01:28:12.440 And if we had more layers here, it would actually just
01:28:14.480 stabilize it around the standard deviation of about 0.65.
01:28:20.600 And the reason that this stabilizes and gives us a nice
01:28:23.160 distribution here is because gain is set to 5/3.
01:28:27.680 Now here, this gain, you see that by default we initialize
01:28:35.320 But then here during initialization, I come in and I
01:28:38.720 And if it's a linear layer, I boost that by the gain.
01:28:44.560 so basically if we just do not use a gain, then what happens?
01:28:48.760 If I redraw this, you will see that the standard deviation
01:28:53.000 is shrinking and the saturation is coming to 0.
01:28:57.000 And basically what's happening is the first layer is pretty
01:29:00.840 But then further layers are just kind of like shrinking
01:29:04.920 And it's happening slowly, but it's shrinking to 0.
01:29:07.600 And the reason for that is when you just have a sandwich
01:29:11.160 of linear layers alone, then initializing our weights
01:29:16.760 in this manner, we saw previously would have conserved
01:29:22.080 But because we have this interspersed 10-inch layers
01:29:24.920 in there, these 10-inch layers are squashing functions.
01:29:32.880 And so some gain is necessary to keep expanding it,
01:29:43.480 So if we have something too small, like 1, we saw that things
01:30:07.000 So 3 would create way too saturated activations.
01:30:10.880 So 5/3 is a good setting for a sandwich of linear layers
01:30:17.840 And it roughly stabilizes the standard deviation
01:30:22.600 Now, honestly, I have no idea where 5/3 came from in PyTorch
01:30:27.240 when we were looking at the cutting initialization.
01:30:30.040 I see empirically that it stabilizes this sandwich
01:30:32.840 of linear and 10-inch, and that the saturation is in a good range.
01:30:36.720 But I don't actually know if this came out of some math formula.
01:30:39.560 I tried searching briefly for where this comes from,
01:30:47.440 Our saturation is roughly 5%, which is a pretty good number.
01:30:50.920 And this is a good setting of the gain in this context.
01:30:55.160 Similarly, we can do the exact same thing with the gradients.
01:31:01.480 but instead of taking the layer dot out, I'm taking the grad.
01:31:04.440 And then I'm also showing the mean and standard deviation
01:31:07.160 and I'm plotting the histogram of these values.
01:31:10.040 And so you'll see that the gradient distribution
01:31:14.920 is that all the different layers in this sandwich
01:31:34.320 but also the gradients are doing something weird.
01:31:55.480 And in this case, we saw that without the use of batch term,
01:32:03.240 to get nice activations in both the forward pass
01:32:10.160 I would also like to take a look at what happens
01:32:24.360 As we saw before, the correct gain here is one.
01:32:27.480 That is the standard deviation preserving gain.
01:32:33.680 And so what's gonna happen now is the following.
01:32:39.000 So we are, because there's no more 10H players.
01:32:45.080 So what we're seeing is the activations started out
01:32:50.320 on the blue and have by layer four become very diffuse.
01:32:55.120 So what's happening to the activations is this.
01:33:00.640 the activation, the gradient statistics are the purple,
01:33:04.400 and then they diminish as you go down deeper than the layers.
01:33:07.720 And so basically you have an asymmetry like in the neural net.
01:33:10.880 And you might imagine that if you have very deep neural networks,
01:33:27.480 And if it's too little of a gain, then this happens.
01:33:38.440 depending on which direction you look at it from.
01:33:44.200 And in this case, the correct setting of the gain
01:33:46.040 is exactly one, just like we're doing at initialization.
01:33:50.240 And then we see that the statistics for the forward
01:34:00.000 but basically like getting your own list of train
01:34:04.280 And before the use of advanced optimizers like Adam,
01:34:15.000 You have to make sure that everything is precisely
01:34:17.480 orchestrated and you have to care about the activations
01:34:23.520 but it was basically impossible to train very deep networks.
01:34:28.040 You'd have to be very, very careful with your initialization.
01:34:40.760 Why do we include them and then have to worry about the gain?
01:34:45.040 is that if you just have a stack of linear layers,
01:34:55.920 to a single linear layer in terms of its representation power.
01:34:59.680 So if you were to plot the output as a function
01:35:02.200 and the input, you're just getting a linear function.
01:35:06.520 you still just end up with a linear transformation.
01:35:09.040 All the WX plus B's just collapse into a large WX plus B
01:35:13.960 with slightly different W's as likely different B.
01:35:16.080 But interestingly, even though the forward pass collapses
01:35:28.620 You actually end up with all kinds of interesting
01:35:34.760 because of the way the chain rule is calculating it.
01:35:43.880 In both cases, those are just a linear transformation
01:35:50.680 in fact, like infinitely layered linear layers and so on.
01:36:33.760 when you're training your neural nets and to consider.
01:36:37.880 we're updating the parameters of the neural net.
01:36:40.160 So we care about the parameters and their values
01:36:44.480 So here what I'm doing is I'm actually iterating
01:36:52.360 which are basically the weights of these linear layers.
01:36:56.320 and I'm skipping the gammas and the betas in the best room,
01:37:08.000 So here we have all the different weights, their shapes.
01:37:12.840 So this is the embedding layer, the first linear layer,
01:37:18.600 the standard deviation of all these parameters,
01:37:22.920 And you can see that it actually doesn't look that amazing.
01:37:32.080 And the last thing here is the gradient to data ratio.
01:37:45.640 And this is important because we're going to end up
01:37:49.160 that is the learning rate times the gradient onto the data.
01:37:54.080 And so the gradient has too large of a magnitude.
01:37:58.280 compared to the numbers in data, then you'd be in trouble.
01:38:01.720 But in this case, the gradient to data is our low numbers.
01:38:05.280 So the values inside grad are 1,000 times smaller
01:38:09.240 than the values inside data in these weights, most of them.
01:38:13.840 Now, notably that is not true about the last layer.
01:38:22.520 because you can see that the last layer here in pink
01:38:37.520 in negative three throughout, except for the last layer,
01:38:45.760 And so the gradients on the last layer are currently
01:38:48.400 about 100 times greater, sorry, 10 times greater,
01:38:52.280 than all the other weights inside the neural net.
01:38:55.880 And so that's problematic because in the simplest
01:39:00.200 you would be training this last layer about 10 times faster
01:39:07.120 Now this actually kind of fixes itself a little bit
01:39:16.200 Let me re-initialize and then let me do it 1,000 steps.
01:39:20.000 And after 1,000 steps, we can look at the forward pass.
01:39:31.240 They're about equal and there's no shrinking to zero
01:39:42.760 are actually coming in during the optimization.
01:39:46.280 But certainly this is like a little bit troubling,
01:39:48.760 especially if you are using a very simple update rule
01:39:56.640 that I usually look at when I train neural networks.
01:40:08.400 Because that is the amount by which we will actually
01:40:14.920 is I'd like to introduce a new update to data ratio.
01:40:19.760 It's going to be less than we're going to build it out
01:40:28.920 So without any gradients, I'm comparing the update,
01:40:44.520 And then I'm taking the basically standard deviation
01:40:52.520 the data of that parameter and its standard deviation.
01:40:58.720 are the updates to the values in these tensors.
01:41:20.680 for all the parameters and adding it to this UD tensor.
01:41:24.200 So now let me re-inertilize and run a thousand iterations.
01:41:34.280 But now I have one more plot here to introduce.
01:41:36.560 Now what's happening here is we're every interval
01:41:43.760 So the number of dimensions in these sensors is two.
01:41:47.800 And then I'm basically plotting all of these update ratios
01:41:59.440 during initialization that take on certain values.
01:42:09.360 that is a rough guide for what it roughly should be.
01:42:12.920 And it should be like roughly one in negative three.
01:42:15.480 And so that means that basically there's some values
01:42:21.880 and the updates to them at every single iteration
01:42:24.280 are no more than roughly 1,000th of the actual magnitude
01:42:37.720 this is actually updating those values quite a lot.
01:42:42.240 But the reason that the final layer here is an outlier
01:42:46.600 is because this layer was artificially shrunk down
01:42:53.400 So here, you see how we multiply the weight by point one
01:42:58.320 in the initialization to make the last layer prediction
01:43:09.280 And that's why we're getting temporarily a very high ratio,
01:43:19.680 of this update ratio for all my parameters usually.
01:43:34.160 usually that means that parameters are not training fast enough.
01:43:40.360 Let's initialize and then let's actually do a learning rate
01:44:00.320 So the size of the update is basically 10,000 times
01:44:13.120 So this is another way to sometimes set the learning rate
01:44:16.880 and to get a sense of what that learning rate should be.
01:44:19.200 And ultimately this is something that you would keep track of.
01:44:29.520 because you see that we're above the black line of negative three.
01:44:36.000 It's like, okay, but everything is like somewhat stabilizing.
01:44:48.440 So for example, everything looks pretty well behaved, right?
01:44:56.440 Let me come up here and let's say that, for example,
01:45:01.760 Let's say that we forgot to apply this fan in normalization.
01:45:07.200 are just a sample from a Gaussian in all those stages.
01:45:10.720 What happens to our, how do we notice that something's off?
01:45:21.480 The histogram for these weights are gonna be all messed up as well.
01:45:28.200 I suspect it's all gonna be also pretty messed up.
01:45:30.680 So you see there's a lot of discrepancy in how fast
01:45:41.400 those are very large numbers in terms of this ratio.
01:45:44.240 Again, you should be somewhere around negative three
01:45:47.640 So this is how mis-calibrations of your neural nets
01:45:54.320 are a good way of bringing those mis-calibrations
01:46:10.200 and make the activations, the gradients and the parameters
01:46:15.880 But it definitely feels a little bit like balancing
01:46:21.240 And that's because this gain has to be very precisely
01:46:25.920 So now let's introduce Bachelormalization layers
01:46:32.920 So here, I'm going to take the Bachelorm 1D class
01:46:47.040 So right after it, but before the non-linearity.
01:46:59.260 is it's totally fine to also place it at the end
01:47:01.740 after the last linear layer and before the loss function.
01:47:25.540 is the variable that multiplicatively interacts
01:47:35.820 We can train and we can see that the activations
01:47:49.260 So this is unsurprisingly, all looks pretty good.
01:47:53.020 It's going to be standard deviation of roughly 0.65,
01:48:09.220 And then the updates also look pretty reasonable.
01:48:24.660 But now what we've gained is we are going to be slightly less
01:48:34.100 So for example, I can make the gain be say 0.2 here,
01:48:39.100 which is much slower than what we had with the 10H.
01:48:42.860 But as we'll see, the activations will actually
01:48:46.740 And that's because, again, this explicit normalization,
01:48:56.980 And so even though the forward and backward pass
01:48:59.980 to a very large extent look OK, because of the backward
01:49:02.500 pass of the batch norm and how the scale of the incoming
01:49:05.380 activations interacts in the batch norm and its backward pass,
01:49:16.340 So the gradients of these weights are affected.
01:49:24.980 But everything else is significantly more robust
01:49:28.700 in terms of the forward, backward, and the weight gradients.
01:49:32.900 It's just that you may have to retune your learning rate
01:49:35.500 if you are changing sufficiently the scale of the activations
01:49:47.180 And we're seeing that the updates are coming out lower
01:49:51.700 And then finally, we can also, if we are using batch
01:49:59.180 We don't necessarily even have to normalize back fan-in sometimes.
01:50:03.540 So if I take out the fan-in-- so these are just now
01:50:06.460 random Gaussian-- we'll see that because of batch
01:50:09.260 norm, this will actually be relatively well behaved.
01:50:11.780 So this is a look, of course, in the forward pass look good.
01:50:19.860 The weight updates look OK, a little bit of fat tails
01:50:29.300 But as you can see, we're significantly below negative 3,
01:50:33.540 so we'd have to bump up the learning rate of this batch norm
01:50:41.100 looks like we have to 10x the learning rate to get
01:50:52.940 then we'll see that everything still, of course, looks good.
01:51:07.140 So long story short, we are significantly more robust
01:51:16.300 But we actually do have to worry a little bit about the update
01:51:19.780 scales and making sure that the learning rate is properly
01:51:24.060 But the activations of the forward backward pass
01:51:26.700 and the updates are all looking significantly more
01:51:29.540 well behaved, except for the global scale that is potentially
01:51:44.300 that we're looking into that helped stabilize very deep
01:51:49.660 And I hope you understand how the Bachelors normalization works
01:51:56.100 Number two, I was hoping to pytorchify some of our code
01:52:14.980 And if you import torch and then then you can actually--
01:52:17.940 the way I've constructed it, you can simply just
01:52:32.740 And the implementation also is basically, as far as I'm aware,
01:52:42.460 to understand whether your neural network is in a good state
01:52:46.300 So we are looking at the statistics and histograms
01:52:55.740 that are going to be updated as part of stochasticarity
01:52:58.980 And we're looking at their means, standard deviations,
01:53:21.580 And in particular, I said that, what do you negative three,
01:53:27.540 is a good, rough heuristic for what you want this ratio to be.
01:53:31.820 And if it's way too high, then probably the learning rate
01:53:36.540 And if it's way too small, then the learning rate
01:53:40.820 that you may want to play with when you try to get your neural
01:53:46.540 Now, there's a number of things I did not try to achieve.
01:53:49.260 I did not try to beat our previous performance, as an example,
01:53:54.060 Actually, I did try, and I found that I used the learning
01:53:57.620 rate finding mechanism that I've described before.
01:53:59.940 I tried to train the Bachelor of Lawyer, a Bachelor of Neural
01:54:03.260 And I actually ended up with results that are very, very
01:54:08.260 And that's because our performance now is not bottlenecked
01:54:11.460 by the optimization, which is what Bachelorm is helping with.
01:54:15.140 The performance of the stage is bottlenecked by--
01:54:17.420 what I suspect is the context length of our context.
01:54:26.180 And we need to look at more powerful architectures
01:54:28.380 like recurrent neural networks and transformers
01:54:30.900 in order to further push the log probabilities that we're
01:54:36.500 And I also did not try to have a full explanation of all
01:54:40.940 of these activations, the gradients, and the backward pass
01:54:45.540 And so you may have found some of the parts here unintuitive.
01:54:47.940 And maybe you were slightly confused about, OK,
01:54:55.540 because you'd have to actually look at the backward pass
01:55:05.740 to the diagnostic tools and what they look like.
01:55:09.860 on the intuitive level to understand the initialization,
01:55:15.780 But you shouldn't feel too bad because, honestly, we
01:55:18.340 are getting to the cutting edge of where the field is.
01:55:22.900 We certainly haven't, I would say, solved initialization.
01:55:28.180 And these are still very much an active area of research.
01:55:31.940 where's the best way to initialize these networks, what
01:55:38.860 And we don't really have all the answers to all these cases.
01:55:48.260 whether or not things are on the right track for now.
01:55:51.700 So I think we've made a positive progress in this lecture.