Building makemore Part 2: MLP

00:00:00.000 Hi everyone. Today we are continuing our implementation of Makemore. Now in the

00:00:05.680 last lecture we implemented the Pygram language model and we implemented it

00:00:09.080 both using counts and also using a super simple neural network that has

00:00:13.240 single linear layer. Now this is the Jupyter notebook that we built out last

00:00:19.240 lecture and we saw that the way we approached this is that we looked at only

00:00:23.160 the single previous character and we predicted the distribution for the

00:00:26.560 character that would go mixed in the sequence and we did that by taking counts

00:00:30.720 and normalizing them into probabilities so that each row here sums to one. Now

00:00:36.920 this is all well and good if you only have one character of previous context and

00:00:41.080 this works and it's approachable. The problem with this model of course is that

00:00:46.160 the predictions from this model are not very good because you only take one

00:00:50.120 character of context so the model didn't produce very name like sounding

00:00:54.440 things. Now the problem with this approach though is that if we are to take more

00:00:59.920 context into account when predicting the next character in a sequence things

00:01:03.720 quickly blow up and this table the size of this table grows and in fact it grows

00:01:08.320 exponentially with the length of the context because if we only take a single

00:01:12.360 character at a time that's 27 possibilities of context but if we take two

00:01:16.720 characters in the past and try to predict the third one suddenly the number of rows

00:01:21.120 in this matrix you can look at it that way is 27 times 27 so there's 729

00:01:26.640 possibilities for what could have come in the context. If we take three

00:01:31.080 characters as the context suddenly we have 20,000 possibilities of context and

00:01:37.000 so there's just way too many rows of this matrix it's way too few counts for

00:01:44.080 each possibility and the whole thing just kind of explodes and doesn't work

00:01:47.360 very well. So that's why today we're going to move on to this bullet point here and

00:01:52.320 we're going to implement a multi layer perceptron model to predict the next

00:01:56.240 character in a sequence and this modeling approach that we're going to

00:02:00.320 adopt follows this paper Ben-Jue et al 2003. So I have the paper pulled up here

00:02:05.720 now this isn't the very first paper that proposed the use of multi layer

00:02:10.120 perceptrons or neural networks to predict the next character or token in a

00:02:13.760 sequence but it's definitely one that is was very influential around that time

00:02:17.800 it is very often cited to stand in for this idea and I think it's a very nice

00:02:21.920 write-up and so this is the paper that we're going to first look at and then

00:02:25.720 implement. Now this paper has 19 pages so we don't have time to go into the full

00:02:30.640 detail of this paper but I invite you to read it it's very readable interesting

00:02:34.520 and has a lot of interesting ideas in it as well. In the introduction they

00:02:38.040 describe the exact same problem I just described and then to address it they

00:02:41.960 propose the following model. Now keep in mind that we are building a

00:02:46.600 character level language model so we're working on the level of characters in

00:02:49.720 this paper they have a vocabulary of 17,000 possible words and they instead

00:02:54.800 build a word level language model but we're going to still stick with the

00:02:58.160 characters but we'll take the same modeling approach. Now what they do is

00:03:02.520 basically they propose to take every one of these words 17,000 words and

00:03:06.520 they're going to associate to each word a say 30 dimensional feature vector.

00:03:11.760 So every word is now embedded into a 30 dimensional space you can think of it

00:03:18.360 that way. So we have 17,000 points or vectors in a 30 dimensional space and

00:03:23.480 that's you might imagine that's very crowded that's a lot of points for a very

00:03:27.520 small space. Now in the beginning these words are initialized completely

00:03:32.080 randomly so they're spread out that random but then we're going to tune these

00:03:36.520 embeddings of these words using that propagation. So during the course of

00:03:40.680 training of this neural network these points or vectors are going to basically

00:03:43.920 move around in this space and you might imagine that for example words that have

00:03:48.360 very similar meanings or there are indeed synonyms of each other might end

00:03:52.040 up in a very similar part of the space and conversely words that mean very

00:03:55.680 different things would go somewhere else in this space. Now their modeling

00:04:00.160 approach otherwise is identical to ours. They are using a multi-layer neural

00:04:04.160 network to predict the next word given the previous words and to train the

00:04:08.320 neural network they are maximizing the log likelihood of the training data just

00:04:11.680 like we did. So the modeling approach itself is identical. Now here they have a

00:04:16.360 concrete example of this intuition. Why does it work? Basically suppose that for

00:04:21.520 example you are trying to predict a dog was running in a blank. Now suppose that

00:04:26.360 the exact phrase a dog was running in a has never occurred in a training data

00:04:31.200 and here you are at sort of test time later when the model is deployed somewhere

00:04:35.720 and it's trying to make a sentence and it's saying a dog was running in a blank

00:04:40.680 and because it's never encountered this exact phrase in the training set you're

00:04:45.360 out of distribution as we say like you don't have fundamentally any reason to

00:04:50.040 suspect what might come next but this approach actually allows you to get

00:04:55.960 around that because maybe you didn't see the exact phrase a dog was running in a

00:04:59.560 something but maybe you've seen similar phrases maybe you've seen the phrase the

00:05:03.800 dog was running in a blank and maybe your network has learned that a and the

00:05:08.360 are like frequently are interchangeable with each other and so maybe it took the

00:05:13.040 embedding for a and the embedding for the and it actually put them like nearby

00:05:17.000 each other in the space and so you can transfer knowledge through that embedding

00:05:21.040 and you can generalize in that way. Similarly the network could know that

00:05:24.920 cats and dogs are animals and they co-occur in lots of very similar

00:05:28.560 contexts and so even though you haven't seen this exact phrase or if you haven't

00:05:33.200 seen exactly walking or running you can through the embedding space transfer

00:05:38.040 knowledge and you can generalize to novel scenarios. So let's now scroll down to

00:05:43.200 the diagram of the neural network they have a nice diagram here and in this

00:05:48.000 example we are taking three previous words and we are trying to predict the

00:05:52.840 fourth word in a sequence. Now these three previous words as I mentioned we

00:05:59.000 have a vocabulary of 17,000 possible words so every one of these basically

00:06:05.320 are the index of the incoming word and because there are 17,000 words this is

00:06:11.960 an integer between 0 and 16,999. Now there's also a lookup table that they

00:06:19.600 call C. This lookup table is a matrix that is 17,000 by say 30 and basically

00:06:26.560 what we're doing here is we're treating this as a lookup table and so every

00:06:30.240 index is plucking out a row of this embedding matrix so that each index is

00:06:36.640 converted to the 30-dimensional vector that corresponds to the embedding

00:06:40.440 vector for that word. So here we have the input layer of 30 neurons for three

00:06:46.960 words making up 90 neurons in total and here they're saying that this matrix C

00:06:52.360 is shared across all the words so we're always indexing into the same

00:06:56.360 matrix C over and over for each one of these words. Next up is the hidden layer

00:07:03.160 of this neural network. The size of this hidden neural layer of this neural net is

00:07:07.640 a hop parameter so we use the word hyperparameter when it's kind of like a

00:07:11.240 design choice up to the designer of the neural net and this can be as large as

00:07:15.320 you'd like or as small as you'd like so for example the size could be 100 and

00:07:18.760 we are going to go over multiple choices of the size of this hidden layer and we're

00:07:23.560 going to evaluate how well they work. So say there were 100 neurons here all of

00:07:28.520 them would be fully connected to the 90 words or 90 numbers that make up these

00:07:34.520 three words. So this is a fully connected layer then there's a 10-inch long

00:07:39.000 linearity and then there's this output layer and because there are 17,000

00:07:43.680 possible words that could come next this layer has 17,000 neurons and all of

00:07:49.160 them are fully connected to all of these neurons in the hidden layer. So there's

00:07:55.160 a lot of parameters here because there's a lot of words so most computation is

00:07:59.280 here this is the expensive layer. Now there are 17,000 logits here so on top of

00:08:05.120 there we have the softmax layer which we've seen in our previous video as well.

00:08:08.840 So every one of these logits is exponentiated and then everything is

00:08:12.760 normalized to sum to 1 so that we have a nice probability distribution for the

00:08:17.280 next word in the sequence. Now of course during training we actually have the

00:08:22.040 label we have the identity of the next word in the sequence that word or its

00:08:27.480 index is used to pluck out the probability of that word and then we are

00:08:33.320 maximizing the probability of that word with respect to the parameters of this

00:08:38.720 neural net. So the parameters are the weights and biases of this output layer

00:08:43.480 the weights and biases of the hidden layer and the embedding lookup table C

00:08:48.760 and all of that is optimized using back propagation and these dashed arrows

00:08:54.160 ignore those that represents a variation of a neural net that we are not going to

00:08:58.400 explore in this video. So that's the setup and now let's implement it. Okay so I

00:09:02.920 started a brand new notebook for this lecture we are importing PyTorch and we

00:09:07.680 are importing Matplotlib so we can create figures. Then I am reading all the

00:09:11.880 names into a list of words like I did before and I'm showing the first eight

00:09:16.160 right here. Keep in mind that we have a 32,000 in total these are just the first

00:09:21.760 eight and then here I'm building out the vocabulary of characters and all the

00:09:25.680 mappings from the characters as strings to integers and vice versa. Now the first

00:09:31.800 thing we want to do is we want to compile the dataset for the neural network and I

00:09:35.880 had to rewrite this code I'll show you in a second what it looks like. So this is

00:09:41.520 the code that I created for the dataset creation so let me first run it and then

00:09:45.520 I'll briefly explain how this works. So first we're going to define something

00:09:50.160 called block size and this is basically the context length of how many characters

00:09:54.680 do we take to predict the next one. So here in this example we're taking three

00:09:58.560 characters to predict the fourth one so we have a block size of three that's the

00:10:02.720 size of the block that supports the prediction. Then here I'm building out the

00:10:08.200 x and y. The x are the input to the neural net and the y are the labels for each

00:10:14.640 example inside x. Then I'm area over the first five words. I'm doing first five

00:10:20.800 just for efficiency while we are developing all the code but then later

00:10:24.480 we're going to come here and erase this so that we use the entire training set.

00:10:28.400 So here I'm printing the word Emma and here I'm basically showing the examples

00:10:34.480 that we can generate the five examples that we can generate out of the single

00:10:37.960 sort of word Emma. So when we are given the context of just dot dot dot the

00:10:45.280 first character in a sequence is E. In this context the label is M. When the

00:10:50.760 context is this the label is M and so forth. And so the way I build this out is

00:10:55.720 first I start with a padded context of just zero tokens. Then I iterate over all

00:11:00.720 the characters. I get the character in the sequence and I basically build out

00:11:05.640 the array y of this current character and the array x which stores the current

00:11:10.360 running context. And then here see I print everything and here I crop the

00:11:16.360 context and enter the new character in a sequence. So this is kind of like a

00:11:19.440 roll in the window of context. Now we can change the block size here to for example

00:11:24.720 four and in that case we would be predicting the fifth character given the

00:11:29.040 previous form or it can be five and then it would look like this or it can be say

00:11:34.840 ten and then it would look something like this. We're taking ten characters to

00:11:39.160 predict the eleventh one and we're always padding with dots. So let me bring

00:11:43.960 this back to three just so that we have what we have here in the paper. And

00:11:49.880 finally the data set right now looks as follows. From these five words we have

00:11:55.240 created a data set of 32 examples and each input of the neural net is three

00:12:00.000 integers and we have a label that is also an integer y. So x looks like this.

00:12:06.000 These are the individual examples and then y are the labels. So given this let's

00:12:15.640 now write a neural network that takes these x's and predicts two y's. First

00:12:19.520 let's build the embedding lookup table C. So we have 27 possible characters and

00:12:25.280 we're going to embed them in a lower dimensional space. In the paper they have

00:12:29.280 17,000 words and they embed them in spaces as small dimensional as 30. So they

00:12:36.120 cram 17,000 words into 30 dimensional space. In our case we have only 27

00:12:42.080 possible characters. So let's cram them in something as small as to start with

00:12:46.240 for example a two-dimensional space. So this lookup table will be random numbers

00:12:50.400 and we'll have 27 rows and we'll have two columns. Right so each 20 each one of

00:12:58.080 27 characters will have a two-dimensional embedding. So that's our matrix C of

00:13:04.560 embeddings in the beginning initialized randomly. Now before we embed all of the

00:13:09.600 integers inside the input x using this lookup table C, let me actually just try

00:13:15.320 to embed a single individual integer like say 5. So we get a sense of how this

00:13:20.640 works. Now one way this works of course is we can just take the C and we can

00:13:25.640 index into row 5 and that gives us a vector the fifth row of C and this is one

00:13:33.600 way to do it. The other way that I presented in the previous lecture is

00:13:36.960 actually seemingly different but actually identical. So in the previous

00:13:41.440 lecture what we did is we took these integers and we used the one hot encoding

00:13:45.000 to first encode them. So if that one hot we want to encode integer 5 and we want

00:13:51.320 to tell it that the number of classes is 27. So that's the 26-dimensional vector

00:13:55.760 of all zeros except the fifth bit is turned on. Now this actually doesn't work.

00:14:02.360 The reason is that this input actually must be a torched dot tensor and I'm

00:14:08.320 making some of these errors intentionally just so you get to see some errors and

00:14:11.320 how to fix them. So this must be a tensor not an int fairly straightforward to

00:14:15.280 fix. We get a one-hot vector the fifth dimension is 1 and the shape of this is

00:14:20.480 27. And now notice that just as I briefly alluded to in a previous video if we

00:14:26.560 take this one hot vector and we multiply it by C then what would you expect? Well

00:14:37.680 number one first you'd expect an error because expected scalar type long but

00:14:45.080 found float. So a little bit confusing but the problem here is that one hot the

00:14:50.240 data type of it is long it's a 64-bit integer but this is a float tensor and so

00:14:57.920 PyTorf doesn't know how to multiply and int with a float and that's why we had

00:15:02.440 to explicitly cast this to a float so that we can multiply. Now the output

00:15:07.560 actually here is identical and that it's identical because of the way the matrix

00:15:13.440 multiplication here works. We have the one hot vector multiplying columns of C and

00:15:19.640 because of all the zeros they actually end up masking out everything in C

00:15:23.920 except for the fifth row which is blocked out and so we actually arrive at

00:15:28.520 the same result and that tells you that here we can interpret this first piece

00:15:33.760 here this embedding of the integer we can either think of it as the integer

00:15:37.560 indexing into a lookup table C but equivalently we can also think of this

00:15:42.040 little piece here as a first layer of this bigger neural net. This layer here has

00:15:48.040 neurons that have no nonlinearity there's no tanh they're just linear neurons and

00:15:51.960 their weight matrix is C and then we are encoding integers into one hot and

00:15:58.440 feeding those into a neural net and this first layer basically embeds them. So

00:16:02.920 those are two equivalent ways of doing the same thing we're just going to

00:16:06.160 index because it's much much faster and we're going to discard this

00:16:09.360 interpretation of one hot inputs into neural nets and we're just going to

00:16:14.080 index integers and create and use embedding tables. Now embedding a single

00:16:18.400 integer like 5 is easy enough we can simply ask PyTorch to retrieve the

00:16:23.080 fifth row of C or the row index 5 of C but how do we simultaneously embed all

00:16:30.160 of these 32 by 3 integers stored in array X? Luckily PyTorch indexing is

00:16:36.040 fairly flexible and quite powerful so it doesn't just work to ask for a

00:16:41.960 single element 5 like this you can actually index using lists so for

00:16:46.680 example we can get the rows 5, 6 and 7 and this will just work like this we can

00:16:51.880 index with a list. It doesn't just have to be a list it can also be a actually

00:16:56.240 tensor of integers and we can index with that so this is a integer tensor 5, 6, 7

00:17:02.960 and this will just work as well. In fact we can also for example repeat row 7 and

00:17:09.160 retrieve it multiple times and that same index will just get embedded multiple

00:17:14.640 times here. So here we are indexing with a one-dimensional tensor of integers but

00:17:20.960 it turns out that you can also index with multi-dimensional tensors of

00:17:24.480 integers. Here we have a two-dimensional tensor of integers so we can simply just

00:17:30.000 do C at X and this just works and the shape of this is 32 by 3 which is the

00:17:39.040 original shape and now for every one of those 32 by 3 integers we've retrieved

00:17:43.000 the embedding vector here. So basically we have that as an example the 13th or

00:17:51.400 example index 13 the second dimension is the integer 1 as an example and so

00:17:59.080 here if we do C of X which gives us that array and then we index into 13 by 2 of

00:18:05.600 that array then we get the embedding here and you can verify that C at 1 which is

00:18:14.720 the integer at that location is indeed equal to this. You see they're equal. So

00:18:21.920 basically long story short pytorch indexing is awesome and to embed

00:18:26.160 simultaneously all of the integers in X we can simply do C of X and that is our

00:18:31.960 embedding and that just works. Now let's construct this layer here the hidden layer.

00:18:38.080 So we have that W1 as I'll call it are these weights which we will initialize

00:18:44.840 randomly. Now the number of inputs to this layer is going to be three times two

00:18:50.440 right because we have two dimensional embeddings and we have three of them so

00:18:54.000 the number of inputs is six and the number of neurons in this layer is a

00:18:58.880 variable up to us. Let's use 100 neurons as an example and then biases will be

00:19:05.120 also initialized randomly as an example and let's and we just need 100 of them.

00:19:10.400 Now the problem with this is we can't simply normally we would take the input

00:19:15.400 in this case that's embedding and we'd like to multiply it with these weights

00:19:19.840 and then we would like to add the bias. This is roughly what we want to do but

00:19:24.680 the problem here is that these embeddings are stacked up in the dimensions of

00:19:28.440 this input tensor. So this will not work this matrix multiplication because this

00:19:32.480 is a shape 32 by 3 by 2 and I can't multiply that by 6 by 100. So somehow we

00:19:37.920 need to concatenate these inputs here together so that we can do something

00:19:42.280 along these lines which currently does not work. So how do we transform this

00:19:46.240 32 by 3 by 2 into a 32 by 6 so that we can actually perform this multiplication

00:19:53.480 over here. I'd like to show you that there are usually many ways of

00:19:57.120 implementing what we'd like to do in Torch and some of them will be faster,

00:20:01.760 better, shorter etc. And that's because Torch is a very large library and it's

00:20:07.160 got lots and lots of functions. So if we just go to the documentation and click

00:20:10.320 on Torch you'll see that my slider here is very tiny and that's because there

00:20:14.880 are so many functions that you can call on these tensors to transform them, create

00:20:19.080 them, multiply them, add them, perform all kinds of different operations on them.

00:20:23.760 And so this is kind of like the space of possibility if you will. Now one of the

00:20:31.760 things that you can do is we can control here control f for concatenate and we

00:20:36.080 see that there's a function Torq.cat short for concatenate and this concatenate is

00:20:41.440 a given sequence of tensors in a given dimension and these tensors must have

00:20:46.440 the same shape etc. So we can use the concatenate operation to in a naive way

00:20:51.160 concatenate these three embeddings for each input. So in this case we have m of

00:20:58.280 m of the shape and really what we want to do is we want to retrieve these three

00:21:03.040 parts and concatenate them. So we want to grab all the examples we want to grab

00:21:09.640 first the zeroth index and then all of this. So this plucks out the 32 by 2

00:21:21.680 embeddings of just the first word here. And so basically we want this guy we want

00:21:29.040 the first dimension and we want the second dimension and these are the three

00:21:33.760 pieces individually and then we want to treat this as a sequence and we want to

00:21:39.120 torch.cat on that sequence. So this is the list torch.cat takes a sequence of

00:21:46.400 tensors and then we have to tell it along which dimension to concatenate. So in

00:21:51.680 this case all these are 32 by 2 and we want to concatenate not across dimension

00:21:55.680 zero but across dimension one. So passing in one gives us the result. The shape of

00:22:02.520 this is 32 by 6 exactly as we'd like. So that basically took 32 and squashed

00:22:08.240 these back on cutting them into 32 by 6. Now this is kind of ugly because this

00:22:13.520 code would not generalize if we want to later change the block size. Right now

00:22:17.600 we have three inputs, three words but what if we had five then here we would

00:22:23.080 have to change the code because I'm indexing directly. Well torch comes to

00:22:26.760 rescue again because that turns out to be a function called unbind and it removes

00:22:32.320 a tensor dimension. So removes a tensor dimension returns a tuple of all slices

00:22:38.800 along a given dimension without it. So this is exactly what we need and

00:22:43.800 basically when we call torch.unbind torch.unbind of m and passing dimension one

00:22:54.960 index one. This gives us a list of a list of tensors exactly equivalent to

00:23:01.640 this. So running this gives us a length three and it's exactly this list and so

00:23:09.320 we can call torch.cat on it and along the first dimension and this works and this

00:23:17.240 shape is the same but now this is it doesn't matter if we have block size three

00:23:22.200 or five or ten this will just work. So this is one way to do it but it turns out

00:23:27.120 that in this case there's actually a significantly better and more efficient

00:23:30.720 way and this gives me an opportunity to hint at some of the internals of torch.tensor.

00:23:35.560 So let's create an array here of elements from zero to 17 and the shape of

00:23:43.280 this is just 18. It's a single vector of 18 numbers. It turns out that we can very

00:23:49.600 quickly re-represent this as different sized and dimensional tensors. We do

00:23:54.920 this by calling a view and we can say that actually this is not a single

00:23:59.800 vector of 18. This is a 2 by 9 tensor or alternatively this is a 9 by 2 tensor or

00:24:08.200 this is actually a 3 by 3 by 2 tensor. As long as the total number of elements

00:24:13.760 here multiply to be the same this will just work and in pytorch this operation

00:24:20.840 calling.view is extremely efficient and the reason for that is that in each

00:24:26.440 tensor there's something called the underlying storage and the storage is

00:24:31.440 just the numbers always as a one-dimensional vector and this is how this

00:24:36.000 tensor is represented in the computer memory. It's always a one-dimensional

00:24:39.400 vector but when we call that view we are manipulating some of attributes of that

00:24:46.280 tensor that dictate how this one-dimensional sequence is interpreted to

00:24:51.200 be an n-dimensional tensor and so what's happening here is that no memory is

00:24:55.840 being changed, copied, moved or created when we call that view. The storage is

00:25:00.640 identical but when you call that view some of the internal attributes of the

00:25:06.560 view of this tensor are being manipulated and changed. In particular

00:25:10.000 that's something there's something called a storage offset, strides and shapes and

00:25:13.680 those are manipulated so that this one-dimensional sequence of bytes is seen

00:25:17.960 as different and dimensional arrays. There's a blog post here from Eric called

00:25:23.000 PyTorch internals where he goes into some of this with respect to tensor and how

00:25:27.920 the view of a tensor is represented and this is really just like a logical

00:25:32.000 construct of representing the physical memory and so this is a pretty good

00:25:37.520 blog post that you can go into. I might also create an entire video on the

00:25:41.560 internals of Torch tensor and how this works. For here we just note that this is

00:25:46.120 an extremely efficient operation and if I delete this and come back to our m

00:25:52.720 we see that the shape of our m is 32 by 3 by 2 but we can simply ask for PyTorch

00:25:59.200 to view this instead as a 32 by 6 and the way that gets flattened into a 32 by

00:26:06.000 6 array just happens that these two get stacked up in a single row and so that's

00:26:14.200 basically the concatenation operation that we're after and you can verify that

00:26:18.200 this actually gives the exact same result as what we had before. So this is an

00:26:22.840 element y equals and you can see that all the elements of these two tensors are

00:26:26.080 the same and so we get the exact same result. So long story short we can actually

00:26:32.960 just come here and if we just view this as a 32 by 6 instead then this

00:26:40.040 multiplication will work and give us the hidden states that we're after. So if

00:26:44.640 this is h then h slash shape is now the 100 dimensional activations for every

00:26:51.360 one of our 32 examples and this gives the desired result. Let me do two things

00:26:56.240 here. Number one let's not use 32 we can for example do something like m dot

00:27:03.200 shape at 0 so that we don't hardcode these numbers and this would work for any

00:27:08.440 size of this m or alternatively we can also do negative 1. When we do negative

00:27:13.280 1 pytorch will infer what this should be because the number of elements must be

00:27:18.000 the same and we're saying that this is 6 pytorch will derive that this must be

00:27:21.920 32 or whatever else it is if m is of different size. The other thing is here

00:27:28.400 one more thing I'd like to point out is here when we do the concatenation this

00:27:35.400 actually is much less efficient because this concatenation would create a whole

00:27:40.400 new tensor with a whole new storage so new memory is being created because

00:27:44.320 there's no way to concatenate tensors just by manipulating the view attributes.

00:27:47.920 So this is inefficient and creates all kinds of new memory. So let me delete

00:27:53.640 this now we don't need this and here to calculate h we want to also dot 10 h of

00:28:01.440 this to get our to get our h. So these are now numbers between negative 1 and 1

00:28:09.000 because of the 10 h and we have that the shape is 32 by 100 and that is

00:28:14.560 basically this hidden layer of activations here for every one of our

00:28:18.360 32 examples. Now there's one more thing I've lost over that we have to be very

00:28:22.720 careful with and that this and that's this plus here. In particular we want to

00:28:27.160 make sure that the broadcasting will do what we like. The shape of this is 32 by

00:28:32.240 100 and the one's shape is 100. So we see that the addition here will broadcast

00:28:38.160 these two and in particular we have 32 by 100 broadcasting to 100. So

00:28:44.440 broadcasting will align on the right create a fake dimension here so this

00:28:49.400 will become a 1 by 100 row vector and then it will copy vertically for every

00:28:55.280 one of these rows of 32 and do an element-wise addition. So in this case the

00:29:00.000 correcting will be happening because the same bias vector will be added to all

00:29:04.480 the rows of this matrix. So that is correct that's what we'd like and it's

00:29:10.680 always good practice just make sure so that you don't treat yourself in the

00:29:14.160 foot. And finally let's create the final layer here. So let's create W2 and B2. The

00:29:22.880 input now is 100 and the output number of neurons will be for us 27

00:29:28.040 because we have 27 possible characters that come next. So the biases will be

00:29:32.800 27 as well. So therefore the logits which are the outputs of this neural net are

00:29:38.400 going to be h multiplied by W2 plus B2. Logis that shape is 32 by 27 and the

00:29:50.880 logits look good. Now exactly as we saw in the previous video we want to take

00:29:55.560 these logits and we want to first exponentiate them to get our fake

00:29:59.000 counts and then we want to normalize them into a probability. So prob is counts

00:30:04.400 divide and now counts dot sum along the first dimension and keep them as true

00:30:11.720 exactly as in the previous video. And so prob that shape now is 32 by 27 and you'll

00:30:20.440 see that every row of prob sums to one so it's normalized. So that gives us the

00:30:27.360 probabilities. Now of course we have the actual letter that comes next and that

00:30:31.560 comes from this array y which we created during the data separation. So y

00:30:38.040 is this last piece here which is the identity of the next character in a

00:30:41.680 sequence that we'd like to now predict. So what we'd like to do now is just as

00:30:46.720 in the previous video we'd like to index into the rows of prob and in each row we'd

00:30:51.520 like to pluck out the probability assigned to the correct character as

00:30:55.040 given here. So first we have torch dot arrange of 32 which is kind of like a

00:31:01.440 iterator over numbers from 0 to 31 and then we can index into prob in the

00:31:07.600 following way. Prob in torch dot arrange of 32 which iterates the

00:31:12.800 roots and then in each row we'd like to grab this column as given by y. So this

00:31:19.720 gives the current probabilities as assigned by this neural network with this

00:31:23.360 setting of its weights to the correct character in the sequence. And you can

00:31:28.160 see here that this looks okay for some of these characters like this is

00:31:31.320 basically 0.2 but it doesn't look very good at all for many other

00:31:34.800 characters. Like this is 0.0 7 0's 1 probability and so the network thinks

00:31:40.760 that some of these are extremely unlikely. But of course we haven't trained the

00:31:44.080 neural network yet so this will improve and ideally all of these numbers here

00:31:49.680 of course are 1 because then we are correctly predicting the next character.

00:31:53.520 Now just in the previous video we want to take these probabilities we want to

00:31:57.640 look at the log probability and then we want to look at the average log probability

00:32:01.520 and then negative of it to create the negative log likelihood loss. So the loss

00:32:08.320 here is 17 and this is the loss that we'd like to minimize to get the network to

00:32:13.440 predict the correct character in the sequence. Okay so I rewrote everything

00:32:18.000 here and made it a bit more respectable. So here's our dataset. Here's all the

00:32:22.520 parameters that we defined. I'm now using a generator to make it reproducible. I

00:32:27.280 clustered all the parameters into a single list of parameters so that for

00:32:31.480 example it's easy to count them and see that in total we currently have about

00:32:34.880 3,400 parameters. And this is the forward pass as we developed it and we

00:32:40.000 arrive at a single number here the loss that is currently expressing how well

00:32:44.520 this neural network works with the current setting of parameters. Now I would like

00:32:49.360 to make it even more respectable. So in particular see these lies here where we

00:32:53.480 take the logits and we calculate a loss. We're not actually reinventing the wheel

00:32:59.440 here. This is just classification and many people use classification and that's

00:33:04.680 why there is a functional.cross entropy function in PyTorch to

00:33:08.600 calculate this much more efficiently. So we could just simply call f.cross entropy

00:33:12.800 and we can pass in the logits and we can pass in the array of targets. Why? And

00:33:17.720 this calculates the exact same loss. So in fact we can simply put this here and

00:33:24.800 erase these three lines and we're going to get the exact same result. Now there

00:33:30.160 are actually many good reasons to prefer f.cross entropy over rolling your own

00:33:34.360 implementation like this. I did this for educational reasons but you'd never use

00:33:38.160 this in practice. Why is that? Number one when you use f.cross entropy PyTorch will

00:33:43.480 not actually create all these intermediate tensors because these are

00:33:47.080 all new tensors in memory and all this is fairly inefficient to run like this.

00:33:51.560 Instead PyTorch will cluster up all these operations and very often create

00:33:56.760 have a fused kernels that very efficiently evaluate these expressions that

00:34:01.120 are sort of like clustered mathematical operations. Number two the backward

00:34:05.720 pass can be made much more efficient and not just because it's a fused kernel but

00:34:09.960 also analytically and mathematically it's much it's often a very much simpler

00:34:15.120 backward pass to implement. We actually saw this with micrograd. You see here

00:34:20.160 when we implemented 10H the forward pass of this operation to calculate the

00:34:24.500 10H was actually a fairly complicated mathematical expression but because

00:34:28.800 it's a clustered mathematical expression when we did the backward pass we

00:34:32.600 didn't individually backward through the x and the two times and the minus one and

00:34:36.800 division etc. We just said it's 1 minus t squared and that's a much simpler

00:34:41.560 mathematical expression and we were able to do this because we're able to

00:34:45.320 reuse calculations and because we are able to mathematically and analytically

00:34:48.880 derive the derivative and often that expression simplifies mathematically

00:34:53.040 and so there's much less to implement. So not only can it be made more efficient

00:34:58.240 because it runs in a fused kernel but also because the expressions can take a

00:35:02.160 much simpler form mathematically. So that's number one. Number two under the

00:35:09.000 hood F dot cross entropy can also be significantly more numerically well

00:35:14.440 behaved. Let me show you an example of how this works. Suppose we have a

00:35:19.960 logit of negative two three negative three zero and five and then we are

00:35:24.600 taking the exponent of it and normalizing it to sum to one. So when

00:35:28.620 logit's take on this values everything is well and good and we get a nice

00:35:31.720 probability distribution. Now consider what happens when some of these logits

00:35:35.760 take on more extreme values and that can happen during optimization of a

00:35:39.200 neural network. Suppose that some of these numbers grow very negative like

00:35:43.480 saying negative 100 then actually everything will come out fine. We still

00:35:48.000 get a probabilities that you know are well behaved and they sum to one and

00:35:52.880 everything is great but because of the way the exports if you have very positive

00:35:57.960 logits let's say positive 100 in here you actually start to run into trouble and

00:36:02.360 we get not a number here and the reason for that is that these counts have an

00:36:09.240 nth here. So if you pass in a very negative number 2 exp you just get a very

00:36:14.400 negative sorry not negative but very small number very near zero and that's

00:36:18.960 fine but you pass in a very positive number suddenly we run out of range in

00:36:23.400 our floating point number that represents these counts. So basically we're taking

00:36:29.080 e and we're raising it to the power of 100 and that gives us nth because we run

00:36:33.840 out of dynamic range on this floating point number that is count and so we

00:36:38.560 cannot pass very large logits through this expression. Now let me reset these

00:36:44.840 numbers to something reasonable. The way PyTorch solved this is that you see

00:36:50.640 how we have a really well behaved result here. It turns out that because of the

00:36:54.800 normalization here you can actually offset logits by any arbitrary constant

00:36:59.680 value that you want. So if I add one here you actually get the exact same

00:37:03.960 result or if I add two or if I subtract three any offset will produce the exact

00:37:10.400 same probabilities. So because negative numbers are okay but positive numbers

00:37:15.840 can actually overflow this exp what PyTorch does is it internally calculates

00:37:20.680 the maximum value that occurs in the logits and it subtracts it so in this case

00:37:25.080 it would subtract five and so therefore the greatest number in logits will become

00:37:29.680 zero and all the other numbers will become some negative numbers and then

00:37:33.520 the result of this is always well behaved. So even if we have a hundred here

00:37:37.600 previously not good but because PyTorch will subtract a hundred this will

00:37:42.800 work. And so there's many good reasons to call cross entropy. Number one the

00:37:49.040 forward pass can be much more efficient the backward pass can be much more

00:37:52.120 efficient and also things can be much more numerically well behaved. Okay so

00:37:56.920 let's now set up the training of this neural head. We have the forward pass.

00:38:02.000 We don't need these because that we have that loss is equal to the path that

00:38:07.240 cross entropy. That's the forward pass. Then we need the backward pass. First we

00:38:12.200 want to set the gradients to be zero. So for p-parameters we want to make sure

00:38:17.080 that p.grad is none which is the same as setting it to zero in PyTorch and then

00:38:21.360 loss the backward to populate those gradients. Once we have the gradients we

00:38:25.520 can do the parameter update. So for p-parameters we want to take all the

00:38:29.440 data and we want to nudge it learning rate times p.grad and then we want to

00:38:38.000 repeat this a few times. And let's print the loss here as well. Now this won't

00:38:49.520 suffice and it will create an error because we also have to go for p-in

00:38:52.760 parameters and we have to make sure that p.grad requires grad is set to true in

00:38:57.760 PyTorch. And this should just work. Okay so we started off with loss of 17 and

00:39:05.960 we're decreasing it. Let's run longer and you see how the loss decreases a lot

00:39:13.200 here. So if we just run for a thousand times we get a very very low loss and

00:39:21.320 that means that we're making very good predictions. Now the reason that this is

00:39:24.920 so straightforward right now is because we're only overfitting 32 examples. So

00:39:32.520 we only have 32 examples of the first five words and therefore it's very easy to

00:39:37.800 make this neural mat fit. Only these 32 examples because we have 3,400

00:39:43.120 parameters and only 32 examples. So we're doing what's called overfitting a

00:39:47.760 single batch of the data and getting a very low loss and good predictions but

00:39:54.320 that's just because we have so many parameters for so few examples. So it's

00:39:57.000 easy to make this be very low. Now we're not able to achieve exactly 0.

00:40:02.120 And the reason for that is we can for example look at logits which are being

00:40:06.440 predicted. And we can look at the max along the first dimension and in PyTorch

00:40:14.840 max reports both the actual values that take on the maximum number but also the

00:40:20.400 indices of these. And you'll see that the indices are very close to the labels but

00:40:26.560 in some cases they differ. For example in this very first example the predicted

00:40:31.840 index is 19 but the label is 5. And we're not able to make loss be 0 and

00:40:37.160 fundamentally that's because here the very first or the 0th index is the

00:40:43.160 example where dot dot dot is supposed to predict e but you see how dot dot

00:40:46.560 dot is also supposed to predict an o and dot dot dot is also supposed to predict

00:40:50.840 an i and an s as well. And so basically e o a or s are all possible outcomes in

00:40:57.800 the training set for the exact same input. So we're not able to completely overfit

00:41:02.120 and and make the loss be exactly 0. But we're getting very close in the cases

00:41:08.840 where there's a unique input for a unique output. In those cases we do what's called

00:41:13.840 overfit and we basically get the exact same and the exact correct result. So now

00:41:19.480 all we have to do is we just need to make sure that we read in the full data

00:41:23.480 set and optimize the neural line. Okay so let's swing back up where we created the

00:41:27.920 data set and we see that here we only use the first five words so let me now

00:41:32.200 erase this and let me erase the print statements otherwise we'd be printing

00:41:36.240 way too much. And so when we process the full data set of all the words we now

00:41:41.440 had 228,000 examples instead of just 32. So let's now scroll back down to this

00:41:47.960 is much larger. We initialize the weights the same number of parameters they all

00:41:52.720 require gradients and then let's push this printout loss that item to be here

00:41:57.960 and let's just see how the optimization goes if we run this. Okay so we started

00:42:05.240 with a fairly high loss and then as we're optimizing the loss is coming down.

00:42:10.600 But you'll notice that it takes quite a bit of time for every single iteration.

00:42:15.520 So let's actually address that because we're doing way too much work forwarding

00:42:19.600 220,000 examples. In practice what people usually do is they perform

00:42:25.240 forward and backward pass an update on many batches of the data. So what we will

00:42:30.480 want to do is we want to randomly select some portion of the data set and

00:42:34.080 that's a mini batch and then only forward backward and update on that little

00:42:37.800 mini batch and then we integrate on those many batches. So in PyTorch we can

00:42:43.120 for example use Torstot_randint. We can generate numbers between 0 and 5 and make

00:42:47.960 32 of them. I believe the size has to be a tuple in PyTorch. So we can have a

00:42:58.520 couple 32 of numbers between 0 and 5 but actually we want x.shape of 0 here.

00:43:04.360 And so this creates integers that index into our data set and there's 32 of them.

00:43:10.440 So if our mini batch size is 32 then we can come here and we can first do mini

00:43:16.520 batch construct. So integers that we want to optimize in this single

00:43:24.660 iteration are in IX and then we want to index into x with IX to only grab those

00:43:33.160 rows. So we're only getting 32 rows of x and therefore embeddings will again be

00:43:38.480 32 by 3 by 2 not 200,000 by 3 by 2 and then this IX has to be used not just to

00:43:45.480 index into X but also to index into Y. And now this should be mini batches and

00:43:52.520 this should be much much faster. So it's instant almost. So this way we can run

00:43:59.280 many many examples, nearly instantly and decrease the loss much much faster.

00:44:04.840 Now because we're only dealing with mini batches the quality of our gradient is

00:44:09.400 lower so the direction is not as reliable. It's not the actual gradient

00:44:13.480 direction but the gradient direction is good enough even when it's estimating on

00:44:18.280 only 32 examples that it is useful. And so it's much better to have an

00:44:24.040 approximate gradient and just make more steps than it is to evaluate the exact

00:44:28.120 gradient and take fewer steps. So that's why in practice this works quite well.

00:44:34.120 So let's now continue the optimization. Let me take out this lost item from here

00:44:41.240 and place it over here at the end. Okay so we're having around 2.5 or so.

00:44:49.240 However this is only the loss for that mini batch. So let's actually evaluate

00:44:53.320 the loss here for all of X and for all of Y just so we have a full sense of

00:45:01.800 exactly how all the model is doing right now. So right now we're at about 2.7 on

00:45:07.080 the entire training set. So let's run the optimization for a while.

00:45:12.600 Okay we're at 2.6, 2.57, 2.53.

00:45:20.360 Okay so one issue of course is we don't know if we're stepping too slow or too

00:45:27.080 fast. So this point one I just guessed it. So one

00:45:31.720 question is how do you determine this learning rate?

00:45:34.840 And how do we gain confidence that we're stepping in the right sort of speed?

00:45:40.360 So I'll show you one way to determine a reasonable learning rate.

00:45:43.480 It works as follows. Let's reset our parameters to the initial

00:45:49.800 settings. And now let's print in every step. But let's only do 10 steps or so

00:45:58.760 or maybe maybe 100 steps. We want to find like a very reasonable

00:46:02.760 set the search range if you will. So for example this is like very low.

00:46:07.960 Then we see that the loss is barely decreasing. So that's not

00:46:13.320 that's like too low basically. So let's try this one.

00:46:18.760 Okay so we're decreasing the loss but like not very quickly. So that's a pretty

00:46:21.800 good low range. Now let's reset it again. And now let's try to find the place at

00:46:27.320 which the loss kind of explodes. So maybe at negative one.

00:46:33.080 Okay we see that we're minimizing the loss but you see how it's kind of unstable.

00:46:37.400 It goes up and down quite a bit. So negative one is probably like a fast

00:46:42.280 learning rate. Let's try negative 10. Okay so this

00:46:46.920 isn't optimizing. This is not working pretty well. So negative 10 is way too big.

00:46:50.920 Negative one was already kind of big. So therefore

00:46:56.440 negative one was like somewhat reasonable if I reset.

00:47:00.440 So I'm thinking that the right learning rate is somewhere between

00:47:03.720 negative 0.001 and negative 1. So the way we can do this here is we can use

00:47:10.680 torch-shot lens space. And we want to basically do something like this

00:47:14.760 between 0 and 1 but a number of steps is one more

00:47:21.080 parameter that's required. Let's do a thousand steps.

00:47:23.880 This creates 1,000 numbers between 0.001 and 1.

00:47:29.800 But it doesn't really make sense to step between these linearly. So instead let me

00:47:33.640 create learning rate exponent. And instead of 0.001 this will be a

00:47:39.240 negative 3 and this will be a 0. And then the actual

00:47:42.680 lars that we want to search over are going to be 10 to the power of lre.

00:47:48.360 So now what we're doing is we're stepping linearly between the exponents

00:47:51.400 of these learning rates. This is 0.001 and this is 1

00:47:55.640 because 10 to the power of 0 is 1. And therefore we are spaced

00:48:00.120 exponentially in this interval. So these are the candidate learning rates

00:48:04.760 that we want to sort of like search over roughly.

00:48:07.800 So now what we're going to do is here we are going to run the optimization for

00:48:13.160 1,000 steps. And instead of using a fixed number

00:48:16.760 we are going to use learning rate indexing into here lars of i

00:48:22.520 and make this i. So basically let me be set this to be again starting from random

00:48:30.520 creating these learning rates between negative

00:48:33.400 0.001 and 1 but exponentially stepped. And here what we're doing is

00:48:41.240 we're iterating a thousand times we're going to use the learning rate

00:48:46.360 that's in the beginning very very low in the beginning it's going to be 0.001

00:48:50.520 but by the end it's going to be 1. And then we're going to step with that

00:48:55.080 learning rate. And now what we want to do is we want to keep track of the

00:49:00.520 all learning rates that we used and we want to look at the losses

00:49:07.960 that resulted. And so here let me track stats.

00:49:14.120 So lri.append lr and loss i.append loss that item.

00:49:22.440 Okay so again reset everything and then run.

00:49:30.440 And so basically we started with a very low learning rate and we went all the way

00:49:33.240 up to a learning rate of negative 1. And now what we can do is we can

00:49:37.240 palti that plot and we can plot the two. So we can plot the

00:49:41.480 learning rates on the x-axis and the losses we saw on the y-axis.

00:49:46.200 And often you're going to find that your plot looks something like this

00:49:50.120 where in the beginning you have very low learning rates we basically anything

00:49:55.000 barely anything happened. Then we got to like a nice spot here

00:50:00.040 and then as we increased the learning rate enough

00:50:03.000 we basically started to be kind of unstable here.

00:50:05.880 So a good learning rate turns out to be somewhere around here.

00:50:10.600 And because we have lri here we actually may want to

00:50:18.600 do not lr not the learning rate but the exponent. So that would be the

00:50:23.640 lre at i is maybe what we want to log. So let me reset this and redo that

00:50:28.920 calculation. But now on the x-axis we have the

00:50:32.520 exponent of the learning rate. And so we can see the

00:50:36.600 exponent of the learning rate that is good to use. It would be sort of like

00:50:39.320 roughly in the valley here because here the learning rates are just way too

00:50:43.000 low. And then here we expect relatively good learning

00:50:46.040 rates somewhere here. And then here things are starting to explode.

00:50:49.160 So somewhere around negative one x the exponent of the learning rate is a

00:50:52.760 pretty good setting. And 10 to the negative one is 0.1.

00:50:57.400 So 0.1 is actually a fairly good learning rate

00:51:01.000 around here. And that's what we had in the initial setting.

00:51:05.400 But that's roughly how you would determine it. And so here now we can

00:51:09.240 take out the tracking of these. And we can just simply set a lr to be

00:51:15.400 10 to the negative one or basically otherwise 0.1 as it was before.

00:51:20.760 And now we have some confidence that this is actually a fairly good learning

00:51:23.400 rate. And so now what we can do is we can crank up the

00:51:26.280 iterations. We can reset our optimization. And we can run for a pretty long time

00:51:34.360 using this learning rate. Oops. And we don't want to print. It's way too much

00:51:39.000 printing. So let me again reset and run 10,000 steps.

00:51:45.320 Okay, so we're at 0.2 2.48 roughly. Let's run another 10,000 steps.

00:51:55.240 2.46. And now let's do one learning rate decay. What this means is we're going to

00:52:03.560 take our learning rate and we're going to 10x lower it.

00:52:06.840 And so we're at the late stages of training potentially. And we may want to go

00:52:11.080 a bit slower. Let's do one more actually, a point one.

00:52:14.120 Just to see if we're making an indent here. Okay, we're still making dent.

00:52:20.280 And by the way, the bygram loss that we achieved last video was 2.45.

00:52:25.720 So we've already surpassed the bygram level. And once I get a sense that this

00:52:30.360 is actually kind of starting to plateau off, people like to do as I mentioned

00:52:33.720 this learning rate decay. So let's try to decay the loss, the learning rate I mean.

00:52:38.600 And we achieve it about 2.3 now. Obviously, this is janky and not exactly how you would

00:52:48.440 train it in production. But this is roughly what you're going through.

00:52:51.800 You first find a decent learning rate using the approach that I showed you.

00:52:54.920 Then you start with that learning rate and you train for a while.

00:52:58.680 And then at the end, people like to do a learning rate decay, where you decay the

00:53:02.200 learning rate by say a factor of 10, and you do a few more steps.

00:53:05.320 And then you get a trained network roughly speaking.

00:53:08.600 So we've achieved 2.3 and dramatically improved on the

00:53:12.440 bygram language model using this simple neural net as described here.

00:53:16.040 Using these 3,400 parameters. Now, there's something we have to be careful with.

00:53:21.400 I said that we have a better model because we are achieving a lower loss,

00:53:26.280 2.3 much lower than 2.45 with the bygram model previously.

00:53:30.680 Now, that's not exactly true. And the reason that's not true is that

00:53:34.440 this is actually fairly small model, but these models can get larger and larger if you keep

00:53:41.480 adding neurons and parameters. So you can imagine that we don't potentially have

00:53:45.480 1000 parameters. We could have 10,000 or 100,000 or millions of parameters.

00:53:49.240 And as the capacity of the neural network grows, it becomes more and more capable of

00:53:54.920 overfitting your training set. What that means is that the loss on the training set,

00:54:00.040 on the data that you're training on, will become very, very low as low as zero.

00:54:04.200 But all that the model is doing is memorizing your training set for big 'em.

00:54:08.920 So if you take that model and it looks like it's working really well,

00:54:11.720 but you try to sample from it, you will basically only get examples exactly as they

00:54:16.040 are in the training set. You won't get any new data.

00:54:18.440 In addition to that, if you try to evaluate the loss on some withheld names or other words,

00:54:24.760 you will actually see that the loss on those can be very high.

00:54:28.440 And so basically it's not a good model. So the standard in the field is to split up

00:54:33.080 your data set into three splits, as we call them. We have the training split,

00:54:37.480 the dev split or the validation split, and the test split. So training split,

00:54:43.880 test or sorry, dev or validation split and test split. And typically, this would be, say,

00:54:53.480 80% of your data set. This could be 10% and this 10% roughly.

00:54:58.280 So you have these three splits of the data. Now, these 80% of your trainings of the data set,

00:55:04.120 the training set, is used to optimize the parameters of the model, just like we're doing here,

00:55:08.600 using gradient descent. These 10% of the examples, the dev or validation split,

00:55:14.680 they're used for development over all the hyper parameters of your model.

00:55:18.920 So hyper parameters are, for example, the size of this hidden layer, the size of the embedding.

00:55:24.120 So this is 100 or a two for us, but we could try different things. The strength of the

00:55:28.600 realization, which we aren't using yet so far. So there's lots of different hyper parameters and

00:55:33.800 settings that go into defining your neural net. And you can try many different variations of them

00:55:38.600 and see whichever one works best on your validation split. So this is used to train the parameters.

00:55:45.160 This is used to train the hyper parameters and test split is used to evaluate basically the

00:55:52.200 performance of the model at the end. So we're only evaluating the loss on the test split very,

00:55:56.600 very sparingly and very few times, because every single time you evaluate your test loss and you

00:56:02.280 learn something from it, you are basically starting to also train on the test split. So you are

00:56:08.760 only allowed to test the loss on the test set very, very few times, otherwise you risk overfitting to

00:56:16.520 it as well as you experiment on your model. So let's also split up our training data into train,

00:56:23.000 depth and test. And then we are going to train on train and only evaluate on test very, very

00:56:28.280 sparingly. Okay, so here we go. Here is where we took all the words and put them into x and y tensors.

00:56:35.320 So instead, let me create a new cell here. And let me just copy paste some code here,

00:56:40.840 because I don't think it's that complex, but we're going to try to save a little bit of time.

00:56:46.760 I'm converting this to be a function now. And this function takes some list of words and builds

00:56:52.840 the arrays x and y for those words only. And then here, I am shuffling up all the words. So

00:57:00.280 these are the input words that we get. We are randomly shuffling them all up. And then

00:57:06.680 we're going to set n one to be the number of examples that is 80% of the words and n two to be 90%

00:57:14.600 of the way of the words. So basically, if ln of words is 30,000, n one is, well, sorry, I should

00:57:22.280 probably run this. n one is 25,000 and n two is 28,000. And so here we see that I'm calling build

00:57:30.600 data set to build a training set x and y by indexing into up to n one. So we're going to have only 25,000

00:57:38.120 training words. And then we're going to have roughly n two minus n one, 3, 3000 validation

00:57:47.960 examples, or dev examples. And we're going to have ln of words, basically, minus n two,

00:57:57.320 or 3,000, 3,204 examples here for the test set. So now we have xs and ys for all those three splits.

00:58:08.440 Oh, yeah, I'm printing their size here inside the function as well.

00:58:15.480 But here we don't have words, but these are already the individual examples made from those words.

00:58:25.240 So let's now scroll down here. And the data set now for training is more like this. And then when we

00:58:34.360 reset the network, when we're training, we're only going to be training using x train,

00:58:43.800 x train, and y train. So that's the only thing we're training on.

00:58:49.480 Let's see where we are on a single batch. Let's now train maybe a few more steps.

00:59:04.200 Training neural networks can take a while. Usually you don't do it in line, you launch a bunch of

00:59:11.880 jobs and you wait for them to finish, can take it multiple days and so on. Luckily, this is a

00:59:17.320 very small network. Okay, so the loss is pretty good. Oh, we accidentally used our learning rate.

00:59:25.720 That is way too low. So let me actually come back. We use the decay learning rate of 0.01.

00:59:32.440 So this will train much faster. And then here, when we evaluate, let's use the depth set here.

00:59:41.720 x depth and y depth to evaluate the loss. Okay, and let's now decay the learning rate and only do, say, 10,000 examples.

00:59:52.600 And let's evaluate the dev loss once here. Okay, so we're getting about 2.3 on dev. And so the

01:00:01.880 neural network running was training did not see these dev examples. It hasn't optimized on them.

01:00:06.920 And yet, when we evaluate the loss on these dev, we actually get a pretty decent loss. And so

01:00:13.080 we can also look at what the loss is on all of training set. Oops. And so we see that the training

01:00:22.040 and the dev loss are about equal. So we're not overfitting. This model is not powerful enough

01:00:28.200 to just be purely memorizing the data. And so far, we are what's called underfitting,

01:00:33.800 because the training loss and the dev or test losses are roughly equal. So what that typically

01:00:38.840 means is that our network is very tiny, very small. And we expect to make performance improvements

01:00:45.320 by scaling up the size of this neural net. So let's do that now. So let's come over here.

01:00:49.640 And let's increase the size of the neural net. The easiest way to do this is we can come here

01:00:54.440 to the hidden layer, which currently has 100 neurons. And let's just bump this up. So let's do 300

01:00:59.000 neurons. And then this is also 300 biases. And here we have 300 inputs into the final layer.

01:01:06.120 So let's initialize our neural net. We now have 10,000 parameters instead of 3,000 parameters.

01:01:14.040 And then we're not using this. And then here, what I'd like to do is I'd like to actually

01:01:20.840 keep track of step. Okay, let's just do this. Let's keep stats again. And here, when we're keeping

01:01:32.040 track of the loss, let's just also keep track of the steps. And let's just have a eye here.

01:01:39.880 And let's train on 30,000 or rather say, because it's right 30,000. And we are at point one.

01:01:49.640 And we should run this and optimize the neural net. And then here, basically, I want to POT dot plot

01:02:00.280 the steps against the loss. So these are the axis and the y's. And this is the last function.

01:02:14.680 And how it's being optimized. Now, you see that there's quite a bit of thickness to this. And that's

01:02:19.480 because we are optimizing over these mini batches. And the mini batches create a little bit of noise

01:02:23.880 in this. Where are we in the deficit? We are at 2.5. So we're still having to optimize this

01:02:30.600 neural net very well. And that's probably because we make it bigger, it might take longer for this

01:02:34.840 neural net to converge. And so let's continue training. Yeah, let's just continue training.

01:02:44.600 One possibility is that the batch size is solo, that we just have way too much noise in the training.

01:02:52.120 And we may want to increase the batch size so that we have a bit more correct gradient. And we're

01:02:57.320 not thrashing too much. And we can actually like optimize more properly.

01:03:01.400 Okay, this will now become meaningless because we've re initialized these. So,

01:03:13.000 yeah, this looks not pleasing right now. But the problem is look at tiny improvement, but it's so

01:03:18.280 hard to tell. Let's go again, 2.5, 2. Let's try to decrease the learning rate by factor 2.

01:03:28.120 [silence]

01:03:50.040 Okay, we're at 4.32. Let's continue training.

01:03:53.160 [silence]

01:04:05.640 We basically expect to see a lower loss than what we had before, because now we have a much,

01:04:09.320 much bigger model. And we were underfitting. So we'd expect that increasing the size of the model

01:04:13.960 should help the neural net. 2.32. Okay, so that's not happening too well. Now, one other concern is

01:04:20.760 that even though we've made the 10H layer here, or the hidden layer much, much bigger, it could be

01:04:25.800 that the bottleneck of the network right now are these embeddings that are two-dimensional.

01:04:30.200 It can be that we're just cramming way too many characters into just two dimensions,

01:04:33.960 and the neural net is not able to really use that space effectively. And that that is sort of like

01:04:39.160 the bottleneck to our network's performance. Okay, 2.23. So just by decreasing the learning rate,

01:04:45.560 I was able to make quite a bit of progress. Let's run this one more time. And then evaluate the

01:04:52.520 training and the dev loss. Now, one more thing after training that I'd like to do is I'd like to

01:04:59.640 visualize the embedding vectors for these characters before we scale up the embedding size from 2,

01:05:09.080 because we'd like to make this bottleneck potentially go away. But once I make this

01:05:14.520 greater than 2, we won't be able to visualize them. So here, okay, we're at 2.23 and 2.24.

01:05:20.840 So we're not improving much more. And maybe the bottleneck now is the character embedding size,

01:05:26.520 which is two. So here I have a bunch of code that will create a figure. And then we're going to

01:05:32.200 visualize the embeddings that were trained by the neural net on these characters, because right

01:05:38.360 now the embedding size is just two. So we can visualize all the characters with the x and the y

01:05:42.680 coordinates as the two embedding locations for each of these characters. And so here are the

01:05:49.000 x coordinates and the y coordinates, which are the columns of C. And then for each one, I also

01:05:54.920 include the text of the little character. So here what we see is actually kind of interesting.

01:06:00.280 The network has basically learned to separate out the characters and cluster them a little bit.

01:06:07.560 So for example, you see how the vowels, A, E, I, O, U are clustered up here. So what that's

01:06:13.240 telling us is that the neural net treats these as very similar, right? Because when they feed into

01:06:17.320 the neural net, the embedding for all these characters is very similar. And so the neural

01:06:23.000 net thinks that they're very similar and kind of like interchangeable, and that makes sense.

01:06:26.520 Then the points that are like really far away are for example, Q. Q is kind of treated as an

01:06:33.960 exception. And Q has a very special embedding vector, so to speak. Similarly, dot, which is a

01:06:40.120 special character is all the way out here. And a lot of the other letters are sort of clustered up

01:06:45.400 here. And so it's kind of interesting that there's a little bit of structure here,

01:06:49.000 after the training. And it's not definitely not random. And these embeddings make sense.

01:06:55.080 So we're now going to scale up the embedding size and won't be able to visualize it directly.

01:07:00.200 But we expect that because we're underfitting, and we made this layer much bigger and did not

01:07:06.440 sufficiently improve the loss, we're thinking that the constraint to better performance right

01:07:12.600 now could be these embedding factors. So let's make them bigger. Okay, so let's scroll up here.

01:07:17.240 And now we don't have two dimensional embeddings. We are going to have, say, 10 dimensional embeddings

01:07:23.080 for each word. Then this layer will receive three times 10. So 30 inputs will go into

01:07:32.200 the hidden layer. Let's also make the layer a bit smaller. So instead of 300, let's just do 200

01:07:39.320 neurons in that hidden layer. So now the total number of elements will be slightly bigger at 11,000.

01:07:45.720 And then we hear we have to be a bit careful because, okay, the learning rate, we set to 0.1.

01:07:53.160 Here we are hard-coded in six. And obviously, if you're working in production, you don't want

01:07:57.560 to be hard-coded in magic numbers. But instead of six, this should now be 30.

01:08:01.400 And let's run for 50,000 iterations. And let me split out the initialization here outside

01:08:09.720 so that when we run this a multiple times, it's not going to wipe out our loss.

01:08:17.480 In addition to that, here, let's instead of logging in the last data, let's actually log the,

01:08:24.120 let's do log 10, I believe that's a function of the loss.

01:08:30.680 And I'll show you why in a second. Let's optimize this.

01:08:34.920 Basically, I'd like to plot the log loss instead of the loss, because when you plot the loss,

01:08:41.400 many times it can have this hockey stick appearance and the log swashes it in. So it just kind of

01:08:48.040 looks nicer. So the x-axis is step i and the y-axis will be the loss i.

01:08:53.400 And then here, this is 30. Ideally, we wouldn't be hard-coding these

01:09:08.760 because let's look at the loss. Okay, it's again very thick because the mini batch size is very small,

01:09:15.080 but the total loss of the training set is 2.3 and the test or the deficit is 2.38 as well.

01:09:21.160 So so far so good. Let's try to now decrease the learning rate by a factor of 10

01:09:26.840 and train for another 50,000 iterations.

01:09:35.240 We'd hope that we would be able to beat 2.32.

01:09:38.840 But again, we're just kind of like doing this very haphazardly. So I don't actually have confidence

01:09:47.720 that our learning rate is set very well, that our learning rate decay, which we just do at random,

01:09:53.560 is set very well. And so the optimization here is kind of suspect, to be honest.

01:09:58.760 And this is not how you would do it typically in production. In production, you would create

01:10:02.360 parameters or hyper parameters out of all these settings, and then you would run lots of experiments

01:10:07.080 and see whichever ones are working well for you. Okay, so we have 2.17 now and 2.2. Okay, so you see

01:10:17.000 how the training and the evaluation performance are starting to slightly slowly depart. So maybe

01:10:23.800 we're getting the sense that the neural net is getting good enough or that number parameters

01:10:29.560 are large enough that we are slowly starting to overfit. Let's maybe run one more iteration of this

01:10:36.600 and see where we get. But yeah, basically, you would be running lots of experiments,

01:10:44.360 and then you are slowly scrutinizing whichever ones give you the best depth performance.

01:10:48.360 And then once you find all the hyper parameters that make your depth performance good,

01:10:53.000 you take that model and you evaluate the test set performance a single time.

01:10:56.760 And that's the number that you report in your paper or wherever else you want to talk about

01:11:00.840 and brag about your model. So let's then rerun the plot and rerun the train and dove.

01:11:08.920 And because we're getting lower loss now, it is the case that the embedding size of these was

01:11:15.720 holding us back very likely. Okay, so 2.16, 2.19 is what we're roughly getting. So there's many ways

01:11:25.320 to go from many ways to go from here. We can continue tuning the optimization. We can continue,

01:11:31.320 for example, playing with the size of the neural net, or we can increase the number of words or

01:11:37.000 characters in our case that we are taking as an input. So instead of just three characters,

01:11:40.600 we could be taking more characters than as an input. And that could further improve the loss.

01:11:45.720 Okay, so I changed the code slightly. So we have here 200,000 steps of the optimization.

01:11:51.560 And in the first 100,000, we're using a learning rate of 0.1. And then in the next 100,000, we're

01:11:56.040 using a learning rate of 0.01. This is the loss that I achieve. And these are the performance on

01:12:01.800 the training and validation loss. And in particular, the best validation loss I've been able to obtain

01:12:06.840 in the last 30 minutes or so is 2.17. So now I invite you to beat this number. And you have quite a

01:12:13.400 few knobs available to you to I think surpass this number. So number one, you can of course change the

01:12:18.840 number of neurons in the hidden layer of this model. You can change the dimensionality of the

01:12:23.320 embedding lookup table. You can change the number of characters that are feeding in as an input,

01:12:28.600 as the context into this model. And then of course, you can change the details of the optimization.

01:12:35.000 How long are we running? What is the learning rate? How does it change over time? How does it decay?

01:12:40.120 You can change the batch size and you may be able to actually achieve a much better

01:12:44.840 convergence speed in terms of how many seconds or minutes it takes to train the model and

01:12:50.600 get your result in terms of really good loss. And then of course, I actually invite you to

01:12:57.400 read this paper. It is 19 pages, but at this point, you should actually be able to read a good chunk

01:13:02.280 of this paper and understand pretty good chunks of it. And this paper also has quite a few ideas

01:13:08.440 for improvements that you can play with. So all of those are knobs available to you,

01:13:13.080 and you should be able to beat this number. I'm leaving that as an exercise to the reader.

01:13:17.400 And that's it for now. And I'll see you next time.

01:13:19.640 Before we wrap up, I also wanted to show how you would sample from the model. So we're going to

01:13:29.160 generate 20 samples. At first, we begin with all dots. So that's the context. And then until we

01:13:36.360 generate the 0th character again, we're going to embed the current context using the embedding

01:13:44.440 table C. Now, usually here, the first dimension was the size of the training set. But here,

01:13:50.840 we're only working with a single example that we're generating. So this is just dimension one,

01:13:55.400 just for simplicity. And so this embedding then gets projected into the end state,

01:14:02.120 you get the logits. Now we calculate the probabilities. For that, you can use f dot softmax

01:14:07.400 of logits. And that just basically explains, she is the logits and makes them sum to one.

01:14:12.920 And similar to cross entropy, it is careful that there's no overflows.

01:14:17.480 Once we have the probabilities, we sample from them using torshoot multinomial to get our next index.

01:14:23.640 And then we shift the context window to append the index and record it. And then we can just

01:14:29.720 decode all the integers to strings and print them out. And so these are some example samples.

01:14:35.000 And you can see that the model now works much better. So the words here are much more word,

01:14:39.880 like or name like. So we have things like ham, joes, Lilla, you know, it's starting to sound a

01:14:49.320 little bit more name like. So we're definitely making progress. But we can still improve on this

01:14:53.640 model quite a lot. Okay, sorry, there's some bonus content. I wanted to mention that I want

01:14:59.160 to make these notebooks more accessible. And so I don't want you to have to like

01:15:03.080 install jibber notebooks and torch and everything else. So I will be sharing a link to a Google

01:15:07.720 collab. And Google collab will look like a notebook in your browser. And you can just go through

01:15:13.640 URL, and you'll be able to execute all of the code that you saw in the Google collab. And so this

01:15:19.720 is me executing the code in this lecture. And I shortened it a little bit. But basically,

01:15:24.760 you're able to train the exact same network, and then plot and sample from the model. And everything

01:15:29.560 is ready for you to like tinker with the numbers right there in your browser, no installation necessary.

01:15:34.120 So I just wanted to point that out. And the link to this will be in the video description.