The spelled-out intro to language modeling: building makemore

00:00:00.000 Hi everyone, hope you're well.

00:00:02.000 And next up, what I'd like to do is I'd like to build out MakeMore.

00:00:06.000 Like Micrograd before it, MakeMore is a repository that I have on my GitHub webpage.

00:00:11.000 You can look at it. But just like with Micrograd, I'm going to build it out step by step

00:00:16.000 and I'm going to spell everything out. So we're going to build it out slowly and together.

00:00:20.000 Now, what is MakeMore?

00:00:22.000 MakeMore, as the name suggests, makes more of things that you give it.

00:00:27.000 So here's an example. Names.txt is an example dataset to MakeMore.

00:00:32.000 And when you look at names.txt, you'll find that it's a very large dataset of names.

00:00:38.000 So here's lots of different types of names. In fact, I believe there are 32,000 names

00:00:44.000 that I've sort of found randomly on the government website.

00:00:47.000 And if you train MakeMore on this dataset, it will learn to make more of things like this.

00:00:54.000 And in particular, in this case, that will mean more things that sound name-like,

00:01:00.000 but are actually unique names. And maybe if you have a baby and you're trying to assign a name,

00:01:05.000 maybe you're looking for a cool new sounding unique name, MakeMore might help you.

00:01:09.000 So here are some example generations from the neural network once we train it on our dataset.

00:01:16.000 So here's some example unique names that it will generate.

00:01:19.000 Dontal, Iraq, Zendy, and so on.

00:01:25.000 And so all these sort of sound name-like, but they're not, of course, names.

00:01:30.000 So under the hood, MakeMore is a character-level language model.

00:01:34.000 So what that means is that it is treating every single line here as an example.

00:01:39.000 And within each example, it's treating them all as sequences of individual characters.

00:01:45.000 So R-E-E-S-E is this example. And that's the sequence of characters.

00:01:51.000 And that's the level in which we are building out MakeMore.

00:01:54.000 And what it means to be a character-level language model then is that it's just sort of modeling those sequences of characters

00:02:00.000 and it knows how to predict the next character in the sequence.

00:02:03.000 Now, we're actually going to implement a large number of character-level language models

00:02:08.000 in terms of the neural networks that are involved in predicting the next character in a sequence.

00:02:12.000 So very simple, bygram and bag-of-word models, multi-level perceptrons,

00:02:17.000 recurring neural networks, all the way to modern transformers.

00:02:21.000 In fact, a transformer that we will build will be basically the equivalent transformer to GPT-2, if you have heard of GPT.

00:02:28.000 So that's kind of a big deal. It's a modern network.

00:02:31.000 And by the end of the series, you will actually understand how that works on the level of characters.

00:02:36.000 Now, to give you a sense of the extensions here, after characters,

00:02:40.000 we will probably spend some time on the word level so that we can generate documents of words,

00:02:44.000 not just little segments of characters, but we can generate entire large, much larger documents.

00:02:50.000 And then we're probably going to go into images and image text networks, such as Dolly, Stable Diffusion, and so on.

00:02:57.000 But for now, we have to start here, character-level language modeling. Let's go.

00:03:03.000 So like before, we are starting with a completely blank, duper-node page.

00:03:07.000 The first thing is I would like to basically load up the dataset, names.txt.

00:03:11.000 So we're going to open up names.txt for reading.

00:03:15.000 And we're going to read in everything into a massive string.

00:03:19.000 And then because it's a massive string, we only like the individual words and put them in the list.

00:03:24.000 So let's call split lines on that string to get all of our words as a Python list of strings.

00:03:31.000 So basically, we can look at, for example, the first 10 words.

00:03:35.000 And we have that it's a list of Emma, Olivia, Eva, and so on.

00:03:41.000 And if we look at the top of the page here, that is indeed what we see.

00:03:46.000 So that's good.

00:03:49.000 This list actually makes me feel that this is probably sorted by frequency.

00:03:54.000 But, okay, so these are the words. Now, we'd like to actually learn a little bit more about this dataset.

00:04:02.000 Let's look at the total number of words. We expect this to be roughly 32,000.

00:04:06.000 And then once the, for example, shortest word, so min of, line of each word for w in words.

00:04:14.000 So the shortest word will be length two, and max of 1w for w in words.

00:04:21.000 So the longest word will be 15 characters.

00:04:25.000 So let's now think through our very first language model.

00:04:27.000 As I mentioned, a character level language model is predicting the next character in a sequence,

00:04:32.000 given already some concrete sequence of characters before it.

00:04:36.000 Now, what we have to realize here is that every single word here, like a Zabella, is actually quite a few examples packed in to that single word.

00:04:45.000 Because what is an existence of a word like a Zabella in the dataset telling us really?

00:04:50.000 It's saying that the character i is a very likely character to come first in a sequence of a name.

00:04:58.000 The character S is likely to come after i.

00:05:04.000 The character A is likely to come after iS.

00:05:07.000 The character B is very likely to come after iS A.

00:05:10.000 And someone on all the way to A following a Zabella, and then there's one more example actually packed in here.

00:05:17.000 And that is that after there's a Zabella, the word is very likely to end.

00:05:23.000 So that's one more sort of explicit piece of information that we have here, that we have to be careful with.

00:05:29.000 And so there's a lot packed into a single individual word in terms of the statistical structure of what's likely to follow in these character sequences.

00:05:38.000 And then of course we don't have just an individual word.

00:05:40.000 We actually have 32,000 of these, and so there's a lot of structure here to model.

00:05:44.000 Now in the beginning what I'd like to start with is I'd like to start with building a Bagram language model.

00:05:51.000 Now in a Bagram language model, we're always working with just two characters at a time.

00:05:56.000 So we're only looking at one character that we are given, and we're trying to predict the next character in the sequence.

00:06:03.000 So what characters are likely to follow are, what characters are likely to follow A, and so on.

00:06:09.000 And we're just modeling that kind of a little local structure.

00:06:13.000 And we're forgetting the fact that we may have a lot more information.

00:06:17.000 We're always just looking at the previous character to predict the next one.

00:06:20.000 So it's a very simple and weak language model, but I think it's a great place to start.

00:06:24.000 So now let's begin by looking at these Bagrams in our dataset and what they look like.

00:06:28.000 And these Bagrams again are just two characters in a row.

00:06:31.000 So for W and words, Hw here is an individual word string.

00:06:36.000 We want to iterate this word with consecutive characters.

00:06:43.000 So two characters at a time sliding it through the word.

00:06:46.000 Now, an interesting nice way, cute way to do this in Python, by the way, is doing something like this.

00:06:52.000 For character one, character two in zip of W and W at one.

00:06:59.000 One call.

00:07:01.000 Print, character one, character two.

00:07:04.000 And let's not do all the words.

00:07:06.000 Let's just do the first three words.

00:07:07.000 And I'm going to show you in a second how this works.

00:07:10.000 But for now, basically as an example, let's just do the very first word alone, Emma.

00:07:15.000 You see how we have a Emma?

00:07:18.000 And this will just print em, em, em, em, em, em, em.

00:07:21.000 And the reason this works is because W is the string Emma.

00:07:25.000 W at one column is the string Emma and zip takes two iterators and it pairs them up and then creates an iterator over the tuples of their consecutive entries.

00:07:37.000 And if any one of these lists is shorter than the other, then it will just halt and return.

00:07:43.640 So basically, that's why we return eM, mm, mm, mA, but then because this iterator, second one

00:07:52.280 here runs out of elements, zip just ends. And that's why we only get these tunnels. So pretty

00:07:58.200 cute. So these are two consecutive elements in the first word. Now we have to be careful because

00:08:04.120 we actually have more information here than just these three examples. As I mentioned,

00:08:08.920 we know that e is the, is very likely to come first. And we know that a in this case is coming

00:08:14.440 last. So what we're going to do this is basically, we're going to create a special array here,

00:08:20.280 characters. And we're going to hallucinate a special start token here. I'm going to

00:08:28.920 call it like special start. So this is a list of one element, plus w, and then plus

00:08:38.760 a special end character. And the reason I'm wrapping the list of w here is because w is a string

00:08:45.000 Emma list of w will just have the individual characters in the list. And then doing this again

00:08:52.520 now, but not iterating over w's but over the characters will give us something like this.

00:08:59.080 So e is likely, so this is a by gram of the start character and e. And this is a by gram of the

00:09:06.360 a in the special end character. And now we can look at, for example, what this looks like for

00:09:11.160 Olivia or Eva. And indeed, we can actually potentially do this for the entire dataset,

00:09:18.040 but we won't print that. It's going to be too much. But these are the individual

00:09:22.120 character bi-grams, and we can print them. Now, in order to learn the statistics about which

00:09:27.000 characters are likely to follow other characters, the simplest way in the by gram language models

00:09:32.440 is to simply do it by counting. So we're basically just going to count how often any one of these

00:09:37.800 combinations occurs in the training set in these words. So we're going to need some kind of a

00:09:43.160 dictionary that's going to maintain some counts for every one of these bi-grams. So let's use

00:09:47.960 a dictionary B. And this will map these bi-grams. So bi-gram is a tuple of character on character

00:09:54.840 two. And then B at bi-gram will be beat up get of bi-gram, which is basically the same as B at

00:10:03.160 bi-gram. But in the case that bi-gram is not in the dictionary B, we would like to

00:10:09.240 by default return zero plus one. So this will basically add up all the bi-grams and count how

00:10:16.840 often they occur. Let's get rid of printing. Or rather, let's keep the printing and let's just

00:10:24.040 inspect what B is in this case. And we see that many bi-grams occur just a single time. This

00:10:30.440 one allegedly occurred three times. So A was an ending character three times. And that's true for

00:10:36.440 all of these words. All of Emma, Olivia and Eva and with A. So that's why this occurred three times.

00:10:46.600 Now let's do it for all the words. Oops, I should not have printed it. I'm going to erase that.

00:10:56.040 Let's kill this. Let's just run. And now B will have the statistics of the entire data set.

00:11:03.160 So these are the counts across all the words of the individual bi-grams. And we could, for example,

00:11:09.480 look at some of the most common ones and least common ones. This kind of grows in Python, but the

00:11:14.920 way to do this, the simplest way I like is we just use B dot items. B dot items returns the

00:11:21.960 tuples of key value. In this case, the keys are the character bi-grams and the values are the counts.

00:11:30.040 And so then what we want to do is we want to do a sorted of this. But by default sort is on the first

00:11:43.480 item of a tuple, but we want to sort by the values, which are the second element of a

00:11:48.440 tuple that is the key value. So we want to use the key equals lambda that takes the key value

00:11:56.440 and returns the key value at one, not at zero, but at one, which is the count. So we want to sort

00:12:05.000 by the count of these elements. And actually we wanted to go backwards. So here we have is the

00:12:14.760 bi-gram Q and R occurs only a single time. DZ occurred only a single time. And when we sort

00:12:21.240 this the other way around, we're going to see the most likely bi-grams. So we see that N was

00:12:27.800 very often an ending character, many, many times. And apparently N almost always follows an A.

00:12:34.360 And that's a very likely combination as well. So this is kind of the individual counts that we

00:12:42.280 achieve over the entire dataset. Now it's actually going to be significantly more convenient for us

00:12:47.640 to keep this information in a two dimensional array instead of a Python dictionary. So we're

00:12:54.680 going to store this information in a 2D array. And the rows are going to be the first character

00:13:01.960 of the bi-gram. And the columns are going to be the second character. And each entry in the

00:13:06.040 student-dimensional array will tell us how often that first character follows the second character

00:13:10.520 in the dataset. So in particular, the array representation that we're going to use or the library

00:13:17.160 is that of PyTorch. And PyTorch is a deep learning neural work framework. But part of it is also

00:13:23.400 this torch.tensor, which allows us to create multi-dimensional arrays and manipulate them very

00:13:28.280 efficiently. So let's import PyTorch, which you can do by import torch. And then we can create

00:13:35.720 arrays. So let's create a array of zeros. And we give it a size of this array. Let's create a

00:13:44.360 three by five array as an example. And this is a three by five array of zeros. And by default,

00:13:52.360 you'll notice a dot D type, which is short for data type is float 32. So these are single

00:13:57.400 precision floating point numbers. Because we are going to represent counts, let's actually use

00:14:02.680 D type as torch dot and 32. So these are 32 bit integers. So now you see that we have integer

00:14:12.280 data inside this tensor. Now tensors allow us to really manipulate all the individual entries and

00:14:19.080 do it very efficiently. So for example, if we want to change this bit, we have to index into the

00:14:24.840 tensor. And in particular, here, this is the first row and the, because it's zero indexed. So this is

00:14:33.640 row index one and column index 0123. So at one comma three, we can set that to one. And then a

00:14:44.600 we'll have a one over there. We can of course also do things like this. So now a will be two over there,

00:14:52.440 or three. And also we can, for example, say, a zero zero is five. And then a will have a five over here.

00:14:59.320 So that's how we can index into the arrays. Now, of course, the array that we are interested in

00:15:04.840 is much, much bigger. So for our purposes, we have 26 letters of the alphabet. And then we have two

00:15:10.840 special characters as and e. So we want 26 plus two or 28 by 28 array. And let's call it the capital

00:15:20.440 n, because it's going to represent sort of the counts. Let me erase this stuff. So that's the

00:15:27.240 array that starts at zeros 28 by 28. And now let's copy paste this here. But instead of having a

00:15:35.400 dictionary B, which we're going to erase, we have an n. Now the problem here is that we have these

00:15:42.600 characters, which are strings, but we have to now basically index into a array. And we have to

00:15:49.800 index using integers. So we need some kind of a lookup table from characters to integers. So

00:15:55.400 let's construct such a character array. And the way we're going to do this is we're going to take

00:16:00.120 all the words, which is a list of strings, we're going to concatenate all of it into a massive

00:16:05.000 string. So this is just simply the entire data set as a single string. We're going to pass this

00:16:10.120 to the set constructor, which takes this massive string and throws out duplicates, because sets

00:16:17.000 do not allow duplicates. So set of this will just be the set of all the lowercase characters.

00:16:22.760 And there should be a total of 26 of them. And now we actually don't want to set we want a list.

00:16:30.600 But we don't want a list sorted in some weird arbitrary way we want it to be sorted

00:16:36.600 from a to z. So sorted list. So those are our characters.

00:16:45.560 Now what we want is this lookup table, as I mentioned. So let's create a special s to I,

00:16:51.080 I will call it s is string or character. And this will be an s to I mapping for

00:16:59.000 is in enumerate of these characters. So enumerate basically gives us this iterator over the integer

00:17:08.600 index and the actual element of the list. And then we are mapping the character to the integer.

00:17:15.160 So s to I is a mapping from a to zero, B to one, etc, all the way from Z to 25.

00:17:21.480 And that's going to be useful here, but we actually also have to specifically set that s will be 26

00:17:29.000 and s to I at e will be 27, right, because z was 25. So those are the lookups. And now we can come

00:17:38.200 here and we can map both character one and character two to their integers. So this will be

00:17:43.320 s to I character one. And I X two will be s to I of character two. And now we should be able to

00:17:50.440 do this line, but using our array. So n at X one, I X two, this is the two dimensional array indexing

00:17:59.240 I showed you before. And honestly, just plus equals one, because everything starts at zero.

00:18:06.120 So this should work and give us a large 28 by 20 array of all these counts. So if we print n,

00:18:15.880 this is the array, but of course it looks ugly. So let's erase this ugly mess and let's try to

00:18:22.200 visualize it a bit more nicer. So for that, we're going to use a library called matplotlib.

00:18:27.800 So matplotlib allows us to create figures. So we can do things like PLT, I am show of the counter

00:18:34.040 array. So this is the 20 by 20 array. And this is a structure. But even this, I would say,

00:18:42.280 is still pretty ugly. So we're going to try to create a much nicer visualization of it. And I

00:18:47.240 wrote a bunch of code for that. The first thing we're going to need is we're going to need to

00:18:52.840 invert this array here, this dictionary. So s to I is a mapping from s to I. And in I to s,

00:19:00.920 we're going to reverse this dictionary. So it rated with all the items and just reverse that array.

00:19:05.880 So I to s maps inversely from zero to a want to be, etc. So we'll need that. And then here's the

00:19:15.400 code that I came up with to try to make this a little bit nicer. We create a figure, we plot

00:19:22.600 and then we do and then we visualize a bunch of things later. Let me just run it so you get a

00:19:28.360 sense of what it is. Okay. So you see here that we have the array spaced out. And every one of

00:19:38.360 these is basically like B follows G zero times B follows H 41 times. So a follows J 175 times.

00:19:47.160 And so what you can see that I'm doing here is first I show that entire array. And then I

00:19:53.240 iterate over all the individual little cells here. And I create a character string here,

00:19:59.160 which is the inverse mapping, I to s of the integer I and the integer J. So those are the

00:20:05.320 bigrams in a character representation. And then I plot just the bigram text. And then I plot the

00:20:13.160 number of times that is by gram occurs. Now the reason that there's a dot item here is because

00:20:19.160 when you index into these arrays, these are torch tensors, you see that we still get a tensor back.

00:20:25.240 So the type of this thing, you think it would be just an integer 149, but it's actually a torch

00:20:30.440 tensor. And so if you do dot item, then it will pop out that individual integer. So it'll just be

00:20:39.080 149. So that's what's happening there. And these are just some options to make it look nice.

00:20:45.240 So what is the structure of this array? We have all these counts. And we see that some of them occur

00:20:51.400 often, and some of them do not occur often. Now if you scrutinize this carefully, you will notice

00:20:56.360 that we're not actually being very clever. That's because when you come over here, you'll notice

00:21:01.000 that, for example, we have an entire row of completely zeros. And that's because the end character

00:21:06.280 is never possibly going to be the first character of a bygram, because we're always placing these

00:21:11.080 end tokens all at the end of a bygram. Similarly, we have entire column zeros here, because the s

00:21:18.360 character will never possibly be the second element of a bygram, because we always start with s and

00:21:24.680 we end with e, and we only have the words in between. So we have an entire column of zeros, an entire

00:21:30.440 row of zeros, and in this little two by two matrix here as well, the only one that can possibly happen

00:21:35.960 is if s directly follows e, that can be nonzero, if we have a word that has no letters. So in that

00:21:43.400 case, there's no letters in the word, it's an empty word, and we just have s follows e. But the

00:21:47.880 other ones are just not possible. And so we're basically wasting space, and not willing that,

00:21:52.440 but the s and the e are getting very crowded here. I was using these brackets because there's

00:21:57.720 convention in natural language processing to use these kinds of brackets to denote special tokens.

00:22:03.160 But we're going to use something else. So let's fix all this and make it prettier. We're not

00:22:08.520 actually going to have two special tokens, we're only going to have one special token. So we're

00:22:13.880 going to have n by n array of 27 by set 27 instead, instead of having two, we will just have one,

00:22:21.480 and I will call it a dot. Okay, let me swing this over here. Now one more thing that I would like

00:22:31.560 to do is I would actually like to make this special character have position zero. And I would like

00:22:36.760 to offset all the other letters off. I find that a little bit more pleasing. So we need a plus one

00:22:45.560 here so that the first character, which is a will start at one. So s to I will now be a starts at one

00:22:53.160 and dot is zero. And I to s, of course, we're not changing this because I to s just creates a

00:23:00.120 reverse mapping and this will work fine. So one is a to s b zero is dot. So we reverse that here,

00:23:08.200 we have a dot and a dot. This should work fine. Make sure I started zeros count. And then here,

00:23:19.240 we don't go up to 28, we go up to 27. And this should just work. Okay. So we see that dot dot

00:23:32.760 never happened. It's at zero, because we don't have empty words. Then this row here now is just

00:23:38.680 very simply the counts for all the first letters. So j starts a word, h starts a word, I starts a

00:23:48.120 word, etc. And then these are all the ending characters. And in between, we have the structure

00:23:54.440 of what characters follow each other. So this is the counts array of our entire data set. So this

00:24:01.960 array actually has all the information necessary for us to actually sample from this by Graham

00:24:06.840 character level language model. And roughly speaking, what we're going to do is we're just

00:24:12.520 going to start following these probabilities and these counts. And we're going to start sampling

00:24:16.760 from the from model. So in the beginning, of course, we start with the dot, the start token dot. So

00:24:25.000 to sample the first character of a name, we're looking at this row here. So we see that we have

00:24:31.880 the counts and those counts externally are telling us how often any one of these characters is to

00:24:37.400 start a word. So if we take this n, and we grab the first row, we can do that by using just indexing

00:24:47.240 at zero, and then using this notation column for the rest of that row. So n zero column

00:24:55.560 is indexing into the zero row, and then it's grabbing all the columns. And so this will give us a one

00:25:03.640 dimensional array of the first row. So zero, four, four, ten, he knows zero, four, four, ten,

00:25:10.360 one, three, oh, six, one, five, four, two, etc. It's just the first row. The shape of this is 27,

00:25:17.080 it's just the row of 27. And the other way that you can do this also is you just you don't actually

00:25:22.680 give this, you just grab the zero row like this. This is equivalent. Now these are the counts.

00:25:29.880 And now what we'd like to do is we'd like to basically sample from this. Since these are

00:25:35.400 two raw counts, we actually have to convert this to probabilities. So we create a probability vector.

00:25:41.400 So we'll take n of zero, and we'll actually convert this to float first. Okay, so these

00:25:50.600 integers are converted to float, floating point numbers. And the reason we're creating floats is

00:25:55.720 because we're about to normalize these counts. So to create a probability distribution here,

00:26:01.400 we want to divide, we basically want to do p p divide p that sum.

00:26:06.440 And now we get a vector of smaller numbers. And these are now probabilities. So of course,

00:26:14.280 because we divided by the sum, the sum of p now is one. So this is a nice proper probability

00:26:20.440 distribution, it sums to one. And this is giving us the probability for any single character to be

00:26:24.920 the first character of a word. So now we can try to sample from this distribution. To sample from

00:26:31.320 these distributions, we're going to use torched up multinomial, which I've pulled up here. So torched

00:26:36.600 up multinomial returns samples from the multinomial probability distribution, which is a complicated

00:26:44.040 way of saying, you give me probabilities, and I will give you integers, which are sampled,

00:26:48.840 according to the property distribution. So this is the signature of the method. And to make everything

00:26:53.880 deterministic, we're going to use a generator object in pytorch. So this makes everything

00:27:00.120 deterministic. So when you run this on your computer, you're going to get the exact same

00:27:04.200 results that I'm getting here on my computer. So let me show you how this works.

00:27:08.680 Here's the deterministic way of creating a torch generator object,

00:27:18.120 ceding it with some number that we can agree on. So that seeds a generator gets gives us an object

00:27:23.960 g. And then we can pass that g to a function that creates here random numbers, torched up

00:27:32.120 random creates random numbers, three of them. And it's using this generator object to as a source

00:27:37.960 of randomness. So without normalizing it, I can just print. This is sort of like numbers between

00:27:48.040 zero and one that are random, according to this thing. And whenever I run it again, I'm always

00:27:53.400 going to get the same result, because I keep using the same generator object, which I'm seeding here.

00:27:57.640 And then if I divide to normalize, I'm going to get a nice probability distribution of just three

00:28:05.880 elements. And then we can use torch at multinomial to draw samples from it. So this is what that

00:28:11.800 looks like. Torched at multinomial will take the torch tensor of probability distributions.

00:28:19.480 Then we can ask for a number of samples, like say 20. Replacement equals true means that when we

00:28:27.160 draw an element, we will, we can draw it, and then we can put it back into the list of eligible indices

00:28:34.280 to draw again. And we have to specify replacement as true, because by default, for some reason,

00:28:40.200 it's false. And I think, you know, it's just something to be careful with. And the generator is

00:28:46.520 passed in here. So we're going to always get deterministic results, the same results. So if I run these

00:28:52.200 to, we're going to get a bunch of samples from this distribution. Now, you'll notice here that the

00:28:58.840 probability for the first element in this tensor is 60%. So in these 20 samples, we'd expect 60% of

00:29:08.200 them to be zero. We'd expect 30% of them to be one. And because the element index two has only 10%

00:29:18.360 probability, very few of these samples should be two. And indeed, we only have a small number of

00:29:24.120 two's. And we can sample as many as we'd like. And the more we sample, the more these numbers

00:29:31.800 should roughly have the distribution here. So we should have lots of zeros, half as many

00:29:39.160 ones. And we should have three times as few, sorry, as few ones, and three times as few

00:29:50.440 two's. So you see that we have very few two's, we have some ones and most of them are zero.

00:29:55.640 So that's what Torxian Multipla is doing. For us here, we aren't just in this row, we've created this

00:30:03.080 p here. And now we can sample from it. So if we use the same seed, and then we sample from this

00:30:14.440 distribution, let's just get one sample. Then we see that the sample is say 13. So this will be the

00:30:23.720 index. And let's you see how it's a tensor that wraps 13, we again have to use that item to pop

00:30:30.840 out that integer. And now index would be just the number 13. And of course, the we can do,

00:30:40.120 we can map the I to S of I X to figure out exactly which character we're sampling here, we're sampling

00:30:46.520 M. So we're saying that the first character is in our generation. And just looking at the row here,

00:30:54.360 M was drawn. And you we can see that M actually starts a large number of words. M started 2,500

00:31:02.520 words out of 32,000 words. So almost a bit less than 10% of the words start with M. So this was

00:31:09.480 actually fairly likely character to draw. So that will be the first character of our work. And now

00:31:17.320 we can continue to sample more characters, because now we know that M started, M is already sampled.

00:31:23.960 So now to draw the next character, we will come back here, and we will look for the row

00:31:30.920 that starts with them. So you see M, and we have a row here. So we see that M dot is

00:31:38.440 516 MA is this many, MB is this many, etc. So these are the counts for the next row. And that's

00:31:45.880 the next character that we are going to now generate. So I think we are ready to actually just write

00:31:50.200 out the loop, because I think you're starting to get a sense of how this is going to go. The,

00:31:56.200 we always begin at index zero, because that's the start token. And then while true,

00:32:03.480 we're going to grab the row corresponding to index that we're currently on. So that's P.

00:32:10.280 So that's N array at Ix converted to float is RP.

00:32:19.080 Then we normalize this P to sum to one, accidentally ran the infinite loop. We normalize P to sum to

00:32:29.640 one. Then we need this generator object. We're going to initialize up here, and we're going to draw

00:32:36.360 a single sample from this distribution. And then this is going to tell us what index is going to be

00:32:44.280 next. If the index sampled is zero, then that's now the end token. So we will break.

00:32:53.560 Otherwise, we are going to print s to I of X. I to s of X. And that's pretty much it. We're just

00:33:08.520 this work. Okay, more. So that's the that's the name that we've sampled. We started with M,

00:33:16.360 the next stop was O, then R, and then dot. And this dot, we printed here as well. So let's not do

00:33:26.920 this a few times. So let's actually create an out list here. And instead of printing, we're going

00:33:38.440 to append. So out that append this character. And then here, let's just print it at the end. So

00:33:47.320 let's just join up all the outs. And we're just going to print more. Okay. Now we're always getting

00:33:52.760 the same result because of the generator. So who wants to do this a few times, we can go for high

00:33:58.600 and range 10, we can sample 10 names. And we can just do that 10 times. And these are the names

00:34:06.600 that we're getting out. Let's do 20. I'll be honest with you, this doesn't look right. So I started

00:34:17.160 a few minutes to convince myself that it actually is right. The reason these samples are so terrible

00:34:22.360 is that by Graham language model is actually just like really terrible. We can generate a few more

00:34:28.760 here. And you can see that they're kind of like their name like a little bit like Yannu,

00:34:33.800 Erile, etc. But they're just like totally messed up. And I mean, the reason that this is so bad,

00:34:40.680 like we're generating H as a name. But you have to think through it from the model size, it doesn't

00:34:46.920 know that this H is the very first H. All it knows is that H was previously. And now how likely is

00:34:53.320 H the last character? Well, it's somewhat likely. And so it just makes it last character. It doesn't

00:34:59.480 know that there were other things before it or there were not other things before it. And so

00:35:04.440 that's why generating all these like nonsense names. Another way to do this is

00:35:09.320 to convince yourself that it is actually doing something reasonable, even though it's so terrible,

00:35:15.320 is these little piece here are 27, right? Like 27. So how about if we did something like this?

00:35:26.120 Instead of p having any structure whatsoever, how about if p was just a torch dot ones

00:35:32.200 of 27? By default, this is a float 32. So this is fine. Divide 27.

00:35:40.920 So what I'm doing here is this is the uniform distribution, which will make everything equally

00:35:48.120 likely. And we can sample from that. So let's see if that does any better. Okay, so it's,

00:35:55.240 this is what you have from a model that is completely untrained, where everything is equally

00:35:59.640 likely. So it's obviously garbage. And then if we have a trained model, which is trained on just

00:36:05.640 by grams, this is what we get. So you can see that it is more name like it is actually working.

00:36:11.800 It's just by gram is so terrible. And we have to do better. Now next, I would like to fix and

00:36:17.880 an efficiency that we have going on here. Because what we're doing here is we're always fetching a

00:36:23.000 row of n from the counts matrix up ahead. And then we're always doing the same things. We're

00:36:28.120 converting to float and we're dividing. And we're doing this every single iteration of this loop.

00:36:32.680 And we just keep renormalizing these rows over and over again, and it's extremely inefficient and

00:36:36.280 wasteful. So what I'd like to do is I'd like to actually prepare a matrix capital P that will

00:36:41.880 just have the probabilities in it. So in other words, it's going to be the same as the capital

00:36:46.280 and matrix here of counts. But every single row will have the row of probabilities that is normalized

00:36:52.200 to one, indicating the probability distribution for the next character, given the character before

00:36:57.320 it, as defined by which row we're in. So basically, what we'd like to do is we'd like to just do it

00:37:03.960 upfront here. And then we would like to just use that row here. So here, we would like to just do

00:37:09.880 P equals P of I X instead. Okay. The other reason I want to do this is not just for efficiency,

00:37:16.920 but also I would like us to practice these indimensional tensors. And I like us to practice

00:37:22.520 their manipulation, and especially something that's called broadcasting that we'll go into in a second.

00:37:26.760 We're actually going to have to become very good at these tensor manipulations,

00:37:30.440 because we're going to build out all the way to transformers. We're going to be doing some

00:37:33.960 pretty complicated array operations for efficiency. And we need to really understand that and be

00:37:39.400 very good at it. So intuitively, what we want to do is we first want to grab the floating point

00:37:45.880 copy of N. And I'm mimicking the line here, basically. And then we want to divide all the rows

00:37:53.160 so that they sum to one. So we'd like to do something like this, P divide P dot sum.

00:37:58.760 But now we have to be careful, because P dot sum actually produces a sum

00:38:07.960 sorry, P equals N dot float copy. P dot sum produces a

00:38:12.520 sums up all of the counts of this entire matrix N, and gives us a single number of just the summation

00:38:20.200 of everything. So that's not the way we want to divide. We want to simultaneously and imperil

00:38:26.040 divide all the rows by their respective sums. So what we have to do now is we have to go into

00:38:33.320 documentation for towards dot sum. And we can scroll down here to a definition that is relevant to us,

00:38:38.680 which is where we don't only provide an input array that we want to sum, but we also provide the

00:38:44.280 dimension along which we want to sum. And in particular, we want to sum up over rows, right?

00:38:51.240 Now, one more argument that I want you to pay attention to here is the keep them as false.

00:38:56.760 If keep them is true, and the output tensor is of the same size as input,

00:39:02.360 except of course the dimension along which is summed, which will become just one. But if you pass in,

00:39:08.840 keep them as false, then this dimension is squeezed out. And so towards dot sum, not only does the sum

00:39:16.680 and collapses dimension to be of size one, but in addition, it does what's called a squeeze,

00:39:21.000 where it squeezes out that dimension. So basically what we want here is we instead want to do P dot

00:39:28.040 sum of some axis. And in particular, notice that p dot shape is 27 by 27. So when we sum up across

00:39:36.680 axis zero, then we would be taking the zero dimension, and we would be summing across it.

00:39:41.400 So we keep them as true. Then this thing will not only give us the counts across along the columns,

00:39:50.840 but notice that basically the shape of this is one by 27, we just get a row vector.

00:39:57.320 And the reason we get a row vector here again is because we pass in zero dimension. So this zero

00:40:01.640 dimension becomes one, and we've done a sum. And we get a row. And so basically we've done the sum

00:40:07.240 this way vertically, and arrived at just a single one by 27 vector of counts.

00:40:13.720 What happens when you take out keep them is that we just get 27. So it squeezes out that dimension,

00:40:21.560 and we just get one dimensional vector of size 27. Now we don't actually want

00:40:29.800 one by 27 row vector, because that gives us the counts or the sums across the columns.

00:40:38.440 We actually want to sum the other way along dimension one. And you'll see that the shape of

00:40:43.800 this is 27 by one. So it's a column vector. It's a 27 by one vector of counts.

00:40:50.760 And that's because what's happened here is that we're going horizontally, and this 27 by 27 matrix

00:40:58.760 becomes a 27 by one array. Now you'll notice by the way that the actual numbers of these counts

00:41:08.920 are identical. And that's because this special array of counts here comes from

00:41:13.560 by grams to the sticks. And actually it just so happens by chance, or because of the way this

00:41:18.840 array is constructed, that this sums along the columns or along the rows horizontally or vertically

00:41:24.200 is identical. But actually what we want to do in this case is we want to sum across the

00:41:29.240 rows horizontally. So what we want here is be that some of one with keep them true.

00:41:37.240 27 by one column vector. And now what we want to do is we want to divide by that.

00:41:41.640 Now we have to be careful here again. Is it possible to take what's a p dot shape you see

00:41:51.080 here is 27 by 27? Is it possible to take a 27 by 27 array and divided by what is a 27 by one array?

00:41:59.480 Is that an operation that you can do? And whether or not you can perform this operation is determined

00:42:06.200 by what's called broadcasting rules. So if you just search broadcasting semantics in torch,

00:42:11.000 you'll notice that there's a special definition for what's called broadcasting that for whether or

00:42:17.160 not these two arrays can be combined in a binary operation like division. So the first condition

00:42:24.760 is each tensor has at least one dimension, which is the case for us. And then when iterating over

00:42:29.480 the dimension sizes, starting at the trailing dimension, the dimension sizes must either be equal,

00:42:34.600 one of them is one or one of them does not exist. Okay, so let's do that. We need to align the two

00:42:41.480 arrays and their shapes, which is very easy because both of these shapes have two elements,

00:42:46.520 so they're aligned. Then we iterate over from the from the right and going to the left.

00:42:51.400 Each dimension must be either equal, one of them is a one or one of them does not exist. So in this

00:42:58.200 case, they're not equal, but one of them is a one. So this is fine. And then this dimension,

00:43:02.840 they're both equal. So this is fine. So all the dimensions are fine. And therefore,

00:43:08.840 this operation is broadcastable. So that means that this operation is allowed. And what is it that

00:43:15.640 these arrays do when you divide 27 by 27 by 27 by one? What it does is that it takes this dimension

00:43:22.120 one and it stretches it out, it copies it to match 27 here in this case. So in our case,

00:43:29.880 it takes this column vector, which is 27 by one, and it copies it 27 times to make these both be

00:43:38.520 27 by 27 internally. You can think of it that way. And so it copies those counts, and then it

00:43:44.520 does an element wise division, which is what we want, because these counts, we want to divide by

00:43:50.120 them on every single one of these columns in this matrix. So this actually we expect will normalize

00:43:57.640 every single row. And we can check that this is true by taking the first row, for example,

00:44:03.320 and taking it some, we expect this to be one, because it's not normalized. And then we expect this

00:44:11.800 now, because if we actually correctly normalize all the rows, we expect to get the exact same

00:44:16.840 result here. So let's run this, it's the exact same result. So this is correct. So now I would

00:44:23.400 like to scare you a little bit. You actually have to like, I basically encourage you very strongly

00:44:28.120 to read through broadcasting semantics. And I encourage you to treat this with respect. And

00:44:32.840 it's not something to play fast and loose with, it's something to really respect, really understand,

00:44:37.400 and look up maybe some tutorials for broadcasting and practice it and be careful with it, because

00:44:41.880 you can very quickly run into bugs. Let me show you what I mean. You see how here we have Pete

00:44:48.280 at some of one, keep them as true. The shape of this is 27 by one. Let me take out this line just so

00:44:54.040 we have the n, and then we can see the counts. And we can see that this is all the counts across

00:45:01.160 all the rows. And it's 27 by one column vector, right? Now suppose that I tried to do the following,

00:45:10.360 but I grace keep them just true here. What does that do? If keep them is not true, it's false.

00:45:17.080 Then remember, according to documentation, it gets rid of this dimension one, it squeezes it out.

00:45:22.280 So basically, we just get all the same counts, the same result, except the shape of it is not

00:45:27.960 27 by one, it is just 27, the one disappears. But all the counts are the same. So you'd think that

00:45:35.960 this divide that would work. First of all, can we even write this? And will it even, is it even

00:45:44.200 expected to run? Is it broadcastable? Let's determine if this result is broadcastable.

00:45:48.280 P dot summit one is shape is 27. This is 27 by 27. So 27 by 27,

00:45:55.560 broadcasting into 27. So now rules of broadcasting number one, align all the dimensions on the right,

00:46:05.400 done. Now iteration over all the dimensions started from the right, going to the left.

00:46:10.120 All the dimensions might either be equal. One of them must be one or one that does not exist.

00:46:15.880 So here they are all equal. Here, the dimension does not exist. So internally, what broadcasting

00:46:21.160 will do is it will create a one here. And then we see that one of them is a one, and this will

00:46:28.120 get copied. And this will run this will broadcast. Okay, so you'd expect this to work.

00:46:37.240 Because we are this broadcast and this is we can divide this. Now if I run this, you'd expect it

00:46:45.000 to work, but it doesn't. You actually get garbage. You get a wrong result, because this is actually

00:46:51.320 a bug. This keep them equal equals true. Makes it work. This is a bug. In both cases, we are doing

00:47:04.680 the correct counts. We are summing up across the rows, but keep them as saving us and making it work.

00:47:11.400 So in this case, I'd like you to encourage you to potentially like pause this video at this point

00:47:15.560 and try to think about why this is buggy and why the keep them was necessary here.

00:47:19.960 Okay, so the reason to do for this is I'm trying to hint it here when I was sort of giving you a

00:47:27.560 bit of a hint on how this works. This 27 vector internally inside the broadcasting,

00:47:34.120 this becomes a one by 27. And one by 27 is a row vector, right? And now we are dividing 27

00:47:41.080 by 27 by one by 27. And torch will replicate this dimension. So basically, it will take

00:47:49.640 this row vector and it will copy it vertically now 27 times. So the 27 by 27 lies exactly and

00:47:58.120 element wise divides. And so basically what's happening here is we're actually normalizing the

00:48:05.320 columns instead of normalizing the rows. So you can check that what's happening here is that p

00:48:12.520 at zero, which is the first row of p dot sum is not one, it's seven. It is the first column as an

00:48:20.200 example that sums to one. So to summarize, where does the issue come from? The issue comes from

00:48:27.400 the silent adding of a dimension here, because in broadcasting rules, you align on the right

00:48:32.360 and go from right to left. And if dimension doesn't exist, you create it. So that's where the problem

00:48:37.000 happens. We still did the counts correctly. We did the counts across the rows. And we got the

00:48:42.280 counts on the right here as a column vector. But because the key things was true, this dimension

00:48:48.680 was discarded and now we just have a vector of 27. And because of broadcasting the way it works,

00:48:53.800 this vector of 27 suddenly becomes a row vector. And then this row vector gets replicated vertically

00:48:59.880 and that every single point we are dividing by the count in the opposite direction.

00:49:07.320 So this thing just doesn't work. This needs to be keep them as equal to true in this case.

00:49:13.320 So then we have that p at zero is normalized. And conversely, the first column you'd expect

00:49:21.960 to potentially not be normalized. And this is what makes it work. So pretty subtle. And hopefully

00:49:30.440 this helps to scare you that you should have respect for broadcasting. Be careful. Check your work.

00:49:36.440 And understand how it works under the hood and make sure that it's broadcasting in the direction

00:49:40.360 that you like. Otherwise, you're going to introduce very subtle bugs, very hard to find bugs. And

00:49:45.640 just be careful. One more note on efficiency. We don't want to be doing this here because this

00:49:51.080 creates a completely new tensor that we store into P. We prefer to use in place operations if

00:49:56.360 possible. So this would be an in place operation has the potential to be faster. It doesn't create

00:50:02.360 new memory under the hood. And then let's erase this. We don't need it. And let's also

00:50:09.560 just do fewer just so I'm not wasting space. Okay, so we're actually in the pretty good spot now.

00:50:16.360 We trained a by gram language model. And we trained it really just by counting

00:50:21.160 how frequently any pairing occurs and then normalizing so that we get a nice property distribution.

00:50:27.880 So really these elements of this array P are really the parameters of our by gram language model,

00:50:33.400 giving us and summarizing the statistics of these by grams. So we trained the model and then we know

00:50:38.600 how to sample from the model. We just iteratively sampled the next character and feed it in each

00:50:44.680 time and get a next character. Now what I'd like to do is I'd like to somehow evaluate the quality

00:50:50.040 of this model. We'd like to somehow summarize the quality of this model into a single number.

00:50:55.160 How good is it at predicting the training set? And as an example, so in the training set,

00:51:01.000 we can evaluate now the training loss and this training loss is telling us about sort of the

00:51:07.240 quality of this model in a single number, just like we saw in micro grad. So let's try to think

00:51:12.920 through the quality of the model and how we would evaluate it. Basically, what we're going to do

00:51:17.960 is we're going to copy paste this code that we previously used for counting. Okay. And let me

00:51:24.440 just print these by grams first. We're going to use F strings. And I'm going to print character one

00:51:29.720 followed by character two. These are the by grams. And then I don't want to do it for all the words.

00:51:33.480 Let's just do first three words. So here we have Emma, Olivia and Eva by grams. Now what we'd

00:51:40.680 like to do is we'd like to basically look at the probability that the model assigns to every one

00:51:46.520 of these by grams. So in other words, we can look at the probability, which is summarized in the matrix

00:51:52.040 B of IX one, I do. And then we can print it here as probability. And because these properties are

00:52:01.720 way too large, let me percent or call on point four f to like truncate it a bit. So what do we have

00:52:09.720 here, right? We're looking at the probabilities that the model assigns to every one of these by grams

00:52:13.800 in the data set. And so we can see some of them are 4%, 3%, etc. Just to have a measuring stick in

00:52:19.800 our mind, by the way, we have 27 possible characters or tokens. And if everything was equally likely,

00:52:26.600 then you'd expect all these probabilities to be 4% roughly. So anything above 4% means that we've

00:52:34.600 learned something useful from these by grams statistics. And you see that roughly some of these

00:52:38.760 are 4%, but some of them are as high as 40%, 35%. And so on. So you see that the model actually

00:52:44.920 assigned a pretty high probability to whatever's in the training set. And so that's a good thing.

00:52:50.040 Basically, if you have a very good model, you'd expect that these properties should be near one,

00:52:54.760 because that means that your model is correctly predicting what's going to come next,

00:52:58.600 especially on the training set where you train your model. So now we'd like to think about how

00:53:05.080 can we summarize these probabilities into a single number that measures the quality of this model.

00:53:10.040 Now, when you look at the literature into maximum likelihood estimation and statistical

00:53:15.640 modeling and so on, you'll see that what's typically used here is something called the likelihood.

00:53:20.680 And the likelihood is the product of all of these probabilities. And so the product of all

00:53:27.080 these probabilities is the likelihood. And it's really telling us about the probability of the

00:53:32.120 entire data set assigned by the model that we've trained. And that is a measure of quality.

00:53:38.840 So the product of these should be as high as possible when you are training the model and when

00:53:44.600 you have a good model, your product of these probabilities should be very high.

00:53:48.200 Now, because the product of these probabilities is an unwieldy thing to work with, you can see

00:53:54.600 that all of them are between zero and one. So your product of these probabilities will be a very

00:53:58.440 tiny number. So for convenience, what people work with usually is not the likelihood,

00:54:04.680 but they work with what's called the log likelihood. So the product of these is the likelihood.

00:54:10.760 To get the log likelihood, we just have to take the log of the probability. And so the log of

00:54:15.800 the probability here, I have the log of x from zero to one. The log is a, you see here monotonic

00:54:21.880 transformation of the probability, where if you pass in one, you get zero. So probability one

00:54:29.720 gets your log probability of zero. And then as you go lower and lower probability, the log will

00:54:35.160 grow more and more negative until all the way to negative infinity at zero. So here, we have a

00:54:43.000 log problem, which is really just the torch dot log of probability. Let's print it out to get a

00:54:47.720 sense of what that looks like. Log prob also point for f. Okay. So as you can see, when we plug in

00:54:58.520 numbers that are very close, some of our higher numbers, we get closer and closer to zero. And

00:55:03.480 then if we plug in very bad probabilities, we get more and more negative number. That's bad.

00:55:08.040 So, and the reason we work with this is for a large extent convenience, right, because we have

00:55:15.960 mathematically that if you have some product eight times B times C of all these probabilities,

00:55:20.200 right, where the likelihood is the product of all these probabilities, then the log of these is just

00:55:29.000 log of a plus log of B plus log of C. If you remember your logs from your high school or undergrad and

00:55:38.680 so on. So we have that basically, the likelihood is the product of probabilities. The log likelihood

00:55:44.280 is just the sum of the logs of the individual probabilities. So log likelihood starts at zero.

00:55:54.600 And then log likelihood here, we can just accumulate simply. And then the end, we can print this.

00:56:01.640 Print the log likelihood.

00:56:06.520 F strings. Maybe you're familiar with this. So what likelihood is negative 38.

00:56:19.880 Okay. Now, we actually want. So how high can log likelihood get? It can go to zero. So when all

00:56:30.200 the probabilities are one, log likelihood will be zero. And then when all the probabilities are

00:56:34.360 lower, this will grow more and more negative. Now, we don't actually like this because what we'd

00:56:40.120 like is a loss function. And a loss function has the semantics that low is good, because we're

00:56:46.520 trying to minimize the loss. So we actually need to invert this. And that's what gives us something

00:56:51.800 called the negative log likelihood. Negative log likelihood is just negative of the log likelihood.

00:56:59.480 These are F strings, by the way, if you'd like to look this up, negative log likelihood equals.

00:57:07.640 So negative log likelihood now is just the negative of it. And so the negative log

00:57:12.840 likelihood is a very nice loss function, because the lowest it can get is zero. And the higher it is,

00:57:21.000 the worse off the predictions are that you're making. And then one more modification to this

00:57:26.040 that sometimes people do is that for convenience, they actually like to normalize by they like to

00:57:31.640 make it an average instead of a sum. And so here, let's just keep some counts as well. So n plus

00:57:40.360 equals one starts at zero. And then here, we can have sort of like a normalized log likelihood.

00:57:46.920 If we just normalize it by the count, then we will sort of get the average log likelihood.

00:57:55.720 So this would be usually our loss function here, is put this way with this is what we would use.

00:58:00.760 So our loss function for the training set assigned by the model is 2.4. That's the quality of this

00:58:07.560 model. And the lower it is, the better off we are, and the higher it is, the worse off we are.

00:58:12.280 And the job of our, you know, training is to find the parameters that minimize the negative

00:58:19.880 log likelihood loss. And that would be like a high quality model. Okay, so to summarize, I actually

00:58:26.520 wrote it out here. So our goal is to maximize likelihood, which is the product of all the

00:58:32.760 probabilities assigned by the model. And we want to maximize this likelihood with respect to the

00:58:38.360 model parameters. And in our case, the model parameters here are defined in the table. These

00:58:43.640 numbers, the probabilities are the model parameters, sort of in our Bragg-Grom language models so far.

00:58:49.240 But you have to keep in mind that here we are storing everything in a table format,

00:58:53.560 the probabilities. But what's coming up as a brief preview is that these numbers will not be

00:58:58.840 kept explicitly, but these numbers will be calculated by a neural network. So that's coming up. And we

00:59:04.760 want to change and tune the parameters of these neural works. We want to change these parameters

00:59:09.400 to maximize the likelihood, the product of the probabilities. Now, maximizing the likelihood is

00:59:15.080 equivalent to maximizing the log likelihood, because log is a monotonic function. Here's the graph of

00:59:21.000 log. And basically, all it is doing is it's just scaling your, you can look at it as just a scaling

00:59:27.880 of the loss function. And so the optimization problem here and here are actually equivalent,

00:59:34.040 because this is just scaling, you can look at it that way. And so these are two identical

00:59:38.600 optimization problems. Maximizing the log likelihood is equivalent to minimizing the

00:59:44.200 negative log likelihood. And then in practice, people actually minimize the average negative

00:59:49.160 log likelihood to get numbers like 2.4. And then this summarizes the quality of your model. And we'd

00:59:56.440 like to minimize it and make it as small as possible. And the lowest it can get is zero. And the lower

01:00:03.080 it is, the better off your model is, because it's signing, it's assigning high probabilities

01:00:08.440 to your data. Now let's estimate the probability over the entire training set, just to make sure

01:00:12.440 that we get something around 2.4. Let's run this over the entire oops, let's take out the print

01:00:18.040 segment as well. Okay, 2.45 or the entire training set. Now what I'd like to show you is that you

01:00:25.720 can actually evaluate the probability for any word that you want. Like for example,

01:00:28.920 if we just test a single word, Andre, and bring back the print statement,

01:00:33.960 then you see that Andre is actually kind of like an unlikely word, like on average,

01:00:39.560 we take three log probability to represent it. And roughly that because EJ apparently is very

01:00:47.160 uncommon as an example. Now, think through this, I'm going to take Andre and I append q.

01:00:55.640 And I test the probability of it under a q. We actually get infinity. And that's because

01:01:03.720 JQ has a 0% probability according to our model. So the log likelihood. So the log of zero will

01:01:10.680 be negative infinity, we get infinite loss. So this is kind of undesirable, right? Because we

01:01:15.960 plugged in a string that could be like a somewhat reasonable name. But basically what this is

01:01:20.120 saying is that this model is exactly 0% likely to predict this name. And our loss is infinity on this

01:01:28.600 example. And really what the reason for that is that J is followed by q 0 times, where is q?

01:01:37.880 JQ is 0. And so JQ is 0% likely. So it's actually kind of gross and people don't like this too much.

01:01:45.160 To fix this, there's a very simple fix that people like to do to sort of smooth out your model

01:01:49.880 a little bit, and it's called model smoothing. And roughly what's happening is that we will

01:01:54.120 add some fake accounts. So imagine adding a count of one to everything. So we add a count of one

01:02:02.520 like this, and then we recalculate the probabilities. And that's model smoothing. And you can add as

01:02:09.560 much as you like, you can add five, and that will give you a smoother model. And the more you add

01:02:13.800 here, the more uniform model you're going to have. And the less you add, the more peaked model you

01:02:20.840 are going to have, of course. So one is like a pretty decent count to add. And that will ensure

01:02:27.000 that there will be no zeros in our probability matrix P. And so this will of course change the

01:02:32.040 generations a little bit. In this case, it didn't, but it in principle it could. But what that's

01:02:37.160 going to do now is that nothing will be infinity unlikely. So now our model will predict some other

01:02:43.960 probability. And we see that JQ now has a very small probability. So the model still finds it

01:02:48.600 very surprising that this was a word or a diagram, but we don't get negative infinity. So it's kind

01:02:53.720 of like a nice fix that people like to apply sometimes and it's called models smoothing. Okay,

01:02:57.240 so we've now trained a respectable by gram character level language model. And we saw that we both

01:03:04.040 sort of trained the model by looking at the counts of all the by grams and normalizing the rows to

01:03:09.320 get probability distributions. We saw that we can also then use those parameters of this model

01:03:15.640 to perform sampling of new words. So we sample new names according to those distributions. And we

01:03:22.440 also saw that we can evaluate the quality of this model. And the quality of this model is summarized

01:03:26.920 in a single number, which is the negative log likelihood. And the lower this number is, the

01:03:31.720 better the model is, because it is giving high probabilities to the actual next characters in

01:03:37.400 all the by grams in our training set. So that's all well and good. But we've arrived at this model

01:03:43.560 explicitly by doing something that felt sensible. We were just performing counts, and then we were

01:03:49.080 normalizing those counts. Now what I would like to do is I would like to take an alternative approach.

01:03:54.040 We will end up in a very, very similar position, but the approach will look very different.

01:03:58.120 Because I would like to cast the problem of by gram character level language modeling into the

01:04:02.440 neural network framework. And in neural network framework, we're going to approach things slightly

01:04:07.560 differently. But again, end up in a very similar spot. I'll go into that later. Now our neural network

01:04:13.720 is going to be a still a by gram character level language model. So it receives a single character

01:04:19.080 as an input. Then there's neural network with some weights or some parameters w. And it's going to

01:04:25.000 output the probability distribution over the next character in a sequence. It's going to make guesses

01:04:30.360 as to what is likely to follow this character that was input to the model. And then in addition to

01:04:36.840 that, we're going to be able to evaluate any setting of the parameters of the neural network,

01:04:41.240 because we have a loss function, the negative log likelihood. So we're going to take a look

01:04:45.800 at its probability distributions, and we're going to use the labels, which are basically just the

01:04:51.160 identity of the next character in that by gram, the second character. So knowing what second

01:04:56.040 character actually comes next in the by gram allows us to then look at what how high of probability

01:05:01.560 the model assigns to that character. And then we of course want the probability to be very high.

01:05:06.040 And that is another way of saying that the loss is low. So we're going to use gradient based

01:05:12.520 optimization then to tune the parameters of this network, because we have the loss function,

01:05:16.920 and we're going to minimize it. So we're going to tune the weights so that the neural net is

01:05:21.240 correctly predicting the probabilities for the next character. So let's get started. The first

01:05:25.800 thing I want to do is I want to compile the training set of this neural network, right? So

01:05:29.640 create the training set of all the by grams. Okay. And here, I'm going to copy paste this code,

01:05:43.720 because this code iterates over all the by grams. So here we start with the words,

01:05:48.840 we iterate over all the by grams. And previously, as you recall, we did the counts,

01:05:52.920 but now we're not going to do counts. We're just creating a training set. Now this training set

01:05:57.720 will be made up of two lists. We have the inputs and the targets, the labels. And these

01:06:09.800 by grams will denote x, y, those are the characters, right? And so we're given the first character of

01:06:14.840 the by gram, and then we're trying to predict the next one. Both of these are going to be integers.

01:06:19.320 So here, we'll take x is data, pend is just x one, y is data pend, I x two. And then here,

01:06:28.280 we actually don't want lists of integers, we will create tensors out of these. So x is

01:06:34.760 torch dot tensor of x is, and y is torch dot tensor of y's. And then we don't actually want to take

01:06:43.400 all the words just yet, because I want everything to be manageable. So let's just do the first word,

01:06:48.440 which is Emma. And then it's clear what these axes and y's would be. Here, let me print

01:06:56.360 character one character to just so you see what's going on here. So the by grams of these characters

01:07:03.960 is dot E E M M M A dot. So this single word, as I mentioned, has one, two, three, four, five

01:07:12.440 examples for our neural network. There are five separate examples in Emma. And those examples

01:07:18.120 are summarized here. When the input to the neural neural network is integer zero, the desired label

01:07:24.360 is integer five, which corresponds to E. When the input to the neural network is five,

01:07:30.200 we want its weights to be arranged so that 13 gets a very high probability. When 13 is put in,

01:07:36.360 we want 13 to have a high probability. When 13 is put in, we also want one to have a high probability.

01:07:42.440 When one is input, we want zero to have a very high probability. So there are five separate input

01:07:48.840 examples to a neural net in this data set. I wanted to add a tangent of a note of caution to be careful

01:07:58.440 with a lot of the APIs of some of these frameworks. You saw me silently use torch dot tensor with a

01:08:04.440 lowercase t and the output looked right. But you should be aware that there's actually two ways of

01:08:10.280 constructing a tensor. There's a torch dot lowercase tensor and there's also a torch dot capital

01:08:15.640 tensor class, which you can also construct. So you can actually call both. You can also do torch

01:08:21.000 dot capital tensor. And you get an X as in Y as well. So that's not confusing at all.

01:08:27.320 There are threads on what is the difference between these two. And unfortunately, the docs are

01:08:34.440 just like not clear on the difference. And when you look at the docs of lowercase tensor,

01:08:39.080 construct tensor with no autograph history by copying data. It's just like it doesn't make

01:08:46.040 sense. So the actual difference, as far as I can tell, is explained eventually in this random

01:08:49.960 thread that you can Google. And really it comes down to, I believe that, where is this?

01:08:57.000 Torch dot tensor in first the D type, the data type automatically. Well, torch dot tensor

01:09:02.440 just returns a flow tensor. I would recommend stick to torch dot lowercase tensor. So,

01:09:07.960 indeed, we see that when I construct this with a capital T, the data type here of X is flow 32.

01:09:18.040 But torch dot lowercase tensor. You see how it's now X dot D type is now integer.

01:09:25.000 So it's advised that you use lowercase T and you can read more about it if you like in some of

01:09:32.760 these threads. But basically, I'm pointing out some of these things because I want to caution you

01:09:39.160 and I want you to get used to reading a lot of documentation and reading through a lot of Q and

01:09:44.680 ACE and threads like this. And some of this stuff is unfortunately not easy and not very well

01:09:50.600 documented. And you have to be careful out there. What we want here is integers because that's what

01:09:55.320 makes sense. And so lowercase tensor is what we are using. Okay, now we want to think through

01:10:02.440 how we're going to feed in these examples into a neural network. Now it's not quite as straightforward

01:10:09.000 plugging it in because these examples right now are integers. So there's like a 0, 5, or 13.

01:10:14.520 It gives us the index of the character and you can't just plug an integer index into a neural

01:10:18.840 net. These neural nets are sort of made up of these neurons. And these neurons have weights.

01:10:26.760 And as you saw in micrograd, these weights act multiplicatively on the inputs W X plus B,

01:10:32.440 there's 10 Hs and so on. And so it doesn't really make sense to make an input neuron take on

01:10:36.760 integer values that you feed in and then multiply on with weights. So instead, a common way of

01:10:43.480 encoding integers is what's called one-hot encoding. In one-hot encoding, we take an integer like 13,

01:10:50.680 and we create a vector that is all zeros except for the 13th dimension, which returned to a 1.

01:10:56.760 And then that vector can feed into a neural net. Now, conveniently, PyTorch actually has

01:11:04.440 something called the one-hot function inside Torch and in functional. It takes a tensor made up of

01:11:11.960 integers. Long is an integer. And it also takes a number of classes, which is how large you want

01:11:23.480 your tensor, your vector to be. So here, let's import Torch dot and in that functional SF,

01:11:31.640 this is a common way of importing it. And then let's do F dot one-hot. And we feed in the integers

01:11:38.280 that we want to encode. So we can actually feed in the entire array of Xs. And we can tell it that

01:11:45.000 num classes is 27. So it doesn't have to try to guess it. It may have guessed that it's only 13

01:11:51.480 and would give us an incorrect result. So this is the one-hot. Let's call this X ink for X encoded.

01:11:59.960 And then we see that X encoded that shape is five by 27. And we can also visualize it PLT dot

01:12:10.040 I am show of X saying to make it a little bit more clear because this is a little messy. So we see

01:12:15.800 that we've encoded all the five examples into vectors. We have five examples, so we have five rows

01:12:22.600 and each row here is now an example into a neural net. And we see that the appropriate bit is turned

01:12:28.520 on as a one and everything else is zero. So here, for example, the zero-th bit is turned on,

01:12:36.520 the fifth bit is turned on, 13th bits are turned on for both of these examples. And the first bit

01:12:42.600 here is turned on. So that's how we can encode integers into vectors. And then these vectors

01:12:50.280 can feed in to neural nets. One more issue to be careful with here, by the way, is let's look at

01:12:55.560 the data type of econcoding. We always want to be careful with data types. What would you expect

01:13:00.680 X encoding data type to be when we're plugging numbers into neural nets, we don't want them to

01:13:05.480 be integers. We want them to be floating point numbers that can take on various values. But the

01:13:10.680 D type here is actually 64 bit integer. And the reason for that I suspect is that one-hot received

01:13:17.480 a 64 bit integer here, and it returned the same data type. And when you look at the signature of

01:13:23.160 one-hot, it doesn't even take a D type, a desired data type of the output tensor. And so we can't,

01:13:29.640 in a lot of functions in tortury, be able to do something like D type equals torched dot float 32,

01:13:34.360 which is what we want. But one-hot does not support that. So instead, we're going to want to cast this

01:13:40.120 to float like this. So that these, everything is the same, everything looks the same, but the D type

01:13:48.600 is float 32. And floats can feed into neural nets. So now let's construct our first neuron.

01:13:55.000 This neuron will look at these input vectors. And as you remember from micrograd, these neurons

01:14:02.440 basically perform a very simple function, W X plus B, where W X is a dot product. Right?

01:14:08.440 So we can achieve the same thing here. Let's first define the weights of this neuron,

01:14:14.520 basically, where are the initial weights at initialization for this neuron? Let's initialize

01:14:19.320 them with touch dot random. Torched dot random is fills a tensor with random numbers, drawn from

01:14:27.560 a normal distribution. And a normal distribution has a probability density function like this.

01:14:34.360 And so most of the numbers drawn from this distribution will be around zero. But some of them will be

01:14:40.040 as high as almost three and so on. And very few numbers will be above three in magnitude.

01:14:45.240 So we need to take size as an input here. And I'm going to use size as to be 27 by one.

01:14:53.080 So 27 by one, and then let's visualize W. So W is a column vector of 27 numbers.

01:15:03.000 And these weights are then multiplied by the inputs. So now to perform this multiplication,

01:15:10.600 we can take X encoding and we can multiply it with W. This is a matrix multiplication operator

01:15:17.560 in PyTorch. And the output of this operation is five by one. The reason is five by five is the

01:15:24.760 following. We took X encoding, which is five by 27, and we multiplied it by 27 by one.

01:15:31.640 And in matrix multiplication, you see that the output will become five by one, because these

01:15:40.360 27 will multiply and add. So basically what we're seeing here, out of this operation,

01:15:48.680 is we are seeing the five activations of this neuron on these five inputs. And we've evaluated

01:15:59.160 all of them in parallel. We didn't feed in just a single input to the single neuron. We fed in

01:16:03.960 simultaneously all the five inputs into the same neuron. And in parallel, PyTorch has evaluated

01:16:10.440 the WX plus B, but here is just WX. There's no bias. It has valued WX for all of them.

01:16:19.240 Uh, independently. Now instead of a single neuron, though, I would like to have 27 neurons. And I'll

01:16:24.280 show you in a second why I want 27 neurons. So instead of having just a one here, which is

01:16:29.880 indicating this presence of one single neuron, we can use 27. And then when W is 27 by 27,

01:16:36.760 this will in parallel evaluate all the 27 neurons on all the five inputs,

01:16:46.440 giving us a much better, much, much bigger result. So now what we've done is five by 27 multiplied,

01:16:52.040 27 by 27. And the output of this is now five by 27. So we can see that the shape of this

01:16:59.400 is five by 27. So what is every element here telling us, right? It's telling us for every one

01:17:08.520 of 27 neurons that we created. What is the firing rate of those neurons on every one of those five

01:17:18.200 examples? So the element, for example, 3, 13 is giving us the firing rate of the 13th neuron

01:17:28.600 looking at the third input. And the way this was achieved is by a dot product

01:17:36.200 between the third input and the 13th column of this W matrix here.

01:17:43.240 Okay, so using matrix multiplication, we can very efficiently evaluate

01:17:49.480 the dot product between lots of input examples in a batch and lots of neurons,

01:17:56.920 where all those neurons have weights in the columns of those Ws. And in matrix multiplication,

01:18:02.120 we're just doing those dot products and in parallel. Just to show you that this is the case,

01:18:07.160 we can take X and we can take the third row. And we can take the W and take its 13th column.

01:18:14.920 And then we can do X and get three element wise multiply with W 13

01:18:26.680 and sum that up. That's W X plus B. Well, there's no plus B, it's just W X dot product. And that's

01:18:33.240 this number. So you see that this is just being done efficiently by the matrix multiplication

01:18:39.560 operation for all the input examples and for all the output neurons of this first layer.

01:18:45.320 Okay, so we fed our 27 dimensional inputs into a first layer of a neural net that has 27 neurons,

01:18:52.600 right? So we have 27 inputs and now we have 27 neurons. These neurons perform W times X.

01:18:59.720 They don't have a bias and they don't have a nonlinearity like 10H. We're going to leave them

01:19:03.880 to be a linear layer. In addition to that, we're not going to have any other layers. This is going

01:19:09.480 to be it. It's just going to be the dumbest, smallest, simplest neural net, which is just a single

01:19:14.440 linear layer. And now I'd like to explain what I want those 27 outputs to be. Intuitively, what

01:19:21.880 we're trying to produce here for every single input example is we're trying to produce some kind of

01:19:25.880 a probability distribution for the next character in a sequence. And there's 27 of them. But we have

01:19:31.880 to come up with like precise semantics for exactly how we're going to interpret these 27 numbers

01:19:37.320 that these neurons take on. Now intuitively, you see here that these numbers are negative and

01:19:43.160 some of them are positive, etc. And that's because these are coming out of a neural net layer

01:19:48.120 initialized with these normal distribution parameters. But what we want is we want something like we

01:19:56.200 had here, like each row here told us the counts. And then we normalize the counts to get probabilities.

01:20:03.400 And we want something similar to come out of a neural net. But what we just have right now is

01:20:07.720 just some negative and positive numbers. Now we want those numbers to somehow represent the

01:20:12.840 probabilities for the next character. But you see that probabilities, they have a special structure,

01:20:18.760 they're positive numbers, and they sum to one. And so that doesn't just come out of a neural net.

01:20:25.000 And then they can't be counts, because these counts are positive and counts are integers. So

01:20:33.080 counts are also not really a good thing to output from a neural net. So instead what the neural

01:20:37.720 net is going to output, and how we are going to interpret the the 27 numbers is that these 27

01:20:45.000 numbers are giving us log counts, basically. So instead of giving us counts directly,

01:20:52.840 lock in this table, they're giving us log counts. And to get the counts, we're going to take the

01:20:58.040 log counts, and we're going to exponentiate them. Now, exponentiation takes the following form.

01:21:07.160 It takes numbers that are negative or they are positive. It takes the entire real line. And then

01:21:13.160 if you plug in negative numbers, you're going to get e to the x, which is always below one. So

01:21:20.760 you're getting numbers lower than one. And if you plug in numbers greater than zero, you're getting

01:21:26.280 numbers greater than one, all the way growing to the infinity. And this here grows to zero.

01:21:33.240 So basically, we're going to take these numbers here. And

01:21:40.520 instead of them being positive and negative and all over the place, we're going to interpret them

01:21:47.080 as log counts. And then we're going to element wise exponentiate these numbers, exponentiating

01:21:53.560 them now gives us something like this. And you see that these numbers now, because of that they

01:21:58.120 went through an exponent, all the negative numbers turned into numbers below one, like 0.338. And all

01:22:04.440 the positive numbers originally turned into even more positive numbers, so we're here than one.

01:22:09.320 So like, for example, seven is some positive number over here that is greater than zero.

01:22:19.400 But exponentiated outputs here basically give us something that we can use and interpret

01:22:27.720 as the equivalent of counts originally. So you see these counts here, one, 12, seven, 51, one,

01:22:34.360 et cetera. The neural at this kind of now predicting counts. And these counts are positive numbers.

01:22:43.880 They can never be below zero. So that makes sense. And they can now take on various values,

01:22:49.000 depending on the settings of w. So let me break this down. We're going to interpret these to

01:22:57.560 be the log counts. In other words, for this, that is often used is so called logits. These are

01:23:05.720 logits log counts. Then these will be sort of the counts, log is exponentiated. And this is

01:23:14.040 equivalent to the n matrix sort of the n array that we used previously. Remember, this was the n.

01:23:21.560 This is the array of counts. And each row here are the counts for the next character sort of.

01:23:30.200 So those are the counts. And now the probabilities are just the counts normalized.

01:23:38.520 And so I'm not going to find the same, but basically I'm not going to scroll all the

01:23:45.000 places we've already done this. We want to counts that some along the first dimension. And we want

01:23:52.120 to keep them as true. We've went over this. And this is how we normalize the rows of our counts

01:23:59.080 matrix to get our probabilities. So now these are the probabilities. And these are the counts

01:24:09.480 that we have currently. And now when I show the probabilities, you see that every row here,

01:24:16.520 of course, will sum to one, because they're normalized. And the shape of this is 5 by 27.

01:24:26.360 And so really what we've achieved is for every one of our five examples, we now have a row that

01:24:33.240 came out of a neural net. And because of the transformations here, we made sure that this

01:24:38.600 output of this neural net now are probabilities, or we can interpret to be probabilities. So

01:24:44.200 our W X here gave us logits. And then we interpret those to be log counts. We exponentiate to get

01:24:51.880 something that looks like counts. And then we normalize those counts to get a probability

01:24:56.040 distribution. And all of these are differentiable operations. So what we've done now is we are

01:25:01.960 taking inputs. We have differentiable operations that we can back propagate through. And we're

01:25:07.160 getting out probability distributions. So for example, for the 0th example that fed in,

01:25:13.800 right, which was the 0th example here was a one-hot vector of 0. And it basically corresponded to

01:25:24.680 feeding in this example here. So we're feeding in a dot into a neural net. And the way we fed the

01:25:31.160 dot into a neural net is that we first got its index, then we one-hot encoded it, then it went

01:25:37.080 into the neural net, and out came this distribution of probabilities. And its shape

01:25:44.520 is 27. There's 27 numbers. And we're going to interpret this as the neural net's assignment

01:25:52.280 for how likely every one of these characters, the 27 characters, are to come next. And as we tune

01:26:00.680 the weights W, we're going to be, of course, getting different probabilities out for any

01:26:05.320 character that you input. And so now the question is just, can we optimize and find a good W,

01:26:10.280 such that the probabilities coming out are pretty good. And the way we measure pretty good is by

01:26:16.120 the loss function. Okay, so I organized everything into a single summary so that hopefully it's a

01:26:20.200 bit more clear. So it starts here with an input data set. We have some inputs to the neural

01:26:25.800 net, and we have some labels for the correct next character in a sequence. These are integers.

01:26:32.680 Here I'm using a torch generators now so that you see the same numbers that I see. And I'm generating

01:26:39.320 27 neurons weights, and each neuron here receives 27 inputs.

01:26:45.480 Then here we're going to plug in all the input examples, Xs, into a neural net. So here,

01:26:53.240 this is a forward pass. First, we have to encode all of the inputs into one-hot representations.

01:27:00.360 So we have 27 classes, we pass in these integers, and X ink becomes a array that is 5 by 27,

01:27:08.600 zeros, except for a few ones. We then multiply this in the first layer of a neural net to get

01:27:15.320 logits, exponentiate the logits to get fake counts, sort of, and normalize these counts to get

01:27:22.120 probabilities. So these last two lines, by the way, here are called the softmax, which I pulled up here.

01:27:31.160 Softmax is a very often used layer in a neural net that takes these z's, which are logits,

01:27:38.120 exponentiates them, and divides and normalizes. It's a way of taking outputs of a neural net layer,

01:27:46.280 and these outputs can be positive or negative, and it outputs probability distributions. It

01:27:52.600 outputs something that is always sums to one in our positive numbers, just like probabilities.

01:27:57.880 So this is kind of like a normalization function, if you want to think of it that way.

01:28:02.120 And you can put it on top of any other linear layer inside a neural net, and it basically

01:28:06.360 makes a neural net output probabilities that's very often used, and we used it as well here.

01:28:13.320 So this is the forward pass, and that's how we made a neural net output probability.

01:28:16.760 Now, you'll notice that all of these, this entire forward pass is made up of differentiable

01:28:26.600 layers. Everything here, we can back propagate through, and we saw some of the back propagation

01:28:31.800 in micrograd. This is just multiplication and addition. All that's happening here is just

01:28:37.480 multiplying and add, and we know how to back propagate through them. Exponentiation, we know

01:28:41.800 how to back propagate through. And then here, we are summing, and sum is as easily back propagated

01:28:49.000 as well, and division as well. So everything here is the differentiable operation, and we can

01:28:55.080 back propagate through. Now, we achieve these probabilities, which are five by 27. For every

01:29:02.040 single example, we have a vector of probabilities that's on to one. And then here, I wrote a bunch

01:29:07.640 of stuff to sort of like break down the examples. So we have five examples making up Emma, right?

01:29:14.600 And there are five by grams inside Emma. So by gram example, by gram example one is that

01:29:23.960 e is the beginning character, right after dot. And the indexes for these are zero and five.

01:29:30.680 So then we feed in a zero, that's the input of the neural net. We get probabilities from the

01:29:37.240 neural net that are 27 numbers. And then the label is five, because e actually comes after dot. So

01:29:45.960 that's the label. And then we use this label five to index into the probability distribution

01:29:53.320 here. So this index five here is 012345. It's this number here, which is here.

01:30:03.960 So that's basically the probability assigned by the neural net to the actual correct character.

01:30:07.960 You see that the network currently thinks that this next character that e following dot is only

01:30:13.800 1% likely, which is of course not very good, right? Because this actually is a training example,

01:30:19.640 and the network thinks that this is currently very, very unlikely. But that's just because

01:30:23.160 we didn't get very lucky in generating a good setting of w. So right now this network thinks

01:30:28.120 it says unlikely, and 0.01 is not a good outcome. So the log likelihood then is very negative.

01:30:35.160 And the negative log likelihood is very positive. And so four is a very high

01:30:41.800 negative log likelihood. And that means we're going to have a high loss. Because what is the loss?

01:30:46.760 The loss is just the average negative log likelihood.

01:30:49.640 So the second character is em. And you see here that also the network thought that m following

01:30:56.840 e is very unlikely 1%. For m following m, it thought it was 2%. And for a following m,

01:31:05.640 it actually thought it was 7% likely. So just by chance, this one actually has a pretty good

01:31:11.240 probability and therefore a pretty low negative log likelihood. And finally here,

01:31:16.280 it thought this was 1% likely. So overall, our average negative log likelihood, which is the loss,

01:31:22.280 the total loss that summarizes basically the how well this network currently works,

01:31:27.320 at least on this one word, not on the full data suggested one word, is 3.76, which is actually

01:31:32.760 very fairly high loss. This is not a very good setting of w's. Now here's what we can do.

01:31:37.720 We're currently getting 3.76. We can actually come here and we can change our w, we can resample it.

01:31:45.560 So let me just add one to have a different seed. And then we get a different w. And then we can

01:31:51.000 rerun this. And with this different c with this different setting of w's, we now get 3.37.

01:31:57.720 So this is a much better w, right? And that and it's better because the probability is just

01:32:03.080 happened to come out higher for the for the characters that actually are next. And so you can

01:32:09.240 imagine actually just resampling this, you know, we can try to. So okay, this was not very good.

01:32:17.080 Let's try one more. We can try three. Okay, this was terrible setting because we have a very high

01:32:23.640 loss. So anyway, I'm gonna erase this. What I'm doing here, which is just guess and check of

01:32:32.040 randomly assigning parameters and seeing if the network is good, that is amateur hour. That's not

01:32:37.480 how you optimize a neural net. The way you optimize a neural net is you start with some random guess,

01:32:42.120 and we're going to commit to this one, even though it's not very good. But now the big deal is we

01:32:46.360 have a loss function. So this loss is made up only of differentiable operations. And we can

01:32:54.840 minimize the loss by tuning w's by computing the gradients of the loss with respect to

01:33:01.880 these w matrices. And so then we can tune w to minimize the loss and find a good setting of w

01:33:09.560 using gradient based optimization. So let's see how that will work. Now things are actually going to

01:33:13.960 look almost identical to what we had with micro grad. So here, I pulled up the lecture from

01:33:20.200 micro grad, the notebook. It's from this repository. And when I scroll all the way to the end where

01:33:25.160 we left off with micro grad, we had something very, very similar. We had a number of input

01:33:30.280 examples. In this case, we had four input examples inside Xs. And we had their targets,

01:33:36.120 these are targets. Just like here, we have our Xs now, but we have five of them. And they're now

01:33:41.160 integers, instead of vectors. But we're going to convert our integers to vectors,

01:33:46.920 except our vectors will be 27 large, instead of three large. And then here, what we did is first,

01:33:53.480 we did a forward pass, where we ran a neural net on all the inputs to get predictions. Our neural

01:34:00.680 net at the time, this n of X was a net, a multi layer perceptron. Our neural net is going to look

01:34:06.520 different, because our neural net is just a single layer, single linear layer, followed by a softmax.

01:34:12.680 So that's our neural net. And the loss here was the mean squared error. So we simply subtracted

01:34:19.480 the prediction from the ground truth and squared it and summed it all up. And that was the loss.

01:34:24.120 And loss was the single number that summarized the quality of the neural net. And when loss is

01:34:29.960 low, like almost zero, that means the neural net is predicting correctly. So we had a single number

01:34:37.640 that that summarized the performance of the neural net. And everything here was differentiable

01:34:43.560 and was stored in massive compute graph. And then we iterated over all the parameters,

01:34:49.080 we made sure that the gradients are set to zero. And we called lost a backward. And lost a backward

01:34:55.320 initiated backpropagation at the final output node of loss. Right. So yeah, remember these

01:35:01.560 expressions, we had lost all the way at the end, we start backpropagation, and we went all the way

01:35:05.480 back. And we made sure that we populated all the parameters dot grad. So that grad started at zero,

01:35:12.280 but backpropagation filled it in. And then in the update, we iterated over all the parameters.

01:35:17.400 And we simply did a parameter update, where every single element of our parameters was

01:35:23.800 not in the opposite direction of the gradient. And so we're going to do the exact same thing here.

01:35:30.760 So I'm going to pull this up on the side here. So that we have it available. And we're actually

01:35:40.280 going to do the exact same thing. So this was the forward pass. So where we did this. And props is

01:35:47.720 our white bread. So now we have to evaluate the loss, but we're not using the mean squared error,

01:35:52.280 we're using the negative log likelihood, because we are doing classification, we're not doing

01:35:56.280 regression as it's called. So here, we want to calculate loss. Now the way we calculated is

01:36:04.200 is just this average negative log likelihood. Now this problems here, has a shape of five by 27.

01:36:12.360 And so to get all the, we basically want to pluck out the probabilities at the correct indices

01:36:19.080 here. So in particular, because the labels are stored here in the array wise, basically what

01:36:24.920 we're after is for the first example, we're looking at probability of five, right, at index five.

01:36:30.200 For the second example, at the second row, or row index one, we are interested in the probability

01:36:37.400 assigned to index 13. At the second example, we also have 13. At the third row, we want one.

01:36:47.320 And at the last row, which is four, we want zero. So these are the probabilities we're interested in,

01:36:53.080 right? And you can see that they're not amazing as we saw above. So these are the probabilities we

01:36:59.640 want, but we want like a more efficient way to access these probabilities, not just listing them

01:37:05.560 out in a tuple like this. So it turns out that the way to do this in PyTorch, one of the ways at least,

01:37:10.680 is we can basically pass in all of these integers in vectors. So these ones, you see how they're just

01:37:25.560 0, 1, 2, 3, 4, we can actually create that using MP, not MP, sorry, torch dot arrange of five,

01:37:32.040 0, 1, 2, 3, 4. So we can index here with torch dot arrange of five. And here we index with wise.

01:37:40.840 And you see that that gives us exactly these numbers. So that plucks out the probabilities of

01:37:50.840 that the neural network assigns to the correct next character. Now we take those probabilities,

01:37:57.880 and we don't we actually look at the log probability. So we want to dot log. And then we want to just

01:38:05.640 average that up. So take the mean of all that. And then it's the negative average log likelihood,

01:38:11.720 that is the loss. So the loss here is 3.7 something. And you see that this loss 3.76, 3.76 is exactly

01:38:21.960 as we've obtained before. But this is a vectorized form of that expression. So we get the same loss.

01:38:28.600 And the same loss, we can consider sort of as part of this forward pass. And we've achieved

01:38:34.760 here now loss. Okay, so we made our way all the way to loss. We defined the forward pass.

01:38:39.400 We forwarded the network and the loss. Now we're ready to do backward pass. So backward pass.

01:38:45.160 We want to first make sure that all the gradients are reset. So they're at zero. Now in PyTorch,

01:38:53.640 you can set the gradients to be zero, but you can also just set it to none. And setting it to

01:38:58.760 none is more efficient. And PyTorch will interpret none as like a lack of a gradient. And it's the

01:39:04.440 same as zeros. So this is a way to set to zero, the gradient. And now we do loss.backward.

01:39:12.040 Before we do loss.backward, we need one more thing. If you remember from micro grad,

01:39:18.280 PyTorch actually requires that we pass in requires grad is true. So that we tell PyTorch that we

01:39:27.640 are interested in calculating gradients for this leaf tensor by default, this is false.

01:39:33.400 So let me recalculate with that. And then setting none and loss.backward.

01:39:38.360 Now something magical happened when loss.backward was run because PyTorch, just like micro grad,

01:39:46.920 when we did the forward pass here, it keeps track of all the operations under the hood.

01:39:52.280 It builds a full computational graph. Just like the graphs we produced in micro grad,

01:39:57.720 those graphs exist inside PyTorch. And so it knows all the dependencies in all the

01:40:02.920 mathematical operations of everything. And when you then calculate the loss, we can call a

01:40:07.800 dot backward on it. And dot backward then fills in the gradients of all the intermediates all the

01:40:15.480 way back to W's, which are the parameters of our neural net. So now we can do WL grad.

01:40:21.400 And we see that it has structure. There's stuff inside it.

01:40:29.080 And these gradients, every single element here, so W dot shape is 27 by 27, W grads shape is the

01:40:38.120 same 27 by 27. And every element of W dot grad is telling us the influence of that weight on the

01:40:47.320 loss function. So for example, this number all the way here, if this element, the zero, zero element

01:40:54.200 of W, because the gradient is positive, it's telling us that this has a positive influence on the loss,

01:41:01.240 slightly nudging W slightly taking W zero zero, and adding a small h to it would increase the loss

01:41:11.720 mildly, because this gradient is positive. Some of these gradients are also negative.

01:41:17.080 So that's telling us about the gradient information. And we can use this gradient

01:41:22.440 information to update the weights of this neural network. So let's not do the update.

01:41:28.200 It's going to be very similar to what we had in micro grad. We need no loop over all the parameters,

01:41:33.240 because we only have one parameter tensor and that is W. So we simply do W dot data plus equals

01:41:40.280 the, we can actually copy this almost exactly negative 0.1 times W dot grad.

01:41:49.400 And that would be the update to the tensor. So that updates the tensor.

01:41:56.520 And because the tensor is updated, we would expect that now the loss should decrease.

01:42:03.400 So here, if I print loss that item, it was 3.76, right? So we've updated the W here. So if I

01:42:16.440 recalculate forward pass, loss now should be slightly lower. So 3.76 goes to 3.74.

01:42:24.440 And then we can again set to set grad to none and backward update. And now the parameters changed

01:42:33.720 again. So if we recalculate the forward pass, we expect a lower loss again, 3.72.

01:42:42.200 Okay. And this is again doing the, we're now doing reading the set.

01:42:45.560 And when we achieve a low loss, that will mean that the network is assigning high probabilities

01:42:53.560 to the correct next characters. Okay. So I rearranged everything and I put it all together from scratch.

01:42:58.520 So here is where we construct our data set of diagrams. You see that we are still iterating

01:43:04.520 along over the first word, Emma. I'm going to change that in a second. I added a number that

01:43:10.600 counts the number of elements in Xs so that we explicitly see that number of examples is 5.

01:43:16.040 Because currently we were just working with Emma and there's 5 by yarns there.

01:43:19.640 And here I added a loop of exactly what we had before. So we had 10 iterations of

01:43:25.320 grainy descent of forward pass, backward pass and update. And so running these two cells,

01:43:30.280 initialization and gradient descent gives us some improvement on the loss function.

01:43:38.120 But now I want to use all the words. And there's not five but 228,000

01:43:44.840 diagrams now. However, this should require no modification whatsoever. Everything should

01:43:49.880 just run because all the code we wrote doesn't care if there's 5 by grams or 228,000 by grams

01:43:55.720 and with everything we should just work. So you see that this will just run. But now we are

01:44:00.920 optimizing over the entire training set of all the by grams. And you see now that we are

01:44:05.880 decreasing very slightly. So actually we can probably afford a larger learning rate.

01:44:09.880 And probably afford even larger learning rate.

01:44:14.280 Even 50 seems to work on this very, very simple example. So let me re-inertilize

01:44:25.480 and let's run 100 iterations. See what happens.

01:44:29.880 Okay.

01:44:33.240 Okay. We seem to be coming up to some pretty good losses here. 2.47. Let me run 100 more.

01:44:43.880 What is the number that we expect by the way in the loss? We expect to get something around

01:44:49.080 what we had originally actually. So all the way back if you remember in the beginning of this video,

01:44:54.920 when we optimized just by counting, our loss was roughly 2.47. After we added it's moving.

01:45:03.480 But before smoothing, we had roughly 2.45 likely at

01:45:07.000 sorry loss. And so that's actually roughly the vicinity of what we expect to achieve.

01:45:13.720 But before we achieved it by counting, and here we are achieving roughly the same result,

01:45:18.680 but with gradient based optimization. So we come to about 2.46, 2.45, etc.

01:45:25.560 And that makes sense because fundamentally we're not taking in any additional information.

01:45:29.800 We're still just taking in the previous character and trying to predict the next one.

01:45:33.080 But instead of doing it explicitly by counting and normalizing,

01:45:37.000 we are doing it with gradient based learning. And it just so happens that the explicit approach

01:45:42.600 happens to very well optimize the loss function without any need for gradient based optimization.

01:45:48.440 Because the setup for bygram language models is so straightforward, that's so simple,

01:45:52.840 we can just afford to estimate those probabilities directly and maintain them in a table.

01:45:58.840 But the gradient based approach is significantly more flexible.

01:46:02.120 So we've actually gained a lot because what we can do now is we can expand this approach and

01:46:11.080 complexify the neural net. So currently we're just taking a single character and feeding into a

01:46:15.320 neural net and the neural that's extremely simple. But we're about to iterate on this

01:46:19.320 substantially. We're going to be taking multiple previous characters and we're going to be feeding

01:46:24.440 them into increasingly more complex neural nets. But fundamentally, the output of the neural

01:46:29.800 net will always just be logits. And those logits will go through the exact same transformation.

01:46:35.400 We are going to take them through a softmax, calculate the loss function and the negative

01:46:39.800 log likelihood and do gradient based optimization. And so actually, as we complexify the neural

01:46:46.200 nets and work all the way up to transformers, none of this will really fundamentally change.

01:46:51.800 None of this will fundamentally change. The only thing that will change is the way we do the forward

01:46:56.760 pass, where we've taken some previous characters and calculated logits for the next character

01:47:01.640 in a sequence. That will become more complex. And I will use the same machinery to optimize it.

01:47:08.120 And it's not obvious how we would have extended this bygram approach into the case where there

01:47:16.520 are many more characters at the input, because eventually these tables will get way too large

01:47:22.120 because there's way too many combinations of what previous characters could be.

01:47:26.520 If you only have one previous character, we can just keep everything in the table

01:47:31.160 accounts. But if you have the last 10 characters that are in bed, we can't actually keep everything

01:47:36.280 in the table anymore. So this is fundamentally an unscalable approach. And the neural network

01:47:40.680 approach is significantly more scalable. And it's something that actually we can improve on over time.

01:47:46.680 So that's where we will be digging next. I wanted to point out two more things.

01:47:50.360 Number one, I want you to notice that this x-ank here, this is made up of one-hot vectors.

01:47:58.920 And then those one-hot vectors are multiplied by this w matrix. And we think of this as a multiple

01:48:05.240 neurons being forwarded in a fully connected manner. But actually what's happening here is that,

01:48:10.200 for example, if you have a one-hot vector here that has a one, let's say the fifth dimension,

01:48:16.920 then because of the way the matrix multiplication works, multiplying that one-hot vector with w

01:48:23.400 actually ends up plucking out the fifth row of w. Lot logits would become just the fifth row of w.

01:48:30.440 And that's because of the way the matrix multiplication works.

01:48:36.920 So that's actually what ends up happening. So but that's actually exactly what happened before,

01:48:42.680 because remember all the way up here, we have a bagram, we took the first character,

01:48:49.080 and then that first character indexed into a row of this array here. And that row gave us the

01:48:56.120 probability distribution for the next character. So the first character was used as a lookup into a

01:49:03.640 matrix here to get the probability distribution. Well, that's actually exactly what's happening

01:49:07.720 here, because we're taking the index, we're encoding it as one-hot and multiplying it by w.

01:49:12.920 So logits literally becomes the appropriate row of w. And that gets just as before,

01:49:22.600 exponentiated to create the counts, and then normalized and becomes probability. So this w here

01:49:29.480 is literally the same as this array here. But w, remember, is the log counts, not the counts.

01:49:38.840 So it's more precise to say that w exponentiated w dot exp is this array. But this array was filled

01:49:47.640 in by counting, and by basically, populating the counts of bagrams. Whereas in the gradient

01:49:54.920 base framework, we initialize it randomly, and then we let the loss guide us to arrive at the exact

01:50:01.720 same array. So this array exactly here is basically the array w at the end of optimization, except we

01:50:10.600 arrived at it piece by piece by following the loss. And that's why we also obtain the same

01:50:16.600 loss function at the end. And the second note is if I come here, remember the smoothing,

01:50:21.560 where we added fake counts to our counts in order to smooth out and make more uniform

01:50:28.280 the distributions of these probabilities. And that prevented us from assigning zero probability to

01:50:33.560 to any one by gram. Now, if I increase the count here, what's happening to the probability?

01:50:41.960 As I increase the count, probability becomes more and more uniform, right? Because these counts go

01:50:49.800 only up to like 900 or whatever. So if I'm adding plus a million to every single number here,

01:50:55.080 you can see how the row and its probability then when you divide, it's just going to become more

01:51:00.040 and more close to exactly even probability uniform distribution. It turns out that the

01:51:05.960 gradient base framework has an equivalent to smoothing. In particular, think through these

01:51:14.200 w's here, which we initialize randomly. We could also think about initializing w's to be zero.

01:51:21.240 If all the entries of w are zero, then you'll see that logits will become all zero. And then

01:51:29.000 exponentiating those logits becomes all one. And then the probabilities turn out to be exactly uniform.

01:51:34.840 So basically, when w's are all equal to each other, or say, especially zero, then the probability

01:51:42.360 has come out completely uniform. So trying to incentivize w to be near zero is basically

01:51:50.040 equivalent to labels smoothing. And the more you incentivize that in a loss function, the more

01:51:56.120 smooth distribution you're going to achieve. So this brings us to something that's called

01:52:00.280 regularization, where we can actually augment the loss function to have a small component that we

01:52:06.120 call a regularization loss. In particular, what we're going to do is we can take w and we can,

01:52:11.880 for example, square all of its entries. And then we can, whoops, sorry about that, we can take all

01:52:19.560 the entries of w and we can sum them. And because we're squaring, there will be no signs anymore.

01:52:27.080 Natives and positives all get squashed, we presume numbers. And then the way this works is you

01:52:33.720 achieve zero loss if w is exactly or zero. But if w has nonzero numbers, you accumulate loss.

01:52:41.160 And so we can actually take this and we can add it on here. So we can do something like loss, plus

01:52:47.720 w square dot sum, or let's actually, instead of sum, let's take a mean, because otherwise the

01:52:55.160 sum gets too large. So mean is like a little bit more manageable. And then we have a regularization

01:53:02.680 loss here, like say 0.01 times, or something like that, you can choose the regularization

01:53:07.560 strength. And then we can just optimize this. And now this optimization actually has two components,

01:53:15.240 not only is it trying to make all the probabilities work out, but in addition to that, there's an

01:53:19.560 additional component that simultaneously tries to make all w's be zero, because if w's are nonzero,

01:53:25.240 you feel a loss. And so minimizing this, the only way to achieve that is for w to be zero.

01:53:29.960 And so you can think of this as adding like a spring force, or like a gravity force,

01:53:35.400 that that pushes w to be zero. So w wants to be zero, and the probabilities want to be uniform,

01:53:41.320 but they also simultaneously want to match up your probabilities as indicated by the data.

01:53:46.600 And so the strength of this regularization is exactly controlling the amount of counts

01:53:53.720 that you add here. Adding a lot more counts here corresponds to increasing this number,

01:54:04.600 because the more you increase it, the more this part of the loss function dominates this part.

01:54:09.480 And the more these these weights will be unable to grow, because as they grow, they accumulate

01:54:16.440 way too much loss. And so if this is strong enough, then we are not able to overcome the force of

01:54:23.880 this loss. And we will never, and basically everything will be uniform predictions. So I

01:54:29.560 thought that's kind of cool. Okay. And lastly, before we wrap up, I wanted to show you how you

01:54:33.960 would sample from this neural net model. And I copy pasted the sampling code from before,

01:54:39.720 where remember that we sampled five times, and all we did is we started zero, we grabbed the

01:54:47.240 current IX row of P, and that was our probability row, from which we sampled the next index,

01:54:54.760 and just accumulated that and break when zero. And running this gave us these results.

01:55:02.280 I still have the P in memory. So this is fine. Now, this P doesn't come from the row of B.

01:55:11.320 Instead, it comes from this neural net. First, we take IX, and we encode it into a one-hot row

01:55:20.360 of X-enk. This X-enk multiplies our W, which really just plucks out the row of W corresponding to

01:55:28.280 IX. Really, that's what's happening. And that gets our logits, and then we normalize those logits,

01:55:34.200 exponentiate to get counts, and then normalize to get the distribution, and then we can sample

01:55:39.800 from the distribution. So if I run this, kind of anti-climatic or climatic, depending how you

01:55:47.880 look at it, but we get the exact same result. And that's because this is the identical model.

01:55:54.600 Not only does it achieve the same loss, but as I mentioned, these are identical models,

01:55:59.800 and this W is the log counts of what we've estimated before. But we came to this answer in a very

01:56:06.440 different way, and it's got a very different interpretation. But fundamentally, this is basically

01:56:10.680 the same model and gives the same samples here. And so that's kind of cool. Okay, so we've actually

01:56:16.440 covered a lot of ground. We introduced the Bagram character level language model. We saw how we can

01:56:23.000 train the model, how we can sample from the model, and how we can evaluate the quality of the model

01:56:27.800 using the negative log likelihood loss. And then we actually trained the model in two completely

01:56:32.440 different ways that actually get the same result and the same model. In the first way, we just

01:56:37.960 counted up the frequency of all the Bagrams and normalized. In a second way, we used the

01:56:43.880 negative log likelihood loss as a guide to optimizing the counts matrix, or the counts array,

01:56:51.800 so that the loss is minimized in the in a gradient based framework. And we saw that both of them

01:56:56.760 give the same result. And that's it. Now the second one of these, the gradient based framework,

01:57:03.400 is much more flexible. And right now, our neural net perk is super simple. We're taking a single

01:57:08.840 previous character, and we're taking it through a single linear layer to calculate the logits.

01:57:13.320 This is about to complexify. So in the follow up videos, we're going to be taking more and more

01:57:18.840 of these characters. And we're going to be feeding them into a neural net. But this neural net will

01:57:23.720 still output the exact same thing, the neural net will output logits. And these logits will still be

01:57:29.240 normalized in the exact same way, and all the loss and everything else in the gradient based

01:57:33.320 framework, everything stays identical. It's just that this neural net will now complexify all the way

01:57:38.920 to transformers. So that's going to be pretty awesome. And I'm looking forward to it for now. Bye.