00:00:00.000 Hi everyone. Today we are continuing our implementation of Makemore, our favorite
00:00:04.400 character level language model. Now you'll notice that the background behind me is different,
00:00:08.640 that's because I am in Kyoto and it is awesome. So I'm in a hotel room here.
00:00:12.480 Now over the last few lectures we've built up to this architecture that is a multi-layer
00:00:18.160 perceptron character level language model. So we see that it receives three previous characters
00:00:22.640 and tries to predict the fourth character in a sequence using a very simple multi-layer perceptron
00:00:27.520 using one hidden layer of neurons with tenational nearities. So what we'd like to do now in this
00:00:32.640 lecture is I'd like to complexify this architecture. In particular we would like to take more characters
00:00:37.840 in a sequence as an input, not just three. And in addition to that we don't just want to feed them
00:00:42.800 all into a single hidden layer because that's squashes too much information too quickly. Instead
00:00:47.920 we would like to make a deeper model that progressively fuses this information to make its guess about
00:00:53.440 the next character in a sequence. And so we'll see that as we make this architecture more complex
00:00:59.200 we're actually going to arrive at something that looks very much like a weight net.
00:01:02.720 So weight net is this paper published by DeFiund in 2016 and it is also a language model basically
00:01:10.960 but a tries to predict audio sequences instead of character level sequences or word level sequences.
00:01:16.240 But fundamentally the modeling setup is identical. It is an autoregressive model
00:01:22.320 and it tries to predict the next character in a sequence. And the architecture actually takes
00:01:26.320 this interesting hierarchical sort of approach to predicting the next character in a sequence
00:01:31.840 with this tree-like structure. And this is the architecture and we're going to implement it
00:01:37.760 in the course of this video. So let's get started. So this story code for part five is very similar
00:01:43.200 to where we ended up in part three. Recall that part four was the manual dot replication exercise
00:01:48.640 that is kind of an aside. So we are coming back to part three, copy pasting chunks out of it,
00:01:53.280 and that is our starter code for part five. I've changed very few things otherwise.
00:01:57.440 So a lot of this should look familiar to if you've gone through part three. So in particular,
00:02:01.520 very briefly, we are doing imports, we are reading our data set of words, and we are processing the
00:02:08.400 data set of words into individual examples and none of this data generation code has changed.
00:02:13.760 And basically, we have lots and lots of examples. In particular, we have 182,000 examples of three
00:02:20.640 characters trying to predict the fourth one. And we've broken up every one of these words into
00:02:25.920 little problems of giving three characters predict the fourth one. So this is our data set,
00:02:30.080 and this is where we're trying to get the neural lab to do. Now, in part three, we started to develop
00:02:35.120 our code around these layer modules that are, for example, a class linear. And we're doing this
00:02:42.320 because we want to think of these modules as building blocks and like a Lego building block
00:02:47.680 bricks that we can sort of like stack up into neural networks. And we can feed data between these
00:02:52.480 layers and stack them up into sort of graphs. Now, we also developed these layers to have APIs
00:02:59.920 and signatures very similar to those that are found in PyTorch. So we have torched.nn, and it's
00:03:05.120 got all these layer building blocks that you would use in practice. And we were developing
00:03:08.960 all of these to mimic APIs of these. So for example, we have linear. So there will also be a
00:03:14.560 torched.nn.linear and its signature will be very similar to our signature. And the functionality
00:03:20.080 will be also quite identical as far as I'm aware. So we had the linear layer with the
00:03:24.320 best room one D layer and the 10 H layer that we developed previously. And linear just does a
00:03:30.720 matrix multiply in the forward pass of this module. Bachelor of course is this crazy layer that we
00:03:36.480 developed in the previous lecture. And what's crazy about it is, well, there's many things.
00:03:41.280 Number one, it has these running mean and variances that are trained outside of back propagation.
00:03:47.040 We are trained using exponential moving average inside this layer when we call the forward pass.
00:03:53.520 In addition to that, there's this training plug because the behavior of best room is
00:03:59.200 different during train time and evaluation time. And so suddenly we have to be very careful that
00:04:03.360 Batch norm is in its correct state that it's in the evaluation state or training state. So that's
00:04:07.840 something to now keep track of something that sometimes introduces bugs because you forget to
00:04:12.720 put it into the right mode. And finally, we saw that Batch norm couples the statistics or the
00:04:17.840 the activations across the examples in the batch. So normally we thought of the batch as just an
00:04:23.280 efficiency thing. But now we are coupling the computation across batch elements. And it's
00:04:29.840 done for the purposes of controlling the activation statistics as we saw in the previous video.
00:04:34.080 So it's a very weird layer, at least a lot of bugs. Partly, for example, because you have to
00:04:40.240 modulate the training in eval phase and so on. In addition, for example, you have to wait for
00:04:46.720 the mean and variance to settle and to actually reach a steady state. And so you have to make
00:04:53.360 sure that you basically there's state in this layer and state is harmful usually. Now,
00:05:00.320 I brought out the generator object. Previously, we had a generator equals g and so on inside these
00:05:05.760 layers. I've discarded that in favor of just initializing the torch RNG outside here just once
00:05:13.840 globally, just for simplicity. And then here we are starting to build out some of the neural
00:05:19.040 elements. This should look very familiar. We are we have our embedding table C, and then we have a
00:05:24.640 list of players. And it's a linear feeds to batch or feeds to 10 H, and then a linear output layer.
00:05:31.600 And its weights are scaled down. So we are not confidently wrong at initialization.
00:05:35.760 We see that this is about 12,000 parameters. We're telling patters that the parameters require
00:05:41.200 ingredients. The optimization is as far as I'm aware, identical and should look very, very familiar.
00:05:47.760 Nothing changed here. Loss function looks very crazy. We should probably fix this. And that's
00:05:54.240 because 32 batch elements are too few. And so you can get very lucky, lucky or unlucky in any one
00:06:00.160 of these batches, and it creates a very thick loss function. So we're going to fix that soon.
00:06:05.200 Now, once we want to evaluate the trained neural network, we need to remember because of the
00:06:10.480 batch room layers to set all the layers to be training equals false. So this only matters for
00:06:14.960 the batch room layer so far. And then we evaluate. We see that currently we have validation loss of
00:06:22.240 2.10, which is fairly good, but there's still ways to go. But even at 2.10, we see that when we
00:06:29.680 sample from the model, we actually get relatively name like results that do not exist in a training
00:06:34.480 set. So for example, a varn, kilo, a pros, a lyia, etc. So certainly not reasonable, not unreasonable,
00:06:45.760 I would say, but not amazing. And we can still push this validation loss even lower and get
00:06:49.840 much better samples that are even more name like. So let's improve this model. Okay, first, let's
00:06:57.360 fix this graph because it is daggers in my eyes and I just can't take it anymore. So last
00:07:02.320 I if you recall is a Python list of floats. So for example, the first 10 elements look like this.
00:07:09.600 Now what we'd like to do basically is we need to average up some of these values to get a more
00:07:16.240 sort of representative value along the way. So one way to do this is to follow in pytorch. If I
00:07:22.640 create, for example, a tensor of the first 10 numbers, then this is currently a one-dimensional
00:07:28.880 array. But recall that I can view this array as two-dimensional. So for example, I can view it as a
00:07:34.000 two by five array. And this is a 2D tensor now, two by five. And you see what pytorch has done is
00:07:40.240 that the first row of this tensor is the first five elements. And the second row is the second
00:07:45.280 five elements. I can also view it as a five by two as an example. And then recall that I can also
00:07:51.280 use negative one in place of one of these numbers. In pytorch will calculate what that number must
00:07:58.400 be in order to make the number of elements work out. So this can be this or like that,
00:08:04.160 but it will work. Of course, this would not work. Okay, so this allows it to spread out some of
00:08:11.760 the consecutive values into rows. So that's very helpful because what we can do now is first of
00:08:16.400 all, we're going to create a torch. tensor out of the list of floats. And then we're going to view it
00:08:23.680 as whatever it is, but we're going to stretch it out into rows of 1000 consecutive elements.
00:08:29.760 So the shape of this now becomes 200 by 1000. And each row is 1000 consecutive elements in this
00:08:38.640 list. So that's very helpful because now we can do a mean along the rows. And the shape of this
00:08:44.880 will just be 200. And so we've taken basically the mean on every row. So PLT dot plot of that
00:08:51.200 should be something nicer. Much better. So we see that we've basically made a lot of progress.
00:08:57.520 And then here, this is the learning rate decay. So here we see that the learning rate decay
00:09:02.400 subtracted a ton of energy out of the system and allowed us to settle into sort of the local
00:09:07.120 minimum and this optimization. So this is a much nicer plot. Let me come up and delete the monster.
00:09:13.920 And we're going to be using this going forward. Now next up, what I'm bothered by is that you
00:09:19.360 see our forward pass is a little bit gnarly and takes away too many lines of code. So in particular,
00:09:25.280 we see that we've organized some of the layers inside the layers list, but not all of them
00:09:29.760 for no reason. So in particular, we see that we still have the embedding table special case
00:09:35.200 outside of the layers. And in addition to that, the viewing operation here is also outside of our
00:09:40.560 layers. So let's create layers for these. And then we can add those layers to just our list.
00:09:46.400 So in particular, the two things that we need is here, we have this embedding table, and we are
00:09:51.760 indexing at the integers inside the batch exp inside a tensor exp. So that's an embedding table
00:09:59.920 lookup just done with indexing. And then here we see that we have this view operation, which if
00:10:04.880 you recall from the previous video, simply rearranges the character embeddings and stretches them out
00:10:11.920 into row. And effectively, what that does is the concatenation operation, basically,
00:10:17.120 except it's free because viewing is very cheap in PyTorch. And no memory is being copied,
00:10:23.280 we're just re-representing how we view that tensor. So let's create modules for both of these
00:10:29.840 operations, the embedding operation and the flattening operation. So I actually wrote the code in
00:10:36.560 just to save some time. So we have a module embedding and a module flatten. And both of them simply
00:10:43.360 do the indexing operation in a forward pass and the flattening operation here. And this
00:10:50.400 C now will just become a soft dot weight inside an embedding module. And I'm calling these layers
00:10:58.240 specifically embedding and flatten because it turns out that both of them actually exist in PyTorch.
00:11:02.800 So in PyTorch, we have an end-out embedding. And it also takes the number of embeddings and the
00:11:07.840 dimensionality of the embedding, just like we have here. But in addition PyTorch takes a lot
00:11:12.160 of other keyword arguments that we are not using for our persist yet. And for flatten,
00:11:18.640 that also exists in PyTorch. And it also takes additional keyword arguments that we are not
00:11:23.200 using. So we have a very simple flatten. But both of them exist in PyTorch, they're just a bit more
00:11:28.720 simpler. And now that we have these, we can simply take out some of these special-case
00:11:35.760 things. So instead of C, we're just going to have an embedding and have a look up size and
00:11:43.600 n embed. And then after the embedding, we are going to flatten. So let's construct those modules.
00:11:50.480 And now I can take out this C. And here, I don't have to special-case it anymore,
00:11:55.280 because now C is the embedding's weight, and it's inside layers. So this should just work.
00:12:02.880 And then here, our forward pass simplifies substantially, because we don't need to do these
00:12:09.040 now outside of these layers, outside and explicitly. They're now inside layers. So we can delete those.
00:12:16.240 But now to kick things off, we want this little x, which in the beginning is just xb.
00:12:22.880 The tensor of integers specifying the identities of these characters at the input.
00:12:26.800 And so these characters can now directly feed into the first layer, and this should just work.
00:12:31.600 So let me come here and insert a break, because I just want to make sure that the first iteration
00:12:37.200 of this runs, and that there's no mistake. So that ran properly. And basically, we've
00:12:42.160 substantially simplified the forward pass here. Okay, I'm sorry, I changed my microphone, so
00:12:46.640 hopefully the audio is a little bit better. Now, one more thing that I would like to do in order
00:12:51.760 to pytorchify our code even further, is that right now we are maintaining all of our modules
00:12:56.160 in a naked list of layers. And we can also simplify this, because we can introduce the concept of
00:13:02.320 pytorch containers. So in torch dot and then, which we are basically rebuilding from scratch here,
00:13:07.120 there's a concept of containers. And these containers are basically a way of organizing
00:13:11.440 layers into lists or dicks and so on. So in particular, there's a sequential,
00:13:17.760 which maintains a list of layers, and is a module class in pytorch. And it basically just passes
00:13:24.240 a given input through all the layers sequentially exactly as we are doing here. So let's write our
00:13:29.440 own sequential, I've written a code here. And basically the code for sequential is quite straightforward.
00:13:35.760 We pass in a list of layers, which we keep here, and then given any input in a forward pass,
00:13:41.040 we just call all the layers sequentially, and return the result. In terms of the parameters,
00:13:45.280 it's just all the parameters of the child modules. So we can run this, and we can again simplify
00:13:51.120 this substantially, because we don't maintain this naked list of layers, we now have a notion of a
00:13:55.520 model, which is a module. And in particular, is a sequential of all the layers. And now,
00:14:05.520 parameters are simply just a model of parameters. And so that list comprehension now lives here.
00:14:13.760 And then here we are, press here we are doing all the things we used to do.
00:14:16.880 Now here, the code again simplifies substantially, because we don't have to do this forwarding here,
00:14:23.600 instead of just call the model on the input data. And the input data here are the integers inside
00:14:28.320 XB. So we can simply do logits, which are the outputs of our model, are simply the model called
00:14:34.320 on XB. And then the cross entropy here takes the logits and the targets. So this simplifies
00:14:42.480 substantially. And then this looks good. So let's just make sure this runs, that looks good.
00:14:47.760 Now here, we actually have some work to do still here, but I'm going to come back later. For now,
00:14:53.840 there's no more layers, there's a model that layers, but it's not a to access attributes
00:14:59.280 of these classes directly. So we'll come back and fix this later. And then here, of course,
00:15:03.920 this simplifies substantially as well, because logits are the model called on X.
00:15:10.560 And then these logits come here. So we can evaluate the train evaluation loss,
00:15:16.480 which currently is terrible, because we just initialized in your own, and then we can also
00:15:20.720 sample from the model. And this simplifies dramatically as well, because we just want to call the model
00:15:26.000 on to the context and outcome logits. And then these logits go into softmax and get the probabilities,
00:15:33.520 etc. So we can sample from this model. What did I screw up? Okay, so I fixed the issue, and we now
00:15:43.760 get the result that we expect, which is gibberish, because the model is not trained, because we
00:15:48.720 reinitialize it from scratch. The problem was that when I fixed this cell to be modeled out layers,
00:15:53.920 instead of just layers, I did not actually run the cell. And so our neural net was in a training mode.
00:16:00.000 And what caused the issue here is the batch room layer, as a batch room layer, often likes to do,
00:16:04.160 because batch room was in the training mode. And here we are passing in an input, which is a batch
00:16:09.840 of just a single example made up of the context. And so if you are trying to pass in a single
00:16:14.880 example into a batch norm that is in the training mode, you're going to end up estimating the variance
00:16:19.280 using the input. And the variance of a single number is not a number, because it is a measure
00:16:24.560 of a spread. So for example, the variance of just a single number five, you can see it's not a number.
00:16:30.480 And so that's what happened. And the batch room basically caused an issue, and then that polluted
00:16:35.760 all of the further processing. So all that we had to do was make sure that this runs. And we basically
00:16:42.480 made the issue of, again, we didn't actually see the issue with the loss. We could have
00:16:47.760 evaluated the loss, but we got the wrong result, because batch room was in the training mode.
00:16:52.080 And so we still get a result is just the wrong result, because it's using the sample statistics
00:16:57.840 of the batch, whereas we want to use the running mean and running variance inside the batch drawer.
00:17:02.160 And so again, an example of introducing a bug in line, because we did not properly maintain the
00:17:09.680 state of what is training or not. Okay, so I we're in everything. And here's what we are. As a
00:17:14.480 reminder, we have the training loss of 2.05 and validation 2.10. Now, because these losses are
00:17:20.560 very similar to each other, we have a sense that we are not overfitting too much on this task.
00:17:24.960 And we can make additional progress in our performance by scaling up the size of the neural
00:17:29.040 network and making everything bigger and deeper. Now, currently, we are using this architecture here,
00:17:34.960 where we are taking in some number of characters, going into a single hidden layer, and then going
00:17:39.040 to the prediction of the next character. The problem here is we don't have an naive way of making this
00:17:45.040 bigger in a productive way. We could, of course, use our layers, sort of building blocks and materials
00:17:51.680 to introduce additional layers here and make the network deeper. But it is still the case that we
00:17:56.000 are crushing all of the characters into a single layer all the way at the beginning. And even if
00:18:01.440 we make this a bigger layer and add neurons, it's all kind of like silly to squash all that
00:18:06.240 information so fast in a single step. So we'd like to do instead is we'd like our network to look a
00:18:12.480 lot more like this in the WaveNet case. So you see in the WaveNet, when we are trying to make
00:18:16.960 the prediction for the next character in a sequence, it is a function of the previous characters that
00:18:21.680 are feed in that feed in, but not all of these different characters are not just crushed to a
00:18:27.040 single layer. And then you have a sandwich. They are crushed slowly. So in particular, we take two
00:18:33.600 characters and we fuse them into sort of like a diagram representation. And we do that for all
00:18:38.400 these characters consecutively. And then we take the diagrams and we fuse those into four
00:18:44.400 character well with chunks. And then we fuse that again. And so we do that in this like tree-like
00:18:50.160 hierarchical manner. So we fuse the information from the previous context slowly into the network
00:18:56.560 as it gets deeper. And so this is the kind of architecture that we want to implement. Now in the
00:19:01.280 WaveNet case, this is a visualization of a stack of dilated, causal convolution layers. And this
00:19:06.800 makes it sound very scary, but actually the idea is very simple. And the fact that it's a dilated
00:19:11.440 causal convolution layer is really just an implementation detail to make everything fast.
00:19:15.440 We're going to see that later. But for now, let's just keep the basic idea of it, which is this
00:19:20.160 progressive fusion. So we want to make the network deeper. And at each level, we want to fuse only
00:19:25.280 two consecutive elements, two characters, then two by grams, then two four grams, and so on.
00:19:31.600 So let's implement this. Okay, so first up, let me scroll to where we built the data set. And
00:19:35.600 let's change the block size from three to eight. So we're going to be taking eight characters of
00:19:40.320 context to predict the ninth character. So the data set now looks like this. We have a lot more
00:19:45.680 context feeding in to predict any next character in a sequence. And these eight characters are
00:19:50.320 going to be processed in this tree-like structure. Now if we scroll here, everything here should
00:19:55.920 just be able to work. So we should be able to redefine the network. You see that the number of
00:20:00.480 parameters has increased by 10,000. And that's because the block size has grown. So this first
00:20:06.000 linear layer is much, much bigger. Our linear layer now takes eight characters into this middle
00:20:11.760 layer. So there's a lot more parameters there. But this should just run. Let me just break right
00:20:17.840 after the very first iteration. So you see that this runs just fine. It's just that this
00:20:22.160 network doesn't make too much sense. We're crushing way too much information way too fast.
00:20:26.880 So let's now come in and see how we could try to implement the hierarchical scheme. Now before
00:20:32.240 we dive into the detail of the re-implementation here, I was just curious to actually run it and see
00:20:37.440 where we are in terms of the baseline performance of just lazily scaling up the context length.
00:20:42.240 So I'll let it run. We get a nice loss curve. And then evaluating the loss, we actually see quite a
00:20:46.800 bit of improvement just from increasing the context length. So I started a little bit of a
00:20:51.680 performance log here. And previously where we were is we were getting performance of 2.10
00:20:57.440 on the validation loss. And now simply scaling up the context length from three to eight gives
00:21:02.400 us a performance of 2.02. So quite a bit of an improvement here. And also when you sample from
00:21:08.000 the model, you see that the names are definitely improving qualitatively as well. So we could of
00:21:13.680 course spend a lot of time here tuning, tuning things and making it even bigger and scaling up
00:21:18.800 a network further, even with this simple sort of setup here. But let's continue and let's
00:21:24.960 implement here a model and treat this as just a rough baseline performance. But there's a lot of
00:21:30.560 optimization like left on the table in terms of some of the hyper parameters that you're hopefully
00:21:34.960 getting a sense of now. Okay, so let's scroll up now and come back up. And what I've done here is
00:21:40.880 I've created a bit of a scratch space for us to just like look at the forward pass of the neural
00:21:45.920 net and inspect the shape of the tensor along the way as the neural net forwards. So here,
00:21:52.720 I'm just temporarily for debugging, creating a batch of just say, for examples. So for random
00:21:58.080 integers, then I'm plucking out those rows from our training set. And then I'm passing into the
00:22:03.600 model the input XB. Now the shape of XB here, because we have only four examples is four by eight.
00:22:10.720 And this eight is now the current block size. So inspecting XB, we just see that we have four
00:22:17.920 examples. Each one of them is a row of XB. And we have eight characters here. And this integer
00:22:25.040 tensor just contains the identities of those characters. So the first layer of our neural
00:22:30.640 net is the embedding layer. So passing XB, this integer tensor through the embedding layer,
00:22:36.000 creates an output that is four by eight by 10. So our embedding table has, for each character,
00:22:43.120 a 10 dimensional vector that we are trying to learn. And so what the embedding layer does here,
00:22:48.320 is it blocks out the embedding vector for each one of these integers, and organizes it all in a
00:22:55.200 four by eight by 10 tensor now. So all of these integers are translated into 10 dimensional vectors
00:23:01.680 inside this three dimensional tensor now. Now passing that through the flatten layer, as you
00:23:06.560 recall, what this does is it views this tensor as just a four by eight tensor. And what that
00:23:12.720 effectively does is that all these 10 dimensional embeddings for all these eight characters just
00:23:18.000 end up being stretched out into a long row. And that looks kind of like a concatenation operation
00:23:23.600 basically. So by viewing the tensor differently, we now have a four by 80. And inside this 80,
00:23:29.920 it's all the 10 dimensional vectors just concatenating next to each other. And the linear layer,
00:23:37.040 of course, takes 80 and creates 200 channels just via matrix multiplication. So so far so good.
00:23:44.480 Now I'd like to show you something surprising. Let's look at the insides of the linear layer and
00:23:50.480 remind ourselves how it works. The linear layer here in a forward pass takes the input X, multiplies
00:23:56.480 it with a weight, and then optionally adds bias. And the weight here is two dimensional as defined
00:24:01.440 here. And the bias is one dimensional here. So effectively, in terms of the shapes involved,
00:24:07.040 what's happening inside this linear layer looks like this right now. And I'm using random numbers
00:24:12.080 here, but I'm just illustrating the shapes and what happens. Basically, a four by 80 input comes
00:24:18.560 into the linear layer gets multiplied by this 80 by 200 weight matrix inside. And there's a plus
00:24:23.440 200 bias. And the shape of the whole thing that comes out of the linear layer is four by 200,
00:24:28.560 as we see here. Now notice here, by the way, that this here will create a four by 200 tensor,
00:24:35.840 and then plus 200, there's a broadcasting happening here, but four by 200 broadcasts with 200. So
00:24:42.560 everything works here. So now the surprising thing that I'd like to show you that you may not expect
00:24:47.840 is that this input here that is being multiplied, doesn't actually have to be two dimensional.
00:24:53.280 This matrix multiply operator in PyTorch is quite powerful. And in fact, you can actually pass in
00:24:58.480 higher dimensional arrays or tensors and everything works fine. So for example, this could be four
00:25:02.800 by five by 80. And the result in that case will become four by five by 200. You can add as many
00:25:08.880 dimensions as you like on the left here. And so effectively, what's happening is that the matrix
00:25:14.080 multiplication only works on the last dimension. And the dimensions before it in the input tensor
00:25:19.680 are left unchanged. So that is basically these, these dimensions on the left are all treated as
00:25:29.200 just a batch dimension. So we can have multiple batch dimensions. And then in parallel over all
00:25:35.360 those dimensions, we are doing the matrix multiplication on the last dimension. So this is quite convenient
00:25:40.800 because we can use that in our network now, because remember that we have these eight characters coming
00:25:46.800 in. And we don't want to now flatten all of it out into a large eight dimensional vector,
00:25:54.160 because we don't want to matrix multiply 80 into a weight matrix multiply immediately. Instead,
00:26:02.480 we want to group these like this. So every consecutive two elements, one, two, and three,
00:26:10.400 four, five, and six, and seven, and eight, all of these should be now basically flattened out and
00:26:16.560 multiply by a weight matrix. But all of these four groups here, we'd like to process in parallel.
00:26:21.920 So it's kind of like a batch dimension that we can introduce. And then we can in parallel,
00:26:28.160 basically process all of these, uh, by gram groups in the four batch dimensions of an individual
00:26:35.120 example, and also over the actual batch dimension of the, you know, four examples in our example here.
00:26:40.960 So let's see how that works. Effectively, what we want is right now, we take a four by 80 and
00:26:47.920 multiplied by 80 by 200 to in the linear layer. This is what happens. But instead of what we want
00:26:54.800 is we don't want 80 characters or 80 numbers to come in. We only want two characters to come in
00:27:00.880 on the very first layer. And those two characters should be fused. So in other words, we just want
00:27:06.560 20 to come in, right? 20 numbers would come in. And here we don't want a four by 80 to feed into
00:27:13.440 the linear layer. We actually want these groups of two to feed in. So instead of four by 80, we want
00:27:18.800 this to be a four by four by 20. So these are the four groups of two, and each one of them is 10
00:27:27.760 dimensional vector. So what we want is now is we need to change the flatten layer. So it doesn't
00:27:33.120 output a four by 80, but it outputs a four by four by 20, where basically these, um, every two
00:27:40.560 consecutive characters are packed in on the very last dimension. And then these four is the first
00:27:47.680 batch dimension. And this four is the second batch dimension referring to the four groups inside
00:27:53.360 everyone of these examples. And then this will just multiply like this. So this is what we want to
00:27:58.800 get to. So we're gonna have to change the linear layer in terms of how many inputs it expects. It
00:28:03.440 shouldn't expect expect 80. It should just expect 20 numbers. And we have to change our flatten layer.
00:28:08.960 So it doesn't just fully flatten out this entire example. It needs to create a four by four by 20,
00:28:15.440 instead of a four by 80. So let's see how this could be implemented. Basically, right now we have
00:28:20.640 an input that is a four by eight by 10 that feeds into the flatten layer. And currently,
00:28:26.000 the flatten layer just stretches it out. So if you remember the implementation of flatten,
00:28:30.480 it takes our X and it just views it as whatever the batch dimension is, and then negative one.
00:28:36.080 So effectively what it does right now is it does ew of four, negative one, and the shape of this,
00:28:42.960 of course, is four by eight. So that's what currently happens. And we instead want this to be a four by
00:28:49.520 20, where these consecutive 10 dimensional vectors get concatenated. So you know how when Python,
00:28:56.400 you can take a list of range of 10. So we have numbers from zero to nine. And we can index like
00:29:04.000 this to get all the even parts. And we can also index like starting at one and going in steps up
00:29:09.920 to get all the odd parts. So one way to implement this, it would be as follows, we can take e,
00:29:17.920 and we can index into it for all the batch elements. And then just even elements in this dimension.
00:29:24.880 So at index zero, two, four, and eight. And then all the parts here from this last dimension.
00:29:32.160 And this gives us the even characters. And then here, this gives us all the odd characters.
00:29:41.600 And basically what we want to do is we want to make sure that these get concatenated
00:29:45.440 in PyTorch. And then we want to concatenate these two tensors along the second dimension.
00:29:51.200 So this and the shape of it would be four by four by 20. This is definitely the result we want.
00:29:58.320 We are explicitly grabbing the even parts and the odd parts. And we're arranging those
00:30:03.360 four by four by 10 right next to each other and concatenated. So this works. But it turns out
00:30:09.920 that what also works is you can simply use a view again and just request the right shape. And it
00:30:16.640 just so happens that in this case, those vectors will again end up being arranged in exactly the
00:30:22.000 way we want. So in particular, if we take e and we just view it as a four by four by 20, which is
00:30:27.200 what we want, we can check that this is exactly equal to what let me call this, this is the explicit
00:30:33.760 concatenation, I suppose. So explicit dot shape is four by four by 20. If you just view it as
00:30:40.960 four by four by 20, you can check that when you compare it to explicit, you get a big, this is
00:30:47.520 element wise operation. So making sure that all of them are true, the value is the true.
00:30:51.680 So basically long story short, we don't need to make an explicit call to concatenate, etc.
00:30:57.360 We can simply take this input tensor to flatten and we can just view it in whatever way we want.
00:31:04.000 And in particular, we don't want to stretch things out with negative one. We want to actually
00:31:09.600 create a three dimensional array. And depending on how many vectors that are consecutive, we want to
00:31:15.760 fuse like for example two, then we can just simply ask for this dimension to be 20. And
00:31:24.080 using negative one here, and pytorch will figure out how many groups it needs to pack into this
00:31:28.400 additional batch dimension. So let's now go into flatten and implement this. Okay, so I scrolled up
00:31:34.080 here to flatten. And what we'd like to do is we'd like to change it now. So let me create a constructor
00:31:39.520 and take the number of elements that are consecutive that we would like to concatenate now in the last
00:31:44.480 dimension of the output. So here we're just going to remember, solve that n equals n. And then I
00:31:51.200 want to be careful here because pytorch actually has a torched off flatten. And its keyword arguments
00:31:56.400 are different. And they kind of like function differently. So our flatten is going to start to
00:32:00.800 depart from pytorch flatten. So let me call it flatten consecutive or something like that,
00:32:06.080 just to make sure that our APIs are about equal. So this basically flattens only some
00:32:13.520 n consecutive elements and puts them into the last dimension. Now here, the shape of X is B by
00:32:20.480 T by C. So let me pop those out into variables. And recall that in our example down below,
00:32:27.600 B was four, T was eight, and C was 10. Now, instead of doing X dot view of B by negative one,
00:32:44.160 We want this to be B by negative one by, and basically here, we want C times n. That's how many
00:32:53.200 consecutive elements we want. And here instead of negative one, I don't super love to use up
00:32:59.360 negative one because I like to be very explicit so that you get error messages when things don't
00:33:03.600 go according to your expectation. So what do we expect here? We expect this to become T,
00:33:09.120 divide and using integer division here. So that's what I expect to happen. And then one more thing
00:33:15.280 I want to do here is remember previously, all the way in the beginning, n was three, and basically
00:33:21.680 we're concatenating all the three characters that existed there. So we basically concatenated
00:33:28.560 everything. And so sometimes that can create a spurious dimension of one here. So if it is the
00:33:33.840 case that X dot shape at one is one, then it's kind of like a spurious dimension. So we don't want to
00:33:41.040 return a three dimensional tensor with a one here, we just want to return a two dimensional tensor
00:33:46.480 exactly as we did before. So in this case, basically, we will just say X equals X dot squeeze.
00:33:52.400 That is a PyTorch function. And squeeze takes a dimension that it either squeezes out all the
00:34:02.160 dimensions of a tensor that are one, or you can specify the exact dimension that you want to be
00:34:08.240 squeezed. And again, I like to be as explicit as possible always. So I expect to squeeze out the
00:34:13.600 first dimension only of this tensor, this three dimensional tensor. And if this dimension here is
00:34:20.000 one, then I just want to return B by C times n. And so self dot out will be X. And then we return
00:34:27.200 self dot out. So that's the candidate implementation. And of course, this should be self dot in
00:34:33.120 instead of just n. So let's run. And let's come here now. And take it for a spin. So flatten
00:34:41.280 consecutive. And in the beginning, let's just use eight. So this should recover the previous
00:34:49.120 behavior. So flatten consecutive of eight, which is the current block size. You can do this.
00:34:56.960 That should recover the previous behavior. So we should be able to run the model. And here we can
00:35:03.760 inspect, I have a little code snippet here, or I iterate over all the layers, I print the name of
00:35:10.960 this class and the shape. And so we see the shapes as we expect them after every single
00:35:19.120 layer and it's output. So now let's try to restructure it using our flatten consecutive
00:35:24.720 and do it hierarchically. So in particular, we want to flatten consecutive, not just
00:35:30.320 not block size, but just two. And then we want to process this with linear. Now, then the number
00:35:36.320 of inputs to this linear will not be an embed times block size. It will now only be an embed times
00:35:41.520 to 20. This goes through the first layer. And now we can in principle, just copy paste this.
00:35:49.760 Now the next linear layer should expect an hidden times two. And the last piece of it
00:35:57.680 should expect an hidden times two again. So this is sort of like the naive version of it.
00:36:03.520 So running this, we now have a much, much bigger model. And we should be able to basically just
00:36:10.720 forward the model. And now we can inspect the numbers in between. So four by 20 was flatten
00:36:20.080 consecutive, and then four by four by 20. This was projected into four by four by 200.
00:36:25.200 And then bash arm just worked out of the box. And we have to verify that bash arm does the
00:36:31.600 correct thing, even though it takes a three dimensional impedance of two dimensional input.
00:36:36.320 Then we have 10 H, which is element wise, then we crushed it again. So flatten consecutively,
00:36:42.080 and ended up with a four by two by 400 now. Then linear brought it back down to 200,
00:36:47.440 bash arm 10 H. And lastly, we get a four by 400. And we see that the flatten consecutive for the
00:36:52.880 last flatten here, it squeezed out that dimension of one. So we only ended up with four by 400.
00:36:59.280 And then linear bash arm 10 H and the last linear layer to get our logits. And so the
00:37:05.280 logits end up in the same shape as they were before. But now we actually have a nice three layer
00:37:10.400 neural net. And it basically corresponds to whoops, sorry, it basically corresponds exactly to this
00:37:16.320 network now, except on this piece here, because we only have three layers. Whereas here in this
00:37:21.920 example, there's four layers with a total receptive field size of 16 characters, instead of just
00:37:29.120 eight characters, the block size here is 16. So this piece of it is basically implemented here.
00:37:35.440 Now we just have to kind of figure out some good channel numbers to use here. Now in particular,
00:37:42.000 I changed the number of hidden units to be 68 in this architecture, because when I use 68,
00:37:47.440 the number of parameters comes out to be 22,000. So that's exactly the same that we had before.
00:37:52.640 And we have the same amount of capacity of this neural net in terms of the number of parameters.
00:37:57.200 But the question is whether we are utilizing those parameters in a more efficient architecture.
00:38:00.720 So what I did then is I got rid of a lot of the debugging cells here, and we ran the optimization,
00:38:07.200 and scrolling down to the result, we see that we get the identical performance roughly.
00:38:12.720 So our validation loss now is 2.029, and previously it was 2.027. So controlling for the number of
00:38:18.720 parameters, changing from the flat to hierarchical is not giving us anything yet. That said, there
00:38:24.080 are two things to point out. Number one, we didn't really torture the architecture here very much.
00:38:30.000 This is just my first guess. And there's a bunch of hyperparameters searched that we could do in
00:38:34.080 order in terms of how we allocate our budget of parameters to what layers. Number two, we still
00:38:40.400 may have a bug inside the Bashram 1D layer. So let's take a look at that, because it runs,
00:38:48.880 but does it do the right thing? So I pulled up the layer inspector sort of that we have here,
00:38:54.560 and printed out the shape along the way. And currently it looks like the Bashram is receiving
00:38:58.560 an input that is 32 by 4 by 68. And here on the right, I have the current implementation of
00:39:04.800 Bashram that we have right now. Now this Bashram assumed in the way we wrote it, and at the time
00:39:10.400 that X is two dimensional. So it was n by d, where n was the backed size. So that's why we only reduced
00:39:17.920 the mean and the variance over the 0th dimension. But now X will basically become three dimensional.
00:39:23.120 So what's happening inside the Bashram layer right now, and how come it's working at all,
00:39:26.480 and not giving any errors? The reason for that is basically because everything broadcasts
00:39:30.800 properly, but the Bashram is not doing what we wanted to do. So in particular, let's basically
00:39:37.280 think through what's happening inside the Bashram, looking at what's happening here.
00:39:43.680 I have the code here. So we're receiving an input of 32 by 4 by 68. And then we are doing
00:39:50.480 here, X dot mean, here I have E instead of X. But we're doing the mean over 0. And that's
00:39:57.360 actually given us one by four by 68. So we're doing the mean only over the very first dimension.
00:40:02.480 And it's given us a mean and a variance that still maintain this dimension here. So these
00:40:08.240 means are only taking over 32 numbers in the first dimension. And then when we perform this,
00:40:14.160 everything broadcasts correctly still. But basically what ends up happening is
00:40:22.160 the shape of it. So I'm looking at the model that layers the three, which is the first Bashram
00:40:29.280 layer, and then looking at whatever the running mean became, and its shape. The shape of this
00:40:34.800 running mean now is one by four by 68. Right? Instead of it being, you know, just size of dimension,
00:40:42.640 because we have 68 channels, we expect to have 68 means and variances that we're maintaining.
00:40:48.400 But actually, we have an array of four by 68. And so basically what this is telling us is
00:40:52.960 this Bashram is only, this Bashram is currently working in parallel over
00:40:58.480 four times 68, instead of just 68 channels. So basically, we are maintaining statistics for
00:41:09.600 every one of these four positions individually and independently. And instead, what we want to do
00:41:14.720 is we want to treat this for as a bash dimension, just like the zero dimension. So as far as the
00:41:21.440 Bashram is concerned, it doesn't want to average, we don't want to average over 32 numbers. We want
00:41:26.560 to now average over 32 times four numbers for every single one of these 68 channels. And so let me
00:41:33.680 now remove this. It turns out that when you look at the documentation of torch dot mean,
00:41:43.440 In one of its signatures, when we specify the dimension, we see that the dimension here is not
00:41:54.640 just it can be in or it can also be a tuple of ints. So we can reduce over multiple integers at
00:42:01.040 the same time over multiple dimensions at the same time. So instead of just reducing over zero,
00:42:05.520 we can pass in a tuple 01. And here 01 as well. And then what's going to happen is the output,
00:42:11.920 of course, is going to be the same. But now what's going to happen is because we reduce over 0 and 1,
00:42:16.560 if we look at in that shape, we see that now we've reduced, we took the mean over both the
00:42:24.400 0 and the first dimension. So we're just getting 68 numbers and a bunch of spurious dimensions here.
00:42:30.800 So now this becomes one by one by 68. And the running mean and the running variance
00:42:35.920 analogously will become one by one by 68. So even though there are spurious dimensions,
00:42:40.320 the current the current the correct thing will happen in that we are only maintaining means and
00:42:45.360 variances for 64 sorry for 68 channels. And we're not calculating the mean variance across 32 times
00:42:52.640 four dimensions. So that's exactly what we want. And let's change the implementation of bachelor
00:42:58.320 1D that we have so that it can take in two dimensional or three dimensional inputs and
00:43:03.760 perform accordingly. So at the end of the day, the fix is relatively straightforward. Basically,
00:43:08.400 the dimension we want to reduce over is either zero or the tuple zero and one, depending on the
00:43:13.840 dimensionality of x. So if x dot and dim is two, so it's a two dimensional tensor, then dimension we
00:43:19.760 want to reduce over is just the integer zero. L of x dot and dim is three. So it's a three
00:43:24.640 dimensional tensor, then the dims, we're going to assume are zero and one that we want to reduce
00:43:30.160 over. And then here we just pass in dim. And if the dimensionality of x is anything else, we'll
00:43:35.360 now get an error, which is good. So that should be the fix. Now I want to point out one more thing.
00:43:42.240 We're actually departing from the API of PyTorch here a little bit, because when you come to
00:43:46.320 bachelor 1D PyTorch, you can scroll down and you can see that the input to this layer can either be
00:43:52.400 n by c, where n is the back size and see is the number of features or channels, or it actually
00:43:57.680 does accept three dimensional inputs, but it expects it to be n by c by L, where L is say like
00:44:03.600 the sequence length or something like that. So this is problem because you see how c is nested
00:44:09.920 here in the middle. And so when it gets three dimensional inputs, this bachelor layer will
00:44:15.360 reduce over zero and two instead of zero and one. So basically PyTorch, bachelor, and one D layer
00:44:22.480 assumes that c will always be the first dimension. Whereas we assume here that c is the last dimension
00:44:30.320 and there are some number of batch dimensions beforehand. And so it expects n by c or n by c
00:44:38.480 by L we expect n by c or n by L by c. And so it's a deviation. I think it's okay. I prefer it this
00:44:48.560 way honestly. So this is the way that we will keep it for our purposes. So I redefined the layers,
00:44:53.200 re-initialize the neural nut, and did a single forward pass with a break just for one step.
00:44:57.680 Looking at the shapes along the way, they're of course identical. All the shapes are the same.
00:45:02.320 By the way we see that things are actually working as we want them to now, is that when we look at
00:45:06.800 the batch room layer, the running mean shape is now one by one by 68. So we're only maintaining
00:45:11.760 68 means for every one of our channels. And we're treating both the 0th and the first dimension as
00:45:17.360 a batch dimension, which is exactly what we want. So let me re-trained the neural nut now. Okay, so
00:45:21.760 I re-trained the neural nut with the bug fix. We get a nice curve. And when we look at the validation
00:45:25.840 performance, we do actually see a slight improvement. So it went from 2.0 to 9 to 2.0 to 2. So basically
00:45:31.920 the bug inside the batch room was holding us back like a little bit it looks like. And we are getting
00:45:37.520 a tiny improvement now, but it's not clear if this is statistically significant. And the reason
00:45:43.200 we slightly expect an improvement is because we're not maintaining so many different means
00:45:47.120 and variances that are only estimated using using 32 numbers effectively. Now we are estimating
00:45:52.480 them using 32 times four numbers. So you just have a lot more numbers that go into any one estimate
00:45:57.760 of the mean and variance. And it allows things to be a bit more stable and less wiggly inside those
00:46:03.360 estimates of those statistics. So pretty nice. With this more general architecture in place,
00:46:09.280 we are now set up to push the performance further by increasing the size of the network.
00:46:13.360 So for example, I bumped up the number of embeddings to 24 instead of 10 and also increased number
00:46:19.520 of hidden units. But using the exact same architecture, we now have 76,000 parameters. And the training
00:46:25.760 takes a lot longer. But we do get a nice curve. And then when you actually evaluate the performance,
00:46:30.880 we are now getting validation performance of 1.993. So we've crossed over the 2.0 sort of territory.
00:46:37.680 And we're at about 1.99. But we are starting to have to wait quite a bit longer. And we're a little
00:46:43.440 bit in the dark with respect to the correct setting of the hyper parameters here and the learning
00:46:47.360 rates and so on, because the experiments are starting to take longer to train. And so we are
00:46:51.360 missing sort of like an experimental harness on which we could run a number of experiments and
00:46:56.400 really tune this architecture very well. So I'd like to conclude now with a few notes.
00:46:59.920 We basically improved our performance from a starting of 2.1 down to 1.9. But I don't want that
00:47:06.320 to be the focus because honestly, we're kind of in the dark. We have no experimental harness. We're
00:47:10.480 just guessing and checking. And this whole thing is terrible. We're just looking at the training loss.
00:47:15.040 Normally, you want to look at both the training and the validation loss together. The whole thing
00:47:19.920 looks different if you're actually trying to squeeze out numbers. That said, we did implement this
00:47:25.760 architecture from the Waifnet paper. But we did not implement this specific forward pass of it,
00:47:32.320 where you have a more complicated linear layer sort of that is this gated linear layer kind of.
00:47:37.520 And there's residual connections and skip connections and so on. So we did not implement that. We just
00:47:42.880 implemented this structure. I would like to briefly hint or preview how what we've done here relates
00:47:48.640 to convolutional neural networks as used in the Waifnet paper. And basically, the use of convolutions
00:47:53.840 is strictly for efficiency. It doesn't actually change the model we've implemented. So here,
00:47:59.520 for example, let me look at a specific name to work with an example. So there's a name in our
00:48:05.120 training set and it's DeAndre. And it has seven letters. So that is eight independent examples in
00:48:11.200 our model. So all these rows here are independent examples of DeAndre. Now you can forward, of course,
00:48:17.920 any one of these rows independently. So I can take my model and call it on any individual index.
00:48:25.520 Notice, by the way, here, I'm being a little bit tricky. The reason for this is that extra at seven
00:48:30.560 that shape is just one dimensional array of eight. So you can't actually call the model on it. You're
00:48:38.000 going to get an error because there's no batch dimension. So when you do extra at
00:48:45.840 list of seven, then the shape of this becomes one by eight. So I get an extra batch dimension of one,
00:48:51.280 and then we can forward the model. So that forwards a single example. And you might imagine that you
00:48:58.240 actually may well know forward all of these eight at the same time. So preallocating some
00:49:04.880 memory and then doing a for loop, eight times, and forwarding all of those eight here will give us
00:49:10.400 all the logits in all these different cases. Now for us with the model, as we've implemented it
00:49:15.120 right now, this is eight independent calls to our model. But what convolutions allow you to do
00:49:20.400 is it allow you to basically slide this model efficiently over the input sequence. And so this
00:49:26.640 for loop can be done not outside in Python, but inside of kernels in CUDA. And so this for loop gets
00:49:33.600 hidden into the convolution. So the convolution basically, you can think of it as it's a for loop
00:49:39.200 applying a little linear filter over space of some input sequence. And in our case, the space
00:49:45.120 we're interested in is one dimensional, and we're interested in sliding these filters over the input
00:49:49.360 data. So this diagram actually is fairly good as well. Basically, what we've done is here they are
00:49:57.040 highlighting in black one individual one single sort of like tree of this calculation. So just
00:50:02.720 calculating the single output example here. And so this is basically what we've implemented here.
00:50:09.600 We've implemented a single this black structure. We've implemented that and calculated a single
00:50:15.200 output like a single example. But what convolutions allow you to do is it allows you to take this
00:50:20.400 black structure and kind of like slide it over the input sequence here and calculate all of these
00:50:27.600 orange outputs at the same time. Or here that corresponds to calculating all of these outputs
00:50:33.280 of at all the positions of the Andre at the same time. And the reason that this is much more
00:50:40.720 efficient is because number one, as I mentioned, the for loop is inside the CUDA kernels in the
00:50:46.000 sliding. So that makes it efficient. But number two, notice the variable reuse here. For example,
00:50:52.160 if we look at this circle, this node here, this node here is the right child of this node,
00:50:57.680 but it's also the left child of the node here. And so basically this node and its value is used
00:51:04.240 twice. And so right now in this naive way, we'd have to recalculate it. But here we are allowed
00:51:11.200 to reuse it. So in the convolutional neural network, you think of these linear layers that we have
00:51:16.000 it up above as filters. And we take these filters and their linear filters and you slide them over
00:51:22.160 input sequence. And we calculate the first layer, and then the second layer, and then the third layer,
00:51:27.200 and then the output layer of the sandwich. And it's all done very efficiently using these
00:51:31.360 convolutions. So we're going to cover that in a future video. The second thing I hope you took
00:51:35.600 away from this video is you've seen me basically implement all of these layer, Lego building blocks,
00:51:41.440 or module building blocks. And I'm implementing them over here. And we've implemented a number of
00:51:46.640 layers together. And we've also implementing these, these containers. And we've overall pytorchified
00:51:52.800 our code quite a bit more. Now basically what we're doing here is we're re implementing Torx.nn,
00:51:58.640 which is the neural networks library on top of Torx.tensor. And it looks very much like this,
00:52:04.800 except it is much better because it's because it's in pytorch instead of jenkaling my jenkaling
00:52:09.440 my jenkaling notebook. So I think going forward, I will probably have considered us having unlocked
00:52:15.840 Torx.nn. We understand roughly what's in there, how these modules work, how they're nested,
00:52:20.400 and what they're doing on top of Torx.tensor. So hopefully we'll just switch over and continue
00:52:26.160 and start using Torx.nn directly. The next thing I hope you got a bit of a sense of is what the
00:52:30.960 development process of building deep neural networks looks like, which I think was relatively
00:52:35.600 representative to some extent. So number one, we are spending a lot of time in the documentation
00:52:40.880 page of pytorch. And we're reading through all the layers looking at the documentations,
00:52:45.840 where are the shapes of the inputs, what can they be, what does the layer do, and so on.
00:52:51.600 Unfortunately, I have to say the pytorch documentation is not very good. They spend a ton of time on
00:52:57.760 hardcore engineering of all kinds of distributed primitives, etc. But as far as I can tell,
00:53:02.320 no one is maintaining documentation. It will lie to you, it will be wrong, it will be incomplete,
00:53:08.240 it will be unclear. So unfortunately, it is what it is, and you just kind of do your best
00:53:13.600 with what they've given us. Number two, the other thing that I hope you got a sense of is
00:53:23.040 there's a ton of trying to make the shapes work. And there's a lot of gymnastics around these
00:53:27.360 multi-dimensional arrays. And are they two-dimensional, three-dimensional, four-dimensional? What layers
00:53:32.080 take what shapes is it NCL or NLC? And you're permuting and viewing, and it just can get pretty
00:53:39.200 messy. And so that brings me to number three. I very often prototype these layers and implementations
00:53:44.640 in Jupyter Notebooks and make sure that all the shapes work out. And I'm spending a lot of time,
00:53:49.200 basically babysitting the shapes and making sure everything is correct. And then once I'm satisfied
00:53:54.000 with the functionality in a Jupyter Notebook, I will take that code and copy paste it into my
00:53:57.920 repository of actual code that I'm training with. And so then I'm working with VS code on a side.
00:54:04.160 So I usually have Jupyter Notebook and VS code. I develop in Jupyter Notebook, I paste into VS code,
00:54:08.720 and then I kick off experiments from the repo, of course, from the code repository.
00:54:13.440 So that's roughly some notes on the development process of working with neural ones.
00:54:17.920 Lastly, I think this lecture unlocks a lot of potential
00:54:20.880 further lectures, because number one, we have to convert our neural network to actually use
00:54:25.120 these dilated causal convolutional layers. So implementing the comment. Number two,
00:54:31.040 I potentially started to get into what this means, where our residual connections and skip
00:54:35.440 connections and why are they useful. Number three, we, as I mentioned, we don't have any
00:54:40.880 experimental harness. So right now I'm just guessing checking everything. This is not representative
00:54:45.440 of typical deep learning workflows. You have to set up your evaluation harness. You can kick
00:54:50.000 off experiments. You have lots of arguments that your script can take. You're kicking off
00:54:54.240 a lot of experimentation. You're looking at a lot of plots of training and validation losses,
00:54:58.240 and you're looking at what is working and what is not working. And you're working on this like
00:55:01.760 population level, and you're doing all these hyperparameters searches. And so we've done none
00:55:06.400 of that so far. So how to set that up and how to make it good, I think, is a whole another topic.
00:55:13.200 And number three, we should probably cover recurrent neural networks,
00:55:16.720 our NENS, LSTMs, GRUFS, and of course, Transformers. So many places to go,
00:55:23.600 and we'll cover that in the future. For now, sorry, I forgot to say that if you are interested,
00:55:29.120 I think it is kind of interesting to try to beat this number 1.993. Because I really haven't
00:55:34.560 tried a lot of experimentation here, and there's quite a bit of long-frigged potentially to still
00:55:38.560 purchase further. So I haven't tried any other ways of allocating these channels in this neural
00:55:43.680 nut. Maybe the number of dimensions for the embedding is all wrong. Maybe it's possible to
00:55:49.280 actually take the original network with just one hidden layer and make it big enough and
00:55:53.360 actually beat my fancy hierarchical network. It's not obvious. It'll be kind of embarrassing if this
00:55:59.440 did not do better, even once you torture it a little bit. Maybe you can read the weight net paper and
00:56:04.080 try to figure out how some of these layers work and implement them yourselves using what we have.
00:56:07.920 And of course, you can always tune some of the initialization or some of the optimization
00:56:13.760 and see if we can improve it that way. So I'd be curious if people can come up with some ways to