neuralnets-zero2hero

00:00:08.000 >> Please welcome AI researcher and founding member

00:00:23.000 I'm happy to be here to tell you about the state of GPT

00:00:27.000 and more generally about the rapidly growing ecosystem of large language models.

00:00:32.000 So I would like to partition the talk into two parts.

00:00:35.000 In the first part, I would like to tell you about how we train GPT

00:00:39.000 And then in the second part, we are going to take a look at how we can use

00:00:43.000 these assistance effectively for your applications.

00:00:46.000 So first, let's take a look at the emerging recipe for how to train these

00:00:49.000 assistance and keep in mind that this is all very new and still rapidly evolving.

00:00:53.000 But so far, the recipe looks something like this.

00:00:56.000 Now, this is kind of a complicated slide, so I'm going to go through it piece by piece.

00:01:00.000 But roughly speaking, we have four major stages.

00:01:03.000 Pre-training, supervised fine-tuning, reward modeling,

00:01:06.000 reinforcement learning, and they follow each other serially.

00:01:10.000 Now, in each stage, we have a data set that powers that stage.

00:01:15.000 We have an algorithm that, for our purposes, will be an objective

00:01:24.000 And then there's some notes on the bottom.

00:01:26.000 So the first stage we're going to start with is the pre-training stage.

00:01:29.000 Now, this stage is kind of special in this diagram, and this diagram is not to scale,

00:01:33.000 because this stage is where all of the computational work basically happens.

00:01:36.000 This is 99% of the training compute time, and also flops.

00:01:42.000 And so this is where we are dealing with internet-scale data sets with thousands of

00:01:47.000 GPUs in the circuit computer, and also months of training potentially.

00:01:51.000 The other three stages are fine-tuning stages that are much more along the lines of

00:01:56.000 small few number of GPUs and hours or days.

00:01:59.000 So let's take a look at the pre-training stage to achieve a base model.

00:02:04.000 First, we're going to gather a large amount of data.

00:02:07.000 Here's an example of what we call a data mixture that comes from this paper that was

00:02:12.000 released by Meta, where they released this llama base model.

00:02:16.000 Now, you can see roughly the kinds of data sets that enter into these collections.

00:02:20.000 So we have common crawl, which is just a web scrape, C4, which is also common crawl,

00:02:25.000 and then some high-quality data sets as well.

00:02:27.000 So, for example, GitHub, Wikipedia, books, archive, stack exchange, and so on.

00:02:31.000 These are all mixed up together, and then they are sampled according to some given

00:02:35.000 proportions, and that forms the training set for the neural net, for the GPT.

00:02:40.000 Now, before we can actually train on this data, we need to go through one more

00:02:44.000 pre-processing step, and that is tokenization.

00:02:46.000 And this is basically a translation of the raw text that we scrape from the internet

00:02:51.000 into sequences of integers, because that's the native representation over which GPTs

00:02:57.000 Now, this is a lossless kind of translation between pieces of text and tokens and integers,

00:03:03.000 and there are a number of algorithms for this stage.

00:03:06.000 Typically, for example, you could use something like byte pairing coding, which

00:03:09.000 iteratively merges little text chunks and groups them into tokens.

00:03:13.000 And so, here I'm showing some examples chunks of these tokens, and then this is the raw

00:03:18.000 integer sequence that will actually feed into a transformer.

00:03:21.000 Now, here I'm showing two examples for hyperparameters that govern this stage.

00:03:28.000 So, GPT4, we did not release too much information about how it was trained and so on, so I'm using

00:03:33.000 GPT3's numbers, but GPT3 is, of course, a little bit old by now, about three years ago.

00:03:38.000 But LAMA is a fairly recent model from META.

00:03:40.000 So, these are roughly the orders of magnitude that we're dealing with when we're doing pre-training.

00:03:45.000 The vocabulary size is usually a couple of 10,000 tokens.

00:03:48.000 The context length is usually something like 2,000, 4,000, or novodays, even 100,000,

00:03:53.000 and this governs the maximum number of integers that the GPT will look at when it's trying to

00:04:02.000 You can see that roughly the number of parameters is, say, 65 billion for LAMA.

00:04:06.000 Now, even though LAMA has only 65 B parameters compared to GPT3's 175 billion parameters,

00:04:11.000 LAMA is a significantly more powerful model, and intuitively, that's because the model is

00:04:18.000 In this case, 1.4 trillion tokens instead of just 300 billion tokens.

00:04:21.000 So, you shouldn't judge the power of a model just by the number of parameters that it contains.

00:04:26.000 Below, I'm showing some tables of rough hyperparameters that typically go into specifying

00:04:34.000 So, the number of heads, the dimension size, number of layers, and so on.

00:04:38.000 And on the bottom, I'm showing some training hyperparameters.

00:04:41.000 So, for example, to train the 65 B model, Metai used 2,000 GPUs, roughly 21 days training,

00:04:52.000 And so, that's the rough orders of magnitude that you should have in mind for the pre-training

00:04:58.000 Now, when we're actually pre-training what happens.

00:05:01.000 Roughly speaking, we are going to take our tokens and we're going to lay them out into

00:05:06.000 So, we have these arrays that will feed into the transformer, and these arrays are B,

00:05:11.000 the batch size, and these are all independent examples stacked up in rows, and B by T,

00:05:18.000 So, in my picture, I only have 10, the context length.

00:05:25.000 And what we do is we take these documents and we pack them into rows, and we delimit them

00:05:29.000 with these special end of text tokens, basically, tell them the transformer where a new document

00:05:35.000 And so, here I have a few examples of documents, and then I've stretched them out into this

00:05:42.000 Now, we're going to feed all of these numbers into transformer.

00:05:46.000 And let me just focus on a single particular cell, but the same thing will happen at every

00:05:54.000 The green cell is going to take a look at all of the tokens before it.

00:06:00.000 And we're going to feed that entire context into the transformer neural network.

00:06:05.000 And the transformer is going to try to predict the next token in a sequence, in this case,

00:06:10.000 Now, the transformer, I don't have too much time to, unfortunately, go into the full details

00:06:15.000 It's just a large blob of neural net stuff for our purposes, and it's got several 10 billion

00:06:20.000 parameters, typically, or something like that.

00:06:22.000 And, of course, as you tune these parameters, you're getting slightly different predicted

00:06:25.000 distributions for every single one of these cells.

00:06:29.000 And so, for example, if our vocabulary size is 50,257 tokens, then we're going to have

00:06:35.000 that many numbers because we need to specify a probability distribution for what comes

00:06:41.000 So, basically, we have a probability for whatever may follow.

00:06:43.000 Now, in this specific example for this specific cell, 513 will come next.

00:06:47.000 And so, we can use this as a source of supervision to update our transformers' weights.

00:06:51.000 And so, we're applying this basically on every single cell in parallel, and we keep swapping

00:06:56.000 batches, and we're trying to get the transformer to make the correct predictions over what

00:07:02.000 So, let me show you more concretely what this looks like when you train one of these models.

00:07:05.000 This is actually coming from New York Times, and they trained a small GPT on Shakespeare.

00:07:11.000 And so, here's a small snippet of Shakespeare, and they trained a GPT on it.

00:07:14.000 Now, in the beginning, at initialization, the GPT starts with completely random weights.

00:07:19.000 So, you're just getting completely random outputs as well.

00:07:22.000 But over time, as you train the GPT longer and longer, you are getting more and more coherent

00:07:28.000 and consistent sort of samples from the model.

00:07:31.000 And the way you sample from it, of course, is you predict what comes next, you sample

00:07:36.000 from that distribution, and you keep feeding that back into the process, and you can basically

00:07:42.000 And so, by the end, you see that the transformer has learned about words and where to put spaces

00:07:48.000 And so, we're making more and more consistent predictions over time.

00:07:51.000 These are the kinds of plots that you look at when you're doing model pre-training.

00:07:55.000 Effectively, we're looking at the loss function over time as you train, and low loss means

00:08:00.000 that our transformer is predicting the correct -- is giving a higher probability to the correct

00:08:07.000 Now, what are we going to do with this model once we've trained it after a month?

00:08:11.000 Well, the first thing that we noticed, the field, is that these models basically in the

00:08:17.000 process of language modeling learn very powerful general representations, and it's possible to

00:08:22.000 very efficiently fine-tune them for any arbitrary downstream task you might be interested in.

00:08:26.000 So, as an example, if you're interested in sentiment classification, the approach used to

00:08:31.000 be that you collect a bunch of positives and negatives, and then you train some kind of

00:08:37.000 But the new approach is ignore sentiment classification, go off and do large language model pre-training,

00:08:43.000 train a large transformer, and then you can only -- you may only have a few examples, and you can

00:08:48.000 very efficiently fine-tune your model for that task.

00:08:51.000 And so this works very well in practice.

00:08:53.000 And the reason for this is that basically the transformer is forced to multitask a huge

00:08:58.000 amount of tasks in the language modeling task, because just in terms of predicting the next

00:09:03.000 token, it's forced to understand a lot about the structure of the text and all the different

00:09:11.000 Now, around the time of GPT-2, people noticed that actually even better than fine-tuning,

00:09:15.000 you can actually prompt these models very effectively.

00:09:18.000 So these are language models, and they want to complete documents.

00:09:20.000 So you can actually trick them into performing tasks just by arranging these fake documents.

00:09:25.000 So in this example, for example, we have some passage, and then we sort of like do QA,

00:09:30.000 QA, QA, this is called a few-shot prompt, and then we do Q, and then as the transformer is trying

00:09:35.000 to complete the document, it's actually answering our question.

00:09:38.000 So this is an example of prompt engineering a base model, making the belief that it's sort

00:09:42.000 of imitating a document and getting it to perform a task.

00:09:46.000 And so this picked off, I think, the era of, I would say, prompting over fine-tuning and seeing

00:09:51.000 that this actually can work extremely well on a lot of problems, even without training

00:09:54.000 any neural networks fine-tuning or so on.

00:09:57.000 Now, since then, we've seen an entire evolutionary tree of base models that everyone has trained.

00:10:03.000 Not all of these models are available. For example, the GPT4 base model was never released.

00:10:08.000 The GPT4 model that you might be interacting with over API is not a base model.

00:10:12.000 It's an assistant model, and we're going to cover how to get those in a bit.

00:10:16.000 GPT3 base model is available via the API under the name DaVinci, and GPT2 base model is

00:10:22.000 available even as weights on our GitHub repo.

00:10:25.000 But currently the best available base model probably is the llama series from meta,

00:10:30.000 although it is not commercially licensed.

00:10:34.000 Now, one thing to point out is base models are not assistants.

00:10:38.000 They don't want to make answers to your questions.

00:10:43.000 So if you tell them right at the bottom about the bread and cheese, it will just, you know,

00:10:47.000 it will answer questions with more questions.

00:10:49.000 It's just completing what it thinks as a document.

00:10:52.000 However, you can prompt them in a specific way for base models that is more likely to work.

00:10:57.000 So as an example, here's a point about bread and cheese.

00:11:00.000 In that case, it will auto complete correctly.

00:11:03.000 You can even trick base models into being assistants.

00:11:06.000 The way you would do this is you would create a specific few shot prompt that makes it look

00:11:11.000 like there's some kind of a document between a human and assistant and they're exchanging

00:11:17.000 At the bottom, you sort of put your query at the end, and the base model will sort of

00:11:22.000 condition itself into being like a helpful assistant and kind of answer.

00:11:27.000 But this is not very reliable and doesn't work super well in practice, although it can be done.

00:11:31.000 So instead, we have a different path to make actual GPT assistants not just base model

00:11:37.000 And so that takes us into supervised fine tuning.

00:11:39.000 So in the supervised fine tuning stage, we are going to collect small but high quality

00:11:45.000 In this case, we're going to ask human contractors to gather data of the form, of the form

00:11:52.000 And we're going to collect lots of these, typically tens of thousands or something like that.

00:11:56.000 And then we're going to still do language modeling on this data.

00:12:03.000 So it used to be internet documents, which is a high quantity local -- for basically

00:12:13.000 So we still do language modeling, and then after training, we get an SFD model.

00:12:18.000 And you can't actually deploy these models, and they are actual assistants, and they work

00:12:23.000 Let me show you what an example demonstration might look like.

00:12:26.000 So here's something that a human contractor might come up with.

00:12:29.000 Can you write a short introduction about the relevance of the term monopsony or something

00:12:35.000 And then the contractor also writes out an ideal response.

00:12:37.000 And when they write out these responses, they are following extensive labeling documentations

00:12:41.000 and they are being asked to be helpful, truthful, and harmless.

00:12:45.000 And this is the labeling instructions here.

00:12:52.000 And this is just people following instructions and trying to complete these prompts.

00:13:00.000 Now, you can actually continue the pipeline from here on and go into RLHF, reinforcement

00:13:05.000 learning from human feedback that consists of both reward modeling and reinforcement

00:13:12.000 And then I'll come back to why you may want to go through the extra steps and how that

00:13:16.000 So in the reward modeling step, what we're going to do is we're now going to shift our

00:13:20.000 data collection to be of the form of comparisons.

00:13:23.000 So here's an example of what our data set will look like.

00:13:25.000 I have the same prompt, identical prompt on the top, which is asking the assistant to

00:13:30.000 write a program or function that checks if a given string is a palindrome.

00:13:35.000 And then what we do is we take the SFT model, which we've already trained, and we create

00:13:41.000 So in this case, we have three completions that the model has created.

00:13:44.000 And then we ask people to rank these completions.

00:13:47.000 So if you stare at this for a while, and by the way, these are very difficult things to

00:13:51.000 do to compare some of these predictions, and this can take people even hours for a single

00:13:58.000 But let's say we decided that one of these is much better than the others and so on.

00:14:03.000 Then we can follow that with something that looks very much kind of like a binary classification

00:14:07.000 on all the possible pairs between these completions.

00:14:10.000 So what we do now is we lay out our prompt in rows, and the prompt is identical across

00:14:19.000 And so the yellow tokens are coming from the SFT model.

00:14:22.000 Then what we do is we append another special reward readout token at the end, and we basically

00:14:28.000 only supervise the transformer at this single green token.

00:14:32.000 And the transformer will predict some reward for how good that completion is for that

00:14:38.000 And so basically it makes a guess about the quality of each completion, and then once

00:14:43.000 it makes a guess for every one of them, we also have the ground truth, which is telling

00:14:48.080 And so we can actually enforce that some of these numbers should be much higher than

00:14:52.400 We formulate this into a loss function, and we train our model to make reward predictions

00:14:56.400 that are consistent with the ground truth coming from the comparisons from all these contractors.

00:15:01.400 So that's how we train our reward model.

00:15:02.960 And that allows us to score how good a completion is for a prompt.

00:15:07.600 Once we have a reward model, we can't deploy this because this is not very useful as an

00:15:12.360 assistant by itself, but it's very useful for the reinforcement learning stage that

00:15:17.360 Because we have a reward model, we can score the quality of any arbitrary completion for

00:15:23.160 So what we do during reinforcement learning is we basically get, again, a large collection

00:15:26.680 of prompts, and now we do reinforcement learning with respect to the reward model.

00:15:37.840 We use the model we'd like to train, which is initialized at SFT model to create some

00:15:43.920 And then we append the reward token again, and we read off the reward according to the

00:15:52.400 And now the reward model tells us the quality of every single completion for each, for these

00:15:57.400 And so what we can do is we can now just basically apply the same language modeling loss function,

00:16:02.000 but we're currently training on the yellow tokens, and we are weighing the language modeling

00:16:08.080 objective by the rewards indicated by the reward model.

00:16:11.400 So as an example, in the first row, the reward model said that this is a fairly high scoring

00:16:17.400 And so all the tokens that we happened to sample on the first row are going to get reinforced,

00:16:22.400 and they're going to get higher probabilities for the future.

00:16:25.600 Conversely, on the second row, the reward model really did not like this completion,

00:16:30.040 And so therefore, every single token that we sampled in that second row is going to get

00:16:34.040 a slightly higher probability for the future.

00:16:36.400 And we do this over and over on many prompts, on many batches, and basically we get a policy

00:16:41.160 which creates yellow tokens here, and it basically all of them, all of the completions

00:16:45.880 here will score high according to the reward model that we trained in the previous stage.

00:16:51.600 So that's how we train, that's what the ROHF pipeline is.

00:16:55.840 Now, and then at the end, you get a model that you could deploy.

00:16:58.720 And so, and as an example, ChachiPT is an ROHF model.

00:17:02.720 But some other models that you might come across, like for example, the Koona 13B and

00:17:08.320 So we have base models, SFT models, and ROHF models, and that's kind of like the state

00:17:17.120 So one answer that is kind of not that exciting is that it just works better.

00:17:21.000 So this comes from the instruction VTPaper.

00:17:23.240 According to these experiments a while ago now, these PPO models are ROHF, and we see

00:17:28.400 that they are basically just preferred in a lot of comparisons when we give them to humans.

00:17:33.760 So humans just prefer out basically tokens that come from ROHF models compared to SFT

00:17:38.800 models, compared to base model that is prompted to be an assistant.

00:17:44.000 But you might ask why, why does it work better?

00:17:47.360 And I don't think that there's a single like amazing answer that the community has really

00:17:51.960 But I will just offer one reason potentially.

00:17:55.720 And it has to do with the asymmetry between how easy computationally it is to compare versus

00:18:02.400 So let's take an example of generating a haiku.

00:18:05.200 Suppose I ask a model to write a haiku about paperclips.

00:18:08.200 If you're a contractor trying to give train data, then imagine being a contractor collecting

00:18:12.480 basically data for the SFT stage, how are you supposed to create a nice haiku for a paper

00:18:17.440 You might just not be very good at that.

00:18:18.920 But if I give you a few examples of haikus, you might be able to appreciate some of these

00:18:24.320 And so judging which one of these is good is much easier task.

00:18:27.600 And so basically this asymmetry makes it so that comparisons are a better way to potentially

00:18:33.360 leverage yourself as a human and your judgment to create a slightly better model.

00:18:38.160 Now RLHF models are not strictly an improvement on base models in some cases.

00:18:43.760 So in particular, we've noticed for example that they lose some entropy.

00:18:47.040 So that means that they give more PT results.

00:18:52.320 So base model has lots of entropy and will give lots of diverse outputs.

00:18:57.320 So for example, one kind of place where I still prefer to use a base model is in a setup

00:19:03.320 where you basically have n things and you want to generate more things like it.

00:19:10.320 And so here is an example that I just cooked up.

00:19:15.320 I gave it seven Pokemon names and I asked a base model to complete documents.

00:19:20.320 And it gave me a lot more Pokemon names.

00:19:26.320 I don't believe they're actual Pokemons.

00:19:28.320 And this is the kind of task that I think base model would be good at because it still

00:19:32.320 has lots of entropy and will give you lots of diverse cool kind of more things that look

00:19:40.320 So this is what -- this is a number -- having said all that, these are kind of like the

00:19:44.320 assistant models that are probably available to you at this point.

00:19:47.320 There's a team at Berkeley that ranked a lot of the available assistant models and gave them

00:19:53.320 So currently some of the best models are GPT-4 by far, I would say, followed by Claude,

00:19:58.320 GPT-3.5, and then a number of models, some of these might be available as weights like

00:20:04.320 And the first three rows here are all -- they are all RLA-CHF models and all of the other

00:20:10.320 models to my knowledge are SFD models, I believe.

00:20:16.320 So that's how we train these models on the high level.

00:20:19.320 Now I'm going to switch gears and let's look at how we can best apply a GPT assistant

00:20:26.320 Now I would like to work in a setting of a concrete example.

00:20:29.320 So let's work with a concrete example here.

00:20:32.320 Let's say that you are working on an article or a blog post and you're going to write this

00:20:38.320 California's population is 53 times that of Alaska.

00:20:41.320 So for some reason you want to compare the populations of these two states.

00:20:45.320 Think about the rich internal model log and tool use and how much work actually goes computationally

00:20:50.320 in your brain to generate this one final sentence.

00:20:53.320 So here's maybe what that could look like in your brain.

00:20:55.320 Okay, for this next step, let me blog -- let me compare these two populations.

00:21:00.320 Okay, first I'm going to obviously need to get both of these populations.

00:21:05.320 Now I know that I probably don't know these populations off the top of my head.

00:21:09.320 So I'm kind of aware of what I know or don't know of my self-knowledge, right?

00:21:13.320 So I do some tool use and I look up California's population and Alaska's population.

00:21:19.320 Now I know that I should divide the two.

00:21:22.320 But again, I know that dividing 3, 9.2 by 0.74 is unlikely to succeed.

00:21:26.320 That's not the kind of thing that I can do in my head.

00:21:29.320 And so therefore I'm going to rely on the calculator.

00:21:32.320 So I'm going to use a calculator, punch it in and see that the output is roughly 53.

00:21:37.320 And then maybe I do some reflection and sanity checks in my brain.

00:21:42.320 Well, that's quite a large fraction, but then California has the most popular state, so maybe that looks okay.

00:21:47.320 So then I have all the information I might need, and now I get to the sort of creative portion of writing.

00:21:52.320 So I might start to write something like California has 53x times greater, and then I think to myself,

00:21:58.320 that's actually like really awkward phrasing.

00:22:00.320 So let me actually delete that and let me try again.

00:22:03.320 And so as I'm writing, I have this separate process almost inspecting what I'm writing and judging whether it looks good or not.

00:22:10.320 And then maybe I delete and maybe I reframe it and then maybe I'm happy with what comes out.

00:22:15.320 So basically long story short, a ton happens under the hood in terms of your internal monologue when you create sentences like this.

00:22:21.320 But what does a sentence like this look like when we are training a GPT on it?

00:22:26.320 From GPT's perspective, this is just a sequence of tokens.

00:22:30.320 So a GPT, when it's reading or generating these tokens, it just goes chunk chunk chunk chunk chunk.

00:22:36.320 And each chunk is roughly the same amount of computational work for each token.

00:22:40.320 And these transformers are not very shallow networks.

00:22:43.320 They have about 80 layers of reasoning, but 80 is still not too much.

00:22:47.320 And so this transformer is going to do its best to imitate, but of course the process here looks very, very different from the process that you took.

00:22:56.320 So in particular, in our final artifacts, in the data sets that we create and then eventually feed to LLMs,

00:23:02.320 all of that internal dialogue is completely stripped.

00:23:06.320 And unlike you, the GPT will look at every single token and spend the same amount of compute on every one of them.

00:23:12.320 And so you can't expect it to actually like, well, you can't expect it to do too much work per token.

00:23:19.320 So, and also in particular, basically these transformers are just like token simulators.

00:23:24.320 So they don't know what they don't know.

00:23:28.320 They don't know what they're good at or not good at. They just tried their best to imitate the next token.

00:23:32.320 They don't reflect in the loop. They don't sanity check anything.

00:23:35.320 They don't correct their mistakes along the way by default. They just sample token sequences.

00:23:40.320 They don't have separate inner monologue streams in their head, right?

00:23:45.320 Now, they do have some sort of cognitive advantages, I would say.

00:23:49.320 And that is that they do actually have very large fact-based knowledge across a vast number of areas

00:23:54.320 because they have, say, several 10 billion parameters. So that's a lot of storage for a lot of facts.

00:23:59.320 And they also, I think, have a relatively large and perfect working memory.

00:24:04.320 So whatever fits into the context window is immediately available to the transformer through its internal self-attention mechanism.

00:24:12.320 And so it's kind of like perfect memory, but it's got a finite size.

00:24:16.320 But the transformer has a very direct access to it. And so it can like,

00:24:19.320 a losslessly remember anything that is inside its context window.

00:24:24.320 So that's kind of how I would compare those two. And the reason I bring all of this up is because I think to a large extent

00:24:29.320 prompting is just making up for this sort of cognitive difference between these two kind of architectures.

00:24:36.320 Like our brains here and LLM brains. You can look at it that way almost.

00:24:41.320 So here's one thing that people found, for example, works pretty well in practice.

00:24:45.320 Especially if your tasks require reasoning, you can't expect the transformer to make too much reasoning per token.

00:24:52.320 And so you have to really spread out the reasoning across more and more tokens.

00:24:56.320 So, for example, you can't give a transformer a very complicated question and expect it to get the answer in a single token.

00:25:01.320 There's just not enough time for it. These transformers need tokens to think, quote unquote, I like to say sometimes.

00:25:06.320 And so this is some of the things that work well.

00:25:09.320 So, for example, I have a few shot prompt that shows the transformer that it should like show its work when it's answering a question.

00:25:15.320 And if you give a few examples, the transformer will imitate that template and it will just end up working out better in terms of its evaluation.

00:25:24.320 Additionally, you can elicit this kind of behavior from the transformer by saying let's think step by step because this condition,

00:25:30.320 the transformer into sort of like showing its work.

00:25:33.320 And because it kind of snaps into a mode of showing its work, it's going to do less computational work per token.

00:25:39.320 And so it's more likely to succeed as a result because it's making slower reasoning over time.

00:25:46.320 Here's another example. This one is called self consistency.

00:25:49.320 We saw that we had the ability to start writing and then if it didn't work out, I can try again and I can try multiple times and maybe select the one that worked best.

00:26:00.320 So in these kinds of approaches, you may sample not just once, but you may sample multiple times and then have some process for finding the ones that are good and then keeping just those samples or doing a majority vote or something like that.

00:26:11.320 So basically these transformers in the process as they predict the next token, just like you, they can get unlucky and they could sample a not a very good token and they can go down sort of like a blind alley in terms of reasoning.

00:26:23.320 And so unlike you, they cannot recover from that. They are stuck with every single token they sample and so they will continue the sequence even if they even know that this sequence is not going to work out.

00:26:34.320 So give them the ability to look back, inspect or try to find, try to basically sample around it.

00:26:41.320 Here's one technique also. You could, it turns out that actually LLMs, like they know when they've screwed up.

00:26:47.320 So as an example, say you asked the model to generate a poem that does not rhyme and it might give you a poem but it actually rhymes.

00:26:55.320 But it turns out that especially for the bigger models like GPT-4, you can just ask it, did you meet the assignment?

00:27:01.320 And actually GPT-4 knows very well that it did not meet the assignment. It just kind of got unlucky in its sampling.

00:27:07.320 And so it will tell you, no, I didn't actually meet the assignment. Here's, let me try again.

00:27:10.320 But without you prompting it, it doesn't even, like it doesn't know, it doesn't know to revisit and so on.

00:27:17.320 So you have to make up for that in your prompts. You have to get it to check. If you don't ask it to check, it's not going to check by itself.

00:27:29.320 I think more generally a lot of these techniques fall into the bucket of what I would say, recreating our system too.

00:27:34.320 So you might be familiar with the system one system two thinking for humans.

00:27:37.320 System one is a fast automatic process and I think kind of corresponds to like an LLM just sampling tokens.

00:27:43.320 And system two is the slower, deliberate planning sort of part of your brain.

00:27:48.320 And so this is a paper actually from just last week because this space is pretty quickly evolving.

00:27:53.320 It's called Tree of Thought. And in Tree of Thought, the authors of this paper propose maintaining multiple completions for any given prompt.

00:28:02.320 And then they are also scoring them along the way and keeping the ones that are going well, if that makes sense.

00:28:08.320 And so a lot of people are like really playing around with kind of prompt engineering to basically bring back some of these abilities that we sort of have in our brain for LLMs.

00:28:19.320 Now one thing I would like to note here is that this is not just a prompt.

00:28:22.320 This is actually prompts that are together used with some Python blue code because you don't, you actually have to maintain multiple prompts and you also have to do some tree search algorithm here.

00:28:31.320 To figure out which prompt to expand, etc. So it's a symbiosis of Python blue code and individual prompts that are called in a while loop or in a bigger algorithm.

00:28:42.320 I also think there's a really cool parallel here to AlphaGo.

00:28:45.320 AlphaGo has a policy for placing the next stone when it plays Go.

00:28:48.320 And this policy was trained originally by imitating humans.

00:28:52.320 But in addition to this policy, it also does Monte Carlo tree search.

00:28:56.320 And basically it will play out a number of possibilities in its head and evaluate all of them and only keep the ones that work well.

00:29:01.320 And so I think this is kind of an equivalent of AlphaGo but for text, if that makes sense.

00:29:08.320 So just like Tree of Thought, I think more generally people are starting to really explore more general techniques of not just a simple question and a sub prompts, but something that looks a lot more like Python blue code stringing together many prompts.

00:29:21.320 So on the right, I have an example from this paper called React where they structure the answer to a prompt as a sequence of thought action observation, thought action observation, and it's a full rollout, kind of a thinking process to answer the query.

00:29:38.320 And in these actions, the model is also allowed to to lose.

00:29:44.320 GPT. Now R GPT, by the way, became is a project that I think got a lot of hype recently.

00:29:51.320 And I think, but I think I still find it kind of inspirational interesting.

00:29:55.320 It's a project that allows an LLM to sort of keep task list and continue to recursively break down tasks.

00:30:02.320 And I don't think this currently works very well and I would not advise people to use it in practical applications.

00:30:07.320 And I think it's something to take inspiration from in terms of where this is going over time.

00:30:13.320 So that's kind of like giving our model system to thinking.

00:30:16.320 The next thing I find interesting is this following survivor would say almost psychological quirk of LLMs is that LLMs don't want to succeed.

00:30:28.320 You want to succeed and you should ask for it.

00:30:31.320 So what I mean by that is when transformers are trained, they have training sets.

00:30:37.320 And there can be an entire spectrum of performance qualities in their training data.

00:30:41.320 So for example, there could be some kind of a prompt for some physics question or something like that.

00:30:45.320 And there could be a student solution that is completely wrong.

00:30:48.320 But there can also be an expert answer that is extremely right.

00:30:51.320 And transformers can't tell the difference between like low -- they know about low quality solutions and high quality solutions.

00:30:58.320 And they want to imitate all of it because they are just trained on language modeling.

00:31:02.320 And so at test time, you actually have to ask for a good performance.

00:31:06.320 So in this example in this paper, they tried various prompts.

00:31:10.320 And let's think step by step was very powerful because it sort of like spread out the reasoning of remaining tokens.

00:31:15.320 But what worked even better is let's work this out in a step by step way to be sure we have the right answer.

00:31:20.320 And so it's kind of like conditioning on getting a right answer.

00:31:23.320 And this actually makes the transformer work better.

00:31:25.320 Because the transformer doesn't have to now hedge its probability mass on low quality solutions as ridiculous as that sounds.

00:31:32.320 And so basically, feel free to ask for a strong solution.

00:31:36.320 Say something like you are a leading expert on this topic.

00:31:43.320 Because if you ask for IQ like 400, you might be out of data distribution.

00:31:47.320 Or even worse, you could be in data distribution for some like sci-fi stuff.

00:31:52.320 And it will start to like take on some sci-fi role playing or something like that.

00:31:56.320 So you have to find like the right amount of IQ, I think it's got some U-shaped curve there.

00:32:02.320 Next up, as we saw, when we are trying to solve problems, we know we are good at and what we're not good at.

00:32:11.320 You want to do the same potentially with your LLMs.

00:32:14.320 So in particular, we may want to give them calculators, code interpreters, and so on.

00:32:23.320 And there's a lot of techniques for doing that.

00:32:26.320 One thing to keep in mind again is that these transformers by default may not know what they don't know.

00:32:32.320 So you may even want to tell the transformer in a prompt.

00:32:35.320 You are not very good at mental arithmetic.

00:32:37.320 Whenever you need to do very large number addition, multiplication or whatever, instead use this calculator.

00:32:43.320 Use this token, combination, et cetera, et cetera.

00:32:46.320 So you have to spell it out because the model by default doesn't know what it's good at or not good at necessarily.

00:32:55.320 Next up, I think something that is very interesting is we went from a world that was retrieval only.

00:33:01.320 All the way, the pendulum has swung to the other extreme where it's memory only in LLMs.

00:33:06.320 But actually there's this entire space in between of these retrieval augmented models and this works extremely well in practice.

00:33:13.320 As I mentioned, the context window of a transformer is it's working memory.

00:33:17.320 If you can load the working memory with any information that is relevant to the task, the model will work extremely well because it can immediately access all that memory.

00:33:26.320 And so I think a lot of people are really interested in basically retrieval augmented generation.

00:33:33.320 And on the bottom I have an example of LAMA index which is one data connector to lots of different types of data.

00:33:39.320 You can index all of that data and make it accessible to LLMs.

00:33:44.320 And the emerging recipe there is you take relevant documents, you split them up into chunks, you embed all of them, and you basically get embedding vectors that represent that data.

00:33:53.320 You store that in the vector store and then at test time you make some kind of a query to your vector store and you fetch chunks that might be relevant to your task and you stuff them into the prompt and then you generate.

00:34:04.320 So this can work quite well in practice.

00:34:06.320 So this is I think similar to when you and I solve problems, you can do everything from your memory and transformers have very large and extensive memory.

00:34:13.320 But also it really helps to reference some primary documents.

00:34:17.320 So whenever you find yourself going back to a textbook to find something or whenever you find yourself going back to documentation of a library to look something up, the transformers definitely when I do that too.

00:34:27.320 You have some memory over how some documentation of a library works but it's much better to look it up.

00:34:36.320 Next I wanted to briefly talk about constraint prompting.

00:34:41.320 This is basically techniques for forcing a certain template in the outputs of LLMs.

00:34:50.320 So guidance is one example from Microsoft actually and here we are enforcing that the output from the LLM will be JSON.

00:34:57.320 And this will actually guarantee that the output will take on this form because they go in and they mess with the probabilities of all the different tokens that come out of the transformer and they clamp those tokens.

00:35:07.320 And then the transformer is only filling in the blanks here and then you can enforce additional restrictions on what could go into those blanks.

00:35:13.320 So this might be really helpful and I think this kind of constraint sampling is also extremely interesting.

00:35:19.320 I also wanted to say a few words about fine tuning.

00:35:22.320 It is the case that you can get really far with prompt engineering but it's also possible to think about fine tuning your models.

00:35:29.320 Now fine tuning models means that you are actually going to change the weights of the model.

00:35:33.320 It is becoming a lot more accessible to do this in practice and that's because of a number of techniques that have been developed and have libraries for very recently.

00:35:42.320 So for example parameter efficient fine tuning techniques like LORA make sure that you are only training small sparse pieces of your model.

00:35:50.320 So most of the model is kept clamped at the base model and some pieces of it are allowed to change and it still works pretty well empirically and makes it much cheaper to sort of tune only small pieces of your model.

00:36:01.320 It also means that because most of your model is clamped you can use very low precision inference for computing those parts because they are not going to be updated by gradient descent.

00:36:11.320 And so that makes everything a lot more efficient as well.

00:36:14.320 And in addition we have a number of open sourced high quality base models currently as I mentioned I think a lot of my quite nice although it is not commercially licensed I believe right now.

00:36:24.320 Something to keep in mind is that basically fine tuning is a lot more technically involved.

00:36:29.320 It requires a lot more I think technical expertise to do right.

00:36:32.320 It requires human data contractors for data sets and or synthetic data pipelines that can be pretty complicated.

00:36:38.320 This will definitely slow down your iteration cycle by a lot.

00:36:41.320 And I would say on a high level SFT is achievable because it is just your continuing language modeling task.

00:36:49.320 But RLA-CHF I would say is very much research territory and is even much harder to get to work.

00:36:55.320 And so I would probably not advise that someone just tries to roll their own RLA-CHF implementation.

00:37:00.320 These things are pretty unstable, very difficult to train.

00:37:03.320 Not something that is I think very beginner friendly right now and is also potentially likely also to change pretty rapidly still.

00:37:10.320 So I think these are my sort of default recommendations right now.

00:37:15.320 I would break up your task into two major parts.

00:37:18.320 Number one, achieve your top performance and number two, optimize your performance in that order.

00:37:23.320 Number one, the best performance will currently come from GFT4 model.

00:37:29.320 Use prompts that are very detailed, they have lots of task contents, relevant information and instructions.

00:37:35.320 Think along the lines of what would you tell a task contractor if they can't email you back.

00:37:40.320 But then also keep in mind that a task contractor is a human and they have inner monologue and they're very clever, et cetera.

00:37:48.320 So make sure to think through the psychology of the LLM almost and cater prompts to that.

00:37:54.320 Retrieve and add any relevant context and information to these prompts.

00:37:59.320 Basically refer to a lot of the prompt engineering techniques.

00:38:02.320 Some of them are highlighted in the slides above, but also this is a very large space and I would just advise you to look for prompt engineering techniques online.

00:38:15.320 What this refers to is you don't just want to tell, you want to show, whenever it's possible.

00:38:19.320 So give examples of everything that helps it really understand what you mean, if you can.

00:38:25.320 Experiment with tools and plug-ins to upload a task that are difficult for LLMs natively.

00:38:30.320 And then think about not just a single prompt and answer, think about potential chains and reflection and how you glue them together and how you could potentially make multiple samples and so on.

00:38:40.320 Finally, if you think you've squeezed out prompt engineering, which I think you should stick with for a while, look at some potentially fine-tuning,

00:38:48.320 a model to your application, but expect this to be a lot more slower and evolved.

00:38:53.320 And then there's an expert fragile research zone here and I would say that is RLA-CHF, which currently does work a bit better than SFT.

00:39:00.320 If you can get it to work, but again, this is pretty involved, I would say.

00:39:04.320 And to optimize your costs, try to explore lower capacity models or shorter prompts and so on.

00:39:11.320 I also wanted to say a few words about the use cases in which I think LLMs are currently well suited for.

00:39:18.320 So in particular, note that there's a large number of limitations to LLMs today, and so I would keep that definitely in mind for all your applications.

00:39:25.320 Models, and this, by the way, could be an entire talk, so I don't have time to cover it in full detail.

00:39:30.320 Models may be biased, they may fabricate hallucinate information, they may have reasoning errors, they may struggle in entire classes of applications.

00:39:38.320 They have knowledge cutoffs, so they might not know any information above, say, September 2021.

00:39:43.320 They are susceptible to a large range of attacks, which are sort of like coming out on Twitter daily, including prompt injection, jailbreak attacks, data poisoning attacks, and so on.

00:39:52.320 So my recommendation right now is use LLMs in low stakes applications, combine them with always with human oversight,

00:40:00.320 use them as a source of inspiration and suggestions, and think copilates instead of completely autonomous agents that are just like performing a task somewhere.

00:40:07.320 It's just not clear that the models are there right now.

00:40:12.320 So I wanted to close by saying that GPT4 is an amazing artifact, I'm very thankful that it exists, and it's beautiful, it has a ton of knowledge across so many areas, it can do math, code, and so on.

00:40:22.320 And in addition, there's this thriving ecosystem of everything else that is being built and incorporated into the ecosystem, some of these things I've talked about.

00:40:31.320 And all of this power is accessible at your fingertips, so here's everything that's needed in terms of code to ask GPT4 a question, to prompt it and get a response.

00:40:42.320 In this case, I said, can you say something to inspire the audience of Microsoft Bill 2023?

00:40:47.320 And I just punched this into Python and verbatim, GPT4 said the following.

00:40:53.320 And by the way, I did not know that they used this trick in the keynote, so I thought I was being clever, but it is really good at this, it says, "Ladies and gentlemen, innovators and trailbears is Microsoft Bill 2023.

00:41:07.320 Welcome to the gathering of brilliant minds like no other.

00:41:10.320 You are architects of the future, the visionaries molding the digital realm in which humanity thrives, embrace the limitless possibilities of technologies and let your ideas soar as high as your imagination.

00:41:20.320 Together, let's create a more connected, remarkable and inclusive world for generations to come, get ready to unleash your creativity, canvas the unknown, and turn dreams into reality.

State of GPT | BRK216HFS