Lecture 02: How do Large Language Models work?

This page contains notes, further readings, and general references for the associated lecture.

The Lecture Slides for the talk are also available (use your arrow keys to advance the slides) or download a pdf of the slides.

If you’re looking for notes to the other lectures, head here.

More about Perceptrons

While we discussed perceptrons last week, we finally get to see their full power this week once we allow them to have multiple layers. There are tons of great explanations for them online, but my personal favorite is the series by YouTuber 3Blue1Brown. The entire series is worth a watch, but be warned, it’s quite long.

During the lecture we showed the awesome 2D simulations for a Multilayered Perceptron digit recognizer created by Adam Harley. There are a number of other visualizations that Adam has made for alternative neural network structures (ones we did not discuss in our lecture) that are also great to see.

One important neural network structure we did not have time to discuss but is still highly used in modern systems is the convolutional neural network.

Backpropagation

This is some of the most technical stuff we covered, and likely some of the most technical stuff involved in current AI models. It cannot be understated how imporant backpropagation is to both the modern models we are discussing (e.g., ChatGPT), but also to lots of regular “narrow AI” models of the last 40 years. It is the single most important component in the entire stack as it fundamentally allows an impossibly complex model to slowly get better through exposure to examples. Note, if you want to really understand how it works, you will have to get into some calculus.

Here are some resources if you want to dive into it a little deeper:

3Blue1Brown’s video on backpropagation is just visually amazing. Highly recommended.
The Builtin article on backprop is also good (and even cites the above video) and gives some additional background
The people at Welch Labs have also created a great video series on backpropagation (be sure to see parts 2 and 3 also) which gets a bit more technical, but well worth a watch

Words as Numbers

Turning words into numbers (in lecture we thought of them as barcodes) has many applications, beyond even LLMs. It’s been a pretty standard task in the AI world for the last 30 years, meaning there’s lots of different ways to do it.

This video by Josh Starmer on word embeddings is a little silly (most of his videos are), but the mathematics and explanations are highly accurate
IBM also put out a very decent article on how these numbers are created and used

Transformers

We absolutely did not cover this in depth. Honestly, we barely mentioned it. And while it’s not the largest component of modern AI systems (that’s still the perceptron we spent so much time talking about) it’s the component that makes the whole thing work. They were first presented famously in a Google research paper Attention is All You Need, a naming scheme that has continued to be reused since then. It’s a techinical paper and not for an introductory crowd, but it still manages to be quite readable.

For a more visual explanation of a Transformer, see the PoloClub interactive tutorial. It too is pretty technical, but also beautiful.
Transformers have two main components: Encoders and Decoders. Josh Starmer from StatQuest does a decent (but silly) job with both of them (encoders, decoders).

Large Language Models

A classic visualization of ChatGPT (version 3, so an older version) is available by Jay Alammar. It’s a little older and not quite accurate for the truly modern models, but its still great for understanding the basics of full system.
Full citation for the “Tale of Two Cities” used in the lecture
Andreas Stöffelbauer wrote a very detailed article on the basics of LLMs

Training Modern AI Models

Stephen Bach wrote a pretty good article describing the different phases of AI training. Well worth a quick read through.
Josh Starmer (from StatQuest) created a good video explaining Reinforcement Learning with Human Feedback (RLHF). His videos are always a bit silly, but they are quite approachable.
Neptune.ai has a very good and very understandable article on RHLF as well

AI Benchmarks

Aspen Digital put together a great primer on AI benchmarks, a good place to get started on learning more.
Stanford’s Holistic Evaluation of Language Models (HELM) maintains a nice leaderboard for various AI benchmarks
HuggingFace (an opensource AI site for sharing models) also maintains a leaderboard for AI benchmarks
The specific MMMLU leaderboard shown during the lecture can be found here
The original paper on MMLU is publicly available as well.
Humanity’s Last Exam has it’s own webpage for information on it, with links to the official paper.
As does ARC-AGI-1 and ARC-AGI-2

Nature Paper

Finally, the very recent Nature paper detailing why current LLMs should be classified as being “artifical general intelligence”. It’s a good article from one of the most reputable journals in all academia. Interestingly, these same ideas have been stated by less prestigous individuals and organizations for at least the last 6 months, if not longer, but it’s good to see it out in Nature now.

More about Perceptrons#

Backpropagation#

Words as Numbers#

Transformers#

Large Language Models#

Training Modern AI Models#

AI Benchmarks#

Nature Paper#