Learn How Large Language Models (LLM) Work
The project is designed for illustration, experimentation, and structural understanding; and not for production use.
Modern GPT Chat Agents are built on a simple objective: given some text, predict the next word.
Repeated many times, this produces full conversations.
What we recognize as a GPT-type chatbot is simply the result of making useful next-word predictions over and over, using the context of text seen so far.
During training, an LLM is exposed to large amounts of human language and learns patterns in how words follow one another. When generating a response, the model works step by step, using the words so far (its context window) to choose the next one. At each step, it makes a best guess based on what it has learned.
It doesn't plan a sentence in advance; the result emerges one word at a time.
These examples illustrate how changing:
- the context window (e.g., 0, 1, 2, or 3 prior words) and
- the training text
affects next-token predictions. The results determine how well the conversation goes.
We include a neutral cat-dog corpus (linked below) to illustrate cases where additional context (more prior words) provides little or no benefit.
Most computational effort occurs during training, not use. Running a trained model in conversation is relatively inexpensive; training it requires processing large amounts of human-written text and repeatedly adjusting the model to improve predictions. Large systems may take days or weeks of computation across many machines. The pre-trained models in Toy GPT reflect the same process at a much smaller scale.
It takes space to track each word in context (even if the context is only 2 or 3 words prior). The context-3 model stores one weight row per unique 3-token context, so the weight matrix grows as vocab³:
| Corpus | Vocab size | Model | Weight matrix rows | Approx. size |
|---|---|---|---|---|
| cat/dog | ~20 tokens | context-3 | 20³ = 8,000 | ~3 MB |
| llm_glossary | ~119 tokens | context-2 | 119² = 14,161 | ~10 MB |
| llm_glossary | ~119 tokens | context-3 | 119³ = 1,685,159 | ~428 MB |
Our simple Context-3 model (without embeddings) grew to 428 MB, consisting mostly of zeros (because relatively few combinations appear in context).
This rapid growth makes direct counting approaches impractical and motivates more compact representations for capturing similarities across words.
Modern GPT-style models use the transformer architecture from "Attention Is All You Need"1. They use:
- embeddings (numeric representations of words) and
- attention (selecting which prior words matter most)
to represent tokens and weigh which prior words are most relevant for prediction.
These mechanisms produce much more compact output. Attention is included in Toy GPT for completeness, but its full effect is hard to demonstrate on such tiny corpora.
1 Vaswani et al. (2017), Attention Is All You Need.
Interactive visualization of next-word (token) prediction in GPT-style language models.
- App -interactive visualization of next-word (token) prediction in GPT-style language models.
- App Repo
Each repository contains pre-trained artifacts for one model–corpus combination. Two things vary independently: the model and the corpus.
Select a model (how many previous tokens are used to predict the next one):
| Code | Model | Context window |
|---|---|---|
| 100 | Unigram | 0 tokens; predicts from overall frequency only |
| 200 / 201 | Bigram | 1 token; predicts from the immediately preceding token |
| 300 / 301 / 302 | Context-2 | 2 tokens; predicts from the previous two tokens |
| 400 | Context-3 | 3 tokens; predicts from the previous three tokens |
| 500 | Embeddings | learns vector representations of tokens |
| 600 | Attention | dynamically weights which prior tokens matter most |
Select a corpus (the text the model is trained on):
| ID | File | Type | Signal |
|---|---|---|---|
| 000 | cat_dog.txt | Neutral | Low; adjacency carries little predictive value |
| 001 | animals.txt | Structured | Higher; natural language patterns |
| 010 | llm_glossary.txt | Domain | Technical NLP definitions |
| 020 | repo_tour.txt | Domain | Repository file and folder descriptions |
| 030 | analytics.txt | Domain | Technical micro-lessons |
| Repository | Model | Corpus | Context |
|---|---|---|---|
| train-100-unigram | Unigram | 000 cat_dog (neutral) | 0 |
| train-200-bigram | Bigram | 000 cat_dog (neutral) | 1 |
| train-300-context-2 | Context-2 | 000 cat_dog (neutral) | 2 |
| train-400-context-3 | Context-3 | 000 cat_dog (neutral) | 3 |
| train-100-unigram-animals | Unigram | 001 animals (structured) | 0 |
| train-200-bigram-animals | Bigram | 001 animals (structured) | 1 |
| train-300-context-2-animals | Context-2 | 001 animals (structured) | 2 |
| train-400-context-3-animals | Context-3 | 001 animals (structured) | 3 |
| train-201-bigram-llm-glossary | Bigram | 010 llm_glossary (domain) | 1 |
| train-301-context-2-llm-glossary | Context-2 | 010 llm_glossary (domain) | 2 |
| train-401-context-3-llm-glossary | Context-3 | 010 llm_glossary (domain) |
3 |
| train-302-context-2-repo-tour | Context-2 | 020 repo_tour (domain) | 2 |
| train-500-embeddings | Embeddings | 030 analytics (domain) | 2 |
| train-600-attention | Attention | 030 analytics (domain) ℹ️ | 2 |
| train-600-attention-3 | Attention | 030 analytics (domain) ℹ️ | 3 |
⚠️ Too large to commit (428 MB of mostly zeros).- ℹ️ Attention requires a larger corpus to produce meaningful position weighting.
| Corpus ID | Description | 100 unigram | 200 bigram | 300 context-2 | 400 context-3 | 500 embeddings | 600 attention |
|---|---|---|---|---|---|---|---|
000_cat_dog |
small neutral (low-signal) corpus | X | X | X | X | ||
001_animals |
small structured (proximity-signal) corpus | X | X | X | X | ||
010_llm_glossary |
language model definitions | X (301) |
|||||
020_repo_tour |
repo files and folders | X (302) |
|||||
030_analytics |
technical micro-lessons | X | X |
Legend:
X indicates a completed training run with committed artifacts.
Blank cells indicate combinations intentionally not explored.
Model codes increase with architectural complexity:
- 100/unigram: predicts the next token using only overall frequency, without considering surrounding context.
- 200/bigram: predicts the next token conditioned on the immediately preceding token.
- 300/context-2: predicts the next token using the previous two tokens as context. Architecturally equivalent to a bigram model, but implemented in a way that generalizes to longer contexts.
- 400/context-3: predicts the next token using the previous three tokens as context.
- 500/embeddings: learns numeric vector representations of tokens so related tokens occupy nearby locations in a learned space.
- 600/attention: dynamically weights which prior tokens matter most when predicting the next token, rather than treating all context equally.
See the paper Attention Is All You Need for background on attention-based models.
Corpus IDs are three digits and follow a project-wide standard:
- 000–009: canonical reference corpora
- 010–049: curated explanatory corpora for focused demonstrations
- 100–799: additional contributions
Each corpus file is named NNN_slug.txt.
The corpus ID and slug are recorded in SE_MANIFEST.toml.
-
neutral corpus A corpus deliberately constructed so that adjacency carries little or no predictive signal. Increasing context length (bigram, context-2, context-3) should provide minimal benefit. Used as a negative control.
-
structured corpus A corpus where adjacency or proximity carries meaningful predictive signal (e.g., attributes, phrases, or semantic groupings). Increasing context length should measurably improve predictions.
-
unigram A model that conditions on a single token (
uni= one). -
bigram A model that conditions on two tokens (
bi= two).
This organization emphasizes clarity, reproducibility, and explicit boundaries between models, data, corpora, and claims.