Skip to content
@toy-gpt

Toy GPT

Toy GPT

Learn How Large Language Models (LLM) Work

The project is designed for illustration, experimentation, and structural understanding; and not for production use.

Modern GPT Chat Agents

Modern GPT Chat Agents are built on a simple objective: given some text, predict the next word.

Repeated many times, this produces full conversations.

What we recognize as a GPT-type chatbot is simply the result of making useful next-word predictions over and over, using the context of text seen so far.

How They Work

During training, an LLM is exposed to large amounts of human language and learns patterns in how words follow one another. When generating a response, the model works step by step, using the words so far (its context window) to choose the next one. At each step, it makes a best guess based on what it has learned.

It doesn't plan a sentence in advance; the result emerges one word at a time.

Toy GPT Exploration (2D Matrix)

These examples illustrate how changing:

  1. the context window (e.g., 0, 1, 2, or 3 prior words) and
  2. the training text

affects next-token predictions. The results determine how well the conversation goes.

We include a neutral cat-dog corpus (linked below) to illustrate cases where additional context (more prior words) provides little or no benefit.

LLM Training Costs

Most computational effort occurs during training, not use. Running a trained model in conversation is relatively inexpensive; training it requires processing large amounts of human-written text and repeatedly adjusting the model to improve predictions. Large systems may take days or weeks of computation across many machines. The pre-trained models in Toy GPT reflect the same process at a much smaller scale.

Combinatorial Explosion

It takes space to track each word in context (even if the context is only 2 or 3 words prior). The context-3 model stores one weight row per unique 3-token context, so the weight matrix grows as vocab³:

Corpus Vocab size Model Weight matrix rows Approx. size
cat/dog ~20 tokens context-3 20³ = 8,000 ~3 MB
llm_glossary ~119 tokens context-2 119² = 14,161 ~10 MB
llm_glossary ~119 tokens context-3 119³ = 1,685,159 ~428 MB

Our simple Context-3 model (without embeddings) grew to 428 MB, consisting mostly of zeros (because relatively few combinations appear in context).

This rapid growth makes direct counting approaches impractical and motivates more compact representations for capturing similarities across words.

GPT: The Transformer Breakthrough

Modern GPT-style models use the transformer architecture from "Attention Is All You Need"1. They use:

  • embeddings (numeric representations of words) and
  • attention (selecting which prior words matter most)

to represent tokens and weigh which prior words are most relevant for prediction.

These mechanisms produce much more compact output. Attention is included in Toy GPT for completeness, but its full effect is hard to demonstrate on such tiny corpora.

1 Vaswani et al. (2017), Attention Is All You Need.

App

Interactive visualization of next-word (token) prediction in GPT-style language models.

  • App -interactive visualization of next-word (token) prediction in GPT-style language models.
  • App Repo

Training Repositories

Each repository contains pre-trained artifacts for one model–corpus combination. Two things vary independently: the model and the corpus.

Choose One Model (Per Training Repository)

Select a model (how many previous tokens are used to predict the next one):

Code Model Context window
100 Unigram 0 tokens; predicts from overall frequency only
200 / 201 Bigram 1 token; predicts from the immediately preceding token
300 / 301 / 302 Context-2 2 tokens; predicts from the previous two tokens
400 Context-3 3 tokens; predicts from the previous three tokens
500 Embeddings learns vector representations of tokens
600 Attention dynamically weights which prior tokens matter most

Chose One Corpus (Per Training Repository)

Select a corpus (the text the model is trained on):

ID File Type Signal
000 cat_dog.txt Neutral Low; adjacency carries little predictive value
001 animals.txt Structured Higher; natural language patterns
010 llm_glossary.txt Domain Technical NLP definitions
020 repo_tour.txt Domain Repository file and folder descriptions
030 analytics.txt Domain Technical micro-lessons

Training Repositories: Model-Corpus Matrix

Repository Model Corpus Context
train-100-unigram Unigram 000 cat_dog (neutral) 0
train-200-bigram Bigram 000 cat_dog (neutral) 1
train-300-context-2 Context-2 000 cat_dog (neutral) 2
train-400-context-3 Context-3 000 cat_dog (neutral) 3
train-100-unigram-animals Unigram 001 animals (structured) 0
train-200-bigram-animals Bigram 001 animals (structured) 1
train-300-context-2-animals Context-2 001 animals (structured) 2
train-400-context-3-animals Context-3 001 animals (structured) 3
train-201-bigram-llm-glossary Bigram 010 llm_glossary (domain) 1
train-301-context-2-llm-glossary Context-2 010 llm_glossary (domain) 2
train-401-context-3-llm-glossary Context-3 010 llm_glossary (domain) ⚠️ 3
train-302-context-2-repo-tour Context-2 020 repo_tour (domain) 2
train-500-embeddings Embeddings 030 analytics (domain) 2
train-600-attention Attention 030 analytics (domain) ℹ️ 2
train-600-attention-3 Attention 030 analytics (domain) ℹ️ 3
  • ⚠️ Too large to commit (428 MB of mostly zeros).
  • ℹ️ Attention requires a larger corpus to produce meaningful position weighting.

Grid View

Corpus ID Description 100 unigram 200 bigram 300 context-2 400 context-3 500 embeddings 600 attention
000_cat_dog small neutral (low-signal) corpus X X X X
001_animals small structured (proximity-signal) corpus X X X X
010_llm_glossary language model definitions X (301)
020_repo_tour repo files and folders X (302)
030_analytics technical micro-lessons X X

Legend: X indicates a completed training run with committed artifacts. Blank cells indicate combinations intentionally not explored.

Model numbering

Model codes increase with architectural complexity:

  • 100/unigram: predicts the next token using only overall frequency, without considering surrounding context.
  • 200/bigram: predicts the next token conditioned on the immediately preceding token.
  • 300/context-2: predicts the next token using the previous two tokens as context. Architecturally equivalent to a bigram model, but implemented in a way that generalizes to longer contexts.
  • 400/context-3: predicts the next token using the previous three tokens as context.
  • 500/embeddings: learns numeric vector representations of tokens so related tokens occupy nearby locations in a learned space.
  • 600/attention: dynamically weights which prior tokens matter most when predicting the next token, rather than treating all context equally.

See the paper Attention Is All You Need for background on attention-based models.

Corpus numbering

Corpus IDs are three digits and follow a project-wide standard:

  • 000–009: canonical reference corpora
  • 010–049: curated explanatory corpora for focused demonstrations
  • 100–799: additional contributions

Each corpus file is named NNN_slug.txt. The corpus ID and slug are recorded in SE_MANIFEST.toml.

Terms

  • neutral corpus A corpus deliberately constructed so that adjacency carries little or no predictive signal. Increasing context length (bigram, context-2, context-3) should provide minimal benefit. Used as a negative control.

  • structured corpus A corpus where adjacency or proximity carries meaningful predictive signal (e.g., attributes, phrases, or semantic groupings). Increasing context length should measurably improve predictions.

  • unigram A model that conditions on a single token (uni = one).

  • bigram A model that conditions on two tokens (bi = two).

Purpose

This organization emphasizes clarity, reproducibility, and explicit boundaries between models, data, corpora, and claims.

Popular repositories Loading

  1. .github .github Public

    GitHub profile repo for the toy-gpt organization on GitHub.Profile. Automates checks across the various training repositories.

    Python 1

  2. train-100-unigram train-100-unigram Public

    Training repo for Toy GPT: unigram + small neutral corpus (000_cat_dog.txt)

    Python 1

  3. train-200-bigram train-200-bigram Public

    Training repo for Toy GPT: bigram + small neutral corpus (000_cat_dog.txt)

    Python 1

  4. train-300-context-2 train-300-context-2 Public

    Training repo for Toy GPT: Context-2 model + small neutral corpus (000_cat_dog.txt)

    Python 1

  5. train-400-context-3 train-400-context-3 Public

    Training repo for Toy GPT: Context-3 model + small neutral corpus (000_cat_dog.txt)

    Python 1

  6. train-100-unigram-animals train-100-unigram-animals Public

    Training repo for Toy GPT: unigram model + small structured corpus (001_animals.txt)

    Python 1

Repositories

Showing 10 of 17 repositories
  • .github Public

    GitHub profile repo for the toy-gpt organization on GitHub.Profile. Automates checks across the various training repositories.

    toy-gpt/.github’s past year of commit activity
    Python 1 MIT 0 0 0 Updated Apr 4, 2026
  • toy-gpt-chat Public

    Interactive visualization of next-word (token) prediction in GPT-style language models.

    toy-gpt/toy-gpt-chat’s past year of commit activity
    TypeScript 1 MIT 0 0 0 Updated Apr 4, 2026
  • train-600-attention-3 Public

    Training repo for Toy GPT: Context-3 model with attention + small domain corpus (030_analytics.txt). Attention requires scale.

    toy-gpt/train-600-attention-3’s past year of commit activity
    Python 1 MIT 0 0 0 Updated Apr 4, 2026
  • train-600-attention Public

    Training repo for Toy GPT: Context-2 model with attention + small domain corpus (030_analytics.txt). Attention requires scale.

    toy-gpt/train-600-attention’s past year of commit activity
    Python 1 MIT 0 0 0 Updated Apr 4, 2026
  • train-500-embeddings Public

    Training repo for Toy GPT: Context-2 model with embeddings + small domain corpus (030_analytics.txt). Much more efficient use of space.

    toy-gpt/train-500-embeddings’s past year of commit activity
    Python 1 MIT 0 0 0 Updated Apr 4, 2026
  • train-401-context-3-llm-glossary Public

    Training repo for Toy GPT: context-3 model + small domain corpus (010_llm_glossary.txt)

    toy-gpt/train-401-context-3-llm-glossary’s past year of commit activity
    Python 0 MIT 0 0 0 Updated Apr 4, 2026
  • train-400-context-3-animals Public

    Training repo for Toy GPT: context-3 model + small structured corpus (001_animals.txt)

    toy-gpt/train-400-context-3-animals’s past year of commit activity
    Python 1 MIT 0 0 0 Updated Apr 3, 2026
  • train-400-context-3 Public

    Training repo for Toy GPT: Context-3 model + small neutral corpus (000_cat_dog.txt)

    toy-gpt/train-400-context-3’s past year of commit activity
    Python 1 MIT 0 1 0 Updated Apr 4, 2026
  • train-302-context-2-repo-tour Public

    Training repo for Toy GPT: context-2 model + small domain corpus (repo tour)

    toy-gpt/train-302-context-2-repo-tour’s past year of commit activity
    Python 0 MIT 0 0 0 Updated Apr 4, 2026
  • train-301-context-2-llm-glossary Public

    Training repo for Toy GPT: context-2 model + small domain corpus (010_llm_glossary.txt)

    toy-gpt/train-301-context-2-llm-glossary’s past year of commit activity
    Python 1 MIT 0 0 0 Updated Apr 4, 2026

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…