Compute-Optimal Tiny Transformers

Replicating the Chinchilla scaling laws at the small scale: sweep over tiny GPT-like models (12K–600K parameters) on WikiText-103 to find the compute-optimal model size and training token count for each FLOP budget, then fit power-law scaling curves.

Pipeline

Data - Download & prepare byte-level train/val/test splits from WikiText-103
Sweep - Generate an iso-compute grid of (model, tokens) pairs across 9 compute budgets
Train - Train every configuration across multiple seeds and two LR-schedule regimes (matched & fixed-horizon)
Aggregate - Collect per-run validation losses into a summary table
Fit & Plot - Fit a Chinchilla-style parametric loss surface and produce scaling-law plots

Running the Experiment

Everything lives in a single Jupyter notebook:

pip install -r requirements.txt
jupyter notebook notebook.ipynb

Run all cells top-to-bottom. The notebook will download data, train the sweep, and save plots to outputs/plots/.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
compute_optimal		compute_optimal
configs		configs
outputs		outputs
.gitignore		.gitignore
README.md		README.md
notebook.ipynb		notebook.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Compute-Optimal Tiny Transformers

Pipeline

Running the Experiment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Compute-Optimal Tiny Transformers

Pipeline

Running the Experiment

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages