Replicating the Chinchilla scaling laws at the small scale: sweep over tiny GPT-like models (12K–600K parameters) on WikiText-103 to find the compute-optimal model size and training token count for each FLOP budget, then fit power-law scaling curves.
![]() |
![]() |
![]() |
- Data - Download & prepare byte-level train/val/test splits from WikiText-103
- Sweep - Generate an iso-compute grid of (model, tokens) pairs across 9 compute budgets
- Train - Train every configuration across multiple seeds and two LR-schedule regimes (matched & fixed-horizon)
- Aggregate - Collect per-run validation losses into a summary table
- Fit & Plot - Fit a Chinchilla-style parametric loss surface and produce scaling-law plots
Everything lives in a single Jupyter notebook:
pip install -r requirements.txt
jupyter notebook notebook.ipynbRun all cells top-to-bottom. The notebook will download data, train the sweep, and save plots to outputs/plots/.


