MLM ("Master Language Model") is an experiment in domain-specialised language models. The hypothesis: a small model trained from scratch on a narrow, high-quality corpus can outperform a much larger general-purpose LLM on tasks inside that domain — at a fraction of the parameter count, memory footprint, and inference cost.
The first target domain is Three.js. We train a byte-level RNN purely on the Three.js source tree and measure whether it learns the library's idioms (class hierarchies, naming, math conventions, shader patterns, addon structure) well enough to generate plausible Three.js code.
Training is performed with EGGROLL (Evolution Guided GeneRal Optimisation
via Low-rank Learning), a gradient-free Evolution Strategies method — see
egg.c/README.md and the
paper. Because EGGROLL needs no
backpropagation, the entire trainer fits in a single C file with int8
weights, runs on Apple Silicon via NEON + Grand Central Dispatch, and has no
Python / framework dependencies on the training side.
For the experimental design — why we believe a small specialist can win at narrow tasks, where it loses, and the orchestrator + specialist-family strategy that addresses the failure modes — see HYPOTHESIS.md.
mlm/
├── README.md this file
├── scripts/
│ └── build_threejs_dataset.sh pulls three.js and produces input.txt
└── egg.c/ EGGROLL trainer (upstream, vendored)
├── full_trained_egg.c single-file CPU/Apple Silicon trainer
├── full_cuda_train_*.cu CUDA variants (optional, for GPU boxes)
└── README.md trainer-level documentation
egg.c/ is the EGGROLL reference implementation; we treat it as the
training engine and feed it domain-specific data through input.txt.
Tested on M-series MacBook Pro, macOS 13+.
xcode-select --install # provides clang and the Dispatch / NEON headersgit is required for the dataset script; it ships with the Command Line
Tools above.
From the repo root:
./scripts/build_threejs_dataset.shThis shallow-clones mrdoob/three.js into .cache/three.js and concatenates
every *.js under src/ and examples/jsm/ into egg.c/input.txt, with
each file prefixed by a // === FILE: <path> === header so the model can
learn file boundaries. Expect roughly 6–10 MiB of code.
Re-running the script refreshes the clone and rebuilds input.txt. To pin a
specific Three.js release:
REPO_REF=r170 ./scripts/build_threejs_dataset.shTo narrow or widen the harvest:
INCLUDE="src" ./scripts/build_threejs_dataset.sh
INCLUDE="src examples/jsm test/unit" ./scripts/build_threejs_dataset.shcd egg.c
clang -O3 full_trained_egg.c -o eggThe trainer hard-includes <arm_neon.h> and <dispatch/dispatch.h>; both are
provided by the macOS Command Line Tools, so no extra flags are needed on
Apple Silicon. The Dispatch framework is auto-linked by clang on macOS.
Recommended — uses caffeinate to keep the laptop awake, captures a
permanent log, and auto-recompiles on source changes:
./scripts/train.sh # logs to ./training.logThe trainer writes checkpoints every print step (every 10 steps) using a two-tier scheme designed to survive both crashes and gradual corruption:
- A round-robin ring buffer of the last 10 checkpoints
(
egg.c/egg.ckpt.0throughegg.c/egg.ckpt.9). Re-running auto-resumes from the slot with the highest step. - A separate
egg.c/egg.ckpt.best, updated only when training loss strictly improves. Survives gradual model corruption (e.g. data contamination) that would silently overwrite the entire ring.
Common operations:
ls egg.c/egg.ckpt* # see what's been saved
EGG_RESUME_BEST=1 ./scripts/train.sh # resume from best loss instead of latest
rm egg.c/egg.ckpt* # wipe everything, start over
cp egg.c/egg.ckpt.3 egg.c/egg.ckpt.0 # manually roll back to slot 3 on next resumeTotal disk for the checkpoints: ~12 MB × 11 ≈ 130 MB (gitignored).
Direct invocation also works if you'd rather not use the wrapper:
cd egg.c && ./egg # reads ./input.txt from cwd, same checkpoint behaviourThe default configuration in full_trained_egg.c is:
| Constant | Value | Meaning |
|---|---|---|
VOCAB_SIZE |
256 | byte-level tokens |
HIDDEN_DIM |
512 | model width |
N_LAYERS |
4 | depth |
SEQ_LEN |
4096 | truncated BPTT length (no backprop here) |
POPULATION_SIZE |
128 | ES perturbations per step |
BATCH_SIZE |
8 | parallel streams |
Edit those at the top of the file and recompile to fit your machine. On a
16 GB M-series laptop the defaults run comfortably; on 8 GB drop
POPULATION_SIZE to 64 and HIDDEN_DIM to 256 first.
Every 10 steps the trainer samples 30 tokens of generated text so you can watch the model converge towards Three.js code as training progresses.
A few properties matter specifically for "small specialist model" research:
- No autograd, no PyTorch. Training is a single C binary. Iteration on architecture and hyperparameters does not fight a framework.
- Integer-only weights. The trained model is already in deployment-grade int8; no post-hoc quantisation step is needed before measuring inference cost.
- Gradient-free. Lets us experiment with non-differentiable architectural choices (clipped activations, hard quantisation, custom tokenisers) without having to design a smooth surrogate.
- Population-parallel. Scales by adding cores, not by scaling a single hot path. GCD on Apple Silicon gets us close to peak throughput on a laptop.
- For larger Three.js corpora, add upstream addons by extending
INCLUDEin the dataset script (e.g.three-mesh-bvh,troika-three-text). - For multi-GPU runs, see
egg.c/full_cuda_train_transformer_adam_mgpu.cuand the distributed implementation underegg.c/d-eggs/. - For a different domain, the recipe is the same: replace
build_threejs_dataset.shwith a script that produces aninput.txtfor your target codebase.
- EGGROLL paper: https://eshyperscale.github.io/
- Trainer documentation:
egg.c/README.md - Three.js source: https://github.com/mrdoob/three.js
