MLM — Master Language Model

MLM ("Master Language Model") is an experiment in domain-specialised language models. The hypothesis: a small model trained from scratch on a narrow, high-quality corpus can outperform a much larger general-purpose LLM on tasks inside that domain — at a fraction of the parameter count, memory footprint, and inference cost.

The first target domain is Three.js. We train a byte-level RNN purely on the Three.js source tree and measure whether it learns the library's idioms (class hierarchies, naming, math conventions, shader patterns, addon structure) well enough to generate plausible Three.js code.

Training is performed with EGGROLL (Evolution Guided GeneRal Optimisation via Low-rank Learning), a gradient-free Evolution Strategies method — see egg.c/README.md and the paper. Because EGGROLL needs no backpropagation, the entire trainer fits in a single C file with int8 weights, runs on Apple Silicon via NEON + Grand Central Dispatch, and has no Python / framework dependencies on the training side.

For the experimental design — why we believe a small specialist can win at narrow tasks, where it loses, and the orchestrator + specialist-family strategy that addresses the failure modes — see HYPOTHESIS.md.

Repository layout

mlm/
├── README.md                       this file
├── scripts/
│   └── build_threejs_dataset.sh    pulls three.js and produces input.txt
└── egg.c/                          EGGROLL trainer (upstream, vendored)
    ├── full_trained_egg.c          single-file CPU/Apple Silicon trainer
    ├── full_cuda_train_*.cu        CUDA variants (optional, for GPU boxes)
    └── README.md                   trainer-level documentation

egg.c/ is the EGGROLL reference implementation; we treat it as the training engine and feed it domain-specific data through input.txt.

Quick start (Apple Silicon)

Tested on M-series MacBook Pro, macOS 13+.

1. Prerequisites

xcode-select --install   # provides clang and the Dispatch / NEON headers

git is required for the dataset script; it ships with the Command Line Tools above.

2. Build the Three.js dataset

From the repo root:

./scripts/build_threejs_dataset.sh

This shallow-clones mrdoob/three.js into .cache/three.js and concatenates every *.js under src/ and examples/jsm/ into egg.c/input.txt, with each file prefixed by a // === FILE: <path> === header so the model can learn file boundaries. Expect roughly 6–10 MiB of code.

Re-running the script refreshes the clone and rebuilds input.txt. To pin a specific Three.js release:

REPO_REF=r170 ./scripts/build_threejs_dataset.sh

To narrow or widen the harvest:

INCLUDE="src" ./scripts/build_threejs_dataset.sh
INCLUDE="src examples/jsm test/unit" ./scripts/build_threejs_dataset.sh

3. Compile the trainer

cd egg.c
clang -O3 full_trained_egg.c -o egg

The trainer hard-includes <arm_neon.h> and <dispatch/dispatch.h>; both are provided by the macOS Command Line Tools, so no extra flags are needed on Apple Silicon. The Dispatch framework is auto-linked by clang on macOS.

4. Train

Recommended — uses caffeinate to keep the laptop awake, captures a permanent log, and auto-recompiles on source changes:

./scripts/train.sh     # logs to ./training.log

The trainer writes checkpoints every print step (every 10 steps) using a two-tier scheme designed to survive both crashes and gradual corruption:

A round-robin ring buffer of the last 10 checkpoints (egg.c/egg.ckpt.0 through egg.c/egg.ckpt.9). Re-running auto-resumes from the slot with the highest step.
A separate egg.c/egg.ckpt.best, updated only when training loss strictly improves. Survives gradual model corruption (e.g. data contamination) that would silently overwrite the entire ring.

Common operations:

ls egg.c/egg.ckpt*                       # see what's been saved
EGG_RESUME_BEST=1 ./scripts/train.sh     # resume from best loss instead of latest
rm egg.c/egg.ckpt*                       # wipe everything, start over
cp egg.c/egg.ckpt.3 egg.c/egg.ckpt.0     # manually roll back to slot 3 on next resume

Total disk for the checkpoints: ~12 MB × 11 ≈ 130 MB (gitignored).

Direct invocation also works if you'd rather not use the wrapper:

cd egg.c && ./egg      # reads ./input.txt from cwd, same checkpoint behaviour

The default configuration in full_trained_egg.c is:

Constant	Value	Meaning
`VOCAB_SIZE`	256	byte-level tokens
`HIDDEN_DIM`	512	model width
`N_LAYERS`	4	depth
`SEQ_LEN`	4096	truncated BPTT length (no backprop here)
`POPULATION_SIZE`	128	ES perturbations per step
`BATCH_SIZE`	8	parallel streams

Edit those at the top of the file and recompile to fit your machine. On a 16 GB M-series laptop the defaults run comfortably; on 8 GB drop POPULATION_SIZE to 64 and HIDDEN_DIM to 256 first.

Every 10 steps the trainer samples 30 tokens of generated text so you can watch the model converge towards Three.js code as training progresses.

Why EGGROLL for this experiment

A few properties matter specifically for "small specialist model" research:

No autograd, no PyTorch. Training is a single C binary. Iteration on architecture and hyperparameters does not fight a framework.
Integer-only weights. The trained model is already in deployment-grade int8; no post-hoc quantisation step is needed before measuring inference cost.
Gradient-free. Lets us experiment with non-differentiable architectural choices (clipped activations, hard quantisation, custom tokenisers) without having to design a smooth surrogate.
Population-parallel. Scales by adding cores, not by scaling a single hot path. GCD on Apple Silicon gets us close to peak throughput on a laptop.

Going further

For larger Three.js corpora, add upstream addons by extending INCLUDE in the dataset script (e.g. three-mesh-bvh, troika-three-text).
For multi-GPU runs, see egg.c/full_cuda_train_transformer_adam_mgpu.cu and the distributed implementation under egg.c/d-eggs/.
For a different domain, the recipe is the same: replace build_threejs_dataset.sh with a script that produces an input.txt for your target codebase.

References

EGGROLL paper: https://eshyperscale.github.io/
Trainer documentation: egg.c/README.md
Three.js source: https://github.com/mrdoob/three.js

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
egg.c		egg.c
scripts		scripts
.gitignore		.gitignore
HYPOTHESIS.md		HYPOTHESIS.md
README.md		README.md
mlm.png		mlm.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLM — Master Language Model

Repository layout

Quick start (Apple Silicon)

1. Prerequisites

2. Build the Three.js dataset

3. Compile the trainer

4. Train

Why EGGROLL for this experiment

Going further

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MLM — Master Language Model

Repository layout

Quick start (Apple Silicon)

1. Prerequisites

2. Build the Three.js dataset

3. Compile the trainer

4. Train

Why EGGROLL for this experiment

Going further

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages