Skip to content

dtellz/mlm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MLM — Master Language Model

MLM

MLM ("Master Language Model") is an experiment in domain-specialised language models. The hypothesis: a small model trained from scratch on a narrow, high-quality corpus can outperform a much larger general-purpose LLM on tasks inside that domain — at a fraction of the parameter count, memory footprint, and inference cost.

The first target domain is Three.js. We train a byte-level RNN purely on the Three.js source tree and measure whether it learns the library's idioms (class hierarchies, naming, math conventions, shader patterns, addon structure) well enough to generate plausible Three.js code.

Training is performed with EGGROLL (Evolution Guided GeneRal Optimisation via Low-rank Learning), a gradient-free Evolution Strategies method — see egg.c/README.md and the paper. Because EGGROLL needs no backpropagation, the entire trainer fits in a single C file with int8 weights, runs on Apple Silicon via NEON + Grand Central Dispatch, and has no Python / framework dependencies on the training side.

For the experimental design — why we believe a small specialist can win at narrow tasks, where it loses, and the orchestrator + specialist-family strategy that addresses the failure modes — see HYPOTHESIS.md.

Repository layout

mlm/
├── README.md                       this file
├── scripts/
│   └── build_threejs_dataset.sh    pulls three.js and produces input.txt
└── egg.c/                          EGGROLL trainer (upstream, vendored)
    ├── full_trained_egg.c          single-file CPU/Apple Silicon trainer
    ├── full_cuda_train_*.cu        CUDA variants (optional, for GPU boxes)
    └── README.md                   trainer-level documentation

egg.c/ is the EGGROLL reference implementation; we treat it as the training engine and feed it domain-specific data through input.txt.

Quick start (Apple Silicon)

Tested on M-series MacBook Pro, macOS 13+.

1. Prerequisites

xcode-select --install   # provides clang and the Dispatch / NEON headers

git is required for the dataset script; it ships with the Command Line Tools above.

2. Build the Three.js dataset

From the repo root:

./scripts/build_threejs_dataset.sh

This shallow-clones mrdoob/three.js into .cache/three.js and concatenates every *.js under src/ and examples/jsm/ into egg.c/input.txt, with each file prefixed by a // === FILE: <path> === header so the model can learn file boundaries. Expect roughly 6–10 MiB of code.

Re-running the script refreshes the clone and rebuilds input.txt. To pin a specific Three.js release:

REPO_REF=r170 ./scripts/build_threejs_dataset.sh

To narrow or widen the harvest:

INCLUDE="src" ./scripts/build_threejs_dataset.sh
INCLUDE="src examples/jsm test/unit" ./scripts/build_threejs_dataset.sh

3. Compile the trainer

cd egg.c
clang -O3 full_trained_egg.c -o egg

The trainer hard-includes <arm_neon.h> and <dispatch/dispatch.h>; both are provided by the macOS Command Line Tools, so no extra flags are needed on Apple Silicon. The Dispatch framework is auto-linked by clang on macOS.

4. Train

Recommended — uses caffeinate to keep the laptop awake, captures a permanent log, and auto-recompiles on source changes:

./scripts/train.sh     # logs to ./training.log

The trainer writes checkpoints every print step (every 10 steps) using a two-tier scheme designed to survive both crashes and gradual corruption:

  • A round-robin ring buffer of the last 10 checkpoints (egg.c/egg.ckpt.0 through egg.c/egg.ckpt.9). Re-running auto-resumes from the slot with the highest step.
  • A separate egg.c/egg.ckpt.best, updated only when training loss strictly improves. Survives gradual model corruption (e.g. data contamination) that would silently overwrite the entire ring.

Common operations:

ls egg.c/egg.ckpt*                       # see what's been saved
EGG_RESUME_BEST=1 ./scripts/train.sh     # resume from best loss instead of latest
rm egg.c/egg.ckpt*                       # wipe everything, start over
cp egg.c/egg.ckpt.3 egg.c/egg.ckpt.0     # manually roll back to slot 3 on next resume

Total disk for the checkpoints: ~12 MB × 11 ≈ 130 MB (gitignored).

Direct invocation also works if you'd rather not use the wrapper:

cd egg.c && ./egg      # reads ./input.txt from cwd, same checkpoint behaviour

The default configuration in full_trained_egg.c is:

Constant Value Meaning
VOCAB_SIZE 256 byte-level tokens
HIDDEN_DIM 512 model width
N_LAYERS 4 depth
SEQ_LEN 4096 truncated BPTT length (no backprop here)
POPULATION_SIZE 128 ES perturbations per step
BATCH_SIZE 8 parallel streams

Edit those at the top of the file and recompile to fit your machine. On a 16 GB M-series laptop the defaults run comfortably; on 8 GB drop POPULATION_SIZE to 64 and HIDDEN_DIM to 256 first.

Every 10 steps the trainer samples 30 tokens of generated text so you can watch the model converge towards Three.js code as training progresses.

Why EGGROLL for this experiment

A few properties matter specifically for "small specialist model" research:

  • No autograd, no PyTorch. Training is a single C binary. Iteration on architecture and hyperparameters does not fight a framework.
  • Integer-only weights. The trained model is already in deployment-grade int8; no post-hoc quantisation step is needed before measuring inference cost.
  • Gradient-free. Lets us experiment with non-differentiable architectural choices (clipped activations, hard quantisation, custom tokenisers) without having to design a smooth surrogate.
  • Population-parallel. Scales by adding cores, not by scaling a single hot path. GCD on Apple Silicon gets us close to peak throughput on a laptop.

Going further

  • For larger Three.js corpora, add upstream addons by extending INCLUDE in the dataset script (e.g. three-mesh-bvh, troika-three-text).
  • For multi-GPU runs, see egg.c/full_cuda_train_transformer_adam_mgpu.cu and the distributed implementation under egg.c/d-eggs/.
  • For a different domain, the recipe is the same: replace build_threejs_dataset.sh with a script that produces an input.txt for your target codebase.

References

About

A small int8 RNN trained from scratch on the Three.js source using EGGROLL (gradient-free Evolution Strategies). Testing whether concentrated domain capacity beats a generalist LLM on narrow code-generation tasks. Pure C & Apple Silicon

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors