Skip to content

lukifer23/AnarchoBot

Repository files navigation

AnarchoBot (MLX, Apple Silicon)

138M ChatML training stack for Apple Silicon using MLX.

Canonical remote branch: main. Historical legacy-default state is preserved at tag archive/origin-master-2026-03-20.

Active Workflow

The repo now treats one path as first-class:

clean Quality2K continuation -> explicitly approved pinned checkpoint -> v19 align/full/repair SFT curriculum

Active entrypoints:

  • scripts/build_pretrain_quality2k.py
  • scripts/run_pretrain_quality2k_terminal.sh
  • scripts/audit_dense_mainline.py
  • scripts/review_plain_generation.py
  • scripts/select_quality2k_checkpoint.py
  • scripts/pin_quality2k_checkpoint.py
  • scripts/build_sft_v19_release.py
  • scripts/run_sft_release.py
  • scripts/run_sft_release_v19.py
  • scripts/run_multiturn_coherence_eval.py (fixed multi-turn transcript suite; see SFT Runbook)

Research branch entrypoints:

  • scripts/extend_tokenizer_with_vm_tokens.py
  • scripts/build_vm_pilot_dataset.py
  • scripts/init_vm_from_dense.py
  • scripts/extend_tokenizer_with_wasm_tokens.py
  • scripts/normalize_local_docs.py
  • scripts/build_wasm_subset_corpus.py
  • scripts/build_wasm80m_pretrain_corpus.py
  • scripts/build_wasm80m_sft_corpora.py
  • scripts/run_wasm80m_pretrain.py
  • scripts/run_wasm80m_sft.py
  • scripts/eval_wasm80m.py

Historical probe-era and experimental material is retained only as archived reference. See Archive Notes.

Historical dense shims:

  • scripts/build_sft_v18_release.py
  • scripts/run_sft_release_v18.py
  • scripts/run_sft_release_v18_terminal.sh These remain compatibility shims only and are non-authoritative for release decisions.

The WASM80m scripts listed under “Research branch entrypoints” are a parallel tokenizer/model line (docs/wasm80m_runbook.md); they are not part of finishing dense 138M v19 chat.

The only architecture on the release path is the dense 138M line. Experimental dense_vm and dense_wasm80m work are isolated to separate branch/config families and do not share checkpoint compatibility with the dense mainline.

Current Artifacts

  • Preserved raw pretrain base: checkpoints/pretrain_mlx_138m_chatml/mlx_step_130000.pkl
  • Active continuation config: configs/pretrain_mlx_138m_quality2k.yaml
  • Active continuation outputs: checkpoints/pretrain_mlx_138m_quality2k
  • Canonical SFT handoff: checkpoints/pretrain_mlx_138m_quality2k/selected_for_sft.pkl
  • active v19 SFT configs:
    • configs/sft_release_v19_align.yaml
    • configs/sft_release_v19_full.yaml
    • configs/sft_release_v19_repair.yaml
  • Canonical chat/eval starting checkpoint (repair stage, step 50): checkpoints/sft_release_v19_repair/sft_step_50.pkl
  • Symlink pin for that artifact (used by scripts/eval_release_candidate.py by default): checkpoints/sft_release_v19_repair/selected_for_future_work.pkl — must resolve to the same file as sft_step_50.pkl when the pin is current; metadata lives in selected_for_future_work.json.
  • Eval commands, gate CLI, release bundle, and optional MLX smoke tests: docs/eval.md. Pin promotion, raw_reply vs reply, and gate_report.json retention: docs/sft_runbook.md (sections after Candidate Eval).
  • Mainline pin metadata for approved selections includes lineage fields: run_id, source_checkpoint, selected_step, gate_report_path, manifest_hash, and mainline_valid.

Setup

python -m venv .venv
source .venv/bin/activate
pip install -e .
PYTHONPATH=src python scripts/setup_verification.py

Canonical Pretrain Continuation

Build the curated continuation corpus:

source .venv/bin/activate
PYTHONPATH=src python scripts/build_pretrain_quality2k.py

The active 138M continuation runtime contract is:

  • context: 2048 tokens
  • dropout: 0.0
  • compile: true
  • compile_granularity: microbatch
  • precision: bfloat16
  • micro_batch_size: 1
  • grad_accum_steps: 16
  • gradient_checkpointing: false

Run the continuation from Terminal:

cd /Users/admin/Downloads/VSCode/AnarchoBot
./scripts/run_pretrain_quality2k_terminal.sh

Start a fresh continuation explicitly:

cd /Users/admin/Downloads/VSCode/AnarchoBot
./scripts/run_pretrain_quality2k_terminal.sh --clean-run

Monitor the run:

source .venv/bin/activate
PYTHONPATH=src python scripts/metrics_window.py \
  --log-dir checkpoints/pretrain_mlx_138m_quality2k/logs \
  --config configs/pretrain_mlx_138m_quality2k.yaml

Validate the staged continuation checkpoints before extending the run:

source .venv/bin/activate
PYTHONPATH=src python scripts/validate_mainline_training.py grad-coverage \
  --config configs/pretrain_mlx_138m_quality2k.yaml \
  --checkpoint checkpoints/pretrain_mlx_138m_chatml/mlx_step_130000.pkl

PYTHONPATH=src python scripts/validate_mainline_training.py checkpoint-diff \
  --config configs/pretrain_mlx_138m_quality2k.yaml \
  --start-checkpoint checkpoints/pretrain_mlx_138m_chatml/mlx_step_130000.pkl \
  --end-checkpoint checkpoints/pretrain_mlx_138m_quality2k/mlx_step_11000.pkl

For the completed 12000 continuation run, the preserved candidate pool is 8000, 9000, 10000, 11000, and 12000. Earlier checkpoints rotated out under ckpt_keep: 5.

Canonical SFT Handoff

Select the checkpoint with the deterministic continuation handoff rule:

source .venv/bin/activate
PYTHONPATH=src python scripts/select_quality2k_checkpoint.py \
  --manifest examples/quality2k_selection_manifest.json \
  --print-pin-command

The selector uses held-out perplexity with earliest-step tie-break, and only blocks candidates for checkpoint-diff failure, non-finite/missing perplexity, or catastrophic plain-generation regression versus the base review.

Pin the chosen continuation checkpoint only after the clean rerun validations pass:

source .venv/bin/activate
PYTHONPATH=src python scripts/pin_quality2k_checkpoint.py \
  --checkpoint checkpoints/pretrain_mlx_138m_quality2k/mlx_step_11000.pkl \
  --mainline-valid \
  --artifact-role mainline_candidate \
  --validation-basis "base grad coverage + compile parity passed; checkpoint diff passed; held-out perplexity won preserved 8000-12000 pool; no catastrophic plain-generation regression vs base"

Export a Hugging Face token at runtime before rebuilding the canonical natural-chat slice:

export HF_TOKEN=...

Build the v19 SFT corpora:

source .venv/bin/activate
PYTHONPATH=src python scripts/build_sft_v19_release.py --clean-output

The standalone builder writes reports/sft_v19_release_build/build_summary.json. The shared runner writes per-run build reports under reports/sft_v19_release_builds/<run_id>/build_summary.json.

The latest validated v19 run reported these manifest counts:

  • align: 3600 examples
  • release: 22571 examples
  • eval: 1600 examples
  • repair: 2912 examples from 3000 selected repair rows after shard filtering

The shared runner now validates manifest_examples against these bands:

  • align: 3000-5000
  • release: 20000-28000
  • eval: >=1280
  • repair: 2500-3500

Run the v19 curriculum:

cd /Users/admin/Downloads/VSCode/AnarchoBot
PYTHONPATH=src .venv/bin/python scripts/run_sft_release_v19.py

Default v19 release controls include:

  • dual-track raw/guarded gating
  • rewrite-rate cap (<=0.15 by default)
  • one bounded repair extension window (+25 once) before final failure

selected_for_sft.pkl is now blocked from the canonical SFT path unless its sibling metadata file marks it mainline_valid: true.

Run the static dense-mainline audit at any time without touching training:

source .venv/bin/activate
PYTHONPATH=src python scripts/audit_dense_mainline.py \
  --json-output reports/pretrain_quality2k_review/static_dense_audit.json

Tests

source .venv/bin/activate
pip install pytest
PYTHONPATH=src pytest

Optional MLX checkpoint smoke tests (loads weights on GPU, uses checkpoints/sft_release_v19_repair/sft_step_50.pkl unless ANARCHOBOT_CANONICAL_CKPT is set):

ANARCHOBOT_RUN_MLX_TESTS=1 PYTHONPATH=src pytest -m mlx_checkpoint tests/test_canonical_checkpoint.py

Generated Artifact Policy

Repo-tracked content is source, prompts, configs, tests, docs, and curated evidence.

Runtime artifacts are intentionally untracked:

  • continuation checkpoints
  • generated shard directories
  • runtime reports
  • transient build JSONL/message dumps

Preserved historical evidence lives under legacy_evidence/.

Docs

About

138M param ChatML training stack optimized for Apple Silicon via MLX. Features a curated Quality2K continuation curriculum and v18 SFT alignment.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors