Skip to content

Improve notes on multipron training#59

Merged
lenzo-ka merged 6 commits into
cmusphinx:masterfrom
lenzo-ka:kal-incr-impr
May 21, 2026
Merged

Improve notes on multipron training#59
lenzo-ka merged 6 commits into
cmusphinx:masterfrom
lenzo-ka:kal-incr-impr

Conversation

@lenzo-ka
Copy link
Copy Markdown
Contributor

Docs and discoverability follow-up for CFG_MULTIPRON_TRAINING

Small, docs-focused follow-up to PR #58 (kal-more-multi). After that
PR landed, CFG_MULTIPRON_TRAINING was correctly threaded into every
bw stage, but it was only documented inside
templates/librispeech/etc/sphinx_train.cfg. A newcomer reading the
top-level README, bw -help, or any other template wouldn't discover
the knob exists.

This PR closes that documentation gap and adds a few unrelated small
ergonomic notes spotted in the same audit. No behavioural change in
any training path; the only C edit is one help-text string in bw.

Commits

# Subject
1 docs: document CFG_MULTIPRON_TRAINING in README
2 bw: mention CFG_MULTIPRON_TRAINING in -multipron help text
3 templates: reference CFG_MULTIPRON_TRAINING in wsj/tidigits configs
4 templates: refresh README with current "where to start" pointer
5 templates: add tied-state and density sizing guide to librispeech cfg
6 docs: note --depth 1 caveat when packaging SphinxTrain

What changes

  • README.md – new section "Multipron training (CFG_MULTIPRON_TRAINING)"
    sibling to the existing stage-21 alignment section. Explains the
    Baum-Welch graph builder, the default, how it composes with
    $CFG_MULTIPRON, and the cost trade-off. Separately, a short note
    in the "Linux/Unix Installation" section about git clone --depth 1
    vs. git fetch --unshallow for downstream packagers.
  • src/programs/bw/train_cmd_ln.c – one sentence appended to the
    -multipron argument doc so bw -help points users at the cfg knob
    that controls it across stages.
  • templates/tidigits/etc/sphinx_train.template and
    templates/wsj/etc/sphinx_train.template – four-line commented
    # $CFG_MULTIPRON_TRAINING = 'no'; reference with a pointer to the
    README and librispeech cfg. Inactive lines; no behavioural impact.
  • templates/README – refreshed from the 2007 sign-off to add a
    "Where to start" pointer to librispeech/ as the current reference
    and mark rm/, tidigits/, wsj/ as older recipes. Original LDC
    blurb and dhuggins sign-off preserved.
  • templates/librispeech/etc/sphinx_train.cfg – inline comment
    block next to $CFG_N_TIED_STATES with rule-of-thumb senone/density
    numbers for ~10 h / ~100 h / ~1000 h continuous-model training.
    The 5000 default is unchanged.
  • No training code paths were touched. The only binary affected is
    bw, and only its help string.

Verification

  • cmake -S . -B build && cmake --build build clean (all binaries,
    including the modified bw).
  • perl -c clean on the four edited Perl configs.
  • ./build/bw -help | grep multipron renders the new help text
    correctly.
  • End-to-end training on CMU Arctic SLT (1043 train / 55 test, 3236-
    word multi-pron dict) with the locally-built tree:
    stages 000 → 50 all completed with CFG_MULTIPRON_TRAINING = 'yes'
    and CFG_MULTIPRON = 'yes'. Stage-20 and stage-50 bw logs both show
    -multipron yes and the multipron-aligned transcription being used
    at every density split (1g → 2g → 4g → 8g). Final CD-tied models
    written under model_parameters/<expt>.cd_cont_200/.

lenzo-ka added 6 commits May 21, 2026 10:38
Sibling section to "Multipron alignment (optional stage 21)" so the
bw-level multipron Baum-Welch knob is discoverable from the top-level
README and not only from templates/librispeech/etc/sphinx_train.cfg.
Users discovering -multipron via bw -help previously had no pointer
back to the SphinxTrain-level knob that flips it on for every stage.
The librispeech template documents the knob inline; users starting
from the older wsj or tidigits templates had no on-file hint that it
existed. Add an inactive 'no' line with a pointer to the canonical
documentation.
The 2007 README listed five subdirectories without saying which one
is current. Point new projects at librispeech (the only template kept
up to date with multipron, LDA, G2P, etc.) and mark rm/tidigits/wsj
as older recipes kept for reference.
Users choosing CFG_N_TIED_STATES and CFG_FINAL_NUM_DENSITIES for a
new corpus have nothing to anchor the numbers against. Add a short
rule-of-thumb comment block next to the senone count for continuous
models at ~10 h / ~100 h / ~1000 h scales.
Downstream projects that vendor SphinxTrain with a shallow clone
occasionally hit a confusing 'not a fast-forward' on later git pulls.
Mention the --unshallow workaround in the Building section.
@lenzo-ka lenzo-ka merged commit e0caeee into cmusphinx:master May 21, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant