Improve notes on multipron training by lenzo-ka · Pull Request #59 · cmusphinx/sphinxtrain

lenzo-ka · 2026-05-21T15:44:43Z

Docs and discoverability follow-up for CFG_MULTIPRON_TRAINING

Small, docs-focused follow-up to PR #58 (kal-more-multi). After that
PR landed, CFG_MULTIPRON_TRAINING was correctly threaded into every
bw stage, but it was only documented inside
templates/librispeech/etc/sphinx_train.cfg. A newcomer reading the
top-level README, bw -help, or any other template wouldn't discover
the knob exists.

This PR closes that documentation gap and adds a few unrelated small
ergonomic notes spotted in the same audit. No behavioural change in
any training path; the only C edit is one help-text string in bw.

Commits

#	Subject
1	`docs: document CFG_MULTIPRON_TRAINING in README`
2	`bw: mention CFG_MULTIPRON_TRAINING in -multipron help text`
3	`templates: reference CFG_MULTIPRON_TRAINING in wsj/tidigits configs`
4	`templates: refresh README with current "where to start" pointer`
5	`templates: add tied-state and density sizing guide to librispeech cfg`
6	`docs: note --depth 1 caveat when packaging SphinxTrain`

What changes

README.md – new section "Multipron training (CFG_MULTIPRON_TRAINING)"
sibling to the existing stage-21 alignment section. Explains the
Baum-Welch graph builder, the default, how it composes with
$CFG_MULTIPRON, and the cost trade-off. Separately, a short note
in the "Linux/Unix Installation" section about git clone --depth 1
vs. git fetch --unshallow for downstream packagers.
src/programs/bw/train_cmd_ln.c – one sentence appended to the
-multipron argument doc so bw -help points users at the cfg knob
that controls it across stages.
templates/tidigits/etc/sphinx_train.template and
templates/wsj/etc/sphinx_train.template – four-line commented
# $CFG_MULTIPRON_TRAINING = 'no'; reference with a pointer to the
README and librispeech cfg. Inactive lines; no behavioural impact.
templates/README – refreshed from the 2007 sign-off to add a
"Where to start" pointer to librispeech/ as the current reference
and mark rm/, tidigits/, wsj/ as older recipes. Original LDC
blurb and dhuggins sign-off preserved.
templates/librispeech/etc/sphinx_train.cfg – inline comment
block next to $CFG_N_TIED_STATES with rule-of-thumb senone/density
numbers for ~10 h / ~100 h / ~1000 h continuous-model training.
The 5000 default is unchanged.
No training code paths were touched. The only binary affected is
bw, and only its help string.

Verification

cmake -S . -B build && cmake --build build clean (all binaries,
including the modified bw).
perl -c clean on the four edited Perl configs.
./build/bw -help | grep multipron renders the new help text
correctly.
End-to-end training on CMU Arctic SLT (1043 train / 55 test, 3236-
word multi-pron dict) with the locally-built tree:
stages 000 → 50 all completed with CFG_MULTIPRON_TRAINING = 'yes'
and CFG_MULTIPRON = 'yes'. Stage-20 and stage-50 bw logs both show
-multipron yes and the multipron-aligned transcription being used
at every density split (1g → 2g → 4g → 8g). Final CD-tied models
written under model_parameters/<expt>.cd_cont_200/.

Sibling section to "Multipron alignment (optional stage 21)" so the bw-level multipron Baum-Welch knob is discoverable from the top-level README and not only from templates/librispeech/etc/sphinx_train.cfg.

Users discovering -multipron via bw -help previously had no pointer back to the SphinxTrain-level knob that flips it on for every stage.

The librispeech template documents the knob inline; users starting from the older wsj or tidigits templates had no on-file hint that it existed. Add an inactive 'no' line with a pointer to the canonical documentation.

The 2007 README listed five subdirectories without saying which one is current. Point new projects at librispeech (the only template kept up to date with multipron, LDA, G2P, etc.) and mark rm/tidigits/wsj as older recipes kept for reference.

Users choosing CFG_N_TIED_STATES and CFG_FINAL_NUM_DENSITIES for a new corpus have nothing to anchor the numbers against. Add a short rule-of-thumb comment block next to the senone count for continuous models at ~10 h / ~100 h / ~1000 h scales.

Downstream projects that vendor SphinxTrain with a shallow clone occasionally hit a confusing 'not a fast-forward' on later git pulls. Mention the --unshallow workaround in the Building section.

lenzo-ka added 6 commits May 21, 2026 10:38

docs: document CFG_MULTIPRON_TRAINING in README

5e8d0d5

Sibling section to "Multipron alignment (optional stage 21)" so the bw-level multipron Baum-Welch knob is discoverable from the top-level README and not only from templates/librispeech/etc/sphinx_train.cfg.

bw: mention CFG_MULTIPRON_TRAINING in -multipron help text

d3bb087

Users discovering -multipron via bw -help previously had no pointer back to the SphinxTrain-level knob that flips it on for every stage.

templates: reference CFG_MULTIPRON_TRAINING in wsj/tidigits configs

a8e44f5

The librispeech template documents the knob inline; users starting from the older wsj or tidigits templates had no on-file hint that it existed. Add an inactive 'no' line with a pointer to the canonical documentation.

docs: note --depth 1 caveat when packaging SphinxTrain

9d6b838

Downstream projects that vendor SphinxTrain with a shallow clone occasionally hit a confusing 'not a fast-forward' on later git pulls. Mention the --unshallow workaround in the Building section.

lenzo-ka merged commit e0caeee into cmusphinx:master May 21, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve notes on multipron training#59

Improve notes on multipron training#59
lenzo-ka merged 6 commits into
cmusphinx:masterfrom
lenzo-ka:kal-incr-impr

lenzo-ka commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Conversation

lenzo-ka commented May 21, 2026

Docs and discoverability follow-up for CFG_MULTIPRON_TRAINING

Commits

What changes

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant