Improve notes on multipron training#59
Merged
Merged
Conversation
Sibling section to "Multipron alignment (optional stage 21)" so the bw-level multipron Baum-Welch knob is discoverable from the top-level README and not only from templates/librispeech/etc/sphinx_train.cfg.
Users discovering -multipron via bw -help previously had no pointer back to the SphinxTrain-level knob that flips it on for every stage.
The librispeech template documents the knob inline; users starting from the older wsj or tidigits templates had no on-file hint that it existed. Add an inactive 'no' line with a pointer to the canonical documentation.
The 2007 README listed five subdirectories without saying which one is current. Point new projects at librispeech (the only template kept up to date with multipron, LDA, G2P, etc.) and mark rm/tidigits/wsj as older recipes kept for reference.
Users choosing CFG_N_TIED_STATES and CFG_FINAL_NUM_DENSITIES for a new corpus have nothing to anchor the numbers against. Add a short rule-of-thumb comment block next to the senone count for continuous models at ~10 h / ~100 h / ~1000 h scales.
Downstream projects that vendor SphinxTrain with a shallow clone occasionally hit a confusing 'not a fast-forward' on later git pulls. Mention the --unshallow workaround in the Building section.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Docs and discoverability follow-up for CFG_MULTIPRON_TRAINING
Small, docs-focused follow-up to PR #58 (
kal-more-multi). After thatPR landed,
CFG_MULTIPRON_TRAININGwas correctly threaded into everybwstage, but it was only documented insidetemplates/librispeech/etc/sphinx_train.cfg. A newcomer reading thetop-level README,
bw -help, or any other template wouldn't discoverthe knob exists.
This PR closes that documentation gap and adds a few unrelated small
ergonomic notes spotted in the same audit. No behavioural change in
any training path; the only C edit is one help-text string in
bw.Commits
docs: document CFG_MULTIPRON_TRAINING in READMEbw: mention CFG_MULTIPRON_TRAINING in -multipron help texttemplates: reference CFG_MULTIPRON_TRAINING in wsj/tidigits configstemplates: refresh README with current "where to start" pointertemplates: add tied-state and density sizing guide to librispeech cfgdocs: note --depth 1 caveat when packaging SphinxTrainWhat changes
README.md– new section "Multipron training (CFG_MULTIPRON_TRAINING)"sibling to the existing stage-21 alignment section. Explains the
Baum-Welch graph builder, the default, how it composes with
$CFG_MULTIPRON, and the cost trade-off. Separately, a short notein the "Linux/Unix Installation" section about
git clone --depth 1vs.
git fetch --unshallowfor downstream packagers.src/programs/bw/train_cmd_ln.c– one sentence appended to the-multipronargument doc sobw -helppoints users at the cfg knobthat controls it across stages.
templates/tidigits/etc/sphinx_train.templateandtemplates/wsj/etc/sphinx_train.template– four-line commented# $CFG_MULTIPRON_TRAINING = 'no';reference with a pointer to theREADME and librispeech cfg. Inactive lines; no behavioural impact.
templates/README– refreshed from the 2007 sign-off to add a"Where to start" pointer to
librispeech/as the current referenceand mark
rm/,tidigits/,wsj/as older recipes. Original LDCblurb and dhuggins sign-off preserved.
templates/librispeech/etc/sphinx_train.cfg– inline commentblock next to
$CFG_N_TIED_STATESwith rule-of-thumb senone/densitynumbers for ~10 h / ~100 h / ~1000 h continuous-model training.
The 5000 default is unchanged.
bw, and only its help string.Verification
cmake -S . -B build && cmake --build buildclean (all binaries,including the modified
bw).perl -cclean on the four edited Perl configs../build/bw -help | grep multipronrenders the new help textcorrectly.
word multi-pron dict) with the locally-built tree:
stages 000 → 50 all completed with
CFG_MULTIPRON_TRAINING = 'yes'and
CFG_MULTIPRON = 'yes'. Stage-20 and stage-50 bw logs both show-multipron yesand the multipron-aligned transcription being usedat every density split (1g → 2g → 4g → 8g). Final CD-tied models
written under
model_parameters/<expt>.cd_cont_200/.