Bake multiple lexical pronunciation into baum-welch by lenzo-ka · Pull Request #58 · cmusphinx/sphinxtrain

lenzo-ka · 2026-05-20T13:19:08Z

Multi-pronunciation Baum-Welch training

Baum-Welch in SphinxTrain silently uses pronunciation variant [1] for every
multi-pron word in the dictionary, so (2)/(3) entries receive no acoustic
training unless the transcript token already carries an explicit (N) suffix.

This branch fixes that in the utterance-HMM construction layer. The forward,
backward, viterbi, and accumulator code paths are unchanged; they already
operate on a general DAG via state_t.next_state[] and prior_state[]. The
change replaces the implicit linear phone chain produced by mk_phone_list
with an explicit phone_graph_t carrying parallel paths per pronunciation
variant, then resolves triphone contexts and builds the state sequence over
that graph.

Commits

Add lexicon base-word variant index — lex_entry_t.next_variant chain
and a parallel lexicon_t.base_ht keyed by the base word (reading for
reading(2)). Existing lexicon_lookup behavior unchanged.
Add phone_graph_t and graph-aware state-sequence builder — new
mk_phone_graph, phone_graph_split_contexts, cvt2triphone_graph, and
state_seq_make_graph. For a single-pron utterance the graph is a strict
linear chain and the resulting state array is bit-identical to
state_seq_make's. Renames the two static helpers in state_seq.c and
shares them via a new private state_seq_internal.h.
bw: add -multipron flag selecting the graph utterance HMM — wires a new
next_utt_states_graph into bw's main loop, gated by -multipron
(default no for parity). The graph path allocates state_t per call and
is freed with state_seq_free; the linear path's static-buffer lifetime
is unchanged.
Add CFG_MULTIPRON_TRAINING knob to pass -multipron to bw — new
CFG_MULTIPRON_TRAINING (default yes) in etc/sphinx_train.cfg and the
librispeech template, plumbed into every baum_welch.pl under
01.lda_train, 02.mllt_train, 10.falign_ci_hmm, 20.ci_hmm,
30.cd_hmm_untied, 50.cd_hmm_tied, 65.mmie_train, and 80.mllr_adapt.

CFG_MULTIPRON_TRAINING is independent of CFG_MULTIPRON (the stage-21
sphinx3_align disambiguation pass). The two compose: tokens already
disambiguated by stage 21 carry an (N) suffix and collapse to a
single-variant expansion in the graph builder; un-disambiguated tokens still
branch on every dictionary variant during Baum-Welch.

Compatibility

C-side default -multipron no preserves prior behavior. Each commit
compiles standalone.
Cfg-side default CFG_MULTIPRON_TRAINING = 'yes' enables the new path in
this fork's recipes. Set to 'no' to fall back to the linear builder.

Validation

No automated tests added (the repo's existing Perl test/ harness isn't
wired into CMake and lacks a scaffold for this layer). Validated empirically:

Each commit builds standalone.
Trained CMU Arctic SLT end-to-end (CI through CD-4g) twice with multipron off and on, then force-aligned the 54 held-out utterances against both
models with sphinx3_align -outsent.
- Same model shape (means/variances/mixw/tmat all identical sizes).
- Different model contents (cmp shows first diff in CI/CD means around
  byte 60).
- 491 content tokens compared; 452 (92.1%) same variant, 39 (7.9%)
  different. 29 / 54 utterances had at least one disagreement,
  concentrated on high-frequency function words.
Matches the reference implementation's numbers (93.6% / 6.4% on a
slightly different cut).

Net change

~1,050 lines new C, ~80 lines modifications, ~60 lines Perl/config.

lex_entry_t gains a next_variant pointer chaining all dictionary entries that share a base word (e.g. "reading", "reading(2)", "reading(3)"). lexicon_t gains a parallel hash table base_ht keyed by the base word with the chain head as value, populated as lexicon_read() inserts each entry. New lexicon_lookup_variants() returns the chain head; lex_entry_ortho/lex_entry_next_variant are accessors. Existing lexicon_lookup() semantics and call sites are unchanged.

mk_phone_graph builds a per-utterance phone graph from a word sequence; for single-pron utterances the graph is a strict linear chain matching mk_phone_list's output, otherwise it carries parallel paths for every pronunciation variant returned by lexicon_lookup_variants. Transcript tokens already disambiguated with an (N) suffix fall through to lexicon_lookup and contribute a single path. phone_graph_split_contexts duplicates any slot whose predecessors carry multiple distinct CI phones, once per distinct predecessor CI, so each slot in the result has an unambiguous left context for triphone resolution. cvt2triphone_graph walks the post-split graph and replaces each slot's CI phone with its triphone acmod_id using the same word-position back-off as cvt2triphone. state_seq_make_graph then builds a state_t array, fanning the non-emit exit of each slot out to every graph successor with uniform 1/n_next probability; for a linear graph it produces an array bit-identical to state_seq_make's output on the equivalent phone[] list. The static helpers set_next_state/set_prior_state in state_seq.c are renamed state_seq_set_next/state_seq_set_prior, exposed via the new private header state_seq_internal.h, and shared by both the linear state_seq_make and the new state_seq_make_graph. CMakeLists.txt registers the three new .c files.

When -multipron yes is passed (default no for parity with prior behavior), bw builds each per-utterance training HMM via the new next_utt_states_graph, which composes mk_phone_graph, phone_graph_split_contexts, cvt2triphone_graph, and state_seq_make_graph. Forward-backward then sums posteriors across all pronunciation variants of every multi-pron word rather than silently picking the first dictionary entry. The graph path allocates the state_t array per call; main.c reads -multipron once before the utterance loop and calls state_seq_free only on the graph-built result so the linear path's static-buffer lifetime is unchanged.

CFG_MULTIPRON_TRAINING (default yes) is independent of CFG_MULTIPRON: the new variable toggles the graph utterance HMM in every bw invocation, while CFG_MULTIPRON still controls the stage-21 sphinx3_align disambiguation pass. The two compose. Transcript tokens already disambiguated by stage 21 carry an explicit (N) suffix and collapse to a single-variant expansion in the graph builder, while un-disambiguated tokens still branch on every dictionary variant during Baum-Welch. Each baum_welch.pl under 01.lda_train, 02.mllt_train, 10.falign_ci_hmm, 20.ci_hmm, 30.cd_hmm_untied, 50.cd_hmm_tied, 65.mmie_train, and 80.mllr_adapt conditionally pushes -multipron yes onto its bw arg list. The knob is also added to templates/librispeech/etc/sphinx_train.cfg.

lenzo-ka added 4 commits May 19, 2026 18:08

lenzo-ka requested a review from dhdaines May 20, 2026 13:20

lenzo-ka merged commit d146065 into cmusphinx:master May 20, 2026
8 checks passed

lenzo-ka mentioned this pull request May 21, 2026

Improve notes on multipron training #59

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bake multiple lexical pronunciation into baum-welch#58

Bake multiple lexical pronunciation into baum-welch#58
lenzo-ka merged 4 commits into
cmusphinx:masterfrom
lenzo-ka:kal-more-multi

lenzo-ka commented May 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Conversation

lenzo-ka commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Multi-pronunciation Baum-Welch training

Commits

Compatibility

Validation

Net change

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

lenzo-ka commented May 20, 2026 •

edited

Loading