Skip to content

Bake multiple lexical pronunciation into baum-welch#58

Merged
lenzo-ka merged 4 commits into
cmusphinx:masterfrom
lenzo-ka:kal-more-multi
May 20, 2026
Merged

Bake multiple lexical pronunciation into baum-welch#58
lenzo-ka merged 4 commits into
cmusphinx:masterfrom
lenzo-ka:kal-more-multi

Conversation

@lenzo-ka
Copy link
Copy Markdown
Contributor

@lenzo-ka lenzo-ka commented May 20, 2026

Multi-pronunciation Baum-Welch training

Baum-Welch in SphinxTrain silently uses pronunciation variant [1] for every
multi-pron word in the dictionary, so (2)/(3) entries receive no acoustic
training unless the transcript token already carries an explicit (N) suffix.

This branch fixes that in the utterance-HMM construction layer. The forward,
backward, viterbi, and accumulator code paths are unchanged; they already
operate on a general DAG via state_t.next_state[] and prior_state[]. The
change replaces the implicit linear phone chain produced by mk_phone_list
with an explicit phone_graph_t carrying parallel paths per pronunciation
variant, then resolves triphone contexts and builds the state sequence over
that graph.

Commits

  1. Add lexicon base-word variant indexlex_entry_t.next_variant chain
    and a parallel lexicon_t.base_ht keyed by the base word (reading for
    reading(2)). Existing lexicon_lookup behavior unchanged.

  2. Add phone_graph_t and graph-aware state-sequence builder — new
    mk_phone_graph, phone_graph_split_contexts, cvt2triphone_graph, and
    state_seq_make_graph. For a single-pron utterance the graph is a strict
    linear chain and the resulting state array is bit-identical to
    state_seq_make's. Renames the two static helpers in state_seq.c and
    shares them via a new private state_seq_internal.h.

  3. bw: add -multipron flag selecting the graph utterance HMM — wires a new
    next_utt_states_graph into bw's main loop, gated by -multipron
    (default no for parity). The graph path allocates state_t per call and
    is freed with state_seq_free; the linear path's static-buffer lifetime
    is unchanged.

  4. Add CFG_MULTIPRON_TRAINING knob to pass -multipron to bw — new
    CFG_MULTIPRON_TRAINING (default yes) in etc/sphinx_train.cfg and the
    librispeech template, plumbed into every baum_welch.pl under
    01.lda_train, 02.mllt_train, 10.falign_ci_hmm, 20.ci_hmm,
    30.cd_hmm_untied, 50.cd_hmm_tied, 65.mmie_train, and 80.mllr_adapt.

CFG_MULTIPRON_TRAINING is independent of CFG_MULTIPRON (the stage-21
sphinx3_align disambiguation pass). The two compose: tokens already
disambiguated by stage 21 carry an (N) suffix and collapse to a
single-variant expansion in the graph builder; un-disambiguated tokens still
branch on every dictionary variant during Baum-Welch.

Compatibility

  • C-side default -multipron no preserves prior behavior. Each commit
    compiles standalone.
  • Cfg-side default CFG_MULTIPRON_TRAINING = 'yes' enables the new path in
    this fork's recipes. Set to 'no' to fall back to the linear builder.

Validation

No automated tests added (the repo's existing Perl test/ harness isn't
wired into CMake and lacks a scaffold for this layer). Validated empirically:

  • Each commit builds standalone.
  • Trained CMU Arctic SLT end-to-end (CI through CD-4g) twice with multipron off and on, then force-aligned the 54 held-out utterances against both
    models with sphinx3_align -outsent.
    • Same model shape (means/variances/mixw/tmat all identical sizes).
    • Different model contents (cmp shows first diff in CI/CD means around
      byte 60).
    • 491 content tokens compared; 452 (92.1%) same variant, 39 (7.9%)
      different. 29 / 54 utterances had at least one disagreement,
      concentrated on high-frequency function words.
  • Matches the reference implementation's numbers (93.6% / 6.4% on a
    slightly different cut).

Net change

~1,050 lines new C, ~80 lines modifications, ~60 lines Perl/config.

lenzo-ka added 4 commits May 19, 2026 18:08
lex_entry_t gains a next_variant pointer chaining all dictionary entries
that share a base word (e.g. "reading", "reading(2)", "reading(3)").
lexicon_t gains a parallel hash table base_ht keyed by the base word
with the chain head as value, populated as lexicon_read() inserts each
entry. New lexicon_lookup_variants() returns the chain head;
lex_entry_ortho/lex_entry_next_variant are accessors. Existing
lexicon_lookup() semantics and call sites are unchanged.
mk_phone_graph builds a per-utterance phone graph from a word sequence;
for single-pron utterances the graph is a strict linear chain matching
mk_phone_list's output, otherwise it carries parallel paths for every
pronunciation variant returned by lexicon_lookup_variants. Transcript
tokens already disambiguated with an (N) suffix fall through to
lexicon_lookup and contribute a single path.

phone_graph_split_contexts duplicates any slot whose predecessors carry
multiple distinct CI phones, once per distinct predecessor CI, so each
slot in the result has an unambiguous left context for triphone
resolution. cvt2triphone_graph walks the post-split graph and replaces
each slot's CI phone with its triphone acmod_id using the same
word-position back-off as cvt2triphone. state_seq_make_graph then
builds a state_t array, fanning the non-emit exit of each slot out to
every graph successor with uniform 1/n_next probability; for a linear
graph it produces an array bit-identical to state_seq_make's output on
the equivalent phone[] list.

The static helpers set_next_state/set_prior_state in state_seq.c are
renamed state_seq_set_next/state_seq_set_prior, exposed via the new
private header state_seq_internal.h, and shared by both the linear
state_seq_make and the new state_seq_make_graph. CMakeLists.txt
registers the three new .c files.
When -multipron yes is passed (default no for parity with prior
behavior), bw builds each per-utterance training HMM via the new
next_utt_states_graph, which composes mk_phone_graph,
phone_graph_split_contexts, cvt2triphone_graph, and
state_seq_make_graph. Forward-backward then sums posteriors across
all pronunciation variants of every multi-pron word rather than
silently picking the first dictionary entry.

The graph path allocates the state_t array per call; main.c reads
-multipron once before the utterance loop and calls state_seq_free
only on the graph-built result so the linear path's static-buffer
lifetime is unchanged.
CFG_MULTIPRON_TRAINING (default yes) is independent of CFG_MULTIPRON:
the new variable toggles the graph utterance HMM in every bw
invocation, while CFG_MULTIPRON still controls the stage-21
sphinx3_align disambiguation pass. The two compose. Transcript tokens
already disambiguated by stage 21 carry an explicit (N) suffix and
collapse to a single-variant expansion in the graph builder, while
un-disambiguated tokens still branch on every dictionary variant
during Baum-Welch.

Each baum_welch.pl under 01.lda_train, 02.mllt_train, 10.falign_ci_hmm,
20.ci_hmm, 30.cd_hmm_untied, 50.cd_hmm_tied, 65.mmie_train, and
80.mllr_adapt conditionally pushes -multipron yes onto its bw arg
list. The knob is also added to templates/librispeech/etc/sphinx_train.cfg.
@lenzo-ka lenzo-ka requested a review from dhdaines May 20, 2026 13:20
@lenzo-ka lenzo-ka merged commit d146065 into cmusphinx:master May 20, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant