Bake multiple lexical pronunciation into baum-welch#58
Merged
Conversation
lex_entry_t gains a next_variant pointer chaining all dictionary entries that share a base word (e.g. "reading", "reading(2)", "reading(3)"). lexicon_t gains a parallel hash table base_ht keyed by the base word with the chain head as value, populated as lexicon_read() inserts each entry. New lexicon_lookup_variants() returns the chain head; lex_entry_ortho/lex_entry_next_variant are accessors. Existing lexicon_lookup() semantics and call sites are unchanged.
mk_phone_graph builds a per-utterance phone graph from a word sequence; for single-pron utterances the graph is a strict linear chain matching mk_phone_list's output, otherwise it carries parallel paths for every pronunciation variant returned by lexicon_lookup_variants. Transcript tokens already disambiguated with an (N) suffix fall through to lexicon_lookup and contribute a single path. phone_graph_split_contexts duplicates any slot whose predecessors carry multiple distinct CI phones, once per distinct predecessor CI, so each slot in the result has an unambiguous left context for triphone resolution. cvt2triphone_graph walks the post-split graph and replaces each slot's CI phone with its triphone acmod_id using the same word-position back-off as cvt2triphone. state_seq_make_graph then builds a state_t array, fanning the non-emit exit of each slot out to every graph successor with uniform 1/n_next probability; for a linear graph it produces an array bit-identical to state_seq_make's output on the equivalent phone[] list. The static helpers set_next_state/set_prior_state in state_seq.c are renamed state_seq_set_next/state_seq_set_prior, exposed via the new private header state_seq_internal.h, and shared by both the linear state_seq_make and the new state_seq_make_graph. CMakeLists.txt registers the three new .c files.
When -multipron yes is passed (default no for parity with prior behavior), bw builds each per-utterance training HMM via the new next_utt_states_graph, which composes mk_phone_graph, phone_graph_split_contexts, cvt2triphone_graph, and state_seq_make_graph. Forward-backward then sums posteriors across all pronunciation variants of every multi-pron word rather than silently picking the first dictionary entry. The graph path allocates the state_t array per call; main.c reads -multipron once before the utterance loop and calls state_seq_free only on the graph-built result so the linear path's static-buffer lifetime is unchanged.
CFG_MULTIPRON_TRAINING (default yes) is independent of CFG_MULTIPRON: the new variable toggles the graph utterance HMM in every bw invocation, while CFG_MULTIPRON still controls the stage-21 sphinx3_align disambiguation pass. The two compose. Transcript tokens already disambiguated by stage 21 carry an explicit (N) suffix and collapse to a single-variant expansion in the graph builder, while un-disambiguated tokens still branch on every dictionary variant during Baum-Welch. Each baum_welch.pl under 01.lda_train, 02.mllt_train, 10.falign_ci_hmm, 20.ci_hmm, 30.cd_hmm_untied, 50.cd_hmm_tied, 65.mmie_train, and 80.mllr_adapt conditionally pushes -multipron yes onto its bw arg list. The knob is also added to templates/librispeech/etc/sphinx_train.cfg.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Multi-pronunciation Baum-Welch training
Baum-Welch in SphinxTrain silently uses pronunciation variant
[1]for everymulti-pron word in the dictionary, so
(2)/(3)entries receive no acoustictraining unless the transcript token already carries an explicit
(N)suffix.This branch fixes that in the utterance-HMM construction layer. The forward,
backward, viterbi, and accumulator code paths are unchanged; they already
operate on a general DAG via
state_t.next_state[]andprior_state[]. Thechange replaces the implicit linear phone chain produced by
mk_phone_listwith an explicit
phone_graph_tcarrying parallel paths per pronunciationvariant, then resolves triphone contexts and builds the state sequence over
that graph.
Commits
Add lexicon base-word variant index—lex_entry_t.next_variantchainand a parallel
lexicon_t.base_htkeyed by the base word (readingforreading(2)). Existinglexicon_lookupbehavior unchanged.Add phone_graph_t and graph-aware state-sequence builder— newmk_phone_graph,phone_graph_split_contexts,cvt2triphone_graph, andstate_seq_make_graph. For a single-pron utterance the graph is a strictlinear chain and the resulting state array is bit-identical to
state_seq_make's. Renames the two static helpers instate_seq.candshares them via a new private
state_seq_internal.h.bw: add -multipron flag selecting the graph utterance HMM— wires a newnext_utt_states_graphintobw's main loop, gated by-multipron(default
nofor parity). The graph path allocatesstate_tper call andis freed with
state_seq_free; the linear path's static-buffer lifetimeis unchanged.
Add CFG_MULTIPRON_TRAINING knob to pass -multipron to bw— newCFG_MULTIPRON_TRAINING(defaultyes) inetc/sphinx_train.cfgand thelibrispeech template, plumbed into every
baum_welch.plunder01.lda_train,02.mllt_train,10.falign_ci_hmm,20.ci_hmm,30.cd_hmm_untied,50.cd_hmm_tied,65.mmie_train, and80.mllr_adapt.CFG_MULTIPRON_TRAININGis independent ofCFG_MULTIPRON(the stage-21sphinx3_aligndisambiguation pass). The two compose: tokens alreadydisambiguated by stage 21 carry an
(N)suffix and collapse to asingle-variant expansion in the graph builder; un-disambiguated tokens still
branch on every dictionary variant during Baum-Welch.
Compatibility
-multipron nopreserves prior behavior. Each commitcompiles standalone.
CFG_MULTIPRON_TRAINING = 'yes'enables the new path inthis fork's recipes. Set to
'no'to fall back to the linear builder.Validation
No automated tests added (the repo's existing Perl
test/harness isn'twired into CMake and lacks a scaffold for this layer). Validated empirically:
models with
sphinx3_align -outsent.cmpshows first diff in CI/CD means aroundbyte 60).
different. 29 / 54 utterances had at least one disagreement,
concentrated on high-frequency function words.
slightly different cut).
Net change
~1,050 lines new C, ~80 lines modifications, ~60 lines Perl/config.