feat: support parakeet-tdt-ctc-110m hybrid model#383
feat: support parakeet-tdt-ctc-110m hybrid model#383JarbasAl wants to merge 5 commits intoFluidInference:mainfrom
Conversation
Add AsrModelVersion.tdtCtc110m for the 110M parameter hybrid TDT-CTC model. Key differences from the 0.6B models: - Fused preprocessor+encoder (no separate Encoder.mlmodelc) - Smaller dimensions: encoderHidden=512, vocabSize=1024, 1 LSTM layer - Array-format vocabulary (vocab.json) instead of dict format - blankId=1024 (same as v2) Changes: - AsrModels: optional encoder, fused frontend loading, array vocab support - AsrManager: version-aware decoder state shapes, fused frontend availability - AsrTranscription: skip encoder step when preprocessor output is fused - TdtDecoderState: parameterized LSTM layer count - TdtDecoderV3: use config.encoderHiddenSize instead of auto-detection - EncoderFrameView: accept explicit hidden size parameter - TranscribeCommand: --model-version tdt-ctc-110m, --model-dir flags - ModelNames: parakeetTdtCtc110m repo, fused model requirements
|
@JarbasAl did you test this on iOS , we had originally fused preprocessor+encoder before & it had incompatibility issues on iOS . also what about the benchmarks |
| case qwen3Asr = "FluidInference/qwen3-asr-0.6b-coreml/f32" | ||
| case qwen3AsrInt8 = "FluidInference/qwen3-asr-0.6b-coreml/int8" | ||
| case multilingualG2p = "FluidInference/charsiu-g2p-byt5-coreml" | ||
| case parakeetTdtCtc110m = "FluidInference/parakeet-tdt-ctc-110m-coreml" |
There was a problem hiding this comment.
we don't have this on FluidInference HF
There was a problem hiding this comment.
I assumed you would upload it before merging, also sent companion PR to mobius for the conversion
Default ASRConfig uses encoderHiddenSize=1024 but the 110m model produces encoder output with hidden size 512, causing a runtime crash in EncoderFrameView. Adapt the config from the model version before passing it to the decoder.
- Accept --model-version tdt-ctc-110m/110m - Use model-version-aware ASRConfig (blankId, encoderHiddenSize) - Fix CI debug path to use AsrModels.defaultCacheDirectory - Update usage text
- TranscribeCommand: add --model-dir and tdt-ctc-110m to help text, fix modelVersionLabel ternary that mislabeled 110m as "v3" in JSON - TdtDecoderV3.prepareJointInput: use config.encoderHiddenSize instead of convenience init that hardcodes 1024
I only tested in a Mac mini, not iOS. but I should note I had to use iOS18 target for the conversion to work EDIT: I take that back, works with 17 |
The AsrModels struct holds strong references to MLModel objects. Without clearing it, cleanup() only nil'd the individual model properties but the AsrModels copy still retained all four models.
|
hi @JarbasAl |
I am developing an application with FluidAudio where I use a proprietary finetuned version of that model, STT is the odd component not using FluidAudio directly. figured it could be useful for the community to share support, the 100m model is very lightweight WER improved ~3% in my test data by using this instead of the CTC export |
| var timeJump: Int? | ||
|
|
||
| init() throws { | ||
| init(decoderLayers: Int = 2) throws { |
| var chunkIndex = 0 | ||
| var chunkDecoderState = TdtDecoderState.make() | ||
| var chunkDecoderState = TdtDecoderState.make( | ||
| decoderLayers: manager.asrModels?.version.decoderLayers ?? 2 |
BrandonWeng
left a comment
There was a problem hiding this comment.
We probably need to double check if this runs on iOS or not. We previously had issues with ioS when we tried combining the mel processor and the encoder.
If there's no problem, @Alex-Wengg can't we just replace the existing ctc 110m with this instead of maintaining both
this is in theory possible but i will need to do some testings first. the custom vocab research paper did not mention anything about preprocessor specifications and generally preprocessors are pretty simple too |
Add AsrModelVersion.tdtCtc110m for the 110M parameter hybrid TDT-CTC model. Key differences from the 0.6B models:
Changes:
Companion PR: FluidInference/mobius#25
Why is this change needed?
better support for https://huggingface.co/nvidia/parakeet-tdt_ctc-110m
AI Disclosure
I never worked with swift before, Claude Opus did most of the work