Skip to content

feat: support parakeet-tdt-ctc-110m hybrid model#383

Open
JarbasAl wants to merge 5 commits intoFluidInference:mainfrom
TigreGotico:feat/tdt-ctc-110m-support
Open

feat: support parakeet-tdt-ctc-110m hybrid model#383
JarbasAl wants to merge 5 commits intoFluidInference:mainfrom
TigreGotico:feat/tdt-ctc-110m-support

Conversation

@JarbasAl
Copy link

@JarbasAl JarbasAl commented Mar 16, 2026

Add AsrModelVersion.tdtCtc110m for the 110M parameter hybrid TDT-CTC model. Key differences from the 0.6B models:

  • Fused preprocessor+encoder (no separate Encoder.mlmodelc)
  • Smaller dimensions: encoderHidden=512, vocabSize=1024, 1 LSTM layer
  • Array-format vocabulary (vocab.json) instead of dict format
  • blankId=1024 (same as v2)

Changes:

  • AsrModels: optional encoder, fused frontend loading, array vocab support
  • AsrManager: version-aware decoder state shapes, fused frontend availability
  • AsrTranscription: skip encoder step when preprocessor output is fused
  • TdtDecoderState: parameterized LSTM layer count
  • TdtDecoderV3: use config.encoderHiddenSize instead of auto-detection
  • EncoderFrameView: accept explicit hidden size parameter
  • TranscribeCommand: --model-version tdt-ctc-110m, --model-dir flags
  • ModelNames: parakeetTdtCtc110m repo, fused model requirements

Companion PR: FluidInference/mobius#25

Why is this change needed?

better support for https://huggingface.co/nvidia/parakeet-tdt_ctc-110m

AI Disclosure

I never worked with swift before, Claude Opus did most of the work


Open with Devin

Add AsrModelVersion.tdtCtc110m for the 110M parameter hybrid TDT-CTC
model. Key differences from the 0.6B models:

- Fused preprocessor+encoder (no separate Encoder.mlmodelc)
- Smaller dimensions: encoderHidden=512, vocabSize=1024, 1 LSTM layer
- Array-format vocabulary (vocab.json) instead of dict format
- blankId=1024 (same as v2)

Changes:
- AsrModels: optional encoder, fused frontend loading, array vocab support
- AsrManager: version-aware decoder state shapes, fused frontend availability
- AsrTranscription: skip encoder step when preprocessor output is fused
- TdtDecoderState: parameterized LSTM layer count
- TdtDecoderV3: use config.encoderHiddenSize instead of auto-detection
- EncoderFrameView: accept explicit hidden size parameter
- TranscribeCommand: --model-version tdt-ctc-110m, --model-dir flags
- ModelNames: parakeetTdtCtc110m repo, fused model requirements
devin-ai-integration[bot]

This comment was marked as resolved.

@Alex-Wengg
Copy link
Member

Alex-Wengg commented Mar 16, 2026

@JarbasAl did you test this on iOS , we had originally fused preprocessor+encoder before & it had incompatibility issues on iOS .

also what about the benchmarks

case qwen3Asr = "FluidInference/qwen3-asr-0.6b-coreml/f32"
case qwen3AsrInt8 = "FluidInference/qwen3-asr-0.6b-coreml/int8"
case multilingualG2p = "FluidInference/charsiu-g2p-byt5-coreml"
case parakeetTdtCtc110m = "FluidInference/parakeet-tdt-ctc-110m-coreml"
Copy link
Member

@Alex-Wengg Alex-Wengg Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have this on FluidInference HF

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed you would upload it before merging, also sent companion PR to mobius for the conversion

@JarbasAl JarbasAl marked this pull request as draft March 16, 2026 18:09
Default ASRConfig uses encoderHiddenSize=1024 but the 110m model produces
encoder output with hidden size 512, causing a runtime crash in
EncoderFrameView. Adapt the config from the model version before passing
it to the decoder.
- Accept --model-version tdt-ctc-110m/110m
- Use model-version-aware ASRConfig (blankId, encoderHiddenSize)
- Fix CI debug path to use AsrModels.defaultCacheDirectory
- Update usage text
- TranscribeCommand: add --model-dir and tdt-ctc-110m to help text,
  fix modelVersionLabel ternary that mislabeled 110m as "v3" in JSON
- TdtDecoderV3.prepareJointInput: use config.encoderHiddenSize instead
  of convenience init that hardcodes 1024
@JarbasAl JarbasAl marked this pull request as ready for review March 16, 2026 19:35
@JarbasAl
Copy link
Author

JarbasAl commented Mar 16, 2026

@JarbasAl did you test this on iOS , we had originally fused preprocessor+encoder before & it had incompatibility issues on iOS .

also what about the benchmarks

I only tested in a Mac mini, not iOS. but I should note I had to use iOS18 target for the conversion to work

EDIT: I take that back, works with 17

devin-ai-integration[bot]

This comment was marked as resolved.

The AsrModels struct holds strong references to MLModel objects.
Without clearing it, cleanup() only nil'd the individual model
properties but the AsrModels copy still retained all four models.
@Alex-Wengg
Copy link
Member

hi @JarbasAl
thanks for the contribution! what's the intended use case for this model? The differences listed in the description (fused frontend, smaller hidden size, array vocab) are structural traits rather than advantages over the 0.6B. what motivated of the conversion of this model?

@JarbasAl
Copy link
Author

JarbasAl commented Mar 16, 2026

hi @JarbasAl thanks for the contribution! what's the intended use case for this model? The differences listed in the description (fused frontend, smaller hidden size, array vocab) are structural traits rather than advantages over the 0.6B. what motivated of the conversion of this model?

I am developing an application with FluidAudio where I use a proprietary finetuned version of that model, STT is the odd component not using FluidAudio directly.

figured it could be useful for the community to share support, the 100m model is very lightweight

WER improved ~3% in my test data by using this instead of the CTC export

var timeJump: Int?

init() throws {
init(decoderLayers: Int = 2) throws {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reasons why int = 2?

var chunkIndex = 0
var chunkDecoderState = TdtDecoderState.make()
var chunkDecoderState = TdtDecoderState.make(
decoderLayers: manager.asrModels?.version.decoderLayers ?? 2
Copy link
Member

@Alex-Wengg Alex-Wengg Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 ?

@Alex-Wengg
Copy link
Member

Copy link
Member

@BrandonWeng BrandonWeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to double check if this runs on iOS or not. We previously had issues with ioS when we tried combining the mel processor and the encoder.

#118


If there's no problem, @Alex-Wengg can't we just replace the existing ctc 110m with this instead of maintaining both

@SGD2718 SGD2718 added enhancement New feature or request speech-to-text issues related to transcription/asr labels Mar 17, 2026
@Alex-Wengg
Copy link
Member

We probably need to double check if this runs on iOS or not. We previously had issues with ioS when we tried combining the mel processor and the encoder.

#118

If there's no problem, @Alex-Wengg can't we just replace the existing ctc 110m with this instead of maintaining both

this is in theory possible but i will need to do some testings first. the custom vocab research paper did not mention anything about preprocessor specifications and generally preprocessors are pretty simple too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request speech-to-text issues related to transcription/asr

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants