feat: add TDT CoreML export for parakeet-tdt-ctc-110m#25
Conversation
Add convert-tdt-coreml.py which exports the TDT decoder components (fused mel+encoder, RNNT decoder LSTM, joint decision with duration) instead of the CTC head. The CTC export only produces blank-dominant log-probabilities unsuitable for greedy transcription in hybrid models. Components: - convert-tdt-coreml.py: Full TDT export pipeline (iOS 18 target) - individual_components.py: Shared torch.nn.Module wrappers for tracing - Updated README.md: Documents both TDT and CTC export paths - Updated pyproject.toml: Adds script entry point and includes
- Replace logits[..., -self.num_extra:] with logits[..., self.vocab_with_blank:] to fix Python -0: slicing returning all logits when num_extra == 0 - Guard duration argmax with num_extra > 0 check, return zeros otherwise - Upgrade num_extra == 0 warning to error since TDT export is invalid without duration head - Fix _save_mlpackage: set iOS18 deployment target (matching export), remove unnecessary try/except
- Bump fsspec 2024.9.0 -> 2024.12.0 (required by nemo-toolkit 2.3.1) - Bump datasets 3.1.0 -> 3.3.2 (compatible with new fsspec) - Add missing transitive deps: editdistance, pyannote.metrics, ipython
The 110m model has no iOS 18-only ops — the int64->int32 warnings during conversion are just precision downcasts, not spec-version-gated operations. Verified all 4 components export at spec version 8 (iOS 17) and inference produces correct transcription via FluidAudio CLI.
| - `parakeet_ctc_decoder.mlpackage` — encoder -> log_probs | ||
| Key differences from the 0.6B export: | ||
| - **Fused frontend**: mel spectrogram + encoder are a single `Preprocessor.mlpackage` (0.6B has separate Preprocessor + Encoder) | ||
| - **iOS 18 deployment target**: Required for int ops in the encoder's positional encoding |
There was a problem hiding this comment.
🟡 README claims iOS 18 deployment target but code uses iOS 17
The README at line 66 states "iOS 18 deployment target: Required for int ops in the encoder's positional encoding" as a key difference from the 0.6B export. However, commit 7475673 explicitly changed the deployment target from iOS 18 to iOS 17 in the code (convert-tdt-coreml.py:57 and convert-tdt-coreml.py:185), confirming that iOS 18 is not actually required. The README was not updated to reflect this fix, leaving stale documentation that will mislead users into believing they need iOS 18.
| - **iOS 18 deployment target**: Required for int ops in the encoder's positional encoding | |
| - **iOS 17 deployment target**: int64→int32 precision downcasts are handled automatically; no iOS 18-only ops |
Was this helpful? React with 👍 or 👎 to provide feedback.
| "decoder_layers": decoder_layers, | ||
| "checkpoint": checkpoint_meta, | ||
| "coreml": { | ||
| "compute_units": export_settings.compute_units.name, |
There was a problem hiding this comment.
🟡 Metadata records CPU_ONLY but mel+encoder is exported with CPU_AND_NE
The metadata coreml.compute_units field at convert-tdt-coreml.py:490 records export_settings.compute_units.name which is hardcoded to "CPU_ONLY" (line 184). However, the Preprocessor (mel+encoder) is actually converted with compute_units_override=melenc_cu (line 328), which defaults to CPU_AND_NE via the --mel-encoder-cu CLI option (line 165). This means the metadata misrepresents the actual compute unit configuration of the exported model, which could mislead downstream tools or developers reading the metadata to understand model behavior.
Prompt for agents
In models/stt/parakeet-tdt-ctc-110m/coreml/convert-tdt-coreml.py, the metadata at line 490 records export_settings.compute_units.name (always "CPU_ONLY") for the overall coreml configuration. However, the mel+encoder (Preprocessor) component is converted with melenc_cu (defaults to CPU_AND_NE). Either:
1. Change line 490 to record melenc_cu.name instead, or
2. Add per-component compute_units to the metadata components section (e.g., add a "compute_units" field to each component dict), or
3. Remove the top-level compute_units from the coreml metadata since it doesn't represent a single consistent value across components.
Was this helpful? React with 👍 or 👎 to provide feedback.
Add convert-tdt-coreml.py which exports the TDT decoder components (fused mel+encoder, RNNT decoder LSTM, joint decision with duration) instead of the CTC head. The CTC export only produces blank-dominant log-probabilities unsuitable for greedy transcription in hybrid models.
Components:
companion PR: FluidInference/FluidAudio#383
AI Disclosure
Claude Opus did most of the work