Skip to content

[codex] 支持外部语音识别模型和小米 MiMo ASR#294

Draft
utopiafar wants to merge 1 commit into
memex-lab:mainfrom
utopiafar:codex/external-speech-recognition-mimo
Draft

[codex] 支持外部语音识别模型和小米 MiMo ASR#294
utopiafar wants to merge 1 commit into
memex-lab:mainfrom
utopiafar:codex/external-speech-recognition-mimo

Conversation

@utopiafar

Copy link
Copy Markdown
Collaborator

背景

Closes #293

这次把语音识别从原来的本地模型/布尔开关扩展为可配置 provider,让用户在识别时可以选择本地模型或外部模型。外部 provider 首批接入腾讯云 ASR 和小米 MiMo ASR,并兼容新版 AI 服务配置页。

需求覆盖

  • 支持本地、腾讯云、小米 MiMo 三种语音识别 provider。
  • 本地与腾讯云继续支持实时预览/实时片段更新。
  • 小米 MiMo 使用官方 mimo-v2.5-asr,支持录音结束或导入音频后的最终文本识别。
  • 支持复用已有 MiMo LLM 配置,也支持为 ASR 单独手动配置 API Key/Base URL。
  • 支持用户提供的 token-plan /anthropic 地址在 ASR 场景归一化为同 host 的 /v1/chat/completions
  • 新配置与旧 use_local_speech_to_text 语义兼容。

实现方案

  • 新增 SpeechRecognitionConfig / provider domain model,集中保存本地、腾讯云、小米 MiMo 的配置。
  • 新增 TencentCloudAsrServiceXiaomiMimoAsrService 和统一的实时转录接口。
  • SpeechTranscriptionService 根据 provider 分发本地、腾讯云或 MiMo 转录路径,并显式区分实时能力。
  • 录音入口改为通过统一 transcriber factory 创建实时转录器,外部非实时 provider 会走录音结束后的最终识别路径。
  • 新版 AI 配置页的自定义服务页新增语音识别能力区,提供 provider 分段选择、能力标签和 provider 专属表单。
  • MiMo 表单支持关联已有 MiMo LLM 配置或手动凭据,并提供模型/语言选项。
  • l10n 同步更新中英文文案。

验证

  • 使用脱敏凭据实测 https://token-plan-sgp.xiaomimimo.com/anthropic:直接拼 /anthropic/chat/completions 会 404,归一化到 /v1/chat/completionsmimo-v2.5-asr 返回正确识别文本。
  • flutter test --no-pub --concurrency=1:633 passed,7 skipped。
  • focused 语音识别 suite:58 passed。
  • touched-files flutter analyze --no-pub:No issues found。
  • git diff --check --cached:通过。

@github-actions

Copy link
Copy Markdown

PR AI Review / PR AI 语义预检

中文

  • 风险等级:高风险
  • 需要人工审核:
  • 黄金链路影响:较可能
  • 置信度:high
  • Workflow run:27587304673

PR 将语音识别从本地模型/布尔开关扩展为可配置 provider 系统(本地、腾讯云 ASR、小米 MiMo ASR)。新增 SpeechRecognitionConfig 域模型、TencentCloudAsrService 和 XiaomiMimoAsrService 服务,重构 SpeechTranscriptionService 按 provider 路由转录。UI 层在 AI 配置页新增语音识别能力区。测试覆盖较全面(6 个新/修改的测试文件,涵盖服务、ViewModel 和 Widget)。风险主要来自录音核心链路的重构和外部 ASR 的网络依赖。

影响范围

  • service
  • ui
  • view_model
  • i18n
  • tests
  • storage

黄金链路

  • record_capture
  • agent_pipeline
  • 说明:录音流程(启动、实时转录、停止校准、文件导入)在 InputSheet 和 MainScreen 中被重写。外部 ASR provider(腾讯云、小米 MiMo)引入新的网络请求路径,影响录音到文本的核心链路。

风险项

  • high 录音核心流程重构:停止录音和校准逻辑变化。证据:lib/ui/main_screen/widgets/input_sheet.dart: _calibrateFromFile now returns bool, realtimeText from finish() is checked before file transcription, lib/ui/main_screen/widgets/input_sheet.dart: _pickAudioFile now transcribes all providers immediately (not just local), removing the old cloud-only 'set audioPath' shortcut, lib/data/services/speech_transcription_service.dart: _transcribeSamplesWithCloud removed; replaced by provider-specific routes。
    建议:建议在真实设备上覆盖测试:(1) 本地模型录音后校准,(2) 腾讯云实时转录后停止,(3) MiMo 录音结束后文件转录,(4) 音频文件导入后各 provider 转录。当前测试使用 MockClient 验证请求构造,但未覆盖真实端到端场景。
  • warn 外部 ASR 将音频数据上传到云端。证据:lib/data/services/tencent_cloud_asr_service.dart: transcribeFile reads local audio bytes and POSTs to asr.cloud.tencent.com, lib/data/services/xiaomi_mimo_asr_service.dart: transcribeBytes base64-encodes audio and sends via chat completions API, lib/l10n/app_en.arb: speechRecognitionAudioLeavesDevice = 'Cloud processing'。
    建议:隐私标签和用户提示已到位('Cloud processing' 标签、provider 描述说明音频上传)。确认 LLM 数据共享同意流程(hasLLMConsent)是否需要在外部 ASR 首次使用前触发。
  • warn SpeechTranscriptionService 从单例变为可注入依赖。证据:lib/data/services/speech_transcription_service.dart: constructor now accepts optional TencentCloudAsrService and XiaomiMimoAsrService params, with defaults to .instance singletons, SpeechTranscriptionService.instance singleton still exists and works as before。
    建议:变更方向正确(便于测试注入)。但 dependencies.dart 中未注册新的 ASR 服务,它们通过各自 .instance 单例访问。确认这是否符合项目的 DI 约定(dependencies.dart 只注册 repositories/services)。
  • info 旧 use_local_speech_to_text 配置向后兼容。证据:lib/utils/user_storage.dart: getSpeechRecognitionConfig() falls back to legacy _keyUseLocalSpeechToText boolean when _keySpeechRecognitionConfig is not set, lib/utils/user_storage.dart: saveSpeechRecognitionConfig() writes both new config and legacy boolean for backward compatibility。
    建议:向后兼容逻辑清晰。legacy boolean 和新 config JSON 同步写入,升级用户无感知。未来可考虑移除 legacy key。
  • info SpeechRecognitionConfig 中存在冗余别名 getter。证据:lib/domain/models/speech_recognition_config.dart: tencentAppId/tencentSecretId/tencentSecretKey/tencentEngineModel are aliases for tencentCloud.appId/etc., lib/domain/models/speech_recognition_config.dart: also has tencentCloudAppId/tencentCloudSecretId/tencentCloudSecretKey/tencentCloudEngineType as second-level aliases。
    建议:两组别名增加了 API 表面积。建议统一为一种命名风格,减少选择疲劳。
  • warn Widget 测试中 MiMo provider 配置测试覆盖不足。证据:test/ui/settings/widgets/ai_service_setup_page_test.dart: 'Xiaomi MiMo speech config supports linked and manual credentials' test exists but the diff was cut off, The Tencent Cloud widget test verifies field entry, engine dropdown, and persistence — MiMo should have equivalent coverage。
    建议:确认 MiMo widget 测试覆盖了:(1) 关联配置选择,(2) 手动 API Key 输入,(3) Base URL 输入,(4) 语言选择,(5) 持久化验证。

测试缺口

  • lib/ui/main_screen/widgets/input_sheet.dart InputSheet 中录音停止后外部 provider 转录的 widget 测试缺失。_calibrateFromFile 返回 bool 的新逻辑和 realtimeText 优先策略需要 widget 测试覆盖。。 建议检查:test/ui/main_screen/widgets/input_sheet_test.dart (new or extend existing)
  • lib/domain/models/speech_recognition_config.dart SpeechRecognitionConfig domain model 缺少独立的 unit test 文件。toJson/fromJson/copyWith/equality 的序列化往返测试应在 test/domain/models/ 下。。 建议检查:test/domain/models/speech_recognition_config_test.dart (new)
  • lib/data/services/speech_transcription_service.dart SpeechTranscriptionService 的 _resolveXiaomiMimoConfig 链接配置解析逻辑测试不完整。当 linkedConfig.type 不是 typeMimo 或 linkedConfig.isValid 为 false 时的行为已有测试,但 apiKey 为空但 baseUrl 非空的边界情况未覆盖。。 建议检查:test/data/services/speech_transcription_service_test.dart (extend)

English

  • Risk level: HIGH
  • Human review required: YES
  • Golden path impact: LIKELY
  • Confidence: high
  • Workflow run: 27587304673

PR extends speech recognition from a local-only model/boolean toggle to a configurable provider system (Local, Tencent Cloud ASR, Xiaomi MiMo ASR). Adds SpeechRecognitionConfig domain model, TencentCloudAsrService, XiaomiMimoAsrService, and refactors SpeechTranscriptionService to route by provider. UI adds a speech recognition capability section in the AI setup page. Test coverage is comprehensive (6 new/modified test files covering services, ViewModel, and widgets). Risk stems from the recording core-path refactor and external ASR network dependencies.

Affected Areas

  • service
  • ui
  • view_model
  • i18n
  • tests
  • storage

Golden Path

  • record_capture
  • agent_pipeline
  • Rationale: Recording flow (start, realtime transcription, stop calibration, file import) is refactored in InputSheet and MainScreen. External ASR providers (Tencent Cloud, Xiaomi MiMo) introduce new network request paths that affect the core record-to-text pipeline.

Findings

  • high Recording core flow refactor: stop recording and calibration logic changed. Evidence: lib/ui/main_screen/widgets/input_sheet.dart: _calibrateFromFile now returns bool, realtimeText from finish() is checked before file transcription, lib/ui/main_screen/widgets/input_sheet.dart: _pickAudioFile now transcribes all providers immediately (not just local), removing the old cloud-only 'set audioPath' shortcut, lib/data/services/speech_transcription_service.dart: _transcribeSamplesWithCloud removed; replaced by provider-specific routes.
    Recommendation: Recommend on-device coverage testing for: (1) local model recording then calibration, (2) Tencent Cloud realtime transcription then stop, (3) MiMo file transcription after recording ends, (4) audio file import with each provider. Current tests use MockClient to verify request construction but do not cover real end-to-end scenarios.
  • warn External ASR uploads audio data to cloud services. Evidence: lib/data/services/tencent_cloud_asr_service.dart: transcribeFile reads local audio bytes and POSTs to asr.cloud.tencent.com, lib/data/services/xiaomi_mimo_asr_service.dart: transcribeBytes base64-encodes audio and sends via chat completions API, lib/l10n/app_en.arb: speechRecognitionAudioLeavesDevice = 'Cloud processing'.
    Recommendation: Privacy labels and user notices are in place ('Cloud processing' badge, provider descriptions noting audio upload). Verify whether the LLM data-sharing consent flow (hasLLMConsent) should be triggered before first use of external ASR providers.
  • warn SpeechTranscriptionService changed from pure singleton to injectable dependencies. Evidence: lib/data/services/speech_transcription_service.dart: constructor now accepts optional TencentCloudAsrService and XiaomiMimoAsrService params, with defaults to .instance singletons, SpeechTranscriptionService.instance singleton still exists and works as before.
    Recommendation: The change is in the right direction (enables test injection). However, the new ASR services are not registered in dependencies.dart — they are accessed via their own .instance singletons. Confirm this aligns with the project DI convention (dependencies.dart registers only repositories/services).
  • info Legacy use_local_speech_to_text config backward compatibility. Evidence: lib/utils/user_storage.dart: getSpeechRecognitionConfig() falls back to legacy _keyUseLocalSpeechToText boolean when _keySpeechRecognitionConfig is not set, lib/utils/user_storage.dart: saveSpeechRecognitionConfig() writes both new config and legacy boolean for backward compatibility.
    Recommendation: Backward compatibility logic is clear. Legacy boolean and new config JSON are written in sync, so upgrading users are unaffected. Consider removing the legacy key in a future release.
  • info SpeechRecognitionConfig has redundant alias getters. Evidence: lib/domain/models/speech_recognition_config.dart: tencentAppId/tencentSecretId/tencentSecretKey/tencentEngineModel are aliases for tencentCloud.appId/etc., lib/domain/models/speech_recognition_config.dart: also has tencentCloudAppId/tencentCloudSecretId/tencentCloudSecretKey/tencentCloudEngineType as second-level aliases.
    Recommendation: Two sets of alias getters increase the API surface. Consider unifying to one naming style to reduce choice fatigue.
  • warn Widget test coverage for MiMo provider configuration is thin. Evidence: test/ui/settings/widgets/ai_service_setup_page_test.dart: 'Xiaomi MiMo speech config supports linked and manual credentials' test exists but the diff was cut off, The Tencent Cloud widget test verifies field entry, engine dropdown, and persistence — MiMo should have equivalent coverage.
    Recommendation: Verify MiMo widget test covers: (1) linked config selection, (2) manual API key input, (3) base URL input, (4) language selection, (5) persistence verification.

Test Gaps

  • lib/ui/main_screen/widgets/input_sheet.dart Widget test for InputSheet recording stop with external provider transcription is missing. The new _calibrateFromFile bool return and realtimeText-first strategy need widget test coverage.. Suggested check: test/ui/main_screen/widgets/input_sheet_test.dart (new or extend existing).
  • lib/domain/models/speech_recognition_config.dart SpeechRecognitionConfig domain model lacks a dedicated unit test file. Serialization round-trip tests for toJson/fromJson/copyWith/equality should exist under test/domain/models/.. Suggested check: test/domain/models/speech_recognition_config_test.dart (new).
  • lib/data/services/speech_transcription_service.dart _resolveXiaomiMimoConfig linked config resolution logic in SpeechTranscriptionService is not fully tested. Edge case where apiKey is empty but baseUrl is non-empty is not covered.. Suggested check: test/data/services/speech_transcription_service_test.dart (extend).

AI review is advisory. Maintainers should verify the result before merging.

@github-actions github-actions Bot added ai: high risk AI review classified the PR as high risk golden path impact AI review found possible impact to a core user flow needs human review AI review or policy signals require maintainer review labels Jun 16, 2026
@github-actions

Copy link
Copy Markdown

PR Preflight Summary / PR 预检汇总

中文

  • 统一结论:低风险:两个预检均已完成,质量预检通过,可走普通手动合并流程。
  • Policy preflight:低风险。未命中打回或高风险规则。
  • Flutter quality:通过。Analyzer 和 test baseline 均未发现新增问题。
  • PR head:6e1a21303611e40db55b588b554eadc47eeb7707
  • Policy run:27587304667
  • Flutter run:27587304674

English

  • Combined result: Low risk: both preflights completed and quality passed; use the normal manual merge flow.
  • Policy preflight: LOW RISK. No blocking or high-risk policy signal was found.
  • Flutter quality: PASS. Analyzer and test baselines found no newly introduced issue.
  • PR head: 6e1a21303611e40db55b588b554eadc47eeb7707
  • Policy run: 27587304667
  • Flutter run: 27587304674
PR Policy Preflight / PR 规则预检

PR Policy Preflight / PR 规则预检

中文

  • 判定:低风险
  • 变更文件数:21
  • 变更行数:4564
  • Diff 是否截断:false

未发现确定性规则问题。

English

  • Decision: LOW RISK
  • Changed files: 21
  • Changed lines: 4564
  • Diff truncated: false

No deterministic policy findings.

PR Flutter Quality / Flutter 质量预检

PR Flutter Quality / Flutter 质量预检

中文

  • 总体:通过
  • Analyzer baseline:通过
  • Test baseline:通过

English

  • Overall: PASS
  • Analyzer baseline: PASS
  • Test baseline: PASS

Flutter Analyzer Baseline

  • Base issues: 284
  • PR issues: 284
  • New issues: 0

No new analyzer issues introduced by this PR.

Flutter Test Baseline

  • Base failures: 0
  • PR failures: 0
  • New failures: 0

No new Flutter test failures introduced by this PR.

@whbzju

whbzju commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

中文回复

这个 feature 方向我认可:语音识别 provider 需要从本地模型扩展到外部服务。但我建议这个 PR 暂时不要按当前形态合入,原因是实现方向和产品预期有偏差。

我看了代码后,当前实现基本是硬编码了三个 provider:local / tencentCloud / xiaomiMimolib/domain/models/speech_recognition_config.dart),设置页里也只暴露这三个选项(lib/ui/settings/widgets/ai_service_setup_page.dart)。这意味着新增的外部云端能力主要是腾讯云和小米 MiMo,覆盖面偏中国 provider;对于海外用户,默认路径不够合理。

更重要的是,这个 PR 替换掉了 main 上已有的通用云端转录路径。原来关闭本地 speech-to-text 后,会通过 UserStorage.getAgentLLMResources(AgentDefinitions.analyzeAssets) 拿用户已有的 LLM 配置,并用 AudioPart 让模型转录音频。#294 改成只按 speech provider 分发到腾讯云或 MiMo。这样一来,已经配置了 OpenAI-compatible、Gemini、OpenRouter 或其他支持音频/转录能力的 LLM key 的用户,不能直接复用已有配置。

腾讯云这条路径的配置门槛也比较高:用户需要额外理解和填写 App ID / Secret ID / Secret Key / engine type。这不应该成为“关闭本地转录”后的默认云端路径。它更适合作为高级/区域性 provider,而不是主路径。

MiMo 的复用也比较窄:它只会复用 LLMConfig.typeMimo 的配置,并且需要用户显式选择 MiMo provider 或 MiMo linked config。它不会从用户当前默认 LLM 配置、文本模型配置、视觉模型配置里自动发现“这个模型/endpoint 支持语音转录”,也没有通用的 speech capability resolution。

我建议调整为:

  1. 保留并强化通用 LLM 转录路径,作为默认云端路径:优先使用用户已经配置的 LLM key,如果该 provider/model 支持音频输入或 speech transcription,就直接使用。
  2. 增加一个 speech model/capability resolver:从用户已配置的 LLM configs 中找可用的转录模型,优先级可以是显式选择的 speech model > 默认模型/文本模型中支持 speech 的配置 > 本地模型 fallback > 高级 provider。
  3. Provider 覆盖应以海外/通用协议优先,例如 OpenAI-compatible transcription/audio-capable endpoint、Gemini audio-capable path、OpenRouter/自定义兼容 endpoint 等;腾讯云、MiMo 可以保留,但放在 advanced/regional provider 下。
  4. legacy use_local_speech_to_text=false 不应该迁移成腾讯云。否则旧用户关闭本地转录后会落到一个未配置的 Tencent provider,导致转录直接返回 unconfigured/null。
  5. 测试需要覆盖:已有 LLM key 支持音频时优先复用;legacy cloud transcription 不回退到未配置腾讯云;MiMo/Tencent 作为显式 provider 时仍然可用。

总体来说,feature 有价值,但现在像是“新增两个国内 ASR provider”,而不是“让用户复用已有模型配置来选择最佳语音转录能力”。我建议先把 provider resolution 和默认路径改成 LLM-config-first,再保留腾讯云/MiMo 作为可选高级 provider。

English Reply

I agree with the feature direction: speech recognition should support external providers beyond the local model. However, I would not merge this PR in its current shape because the implementation does not match the expected product behavior.

After reviewing the code, the implementation effectively hard-codes three providers: local / tencentCloud / xiaomiMimo in lib/domain/models/speech_recognition_config.dart, and the settings UI exposes only those three options in lib/ui/settings/widgets/ai_service_setup_page.dart. That means the new cloud-side support is mostly Tencent Cloud and Xiaomi MiMo, which is heavily China-provider oriented. For overseas users, this is not the right default direction.

More importantly, this PR replaces the generic cloud transcription path that already exists on main. Previously, when local speech-to-text was disabled, the app used UserStorage.getAgentLLMResources(AgentDefinitions.analyzeAssets) and sent an AudioPart through the user’s existing LLM configuration. In #294, that path is replaced with provider-specific routing to Tencent Cloud or MiMo. As a result, users who already configured an OpenAI-compatible, Gemini, OpenRouter, or other audio/transcription-capable LLM key cannot reuse that existing configuration.

The Tencent Cloud path also has a high setup burden: users need to understand and enter App ID / Secret ID / Secret Key / engine type. That should not become the default cloud transcription path after disabling local transcription. It is better suited as an advanced or regional provider.

The MiMo reuse path is also too narrow. It only reuses configs where LLMConfig.typeMimo matches, and only after the user explicitly chooses MiMo or a linked MiMo config. It does not automatically inspect the user’s default/text/vision LLM configs to find a provider or model that supports speech transcription.

I suggest changing the design as follows:

  1. Keep and strengthen the generic LLM transcription path as the default cloud path: if the user already has an LLM key and the provider/model supports audio input or speech transcription, use it directly.
  2. Add a speech model/capability resolver: explicit speech model > configured default/text model with speech support > local fallback > advanced provider.
  3. Prefer overseas/generic protocols first, such as OpenAI-compatible transcription/audio-capable endpoints, Gemini audio-capable paths, OpenRouter/custom compatible endpoints, etc. Tencent Cloud and MiMo can remain, but should be advanced/regional options.
  4. Do not migrate legacy use_local_speech_to_text=false to Tencent Cloud. That can leave existing users on an unconfigured Tencent provider and make transcription return unconfigured/null.
  5. Add tests for: reusing an existing audio-capable LLM key first; legacy cloud transcription not falling back to unconfigured Tencent Cloud; Tencent/MiMo still working when explicitly selected.

Overall, the feature is valuable, but the current PR feels like “add two China ASR providers” rather than “reuse the user’s existing model configuration and choose the best available speech transcription capability.” I’d recommend making the default path LLM-config-first, then keeping Tencent/MiMo as optional advanced providers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai: high risk AI review classified the PR as high risk golden path impact AI review found possible impact to a core user flow needs human review AI review or policy signals require maintainer review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

支持外部语音识别模型并接入小米 MiMo ASR

2 participants