fix(ai): per-provider timeout in fallback chain + raise overall cap#96
fix(ai): per-provider timeout in fallback chain + raise overall cap#96alex-mextner wants to merge 1 commit into
Conversation
The AI agent failed with "Ошибка AI" on legitimate multi-round questions because of two distinct problems: Problem A — shared abort signal poisoned the fallback chain. A single AbortController with a 60s deadline wrapped the entire agent loop, and the same signal was handed to every provider in aiStreamRound's fallback chain. Once the deadline fired mid-round, the signal stayed aborted, so z.ai → Gemini → HF all rejected instantly with the same abort — the fallback was useless. Problem B — the 60s overall budget was too tight. z.ai glm-5.1 takes ~27s per round; a 3-round query (~81s) exceeded 60s and aborted. Fixes: - Each provider in the fallback loop now gets its own fresh per-provider timeout (PER_PROVIDER_TIMEOUT_MS = 45s), combined with the caller's overall signal via AbortSignal.any. A slow provider aborts after 45s and the loop retries the next with a clean, non-aborted signal. - The loop distinguishes the two abort causes: if the caller's overall signal is aborted, it stops the chain and throws an AbortError-classified error; if only the per-provider timeout fired, it continues to the next provider. - Raise AGENT_TIMEOUT_MS to 180s and move the AbortController/timeout outside the retry loop so one deadline spans all retries — a truly-stuck run is bounded by the cap, not cap × attempts. - The overall-deadline abort is thrown as a plain Error with name='AbortError' so run()'s catch surfaces "Время ожидания истекло" instead of the generic "Ошибка AI". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
🤖 Stage bot deployed! Test it: https://t.me/ExpenseSyncStageBot Branch: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 19d2a6fed9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| * the next one with a clean signal. NOT shared across providers, so one stuck | ||
| * provider does not poison the fallback chain. | ||
| */ | ||
| const PER_PROVIDER_TIMEOUT_MS = 45_000; |
There was a problem hiding this comment.
Preserve longer caller budgets for active streams
This 45s default now applies to every aiStreamRound caller even when the caller explicitly allows longer streaming: for example, the deep advice flow passes maxTokens: 3000 with a 120s AbortSignal.timeout and notes that deep responses need more time (src/bot/commands/ask.ts:31-40,336-340,426-428). Because this timer is wall-clock rather than idle-based, an otherwise healthy provider that is still streaming after 45s is aborted; since text has already been emitted, the fallback path refuses to switch providers and the advice generation fails/retries instead of using its intended 120s budget. Consider making this cap caller-configurable or enforcing it only when no progress is made.
Useful? React with 👍 / 👎.
Проблема
ИИ-агент падал с «❌ Ошибка AI. Попробуйте позже.» на легитимных многораундовых вопросах. Доказано по логам прода: вопрос стартовал в 09:17:59, тройной abort всех провайдеров в 09:18:59 — ровно через 60 секунд. Это срабатывание
AGENT_TIMEOUT_MS, а не флакайность провайдеров.Два отдельных бага:
AbortController(60с) оборачивал весьrunAgentLoop(до 10 раундов), и егоsignalшёл во все вызовыaiStreamRoundи во все провайдеры fallback (z.ai → Gemini → HF). Когда дедлайн срабатывал посреди раунда, signal оставался аборченным навсегда → каждый следующий провайдер падал сAPIUserAbortErrorв ту же миллисекунду. Fallback бесполезен.Фикс
streaming.ts: каждый провайдер в fallback получает свежийAbortController(PER_PROVIDER_TIMEOUT_MS = 45_000), скомбинированный с общимoptions.signalчерезAbortSignal.any. Медленный провайдер аботится через 45с, а следующий пробуется с чистым signal. После падения проверкаoptions.signal?.abortedразличает: общий дедлайн → стоп цепочки +AbortError; таймаут провайдера → следующий провайдер.agent.ts:AGENT_TIMEOUT_MS60с → 180с. Контроллер вынесен из цикла ретраев (один дедлайн на все попытки). Общий abort теперь даёт юзеру «⏳ Время ожидания истекло», а не «❌ Ошибка AI».Константы: 3×45с = 135с < 180с — один раунд успевает перебрать все три провайдера в рамках бюджета.
Тесты
TDD: fallback со свежим signal после таймаута провайдера (падал на старом коде — регрессия Problem A), стоп цепочки при общем дедлайне с классификацией AbortError, 3-раундовый вопрос завершается, застрявший прогон ограничен дедлайном с timeout-сообщением.
Полный сьют: 3412/3412 зелёные, tsc чисто.
🤖 Generated with Claude Code