Skip to content

fix(ai): per-provider timeout in fallback chain + raise overall cap#96

Open
alex-mextner wants to merge 1 commit into
mainfrom
fix/ai-agent-timeout-fallback
Open

fix(ai): per-provider timeout in fallback chain + raise overall cap#96
alex-mextner wants to merge 1 commit into
mainfrom
fix/ai-agent-timeout-fallback

Conversation

@alex-mextner
Copy link
Copy Markdown
Owner

Проблема

ИИ-агент падал с «❌ Ошибка AI. Попробуйте позже.» на легитимных многораундовых вопросах. Доказано по логам прода: вопрос стартовал в 09:17:59, тройной abort всех провайдеров в 09:18:59 — ровно через 60 секунд. Это срабатывание AGENT_TIMEOUT_MS, а не флакайность провайдеров.

Два отдельных бага:

  1. Отравление fallback-цепочки общим signal'ом. Один AbortController (60с) оборачивал весь runAgentLoop (до 10 раундов), и его signal шёл во все вызовы aiStreamRound и во все провайдеры fallback (z.ai → Gemini → HF). Когда дедлайн срабатывал посреди раунда, signal оставался аборченным навсегда → каждый следующий провайдер падал с APIUserAbortError в ту же миллисекунду. Fallback бесполезен.
  2. 60с слишком жёстко. z.ai glm-5.1 ~27с/раунд; вопрос на 3 раунда = ~81с > 60с → abort.

Фикс

  • streaming.ts: каждый провайдер в fallback получает свежий AbortController (PER_PROVIDER_TIMEOUT_MS = 45_000), скомбинированный с общим options.signal через AbortSignal.any. Медленный провайдер аботится через 45с, а следующий пробуется с чистым signal. После падения проверка options.signal?.aborted различает: общий дедлайн → стоп цепочки + AbortError; таймаут провайдера → следующий провайдер.
  • agent.ts: AGENT_TIMEOUT_MS 60с → 180с. Контроллер вынесен из цикла ретраев (один дедлайн на все попытки). Общий abort теперь даёт юзеру «⏳ Время ожидания истекло», а не «❌ Ошибка AI».

Константы: 3×45с = 135с < 180с — один раунд успевает перебрать все три провайдера в рамках бюджета.

Тесты

TDD: fallback со свежим signal после таймаута провайдера (падал на старом коде — регрессия Problem A), стоп цепочки при общем дедлайне с классификацией AbortError, 3-раундовый вопрос завершается, застрявший прогон ограничен дедлайном с timeout-сообщением.

Полный сьют: 3412/3412 зелёные, tsc чисто.

🤖 Generated with Claude Code

The AI agent failed with "Ошибка AI" on legitimate multi-round questions
because of two distinct problems:

Problem A — shared abort signal poisoned the fallback chain. A single
AbortController with a 60s deadline wrapped the entire agent loop, and the
same signal was handed to every provider in aiStreamRound's fallback chain.
Once the deadline fired mid-round, the signal stayed aborted, so z.ai →
Gemini → HF all rejected instantly with the same abort — the fallback was
useless.

Problem B — the 60s overall budget was too tight. z.ai glm-5.1 takes ~27s
per round; a 3-round query (~81s) exceeded 60s and aborted.

Fixes:
- Each provider in the fallback loop now gets its own fresh per-provider
  timeout (PER_PROVIDER_TIMEOUT_MS = 45s), combined with the caller's overall
  signal via AbortSignal.any. A slow provider aborts after 45s and the loop
  retries the next with a clean, non-aborted signal.
- The loop distinguishes the two abort causes: if the caller's overall signal
  is aborted, it stops the chain and throws an AbortError-classified error;
  if only the per-provider timeout fired, it continues to the next provider.
- Raise AGENT_TIMEOUT_MS to 180s and move the AbortController/timeout outside
  the retry loop so one deadline spans all retries — a truly-stuck run is
  bounded by the cap, not cap × attempts.
- The overall-deadline abort is thrown as a plain Error with name='AbortError'
  so run()'s catch surfaces "Время ожидания истекло" instead of the generic
  "Ошибка AI".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

🤖 Stage bot deployed! Test it: https://t.me/ExpenseSyncStageBot

Branch: fix/ai-agent-timeout-fallback @ c5f934a

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 19d2a6fed9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

* the next one with a clean signal. NOT shared across providers, so one stuck
* provider does not poison the fallback chain.
*/
const PER_PROVIDER_TIMEOUT_MS = 45_000;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve longer caller budgets for active streams

This 45s default now applies to every aiStreamRound caller even when the caller explicitly allows longer streaming: for example, the deep advice flow passes maxTokens: 3000 with a 120s AbortSignal.timeout and notes that deep responses need more time (src/bot/commands/ask.ts:31-40,336-340,426-428). Because this timer is wall-clock rather than idle-based, an otherwise healthy provider that is still streaming after 45s is aborted; since text has already been emitted, the fallback path refuses to switch providers and the advice generation fails/retries instead of using its intended 120s budget. Consider making this cap caller-configurable or enforcing it only when no progress is made.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant