Skip to content

Stage 2: LLM calls have no retry/backoff on 429 or timeout; default concurrency 10 is unconfigurable #10

@wernerkasselman-au

Description

@wernerkasselman-au

Summary

The Stage 2 LLM calls have no retry on transient failures. A single HTTP 429 or a single request timeout fails the call outright, and (because of the related abort-everything behaviour) takes the whole semantic pass down with it. There is also no backoff, and the default concurrency is high enough that hosted tiers 429 almost immediately.

Where it happens

Commit 8c9f5cc, v2.0.0, src/skillspector/llm_analyzer_base.py:

  • Async path, :391 and :393:

    response = await self._structured_llm.ainvoke(prompt)
    ...
    response = (await self._llm.ainvoke(prompt)).content
  • Sync path, :356 and :358:

    response = self._structured_llm.invoke(prompt)
    ...
    response = self._llm.invoke(prompt).content

None of these are wrapped in a retry. A RateLimitError (429) or an APITimeoutError raises on the first attempt and that is the end of it.

The default concurrency makes this easy to hit, :368:

max_concurrency: int = 10,

Ten parallel requests with no backoff is more than several hosted tiers will accept. On the NVIDIA build tier I saw nearly every call 429, and there is no env knob to turn the concurrency down without editing the source.

Why this matters

429 and timeout are the normal, expected failure modes of every hosted inference API; they are not exceptional. Treating the first one as fatal means the semantic pass is fragile exactly where it should be resilient. Combined with the all-or-nothing fallback, the practical result is that on a hosted tier the LLM filter frequently does not run at all, and the user is not told.

I do recognise that adding retries trades wall-clock for reliability, and that nobody wants an unbounded retry storm. A small bounded exponential backoff (a handful of attempts) is the usual middle ground and is plenty here.

Reproduce

  1. Use a provider/tier with a low requests-per-minute ceiling.
  2. Run skillspector scan <dir> --verbose over a tree with enough files to exceed the ceiling at concurrency 10.
  3. Observe repeated 429s, no retry, and (via the related abort bug) a fallback to static-only.

Suggested direction

  • Add a bounded exponential-backoff retry around the four invoke sites, scoped to genuinely transient errors only: rate-limit (429) and timeout. Do not retry 4xx like the context-overflow 400, because retrying that is pointless; let it fail the batch and be handled by per-batch isolation.
  • Make max_concurrency configurable by environment variable (something like SKILLSPECTOR_MAX_CONCURRENCY) and consider a lower default than 10, since the safe ceiling varies a lot by provider and tier.
  • Detect the retryable cases defensively across SDKs (check for a 429 status, a RateLimit/Timeout error name, or the matching text) rather than importing one provider's exception types.

Related: this compounds with the abort-everything fallback. Retries reduce how often a transient error fires; per-batch isolation limits the damage when one finally does.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions