Stage 2: LLM calls have no retry/backoff on 429 or timeout; default concurrency 10 is unconfigurable

## Summary

The Stage 2 LLM calls have no retry on transient failures. A single HTTP 429 or a single request timeout fails the call outright, and (because of the related abort-everything behaviour) takes the whole semantic pass down with it. There is also no backoff, and the default concurrency is high enough that hosted tiers 429 almost immediately.

## Where it happens

Commit `8c9f5cc`, v2.0.0, `src/skillspector/llm_analyzer_base.py`:

- Async path, `:391` and `:393`:

  ```python
  response = await self._structured_llm.ainvoke(prompt)
  ...
  response = (await self._llm.ainvoke(prompt)).content
  ```

- Sync path, `:356` and `:358`:

  ```python
  response = self._structured_llm.invoke(prompt)
  ...
  response = self._llm.invoke(prompt).content
  ```

None of these are wrapped in a retry. A `RateLimitError` (429) or an `APITimeoutError` raises on the first attempt and that is the end of it.

The default concurrency makes this easy to hit, `:368`:

```python
max_concurrency: int = 10,
```

Ten parallel requests with no backoff is more than several hosted tiers will accept. On the NVIDIA build tier I saw nearly every call 429, and there is no env knob to turn the concurrency down without editing the source.

## Why this matters

429 and timeout are the normal, expected failure modes of every hosted inference API; they are not exceptional. Treating the first one as fatal means the semantic pass is fragile exactly where it should be resilient. Combined with the all-or-nothing fallback, the practical result is that on a hosted tier the LLM filter frequently does not run at all, and the user is not told.

I do recognise that adding retries trades wall-clock for reliability, and that nobody wants an unbounded retry storm. A small bounded exponential backoff (a handful of attempts) is the usual middle ground and is plenty here.

## Reproduce

1. Use a provider/tier with a low requests-per-minute ceiling.
2. Run `skillspector scan <dir> --verbose` over a tree with enough files to exceed the ceiling at concurrency 10.
3. Observe repeated 429s, no retry, and (via the related abort bug) a fallback to static-only.

## Suggested direction

- Add a bounded exponential-backoff retry around the four invoke sites, scoped to genuinely transient errors only: rate-limit (429) and timeout. Do not retry 4xx like the context-overflow 400, because retrying that is pointless; let it fail the batch and be handled by per-batch isolation.
- Make `max_concurrency` configurable by environment variable (something like `SKILLSPECTOR_MAX_CONCURRENCY`) and consider a lower default than 10, since the safe ceiling varies a lot by provider and tier.
- Detect the retryable cases defensively across SDKs (check for a 429 status, a `RateLimit`/`Timeout` error name, or the matching text) rather than importing one provider's exception types.

Related: this compounds with the abort-everything fallback. Retries reduce how often a transient error fires; per-batch isolation limits the damage when one finally does.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stage 2: LLM calls have no retry/backoff on 429 or timeout; default concurrency 10 is unconfigurable #10

Summary

Where it happens

Why this matters

Reproduce

Suggested direction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Stage 2: LLM calls have no retry/backoff on 429 or timeout; default concurrency 10 is unconfigurable #10

Description

Summary

Where it happens

Why this matters

Reproduce

Suggested direction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions