Stage 2: one failed batch aborts the entire meta-analyzer pass and silently falls back to static

## Summary

I was running the Anthropic provider against a large skill tree and noticed that a single failed LLM call on one file quietly turned off the semantic filter for every file in the scan, not just the file that failed. The scan still exits 0 and prints a normal report, so the degradation is invisible unless you read the WARNING log.

The cause is that the meta-analyzer has no per-batch isolation. One exception anywhere in the batch fan-out aborts the whole Stage 2 pass and falls back to static-only results.

## Where it happens

Two coupled spots (commit `8c9f5cc`, v2.0.0):

1. `src/skillspector/llm_analyzer_base.py:397`

   ```python
   return list(await asyncio.gather(*[_process(b) for b in batches]))
   ```

   `asyncio.gather` is called without `return_exceptions=True`, so the first `_process` coroutine that raises (a 429, a request timeout, a 400 on an oversized chunk) propagates straight out of `arun_batches` and cancels the rest.

2. `src/skillspector/nodes/meta_analyzer.py:394` then `:405`

   ```python
   batch_results = asyncio.run(analyzer.arun_batches(batches, metadata_text=metadata_text))
   ...
   except Exception as e:
       logger.warning("LLM call failed, using fallback: %s", e)
       return {"filtered_findings": _fallback_filtered(findings)}
   ```

   The whole fan-out sits under one `try/except`, so the propagated exception from a single batch lands here and `_fallback_filtered` returns every finding unfiltered.

## Why this matters

The blast radius is wrong. If file A's batch times out, I lose the semantic filtering for files B through Z as well, even though their calls would have succeeded. On a 190-file scan I watched one bad batch discard the enrichment for all 190 (0 files actually filtered), and the printed risk score was identical to a `--no-llm` run while the report still claimed the LLM pass had run. A user reasonably believes they are getting the precision pass when they are getting static-only.

To be fair, an all-or-nothing fallback is a defensible first cut, and on a tiny single-file skill it is harmless. On anything large it is not, because the probability that at least one of N batches hits a transient error climbs with N, so the larger and more interesting the skill, the more likely the filter silently switches itself off.

## Reproduce

1. Point at any skill tree with more than ~50 files that have findings.
2. Use a provider/tier where at least one call will 429 or time out (the NVIDIA build tier rate-caps readily; a low Anthropic concurrency limit does too).
3. Run `skillspector scan <dir> --verbose`.
4. Observe a single `using fallback` line, `0 analyzed` in the meta-analyzer debug line, and a final report that is byte-identical to `--no-llm` while exit code stays 0.

## Suggested direction

Isolate each batch so one failure cannot cancel the others, and reserve the static fallback for the batches that actually failed rather than the whole set:

- In `arun_batches`, either pass `return_exceptions=True` to `gather` and drop the failures, or wrap `_process` in its own `try/except` that logs and returns a sentinel; then filter the sentinels out before returning.
- In `meta_analyzer`, stop treating "one batch raised" as "the whole pass failed". Apply the filter to the batches that came back and handle the missing ones separately (see the related issue on `apply_filter` dropping unanalyzed findings).
- Separately, surface the degradation rather than only logging it at WARNING; a static-only fallback that is invisible at the default log level is the part that actually burns people.

Related: the schema-400 report (#4) is one trigger for this same fallback, but the abort-everything behaviour is the deeper bug and would still bite on any transient 429 or timeout even after the schema issue is fixed.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stage 2: one failed batch aborts the entire meta-analyzer pass and silently falls back to static #9

Summary

Where it happens

Why this matters

Reproduce

Suggested direction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Stage 2: one failed batch aborts the entire meta-analyzer pass and silently falls back to static #9

Description

Summary

Where it happens

Why this matters

Reproduce

Suggested direction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions