Skip to content

Stage 2: apply_filter silently drops findings for files the LLM never analyzed (false negatives) #11

@wernerkasselman-au

Description

@wernerkasselman-au

Summary

apply_filter cannot tell the difference between "the LLM looked at this finding and rejected it" and "the LLM never saw this finding because its batch failed". Both end up dropped. Today that gap is masked by the all-or-nothing fallback (a failed batch aborts the whole pass and everything is kept), but the moment you add any per-batch resilience, this turns into a silent under-report: real findings vanish from the report because of an infrastructure hiccup, not because of a verdict.

I am filing this separately from the abort bug on purpose, because fixing the abort path without fixing this one would quietly make the scanner less safe.

Where it happens

Commit 8c9f5cc, v2.0.0, src/skillspector/nodes/meta_analyzer.py, the keep/drop loop in apply_filter (around :314-:326):

for f in findings:
    exact_key = (f.file, f.rule_id, f.start_line, f.end_line)
    start_only_key = (f.file, f.rule_id, f.start_line, None)
    coarse_key = (f.file, f.rule_id)
    if exact_key in confirmed_granular:
        ...
    elif start_only_key in confirmed_granular:
        ...
    elif coarse_key in confirmed_coarse:
        ...
    else:
        continue   # <- finding dropped

A finding is kept only if it appears in confirmed_*, which is built solely from batches that returned. A finding whose batch never returned (timeout, 429, an oversized-chunk 400) falls through to the else: continue and is silently removed. The function has no notion of "this file was not analysed, so I have no verdict and should keep the finding rather than delete it".

Why this matters

For a security scanner the two failure directions are not equal. Dropping a finding the LLM explicitly rejected is the intended filter behaviour. Dropping a finding the LLM never evaluated is a false negative manufactured by an infra error, and a false negative in a tool whose whole job is "is this skill safe to install" is the dangerous direction to fail.

Right now this is latent because of the coupled abort behaviour (see the related issue): any batch failure aborts the pass and _fallback_filtered keeps everything, so you never reach a state where some batches succeeded and others were skipped. As soon as the abort is fixed (which it should be), partial success becomes the normal case and this drop becomes live.

Suggested direction

Make the keep/drop decision distinguish rejection from non-analysis:

  • Track which files (or batches) actually came back from the LLM. Call that the analysed set.
  • For findings whose file is in the analysed set, keep the current confirm-or-drop logic; that is the real filter doing its job.
  • For findings whose file is not in the analysed set (their batch failed or was skipped), keep them unfiltered via the existing _fallback_filtered path rather than dropping them. No verdict means conservative keep, not delete.
  • Log a count of how many findings were kept unfiltered because their file could not be analysed, so the gap is visible rather than silent.

Net effect: the filter still removes what the LLM rejects, but an infrastructure failure can only ever cost you enrichment on a file, never the finding itself.

Related: this is the third of three coupled resilience issues in the Stage 2 path (the abort-everything fallback, the missing retry/backoff, and this drop). They are individually fixable but want to be considered together, because fixing one in isolation can move the failure mode rather than remove it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions