Skip to content

Perl call graph polluted by false-positive CALLS edges (builtins, framework method calls, config strings) #476

@halindrome

Description

@halindrome

Summary

On real-world Perl codebases, the call graph is dominated by false-positive CALLS edges. Perl files are extracted (call sites emitted), and any call the textual resolver can't place falls back to a generic short-name matcher that has no language or call-kind awareness. It happily wires Perl builtins, framework method calls, and even mis-parsed config strings to unrelated project subs that merely share a name.

Evidence

Measured on a large real Perl monorepo (~1,200 .pm/.pl + 352 .cgi endpoint files). For .cgi callers, the generic resolver produced 4,940 suffix_match edges, of which the overwhelming majority are noise:

Noise class .cgi edges Example callees
CPAN/framework method calls ($obj->m()) matched to unrelated project subs ~3,064 log (Log::Log4perl), param/header (CGI.pm), connect/commit/rollback/execute (DBI), encode (JSON)
Perl builtins matched to project subs ~645 shift, push, keys, sprintf
Config strings mis-extracted as call targets ~305 log4perl.appender.File.utf8

These targets are not project calls at all — the classes belong to CPAN/framework modules not in the graph, the builtins are language primitives, and the config string is a literal. The generic resolver fabricates edges purely on short-name collision.

Root causes

  1. Callee over-extraction (Perl): extract_scripting_callee (internal/cbm/extract_calls.c, CBM_LANG_PERL branch) returns the raw first-child text of a call node, so non-identifier tokens (dotted config strings, literals) become callees.
  2. No builtin guard: the generic resolver (src/pipeline/registry.c) matches builtin-named calls (shift/push/…) to project subs of the same name.
  3. No function-vs-method distinction: CBMCall carries no call-kind, so a method call $obj->commit() with an unknown receiver is treated as a bare function and short-name-matched to a project commit sub.

Proposed fix (language-gated; non-Perl behavior unchanged)

  1. Extraction hygiene: in the Perl callee path, extract the real method/function name token and reject non-identifier callees (containing ., quotes, whitespace, etc.) so config strings/literals never become call targets.
  2. Builtin guard: add a Perl builtin set; when an unresolved Perl call's name is a builtin, suppress the generic edge (real same-file subs are already resolved by earlier stages before the generic fallback).
  3. Method-vs-function: add a is_method flag to CBMCall, set it during Perl extraction for arrow/method calls, thread it into the resolver, and suppress generic short-name matching for Perl method calls with an unknown receiver (precise method resolution is the LSP's job; bare short-name matching is almost always wrong).

Every change is gated on CBM_LANG_PERL — the generic resolver and CBMCall are shared by all languages, so the other nine families remain byte-identical.

Acceptance criteria

  • A Perl builtin call (e.g. push @x, 1) does not produce a CALLS edge to a project sub named push.
  • A Perl method call with an unknown receiver (e.g. $dbh->commit()) does not generic-match to a project sub named commit.
  • A dotted config-string token (e.g. log4perl.appender.File.utf8) is not extracted as a call target.
  • Genuine intra-project Perl function calls still resolve.
  • Non-Perl languages' extraction/resolution are unchanged (verified by the existing suite + the cross-language breadth check).
  • Tests cover all of the above.

Validation (on the proposed branch)

Re-indexing the same repo with the fix: .cgi suffix_match drops 4,940 → 655 (−87%), builtin/CPAN-method/config-string noise on .cgi goes to zero, project-wide CALLS edges drop by ~13,400 (noise removal — fewer, more-correct edges), scripts/build.sh/scripts/test.sh stay green, and the cross-language breadth check confirms all other languages still resolve.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions