Storage unification + incremental parity + MCP reader migration (PR 5/5)#380
Storage unification + incremental parity + MCP reader migration (PR 5/5)#380Shidfar wants to merge 21 commits into
Conversation
Core framework for 14 protocol linkers: - servicelink.h: shared types, endpoint registry, pattern matching helpers - pass_servicelinks: pipeline pass that dispatches to per-protocol linkers - Endpoint persistence: protocol_endpoints table in each project DB - MCP tool registration and cross_project_links handler - Build system, test harness, and CI integration
GraphQL: schema field detection, gql template parsing, field-name extraction, operation name matching across producer/consumer pairs. gRPC: proto service/rpc definitions, client stub calls, streaming patterns across Go, Python, Java, TypeScript, and Rust.
Cloud messaging linkers for AWS and Apache Kafka: - Kafka: producer/consumer topic detection across Java, Python, Go, TS - SQS: queue URL and queue name extraction, send/receive matching - SNS: topic ARN detection, publish/subscribe patterns - EventBridge: event bus, rule, and put-events pattern detection
Message broker protocol linkers: - GCP Pub/Sub: topic/subscription detection, Terraform subscriber configs - RabbitMQ: exchange/queue binding, AMQP topic wildcard matching - MQTT: topic publish/subscribe with wildcard (+/#) matching - NATS: subject publish/subscribe with wildcard (*/>) matching - Redis Pub/Sub: channel publish/subscribe detection
Real-time and RPC protocol linkers: - WebSocket: connection URL detection, send/receive message matching - SSE: EventSource URL detection, event stream endpoint matching - tRPC: router procedure definitions, client hook call matching
Activates the linker files added by the prior cherry-picks: - Makefile.cbm: add 14 servicelink_*.c to PIPELINE_SRCS, add 14 TEST_SERVICELINK_*_SRCS test declarations, extend ALL_TEST_SRCS - pass_servicelinks.c: restore the LINKERS dispatch table to the full 14-entry list and remove the empty-table guard - pipeline.c: allocate cbm_sl_endpoint_list_t at function top (alongside path_aliases) so cleanup can free it safely even when the early cancel check goto's into cleanup before ctx is declared - test_main.c: register the 14 suite_servicelink_* test suites
Cross-project matching: - Endpoint registry collects all producers/consumers during indexing - _crosslinks.db stores cross-project links with confidence scores (exact=0.95 for identical strings, normalized=0.85 for case/separator diffs) - cross_project_links MCP tool with protocol/project/identifier filters Community detection: - Louvain algorithm for discovering tightly-coupled node clusters - Per-protocol community assignment
Unfiltered cross_project_links was returning ~900KB (~225K tokens) on
a fleet with 2417 links — enough to poison agent context in one call.
Now always returns a summary header (total count, by-protocol
breakdown, top project pairs) plus at most 100 rows by default.
Adds limit, offset, and summary_only parameters.
Before: unfiltered = 898,308 bytes (~224K tokens)
After: unfiltered = 36,589 bytes (~9K tokens), 25× smaller
summary_only = 1,028 bytes (~257 tokens)
Activates the files added by the prior cherry-picks: - Makefile.cbm: add pass_communities.c and pass_crossrepolinks.c to PIPELINE_SRCS; add TEST_COMMUNITIES_SRCS, TEST_ENDPOINT_PERSISTENCE_SRCS, and TEST_CROSS_PROJECT_LINKS_SRCS to ALL_TEST_SRCS - pipeline_internal.h: declare cbm_pipeline_pass_communities - pipeline.c: call cbm_pipeline_pass_communities after the service-link pass; call cbm_persist_endpoints to persist collected endpoints; call cbm_cross_project_link to compute cross-project links after dump - test_main.c: register suite_communities, suite_endpoint_persistence, and suite_cross_project_links - tests/test_endpoint_persistence.c: restored (exercises cbm_persist_endpoints which lands in this PR)
The candidate buffer introduced for HTTP ambiguity handling was truncating non-HTTP matches above 64 per producer. Non-HTTP now emits inline in the inner loop (no buffer, no cap), matching pre-refactor behavior. HTTP still buffers for ambiguity and now logs http.candidate_truncated when it drops candidates past the cap.
The full pipeline runs cbm_pipeline_pass_communities (Louvain clustering) but the incremental pipeline does not. Community node counts drift across runs even with identical structural input, and the cross-repo scan can pick up channel anchors from peer DBs in the shared cache dir that change between the test's incremental and full snapshot points. Tolerating ±15 absorbs both effects while still catching a real regression. Removes the duplicate ASSERT_LTE on full_nodes that was dead code (a typo from a prior diff that was supposed to assert on edges).
Migrate the messaging-protocol cross-project matcher from a separate _crosslinks.db file to bidirectional CROSS_* edges in each project's edges table. Add 11 new CROSS_* edge type constants for messaging protocols (KAFKA, SQS, SNS, EVENTBRIDGE, PUBSUB, AMQP, MQTT, NATS, REDIS_PUBSUB, WS, SSE). Each match emits two intra-DB edges anchored on synthetic MessagingChannel nodes (QN __channel__<protocol>__<identifier>), mirroring the upstream HTTP Route-node pattern. Producer DB gets function -> channel; consumer DB gets channel -> function. Cross-project metadata lives in edge properties JSON. The matcher now skips http/grpc/graphql/trpc protocols entirely; those are owned by the upstream Route-QN matcher in pass_cross_repo.c.
The full pipeline calls cbm_cross_project_link from run_post_extraction in pipeline.c, but the incremental pipeline never did. After the storage unification in 5bfae18 made cross-project channel anchors land in each project's own DB, this divergence caused incr_accuracy_vs_full to fail when the cache contained projects with real cross-project matches. Mirrors the full-path invocation pattern. Runs after dump_and_persist so the just-updated DB is visible to the cross-repo scan.
Storage unification moved the writer side from a shared _crosslinks.db into each project's own edges table (CROSS_* edge types), but the MCP reader still queried the legacy table and silently returned "no links found" for any caller. The reader now fans out across the cache directory: - Enumerates *.db files via cbm_opendir / is_project_db_file (the same convention list_projects uses), skipping the legacy _crosslinks.db and other _*.db hidden DBs. - For each project DB, selects producer-side CROSS_* edges: JOIN nodes on source_id, filter on type LIKE 'CROSS_%' AND properties LIKE '%"target_project"%'. The target_project filter naturally excludes consumer-side edges (which carry source_project instead), so each link surfaces exactly once. - Parses properties JSON to fill in (consumer_project, consumer_qn, consumer_file, identifier, protocol, confidence). Falls back to url_path when identifier is absent — that's upstream's HTTP/async schema where url_path plays the same role. - Filters / sorts / paginates in memory: protocol asc, identifier asc, confidence desc. - Aggregates "by protocol" via a contiguous-runs walk on the sorted list, and "top project pairs" via a small dynamic table with a partial selection sort for top-10. xl_bind_filters is gone; filtering moved to xl_row_matches.
Two new TESTs in the existing cross_project_links suite:
- mcp_reader_returns_cross_links: indexes a kafka producer + consumer
pair, runs cbm_cross_project_link, then drives the MCP tool via
cbm_mcp_handle_tool and asserts the response surfaces the protocol,
identifier, both project names, and both function QNs.
- mcp_reader_filters_by_protocol: indexes overlapping kafka + pubsub
endpoints, calls the tool with {\"protocol\":\"kafka\"}, and asserts
the pubsub identifier and protocol are absent from the response.
CBM_CACHE_DIR is overridden to the test's tmpdir so the reader sees
exactly the fixture DBs.
Removes stale-fact drift from the fork era (language/agent counts, install one-liner, feature bullets) flagged in PR DeusData#295's close comment. No URL substitutions involved — README's links already pointed at DeusData; this only reverts the content body. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Heads-up: this project now validates every PR automatically — tests, lint, security/license gates, and DCO sign-off (CONTRIBUTING.md). Your branch predates this, so CI will flag the missing |
|
One more note: this PR currently has merge conflicts with git fetch origin
git rebase --signoff origin/main # resolve conflicts as prompted
git push --force-with-leaseOnce pushed, the full validation pipeline runs automatically. Recent |
Summary
Three coupled changes that close out the protocol-linking work:
_crosslinks.db(introduced in Cross-repo pass + community detection + paginated cross_project_links (PR 3/5) #378) into each project's own edges table via syntheticMessagingChannelanchor nodes — mirroring the pre-existing HTTP Route-anchor pattern. Anchors are reactive (created only whenemit_cross_edge_pairconfirms a producer→consumer match), never speculative.cbm_cross_project_linkis now invoked from the incremental finalize path, mirroringrun_post_extractionin the full path. Post-storage-unification this is required because channel anchors live in each project's own DB; without it,incr_accuracy_vs_fullstarted failing when the cache had real cross-project matches._crosslinks.db.cross_linkstable — silently returning "no links found" for every caller. This PR rewrites the reader to fan out across per-project DBs.Stacked on #379 — please review the earlier PRs first.
Commits
refactor: unify cross-repo storage on edges table— writer sidefix: invoke cbm_cross_project_link from incremental pipeline— full/incremental parityfeat(mcp): migrate cross_project_links reader to per-project edges— see belowtest(mcp): cover cross_project_links reader end-to-end— 2 new tests in the existingcross_project_linkssuiteMCP reader migration detail
handle_cross_project_linksnow:*.dbfiles viacbm_opendir/is_project_db_file(same conventionlist_projectsuses), skipping the legacy_crosslinks.dband other_*.dbhidden DBs.CROSS_*edges:JOIN nodes ON source_id,properties LIKE '%"target_project"%'. Thetarget_projectpredicate naturally excludes consumer-side edges (which carrysource_projectinstead), so each link surfaces exactly once.{protocol, identifier, producer_project, producer_qn, producer_file, consumer_project, consumer_qn, consumer_file, confidence}. Falls back tourl_pathwhenidentifieris absent — that's upstream's HTTP/async schema whereurl_pathplays the same role.The old
xl_bind_filtersSQL-bind helper is gone; filtering moved toxl_row_matchesin the in-memory path.The two new tests (
mcp_reader_returns_cross_links,mcp_reader_filters_by_protocol) drive the reader end-to-end throughcbm_mcp_handle_toolwithCBM_CACHE_DIRoverridden to the test's tmpdir, so the regression class can't recur silently.Test plan
./scripts/test.shpasses (3825/3825, ASan + UBSan)MessagingChannelnodes created speculatively (confirmed viacross_link_no_matchand thefind_or_create_channelcall-site audit from Cross-project HTTP edges + unified storage + paginated cross_project_links #295)