Cross-repo pass + community detection + paginated cross_project_links (PR 3/5)#378
Cross-repo pass + community detection + paginated cross_project_links (PR 3/5)#378Shidfar wants to merge 10 commits into
Conversation
Core framework for 14 protocol linkers: - servicelink.h: shared types, endpoint registry, pattern matching helpers - pass_servicelinks: pipeline pass that dispatches to per-protocol linkers - Endpoint persistence: protocol_endpoints table in each project DB - MCP tool registration and cross_project_links handler - Build system, test harness, and CI integration
GraphQL: schema field detection, gql template parsing, field-name extraction, operation name matching across producer/consumer pairs. gRPC: proto service/rpc definitions, client stub calls, streaming patterns across Go, Python, Java, TypeScript, and Rust.
Cloud messaging linkers for AWS and Apache Kafka: - Kafka: producer/consumer topic detection across Java, Python, Go, TS - SQS: queue URL and queue name extraction, send/receive matching - SNS: topic ARN detection, publish/subscribe patterns - EventBridge: event bus, rule, and put-events pattern detection
Message broker protocol linkers: - GCP Pub/Sub: topic/subscription detection, Terraform subscriber configs - RabbitMQ: exchange/queue binding, AMQP topic wildcard matching - MQTT: topic publish/subscribe with wildcard (+/#) matching - NATS: subject publish/subscribe with wildcard (*/>) matching - Redis Pub/Sub: channel publish/subscribe detection
Real-time and RPC protocol linkers: - WebSocket: connection URL detection, send/receive message matching - SSE: EventSource URL detection, event stream endpoint matching - tRPC: router procedure definitions, client hook call matching
Activates the linker files added by the prior cherry-picks: - Makefile.cbm: add 14 servicelink_*.c to PIPELINE_SRCS, add 14 TEST_SERVICELINK_*_SRCS test declarations, extend ALL_TEST_SRCS - pass_servicelinks.c: restore the LINKERS dispatch table to the full 14-entry list and remove the empty-table guard - pipeline.c: allocate cbm_sl_endpoint_list_t at function top (alongside path_aliases) so cleanup can free it safely even when the early cancel check goto's into cleanup before ctx is declared - test_main.c: register the 14 suite_servicelink_* test suites
Cross-project matching: - Endpoint registry collects all producers/consumers during indexing - _crosslinks.db stores cross-project links with confidence scores (exact=0.95 for identical strings, normalized=0.85 for case/separator diffs) - cross_project_links MCP tool with protocol/project/identifier filters Community detection: - Louvain algorithm for discovering tightly-coupled node clusters - Per-protocol community assignment
Unfiltered cross_project_links was returning ~900KB (~225K tokens) on
a fleet with 2417 links — enough to poison agent context in one call.
Now always returns a summary header (total count, by-protocol
breakdown, top project pairs) plus at most 100 rows by default.
Adds limit, offset, and summary_only parameters.
Before: unfiltered = 898,308 bytes (~224K tokens)
After: unfiltered = 36,589 bytes (~9K tokens), 25× smaller
summary_only = 1,028 bytes (~257 tokens)
Activates the files added by the prior cherry-picks: - Makefile.cbm: add pass_communities.c and pass_crossrepolinks.c to PIPELINE_SRCS; add TEST_COMMUNITIES_SRCS, TEST_ENDPOINT_PERSISTENCE_SRCS, and TEST_CROSS_PROJECT_LINKS_SRCS to ALL_TEST_SRCS - pipeline_internal.h: declare cbm_pipeline_pass_communities - pipeline.c: call cbm_pipeline_pass_communities after the service-link pass; call cbm_persist_endpoints to persist collected endpoints; call cbm_cross_project_link to compute cross-project links after dump - test_main.c: register suite_communities, suite_endpoint_persistence, and suite_cross_project_links - tests/test_endpoint_persistence.c: restored (exercises cbm_persist_endpoints which lands in this PR)
Removes stale-fact drift from the fork era (language/agent counts, install one-liner, feature bullets) flagged in PR DeusData#295's close comment. No URL substitutions involved — README's links already pointed at DeusData; this only reverts the content body. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Heads-up: this project now validates every PR automatically — tests, lint, security/license gates, and DCO sign-off (CONTRIBUTING.md). Your branch predates this, so CI will flag the missing |
|
One more note: this PR currently has merge conflicts with git fetch origin
git rebase --signoff origin/main # resolve conflicts as prompted
git push --force-with-leaseOnce pushed, the full validation pipeline runs automatically. Recent |
Summary
Cross-repo protocol matching, Louvain community detection on the resulting graph, and the paginated
cross_project_linksMCP tool that surfaces matches.Stacked on #377 — please review #376 then #377 first.
Commits
feat: add cross-repo protocol linking and community detection— createspass_crossrepolinks.c(the producer↔consumer matcher, exact 0.95 + normalized 0.85) andpass_communities.c(Louvain clustering on service-link edges). Stores results in a separate_crosslinks.db(this storage choice is unified in PR 5).feat: add paginated summary guard to cross_project_links— new params:limit(default 100, max 1000),offset,summary_only. Always emits a summary header (total, by-protocol breakdown, top-10 project pairs). Unfiltered output dropped from ~225K tokens to ~9K tokens on a 19-project cache in the original measurement from Cross-project HTTP edges + unified storage + paginated cross_project_links #295.build: wire cross-repo pass + community detection into pipeline— callscbm_pipeline_pass_communities,cbm_persist_endpoints, andcbm_cross_project_linkfromcbm_pipeline_run; restorestests/test_endpoint_persistence.c(its dependencycbm_persist_endpointslives in this PR); wires Makefile + test_main.cTest plan
./scripts/test.shpasses (3810/3810, ASan + UBSan)cross_link_*battery exercises exact/normalized/same-project/no-match/multi-protocol/missing-table/http-skipped/unresolved-qn/idempotent-rerunUpstream overlap audit (re-checked against
upstream/main@ 6226972)Since this PR was opened the audit has been re-run on current upstream. Findings:
src/pipeline/pass_cross_repo.c:107-684— full cross-repo matcher; writesCROSS_*edges to per-project edges tablessrc/mcp/mcp.c:append_cross_repo_summary(line 1777) — cross-repo results surfaced inget_architecturesrc/mcp/mcp.cindex_repository(mode=cross-repo-intelligence)(line 2399) — exposes the matcher via MCPpass_communities.c(Louvain community detection) — no upstream equivalentcross_project_linksMCP tool with pagination,summary_only, and per-protocol filters — upstream surfaces cross-repo only via the narrative summary insideget_architecturepass_crossrepolinks.cas a parallel matcher (it duplicates upstream). Keeppass_communities.cand the dedicated MCP reader. The MCP reader should query upstream's per-projectCROSS_*edges directly — which is exactly the storage unification PR Storage unification + incremental parity + MCP reader migration (PR 5/5) #380 establishes.Marking remains draft until reviewed against this audit. PR #380 establishes the architectural reconciliation (cedes 4 protocols to upstream); the consolidated shape of this PR depends on how that lands.