Skip to content

feat(737): global / local ctx_search scope — project: current | global | <abs-path>#758

Draft
jetnet wants to merge 7 commits into
mksglu:nextfrom
jetnet:feat/737-global-search
Draft

feat(737): global / local ctx_search scope — project: current | global | <abs-path>#758
jetnet wants to merge 7 commits into
mksglu:nextfrom
jetnet:feat/737-global-search

Conversation

@jetnet
Copy link
Copy Markdown

@jetnet jetnet commented Jun 1, 2026

Summary

Implements the maintainer-approved design for #737 — a real project: scope on ctx_search delivering both project-local and true cross-project (global) recall, with zero migration.

ctx_search(queries: [...])                       # current project (default)
ctx_search(queries: [...], project: "global")    # fan-out across ALL projects
ctx_search(queries: [...], project: "/abs/path") # a specific project

What was broken (all confirmed in the issue thread)

  1. project: "global" reached only 1 of 3 sources (ContentStore); session events + auto-memory ignored the scope.
  2. Default sort: "relevance" skipped session memory entirely (only timeline searched it).
  3. current/default filtered by the pinned dir in shared mode, not the real cwd → silent 0 results.
  4. Per-project mode physically opens only the current DB — global was meaningless without fan-out across DB files.

What this PR does

  • Decouples the row-level filter from DB-file selection; adds getCurrentWorkingProject() (real cwd) so current is correct in shared mode.
  • Relevance now searches all three sources (content + session events + auto-memory); sort controls ordering only.
  • Global fan-out (src/search/global-fanout.ts): unions every sessions/*.db + content/*.db, merges with RRF (k=60), read-only opens (no WAL pragmas / schema init / migration / corruption repair), bounded by CONTEXT_MODE_GLOBAL_FANOUT_MAX (default 1024).
  • <abs-path> scope: shared mode → row filter; per-project mode → open that project's hashed DBs read-only (incl. worktree-suffixed session DBs).
  • Attribution: each result header names its origin — proj:<name> + sess:<id>; orphan content in global is labelled proj:(unattributed).
  • Coverage notice: if the fan-out cap drops DBs, a visible "Global search INCOMPLETE" line reports searched X of Y and the exact value to raise — never silently partial.
  • Result limits: breadth scopes return 10–12 hits/query (so specific answers aren't buried behind common-term noise); single-DB stays 1–2. Hard 40KB output cap enforced incrementally.
  • Snippet detail: forward-biased window so explanation after a heading match is captured.

Backwards compatibility

Zero migration. Per-project users are unaffected unless they pass project: "global"; shared-DB users keep working. The one intentional default-behaviour change is the relevance-mode source-set fix (#2 above).

Tests

New tests/core/global-fanout.test.ts and tests/core/ctx-search-plan.test.ts; extended search-project-filter, search, and auto-memory-adapter suites. Targeted suites green (250+), tsc --noEmit clean, npm run build (assert-bundle + asymmetric-drift) green. Validated end-to-end against a real ~170-project install (read-only confirmed: 0 DB files mutated).

Notes for reviewers

  • Bundles are intentionally not included (gitignored; regenerated at release via prepublishOnly).
  • New env var CONTEXT_MODE_GLOBAL_FANOUT_MAX is documented in README ("Search environment variables").

Closes #737

jetnet added 4 commits June 1, 2026 19:30
…emory

Decouple the row-level project filter from DB-file selection so ctx_search
gains a real cross-project ("no filter") mode, and fix relevance mode to
search all three sources.

- db.ts: searchEvents accepts projectDir: string | null (null = no filter,
  new prepared statement); add canonicalContentDbPath/canonicalSessionDbPath
  helpers (hash path, no legacy rename) for read-only search paths.
- auto-memory.ts: searchAutoMemory accepts null -> union across all per-project
  memory hash dirs; adapter base-dir detection handles hashed vs non-hashed.
- unified.ts: thread projectScope to session events + auto-memory and run them
  in BOTH relevance and timeline (Bug 2); round-robin interleave so session/
  auto-memory survive truncation; carry project/sessionId attribution fields.
- ctx-search-schema.ts: expose `project` in per-project mode too; resolver
  resolves current to real cwd; steer the model to stay on current unless
  the user explicitly asks for global.
- store.ts: export escapeLikeSource so fan-out reuses identical LIKE escaping.

Refs mksglu#737
…ion, bounded)

New src/search/global-fanout.ts implements true cross-project recall:

- listProjectDbs: enumerate every sessions/*.db + content/*.db, pair by project
  hash, cap the TOTAL file count and report coverage {totalAvailable, opened,
  cap, truncated} so callers can warn on incomplete fan-out.
- Read-only readers (readonlySearchContent FTS5 MATCH + readonlySearchEvents
  LIKE): { readonly: true } opens, no WAL pragmas, no schema init, no
  corruption repair/migration; every handle closed in finally.
- RRF merge across DBs (k=60), dedupe key excludes per-DB source so identical
  cross-DB content fuses; relevance = RRF, timeline = chronological.
- searchAbsPathProject: per-project abs-path scope opens that project's hashed
  DBs read-only (incl. worktree-suffixed session DBs), interleaved.
- Attribution: session hits carry project_dir + session_id from the row;
  content hits resolve project_dir via the sibling session DB (hash map).
- Default fan-out cap raised 64 -> 1024 (the old default silently dropped most
  projects on real 150+ DB installs); strict env parse for
  CONTEXT_MODE_GLOBAL_FANOUT_MAX.
- source/contentType filters applied consistently across all sources.

Refs mksglu#737
…overage, limits

Wire the new capability into the ctx_search handler via extracted, unit-tested
pure helpers (src/search/ctx-search-plan.ts).

- planCtxSearchScope: discriminated scope (global | absPathPerProject |
  rowFilter | current) honouring shared vs per-project mode.
- getCurrentWorkingProject(): real cwd (not the pinned CONTEXT_MODE_PROJECT_DIR)
  so shared-mode "current" filters by the right path.
- Global / abs-path branches NEVER open a writable store (no getStore(),
  no migration); content dir resolved via the non-mutating storage helper.
- shouldReturnEmptyGuidance: post-search, only when content is empty AND no
  hits — so auto-memory-only recall is not masked; empty message now points to
  project:"global".
- Coverage notice: prepend a visible "Global search INCOMPLETE" warning (and on
  the no-results path) when the fan-out cap drops DBs — never silently partial.
- effectiveSearchLimit: breadth scopes (global/abs-path) return 10-12 results
  (5 throttled) so specific hits are not chopped off behind common-term noise;
  single-DB stays 1-2.
- Result header attribution: proj:<name> + sess:<id>; orphan content in breadth
  scope labelled proj:(unattributed) so the model can't borrow a wrong project.
- extractSnippet: forward-biased window (200 back / 520 fwd) + larger budget so
  detail after a heading match is captured; hard 40KB output cap enforced
  incrementally with footer headroom.

Refs mksglu#737
- project: current | global | <abs-path> table + examples; note the model
  should stay on current unless the user explicitly asks for global.
- document the per-result attribution header (proj:/sess:) and the
  "Global search INCOMPLETE" coverage notice.
- new "Search environment variables" section documenting
  CONTEXT_MODE_GLOBAL_FANOUT_MAX (default 1024, behaviour, fallback).

Refs mksglu#737
@mksglu mksglu changed the base branch from main to next June 1, 2026 18:30
jetnet added 3 commits June 1, 2026 23:57
…t, diversity cap

Cross-project (project:"global") recall surfaced README/tool-call noise but
NOT the human's actual question/decision, because session_events search matched
the ENTIRE query as one LIKE substring and ranked all categories equally.
Validated end-to-end: a real `pi -p` agent now answers "how did we customize
pi-statusline" with the correct project/session/date + the model.provider work.

New src/search/event-query.ts (shared, unit-tested):
- tokenizeSearchQuery: whitespace split, trim non-[A-Za-z0-9%_/@.-] edges while
  PRESERVING %/_ LIKE-wildcards, drop English stopwords, keep len>=2 (incl "pi"),
  dedupe, cap 8, [] for empty/all-stopword.
- escapeLike (\ % _), tokenWeight (distinctiveness: package/digit tokens outrank
  generic words; bare 2-char tokens stay low so "pi" can't dominate).
- buildEventMatch: per-term weighted OR-LIKE; relevance score = (category boost)
  * term-sum + 100*exact-phrase. COALESCE(data,'')/COALESCE(category,'') guards
  3-valued-logic NULL poisoning (defensive across arbitrary fan-out DBs).
- categoryBoostSql: x3 for high-signal categories (role/user-prompt/decision/
  plan), x1 otherwise — boost-only, never penalises file/data recall.

src/session/db.ts searchEvents:
- tokenized dynamic SQL; new orderMode ("timeline" default = chronological,
  preserving the documented contract + tests; "relevance" = score DESC,
  created_at DESC, id ASC). Relevance fallback (0 terms) orders by recency, NOT
  a bare-integer scoreExpr (SQLite would read it as a column index).

src/search/global-fanout.ts readonlySearchEvents + merge:
- same tokenized matcher + orderMode (read-only opens, closed in finally).
- relevance sort: RRF score DESC, origin as an equal-score TIE-break only
  (prior-session > content), RRF-safe.
- HARD session diversity cap (max(3, floor(limit/4)) per sessionId, drop surplus,
  no backfill) so one chatty session (e.g. the current one, flooding the KB with
  the topic being worked on) can't monopolise the window; content/auto-memory
  (no sessionId) uncapped; non-positive limit guarded.

src/search/unified.ts threads sort -> orderMode for current/local recall.

Tests: new event-query.test.ts; new global-fanout cases (multi-term recall,
RRF source fairness, 0-term fallback recency order, category boost role>data,
session diversity cap). Reviewed across multiple rounds (oracle + reviewer + an
external model). FTS5 on session_events remains the tracked follow-up.

Refs mksglu#737
…idance

CI (test job, ubuntu + macos) caught a regression the targeted local runs
missed: the mksglu#737 empty-results guidance was reworded from the old
"No results found / After indexing" phrasing to point users at
project:"global". The mksglu#442 Read-deny-policy test asserted the OLD wording via
`searchedEmpty`, so it failed with "expected false to be true" at
server.test.ts:1509.

Teach the assertion to also accept the new "no indexed content" phrasing, and
add the explicit exfil pin the test comment already described — the denied
secret marker must never appear in search output. Full tests/core/server.test.ts
now green (475 passed).

Refs mksglu#737
…l-content dedupe key

Two MAJORs found in fresh review of the global-search paths (both NEW in this
PR), confirmed across independent review passes:

1. **contentType/source filtering leaked across sources** (src/search/unified.ts
   searchAllSources):
   - contentType was only applied to the content store, but SessionDB +
     auto-memory were still queried → `contentType:"code"` wrongly returned
     session events. Now both are gated behind `if (!contentType)`, matching the
     policy global-fanout already used (session/auto-memory have no code/prose
     classification).
   - auto-memory results are now source-filtered (were not), mirroring
     global-fanout.
   - session-event results now carry their CATEGORY as `source` (was the literal
     "prior-session") in BOTH readonlySearchEvents and the unified single-DB
     mapper, so the documented `source:"decision"` (etc.) filters session memory
     by category in single-DB AND global. `origin` ("prior-session"/
     "current-session") is unchanged; attribution display uses origin separately.

2. **Global RRF dedupe fused distinct cross-project hits** (src/search/
   global-fanout.ts): itemKey was `title + content.slice(0,80)`, so two distinct
   documents sharing a title + 80-char boilerplate prefix (license headers,
   decision preambles) collided — one project silently dropped, the other's RRF
   score inflated. Key is now `title | content.length | fnv1a32(full content)`
   (new dependency-free FNV-1a helper); project/source stay excluded so genuinely
   identical content across DBs still fuses as intended.

Tests: new regressions for distinct-prefix non-fusion + identical fusion,
contentType excluding session/auto-memory, auto-memory source filtering, and
category-as-source filtering (single-DB + global). All targeted suites green
(incl. server.test.ts) + typecheck + build.

Refs mksglu#737
@mksglu mksglu marked this pull request as draft June 3, 2026 06:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: ctx_search project: filter — scope FTS5 ContentStore to current project when using a global DB

1 participant