Skip to content

Latest commit

 

History

History
292 lines (210 loc) · 10.2 KB

File metadata and controls

292 lines (210 loc) · 10.2 KB

CodeWell Architecture V2

This document defines the next implementation direction after the current pre-alpha MVP.

For the branch-level V2 starting point after the V1 freeze candidate was preserved, see docs/V2_KICKOFF.md.

Product Direction

CodeWell is moving from a single-project indexer toward a local code knowledge-base system for agents:

  • the user can drop in a local folder, a GitHub URL, and later a local ZIP archive
  • raw source input must remain untouched
  • derived artifacts, indexes, revisions, and logs should live outside the raw source tree when the user chooses detached-library mode
  • retrieval should stay local-first, lightweight, and fast by default
  • MCP output should be structured for agent action, not just snippet dumping

Design Principles

  • Raw sources are read-only inputs.
  • Derived artifacts are replaceable caches plus durable memory records.
  • Revision memory never mutates the original snippet or repository.
  • Local lexical + graph retrieval remains the default baseline.
  • Embeddings and LLMs are optional plugins, not required infrastructure.
  • Every new retrieval feature must be evaluated for both recall and precision.
  • Multi-goal retrieval must have a complete no-LLM no-embedding baseline path.

Library Model

The target model is Raw / Derived / Manifest.

Raw

  • user-owned source folders
  • downloaded but unmodified GitHub snapshots
  • later: extracted ZIP imports

Derived

  • SQLite index
  • graph/cache artifacts
  • revision-memory records
  • future retrieval diagnostics and ranking traces

Manifest

  • maps a raw source root to its detached derived location
  • stores source kind, source URL/ref/commit, timestamps, and future import metadata
  • enables future admin-agent workflows without touching the raw tree

Current execution status:

  • Phase 1 (Library Boundary) is complete — detached-library layout, workspace manifests, and centralized path resolution are implemented.
  • Phase 2 (V2 Multi-Goal Retrieval) is complete — ambiguity-aware retrieval with subgoal decomposition, workspace clustering, primary/backup branch selection, and output trimming. Full 3×3 repeated-run evaluation shows net-positive results on all strong_multi_goal tasks.
  • codewell index --library-root <path> stores derived artifacts outside the raw source tree and writes a workspace manifest.
  • Legacy in-place .codewell/ mode still exists for compatibility.
  • The unified ingest dispatcher routes local folders, GitHub URLs, and local ZIP archives through one internal planning layer.
  • A managed intake layer exists under the detached library root:
    • intake/inbox
    • intake/archives
    • intake/papers
    • intake/documents
    • intake/source-workspaces/projects
    • intake/source-workspaces/single-files
    • intake/failed
    • intake/manifests
  • Intake can now auto-index imported code workspaces and apply read-only protection, with an additional best-effort Windows ACL layer when available.

Import Pipeline

The long-term ingest layer should normalize multiple source types behind one interface:

  1. local folder
  2. local ZIP archive
  3. GitHub repository URL

Shared ingest stages:

  1. source resolution
  2. raw snapshot acquisition
  3. manifest write
  4. indexing and graph extraction
  5. evaluation/diagnostics hooks

Current internal implementation has now been split into the first reusable four-stage pipeline:

  1. plan
  2. acquire
  3. materialize
  4. index

This is the boundary future batch import, sync, retry, and admin-agent workflows should extend.

Current intake behavior sits one layer above that pipeline:

  1. user drops files into the managed inbox
  2. intake classifies and moves them into protected storage
  3. code imports can be indexed immediately
  4. papers/documents emit lightweight metadata manifests for later context retrieval

ZIP Support

Do not introduce 7z now.

  • GitHub archive ingest already uses ZIP.
  • Python zipfile is enough for the first local-archive workflow.
  • An external unpacker increases dependency surface and weakens the lightweight/local-first goal.
  • Revisit 7z later only if real format coverage or measured extraction performance justifies it.

Retrieval Precision Plan

The baseline stays lexical search plus static graph expansion. Improvement work should focus on precision governance instead of broad opaque ranking.

Planned layers:

  1. lexical path/symbol/content match
  2. one-hop graph expansion through imports, callers, callees, tests, routes, and commands
  3. bounded multi-hop expansion only when the first hop is strong
  4. revision-memory relevance and applicability checks
  5. optional reranking plugins after evaluation proves value

Current first V2 retrieval-control slice:

  • keep the same lexical plus graph backbone
  • add ambiguity-aware retrieval modes instead of adding a new retrieval stack
  • first concrete policy is navigation_prune:
    • if the query still looks like a navigation flow
    • but the top lexical candidate is already plausible enough
    • and local goal competition is high
    • then reduce lexical breadth, graph depth, and symbol-context fanout

Multi-Goal Decomposition Baseline

The preferred V2 decomposition design is retrieval-aware and model-free by default.

Default baseline pipeline:

  1. split the user task into short structured subgoals
  2. run shallow lexical probes for each subgoal
  3. detect whether subgoals share the same local workspace region
  4. merge same-region subgoals into one retrieval session
  5. select one primary branch for deep context assembly
  6. retain at most one lightweight backup branch
  7. retain only a micro-context for the backup branch when budget remains

This baseline deliberately avoids:

  • required LLM planning
  • required embeddings
  • required semantic rerankers
  • broad full-parallel graph expansion across every subgoal

The key design objective is not "more branches". It is "earlier commitment to the right local workspace with bounded cost".

Current intended backup-branch form:

  • one backup file at most
  • one tiny backup symbol trace at most
  • one tiny backup neighborhood at most
  • never full graph expansion for the backup branch in the default baseline

Optional Enhancement Layers

Future layers may be added behind explicit opt-in interfaces.

Optional planner interface:

  • can propose structured subgoals from difficult natural-language prompts
  • can propose dependency order between subgoals
  • must not replace baseline execution control

Optional semantic interface:

  • can cluster related subgoals with embeddings
  • can help merge ambiguous workspace candidates
  • can rerank candidate groups after the baseline is already measurable

Both interfaces must satisfy one constraint:

  • disabling them must still leave a complete, evaluated, first-class V2 system

Evaluation requirements:

  • keep task-level Python and TS/JS evaluations
  • add real JS/TS repository tasks before adding more languages
  • measure recall, precision, latency, token budget, and graph-size growth together

Graph And Memory Model

The current graph is file/symbol/import/call-edge centric. The next expansion should stay modular:

  • repository node
  • file node
  • symbol node
  • call edge
  • import edge
  • route/test/command relationship edge
  • revision-memory node linked to source symbols/files and applicability evidence

Future admin agents should be able to query this graph without needing direct SQL knowledge.

MCP Return Protocol

MCP should return minimal actionable structure, not recursive repo dumps.

Recommended tool contract direction:

  • search_code: lightweight ranked hits with file path, symbol names, snippet, and score
  • trace_symbol: current definition plus one-hop callers and one-hop callees with excerpts
  • get_context_pack: selected files, why they were chosen, bounded traces, and related revisions
  • search_revision_memory: explanation, verification state, applicability notes, signals, and warnings

Default MCP expansion rules:

  • one-hop by default
  • token-aware excerpts
  • related files only when selection reasons are explicit
  • revision memory included only when it is relevant and supported by evidence

Planned next MCP additions:

  • response mode presets such as compact, default, and debug
  • explicit context-pack diagnostics for why files were added or excluded
  • source provenance blocks for detached libraries and imported archives
  • stronger paper/document retrieval once PDF extraction quality is improved
  • future branch-aware diagnostics for multi-goal decomposition, such as:
    • subgoal records
    • workspace-cluster merges
    • primary-branch versus backup-branch selection
    • backup-branch micro-context retention

Modularization Requirements

To support a future repository-admin agent and document-knowledge agent collaboration, keep these interfaces stable and separable:

  • ingest: source acquisition and normalization
  • library: raw/derived/manifest path resolution
  • parsers: language extraction
  • graph: relationship storage and traversal
  • retrieval: search, trace, context-pack assembly
  • memory: failures, revisions, applicability, verification
  • mcp: agent-facing protocol surface
  • ui: human-facing local inspection

Execution Roadmap

Phase 1: Library Boundary

  • formalize detached-library layout
  • keep raw sources unchanged
  • write workspace manifests
  • centralize path resolution

Status: in progress.

Phase 2: Unified Import Layer

  • add a source-ingest abstraction
  • support local folder, GitHub, and local ZIP input through one pipeline
  • keep ZIP support on Python stdlib first

Phase 3: Retrieval Governance

  • broaden JS/TS task-level evaluations
  • tighten graph expansion thresholds and diagnostics
  • compare precision/latency tradeoffs before optional reranking
  • add retrieval-aware multi-goal decomposition without required model dependencies

Phase 4: Graph And Memory Upgrade

  • promote revision memory to graph-linked entities
  • add richer route/test/command relationships
  • expose reusable diagnostics to MCP and UI

Phase 5: Agent And UI Evolution

  • richer UI views for files, failures, graph relationships, and manifests
  • optional embedding/reranking providers
  • admin-agent-facing APIs for repository and memory maintenance
  • optional planner providers for complex multi-goal task decomposition