CodeWell Architecture V2

This document defines the next implementation direction after the current pre-alpha MVP.

For the branch-level V2 starting point after the V1 freeze candidate was preserved, see docs/V2_KICKOFF.md.

Product Direction

CodeWell is moving from a single-project indexer toward a local code knowledge-base system for agents:

the user can drop in a local folder, a GitHub URL, and later a local ZIP archive
raw source input must remain untouched
derived artifacts, indexes, revisions, and logs should live outside the raw source tree when the user chooses detached-library mode
retrieval should stay local-first, lightweight, and fast by default
MCP output should be structured for agent action, not just snippet dumping

Design Principles

Raw sources are read-only inputs.
Derived artifacts are replaceable caches plus durable memory records.
Revision memory never mutates the original snippet or repository.
Local lexical + graph retrieval remains the default baseline.
Embeddings and LLMs are optional plugins, not required infrastructure.
Every new retrieval feature must be evaluated for both recall and precision.
Multi-goal retrieval must have a complete no-LLM no-embedding baseline path.

Library Model

The target model is Raw / Derived / Manifest.

Raw

user-owned source folders
downloaded but unmodified GitHub snapshots
later: extracted ZIP imports

Derived

SQLite index
graph/cache artifacts
revision-memory records
future retrieval diagnostics and ranking traces

Manifest

maps a raw source root to its detached derived location
stores source kind, source URL/ref/commit, timestamps, and future import metadata
enables future admin-agent workflows without touching the raw tree

Current execution status:

Phase 1 (Library Boundary) is complete — detached-library layout, workspace manifests, and centralized path resolution are implemented.
Phase 2 (V2 Multi-Goal Retrieval) is complete — ambiguity-aware retrieval with subgoal decomposition, workspace clustering, primary/backup branch selection, and output trimming. Full 3×3 repeated-run evaluation shows net-positive results on all strong_multi_goal tasks.
codewell index --library-root <path> stores derived artifacts outside the raw source tree and writes a workspace manifest.
Legacy in-place .codewell/ mode still exists for compatibility.
The unified ingest dispatcher routes local folders, GitHub URLs, and local ZIP archives through one internal planning layer.
A managed intake layer exists under the detached library root:
- intake/inbox
- intake/archives
- intake/papers
- intake/documents
- intake/source-workspaces/projects
- intake/source-workspaces/single-files
- intake/failed
- intake/manifests
Intake can now auto-index imported code workspaces and apply read-only protection, with an additional best-effort Windows ACL layer when available.

Import Pipeline

The long-term ingest layer should normalize multiple source types behind one interface:

local folder
local ZIP archive
GitHub repository URL

Shared ingest stages:

source resolution
raw snapshot acquisition
manifest write
indexing and graph extraction
evaluation/diagnostics hooks

Current internal implementation has now been split into the first reusable four-stage pipeline:

plan
acquire
materialize
index

This is the boundary future batch import, sync, retry, and admin-agent workflows should extend.

Current intake behavior sits one layer above that pipeline:

user drops files into the managed inbox
intake classifies and moves them into protected storage
code imports can be indexed immediately
papers/documents emit lightweight metadata manifests for later context retrieval

ZIP Support

Do not introduce 7z now.

GitHub archive ingest already uses ZIP.
Python zipfile is enough for the first local-archive workflow.
An external unpacker increases dependency surface and weakens the lightweight/local-first goal.
Revisit 7z later only if real format coverage or measured extraction performance justifies it.

Retrieval Precision Plan

The baseline stays lexical search plus static graph expansion. Improvement work should focus on precision governance instead of broad opaque ranking.

Planned layers:

lexical path/symbol/content match
one-hop graph expansion through imports, callers, callees, tests, routes, and commands
bounded multi-hop expansion only when the first hop is strong
revision-memory relevance and applicability checks
optional reranking plugins after evaluation proves value

Current first V2 retrieval-control slice:

keep the same lexical plus graph backbone
add ambiguity-aware retrieval modes instead of adding a new retrieval stack
first concrete policy is navigation_prune:
- if the query still looks like a navigation flow
- but the top lexical candidate is already plausible enough
- and local goal competition is high
- then reduce lexical breadth, graph depth, and symbol-context fanout

Multi-Goal Decomposition Baseline

The preferred V2 decomposition design is retrieval-aware and model-free by default.

Default baseline pipeline:

split the user task into short structured subgoals
run shallow lexical probes for each subgoal
detect whether subgoals share the same local workspace region
merge same-region subgoals into one retrieval session
select one primary branch for deep context assembly
retain at most one lightweight backup branch
retain only a micro-context for the backup branch when budget remains

This baseline deliberately avoids:

required LLM planning
required embeddings
required semantic rerankers
broad full-parallel graph expansion across every subgoal

The key design objective is not "more branches". It is "earlier commitment to the right local workspace with bounded cost".

Current intended backup-branch form:

one backup file at most
one tiny backup symbol trace at most
one tiny backup neighborhood at most
never full graph expansion for the backup branch in the default baseline

Optional Enhancement Layers

Future layers may be added behind explicit opt-in interfaces.

Optional planner interface:

can propose structured subgoals from difficult natural-language prompts
can propose dependency order between subgoals
must not replace baseline execution control

Optional semantic interface:

can cluster related subgoals with embeddings
can help merge ambiguous workspace candidates
can rerank candidate groups after the baseline is already measurable

Both interfaces must satisfy one constraint:

disabling them must still leave a complete, evaluated, first-class V2 system

Evaluation requirements:

keep task-level Python and TS/JS evaluations
add real JS/TS repository tasks before adding more languages
measure recall, precision, latency, token budget, and graph-size growth together

Graph And Memory Model

The current graph is file/symbol/import/call-edge centric. The next expansion should stay modular:

repository node
file node
symbol node
call edge
import edge
route/test/command relationship edge
revision-memory node linked to source symbols/files and applicability evidence

Future admin agents should be able to query this graph without needing direct SQL knowledge.

MCP Return Protocol

MCP should return minimal actionable structure, not recursive repo dumps.

Recommended tool contract direction:

search_code: lightweight ranked hits with file path, symbol names, snippet, and score
trace_symbol: current definition plus one-hop callers and one-hop callees with excerpts
get_context_pack: selected files, why they were chosen, bounded traces, and related revisions
search_revision_memory: explanation, verification state, applicability notes, signals, and warnings

Default MCP expansion rules:

one-hop by default
token-aware excerpts
related files only when selection reasons are explicit
revision memory included only when it is relevant and supported by evidence

Planned next MCP additions:

response mode presets such as compact, default, and debug
explicit context-pack diagnostics for why files were added or excluded
source provenance blocks for detached libraries and imported archives
stronger paper/document retrieval once PDF extraction quality is improved
future branch-aware diagnostics for multi-goal decomposition, such as:
- subgoal records
- workspace-cluster merges
- primary-branch versus backup-branch selection
- backup-branch micro-context retention

Modularization Requirements

To support a future repository-admin agent and document-knowledge agent collaboration, keep these interfaces stable and separable:

ingest: source acquisition and normalization
library: raw/derived/manifest path resolution
parsers: language extraction
graph: relationship storage and traversal
retrieval: search, trace, context-pack assembly
memory: failures, revisions, applicability, verification
mcp: agent-facing protocol surface
ui: human-facing local inspection

Execution Roadmap

Phase 1: Library Boundary

formalize detached-library layout
keep raw sources unchanged
write workspace manifests
centralize path resolution

Status: in progress.

Phase 2: Unified Import Layer

add a source-ingest abstraction
support local folder, GitHub, and local ZIP input through one pipeline
keep ZIP support on Python stdlib first

Phase 3: Retrieval Governance

broaden JS/TS task-level evaluations
tighten graph expansion thresholds and diagnostics
compare precision/latency tradeoffs before optional reranking
add retrieval-aware multi-goal decomposition without required model dependencies

Phase 4: Graph And Memory Upgrade

promote revision memory to graph-linked entities
add richer route/test/command relationships
expose reusable diagnostics to MCP and UI

Phase 5: Agent And UI Evolution

richer UI views for files, failures, graph relationships, and manifests
optional embedding/reranking providers
admin-agent-facing APIs for repository and memory maintenance
optional planner providers for complex multi-goal task decomposition

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CodeWell Architecture V2

Product Direction

Design Principles

Library Model

Raw

Derived

Manifest

Import Pipeline

ZIP Support

Retrieval Precision Plan

Multi-Goal Decomposition Baseline

Optional Enhancement Layers

Graph And Memory Model

MCP Return Protocol

Modularization Requirements

Execution Roadmap

Phase 1: Library Boundary

Phase 2: Unified Import Layer

Phase 3: Retrieval Governance

Phase 4: Graph And Memory Upgrade

Phase 5: Agent And UI Evolution

FilesExpand file tree

ARCHITECTURE_V2.md

Latest commit

History

ARCHITECTURE_V2.md

File metadata and controls

CodeWell Architecture V2

Product Direction

Design Principles

Library Model

Raw

Derived

Manifest

Import Pipeline

ZIP Support

Retrieval Precision Plan

Multi-Goal Decomposition Baseline

Optional Enhancement Layers

Graph And Memory Model

MCP Return Protocol

Modularization Requirements

Execution Roadmap

Phase 1: Library Boundary

Phase 2: Unified Import Layer

Phase 3: Retrieval Governance

Phase 4: Graph And Memory Upgrade

Phase 5: Agent And UI Evolution