This document defines the next implementation direction after the current pre-alpha MVP.
For the branch-level V2 starting point after the V1 freeze candidate was preserved, see
docs/V2_KICKOFF.md.
CodeWell is moving from a single-project indexer toward a local code knowledge-base system for agents:
- the user can drop in a local folder, a GitHub URL, and later a local ZIP archive
- raw source input must remain untouched
- derived artifacts, indexes, revisions, and logs should live outside the raw source tree when the user chooses detached-library mode
- retrieval should stay local-first, lightweight, and fast by default
- MCP output should be structured for agent action, not just snippet dumping
- Raw sources are read-only inputs.
- Derived artifacts are replaceable caches plus durable memory records.
- Revision memory never mutates the original snippet or repository.
- Local lexical + graph retrieval remains the default baseline.
- Embeddings and LLMs are optional plugins, not required infrastructure.
- Every new retrieval feature must be evaluated for both recall and precision.
- Multi-goal retrieval must have a complete no-LLM no-embedding baseline path.
The target model is Raw / Derived / Manifest.
- user-owned source folders
- downloaded but unmodified GitHub snapshots
- later: extracted ZIP imports
- SQLite index
- graph/cache artifacts
- revision-memory records
- future retrieval diagnostics and ranking traces
- maps a raw source root to its detached derived location
- stores source kind, source URL/ref/commit, timestamps, and future import metadata
- enables future admin-agent workflows without touching the raw tree
Current execution status:
- Phase 1 (Library Boundary) is complete — detached-library layout, workspace manifests, and centralized path resolution are implemented.
- Phase 2 (V2 Multi-Goal Retrieval) is complete — ambiguity-aware retrieval with subgoal decomposition, workspace clustering, primary/backup branch selection, and output trimming. Full 3×3 repeated-run evaluation shows net-positive results on all strong_multi_goal tasks.
codewell index --library-root <path>stores derived artifacts outside the raw source tree and writes a workspace manifest.- Legacy in-place
.codewell/mode still exists for compatibility. - The unified ingest dispatcher routes local folders, GitHub URLs, and local ZIP archives through one internal planning layer.
- A managed intake layer exists under the detached library root:
intake/inboxintake/archivesintake/papersintake/documentsintake/source-workspaces/projectsintake/source-workspaces/single-filesintake/failedintake/manifests
- Intake can now auto-index imported code workspaces and apply read-only protection, with an additional best-effort Windows ACL layer when available.
The long-term ingest layer should normalize multiple source types behind one interface:
- local folder
- local ZIP archive
- GitHub repository URL
Shared ingest stages:
- source resolution
- raw snapshot acquisition
- manifest write
- indexing and graph extraction
- evaluation/diagnostics hooks
Current internal implementation has now been split into the first reusable four-stage pipeline:
planacquirematerializeindex
This is the boundary future batch import, sync, retry, and admin-agent workflows should extend.
Current intake behavior sits one layer above that pipeline:
- user drops files into the managed inbox
- intake classifies and moves them into protected storage
- code imports can be indexed immediately
- papers/documents emit lightweight metadata manifests for later context retrieval
Do not introduce 7z now.
- GitHub archive ingest already uses ZIP.
- Python
zipfileis enough for the first local-archive workflow. - An external unpacker increases dependency surface and weakens the lightweight/local-first goal.
- Revisit
7zlater only if real format coverage or measured extraction performance justifies it.
The baseline stays lexical search plus static graph expansion. Improvement work should focus on precision governance instead of broad opaque ranking.
Planned layers:
- lexical path/symbol/content match
- one-hop graph expansion through imports, callers, callees, tests, routes, and commands
- bounded multi-hop expansion only when the first hop is strong
- revision-memory relevance and applicability checks
- optional reranking plugins after evaluation proves value
Current first V2 retrieval-control slice:
- keep the same lexical plus graph backbone
- add ambiguity-aware retrieval modes instead of adding a new retrieval stack
- first concrete policy is
navigation_prune:- if the query still looks like a navigation flow
- but the top lexical candidate is already plausible enough
- and local goal competition is high
- then reduce lexical breadth, graph depth, and symbol-context fanout
The preferred V2 decomposition design is retrieval-aware and model-free by default.
Default baseline pipeline:
- split the user task into short structured subgoals
- run shallow lexical probes for each subgoal
- detect whether subgoals share the same local workspace region
- merge same-region subgoals into one retrieval session
- select one primary branch for deep context assembly
- retain at most one lightweight backup branch
- retain only a micro-context for the backup branch when budget remains
This baseline deliberately avoids:
- required LLM planning
- required embeddings
- required semantic rerankers
- broad full-parallel graph expansion across every subgoal
The key design objective is not "more branches". It is "earlier commitment to the right local workspace with bounded cost".
Current intended backup-branch form:
- one backup file at most
- one tiny backup symbol trace at most
- one tiny backup neighborhood at most
- never full graph expansion for the backup branch in the default baseline
Future layers may be added behind explicit opt-in interfaces.
Optional planner interface:
- can propose structured subgoals from difficult natural-language prompts
- can propose dependency order between subgoals
- must not replace baseline execution control
Optional semantic interface:
- can cluster related subgoals with embeddings
- can help merge ambiguous workspace candidates
- can rerank candidate groups after the baseline is already measurable
Both interfaces must satisfy one constraint:
- disabling them must still leave a complete, evaluated, first-class V2 system
Evaluation requirements:
- keep task-level Python and TS/JS evaluations
- add real JS/TS repository tasks before adding more languages
- measure recall, precision, latency, token budget, and graph-size growth together
The current graph is file/symbol/import/call-edge centric. The next expansion should stay modular:
- repository node
- file node
- symbol node
- call edge
- import edge
- route/test/command relationship edge
- revision-memory node linked to source symbols/files and applicability evidence
Future admin agents should be able to query this graph without needing direct SQL knowledge.
MCP should return minimal actionable structure, not recursive repo dumps.
Recommended tool contract direction:
search_code: lightweight ranked hits with file path, symbol names, snippet, and scoretrace_symbol: current definition plus one-hop callers and one-hop callees with excerptsget_context_pack: selected files, why they were chosen, bounded traces, and related revisionssearch_revision_memory: explanation, verification state, applicability notes, signals, and warnings
Default MCP expansion rules:
- one-hop by default
- token-aware excerpts
- related files only when selection reasons are explicit
- revision memory included only when it is relevant and supported by evidence
Planned next MCP additions:
- response mode presets such as
compact,default, anddebug - explicit context-pack diagnostics for why files were added or excluded
- source provenance blocks for detached libraries and imported archives
- stronger paper/document retrieval once PDF extraction quality is improved
- future branch-aware diagnostics for multi-goal decomposition, such as:
- subgoal records
- workspace-cluster merges
- primary-branch versus backup-branch selection
- backup-branch micro-context retention
To support a future repository-admin agent and document-knowledge agent collaboration, keep these interfaces stable and separable:
ingest: source acquisition and normalizationlibrary: raw/derived/manifest path resolutionparsers: language extractiongraph: relationship storage and traversalretrieval: search, trace, context-pack assemblymemory: failures, revisions, applicability, verificationmcp: agent-facing protocol surfaceui: human-facing local inspection
- formalize detached-library layout
- keep raw sources unchanged
- write workspace manifests
- centralize path resolution
Status: in progress.
- add a source-ingest abstraction
- support local folder, GitHub, and local ZIP input through one pipeline
- keep ZIP support on Python stdlib first
- broaden JS/TS task-level evaluations
- tighten graph expansion thresholds and diagnostics
- compare precision/latency tradeoffs before optional reranking
- add retrieval-aware multi-goal decomposition without required model dependencies
- promote revision memory to graph-linked entities
- add richer route/test/command relationships
- expose reusable diagnostics to MCP and UI
- richer UI views for files, failures, graph relationships, and manifests
- optional embedding/reranking providers
- admin-agent-facing APIs for repository and memory maintenance
- optional planner providers for complex multi-goal task decomposition