diff --git a/README.md b/README.md index 90a7e2a..6ab0e4c 100644 --- a/README.md +++ b/README.md @@ -1,72 +1,114 @@ +![wikifi](./assets/wikifi.png) + # wikifi -> A Python library that walks a codebase and produces a **technology-agnostic** feature and domain extraction from the source suitable for reflection into a new modern implementaiton which retains it's previous functionality and delivered value to users. +wikifi walks a legacy codebase and writes a **technology-agnostic** wiki of what the system does — domains, entities, flows, integrations, and cross-cutting concerns — extracted from the source with citations back to the lines that prove it. -For the *why* behind wikifi and the design questions still to resolve, see [`VISION.md`](./VISION.md). For the development process, see [`CLAUDE.md`](./CLAUDE.md). +The output is what a migration team needs to re-implement the system on a fresh stack from the wiki alone, without recreating the legacy structure in a new language. -To see the results, view the [.wikifi](./.wikifi/) directory in this repo, which contains the output of running `wikifi walk` on the wikifi codebase itself. +For the full rationale and content contract, see [VISION.md](./VISION.md). To see the output, browse [`.wikifi/`](./.wikifi/) — wikifi run against its own source. -## Install & use -Add wikifi to a target project, then run `init` to bootstrap wikification of that project's source. +## Quickstart ```bash +# 1. Install in the target project uv add wikifi + +# 2. Scaffold .wikifi/ and config uv run wikifi init + +# 3. Walk the codebase +uv run wikifi walk ``` +By default wikifi runs against a **local Ollama** server (Qwen 3 27B at the highest reasoning level the model exposes) — no cloud dependency, no API key, no data leaving the machine. Hosted Anthropic and OpenAI backends are opt-in. + +## What you get + +A `.wikifi/` directory in the target repo containing the synthesized wiki. The on-disk layout is at the implementor's discretion; the **content contract** is fixed and lives in [VISION.md](./VISION.md): + +- **Primary capture** (extracted from source) — domains & subdomains, intent, capabilities, entities, integrations, external dependencies, cross-cutting concerns, hard specifications, and inline schematics. +- **Derivative capture** (synthesized from the aggregate) — personas, user stories, and 10,000-foot diagrams produced *after* primary content is complete. + +Every claim in the wiki carries a numbered citation back to a `SourceRef` (file + line range + content fingerprint). Conflicting evidence across files is preserved in a "Conflicts in source" block rather than silently resolved. + ## CLI -- `init` — one-time setup; scaffolds the `.wikifi/` directory and any local config the implementor chooses to expose. -- `walk` — main entry point. Walks the target codebase and produces the wiki content. - - `--no-cache` — force a clean re-walk; drops the on-disk extraction + aggregation caches. - - `--review` — run the critic + reviser loop on derivative sections (personas, user stories, diagrams). - - `--provider {ollama|anthropic|openai}` — override the configured provider for this walk. -- `report` — print a coverage + quality report (per-section file counts, findings, body sizes). - - `--score` — additionally run the critic on every populated section for a 0-10 quality score. -- `ask` — natural language queries against the wiki content, with optional context injection from the target codebase. -- `chat` — interactive REPL for iterative exploration of the wiki content and the target codebase. - -## Architecture -- **`wikifi/` package** — the library, with the CLI entry point exposed via `[project.scripts] wikifi = "wikifi.cli:main"` in `pyproject.toml`. -- **Repository introspection** — before walking, the agent reviews the target's root structure (manifests, top-level layout, gitignore signals) and decides which paths carry production source worth analyzing. The walk that follows is deterministic — the agent does not re-pick scope mid-walk. -- **Repo graph** (`wikifi/repograph.py`) — a regex-driven static analysis builds an import / reference graph across in-scope files, plus classifies each file's `FileKind` (application code, SQL, OpenAPI, Protobuf, GraphQL, migration, other). Each file's neighborhood is injected into the extraction prompt so per-file findings can describe cross-file flows. -- **Specialized extractors** (`wikifi/specialized/`) — schema files (SQL, OpenAPI, Protobuf, GraphQL, migrations) bypass the LLM entirely and run through deterministic parsers. The structured findings reach the same notes store as LLM output, so the rest of the pipeline is unchanged. -- **Per-file extraction** — for each in-scope file, the agent extracts contributions to each *primary* capture section (see `VISION.md`) into structured findings. Each finding carries a structured `SourceRef` (file + line range + content fingerprint) for downstream citation. -- **Content-addressed cache** (`wikifi/cache.py`) — extraction findings are keyed by `(rel_path, sha256(file_bytes))`; aggregation bodies are keyed by a hash of the section's notes payload. Re-walks skip every file whose fingerprint hasn't changed; resumability after a crash is a free property of the same cache. Use `walk --no-cache` to force a clean re-walk. -- **Input filtering** — the walker recognizes and skips unstructured or near-empty files (stub `__init__` files, empty fixtures, machine-generated artifacts) before they reach the agent. Empty input must never stall the walk. -- **Section synthesis** — primary capture sections are synthesized from the accumulated per-file findings; the aggregator emits a structured `EvidenceBundle` (body + claims + contradictions) and the renderer threads numbered citations + a "Conflicts in source" block into the section markdown. Derivative sections (personas, user stories, diagrams) are produced *after* primary content is complete, taking the synthesized primary content as their input. -- **Critic + reviser** (`wikifi/critic.py`) — opt-in (`walk --review`), runs a quality pass on derivative sections: scores the body against its brief and upstream evidence, identifies unsupported claims, and re-synthesizes when the score is below threshold. Only accepts a revision if it scores at least as well as the original. -- **Coverage + quality report** (`wikifi/report.py`) — `wikifi report` produces a per-section view of files contributing, finding count, body size, and (with `--score`) critic-derived quality scores. -- **Provider abstraction** — the LLM backend is reached through a provider interface. Default is a local Ollama server (`OllamaProvider`); two hosted backends are opt-in: - - `AnthropicProvider` via `WIKIFI_PROVIDER=anthropic` — uses prompt caching with `cache_control: ephemeral` on the system prompt so the multi-KB extraction prompt is paid for once across hundreds of per-file calls. - - `OpenAIProvider` via `WIKIFI_PROVIDER=openai` — relies on OpenAI's automatic prefix caching (no marker required) and routes the `think` knob to `reasoning_effort` on `o*`/`gpt-5` reasoning models. -- **Wiki adapter** — writes the rendered wiki into the target's `.wikifi/` directory. Layout, taxonomy, and structure within `.wikifi/` are at the implementor's discretion, provided the content contract from `VISION.md` is met. +| Command | Purpose | +| --- | --- | +| `wikifi init` | One-time setup. Scaffolds `.wikifi/` and local config. | +| `wikifi walk` | Walks the target codebase and produces the wiki. | +| `wikifi report` | Coverage + quality report (per-section file counts, findings, body sizes). | +| `wikifi ask` | Natural-language queries against the wiki, with optional context injection from the source. | +| `wikifi chat` | Interactive REPL for iterative exploration of the wiki and the source. | -## Tech stack -- **Python 3.12+**, packaged with **`uv`** -- **Local LLM via Ollama** as the default runtime; thinking-capable model (Qwen 3 27B) at the highest available reasoning level. -- **Provider abstraction layer** — Ollama is the default, additional backends slot in without touching the rest of the system. -- **`ruff`** as the single tool for lint and format -- **`pytest` + `pytest-cov`** for tests -- **NiceGUI** if and when wikifi gains a UI surface (mounted on FastAPI, no JS build step) -- **GitHub Actions** for CI +`walk` flags: + +- `--no-cache` — force a clean re-walk; drops the on-disk extraction + aggregation caches. +- `--review` — run the critic + reviser loop on derivative sections (personas, user stories, diagrams). +- `--provider {ollama|anthropic|openai}` — override the configured provider for this walk. + +`report --score` runs the critic on every populated section for a 0–10 quality score. + +## Providers + +The LLM backend is reached through a provider abstraction; swapping it never touches the rest of the system. + +- **`OllamaProvider`** — default. Local server, no cloud dependency. +- **`AnthropicProvider`** — `WIKIFI_PROVIDER=anthropic`. Uses prompt caching with `cache_control: ephemeral` on the system prompt so the multi-KB extraction prompt is paid for once across hundreds of per-file calls. +- **`OpenAIProvider`** — `WIKIFI_PROVIDER=openai`. Relies on OpenAI's automatic prefix caching and routes the `think` knob to `reasoning_effort` on `o*` / `gpt-5` reasoning models. + +## How the walk works + +The walk has four responsibilities, in order: + +1. **Introspect** — review the target's root structure (manifests, top-level layout, gitignore signals) and decide which paths carry production source worth analyzing. The walk that follows is deterministic; the agent does not re-pick scope mid-walk. +2. **Filter** — recognize and skip unstructured or near-empty files (stub `__init__`, empty fixtures, generated lockfiles) before they reach the agent. Empty input must never stall the walk. +3. **Extract** — for each in-scope file, extract structured findings against the primary capture sections in [VISION.md](./VISION.md). Each finding carries a `SourceRef` for downstream citation. +4. **Synthesize** — primary sections are aggregated from the per-file findings into an `EvidenceBundle` (body + claims + contradictions). Derivative sections (personas, user stories, diagrams) are produced *after* primary content is complete, never inferred from a single file. + +Supporting machinery: + +- **Repo graph** ([`wikifi/repograph.py`](./wikifi/repograph.py)) — regex-driven static analysis builds an import / reference graph and classifies each file's `FileKind` (application code, SQL, OpenAPI, Protobuf, GraphQL, migration, other). Each file's neighborhood is injected into the extraction prompt so per-file findings can describe cross-file flows. +- **Specialized extractors** ([`wikifi/specialized/`](./wikifi/specialized/)) — schema files (SQL, OpenAPI, Protobuf, GraphQL, migrations) bypass the LLM and run through deterministic parsers. Structured findings reach the same notes store, so the rest of the pipeline is unchanged. +- **Content-addressed cache** ([`wikifi/cache.py`](./wikifi/cache.py)) — extraction findings are keyed by `(rel_path, sha256(file_bytes))`; aggregation bodies are keyed by a hash of the section's notes payload. Re-walks skip every file whose fingerprint hasn't changed; resumability after a crash is a free property of the same cache. +- **Critic + reviser** ([`wikifi/critic.py`](./wikifi/critic.py)) — opt-in via `walk --review`. Scores derivative sections against their brief and upstream evidence, identifies unsupported claims, and re-synthesizes when the score is below threshold. Only accepts a revision if it scores at least as well as the original. +- **Coverage + quality report** ([`wikifi/report.py`](./wikifi/report.py)) — `wikifi report` produces a per-section view of files contributing, finding count, body size, and (with `--score`) critic-derived quality scores. ## Configuration -wikifi reads its configuration from environment variables (a `.env.example` lands once the surface is finalized). At minimum: -- the LLM provider id and the model identifier +wikifi reads configuration from environment variables. At minimum: + +- the LLM provider id and model identifier - the local Ollama endpoint (when using the default provider) - bounds on file size and stripped-content size, so unstructured or oversized files never reach the agent - the agent's thinking / reasoning level — defaults to the highest the chosen model supports -## Setup (development) +A `.env.example` will land once the surface is finalized. + +## Tech stack + +- **Python 3.12+**, packaged with [`uv`](https://docs.astral.sh/uv/) +- **Local LLM via Ollama** as the default runtime; thinking-capable model at the highest available reasoning level +- **Provider abstraction** — Ollama default; hosted Anthropic and OpenAI slot in without touching the rest of the system +- [`ruff`](https://docs.astral.sh/ruff/) as the single tool for lint and format +- `pytest` + `pytest-cov` for tests (≥85% coverage gate) +- **GitHub Actions** for CI + +## Development + ```bash make hooks # one-time: enables .githooks/ pre-commit + pre-push uv sync # install dependencies make test # run the test suite ``` -See [`CLAUDE.md`](./CLAUDE.md) for the full development process — commands, code rules, agent workflow, and debug escalation. +See [CLAUDE.md](./CLAUDE.md) for the full development process — commands, code rules, agent workflow, and debug escalation. ## Distribution -wikifi ships as a Python library (PyPI / private index); it operates as a CLI invoked from a target project rather than as a server. + +wikifi ships as a Python library (PyPI / private index) and operates as a CLI invoked from a target project rather than as a server. + +## License + +MIT. diff --git a/assets/wikifi.png b/assets/wikifi.png new file mode 100644 index 0000000..08b970d Binary files /dev/null and b/assets/wikifi.png differ