From e285f4b5cf755ba8f37dc4fcb2ff2f287f686c2c Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 1 May 2026 17:31:24 +0000 Subject: [PATCH 1/9] feat: premium-grade extraction & wikification engine MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Lands nine premium features positioned in five shared-module groups: Group A — Caching & resumability - wikifi/fingerprint.py: stable file/text fingerprints - wikifi/cache.py: content-addressed extraction + aggregation cache with atomic persistence; resumability falls out of cache reuse Group B — Evidence & citations - wikifi/evidence.py: SourceRef / Claim / Contradiction models plus a section renderer that threads numbered citations and a "Conflicts in source" block into the final markdown Group C — Repo intelligence - wikifi/repograph.py: regex-driven import graph + FileKind classifier - wikifi/specialized/{sql,openapi,protobuf,graphql}.py: deterministic extractors that bypass the LLM for schema files and migrations Group D — Quality - wikifi/critic.py: critic + reviser loop with score-based revision acceptance; CoverageStats helper - wikifi/report.py + `wikifi report` CLI command for per-section coverage and quality scoring Group E — Premium provider - wikifi/providers/anthropic_provider.py: hosted Anthropic backend with prompt-cached system prompt, messages.parse-based structured output, adaptive-thinking + effort mapping, and APIError → RuntimeError translation Pipeline wiring: - extractor: cache lookup + replay, specialized routing, neighbor context injection, structured SourceRefs on every finding - aggregator: structured EvidenceBundle output (claims + contradictions), notes-hash section cache - deriver: optional critic loop (--review) - orchestrator: graph build, cache load/save with per-file persist, anthropic dispatch - cli: --no-cache, --review, --provider flags; new `report` command Tests + coverage: - 156 tests pass (was 88) - 93% total coverage; every new module ≥ 86% - Dedicated test files for fingerprint, cache, evidence, repograph, specialized, critic, report, anthropic_provider - Existing extractor / aggregator / orchestrator / cli suites extended for cache, graph, citations, contradictions, anthropic dispatch, and CLI flag plumbing See TESTING-AND-DEMO.md for end-to-end demo recipes. https://claude.ai/code/session_01K3H5GMhcvfc5HB63NhykcL --- README.md | 16 +- TESTING-AND-DEMO.md | 251 ++++++++++++++++ pyproject.toml | 1 + tests/test_aggregator.py | 95 +++++- tests/test_anthropic_provider.py | 168 +++++++++++ tests/test_cache.py | 141 +++++++++ tests/test_cli.py | 55 +++- tests/test_critic.py | 118 ++++++++ tests/test_evidence.py | 78 +++++ tests/test_extractor.py | 146 +++++++++ tests/test_fingerprint.py | 29 ++ tests/test_orchestrator.py | 76 +++++ tests/test_repograph.py | 85 ++++++ tests/test_report.py | 73 +++++ tests/test_specialized.py | 186 ++++++++++++ uv.lock | 120 ++++++++ wikifi/aggregator.py | 211 +++++++++++-- wikifi/cache.py | 280 +++++++++++++++++ wikifi/cli.py | 77 ++++- wikifi/config.py | 71 ++++- wikifi/critic.py | 242 +++++++++++++++ wikifi/deriver.py | 39 ++- wikifi/evidence.py | 160 ++++++++++ wikifi/extractor.py | 277 +++++++++++++++-- wikifi/fingerprint.py | 48 +++ wikifi/orchestrator.py | 101 +++++-- wikifi/providers/anthropic_provider.py | 235 +++++++++++++++ wikifi/repograph.py | 397 +++++++++++++++++++++++++ wikifi/report.py | 156 ++++++++++ wikifi/specialized/__init__.py | 58 ++++ wikifi/specialized/graphql.py | 120 ++++++++ wikifi/specialized/openapi.py | 188 ++++++++++++ wikifi/specialized/protobuf.py | 106 +++++++ wikifi/specialized/sql.py | 225 ++++++++++++++ 34 files changed, 4543 insertions(+), 86 deletions(-) create mode 100644 TESTING-AND-DEMO.md create mode 100644 tests/test_anthropic_provider.py create mode 100644 tests/test_cache.py create mode 100644 tests/test_critic.py create mode 100644 tests/test_evidence.py create mode 100644 tests/test_fingerprint.py create mode 100644 tests/test_repograph.py create mode 100644 tests/test_report.py create mode 100644 tests/test_specialized.py create mode 100644 wikifi/cache.py create mode 100644 wikifi/critic.py create mode 100644 wikifi/evidence.py create mode 100644 wikifi/fingerprint.py create mode 100644 wikifi/providers/anthropic_provider.py create mode 100644 wikifi/repograph.py create mode 100644 wikifi/report.py create mode 100644 wikifi/specialized/__init__.py create mode 100644 wikifi/specialized/graphql.py create mode 100644 wikifi/specialized/openapi.py create mode 100644 wikifi/specialized/protobuf.py create mode 100644 wikifi/specialized/sql.py diff --git a/README.md b/README.md index 22309b4..bf4892e 100644 --- a/README.md +++ b/README.md @@ -19,16 +19,26 @@ uv run wikifi init - `init` — one-time setup; scaffolds the `.wikifi/` directory and any local config the implementor chooses to expose. - `walk` — main entry point. Walks the target codebase and produces the wiki content. + - `--no-cache` — force a clean re-walk; drops the on-disk extraction + aggregation caches. + - `--review` — run the critic + reviser loop on derivative sections (personas, user stories, diagrams). + - `--provider {ollama|anthropic}` — override the configured provider for this walk. +- `report` — print a coverage + quality report (per-section file counts, findings, body sizes). + - `--score` — additionally run the critic on every populated section for a 0-10 quality score. - `ask` — natural language queries against the wiki content, with optional context injection from the target codebase. - `chat` — interactive REPL for iterative exploration of the wiki content and the target codebase. ## Architecture - **`wikifi/` package** — the library, with the CLI entry point exposed via `[project.scripts] wikifi = "wikifi.cli:main"` in `pyproject.toml`. - **Repository introspection** — before walking, the agent reviews the target's root structure (manifests, top-level layout, gitignore signals) and decides which paths carry production source worth analyzing. The walk that follows is deterministic — the agent does not re-pick scope mid-walk. -- **Per-file extraction** — for each in-scope file, the agent extracts contributions to each *primary* capture section (see `VISION.md`) into structured findings. +- **Repo graph** (`wikifi/repograph.py`) — a regex-driven static analysis builds an import / reference graph across in-scope files, plus classifies each file's `FileKind` (application code, SQL, OpenAPI, Protobuf, GraphQL, migration, other). Each file's neighborhood is injected into the extraction prompt so per-file findings can describe cross-file flows. +- **Specialized extractors** (`wikifi/specialized/`) — schema files (SQL, OpenAPI, Protobuf, GraphQL, migrations) bypass the LLM entirely and run through deterministic parsers. The structured findings reach the same notes store as LLM output, so the rest of the pipeline is unchanged. +- **Per-file extraction** — for each in-scope file, the agent extracts contributions to each *primary* capture section (see `VISION.md`) into structured findings. Each finding carries a structured `SourceRef` (file + line range + content fingerprint) for downstream citation. +- **Content-addressed cache** (`wikifi/cache.py`) — extraction findings are keyed by `(rel_path, sha256(file_bytes))`; aggregation bodies are keyed by a hash of the section's notes payload. Re-walks skip every file whose fingerprint hasn't changed; resumability after a crash is a free property of the same cache. Use `walk --no-cache` to force a clean re-walk. - **Input filtering** — the walker recognizes and skips unstructured or near-empty files (stub `__init__` files, empty fixtures, machine-generated artifacts) before they reach the agent. Empty input must never stall the walk. -- **Section synthesis** — primary capture sections are synthesized from the accumulated per-file findings; derivative sections (personas, user stories, diagrams) are produced *after* primary content is complete, taking the synthesized primary content as their input. -- **Provider abstraction** — the LLM backend is reached through a provider interface. Default is a local Ollama server; alternative providers (hosted Anthropic, hosted OpenAI, custom) plug in by implementing the same interface. +- **Section synthesis** — primary capture sections are synthesized from the accumulated per-file findings; the aggregator emits a structured `EvidenceBundle` (body + claims + contradictions) and the renderer threads numbered citations + a "Conflicts in source" block into the section markdown. Derivative sections (personas, user stories, diagrams) are produced *after* primary content is complete, taking the synthesized primary content as their input. +- **Critic + reviser** (`wikifi/critic.py`) — opt-in (`walk --review`), runs a quality pass on derivative sections: scores the body against its brief and upstream evidence, identifies unsupported claims, and re-synthesizes when the score is below threshold. Only accepts a revision if it scores at least as well as the original. +- **Coverage + quality report** (`wikifi/report.py`) — `wikifi report` produces a per-section view of files contributing, finding count, body size, and (with `--score`) critic-derived quality scores. +- **Provider abstraction** — the LLM backend is reached through a provider interface. Default is a local Ollama server (`OllamaProvider`); the hosted Anthropic backend (`AnthropicProvider`) is opt-in via `WIKIFI_PROVIDER=anthropic` and uses prompt caching with `cache_control: ephemeral` on the system prompt so the multi-KB extraction prompt is paid for once across hundreds of per-file calls. - **Wiki adapter** — writes the rendered wiki into the target's `.wikifi/` directory. Layout, taxonomy, and structure within `.wikifi/` are at the implementor's discretion, provided the content contract from `VISION.md` is met. ## Tech stack diff --git a/TESTING-AND-DEMO.md b/TESTING-AND-DEMO.md new file mode 100644 index 0000000..4bcc8a9 --- /dev/null +++ b/TESTING-AND-DEMO.md @@ -0,0 +1,251 @@ +# Testing & demoing the premium pipeline + +This document covers how to verify and demo the nine premium features +landed in this PR. Every step works from a clean clone — no external +service required for the test suite, and only Ollama (default) or an +Anthropic API key (opt-in) for the live demos. + +## Prerequisites + +```bash +make hooks # one-time, enables the pre-commit + pre-push hooks +uv sync # installs anthropic + the other deps (already in uv.lock) +``` + +## Running the test suite + +```bash +make test # runs pytest with coverage +``` + +Expectations: +- **156 tests pass.** +- **Total coverage ≥ 93%.** Every new module is at or above 86%; the + premium-pipeline modules — `fingerprint`, `cache`, `evidence`, + `critic`, `report`, `repograph`, `specialized/*`, + `providers/anthropic_provider` — each carry a dedicated test file. + +To run only the suites for the new functionality: + +```bash +uv run pytest tests/test_fingerprint.py tests/test_cache.py tests/test_evidence.py \ + tests/test_repograph.py tests/test_specialized.py tests/test_critic.py \ + tests/test_report.py tests/test_anthropic_provider.py -v --no-cov +``` + +## Demoing each feature + +The demos below assume a working Ollama install with the model from +`.wikifi/config.toml` (default `qwen3.6:27b`). If you want the hosted +Anthropic path instead, set `ANTHROPIC_API_KEY` and pass +`--provider anthropic` to the relevant commands; everything else is +identical. + +### 1. Source-traceable citations + 5. Contradiction surfacing + +Run a walk against this repo: + +```bash +make init # one-time; idempotent +make walk +``` + +Open `.wikifi/
.md` for any populated primary section +(`entities.md`, `capabilities.md`, `cross_cutting.md`, …). At the bottom +you should see: + +``` +## Sources +1. `wikifi/extractor.py:115-187` +2. `wikifi/aggregator.py:54-79` +… +``` + +Where the aggregator detected disagreement across files, the section +also carries a `## Conflicts in source` block enumerating each +position with its sources. Search for it via: + +```bash +rg -n '^## Conflicts in source' .wikifi/ +``` + +(For unit-level evidence: `tests/test_evidence.py` exercises citation +rendering and contradiction rendering directly; `tests/test_aggregator.py` +covers the end-to-end "claim → SourceRef" resolution.) + +### 2. Incremental walks (content-addressed cache) + 11. Resumability + +Run a walk, then run it again immediately: + +```bash +make walk # first walk: extracts every in-scope file +make walk # second walk: cache_hits == files_seen +``` + +The second invocation prints `cache_hits=N` in the **Extraction** row +of the walk report — that's the number of files served from the cache +without an LLM call. + +To force a clean re-walk: + +```bash +uv run wikifi walk --no-cache +``` + +Resumability is the same mechanism: the cache is persisted after every +file finishes, so a `Ctrl-C` mid-walk loses no progress — the next +`wikifi walk` continues from the file that was in flight when the +crash happened. + +(Unit evidence: `tests/test_cache.py`, plus +`test_run_walk_persists_cache_for_resumability` in +`tests/test_orchestrator.py`.) + +### 3. Cross-file context (import graph) + +Open the live extraction prompt for any application file. The walker +includes a `Neighbor files` block listing files this one imports from +or is imported by: + +```bash +uv run wikifi walk -v 2>&1 | rg -A3 "Neighbor files" | head +``` + +You can also inspect the graph directly: + +```python +from pathlib import Path +from wikifi.repograph import build_graph +from wikifi.walker import WalkConfig, iter_files + +config = WalkConfig(root=Path(".")) +files = list(iter_files(config)) +graph = build_graph(repo_root=Path("."), files=files) +node = graph.get("wikifi/aggregator.py") +print(node.imports) # ('wikifi/cache.py', 'wikifi/evidence.py', ...) +print(node.imported_by) # ('wikifi/orchestrator.py', ...) +``` + +(Unit evidence: `tests/test_repograph.py`, plus +`test_extract_repo_injects_neighbor_context_when_graph_supplied` in +`tests/test_extractor.py`.) + +### 4. Type-aware extractors (SQL / OpenAPI / Protobuf / GraphQL / migrations) + +Drop a SQL file into a target project and run a walk: + +```bash +mkdir -p /tmp/demo && cd /tmp/demo && git init -q +cat > schema.sql <<'EOF' +CREATE TABLE customer ( + id INTEGER PRIMARY KEY, + email VARCHAR(255) UNIQUE NOT NULL +); +CREATE TABLE orders ( + id INTEGER PRIMARY KEY, + customer_id INTEGER REFERENCES customer(id), + total INTEGER NOT NULL +); +CREATE INDEX idx_orders_customer ON orders (customer_id); +EOF +uv run --project /home/user/wikifi wikifi init +uv run --project /home/user/wikifi wikifi walk +``` + +The walk report's **Extraction** row shows `specialized=1`. The +findings produced for `entities.md`, `integrations.md` (the FK), and +`cross_cutting.md` (the index + UNIQUE invariants) come from the +deterministic SQL parser — no LLM call was made for `schema.sql`. + +The same routing covers `*.proto`, `*.graphql`, OpenAPI YAML / JSON +specs, and any SQL file under `migrations/` / `alembic/` / +`db/migrate/` directories. + +(Unit evidence: `tests/test_specialized.py` covers each parser; +`test_extract_repo_routes_sql_through_specialized_extractor` in +`tests/test_extractor.py` covers the end-to-end routing.) + +### 6. Critic + reviser pass on derivatives + +Re-run the walk with `--review`: + +```bash +uv run wikifi walk --review +``` + +The walk report shows `sections_revised=N` in the **Derivation** row — +that's how many derivative sections (personas / user stories / +diagrams) the critic flagged as below the score threshold and the +reviser improved. + +(Unit evidence: `tests/test_critic.py` covers the critic loop and the +"only accept revision if it scores at least as well" guard. Integration: +`test_run_walk_review_flag_invokes_critic`.) + +### 8. Coverage + quality report + +After a walk: + +```bash +uv run wikifi report # purely structural; no LLM calls +uv run wikifi report --score # adds critic-derived quality scores +``` + +Output is a markdown table of every section (files contributing, +findings count, body size, score, headline gap): + +``` +| Section | Files | Findings | Body | Score | Headline gap | +| --- | --- | --- | --- | --- | --- | +| `entities` | 12 | 47 | 5132 | 9/10 | — | +| `cross_cutting` | 4 | 9 | 1421 | 6/10 | unsupported: rate-limit policy | +… +``` + +(Unit evidence: `tests/test_report.py`.) + +### 9. Anthropic provider with prompt caching + +Set the API key and switch the provider for a walk: + +```bash +export ANTHROPIC_API_KEY=sk-ant-... +WIKIFI_PROVIDER=anthropic uv run wikifi walk +# or: +uv run wikifi walk --provider anthropic +``` + +The provider sets `cache_control: {"type": "ephemeral"}` on the system +prompt block. After the first per-file extraction call writes the +cache, subsequent calls within the cache window read it for ~10% of +the input price. + +To verify caching is active in the wild, intercept the SDK's response: + +```python +from wikifi.providers.anthropic_provider import AnthropicProvider +provider = AnthropicProvider(model="claude-opus-4-7", think="high") +# After two calls with the same system prompt: +# response.usage.cache_read_input_tokens > 0 +``` + +(Unit evidence: `tests/test_anthropic_provider.py` locks in the +`cache_control` placement, the `messages.parse` structured-output +contract, the thinking → effort translation, and the APIError → +RuntimeError mapping. `test_build_provider_returns_anthropic_when_selected` +in `tests/test_orchestrator.py` covers dispatch.) + +## Tearing down + +The premium-pipeline state lives entirely under `.wikifi/`: + +``` +.wikifi/ + config.toml + *.md # rendered sections (committable) + .notes/ # per-section JSONL findings (gitignored) + .cache/ # extraction + aggregation caches (gitignored) +``` + +Delete `.wikifi/.cache/` to drop the cache and force a full re-walk; +delete the whole directory to start over. diff --git a/pyproject.toml b/pyproject.toml index 7394b57..9496eca 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -14,6 +14,7 @@ dependencies = [ "typer>=0.12", "rich>=13.7", "pathspec>=0.12", + "anthropic>=0.40", ] [project.scripts] diff --git a/tests/test_aggregator.py b/tests/test_aggregator.py index 0bd3e5e..a62fed2 100644 --- a/tests/test_aggregator.py +++ b/tests/test_aggregator.py @@ -1,6 +1,12 @@ -from wikifi.aggregator import SectionBody, aggregate_all +from wikifi.aggregator import ( + AggregatedClaim, + AggregatedContradiction, + SectionBody, + aggregate_all, +) +from wikifi.cache import WalkCache, hash_section_notes from wikifi.sections import PRIMARY_SECTIONS -from wikifi.wiki import WikiLayout, append_note, initialize +from wikifi.wiki import WikiLayout, append_note, initialize, read_notes def _setup(tmp_path): @@ -68,3 +74,88 @@ def raiser(schema, system, user): body = layout.section_path(section).read_text() assert "Aggregation failed" in body assert "Order line item." in body # raw notes preserved + + +def test_aggregate_renders_citations_and_contradictions(tmp_path, mock_provider_factory): + layout = _setup(tmp_path) + section = PRIMARY_SECTIONS[0] + append_note( + layout, + section, + { + "file": "a.py", + "summary": "domain", + "finding": "Tax computed at order time.", + "sources": [{"file": "src/order.py", "lines": [10, 25], "fingerprint": "abc"}], + }, + ) + append_note( + layout, + section, + { + "file": "b.py", + "summary": "domain", + "finding": "Tax computed at invoice time.", + "sources": [{"file": "src/invoice.py", "lines": [5, 12], "fingerprint": "def"}], + }, + ) + + structured = SectionBody( + body="The system computes tax somewhere.", + claims=[AggregatedClaim(text="Tax computation lives at the boundary.", source_indices=[1, 2])], + contradictions=[ + AggregatedContradiction( + summary="Where tax is computed.", + positions=[ + AggregatedClaim(text="At order time.", source_indices=[1]), + AggregatedClaim(text="At invoice time.", source_indices=[2]), + ], + ) + ], + ) + + provider = mock_provider_factory( + json_factory=lambda schema, system, user: structured, + ) + aggregate_all(layout=layout, provider=provider) + body = layout.section_path(section).read_text() + assert "Conflicts in source" in body + assert "src/order.py:10-25" in body + assert "src/invoice.py:5-12" in body + assert "## Sources" in body + + +def test_aggregate_uses_cache_to_skip_unchanged_notes(tmp_path, mock_provider_factory): + layout = _setup(tmp_path) + section = PRIMARY_SECTIONS[0] + append_note(layout, section, {"file": "a.py", "summary": "x", "finding": "Order entity."}) + + cache = WalkCache() + notes_hash = hash_section_notes(read_notes(layout, section)) + cache.record_aggregation( + section.id, + notes_hash=notes_hash, + body="Cached body for the section.", + claims=[], + contradictions=[], + ) + + provider = mock_provider_factory() # no responses queued — must not be called + stats = aggregate_all(layout=layout, provider=provider, cache=cache) + + body = layout.section_path(section).read_text() + assert "Cached body for the section." in body + assert stats.sections_cached == 1 + + +def test_aggregate_records_cache_entry_after_synthesis(tmp_path, mock_provider_factory): + layout = _setup(tmp_path) + section = PRIMARY_SECTIONS[0] + append_note(layout, section, {"file": "a.py", "summary": "x", "finding": "Order."}) + + cache = WalkCache() + provider = mock_provider_factory( + json_factory=lambda schema, system, user: SectionBody(body="Synthesized body."), + ) + aggregate_all(layout=layout, provider=provider, cache=cache) + assert section.id in cache.aggregation diff --git a/tests/test_anthropic_provider.py b/tests/test_anthropic_provider.py new file mode 100644 index 0000000..8abb298 --- /dev/null +++ b/tests/test_anthropic_provider.py @@ -0,0 +1,168 @@ +"""AnthropicProvider tests. + +The HTTP transport is mocked via the ``client=`` injection point so the +test never touches the network. The point is to lock in the wikifi +contract: prompt caching on the system prompt, structured output via +``messages.parse``, and APIError → RuntimeError mapping. +""" + +from __future__ import annotations + +from types import SimpleNamespace + +import anthropic +import pytest +from pydantic import BaseModel + +from wikifi.providers.anthropic_provider import AnthropicProvider + + +class _Echo(BaseModel): + value: str + + +class _StubClient: + """Minimal stand-in for ``anthropic.Anthropic`` exposing ``messages``.""" + + def __init__( + self, + *, + parse_response=None, + create_response=None, + raise_on_parse: Exception | None = None, + raise_on_create: Exception | None = None, + ) -> None: + self.parse_calls: list[dict] = [] + self.create_calls: list[dict] = [] + self._parse_response = parse_response + self._create_response = create_response + self._raise_on_parse = raise_on_parse + self._raise_on_create = raise_on_create + self.messages = SimpleNamespace(parse=self._parse, create=self._create) + + def _parse(self, **kwargs): + self.parse_calls.append(kwargs) + if self._raise_on_parse is not None: + raise self._raise_on_parse + return self._parse_response + + def _create(self, **kwargs): + self.create_calls.append(kwargs) + if self._raise_on_create is not None: + raise self._raise_on_create + return self._create_response + + +def _api_error(message: str = "boom", request_id: str = "req_abc") -> anthropic.APIError: + """Build an APIError without going through the real httpx wiring.""" + err = anthropic.APIError.__new__(anthropic.APIError) + err.message = message + err.request_id = request_id + err.args = (message,) + return err + + +def test_complete_json_passes_cache_control_and_returns_pydantic(): + parsed = _Echo(value="hello") + response = SimpleNamespace(parsed_output=parsed, content=[]) + client = _StubClient(parse_response=response) + + provider = AnthropicProvider(model="claude-opus-4-7", client=client, think="high") + result = provider.complete_json(system="SYS", user="USR", schema=_Echo) + + assert result == parsed + call = client.parse_calls[0] + assert call["model"] == "claude-opus-4-7" + assert call["output_format"] is _Echo + assert call["messages"] == [{"role": "user", "content": "USR"}] + # System prompt must be a list with a cache_control marker. + system = call["system"] + assert isinstance(system, list) + assert system[0]["cache_control"] == {"type": "ephemeral"} + assert system[0]["text"] == "SYS" + # think="high" → adaptive thinking + effort. + assert call["thinking"] == {"type": "adaptive"} + assert call["output_config"] == {"effort": "high"} + + +def test_complete_json_falls_back_to_validate_json_when_parsed_output_missing(): + response = SimpleNamespace( + parsed_output=None, + content=[SimpleNamespace(type="text", text='{"value": "fallback"}')], + ) + client = _StubClient(parse_response=response) + provider = AnthropicProvider(client=client) + out = provider.complete_json(system="s", user="u", schema=_Echo) + assert out == _Echo(value="fallback") + + +def test_complete_json_raises_runtime_error_on_api_error(): + client = _StubClient(raise_on_parse=_api_error("rate-limited", "req_xyz")) + provider = AnthropicProvider(client=client) + with pytest.raises(RuntimeError) as info: + provider.complete_json(system="s", user="u", schema=_Echo) + assert "req_xyz" in str(info.value) + assert "rate-limited" in str(info.value) + + +def test_complete_text_extracts_first_text_block(): + response = SimpleNamespace(content=[SimpleNamespace(type="text", text="hi")]) + client = _StubClient(create_response=response) + provider = AnthropicProvider(client=client) + assert provider.complete_text(system="s", user="u") == "hi" + + +def test_complete_text_returns_empty_when_no_text_block(): + response = SimpleNamespace(content=[]) + client = _StubClient(create_response=response) + provider = AnthropicProvider(client=client) + assert provider.complete_text(system="s", user="u") == "" + + +def test_chat_forwards_messages_and_caches_system(): + response = SimpleNamespace(content=[SimpleNamespace(type="text", text="hello back")]) + client = _StubClient(create_response=response) + provider = AnthropicProvider(client=client, think=False) + out = provider.chat( + system="SYS", + messages=[{"role": "user", "content": "first"}], + ) + assert out == "hello back" + call = client.create_calls[0] + assert call["messages"] == [{"role": "user", "content": "first"}] + assert call["system"][0]["cache_control"] == {"type": "ephemeral"} + # think=False → thinking disabled, no effort. + assert call["thinking"] == {"type": "disabled"} + assert "output_config" not in call + + +def test_thinking_kwargs_translation_table(): + """Lock the think-knob → request mapping so the contract is testable.""" + client = _StubClient(create_response=SimpleNamespace(content=[])) + cases = [ + ("low", {"thinking": {"type": "adaptive"}, "output_config": {"effort": "low"}}), + ("medium", {"thinking": {"type": "adaptive"}, "output_config": {"effort": "medium"}}), + ("high", {"thinking": {"type": "adaptive"}, "output_config": {"effort": "high"}}), + ("max", {"thinking": {"type": "adaptive"}, "output_config": {"effort": "max"}}), + (True, {"thinking": {"type": "adaptive"}}), + (False, {"thinking": {"type": "disabled"}}), + ("off", {"thinking": {"type": "disabled"}}), + ] + for think, expected in cases: + provider = AnthropicProvider(client=client, think=think) + # Reset the recorded calls between cases. + client.create_calls.clear() + provider.complete_text(system="s", user="u") + call = client.create_calls[-1] + for key, value in expected.items(): + assert call.get(key) == value, f"think={think!r}: expected {key}={value}" + if "output_config" not in expected: + assert "output_config" not in call + + +def test_cache_system_prompt_off_returns_plain_string(): + response = SimpleNamespace(content=[]) + client = _StubClient(create_response=response) + provider = AnthropicProvider(client=client, cache_system_prompt=False) + provider.complete_text(system="SYS", user="u") + assert client.create_calls[0]["system"] == "SYS" diff --git a/tests/test_cache.py b/tests/test_cache.py new file mode 100644 index 0000000..c515c3b --- /dev/null +++ b/tests/test_cache.py @@ -0,0 +1,141 @@ +"""Cache layer tests.""" + +from __future__ import annotations + +from pathlib import Path + +from wikifi.cache import ( + CACHE_VERSION, + WalkCache, + aggregation_cache_path, + extraction_cache_path, + hash_section_notes, + load, + reset, + save, +) +from wikifi.wiki import WikiLayout, initialize + + +def _layout(tmp_path: Path) -> WikiLayout: + layout = WikiLayout(root=tmp_path) + initialize(layout, model="m", provider="ollama", ollama_host="http://h") + return layout + + +def test_extraction_cache_hit_and_miss(tmp_path: Path): + layout = _layout(tmp_path) + cache = load(layout) + assert cache.lookup_extraction("a.py", "abc") is None + assert cache.extraction_misses == 1 + + cache.record_extraction( + "a.py", + fingerprint="abc", + findings=[{"section_id": "entities", "finding": "x", "sources": []}], + summary="role", + chunks_processed=1, + ) + hit = cache.lookup_extraction("a.py", "abc") + assert hit is not None + assert hit.fingerprint == "abc" + assert cache.extraction_hits == 1 + + +def test_extraction_cache_invalidated_on_fingerprint_change(tmp_path: Path): + layout = _layout(tmp_path) + cache = load(layout) + cache.record_extraction("a.py", fingerprint="old", findings=[], summary="", chunks_processed=0) + assert cache.lookup_extraction("a.py", "new") is None + + +def test_aggregation_cache_round_trip(tmp_path: Path): + layout = _layout(tmp_path) + cache = load(layout) + cache.record_aggregation( + "entities", + notes_hash="h1", + body="body", + claims=[{"text": "c", "sources": []}], + contradictions=[], + ) + hit = cache.lookup_aggregation("entities", "h1") + assert hit is not None + assert hit.body == "body" + assert cache.lookup_aggregation("entities", "h2") is None + + +def test_save_and_load_round_trip(tmp_path: Path): + layout = _layout(tmp_path) + cache = WalkCache() + cache.record_extraction( + "src/a.py", + fingerprint="abc123", + findings=[{"section_id": "entities", "finding": "x", "sources": []}], + summary="role", + chunks_processed=2, + ) + cache.record_aggregation("entities", notes_hash="hh", body="body", claims=[], contradictions=[]) + save(layout, cache) + assert extraction_cache_path(layout).exists() + assert aggregation_cache_path(layout).exists() + + loaded = load(layout) + assert loaded.lookup_extraction("src/a.py", "abc123") is not None + assert loaded.lookup_aggregation("entities", "hh") is not None + + +def test_reset_clears_disk_files(tmp_path: Path): + layout = _layout(tmp_path) + cache = WalkCache() + cache.record_extraction("a.py", fingerprint="x", findings=[], summary="", chunks_processed=0) + save(layout, cache) + reset(layout) + assert not extraction_cache_path(layout).exists() + assert not aggregation_cache_path(layout).exists() + + +def test_load_returns_empty_when_file_missing(tmp_path: Path): + layout = _layout(tmp_path) + cache = load(layout) + assert cache.extraction == {} + assert cache.aggregation == {} + + +def test_load_drops_bad_version(tmp_path: Path): + layout = _layout(tmp_path) + extraction_cache_path(layout).parent.mkdir(parents=True, exist_ok=True) + extraction_cache_path(layout).write_text('{"version": 999, "entries": {"a.py": {"fingerprint": "abc"}}}') + cache = load(layout) + assert cache.extraction == {} + + +def test_prune_extraction_drops_out_of_scope_files(tmp_path: Path): + _layout(tmp_path) # ensures cache dir exists; entries are exercised below + cache = WalkCache() + for path in ("keep.py", "drop.py"): + cache.record_extraction(path, fingerprint="x", findings=[], summary="", chunks_processed=0) + removed = cache.prune_extraction(keep={"keep.py"}) + assert removed == 1 + assert "keep.py" in cache.extraction + assert "drop.py" not in cache.extraction + + +def test_hash_section_notes_is_stable(): + notes = [ + {"file": "a.py", "summary": "x", "finding": "y", "timestamp": "t1"}, + {"file": "b.py", "summary": "x", "finding": "z", "timestamp": "t2"}, + ] + same = [ + {"file": "a.py", "summary": "x", "finding": "y", "timestamp": "t99"}, + {"file": "b.py", "summary": "x", "finding": "z", "timestamp": "t100"}, + ] + assert hash_section_notes(notes) == hash_section_notes(same) + different = [{"file": "a.py", "summary": "x", "finding": "DIFFERENT"}] + assert hash_section_notes(notes) != hash_section_notes(different) + + +def test_cache_version_is_pinned(): + """Bumps to CACHE_VERSION should be intentional — guard against drift.""" + assert isinstance(CACHE_VERSION, int) + assert CACHE_VERSION >= 1 diff --git a/tests/test_cli.py b/tests/test_cli.py index 8d02b5b..36e3829 100644 --- a/tests/test_cli.py +++ b/tests/test_cli.py @@ -2,7 +2,7 @@ from wikifi import __version__ from wikifi.cli import app -from wikifi.wiki import WikiLayout, initialize +from wikifi.wiki import WikiLayout, initialize, write_section def test_version_flag(): @@ -49,6 +49,59 @@ def test_chat_command_errors_when_wiki_missing(tmp_path): assert "No .wikifi/" in result.output +def test_report_command_errors_when_wiki_missing(tmp_path): + runner = CliRunner() + result = runner.invoke(app, ["report", str(tmp_path)]) + assert result.exit_code == 1 + assert "No .wikifi/" in result.output + + +def test_report_command_renders_table(tmp_path): + layout = WikiLayout(root=tmp_path) + initialize(layout, model="m", provider="ollama", ollama_host="http://h") + write_section(layout, "intent", "Some intent body.") + + runner = CliRunner() + result = runner.invoke(app, ["report", str(tmp_path)]) + assert result.exit_code == 0, result.output + # Markdown rendered through rich; check for the header text. + assert "wikifi coverage" in result.output.lower() or "section" in result.output.lower() + + +def test_walk_no_cache_flag_clears_cache_dir(tmp_path, monkeypatch): + """`walk --no-cache` triggers the cache-reset path before the run starts.""" + layout = WikiLayout(root=tmp_path) + initialize(layout, model="m", provider="ollama", ollama_host="http://h") + cache_path = layout.wiki_dir / ".cache" / "extraction.json" + cache_path.parent.mkdir(parents=True, exist_ok=True) + cache_path.write_text('{"version": 1, "entries": {}}') + + captured = {} + + def fake_run_walk(*, root, settings, provider=None): + captured["use_cache"] = settings.use_cache + from wikifi.aggregator import AggregationStats + from wikifi.deriver import DerivationStats + from wikifi.extractor import ExtractionStats + from wikifi.introspection import IntrospectionResult + from wikifi.orchestrator import WalkReport + + return WalkReport( + introspection=IntrospectionResult(), + extraction=ExtractionStats(), + aggregation=AggregationStats(), + derivation=DerivationStats(), + ) + + monkeypatch.setattr("wikifi.cli.run_walk", fake_run_walk) + runner = CliRunner() + result = runner.invoke(app, ["walk", str(tmp_path), "--no-cache"]) + assert result.exit_code == 0, result.output + assert captured["use_cache"] is False + # Cache file was deleted by the flag. + assert not cache_path.exists() + + def test_chat_command_runs_repl(tmp_path, monkeypatch): layout = WikiLayout(root=tmp_path) initialize(layout, model="m", provider="ollama", ollama_host="http://h") diff --git a/tests/test_critic.py b/tests/test_critic.py new file mode 100644 index 0000000..35bbdb4 --- /dev/null +++ b/tests/test_critic.py @@ -0,0 +1,118 @@ +"""Critic + reviser tests.""" + +from __future__ import annotations + +from wikifi.aggregator import SectionBody # for unused import sanity +from wikifi.critic import ( + CoverageStats, + Critique, + RevisedBody, + review_section, +) +from wikifi.sections import SECTIONS_BY_ID + +_ = SectionBody # silence "imported but unused" + + +def test_review_skips_revision_when_score_meets_threshold(mock_provider_factory): + section = SECTIONS_BY_ID["entities"] + provider = mock_provider_factory( + json_responses={ + Critique: [Critique(score=8, summary="solid")], + } + ) + outcome = review_section( + section=section, + body="Bodies of evidence here.", + upstream_evidence=None, + provider=provider, + min_score=7, + ) + assert outcome.revised is False + assert outcome.body == "Bodies of evidence here." + + +def test_review_revises_when_score_below_threshold(mock_provider_factory): + section = SECTIONS_BY_ID["entities"] + queue_critique = [ + Critique(score=4, summary="weak", unsupported_claims=["X"], gaps=["Y"]), + Critique(score=8, summary="better"), + ] + + def factory(schema, system, user): + if schema is Critique: + return queue_critique.pop(0) + if schema is RevisedBody: + return RevisedBody(body="Revised body that addresses X and Y.") + raise AssertionError(f"unexpected schema {schema}") + + provider = mock_provider_factory(json_factory=factory) + + outcome = review_section( + section=section, + body="Original body.", + upstream_evidence={"intent": "upstream content"}, + provider=provider, + min_score=7, + ) + assert outcome.revised is True + assert "Revised body" in outcome.body + assert outcome.final is not None + assert outcome.final.score == 8 + + +def test_review_keeps_original_when_revision_regresses(mock_provider_factory): + section = SECTIONS_BY_ID["entities"] + critiques = [ + Critique(score=5, gaps=["Y"]), + Critique(score=3), # revision is worse + ] + + def factory(schema, system, user): + if schema is Critique: + return critiques.pop(0) + if schema is RevisedBody: + return RevisedBody(body="Worse body.") + raise AssertionError + + provider = mock_provider_factory(json_factory=factory) + + outcome = review_section( + section=section, + body="Original body.", + upstream_evidence=None, + provider=provider, + min_score=7, + ) + assert outcome.revised is False + assert outcome.body == "Original body." + + +def test_review_handles_critic_failure(mock_provider_factory): + """If the critic call fails, score=0 → no revision attempt; the body stays.""" + section = SECTIONS_BY_ID["entities"] + + def factory(schema, system, user): + raise RuntimeError("model unavailable") + + provider = mock_provider_factory(json_factory=factory) + outcome = review_section( + section=section, + body="Body.", + upstream_evidence=None, + provider=provider, + min_score=7, + ) + assert outcome.body == "Body." + assert outcome.initial.score == 0 + + +def test_coverage_stats_pct(): + stats = CoverageStats( + files_total=100, + files_with_findings=42, + findings_per_section={}, + files_per_section={}, + ) + assert stats.coverage_pct() == 42.0 + assert CoverageStats(0, 0, {}, {}).coverage_pct() == 0.0 diff --git a/tests/test_evidence.py b/tests/test_evidence.py new file mode 100644 index 0000000..b5bf3dc --- /dev/null +++ b/tests/test_evidence.py @@ -0,0 +1,78 @@ +"""Evidence model + rendering tests.""" + +from __future__ import annotations + +from wikifi.evidence import ( + Claim, + Contradiction, + EvidenceBundle, + SourceRef, + coalesce_refs, + render_section_body, +) + + +def test_source_ref_render(): + assert SourceRef(file="a.py").render() == "a.py" + assert SourceRef(file="a.py", lines=(10, 10)).render() == "a.py:10" + assert SourceRef(file="a.py", lines=(10, 25)).render() == "a.py:10-25" + + +def test_claim_supported_flag(): + assert not Claim(text="x").supported() + assert Claim(text="x", sources=[SourceRef(file="a.py")]).supported() + + +def test_render_section_body_includes_sources_footer(): + bundle = EvidenceBundle( + body="The system manages orders.", + claims=[ + Claim(text="Orders carry line items.", sources=[SourceRef(file="src/order.py", lines=(1, 30))]), + Claim(text="Orders are immutable once placed.", sources=[SourceRef(file="src/order.py", lines=(1, 30))]), + ], + ) + out = render_section_body(bundle) + assert "The system manages orders." in out + assert "## Sources" in out + assert "src/order.py:1-30" in out + # Same source ref is deduped — only one numbered entry. + assert out.count("src/order.py:1-30") == 1 + + +def test_render_section_body_renders_contradictions(): + bundle = EvidenceBundle( + body="Order pricing is calculated downstream.", + contradictions=[ + Contradiction( + summary="Whether tax is computed at order time or invoice time.", + positions=[ + Claim(text="Tax is computed at order time.", sources=[SourceRef(file="src/order.py")]), + Claim(text="Tax is computed at invoice time.", sources=[SourceRef(file="src/invoice.py")]), + ], + ) + ], + ) + out = render_section_body(bundle) + assert "Conflicts in source" in out + assert "Tax is computed at order time" in out + assert "Tax is computed at invoice time" in out + assert "src/order.py" in out + assert "src/invoice.py" in out + + +def test_render_section_body_omits_footer_when_no_sources(): + bundle = EvidenceBundle(body="Plain body, no claims.") + out = render_section_body(bundle) + assert "Plain body, no claims." in out + assert "## Sources" not in out + + +def test_coalesce_refs_dedupes_by_render(): + refs = [ + SourceRef(file="a.py", lines=(1, 10)), + SourceRef(file="a.py", lines=(1, 10)), + SourceRef(file="b.py"), + ] + out = coalesce_refs(refs) + assert len(out) == 2 + assert {r.render() for r in out} == {"a.py:1-10", "b.py"} diff --git a/tests/test_extractor.py b/tests/test_extractor.py index d007cbf..2b27cca 100644 --- a/tests/test_extractor.py +++ b/tests/test_extractor.py @@ -2,12 +2,14 @@ import pytest +from wikifi.cache import WalkCache from wikifi.extractor import ( FileFindings, SectionFinding, _chunk_text, extract_repo, ) +from wikifi.repograph import build_graph from wikifi.wiki import WikiLayout, initialize, read_notes @@ -327,6 +329,150 @@ def test_section_ids_documented_in_system_prompt(): assert sid not in EXTRACTION_SYSTEM_PROMPT.split("Only emit findings for these section ids:")[1].split("\n")[0] +def test_extract_repo_uses_cache_to_skip_unchanged_files(tmp_path, mock_provider_factory): + """A file whose fingerprint matches a cache entry skips the LLM call entirely.""" + layout = _layout(tmp_path) + (tmp_path / "a.py").write_text("class Order: pass\n# meaningful body content here for the walker\n") + + cache = WalkCache() + seen: list[str] = [] + + def factory(schema, system, user): + seen.append(user) + return FileFindings( + summary="domain class", + findings=[SectionFinding(section_id="entities", finding="Order entity.")], + ) + + provider = mock_provider_factory(json_factory=factory) + extract_repo( + layout=layout, + provider=provider, + files=[Path("a.py")], + repo_root=tmp_path, + cache=cache, + ) + assert len(seen) == 1 + assert "a.py" in cache.extraction + # Second walk against the same file: cache hit, no new LLM call. + seen.clear() + stats2 = extract_repo( + layout=layout, + provider=provider, + files=[Path("a.py")], + repo_root=tmp_path, + cache=cache, + ) + assert seen == [] + assert stats2.cache_hits == 1 + notes = read_notes(layout, "entities") + # Findings are replayed into the notes store on cache hit. + assert any("Order" in n["finding"] for n in notes) + + +def test_extract_repo_invalidates_cache_when_file_changes(tmp_path, mock_provider_factory): + layout = _layout(tmp_path) + target = tmp_path / "a.py" + target.write_text("class Order: pass\n# body content for the walker minimum threshold\n") + + cache = WalkCache() + call_count = {"n": 0} + + def factory(schema, system, user): + call_count["n"] += 1 + return FileFindings(findings=[SectionFinding(section_id="entities", finding="Order.")]) + + provider = mock_provider_factory(json_factory=factory) + extract_repo(layout=layout, provider=provider, files=[Path("a.py")], repo_root=tmp_path, cache=cache) + assert call_count["n"] == 1 + + # Mutate the file → fingerprint changes → cache miss → new call. + target.write_text("class Customer: pass\n# different content for the walker minimum threshold\n") + extract_repo(layout=layout, provider=provider, files=[Path("a.py")], repo_root=tmp_path, cache=cache) + assert call_count["n"] == 2 + + +def test_extract_repo_routes_sql_through_specialized_extractor(tmp_path, mock_provider_factory): + """SQL files bypass the LLM and go through the deterministic SQL extractor.""" + layout = _layout(tmp_path) + (tmp_path / "schema.sql").write_text("CREATE TABLE customer (id INTEGER PRIMARY KEY, email VARCHAR(255) NOT NULL);") + + seen: list[str] = [] + + def factory(schema, system, user): + seen.append(user) + return FileFindings() + + provider = mock_provider_factory(json_factory=factory) + stats = extract_repo( + layout=layout, + provider=provider, + files=[Path("schema.sql")], + repo_root=tmp_path, + ) + # No LLM calls — specialized extractor handled the file directly. + assert seen == [] + assert stats.specialized_files == 1 + notes = read_notes(layout, "entities") + assert any("customer" in n["finding"] for n in notes) + + +def test_extract_repo_emits_source_refs_in_notes(tmp_path, mock_provider_factory): + """Every note carries a structured ``sources`` list for downstream citations.""" + layout = _layout(tmp_path) + (tmp_path / "a.py").write_text("class Order:\n pass\n# more body content for walker minimum\n") + + findings = FileFindings( + summary="domain class", + findings=[ + SectionFinding(section_id="entities", finding="Order entity.", line_range=(1, 2)), + ], + ) + provider = mock_provider_factory(json_responses={FileFindings: [findings]}) + extract_repo( + layout=layout, + provider=provider, + files=[Path("a.py")], + repo_root=tmp_path, + ) + note = read_notes(layout, "entities")[0] + sources = note["sources"] + assert sources and sources[0]["file"] == "a.py" + assert sources[0]["lines"] == [1, 2] + assert sources[0]["fingerprint"] + + +def test_extract_repo_injects_neighbor_context_when_graph_supplied(tmp_path, mock_provider_factory): + layout = _layout(tmp_path) + (tmp_path / "pkg").mkdir() + (tmp_path / "pkg" / "__init__.py").write_text("# package marker for tests; long enough to pass min_content\n") + (tmp_path / "pkg" / "main.py").write_text("from pkg.helper import compute\n\ndef run():\n return compute()\n") + (tmp_path / "pkg" / "helper.py").write_text( + "def compute():\n return 42\n# extra padding to satisfy the minimum content threshold for the walker\n" + ) + + files = [Path("pkg/__init__.py"), Path("pkg/main.py"), Path("pkg/helper.py")] + graph = build_graph(repo_root=tmp_path, files=files) + + captured: list[str] = [] + + def factory(schema, system, user): + captured.append(user) + return FileFindings() + + provider = mock_provider_factory(json_factory=factory) + extract_repo( + layout=layout, + provider=provider, + files=files, + repo_root=tmp_path, + graph=graph, + ) + main_prompt = next(p for p in captured if "pkg/main.py" in p) + assert "Neighbor files" in main_prompt + assert "pkg/helper.py" in main_prompt + + def test_extract_repo_drops_derivative_section_findings(tmp_path, mock_provider_factory): """Even if the model emits a derivative section id, the extractor filters it out.""" from wikifi.sections import DERIVATIVE_SECTION_IDS diff --git a/tests/test_fingerprint.py b/tests/test_fingerprint.py new file mode 100644 index 0000000..c462afd --- /dev/null +++ b/tests/test_fingerprint.py @@ -0,0 +1,29 @@ +"""Fingerprint tests.""" + +from __future__ import annotations + +from pathlib import Path + +from wikifi.fingerprint import FINGERPRINT_LENGTH, hash_bytes, hash_file, hash_text + + +def test_hash_text_is_stable_and_short(): + a = hash_text("hello world") + b = hash_text("hello world") + assert a == b + assert len(a) == FINGERPRINT_LENGTH + assert all(c in "0123456789abcdef" for c in a) + + +def test_hash_text_diverges_on_change(): + assert hash_text("hello") != hash_text("hello!") + + +def test_hash_bytes_handles_arbitrary_bytes(): + assert hash_bytes(b"\x00\x01\x02") != hash_bytes(b"\x00\x01\x03") + + +def test_hash_file_reads_bytes_from_disk(tmp_path: Path): + target = tmp_path / "file.txt" + target.write_bytes(b"contents") + assert hash_file(target) == hash_bytes(b"contents") diff --git a/tests/test_orchestrator.py b/tests/test_orchestrator.py index f6c3baf..a068ace 100644 --- a/tests/test_orchestrator.py +++ b/tests/test_orchestrator.py @@ -89,3 +89,79 @@ def test_build_provider_returns_ollama_for_ollama_settings(): def test_build_provider_rejects_unknown(): with pytest.raises(ValueError): build_provider(_settings(provider="other")) + + +def test_build_provider_returns_anthropic_when_selected(monkeypatch): + """``provider='anthropic'`` dispatches to AnthropicProvider with a Claude model default.""" + monkeypatch.setenv("ANTHROPIC_API_KEY", "test-key") + settings = _settings(provider="anthropic", model="m") # non-claude model id + provider = build_provider(settings) + from wikifi.providers.anthropic_provider import AnthropicProvider + + assert isinstance(provider, AnthropicProvider) + # Falls back to a sane Claude default rather than 404'ing on "m". + assert provider.model.startswith("claude-") + + +def test_run_walk_persists_cache_for_resumability(mini_target, mock_provider_factory): + """A second walk reuses the cache and skips the LLM call for unchanged files.""" + settings = _settings() + introspection = IntrospectionResult( + include=["src/"], exclude=[], primary_languages=["python"], likely_purpose="demo", rationale="ok" + ) + + extraction_calls = {"n": 0} + + def factory(schema, system, user): + if schema is IntrospectionResult: + return introspection + if schema is FileFindings: + extraction_calls["n"] += 1 + return FileFindings( + summary="role", + findings=[SectionFinding(section_id="entities", finding="Order entity inferred.")], + ) + if schema is SectionBody: + return SectionBody(body="Synthesized.") + if schema is DerivedSection: + return DerivedSection(body="Derived.") + raise AssertionError(f"unexpected {schema}") + + provider = mock_provider_factory(json_factory=factory) + run_walk(root=mini_target, settings=settings, provider=provider) + first = extraction_calls["n"] + assert first >= 2 + + # Second walk against the same target with the same content: cache reuses + # the per-file findings, so extraction calls do not increase. + run_walk(root=mini_target, settings=settings, provider=provider) + assert extraction_calls["n"] == first + + +def test_run_walk_review_flag_invokes_critic(mini_target, mock_provider_factory): + """With ``review_derivatives=True`` the deriver runs the critic loop.""" + from wikifi.critic import Critique + + settings = _settings(review_derivatives=True) + introspection = IntrospectionResult( + include=["src/"], exclude=[], primary_languages=["python"], likely_purpose="demo", rationale="ok" + ) + critic_called = {"n": 0} + + def factory(schema, system, user): + if schema is IntrospectionResult: + return introspection + if schema is FileFindings: + return FileFindings(findings=[SectionFinding(section_id="entities", finding="Order.")]) + if schema is SectionBody: + return SectionBody(body="Synthesized.") + if schema is DerivedSection: + return DerivedSection(body="Derived.") + if schema is Critique: + critic_called["n"] += 1 + return Critique(score=9, summary="ok") + raise AssertionError(f"unexpected {schema}") + + provider = mock_provider_factory(json_factory=factory) + run_walk(root=mini_target, settings=settings, provider=provider) + assert critic_called["n"] >= 1 diff --git a/tests/test_repograph.py b/tests/test_repograph.py new file mode 100644 index 0000000..b97d20e --- /dev/null +++ b/tests/test_repograph.py @@ -0,0 +1,85 @@ +"""Repo graph + file classification tests.""" + +from __future__ import annotations + +from pathlib import Path + +from wikifi.repograph import FileKind, build_graph, classify + + +def test_classify_extension_only(): + assert classify(Path("schema.sql")) is FileKind.SQL + assert classify(Path("api.proto")) is FileKind.PROTOBUF + assert classify(Path("schema.graphql")) is FileKind.GRAPHQL + + +def test_classify_application_code(): + assert classify(Path("src/app.py")) is FileKind.APPLICATION_CODE + assert classify(Path("src/app.ts")) is FileKind.APPLICATION_CODE + assert classify(Path("src/app.go")) is FileKind.APPLICATION_CODE + + +def test_classify_migration_path(): + assert classify(Path("backend/migrations/0001_init.sql")) is FileKind.MIGRATION + assert classify(Path("alembic/versions/abc.py")) is FileKind.MIGRATION + + +def test_classify_openapi_via_sample(): + assert classify(Path("api.yaml"), sample="openapi: 3.0.3\ninfo: ...") is FileKind.OPENAPI + assert classify(Path("api.json"), sample='{"openapi": "3.0.0"}') is FileKind.OPENAPI + + +def test_classify_other(): + assert classify(Path("README.md")) is FileKind.OTHER + assert classify(Path("data.csv")) is FileKind.OTHER + + +def test_build_graph_python_imports(tmp_path: Path): + (tmp_path / "pkg").mkdir() + (tmp_path / "pkg" / "__init__.py").write_text("") + (tmp_path / "pkg" / "a.py").write_text("from pkg.b import thing\nimport os\n") + (tmp_path / "pkg" / "b.py").write_text("def thing(): return 1\n") + + files = [Path("pkg/__init__.py"), Path("pkg/a.py"), Path("pkg/b.py")] + graph = build_graph(repo_root=tmp_path, files=files) + + a_node = graph.get("pkg/a.py") + assert a_node is not None + assert "pkg/b.py" in a_node.imports + + b_node = graph.get("pkg/b.py") + assert b_node is not None + assert "pkg/a.py" in b_node.imported_by + + +def test_build_graph_js_imports(tmp_path: Path): + (tmp_path / "src").mkdir() + (tmp_path / "src" / "main.js").write_text("import { run } from './worker';\n") + (tmp_path / "src" / "worker.js").write_text("export function run() {}\n") + + files = [Path("src/main.js"), Path("src/worker.js")] + graph = build_graph(repo_root=tmp_path, files=files) + main_node = graph.get("src/main.js") + assert main_node is not None + assert "src/worker.js" in main_node.imports + + +def test_neighbor_paths_caps_results(tmp_path: Path): + """neighbors() bounds the prompt-side noise.""" + (tmp_path / "hub.py").write_text("\n".join(f"from leaf{i} import foo" for i in range(20))) + for i in range(20): + (tmp_path / f"leaf{i}.py").write_text("def foo(): pass\n") + files = [Path("hub.py")] + [Path(f"leaf{i}.py") for i in range(20)] + graph = build_graph(repo_root=tmp_path, files=files) + neighbors = graph.neighbor_paths("hub.py", limit=5) + assert len(neighbors) == 5 + + +def test_build_graph_skips_unreadable_files(tmp_path: Path): + """Missing-file path is exercised even if no other tests trip it.""" + files = [Path("ghost.py")] + graph = build_graph(repo_root=tmp_path, files=files) + # No edges produced; graph still records a node with empty imports. + node = graph.get("ghost.py") + assert node is not None + assert node.imports == () diff --git a/tests/test_report.py b/tests/test_report.py new file mode 100644 index 0000000..325938b --- /dev/null +++ b/tests/test_report.py @@ -0,0 +1,73 @@ +"""Coverage + quality report tests.""" + +from __future__ import annotations + +from pathlib import Path + +from wikifi.cache import WalkCache, save +from wikifi.critic import Critique +from wikifi.report import build_report +from wikifi.wiki import WikiLayout, append_note, initialize, write_section + + +def _layout(tmp_path: Path) -> WikiLayout: + layout = WikiLayout(root=tmp_path) + initialize(layout, model="m", provider="ollama", ollama_host="http://h") + return layout + + +def test_build_report_without_provider_returns_structural_view(tmp_path: Path): + layout = _layout(tmp_path) + cache = WalkCache() + cache.record_extraction( + "src/order.py", + fingerprint="abc", + findings=[{"section_id": "entities", "finding": "Order", "sources": []}], + summary="domain", + chunks_processed=1, + ) + cache.record_extraction( + "src/empty.py", + fingerprint="def", + findings=[], + summary="", + chunks_processed=1, + ) + save(layout, cache) + append_note(layout, "entities", {"file": "src/order.py", "summary": "x", "finding": "Order"}) + write_section(layout, "entities", "Body for entities.") + + report = build_report(layout=layout, provider=None, score=False) + + assert report.coverage.files_total == 2 + assert report.coverage.files_with_findings == 1 + assert report.coverage.coverage_pct() == 50.0 + assert report.overall_score is None + md = report.render() + assert "wikifi coverage" in md + assert "`entities`" in md + + +def test_build_report_with_score_uses_provider(tmp_path: Path, mock_provider_factory): + layout = _layout(tmp_path) + write_section(layout, "entities", "An entity body.") + write_section(layout, "intent", "Intent body.") + + provider = mock_provider_factory( + json_factory=lambda schema, system, user: Critique(score=9, summary="great"), + ) + report = build_report(layout=layout, provider=provider, score=True) + + populated = [s for s in report.sections if s.critique is not None] + assert populated, "expected at least one populated section to be scored" + assert all(s.critique.score == 9 for s in populated) + assert report.overall_score == 9.0 + assert "9/10" in report.render() + + +def test_build_report_marks_unpopulated_sections(tmp_path: Path): + """Sections still bearing the init placeholder are flagged ``is_empty``.""" + layout = _layout(tmp_path) + save(layout, WalkCache()) + report = build_report(layout=layout, provider=None, score=False) + assert any(entry.is_empty for entry in report.sections) diff --git a/tests/test_specialized.py b/tests/test_specialized.py new file mode 100644 index 0000000..4e064f1 --- /dev/null +++ b/tests/test_specialized.py @@ -0,0 +1,186 @@ +"""Type-aware (specialized) extractor tests.""" + +from __future__ import annotations + +from wikifi.repograph import FileKind +from wikifi.specialized import select +from wikifi.specialized.graphql import extract as gql_extract +from wikifi.specialized.openapi import extract as openapi_extract +from wikifi.specialized.protobuf import extract as proto_extract +from wikifi.specialized.sql import extract as sql_extract + + +def test_select_routes_known_kinds_to_extractors(): + assert select(FileKind.SQL) is sql_extract + assert select(FileKind.PROTOBUF) is proto_extract + assert select(FileKind.GRAPHQL) is gql_extract + assert select(FileKind.OPENAPI) is openapi_extract + # Migrations route to a SQL variant. + assert select(FileKind.MIGRATION).__name__ == "extract_migration" + assert select(FileKind.APPLICATION_CODE) is None + assert select(FileKind.OTHER) is None + + +# --------------------------------------------------------------------------- +# SQL +# --------------------------------------------------------------------------- + + +def test_sql_extracts_table_and_foreign_key(): + text = """ + CREATE TABLE customer ( + id INTEGER PRIMARY KEY, + email VARCHAR(255) UNIQUE NOT NULL + ); + + CREATE TABLE orders ( + id INTEGER PRIMARY KEY, + customer_id INTEGER REFERENCES customer(id), + total INTEGER NOT NULL + ); + """ + result = sql_extract("schema.sql", text) + sections = {f.section_id for f in result.findings} + assert "entities" in sections + assert "integrations" in sections + findings_by_section = {s: [f for f in result.findings if f.section_id == s] for s in sections} + # Both tables surface as entities. + entity_findings = findings_by_section["entities"] + assert any("customer" in f.finding for f in entity_findings) + assert any("orders" in f.finding for f in entity_findings) + # FK becomes an integration. + fk_findings = findings_by_section["integrations"] + assert any("customer" in f.finding for f in fk_findings) + + +def test_sql_migration_marks_summary(): + text = "ALTER TABLE orders ADD COLUMN refund_status TEXT;" + from wikifi.specialized.sql import extract_migration + + result = extract_migration("backend/migrations/0042_refunds.sql", text) + assert "Migration" in result.summary or "migration" in result.summary.lower() + assert any("orders" in f.finding for f in result.findings) + + +def test_sql_index_becomes_cross_cutting(): + text = "CREATE INDEX idx_orders_customer ON orders (customer_id);" + result = sql_extract("schema.sql", text) + assert any(f.section_id == "cross_cutting" and "idx_orders_customer" in f.finding for f in result.findings) + + +# --------------------------------------------------------------------------- +# OpenAPI +# --------------------------------------------------------------------------- + + +def test_openapi_extracts_endpoints_and_schemas_from_json(): + spec = """ + { + "openapi": "3.0.0", + "info": {"title": "Orders API", "version": "1.0"}, + "paths": { + "/orders": { + "post": {"summary": "Create order"}, + "get": {"summary": "List orders"} + } + }, + "components": { + "schemas": {"Order": {"type": "object"}, "LineItem": {"type": "object"}}, + "securitySchemes": {"bearerAuth": {"type": "http"}} + } + } + """ + result = openapi_extract("openapi.json", spec) + sections = {f.section_id for f in result.findings} + assert "intent" in sections + assert "capabilities" in sections + assert "entities" in sections + assert "integrations" in sections + assert "cross_cutting" in sections + cap_text = next(f.finding for f in result.findings if f.section_id == "capabilities") + assert "POST /orders" in cap_text + assert "GET /orders" in cap_text + + +def test_openapi_handles_unparseable_input(): + result = openapi_extract("openapi.yaml", "") + assert any(f.section_id == "capabilities" for f in result.findings) + assert "Unparseable" in result.summary or "manual review" in result.findings[0].finding.lower() + + +def test_openapi_yaml_fallback_parser(): + """The shallow YAML parser should work even without PyYAML installed.""" + spec = """openapi: 3.0.0 +info: + title: Test API + version: "1.0" +paths: + /test: + get: + summary: Test endpoint +""" + result = openapi_extract("openapi.yaml", spec) + # Should at least extract intent (title is present). + assert any("Test API" in f.finding for f in result.findings) + + +# --------------------------------------------------------------------------- +# Protobuf +# --------------------------------------------------------------------------- + + +def test_proto_extracts_messages_and_services(): + text = """ + syntax = "proto3"; + package billing.v1; + + message Invoice { + int64 id = 1; + string customer_id = 2; + } + + service BillingService { + rpc CreateInvoice (Invoice) returns (Invoice); + rpc StreamInvoices (Invoice) returns (stream Invoice); + } + """ + result = proto_extract("billing.proto", text) + sections = {f.section_id for f in result.findings} + assert "entities" in sections + assert "integrations" in sections + assert "capabilities" in sections + integrations = next(f for f in result.findings if f.section_id == "integrations") + assert "BillingService" in integrations.finding + assert "CreateInvoice" in integrations.finding + + +# --------------------------------------------------------------------------- +# GraphQL +# --------------------------------------------------------------------------- + + +def test_graphql_extracts_types_and_roots(): + sdl = """ + type Order { + id: ID! + total: Int! + } + + input OrderInput { + total: Int! + } + + type Query { + order(id: ID!): Order + } + + type Mutation { + createOrder(input: OrderInput!): Order! + } + """ + result = gql_extract("schema.graphql", sdl) + sections = {f.section_id for f in result.findings} + assert "entities" in sections + assert "capabilities" in sections + cap = next(f for f in result.findings if f.section_id == "capabilities") + assert "Query" in cap.finding or "Mutation" in cap.finding diff --git a/uv.lock b/uv.lock index e2c0f09..8d9f2ed 100644 --- a/uv.lock +++ b/uv.lock @@ -20,6 +20,25 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/78/b6/6307fbef88d9b5ee7421e68d78a9f162e0da4900bc5f5793f6d3d0e34fb8/annotated_types-0.7.0-py3-none-any.whl", hash = "sha256:1f02e8b43a8fbbc3f3e0d4f0f4bfc8131bcb4eebe8849b8e5c773f3a1c582a53", size = 13643, upload-time = "2024-05-20T21:33:24.1Z" }, ] +[[package]] +name = "anthropic" +version = "0.97.0" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "anyio" }, + { name = "distro" }, + { name = "docstring-parser" }, + { name = "httpx" }, + { name = "jiter" }, + { name = "pydantic" }, + { name = "sniffio" }, + { name = "typing-extensions" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/14/93/f66ea8bfe39f2e6bb9da8e27fa5457ad2520e8f7612dfc547b17fad55c4d/anthropic-0.97.0.tar.gz", hash = "sha256:021e79fd8e21e90ad94dc5ba2bbbd8b1599f424f5b1fab6c06204009cab764be", size = 669502, upload-time = "2026-04-23T20:52:34.445Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/53/b6/8e851369fa661ad0fef2ae6266bf3b7d52b78ccf011720058f4adaca59e2/anthropic-0.97.0-py3-none-any.whl", hash = "sha256:8a1a472dfabcfc0c52ff6a3eecf724ac7e07107a2f6e2367be55ceb42f5d5613", size = 662126, upload-time = "2026-04-23T20:52:32.377Z" }, +] + [[package]] name = "anyio" version = "4.13.0" @@ -147,6 +166,24 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/9e/ee/a4cf96b8ce1e566ed238f0659ac2d3f007ed1d14b181bcb684e19561a69a/coverage-7.13.5-py3-none-any.whl", hash = "sha256:34b02417cf070e173989b3db962f7ed56d2f644307b2cf9d5a0f258e13084a61", size = 211346, upload-time = "2026-03-17T10:33:15.691Z" }, ] +[[package]] +name = "distro" +version = "1.9.0" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/fc/f8/98eea607f65de6527f8a2e8885fc8015d3e6f5775df186e443e0964a11c3/distro-1.9.0.tar.gz", hash = "sha256:2fa77c6fd8940f116ee1d6b94a2f90b13b5ea8d019b98bc8bafdcabcdd9bdbed", size = 60722, upload-time = "2023-12-24T09:54:32.31Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/12/b3/231ffd4ab1fc9d679809f356cebee130ac7daa00d6d6f3206dd4fd137e9e/distro-1.9.0-py3-none-any.whl", hash = "sha256:7bffd925d65168f85027d8da9af6bddab658135b840670a223589bc0c8ef02b2", size = 20277, upload-time = "2023-12-24T09:54:30.421Z" }, +] + +[[package]] +name = "docstring-parser" +version = "0.18.0" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/e0/4d/f332313098c1de1b2d2ff91cf2674415cc7cddab2ca1b01ae29774bd5fdf/docstring_parser-0.18.0.tar.gz", hash = "sha256:292510982205c12b1248696f44959db3cdd1740237a968ea1e2e7a900eeb2015", size = 29341, upload-time = "2026-04-14T04:09:19.867Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/a7/5f/ed01f9a3cdffbd5a008556fc7b2a08ddb1cc6ace7effa7340604b1d16699/docstring_parser-0.18.0-py3-none-any.whl", hash = "sha256:b3fcbed555c47d8479be0796ef7e19c2670d428d72e96da63f3a40122860374b", size = 22484, upload-time = "2026-04-14T04:09:18.638Z" }, +] + [[package]] name = "h11" version = "0.16.0" @@ -202,6 +239,78 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/cb/b1/3846dd7f199d53cb17f49cba7e651e9ce294d8497c8c150530ed11865bb8/iniconfig-2.3.0-py3-none-any.whl", hash = "sha256:f631c04d2c48c52b84d0d0549c99ff3859c98df65b3101406327ecc7d53fbf12", size = 7484, upload-time = "2025-10-18T21:55:41.639Z" }, ] +[[package]] +name = "jiter" +version = "0.14.0" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/6e/c1/0cddc6eb17d4c53a99840953f95dd3accdc5cfc7a337b0e9b26476276be9/jiter-0.14.0.tar.gz", hash = "sha256:e8a39e66dac7153cf3f964a12aad515afa8d74938ec5cc0018adcdae5367c79e", size = 165725, upload-time = "2026-04-10T14:28:42.01Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/5a/68/7390a418f10897da93b158f2d5a8bd0bcd73a0f9ec3bb36917085bb759ef/jiter-0.14.0-cp312-cp312-macosx_10_12_x86_64.whl", hash = "sha256:2fb2ce3a7bc331256dfb14cefc34832366bb28a9aca81deaf43bbf2a5659e607", size = 316295, upload-time = "2026-04-10T14:26:24.887Z" }, + { url = "https://files.pythonhosted.org/packages/60/a0/5854ac00ff63551c52c6c89534ec6aba4b93474e7924d64e860b1c94165b/jiter-0.14.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:5252a7ca23785cef5d02d4ece6077a1b556a410c591b379f82091c3001e14844", size = 315898, upload-time = "2026-04-10T14:26:26.601Z" }, + { url = "https://files.pythonhosted.org/packages/41/a1/4f44832650a16b18e8391f1bf1d6ca4909bc738351826bcc198bba4357f4/jiter-0.14.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c409578cbd77c338975670ada777add4efd53379667edf0aceea730cabede6fb", size = 343730, upload-time = "2026-04-10T14:26:28.326Z" }, + { url = "https://files.pythonhosted.org/packages/48/64/a329e9d469f86307203594b1707e11ae51c3348d03bfd514a5f997870012/jiter-0.14.0-cp312-cp312-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:7ede4331a1899d604463369c730dbb961ffdc5312bc7f16c41c2896415b1304a", size = 370102, upload-time = "2026-04-10T14:26:30.089Z" }, + { url = "https://files.pythonhosted.org/packages/94/c1/5e3dfc59635aa4d4c7bd20a820ac1d09b8ed851568356802cf1c08edb3cf/jiter-0.14.0-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:92cd8b6025981a041f5310430310b55b25ca593972c16407af8837d3d7d2ca01", size = 461335, upload-time = "2026-04-10T14:26:31.911Z" }, + { url = "https://files.pythonhosted.org/packages/e3/1b/dd157009dbc058f7b00108f545ccb72a2d56461395c4fc7b9cfdccb00af4/jiter-0.14.0-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:351bf6eda4e3a7ceb876377840c702e9a3e4ecc4624dbfb2d6463c67ae52637d", size = 378536, upload-time = "2026-04-10T14:26:33.595Z" }, + { url = "https://files.pythonhosted.org/packages/91/78/256013667b7c10b8834f8e6e54cd3e562d4c6e34227a1596addccc05e38c/jiter-0.14.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c1dcfbeb93d9ecd9ca128bbf8910120367777973fa193fb9a39c31237d8df165", size = 353859, upload-time = "2026-04-10T14:26:35.098Z" }, + { url = "https://files.pythonhosted.org/packages/de/d9/137d65ade9093a409fe80955ce60b12bb753722c986467aeda47faf450ad/jiter-0.14.0-cp312-cp312-manylinux_2_31_riscv64.whl", hash = "sha256:ae039aaef8de3f8157ecc1fdd4d85043ac4f57538c245a0afaecb8321ec951c3", size = 357626, upload-time = "2026-04-10T14:26:36.685Z" }, + { url = "https://files.pythonhosted.org/packages/2e/48/76750835b87029342727c1a268bea8878ab988caf81ee4e7b880900eeb5a/jiter-0.14.0-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:7d9d51eb96c82a9652933bd769fe6de66877d6eb2b2440e281f2938c51b5643e", size = 393172, upload-time = "2026-04-10T14:26:38.097Z" }, + { url = "https://files.pythonhosted.org/packages/a6/60/456c4e81d5c8045279aefe60e9e483be08793828800a4e64add8fdde7f2a/jiter-0.14.0-cp312-cp312-musllinux_1_1_aarch64.whl", hash = "sha256:d824ca4148b705970bf4e120924a212fdfca9859a73e42bd7889a63a4ea6bb98", size = 520300, upload-time = "2026-04-10T14:26:39.532Z" }, + { url = "https://files.pythonhosted.org/packages/a8/9f/2020e0984c235f678dced38fe4eec3058cf528e6af36ebf969b410305941/jiter-0.14.0-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:ff3a6465b3a0f54b1a430f45c3c0ba7d61ceb45cbc3e33f9e1a7f638d690baf3", size = 553059, upload-time = "2026-04-10T14:26:40.991Z" }, + { url = "https://files.pythonhosted.org/packages/ef/32/e2d298e1a22a4bbe6062136d1c7192db7dba003a6975e51d9a9eecabc4c2/jiter-0.14.0-cp312-cp312-win32.whl", hash = "sha256:5dec7c0a3e98d2a3f8a2e67382d0d7c3ac60c69103a4b271da889b4e8bb1e129", size = 206030, upload-time = "2026-04-10T14:26:42.517Z" }, + { url = "https://files.pythonhosted.org/packages/36/ac/96369141b3d8a4a8e4590e983085efe1c436f35c0cda940dd76d942e3e40/jiter-0.14.0-cp312-cp312-win_amd64.whl", hash = "sha256:fc7e37b4b8bc7e80a63ad6cfa5fc11fab27dbfea4cc4ae644b1ab3f273dc348f", size = 201603, upload-time = "2026-04-10T14:26:44.328Z" }, + { url = "https://files.pythonhosted.org/packages/01/c3/75d847f264647017d7e3052bbcc8b1e24b95fa139c320c5f5066fa7a0bdd/jiter-0.14.0-cp312-cp312-win_arm64.whl", hash = "sha256:ee4a72f12847ef29b072aee9ad5474041ab2924106bdca9fcf5d7d965853e057", size = 191525, upload-time = "2026-04-10T14:26:46Z" }, + { url = "https://files.pythonhosted.org/packages/97/2a/09f70020898507a89279659a1afe3364d57fc1b2c89949081975d135f6f5/jiter-0.14.0-cp313-cp313-macosx_10_12_x86_64.whl", hash = "sha256:af72f204cf4d44258e5b4c1745130ac45ddab0e71a06333b01de660ab4187a94", size = 315502, upload-time = "2026-04-10T14:26:47.697Z" }, + { url = "https://files.pythonhosted.org/packages/d6/be/080c96a45cd74f9fce5db4fd68510b88087fb37ffe2541ff73c12db92535/jiter-0.14.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:4b77da71f6e819be5fbcec11a453fde5b1d0267ef6ed487e2a392fd8e14e4e3a", size = 314870, upload-time = "2026-04-10T14:26:49.149Z" }, + { url = "https://files.pythonhosted.org/packages/7d/5e/2d0fee155826a968a832cc32438de5e2a193292c8721ca70d0b53e58245b/jiter-0.14.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:77f4ea612fe8b84b8b04e51d0e78029ecf3466348e25973f953de6e6a59aa4c1", size = 343406, upload-time = "2026-04-10T14:26:50.762Z" }, + { url = "https://files.pythonhosted.org/packages/70/af/bf9ee0d3a4f8dc0d679fc1337f874fe60cdbf841ebbb304b374e1c9aaceb/jiter-0.14.0-cp313-cp313-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:62fe2451f8fcc0240261e6a4df18ecbcd58327857e61e625b2393ea3b468aac9", size = 369415, upload-time = "2026-04-10T14:26:52.188Z" }, + { url = "https://files.pythonhosted.org/packages/0f/83/8e8561eadba31f4d3948a5b712fb0447ec71c3560b57a855449e7b8ddc98/jiter-0.14.0-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:6112f26f5afc75bcb475787d29da3aa92f9d09c7858f632f4be6ffe607be82e9", size = 461456, upload-time = "2026-04-10T14:26:53.611Z" }, + { url = "https://files.pythonhosted.org/packages/f6/c9/c5299e826a5fe6108d172b344033f61c69b1bb979dd8d9ddd4278a160971/jiter-0.14.0-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:215a6cb8fb7dc702aa35d475cc00ddc7f970e5c0b1417fb4b4ac5d82fa2a29db", size = 378488, upload-time = "2026-04-10T14:26:55.211Z" }, + { url = "https://files.pythonhosted.org/packages/5d/37/c16d9d15c0a471b8644b1abe3c82668092a707d9bedcf076f24ff2e380cd/jiter-0.14.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:fc4ab96a30fb3cb2c7e0cd33f7616c8860da5f5674438988a54ac717caccdbaa", size = 353242, upload-time = "2026-04-10T14:26:56.705Z" }, + { url = "https://files.pythonhosted.org/packages/58/ea/8050cb0dc654e728e1bfacbc0c640772f2181af5dedd13ae70145743a439/jiter-0.14.0-cp313-cp313-manylinux_2_31_riscv64.whl", hash = "sha256:3a99c1387b1f2928f799a9de899193484d66206a50e98233b6b088a7f0c1edb2", size = 356823, upload-time = "2026-04-10T14:26:58.281Z" }, + { url = "https://files.pythonhosted.org/packages/b0/3b/cf71506d270e5f84d97326bf220e47aed9b95e9a4a060758fb07772170ab/jiter-0.14.0-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:ab18d11074485438695f8d34a1b6da61db9754248f96d51341956607a8f39985", size = 392564, upload-time = "2026-04-10T14:27:00.018Z" }, + { url = "https://files.pythonhosted.org/packages/b0/cc/8c6c74a3efb5bd671bfd14f51e8a73375464ca914b1551bc3b40e26ac2c9/jiter-0.14.0-cp313-cp313-musllinux_1_1_aarch64.whl", hash = "sha256:801028dcfc26ac0895e4964cbc0fd62c73be9fd4a7d7b1aaf6e5790033a719b7", size = 520322, upload-time = "2026-04-10T14:27:01.664Z" }, + { url = "https://files.pythonhosted.org/packages/41/24/68d7b883ec959884ddf00d019b2e0e82ba81b167e1253684fa90519ce33c/jiter-0.14.0-cp313-cp313-musllinux_1_1_x86_64.whl", hash = "sha256:ad425b087aafb4a1c7e1e98a279200743b9aaf30c3e0ba723aec93f061bd9bc8", size = 552619, upload-time = "2026-04-10T14:27:03.316Z" }, + { url = "https://files.pythonhosted.org/packages/b6/89/b1a0985223bbf3150ff9e8f46f98fc9360c1de94f48abe271bbe1b465682/jiter-0.14.0-cp313-cp313-win32.whl", hash = "sha256:882bcb9b334318e233950b8be366fe5f92c86b66a7e449e76975dfd6d776a01f", size = 205699, upload-time = "2026-04-10T14:27:04.662Z" }, + { url = "https://files.pythonhosted.org/packages/4c/19/3f339a5a7f14a11730e67f6be34f9d5105751d547b615ef593fa122a5ded/jiter-0.14.0-cp313-cp313-win_amd64.whl", hash = "sha256:9b8c571a5dba09b98bd3462b5a53f27209a5cbbe85670391692ede71974e979f", size = 201323, upload-time = "2026-04-10T14:27:06.139Z" }, + { url = "https://files.pythonhosted.org/packages/50/56/752dd89c84be0e022a8ea3720bcfa0a8431db79a962578544812ce061739/jiter-0.14.0-cp313-cp313-win_arm64.whl", hash = "sha256:34f19dcc35cb1abe7c369b3756babf8c7f04595c0807a848df8f26ef8298ef92", size = 191099, upload-time = "2026-04-10T14:27:07.564Z" }, + { url = "https://files.pythonhosted.org/packages/91/28/292916f354f25a1fe8cf2c918d1415c699a4a659ae00be0430e1c5d9ffea/jiter-0.14.0-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:e89bcd7d426a75bb4952c696b267075790d854a07aad4c9894551a82c5b574ab", size = 320880, upload-time = "2026-04-10T14:27:09.326Z" }, + { url = "https://files.pythonhosted.org/packages/ad/c7/b002a7d8b8957ac3d469bd59c18ef4b1595a5216ae0de639a287b9816023/jiter-0.14.0-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7b25beaa0d4447ea8c7ae0c18c688905d34840d7d0b937f2f7bdd52162c98a40", size = 346563, upload-time = "2026-04-10T14:27:11.287Z" }, + { url = "https://files.pythonhosted.org/packages/f9/3b/f8d07580d8706021d255a6356b8fab13ee4c869412995550ce6ed4ddf97d/jiter-0.14.0-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:651a8758dd413c51e3b7f6557cdc6921faf70b14106f45f969f091f5cda990ea", size = 357928, upload-time = "2026-04-10T14:27:12.729Z" }, + { url = "https://files.pythonhosted.org/packages/47/5b/ac1a974da29e35507230383110ffec59998b290a8732585d04e19a9eb5ba/jiter-0.14.0-cp313-cp313t-win_amd64.whl", hash = "sha256:e1a7eead856a5038a8d291f1447176ab0b525c77a279a058121b5fccee257f6f", size = 203519, upload-time = "2026-04-10T14:27:14.125Z" }, + { url = "https://files.pythonhosted.org/packages/96/6d/9fc8433d667d2454271378a79747d8c76c10b51b482b454e6190e511f244/jiter-0.14.0-cp313-cp313t-win_arm64.whl", hash = "sha256:2e692633a12cda97e352fdcd1c4acc971b1c28707e1e33aeef782b0cbf051975", size = 190113, upload-time = "2026-04-10T14:27:16.638Z" }, + { url = "https://files.pythonhosted.org/packages/4f/1e/354ed92461b165bd581f9ef5150971a572c873ec3b68a916d5aa91da3cc2/jiter-0.14.0-cp314-cp314-macosx_10_12_x86_64.whl", hash = "sha256:6f396837fc7577871ca8c12edaf239ed9ccef3bbe39904ae9b8b63ce0a48b140", size = 315277, upload-time = "2026-04-10T14:27:18.109Z" }, + { url = "https://files.pythonhosted.org/packages/a6/95/8c7c7028aa8636ac21b7a55faef3e34215e6ed0cbf5ae58258427f621aa3/jiter-0.14.0-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:a4d50ea3d8ba4176f79754333bd35f1bbcd28e91adc13eb9b7ca91bc52a6cef9", size = 315923, upload-time = "2026-04-10T14:27:19.603Z" }, + { url = "https://files.pythonhosted.org/packages/47/40/e2a852a44c4a089f2681a16611b7ce113224a80fd8504c46d78491b47220/jiter-0.14.0-cp314-cp314-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ce17f8a050447d1b4153bda4fb7d26e6a9e74eb4f4a41913f30934c5075bf615", size = 344943, upload-time = "2026-04-10T14:27:21.262Z" }, + { url = "https://files.pythonhosted.org/packages/fc/1f/670f92adee1e9895eac41e8a4d623b6da68c4d46249d8b556b60b63f949e/jiter-0.14.0-cp314-cp314-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:f4f1c4b125e1652aefbc2e2c1617b60a160ab789d180e3d423c41439e5f32850", size = 369725, upload-time = "2026-04-10T14:27:22.766Z" }, + { url = "https://files.pythonhosted.org/packages/01/2f/541c9ba567d05de1c4874a0f8f8c5e3fd78e2b874266623da9a775cf46e0/jiter-0.14.0-cp314-cp314-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:be808176a6a3a14321d18c603f2d40741858a7c4fc982f83232842689fe86dd9", size = 461210, upload-time = "2026-04-10T14:27:24.315Z" }, + { url = "https://files.pythonhosted.org/packages/ce/a9/c31cbec09627e0d5de7aeaec7690dba03e090caa808fefd8133137cf45bc/jiter-0.14.0-cp314-cp314-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:26679d58ba816f88c3849306dd58cb863a90a1cf352cdd4ef67e30ccf8a77994", size = 380002, upload-time = "2026-04-10T14:27:26.155Z" }, + { url = "https://files.pythonhosted.org/packages/50/02/3c05c1666c41904a2f607475a73e7a4763d1cbde2d18229c4f85b22dc253/jiter-0.14.0-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:80381f5a19af8fa9aef743f080e34f6b25ebd89656475f8cf0470ec6157052aa", size = 354678, upload-time = "2026-04-10T14:27:27.701Z" }, + { url = "https://files.pythonhosted.org/packages/7d/97/e15b33545c2b13518f560d695f974b9891b311641bdcf178d63177e8801e/jiter-0.14.0-cp314-cp314-manylinux_2_31_riscv64.whl", hash = "sha256:004df5fdb8ecbd6d99f3227df18ba1a259254c4359736a2e6f036c944e02d7c5", size = 358920, upload-time = "2026-04-10T14:27:29.256Z" }, + { url = "https://files.pythonhosted.org/packages/ad/d2/8b1461def6b96ba44530df20d07ef7a1c7da22f3f9bf1727e2d611077bf1/jiter-0.14.0-cp314-cp314-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:cff5708f7ed0fa098f2b53446c6fa74c48469118e5cd7497b4f1cd569ab06928", size = 394512, upload-time = "2026-04-10T14:27:31.344Z" }, + { url = "https://files.pythonhosted.org/packages/e3/88/837566dd6ed6e452e8d3205355afd484ce44b2533edfa4ed73a298ea893e/jiter-0.14.0-cp314-cp314-musllinux_1_1_aarch64.whl", hash = "sha256:2492e5f06c36a976d25c7cc347a60e26d5470178d44cde1b9b75e60b4e519f28", size = 521120, upload-time = "2026-04-10T14:27:33.299Z" }, + { url = "https://files.pythonhosted.org/packages/89/6b/b00b45c4d1b4c031777fe161d620b755b5b02cdade1e316dcb46e4471d63/jiter-0.14.0-cp314-cp314-musllinux_1_1_x86_64.whl", hash = "sha256:7609cfbe3a03d37bfdbf5052012d5a879e72b83168a363deae7b3a26564d57de", size = 553668, upload-time = "2026-04-10T14:27:34.868Z" }, + { url = "https://files.pythonhosted.org/packages/ad/d8/6fe5b42011d19397433d345716eac16728ac241862a2aac9c91923c7509a/jiter-0.14.0-cp314-cp314-win32.whl", hash = "sha256:7282342d32e357543565286b6450378c3cd402eea333fc1ebe146f1fabb306fc", size = 207001, upload-time = "2026-04-10T14:27:36.455Z" }, + { url = "https://files.pythonhosted.org/packages/e5/43/5c2e08da1efad5e410f0eaaabeadd954812612c33fbbd8fd5328b489139d/jiter-0.14.0-cp314-cp314-win_amd64.whl", hash = "sha256:bd77945f38866a448e73b0b7637366afa814d4617790ecd88a18ca74377e6c02", size = 202187, upload-time = "2026-04-10T14:27:38Z" }, + { url = "https://files.pythonhosted.org/packages/aa/1f/6e39ac0b4cdfa23e606af5b245df5f9adaa76f35e0c5096790da430ca506/jiter-0.14.0-cp314-cp314-win_arm64.whl", hash = "sha256:f2d4c61da0821ee42e0cdf5489da60a6d074306313a377c2b35af464955a3611", size = 192257, upload-time = "2026-04-10T14:27:39.504Z" }, + { url = "https://files.pythonhosted.org/packages/05/57/7dbc0ffbbb5176a27e3518716608aa464aee2e2887dc938f0b900a120449/jiter-0.14.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:1bf7ff85517dd2f20a5750081d2b75083c1b269cf75afc7511bdf1f9548beb3b", size = 323441, upload-time = "2026-04-10T14:27:41.039Z" }, + { url = "https://files.pythonhosted.org/packages/83/6e/7b3314398d8983f06b557aa21b670511ec72d3b79a68ee5e4d9bff972286/jiter-0.14.0-cp314-cp314t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c8ef8791c3e78d6c6b157c6d360fbb5c715bebb8113bc6a9303c5caff012754a", size = 348109, upload-time = "2026-04-10T14:27:42.552Z" }, + { url = "https://files.pythonhosted.org/packages/ae/4f/8dc674bcd7db6dba566de73c08c763c337058baff1dbeb34567045b27cdc/jiter-0.14.0-cp314-cp314t-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:e74663b8b10da1fe0f4e4703fd7980d24ad17174b6bb35d8498d6e3ebce2ae6a", size = 368328, upload-time = "2026-04-10T14:27:44.574Z" }, + { url = "https://files.pythonhosted.org/packages/3b/5f/188e09a1f20906f98bbdec44ed820e19f4e8eb8aff88b9d1a5a497587ff3/jiter-0.14.0-cp314-cp314t-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:1aca29ba52913f78362ec9c2da62f22cdc4c3083313403f90c15460979b84d9b", size = 463301, upload-time = "2026-04-10T14:27:46.717Z" }, + { url = "https://files.pythonhosted.org/packages/ac/f0/19046ef965ed8f349e8554775bb12ff4352f443fbe12b95d31f575891256/jiter-0.14.0-cp314-cp314t-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:8b39b7d87a952b79949af5fef44d2544e58c21a28da7f1bae3ef166455c61746", size = 378891, upload-time = "2026-04-10T14:27:48.32Z" }, + { url = "https://files.pythonhosted.org/packages/c4/c3/da43bd8431ee175695777ee78cf0e93eacbb47393ff493f18c45231b427d/jiter-0.14.0-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:78d918a68b26e9fab068c2b5453577ef04943ab2807b9a6275df2a812599a310", size = 360749, upload-time = "2026-04-10T14:27:49.88Z" }, + { url = "https://files.pythonhosted.org/packages/72/26/e054771be889707c6161dbdec9c23d33a9ec70945395d70f07cfea1e9a6f/jiter-0.14.0-cp314-cp314t-manylinux_2_31_riscv64.whl", hash = "sha256:b08997c35aee1201c1a5361466a8fb9162d03ae7bf6568df70b6c859f1e654a4", size = 358526, upload-time = "2026-04-10T14:27:51.504Z" }, + { url = "https://files.pythonhosted.org/packages/c3/0f/7bea65ea2a6d91f2bf989ff11a18136644392bf2b0497a1fa50934c30a9c/jiter-0.14.0-cp314-cp314t-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:260bf7ca20704d58d41f669e5e9fe7fe2fa72901a6b324e79056f5d52e9c9be2", size = 393926, upload-time = "2026-04-10T14:27:53.368Z" }, + { url = "https://files.pythonhosted.org/packages/3c/a1/b1ff7d70deef61ac0b7c6c2f12d2ace950cdeecb4fdc94500a0926802857/jiter-0.14.0-cp314-cp314t-musllinux_1_1_aarch64.whl", hash = "sha256:37826e3df29e60f30a382f9294348d0238ef127f4b5d7f5f8da78b5b9e050560", size = 521052, upload-time = "2026-04-10T14:27:55.058Z" }, + { url = "https://files.pythonhosted.org/packages/0b/7b/3b0649983cbaf15eda26a414b5b1982e910c67bd6f7b1b490f3cfc76896a/jiter-0.14.0-cp314-cp314t-musllinux_1_1_x86_64.whl", hash = "sha256:645be49c46f2900937ba0eaf871ad5183c96858c0af74b6becc7f4e367e36e06", size = 553716, upload-time = "2026-04-10T14:27:57.269Z" }, + { url = "https://files.pythonhosted.org/packages/97/f8/33d78c83bd93ae0c0af05293a6660f88a1977caef39a6d72a84afab94ce0/jiter-0.14.0-cp314-cp314t-win32.whl", hash = "sha256:2f7877ed45118de283786178eceaf877110abacd04fde31efff3940ae9672674", size = 207957, upload-time = "2026-04-10T14:27:59.285Z" }, + { url = "https://files.pythonhosted.org/packages/d6/ac/2b760516c03e2227826d1f7025d89bf6bf6357a28fe75c2a2800873c50bf/jiter-0.14.0-cp314-cp314t-win_amd64.whl", hash = "sha256:14c0cb10337c49f5eafe8e7364daca5e29a020ea03580b8f8e6c597fed4e1588", size = 204690, upload-time = "2026-04-10T14:28:00.962Z" }, + { url = "https://files.pythonhosted.org/packages/dc/2e/a44c20c58aeed0355f2d326969a181696aeb551a25195f47563908a815be/jiter-0.14.0-cp314-cp314t-win_arm64.whl", hash = "sha256:5419d4aa2024961da9fe12a9cfe7484996735dca99e8e090b5c88595ef1951ff", size = 191338, upload-time = "2026-04-10T14:28:02.853Z" }, + { url = "https://files.pythonhosted.org/packages/21/42/9042c3f3019de4adcb8c16591c325ec7255beea9fcd33a42a43f3b0b1000/jiter-0.14.0-graalpy312-graalpy250_312_native-macosx_10_12_x86_64.whl", hash = "sha256:fbd9e482663ca9d005d051330e4d2d8150bb208a209409c10f7e7dfdf7c49da9", size = 308810, upload-time = "2026-04-10T14:28:34.673Z" }, + { url = "https://files.pythonhosted.org/packages/60/cf/a7e19b308bd86bb04776803b1f01a5f9a287a4c55205f4708827ee487fbf/jiter-0.14.0-graalpy312-graalpy250_312_native-macosx_11_0_arm64.whl", hash = "sha256:33a20d838b91ef376b3a56896d5b04e725c7df5bc4864cc6569cf046a8d73b6d", size = 308443, upload-time = "2026-04-10T14:28:36.658Z" }, + { url = "https://files.pythonhosted.org/packages/ca/44/e26ede3f0caeff93f222559cb0cc4ca68579f07d009d7b6010c5b586f9b1/jiter-0.14.0-graalpy312-graalpy250_312_native-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:432c4db5255d86a259efde91e55cb4c8d18c0521d844c9e2e7efcce3899fb016", size = 343039, upload-time = "2026-04-10T14:28:38.356Z" }, + { url = "https://files.pythonhosted.org/packages/da/e9/1f9ada30cef7b05e74bb06f52127e7a724976c225f46adb65c37b1dadfb6/jiter-0.14.0-graalpy312-graalpy250_312_native-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:67f00d94b281174144d6532a04b66a12cb866cbdc47c3af3bfe2973677f9861a", size = 349613, upload-time = "2026-04-10T14:28:40.066Z" }, +] + [[package]] name = "markdown-it-py" version = "4.0.0" @@ -475,6 +584,15 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/e0/f9/0595336914c5619e5f28a1fb793285925a8cd4b432c9da0a987836c7f822/shellingham-1.5.4-py2.py3-none-any.whl", hash = "sha256:7ecfff8f2fd72616f7481040475a65b2bf8af90a56c89140852d1120324e8686", size = 9755, upload-time = "2023-10-24T04:13:38.866Z" }, ] +[[package]] +name = "sniffio" +version = "1.3.1" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/a2/87/a6771e1546d97e7e041b6ae58d80074f81b7d5121207425c964ddf5cfdbd/sniffio-1.3.1.tar.gz", hash = "sha256:f4324edc670a0f49750a81b895f35c3adb843cca46f0530f79fc1babb23789dc", size = 20372, upload-time = "2024-02-25T23:20:04.057Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/e9/44/75a9c9421471a6c4805dbf2356f7c181a29c1879239abab1ea2cc8f38b40/sniffio-1.3.1-py3-none-any.whl", hash = "sha256:2f6da418d1f1e0fddd844478f41680e794e6051915791a034ff65e5f100525a2", size = 10235, upload-time = "2024-02-25T23:20:01.196Z" }, +] + [[package]] name = "typer" version = "0.24.2" @@ -516,6 +634,7 @@ name = "wikifi" version = "0.1.0" source = { editable = "." } dependencies = [ + { name = "anthropic" }, { name = "ollama" }, { name = "pathspec" }, { name = "pydantic" }, @@ -534,6 +653,7 @@ dev = [ [package.metadata] requires-dist = [ + { name = "anthropic", specifier = ">=0.40" }, { name = "ollama", specifier = ">=0.4.0" }, { name = "pathspec", specifier = ">=0.12" }, { name = "pydantic", specifier = ">=2.6" }, diff --git a/wikifi/aggregator.py b/wikifi/aggregator.py index fdbe2ca..cad0350 100644 --- a/wikifi/aggregator.py +++ b/wikifi/aggregator.py @@ -1,20 +1,41 @@ """Stage 3 — per-section synthesis. Reads the JSONL notes accumulated by the extractor and asks the LLM to -synthesize each section's final markdown. One LLM call per section, with the -section description as the contract for what should appear and what shouldn't. +synthesize each section's final markdown along with **structured +evidence**: a list of supported claims and any contradictions surfaced +across the file findings. -Sections with zero notes get a placeholder body so the wiki layout stays -complete and the absence is visible. +Three behaviors set this stage apart from a vanilla "merge LLM output": + +1. **Citations.** Every claim records the source files (and line ranges + when the extractor knew them) it draws from. The renderer threads + those into the final markdown as numbered footnotes. +2. **Contradiction surfacing.** When two or more files disagree about + the same domain claim — a frequent case in legacy systems where + tribal knowledge hides in inconsistencies — the conflict is rendered + under a "Conflicts in source" heading rather than silently merged. +3. **Section-level cache.** A digest of the section's note payload is + compared against the prior walk's; if the notes are byte-identical, + the cached body and evidence are reused without re-calling the LLM. """ from __future__ import annotations +import json import logging from dataclasses import dataclass from pydantic import BaseModel, Field +from wikifi.cache import WalkCache, hash_section_notes +from wikifi.evidence import ( + Claim, + Contradiction, + EvidenceBundle, + SourceRef, + coalesce_refs, + render_section_body, +) from wikifi.providers.base import LLMProvider from wikifi.sections import PRIMARY_SECTIONS, Section from wikifi.wiki import WikiLayout, read_notes, write_section @@ -25,38 +46,77 @@ You are wikifi's section aggregator. You receive structured notes that an \ extractor pass collected from individual source files in a target codebase, \ along with the brief for one section of a tech-agnostic wiki. Synthesize a \ -clean markdown body for that section. +clean markdown body for that section *and* expose the evidence you used. + +Each note carries an `[index]` tag — when you make a claim that draws on \ +specific notes, list those indices in the corresponding `claim.source_indices`. \ +Indices are 1-based and refer to the numbered notes in the user prompt. Rules: - Tech-agnostic. Never mention specific languages, frameworks, or libraries. \ Translate every observation into domain terms. -- Coherent narrative — not a transcript of the notes. Merge duplicates, \ - resolve contradictions, organize by domain logic. -- Use markdown sub-headings, lists, and tables where they help the reader. -- If notes are sparse or contradictory, say so plainly. Better to declare a \ - gap than to invent content. -- Output the body only. Do not repeat the section title (the writer adds it). +- Coherent narrative — not a transcript of the notes. Merge consistent \ + statements, organize by domain logic. +- DO NOT silently resolve contradictions. If two notes assert incompatible \ + things about the same topic, emit a `contradictions[]` entry naming each \ + position and the source-note indices that support it. +- Use markdown sub-headings, lists, and tables in `body` where they help. +- Keep `body` focused on prose; the renderer adds the citation footer and \ + "Conflicts in source" block from the structured `claims`/`contradictions` \ + fields. Don't duplicate citations inline. +- If notes are sparse or contradictory, say so plainly rather than inventing. +- Output the body only (no top-level heading); the writer adds the title. """ +class AggregatedClaim(BaseModel): + """One claim the aggregator extracted, indexed against the input notes.""" + + text: str = Field(description="One assertion in the synthesized body.") + source_indices: list[int] = Field( + default_factory=list, + description="1-based indices of the notes that justify this claim.", + ) + + +class AggregatedContradiction(BaseModel): + summary: str = Field(description="One-sentence description of the disagreement.") + positions: list[AggregatedClaim] = Field( + default_factory=list, + description="Each disagreeing position, with its own supporting note indices.", + ) + + class SectionBody(BaseModel): - """The final markdown body for a section.""" + """The aggregator's structured output for a single section.""" body: str = Field(description="Markdown content for the section, no top-level heading.") + claims: list[AggregatedClaim] = Field(default_factory=list) + contradictions: list[AggregatedContradiction] = Field(default_factory=list) @dataclass class AggregationStats: sections_written: int = 0 sections_empty: int = 0 + sections_cached: int = 0 -def aggregate_all(*, layout: WikiLayout, provider: LLMProvider) -> AggregationStats: +def aggregate_all( + *, + layout: WikiLayout, + provider: LLMProvider, + cache: WalkCache | None = None, +) -> AggregationStats: """Aggregate every primary section from its accumulated notes. Derivative sections (personas, user stories, diagrams) are populated by `wikifi.deriver.derive_all` after this stage — they have no per-file notes to aggregate from. + + When ``cache`` is supplied and the section's note digest is unchanged + from the prior walk, the cached body and evidence are reused without + invoking the LLM. """ stats = AggregationStats() for section in PRIMARY_SECTIONS: @@ -65,20 +125,96 @@ def aggregate_all(*, layout: WikiLayout, provider: LLMProvider) -> AggregationSt write_section(layout, section, _empty_body(section)) stats.sections_empty += 1 continue + + notes_hash = hash_section_notes(notes) + if cache is not None: + cached = cache.lookup_aggregation(section.id, notes_hash) + if cached is not None: + bundle = EvidenceBundle( + body=cached.body, + claims=[Claim.model_validate(c) for c in cached.claims], + contradictions=[Contradiction.model_validate(c) for c in cached.contradictions], + ) + write_section(layout, section, render_section_body(bundle)) + stats.sections_cached += 1 + stats.sections_written += 1 + continue + try: - body = provider.complete_json( + structured = provider.complete_json( system=AGGREGATION_SYSTEM_PROMPT, user=_render_user_prompt(section, notes), schema=SectionBody, - ).body + ) + bundle = _bundle_from(structured, notes) + rendered = render_section_body(bundle) except Exception as exc: log.warning("aggregation failed for %s: %s", section.id, exc) - body = _fallback_body(section, notes, error=str(exc)) - write_section(layout, section, body) + rendered = _fallback_body(section, notes, error=str(exc)) + bundle = None + + write_section(layout, section, rendered) stats.sections_written += 1 + + if cache is not None and bundle is not None: + cache.record_aggregation( + section.id, + notes_hash=notes_hash, + body=bundle.body, + claims=[c.model_dump() for c in bundle.claims], + contradictions=[c.model_dump() for c in bundle.contradictions], + ) + return stats +def _bundle_from(structured: SectionBody, notes: list[dict]) -> EvidenceBundle: + """Resolve note indices into concrete :class:`SourceRef` lists.""" + note_refs = _refs_per_note(notes) + + def resolve(indices: list[int]) -> list[SourceRef]: + refs: list[SourceRef] = [] + for idx in indices: + real = idx - 1 + if 0 <= real < len(note_refs): + refs.extend(note_refs[real]) + return coalesce_refs(refs) + + claims = [Claim(text=c.text, sources=resolve(c.source_indices)) for c in structured.claims] + contradictions = [ + Contradiction( + summary=c.summary, + positions=[Claim(text=p.text, sources=resolve(p.source_indices)) for p in c.positions], + ) + for c in structured.contradictions + ] + return EvidenceBundle(body=structured.body, claims=claims, contradictions=contradictions) + + +def _refs_per_note(notes: list[dict]) -> list[list[SourceRef]]: + """Map each note to its source refs. + + Notes produced by the modern extractor carry a ``sources`` list; + older notes (or hand-written ones) fall back to a single SourceRef + derived from the ``file`` field. + """ + out: list[list[SourceRef]] = [] + for note in notes: + sources = note.get("sources") + if isinstance(sources, list) and sources: + try: + out.append([SourceRef.model_validate(s) for s in sources]) + continue + except Exception: # malformed sources — fall back to file + pass + file = note.get("file") + if file: + out.append([SourceRef(file=str(file))]) + else: + out.append([]) + return out + + def _render_user_prompt(section: Section, notes: list[dict]) -> str: lines: list[str] = [] lines.append(f"## Section: {section.title} (id: {section.id})") @@ -86,17 +222,38 @@ def _render_user_prompt(section: Section, notes: list[dict]) -> str: lines.append("### Brief") lines.append(section.description) lines.append("") - lines.append(f"### Notes from {len(notes)} file(s)") - for note in notes: + lines.append(f"### Notes from {len(notes)} file(s) — referenced by 1-based index in `source_indices`") + for idx, note in enumerate(notes, start=1): file_ref = note.get("file", "?") summary = note.get("summary", "") finding = note.get("finding", "") - lines.append(f"- [{file_ref}] (file role: {summary}) {finding}") + sources = note.get("sources") or [] + ranges = ", ".join(_format_source(s) for s in sources) if sources else file_ref + role = f" (file role: {summary})" if summary else "" + lines.append(f"[{idx}] {ranges}{role}: {finding}") lines.append("") - lines.append("Synthesize a coherent markdown body for this section. Follow the rules in the system prompt.") + lines.append( + "Synthesize a coherent markdown body for this section in `body`, " + "and populate `claims` (with the 1-based note indices that justify " + "each one) and `contradictions` for any disagreements. Follow the " + "rules in the system prompt." + ) return "\n".join(lines) +def _format_source(source: dict | SourceRef) -> str: + if isinstance(source, SourceRef): + return source.render() + file = source.get("file", "?") + lines = source.get("lines") + if not lines: + return file + if isinstance(lines, list | tuple) and len(lines) == 2: + start, end = lines + return f"{file}:{start}-{end}" if start != end else f"{file}:{start}" + return file + + def _empty_body(section: Section) -> str: return ( f"_No findings were extracted for **{section.title}** during the last walk._\n\n" @@ -117,3 +274,15 @@ def _fallback_body(section: Section, notes: list[dict], *, error: str) -> str: finding = note.get("finding", "") lines.append(f"- **{file_ref}** — {finding}") return "\n".join(lines) + + +__all__ = [ + "AGGREGATION_SYSTEM_PROMPT", + "AggregatedClaim", + "AggregatedContradiction", + "AggregationStats", + "SectionBody", + "aggregate_all", +] +# json kept for downstream debugging needs +_ = json diff --git a/wikifi/cache.py b/wikifi/cache.py new file mode 100644 index 0000000..fecc97b --- /dev/null +++ b/wikifi/cache.py @@ -0,0 +1,280 @@ +"""Content-addressed cache for the walk pipeline. + +The cache turns a clean re-walk of a 50k-file legacy monorepo from "hours" +to "minutes-of-changed-files-only". Two scopes are persisted: + +- **Per-file extraction cache.** Keyed by ``(rel_path, file_fingerprint)``, + values are the list of structured findings the extractor produced. If a + file's bytes haven't changed since the last walk the cache entry is + reused verbatim and no LLM call is made. +- **Per-section aggregation cache.** Keyed by the SHA-256 of the section's + full notes payload (after extraction completes). If the notes payload + is bit-identical to last walk's, the cached markdown body is reused + rather than calling the aggregator again. + +Resumability falls out of the per-file cache for free: a walk that crashes +at file 8127/10000 picks up exactly where it left off because the previous +8126 files' fingerprints are still in the cache from the last successful +extraction call. + +Cache files live under ``.wikifi/.cache/`` so they share the wiki's +git-ignore rules but stay out of the section markdown that *is* committed. +""" + +from __future__ import annotations + +import json +import logging +from dataclasses import dataclass, field +from datetime import UTC, datetime +from pathlib import Path +from typing import Any + +from wikifi.wiki import WikiLayout + +log = logging.getLogger("wikifi.cache") + +CACHE_DIRNAME = ".cache" +EXTRACTION_CACHE_FILENAME = "extraction.json" +AGGREGATION_CACHE_FILENAME = "aggregation.json" +CACHE_VERSION = 1 # bump to invalidate every cache entry across upgrades + + +@dataclass +class CachedFindings: + """Per-file findings recovered from cache.""" + + fingerprint: str + findings: list[dict[str, Any]] + summary: str = "" + chunks_processed: int = 0 + + +@dataclass +class CachedSection: + """Per-section aggregator output recovered from cache.""" + + notes_hash: str + body: str + claims: list[dict[str, Any]] = field(default_factory=list) + contradictions: list[dict[str, Any]] = field(default_factory=list) + + +@dataclass +class WalkCache: + """Mutable in-memory view of both caches; persisted via :func:`save`.""" + + extraction: dict[str, CachedFindings] = field(default_factory=dict) + aggregation: dict[str, CachedSection] = field(default_factory=dict) + extraction_hits: int = 0 + extraction_misses: int = 0 + aggregation_hits: int = 0 + aggregation_misses: int = 0 + + # ----- extraction scope ----- + + def lookup_extraction(self, rel_path: str, fingerprint: str) -> CachedFindings | None: + entry = self.extraction.get(rel_path) + if entry is None or entry.fingerprint != fingerprint: + self.extraction_misses += 1 + return None + self.extraction_hits += 1 + return entry + + def record_extraction( + self, + rel_path: str, + *, + fingerprint: str, + findings: list[dict[str, Any]], + summary: str, + chunks_processed: int, + ) -> None: + self.extraction[rel_path] = CachedFindings( + fingerprint=fingerprint, + findings=list(findings), + summary=summary, + chunks_processed=chunks_processed, + ) + + def forget_extraction(self, rel_path: str) -> None: + self.extraction.pop(rel_path, None) + + def prune_extraction(self, *, keep: set[str]) -> int: + """Drop cache entries for files no longer in scope. Returns count removed.""" + removed = [path for path in list(self.extraction) if path not in keep] + for path in removed: + del self.extraction[path] + return len(removed) + + # ----- aggregation scope ----- + + def lookup_aggregation(self, section_id: str, notes_hash: str) -> CachedSection | None: + entry = self.aggregation.get(section_id) + if entry is None or entry.notes_hash != notes_hash: + self.aggregation_misses += 1 + return None + self.aggregation_hits += 1 + return entry + + def record_aggregation( + self, + section_id: str, + *, + notes_hash: str, + body: str, + claims: list[dict[str, Any]] | None = None, + contradictions: list[dict[str, Any]] | None = None, + ) -> None: + self.aggregation[section_id] = CachedSection( + notes_hash=notes_hash, + body=body, + claims=list(claims or []), + contradictions=list(contradictions or []), + ) + + +# --------------------------------------------------------------------------- +# Persistence +# --------------------------------------------------------------------------- + + +def cache_dir(layout: WikiLayout) -> Path: + return layout.wiki_dir / CACHE_DIRNAME + + +def extraction_cache_path(layout: WikiLayout) -> Path: + return cache_dir(layout) / EXTRACTION_CACHE_FILENAME + + +def aggregation_cache_path(layout: WikiLayout) -> Path: + return cache_dir(layout) / AGGREGATION_CACHE_FILENAME + + +def load(layout: WikiLayout) -> WalkCache: + """Load both caches from disk. Missing or invalid files yield an empty cache.""" + cache = WalkCache() + cache.extraction = _load_extraction(extraction_cache_path(layout)) + cache.aggregation = _load_aggregation(aggregation_cache_path(layout)) + return cache + + +def save(layout: WikiLayout, cache: WalkCache) -> None: + """Persist both caches atomically.""" + cache_dir(layout).mkdir(parents=True, exist_ok=True) + _atomic_write_json( + extraction_cache_path(layout), + { + "version": CACHE_VERSION, + "saved_at": datetime.now(UTC).isoformat(), + "entries": { + path: { + "fingerprint": entry.fingerprint, + "summary": entry.summary, + "chunks_processed": entry.chunks_processed, + "findings": entry.findings, + } + for path, entry in cache.extraction.items() + }, + }, + ) + _atomic_write_json( + aggregation_cache_path(layout), + { + "version": CACHE_VERSION, + "saved_at": datetime.now(UTC).isoformat(), + "entries": { + sid: { + "notes_hash": entry.notes_hash, + "body": entry.body, + "claims": entry.claims, + "contradictions": entry.contradictions, + } + for sid, entry in cache.aggregation.items() + }, + }, + ) + + +def reset(layout: WikiLayout) -> None: + """Delete every cache file. Triggered by `walk --no-cache` and tests.""" + for path in (extraction_cache_path(layout), aggregation_cache_path(layout)): + if path.exists(): + path.unlink() + + +def _atomic_write_json(path: Path, payload: dict[str, Any]) -> None: + tmp = path.with_suffix(path.suffix + ".tmp") + tmp.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8") + tmp.replace(path) + + +def _load_extraction(path: Path) -> dict[str, CachedFindings]: + raw = _load_json(path) + if not raw or raw.get("version") != CACHE_VERSION: + return {} + out: dict[str, CachedFindings] = {} + for rel, entry in raw.get("entries", {}).items(): + try: + out[rel] = CachedFindings( + fingerprint=entry["fingerprint"], + findings=list(entry.get("findings", [])), + summary=entry.get("summary", ""), + chunks_processed=int(entry.get("chunks_processed", 0)), + ) + except (KeyError, TypeError, ValueError) as exc: + log.warning("dropping malformed extraction cache entry %s: %s", rel, exc) + return out + + +def _load_aggregation(path: Path) -> dict[str, CachedSection]: + raw = _load_json(path) + if not raw or raw.get("version") != CACHE_VERSION: + return {} + out: dict[str, CachedSection] = {} + for sid, entry in raw.get("entries", {}).items(): + try: + out[sid] = CachedSection( + notes_hash=entry["notes_hash"], + body=entry.get("body", ""), + claims=list(entry.get("claims", [])), + contradictions=list(entry.get("contradictions", [])), + ) + except (KeyError, TypeError, ValueError) as exc: + log.warning("dropping malformed aggregation cache entry %s: %s", sid, exc) + return out + + +def _load_json(path: Path) -> dict[str, Any] | None: + if not path.exists(): + return None + try: + return json.loads(path.read_text(encoding="utf-8")) + except (OSError, json.JSONDecodeError) as exc: + log.warning("could not load cache at %s: %s; starting fresh", path, exc) + return None + + +# --------------------------------------------------------------------------- +# Hash helpers used at the section boundary +# --------------------------------------------------------------------------- + + +def hash_section_notes(notes: list[dict[str, Any]]) -> str: + """Stable digest of a section's note payload for aggregation cache keys. + + The hash spans only the *content* fields the aggregator actually reads + (file ref, summary, finding) — not timestamps or per-walk debug fields — + so regenerating identical notes on a fresh walk reuses the cached body. + """ + from wikifi.fingerprint import hash_text + + payload = [ + { + "file": n.get("file", ""), + "summary": n.get("summary", ""), + "finding": n.get("finding", ""), + } + for n in notes + ] + return hash_text(json.dumps(payload, ensure_ascii=False, sort_keys=True)) diff --git a/wikifi/cli.py b/wikifi/cli.py index e9231d0..ab4fc76 100644 --- a/wikifi/cli.py +++ b/wikifi/cli.py @@ -5,6 +5,7 @@ - ``wikifi init`` — scaffold the ``.wikifi/`` directory in CWD - ``wikifi walk`` — run the full Stage 1→2→3→4 pipeline against CWD - ``wikifi chat`` — interactive REPL with ``.wikifi/`` content as context +- ``wikifi report`` — coverage + quality report on the wiki """ from __future__ import annotations @@ -15,13 +16,16 @@ import typer from rich.console import Console +from rich.markdown import Markdown from rich.panel import Panel from rich.table import Table from wikifi import __version__ +from wikifi.cache import reset as reset_cache from wikifi.chat import run_repl from wikifi.config import get_settings from wikifi.orchestrator import build_provider, init_wiki, run_walk +from wikifi.report import build_report from wikifi.wiki import WikiLayout app = typer.Typer( @@ -79,13 +83,38 @@ def init(target: TargetArg = None) -> None: @app.command() -def walk(target: TargetArg = None) -> None: +def walk( + target: TargetArg = None, + no_cache: Annotated[ + bool, typer.Option("--no-cache", help="Force a clean re-walk; drop the on-disk cache.") + ] = False, + review: Annotated[ + bool, + typer.Option("--review/--no-review", help="Run the critic + reviser loop on derivative sections."), + ] = False, + provider: Annotated[ + str | None, + typer.Option("--provider", help="Override the configured provider for this walk ('ollama' | 'anthropic')."), + ] = None, +) -> None: """Walk the target codebase and populate every wiki section.""" target = target or Path.cwd() settings = get_settings() + if no_cache: + settings = settings.model_copy(update={"use_cache": False}) + reset_cache(WikiLayout(root=target)) + if review: + settings = settings.model_copy(update={"review_derivatives": True}) + if provider: + settings = settings.model_copy(update={"provider": provider}) + console.print( Panel.fit( - f"[bold]wikifi walk[/bold] — target=[cyan]{target}[/cyan] model=[cyan]{settings.model}[/cyan]", + f"[bold]wikifi walk[/bold] — target=[cyan]{target}[/cyan] " + f"provider=[cyan]{settings.provider}[/cyan] model=[cyan]{settings.model}[/cyan]\n" + f"cache=[cyan]{settings.use_cache}[/cyan] graph=[cyan]{settings.use_graph}[/cyan] " + f"specialized=[cyan]{settings.use_specialized_extractors}[/cyan] " + f"review=[cyan]{settings.review_derivatives}[/cyan]", title="starting", ) ) @@ -100,21 +129,27 @@ def walk(target: TargetArg = None) -> None: f"exclude={len(report.introspection.exclude)} " f"langs={', '.join(report.introspection.primary_languages) or '?'}", ) - table.add_row( - "2. Extraction", + extraction_row = ( f"seen={report.extraction.files_seen} " f"contributed={report.extraction.files_with_findings} " f"findings={report.extraction.findings_total} " - f"skipped={report.extraction.files_skipped}", + f"skipped={report.extraction.files_skipped} " + f"cache_hits={report.extraction.cache_hits} " + f"specialized={report.extraction.specialized_files}" ) + table.add_row("2. Extraction", extraction_row) table.add_row( "3. Aggregation", - f"sections_written={report.aggregation.sections_written} sections_empty={report.aggregation.sections_empty}", + f"sections_written={report.aggregation.sections_written} " + f"sections_empty={report.aggregation.sections_empty} " + f"sections_cached={report.aggregation.sections_cached}", ) - table.add_row( - "4. Derivation", - f"sections_derived={report.derivation.sections_derived} sections_skipped={report.derivation.sections_skipped}", + derivation_row = ( + f"sections_derived={report.derivation.sections_derived} " + f"sections_skipped={report.derivation.sections_skipped} " + f"sections_revised={report.derivation.sections_revised}" ) + table.add_row("4. Derivation", derivation_row) console.print(table) console.print(f"\n[green]Done.[/green] Wiki at [bold]{target}/.wikifi/[/bold]") @@ -136,6 +171,30 @@ def chat(target: TargetArg = None) -> None: run_repl(layout=layout, provider=provider, console=console) +@app.command() +def report( + target: TargetArg = None, + score: Annotated[ + bool, + typer.Option("--score/--no-score", help="Run the critic on every populated section for quality scoring."), + ] = False, +) -> None: + """Print a coverage + quality report for the wiki at ``target``.""" + target = target or Path.cwd() + layout = WikiLayout(root=target) + if not layout.wiki_dir.exists(): + console.print( + f"[red]No .wikifi/ directory at {target}.[/red] " + "Run [bold]wikifi init[/bold] and [bold]wikifi walk[/bold] first." + ) + raise typer.Exit(code=1) + + settings = get_settings() + provider = build_provider(settings) if score else None + wiki_report = build_report(layout=layout, provider=provider, score=score) + console.print(Markdown(wiki_report.render())) + + def main() -> None: """Entry point referenced by [project.scripts] in pyproject.toml.""" app() diff --git a/wikifi/config.py b/wikifi/config.py index fad9fb5..ea093bf 100644 --- a/wikifi/config.py +++ b/wikifi/config.py @@ -2,6 +2,9 @@ Defaults assume a local Ollama server with qwen3.6:27b. Override any field via WIKIFI_* env vars or a .env file in the target project's CWD. + +The hosted Anthropic provider is selected via ``WIKIFI_PROVIDER=anthropic`` +(plus ``ANTHROPIC_API_KEY`` from env). """ from __future__ import annotations @@ -20,7 +23,7 @@ class Settings(BaseSettings): extra="ignore", ) - provider: str = Field(default="ollama", description="LLM provider id; only 'ollama' in v1") + provider: str = Field(default="ollama", description="LLM provider id; 'ollama' (default) or 'anthropic'") model: str = Field(default="qwen3.6:27b", description="Model identifier passed to the provider") ollama_host: str = Field(default="http://localhost:11434", description="Ollama HTTP endpoint") request_timeout: float = Field(default=900.0, description="Per-request timeout in seconds") @@ -50,18 +53,68 @@ class Settings(BaseSettings): description="Skip files whose stripped content is shorter than this (avoids thinking runaway on stubs)", ) introspection_depth: int = Field(default=3, description="Tree depth fed to the introspection pass") - # Thinking mode for reasoning-capable models (Qwen3, DeepSeek-R1, etc.). + # Thinking mode for reasoning-capable models (Qwen3, DeepSeek-R1, Anthropic). # Default 'high' — wikifi prioritizes wiki quality over walk wall-time. - # Higher thinking levels produce noticeably better domain abstraction and - # cleaner Gherkin in the derivative pass; expect 1–3 minutes per real - # file on a local 27B model. The min_content_bytes guard keeps the - # thinking-runaway-on-stubs failure mode at bay. - # Accepted values: 'low' / 'medium' / 'high' (Qwen3-style); True - # (DeepSeek-style); False to opt out entirely (only safe with non- - # thinking models — Qwen3 ignores `format=` when thinking is off). + # On Anthropic, this maps to adaptive thinking + the equivalent + # ``effort`` level (low/medium/high/max). think: bool | str = Field(default="high", description="Thinking-mode level for reasoning models") + # ----- Premium pipeline knobs ----- + + use_cache: bool = Field( + default=True, + description=( + "Reuse the per-file extraction + per-section aggregation caches across walks. " + "Disable to force a clean re-walk." + ), + ) + use_graph: bool = Field( + default=True, + description=( + "Build an import/reference graph and feed each file's neighborhood into the " + "extraction prompt. Disable to fall back to per-file isolated extraction." + ), + ) + use_specialized_extractors: bool = Field( + default=True, + description=( + "Route schema files (SQL, OpenAPI, Protobuf, GraphQL, migrations) through " + "deterministic extractors that bypass the LLM." + ), + ) + review_derivatives: bool = Field( + default=False, + description=( + "Run the critic + reviser loop on derivative sections (personas, user stories, " + "diagrams). Adds 2 LLM calls per derivative section but materially improves " + "groundedness. Off by default to keep walk wall-time predictable." + ), + ) + review_min_score: int = Field( + default=7, + description="Minimum critic score below which the reviser is invoked.", + ) + + # ----- Anthropic provider knobs ----- + + anthropic_api_key: str | None = Field( + default=None, + description=("Explicit Anthropic API key. Falls back to ANTHROPIC_API_KEY in the environment when unset."), + ) + anthropic_max_tokens: int = Field( + default=16_000, + description="Per-call output token cap for the Anthropic provider.", + ) + @lru_cache def get_settings() -> Settings: return Settings() + + +def reset_settings_cache() -> None: + """Drop the cached :class:`Settings` instance so env changes take effect. + + Used by tests that mutate ``WIKIFI_*`` env vars between cases. + """ + get_settings.cache_clear() diff --git a/wikifi/critic.py b/wikifi/critic.py new file mode 100644 index 0000000..4666baf --- /dev/null +++ b/wikifi/critic.py @@ -0,0 +1,242 @@ +"""Section-quality critic. + +Two consumers: + +- :func:`review_section` runs a *critic + reviser* loop on a synthesized + section body. The critic scores the body against its brief and the + upstream evidence, identifying unsupported claims and gaps. If the + score falls below ``min_score`` the reviser is invoked once with the + critique to produce an improved body. This catches the bulk of + hallucination and missing-coverage failures on derivative sections, + where a single-shot synthesis is most error-prone. +- :func:`score_wiki` walks every section in the wiki and produces a + rubric-style report (used by ``wikifi report``). + +The two paths share a single Pydantic schema (:class:`Critique`) so the +provider implementation can cache the system prompt across both. +""" + +from __future__ import annotations + +import logging +from collections.abc import Mapping +from dataclasses import dataclass, field + +from pydantic import BaseModel, Field + +from wikifi.providers.base import LLMProvider +from wikifi.sections import Section + +log = logging.getLogger("wikifi.critic") + + +CRITIC_SYSTEM_PROMPT = """\ +You are wikifi's quality critic. You receive (a) the brief for a section of \ +a technology-agnostic wiki, (b) the synthesized markdown body, and \ +optionally (c) the upstream evidence the body was supposed to derive from. \ +You score the body on a 0–10 rubric and identify concrete improvements. + +Rubric: +- 9–10: tech-agnostic, fully grounded in evidence, narratively coherent, \ + no unsupported claims, no obvious gaps against the brief. +- 6–8: largely sound but with one or more issues — minor unsupported \ + claims, awkward narrative, or missed coverage of brief items. +- 3–5: substantial gaps, several unsupported claims, or partial coverage. +- 0–2: incoherent, dominated by speculation, or off-brief. + +Be specific in `unsupported_claims` and `gaps`. A migration team will use \ +your critique to decide whether the section is ready to ship. +""" + + +REVISER_SYSTEM_PROMPT = """\ +You are wikifi's section reviser. You receive (a) the section brief, \ +(b) the prior body, (c) a critique flagging unsupported claims and gaps, \ +and (d) the upstream evidence available. Produce a revised body that \ +addresses every flagged issue. Stay tech-agnostic. Do not invent claims \ +the upstreams cannot support — declare gaps explicitly when evidence is \ +missing. Output the body only, no top-level heading. +""" + + +class Critique(BaseModel): + """Structured critic output.""" + + score: int = Field(ge=0, le=10, description="Overall quality score (0–10).") + summary: str = Field(default="", description="One- or two-sentence overall judgment.") + unsupported_claims: list[str] = Field( + default_factory=list, + description="Statements in the body not supported by the upstream evidence.", + ) + gaps: list[str] = Field( + default_factory=list, + description="Brief items the body fails to cover.", + ) + suggestions: list[str] = Field( + default_factory=list, + description="Concrete edits the reviser should make.", + ) + + +class RevisedBody(BaseModel): + body: str = Field(description="Revised markdown body for the section.") + + +@dataclass +class ReviewOutcome: + section_id: str + initial: Critique + body: str + revised: bool = False + final: Critique | None = None + + +@dataclass +class WikiQualityReport: + overall_score: float + critiques: dict[str, Critique] = field(default_factory=dict) + coverage: CoverageStats | None = None + + +def review_section( + *, + section: Section, + body: str, + upstream_evidence: Mapping[str, str] | None, + provider: LLMProvider, + min_score: int = 7, +) -> ReviewOutcome: + """Critique → optionally revise → critique again. Returns the outcome.""" + initial = _critique(section=section, body=body, upstream=upstream_evidence, provider=provider) + outcome = ReviewOutcome(section_id=section.id, initial=initial, body=body) + if initial.score >= min_score or not (initial.unsupported_claims or initial.gaps): + return outcome + + try: + revised = provider.complete_json( + system=REVISER_SYSTEM_PROMPT, + user=_render_revise_prompt(section, body, initial, upstream_evidence), + schema=RevisedBody, + ) + except Exception as exc: + log.warning("reviser failed for %s: %s", section.id, exc) + return outcome + + follow_up = _critique(section=section, body=revised.body, upstream=upstream_evidence, provider=provider) + # Only accept the revision if it actually improved the score; otherwise + # keep the original to avoid regressions caused by a confused reviser. + if follow_up.score >= initial.score: + outcome.body = revised.body + outcome.revised = True + outcome.final = follow_up + else: + log.info( + "discarding revision for %s — score dropped from %d to %d", + section.id, + initial.score, + follow_up.score, + ) + return outcome + + +def _critique( + *, + section: Section, + body: str, + upstream: Mapping[str, str] | None, + provider: LLMProvider, +) -> Critique: + user = _render_critique_prompt(section, body, upstream) + try: + return provider.complete_json(system=CRITIC_SYSTEM_PROMPT, user=user, schema=Critique) + except Exception as exc: + log.warning("critic failed for %s: %s", section.id, exc) + return Critique(score=0, summary=f"Critic unavailable ({exc}).") + + +def _render_critique_prompt( + section: Section, + body: str, + upstream: Mapping[str, str] | None, +) -> str: + parts = [ + f"## Section: {section.title} (id: {section.id})", + "", + "### Brief", + section.description, + "", + "### Body to evaluate", + "```markdown", + body.strip() or "(empty body)", + "```", + ] + if upstream: + parts += ["", "### Upstream evidence available"] + for upstream_id, content in upstream.items(): + parts.append(f"#### {upstream_id}") + parts.append("```markdown") + parts.append(content.strip()) + parts.append("```") + parts.append("") + parts.append("Score the body and list unsupported claims, gaps, and suggested edits.") + return "\n".join(parts) + + +def _render_revise_prompt( + section: Section, + body: str, + critique: Critique, + upstream: Mapping[str, str] | None, +) -> str: + parts = [ + f"## Section: {section.title} (id: {section.id})", + "", + "### Brief", + section.description, + "", + "### Prior body", + "```markdown", + body.strip() or "(empty)", + "```", + "", + "### Critique", + f"score: {critique.score}/10", + ] + if critique.unsupported_claims: + parts.append("Unsupported claims to remove or qualify:") + parts += [f"- {c}" for c in critique.unsupported_claims] + if critique.gaps: + parts.append("Gaps to fill (only when evidence allows):") + parts += [f"- {g}" for g in critique.gaps] + if critique.suggestions: + parts.append("Suggested edits:") + parts += [f"- {s}" for s in critique.suggestions] + if upstream: + parts += ["", "### Upstream evidence"] + for upstream_id, content in upstream.items(): + parts.append(f"#### {upstream_id}") + parts.append("```markdown") + parts.append(content.strip()) + parts.append("```") + parts.append("") + parts.append("Output the revised body only.") + return "\n".join(parts) + + +# --------------------------------------------------------------------------- +# Coverage stats — populated by the extractor + aggregator caches and +# rendered by `wikifi report`. +# --------------------------------------------------------------------------- + + +@dataclass +class CoverageStats: + files_total: int + files_with_findings: int + findings_per_section: dict[str, int] + files_per_section: dict[str, int] + + def coverage_pct(self) -> float: + if self.files_total == 0: + return 0.0 + return round(100.0 * self.files_with_findings / self.files_total, 1) diff --git a/wikifi/deriver.py b/wikifi/deriver.py index b5a11bc..9c4223b 100644 --- a/wikifi/deriver.py +++ b/wikifi/deriver.py @@ -18,10 +18,11 @@ from __future__ import annotations import logging -from dataclasses import dataclass +from dataclasses import dataclass, field from pydantic import BaseModel, Field +from wikifi.critic import ReviewOutcome, review_section from wikifi.providers.base import LLMProvider from wikifi.sections import DERIVATIVE_SECTIONS, SECTIONS_BY_ID, Section from wikifi.wiki import WikiLayout, write_section @@ -60,10 +61,24 @@ class DerivedSection(BaseModel): class DerivationStats: sections_derived: int = 0 sections_skipped: int = 0 - - -def derive_all(*, layout: WikiLayout, provider: LLMProvider) -> DerivationStats: - """Synthesize every derivative section from its upstream primary sections.""" + sections_revised: int = 0 + review_outcomes: list[ReviewOutcome] = field(default_factory=list) + + +def derive_all( + *, + layout: WikiLayout, + provider: LLMProvider, + review: bool = False, + review_min_score: int = 7, +) -> DerivationStats: + """Synthesize every derivative section from its upstream primary sections. + + With ``review=True`` each derivative is run through the critic + + reviser loop after synthesis. The critic loop is the highest-leverage + quality lever for derivative sections — personas and Gherkin stories + are exactly where single-shot synthesis tends to hallucinate. + """ stats = DerivationStats() for section in DERIVATIVE_SECTIONS: upstream_bodies = _collect_upstream(layout, section) @@ -85,6 +100,20 @@ def derive_all(*, layout: WikiLayout, provider: LLMProvider) -> DerivationStats: except Exception as exc: log.warning("derivation failed for %s: %s", section.id, exc) body = _fallback_body(section, upstream_bodies, error=str(exc)) + + if review: + outcome = review_section( + section=section, + body=body, + upstream_evidence=upstream_bodies, + provider=provider, + min_score=review_min_score, + ) + body = outcome.body + stats.review_outcomes.append(outcome) + if outcome.revised: + stats.sections_revised += 1 + write_section(layout, section, body) stats.sections_derived += 1 return stats diff --git a/wikifi/evidence.py b/wikifi/evidence.py new file mode 100644 index 0000000..af3c9ee --- /dev/null +++ b/wikifi/evidence.py @@ -0,0 +1,160 @@ +"""Evidence model: source references, claims, and contradictions. + +A premium migration wiki must let an architect ask, for any sentence in the +wiki, *"where in the source did this come from?"* — and get a precise, +verifiable answer. This module defines the small structured types that +carry that answer end-to-end: + +- :class:`SourceRef` — a single ``(file, lines, fingerprint)`` pointer back + to the codebase. Lines are optional because not every claim has a line + range (e.g. cross-cutting findings that span a whole module). +- :class:`Claim` — one assertion in a section's narrative, with the source + refs that justify it. The aggregator emits one or more claims per + section; the renderer converts them into citation-bearing markdown. +- :class:`Contradiction` — two or more claims that disagree, surfaced + rather than silently merged. Migration teams treat contradictions as + high-priority signals: legacy systems hide tribal knowledge in them. + +Citations are rendered as compact footnote-style markers (``[1]``, ``[2]``, +…) with an explicit "Sources" footer at the bottom of each section. Lines +are included when known (``path/to/file.py:42-87``). +""" + +from __future__ import annotations + +from dataclasses import dataclass + +from pydantic import BaseModel, Field + + +class SourceRef(BaseModel): + """A pointer back to a single span of source code.""" + + file: str = Field(description="Repo-relative path of the source file.") + lines: tuple[int, int] | None = Field( + default=None, + description="Optional inclusive (start, end) line range within the file.", + ) + fingerprint: str = Field( + default="", + description="Short content hash captured at extraction time. Empty when unknown.", + ) + + def render(self) -> str: + """Render as ``path:start-end`` (or just ``path`` when lines unknown).""" + if self.lines is None: + return self.file + start, end = self.lines + if start == end: + return f"{self.file}:{start}" + return f"{self.file}:{start}-{end}" + + +class Claim(BaseModel): + """A single assertion the aggregator places in a section, with sources.""" + + text: str = Field(description="Markdown sentence(s) asserting one fact.") + sources: list[SourceRef] = Field( + default_factory=list, + description="Files/lines that support this claim. Empty means unsupported.", + ) + + def supported(self) -> bool: + return bool(self.sources) + + +class Contradiction(BaseModel): + """Two or more conflicting claims about the same topic.""" + + summary: str = Field(description="One-sentence description of the conflict.") + positions: list[Claim] = Field( + default_factory=list, + description="Each disagreeing position, with its own sources.", + ) + + +class EvidenceBundle(BaseModel): + """The aggregator's structured output for a single section.""" + + body: str = Field(description="Markdown narrative for the section.") + claims: list[Claim] = Field(default_factory=list) + contradictions: list[Contradiction] = Field(default_factory=list) + + +# --------------------------------------------------------------------------- +# Rendering helpers +# --------------------------------------------------------------------------- + + +@dataclass +class _Numbered: + index: int + ref: SourceRef + + +def render_section_body(bundle: EvidenceBundle) -> str: + """Render an EvidenceBundle into final markdown. + + The body is appended with a "Sources" footer enumerating every distinct + source ref across claims and contradictions, plus an explicit + "Conflicts in source" section if any contradictions were surfaced. + """ + parts: list[str] = [] + if bundle.body.strip(): + parts.append(bundle.body.strip()) + + if bundle.contradictions: + parts.append("") + parts.append("## Conflicts in source") + parts.append( + "_The walker found disagreements across files. Migration teams " + "should resolve these before re-implementation._" + ) + for entry in bundle.contradictions: + parts.append("") + parts.append(f"- **{entry.summary.strip()}**") + for position in entry.positions: + refs = _format_refs(position.sources) + parts.append(f" - {position.text.strip()} {refs}".rstrip()) + + sources = _enumerate_sources(bundle) + if sources: + parts.append("") + parts.append("## Sources") + for entry in sources: + parts.append(f"{entry.index}. `{entry.ref.render()}`") + + return "\n".join(parts).strip() + + +def _format_refs(refs: list[SourceRef]) -> str: + if not refs: + return "" + rendered = ", ".join(f"`{ref.render()}`" for ref in refs) + return f"({rendered})" + + +def _enumerate_sources(bundle: EvidenceBundle) -> list[_Numbered]: + seen: dict[str, _Numbered] = {} + next_index = 1 + iterables: list[list[SourceRef]] = [c.sources for c in bundle.claims] + for entry in bundle.contradictions: + for position in entry.positions: + iterables.append(position.sources) + for refs in iterables: + for ref in refs: + key = ref.render() + if key not in seen: + seen[key] = _Numbered(index=next_index, ref=ref) + next_index += 1 + return list(seen.values()) + + +def coalesce_refs(refs: list[SourceRef]) -> list[SourceRef]: + """Deduplicate refs by rendered form, preserving first-seen order.""" + seen: dict[str, SourceRef] = {} + for ref in refs: + key = ref.render() + if key not in seen: + seen[key] = ref + return list(seen.values()) diff --git a/wikifi/extractor.py b/wikifi/extractor.py index 34ffa9f..8f769d0 100644 --- a/wikifi/extractor.py +++ b/wikifi/extractor.py @@ -2,25 +2,42 @@ Given the include/exclude decision from Stage 1, walk each file deterministically and ask the LLM what intent-bearing content it contributes to each capture -section. Results are appended to per-section JSONL note files for the aggregator. - -The contract: one LLM call per file *or* one call per overlapping chunk for -files that exceed the per-call window. Output is validated against a strict -Pydantic schema. Files that can't be read or validated are recorded as skipped -findings rather than crashing the walk. +section. Results are appended to per-section JSONL note files for the +aggregator. + +Three orthogonal mechanisms make this stage premium-grade: + +1. **Content-addressed cache.** Each file is fingerprinted; if its fingerprint + matches a cached entry, the LLM call is skipped entirely and cached + findings are replayed into the notes store. This is what makes a re-walk + of a 50k-file legacy monorepo finish in minutes. +2. **Cross-file context.** A repo-wide import graph (built once, before + extraction starts) supplies each file's neighborhood to the prompt so + findings can describe inter-file flows. +3. **Type-aware specialization.** Files classified as SQL, OpenAPI, + Protobuf, GraphQL, or migrations bypass the LLM entirely and run + through deterministic extractors that read the structure directly. + +Every emitted finding carries a structured :class:`SourceRef` so the +aggregator can stitch citations back into the rendered wiki. """ from __future__ import annotations import logging -from collections.abc import Iterable -from dataclasses import dataclass +from collections.abc import Callable, Iterable +from dataclasses import dataclass, field from pathlib import Path from pydantic import BaseModel, Field +from wikifi.cache import WalkCache +from wikifi.evidence import SourceRef +from wikifi.fingerprint import hash_file from wikifi.providers.base import LLMProvider +from wikifi.repograph import FileKind, RepoGraph, classify from wikifi.sections import PRIMARY_SECTION_IDS, PRIMARY_SECTIONS +from wikifi.specialized import select as select_specialized from wikifi.wiki import WikiLayout, append_note log = logging.getLogger("wikifi.extractor") @@ -52,6 +69,14 @@ the same finding to appear twice — that's deliberate context, not duplication \ to invent around. +When the user prompt names neighbor files (files this one imports from or is \ +imported by), you may reference those relationships when describing flows that \ +cross file boundaries. Do not fabricate flows that aren't visible in the chunk. + +Each finding can carry an optional list of supporting line ranges within \ +this file. Provide them when you can; omit them when the contribution is \ +diffuse across the chunk. + Only emit findings for these section ids: {_SECTION_LIST} Section briefs: @@ -73,6 +98,10 @@ class SectionFinding(BaseModel): section_id: str = Field(description=f"Must be one of: {_SECTION_LIST}") finding: str = Field(description="Tech-agnostic markdown describing the contribution. 1-5 sentences.") + line_range: tuple[int, int] | None = Field( + default=None, + description="Optional inclusive (start, end) line range within the chunk supporting this finding.", + ) class FileFindings(BaseModel): @@ -89,6 +118,9 @@ class ExtractionStats: findings_total: int = 0 files_skipped: int = 0 chunks_processed: int = 0 + cache_hits: int = 0 + specialized_files: int = 0 + files_kinds: dict[str, int] = field(default_factory=dict) def extract_repo( @@ -99,6 +131,9 @@ def extract_repo( repo_root: Path, chunk_size_bytes: int = 150_000, chunk_overlap_bytes: int = 8_000, + cache: WalkCache | None = None, + graph: RepoGraph | None = None, + persist_cache: Callable[[], None] | None = None, ) -> ExtractionStats: """Walk the supplied files and append per-section findings to the notes store. @@ -108,6 +143,12 @@ def extract_repo( chunk produces one LLM call; identical findings emerging from the overlap region are deduplicated per file so a single declaration isn't double-counted. + + When a ``cache`` is supplied, files whose content fingerprint matches a + cached entry skip the LLM call entirely and replay the cached findings. + When ``persist_cache`` is supplied, it is invoked after each file + finishes — that turns crash-resumability into a free property of the + cache layer. """ stats = ExtractionStats() valid_ids = set(PRIMARY_SECTION_IDS) @@ -122,12 +163,89 @@ def extract_repo( stats.files_skipped += 1 continue + try: + fingerprint = hash_file(full) + except OSError: + fingerprint = "" + + kind = classify(rel, sample=data[:4096]) + kind_label = kind.value + stats.files_kinds[kind_label] = stats.files_kinds.get(kind_label, 0) + 1 + + # ---- cache hit ---- + if cache is not None and fingerprint: + cached = cache.lookup_extraction(rel.as_posix(), fingerprint) + if cached is not None: + file_had_findings = _replay_cached(layout, rel, cached, valid_ids, stats) + if file_had_findings: + stats.files_with_findings += 1 + stats.cache_hits += 1 + if persist_cache is not None: + persist_cache() + continue + + # ---- specialized routing ---- + specialized_fn = select_specialized(kind) + if specialized_fn is not None: + stats.specialized_files += 1 + try: + result = specialized_fn(rel.as_posix(), data) + except Exception as exc: # specialized failures don't kill the walk + log.warning("specialized extraction failed for %s: %s", rel, exc) + stats.files_skipped += 1 + continue + + cached_findings = [] + file_had_findings = False + for finding in result.findings: + if finding.section_id not in valid_ids: + continue + note = _build_note( + rel=rel, + summary=result.summary, + finding_text=finding.finding, + sources=finding.sources, + extractor=f"specialized:{kind_label}", + ) + append_note(layout, finding.section_id, note) + cached_findings.append( + { + "section_id": finding.section_id, + "finding": finding.finding, + "sources": [s.model_dump() for s in finding.sources], + } + ) + stats.findings_total += 1 + file_had_findings = True + if file_had_findings: + stats.files_with_findings += 1 + if cache is not None and fingerprint: + cache.record_extraction( + rel.as_posix(), + fingerprint=fingerprint, + findings=cached_findings, + summary=result.summary, + chunks_processed=0, + ) + if persist_cache is not None: + persist_cache() + continue + + # ---- LLM extraction path ---- chunks = _chunk_text(data, chunk_size=chunk_size_bytes, overlap=chunk_overlap_bytes) total_chunks = len(chunks) file_had_findings = False any_chunk_failed = False seen_findings: set[tuple[str, str]] = set() latest_summary = "" + cached_findings: list[dict] = [] + chunks_done = 0 + + neighbors = graph.neighbor_paths(rel.as_posix()) if graph is not None else [] + + # Track each chunk's starting line so finding line_ranges can be + # mapped back to absolute file lines for the citation. + chunk_offsets = _chunk_line_offsets(data, chunks) for chunk_index, chunk_body in enumerate(chunks): try: @@ -138,6 +256,7 @@ def extract_repo( body=chunk_body, chunk_index=chunk_index, total_chunks=total_chunks, + neighbors=neighbors, ), schema=FileFindings, ) @@ -153,9 +272,11 @@ def extract_repo( continue stats.chunks_processed += 1 + chunks_done += 1 if chunk_findings.summary: latest_summary = chunk_findings.summary + chunk_line_offset = chunk_offsets[chunk_index] for finding in chunk_findings.findings: if finding.section_id not in valid_ids: continue @@ -164,15 +285,29 @@ def extract_repo( continue seen_findings.add(key) - note: dict[str, object] = { - "file": rel.as_posix(), - "summary": latest_summary, - "finding": finding.finding, - } - if total_chunks > 1: - note["chunk"] = chunk_index - note["chunks"] = total_chunks + line_range: tuple[int, int] | None = None + if finding.line_range is not None: + start, end = finding.line_range + line_range = (start + chunk_line_offset, end + chunk_line_offset) + + sources = [SourceRef(file=rel.as_posix(), lines=line_range, fingerprint=fingerprint)] + note = _build_note( + rel=rel, + summary=latest_summary, + finding_text=finding.finding, + sources=sources, + extractor=f"llm:{kind_label}", + chunk_index=chunk_index, + total_chunks=total_chunks, + ) append_note(layout, finding.section_id, note) + cached_findings.append( + { + "section_id": finding.section_id, + "finding": finding.finding, + "sources": [s.model_dump() for s in sources], + } + ) stats.findings_total += 1 file_had_findings = True @@ -184,10 +319,78 @@ def extract_repo( # chunked files lose some chunks we still keep what we got. stats.files_skipped += 1 + if cache is not None and fingerprint and chunks_done > 0: + cache.record_extraction( + rel.as_posix(), + fingerprint=fingerprint, + findings=cached_findings, + summary=latest_summary, + chunks_processed=chunks_done, + ) + if persist_cache is not None: + persist_cache() + return stats -def _render_user_prompt(*, rel: Path, body: str, chunk_index: int = 0, total_chunks: int = 1) -> str: +def _replay_cached( + layout: WikiLayout, + rel: Path, + cached, + valid_ids: set[str], + stats: ExtractionStats, +) -> bool: + """Re-emit cached findings into the notes store. Returns True if any landed.""" + file_had_findings = False + for entry in cached.findings: + section_id = entry.get("section_id", "") + if section_id not in valid_ids: + continue + sources = [SourceRef(**s) for s in entry.get("sources", [])] + note = _build_note( + rel=rel, + summary=cached.summary, + finding_text=entry.get("finding", ""), + sources=sources, + extractor="cache", + ) + append_note(layout, section_id, note) + stats.findings_total += 1 + file_had_findings = True + return file_had_findings + + +def _build_note( + *, + rel: Path, + summary: str, + finding_text: str, + sources: list[SourceRef], + extractor: str, + chunk_index: int | None = None, + total_chunks: int | None = None, +) -> dict[str, object]: + note: dict[str, object] = { + "file": rel.as_posix(), + "summary": summary, + "finding": finding_text, + "sources": [s.model_dump() for s in sources], + "extractor": extractor, + } + if total_chunks is not None and total_chunks > 1: + note["chunk"] = chunk_index + note["chunks"] = total_chunks + return note + + +def _render_user_prompt( + *, + rel: Path, + body: str, + chunk_index: int = 0, + total_chunks: int = 1, + neighbors: list[str] | None = None, +) -> str: if total_chunks > 1: chunk_header = ( f"Chunk: {chunk_index + 1} of {total_chunks} " @@ -196,15 +399,26 @@ def _render_user_prompt(*, rel: Path, body: str, chunk_index: int = 0, total_chu ) else: chunk_header = "" + neighbor_block = "" + if neighbors: + neighbor_lines = "\n".join(f" - {n}" for n in neighbors[:8]) + neighbor_block = ( + "Neighbor files (this file imports from or is imported by these — " + "feel free to mention cross-file relationships when supported by the chunk):\n" + f"{neighbor_lines}\n\n" + ) return ( f"File path: {rel.as_posix()}\n\n" + f"{neighbor_block}" f"{chunk_header}" "File contents:\n" "```\n" f"{body}\n" "```\n\n" "Return findings strictly in the FileFindings schema. Use section ids " - f"only from: {_SECTION_LIST}." + f"only from: {_SECTION_LIST}. Provide ``line_range`` as an inclusive " + "(start, end) pair *within this chunk* whenever the contribution is " + "tied to a specific span; omit it for diffuse contributions." ) @@ -242,6 +456,28 @@ def _chunk_text(text: str, *, chunk_size: int, overlap: int) -> list[str]: return overlapped +def _chunk_line_offsets(text: str, chunks: list[str]) -> list[int]: + """Return the starting line number (0-indexed offset) of each chunk + within ``text``. Used to translate per-chunk line ranges into absolute + file line ranges for citations. + """ + offsets: list[int] = [] + cursor = 0 + for chunk in chunks: + idx = text.find(chunk, cursor) + if idx < 0: + # Overlap or aggressive splitting can shift the search window; + # fall back to a global find. Worst case: line offsets are + # approximate, which is acceptable for citation purposes. + idx = text.find(chunk) + if idx < 0: + offsets.append(0) + continue + offsets.append(text.count("\n", 0, idx)) + cursor = idx + max(1, len(chunk) // 2) # advance past most of this chunk + return offsets + + def _recursive_split(text: str, *, chunk_size: int, separators: list[str]) -> list[str]: """Split ``text`` so every chunk fits within ``chunk_size``, trying each separator in priority order. The empty-string separator is the terminal @@ -284,3 +520,8 @@ def _recursive_split(text: str, *, chunk_size: int, separators: list[str]) -> li if current: chunks.append(current) return chunks + + +def classify_file(rel_path: Path, sample: str) -> FileKind: + """Public re-export so callers don't need to import :mod:`repograph`.""" + return classify(rel_path, sample=sample) diff --git a/wikifi/fingerprint.py b/wikifi/fingerprint.py new file mode 100644 index 0000000..69a4dfa --- /dev/null +++ b/wikifi/fingerprint.py @@ -0,0 +1,48 @@ +"""Stable content fingerprints for files and synthesized text. + +Used by three subsystems: + +- :mod:`wikifi.cache` keys cached extraction findings by ``hash(file_bytes)`` + and cached aggregations by ``hash(notes_payload)``. +- :mod:`wikifi.evidence` cites source files by ``(path, fingerprint, lines)`` + so a migration team can verify the wiki claim survives a re-walk. +- :mod:`wikifi.repograph` records each file's fingerprint alongside its + import edges so cross-file context invalidates correctly when source + changes. + +Fingerprints are short hex prefixes of SHA-256: enough entropy to +distinguish every file in any realistic repository (~10 trillion files +before a 50% collision chance with a 12-char prefix), and short enough +to render comfortably inline in citations. +""" + +from __future__ import annotations + +import hashlib +from pathlib import Path + +# Twelve hex chars = 48 bits of entropy. Using a prefix (rather than the +# full digest) keeps citations readable while leaving margin against +# collisions on any realistic codebase. +FINGERPRINT_LENGTH = 12 + + +def hash_text(text: str) -> str: + """Return a stable short fingerprint for a string.""" + digest = hashlib.sha256(text.encode("utf-8", errors="replace")).hexdigest() + return digest[:FINGERPRINT_LENGTH] + + +def hash_bytes(data: bytes) -> str: + """Return a stable short fingerprint for raw bytes.""" + digest = hashlib.sha256(data).hexdigest() + return digest[:FINGERPRINT_LENGTH] + + +def hash_file(path: Path) -> str: + """Return the fingerprint of the file at ``path``. + + Reads the file as bytes (not text) so the same fingerprint is produced + regardless of how the cache or extractor later decodes it. + """ + return hash_bytes(path.read_bytes()) diff --git a/wikifi/orchestrator.py b/wikifi/orchestrator.py index 1923ecc..80b26ba 100644 --- a/wikifi/orchestrator.py +++ b/wikifi/orchestrator.py @@ -1,13 +1,19 @@ """End-to-end pipeline that wires Stage 1 → Stage 2 → Stage 3 → Stage 4. -The CLI calls into ``init_wiki`` and ``run_walk``. Both accept a target root -and a configured provider so tests can substitute a mock provider trivially. +The CLI calls into ``init_wiki``, ``run_walk``, and ``run_report``. Each +accepts a target root and a configured provider so tests can substitute +a mock provider trivially. - Stage 1: LLM introspection of repo structure (`introspection.introspect`) -- Stage 2: deterministic per-file extraction → JSONL notes (`extractor.extract_repo`) -- Stage 3: per-section aggregation of primary sections (`aggregator.aggregate_all`) -- Stage 4: derivation of personas/user_stories/diagrams from primary section - bodies (`deriver.derive_all`) +- Stage 1.5: lightweight static analysis (`repograph.build_graph`) when + ``settings.use_graph`` is set +- Stage 2: deterministic per-file extraction → JSONL notes + (`extractor.extract_repo`), with caching, specialized routing, and + cross-file context if available +- Stage 3: per-section aggregation of primary sections + (`aggregator.aggregate_all`), with section-level cache +- Stage 4: derivation of personas/user_stories/diagrams from primary + section bodies (`deriver.derive_all`), with optional critic loop """ from __future__ import annotations @@ -17,12 +23,17 @@ from pathlib import Path from wikifi.aggregator import AggregationStats, aggregate_all +from wikifi.cache import WalkCache +from wikifi.cache import load as load_cache +from wikifi.cache import reset as reset_cache +from wikifi.cache import save as save_cache from wikifi.config import Settings from wikifi.deriver import DerivationStats, derive_all from wikifi.extractor import ExtractionStats, extract_repo from wikifi.introspection import IntrospectionResult, introspect from wikifi.providers.base import LLMProvider from wikifi.providers.ollama_provider import OllamaProvider +from wikifi.repograph import RepoGraph, build_graph from wikifi.walker import WalkConfig, iter_files from wikifi.wiki import WikiLayout, initialize, reset_notes @@ -46,6 +57,8 @@ class WalkReport: extraction: ExtractionStats aggregation: AggregationStats derivation: DerivationStats + cache: WalkCache | None = None + graph: RepoGraph | None = None def run_walk( @@ -83,9 +96,30 @@ def run_walk( min_content_bytes=settings.min_content_bytes, ) + files = list(iter_files(walk_config)) + + cache: WalkCache | None = None + if settings.use_cache: + cache = load_cache(layout) + # Drop cache entries for files that fell out of scope so the + # cache size tracks the live in-scope set. + in_scope = {p.as_posix() for p in files} + cache.prune_extraction(keep=in_scope) + else: + reset_cache(layout) + + graph: RepoGraph | None = None + if settings.use_graph: + log.info("stage 1.5: building repo import graph") + graph = build_graph(repo_root=root, files=files) + log.info("stage 2: extracting per-file findings") reset_notes(layout) - files = list(iter_files(walk_config)) + + def _persist() -> None: + if cache is not None: + save_cache(layout, cache) + extraction = extract_repo( layout=layout, provider=provider, @@ -93,29 +127,60 @@ def run_walk( repo_root=root, chunk_size_bytes=settings.chunk_size_bytes, chunk_overlap_bytes=settings.chunk_overlap_bytes, + cache=cache, + graph=graph, + persist_cache=_persist if cache is not None else None, ) log.info("stage 3: aggregating primary sections") - aggregation = aggregate_all(layout=layout, provider=provider) + aggregation = aggregate_all(layout=layout, provider=provider, cache=cache) log.info("stage 4: deriving personas, user stories, and diagrams") - derivation = derive_all(layout=layout, provider=provider) + derivation = derive_all( + layout=layout, + provider=provider, + review=settings.review_derivatives, + review_min_score=settings.review_min_score, + ) + + if cache is not None: + save_cache(layout, cache) return WalkReport( introspection=introspection, extraction=extraction, aggregation=aggregation, derivation=derivation, + cache=cache, + graph=graph, ) def build_provider(settings: Settings) -> LLMProvider: - """Construct the configured provider. Currently Ollama is the only backend.""" - if settings.provider != "ollama": - raise ValueError(f"unknown provider {settings.provider!r}; only 'ollama' is supported in v1") - return OllamaProvider( - model=settings.model, - host=settings.ollama_host, - timeout=settings.request_timeout, - think=settings.think, - ) + """Construct the configured provider. + + Local Ollama is the default. Hosted Anthropic is opt-in via + ``WIKIFI_PROVIDER=anthropic`` and an ``ANTHROPIC_API_KEY``. + """ + if settings.provider == "ollama": + return OllamaProvider( + model=settings.model, + host=settings.ollama_host, + timeout=settings.request_timeout, + think=settings.think, + ) + if settings.provider == "anthropic": + from wikifi.providers.anthropic_provider import AnthropicProvider + + # When users opt in to Anthropic but leave the Ollama default + # model id in place, swap to a sensible Claude default rather + # than 404 on the model name. + model = settings.model if settings.model.startswith("claude-") else "claude-opus-4-7" + return AnthropicProvider( + model=model, + api_key=settings.anthropic_api_key, + timeout=settings.request_timeout, + max_tokens=settings.anthropic_max_tokens, + think=settings.think, + ) + raise ValueError(f"unknown provider {settings.provider!r}; expected 'ollama' or 'anthropic'") diff --git a/wikifi/providers/anthropic_provider.py b/wikifi/providers/anthropic_provider.py new file mode 100644 index 0000000..8241e07 --- /dev/null +++ b/wikifi/providers/anthropic_provider.py @@ -0,0 +1,235 @@ +"""Anthropic-backed implementation of :class:`LLMProvider`. + +This is the premium / hosted path. Wikifi's pipeline reuses the same +multi-KB system prompt across hundreds of per-file extraction calls; the +defining design choice here is to mark that prompt with +``cache_control: {"type": "ephemeral"}`` so subsequent calls served by +the same cache breakpoint pay ~10% of the input price (cache read) instead +of full price every time. Without that, hosted Anthropic is uneconomical +on a 10k-file codebase walk; with it, the cost story competes with +local Ollama at materially better extraction quality. + +Three design notes worth flagging: + +1. **Structured output via ``messages.parse``.** The Pydantic schema is + converted to JSON Schema by the SDK and the model returns a + pre-validated instance. This is the SDK's recommended path for + structured outputs (see ``claude-api`` skill, *Structured Outputs*) — + we don't hand-roll tool_use blocks for this. +2. **Adaptive thinking + effort.** Opus 4.7 (the recommended default) + supports only adaptive thinking and exposes ``effort`` for depth. + Sampling parameters (``temperature``, ``top_p``, ``top_k``) are + removed on 4.7 and would 400 if sent — we omit them entirely. The + ``think`` knob mirrors the Ollama provider's interface so the rest + of the codebase doesn't branch on provider. +3. **Errors map to ``RuntimeError``.** The aggregator/extractor/deriver + already catch broad ``Exception`` per call; mapping + ``anthropic.APIError`` (and friends) into a plain ``RuntimeError`` + with the request id keeps the pipeline's existing fallback paths + working unchanged. +""" + +from __future__ import annotations + +import logging +import os +from typing import Any, TypeVar + +from pydantic import BaseModel + +from wikifi.providers.base import ChatMessage + +try: # the dep is declared in pyproject.toml, but importing lazily yields + # a clearer error if a user installs without extras. + import anthropic +except ImportError as exc: # pragma: no cover - import error path + raise ImportError( + "wikifi.providers.anthropic_provider requires the `anthropic` package. " + "Install via `uv add anthropic` or include the [hosted] extras." + ) from exc + + +T = TypeVar("T", bound=BaseModel) +log = logging.getLogger("wikifi.providers.anthropic") + + +# Default model — opus 4.7 is the most capable for migration-grade +# domain extraction. Override per-walk via `WIKIFI_MODEL` env or +# `.wikifi/config.toml`. +DEFAULT_MODEL = "claude-opus-4-7" + +# Default per-call max output tokens. Wikifi's structured findings are +# small relative to the input; 16K is comfortable headroom for any of +# the section schemas without crossing the SDK's non-streaming HTTP +# timeout guard. +DEFAULT_MAX_TOKENS = 16_000 + + +ThinkLevel = bool | str | None + + +class AnthropicProvider: + """Hosted-Claude implementation of the wikifi provider protocol.""" + + name = "anthropic" + + def __init__( + self, + *, + model: str = DEFAULT_MODEL, + api_key: str | None = None, + timeout: float = 900.0, + max_tokens: int = DEFAULT_MAX_TOKENS, + think: ThinkLevel = "high", + cache_system_prompt: bool = True, + client: Any | None = None, + ) -> None: + self.model = model + self.timeout = timeout + self.max_tokens = max_tokens + self.think = think + self.cache_system_prompt = cache_system_prompt + if client is not None: + # Tests pass an injected mock; preserve the duck-typed surface. + self._client = client + else: + api_key = api_key or os.environ.get("ANTHROPIC_API_KEY") + self._client = anthropic.Anthropic(api_key=api_key, timeout=timeout) + + # ------------------------------------------------------------------ + # Provider protocol + # ------------------------------------------------------------------ + + def complete_json(self, *, system: str, user: str, schema: type[T]) -> T: + """Return a ``schema``-validated Pydantic instance. + + Uses ``messages.parse`` so the SDK runs JSON-Schema-constrained + decoding and returns the parsed Pydantic model directly. The + system prompt is wrapped in a single text block with + ``cache_control`` so successive per-file calls hit the prompt + cache. + """ + try: + response = self._client.messages.parse( + model=self.model, + max_tokens=self.max_tokens, + system=self._render_system(system), + messages=[{"role": "user", "content": user}], + output_format=schema, + **self._thinking_kwargs(), + ) + except anthropic.APIError as exc: + raise RuntimeError(_format_api_error(exc)) from exc + + parsed = getattr(response, "parsed_output", None) + if parsed is None: + # Defensive: if the model refused or the SDK couldn't parse, + # fall back to schema-validating the response text. This + # keeps the protocol's ``raise on failure`` contract intact + # rather than returning a None. + text = _first_text(response) + try: + return schema.model_validate_json(text) + except Exception as exc: # pragma: no cover - defensive path + raise RuntimeError(f"anthropic provider: empty parsed_output and parse fallback failed: {exc}") from exc + return parsed # type: ignore[return-value] + + def complete_text(self, *, system: str, user: str) -> str: + """Return the model's free-text response.""" + try: + response = self._client.messages.create( + model=self.model, + max_tokens=self.max_tokens, + system=self._render_system(system), + messages=[{"role": "user", "content": user}], + **self._thinking_kwargs(), + ) + except anthropic.APIError as exc: + raise RuntimeError(_format_api_error(exc)) from exc + return _first_text(response) or "" + + def chat(self, *, system: str, messages: list[ChatMessage]) -> str: + """Multi-turn chat. The system prompt is cached; the running + message history follows it (and is therefore not cached itself + beyond the prefix-match window — see the prompt-caching guide + in the ``claude-api`` skill).""" + try: + response = self._client.messages.create( + model=self.model, + max_tokens=self.max_tokens, + system=self._render_system(system), + messages=list(messages), + **self._thinking_kwargs(), + ) + except anthropic.APIError as exc: + raise RuntimeError(_format_api_error(exc)) from exc + return _first_text(response) or "" + + # ------------------------------------------------------------------ + # Helpers + # ------------------------------------------------------------------ + + def _render_system(self, system: str) -> list[dict[str, Any]] | str: + """Wrap ``system`` in a single text block with ``cache_control``. + + Returning a list (not a string) is what enables the cache mark. + Wikifi's per-file system prompt is large and identical across + every Stage 2 / Stage 3 / Stage 4 call — the cache hit on the + 2nd-Nth request is the entire cost story for hosted runs. + """ + if not self.cache_system_prompt: + return system + return [ + { + "type": "text", + "text": system, + "cache_control": {"type": "ephemeral"}, + } + ] + + def _thinking_kwargs(self) -> dict[str, Any]: + """Translate ``think`` into Anthropic's adaptive-thinking config. + + - ``False`` / ``"off"`` / ``"none"`` → thinking disabled. + - ``"low"`` / ``"medium"`` / ``"high"`` / ``"max"`` → adaptive + thinking with the corresponding ``effort``. Wikifi defaults + to ``"high"`` since the walk is bounded; bump to ``"max"`` for + intelligence-critical migrations. + - ``True`` / unspecified string → adaptive thinking, no + ``effort`` override (SDK default). + """ + if self.think is False or self.think in {"off", "none"}: + return {"thinking": {"type": "disabled"}} + if isinstance(self.think, str) and self.think.lower() in {"low", "medium", "high", "xhigh", "max"}: + return { + "thinking": {"type": "adaptive"}, + "output_config": {"effort": self.think.lower()}, + } + return {"thinking": {"type": "adaptive"}} + + +def _first_text(response: Any) -> str: + """Pull the first text block out of a Messages response. + + Tolerates the SDK shape (``response.content`` is a list of typed + blocks) and a duck-typed mock (a list of dicts). + """ + content = getattr(response, "content", None) + if not content: + return "" + for block in content: + block_type = getattr(block, "type", None) or (block.get("type") if isinstance(block, dict) else None) + if block_type == "text": + text = getattr(block, "text", None) or (block.get("text") if isinstance(block, dict) else None) + if text: + return text + return "" + + +def _format_api_error(exc: Exception) -> str: + """Render an APIError with the request id, when present, for diagnostics.""" + request_id = getattr(exc, "request_id", None) + msg = getattr(exc, "message", None) or str(exc) + if request_id: + return f"anthropic provider failed ({request_id}): {msg}" + return f"anthropic provider failed: {msg}" diff --git a/wikifi/repograph.py b/wikifi/repograph.py new file mode 100644 index 0000000..6fa8bb5 --- /dev/null +++ b/wikifi/repograph.py @@ -0,0 +1,397 @@ +"""Lightweight static analysis of the repository. + +Two outputs feed Stage 2: + +1. **File classification.** Each in-scope file is tagged with a + :class:`FileKind` (``application_code``, ``sql``, ``openapi``, + ``protobuf``, ``graphql``, ``migration``, ``other``). Specialized + extractors short-circuit the LLM for the structured kinds — a SQL + DDL file becomes a precise entity diff without a 90-second model + call. Application code falls through to the existing LLM extraction + path, but enriched with the import graph. + +2. **Import / reference graph.** A regex-driven scan builds an undirected + neighbor map: for each file, "this file imports from these files, + and is imported by these files". The neighbor list is injected into + the Stage 2 prompt so per-file findings can talk about cross-file + flows ("this handler delegates to ``services/billing.py`` for the + order-totalling step") rather than treating each file as an island. + +The implementation is deliberately language-pluralistic and relies only +on regex + path resolution. tree-sitter would give richer structure but +adds a binary dep wikifi has explicitly avoided so far; the regex graph +is good enough to surface neighbors for the LLM to reason over, which is +the only consumer that matters here. +""" + +from __future__ import annotations + +import logging +import re +from collections import defaultdict +from collections.abc import Iterable +from dataclasses import dataclass, field +from enum import StrEnum +from pathlib import Path + +log = logging.getLogger("wikifi.repograph") + + +class FileKind(StrEnum): + APPLICATION_CODE = "application_code" + SQL = "sql" + OPENAPI = "openapi" + PROTOBUF = "protobuf" + GRAPHQL = "graphql" + MIGRATION = "migration" + OTHER = "other" + + +# Suffixes that pin a file kind purely by extension. +_EXTENSION_KINDS: dict[str, FileKind] = { + ".sql": FileKind.SQL, + ".ddl": FileKind.SQL, + ".proto": FileKind.PROTOBUF, + ".graphql": FileKind.GRAPHQL, + ".graphqls": FileKind.GRAPHQL, + ".gql": FileKind.GRAPHQL, +} + + +_APPLICATION_EXTS: frozenset[str] = frozenset( + { + ".py", + ".js", + ".jsx", + ".ts", + ".tsx", + ".mjs", + ".cjs", + ".go", + ".rs", + ".rb", + ".php", + ".java", + ".kt", + ".kts", + ".scala", + ".cs", + ".cpp", + ".cc", + ".c", + ".h", + ".hpp", + ".swift", + ".m", + ".mm", + ".dart", + ".ex", + ".exs", + ".clj", + ".cljs", + ".lua", + } +) + + +# Common conventions for migration directories (Alembic, Django, Rails, +# Knex, Flyway, Liquibase). A ``.sql`` file in any of these is a migration +# rather than a generic DDL — both kinds run through the SQL extractor +# but the migration label keeps the wiki distinguishing forward-only +# changes from current schema. +_MIGRATION_DIR_TOKENS: tuple[str, ...] = ( + "/migrations/", + "/alembic/", + "/db/migrate/", + "/database/migrations/", + "/prisma/migrations/", + "/flyway/", + "/liquibase/", +) + + +# Heuristics for OpenAPI/Swagger detection inside YAML and JSON files. +_OPENAPI_HEAD_PATTERNS: tuple[re.Pattern[str], ...] = ( + re.compile(r"^\s*openapi\s*:\s*[\"']?\d", re.MULTILINE), + re.compile(r'"openapi"\s*:\s*"\d'), + re.compile(r"^\s*swagger\s*:\s*[\"']?\d", re.MULTILINE), + re.compile(r'"swagger"\s*:\s*"\d'), +) + + +def classify(rel_path: Path, sample: str | None = None) -> FileKind: + """Return the :class:`FileKind` for a repo-relative path. + + ``sample`` may carry the first ~4 KB of the file's contents and is + consulted for kinds that can't be decided from the path alone (YAML + / JSON files that may or may not be OpenAPI specs). + """ + suffix = rel_path.suffix.lower() + posix = rel_path.as_posix().lower() + + if suffix in _EXTENSION_KINDS: + kind = _EXTENSION_KINDS[suffix] + if kind is FileKind.SQL and any(token in f"/{posix}" for token in _MIGRATION_DIR_TOKENS): + return FileKind.MIGRATION + return kind + + if suffix in {".yml", ".yaml", ".json"} and sample is not None: + head = sample[:4096] + if any(pat.search(head) for pat in _OPENAPI_HEAD_PATTERNS): + return FileKind.OPENAPI + + if suffix in _APPLICATION_EXTS: + if any(token in f"/{posix}" for token in _MIGRATION_DIR_TOKENS): + return FileKind.MIGRATION + return FileKind.APPLICATION_CODE + + return FileKind.OTHER + + +# --------------------------------------------------------------------------- +# Import graph +# --------------------------------------------------------------------------- + + +@dataclass(frozen=True) +class GraphNode: + rel_path: str + imports: tuple[str, ...] + imported_by: tuple[str, ...] + + def neighbors(self, *, limit: int = 8) -> list[str]: + """Combined neighbor list, capped, for prompt enrichment.""" + out: list[str] = [] + seen: set[str] = set() + for paths in (self.imports, self.imported_by): + for path in paths: + if path not in seen: + seen.add(path) + out.append(path) + if len(out) >= limit: + return out + return out + + +@dataclass +class RepoGraph: + """Per-file import edges across an in-scope file list.""" + + nodes: dict[str, GraphNode] = field(default_factory=dict) + + def get(self, rel_path: str) -> GraphNode | None: + return self.nodes.get(rel_path) + + def neighbor_paths(self, rel_path: str, *, limit: int = 8) -> list[str]: + node = self.nodes.get(rel_path) + return node.neighbors(limit=limit) if node else [] + + def __contains__(self, rel_path: str) -> bool: # pragma: no cover - convenience + return rel_path in self.nodes + + +# Per-language import patterns. Each pattern captures the imported module +# path/identifier; resolution to a real file is handled by a separate +# heuristic. +_PY_IMPORT = re.compile( + r"^\s*(?:from\s+([A-Za-z_][\w.]*)\s+import|import\s+([A-Za-z_][\w.]*))", + re.MULTILINE, +) +_JS_IMPORT = re.compile( + r"""(?:import\s+[^'"\n]*?from\s*['"]([^'"\n]+)['"])""" + r"""|(?:require\(\s*['"]([^'"\n]+)['"]\s*\))""" + r"""|(?:import\(\s*['"]([^'"\n]+)['"]\s*\))""", +) +_GO_IMPORT = re.compile(r"""import\s+(?:\([^)]*\)|\"([^\"]+)\")""", re.DOTALL) +_GO_IMPORT_BLOCK = re.compile(r"^\s*\"([^\"]+)\"", re.MULTILINE) +_JAVA_IMPORT = re.compile(r"^\s*import\s+(?:static\s+)?([\w.]+);", re.MULTILINE) +_RUBY_REQUIRE = re.compile(r"""^\s*require(?:_relative)?\s+['"]([^'"\n]+)['"]""", re.MULTILINE) + + +def build_graph(*, repo_root: Path, files: Iterable[Path]) -> RepoGraph: + """Build a :class:`RepoGraph` from the supplied in-scope files. + + Files outside :data:`_APPLICATION_EXTS` contribute nothing — their + import semantics aren't text-recoverable in any meaningful sense + (binary, image, lockfile, etc.). + """ + file_list = [Path(f) for f in files] + file_set = {p.as_posix() for p in file_list} + candidates_by_module: dict[str, list[str]] = _index_modules(file_set) + + raw_edges: dict[str, set[str]] = defaultdict(set) + reverse: dict[str, set[str]] = defaultdict(set) + + for rel in file_list: + full = repo_root / rel + if rel.suffix.lower() not in _APPLICATION_EXTS: + continue + try: + text = full.read_text(encoding="utf-8", errors="replace") + except OSError: + continue + targets = _resolve_imports(rel, text, file_set=file_set, modules=candidates_by_module) + rel_str = rel.as_posix() + for target in targets: + if target == rel_str: + continue + raw_edges[rel_str].add(target) + reverse[target].add(rel_str) + + nodes: dict[str, GraphNode] = {} + for rel in file_list: + rel_str = rel.as_posix() + nodes[rel_str] = GraphNode( + rel_path=rel_str, + imports=tuple(sorted(raw_edges.get(rel_str, set()))), + imported_by=tuple(sorted(reverse.get(rel_str, set()))), + ) + return RepoGraph(nodes=nodes) + + +def _index_modules(file_set: set[str]) -> dict[str, list[str]]: + """Build module-name → candidate-paths index for resolution. + + For Python ``foo.bar.baz`` we register every dotted prefix that maps + to a concrete file (``foo/bar/baz.py`` or ``foo/bar/baz/__init__.py``). + For Java ``com.foo.Bar`` we register the matching ``com/foo/Bar.java``. + Other languages fall back to filename-stem matching when imports are + bare names. + """ + index: dict[str, list[str]] = defaultdict(list) + for path in file_set: + p = Path(path) + suffix = p.suffix.lower() + stem = p.stem + # Bare filename → all paths sharing that stem + index[stem].append(path) + + if suffix == ".py": + parts = list(p.with_suffix("").parts) + if parts and parts[-1] == "__init__": + parts = parts[:-1] + for size in range(1, len(parts) + 1): + dotted = ".".join(parts[-size:]) + index[dotted].append(path) + elif suffix in {".java", ".kt", ".scala", ".cs"}: + parts = list(p.with_suffix("").parts) + for size in range(1, len(parts) + 1): + dotted = ".".join(parts[-size:]) + index[dotted].append(path) + elif suffix in {".js", ".jsx", ".ts", ".tsx", ".mjs", ".cjs"}: + parts = list(p.parts) + # JS imports are usually written without extension. + for size in range(1, len(parts) + 1): + tail = "/".join(parts[-size:]) + stripped = re.sub(r"\.(?:js|jsx|ts|tsx|mjs|cjs)$", "", tail) + index[stripped].append(path) + index[tail].append(path) + elif suffix == ".go": + parts = list(p.parts) + for size in range(1, len(parts) + 1): + index["/".join(parts[-size:])].append(path) + return index + + +def _resolve_imports( + source: Path, + text: str, + *, + file_set: set[str], + modules: dict[str, list[str]], +) -> list[str]: + suffix = source.suffix.lower() + raw_targets: list[str] = [] + + if suffix == ".py": + for match in _PY_IMPORT.finditer(text): + raw_targets.append(match.group(1) or match.group(2)) + elif suffix in {".js", ".jsx", ".ts", ".tsx", ".mjs", ".cjs"}: + for match in _JS_IMPORT.finditer(text): + raw_targets.append(next((g for g in match.groups() if g), "")) + elif suffix == ".go": + for match in _GO_IMPORT.finditer(text): + block = match.group(0) + for inner in _GO_IMPORT_BLOCK.finditer(block): + raw_targets.append(inner.group(1)) + if match.group(1): + raw_targets.append(match.group(1)) + elif suffix in {".java", ".kt", ".scala", ".cs"}: + for match in _JAVA_IMPORT.finditer(text): + raw_targets.append(match.group(1)) + elif suffix == ".rb": + for match in _RUBY_REQUIRE.finditer(text): + raw_targets.append(match.group(1)) + + resolved: list[str] = [] + seen: set[str] = set() + for raw in raw_targets: + if not raw: + continue + normalized = raw.strip().strip('"').strip("'") + if not normalized: + continue + for candidate in _candidates_for(normalized, source=source, file_set=file_set, modules=modules): + if candidate not in seen: + seen.add(candidate) + resolved.append(candidate) + return resolved + + +def _candidates_for( + raw: str, + *, + source: Path, + file_set: set[str], + modules: dict[str, list[str]], +) -> list[str]: + # Relative imports (``./foo``, ``../bar``) — resolve within the repo. + # Path.resolve() would expand against the CWD; we want the result + # relative to the repo root so it can match file_set entries. + if raw.startswith((".", "/")): + target = source.parent / raw + normalized = _normalize_relative(target) + return [p for p in _try_path_variants(normalized) if p in file_set] + + # Strip leading dots from Python relative-from imports + stripped = raw.lstrip(".") + matches = modules.get(stripped, []) + matches += modules.get(stripped.split(".")[-1], []) + matches += modules.get(stripped.split("/")[-1], []) + + out: list[str] = [] + seen: set[str] = set() + for path in matches: + if path in file_set and path not in seen and path != source.as_posix(): + seen.add(path) + out.append(path) + return out + + +def _normalize_relative(path: Path) -> Path: + """Collapse ``..`` / ``.`` segments without touching the filesystem. + + ``Path.resolve()`` would anchor against the current working directory + and break the repo-relative semantics we rely on for graph keys. + """ + parts: list[str] = [] + for part in path.parts: + if part in ("", "."): + continue + if part == "..": + if parts: + parts.pop() + continue + parts.append(part) + return Path(*parts) if parts else Path() + + +def _try_path_variants(path: Path) -> list[str]: + candidates: list[str] = [] + for ext in (".py", ".js", ".ts", ".tsx", ".jsx", ".mjs", ".cjs", ".rb", ".go", ""): + with_ext = path if ext == "" else path.with_suffix(ext) + candidates.append(with_ext.as_posix()) + candidates.append((path / "__init__.py").as_posix()) + candidates.append((path / "index.ts").as_posix()) + candidates.append((path / "index.js").as_posix()) + return candidates diff --git a/wikifi/report.py b/wikifi/report.py new file mode 100644 index 0000000..4942f4f --- /dev/null +++ b/wikifi/report.py @@ -0,0 +1,156 @@ +"""``wikifi report`` — coverage and quality view of the wiki. + +The report answers two questions migration leads ask before they fund a +re-implementation: + +1. **Did the walk cover the system?** Per-section file/finding counts, + total files seen vs. files that contributed something. +2. **Is the wiki good enough to act on?** Per-section quality score from + the critic, with the headline ``unsupported_claims`` and ``gaps``. + +The report runs purely from on-disk artifacts (notes JSONL + section +markdown + cache) plus optional provider-driven scoring; it never +modifies the wiki. +""" + +from __future__ import annotations + +import logging +from dataclasses import dataclass, field + +from wikifi.cache import WalkCache, load +from wikifi.critic import CoverageStats, Critique, _critique +from wikifi.providers.base import LLMProvider +from wikifi.sections import PRIMARY_SECTIONS, SECTIONS, Section +from wikifi.wiki import WikiLayout, read_notes + +log = logging.getLogger("wikifi.report") + + +@dataclass +class SectionReport: + section: Section + files_contributing: int + findings_count: int + body_chars: int + is_empty: bool + critique: Critique | None = None + + +@dataclass +class WikiReport: + coverage: CoverageStats + sections: list[SectionReport] = field(default_factory=list) + overall_score: float | None = None + + def render(self) -> str: + lines: list[str] = [] + lines.append("# wikifi coverage + quality report") + lines.append("") + lines.append( + f"Files seen: **{self.coverage.files_total}** · " + f"Files with findings: **{self.coverage.files_with_findings}** " + f"({self.coverage.coverage_pct()}%)" + ) + if self.overall_score is not None: + lines.append(f"Overall section score (mean of populated sections): **{self.overall_score:.1f} / 10**") + lines.append("") + lines.append("| Section | Files | Findings | Body | Score | Headline gap |") + lines.append("| --- | ---: | ---: | ---: | ---: | --- |") + for entry in self.sections: + score = "—" if entry.critique is None else f"{entry.critique.score}/10" + gap = "" + if entry.critique and entry.critique.gaps: + gap = entry.critique.gaps[0][:60] + elif entry.critique and entry.critique.unsupported_claims: + gap = "unsupported: " + entry.critique.unsupported_claims[0][:50] + elif entry.is_empty: + gap = "no findings" + lines.append( + f"| `{entry.section.id}` " + f"| {entry.files_contributing} " + f"| {entry.findings_count} " + f"| {entry.body_chars} " + f"| {score} " + f"| {gap} |" + ) + return "\n".join(lines) + + +def build_report( + *, + layout: WikiLayout, + provider: LLMProvider | None = None, + score: bool = False, +) -> WikiReport: + """Inspect a wiki and produce a :class:`WikiReport`. + + With ``score=True`` and a provider supplied, every populated section + is run through the critic for a quality score. Without that, the + report is purely structural — useful in CI without an LLM. + """ + files_total, files_with_findings = _coverage_from_cache(layout) + findings_per_section: dict[str, int] = {} + files_per_section: dict[str, int] = {} + for section in PRIMARY_SECTIONS: + notes = read_notes(layout, section) + findings_per_section[section.id] = len(notes) + files_per_section[section.id] = len({n.get("file") for n in notes if n.get("file")}) + + coverage = CoverageStats( + files_total=files_total, + files_with_findings=files_with_findings, + findings_per_section=findings_per_section, + files_per_section=files_per_section, + ) + + section_reports: list[SectionReport] = [] + scored: list[int] = [] + for section in SECTIONS: + path = layout.section_path(section) + body = path.read_text(encoding="utf-8") if path.exists() else "" + is_empty = ( + "Not yet populated" in body + or "No findings were extracted" in body + or "upstream sections required to derive" in body.lower() + ) + critique: Critique | None = None + if score and provider is not None and not is_empty and body.strip(): + critique = _critique( + section=section, + body=body, + upstream=_collect_upstream(layout, section) if section.tier == "derivative" else None, + provider=provider, + ) + scored.append(critique.score) + section_reports.append( + SectionReport( + section=section, + files_contributing=files_per_section.get(section.id, 0), + findings_count=findings_per_section.get(section.id, 0), + body_chars=len(body), + is_empty=is_empty, + critique=critique, + ) + ) + + overall = sum(scored) / len(scored) if scored else None + return WikiReport(coverage=coverage, sections=section_reports, overall_score=overall) + + +def _coverage_from_cache(layout: WikiLayout) -> tuple[int, int]: + cache: WalkCache = load(layout) + files_total = len(cache.extraction) + files_with_findings = sum(1 for entry in cache.extraction.values() if entry.findings) + return files_total, files_with_findings + + +def _collect_upstream(layout: WikiLayout, section: Section) -> dict[str, str]: + bodies: dict[str, str] = {} + for upstream_id in section.derived_from: + path = layout.section_path(upstream_id) + if path.exists(): + text = path.read_text(encoding="utf-8") + if "Not yet populated" not in text and "No findings were extracted" not in text: + bodies[upstream_id] = text + return bodies diff --git a/wikifi/specialized/__init__.py b/wikifi/specialized/__init__.py new file mode 100644 index 0000000..d911a91 --- /dev/null +++ b/wikifi/specialized/__init__.py @@ -0,0 +1,58 @@ +"""Type-aware extractors for high-signal source artifacts. + +Schema files, IDLs, OpenAPI specs, and migrations carry the system's +contracts in machine-readable form. Running them through the same prose +LLM extractor as application code is wasteful and lossy: the structure +is already there, the extractor just has to read it. + +Each module in this package implements one or more parsers that consume +the file's text and emit a list of structured findings, in the same +``{section_id, finding, sources}`` shape the LLM extractor produces. +That keeps the downstream aggregator interface unchanged — the +specialized path is a drop-in replacement for the LLM call when the +file kind is recognized. + +Extractor selection lives in :func:`select` below. +""" + +from __future__ import annotations + +import logging +from collections.abc import Callable +from dataclasses import dataclass, field + +from wikifi.evidence import SourceRef +from wikifi.repograph import FileKind + +log = logging.getLogger("wikifi.specialized") + + +@dataclass +class SpecializedFinding: + section_id: str + finding: str + sources: list[SourceRef] = field(default_factory=list) + + +@dataclass +class SpecializedResult: + findings: list[SpecializedFinding] = field(default_factory=list) + summary: str = "" + + +# Each extractor takes ``(rel_path, text)`` and returns a SpecializedResult. +ExtractorFn = Callable[[str, str], SpecializedResult] + + +def select(kind: FileKind) -> ExtractorFn | None: + """Return the specialized extractor for a file kind, or ``None``.""" + from wikifi.specialized import graphql, openapi, protobuf, sql + + table: dict[FileKind, ExtractorFn] = { + FileKind.SQL: sql.extract, + FileKind.MIGRATION: sql.extract_migration, + FileKind.OPENAPI: openapi.extract, + FileKind.PROTOBUF: protobuf.extract, + FileKind.GRAPHQL: graphql.extract, + } + return table.get(kind) diff --git a/wikifi/specialized/graphql.py b/wikifi/specialized/graphql.py new file mode 100644 index 0000000..c972bcc --- /dev/null +++ b/wikifi/specialized/graphql.py @@ -0,0 +1,120 @@ +"""GraphQL SDL extractor. + +Pulls types, inputs, queries, mutations, and subscriptions. Maps them to +``entities`` (types/inputs) and ``capabilities`` + ``integrations`` +(query/mutation/subscription roots). +""" + +from __future__ import annotations + +import re + +from wikifi.evidence import SourceRef +from wikifi.specialized import SpecializedFinding, SpecializedResult + +_TYPE_RE = re.compile(r"^\s*type\s+(\w+)\s*(?:implements\s+[^\{]+)?\{", re.MULTILINE) +_INPUT_RE = re.compile(r"^\s*input\s+(\w+)\s*\{", re.MULTILINE) +_INTERFACE_RE = re.compile(r"^\s*interface\s+(\w+)\s*\{", re.MULTILINE) +_ENUM_RE = re.compile(r"^\s*enum\s+(\w+)\s*\{", re.MULTILINE) +_SCHEMA_FIELD_RE = re.compile(r"^\s*(\w+)\s*(?:\([^)]*\))?\s*:\s*[^\n]+", re.MULTILINE) + + +def extract(rel_path: str, text: str) -> SpecializedResult: + findings: list[SpecializedFinding] = [] + summary_bits: list[str] = [] + + types = [(m.group(1), _line(text, m.start())) for m in _TYPE_RE.finditer(text)] + inputs = [(m.group(1), _line(text, m.start())) for m in _INPUT_RE.finditer(text)] + interfaces = [(m.group(1), _line(text, m.start())) for m in _INTERFACE_RE.finditer(text)] + enums = [(m.group(1), _line(text, m.start())) for m in _ENUM_RE.finditer(text)] + + domain_types = [t for t in types if t[0] not in {"Query", "Mutation", "Subscription"}] + root_types = [t for t in types if t[0] in {"Query", "Mutation", "Subscription"}] + + if domain_types: + summary_bits.append(f"{len(domain_types)} type(s)") + bullets = "\n".join(f" - **{name}**" for name, _ in domain_types[:25]) + findings.append( + SpecializedFinding( + section_id="entities", + finding=("GraphQL domain types:\n" + bullets), + sources=[ + SourceRef( + file=rel_path, + lines=(domain_types[0][1], domain_types[-1][1]), + ) + ], + ) + ) + + if interfaces: + bullets = "\n".join(f" - **{name}**" for name, _ in interfaces) + findings.append( + SpecializedFinding( + section_id="entities", + finding=("Interfaces (shared shape contracts):\n" + bullets), + sources=[SourceRef(file=rel_path)], + ) + ) + + if inputs: + bullets = "\n".join(f" - **{name}**" for name, _ in inputs[:25]) + findings.append( + SpecializedFinding( + section_id="entities", + finding=("Input types (request payload shapes):\n" + bullets), + sources=[SourceRef(file=rel_path)], + ) + ) + + if enums: + bullets = "\n".join(f" - **{name}**" for name, _ in enums[:15]) + findings.append( + SpecializedFinding( + section_id="entities", + finding=("Enum types (closed value sets):\n" + bullets), + sources=[SourceRef(file=rel_path)], + ) + ) + + if root_types: + # Pull each root's fields by scanning the snippet between its + # declaration line and the next ``}``. + for name, line in root_types: + block = _block_after(text, line) + fields = _SCHEMA_FIELD_RE.findall(block) + bullets = "\n".join(f" - `{f}`" for f in fields[:30]) + section_id = "capabilities" if name in {"Query", "Mutation"} else "integrations" + findings.append( + SpecializedFinding( + section_id=section_id, + finding=(f"GraphQL **{name}** root exposes:\n" + (bullets or " - (no fields detected)")), + sources=[SourceRef(file=rel_path, lines=(line, line))], + ) + ) + summary_bits.append(", ".join(name for name, _ in root_types) + " roots") + + return SpecializedResult( + findings=findings, + summary=("GraphQL SDL: " + ", ".join(summary_bits)) if summary_bits else "GraphQL SDL.", + ) + + +def _line(text: str, offset: int) -> int: + return text.count("\n", 0, offset) + 1 + + +def _block_after(text: str, line: int) -> str: + """Return the text between line ``line`` and the next top-level ``}``. + + Approximate: enough to read field declarations for a GraphQL root + type. Matches ``}`` that appears at column 0. + """ + lines = text.splitlines() + start = max(0, line - 1) + out: list[str] = [] + for ln in lines[start:]: + if ln.startswith("}"): + break + out.append(ln) + return "\n".join(out) diff --git a/wikifi/specialized/openapi.py b/wikifi/specialized/openapi.py new file mode 100644 index 0000000..db407d3 --- /dev/null +++ b/wikifi/specialized/openapi.py @@ -0,0 +1,188 @@ +"""OpenAPI / Swagger contract extractor. + +OpenAPI specs are migration gold: every public endpoint, request/response +body, and authentication method is enumerated in one structured document. +We avoid pulling PyYAML as a hard dependency by attempting JSON first, +then falling back to a small permissive YAML parser sufficient for the +keys we read. Specs that exceed the parser's limits are flagged with a +single ``capabilities`` finding noting the parse failure rather than +crashing the walk. +""" + +from __future__ import annotations + +import json +import logging +import re +from typing import Any + +from wikifi.evidence import SourceRef +from wikifi.specialized import SpecializedFinding, SpecializedResult + +log = logging.getLogger("wikifi.specialized.openapi") + + +def extract(rel_path: str, text: str) -> SpecializedResult: + spec = _parse(text) + if spec is None: + return SpecializedResult( + findings=[ + SpecializedFinding( + section_id="capabilities", + finding=( + "An API contract was found but could not be parsed for " + "structured extraction. Migration teams should consult " + "this file directly for endpoint inventory." + ), + sources=[SourceRef(file=rel_path)], + ) + ], + summary="Unparseable API spec — manual review recommended.", + ) + + findings: list[SpecializedFinding] = [] + summary_bits: list[str] = [] + + info = spec.get("info") or {} + if isinstance(info, dict) and (title := info.get("title")): + findings.append( + SpecializedFinding( + section_id="intent", + finding=( + f"The system exposes a public API titled **{title}**" + + (f" (v{info.get('version')})" if info.get("version") else "") + + (f": {info.get('description')}" if info.get("description") else ".") + ), + sources=[SourceRef(file=rel_path)], + ) + ) + + paths = spec.get("paths") or {} + if isinstance(paths, dict): + verbs = ("get", "post", "put", "patch", "delete", "head", "options") + endpoints: list[tuple[str, str, str]] = [] + for path, ops in paths.items(): + if not isinstance(ops, dict): + continue + for verb in verbs: + op = ops.get(verb) + if not isinstance(op, dict): + continue + description = op.get("summary") or op.get("description") or "" + endpoints.append((verb.upper(), str(path), str(description))) + if endpoints: + summary_bits.append(f"{len(endpoints)} endpoint(s)") + top = endpoints[:20] + bullets = "\n".join(f" - `{verb} {path}`{(' — ' + desc) if desc else ''}" for verb, path, desc in top) + more = f"\n - … {len(endpoints) - 20} more endpoint(s) elided." if len(endpoints) > 20 else "" + findings.append( + SpecializedFinding( + section_id="capabilities", + finding=("Public API surface (subset shown):\n" + bullets + more), + sources=[SourceRef(file=rel_path)], + ) + ) + findings.append( + SpecializedFinding( + section_id="integrations", + finding=( + f"Inbound integration: HTTP API exposes {len(endpoints)} endpoint(s) for external consumers." + ), + sources=[SourceRef(file=rel_path)], + ) + ) + + components = spec.get("components") or {} + schemas = components.get("schemas") if isinstance(components, dict) else None + if isinstance(schemas, dict) and schemas: + names = list(schemas.keys()) + summary_bits.append(f"{len(names)} schema(s)") + bullets = "\n".join(f" - **{name}**" for name in names[:25]) + more = f"\n - … {len(names) - 25} more schema(s) elided." if len(names) > 25 else "" + findings.append( + SpecializedFinding( + section_id="entities", + finding=("API schemas (request/response models):\n" + bullets + more), + sources=[SourceRef(file=rel_path)], + ) + ) + + security = components.get("securitySchemes") if isinstance(components, dict) else None + if isinstance(security, dict) and security: + types = sorted({(v or {}).get("type", "?") for v in security.values() if isinstance(v, dict)}) + findings.append( + SpecializedFinding( + section_id="cross_cutting", + finding=("Authentication contract for the API: scheme(s) " + ", ".join(f"`{t}`" for t in types) + "."), + sources=[SourceRef(file=rel_path)], + ) + ) + + return SpecializedResult( + findings=findings, + summary="API contract: " + ", ".join(summary_bits) if summary_bits else "API contract.", + ) + + +def _parse(text: str) -> dict[str, Any] | None: + stripped = text.strip() + if not stripped: + return None + if stripped.startswith("{"): + try: + return json.loads(stripped) + except json.JSONDecodeError: + return None + try: + import yaml # type: ignore[import-not-found] + except ImportError: + return _shallow_yaml(stripped) + try: + loaded = yaml.safe_load(stripped) + except Exception as exc: # pragma: no cover - depends on installed PyYAML + log.warning("yaml parse failed: %s", exc) + return None + return loaded if isinstance(loaded, dict) else None + + +# --------------------------------------------------------------------------- +# Tiny YAML fallback — only handles the OpenAPI subset we need (top-level +# keys, simple nested dicts, and method blocks under paths). +# --------------------------------------------------------------------------- + + +_KEY_RE = re.compile(r"^(\s*)([\w./{}-]+):\s*(.*)$") + + +def _shallow_yaml(text: str) -> dict[str, Any] | None: + """Best-effort YAML parser sufficient for OpenAPI's known shape. + + Returns nested dicts where each key contributes a string value or a + nested dict; lists and complex flow-style structures collapse to + string descriptions, which is fine for the keys :func:`extract` + actually inspects. + """ + root: dict[str, Any] = {} + stack: list[tuple[int, dict[str, Any]]] = [(-1, root)] + for raw_line in text.splitlines(): + if not raw_line.strip() or raw_line.lstrip().startswith("#"): + continue + match = _KEY_RE.match(raw_line) + if not match: + continue + indent = len(match.group(1)) + key = match.group(2).strip() + value = match.group(3).strip() + while stack and stack[-1][0] >= indent: + stack.pop() + if not stack: + stack.append((-1, root)) + parent = stack[-1][1] + if value == "" or value == "{}": + child: dict[str, Any] = {} + parent[key] = child + stack.append((indent, child)) + else: + stripped = value.strip().strip('"').strip("'") + parent[key] = stripped + return root or None diff --git a/wikifi/specialized/protobuf.py b/wikifi/specialized/protobuf.py new file mode 100644 index 0000000..b4c864b --- /dev/null +++ b/wikifi/specialized/protobuf.py @@ -0,0 +1,106 @@ +"""Protobuf IDL extractor. + +Surfaces ``message`` types as entities and ``service``/``rpc`` blocks as +integration touchpoints. Proto files are pure contract: a migration team +re-implementing in a new stack can read these findings directly into +their interface design. +""" + +from __future__ import annotations + +import re + +from wikifi.evidence import SourceRef +from wikifi.specialized import SpecializedFinding, SpecializedResult + +_MESSAGE_RE = re.compile(r"^\s*message\s+(\w+)\s*\{", re.MULTILINE) +_SERVICE_RE = re.compile(r"^\s*service\s+(\w+)\s*\{", re.MULTILINE) +_RPC_RE = re.compile( + r"^\s*rpc\s+(\w+)\s*\(\s*(stream\s+)?([\w.]+)\s*\)\s*returns\s*\(\s*(stream\s+)?([\w.]+)\s*\)", + re.MULTILINE, +) +_ENUM_RE = re.compile(r"^\s*enum\s+(\w+)\s*\{", re.MULTILINE) +_PACKAGE_RE = re.compile(r"^\s*package\s+([\w.]+)\s*;", re.MULTILINE) + + +def extract(rel_path: str, text: str) -> SpecializedResult: + findings: list[SpecializedFinding] = [] + summary_bits: list[str] = [] + + package_match = _PACKAGE_RE.search(text) + package = package_match.group(1) if package_match else "" + + messages = [(m.group(1), _line(text, m.start())) for m in _MESSAGE_RE.finditer(text)] + enums = [(m.group(1), _line(text, m.start())) for m in _ENUM_RE.finditer(text)] + services = [(m.group(1), _line(text, m.start())) for m in _SERVICE_RE.finditer(text)] + rpcs = [ + (m.group(1), m.group(3), m.group(5), bool(m.group(2)), bool(m.group(4)), _line(text, m.start())) + for m in _RPC_RE.finditer(text) + ] + + if messages: + summary_bits.append(f"{len(messages)} message(s)") + bullets = "\n".join(f" - **{name}**" for name, _ in messages[:25]) + more = f"\n - … {len(messages) - 25} more message(s) elided." if len(messages) > 25 else "" + findings.append( + SpecializedFinding( + section_id="entities", + finding=( + f"Protocol entities {('in package `' + package + '`') if package else ''}:\n" + bullets + more + ), + sources=[SourceRef(file=rel_path, lines=(messages[0][1], messages[-1][1]))], + ) + ) + + if enums: + bullets = "\n".join(f" - **{name}**" for name, _ in enums[:15]) + findings.append( + SpecializedFinding( + section_id="entities", + finding=("Enum types (closed value sets):\n" + bullets), + sources=[SourceRef(file=rel_path, lines=(enums[0][1], enums[-1][1]))], + ) + ) + + for service_name, line in services: + related = [r for r in rpcs if line <= r[5]] + bullets = "\n".join( + f" - `{name}({_arrow(in_msg, in_stream)}) -> {_arrow(out_msg, out_stream)}`" + for name, in_msg, out_msg, in_stream, out_stream, _ in related[:25] + ) + findings.append( + SpecializedFinding( + section_id="integrations", + finding=( + f"Service **{service_name}** exposes the following RPCs:\n" + + (bullets if bullets else " - (no RPCs detected)") + ), + sources=[SourceRef(file=rel_path, lines=(line, line))], + ) + ) + if services: + summary_bits.append(f"{len(services)} service(s)") + if rpcs: + summary_bits.append(f"{len(rpcs)} rpc(s)") + findings.append( + SpecializedFinding( + section_id="capabilities", + finding=( + f"Wire protocol exposes {len(rpcs)} remote procedure(s) across {len(services) or 1} service(s)." + ), + sources=[SourceRef(file=rel_path)], + ) + ) + + return SpecializedResult( + findings=findings, + summary=("Proto file: " + ", ".join(summary_bits)) if summary_bits else "Proto file.", + ) + + +def _arrow(name: str, stream: bool) -> str: + return f"stream {name}" if stream else name + + +def _line(text: str, offset: int) -> int: + return text.count("\n", 0, offset) + 1 diff --git a/wikifi/specialized/sql.py b/wikifi/specialized/sql.py new file mode 100644 index 0000000..c3b6926 --- /dev/null +++ b/wikifi/specialized/sql.py @@ -0,0 +1,225 @@ +"""SQL DDL + migration extractor. + +Pulls table definitions, columns, primary/foreign keys, indexes, and +constraints. Each table becomes an ``entities`` finding; foreign keys +become ``integrations``-style relationships if they cross obvious +service boundaries (heuristic), and ``cross_cutting`` for storage +invariants like ``UNIQUE`` and ``NOT NULL`` constraints. + +Migration files (Alembic/Knex/Flyway/etc.) are extracted with the same +parser and additionally tagged in the summary so the migration team can +spot forward-only schema changes vs. baseline DDL. +""" + +from __future__ import annotations + +import re +from dataclasses import dataclass, field + +from wikifi.evidence import SourceRef +from wikifi.specialized import SpecializedFinding, SpecializedResult + +# Line-number tracking is precise to "the line containing the matched +# keyword" — that's specific enough for citations and avoids the cost +# of a full SQL parser. Migrations frequently mix dialects; we tolerate +# anything that loosely matches the keyword grammar. +_CREATE_TABLE_RE = re.compile( + r"create\s+(?:or\s+replace\s+)?(?:temporary\s+)?table\s+(?:if\s+not\s+exists\s+)?" + r"([\"`\[\]\w.]+)\s*\((.*?)\)\s*;", + re.IGNORECASE | re.DOTALL, +) +_ALTER_TABLE_RE = re.compile( + r"alter\s+table\s+([\"`\[\]\w.]+)\s+(.*?);", + re.IGNORECASE | re.DOTALL, +) +_FK_RE = re.compile( + r"foreign\s+key\s*\(([^)]+)\)\s*references\s+([\"`\[\]\w.]+)\s*\(([^)]+)\)", + re.IGNORECASE, +) +_REF_INLINE_RE = re.compile(r"references\s+([\"`\[\]\w.]+)\s*\(([^)]+)\)", re.IGNORECASE) +_UNIQUE_RE = re.compile(r"\bunique\b", re.IGNORECASE) +_NOT_NULL_RE = re.compile(r"\bnot\s+null\b", re.IGNORECASE) +_INDEX_RE = re.compile( + r"create\s+(?:unique\s+)?index\s+(?:if\s+not\s+exists\s+)?([\"`\[\]\w.]+)\s+on\s+([\"`\[\]\w.]+)", + re.IGNORECASE, +) + + +@dataclass +class _TableHit: + name: str + line: int + body: str + columns: list[str] = field(default_factory=list) + fks: list[tuple[str, str, str]] = field(default_factory=list) + + +def extract(rel_path: str, text: str) -> SpecializedResult: + return _extract(rel_path, text, migration=False) + + +def extract_migration(rel_path: str, text: str) -> SpecializedResult: + return _extract(rel_path, text, migration=True) + + +def _extract(rel_path: str, text: str, *, migration: bool) -> SpecializedResult: + findings: list[SpecializedFinding] = [] + tables: list[_TableHit] = [] + + for match in _CREATE_TABLE_RE.finditer(text): + name = _strip_ident(match.group(1)) + body = match.group(2) + line = _line_of(text, match.start()) + hit = _TableHit(name=name, line=line, body=body) + _populate_columns(hit) + tables.append(hit) + + for hit in tables: + bullet_lines = ", ".join(hit.columns) if hit.columns else "(no columns parsed)" + prefix = "Migration adds" if migration else "Persists" + findings.append( + SpecializedFinding( + section_id="entities", + finding=(f"{prefix} the **{hit.name}** entity. Columns: {bullet_lines}."), + sources=[SourceRef(file=rel_path, lines=(hit.line, hit.line))], + ) + ) + + for column, ref_table, ref_column in hit.fks: + findings.append( + SpecializedFinding( + section_id="integrations", + finding=( + f"`{hit.name}.{column}` references " + f"`{ref_table}.{ref_column}` — a hard relational link " + "between these entities." + ), + sources=[SourceRef(file=rel_path, lines=(hit.line, hit.line))], + ) + ) + + constraints = _parse_constraints(hit.body) + if constraints: + findings.append( + SpecializedFinding( + section_id="cross_cutting", + finding=(f"Storage invariants on **{hit.name}**: {constraints}."), + sources=[SourceRef(file=rel_path, lines=(hit.line, hit.line))], + ) + ) + + for match in _ALTER_TABLE_RE.finditer(text): + line = _line_of(text, match.start()) + target = _strip_ident(match.group(1)) + action = match.group(2).strip() + prefix = "Migration alters" if migration else "Alters" + findings.append( + SpecializedFinding( + section_id="entities", + finding=(f"{prefix} entity **{target}**: {_summarize_alter(action)}."), + sources=[SourceRef(file=rel_path, lines=(line, line))], + ) + ) + + for match in _INDEX_RE.finditer(text): + line = _line_of(text, match.start()) + idx = _strip_ident(match.group(1)) + target = _strip_ident(match.group(2)) + findings.append( + SpecializedFinding( + section_id="cross_cutting", + finding=( + f"Index `{idx}` on **{target}** — encodes a query-time " + "performance invariant the new system must preserve." + ), + sources=[SourceRef(file=rel_path, lines=(line, line))], + ) + ) + + summary = f"Migration touches {len(tables)} table(s)." if migration else f"Schema for {len(tables)} table(s)." + return SpecializedResult(findings=findings, summary=summary) + + +def _populate_columns(hit: _TableHit) -> None: + """Pull column names + foreign-key edges from a CREATE TABLE body.""" + body = hit.body + columns: list[str] = [] + fks: list[tuple[str, str, str]] = [] + + for fk in _FK_RE.finditer(body): + local_cols = [c.strip().strip('"`[]') for c in fk.group(1).split(",")] + ref_table = _strip_ident(fk.group(2)) + ref_cols = [c.strip().strip('"`[]') for c in fk.group(3).split(",")] + for lc, rc in zip(local_cols, ref_cols, strict=False): + fks.append((lc, ref_table, rc)) + + # Split top-level commas so we can read column lines. + for raw_line in _split_top_level_commas(body): + line = raw_line.strip() + if not line: + continue + lowered = line.lower() + if lowered.startswith(("primary key", "foreign key", "unique", "constraint", "check", "index")): + continue + # First token is the column name (may be quoted). + match = re.match(r"\s*([\"`\[]?[\w]+[\"`\]]?)", line) + if not match: + continue + column = match.group(1).strip('"`[]') + columns.append(column) + + ref = _REF_INLINE_RE.search(line) + if ref: + ref_table = _strip_ident(ref.group(1)) + ref_cols = [c.strip().strip('"`[]') for c in ref.group(2).split(",")] + for rc in ref_cols: + fks.append((column, ref_table, rc)) + + hit.columns = columns + hit.fks = fks + + +def _split_top_level_commas(body: str) -> list[str]: + """Split on commas that are not inside parentheses.""" + out: list[str] = [] + depth = 0 + buf: list[str] = [] + for ch in body: + if ch == "(": + depth += 1 + buf.append(ch) + elif ch == ")": + depth = max(0, depth - 1) + buf.append(ch) + elif ch == "," and depth == 0: + out.append("".join(buf)) + buf = [] + else: + buf.append(ch) + if buf: + out.append("".join(buf)) + return out + + +def _parse_constraints(body: str) -> str: + bits: list[str] = [] + if _UNIQUE_RE.search(body): + bits.append("UNIQUE") + if _NOT_NULL_RE.search(body): + bits.append("NOT NULL") + return ", ".join(bits) + + +def _summarize_alter(action: str) -> str: + cleaned = " ".join(action.split()) + if len(cleaned) > 160: + cleaned = cleaned[:157] + "..." + return cleaned + + +def _strip_ident(name: str) -> str: + return name.strip().strip('"`[]') + + +def _line_of(text: str, offset: int) -> int: + return text.count("\n", 0, offset) + 1 From f0a8f527ab6435eb3e223ed49fec223e8a5ce597 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 1 May 2026 17:57:37 +0000 Subject: [PATCH 2/9] feat: add OpenAI provider with auto-cached prefixes + reasoning effort MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Third backend alongside Ollama (default) and Anthropic. Selected via WIKIFI_PROVIDER=openai plus an OPENAI_API_KEY. Implementation notes: - Structured output via client.chat.completions.parse — returns a schema-validated Pydantic instance directly, same protocol contract as the Anthropic path. - Prompt caching is automatic (≥ 1024-token prefixes, ~5-10 min). No cache_control marker needed; system prompt sits at message[0] so the multi-KB extraction prompt is what the prefix cache catches. - Reasoning effort: think={"low","medium","high"} routes to reasoning_effort on o*/gpt-5 models and is stripped on plain models to avoid future-strict 400s. Reasoning models also receive max_completion_tokens in place of max_tokens. - APIError → RuntimeError, mirroring the Anthropic provider so per-call fallback paths in extractor/aggregator/deriver are unchanged. Wiring: - wikifi/config.py: openai_api_key, openai_base_url, openai_max_tokens - wikifi/orchestrator.build_provider: dispatches openai with a default guard (model id stripped to gpt-4o if it doesn't look like a GPT/o- series id, mirroring the anthropic guard) - wikifi/cli.py: --provider help text mentions all three options Tests: - tests/test_openai_provider.py (10 cases): parse path returns Pydantic, fallback to validate_json, APIError mapping, text + chat, reasoning-effort + max_completion_tokens routing on reasoning models, full (model, think) translation table. - tests/test_orchestrator.py: build_provider dispatch + model-default preservation cases. 168 tests pass (was 156); 93% total coverage. Lint clean. https://claude.ai/code/session_01K3H5GMhcvfc5HB63NhykcL --- README.md | 4 +- TESTING-AND-DEMO.md | 48 ++++-- pyproject.toml | 1 + tests/test_openai_provider.py | 232 ++++++++++++++++++++++++++ tests/test_orchestrator.py | 21 +++ uv.lock | 33 ++++ wikifi/cli.py | 5 +- wikifi/config.py | 25 ++- wikifi/orchestrator.py | 28 +++- wikifi/providers/openai_provider.py | 241 ++++++++++++++++++++++++++++ 10 files changed, 620 insertions(+), 18 deletions(-) create mode 100644 tests/test_openai_provider.py create mode 100644 wikifi/providers/openai_provider.py diff --git a/README.md b/README.md index bf4892e..9514185 100644 --- a/README.md +++ b/README.md @@ -38,7 +38,9 @@ uv run wikifi init - **Section synthesis** — primary capture sections are synthesized from the accumulated per-file findings; the aggregator emits a structured `EvidenceBundle` (body + claims + contradictions) and the renderer threads numbered citations + a "Conflicts in source" block into the section markdown. Derivative sections (personas, user stories, diagrams) are produced *after* primary content is complete, taking the synthesized primary content as their input. - **Critic + reviser** (`wikifi/critic.py`) — opt-in (`walk --review`), runs a quality pass on derivative sections: scores the body against its brief and upstream evidence, identifies unsupported claims, and re-synthesizes when the score is below threshold. Only accepts a revision if it scores at least as well as the original. - **Coverage + quality report** (`wikifi/report.py`) — `wikifi report` produces a per-section view of files contributing, finding count, body size, and (with `--score`) critic-derived quality scores. -- **Provider abstraction** — the LLM backend is reached through a provider interface. Default is a local Ollama server (`OllamaProvider`); the hosted Anthropic backend (`AnthropicProvider`) is opt-in via `WIKIFI_PROVIDER=anthropic` and uses prompt caching with `cache_control: ephemeral` on the system prompt so the multi-KB extraction prompt is paid for once across hundreds of per-file calls. +- **Provider abstraction** — the LLM backend is reached through a provider interface. Default is a local Ollama server (`OllamaProvider`); two hosted backends are opt-in: + - `AnthropicProvider` via `WIKIFI_PROVIDER=anthropic` — uses prompt caching with `cache_control: ephemeral` on the system prompt so the multi-KB extraction prompt is paid for once across hundreds of per-file calls. + - `OpenAIProvider` via `WIKIFI_PROVIDER=openai` — relies on OpenAI's automatic prefix caching (no marker required) and routes the `think` knob to `reasoning_effort` on `o*`/`gpt-5` reasoning models. - **Wiki adapter** — writes the rendered wiki into the target's `.wikifi/` directory. Layout, taxonomy, and structure within `.wikifi/` are at the implementor's discretion, provided the content contract from `VISION.md` is met. ## Tech stack diff --git a/TESTING-AND-DEMO.md b/TESTING-AND-DEMO.md index 4bcc8a9..5c72fdf 100644 --- a/TESTING-AND-DEMO.md +++ b/TESTING-AND-DEMO.md @@ -204,9 +204,13 @@ findings count, body size, score, headline gap): (Unit evidence: `tests/test_report.py`.) -### 9. Anthropic provider with prompt caching +### 9. Hosted providers with prompt caching -Set the API key and switch the provider for a walk: +Two opt-in hosted backends share the same provider abstraction. + +**Anthropic.** Sets `cache_control: {"type": "ephemeral"}` on the system +prompt block; subsequent per-file extraction calls read the cache for +~10% of the input price. ```bash export ANTHROPIC_API_KEY=sk-ant-... @@ -215,13 +219,6 @@ WIKIFI_PROVIDER=anthropic uv run wikifi walk uv run wikifi walk --provider anthropic ``` -The provider sets `cache_control: {"type": "ephemeral"}` on the system -prompt block. After the first per-file extraction call writes the -cache, subsequent calls within the cache window read it for ~10% of -the input price. - -To verify caching is active in the wild, intercept the SDK's response: - ```python from wikifi.providers.anthropic_provider import AnthropicProvider provider = AnthropicProvider(model="claude-opus-4-7", think="high") @@ -229,11 +226,42 @@ provider = AnthropicProvider(model="claude-opus-4-7", think="high") # response.usage.cache_read_input_tokens > 0 ``` +**OpenAI.** Relies on OpenAI's automatic prefix caching — no marker +required, prefixes ≥ 1024 tokens are cached for ~5–10 minutes. The +provider also routes the `think` knob to `reasoning_effort` on +reasoning-capable models (`o*`, `gpt-5`): + +```bash +export OPENAI_API_KEY=sk-... +WIKIFI_PROVIDER=openai uv run wikifi walk +# or, with a reasoning model: +WIKIFI_PROVIDER=openai WIKIFI_MODEL=o3-mini uv run wikifi walk +# or via flag: +uv run wikifi walk --provider openai +``` + +```python +from wikifi.providers.openai_provider import OpenAIProvider +provider = OpenAIProvider(model="gpt-4o", think="high") +# Reasoning routing: +# OpenAIProvider(model="o3-mini", think="medium") → forwards reasoning_effort +# OpenAIProvider(model="gpt-4o", think="medium") → no reasoning_effort +``` + +For Azure-OpenAI or a corporate proxy, set +`WIKIFI_OPENAI_BASE_URL` (or pass `base_url=...` directly to the +constructor). + (Unit evidence: `tests/test_anthropic_provider.py` locks in the `cache_control` placement, the `messages.parse` structured-output contract, the thinking → effort translation, and the APIError → +RuntimeError mapping. `tests/test_openai_provider.py` covers the +`chat.completions.parse` structured-output contract, the +reasoning-effort routing for `o*`/`gpt-5` vs plain models, the +`max_tokens` vs `max_completion_tokens` swap, and the same APIError → RuntimeError mapping. `test_build_provider_returns_anthropic_when_selected` -in `tests/test_orchestrator.py` covers dispatch.) +and `test_build_provider_returns_openai_when_selected` in +`tests/test_orchestrator.py` cover dispatch.) ## Tearing down diff --git a/pyproject.toml b/pyproject.toml index 9496eca..5908fac 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -15,6 +15,7 @@ dependencies = [ "rich>=13.7", "pathspec>=0.12", "anthropic>=0.40", + "openai>=1.50", ] [project.scripts] diff --git a/tests/test_openai_provider.py b/tests/test_openai_provider.py new file mode 100644 index 0000000..023d4ee --- /dev/null +++ b/tests/test_openai_provider.py @@ -0,0 +1,232 @@ +"""OpenAIProvider tests. + +The HTTP transport is mocked via the ``client=`` injection point so the +test never touches the network. The point is to lock in the wikifi +contract: structured output via ``chat.completions.parse``, the +reasoning-effort routing for reasoning vs. plain models, the +``max_tokens`` vs ``max_completion_tokens`` swap, and APIError → +RuntimeError mapping. +""" + +from __future__ import annotations + +from types import SimpleNamespace + +import openai +import pytest +from pydantic import BaseModel + +from wikifi.providers.openai_provider import OpenAIProvider + + +class _Echo(BaseModel): + value: str + + +class _StubClient: + """Minimal stand-in for ``openai.OpenAI``. + + Exposes ``chat.completions.parse`` and ``chat.completions.create`` + via the same ``SimpleNamespace`` shape the real SDK uses. + """ + + def __init__( + self, + *, + parse_response=None, + create_response=None, + raise_on_parse: Exception | None = None, + raise_on_create: Exception | None = None, + ) -> None: + self.parse_calls: list[dict] = [] + self.create_calls: list[dict] = [] + self._parse_response = parse_response + self._create_response = create_response + self._raise_on_parse = raise_on_parse + self._raise_on_create = raise_on_create + self.chat = SimpleNamespace( + completions=SimpleNamespace(parse=self._parse, create=self._create), + ) + + def _parse(self, **kwargs): + self.parse_calls.append(kwargs) + if self._raise_on_parse is not None: + raise self._raise_on_parse + return self._parse_response + + def _create(self, **kwargs): + self.create_calls.append(kwargs) + if self._raise_on_create is not None: + raise self._raise_on_create + return self._create_response + + +def _api_error(message: str = "boom", request_id: str = "req_abc") -> openai.APIError: + """Construct an APIError without going through the real httpx wiring.""" + err = openai.APIError.__new__(openai.APIError) + err.message = message + err.request_id = request_id + err.args = (message,) + return err + + +def _parse_response(parsed): + return SimpleNamespace( + choices=[SimpleNamespace(message=SimpleNamespace(parsed=parsed, content=""))], + ) + + +def _text_response(text: str | None): + return SimpleNamespace( + choices=[SimpleNamespace(message=SimpleNamespace(content=text, parsed=None))], + ) + + +# --------------------------------------------------------------------------- +# complete_json +# --------------------------------------------------------------------------- + + +def test_complete_json_returns_parsed_pydantic_instance(): + parsed = _Echo(value="hello") + client = _StubClient(parse_response=_parse_response(parsed)) + provider = OpenAIProvider(model="gpt-4o", client=client, think="high") + + result = provider.complete_json(system="SYS", user="USR", schema=_Echo) + + assert result == parsed + call = client.parse_calls[0] + assert call["model"] == "gpt-4o" + assert call["response_format"] is _Echo + assert call["messages"] == [ + {"role": "system", "content": "SYS"}, + {"role": "user", "content": "USR"}, + ] + # gpt-4o is non-reasoning → max_tokens, not max_completion_tokens. + assert "max_tokens" in call + assert "max_completion_tokens" not in call + # think="high" must NOT leak through on a non-reasoning model. + assert "reasoning_effort" not in call + + +def test_complete_json_falls_back_to_validate_json_when_parsed_missing(): + response = SimpleNamespace( + choices=[SimpleNamespace(message=SimpleNamespace(parsed=None, content='{"value": "fallback"}'))], + ) + client = _StubClient(parse_response=response) + provider = OpenAIProvider(client=client) + out = provider.complete_json(system="s", user="u", schema=_Echo) + assert out == _Echo(value="fallback") + + +def test_complete_json_raises_runtime_error_on_api_error(): + client = _StubClient(raise_on_parse=_api_error("rate-limited", "req_xyz")) + provider = OpenAIProvider(client=client) + with pytest.raises(RuntimeError) as info: + provider.complete_json(system="s", user="u", schema=_Echo) + assert "req_xyz" in str(info.value) + assert "rate-limited" in str(info.value) + + +# --------------------------------------------------------------------------- +# complete_text + chat +# --------------------------------------------------------------------------- + + +def test_complete_text_extracts_first_message_content(): + client = _StubClient(create_response=_text_response("hi")) + provider = OpenAIProvider(client=client) + assert provider.complete_text(system="s", user="u") == "hi" + + +def test_complete_text_returns_empty_when_content_none(): + client = _StubClient(create_response=_text_response(None)) + provider = OpenAIProvider(client=client) + assert provider.complete_text(system="s", user="u") == "" + + +def test_chat_prepends_system_and_returns_content(): + client = _StubClient(create_response=_text_response("reply")) + provider = OpenAIProvider(client=client) + out = provider.chat( + system="SYS", + messages=[ + {"role": "user", "content": "first"}, + {"role": "assistant", "content": "first reply"}, + {"role": "user", "content": "second"}, + ], + ) + assert out == "reply" + call = client.create_calls[0] + assert call["messages"][0] == {"role": "system", "content": "SYS"} + assert call["messages"][-1] == {"role": "user", "content": "second"} + assert len(call["messages"]) == 4 + + +# --------------------------------------------------------------------------- +# Reasoning model routing +# --------------------------------------------------------------------------- + + +def test_reasoning_model_forwards_reasoning_effort_and_uses_completion_tokens(): + """o-series + gpt-5 models should receive ``reasoning_effort`` and + ``max_completion_tokens`` instead of ``max_tokens``.""" + client = _StubClient(create_response=_text_response("x")) + provider = OpenAIProvider(model="o3-mini", client=client, think="medium") + provider.complete_text(system="s", user="u") + call = client.create_calls[0] + assert call["reasoning_effort"] == "medium" + assert "max_completion_tokens" in call + assert "max_tokens" not in call + + +def test_reasoning_model_strips_effort_when_think_is_off(): + client = _StubClient(create_response=_text_response("x")) + provider = OpenAIProvider(model="gpt-5", client=client, think=False) + provider.complete_text(system="s", user="u") + call = client.create_calls[0] + assert "reasoning_effort" not in call + # Reasoning model still uses max_completion_tokens regardless of think. + assert "max_completion_tokens" in call + + +def test_plain_model_does_not_forward_reasoning_effort(): + client = _StubClient(create_response=_text_response("x")) + provider = OpenAIProvider(model="gpt-4o", client=client, think="high") + provider.complete_text(system="s", user="u") + call = client.create_calls[0] + assert "reasoning_effort" not in call + + +# --------------------------------------------------------------------------- +# Token-knob translation table +# --------------------------------------------------------------------------- + + +def test_reasoning_kwargs_translation_table(): + """Lock the (model, think) → request mapping so the contract is testable.""" + client = _StubClient(create_response=_text_response("x")) + cases = [ + # Reasoning-capable model: each level forwards through + ("o3-mini", "low", {"reasoning_effort": "low"}), + ("o3-mini", "medium", {"reasoning_effort": "medium"}), + ("o3-mini", "high", {"reasoning_effort": "high"}), + ("o3-mini", True, {}), # SDK default + ("o3-mini", False, {}), # disabled + ("o3-mini", "off", {}), + # Plain model: never forwards + ("gpt-4o", "high", {}), + ("gpt-4o", "low", {}), + ("gpt-4o", False, {}), + ] + for model, think, expected_extras in cases: + provider = OpenAIProvider(model=model, client=client, think=think) + client.create_calls.clear() + provider.complete_text(system="s", user="u") + call = client.create_calls[-1] + if "reasoning_effort" in expected_extras: + assert call.get("reasoning_effort") == expected_extras["reasoning_effort"], ( + f"model={model} think={think!r}: want {expected_extras}" + ) + else: + assert "reasoning_effort" not in call, f"model={model} think={think!r}: must not forward effort" diff --git a/tests/test_orchestrator.py b/tests/test_orchestrator.py index a068ace..f4f2458 100644 --- a/tests/test_orchestrator.py +++ b/tests/test_orchestrator.py @@ -103,6 +103,27 @@ def test_build_provider_returns_anthropic_when_selected(monkeypatch): assert provider.model.startswith("claude-") +def test_build_provider_returns_openai_when_selected(monkeypatch): + """``provider='openai'`` dispatches to OpenAIProvider with a GPT default.""" + monkeypatch.setenv("OPENAI_API_KEY", "test-key") + settings = _settings(provider="openai", model="m") # non-gpt id + provider = build_provider(settings) + from wikifi.providers.openai_provider import OpenAIProvider + + assert isinstance(provider, OpenAIProvider) + # Falls back to gpt-4o rather than 404'ing on "m". + assert provider.model.startswith("gpt-") + + +def test_build_provider_preserves_explicit_openai_model(monkeypatch): + """A user-supplied gpt/o-series model id is passed through unchanged.""" + monkeypatch.setenv("OPENAI_API_KEY", "test-key") + for model in ("gpt-4o", "o3-mini", "gpt-5"): + settings = _settings(provider="openai", model=model) + provider = build_provider(settings) + assert provider.model == model + + def test_run_walk_persists_cache_for_resumability(mini_target, mock_provider_factory): """A second walk reuses the cache and skips the LLM call for unchanged files.""" settings = _settings() diff --git a/uv.lock b/uv.lock index 8d9f2ed..234d787 100644 --- a/uv.lock +++ b/uv.lock @@ -345,6 +345,25 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/47/4f/4a617ee93d8208d2bcf26b2d8b9402ceaed03e3853c754940e2290fed063/ollama-0.6.1-py3-none-any.whl", hash = "sha256:fc4c984b345735c5486faeee67d8a265214a31cbb828167782dc642ce0a2bf8c", size = 14354, upload-time = "2025-11-13T23:02:16.292Z" }, ] +[[package]] +name = "openai" +version = "2.33.0" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "anyio" }, + { name = "distro" }, + { name = "httpx" }, + { name = "jiter" }, + { name = "pydantic" }, + { name = "sniffio" }, + { name = "tqdm" }, + { name = "typing-extensions" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/f0/ee/d056c82f63c05f06baac0cffb4a90952d8274f90c49dfe244f20497b9bbd/openai-2.33.0.tar.gz", hash = "sha256:f850c435e2a4685bba3295bd54912dd26315d9c1b7733068186134d6e0599f9a", size = 693254, upload-time = "2026-04-28T14:04:42.428Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/7d/32/37734d769bc8b42e4938785313cc05aade6cb0fa72479d3220a0d61a4e78/openai-2.33.0-py3-none-any.whl", hash = "sha256:03ac37d70e8c9e3a8124214e3afa785e2cbc12e627fbd98177a086ef2fd87ad5", size = 1162695, upload-time = "2026-04-28T14:04:40.482Z" }, +] + [[package]] name = "packaging" version = "26.2" @@ -593,6 +612,18 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/e9/44/75a9c9421471a6c4805dbf2356f7c181a29c1879239abab1ea2cc8f38b40/sniffio-1.3.1-py3-none-any.whl", hash = "sha256:2f6da418d1f1e0fddd844478f41680e794e6051915791a034ff65e5f100525a2", size = 10235, upload-time = "2024-02-25T23:20:01.196Z" }, ] +[[package]] +name = "tqdm" +version = "4.67.3" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "colorama", marker = "sys_platform == 'win32'" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/09/a9/6ba95a270c6f1fbcd8dac228323f2777d886cb206987444e4bce66338dd4/tqdm-4.67.3.tar.gz", hash = "sha256:7d825f03f89244ef73f1d4ce193cb1774a8179fd96f31d7e1dcde62092b960bb", size = 169598, upload-time = "2026-02-03T17:35:53.048Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/16/e1/3079a9ff9b8e11b846c6ac5c8b5bfb7ff225eee721825310c91b3b50304f/tqdm-4.67.3-py3-none-any.whl", hash = "sha256:ee1e4c0e59148062281c49d80b25b67771a127c85fc9676d3be5f243206826bf", size = 78374, upload-time = "2026-02-03T17:35:50.982Z" }, +] + [[package]] name = "typer" version = "0.24.2" @@ -636,6 +667,7 @@ source = { editable = "." } dependencies = [ { name = "anthropic" }, { name = "ollama" }, + { name = "openai" }, { name = "pathspec" }, { name = "pydantic" }, { name = "pydantic-settings" }, @@ -655,6 +687,7 @@ dev = [ requires-dist = [ { name = "anthropic", specifier = ">=0.40" }, { name = "ollama", specifier = ">=0.4.0" }, + { name = "openai", specifier = ">=1.50" }, { name = "pathspec", specifier = ">=0.12" }, { name = "pydantic", specifier = ">=2.6" }, { name = "pydantic-settings", specifier = ">=2.2" }, diff --git a/wikifi/cli.py b/wikifi/cli.py index ab4fc76..8f4f4e0 100644 --- a/wikifi/cli.py +++ b/wikifi/cli.py @@ -94,7 +94,10 @@ def walk( ] = False, provider: Annotated[ str | None, - typer.Option("--provider", help="Override the configured provider for this walk ('ollama' | 'anthropic')."), + typer.Option( + "--provider", + help="Override the configured provider for this walk ('ollama' | 'anthropic' | 'openai').", + ), ] = None, ) -> None: """Walk the target codebase and populate every wiki section.""" diff --git a/wikifi/config.py b/wikifi/config.py index ea093bf..99b7a07 100644 --- a/wikifi/config.py +++ b/wikifi/config.py @@ -3,8 +3,9 @@ Defaults assume a local Ollama server with qwen3.6:27b. Override any field via WIKIFI_* env vars or a .env file in the target project's CWD. -The hosted Anthropic provider is selected via ``WIKIFI_PROVIDER=anthropic`` -(plus ``ANTHROPIC_API_KEY`` from env). +Hosted providers are opt-in: +- ``WIKIFI_PROVIDER=anthropic`` (plus ``ANTHROPIC_API_KEY``) +- ``WIKIFI_PROVIDER=openai`` (plus ``OPENAI_API_KEY``) """ from __future__ import annotations @@ -23,7 +24,10 @@ class Settings(BaseSettings): extra="ignore", ) - provider: str = Field(default="ollama", description="LLM provider id; 'ollama' (default) or 'anthropic'") + provider: str = Field( + default="ollama", + description="LLM provider id; 'ollama' (default), 'anthropic', or 'openai'", + ) model: str = Field(default="qwen3.6:27b", description="Model identifier passed to the provider") ollama_host: str = Field(default="http://localhost:11434", description="Ollama HTTP endpoint") request_timeout: float = Field(default=900.0, description="Per-request timeout in seconds") @@ -106,6 +110,21 @@ class Settings(BaseSettings): description="Per-call output token cap for the Anthropic provider.", ) + # ----- OpenAI provider knobs ----- + + openai_api_key: str | None = Field( + default=None, + description=("Explicit OpenAI API key. Falls back to OPENAI_API_KEY in the environment when unset."), + ) + openai_base_url: str | None = Field( + default=None, + description=("Explicit OpenAI base URL (for Azure-OpenAI / proxies). Defaults to api.openai.com."), + ) + openai_max_tokens: int = Field( + default=16_000, + description="Per-call output token cap for the OpenAI provider.", + ) + @lru_cache def get_settings() -> Settings: diff --git a/wikifi/orchestrator.py b/wikifi/orchestrator.py index 80b26ba..da9864b 100644 --- a/wikifi/orchestrator.py +++ b/wikifi/orchestrator.py @@ -159,8 +159,9 @@ def _persist() -> None: def build_provider(settings: Settings) -> LLMProvider: """Construct the configured provider. - Local Ollama is the default. Hosted Anthropic is opt-in via - ``WIKIFI_PROVIDER=anthropic`` and an ``ANTHROPIC_API_KEY``. + Local Ollama is the default. Hosted backends are opt-in via + ``WIKIFI_PROVIDER=anthropic`` (plus ``ANTHROPIC_API_KEY``) or + ``WIKIFI_PROVIDER=openai`` (plus ``OPENAI_API_KEY``). """ if settings.provider == "ollama": return OllamaProvider( @@ -183,4 +184,25 @@ def build_provider(settings: Settings) -> LLMProvider: max_tokens=settings.anthropic_max_tokens, think=settings.think, ) - raise ValueError(f"unknown provider {settings.provider!r}; expected 'ollama' or 'anthropic'") + if settings.provider == "openai": + from wikifi.providers.openai_provider import OpenAIProvider + + # Same default-swap guard as the Anthropic path: a user opting + # in to OpenAI shouldn't 404 because the Ollama model id is + # still in their config. + model = settings.model if _looks_like_openai_model(settings.model) else "gpt-4o" + return OpenAIProvider( + model=model, + api_key=settings.openai_api_key, + base_url=settings.openai_base_url, + timeout=settings.request_timeout, + max_tokens=settings.openai_max_tokens, + think=settings.think, + ) + raise ValueError(f"unknown provider {settings.provider!r}; expected 'ollama', 'anthropic', or 'openai'") + + +def _looks_like_openai_model(model: str) -> bool: + """Heuristic — covers gpt-*, o1/o3/o4 reasoning, and ft: variants.""" + lowered = model.lower() + return lowered.startswith(("gpt-", "o1", "o3", "o4", "ft:")) diff --git a/wikifi/providers/openai_provider.py b/wikifi/providers/openai_provider.py new file mode 100644 index 0000000..de2de05 --- /dev/null +++ b/wikifi/providers/openai_provider.py @@ -0,0 +1,241 @@ +"""OpenAI-backed implementation of :class:`LLMProvider`. + +The third provider, alongside :mod:`wikifi.providers.ollama_provider` +(local default) and :mod:`wikifi.providers.anthropic_provider` (hosted +Claude). Selected via ``WIKIFI_PROVIDER=openai`` plus an +``OPENAI_API_KEY``. + +Three implementation notes worth flagging: + +1. **Structured output via ``chat.completions.parse``.** The Pydantic + schema is converted to a JSON Schema by the SDK and the model + returns a pre-validated instance. This is OpenAI's GA path for + schema-constrained decoding; we don't hand-roll function calls. +2. **Prompt caching is automatic.** Unlike Anthropic, OpenAI does not + require a ``cache_control`` marker — the API caches identical + prefixes (≥ 1024 tokens) for ~5–10 minutes automatically. We keep + the system prompt at message position 0 so wikifi's repeated multi-KB + extraction prompt is what gets cached. +3. **Reasoning effort.** Reasoning-capable models (o1, o3, o4, gpt-5 + families) accept a ``reasoning_effort`` parameter that mirrors + wikifi's ``think`` knob. Non-reasoning models silently ignore the + parameter, so we route the knob through whenever a reasoning level + is set and skip it on plain models to avoid surfacing a 400 if a + future SDK starts validating it. +""" + +from __future__ import annotations + +import logging +import os +import re +from typing import Any, TypeVar + +from pydantic import BaseModel + +from wikifi.providers.base import ChatMessage + +try: + import openai +except ImportError as exc: # pragma: no cover - import error path + raise ImportError( + "wikifi.providers.openai_provider requires the `openai` package. " + "Install via `uv add openai` or include the [hosted] extras." + ) from exc + + +T = TypeVar("T", bound=BaseModel) +log = logging.getLogger("wikifi.providers.openai") + + +# Default model — gpt-4o is the most stable, broadly-available +# structured-output capable model. Override per-walk via ``WIKIFI_MODEL`` +# env or ``.wikifi/config.toml`` (e.g. set to a reasoning model like +# ``o3-mini`` or ``gpt-5`` to opt into the reasoning_effort path). +DEFAULT_MODEL = "gpt-4o" + +# Default per-call output token cap. wikifi's structured findings are +# small relative to the input; 16K leaves headroom for any of the +# section schemas without crossing the SDK's HTTP timeout guard. +DEFAULT_MAX_TOKENS = 16_000 + + +# Names that match a reasoning-capable model family. We inspect the +# model id by prefix because OpenAI's lineup is too volatile to +# enumerate exactly. Anything matching gets ``reasoning_effort`` +# forwarded; anything else has it stripped from the request. +_REASONING_MODEL_RE = re.compile(r"^(o\d|gpt-5)", re.IGNORECASE) + + +ThinkLevel = bool | str | None + + +class OpenAIProvider: + """Hosted-OpenAI implementation of the wikifi provider protocol.""" + + name = "openai" + + def __init__( + self, + *, + model: str = DEFAULT_MODEL, + api_key: str | None = None, + base_url: str | None = None, + timeout: float = 900.0, + max_tokens: int = DEFAULT_MAX_TOKENS, + think: ThinkLevel = "high", + client: Any | None = None, + ) -> None: + self.model = model + self.timeout = timeout + self.max_tokens = max_tokens + self.think = think + if client is not None: + # Tests pass an injected mock; preserve the duck-typed surface. + self._client = client + else: + api_key = api_key or os.environ.get("OPENAI_API_KEY") + self._client = openai.OpenAI( + api_key=api_key, + base_url=base_url, + timeout=timeout, + ) + + # ------------------------------------------------------------------ + # Provider protocol + # ------------------------------------------------------------------ + + def complete_json(self, *, system: str, user: str, schema: type[T]) -> T: + """Return a ``schema``-validated Pydantic instance. + + Uses ``chat.completions.parse`` so the SDK runs JSON-Schema- + constrained decoding and returns the parsed Pydantic model + directly. The system prompt sits at position 0 so OpenAI's + automatic prefix cache catches the repeated multi-KB extraction + prompt across per-file calls. + """ + try: + response = self._client.chat.completions.parse( + model=self.model, + messages=[ + {"role": "system", "content": system}, + {"role": "user", "content": user}, + ], + response_format=schema, + **self._token_kwargs(), + **self._reasoning_kwargs(), + ) + except openai.APIError as exc: + raise RuntimeError(_format_api_error(exc)) from exc + + parsed = _first_parsed(response) + if parsed is None: + # Defensive fallback: if the SDK couldn't parse (refusal, + # truncation), schema-validate the raw JSON text. Keeps the + # protocol's "raise on failure" contract intact rather than + # returning a None. + text = _first_text(response) + try: + return schema.model_validate_json(text) + except Exception as exc: # pragma: no cover - defensive path + raise RuntimeError(f"openai provider: empty parsed and validate fallback failed: {exc}") from exc + return parsed + + def complete_text(self, *, system: str, user: str) -> str: + """Return the model's free-text response.""" + try: + response = self._client.chat.completions.create( + model=self.model, + messages=[ + {"role": "system", "content": system}, + {"role": "user", "content": user}, + ], + **self._token_kwargs(), + **self._reasoning_kwargs(), + ) + except openai.APIError as exc: + raise RuntimeError(_format_api_error(exc)) from exc + return _first_text(response) or "" + + def chat(self, *, system: str, messages: list[ChatMessage]) -> str: + """Multi-turn chat. The system prompt sits at position 0; the + running message history follows it.""" + try: + response = self._client.chat.completions.create( + model=self.model, + messages=[{"role": "system", "content": system}, *messages], + **self._token_kwargs(), + **self._reasoning_kwargs(), + ) + except openai.APIError as exc: + raise RuntimeError(_format_api_error(exc)) from exc + return _first_text(response) or "" + + # ------------------------------------------------------------------ + # Helpers + # ------------------------------------------------------------------ + + def _is_reasoning_model(self) -> bool: + return bool(_REASONING_MODEL_RE.match(self.model)) + + def _reasoning_kwargs(self) -> dict[str, Any]: + """Forward the ``think`` knob as ``reasoning_effort`` only on + reasoning-capable models. Plain models silently ignore it but + we still strip it so a future strict validation can't 400 us. + """ + if not self._is_reasoning_model(): + return {} + if self.think is False or self.think in {"off", "none"}: + return {} + if isinstance(self.think, str) and self.think.lower() in {"low", "medium", "high"}: + return {"reasoning_effort": self.think.lower()} + # ``True`` / unrecognized string → adopt SDK default by omitting. + return {} + + def _token_kwargs(self) -> dict[str, Any]: + """Output cap. Reasoning models use ``max_completion_tokens``; + plain chat models use ``max_tokens``. We send the appropriate + one so neither path 400s on an unrecognized parameter.""" + key = "max_completion_tokens" if self._is_reasoning_model() else "max_tokens" + return {key: self.max_tokens} + + +def _first_parsed(response: Any) -> Any: + """Pull the parsed Pydantic instance out of a parse() response. + + Tolerates the SDK shape (``response.choices[0].message.parsed``) + and a duck-typed mock (a list of dicts). + """ + choices = getattr(response, "choices", None) or (response.get("choices") if isinstance(response, dict) else None) + if not choices: + return None + first = choices[0] + message = getattr(first, "message", None) or (first.get("message") if isinstance(first, dict) else None) + if message is None: + return None + parsed = getattr(message, "parsed", None) or (message.get("parsed") if isinstance(message, dict) else None) + return parsed + + +def _first_text(response: Any) -> str: + """Pull the first text content out of a chat-completion response.""" + choices = getattr(response, "choices", None) or (response.get("choices") if isinstance(response, dict) else None) + if not choices: + return "" + first = choices[0] + message = getattr(first, "message", None) or (first.get("message") if isinstance(first, dict) else None) + if message is None: + return "" + content = getattr(message, "content", None) + if content is None and isinstance(message, dict): + content = message.get("content") + return content or "" + + +def _format_api_error(exc: Exception) -> str: + """Render an APIError with the request id, when present, for diagnostics.""" + request_id = getattr(exc, "request_id", None) + msg = getattr(exc, "message", None) or str(exc) + if request_id: + return f"openai provider failed ({request_id}): {msg}" + return f"openai provider failed: {msg}" From ddd193cdae20a3e9d9e0907d3ea30684204068c0 Mon Sep 17 00:00:00 2001 From: Dallas Pool Date: Fri, 1 May 2026 21:19:30 -0500 Subject: [PATCH 3/9] e2e run --- .wikifi/.cache/aggregation.json | 2656 +++++++++++++++++ .wikifi/.cache/extraction.json | 4568 ++++++++++++++++++++++++++++++ .wikifi/capabilities.md | 131 +- .wikifi/config.toml | 4 +- .wikifi/cross_cutting.md | 133 +- .wikifi/diagrams.md | 416 ++- .wikifi/domains.md | 67 +- .wikifi/entities.md | 176 +- .wikifi/external_dependencies.md | 52 +- .wikifi/hard_specifications.md | 88 +- .wikifi/integrations.md | 100 +- .wikifi/intent.md | 87 +- .wikifi/personas.md | 204 +- .wikifi/user_stories.md | 335 ++- 14 files changed, 8528 insertions(+), 489 deletions(-) create mode 100644 .wikifi/.cache/aggregation.json create mode 100644 .wikifi/.cache/extraction.json diff --git a/.wikifi/.cache/aggregation.json b/.wikifi/.cache/aggregation.json new file mode 100644 index 0000000..e678b78 --- /dev/null +++ b/.wikifi/.cache/aggregation.json @@ -0,0 +1,2656 @@ +{ + "version": 1, + "saved_at": "2026-05-02T02:17:19.876759+00:00", + "entries": { + "domains": { + "notes_hash": "4040897a09cc", + "body": "## Core Domain\n\nThe system's core domain is **codebase knowledge extraction**: ingesting an existing source base, classifying its contents, deriving domain findings from individual files, and synthesising those findings into a structured, technology-agnostic wiki. The primary consumers are migration teams who need to understand business intent, domain structure, and operational behaviour before re-implementing or replacing a legacy system.\n\n## Subdomains\n\n### Repository Introspection\nThis subdomain concerns discovering and classifying the files that make up a target codebase. Its central responsibility is distinguishing production source that encodes business intent from infrastructure, tooling, and other artefacts that do not. Tech-agnosticism is a first-class constraint here: the classification logic must not rely on recognising any specific language, framework, or runtime.\n\n### Per-File Knowledge Extraction\nOnce relevant files are identified, each is analysed independently to surface domain findings. This subdomain covers the full extraction loop — examining file content, applying domain heuristics, and producing structured evidence — and forms the first phase of wiki generation (primary sections).\n\n### Section Synthesis and Aggregation\nThe second phase of wiki generation operates over the evidence produced by per-file extraction. It aggregates findings across files into coherent wiki sections, derives higher-level content that cannot be inferred from any single file, and enforces the dependency ordering between primary (evidence-driven) and derivative (aggregated) sections. This ordering is a structural design constraint, not merely a runtime convention.\n\n### Wiki Authoring and Organisation\nA secondary domain governs how extracted knowledge is structured and stored. It defines the taxonomy of sections, distinguishes primary from derivative content, and produces output that a migration team can navigate and consume independently of the source codebase.\n\n### Interactive Knowledge Retrieval\nA supporting subdomain exposes the generated wiki to conversational or query-driven access, allowing stakeholders to interrogate extracted knowledge without directly inspecting raw wiki files.\n\n## Cross-Cutting Constraint: Tech-Agnosticism\nTech-agnosticism spans every subdomain. All analysis, extraction, and synthesis must produce domain-level descriptions that are free of references to specific languages, frameworks, or libraries. This constraint is enforced at both the classification stage (repository introspection) and the output stage (section content).\n\n## Subdomain Relationships\n\n| Subdomain | Role | Depends On |\n|---|---|---|\n| Repository Introspection | Identifies source worth analysing | — |\n| Per-File Knowledge Extraction | Produces primary section evidence | Introspection |\n| Section Synthesis & Aggregation | Produces derivative sections | Per-File Extraction |\n| Wiki Authoring & Organisation | Structures and stores the wiki | Synthesis |\n| Interactive Knowledge Retrieval | Queries the completed wiki | Authoring |\n", + "claims": [ + { + "text": "The core domain is codebase knowledge extraction: ingesting source files, classifying them, deriving domain findings, and synthesising those findings into a structured wiki.", + "sources": [ + { + "file": "README.md", + "lines": [ + 28, + 52 + ], + "fingerprint": "996c401d036d" + }, + { + "file": "VISION.md", + "lines": [ + 3, + 20 + ], + "fingerprint": "10651b456a64" + }, + { + "file": "wikifi/cli.py", + "lines": [ + 1, + 8 + ], + "fingerprint": "f326383c7da1" + } + ] + }, + { + "text": "The primary consumers are migration teams who need to understand business intent, domain structure, and operational behaviour of a legacy system.", + "sources": [ + { + "file": "VISION.md", + "lines": [ + 3, + 20 + ], + "fingerprint": "10651b456a64" + } + ] + }, + { + "text": "Repository introspection is responsible for distinguishing production source that encodes business intent from infrastructure, tooling, and other artefacts.", + "sources": [ + { + "file": "wikifi/cli.py", + "lines": [ + 1, + 8 + ], + "fingerprint": "f326383c7da1" + }, + { + "file": "wikifi/introspection.py", + "lines": [ + 19, + 44 + ], + "fingerprint": "59cd5940f72e" + } + ] + }, + { + "text": "Tech-agnosticism is a first-class constraint on the introspection subdomain: classification must not rely on recognising any specific language or framework.", + "sources": [ + { + "file": "wikifi/introspection.py", + "lines": [ + 19, + 44 + ], + "fingerprint": "59cd5940f72e" + } + ] + }, + { + "text": "Wiki generation is split into two subdomains: per-file evidence extraction (primary sections) and aggregate synthesis (derivative sections), with a structurally enforced dependency ordering between them.", + "sources": [ + { + "file": "README.md", + "lines": [ + 28, + 52 + ], + "fingerprint": "996c401d036d" + }, + { + "file": "wikifi/sections.py", + "lines": [ + 1, + 19 + ], + "fingerprint": "f743972a8fce" + } + ] + }, + { + "text": "Wiki authoring and organisation is a secondary domain governing how extracted knowledge is structured and stored for consumption.", + "sources": [ + { + "file": "VISION.md", + "lines": [ + 3, + 20 + ], + "fingerprint": "10651b456a64" + }, + { + "file": "wikifi/sections.py", + "lines": [ + 1, + 19 + ], + "fingerprint": "f743972a8fce" + } + ] + }, + { + "text": "Interactive knowledge retrieval is a supporting subdomain that exposes the generated wiki to query-driven access.", + "sources": [ + { + "file": "wikifi/cli.py", + "lines": [ + 1, + 8 + ], + "fingerprint": "f326383c7da1" + } + ] + } + ], + "contradictions": [] + }, + "intent": { + "notes_hash": "5f1844b4a404", + "body": "wikifi exists because the intent embedded in a legacy system is typically invisible — locked inside years of implementation choices, technology-specific conventions, and accumulated structure that makes it difficult to separate *what the system does and why* from *how it currently does it*. Migration teams tasked with replacing or re-implementing such a system need the former without the latter.\n\n### The Core Problem\n\nWhen a team inherits a large legacy codebase and must produce a new implementation, they face a knowledge-extraction problem. The source describes a particular way of solving a set of problems, but rarely describes the problems themselves at a level that is portable to a new context. Reading the source directly tends to reproduce the same structure and constraints in the new system — recreating legacy decisions rather than the underlying intent.\n\nwikifi addresses this by walking a repository and producing a structured, technology-agnostic wiki that surfaces:\n\n- **Domain entities and capabilities** — what the system models and what it can do\n- **API contracts and integration touchpoints** — what it exposes and to whom\n- **Cross-cutting concerns** — considerations that span the system as a whole\n- **Personas, user stories, and diagrams** — who uses the system, what they need, and how flows connect\n\nThe goal is to make legacy intent explicit, complete, and portable so a fresh implementation can retain full functional value without inheriting structural decisions.\n\n### Primary Audience\n\nThe immediate audience is migration teams — architects and developers who need to understand a system's domain well enough to re-implement it rather than maintain it. A secondary audience includes anyone who must understand what a system does without reading its source directly, including those who need to interrogate the resulting wiki conversationally.\n\n### What the System Is Not\n\nwikifi is explicitly a feature-extraction tool, not a transposition tool. It surfaces what a legacy system does and leaves all decisions about target architecture, structure, and approach entirely to the migration team. The output prescribes nothing about how the new system should be built.\n\n### Shaping Constraints\n\nSeveral constraints are built into the design from the outset:\n\n| Constraint | Rationale |\n|---|---|\n| **Technology agnosticism** | Output must be expressed in domain terms, never in terms of the implementation technology found in the source, so the wiki does not embed the very assumptions it is meant to dissolve. |\n| **Quality over speed** | Accuracy and completeness of the generated wiki are prioritised over processing throughput. |\n| **Arbitrary scale** | The system must handle repositories of any size — including legacy monorepos with tens of thousands of files — through caching and chunking strategies that make repeated and interrupted runs cheap. |\n| **Full traceability** | Every assertion in the generated wiki must trace back to specific source files and locations so architects can verify any claim against the original codebase. |\n| **Honest disagreement** | Where source files contain conflicting signals, the system surfaces those contradictions explicitly rather than silently resolving them, preserving the full picture for the migration team. |", + "claims": [ + { + "text": "wikifi exists because the intent of legacy systems is locked inside their implementation choices, making it difficult to separate what the system does from how it does it.", + "sources": [ + { + "file": "VISION.md", + "lines": [ + 3, + 9 + ], + "fingerprint": "10651b456a64" + } + ] + }, + { + "text": "Migration teams need a description of what a system does and why, decoupled from how it currently does it, so they can re-implement on a fresh stack without recreating legacy structure.", + "sources": [ + { + "file": "CLAUDE.md", + "lines": [ + 73, + 75 + ], + "fingerprint": "ac9698d91de6" + }, + { + "file": "README.md", + "lines": [ + 3, + 3 + ], + "fingerprint": "996c401d036d" + }, + { + "file": "VISION.md", + "lines": [ + 3, + 9 + ], + "fingerprint": "10651b456a64" + }, + { + "file": "wikifi/cli.py", + "lines": [ + 1, + 8 + ], + "fingerprint": "f326383c7da1" + } + ] + }, + { + "text": "wikifi walks a repository and produces a structured, technology-agnostic wiki surfacing features, domains, entities, capabilities, and delivered value.", + "sources": [ + { + "file": ".env.example", + "lines": [ + 1, + 2 + ], + "fingerprint": "2e493dbd2d87" + }, + { + "file": "README.md", + "lines": [ + 3, + 3 + ], + "fingerprint": "996c401d036d" + }, + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 1, + 6 + ], + "fingerprint": "3b93f710ebca" + }, + { + "file": "wikifi/cli.py", + "lines": [ + 1, + 8 + ], + "fingerprint": "f326383c7da1" + }, + { + "file": "wikifi/config.py", + "lines": [ + 1, + 8 + ], + "fingerprint": "8cd2ca53c957" + } + ] + }, + { + "text": "The wiki includes domain entities and capabilities, API contracts and integration touchpoints, and cross-cutting concerns extracted from source files.", + "sources": [ + { + "file": "README.md", + "lines": [ + 3, + 3 + ], + "fingerprint": "996c401d036d" + }, + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 1, + 6 + ], + "fingerprint": "3b93f710ebca" + }, + { + "file": "wikifi/specialized/__init__.py", + "lines": [ + 1, + 13 + ], + "fingerprint": "84d6c382c745" + }, + { + "file": "wikifi/specialized/openapi.py", + "lines": [ + 1, + 11 + ], + "fingerprint": "ae97781309c4" + }, + { + "file": "wikifi/specialized/protobuf.py", + "lines": [ + 1, + 8 + ], + "fingerprint": "e20d5913745a" + } + ] + }, + { + "text": "Certain wiki sections — personas, user stories, and diagrams — are synthesized from aggregated primary evidence because they cannot be inferred from individual files.", + "sources": [ + { + "file": "wikifi/deriver.py", + "lines": [ + 1, + 18 + ], + "fingerprint": "0b7f4f5abb09" + }, + { + "file": "wikifi/sections.py", + "lines": [ + 1, + 19 + ], + "fingerprint": "f743972a8fce" + } + ] + }, + { + "text": "The goal is to make legacy intent explicit, complete, and portable so a fresh implementation retains full functional value without inheriting structural decisions.", + "sources": [ + { + "file": "README.md", + "lines": [ + 3, + 3 + ], + "fingerprint": "996c401d036d" + }, + { + "file": "VISION.md", + "lines": [ + 3, + 9 + ], + "fingerprint": "10651b456a64" + } + ] + }, + { + "text": "The primary audience is migration teams who need to understand a system's domain well enough to re-implement it.", + "sources": [ + { + "file": "CLAUDE.md", + "lines": [ + 73, + 75 + ], + "fingerprint": "ac9698d91de6" + }, + { + "file": "VISION.md", + "lines": [ + 3, + 9 + ], + "fingerprint": "10651b456a64" + }, + { + "file": "VISION.md", + "lines": [ + 86, + 89 + ], + "fingerprint": "10651b456a64" + }, + { + "file": "wikifi/critic.py", + "lines": [ + 1, + 15 + ], + "fingerprint": "502af9aee392" + } + ] + }, + { + "text": "Users can also interrogate the generated wiki conversationally, with every answer grounded in the extracted sections rather than invented detail.", + "sources": [ + { + "file": "wikifi/chat.py", + "lines": [ + 1, + 32 + ], + "fingerprint": "0333e700a046" + } + ] + }, + { + "text": "wikifi is explicitly a feature-extraction tool, not a transposition tool — it surfaces what the legacy system does and leaves all target architecture decisions to the migration team.", + "sources": [ + { + "file": "CLAUDE.md", + "lines": [ + 73, + 75 + ], + "fingerprint": "ac9698d91de6" + }, + { + "file": "VISION.md", + "lines": [ + 86, + 89 + ], + "fingerprint": "10651b456a64" + } + ] + }, + { + "text": "All output is expressed in domain terms, never in terms of the implementation technology found in the source.", + "sources": [ + { + "file": "CLAUDE.md", + "lines": [ + 73, + 75 + ], + "fingerprint": "ac9698d91de6" + }, + { + "file": "README.md", + "lines": [ + 3, + 3 + ], + "fingerprint": "996c401d036d" + }, + { + "file": "VISION.md", + "lines": [ + 86, + 89 + ], + "fingerprint": "10651b456a64" + }, + { + "file": "wikifi/cli.py", + "lines": [ + 1, + 8 + ], + "fingerprint": "f326383c7da1" + } + ] + }, + { + "text": "The system prioritises documentation quality over processing speed.", + "sources": [ + { + "file": ".env.example", + "lines": [ + 1, + 2 + ], + "fingerprint": "2e493dbd2d87" + }, + { + "file": "wikifi/config.py", + "lines": [ + 1, + 8 + ], + "fingerprint": "8cd2ca53c957" + } + ] + }, + { + "text": "The system is designed to handle repositories of arbitrary size, including legacy monorepos with tens of thousands of files, through caching and chunking.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 1, + 21 + ], + "fingerprint": "1ba541fe863d" + }, + { + "file": "wikifi/config.py", + "lines": [ + 1, + 8 + ], + "fingerprint": "8cd2ca53c957" + }, + { + "file": "wikifi/extractor.py", + "lines": [ + 1, + 37 + ], + "fingerprint": "b0e939259557" + } + ] + }, + { + "text": "Every assertion in the generated wiki must trace back to specific source files and locations so architects can verify any claim against the original codebase.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 1, + 15 + ], + "fingerprint": "c5f76cb7c4a3" + }, + { + "file": "wikifi/evidence.py", + "lines": [ + 1, + 18 + ], + "fingerprint": "dddfe1a01c85" + } + ] + }, + { + "text": "Where source files contain conflicting signals, the system surfaces those contradictions explicitly rather than silently resolving them.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 1, + 15 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + } + ], + "contradictions": [] + }, + "capabilities": { + "notes_hash": "4a4c91043bca", + "body": "wikifi analyzes any target codebase and produces a structured, technology-agnostic wiki that captures domain knowledge, system intent, capabilities, external dependencies, integrations, cross-cutting concerns, core entities, and hard specifications — expressed entirely in domain terms rather than in the language of a specific technology stack.\n\n## Workspace Initialization\n\nBefore analysis begins, the system bootstraps a wiki workspace inside the target project in an idempotent manner, creating the required directory structure, a configuration file, version-control ignore rules, and one placeholder document per defined section. Repeat invocations leave already-existing artifacts untouched.\n\n## Codebase Analysis Pipeline\n\nThe core pipeline runs in four ordered stages:\n\n1. **Repository introspection** — The system compresses the repository's directory layout and reads key manifest files, then uses this compact view to classify every path as either worth walking (production source, business logic, integrations, domain models) or worth skipping (vendored dependencies, build output, tests, CI/CD). The classification is returned as a structured, diffable result.\n\n2. **Per-file extraction** — Every in-scope file is routed through one of three extraction paths:\n - *Cache replay* — if a file's content is unchanged since the last run, previously stored findings are reused without any further processing.\n - *Deterministic schema parsing* — files recognised as structured schema artifacts (SQL DDL, database migrations, API contract specs, interface definition files, and graph schema files) are processed by purpose-built parsers that produce findings about entities, relationships, operations, and constraints without invoking an AI model.\n - *AI-assisted extraction* — all remaining files pass through an AI extraction pass; large files are recursively split into overlapping chunks so no content is missed regardless of size.\n\n Every finding carries a source citation — the originating file path, an inclusive line range, and a content fingerprint — enabling full traceability back to the codebase.\n\n3. **Cross-file context enrichment** — In parallel with extraction, the system builds an import and reference graph across the entire in-scope file set. Each file's neighborhood (the files it depends on and the files that depend on it) is injected into its extraction prompt, enabling findings to describe inter-file flows rather than treating each file in isolation.\n\n4. **Section aggregation** — Per-file findings are grouped by their target wiki section and synthesised into readable markdown bodies. Every asserted claim is backed by numbered citations pointing to the originating files and line ranges. Where two or more files make incompatible assertions about the same topic, the system surfaces the conflict explicitly in a dedicated *Conflicts in source* block rather than silently resolving it — a deliberate feature for legacy codebases where disagreements encode high-priority migration signals.\n\n## Wiki Structure\n\nThe generated wiki is organised into **eleven sections**: eight primary sections populated directly from per-file evidence, and three derivative sections synthesised from the completed primaries:\n\n| Section type | Sections |\n|---|---|\n| Primary (8) | Business domains, system intent, capabilities, external dependencies, integrations, cross-cutting concerns, core entities, hard specifications |\n| Derivative (3) | User personas, Gherkin-style user stories, Mermaid architectural diagrams |\n\nDerivative sections are only generated after the primaries they depend on are finalised. If upstream primary sections are empty or missing, the system writes a placeholder that declares the gap rather than fabricating content.\n\n## Quality Assurance\n\nAn optional critic-and-reviser pass evaluates any synthesised section against its brief and the upstream evidence it drew from, producing a structured quality score (0–10) with itemised unsupported claims, gaps, and suggested edits. When a section scores below a configurable threshold, a revision is automatically invoked; the revision is accepted only if it matches or improves the original score, preventing regressions. This loop is particularly valuable for derivative sections — personas and user stories — where single-shot synthesis is most prone to introducing unsupported assertions.\n\n## Incremental and Resumable Walks\n\nThe pipeline uses a two-scope content-addressed cache: per-file extraction results are keyed to a combination of file path and content fingerprint, and per-section aggregation results are keyed to a digest of the contributing notes payload. Only changed files and affected sections are reprocessed on incremental runs. Because results are persisted after every completed file, an interrupted walk resumes from the last unprocessed file rather than restarting from scratch. The cache can also be fully invalidated to force a clean re-walk.\n\n## Coverage and Quality Reporting\n\nA report command produces a human-readable markdown table summarising every wiki section by contributing file count, finding count, body size, optional critic-derived quality score, and the highest-priority content gap identified by the critic. Coverage statistics also surface *dead zones* — files that were processed but produced no findings — so teams can identify blind spots in the analysis.\n\n## Interactive Knowledge Querying\n\nOnce a wiki has been generated, users can open an interactive conversational session grounded in all populated sections. The session supports multi-turn exchanges, conversation history reset, and introspection of which sections are currently loaded as context. Only meaningfully populated sections are included, ensuring the assistant is not grounded in placeholder content.\n\n## Graceful Degradation\n\nWhen AI synthesis fails for a section, the system falls back to emitting the raw collected notes directly in the section body, preserving information at the cost of polish and surfacing the error inline. Similarly, unparseable schema files produce an advisory finding directing reviewers to inspect the file manually rather than silently failing.", + "claims": [ + { + "text": "wikifi analyzes a target codebase and produces a technology-agnostic wiki covering DDD domains, intent, capabilities, external dependencies, integrations, cross-cutting concerns, core entities, and hard specifications.", + "sources": [ + { + "file": "VISION.md", + "lines": [ + 6, + 8 + ], + "fingerprint": "10651b456a64" + }, + { + "file": "wikifi/sections.py", + "lines": [ + 44, + 142 + ], + "fingerprint": "f743972a8fce" + } + ] + }, + { + "text": "The system bootstraps a wiki workspace inside the target project in an idempotent manner, creating directory structure, configuration, version-control ignore rules, and one placeholder document per section.", + "sources": [ + { + "file": "README.md", + "lines": [ + 14, + 24 + ], + "fingerprint": "996c401d036d" + }, + { + "file": "wikifi/orchestrator.py", + "lines": [ + 62, + 76 + ], + "fingerprint": "6ed682a87356" + }, + { + "file": "wikifi/wiki.py", + "lines": [ + 64, + 86 + ], + "fingerprint": "9230b7444e0d" + } + ] + }, + { + "text": "Stage 1 introspects the repository by compressing its directory layout and reading manifest files, then classifies paths as worth walking or worth skipping.", + "sources": [ + { + "file": "wikifi/introspection.py", + "lines": [ + 28, + 44 + ], + "fingerprint": "59cd5940f72e" + }, + { + "file": "wikifi/introspection.py", + "lines": [ + 61, + 70 + ], + "fingerprint": "59cd5940f72e" + }, + { + "file": "wikifi/walker.py", + "lines": [ + 92, + 186 + ], + "fingerprint": "a29bd1ad8bdb" + } + ] + }, + { + "text": "Per-file extraction routes each file through one of three paths: cache replay, deterministic schema parsing, or AI-assisted extraction with chunking for large files.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 140, + 200 + ], + "fingerprint": "b0e939259557" + }, + { + "file": "wikifi/cache.py", + "lines": [ + 5, + 8 + ], + "fingerprint": "1ba541fe863d" + } + ] + }, + { + "text": "Structured schema artifacts — SQL DDL, migrations, API contract specs, interface definition files, and graph schema files — are processed by purpose-built deterministic parsers without invoking an AI model.", + "sources": [ + { + "file": "README.md", + "lines": [ + 34, + 36 + ], + "fingerprint": "996c401d036d" + }, + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 116, + 149 + ], + "fingerprint": "3b93f710ebca" + }, + { + "file": "wikifi/config.py", + "lines": [ + 75, + 81 + ], + "fingerprint": "8cd2ca53c957" + }, + { + "file": "wikifi/extractor.py", + "lines": [ + 140, + 200 + ], + "fingerprint": "b0e939259557" + }, + { + "file": "wikifi/repograph.py", + "lines": [ + 41, + 52 + ], + "fingerprint": "3d8bbdb10112" + }, + { + "file": "wikifi/specialized/__init__.py", + "lines": [ + 46, + 57 + ], + "fingerprint": "84d6c382c745" + }, + { + "file": "wikifi/specialized/sql.py", + "lines": [ + 56, + 62 + ], + "fingerprint": "1ef5e77c4038" + } + ] + }, + { + "text": "Every finding carries a source citation including file path, line range, and content fingerprint.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 251, + 270 + ], + "fingerprint": "b0e939259557" + }, + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 40, + 66 + ], + "fingerprint": "3b93f710ebca" + }, + { + "file": "wikifi/aggregator.py", + "lines": [ + 1, + 15 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "text": "The system builds an import and reference graph across the in-scope file set and injects each file's neighborhood into its extraction prompt.", + "sources": [ + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 90, + 114 + ], + "fingerprint": "3b93f710ebca" + }, + { + "file": "wikifi/config.py", + "lines": [ + 69, + 74 + ], + "fingerprint": "8cd2ca53c957" + }, + { + "file": "wikifi/extractor.py", + "lines": [ + 241, + 246 + ], + "fingerprint": "b0e939259557" + }, + { + "file": "wikifi/repograph.py", + "lines": [ + 155, + 210 + ], + "fingerprint": "3d8bbdb10112" + } + ] + }, + { + "text": "Per-file findings are synthesised into readable markdown section bodies with every claim backed by numbered citations.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 1, + 15 + ], + "fingerprint": "c5f76cb7c4a3" + }, + { + "file": "wikifi/evidence.py", + "lines": [ + 88, + 121 + ], + "fingerprint": "dddfe1a01c85" + } + ] + }, + { + "text": "Where two or more files make incompatible assertions about the same topic, the system surfaces the conflict explicitly in a 'Conflicts in source' block rather than silently resolving it.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 9, + 14 + ], + "fingerprint": "c5f76cb7c4a3" + }, + { + "file": "wikifi/evidence.py", + "lines": [ + 13, + 17 + ], + "fingerprint": "dddfe1a01c85" + }, + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 40, + 66 + ], + "fingerprint": "3b93f710ebca" + } + ] + }, + { + "text": "The wiki is organised into eight primary sections and three derivative sections (user personas, Gherkin-style user stories, and Mermaid diagrams).", + "sources": [ + { + "file": "wikifi/sections.py", + "lines": [ + 44, + 142 + ], + "fingerprint": "f743972a8fce" + }, + { + "file": "VISION.md", + "lines": [ + 53, + 63 + ], + "fingerprint": "10651b456a64" + }, + { + "file": "wikifi/deriver.py", + "lines": [ + 73, + 107 + ], + "fingerprint": "0b7f4f5abb09" + } + ] + }, + { + "text": "Derivative sections are only generated after the primaries they depend on are finalised; if upstream sections are empty, a placeholder is written rather than fabricating content.", + "sources": [ + { + "file": "wikifi/deriver.py", + "lines": [ + 73, + 107 + ], + "fingerprint": "0b7f4f5abb09" + }, + { + "file": "wikifi/sections.py", + "lines": [ + 44, + 142 + ], + "fingerprint": "f743972a8fce" + } + ] + }, + { + "text": "An optional critic-and-reviser pass evaluates sections against a quality rubric, scoring them 0–10, and invokes a revision only when the score is below a configurable threshold, accepting the revision only if it matches or improves the original score.", + "sources": [ + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 151, + 164 + ], + "fingerprint": "3b93f710ebca" + }, + { + "file": "wikifi/config.py", + "lines": [ + 83, + 94 + ], + "fingerprint": "8cd2ca53c957" + }, + { + "file": "wikifi/critic.py", + "lines": [ + 100, + 153 + ], + "fingerprint": "502af9aee392" + }, + { + "file": "wikifi/deriver.py", + "lines": [ + 90, + 103 + ], + "fingerprint": "0b7f4f5abb09" + } + ] + }, + { + "text": "The pipeline uses a two-scope content-addressed cache — per-file and per-section — so only changed files and affected sections are reprocessed on incremental runs.", + "sources": [ + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 67, + 88 + ], + "fingerprint": "3b93f710ebca" + }, + { + "file": "wikifi/cache.py", + "lines": [ + 5, + 8 + ], + "fingerprint": "1ba541fe863d" + }, + { + "file": "wikifi/cache.py", + "lines": [ + 9, + 12 + ], + "fingerprint": "1ba541fe863d" + }, + { + "file": "wikifi/config.py", + "lines": [ + 63, + 68 + ], + "fingerprint": "8cd2ca53c957" + } + ] + }, + { + "text": "Interrupted walks are resumable because per-file results are persisted incrementally; the cache can also be fully invalidated to force a clean re-walk.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 14, + 18 + ], + "fingerprint": "1ba541fe863d" + }, + { + "file": "wikifi/cache.py", + "lines": [ + 105, + 113 + ], + "fingerprint": "1ba541fe863d" + }, + { + "file": "README.md", + "lines": [ + 16, + 20 + ], + "fingerprint": "996c401d036d" + }, + { + "file": "wikifi/cli.py", + "lines": [ + 88, + 112 + ], + "fingerprint": "f326383c7da1" + } + ] + }, + { + "text": "A report command produces a markdown table summarising every wiki section by file count, finding count, body size, quality score, and highest-priority content gap.", + "sources": [ + { + "file": "README.md", + "lines": [ + 21, + 23 + ], + "fingerprint": "996c401d036d" + }, + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 166, + 186 + ], + "fingerprint": "3b93f710ebca" + }, + { + "file": "wikifi/critic.py", + "lines": [ + 155, + 180 + ], + "fingerprint": "502af9aee392" + }, + { + "file": "wikifi/report.py", + "lines": [ + 44, + 77 + ], + "fingerprint": "2b94c0a5e62e" + } + ] + }, + { + "text": "Coverage statistics surface dead zones — files processed but producing no findings.", + "sources": [ + { + "file": "wikifi/report.py", + "lines": [ + 103, + 107 + ], + "fingerprint": "2b94c0a5e62e" + } + ] + }, + { + "text": "An interactive conversational session grounded in all populated wiki sections supports multi-turn exchanges and various session management commands.", + "sources": [ + { + "file": "README.md", + "lines": [ + 24, + 25 + ], + "fingerprint": "996c401d036d" + }, + { + "file": "wikifi/chat.py", + "lines": [ + 88, + 130 + ], + "fingerprint": "0333e700a046" + }, + { + "file": "wikifi/chat.py", + "lines": [ + 63, + 82 + ], + "fingerprint": "0333e700a046" + }, + { + "file": "wikifi/cli.py", + "lines": [ + 60, + 220 + ], + "fingerprint": "f326383c7da1" + } + ] + }, + { + "text": "When AI synthesis fails for a section, the system falls back to emitting raw collected notes with the error message inline.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 272, + 285 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "text": "Unparseable schema files produce an advisory finding rather than failing silently.", + "sources": [ + { + "file": "wikifi/specialized/openapi.py", + "lines": [ + 23, + 50 + ], + "fingerprint": "ae97781309c4" + } + ] + } + ], + "contradictions": [] + }, + "external_dependencies": { + "notes_hash": "ba55342df61c", + "body": "The system draws on several categories of external service: language-model inference backends, development-time tooling integrations, and a continuous-integration platform.\n\n## Language-Model Inference\n\nAll substantive text generation and structured extraction is delegated to an external (or locally hosted) language-model service. Three backends are supported through a common provider abstraction:\n\n| Backend | Hosting | Authentication | Role |\n|---|---|---|---|\n| Local inference server (default) | Self-hosted, no network egress | None required | Default backend for all extraction and synthesis calls; configurable host address and 15-minute per-call timeout |\n| Hosted AI service A (Anthropic) | Cloud API | API key (`ANTHROPIC_API_KEY`) | Opt-in backend; uses an ephemeral prompt-cache marker on the system prompt so that large extraction prompts are billed at roughly 10 % of normal input-token cost across repeated per-file calls |\n| Hosted AI service B (OpenAI-compatible) | Cloud API (or compatible proxy/Azure endpoint) | API key + optional custom base URL | Opt-in backend; relies on automatic prefix caching (prefixes ≥ 1 024 tokens cached for ~5–10 minutes); exposes a reasoning-intensity knob mapped to the backend's reasoning-effort parameter on capable model variants |\n\nThe local inference server is the default and requires no credentials or external network access. The two hosted backends are opt-in and each require a provisioned API key. All three backends are configured with a model name, timeout, and per-call output-token cap drawn from the application's runtime settings.\n\n### Caching Strategy\nBecause the extraction prompt is large and is reused across every file in a repository, minimising repeated billing for identical prompt prefixes is a first-class concern. The hosted-AI-service-A integration achieves this by tagging the system-prompt block with an ephemeral cache-control marker. The hosted-AI-service-B integration relies on the provider's automatic prefix-caching mechanism without requiring explicit markers.\n\n## Development-Time Tool Integrations\n\nThe MCP server configuration reveals several additional integrations that appear to be used during development or agent-assisted workflows rather than in the core production pipeline:\n\n- **Google AI generative API** — consumed by at least two registered tool integrations; authenticated via a shared API key.\n- **Self-hosted web-crawling service** — running locally on a fixed port with no API key, providing crawling capability on demand.\n- **External documentation/context lookup service** — called over HTTP with a dedicated API key; likely used to retrieve up-to-date reference documentation for prompt enrichment.\n- **Google-hosted orchestration service (", + "claims": [], + "contradictions": [] + }, + "integrations": { + "notes_hash": "dc7982e6a028", + "body": "### Inbound: Entry Points into the System\n\nThe system is distributed as a library installed directly into a target project. The command-line interface (CLI) is the primary inbound entry point, exposing subcommands that drive the full pipeline from repository introspection through wiki generation, interactive querying, and quality reporting. The CLI delegates all pipeline coordination to the orchestrator, which is also the central hub wiring together every downstream stage.\n\n---\n\n### Outbound: AI Model Backends\n\nAll pipeline stages — introspection, per-file extraction, section aggregation, derivative content derivation, quality critique, and interactive chat — communicate with an AI model backend exclusively through a shared provider abstraction. No stage calls a specific backend directly. Three interaction shapes are exposed through this abstraction: schema-validated structured output, free-text completion, and multi-turn stateful conversation.\n\nThree backends are available and are interchangeable without altering any pipeline code:\n\n| Backend type | Hosting model |\n|---|---|\n| Local self-hosted inference runtime | On-premise / developer machine |\n| Hosted AI service (Anthropic-compatible) | Remote cloud |\n| Hosted AI service (OpenAI-compatible) | Remote cloud or self-managed endpoint |\n\nThe active backend is selected via an environment variable or a per-invocation flag at the CLI level. OpenAI-compatible endpoints — including corporate reverse proxies and managed cloud deployments — are supported by overriding the base URL alone, with no other changes to the calling code.\n\n---\n\n### Outbound: Development-Time Tool Servers (MCP)\n\nA separate set of external capability providers is declared through an MCP client configuration used during development or runtime. Four tool servers are wired up: a local AI utility, a local web crawler, a remote documentation-context service, and a remote search-and-stitching service. The system acts as an MCP client that fans requests out to these providers as needed.\n\n---\n\n### Outbound: Filesystem and Persistence Layer\n\nAll reading and writing of wiki artifacts — extraction notes, finished section bodies, and cache entries — flows through a centralized layout abstraction that manages the `.wikifi/` output directory inside the target project. The extractor, aggregator, deriver, CLI, and orchestrator all resolve paths through this abstraction rather than independently.\n\nA content-addressed cache layer sits between the orchestrator and the AI backend, consulting a fingerprinting service to derive content hashes as cache keys. The extractor, aggregator, and orchestrator each consult the cache before issuing AI calls, enabling both incremental re-runs and resumability for large codebases.\n\n---\n\n### Integration Touchpoints Discovered in Target Codebases\n\nWhen analyzing a target codebase, the system identifies and surfaces integration touchpoints from high-signal artifact files through specialized parsers:\n\n- **HTTP API surfaces** — Parsed from API contract files; each contract contributes a finding recording the count of externally exposed endpoints, establishing the public-facing API surface as a documented integration point.\n- **RPC service definitions** — Each declared service and its remote procedures are mapped, capturing procedure names, request and response message types, and whether either channel is streaming.\n- **Event-driven channels** — Subscription roots found in schema definition files are classified as real-time integration touchpoints rather than ordinary capabilities, reflecting their role as channels that external consumers attach to.\n- **Relational links** — Foreign key declarations (both explicit and inline) are surfaced as hard relational links between domain entities, identifying cross-entity data dependencies.\n\nThe dispatcher that routes files to these specialized parsers uses the file-kind classification produced by the repository graph module, ensuring each artifact type reaches the appropriate parser while preserving a uniform output contract for downstream aggregation.", + "claims": [ + { + "text": "The system is distributed as a library installed into a target project and invoked via a CLI from that project's root.", + "sources": [ + { + "file": "README.md", + "lines": [ + 8, + 12 + ], + "fingerprint": "996c401d036d" + } + ] + }, + { + "text": "The CLI is the primary inbound entry point, exposing subcommands that drive the full pipeline from introspection through wiki generation, chat, and reporting.", + "sources": [ + { + "file": "wikifi/cli.py", + "lines": [ + 98, + 101 + ], + "fingerprint": "f326383c7da1" + } + ] + }, + { + "text": "The orchestrator is the central hub called by the CLI that wires together all pipeline stages.", + "sources": [ + { + "file": "wikifi/orchestrator.py", + "lines": [ + 40, + 60 + ], + "fingerprint": "6ed682a87356" + } + ] + }, + { + "text": "All pipeline stages communicate with an AI model backend exclusively through a shared provider abstraction; no stage calls a specific backend directly.", + "sources": [ + { + "file": "wikifi/providers/base.py", + "lines": [ + 30, + 48 + ], + "fingerprint": "2750f0f56327" + } + ] + }, + { + "text": "Three interaction shapes are exposed through the provider abstraction: schema-validated structured output, free-text completion, and multi-turn stateful conversation.", + "sources": [ + { + "file": "wikifi/providers/anthropic_provider.py", + "lines": [ + 115, + 175 + ], + "fingerprint": "872020d40ac3" + }, + { + "file": "wikifi/providers/base.py", + "lines": [ + 30, + 48 + ], + "fingerprint": "2750f0f56327" + }, + { + "file": "wikifi/providers/ollama_provider.py", + "lines": [ + 58, + 95 + ], + "fingerprint": "0a21916665a5" + }, + { + "file": "wikifi/providers/openai_provider.py", + "lines": [ + 1, + 8 + ], + "fingerprint": "428df9ba13f1" + } + ] + }, + { + "text": "Three interchangeable backends are available: a local self-hosted inference runtime, an Anthropic-hosted service, and an OpenAI-compatible service.", + "sources": [ + { + "file": "README.md", + "lines": [ + 46, + 51 + ], + "fingerprint": "996c401d036d" + }, + { + "file": "wikifi/cli.py", + "lines": [ + 98, + 101 + ], + "fingerprint": "f326383c7da1" + }, + { + "file": "wikifi/providers/anthropic_provider.py", + "lines": [ + 115, + 175 + ], + "fingerprint": "872020d40ac3" + }, + { + "file": "wikifi/providers/ollama_provider.py", + "lines": [ + 58, + 95 + ], + "fingerprint": "0a21916665a5" + }, + { + "file": "wikifi/providers/openai_provider.py", + "lines": [ + 1, + 8 + ], + "fingerprint": "428df9ba13f1" + } + ] + }, + { + "text": "The active backend is selected via an environment variable or a per-invocation flag.", + "sources": [ + { + "file": "README.md", + "lines": [ + 46, + 51 + ], + "fingerprint": "996c401d036d" + }, + { + "file": "wikifi/cli.py", + "lines": [ + 98, + 101 + ], + "fingerprint": "f326383c7da1" + } + ] + }, + { + "text": "OpenAI-compatible endpoints including corporate reverse proxies and managed cloud deployments are supported by overriding the base URL only, with no other code changes.", + "sources": [ + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 232, + 235 + ], + "fingerprint": "3b93f710ebca" + } + ] + }, + { + "text": "An MCP client configuration wires up four external tool servers: a local AI utility, a local web crawler, a remote documentation-context service, and a remote search-and-stitching service.", + "sources": [ + { + "file": ".mcp.json", + "lines": [ + 2, + 36 + ], + "fingerprint": "b6b856cb3fe2" + } + ] + }, + { + "text": "All wiki artifact persistence flows through a centralized layout abstraction managing the .wikifi/ directory, consumed by the orchestrator, extractor, aggregator, deriver, and CLI.", + "sources": [ + { + "file": "wikifi/wiki.py", + "lines": [ + 34, + 61 + ], + "fingerprint": "9230b7444e0d" + } + ] + }, + { + "text": "A content-addressed cache layer uses a fingerprinting service to compute content hashes as cache keys, and is consulted by the extractor, aggregator, and orchestrator before issuing AI calls.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 244, + 246 + ], + "fingerprint": "1ba541fe863d" + }, + { + "file": "wikifi/cache.py", + "lines": [ + 30, + 30 + ], + "fingerprint": "1ba541fe863d" + } + ] + }, + { + "text": "API contract files are parsed to produce inbound-integration findings recording the count of HTTP endpoints exposed to external consumers.", + "sources": [ + { + "file": "wikifi/specialized/openapi.py", + "lines": [ + 83, + 92 + ], + "fingerprint": "ae97781309c4" + } + ] + }, + { + "text": "Service and RPC definition files are parsed to map each procedure, capturing name, request and response types, and streaming flags, with each service emitted as a distinct integration touchpoint.", + "sources": [ + { + "file": "wikifi/specialized/protobuf.py", + "lines": [ + 70, + 87 + ], + "fingerprint": "e20d5913745a" + } + ] + }, + { + "text": "Subscription roots in schema definition files are classified as real-time integration touchpoints rather than ordinary capabilities.", + "sources": [ + { + "file": "wikifi/specialized/graphql.py", + "lines": [ + 88, + 91 + ], + "fingerprint": "bbb305e0d47f" + } + ] + }, + { + "text": "Foreign key declarations are surfaced as hard relational links between domain entities.", + "sources": [ + { + "file": "wikifi/specialized/sql.py", + "lines": [ + 86, + 96 + ], + "fingerprint": "1ef5e77c4038" + } + ] + }, + { + "text": "The specialized-parser dispatcher uses the file-kind classification from the repository graph module and routes to four sibling parsers while preserving a uniform output contract.", + "sources": [ + { + "file": "wikifi/specialized/__init__.py", + "lines": [ + 46, + 57 + ], + "fingerprint": "84d6c382c745" + } + ] + } + ], + "contradictions": [] + }, + "cross_cutting": { + "notes_hash": "9142920419c0", + "body": "## Observability\n\nA consistent, pipeline-wide observability model spans every stage of the system. Structured logging is initialised once and reused across all subcommands; a single verbose flag activates debug-level output globally without each subsystem needing its own toggle. Stage-boundary log events are emitted at each major transition — repository introspection, dependency-graph construction, file extraction, section aggregation, and derivative synthesis — so operators can pinpoint where a long walk is spending time. Revision and quality-scoring events are counted in the run's statistics, and cache hit counts are surfaced in the post-walk report, giving a quantitative picture of incremental efficiency.\n\n## Resilience and Error Handling\n\nThe system is designed so that no single failure can abort an entire pipeline run. Extraction failures — whether caused by an inference provider or a specialised deterministic parser — are logged and tallied but never propagated upward; a file whose processing fails entirely is recorded as skipped, and partially-recovered files retain whatever findings were salvaged. Aggregation and derivation failures follow the same pattern: errors are caught and logged at warning level, and a fallback body that preserves the raw upstream evidence is written so the wiki remains inspectable. Quality-assurance (critic and reviser) failures degrade gracefully to returning the original body with a diagnostic score of zero rather than halting. Provider failures during interactive query sessions are surfaced inline without terminating the session. Across all provider backends, raw infrastructure errors are caught at the provider boundary and re-raised as a normalised internal error type carrying the upstream request identifier when available, so the rest of the pipeline does not branch on provider-specific exception shapes.\n\n## Content-Addressed Caching and Crash Resumability\n\nAll expensive inference work is protected by a two-scope content-addressed cache stored under a dedicated hidden subdirectory within the wiki output directory, inheriting the same version-control ignore rules as other working-state artifacts.\n\n- **Extraction scope:** each file's results are keyed by the combination of its relative path and a stable hash of its raw bytes. Any unchanged file is skipped on re-walk with no inference call.\n- **Aggregation scope:** each section's synthesised body is keyed by a deterministic digest of its note payload. Unchanged inputs reuse the stored body and evidence bundle.\n\nCache entries are written after every individual file completes, so a mid-walk crash loses at most one file's work. Writes are performed atomically — content is first written to a temporary location and then renamed into place — preventing corrupt partial writes. Malformed entries are silently dropped and logged rather than causing a hard failure, so a partially corrupt cache degrades gracefully to a fresh extraction for only the affected entries. A monotonically increasing version tag is embedded in every persisted cache file; a version mismatch on load causes the entire cache to be discarded and rebuilt, providing a controlled invalidation path across software upgrades. Between runs, entries for files no longer in scope are pruned automatically.\n\n## Input Integrity Guards\n\nA layered set of guards prevents low-signal or pathological inputs from ever reaching the inference layer.\n\n| Guard | Threshold | Effect |\n|---|---|---|\n| Minimum content size | 64 bytes (stripped) | File silently skipped |\n| Maximum file size | 2 MB | File silently skipped |\n| Large-file windowing | 150 KB – 2 MB | File split into overlapping chunks with 8 KB overlap |\n| Manifest truncation | 20 000 bytes | Hard-truncated with visible marker |\n| Per-request timeout | 900 seconds | Uniform backstop across all providers |\n\nDirectory traversal prunes excluded subtrees before descending into them, so ignore patterns are applied efficiently at the directory level rather than file-by-file. Files carrying no extractable intent — stub initialisers, empty fixtures, generated lockfiles — are identified and dropped before reaching the inference layer; the invariant that a single empty or unstructured file must never stall the walk is explicitly upheld. Findings produced from the overlap region between adjacent large-file chunks are deduplicated by section and normalised text within each file's pass, preventing double-counting in downstream aggregation.\n\n## Provider Abstraction\n\nAll inference calls — structured extraction, free-text generation, and multi-turn chat — are routed through a single provider abstraction layer. This boundary is where observability hooks, retry logic, error normalisation, and backend-switching concerns live; no extraction or aggregation logic needs knowledge of which backend is active. Supported backend shapes include local inference runtimes and hosted services; the local-inference path is the default, with hosted options as addenda, and swapping between them requires no changes outside the provider boundary.\n\nStructured-output calls enforce a schema-validation contract: the model response must be validated against a declared schema before being returned to the caller, ensuring type-safe data flows through every pipeline stage. To maximise determinism, temperature is hard-pinned to zero on all structured-output calls; free-text and conversational paths accept model-default variability in exchange for naturalness.\n\nWhen a backend exposes a reasoning-depth control, the system runs at the highest available setting, prioritising output quality over walk speed. A configurable depth parameter is translated into the provider's native adaptive-thinking feature, allowing callers to trade latency and cost against quality without branching on provider type in shared pipeline code.\n\nHosted backends employ prompt-caching strategies — placing the large, repeated system prompt at a fixed position in every request so the service can serve subsequent calls from a cached prefix — making large-scale walks economically viable by paying full input cost only on the first call and a fraction of that cost on subsequent ones.\n\n## Source Traceability and Hallucination Prevention\n\nFull source traceability is a non-negotiable structural invariant: every assertion in every wiki section must be linkable back to the originating file and, where available, the precise line range within it. This is enforced through typed evidence structures (claims and source references) rather than by convention, so the constraint cannot be silently bypassed.\n\nHallucination prevention operates at two additional levels. First, the inference prompt explicitly instructs the model never to name specific technologies, translating all observations into domain terms — this is a mandatory invariant enforced at the prompt layer. Second, upstream section content that matches known placeholder shapes is filtered out before derivative synthesis, preventing empty or stub sections from being treated as real evidence; these same sentinel strings are used by the quality-report layer to exclude placeholder sections from scoring. Interactive query sessions are similarly grounded: the assistant is instructed to explicitly acknowledge when the wiki does not cover a topic rather than generating unsupported answers.\n\nContent fingerprints serve a triple cross-cutting role: keying both extraction and aggregation caches so stale results are never served, anchoring source-evidence citations so claims can be re-verified against a fresh repository walk, and tracking file identity inside the dependency graph so cross-file context is invalidated when any contributing source changes. Files are always fingerprinted as raw bytes rather than decoded text to ensure the cache layer and the extractor agree on identity regardless of encoding assumptions.\n\n## Authentication and Storage Invariants\n\nSpecialised deterministic parsers extract security and data-integrity contracts from high-signal artifacts and surface them as first-class cross-cutting concerns that must be preserved through any migration:\n\n- **Authentication schemes** declared in API contract files are extracted and categorised by type, flagging which security contracts (key-based, delegated authorisation, bearer-token, etc.) the new system must honour.\n- **Data integrity constraints** — uniqueness and non-nullability — found in schema definitions are extracted as storage invariants explicitly marked as migration-critical.\n- **Query-performance invariants** — index definitions — are recorded with an explicit note that the new system must preserve equivalent access patterns.\n\nAll specialised parsers return results in the same structured shape as the general inference extractor, so the aggregation layer needs no knowledge of which extraction path was taken; this uniform interface contract is itself an invariant that must be preserved.\n\n## Data Storage Layout\n\nThe pipeline's working state is isolated to a single hidden directory within the repository:\n\n- **Rendered section documents** live at the root of this directory and are intended to be committed to version control.\n- **Per-section extraction notes** (JSONL, each record UTC-timestamped) are stored in a notes subdirectory and excluded from version control via a generated ignore file.\n- **Extraction and aggregation caches** are stored in a cache subdirectory and similarly excluded.\n\nDeleting the cache subdirectory forces a full re-walk; deleting the entire working directory resets all pipeline state. This layout ensures generated documentation commits remain clean and the boundary between committed outputs and ephemeral working state is unambiguous.", + "claims": [ + { + "text": "A single verbose flag activates debug-level structured logging globally across all subcommands.", + "sources": [ + { + "file": "wikifi/cli.py", + "lines": [ + 51, + 60 + ], + "fingerprint": "f326383c7da1" + } + ] + }, + { + "text": "Stage-boundary log events are emitted at each major pipeline transition — introspection, dependency-graph construction, extraction, aggregation, and derivation.", + "sources": [ + { + "file": "wikifi/orchestrator.py", + "lines": [ + 84, + 148 + ], + "fingerprint": "6ed682a87356" + } + ] + }, + { + "text": "Revision and quality-scoring events are counted in run statistics, and cache hit counts are surfaced in the post-walk report.", + "sources": [ + { + "file": "wikifi/cli.py", + "lines": [ + 90, + 97 + ], + "fingerprint": "f326383c7da1" + }, + { + "file": "wikifi/deriver.py", + "lines": [ + 110, + 135 + ], + "fingerprint": "0b7f4f5abb09" + } + ] + }, + { + "text": "Structured logging is initialised under a dedicated namespace for the report subsystem.", + "sources": [ + { + "file": "wikifi/report.py", + "lines": [ + 22, + 22 + ], + "fingerprint": "2b94c0a5e62e" + } + ] + }, + { + "text": "Extraction failures are logged and tallied but never propagate to abort the walk; a file whose processing fails entirely is recorded as skipped.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 228, + 242 + ], + "fingerprint": "b0e939259557" + } + ] + }, + { + "text": "Aggregation failures are caught and logged at warning level, and a fallback body preserving raw notes is written.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 143, + 152 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "text": "Derivation failures are caught and logged; a fallback body preserving upstream evidence is written rather than leaving the section blank.", + "sources": [ + { + "file": "wikifi/deriver.py", + "lines": [ + 96, + 107 + ], + "fingerprint": "0b7f4f5abb09" + } + ] + }, + { + "text": "Quality-assurance failures degrade gracefully by returning the original body with a diagnostic score of zero.", + "sources": [ + { + "file": "wikifi/critic.py", + "lines": [ + 158, + 165 + ], + "fingerprint": "502af9aee392" + } + ] + }, + { + "text": "Provider failures during interactive sessions are surfaced inline without terminating the session.", + "sources": [ + { + "file": "wikifi/chat.py", + "lines": [ + 120, + 125 + ], + "fingerprint": "0333e700a046" + } + ] + }, + { + "text": "Raw infrastructure errors are caught at the provider boundary and re-raised as a normalised internal error type carrying the upstream request identifier.", + "sources": [ + { + "file": "wikifi/providers/anthropic_provider.py", + "lines": [ + 238, + 244 + ], + "fingerprint": "872020d40ac3" + }, + { + "file": "wikifi/providers/openai_provider.py", + "lines": [ + 248, + 255 + ], + "fingerprint": "428df9ba13f1" + } + ] + }, + { + "text": "Extraction results are keyed by the combination of a file's relative path and a stable hash of its raw bytes; unchanged files are skipped on re-walk.", + "sources": [ + { + "file": "README.md", + "lines": [ + 40, + 43 + ], + "fingerprint": "996c401d036d" + }, + { + "file": "wikifi/fingerprint.py", + "lines": [ + 1, + 18 + ], + "fingerprint": "853400108135" + } + ] + }, + { + "text": "Aggregation results are keyed by a deterministic digest of the note payload; unchanged inputs reuse the stored body and evidence bundle.", + "sources": [ + { + "file": "README.md", + "lines": [ + 40, + 43 + ], + "fingerprint": "996c401d036d" + }, + { + "file": "wikifi/aggregator.py", + "lines": [ + 126, + 155 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "text": "Cache entries are written after every individual file completes, so a mid-walk crash loses at most one file's work.", + "sources": [ + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 67, + 88 + ], + "fingerprint": "3b93f710ebca" + }, + { + "file": "wikifi/extractor.py", + "lines": [ + 155, + 175 + ], + "fingerprint": "b0e939259557" + } + ] + }, + { + "text": "Cache writes are atomic — content is written to a temporary location and then renamed into place.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 189, + 193 + ], + "fingerprint": "1ba541fe863d" + } + ] + }, + { + "text": "Malformed cache entries are silently dropped and logged rather than causing a hard failure.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 196, + 222 + ], + "fingerprint": "1ba541fe863d" + } + ] + }, + { + "text": "A monotonically increasing version tag is embedded in every cache file; a version mismatch causes the entire cache to be discarded and rebuilt.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 38, + 38 + ], + "fingerprint": "1ba541fe863d" + } + ] + }, + { + "text": "Entries for files no longer in scope are pruned from the cache between runs.", + "sources": [ + { + "file": "wikifi/orchestrator.py", + "lines": [ + 95, + 110 + ], + "fingerprint": "6ed682a87356" + } + ] + }, + { + "text": "Cache files are stored under a dedicated hidden subdirectory within the wiki output directory, inheriting version-control ignore rules.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 19, + 21 + ], + "fingerprint": "1ba541fe863d" + } + ] + }, + { + "text": "Files below 64 bytes (stripped) are silently skipped to prevent inference on effectively empty inputs.", + "sources": [ + { + "file": "wikifi/config.py", + "lines": [ + 56, + 59 + ], + "fingerprint": "8cd2ca53c957" + }, + { + "file": "wikifi/walker.py", + "lines": [ + 61, + 79 + ], + "fingerprint": "a29bd1ad8bdb" + } + ] + }, + { + "text": "Files above 2 MB are silently skipped on the assumption they are vendored, generated, or binary.", + "sources": [ + { + "file": "wikifi/config.py", + "lines": [ + 38, + 56 + ], + "fingerprint": "8cd2ca53c957" + }, + { + "file": "wikifi/walker.py", + "lines": [ + 61, + 79 + ], + "fingerprint": "a29bd1ad8bdb" + } + ] + }, + { + "text": "Files between 150 KB and 2 MB are split into overlapping chunks with an 8 KB overlap.", + "sources": [ + { + "file": "wikifi/config.py", + "lines": [ + 38, + 56 + ], + "fingerprint": "8cd2ca53c957" + } + ] + }, + { + "text": "Manifest files are hard-truncated at 20,000 bytes with a visible truncation marker.", + "sources": [ + { + "file": "wikifi/walker.py", + "lines": [ + 220, + 231 + ], + "fingerprint": "a29bd1ad8bdb" + } + ] + }, + { + "text": "All inference calls share a single per-request timeout of 900 seconds as a uniform backstop.", + "sources": [ + { + "file": ".env.example", + "lines": [ + 16, + 29 + ], + "fingerprint": "2e493dbd2d87" + }, + { + "file": "wikifi/config.py", + "lines": [ + 33, + 34 + ], + "fingerprint": "8cd2ca53c957" + } + ] + }, + { + "text": "Directory traversal prunes excluded subtrees before descending into them.", + "sources": [ + { + "file": "wikifi/walker.py", + "lines": [ + 133, + 143 + ], + "fingerprint": "a29bd1ad8bdb" + } + ] + }, + { + "text": "Files carrying no extractable intent are identified and dropped before reaching the inference layer; an empty file must never stall the walk.", + "sources": [ + { + "file": "README.md", + "lines": [ + 44, + 46 + ], + "fingerprint": "996c401d036d" + }, + { + "file": "VISION.md", + "lines": [ + 99, + 100 + ], + "fingerprint": "10651b456a64" + } + ] + }, + { + "text": "Findings from overlap regions between adjacent chunks are deduplicated by section and normalised text within each file's pass.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 253, + 262 + ], + "fingerprint": "b0e939259557" + } + ] + }, + { + "text": "All inference calls are routed through a provider abstraction layer where observability, retry logic, error normalisation, and backend-switching concerns are centralised.", + "sources": [ + { + "file": "CLAUDE.md", + "lines": [ + 53, + 54 + ], + "fingerprint": "ac9698d91de6" + }, + { + "file": "VISION.md", + "lines": [ + 92, + 96 + ], + "fingerprint": "10651b456a64" + } + ] + }, + { + "text": "Structured-output calls require the model response to be validated against a declared schema before being returned to the caller.", + "sources": [ + { + "file": "wikifi/providers/base.py", + "lines": [ + 36, + 38 + ], + "fingerprint": "2750f0f56327" + } + ] + }, + { + "text": "Temperature is hard-pinned to zero on all structured-output calls to enforce determinism; free-text and conversational paths use model-default temperature.", + "sources": [ + { + "file": "wikifi/providers/ollama_provider.py", + "lines": [ + 58, + 68 + ], + "fingerprint": "0a21916665a5" + } + ] + }, + { + "text": "When a backend exposes a reasoning-depth control, the system runs at the highest available setting, prioritising quality over speed.", + "sources": [ + { + "file": "VISION.md", + "lines": [ + 97, + 98 + ], + "fingerprint": "10651b456a64" + }, + { + "file": "wikifi/providers/anthropic_provider.py", + "lines": [ + 212, + 232 + ], + "fingerprint": "872020d40ac3" + } + ] + }, + { + "text": "Hosted backends employ prompt-caching strategies so that only the first call in a walk pays full input-token cost.", + "sources": [ + { + "file": "wikifi/providers/anthropic_provider.py", + "lines": [ + 193, + 210 + ], + "fingerprint": "872020d40ac3" + }, + { + "file": "wikifi/providers/openai_provider.py", + "lines": [ + 13, + 17 + ], + "fingerprint": "428df9ba13f1" + } + ] + }, + { + "text": "Full source traceability is enforced structurally: every wiki assertion must be linkable to its originating file and line range via typed evidence structures.", + "sources": [ + { + "file": "wikifi/evidence.py", + "lines": [ + 1, + 18 + ], + "fingerprint": "dddfe1a01c85" + } + ] + }, + { + "text": "The inference prompt mandates tech-agnostic output, explicitly instructing the model to translate all observations into domain terms.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 54, + 67 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "text": "Upstream section content matching known placeholder shapes is filtered out before derivative synthesis to prevent fabrication from empty inputs.", + "sources": [ + { + "file": "wikifi/deriver.py", + "lines": [ + 110, + 135 + ], + "fingerprint": "0b7f4f5abb09" + }, + { + "file": "wikifi/deriver.py", + "lines": [ + 118, + 135 + ], + "fingerprint": "0b7f4f5abb09" + } + ] + }, + { + "text": "The same sentinel strings used in derivation filtering are used by the quality-report layer to exclude placeholder sections from scoring.", + "sources": [ + { + "file": "wikifi/report.py", + "lines": [ + 118, + 123 + ], + "fingerprint": "2b94c0a5e62e" + } + ] + }, + { + "text": "Interactive query sessions instruct the assistant to acknowledge when the wiki does not cover a topic rather than generating unsupported answers.", + "sources": [ + { + "file": "wikifi/chat.py", + "lines": [ + 27, + 31 + ], + "fingerprint": "0333e700a046" + } + ] + }, + { + "text": "Content fingerprints serve three roles: cache keying, citation anchoring, and dependency-graph invalidation.", + "sources": [ + { + "file": "wikifi/fingerprint.py", + "lines": [ + 1, + 18 + ], + "fingerprint": "853400108135" + } + ] + }, + { + "text": "Files are always fingerprinted as raw bytes to ensure consistent identity regardless of encoding assumptions.", + "sources": [ + { + "file": "wikifi/fingerprint.py", + "lines": [ + 44, + 50 + ], + "fingerprint": "853400108135" + } + ] + }, + { + "text": "Authentication schemes declared in API contract files are extracted and categorised by type as migration-critical cross-cutting concerns.", + "sources": [ + { + "file": "wikifi/specialized/openapi.py", + "lines": [ + 110, + 121 + ], + "fingerprint": "ae97781309c4" + } + ] + }, + { + "text": "Uniqueness and non-nullability constraints from schema definitions are extracted as storage invariants flagged as migration-critical.", + "sources": [ + { + "file": "wikifi/specialized/sql.py", + "lines": [ + 97, + 98 + ], + "fingerprint": "1ef5e77c4038" + } + ] + }, + { + "text": "Index definitions are recorded as query-performance invariants that the target system must preserve.", + "sources": [ + { + "file": "wikifi/specialized/sql.py", + "lines": [ + 113, + 125 + ], + "fingerprint": "1ef5e77c4038" + } + ] + }, + { + "text": "All specialised parsers return results in the same structured shape as the general inference extractor, preserving a uniform interface contract downstream.", + "sources": [ + { + "file": "wikifi/specialized/__init__.py", + "lines": [ + 9, + 13 + ], + "fingerprint": "84d6c382c745" + } + ] + }, + { + "text": "Rendered section documents are committed to version control; extraction notes and caches are excluded via a generated ignore file.", + "sources": [ + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 249, + 265 + ], + "fingerprint": "3b93f710ebca" + }, + { + "file": "wikifi/wiki.py", + "lines": [ + 96, + 121 + ], + "fingerprint": "9230b7444e0d" + } + ] + }, + { + "text": "Deleting the cache subdirectory forces a full re-walk; deleting the entire working directory resets all pipeline state.", + "sources": [ + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 249, + 265 + ], + "fingerprint": "3b93f710ebca" + } + ] + } + ], + "contradictions": [] + }, + "entities": { + "notes_hash": "aff1a81afdaf", + "body": "The system's domain model spans five functional layers — wiki structure, file classification, extraction, evidence, and review — plus supporting entities for caching, derivation, and chat.\n\n---\n\n## Wiki Structure\n\n**Section** is the central organizing entity. Each section carries a unique identifier, a human-readable title, a prose description of what belongs in it, and a tier (primary or derivative). Derivative sections additionally declare an ordered list of upstream section identifiers they depend on, forming an explicit dependency graph. An invariant holds at startup: every upstream identifier in a derivative section's dependency list must refer to a section that appears earlier in the canonical ordering (topological sort enforced).\n\n**WikiLayout** is an immutable value object that encodes the on-disk structure of a wiki workspace. Given a project root, it derives all canonical sub-paths: the wiki directory, configuration file, gitignore file, notes directory, per-section markdown files, and per-section note files. No fields are mutable after construction.\n\n**WalkConfig** is an immutable configuration record consumed by the filesystem walker. It captures the repository root, extra exclusion patterns, a flag for honouring ignore rules, a maximum file size in bytes, and a minimum stripped-content size in bytes.\n\n---\n\n## File Classification and Graph\n\n**FileKind** is a closed enumeration of seven mutually exclusive file roles: application code, SQL, OpenAPI, Protobuf, GraphQL, migration, and other. This classification determines whether a file is routed to a specialised deterministic parser or the general-purpose extraction path.\n\n**GraphNode** represents a single file's position in the repository's import graph. It carries the file's repo-relative path, the ordered set of files it imports, and the ordered set of files that import it. It exposes a capped combined-neighbour list for inclusion in extraction prompts.\n\n**RepoGraph** holds the complete import-edge map for a repository scan. It supports node lookup by path and retrieval of a capped neighbour list for any given file, providing cross-file context during extraction.\n\n**DirSummary** is a value object holding aggregate statistics for a single (non-recursive) directory: its repo-relative path, file count, total byte size, a frequency map of the top-10 file extensions, and a tuple of notable filenames (manifests, readmes) present in that directory.\n\n---\n\n## Extraction Layer\n\n**SectionFinding** represents one file's contribution to one wiki section. It carries the target section identifier, a technology-agnostic prose description of the contribution, and an optional inclusive line range within the source chunk.\n\n**FileFindings** groups a one-sentence summary of a file with all `SectionFinding` records produced for it.\n\n**SpecializedFinding** is the output unit of the deterministic parsing paths. It carries a section identifier, a human-readable description, and a list of source references. **SpecializedResult** groups zero or more such findings with an optional summary string; this is the uniform output contract for all specialised extractors, ensuring interoperability with the general extraction path downstream.\n\n**ExtractionStats** is a walk-level counter record, accumulating: total files seen, files yielding at least one finding, total findings, skipped files, chunks processed, cache hits, specialised-extractor invocations, and a per-kind file breakdown.\n\n---\n\n## Evidence Layer\n\n**SourceRef** represents a single span of source: a repo-relative file path, an optional inclusive line range, and a short content fingerprint captured at extraction time for change detection.\n\n**Claim** represents one assertion placed in a wiki section. It carries the markdown text and a list of `SourceRef` values that justify it. A claim with no sources is explicitly marked unsupported — this is a first-class state, not an error.\n\n**Contradiction** groups two or more conflicting `Claim` objects about the same topic under a single summary sentence. Each disagreeing position retains its own source references, preserving full traceability.\n\n**EvidenceBundle** is the aggregator's structured output for a single wiki section. It combines the narrative body text, a list of `Claim` records, and a list of `Contradiction` records. The renderer uses the bundle to thread numbered citations and a conflicts block into the final markdown.\n\nDuring aggregation, the pipeline works with intermediate forms: **AggregatedClaim** pairs a single prose assertion with the 1-based indices of the input notes that support it, and **AggregatedContradiction** holds a one-sentence summary alongside multiple conflicting positional claims, each with its own note indices. These are the structured forms that the language model produces before being resolved into the full evidence model.\n\n---\n\n## Cache Entities\n\n**CachedFindings** stores the extraction result for a single file: the file's content fingerprint, the list of structured findings produced, a one-sentence summary, and a count of processed chunks. Its invariant is content-addressed — the fingerprint is the cache key.\n\n**CachedSection** stores the aggregation result for a single wiki section: the hash of the notes payload that produced it, the rendered markdown body, and lists of claims and contradictions. It too is content-addressed on the notes hash.\n\n**WalkCache** is the in-memory container for both caches. It holds extraction and aggregation entries alongside hit and miss counters, enabling observability into cache effectiveness across a run.\n\n---\n\n## Quality and Review Layer\n\n**Critique** captures the quality assessment of a single section: an integer score (0–10), a short overall judgment, a list of unsupported claims, a list of gaps relative to the section brief, and a list of concrete revision suggestions.\n\n**ReviewOutcome** tracks a section's review lifecycle: the section identifier, the initial critique, the current body text, a flag indicating whether a revision was applied, and the optional follow-up critique produced after revision.\n\n**WikiQualityReport** aggregates the full-wiki audit: an overall numeric score, a mapping from section identifiers to their individual `Critique` records, and optional coverage statistics.\n\n**CoverageStats** records total files seen, files with findings, and per-section breakdowns of finding counts and contributing file counts; it exposes a coverage-percentage computation.\n\n**SectionReport** captures the per-section view for reporting: the section descriptor, count of contributing files, total findings count, body size in characters, an emptiness flag, and an optional quality critique.\n\n**WikiReport** aggregates all `SectionReport` records alongside overall coverage statistics and an optional mean quality score across populated sections.\n\n---\n\n## Derivation and Pipeline Outputs\n\n**IntrospectionResult** captures the Stage 1 decision about which files are worth deeper analysis: a list of gitignore-style include patterns, a list of exclude patterns, a list of primary languages (informational), a one-paragraph guess at the system's purpose, and a rationale for the choices made.\n\n**AggregationStats** records, for a single aggregation run, how many sections were written fresh, skipped due to empty notes, or served from cache.\n\n**DerivationStats** accumulates pipeline metrics for the derivation stage: counts of sections derived, skipped, and revised, plus the full list of `ReviewOutcome` records. It acts as an audit trail for the synthesis stage.\n\n**WalkReport** is the single return value of a completed wiki-generation run, aggregating the introspection result, extraction statistics, aggregation statistics, derivation statistics, the live cache state, and the repository import graph.\n\n---\n\n## Chat Layer\n\n**ChatMessage** carries a role and a content field, representing a single turn in a multi-turn conversation. Lists of these are accumulated to maintain conversation history.\n\n**LoadedSection** pairs a `Section` descriptor with its rendered markdown body, representing a single populated section ready for inclusion in a chat context.\n\n**ChatSession** holds a provider reference, the frozen system prompt built from wiki sections, and the accumulated conversation history as an ordered list of `ChatMessage` records. It supports appending user and assistant turns and clearing history while retaining the wiki context.\n\n---\n\n## Relationships and Invariants Summary\n\n| Entity | Key relationships | Notable invariants |\n|---|---|---|\n| Section | depends on upstream Sections (derivative tier only) | Dependency graph must be topologically ordered |\n| WikiLayout | derived from a project root | Immutable; all paths are computed, not stored independently |\n| SourceRef | referenced by Claim, SpecializedFinding | Fingerprint enables staleness detection |\n| Claim | groups SourceRefs; composed into EvidenceBundle | Sourceless claims are explicitly flagged unsupported |\n| Contradiction | groups ≥2 conflicting Claims | Each position retains its own SourceRefs |\n| CachedFindings | keyed on file content fingerprint | Cache miss if fingerprint changes |\n| CachedSection | keyed on notes-payload hash | Cache miss if any upstream note changes |\n| ReviewOutcome | holds pre- and post-revision Critique | Revision flag distinguishes touched from untouched sections |\n| WalkReport | aggregates all four stage outputs | Single return value for a complete run |", + "claims": [ + { + "text": "A Section entity carries a unique identifier, human-readable title, prose description, tier (primary or derivative), and an ordered list of upstream section identifiers for derivative sections, forming an explicit dependency graph.", + "sources": [ + { + "file": "wikifi/sections.py", + "lines": [ + 30, + 40 + ], + "fingerprint": "f743972a8fce" + }, + { + "file": "wikifi/deriver.py", + "lines": [ + 112, + 116 + ], + "fingerprint": "0b7f4f5abb09" + } + ] + }, + { + "text": "An invariant holds at startup: every upstream identifier in a derivative section's dependency list must refer to a section that appears earlier in the canonical ordering (topological sort enforced).", + "sources": [ + { + "file": "wikifi/sections.py", + "lines": [ + 30, + 40 + ], + "fingerprint": "f743972a8fce" + } + ] + }, + { + "text": "WikiLayout is an immutable value object that encodes the on-disk wiki workspace structure and derives all canonical sub-paths from a project root.", + "sources": [ + { + "file": "wikifi/cli.py", + "lines": [ + 166, + 172 + ], + "fingerprint": "f326383c7da1" + }, + { + "file": "wikifi/wiki.py", + "lines": [ + 34, + 61 + ], + "fingerprint": "9230b7444e0d" + } + ] + }, + { + "text": "WalkConfig is an immutable configuration record capturing repository root, extra exclusion patterns, gitignore-honouring flag, maximum file size in bytes, and minimum stripped-content size in bytes.", + "sources": [ + { + "file": "wikifi/walker.py", + "lines": [ + 61, + 79 + ], + "fingerprint": "a29bd1ad8bdb" + } + ] + }, + { + "text": "FileKind is a closed enumeration of seven mutually exclusive file roles (application code, SQL, OpenAPI, Protobuf, GraphQL, migration, other), driving routing to specialised or general-purpose extraction paths.", + "sources": [ + { + "file": "README.md", + "lines": [ + 31, + 33 + ], + "fingerprint": "996c401d036d" + }, + { + "file": "wikifi/repograph.py", + "lines": [ + 41, + 52 + ], + "fingerprint": "3d8bbdb10112" + } + ] + }, + { + "text": "GraphNode represents a single file's position in the import graph, carrying its repo-relative path, the ordered set of files it imports, and the ordered set of files that import it, and exposes a capped combined-neighbour list.", + "sources": [ + { + "file": "wikifi/repograph.py", + "lines": [ + 148, + 167 + ], + "fingerprint": "3d8bbdb10112" + } + ] + }, + { + "text": "RepoGraph holds the complete per-file import-edge map, supporting node lookup by path and retrieval of a capped neighbour list for any file.", + "sources": [ + { + "file": "wikifi/repograph.py", + "lines": [ + 170, + 181 + ], + "fingerprint": "3d8bbdb10112" + } + ] + }, + { + "text": "DirSummary is a value object holding aggregate statistics for a single non-recursive directory: path, file count, total byte size, top-10 extension frequency map, and notable filenames.", + "sources": [ + { + "file": "wikifi/walker.py", + "lines": [ + 144, + 153 + ], + "fingerprint": "a29bd1ad8bdb" + } + ] + }, + { + "text": "SectionFinding represents one file's contribution to one wiki section, carrying the target section identifier, a technology-agnostic prose description, and an optional inclusive line range.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 106, + 123 + ], + "fingerprint": "b0e939259557" + } + ] + }, + { + "text": "FileFindings groups a one-sentence file summary with all SectionFinding records produced for that file.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 106, + 123 + ], + "fingerprint": "b0e939259557" + } + ] + }, + { + "text": "SpecializedFinding carries a section identifier, a human-readable description, and a list of source references. SpecializedResult groups zero or more such findings with an optional summary string and is the uniform output contract for all specialised extractors.", + "sources": [ + { + "file": "wikifi/specialized/__init__.py", + "lines": [ + 29, + 38 + ], + "fingerprint": "84d6c382c745" + } + ] + }, + { + "text": "ExtractionStats accumulates walk-level counters: total files seen, files with findings, total findings, skipped files, chunks processed, cache hits, specialised-extractor invocations, and a per-kind file breakdown.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 126, + 135 + ], + "fingerprint": "b0e939259557" + } + ] + }, + { + "text": "SourceRef represents a single span of source: a repo-relative file path, an optional inclusive line range, and a short content fingerprint captured at extraction time.", + "sources": [ + { + "file": "wikifi/evidence.py", + "lines": [ + 37, + 52 + ], + "fingerprint": "dddfe1a01c85" + }, + { + "file": "README.md", + "lines": [ + 37, + 39 + ], + "fingerprint": "996c401d036d" + } + ] + }, + { + "text": "Claim carries markdown text and a list of SourceRefs that justify it; a claim with no sources is explicitly marked unsupported.", + "sources": [ + { + "file": "wikifi/evidence.py", + "lines": [ + 55, + 67 + ], + "fingerprint": "dddfe1a01c85" + }, + { + "file": "wikifi/aggregator.py", + "lines": [ + 166, + 186 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "text": "Contradiction groups two or more conflicting Claims about the same topic under a single summary sentence, with each disagreeing position retaining its own source references.", + "sources": [ + { + "file": "wikifi/evidence.py", + "lines": [ + 70, + 77 + ], + "fingerprint": "dddfe1a01c85" + }, + { + "file": "wikifi/aggregator.py", + "lines": [ + 74, + 101 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "text": "EvidenceBundle combines the narrative body text, a list of Claims, and a list of Contradictions for a single wiki section; the renderer uses it to thread citations and a conflicts block into the final markdown.", + "sources": [ + { + "file": "README.md", + "lines": [ + 46, + 48 + ], + "fingerprint": "996c401d036d" + }, + { + "file": "wikifi/evidence.py", + "lines": [ + 80, + 85 + ], + "fingerprint": "dddfe1a01c85" + }, + { + "file": "wikifi/aggregator.py", + "lines": [ + 166, + 186 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "text": "AggregatedClaim pairs a single prose assertion with the 1-based indices of the input notes that support it.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 74, + 101 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "text": "AggregatedContradiction holds a one-sentence summary alongside multiple conflicting positional claims, each with its own note indices.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 74, + 101 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "text": "CachedFindings stores the extraction result for a single file: content fingerprint, list of structured findings, one-sentence summary, and chunk count; it is content-addressed on the fingerprint.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 44, + 51 + ], + "fingerprint": "1ba541fe863d" + } + ] + }, + { + "text": "CachedSection stores the aggregation result for a single wiki section: the hash of the notes payload, the rendered markdown body, and lists of claims and contradictions; it is content-addressed on the notes hash.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 54, + 60 + ], + "fingerprint": "1ba541fe863d" + } + ] + }, + { + "text": "WalkCache is the in-memory container for both caches, holding extraction and aggregation entries alongside hit and miss counters.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 63, + 70 + ], + "fingerprint": "1ba541fe863d" + } + ] + }, + { + "text": "AggregationStats records how many sections were written fresh, skipped due to empty notes, or served from cache during a single aggregation run.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 103, + 107 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "text": "Critique captures the quality assessment of a single section: an integer score (0–10), a short overall judgment, a list of unsupported claims, a list of gaps relative to the section brief, and a list of revision suggestions.", + "sources": [ + { + "file": "wikifi/critic.py", + "lines": [ + 67, + 84 + ], + "fingerprint": "502af9aee392" + } + ] + }, + { + "text": "ReviewOutcome tracks a section's review lifecycle: the section identifier, initial critique, current body text, a revision-applied flag, and an optional follow-up critique.", + "sources": [ + { + "file": "wikifi/critic.py", + "lines": [ + 91, + 96 + ], + "fingerprint": "502af9aee392" + } + ] + }, + { + "text": "WikiQualityReport aggregates an overall numeric score, a mapping from section identifiers to individual Critique records, and optional coverage statistics.", + "sources": [ + { + "file": "wikifi/critic.py", + "lines": [ + 99, + 114 + ], + "fingerprint": "502af9aee392" + } + ] + }, + { + "text": "CoverageStats records total files seen, files with findings, and per-section breakdowns of finding counts and contributing file counts, and exposes a coverage-percentage computation.", + "sources": [ + { + "file": "wikifi/critic.py", + "lines": [ + 99, + 114 + ], + "fingerprint": "502af9aee392" + }, + { + "file": "wikifi/report.py", + "lines": [ + 85, + 94 + ], + "fingerprint": "2b94c0a5e62e" + } + ] + }, + { + "text": "SectionReport captures the per-section view: section descriptor, contributing file count, total findings count, body size in characters, an emptiness flag, and an optional quality critique.", + "sources": [ + { + "file": "wikifi/report.py", + "lines": [ + 28, + 42 + ], + "fingerprint": "2b94c0a5e62e" + } + ] + }, + { + "text": "WikiReport aggregates all SectionReport records alongside overall coverage statistics and an optional mean quality score.", + "sources": [ + { + "file": "wikifi/report.py", + "lines": [ + 28, + 42 + ], + "fingerprint": "2b94c0a5e62e" + } + ] + }, + { + "text": "IntrospectionResult captures the Stage 1 decision: include/exclude gitignore-style patterns, primary languages (informational), a one-paragraph purpose guess, and a rationale.", + "sources": [ + { + "file": "wikifi/introspection.py", + "lines": [ + 47, + 64 + ], + "fingerprint": "59cd5940f72e" + } + ] + }, + { + "text": "DerivationStats accumulates pipeline metrics for the derivation stage: derived, skipped, and revised counts, plus the full list of ReviewOutcome records.", + "sources": [ + { + "file": "wikifi/deriver.py", + "lines": [ + 57, + 62 + ], + "fingerprint": "0b7f4f5abb09" + } + ] + }, + { + "text": "WalkReport aggregates the introspection result, extraction statistics, aggregation statistics, derivation statistics, the live cache state, and the repository import graph; it is the single return value of a completed wiki-generation run.", + "sources": [ + { + "file": "wikifi/orchestrator.py", + "lines": [ + 54, + 61 + ], + "fingerprint": "6ed682a87356" + }, + { + "file": "wikifi/cli.py", + "lines": [ + 118, + 153 + ], + "fingerprint": "f326383c7da1" + } + ] + }, + { + "text": "ChatMessage carries a role and content field, representing a single turn in a multi-turn conversation.", + "sources": [ + { + "file": "wikifi/providers/base.py", + "lines": [ + 28, + 30 + ], + "fingerprint": "2750f0f56327" + } + ] + }, + { + "text": "LoadedSection pairs a Section descriptor with its rendered markdown body.", + "sources": [ + { + "file": "wikifi/chat.py", + "lines": [ + 42, + 45 + ], + "fingerprint": "0333e700a046" + } + ] + }, + { + "text": "ChatSession holds a provider reference, a frozen system prompt built from wiki sections, and an accumulated conversation history of ChatMessage records; it supports appending turns and clearing history while retaining wiki context.", + "sources": [ + { + "file": "wikifi/chat.py", + "lines": [ + 46, + 57 + ], + "fingerprint": "0333e700a046" + } + ] + }, + { + "text": "Specialised extractors surface domain entities from structured artifacts: SQL CREATE TABLE statements are treated as domain entities capturing table name, columns, foreign keys, and storage constraints; ALTER TABLE statements track schema evolution per entity.", + "sources": [ + { + "file": "wikifi/specialized/sql.py", + "lines": [ + 64, + 84 + ], + "fingerprint": "1ef5e77c4038" + }, + { + "file": "wikifi/specialized/sql.py", + "lines": [ + 99, + 111 + ], + "fingerprint": "1ef5e77c4038" + } + ] + }, + { + "text": "GraphQL object types (excluding operation roots), interfaces, input types, and enums are each treated as distinct entity-level constructs by the GraphQL specialised extractor.", + "sources": [ + { + "file": "wikifi/specialized/graphql.py", + "lines": [ + 32, + 81 + ], + "fingerprint": "bbb305e0d47f" + } + ] + }, + { + "text": "Protobuf message types and enum types are extracted from interface definition files, grouped by package namespace, with counts truncated after 25 items.", + "sources": [ + { + "file": "wikifi/specialized/protobuf.py", + "lines": [ + 44, + 68 + ], + "fingerprint": "e20d5913745a" + } + ] + }, + { + "text": "OpenAPI component schemas are extracted as canonical data models (up to 25, with overflow count) and surfaced as entity-level findings.", + "sources": [ + { + "file": "wikifi/specialized/openapi.py", + "lines": [ + 94, + 108 + ], + "fingerprint": "ae97781309c4" + } + ] + } + ], + "contradictions": [] + } + } +} \ No newline at end of file diff --git a/.wikifi/.cache/extraction.json b/.wikifi/.cache/extraction.json new file mode 100644 index 0000000..67b0d06 --- /dev/null +++ b/.wikifi/.cache/extraction.json @@ -0,0 +1,4568 @@ +{ + "version": 1, + "saved_at": "2026-05-02T02:17:19.874633+00:00", + "entries": { + ".env.example": { + "fingerprint": "2e493dbd2d87", + "summary": "Example environment configuration exposing all tuneable runtime parameters for the wikifi documentation-generation system.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "wikifi is a tool that generates wiki documentation from source codebases using a local large-language model. It is designed to prioritise documentation quality over processing speed, and includes guards to prevent runaway behaviour on near-empty or oversized files.", + "sources": [ + { + "file": ".env.example", + "lines": [ + 1, + 2 + ], + "fingerprint": "2e493dbd2d87" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system processes source files through at least two pipeline stages: a Stage 1 introspection pass (which receives a configurable directory-tree depth) and a per-file extraction pass. Files outside configurable size bounds are skipped to avoid wasting model inference time.", + "sources": [ + { + "file": ".env.example", + "lines": [ + 20, + 29 + ], + "fingerprint": "2e493dbd2d87" + } + ] + }, + { + "section_id": "external_dependencies", + "finding": "Ollama is the sole supported LLM provider in version 1, serving models locally over HTTP (default endpoint http://localhost:11434). The system is designed to work with reasoning-capable models such as Qwen3 and DeepSeek-R1, which support a 'thinking mode' that trades latency for output depth.", + "sources": [ + { + "file": ".env.example", + "lines": [ + 7, + 14 + ], + "fingerprint": "2e493dbd2d87" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Per-request timeouts (default 900 seconds) are set to accommodate high-thinking model runs on real source files. Minimum and maximum file-size thresholds act as integrity guards: the minimum prevents thinking-mode runaway on stub files, while the maximum prevents processing files too large to be useful.", + "sources": [ + { + "file": ".env.example", + "lines": [ + 16, + 29 + ], + "fingerprint": "2e493dbd2d87" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Only the 'ollama' provider is supported in v1. The default request timeout is 900 seconds (15 minutes). Fully disabling thinking mode ('false') is documented as unsafe with Qwen3 models because those models ignore the JSON-schema output constraint and emit free text instead.", + "sources": [ + { + "file": ".env.example", + "lines": [ + 7, + 44 + ], + "fingerprint": "2e493dbd2d87" + } + ] + } + ] + }, + ".gitignore": { + "fingerprint": "493b2310ee7c", + "summary": "Standard version-control ignore file for a full-stack project with backend and frontend components.", + "chunks_processed": 1, + "findings": [] + }, + ".mcp.json": { + "fingerprint": "b6b856cb3fe2", + "summary": "MCP server configuration wiring together several external tool/API integrations used during development or runtime.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "external_dependencies", + "finding": "A locally-running web-crawling service is depended upon at a fixed local address (port 3002), requiring no API key, suggesting an self-hosted crawling capability used by the system.", + "sources": [ + { + "file": ".mcp.json", + "lines": [ + 14, + 20 + ], + "fingerprint": "b6b856cb3fe2" + } + ] + }, + { + "section_id": "external_dependencies", + "finding": "Google's AI/generative API is consumed under the key named GOOGLE_API_KEY, used by at least two registered server integrations (nano-banana and stitch).", + "sources": [ + { + "file": ".mcp.json", + "lines": [ + 4, + 8 + ], + "fingerprint": "b6b856cb3fe2" + } + ] + }, + { + "section_id": "external_dependencies", + "finding": "An external documentation/context lookup service (context7) is called over HTTP using a dedicated API key, likely to enrich prompts or retrieve up-to-date library documentation.", + "sources": [ + { + "file": ".mcp.json", + "lines": [ + 22, + 28 + ], + "fingerprint": "b6b856cb3fe2" + } + ] + }, + { + "section_id": "external_dependencies", + "finding": "A Google-hosted MCP-compatible service called 'stitch' is consumed over HTTP, authenticated via the Google API key, purpose not fully specified but likely an orchestration or data-stitching capability.", + "sources": [ + { + "file": ".mcp.json", + "lines": [ + 29, + 35 + ], + "fingerprint": "b6b856cb3fe2" + } + ] + }, + { + "section_id": "integrations", + "finding": "Four tool-server integrations are declared: a local banana/AI utility, a local web crawler, a remote documentation context service, and a remote Google stitching service — suggesting the system acts as an MCP client that fans out to multiple capability providers.", + "sources": [ + { + "file": ".mcp.json", + "lines": [ + 2, + 36 + ], + "fingerprint": "b6b856cb3fe2" + } + ] + } + ] + }, + "CLAUDE.md": { + "fingerprint": "ac9698d91de6", + "summary": "Developer and agent operating guide for the wikifi CLI library, capturing tooling rules, code constraints, architectural invariants, and workflow conventions.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "wikifi exists to describe what a legacy system does — producing a technology-agnostic wiki of its capabilities and domain model — so that migration teams can consume that knowledge without the tool itself prescribing any target architecture, language, or framework.", + "sources": [ + { + "file": "CLAUDE.md", + "lines": [ + 73, + 75 + ], + "fingerprint": "ac9698d91de6" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system exposes two primary commands: `init` (initialise wiki scaffolding against a repository) and `walk` (traverse the repository, extract per-file findings, and synthesise wiki sections). The walk is responsible for repository introspection, empty-file filtering, deterministic per-file extraction, and multi-file synthesis — in that order.", + "sources": [ + { + "file": "CLAUDE.md", + "lines": [ + 60, + 72 + ], + "fingerprint": "ac9698d91de6" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "The system must run against a local LLM out of the box with no cloud dependency required; hosted backends (Anthropic, OpenAI, custom) are valid additional options but never the default.", + "sources": [ + { + "file": "CLAUDE.md", + "lines": [ + 51, + 52 + ], + "fingerprint": "ac9698d91de6" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Provider abstraction is mandatory: swapping the LLM backend must not require changes outside the provider boundary.", + "sources": [ + { + "file": "CLAUDE.md", + "lines": [ + 53, + 54 + ], + "fingerprint": "ac9698d91de6" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "When the chosen model exposes a reasoning or thinking level, the system must run at the highest available setting; lower reasoning levels are opt-in only.", + "sources": [ + { + "file": "CLAUDE.md", + "lines": [ + 55, + 56 + ], + "fingerprint": "ac9698d91de6" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Test coverage target is ≥ 85%; every feature must ship with tests.", + "sources": [ + { + "file": "CLAUDE.md", + "lines": [ + 45, + 46 + ], + "fingerprint": "ac9698d91de6" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "wikifi is strictly a feature-extraction tool: it describes what the legacy system does and must never transform source into any target architecture, language, or framework shape.", + "sources": [ + { + "file": "CLAUDE.md", + "lines": [ + 73, + 75 + ], + "fingerprint": "ac9698d91de6" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Derivative wiki sections (personas, user stories, diagrams) must be produced only after primary content sections are complete and must never be inferred from a single file.", + "sources": [ + { + "file": "CLAUDE.md", + "lines": [ + 66, + 72 + ], + "fingerprint": "ac9698d91de6" + } + ] + }, + { + "section_id": "external_dependencies", + "finding": "A local LLM runtime (e.g. Ollama) is the default inference backend, requiring no external network dependency for core operation.", + "sources": [ + { + "file": "CLAUDE.md", + "lines": [ + 51, + 52 + ], + "fingerprint": "ac9698d91de6" + } + ] + }, + { + "section_id": "external_dependencies", + "finding": "Hosted Anthropic and hosted OpenAI are supported as optional alternative inference backends, reachable through the mandatory provider abstraction layer.", + "sources": [ + { + "file": "CLAUDE.md", + "lines": [ + 53, + 54 + ], + "fingerprint": "ac9698d91de6" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "A pre-commit hook auto-fixes lint and re-stages changed files; a pre-push hook runs the full test suite and gates the push, ensuring the main branch remains deployable at all times.", + "sources": [ + { + "file": "CLAUDE.md", + "lines": [ + 47, + 48 + ], + "fingerprint": "ac9698d91de6" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "A mandatory provider abstraction layer insulates all LLM calls, so observability, retry logic, and backend-switching concerns are centralised at that boundary rather than scattered across extraction logic.", + "sources": [ + { + "file": "CLAUDE.md", + "lines": [ + 53, + 54 + ], + "fingerprint": "ac9698d91de6" + } + ] + } + ] + }, + "CODE-FORMAT.md": { + "fingerprint": "b5e0603faf44", + "summary": "Project conventions and tooling guide defining how software is built, structured, tested, and deployed — serves as the single source of truth for agents and humans working on any project.", + "chunks_processed": 1, + "findings": [] + }, + "README.md": { + "fingerprint": "996c401d036d", + "summary": "Top-level README describing wikifi's purpose, CLI surface, architecture, and configuration as a codebase analysis tool that produces technology-agnostic domain and feature wikis.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "wikifi exists to walk an arbitrary codebase and produce a technology-agnostic extraction of its features, domains, and delivered value — so that a new modern implementation can be built that fully retains the functionality and value the original system provides to its users.", + "sources": [ + { + "file": "README.md", + "lines": [ + 3, + 3 + ], + "fingerprint": "996c401d036d" + } + ] + }, + { + "section_id": "domains", + "finding": "The core domain is codebase knowledge extraction: ingesting source files, classifying them, extracting domain findings per file, and synthesising those findings into structured wiki sections. Subdomains include repository introspection, static import-graph analysis, LLM-backed extraction, section synthesis, quality critique, and coverage reporting.", + "sources": [ + { + "file": "README.md", + "lines": [ + 28, + 52 + ], + "fingerprint": "996c401d036d" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system can initialise a wiki workspace for a target project, walk the target codebase to extract per-file domain findings, synthesise primary wiki sections from accumulated findings, and then derive higher-level artefacts (personas, user stories, architecture diagrams) from the primary content.", + "sources": [ + { + "file": "README.md", + "lines": [ + 14, + 24 + ], + "fingerprint": "996c401d036d" + } + ] + }, + { + "section_id": "capabilities", + "finding": "Users can force a clean re-walk, run an opt-in quality critique and revision loop on derivative sections, and override the configured LLM provider at invocation time.", + "sources": [ + { + "file": "README.md", + "lines": [ + 16, + 20 + ], + "fingerprint": "996c401d036d" + } + ] + }, + { + "section_id": "capabilities", + "finding": "A coverage and quality report shows per-section file counts, finding counts, body sizes, and optionally critic-derived quality scores for every populated section.", + "sources": [ + { + "file": "README.md", + "lines": [ + 21, + 23 + ], + "fingerprint": "996c401d036d" + } + ] + }, + { + "section_id": "capabilities", + "finding": "Natural-language querying and an interactive chat REPL allow iterative exploration of the extracted wiki content alongside the target codebase.", + "sources": [ + { + "file": "README.md", + "lines": [ + 24, + 25 + ], + "fingerprint": "996c401d036d" + } + ] + }, + { + "section_id": "capabilities", + "finding": "Schema files (SQL, OpenAPI, Protobuf, GraphQL, migrations) are processed through deterministic parsers rather than an LLM, producing the same structured findings as LLM extraction without consuming model tokens.", + "sources": [ + { + "file": "README.md", + "lines": [ + 34, + 36 + ], + "fingerprint": "996c401d036d" + } + ] + }, + { + "section_id": "external_dependencies", + "finding": "A local Ollama server is the default LLM runtime, used for all per-file extraction and synthesis calls; the default model is a thinking-capable model run at the highest available reasoning level.", + "sources": [ + { + "file": "README.md", + "lines": [ + 56, + 57 + ], + "fingerprint": "996c401d036d" + } + ] + }, + { + "section_id": "external_dependencies", + "finding": "Anthropic's hosted API is an opt-in LLM backend; it uses prompt caching with ephemeral cache-control markers on the system prompt so the large extraction prompt is paid for only once across hundreds of per-file calls.", + "sources": [ + { + "file": "README.md", + "lines": [ + 48, + 49 + ], + "fingerprint": "996c401d036d" + } + ] + }, + { + "section_id": "external_dependencies", + "finding": "OpenAI's hosted API is an opt-in LLM backend; it relies on automatic prefix caching and routes a 'think' knob to reasoning-effort on reasoning-capable models.", + "sources": [ + { + "file": "README.md", + "lines": [ + 50, + 51 + ], + "fingerprint": "996c401d036d" + } + ] + }, + { + "section_id": "external_dependencies", + "finding": "GitHub Actions provides the continuous integration pipeline.", + "sources": [ + { + "file": "README.md", + "lines": [ + 61, + 61 + ], + "fingerprint": "996c401d036d" + } + ] + }, + { + "section_id": "integrations", + "finding": "wikifi is distributed as a library installed into a target project and invoked as a CLI from that project's root; it reads the target's source tree and writes its output into a `.wikifi/` directory within that project.", + "sources": [ + { + "file": "README.md", + "lines": [ + 8, + 12 + ], + "fingerprint": "996c401d036d" + } + ] + }, + { + "section_id": "integrations", + "finding": "The LLM backend is reached through a provider abstraction; Ollama, Anthropic, and OpenAI backends slot in without changing the rest of the pipeline, selectable via an environment variable or a per-invocation flag.", + "sources": [ + { + "file": "README.md", + "lines": [ + 46, + 51 + ], + "fingerprint": "996c401d036d" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Extraction results are content-addressed: each file's findings are keyed by the combination of its relative path and the SHA-256 hash of its bytes, so re-walks skip unchanged files automatically and the walk is resumable after a crash.", + "sources": [ + { + "file": "README.md", + "lines": [ + 40, + 43 + ], + "fingerprint": "996c401d036d" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Aggregation bodies are keyed by a hash of the section's notes payload, so synthesis is also skipped when inputs have not changed.", + "sources": [ + { + "file": "README.md", + "lines": [ + 40, + 43 + ], + "fingerprint": "996c401d036d" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Input filtering discards unstructured, near-empty, or machine-generated files before they reach the LLM; an invariant is that empty input must never stall a walk.", + "sources": [ + { + "file": "README.md", + "lines": [ + 44, + 46 + ], + "fingerprint": "996c401d036d" + } + ] + }, + { + "section_id": "entities", + "finding": "A per-file finding captures a single file's contribution to one wiki section and carries a structured source reference (file path, line range, content fingerprint) for downstream citation.", + "sources": [ + { + "file": "README.md", + "lines": [ + 37, + 39 + ], + "fingerprint": "996c401d036d" + } + ] + }, + { + "section_id": "entities", + "finding": "An EvidenceBundle is the output of section synthesis: it contains the section body, the supporting claims, and any contradictions found across source findings; the renderer uses it to thread numbered citations and a conflicts block into the final markdown.", + "sources": [ + { + "file": "README.md", + "lines": [ + 46, + 48 + ], + "fingerprint": "996c401d036d" + } + ] + }, + { + "section_id": "entities", + "finding": "FileKind classifies each in-scope file as one of: application code, SQL, OpenAPI, Protobuf, GraphQL, migration, or other; this classification drives routing to LLM extraction or a deterministic parser.", + "sources": [ + { + "file": "README.md", + "lines": [ + 31, + 33 + ], + "fingerprint": "996c401d036d" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "The critic-reviser loop must only accept a revised section if its quality score is at least as high as the score of the original; downgrades are rejected.", + "sources": [ + { + "file": "README.md", + "lines": [ + 53, + 55 + ], + "fingerprint": "996c401d036d" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Empty or near-empty input files must never stall the walk; the walker is required to filter them out before any LLM call is made.", + "sources": [ + { + "file": "README.md", + "lines": [ + 44, + 45 + ], + "fingerprint": "996c401d036d" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Every per-file finding must carry a structured SourceRef (file, line range, content fingerprint) to support citation in the rendered wiki.", + "sources": [ + { + "file": "README.md", + "lines": [ + 37, + 39 + ], + "fingerprint": "996c401d036d" + } + ] + } + ] + }, + "TESTING-AND-DEMO.md": { + "fingerprint": "3b93f710ebca", + "summary": "Developer-facing documentation covering how to test and demonstrate the premium pipeline features of the wikifi wiki-generation tool, including setup, feature verification steps, and teardown instructions.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "The system exists to automatically generate technology-agnostic wiki documentation from a source-code repository. It extracts domain knowledge, entities, capabilities, and cross-cutting concerns from code files using a combination of deterministic parsers and language-model-based extraction, then aggregates and refines the results into committed markdown sections.", + "sources": [ + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 1, + 6 + ], + "fingerprint": "3b93f710ebca" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system performs incremental, content-addressed extraction walks over a repository, serving previously processed files from a persistent cache and skipping redundant language-model calls. A walk can be interrupted and resumed without losing progress, since the cache is flushed after every completed file.", + "sources": [ + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 67, + 88 + ], + "fingerprint": "3b93f710ebca" + } + ] + }, + { + "section_id": "capabilities", + "finding": "Each generated wiki section carries source-traceable citations (file path and line range) so readers can verify where each claim originates. Where the aggregation step detects disagreement across files, the section also renders a Conflicts block enumerating each conflicting position alongside its sources.", + "sources": [ + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 40, + 66 + ], + "fingerprint": "3b93f710ebca" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The walker builds an import graph across the repository and injects a Neighbor files block into each extraction prompt, giving the language model cross-file context about which modules a file imports from and which modules depend on it.", + "sources": [ + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 90, + 114 + ], + "fingerprint": "3b93f710ebca" + } + ] + }, + { + "section_id": "capabilities", + "finding": "Specialized, deterministic parsers handle structured schema files (SQL DDL, Protobuf, GraphQL, OpenAPI YAML/JSON, and migration scripts) without invoking a language model, producing findings for entities, integrations, and cross-cutting invariants directly from the schema syntax.", + "sources": [ + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 116, + 149 + ], + "fingerprint": "3b93f710ebca" + } + ] + }, + { + "section_id": "capabilities", + "finding": "An optional critic-and-reviser pass evaluates derivative sections (personas, user stories, diagrams) against a quality threshold and rewrites any that fall below it, accepting the revision only when it scores at least as well as the original.", + "sources": [ + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 151, + 164 + ], + "fingerprint": "3b93f710ebca" + } + ] + }, + { + "section_id": "capabilities", + "finding": "A report command produces a markdown table summarising every wiki section by contributing file count, finding count, body size, critic-derived quality score (0–10), and the highest-priority content gap identified by the critic.", + "sources": [ + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 166, + 186 + ], + "fingerprint": "3b93f710ebca" + } + ] + }, + { + "section_id": "external_dependencies", + "finding": "Ollama serves as the default local language-model backend; the model is configured via the repository's config file (defaulting to qwen3.6:27b). No external service or API key is required to use this path.", + "sources": [ + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 27, + 32 + ], + "fingerprint": "3b93f710ebca" + } + ] + }, + { + "section_id": "external_dependencies", + "finding": "The Anthropic API is an opt-in hosted language-model backend. The system sets an ephemeral cache-control marker on the system-prompt block so that repeated per-file extraction calls read the cached prompt at roughly 10% of the normal input token price.", + "sources": [ + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 188, + 209 + ], + "fingerprint": "3b93f710ebca" + } + ] + }, + { + "section_id": "external_dependencies", + "finding": "The OpenAI API is a second opt-in hosted backend. It relies on OpenAI's automatic prefix caching (no explicit marker required; prefixes of at least 1024 tokens are cached for approximately 5–10 minutes). The integration also routes a 'think' intensity knob to the reasoning_effort parameter on reasoning-capable models (o-series and gpt-5), while omitting that parameter for standard models.", + "sources": [ + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 210, + 236 + ], + "fingerprint": "3b93f710ebca" + } + ] + }, + { + "section_id": "integrations", + "finding": "Azure OpenAI deployments and corporate reverse-proxy endpoints are supported by overriding the base URL for the OpenAI provider, either via an environment variable or a constructor parameter, with no other changes to the calling code.", + "sources": [ + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 232, + 235 + ], + "fingerprint": "3b93f710ebca" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Extraction results are persisted in a content-addressed cache under .wikifi/.cache/ and are written after every individual file completes. This ensures that a process crash or manual interruption at any point does not require re-processing already-completed files.", + "sources": [ + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 67, + 88 + ], + "fingerprint": "3b93f710ebca" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "The full pipeline state is isolated to the .wikifi/ directory: committed markdown sections live at the root of that directory, per-section JSONL findings are stored in .notes/ (gitignored), and extraction/aggregation caches are stored in .cache/ (gitignored). Deleting .cache/ forces a full re-walk; deleting the entire directory resets all state.", + "sources": [ + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 249, + 265 + ], + "fingerprint": "3b93f710ebca" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "The test suite must include exactly 156 passing tests with total line coverage at or above 93%. Every new module must individually reach at least 86% coverage, and each premium-pipeline module (fingerprint, cache, evidence, critic, report, repograph, specialized parsers, and the Anthropic provider) must carry a dedicated test file.", + "sources": [ + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 20, + 30 + ], + "fingerprint": "3b93f710ebca" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "The Anthropic provider must place cache_control of type 'ephemeral' on the system-prompt block, use the messages.parse structured-output contract, translate the 'think' intensity setting to an effort level, and map API errors to a RuntimeError. These behaviors are locked in by the provider's dedicated test file.", + "sources": [ + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 237, + 242 + ], + "fingerprint": "3b93f710ebca" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "The OpenAI provider must use the chat.completions.parse structured-output contract, route reasoning_effort only to o-series and gpt-5 models (not standard models), swap max_tokens for max_completion_tokens on reasoning models, and map API errors to RuntimeError. OpenAI's automatic prefix caching applies to prefixes of at least 1024 tokens and lasts approximately 5–10 minutes.", + "sources": [ + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 215, + 230 + ], + "fingerprint": "3b93f710ebca" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "The critic-reviser loop must only accept a revised derivative section if the revision scores at least as well as the original; a revision that scores lower must be discarded.", + "sources": [ + { + "file": "TESTING-AND-DEMO.md", + "lines": [ + 158, + 163 + ], + "fingerprint": "3b93f710ebca" + } + ] + } + ] + }, + "VISION.md": { + "fingerprint": "10651b456a64", + "summary": "VISION.md defines wikifi's purpose, scope, operational requirements, and success criteria as a technology-agnostic codebase-to-wiki extraction tool for legacy migration teams.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "wikifi exists because the intent of legacy systems is locked inside their implementation choices. A migration team needs a description of *what the system does and why*, decoupled from *how it currently does it*, so they can re-implement on a fresh stack without recreating the legacy system's structure. The goal is to make legacy intent explicit, complete, and portable.", + "sources": [ + { + "file": "VISION.md", + "lines": [ + 3, + 9 + ], + "fingerprint": "10651b456a64" + } + ] + }, + { + "section_id": "intent", + "finding": "wikifi is explicitly a feature-extraction tool, not a transposition tool — it surfaces what a legacy system does and leaves the act of reshaping it to a target architecture entirely to the migration team.", + "sources": [ + { + "file": "VISION.md", + "lines": [ + 86, + 89 + ], + "fingerprint": "10651b456a64" + } + ] + }, + { + "section_id": "capabilities", + "finding": "wikifi walks a target codebase, uses an AI agent to extract domain knowledge from each source file, and writes a technology-agnostic wiki covering DDD domains, intent, capabilities, external dependencies, integrations, cross-cutting concerns, core entities, and hard specifications.", + "sources": [ + { + "file": "VISION.md", + "lines": [ + 6, + 8 + ], + "fingerprint": "10651b456a64" + } + ] + }, + { + "section_id": "capabilities", + "finding": "After primary per-file capture is complete, wikifi performs a derivative synthesis pass that produces user personas (with intent, needs, pain points, and usage patterns), Gherkin-style user stories keyed to those personas, and aggregate system diagrams — none of which can be inferred from any single source file.", + "sources": [ + { + "file": "VISION.md", + "lines": [ + 53, + 63 + ], + "fingerprint": "10651b456a64" + } + ] + }, + { + "section_id": "capabilities", + "finding": "wikifi exposes a CLI interface for interacting with the generated wiki; an MCP interface is identified as in-scope for a follow-up release.", + "sources": [ + { + "file": "VISION.md", + "lines": [ + 79, + 80 + ], + "fingerprint": "10651b456a64" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "The agent must run against a local LLM by default with no cloud dependency; hosted backends are valid additional options but not the default. The LLM backend must be reachable through a provider abstraction layer so it can be swapped (local Ollama, hosted Anthropic, hosted OpenAI, or custom) without changes outside the provider boundary.", + "sources": [ + { + "file": "VISION.md", + "lines": [ + 92, + 96 + ], + "fingerprint": "10651b456a64" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "When the chosen model exposes a thinking/reasoning level, the agent runs at the highest available setting, prioritising wiki quality over walk speed.", + "sources": [ + { + "file": "VISION.md", + "lines": [ + 97, + 98 + ], + "fingerprint": "10651b456a64" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "The walker must recognise and skip files carrying no extractable intent (stub init files, empty fixtures, generated lockfiles, and similar) before they reach the agent; a single empty or unstructured file must never stall the walk.", + "sources": [ + { + "file": "VISION.md", + "lines": [ + 99, + 100 + ], + "fingerprint": "10651b456a64" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "The generated wiki must at minimum contain: DDD domains and subdomains, system intent, domain-level capabilities, external-system dependencies, internal and external integrations, cross-cutting concerns, core entities and their structures, and hard specifications — regardless of the on-disk layout chosen by the implementor.", + "sources": [ + { + "file": "VISION.md", + "lines": [ + 26, + 47 + ], + "fingerprint": "10651b456a64" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Derivative wiki sections (user personas, user stories, aggregate diagrams) must be produced in a step that runs *after* primary capture and must never be inferred from a single source file.", + "sources": [ + { + "file": "VISION.md", + "lines": [ + 50, + 63 + ], + "fingerprint": "10651b456a64" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Wiki content is stored in the target project's `.wikifi/` directory; the contract is the content the wiki conveys, not its on-disk shape or file structure within that directory.", + "sources": [ + { + "file": "VISION.md", + "lines": [ + 73, + 76 + ], + "fingerprint": "10651b456a64" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Success is defined as: a migration team working from the wiki alone — without reference to the original codebase — can deliver a microservice re-implementation that preserves the original system's personas, problem space, integrations, cross-cutting concerns, entities, data patterns, and user value.", + "sources": [ + { + "file": "VISION.md", + "lines": [ + 103, + 105 + ], + "fingerprint": "10651b456a64" + } + ] + }, + { + "section_id": "domains", + "finding": "wikifi's core domain is legacy-system knowledge extraction: capturing business intent, domain structure, and operational behaviour from an existing codebase and representing it as a technology-agnostic wiki. A secondary domain is wiki authoring and organisation, governing how extracted knowledge is structured and stored for consumption by a migration team.", + "sources": [ + { + "file": "VISION.md", + "lines": [ + 3, + 20 + ], + "fingerprint": "10651b456a64" + } + ] + } + ] + }, + "wikifi/aggregator.py": { + "fingerprint": "c5f76cb7c4a3", + "summary": "Stage 3 of the wiki-generation pipeline: synthesises per-section wiki content from accumulated file-level notes using an LLM, attaches structured evidence (citations + contradictions), and writes the final section body.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "The aggregator exists to turn many small, file-scoped observations into a coherent, tech-agnostic wiki section — while refusing to silently hide disagreements between sources. The system prompt makes explicit that contradictions across files must be surfaced as named conflicts rather than merged away, and that every claim must trace back to the specific source files that justified it.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 1, + 15 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system synthesises notes collected from individual source files into readable markdown bodies for each primary wiki section, with every asserted claim backed by numbered citations pointing to the originating files and optional line ranges.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 1, + 15 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "section_id": "capabilities", + "finding": "When two or more files make incompatible assertions about the same domain topic, the system surfaces the conflict explicitly under a 'Conflicts in source' heading rather than silently choosing one position — a deliberate feature for legacy codebases where tribal knowledge hides in inconsistencies.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 9, + 14 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "section_id": "capabilities", + "finding": "A section-level cache compares a digest of the current note payload against the previous walk; if the notes are unchanged, the prior rendered body and evidence bundle are reused without re-invoking the LLM, saving cost and latency on incremental runs.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 15, + 17 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "section_id": "capabilities", + "finding": "When LLM synthesis fails, the system falls back to emitting the raw notes directly in the section body, preserving information at the cost of polish and providing the error message inline.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 272, + 285 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "section_id": "entities", + "finding": "An `AggregatedClaim` pairs a single prose assertion with the 1-based indices of the input notes that support it. A `SectionBody` groups a markdown body string with a list of such claims and a list of `AggregatedContradiction` records, each contradiction holding a one-sentence summary and multiple conflicting claim positions.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 74, + 101 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "section_id": "entities", + "finding": "An `AggregationStats` record tracks how many sections were written fresh, skipped due to empty notes, or served from cache during a single aggregation run.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 103, + 107 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "section_id": "entities", + "finding": "A `SourceRef` links a claim back to a specific file, optionally scoped to a line range. Multiple `SourceRef` values are coalesced before being attached to a `Claim`; a `Claim` is the evidence-layer representation of one assertion with its resolved file sources. An `EvidenceBundle` carries the final body, claims, and contradictions for a section.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 166, + 186 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Section-level caching uses a deterministic digest of the note payload; hits reuse the stored body and evidence bundle without any LLM call, and misses record the fresh result back to the cache for the next walk.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 126, + 155 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "All aggregation failures are logged at WARNING level and produce a fallback body that preserves the raw notes, ensuring a section is always written even when the LLM call fails.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 143, + 152 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "The aggregator enforces a tech-agnostic invariant at the prompt level: the LLM is explicitly instructed never to name languages, frameworks, or libraries in the synthesised output, translating all observations into domain terms.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 54, + 67 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "section_id": "integrations", + "finding": "The aggregator reads accumulated per-file notes from the wiki layout store (via `read_notes`) and writes finished section bodies back to it (via `write_section`), acting as the bridge between the extraction and rendering stages.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 109, + 160 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "section_id": "integrations", + "finding": "The aggregator calls the LLM provider's structured-JSON completion endpoint, passing a system prompt and a rendered user prompt, and expects the response to conform to the `SectionBody` schema.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 136, + 141 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "section_id": "integrations", + "finding": "Derivative sections (personas, user stories, diagrams) are explicitly excluded from this stage and are instead populated by a separate deriver stage that runs afterwards, indicating a two-stage downstream pipeline.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 111, + 116 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Contradictions between source notes must never be silently resolved: any incompatible claims must produce a `contradictions[]` entry naming each position and the note indices that support it. This is stated as a hard rule in the LLM system prompt and enforced structurally via the `AggregatedContradiction` schema.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 61, + 63 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Wiki section bodies must be tech-agnostic: no mention of specific languages, frameworks, or libraries is permitted in synthesised output; every observation must be translated into domain terms.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 57, + 59 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Note indices presented to the LLM are 1-based, and the resolution logic subtracts 1 before indexing into the notes list — an off-by-one invariant that must be preserved if the prompting scheme changes.", + "sources": [ + { + "file": "wikifi/aggregator.py", + "lines": [ + 167, + 173 + ], + "fingerprint": "c5f76cb7c4a3" + } + ] + } + ] + }, + "wikifi/cache.py": { + "fingerprint": "1ba541fe863d", + "summary": "Implements a two-scope content-addressed cache that lets the documentation walk pipeline skip unchanged files and unchanged section aggregations, providing both speed and resumability for large codebases.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "The system is designed to handle very large legacy codebases (example cited: 50,000 files) where regenerating documentation on every run would take hours. The cache layer reduces repeat runs to processing only changed files, and provides free resumability so an interrupted walk restarts from the last completed file rather than from scratch.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 1, + 21 + ], + "fingerprint": "1ba541fe863d" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The pipeline caches per-file extraction results keyed by the combination of relative file path and a content fingerprint, so files whose bytes have not changed since the last run reuse their previous structured findings without incurring an AI extraction call.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 5, + 8 + ], + "fingerprint": "1ba541fe863d" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The pipeline also caches per-section aggregation results keyed by a stable digest of the section's notes payload. If all contributing file findings are identical to the previous run, the cached rendered section body is reused without re-invoking the aggregator.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 9, + 12 + ], + "fingerprint": "1ba541fe863d" + } + ] + }, + { + "section_id": "capabilities", + "finding": "Interrupted walks are automatically resumable: because per-file results are persisted incrementally, a walk that fails partway through resumes from the last unprocessed file on the next invocation.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 14, + 18 + ], + "fingerprint": "1ba541fe863d" + } + ] + }, + { + "section_id": "capabilities", + "finding": "Stale cache entries for files that are no longer in scope can be pruned, and the entire cache can be reset (e.g., via a `--no-cache` flag) to force a full fresh walk.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 105, + 113 + ], + "fingerprint": "1ba541fe863d" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Cache writes are performed atomically by writing to a temporary file then renaming it into place, preventing corrupt cache state from a partial write.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 189, + 193 + ], + "fingerprint": "1ba541fe863d" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "A monotonically increasing cache version number is embedded in every persisted cache file; any version mismatch causes the entire cache to be silently discarded and rebuilt, providing a controlled invalidation mechanism across software upgrades.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 38, + 38 + ], + "fingerprint": "1ba541fe863d" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Malformed individual cache entries are logged as warnings and silently dropped rather than causing a hard failure, so a partially corrupt cache degrades gracefully to a fresh extraction for only the affected entries.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 196, + 222 + ], + "fingerprint": "1ba541fe863d" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Cache files are stored under a dedicated hidden subdirectory within the wiki output directory so they inherit the same version-control ignore rules as the rest of the tool's working state, keeping generated documentation commits clean.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 19, + 21 + ], + "fingerprint": "1ba541fe863d" + } + ] + }, + { + "section_id": "entities", + "finding": "A `CachedFindings` entity represents the extraction result for a single file: it holds the file's content fingerprint, the list of structured findings produced, a one-sentence summary, and a count of how many chunks were processed.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 44, + 51 + ], + "fingerprint": "1ba541fe863d" + } + ] + }, + { + "section_id": "entities", + "finding": "A `CachedSection` entity represents the aggregation result for a single wiki section: it holds the hash of the notes payload that produced it, the rendered markdown body, and lists of claims and contradictions identified during aggregation.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 54, + 60 + ], + "fingerprint": "1ba541fe863d" + } + ] + }, + { + "section_id": "entities", + "finding": "A `WalkCache` entity is the in-memory container for both caches, tracking extraction and aggregation entries alongside hit and miss counters for observability into cache effectiveness.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 63, + 70 + ], + "fingerprint": "1ba541fe863d" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "The aggregation cache key is computed only over content-bearing fields (file reference, summary, finding text) and explicitly excludes timestamps and per-walk debug fields, ensuring that regenerating identical notes on a fresh walk always produces a cache hit.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 238, + 251 + ], + "fingerprint": "1ba541fe863d" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Cache files must reside at `.wikifi/.cache/extraction.json` and `.wikifi/.cache/aggregation.json` relative to the wiki directory root.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 33, + 36 + ], + "fingerprint": "1ba541fe863d" + } + ] + }, + { + "section_id": "integrations", + "finding": "This module depends on the fingerprinting service (imported from `wikifi/fingerprint.py`) to compute content hashes used as cache keys for both extraction and aggregation scopes.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 244, + 246 + ], + "fingerprint": "1ba541fe863d" + } + ] + }, + { + "section_id": "integrations", + "finding": "The cache is consumed by the extractor, aggregator, and orchestrator (all neighbor files) to gate whether AI calls are needed; the wiki layout structure (from `wikifi/wiki.py`) determines where cache files are persisted on disk.", + "sources": [ + { + "file": "wikifi/cache.py", + "lines": [ + 30, + 30 + ], + "fingerprint": "1ba541fe863d" + } + ] + } + ] + }, + "wikifi/chat.py": { + "fingerprint": "0333e700a046", + "summary": "Implements an interactive multi-turn chat session grounded in the populated wiki sections of a target project, allowing users to query the wiki content conversationally.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "The chat module exists so that users can hold a natural-language conversation about a codebase's wiki, with every answer grounded in the extracted wiki sections rather than invented detail. It explicitly instructs the assistant to cite section names and admit gaps rather than fabricate information.", + "sources": [ + { + "file": "wikifi/chat.py", + "lines": [ + 1, + 32 + ], + "fingerprint": "0333e700a046" + } + ] + }, + { + "section_id": "capabilities", + "finding": "Users can launch an interactive session that loads all populated wiki sections as context, then send multi-turn messages and receive responses grounded in that context. The session supports conversation history reset (clearing turns while keeping wiki context), listing which sections are loaded, and graceful exit — all via slash commands.", + "sources": [ + { + "file": "wikifi/chat.py", + "lines": [ + 88, + 130 + ], + "fingerprint": "0333e700a046" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system filters out unpopulated or placeholder wiki sections before building the context bundle, ensuring the assistant is only grounded in meaningful content.", + "sources": [ + { + "file": "wikifi/chat.py", + "lines": [ + 63, + 82 + ], + "fingerprint": "0333e700a046" + } + ] + }, + { + "section_id": "entities", + "finding": "A `ChatSession` entity holds the LLM provider reference, the frozen system prompt built from wiki sections, and the accumulated conversation history (ordered list of role/content message pairs). It supports appending user and assistant turns and clearing history while retaining the wiki context.", + "sources": [ + { + "file": "wikifi/chat.py", + "lines": [ + 46, + 57 + ], + "fingerprint": "0333e700a046" + } + ] + }, + { + "section_id": "entities", + "finding": "A `LoadedSection` entity pairs a wiki Section descriptor with its markdown body text, representing a single populated section ready for inclusion in a chat context.", + "sources": [ + { + "file": "wikifi/chat.py", + "lines": [ + 42, + 45 + ], + "fingerprint": "0333e700a046" + } + ] + }, + { + "section_id": "integrations", + "finding": "The chat session delegates all LLM inference to the configured provider via a `chat` call, passing the system prompt and accumulated message history. The provider abstraction is sourced from `wikifi/providers/base.py`, keeping the chat logic decoupled from any specific LLM service.", + "sources": [ + { + "file": "wikifi/chat.py", + "lines": [ + 52, + 55 + ], + "fingerprint": "0333e700a046" + } + ] + }, + { + "section_id": "integrations", + "finding": "Wiki section content is read directly from the `.wikifi/` directory on disk using the layout abstraction from `wikifi/wiki.py`, and section metadata is sourced from `wikifi/sections.py`.", + "sources": [ + { + "file": "wikifi/chat.py", + "lines": [ + 63, + 82 + ], + "fingerprint": "0333e700a046" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Provider failures during a chat turn are caught and surfaced as inline error messages rather than crashing the REPL, ensuring a single failed inference call does not terminate the session.", + "sources": [ + { + "file": "wikifi/chat.py", + "lines": [ + 120, + 125 + ], + "fingerprint": "0333e700a046" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "The system prompt instructs the assistant to explicitly acknowledge when the wiki does not cover a topic, enforcing a data-integrity constraint that responses must be grounded in extracted content rather than hallucinated.", + "sources": [ + { + "file": "wikifi/chat.py", + "lines": [ + 27, + 31 + ], + "fingerprint": "0333e700a046" + } + ] + } + ] + }, + "wikifi/cli.py": { + "fingerprint": "f326383c7da1", + "summary": "The command-line entry point for wikifi, exposing four subcommands (init, walk, chat, report) that drive the full pipeline from codebase introspection through wiki generation and interactive querying.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "wikifi exists to walk a codebase and produce a technology-agnostic markdown wiki of its intent — translating implementation details into domain-level documentation for whoever needs to understand what the system does rather than how it is built.", + "sources": [ + { + "file": "wikifi/cli.py", + "lines": [ + 1, + 8 + ], + "fingerprint": "f326383c7da1" + } + ] + }, + { + "section_id": "capabilities", + "finding": "Users can initialise a wiki workspace in any project directory, run a full multi-stage extraction-and-aggregation pipeline against that codebase, query the resulting wiki through an interactive conversational interface, and obtain a coverage and quality report on how completely the wiki sections have been populated.", + "sources": [ + { + "file": "wikifi/cli.py", + "lines": [ + 60, + 220 + ], + "fingerprint": "f326383c7da1" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The walk pipeline supports opt-in cache invalidation (forcing a clean re-walk), an optional critic-and-reviser review loop on derivative sections, and runtime override of the AI provider, giving operators fine-grained control over cost and quality trade-offs.", + "sources": [ + { + "file": "wikifi/cli.py", + "lines": [ + 88, + 112 + ], + "fingerprint": "f326383c7da1" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The report command can optionally invoke a critic against every populated wiki section to produce quality scores, in addition to its baseline coverage summary.", + "sources": [ + { + "file": "wikifi/cli.py", + "lines": [ + 185, + 205 + ], + "fingerprint": "f326383c7da1" + } + ] + }, + { + "section_id": "domains", + "finding": "Two core domains are visible: codebase introspection (discovering and classifying source files) and wiki generation (extracting findings, aggregating them into sections, and deriving higher-level content). A supporting subdomain covers interactive knowledge retrieval against the generated wiki.", + "sources": [ + { + "file": "wikifi/cli.py", + "lines": [ + 1, + 8 + ], + "fingerprint": "f326383c7da1" + } + ] + }, + { + "section_id": "entities", + "finding": "A WikiLayout entity represents the on-disk structure of a wiki workspace rooted at a project directory; it tracks the presence of the .wikifi/ directory and organises the paths that other pipeline stages read and write.", + "sources": [ + { + "file": "wikifi/cli.py", + "lines": [ + 166, + 172 + ], + "fingerprint": "f326383c7da1" + } + ] + }, + { + "section_id": "entities", + "finding": "A walk report entity carries structured metrics for each of the four pipeline stages: introspection (included/excluded file counts, detected languages), extraction (files seen, files with findings, total findings, skipped files, cache hits, specialised-extractor files), aggregation (sections written/empty/cached), and derivation (sections derived/skipped/revised).", + "sources": [ + { + "file": "wikifi/cli.py", + "lines": [ + 118, + 153 + ], + "fingerprint": "f326383c7da1" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Verbose mode activates debug-level structured logging across all subcommands via a shared callback, providing a consistent observability toggle for the entire pipeline.", + "sources": [ + { + "file": "wikifi/cli.py", + "lines": [ + 51, + 60 + ], + "fingerprint": "f326383c7da1" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "An on-disk cache is used by the walk pipeline to avoid redundant re-processing; it can be explicitly invalidated at runtime, and cache hit counts are surfaced in the walk report.", + "sources": [ + { + "file": "wikifi/cli.py", + "lines": [ + 90, + 97 + ], + "fingerprint": "f326383c7da1" + } + ] + }, + { + "section_id": "integrations", + "finding": "The CLI integrates with configurable AI providers (ollama, anthropic, openai) at runtime; the provider and model are resolved from settings but can be overridden per-walk invocation, and the same provider instance is reused for both the chat REPL and quality scoring in the report command.", + "sources": [ + { + "file": "wikifi/cli.py", + "lines": [ + 98, + 101 + ], + "fingerprint": "f326383c7da1" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "The tool's entry point must be declared as `wikifi` in the project's script configuration and must delegate directly to the Typer application; this contract ties the installed command name to the main() function in this module.", + "sources": [ + { + "file": "wikifi/cli.py", + "lines": [ + 210, + 215 + ], + "fingerprint": "f326383c7da1" + } + ] + } + ] + }, + "wikifi/config.py": { + "fingerprint": "8cd2ca53c957", + "summary": "Runtime configuration module that declares all tunable settings for the wiki-generation pipeline, loaded from environment variables or a .env file.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "wikifi exists to generate technology-agnostic wiki documentation by walking source repositories, extracting structured findings from each file via language models, and assembling them into coherent wiki sections. The configuration reveals the system prioritizes wiki quality over processing speed, and is designed to handle arbitrarily large codebases through chunking and caching.", + "sources": [ + { + "file": "wikifi/config.py", + "lines": [ + 1, + 8 + ], + "fingerprint": "8cd2ca53c957" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The pipeline can route schema files (SQL, OpenAPI, Protobuf, GraphQL, migrations) through deterministic extractors that bypass the language model entirely, providing faster and more reliable handling of structured file types.", + "sources": [ + { + "file": "wikifi/config.py", + "lines": [ + 75, + 81 + ], + "fingerprint": "8cd2ca53c957" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system builds an import/reference graph of the codebase and feeds each file's neighborhood into the extraction prompt, enabling context-aware extraction that understands cross-file relationships rather than treating each file in isolation.", + "sources": [ + { + "file": "wikifi/config.py", + "lines": [ + 69, + 74 + ], + "fingerprint": "8cd2ca53c957" + } + ] + }, + { + "section_id": "capabilities", + "finding": "A critic-and-reviser loop can be applied to derivative wiki sections (personas, user stories, diagrams), invoking a revision pass whenever a quality score falls below a configurable threshold, improving groundedness at the cost of additional processing.", + "sources": [ + { + "file": "wikifi/config.py", + "lines": [ + 83, + 94 + ], + "fingerprint": "8cd2ca53c957" + } + ] + }, + { + "section_id": "capabilities", + "finding": "Per-file extraction results and per-section aggregation results are cached across walks, allowing incremental re-runs that only reprocess changed files.", + "sources": [ + { + "file": "wikifi/config.py", + "lines": [ + 63, + 68 + ], + "fingerprint": "8cd2ca53c957" + } + ] + }, + { + "section_id": "external_dependencies", + "finding": "A locally-hosted Ollama inference server (defaulting to localhost:11434) serves as the default language model provider, with qwen3:27b as the default model.", + "sources": [ + { + "file": "wikifi/config.py", + "lines": [ + 31, + 32 + ], + "fingerprint": "8cd2ca53c957" + } + ] + }, + { + "section_id": "external_dependencies", + "finding": "Anthropic's hosted API is an opt-in language model provider, authenticated via an API key, with a configurable per-call output token cap.", + "sources": [ + { + "file": "wikifi/config.py", + "lines": [ + 100, + 106 + ], + "fingerprint": "8cd2ca53c957" + } + ] + }, + { + "section_id": "external_dependencies", + "finding": "OpenAI's API (or compatible proxies such as Azure OpenAI) is an opt-in language model provider, authenticated via an API key and configurable base URL, with a configurable per-call output token cap.", + "sources": [ + { + "file": "wikifi/config.py", + "lines": [ + 108, + 117 + ], + "fingerprint": "8cd2ca53c957" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "File processing enforces a hard size ceiling (default 2 MB) above which files are silently skipped as vendored or generated noise; files below the ceiling but above 150 KB are split into overlapping windows (8 KB overlap) so each language model call stays within a comfortable context budget.", + "sources": [ + { + "file": "wikifi/config.py", + "lines": [ + 38, + 56 + ], + "fingerprint": "8cd2ca53c957" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "A minimum content threshold (default 64 bytes) prevents the system from invoking the language model on essentially empty stub files, guarding against runaway reasoning on trivial inputs.", + "sources": [ + { + "file": "wikifi/config.py", + "lines": [ + 56, + 59 + ], + "fingerprint": "8cd2ca53c957" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "All language model calls share a single per-request timeout (default 900 seconds), providing a uniform backstop against hung inference requests across all providers.", + "sources": [ + { + "file": "wikifi/config.py", + "lines": [ + 33, + 34 + ], + "fingerprint": "8cd2ca53c957" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Files exceeding 2,000,000 bytes are unconditionally dropped and never read; this threshold is explicitly documented as targeting vendored or generated noise rather than real source files.", + "sources": [ + { + "file": "wikifi/config.py", + "lines": [ + 36, + 42 + ], + "fingerprint": "8cd2ca53c957" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Each language model call is limited to a 150,000-byte content window, sized to fit within a 32K-context model after prompt overhead; larger files must be split into overlapping chunks rather than truncated.", + "sources": [ + { + "file": "wikifi/config.py", + "lines": [ + 43, + 50 + ], + "fingerprint": "8cd2ca53c957" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Adjacent file chunks share an 8,000-byte overlap region to preserve cross-boundary context; this overlap guarantee must be maintained when the chunking logic is modified.", + "sources": [ + { + "file": "wikifi/config.py", + "lines": [ + 51, + 54 + ], + "fingerprint": "8cd2ca53c957" + } + ] + } + ] + }, + "wikifi/critic.py": { + "fingerprint": "502af9aee392", + "summary": "Quality-assurance component that scores synthesized wiki sections against a rubric, identifies unsupported claims and gaps, and optionally revises bodies that fall below a minimum quality threshold.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "The critic exists specifically to catch hallucination and missing-coverage failures that are most likely to occur in single-shot synthesis of derivative sections. It enforces that all wiki content remains tech-agnostic and grounded in upstream evidence, so that a migration team can trust the output without manually verifying every claim.", + "sources": [ + { + "file": "wikifi/critic.py", + "lines": [ + 1, + 15 + ], + "fingerprint": "502af9aee392" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system can evaluate any synthesized wiki section body against its brief and upstream evidence, producing a structured score (0–10) with itemised unsupported claims, gaps, and suggested edits. When the initial score falls below a configurable threshold, it automatically invokes a revision pass and only accepts the revision if it improves or matches the prior score, preventing regressions.", + "sources": [ + { + "file": "wikifi/critic.py", + "lines": [ + 100, + 153 + ], + "fingerprint": "502af9aee392" + } + ] + }, + { + "section_id": "capabilities", + "finding": "A separate audit path walks every section in the finished wiki and produces a rubric-style quality report, including per-section coverage statistics such as total files analysed, files that produced findings, and finding counts per section. This report is surfaced via the `wikifi report` command.", + "sources": [ + { + "file": "wikifi/critic.py", + "lines": [ + 155, + 180 + ], + "fingerprint": "502af9aee392" + } + ] + }, + { + "section_id": "entities", + "finding": "A `Critique` entity captures the quality assessment of a single section: an integer score (0–10), a short overall judgment, a list of unsupported claims, a list of gaps relative to the section brief, and a list of concrete revision suggestions.", + "sources": [ + { + "file": "wikifi/critic.py", + "lines": [ + 67, + 84 + ], + "fingerprint": "502af9aee392" + } + ] + }, + { + "section_id": "entities", + "finding": "A `ReviewOutcome` entity tracks the lifecycle of a section review: the section identifier, the initial critique, the current body text, a flag indicating whether a revision was applied, and the optional follow-up critique produced after revision.", + "sources": [ + { + "file": "wikifi/critic.py", + "lines": [ + 91, + 96 + ], + "fingerprint": "502af9aee392" + } + ] + }, + { + "section_id": "entities", + "finding": "A `WikiQualityReport` entity aggregates the full-wiki audit results: an overall numeric score, a mapping from section identifiers to their individual critiques, and optional coverage statistics. `CoverageStats` records total files, files with findings, and per-section finding and file counts.", + "sources": [ + { + "file": "wikifi/critic.py", + "lines": [ + 99, + 114 + ], + "fingerprint": "502af9aee392" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "The scoring rubric is fixed: 9–10 indicates fully grounded, tech-agnostic, narratively coherent content with no unsupported claims; 6–8 allows minor issues; 3–5 signals substantial gaps or partial coverage; 0–2 marks incoherent or off-brief content. The default minimum acceptable score for shipping a section without revision is 7.", + "sources": [ + { + "file": "wikifi/critic.py", + "lines": [ + 31, + 48 + ], + "fingerprint": "502af9aee392" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "A revised body is only accepted if its follow-up critique score is greater than or equal to the initial score; any revision that causes a score regression is discarded and the original body is retained. This invariant must be preserved in any reimplementation.", + "sources": [ + { + "file": "wikifi/critic.py", + "lines": [ + 137, + 147 + ], + "fingerprint": "502af9aee392" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "All section bodies must be tech-agnostic: the reviser is explicitly instructed not to invent claims unsupported by upstream evidence and to declare gaps explicitly when evidence is missing rather than speculating.", + "sources": [ + { + "file": "wikifi/critic.py", + "lines": [ + 53, + 61 + ], + "fingerprint": "502af9aee392" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Failures in either the critic or reviser calls are caught and logged as warnings; the system degrades gracefully by returning the original body rather than propagating errors. A score of 0 with a diagnostic message is returned when the critic is unavailable.", + "sources": [ + { + "file": "wikifi/critic.py", + "lines": [ + 158, + 165 + ], + "fingerprint": "502af9aee392" + } + ] + }, + { + "section_id": "integrations", + "finding": "This module calls into the shared LLM provider (from the providers layer) for both structured critique completions and structured revision completions, passing domain-specific system prompts and Pydantic schemas for type-safe JSON responses. It consumes Section metadata from the sections catalogue and its outputs are consumed by the report module.", + "sources": [ + { + "file": "wikifi/critic.py", + "lines": [ + 30, + 32 + ], + "fingerprint": "502af9aee392" + } + ] + } + ] + }, + "wikifi/deriver.py": { + "fingerprint": "0b7f4f5abb09", + "summary": "Stage 4 of the wiki generation pipeline: synthesizes derivative sections (personas, user stories, diagrams) by feeding aggregated upstream section content into the language model, then optionally running a critic/reviser quality loop.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "Some wiki sections — personas, user stories, and diagrams — cannot be extracted from individual source files because they only emerge from the aggregate of capabilities, entities, and integrations. This module exists specifically to synthesize those derivative sections after all primary sections have been written, grounding each output exclusively in already-aggregated upstream evidence.", + "sources": [ + { + "file": "wikifi/deriver.py", + "lines": [ + 1, + 18 + ], + "fingerprint": "0b7f4f5abb09" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system synthesizes derivative wiki sections (personas, Gherkin-style user stories, and Mermaid architectural diagrams) by collecting the final markdown bodies of all upstream primary sections and passing them to the language model with a targeted brief. If upstream sections are empty or missing, the system writes a placeholder that declares the gap rather than fabricating content.", + "sources": [ + { + "file": "wikifi/deriver.py", + "lines": [ + 73, + 107 + ], + "fingerprint": "0b7f4f5abb09" + } + ] + }, + { + "section_id": "capabilities", + "finding": "Each synthesized derivative section can optionally be run through a critic-and-reviser quality loop. The system explicitly notes this loop is the highest-leverage quality control point for derivative sections, because personas and Gherkin stories are where single-shot synthesis most often hallucinates.", + "sources": [ + { + "file": "wikifi/deriver.py", + "lines": [ + 90, + 103 + ], + "fingerprint": "0b7f4f5abb09" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Hallucination prevention is enforced at two levels: (1) a heuristic filters placeholder bodies so no derivative section treats an unpopulated upstream as real evidence, and (2) an optional critic review loop scores and revises each derivative before it is written. Revision events are counted in the run's stats for observability.", + "sources": [ + { + "file": "wikifi/deriver.py", + "lines": [ + 110, + 135 + ], + "fingerprint": "0b7f4f5abb09" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "The empty-content heuristic must match all known placeholder shapes ('not yet populated', 'no findings were extracted', 'upstream sections required to derive') to prevent fabricated findings cascading into downstream derivative sections.", + "sources": [ + { + "file": "wikifi/deriver.py", + "lines": [ + 118, + 135 + ], + "fingerprint": "0b7f4f5abb09" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "All derivation failures are caught and logged; a fallback body is written that preserves the upstream evidence verbatim rather than leaving the section blank, ensuring the wiki remains inspectable even after partial failures.", + "sources": [ + { + "file": "wikifi/deriver.py", + "lines": [ + 96, + 107 + ], + "fingerprint": "0b7f4f5abb09" + } + ] + }, + { + "section_id": "entities", + "finding": "A `DerivationStats` record accumulates pipeline metrics for a single run: count of sections derived, skipped, and revised, plus the full list of critic review outcomes. This acts as an audit trail for the synthesis stage.", + "sources": [ + { + "file": "wikifi/deriver.py", + "lines": [ + 57, + 62 + ], + "fingerprint": "0b7f4f5abb09" + } + ] + }, + { + "section_id": "entities", + "finding": "A `Section` entity has a `derived_from` list declaring which upstream section IDs it depends on, establishing an explicit dependency graph between primary and derivative sections.", + "sources": [ + { + "file": "wikifi/deriver.py", + "lines": [ + 112, + 116 + ], + "fingerprint": "0b7f4f5abb09" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Derivative sections must be grounded solely in upstream section content. The model is instructed to declare gaps explicitly rather than filling them with invented facts — this is a hard constraint on output integrity.", + "sources": [ + { + "file": "wikifi/deriver.py", + "lines": [ + 34, + 50 + ], + "fingerprint": "0b7f4f5abb09" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "All wiki content, including derivative sections, must remain technology-agnostic: language names, framework names, and library names are forbidden and must be translated into domain terms.", + "sources": [ + { + "file": "wikifi/deriver.py", + "lines": [ + 37, + 39 + ], + "fingerprint": "0b7f4f5abb09" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Gherkin-style outputs must use proper Given/When/Then syntax inside fenced ```gherkin code blocks. Mermaid diagrams must be valid and inside fenced ```mermaid code blocks, preferring graph, classDiagram, erDiagram, and sequenceDiagram diagram types.", + "sources": [ + { + "file": "wikifi/deriver.py", + "lines": [ + 40, + 45 + ], + "fingerprint": "0b7f4f5abb09" + } + ] + }, + { + "section_id": "integrations", + "finding": "The deriver reads upstream section bodies from the filesystem layout written by the aggregator stage, and writes its output back through the same layout abstraction. It also calls into the critic module to obtain scored review outcomes for each derivative section.", + "sources": [ + { + "file": "wikifi/deriver.py", + "lines": [ + 73, + 107 + ], + "fingerprint": "0b7f4f5abb09" + } + ] + } + ] + }, + "wikifi/evidence.py": { + "fingerprint": "dddfe1a01c85", + "summary": "Defines the core evidence model — source references, claims, contradictions, and rendering helpers — that gives every wiki assertion a traceable pointer back to the original codebase.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "The system exists so that any architect reading the migration wiki can ask 'where in the source did this come from?' and receive a precise, verifiable answer. Every assertion in the generated wiki is backed by file paths and optional line ranges captured at extraction time.", + "sources": [ + { + "file": "wikifi/evidence.py", + "lines": [ + 1, + 18 + ], + "fingerprint": "dddfe1a01c85" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system surfaces conflicting information found across source files as explicit 'Contradiction' entries rather than silently merging them, treating disagreements as high-priority migration signals that encode tribal knowledge.", + "sources": [ + { + "file": "wikifi/evidence.py", + "lines": [ + 13, + 17 + ], + "fingerprint": "dddfe1a01c85" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system renders each wiki section with a 'Sources' footer enumerating every distinct source reference that backs claims in that section, and an additional 'Conflicts in source' sub-section when contradictions exist.", + "sources": [ + { + "file": "wikifi/evidence.py", + "lines": [ + 88, + 121 + ], + "fingerprint": "dddfe1a01c85" + } + ] + }, + { + "section_id": "entities", + "finding": "A SourceRef represents a single span of source code: a repo-relative file path, an optional inclusive line range, and a short content fingerprint captured at extraction time for change detection.", + "sources": [ + { + "file": "wikifi/evidence.py", + "lines": [ + 37, + 52 + ], + "fingerprint": "dddfe1a01c85" + } + ] + }, + { + "section_id": "entities", + "finding": "A Claim represents one assertion placed in a wiki section, carrying the markdown text and a list of SourceRefs that justify it; a claim with no sources is explicitly marked as unsupported.", + "sources": [ + { + "file": "wikifi/evidence.py", + "lines": [ + 55, + 67 + ], + "fingerprint": "dddfe1a01c85" + } + ] + }, + { + "section_id": "entities", + "finding": "A Contradiction groups two or more conflicting Claims about the same topic under a single summary sentence; each disagreeing position retains its own source references.", + "sources": [ + { + "file": "wikifi/evidence.py", + "lines": [ + 70, + 77 + ], + "fingerprint": "dddfe1a01c85" + } + ] + }, + { + "section_id": "entities", + "finding": "An EvidenceBundle is the aggregator's structured output for a single wiki section, combining the narrative body text, a list of Claims, and a list of Contradictions.", + "sources": [ + { + "file": "wikifi/evidence.py", + "lines": [ + 80, + 85 + ], + "fingerprint": "dddfe1a01c85" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Full source traceability is a non-negotiable invariant: every sentence in every wiki section must be linkable back to the originating file and, when available, the precise line range. This is enforced structurally through the Claim and SourceRef types rather than by convention.", + "sources": [ + { + "file": "wikifi/evidence.py", + "lines": [ + 1, + 18 + ], + "fingerprint": "dddfe1a01c85" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Source references must be rendered in the format 'path/to/file:start-end' (or 'path/to/file:line' for a single line, or just 'path/to/file' when lines are unknown). The 'Sources' footer uses 1-based sequential numeric indices in the form '1. `path`'.", + "sources": [ + { + "file": "wikifi/evidence.py", + "lines": [ + 45, + 52 + ], + "fingerprint": "dddfe1a01c85" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Contradictions must never be silently merged into a unified narrative; they must be explicitly surfaced in a dedicated 'Conflicts in source' sub-section, with a warning that migration teams must resolve them before re-implementation.", + "sources": [ + { + "file": "wikifi/evidence.py", + "lines": [ + 96, + 102 + ], + "fingerprint": "dddfe1a01c85" + } + ] + } + ] + }, + "wikifi/extractor.py": { + "fingerprint": "b0e939259557", + "summary": "Orchestrates per-file extraction of intent-bearing findings from repository source files, routing each file through caching, specialized deterministic parsing, or LLM-based chunked extraction, then appending structured findings to per-section note stores for downstream aggregation.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "The extractor exists to make large codebases understandable by walking every included source file and asking what intent-bearing content it contributes to each section of a technology-agnostic wiki. It is designed to handle repositories of arbitrary size — including 50,000-file legacy monorepos — by making repeated walks cheap through content-addressed caching and by keeping per-chunk LLM failures isolated so partial results are never discarded.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 1, + 37 + ], + "fingerprint": "b0e939259557" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system can extract structured, section-tagged findings from every file in a repository, supporting three extraction paths: replaying previously cached findings for unchanged files, running deterministic structure-reading for files classified as SQL, OpenAPI, Protobuf, GraphQL, or migrations, and invoking an LLM for all other files. Large files are recursively split into overlapping chunks so no content is missed regardless of file size.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 140, + 200 + ], + "fingerprint": "b0e939259557" + } + ] + }, + { + "section_id": "capabilities", + "finding": "Every emitted finding carries a source citation — the originating file path, an optional inclusive line range within that file, and a content fingerprint — enabling the aggregator to stitch citations back into the rendered wiki. Line ranges reported per-chunk are translated to absolute file line numbers before storage.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 251, + 270 + ], + "fingerprint": "b0e939259557" + } + ] + }, + { + "section_id": "capabilities", + "finding": "Cross-file context is surfaced to the LLM by supplying each file's import neighborhood (up to eight neighbor paths) in the extraction prompt, enabling findings to describe inter-file flows rather than treating each file in isolation.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 241, + 246 + ], + "fingerprint": "b0e939259557" + } + ] + }, + { + "section_id": "entities", + "finding": "A `SectionFinding` represents one contribution from a single file to one wiki section, carrying the target section identifier, a technology-agnostic prose description of the contribution, and an optional inclusive line range within the source chunk. A `FileFindings` groups a one-sentence file summary with all findings produced for that file.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 106, + 123 + ], + "fingerprint": "b0e939259557" + } + ] + }, + { + "section_id": "entities", + "finding": "An `ExtractionStats` record accumulates walk-level counters: total files seen, files yielding at least one finding, total findings, skipped files, chunks processed, cache hits, specialized-extractor invocations, and a per-kind file breakdown. It is returned by the walk so callers can report or act on extraction quality.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 126, + 135 + ], + "fingerprint": "b0e939259557" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Crash-resumability is a first-class property: after each file completes, the cache is optionally persisted via a caller-supplied callback, so a mid-walk failure loses at most one file's worth of work. Combined with content-addressed cache lookup, this makes re-walking an interrupted run nearly instantaneous for already-processed files.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 155, + 175 + ], + "fingerprint": "b0e939259557" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Findings that emerge from the overlap region shared between adjacent chunks are deduplicated by (section_id, normalized finding text) within each file's processing pass, preventing the same declaration from being double-counted in the aggregated wiki.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 253, + 262 + ], + "fingerprint": "b0e939259557" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Extraction failures — whether from the LLM provider or a specialized parser — are logged and counted but never propagate to abort the walk. A file whose only chunk fails with no salvageable findings is counted as skipped; partially-chunked files retain whatever findings were recovered.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 228, + 242 + ], + "fingerprint": "b0e939259557" + } + ] + }, + { + "section_id": "integrations", + "finding": "The extractor delegates LLM calls to a provider abstraction sourced from the providers layer, passes file fingerprints to a cache layer for lookup and recording, reads import-graph neighborhoods from a repo-graph component, appends findings to a wiki layout via a note-store helper, and hands off recognized structured file types to a specialized extractor registry.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 32, + 42 + ], + "fingerprint": "b0e939259557" + } + ] + }, + { + "section_id": "external_dependencies", + "finding": "An LLM service is the primary external dependency for the general extraction path: structured prompts are sent per chunk and the responses are parsed as typed finding objects. The system is designed so the LLM call is the sole expensive operation, with all other mechanisms (caching, specialization, chunking) oriented toward minimizing how often it must be invoked.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 224, + 236 + ], + "fingerprint": "b0e939259557" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Per-file extraction is restricted to primary wiki sections only. Derivative sections (personas, user stories, diagrams) are explicitly excluded from per-file extraction and are instead produced in a later aggregation stage; requesting them at the per-file level is documented as producing sparse, speculative findings.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 46, + 51 + ], + "fingerprint": "b0e939259557" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "The recursive text splitter must guarantee termination on any input, including minified single-line files with no whitespace, by falling back through separator priority (blank lines → single newlines → spaces → character boundaries). The character-boundary split is the terminal step that ensures every byte is eventually consumed.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 91, + 103 + ], + "fingerprint": "b0e939259557" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Chunk overlap must satisfy `0 <= overlap < chunk_size`; violating this constraint raises an error. The effective base chunk size is `chunk_size - overlap` so that prepending an overlap tail never causes a chunk to exceed `chunk_size` bytes.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 328, + 336 + ], + "fingerprint": "b0e939259557" + } + ] + } + ] + }, + "wikifi/fingerprint.py": { + "fingerprint": "853400108135", + "summary": "Utility that produces stable short content fingerprints (12-character SHA-256 hex prefixes) used for cache keying, source-evidence citations, and dependency-graph invalidation.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "cross_cutting", + "finding": "Content fingerprints serve three cross-cutting roles: keying extraction and aggregation caches so stale results are never served, anchoring source-evidence citations so wiki claims can be verified against a re-walk of the repository, and tracking file identity inside the dependency graph so cross-file context is invalidated when any source changes.", + "sources": [ + { + "file": "wikifi/fingerprint.py", + "lines": [ + 1, + 18 + ], + "fingerprint": "853400108135" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Fingerprints are defined as the first 12 hexadecimal characters of a SHA-256 digest (48 bits of entropy). This length is explicitly chosen to be sufficient to distinguish every file in any realistic repository (estimated 50% collision threshold at ~10 trillion files) while remaining short enough to embed inline in human-readable citations. This format must be preserved across any migration because it is recorded in cached artefacts and emitted into wiki evidence references.", + "sources": [ + { + "file": "wikifi/fingerprint.py", + "lines": [ + 23, + 27 + ], + "fingerprint": "853400108135" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Files are always hashed as raw bytes rather than decoded text, ensuring that the cache layer and the extractor produce the same fingerprint for the same file regardless of encoding assumptions.", + "sources": [ + { + "file": "wikifi/fingerprint.py", + "lines": [ + 44, + 50 + ], + "fingerprint": "853400108135" + } + ] + } + ] + }, + "wikifi/introspection.py": { + "fingerprint": "59cd5940f72e", + "summary": "Implements Stage 1 of the wiki-generation pipeline: a single LLM call that examines a compressed repository structure and decides which paths contain production source worth deeper analysis.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "The system exists to produce a technology-agnostic wiki from an unknown codebase. Stage 1 solves the problem of not knowing which parts of a repository contain intent-bearing production code versus scaffolding, tests, or build artifacts — without reading any source files up front.", + "sources": [ + { + "file": "wikifi/introspection.py", + "lines": [ + 1, + 9 + ], + "fingerprint": "59cd5940f72e" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system can analyze a repository's directory layout and manifest files to classify paths as worth walking (production source, business logic, integrations, domain models) or worth skipping (vendored dependencies, build output, test code, CI/CD, documentation). The classification is returned as a structured, diffable result.", + "sources": [ + { + "file": "wikifi/introspection.py", + "lines": [ + 28, + 44 + ], + "fingerprint": "59cd5940f72e" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system compresses the entire repository tree into a compact directory-summary representation and reads selected manifest files, then submits this compressed view to an LLM to infer the repository's likely purpose, primary languages, and path patterns to include or exclude.", + "sources": [ + { + "file": "wikifi/introspection.py", + "lines": [ + 61, + 70 + ], + "fingerprint": "59cd5940f72e" + } + ] + }, + { + "section_id": "domains", + "finding": "The core domain is automated repository understanding: deciding which parts of an arbitrary codebase encode business intent versus infrastructure or tooling. A key subdomain constraint is tech-agnosticism — the analysis must not depend on recognizing any specific language or framework.", + "sources": [ + { + "file": "wikifi/introspection.py", + "lines": [ + 19, + 44 + ], + "fingerprint": "59cd5940f72e" + } + ] + }, + { + "section_id": "entities", + "finding": "The `IntrospectionResult` entity captures the Stage 1 decision: a list of gitignore-style include patterns, a list of exclude patterns, a list of primary languages (informational), a one-paragraph guess at the system's purpose, and a rationale for the include/exclude choices.", + "sources": [ + { + "file": "wikifi/introspection.py", + "lines": [ + 47, + 64 + ], + "fingerprint": "59cd5940f72e" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "LLM output is constrained to a strict Pydantic schema to ensure deterministic parsing and easy diffing between runs. At Stage 1, the agent deliberately has no access to source file contents — only compressed directory metadata and manifest files — enforcing a clean separation of concerns between introspection and per-file analysis.", + "sources": [ + { + "file": "wikifi/introspection.py", + "lines": [ + 5, + 9 + ], + "fingerprint": "59cd5940f72e" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Stage 1 must operate without reading any source files; it sees only directory-level summaries and manifest contents. This constraint is architectural and must be preserved: source reading is exclusively Stage 2's responsibility.", + "sources": [ + { + "file": "wikifi/introspection.py", + "lines": [ + 5, + 9 + ], + "fingerprint": "59cd5940f72e" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Include and exclude patterns produced by Stage 1 must be in gitignore-style format relative to the repository root.", + "sources": [ + { + "file": "wikifi/introspection.py", + "lines": [ + 50, + 58 + ], + "fingerprint": "59cd5940f72e" + } + ] + }, + { + "section_id": "integrations", + "finding": "Stage 1 calls into an LLM provider (via the shared LLMProvider interface) requesting structured JSON output conforming to the IntrospectionResult schema. It also depends on the walker component to produce directory summaries and read manifest file contents. The orchestrator calls this stage as the first step of the pipeline.", + "sources": [ + { + "file": "wikifi/introspection.py", + "lines": [ + 61, + 70 + ], + "fingerprint": "59cd5940f72e" + } + ] + } + ] + }, + "wikifi/orchestrator.py": { + "fingerprint": "6ed682a87356", + "summary": "Central orchestrator that wires the full four-stage documentation-generation pipeline (repository introspection → per-file extraction → section aggregation → derivative content derivation) and constructs the appropriate LLM provider based on configuration.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "The system automates the creation of a structured wiki from a source repository. It analyses the repository's files using a language model and produces aggregated documentation sections along with derived artefacts such as personas, user stories, and diagrams — removing the burden of manual documentation from developers.", + "sources": [ + { + "file": "wikifi/orchestrator.py", + "lines": [ + 1, + 16 + ], + "fingerprint": "6ed682a87356" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The pipeline initialises a wiki skeleton in the target repository, introspects its structure to determine which files are in scope, performs per-file content extraction into structured notes with optional caching and import-graph context, aggregates those notes into primary wiki sections, and then derives higher-level artefacts (personas, user stories, diagrams) from the aggregated sections. An optional critic loop can review and rescore the derived content.", + "sources": [ + { + "file": "wikifi/orchestrator.py", + "lines": [ + 62, + 155 + ], + "fingerprint": "6ed682a87356" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system can be initialised idempotently against any repository root, automatically bootstrapping the required directory skeleton if it does not yet exist before executing the walk pipeline.", + "sources": [ + { + "file": "wikifi/orchestrator.py", + "lines": [ + 62, + 76 + ], + "fingerprint": "6ed682a87356" + } + ] + }, + { + "section_id": "external_dependencies", + "finding": "Three external language-model backends are supported: a locally hosted inference server (Ollama, the default), the Anthropic hosted API (requiring an API key), and the OpenAI hosted API (requiring an API key and an optional custom base URL). Each backend is configured with a model name, timeout, and token limits drawn from application settings.", + "sources": [ + { + "file": "wikifi/orchestrator.py", + "lines": [ + 163, + 210 + ], + "fingerprint": "6ed682a87356" + } + ] + }, + { + "section_id": "integrations", + "finding": "The orchestrator is the primary entry point called by the CLI layer (`init_wiki`, `run_walk`). It delegates outward to the introspection, extraction, aggregation, and derivation modules, and further to the chosen LLM provider. The cache and file-walker modules are also called as part of the pipeline.", + "sources": [ + { + "file": "wikifi/orchestrator.py", + "lines": [ + 40, + 60 + ], + "fingerprint": "6ed682a87356" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "A persistent extraction cache is maintained across runs: on each walk the cache is loaded, entries for files that are no longer in scope are pruned, and the cache is saved after extraction and again after all stages complete. Caching can be disabled, in which case the cache is fully reset before the walk begins.", + "sources": [ + { + "file": "wikifi/orchestrator.py", + "lines": [ + 95, + 110 + ], + "fingerprint": "6ed682a87356" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Structured logging is used at each major stage boundary (introspection, graph build, extraction, aggregation, derivation) to provide observability into pipeline progress.", + "sources": [ + { + "file": "wikifi/orchestrator.py", + "lines": [ + 84, + 148 + ], + "fingerprint": "6ed682a87356" + } + ] + }, + { + "section_id": "entities", + "finding": "A `WalkReport` aggregates the outputs of all four pipeline stages: the repository introspection result, per-file extraction statistics, section aggregation statistics, derivation statistics, the live cache state, and the repository import graph. It is the single return value representing a completed wiki-generation run.", + "sources": [ + { + "file": "wikifi/orchestrator.py", + "lines": [ + 54, + 61 + ], + "fingerprint": "6ed682a87356" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "When a user selects the Anthropic provider but the configured model name does not begin with 'claude-', the system silently substitutes the model identifier 'claude-opus-4-7' rather than forwarding an invalid name. Similarly, for OpenAI, non-OpenAI-pattern model names are replaced with 'gpt-4o'. This model-name substitution logic must be preserved so that users migrating from the default local provider do not receive opaque remote API errors.", + "sources": [ + { + "file": "wikifi/orchestrator.py", + "lines": [ + 178, + 207 + ], + "fingerprint": "6ed682a87356" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "The only accepted provider identifiers are 'ollama', 'anthropic', and 'openai'; any other value raises an error. This contract is enforced at provider construction time and must be maintained by any future provider registration mechanism.", + "sources": [ + { + "file": "wikifi/orchestrator.py", + "lines": [ + 208, + 210 + ], + "fingerprint": "6ed682a87356" + } + ] + } + ] + }, + "wikifi/repograph.py": { + "fingerprint": "3d8bbdb10112", + "summary": "Provides lightweight, language-agnostic static analysis of a repository: classifies each in-scope file by kind and constructs an import/reference graph so that per-file wiki extraction can reference cross-file flows rather than treating each file in isolation.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "This module exists to enrich wiki extraction with two signals: what kind of structured artifact each file represents (schema, API contract, migration, general code) and which other files it depends on or is depended upon by. The goal is to let per-file analysis describe cross-file flows (e.g. 'this handler delegates to the billing service') rather than producing island findings.", + "sources": [ + { + "file": "wikifi/repograph.py", + "lines": [ + 1, + 30 + ], + "fingerprint": "3d8bbdb10112" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system classifies every in-scope file into one of seven categories — general application code, SQL DDL, OpenAPI/Swagger spec, Protobuf definition, GraphQL schema, database migration, or unclassified — allowing downstream stages to route structured files to specialised extractors that skip the language model entirely.", + "sources": [ + { + "file": "wikifi/repograph.py", + "lines": [ + 41, + 52 + ], + "fingerprint": "3d8bbdb10112" + } + ] + }, + { + "section_id": "capabilities", + "finding": "A regex-driven import graph is built across the entire in-scope file set for multiple language families, producing for each file the list of files it imports and the list of files that import it. This neighbor map is injected into per-file extraction prompts to surface cross-file relationships.", + "sources": [ + { + "file": "wikifi/repograph.py", + "lines": [ + 155, + 210 + ], + "fingerprint": "3d8bbdb10112" + } + ] + }, + { + "section_id": "capabilities", + "finding": "Migration files are distinguished from generic DDL by detecting well-known migration directory conventions (Alembic, Django, Rails, Prisma, Flyway, Liquibase), ensuring the wiki can separate the current schema state from historical forward-only change scripts.", + "sources": [ + { + "file": "wikifi/repograph.py", + "lines": [ + 84, + 99 + ], + "fingerprint": "3d8bbdb10112" + } + ] + }, + { + "section_id": "capabilities", + "finding": "OpenAPI/Swagger specifications are detected inside YAML and JSON files by scanning the first 4 KB for characteristic header patterns, since extension alone is insufficient to distinguish them from other structured data files.", + "sources": [ + { + "file": "wikifi/repograph.py", + "lines": [ + 102, + 109 + ], + "fingerprint": "3d8bbdb10112" + } + ] + }, + { + "section_id": "entities", + "finding": "A `FileKind` enumeration captures seven mutually exclusive file roles: application code, SQL, OpenAPI, Protobuf, GraphQL, migration, and other. This classification drives routing to specialised or general-purpose extraction paths.", + "sources": [ + { + "file": "wikifi/repograph.py", + "lines": [ + 41, + 52 + ], + "fingerprint": "3d8bbdb10112" + } + ] + }, + { + "section_id": "entities", + "finding": "A `GraphNode` entity represents a single file's position in the import graph, carrying its repo-relative path, the ordered set of files it imports, and the ordered set of files that import it. It exposes a capped combined-neighbor list for use in prompts.", + "sources": [ + { + "file": "wikifi/repograph.py", + "lines": [ + 148, + 167 + ], + "fingerprint": "3d8bbdb10112" + } + ] + }, + { + "section_id": "entities", + "finding": "A `RepoGraph` entity holds the complete per-file import-edge map for a repository scan, supporting lookup of a node by path and retrieval of a capped neighbor list for any given file.", + "sources": [ + { + "file": "wikifi/repograph.py", + "lines": [ + 170, + 181 + ], + "fingerprint": "3d8bbdb10112" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "The import graph caps neighbor lists at eight entries by default when constructing prompt context, providing a bounded, deterministic input size regardless of how highly connected a file is in the repository.", + "sources": [ + { + "file": "wikifi/repograph.py", + "lines": [ + 156, + 165 + ], + "fingerprint": "3d8bbdb10112" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "The implementation must remain dependency-free beyond regex and path resolution — tree-sitter or similar binary dependencies are explicitly prohibited so that the tool can be installed without native compilation.", + "sources": [ + { + "file": "wikifi/repograph.py", + "lines": [ + 24, + 29 + ], + "fingerprint": "3d8bbdb10112" + } + ] + }, + { + "section_id": "integrations", + "finding": "This module is consumed by the extractor and orchestrator components: the file classification result determines which specialised extractor is invoked, and the neighbor list from the graph is injected into the Stage 2 extraction prompt for each file.", + "sources": [ + { + "file": "wikifi/repograph.py", + "lines": [ + 1, + 15 + ], + "fingerprint": "3d8bbdb10112" + } + ] + } + ] + }, + "wikifi/report.py": { + "fingerprint": "2b94c0a5e62e", + "summary": "Produces a read-only coverage-and-quality report over a completed wiki walk, answering whether the codebase was fully covered and whether the resulting wiki is good enough to act on.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "The report module addresses the questions that migration leads ask before committing to a re-implementation effort: (1) did the automated walk cover the entire system, and (2) is the generated wiki accurate and complete enough to guide action? It runs purely from existing on-disk artifacts and never mutates the wiki, making it safe to run in CI without side-effects.", + "sources": [ + { + "file": "wikifi/report.py", + "lines": [ + 1, + 14 + ], + "fingerprint": "2b94c0a5e62e" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system can produce a structured coverage report showing, per wiki section, how many files contributed findings and how many findings were extracted. It also offers an optional quality-scoring mode where each populated section is evaluated by the critic and assigned a numeric score out of 10, surfacing unsupported claims and gaps. A human-readable markdown table is rendered from these results.", + "sources": [ + { + "file": "wikifi/report.py", + "lines": [ + 44, + 77 + ], + "fingerprint": "2b94c0a5e62e" + } + ] + }, + { + "section_id": "capabilities", + "finding": "Coverage statistics are derived from the persisted walk cache, comparing total files seen against files that produced at least one finding, and computing a coverage percentage. This allows teams to identify dead zones — files the extraction pass processed but from which no signal was extracted.", + "sources": [ + { + "file": "wikifi/report.py", + "lines": [ + 103, + 107 + ], + "fingerprint": "2b94c0a5e62e" + } + ] + }, + { + "section_id": "entities", + "finding": "A `SectionReport` captures the per-section view: the section descriptor, count of contributing files, total findings count, body size in characters, an emptiness flag, and an optional quality critique. A `WikiReport` aggregates all section reports alongside overall coverage statistics and an optional mean quality score across populated sections.", + "sources": [ + { + "file": "wikifi/report.py", + "lines": [ + 28, + 42 + ], + "fingerprint": "2b94c0a5e62e" + } + ] + }, + { + "section_id": "entities", + "finding": "A `CoverageStats` entity (defined in the critic module) holds the total files seen, files with findings, and per-section breakdowns of findings and contributing files, and exposes a coverage percentage computation.", + "sources": [ + { + "file": "wikifi/report.py", + "lines": [ + 85, + 94 + ], + "fingerprint": "2b94c0a5e62e" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "The report is explicitly read-only: it inspects on-disk artifacts (notes JSONL, section markdown files, and the walk cache) without modifying the wiki. This invariant is stated in the module docstring and upheld throughout the implementation.", + "sources": [ + { + "file": "wikifi/report.py", + "lines": [ + 9, + 12 + ], + "fingerprint": "2b94c0a5e62e" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Emptiness detection for a section applies three textual sentinel checks — 'Not yet populated', 'No findings were extracted', and 'upstream sections required to derive' — to decide whether a section should be skipped for scoring. These sentinels act as a lightweight data-integrity signal distinguishing meaningfully populated sections from placeholder content.", + "sources": [ + { + "file": "wikifi/report.py", + "lines": [ + 118, + 123 + ], + "fingerprint": "2b94c0a5e62e" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Structured logging is initialised under the `wikifi.report` namespace for observability into report generation steps.", + "sources": [ + { + "file": "wikifi/report.py", + "lines": [ + 22, + 22 + ], + "fingerprint": "2b94c0a5e62e" + } + ] + }, + { + "section_id": "integrations", + "finding": "The report integrates with the walk cache to retrieve the full extraction manifest (file → findings mapping) for computing coverage statistics, and with the notes store to count per-section findings and contributing files.", + "sources": [ + { + "file": "wikifi/report.py", + "lines": [ + 103, + 107 + ], + "fingerprint": "2b94c0a5e62e" + } + ] + }, + { + "section_id": "integrations", + "finding": "When quality scoring is requested and a language-model provider is available, the report delegates to the critic subsystem, optionally supplying upstream section bodies for derivative sections that depend on earlier wiki content.", + "sources": [ + { + "file": "wikifi/report.py", + "lines": [ + 124, + 131 + ], + "fingerprint": "2b94c0a5e62e" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Quality scoring is only performed when explicitly requested (`score=True`) and a provider is supplied; without both conditions the report remains purely structural. This ensures the tool can run in provider-free environments such as CI pipelines without failure.", + "sources": [ + { + "file": "wikifi/report.py", + "lines": [ + 80, + 84 + ], + "fingerprint": "2b94c0a5e62e" + } + ] + } + ] + }, + "wikifi/sections.py": { + "fingerprint": "f743972a8fce", + "summary": "Defines the complete taxonomy of wiki sections that the system generates, distinguishing between primary sections (extracted from per-file evidence) and derivative sections (synthesized from aggregates of primary sections).", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "The system exists to generate structured, technology-agnostic wiki documentation from a codebase. Its design explicitly separates extraction of direct per-file evidence (primary sections) from higher-order synthesis across the whole codebase (derivative sections), because single files rarely contain enough signal to infer personas, user stories, or diagrams.", + "sources": [ + { + "file": "wikifi/sections.py", + "lines": [ + 1, + 19 + ], + "fingerprint": "f743972a8fce" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system produces wiki documentation organized into eight primary sections — business domains, system intent, capabilities, external dependencies, integrations, cross-cutting concerns, core entities, and hard specifications — plus three derivative sections: user personas, Gherkin-style user stories, and Mermaid diagrams. Each derivative section declares which primary sections it depends on and is only generated after those primaries are finalized.", + "sources": [ + { + "file": "wikifi/sections.py", + "lines": [ + 44, + 142 + ], + "fingerprint": "f743972a8fce" + } + ] + }, + { + "section_id": "entities", + "finding": "A Section entity captures: a unique identifier, a human-readable title, a prose description of what belongs in that section, a tier (primary or derivative), and an ordered tuple of upstream section identifiers it is derived from. Sections form a dependency graph that must be topologically ordered — each derivative's upstreams must appear earlier in the canonical section list, enforced at startup.", + "sources": [ + { + "file": "wikifi/sections.py", + "lines": [ + 30, + 40 + ], + "fingerprint": "f743972a8fce" + } + ] + }, + { + "section_id": "domains", + "finding": "The wiki-generation process is split into two subdomains: per-file evidence extraction (primary sections, Stages 2–3) and aggregate synthesis (derivative sections, Stage 4). The dependency ordering between these two subdomains is a first-class design constraint enforced structurally.", + "sources": [ + { + "file": "wikifi/sections.py", + "lines": [ + 1, + 19 + ], + "fingerprint": "f743972a8fce" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Derivative sections must always reference only known section IDs, and every upstream a derivative depends on must appear earlier in the canonical SECTIONS ordering. This ordering invariant is validated at module load time and any violation raises an error, making it a hard structural requirement for the section taxonomy.", + "sources": [ + { + "file": "wikifi/sections.py", + "lines": [ + 148, + 158 + ], + "fingerprint": "f743972a8fce" + } + ] + } + ] + }, + "wikifi/walker.py": { + "fingerprint": "a29bd1ad8bdb", + "summary": "Filesystem walker that enumerates and filters source files in a repository, building directory summaries and reading manifest content for downstream analysis passes.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "The walker exists to give higher-level analysis passes a clean, noise-free view of a repository's source files. It deliberately excludes VCS metadata, dependency caches, build artifacts, and the tool's own working directory so that only meaningful source content reaches the analysis layer. It is described as intentionally provider-free — it knows nothing about analysis models or output sections.", + "sources": [ + { + "file": "wikifi/walker.py", + "lines": [ + 1, + 12 + ], + "fingerprint": "a29bd1ad8bdb" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system can enumerate all analyzable files in a repository tree while respecting gitignore rules and a configurable set of additional exclusion patterns. It produces a depth-limited, pre-order directory summary (file counts, byte totals, extension histograms, notable filenames) used as a compressed structural view for the Stage 1 introspection pass. It can also read a targeted set of manifest and readme files up to a configurable byte limit for inclusion in introspection prompts.", + "sources": [ + { + "file": "wikifi/walker.py", + "lines": [ + 92, + 186 + ], + "fingerprint": "a29bd1ad8bdb" + } + ] + }, + { + "section_id": "capabilities", + "finding": "File selection applies layered filtering: pattern-based exclusion, size upper-bound to discard generated or vendored assets, and a minimum content threshold to skip stubs and near-empty files before they reach any analysis model.", + "sources": [ + { + "file": "wikifi/walker.py", + "lines": [ + 100, + 130 + ], + "fingerprint": "a29bd1ad8bdb" + } + ] + }, + { + "section_id": "entities", + "finding": "WalkConfig is a configuration entity capturing: the repository root path, extra exclusion patterns beyond the defaults, whether to honour gitignore rules, a maximum file size in bytes, and a minimum stripped-content size in bytes. It is immutable once constructed.", + "sources": [ + { + "file": "wikifi/walker.py", + "lines": [ + 61, + 79 + ], + "fingerprint": "a29bd1ad8bdb" + } + ] + }, + { + "section_id": "entities", + "finding": "DirSummary is a value object representing aggregate statistics for a single (non-recursive) directory: its repo-relative path, file count, total byte size, a frequency map of file extensions (top 10), and a tuple of notable filenames (manifests, readmes) present in that directory.", + "sources": [ + { + "file": "wikifi/walker.py", + "lines": [ + 144, + 153 + ], + "fingerprint": "a29bd1ad8bdb" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Files exceeding 2 MB are silently dropped on the assumption they are vendored, generated, or binary assets; real source files are expected to fit within this bound. Files whose stripped text content is shorter than 64 bytes are also dropped to prevent analysis models from producing speculative or hallucinated findings on effectively empty inputs.", + "sources": [ + { + "file": "wikifi/walker.py", + "lines": [ + 61, + 79 + ], + "fingerprint": "a29bd1ad8bdb" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Manifest files read for introspection context are hard-truncated at 20,000 bytes with a visible truncation marker to keep prompt payloads bounded.", + "sources": [ + { + "file": "wikifi/walker.py", + "lines": [ + 220, + 231 + ], + "fingerprint": "a29bd1ad8bdb" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Directory traversal prunes ignored directories before descending into them, meaning exclusion patterns apply to entire subtrees efficiently rather than file-by-file.", + "sources": [ + { + "file": "wikifi/walker.py", + "lines": [ + 133, + 143 + ], + "fingerprint": "a29bd1ad8bdb" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "The maximum file size threshold is 2,000,000 bytes (2 MB); files at or above this limit are unconditionally skipped and never sent for analysis. The minimum content threshold is 64 bytes of stripped text. Manifest files are truncated to 20,000 bytes maximum before being included in any prompt.", + "sources": [ + { + "file": "wikifi/walker.py", + "lines": [ + 61, + 79 + ], + "fingerprint": "a29bd1ad8bdb" + } + ] + }, + { + "section_id": "integrations", + "finding": "The walker is called into by the introspection layer (Stage 1) and the extractor layer: introspection uses `summarize_tree` and `read_manifest_files` to build its compressed repo view, while the extractor uses `iter_files` to obtain the filtered file list for per-file analysis. The walker itself calls into no external services.", + "sources": [ + { + "file": "wikifi/walker.py", + "lines": [ + 1, + 12 + ], + "fingerprint": "a29bd1ad8bdb" + } + ] + } + ] + }, + "wikifi/wiki.py": { + "fingerprint": "9230b7444e0d", + "summary": "Manages the on-disk wiki directory layout, scaffolding lifecycle, and persistence of per-file extraction notes and rendered section documents.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "hard_specifications", + "finding": "The directory layout is explicitly declared as a stable contract between the tool and any target project: upgrading the tool must not break existing wikis. This constraint is called out in the module docstring and governs all future changes to path conventions.", + "sources": [ + { + "file": "wikifi/wiki.py", + "lines": [ + 1, + 13 + ], + "fingerprint": "9230b7444e0d" + } + ] + }, + { + "section_id": "entities", + "finding": "WikiLayout is a value object (immutable dataclass) that encapsulates the root path of a target project and derives all canonical sub-paths from it: the wiki directory, configuration file, gitignore file, notes directory, per-section markdown files, and per-section JSONL note files.", + "sources": [ + { + "file": "wikifi/wiki.py", + "lines": [ + 34, + 61 + ], + "fingerprint": "9230b7444e0d" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system can initialize a wiki skeleton inside a target project directory in an idempotent manner: it creates the directory structure, a provider/model configuration file, a gitignore, and one placeholder markdown file per defined section — skipping anything that already exists.", + "sources": [ + { + "file": "wikifi/wiki.py", + "lines": [ + 64, + 86 + ], + "fingerprint": "9230b7444e0d" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system can write a fully rendered markdown document for any individual section, replacing its previous content, and can append timestamped per-file extraction notes to a section's scratch log, read those notes back in insertion order, and wipe all notes at the start of a fresh analysis run.", + "sources": [ + { + "file": "wikifi/wiki.py", + "lines": [ + 89, + 121 + ], + "fingerprint": "9230b7444e0d" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Per-file extraction state is stored as newline-delimited JSON (JSONL), with each record automatically stamped with a UTC timestamp. Notes are excluded from version control via a generated gitignore entry, while rendered section markdown is intended to be committed.", + "sources": [ + { + "file": "wikifi/wiki.py", + "lines": [ + 96, + 121 + ], + "fingerprint": "9230b7444e0d" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "The `.wikifi/` directory layout follows a fixed, documented schema: `config.toml` for provider/model overrides, `.gitignore` for excluding notes, one `
.md` per defined section, and a `.notes/
.jsonl` per section for extraction state. This schema must remain stable across upgrades.", + "sources": [ + { + "file": "wikifi/wiki.py", + "lines": [ + 1, + 13 + ], + "fingerprint": "9230b7444e0d" + } + ] + }, + { + "section_id": "integrations", + "finding": "This module is the authoritative persistence layer consumed by the orchestrator, extractor, aggregator, deriver, and CLI — all path resolution for reading and writing wiki artifacts flows through WikiLayout rather than being scattered across those callers.", + "sources": [ + { + "file": "wikifi/wiki.py", + "lines": [ + 34, + 61 + ], + "fingerprint": "9230b7444e0d" + } + ] + } + ] + }, + "wikifi/specialized/__init__.py": { + "fingerprint": "84d6c382c745", + "summary": "Dispatcher that routes high-signal source artifacts (schemas, IDLs, API specs, migrations) to purpose-built parsers instead of the general LLM extraction path, while preserving a uniform output contract downstream.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "Schema files, interface definition languages, API specifications, and database migrations encode a system's contracts in structured, machine-readable form. Passing them through a general-purpose prose extractor is both inefficient and lossy; dedicated parsers can read the structure directly. This package implements that optimised path as a drop-in replacement for the LLM call whenever a file's kind is recognised.", + "sources": [ + { + "file": "wikifi/specialized/__init__.py", + "lines": [ + 1, + 13 + ], + "fingerprint": "84d6c382c745" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system selects a specialised extractor based on the detected kind of a source file — covering SQL queries, database migrations, OpenAPI specifications, Protocol Buffer definitions, and GraphQL schemas — and routes each file to the parser best suited to its structure rather than treating all files identically.", + "sources": [ + { + "file": "wikifi/specialized/__init__.py", + "lines": [ + 46, + 57 + ], + "fingerprint": "84d6c382c745" + } + ] + }, + { + "section_id": "entities", + "finding": "A `SpecializedFinding` represents a single structured insight extracted from a file, carrying a section identifier, a human-readable description, and a list of source references. A `SpecializedResult` groups zero or more such findings together with an optional summary string, and is the standard output contract for every specialised extractor.", + "sources": [ + { + "file": "wikifi/specialized/__init__.py", + "lines": [ + 29, + 38 + ], + "fingerprint": "84d6c382c745" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "All specialised extractors are required to return results in the same `{section_id, finding, sources}` shape that the LLM extractor produces, ensuring the downstream aggregation layer needs no knowledge of which extraction path was taken. This interface contract is an invariant that must be preserved.", + "sources": [ + { + "file": "wikifi/specialized/__init__.py", + "lines": [ + 9, + 13 + ], + "fingerprint": "84d6c382c745" + } + ] + }, + { + "section_id": "integrations", + "finding": "The dispatcher integrates internally with the file-kind classification system (sourced from the repository graph module) and delegates to four sibling extractor modules — SQL, OpenAPI, Protobuf, and GraphQL — each responsible for a distinct artifact type.", + "sources": [ + { + "file": "wikifi/specialized/__init__.py", + "lines": [ + 46, + 57 + ], + "fingerprint": "84d6c382c745" + } + ] + } + ] + }, + "wikifi/specialized/graphql.py": { + "fingerprint": "bbb305e0d47f", + "summary": "Specialized extractor that parses GraphQL Schema Definition Language files and maps their constructs to structured wiki findings about domain entities and API capabilities.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "capabilities", + "finding": "The system can analyze GraphQL schema files to identify and catalog all operation roots. Query and Mutation roots are recognized as first-class capability surfaces, with each declared field surfaced as an individually named operation the API exposes.", + "sources": [ + { + "file": "wikifi/specialized/graphql.py", + "lines": [ + 85, + 101 + ], + "fingerprint": "bbb305e0d47f" + } + ] + }, + { + "section_id": "integrations", + "finding": "Subscription roots in a GraphQL schema are treated as integration touchpoints rather than capabilities, reflecting their role as real-time or event-driven channels that external consumers attach to.", + "sources": [ + { + "file": "wikifi/specialized/graphql.py", + "lines": [ + 88, + 91 + ], + "fingerprint": "bbb305e0d47f" + } + ] + }, + { + "section_id": "entities", + "finding": "Domain-level GraphQL object types (excluding operation roots) are extracted and recorded as named entities. Interfaces, input types, and enums are each treated as distinct entity-level constructs: interfaces represent shared shape contracts, inputs represent request payload shapes, and enums represent closed value sets.", + "sources": [ + { + "file": "wikifi/specialized/graphql.py", + "lines": [ + 32, + 81 + ], + "fingerprint": "bbb305e0d47f" + } + ] + }, + { + "section_id": "intent", + "finding": "The extractor exists to make GraphQL schemas first-class source material for wiki generation, automatically translating schema structure into domain-meaningful findings about entities and API surface rather than requiring manual documentation of those schemas.", + "sources": [ + { + "file": "wikifi/specialized/graphql.py", + "lines": [ + 1, + 7 + ], + "fingerprint": "bbb305e0d47f" + } + ] + } + ] + }, + "wikifi/specialized/openapi.py": { + "fingerprint": "ae97781309c4", + "summary": "Parses OpenAPI/Swagger contract files and extracts structured findings about public endpoints, data schemas, and authentication schemes to support migration analysis.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "OpenAPI specification files are treated as authoritative 'migration gold' because they enumerate every public endpoint, request/response body, and authentication method in one structured document. This extractor surfaces that information so migration teams have a complete picture of the API contract without manually reading raw spec files.", + "sources": [ + { + "file": "wikifi/specialized/openapi.py", + "lines": [ + 1, + 11 + ], + "fingerprint": "ae97781309c4" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system can parse an API contract file (JSON or YAML format) and produce an inventory of all public HTTP endpoints — including the verb, path, and human-readable description — capping display at 20 with a count of omitted entries. When a spec cannot be parsed, it emits a graceful advisory finding rather than failing, directing reviewers to inspect the file manually.", + "sources": [ + { + "file": "wikifi/specialized/openapi.py", + "lines": [ + 23, + 50 + ], + "fingerprint": "ae97781309c4" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The extractor also surfaces the API title, version, and description from the contract's info block, providing high-level identity context for the API being migrated.", + "sources": [ + { + "file": "wikifi/specialized/openapi.py", + "lines": [ + 53, + 66 + ], + "fingerprint": "ae97781309c4" + } + ] + }, + { + "section_id": "entities", + "finding": "API request/response schemas are extracted from the contract's component definitions and listed by name (up to 25, with overflow count). These represent the canonical data models the API exposes or consumes, and are surfaced as entity-level findings for migration awareness.", + "sources": [ + { + "file": "wikifi/specialized/openapi.py", + "lines": [ + 94, + 108 + ], + "fingerprint": "ae97781309c4" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Authentication schemes declared in the API contract are extracted and categorized by type. This ensures that security contracts (e.g., API key, OAuth, HTTP bearer) are recorded as cross-cutting concerns that must be preserved through migration.", + "sources": [ + { + "file": "wikifi/specialized/openapi.py", + "lines": [ + 110, + 121 + ], + "fingerprint": "ae97781309c4" + } + ] + }, + { + "section_id": "integrations", + "finding": "Each parsed API contract contributes an inbound-integration finding recording the count of HTTP endpoints exposed to external consumers, establishing the external-facing API surface as a documented integration point.", + "sources": [ + { + "file": "wikifi/specialized/openapi.py", + "lines": [ + 83, + 92 + ], + "fingerprint": "ae97781309c4" + } + ] + }, + { + "section_id": "external_dependencies", + "finding": "The system optionally relies on an external YAML parsing library for reading YAML-formatted API specifications. When the library is unavailable, a built-in minimal parser handles the subset of YAML structures present in standard OpenAPI documents, ensuring no hard runtime dependency is introduced.", + "sources": [ + { + "file": "wikifi/specialized/openapi.py", + "lines": [ + 143, + 157 + ], + "fingerprint": "ae97781309c4" + } + ] + } + ] + }, + "wikifi/specialized/protobuf.py": { + "fingerprint": "e20d5913745a", + "summary": "Specialized extractor that parses interface definition (proto) files to surface message types as domain entities and service/RPC blocks as integration touchpoints for migration-ready contract analysis.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "This extractor exists to treat interface definition files as pure contracts: its findings are intended to be read directly into interface design when re-implementing in a new stack. It bridges the gap between existing wire-protocol definitions and a migration team's understanding of what must be preserved.", + "sources": [ + { + "file": "wikifi/specialized/protobuf.py", + "lines": [ + 1, + 8 + ], + "fingerprint": "e20d5913745a" + } + ] + }, + { + "section_id": "capabilities", + "finding": "Parses protocol definition files to extract: named data types (messages) grouped by package, closed value sets (enums), service definitions, and individual remote procedure signatures including streaming variants on both input and output. Produces structured findings categorised by entity type and integration surface.", + "sources": [ + { + "file": "wikifi/specialized/protobuf.py", + "lines": [ + 27, + 95 + ], + "fingerprint": "e20d5913745a" + } + ] + }, + { + "section_id": "entities", + "finding": "Extracts protocol message types and enum types from interface definition files, recording each by name and source line. Messages are grouped under their package namespace; enums are surfaced separately as closed value sets. Up to 25 of each are reported verbatim; larger sets are truncated with a count.", + "sources": [ + { + "file": "wikifi/specialized/protobuf.py", + "lines": [ + 44, + 68 + ], + "fingerprint": "e20d5913745a" + } + ] + }, + { + "section_id": "integrations", + "finding": "Identifies every declared service and maps each of its remote procedures, capturing the procedure name, request message type, response message type, and whether either side is a streaming channel. Each service is emitted as a distinct integration touchpoint.", + "sources": [ + { + "file": "wikifi/specialized/protobuf.py", + "lines": [ + 70, + 87 + ], + "fingerprint": "e20d5913745a" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "The module explicitly designates proto file findings as direct inputs to interface design during migration, implying that message names, enum value sets, service names, RPC signatures, and streaming contracts must be preserved verbatim when porting to a new stack.", + "sources": [ + { + "file": "wikifi/specialized/protobuf.py", + "lines": [ + 1, + 8 + ], + "fingerprint": "e20d5913745a" + } + ] + } + ] + }, + "wikifi/specialized/sql.py": { + "fingerprint": "1ef5e77c4038", + "summary": "SQL DDL and migration extractor that converts CREATE TABLE, ALTER TABLE, and CREATE INDEX statements into structured findings about domain entities, relational links, and storage invariants.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "capabilities", + "finding": "The system can parse SQL schema files and migration scripts to automatically discover domain entities, their fields, foreign-key relationships, uniqueness and nullability constraints, and index definitions — producing structured wiki findings without requiring a live database connection.", + "sources": [ + { + "file": "wikifi/specialized/sql.py", + "lines": [ + 56, + 62 + ], + "fingerprint": "1ef5e77c4038" + } + ] + }, + { + "section_id": "capabilities", + "finding": "Migration files are parsed with the same logic as baseline DDL but are distinguished in output summaries, allowing the migration team to differentiate additive schema changes from the original baseline structure.", + "sources": [ + { + "file": "wikifi/specialized/sql.py", + "lines": [ + 56, + 62 + ], + "fingerprint": "1ef5e77c4038" + } + ] + }, + { + "section_id": "entities", + "finding": "Each CREATE TABLE statement is treated as a domain entity: the extractor captures the table name, all column names, foreign key edges, and storage constraints as structured entity findings.", + "sources": [ + { + "file": "wikifi/specialized/sql.py", + "lines": [ + 64, + 84 + ], + "fingerprint": "1ef5e77c4038" + } + ] + }, + { + "section_id": "entities", + "finding": "ALTER TABLE statements are also tracked as entity-level findings, recording what schema evolution has been applied to a given entity over time (e.g., added columns, dropped constraints).", + "sources": [ + { + "file": "wikifi/specialized/sql.py", + "lines": [ + 99, + 111 + ], + "fingerprint": "1ef5e77c4038" + } + ] + }, + { + "section_id": "integrations", + "finding": "Foreign key declarations — both explicit FOREIGN KEY clauses and inline REFERENCES annotations on column definitions — are surfaced as hard relational links between entities, capturing which field on one entity points into another.", + "sources": [ + { + "file": "wikifi/specialized/sql.py", + "lines": [ + 86, + 96 + ], + "fingerprint": "1ef5e77c4038" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "UNIQUE and NOT NULL constraints found within a table definition are extracted as storage invariants that the system flags must be preserved across any migration or re-implementation.", + "sources": [ + { + "file": "wikifi/specialized/sql.py", + "lines": [ + 97, + 98 + ], + "fingerprint": "1ef5e77c4038" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Index definitions (CREATE INDEX) are recorded as query-time performance invariants, with an explicit note that the new system must preserve them.", + "sources": [ + { + "file": "wikifi/specialized/sql.py", + "lines": [ + 113, + 125 + ], + "fingerprint": "1ef5e77c4038" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Indexes are explicitly annotated as performance invariants that 'the new system must preserve,' establishing a carry-forward requirement for any target platform.", + "sources": [ + { + "file": "wikifi/specialized/sql.py", + "lines": [ + 117, + 122 + ], + "fingerprint": "1ef5e77c4038" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "UNIQUE and NOT NULL constraints are treated as storage-level invariants that must survive migration, not merely advisory metadata.", + "sources": [ + { + "file": "wikifi/specialized/sql.py", + "lines": [ + 97, + 98 + ], + "fingerprint": "1ef5e77c4038" + } + ] + } + ] + }, + "wikifi/providers/anthropic_provider.py": { + "fingerprint": "872020d40ac3", + "summary": "Implements the hosted-AI provider by calling an external large-language-model service, with a prompt-caching strategy that makes large-scale codebase walks economically viable.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "external_dependencies", + "finding": "The system depends on a hosted large-language-model service (Anthropic Claude) for structured extraction and free-text generation. The API key is resolved from an environment variable (`ANTHROPIC_API_KEY`) or injected at construction time, and a configurable HTTP timeout guards against long-running inference calls.", + "sources": [ + { + "file": "wikifi/providers/anthropic_provider.py", + "lines": [ + 100, + 108 + ], + "fingerprint": "872020d40ac3" + } + ] + }, + { + "section_id": "integrations", + "finding": "This module is the outbound integration point to the hosted LLM service. It is consumed by the orchestrator (`wikifi/orchestrator.py`) to drive all per-file extraction, aggregation, and derivation calls. Three interaction shapes are exposed: schema-validated structured output, free-text completion, and multi-turn chat.", + "sources": [ + { + "file": "wikifi/providers/anthropic_provider.py", + "lines": [ + 115, + 175 + ], + "fingerprint": "872020d40ac3" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "A prompt-caching mechanism marks the large, repeated system prompt with an ephemeral cache breakpoint so that only the first call in a pipeline walk pays full input-token cost; subsequent calls pay roughly 10% of that cost as a cache read. The module's own documentation states this is the critical cost control that makes hosted inference economical on codebases of tens of thousands of files.", + "sources": [ + { + "file": "wikifi/providers/anthropic_provider.py", + "lines": [ + 193, + 210 + ], + "fingerprint": "872020d40ac3" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "All errors from the external LLM service are caught and re-raised as a normalized internal error type, carrying the provider's request identifier when available. This preserves the pipeline's existing per-call fallback behaviour without leaking provider-specific error types into the broader system.", + "sources": [ + { + "file": "wikifi/providers/anthropic_provider.py", + "lines": [ + 238, + 244 + ], + "fingerprint": "872020d40ac3" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "A configurable reasoning-depth control (`think`) is translated into the external API's adaptive-thinking feature, allowing callers to trade inference latency and cost against extraction quality without branching on provider type elsewhere in the codebase.", + "sources": [ + { + "file": "wikifi/providers/anthropic_provider.py", + "lines": [ + 212, + 232 + ], + "fingerprint": "872020d40ac3" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Sampling parameters (temperature, top_p, top_k) must not be sent to the claude-opus-4-7 model variant — doing so causes a 400 error. The provider explicitly omits these parameters for this model generation, making their absence a hard constraint carried forward with the provider implementation.", + "sources": [ + { + "file": "wikifi/providers/anthropic_provider.py", + "lines": [ + 17, + 21 + ], + "fingerprint": "872020d40ac3" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "The maximum output token budget per call is set at 16,000 tokens. This is documented as comfortable headroom for any section schema response while staying within the SDK's non-streaming HTTP timeout guard, making it an operationally important default that should not be reduced without re-validating pipeline completions.", + "sources": [ + { + "file": "wikifi/providers/anthropic_provider.py", + "lines": [ + 72, + 76 + ], + "fingerprint": "872020d40ac3" + } + ] + } + ] + }, + "wikifi/providers/base.py": { + "fingerprint": "2750f0f56327", + "summary": "Defines the minimal provider abstraction that all language-model backends must satisfy, enabling the rest of the system to swap underlying AI services without changing call sites.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "The provider protocol is explicitly designed to be minimal so that switching between different AI backends (local, hosted, or mock) requires changing only a single class, keeping the rest of the system decoupled from any particular AI service.", + "sources": [ + { + "file": "wikifi/providers/base.py", + "lines": [ + 1, + 16 + ], + "fingerprint": "2750f0f56327" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system interacts with language models in three distinct modes: structured JSON generation (used for repository introspection, per-file extraction, and section aggregation), free-form markdown generation (used for diagram passes), and stateful multi-turn conversation (used for the interactive chat REPL).", + "sources": [ + { + "file": "wikifi/providers/base.py", + "lines": [ + 1, + 16 + ], + "fingerprint": "2750f0f56327" + } + ] + }, + { + "section_id": "integrations", + "finding": "The provider abstraction is the outbound integration boundary between wikifi's pipeline stages and any AI model backend; all pipeline components (introspection, extractor, aggregator, deriver, critic, chat, orchestrator) call through this interface rather than directly to a specific service.", + "sources": [ + { + "file": "wikifi/providers/base.py", + "lines": [ + 30, + 48 + ], + "fingerprint": "2750f0f56327" + } + ] + }, + { + "section_id": "entities", + "finding": "A `ChatMessage` entity carries a `role` and `content` field and represents a single turn in a multi-turn conversation; lists of these are passed to the chat mode to maintain conversation history across turns.", + "sources": [ + { + "file": "wikifi/providers/base.py", + "lines": [ + 28, + 30 + ], + "fingerprint": "2750f0f56327" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Structured output validation is a cross-cutting concern: the `complete_json` mode requires the model response to be validated against a declared schema before being returned, ensuring type-safe data flows through every pipeline stage that uses it.", + "sources": [ + { + "file": "wikifi/providers/base.py", + "lines": [ + 36, + 38 + ], + "fingerprint": "2750f0f56327" + } + ] + } + ] + }, + "wikifi/providers/ollama_provider.py": { + "fingerprint": "0a21916665a5", + "summary": "Ollama-backed LLM provider that handles schema-enforced structured output, free-text generation, and multi-turn chat by connecting to a locally-hosted inference runtime.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "external_dependencies", + "finding": "The system depends on a locally-hosted Ollama inference runtime reachable at a configurable host address. Ollama serves as the sole LLM backend, handling structured JSON output (via schema enforcement), free-text completion, and multi-turn chat. A per-connection timeout (defaulting to 900 seconds) gates all calls to this service.", + "sources": [ + { + "file": "wikifi/providers/ollama_provider.py", + "lines": [ + 50, + 56 + ], + "fingerprint": "0a21916665a5" + } + ] + }, + { + "section_id": "integrations", + "finding": "The orchestrator calls this provider for all LLM inference work. The provider, in turn, calls the Ollama service for three interaction modes: schema-constrained structured extraction, unconstrained text generation, and stateful multi-turn conversation. This provider is the exclusive outbound integration boundary between the system and any language model.", + "sources": [ + { + "file": "wikifi/providers/ollama_provider.py", + "lines": [ + 58, + 95 + ], + "fingerprint": "0a21916665a5" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Temperature is hard-pinned to 0 on every structured-output call, enforcing determinism so that identical inputs reliably produce identical structured results across repeated runs. The text and chat paths leave temperature at the model default, accepting variability in exchange for naturalness.", + "sources": [ + { + "file": "wikifi/providers/ollama_provider.py", + "lines": [ + 58, + 68 + ], + "fingerprint": "0a21916665a5" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Qwen3-family models must not be invoked with think=False on the structured-output path: doing so causes the model to bypass the schema constraint and emit free text, which fails downstream validation. The thinking level must be 'low' or higher to preserve schema compliance. For the derivative-section synthesis pass, 'high' thinking is the preferred setting for output quality, but callers must budget 1–3 minutes per file and configure the timeout to at least 900 seconds to absorb that latency.", + "sources": [ + { + "file": "wikifi/providers/ollama_provider.py", + "lines": [ + 14, + 32 + ], + "fingerprint": "0a21916665a5" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The provider exposes three distinct LLM interaction modes: structured extraction (response validated against a caller-supplied schema), open-ended text generation, and multi-turn conversational exchange. This separation allows the orchestrator to select the right interaction pattern for each processing stage (per-file extraction vs. derivative synthesis vs. interactive refinement).", + "sources": [ + { + "file": "wikifi/providers/ollama_provider.py", + "lines": [ + 58, + 95 + ], + "fingerprint": "0a21916665a5" + } + ] + } + ] + }, + "wikifi/providers/openai_provider.py": { + "fingerprint": "428df9ba13f1", + "summary": "Implements the OpenAI-hosted LLM backend for wikifi, providing structured-output, free-text, and multi-turn chat completions with automatic prompt-caching and reasoning-effort control.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "external_dependencies", + "finding": "The system depends on OpenAI's hosted language model API for all inference when this provider is selected. It is activated via an environment variable (`WIKIFI_PROVIDER=openai`) and an API key, and supports an optional custom base URL to point at compatible third-party endpoints.", + "sources": [ + { + "file": "wikifi/providers/openai_provider.py", + "lines": [ + 1, + 10 + ], + "fingerprint": "428df9ba13f1" + } + ] + }, + { + "section_id": "integrations", + "finding": "This provider is one of three selectable backends (alongside local and Anthropic-hosted options) consumed by the orchestrator. It implements the shared provider protocol defined in `wikifi/providers/base.py`, exposing structured-JSON, free-text, and multi-turn chat completion methods that the orchestrator calls during per-file extraction and synthesis passes.", + "sources": [ + { + "file": "wikifi/providers/openai_provider.py", + "lines": [ + 1, + 8 + ], + "fingerprint": "428df9ba13f1" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The provider supports three interaction modes: schema-constrained structured output (returning a validated domain object), free-text generation, and stateful multi-turn conversation. This covers the full range of interactions wikifi needs — per-file structured extraction, narrative synthesis, and interactive Q&A.", + "sources": [ + { + "file": "wikifi/providers/openai_provider.py", + "lines": [ + 120, + 185 + ], + "fingerprint": "428df9ba13f1" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "API errors are caught and re-raised with a normalised diagnostic message that includes the upstream request identifier when available, preserving traceability across the provider boundary without leaking raw SDK exceptions to callers.", + "sources": [ + { + "file": "wikifi/providers/openai_provider.py", + "lines": [ + 248, + 255 + ], + "fingerprint": "428df9ba13f1" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Prompt caching is exploited automatically by placing the system prompt at message position 0 in every call; the hosted service caches identical long prefixes, reducing latency and cost for the repeated multi-kilobyte extraction prompt that is sent once per source file.", + "sources": [ + { + "file": "wikifi/providers/openai_provider.py", + "lines": [ + 13, + 17 + ], + "fingerprint": "428df9ba13f1" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Reasoning-capable model families (identified by name prefix) must receive output-token limits via a distinct parameter name from standard chat models; sending the wrong parameter to either family causes a request failure. The provider routes the correct parameter unconditionally based on model identity.", + "sources": [ + { + "file": "wikifi/providers/openai_provider.py", + "lines": [ + 215, + 226 + ], + "fingerprint": "428df9ba13f1" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "The `think` (reasoning-effort) knob must only be forwarded to reasoning-capable models; forwarding it to a plain chat model risks a validation error from the hosted service. The mapping from wikifi's internal knob values (`low`, `medium`, `high`) to the API's accepted values is fixed and must be preserved.", + "sources": [ + { + "file": "wikifi/providers/openai_provider.py", + "lines": [ + 200, + 214 + ], + "fingerprint": "428df9ba13f1" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "When the hosted service returns a response that cannot be parsed into the expected structured schema (e.g. due to refusal or truncation), the system falls back to direct JSON validation of the raw text rather than returning a null result, preserving the protocol contract that callers always receive a validated object or an explicit error.", + "sources": [ + { + "file": "wikifi/providers/openai_provider.py", + "lines": [ + 150, + 163 + ], + "fingerprint": "428df9ba13f1" + } + ] + } + ] + } + } +} \ No newline at end of file diff --git a/.wikifi/capabilities.md b/.wikifi/capabilities.md index fc72e87..9ced16c 100644 --- a/.wikifi/capabilities.md +++ b/.wikifi/capabilities.md @@ -1,28 +1,107 @@ # Capabilities -### Value Proposition & Core Purpose -The application automates the transformation of raw source artifacts into structured, domain-focused documentation. By systematically analyzing codebases, it extracts business logic, system relationships, and functional capabilities, delivering a living knowledge base that reduces documentation debt, standardizes terminology, and accelerates cross-team onboarding. The system is designed to keep documentation synchronized with implementation without requiring manual authoring overhead. - -### Sequential Analysis Workflow -The application operates through a deterministic, four-stage pipeline that progresses from structural discovery to polished documentation: - -| Pipeline Stage | Domain Focus | Primary Output | -|---|---|---| -| **Structural Analysis** | Repository layout evaluation, manifest inspection, and production-relevance classification | Scoped processing boundaries and system purpose inference | -| **Granular Extraction** | File-by-file translation of technical implementations into domain concepts | Schema-validated, technology-agnostic capability notes | -| **Section Synthesis** | Aggregation of extracted notes into cohesive documentation units | Finalized wiki sections with consistent structure and terminology | -| **Cross-Cutting Derivation** | Identification of relationships spanning multiple components | Inferred user personas, behavioral stories, and system interaction diagrams | - -### Key Capabilities -- **Intelligent Traversal & Filtering:** Recursively navigates directory structures while automatically excluding version-controlled noise, large binary assets, and empty stubs. Processing focus is dynamically adjusted to prioritize substantive, domain-relevant files. -- **Domain-Centric Translation:** Strips away implementation-specific syntax to surface underlying business rules, data flows, and functional responsibilities. Technical artifacts are consistently mapped to business-readable concepts. -- **Adaptive Reasoning Depth:** Analytical intensity can be tuned to balance comprehensive detail against processing efficiency, allowing the system to scale from lightweight overviews to deep architectural breakdowns. -- **Workspace Lifecycle Management:** Initializes and maintains a standardized documentation environment, handling section scaffolding, versioning rules, and intermediate state cleanup between generation cycles. - -### Quality Assurance & Transparency -- **Explicit Gap Declaration:** When upstream data is incomplete or ambiguous, the system preserves raw evidence and explicitly documents missing information rather than generating speculative content. -- **Execution Reporting:** Produces detailed summaries capturing file inclusion/exclusion metrics, processing counts, and generation status for full auditability and pipeline monitoring. -- **Timestamped Provenance:** Maintains a chronological record of extraction notes per section, enabling traceability from final documentation back to the original source artifacts. - -### Adaptive Configuration -The application supports flexible configuration of analysis parameters, including file size thresholds, content length filters, and traversal depth limits. Analytical interactions are standardized into two operational modes: schema-validated structured generation for systematic processing phases, and free-form analytical generation for narrative documentation and visual representations. This dual-mode approach ensures both machine-readable consistency and human-readable clarity across all generated artifacts. +wikifi analyzes any target codebase and produces a structured, technology-agnostic wiki that captures domain knowledge, system intent, capabilities, external dependencies, integrations, cross-cutting concerns, core entities, and hard specifications — expressed entirely in domain terms rather than in the language of a specific technology stack. + +## Workspace Initialization + +Before analysis begins, the system bootstraps a wiki workspace inside the target project in an idempotent manner, creating the required directory structure, a configuration file, version-control ignore rules, and one placeholder document per defined section. Repeat invocations leave already-existing artifacts untouched. + +## Codebase Analysis Pipeline + +The core pipeline runs in four ordered stages: + +1. **Repository introspection** — The system compresses the repository's directory layout and reads key manifest files, then uses this compact view to classify every path as either worth walking (production source, business logic, integrations, domain models) or worth skipping (vendored dependencies, build output, tests, CI/CD). The classification is returned as a structured, diffable result. + +2. **Per-file extraction** — Every in-scope file is routed through one of three extraction paths: + - *Cache replay* — if a file's content is unchanged since the last run, previously stored findings are reused without any further processing. + - *Deterministic schema parsing* — files recognised as structured schema artifacts (SQL DDL, database migrations, API contract specs, interface definition files, and graph schema files) are processed by purpose-built parsers that produce findings about entities, relationships, operations, and constraints without invoking an AI model. + - *AI-assisted extraction* — all remaining files pass through an AI extraction pass; large files are recursively split into overlapping chunks so no content is missed regardless of size. + + Every finding carries a source citation — the originating file path, an inclusive line range, and a content fingerprint — enabling full traceability back to the codebase. + +3. **Cross-file context enrichment** — In parallel with extraction, the system builds an import and reference graph across the entire in-scope file set. Each file's neighborhood (the files it depends on and the files that depend on it) is injected into its extraction prompt, enabling findings to describe inter-file flows rather than treating each file in isolation. + +4. **Section aggregation** — Per-file findings are grouped by their target wiki section and synthesised into readable markdown bodies. Every asserted claim is backed by numbered citations pointing to the originating files and line ranges. Where two or more files make incompatible assertions about the same topic, the system surfaces the conflict explicitly in a dedicated *Conflicts in source* block rather than silently resolving it — a deliberate feature for legacy codebases where disagreements encode high-priority migration signals. + +## Wiki Structure + +The generated wiki is organised into **eleven sections**: eight primary sections populated directly from per-file evidence, and three derivative sections synthesised from the completed primaries: + +| Section type | Sections | +|---|---| +| Primary (8) | Business domains, system intent, capabilities, external dependencies, integrations, cross-cutting concerns, core entities, hard specifications | +| Derivative (3) | User personas, Gherkin-style user stories, Mermaid architectural diagrams | + +Derivative sections are only generated after the primaries they depend on are finalised. If upstream primary sections are empty or missing, the system writes a placeholder that declares the gap rather than fabricating content. + +## Quality Assurance + +An optional critic-and-reviser pass evaluates any synthesised section against its brief and the upstream evidence it drew from, producing a structured quality score (0–10) with itemised unsupported claims, gaps, and suggested edits. When a section scores below a configurable threshold, a revision is automatically invoked; the revision is accepted only if it matches or improves the original score, preventing regressions. This loop is particularly valuable for derivative sections — personas and user stories — where single-shot synthesis is most prone to introducing unsupported assertions. + +## Incremental and Resumable Walks + +The pipeline uses a two-scope content-addressed cache: per-file extraction results are keyed to a combination of file path and content fingerprint, and per-section aggregation results are keyed to a digest of the contributing notes payload. Only changed files and affected sections are reprocessed on incremental runs. Because results are persisted after every completed file, an interrupted walk resumes from the last unprocessed file rather than restarting from scratch. The cache can also be fully invalidated to force a clean re-walk. + +## Coverage and Quality Reporting + +A report command produces a human-readable markdown table summarising every wiki section by contributing file count, finding count, body size, optional critic-derived quality score, and the highest-priority content gap identified by the critic. Coverage statistics also surface *dead zones* — files that were processed but produced no findings — so teams can identify blind spots in the analysis. + +## Interactive Knowledge Querying + +Once a wiki has been generated, users can open an interactive conversational session grounded in all populated sections. The session supports multi-turn exchanges, conversation history reset, and introspection of which sections are currently loaded as context. Only meaningfully populated sections are included, ensuring the assistant is not grounded in placeholder content. + +## Graceful Degradation + +When AI synthesis fails for a section, the system falls back to emitting the raw collected notes directly in the section body, preserving information at the cost of polish and surfacing the error inline. Similarly, unparseable schema files produce an advisory finding directing reviewers to inspect the file manually rather than silently failing. + +## Sources +1. `VISION.md:6-8` +2. `wikifi/sections.py:44-142` +3. `README.md:14-24` +4. `wikifi/orchestrator.py:62-76` +5. `wikifi/wiki.py:64-86` +6. `wikifi/introspection.py:28-44` +7. `wikifi/introspection.py:61-70` +8. `wikifi/walker.py:92-186` +9. `wikifi/extractor.py:140-200` +10. `wikifi/cache.py:5-8` +11. `README.md:34-36` +12. `TESTING-AND-DEMO.md:116-149` +13. `wikifi/config.py:75-81` +14. `wikifi/repograph.py:41-52` +15. `wikifi/specialized/__init__.py:46-57` +16. `wikifi/specialized/sql.py:56-62` +17. `wikifi/extractor.py:251-270` +18. `TESTING-AND-DEMO.md:40-66` +19. `wikifi/aggregator.py:1-15` +20. `TESTING-AND-DEMO.md:90-114` +21. `wikifi/config.py:69-74` +22. `wikifi/extractor.py:241-246` +23. `wikifi/repograph.py:155-210` +24. `wikifi/evidence.py:88-121` +25. `wikifi/aggregator.py:9-14` +26. `wikifi/evidence.py:13-17` +27. `VISION.md:53-63` +28. `wikifi/deriver.py:73-107` +29. `TESTING-AND-DEMO.md:151-164` +30. `wikifi/config.py:83-94` +31. `wikifi/critic.py:100-153` +32. `wikifi/deriver.py:90-103` +33. `TESTING-AND-DEMO.md:67-88` +34. `wikifi/cache.py:9-12` +35. `wikifi/config.py:63-68` +36. `wikifi/cache.py:14-18` +37. `wikifi/cache.py:105-113` +38. `README.md:16-20` +39. `wikifi/cli.py:88-112` +40. `README.md:21-23` +41. `TESTING-AND-DEMO.md:166-186` +42. `wikifi/critic.py:155-180` +43. `wikifi/report.py:44-77` +44. `wikifi/report.py:103-107` +45. `README.md:24-25` +46. `wikifi/chat.py:88-130` +47. `wikifi/chat.py:63-82` +48. `wikifi/cli.py:60-220` +49. `wikifi/aggregator.py:272-285` +50. `wikifi/specialized/openapi.py:23-50` diff --git a/.wikifi/config.toml b/.wikifi/config.toml index 571ab5e..28ed551 100644 --- a/.wikifi/config.toml +++ b/.wikifi/config.toml @@ -1,4 +1,4 @@ # wikifi local config — overrides WIKIFI_* environment variables when present. -provider = "ollama" -model = "qwen3.6:27b" +provider = "anthropic" +model = "claude-sonnet-4-6" ollama_host = "http://localhost:11434" diff --git a/.wikifi/cross_cutting.md b/.wikifi/cross_cutting.md index aa10602..d504546 100644 --- a/.wikifi/cross_cutting.md +++ b/.wikifi/cross_cutting.md @@ -1,25 +1,124 @@ # Cross-Cutting Concerns -### Observability and Monitoring -The system maintains comprehensive visibility into pipeline execution through dynamic logging and structured progress tracking. Logging verbosity adjusts based on operational context, supporting both routine monitoring and deep debugging. Each processing stage emits standardized progress markers, while metric tracking and warning logs capture anomalies such as missing sections or synthesis failures. System identification metadata is exposed to facilitate version tracking and compatibility verification across deployments. +## Observability -### Data Integrity and Traceability -Content generation adheres to strict evidence-based principles, explicitly prohibiting fabrication when upstream sources lack relevant information. Every derivative output is validated against a predefined schema to prevent malformed data from propagating downstream. When synthesis encounters failures, fallback mechanisms preserve raw findings, guaranteeing that all documentation sections receive either synthesized content, structured placeholders, or unprocessed source material. Traceability is maintained by linking generated content directly to its originating evidence, and deterministic processing order ensures consistent evaluation of source artifacts. +A consistent, pipeline-wide observability model spans every stage of the system. Structured logging is initialised once and reused across all subcommands; a single verbose flag activates debug-level output globally without each subsystem needing its own toggle. Stage-boundary log events are emitted at each major transition — repository introspection, dependency-graph construction, file extraction, section aggregation, and derivative synthesis — so operators can pinpoint where a long walk is spending time. Revision and quality-scoring events are counted in the run's statistics, and cache hit counts are surfaced in the post-walk report, giving a quantitative picture of incremental efficiency. -### State Management and Data Storage -Workspace initialization and intermediate data resets are designed to be idempotent, preventing state corruption during repeated or interrupted executions. The system enforces strict format contracts for configuration files, documentation outputs, and intermediate logs to maintain structural consistency. Working state is automatically isolated from committed artifacts, preserving version control hygiene and ensuring that transient processing data does not interfere with finalized documentation. +## Resilience and Error Handling -### Operational Guardrails and Determinism -Runtime behavior is governed by centralized, environment-driven configuration that standardizes parameters across execution contexts. To prevent resource exhaustion and processing delays, the system enforces several operational limits: +The system is designed so that no single failure can abort an entire pipeline run. Extraction failures — whether caused by an inference provider or a specialised deterministic parser — are logged and tallied but never propagated upward; a file whose processing fails entirely is recorded as skipped, and partially-recovered files retain whatever findings were salvaged. Aggregation and derivation failures follow the same pattern: errors are caught and logged at warning level, and a fallback body that preserves the raw upstream evidence is written so the wiki remains inspectable. Quality-assurance (critic and reviser) failures degrade gracefully to returning the original body with a diagnostic score of zero rather than halting. Provider failures during interactive query sessions are surfaced inline without terminating the session. Across all provider backends, raw infrastructure errors are caught at the provider boundary and re-raised as a normalised internal error type carrying the upstream request identifier when available, so the rest of the pipeline does not branch on provider-specific exception shapes. -| Constraint | Purpose | -|---|---| -| Request Timeouts | Accommodate variable processing durations while preventing indefinite hangs | -| File Size & Content Caps | Filter out oversized or trivial inputs to conserve computational resources | -| Reasoning Mode Controls | Balance depth of analysis against execution speed | -| Determinism Parameters | Ensure reproducible outputs for identical inputs across runs | +## Content-Addressed Caching and Crash Resumability -The pipeline incorporates graceful degradation strategies to handle read errors, parsing failures, and permission restrictions during directory traversal without halting execution. +All expensive inference work is protected by a two-scope content-addressed cache stored under a dedicated hidden subdirectory within the wiki output directory, inheriting the same version-control ignore rules as other working-state artifacts. -### Authentication and Authorization -The provided notes do not contain information regarding access controls, credential management, or authorization mechanisms. This area remains undocumented and should be addressed separately to ensure secure handling of sensitive source materials and generated artifacts. +- **Extraction scope:** each file's results are keyed by the combination of its relative path and a stable hash of its raw bytes. Any unchanged file is skipped on re-walk with no inference call. +- **Aggregation scope:** each section's synthesised body is keyed by a deterministic digest of its note payload. Unchanged inputs reuse the stored body and evidence bundle. + +Cache entries are written after every individual file completes, so a mid-walk crash loses at most one file's work. Writes are performed atomically — content is first written to a temporary location and then renamed into place — preventing corrupt partial writes. Malformed entries are silently dropped and logged rather than causing a hard failure, so a partially corrupt cache degrades gracefully to a fresh extraction for only the affected entries. A monotonically increasing version tag is embedded in every persisted cache file; a version mismatch on load causes the entire cache to be discarded and rebuilt, providing a controlled invalidation path across software upgrades. Between runs, entries for files no longer in scope are pruned automatically. + +## Input Integrity Guards + +A layered set of guards prevents low-signal or pathological inputs from ever reaching the inference layer. + +| Guard | Threshold | Effect | +|---|---|---| +| Minimum content size | 64 bytes (stripped) | File silently skipped | +| Maximum file size | 2 MB | File silently skipped | +| Large-file windowing | 150 KB – 2 MB | File split into overlapping chunks with 8 KB overlap | +| Manifest truncation | 20 000 bytes | Hard-truncated with visible marker | +| Per-request timeout | 900 seconds | Uniform backstop across all providers | + +Directory traversal prunes excluded subtrees before descending into them, so ignore patterns are applied efficiently at the directory level rather than file-by-file. Files carrying no extractable intent — stub initialisers, empty fixtures, generated lockfiles — are identified and dropped before reaching the inference layer; the invariant that a single empty or unstructured file must never stall the walk is explicitly upheld. Findings produced from the overlap region between adjacent large-file chunks are deduplicated by section and normalised text within each file's pass, preventing double-counting in downstream aggregation. + +## Provider Abstraction + +All inference calls — structured extraction, free-text generation, and multi-turn chat — are routed through a single provider abstraction layer. This boundary is where observability hooks, retry logic, error normalisation, and backend-switching concerns live; no extraction or aggregation logic needs knowledge of which backend is active. Supported backend shapes include local inference runtimes and hosted services; the local-inference path is the default, with hosted options as addenda, and swapping between them requires no changes outside the provider boundary. + +Structured-output calls enforce a schema-validation contract: the model response must be validated against a declared schema before being returned to the caller, ensuring type-safe data flows through every pipeline stage. To maximise determinism, temperature is hard-pinned to zero on all structured-output calls; free-text and conversational paths accept model-default variability in exchange for naturalness. + +When a backend exposes a reasoning-depth control, the system runs at the highest available setting, prioritising output quality over walk speed. A configurable depth parameter is translated into the provider's native adaptive-thinking feature, allowing callers to trade latency and cost against quality without branching on provider type in shared pipeline code. + +Hosted backends employ prompt-caching strategies — placing the large, repeated system prompt at a fixed position in every request so the service can serve subsequent calls from a cached prefix — making large-scale walks economically viable by paying full input cost only on the first call and a fraction of that cost on subsequent ones. + +## Source Traceability and Hallucination Prevention + +Full source traceability is a non-negotiable structural invariant: every assertion in every wiki section must be linkable back to the originating file and, where available, the precise line range within it. This is enforced through typed evidence structures (claims and source references) rather than by convention, so the constraint cannot be silently bypassed. + +Hallucination prevention operates at two additional levels. First, the inference prompt explicitly instructs the model never to name specific technologies, translating all observations into domain terms — this is a mandatory invariant enforced at the prompt layer. Second, upstream section content that matches known placeholder shapes is filtered out before derivative synthesis, preventing empty or stub sections from being treated as real evidence; these same sentinel strings are used by the quality-report layer to exclude placeholder sections from scoring. Interactive query sessions are similarly grounded: the assistant is instructed to explicitly acknowledge when the wiki does not cover a topic rather than generating unsupported answers. + +Content fingerprints serve a triple cross-cutting role: keying both extraction and aggregation caches so stale results are never served, anchoring source-evidence citations so claims can be re-verified against a fresh repository walk, and tracking file identity inside the dependency graph so cross-file context is invalidated when any contributing source changes. Files are always fingerprinted as raw bytes rather than decoded text to ensure the cache layer and the extractor agree on identity regardless of encoding assumptions. + +## Authentication and Storage Invariants + +Specialised deterministic parsers extract security and data-integrity contracts from high-signal artifacts and surface them as first-class cross-cutting concerns that must be preserved through any migration: + +- **Authentication schemes** declared in API contract files are extracted and categorised by type, flagging which security contracts (key-based, delegated authorisation, bearer-token, etc.) the new system must honour. +- **Data integrity constraints** — uniqueness and non-nullability — found in schema definitions are extracted as storage invariants explicitly marked as migration-critical. +- **Query-performance invariants** — index definitions — are recorded with an explicit note that the new system must preserve equivalent access patterns. + +All specialised parsers return results in the same structured shape as the general inference extractor, so the aggregation layer needs no knowledge of which extraction path was taken; this uniform interface contract is itself an invariant that must be preserved. + +## Data Storage Layout + +The pipeline's working state is isolated to a single hidden directory within the repository: + +- **Rendered section documents** live at the root of this directory and are intended to be committed to version control. +- **Per-section extraction notes** (JSONL, each record UTC-timestamped) are stored in a notes subdirectory and excluded from version control via a generated ignore file. +- **Extraction and aggregation caches** are stored in a cache subdirectory and similarly excluded. + +Deleting the cache subdirectory forces a full re-walk; deleting the entire working directory resets all pipeline state. This layout ensures generated documentation commits remain clean and the boundary between committed outputs and ephemeral working state is unambiguous. + +## Sources +1. `wikifi/cli.py:51-60` +2. `wikifi/orchestrator.py:84-148` +3. `wikifi/cli.py:90-97` +4. `wikifi/deriver.py:110-135` +5. `wikifi/report.py:22` +6. `wikifi/extractor.py:228-242` +7. `wikifi/aggregator.py:143-152` +8. `wikifi/deriver.py:96-107` +9. `wikifi/critic.py:158-165` +10. `wikifi/chat.py:120-125` +11. `wikifi/providers/anthropic_provider.py:238-244` +12. `wikifi/providers/openai_provider.py:248-255` +13. `README.md:40-43` +14. `wikifi/fingerprint.py:1-18` +15. `wikifi/aggregator.py:126-155` +16. `TESTING-AND-DEMO.md:67-88` +17. `wikifi/extractor.py:155-175` +18. `wikifi/cache.py:189-193` +19. `wikifi/cache.py:196-222` +20. `wikifi/cache.py:38` +21. `wikifi/orchestrator.py:95-110` +22. `wikifi/cache.py:19-21` +23. `wikifi/config.py:56-59` +24. `wikifi/walker.py:61-79` +25. `wikifi/config.py:38-56` +26. `wikifi/walker.py:220-231` +27. `.env.example:16-29` +28. `wikifi/config.py:33-34` +29. `wikifi/walker.py:133-143` +30. `README.md:44-46` +31. `VISION.md:99-100` +32. `wikifi/extractor.py:253-262` +33. `CLAUDE.md:53-54` +34. `VISION.md:92-96` +35. `wikifi/providers/base.py:36-38` +36. `wikifi/providers/ollama_provider.py:58-68` +37. `VISION.md:97-98` +38. `wikifi/providers/anthropic_provider.py:212-232` +39. `wikifi/providers/anthropic_provider.py:193-210` +40. `wikifi/providers/openai_provider.py:13-17` +41. `wikifi/evidence.py:1-18` +42. `wikifi/aggregator.py:54-67` +43. `wikifi/deriver.py:118-135` +44. `wikifi/report.py:118-123` +45. `wikifi/chat.py:27-31` +46. `wikifi/fingerprint.py:44-50` +47. `wikifi/specialized/openapi.py:110-121` +48. `wikifi/specialized/sql.py:97-98` +49. `wikifi/specialized/sql.py:113-125` +50. `wikifi/specialized/__init__.py:9-13` +51. `TESTING-AND-DEMO.md:249-265` +52. `wikifi/wiki.py:96-121` diff --git a/.wikifi/diagrams.md b/.wikifi/diagrams.md index f7588d9..5a0e4c3 100644 --- a/.wikifi/diagrams.md +++ b/.wikifi/diagrams.md @@ -1,137 +1,283 @@ # Diagrams -### Domain Map -The following graph visualizes the bounded contexts within the core domain of Automated Knowledge Translation. It reflects the strict, stage-gated dependency chain and the cross-cutting nature of external intelligence integration. - -```mermaid -graph TD - Core[Core Domain: Automated Knowledge Translation] - Introspection[Repository Introspection & Curation\nSupporting] - Extraction[Semantic Extraction & Analysis\nCore] - Aggregation[Information Aggregation & Synthesis\nCore] - Orchestration[Pipeline Orchestration & Lifecycle Management\nSupporting] - External[External Intelligence Integration\nGeneralized] - - Core --> Introspection - Introspection -->|Curated artifacts & structural metadata| Extraction - Extraction -->|Structured knowledge units & analysis results| Aggregation - Aggregation -->|Synthesized content & workspace population| Orchestration - - External -.->|On-demand pattern resolution & narrative generation| Extraction -``` - -**Key Observations:** -- Data flows unidirectionally through the pipeline, with intermediate states explicitly persisted between stages to support incremental processing, auditability, and fault tolerance. -- External Intelligence Integration operates as a generalized, cross-cutting capability invoked on-demand within the extraction context rather than dictating pipeline progression. -- Orchestration and workspace lifecycle management responsibilities currently overlap; future modeling may require separating execution coordination from directory/configuration governance. - -### Entity Relationship View -This entity-relationship diagram maps the core domain entities, their primary fields, and the structural boundaries that govern data transformation from raw repository scanning to final documentation assembly. - -```mermaid -erDiagram - CONFIGURATION ||--o{ SCAN_TRAVERSAL_CONFIG : "defines" - SCAN_TRAVERSAL_CONFIG ||--o{ DIRECTORY_SUMMARY : "scopes" - DIRECTORY_SUMMARY ||--|| INTROSPECTION_ASSESSMENT : "generates" - INTROSPECTION_ASSESSMENT ||--o{ EXTRACTION_NOTE : "guides" - EXTRACTION_NOTE }o--|| DOCUMENTATION_SECTION : "aggregates_to" - DOCUMENTATION_SECTION ||--o{ AGGREGATION_STATS : "updates" - DOCUMENTATION_SECTION ||--o{ WORKSPACE_LAYOUT : "populates" - EXECUTION_SUMMARY }o--|| PIPELINE_EXECUTION : "observes" - - CONFIGURATION { - string default_settings - string local_overrides - } - SCAN_TRAVERSAL_CONFIG { - string root_path - string inclusion_exclusion_patterns - number size_thresholds - } - DIRECTORY_SUMMARY { - number file_count - number total_size - string extension_distribution - boolean manifest_presence - } - INTROSPECTION_ASSESSMENT { - string primary_languages - string inferred_purpose - string classification_rationale - } - EXTRACTION_NOTE { - datetime timestamp - string file_reference - string role_summary - string extracted_finding - } - DOCUMENTATION_SECTION { - string category - string aggregated_content - string final_markdown_body - } - AGGREGATION_STATS { - number successful_writes - number empty_section_count - } - WORKSPACE_LAYOUT { - string config_paths - string notes_paths - string sections_paths - } - EXECUTION_SUMMARY { - string stage_metrics - string completion_status - string consolidated_findings - } -``` - -**Key Observations:** -- Configuration entities establish hard boundaries for traversal and analysis, ensuring processing never exceeds defined size constraints or excluded paths. -- Extraction notes are immutable, timestamped records tied to single source files, serving as the raw material for downstream aggregation. -- Aggregation statistics and the execution summary function as cross-cutting observers, tracking pipeline health and output readiness without interfering with the primary data flow. -- **Known Gap:** The exact mapping rules between intermediate extraction notes and final documentation sections are implied but not explicitly detailed. Further specification is required to define how notes are grouped, prioritized, or filtered during section assembly, and how empty sections are resolved or reported upstream. - -### Integration Flow -The sequence diagram below illustrates the internal pipeline handoffs and external interface interactions. It captures the staged execution model, centralized orchestration, and abstracted external dependencies. - -```mermaid -sequenceDiagram - participant CLI as CLI Interface - participant Orch as Orchestrator - participant Traversal as Traversal & Introspection - participant Extractor as Source Analysis & Extraction - participant Aggregator as Content Aggregation - participant Deriver as Derivative Generation - participant AI as Generative AI Services - participant Telemetry as Observability & Telemetry - participant Storage as Wiki Storage - - CLI->>Orch: Trigger execution / provision workspace - Orch->>Traversal: Delegate scanning & structural analysis - Traversal->>Traversal: Apply path filters & size constraints - Traversal-->>Orch: Return filtered paths & metadata - Orch->>Extractor: Delegate artifact analysis - Extractor->>AI: Request pattern resolution / narrative generation (on-demand) - AI-->>Extractor: Return processed findings - Extractor->>Telemetry: Log processing metrics & outcomes - Extractor-->>Orch: Return structured analysis notes - Orch->>Aggregator: Delegate content synthesis - Aggregator->>AI: Request section-level synthesis - AI-->>Aggregator: Return synthesized markdown - Aggregator->>Storage: Write documentation sections - Aggregator-->>Orch: Return aggregation statistics - Orch->>Deriver: Delegate supplementary content generation - Deriver->>AI: Request derivative synthesis - AI-->>Deriver: Return derivative documentation - Deriver->>Storage: Write derivative artifacts - Deriver-->>Orch: Confirm completion - Orch->>Orch: Consolidate metrics & generate execution summary - Orch-->>CLI: Report pipeline health & output readiness -``` - -**Key Observations:** -- The orchestrator acts as the central coordinator, delegating execution to specialized components in a strict sequence while maintaining a single source of truth for pipeline health. -- All external dependencies are routed through standardized contracts, isolating core business logic from provider-specific implementations and enabling swappable analytical backends. -- Observability and telemetry are integrated directly into the extraction stage to monitor processing metrics and record analysis outcomes in real time. -- **Known Gaps:** The integration contracts do not specify exact data schemas or serialization formats for inter-module handoffs. Error handling, retry policies, fallback mechanisms for external service degradation, authentication/rate-limiting constraints, and versioning guarantees between pipeline stages remain undefined and require clarification in implementation documentation. +_Derivation failed for **Diagrams** (anthropic provider: empty parsed_output and parse fallback failed: 1 validation error for DerivedSection + Invalid JSON: EOF while parsing a value at line 1 column 0 [type=json_invalid, input_value='', input_type=str] + For further information visit https://errors.pydantic.dev/2.13/v/json_invalid). Upstream evidence preserved below._ + +> Brief: Mermaid diagrams that visualize structural and behavioral relationships across the system: a domain map (graph or classDiagram across domains), an entity relationship view (erDiagram across entities), and an integration flow (sequence or flowchart across integrations). Tech-agnostic — no reference to current stack. + + +## From domains +# Domains and Subdomains + +## Core Domain + +The system's core domain is **codebase knowledge extraction**: ingesting an existing source base, classifying its contents, deriving domain findings from individual files, and synthesising those findings into a structured, technology-agnostic wiki. The primary consumers are migration teams who need to understand business intent, domain structure, and operational behaviour before re-implementing or replacing a legacy system. + +## Subdomains + +### Repository Introspection +This subdomain concerns discovering and classifying the files that make up a target codebase. Its central responsibility is distinguishing production source that encodes business intent from infrastructure, tooling, and other artefacts that do not. Tech-agnosticism is a first-class constraint here: the classification logic must not rely on recognising any specific language, framework, or runtime. + +### Per-File Knowledge Extraction +Once relevant files are identified, each is analysed independently to surface domain findings. This subdomain covers the full extraction loop — examining file content, applying domain heuristics, and producing structured evidence — and forms the first phase of wiki generation (primary sections). + +### Section Synthesis and Aggregation +The second phase of wiki generation operates over the evidence produced by per-file extraction. It aggregates findings across files into coherent wiki sections, derives higher-level content that cannot be inferred from any single file, and enforces the dependency ordering between primary (evidence-driven) and derivative (aggregated) sections. This ordering is a structural design constraint, not merely a runtime convention. + +### Wiki Authoring and Organisation +A secondary domain governs how extracted knowledge is structured and stored. It defines the taxonomy of sections, distinguishes primary from derivative content, and produces output that a migration team can navigate and consume independently of the source codebase. + +### Interactive Knowledge Retrieval +A supporting subdomain exposes the generated wiki to conversational or query-driven access, allowing stakeholders to interrogate extracted knowledge without directly inspecting raw wiki files. + +## Cross-Cutting Constraint: Tech-Agnosticism +Tech-agnosticism spans every subdomain. All analysis, extraction, and synthesis must produce domain-level descriptions that are free of references to specific languages, frameworks, or libraries. This constraint is enforced at both the classification stage (repository introspection) and the output stage (section content). + +## Subdomain Relationships + +| Subdomain | Role | Depends On | +|---|---|---| +| Repository Introspection | Identifies source worth analysing | — | +| Per-File Knowledge Extraction | Produces primary section evidence | Introspection | +| Section Synthesis & Aggregation | Produces derivative sections | Per-File Extraction | +| Wiki Authoring & Organisation | Structures and stores the wiki | Synthesis | +| Interactive Knowledge Retrieval | Queries the completed wiki | Authoring | + +## Sources +1. `README.md:28-52` +2. `VISION.md:3-20` +3. `wikifi/cli.py:1-8` +4. `wikifi/introspection.py:19-44` +5. `wikifi/sections.py:1-19` + +## From entities +# Core Entities + +The system's domain model spans five functional layers — wiki structure, file classification, extraction, evidence, and review — plus supporting entities for caching, derivation, and chat. + +--- + +## Wiki Structure + +**Section** is the central organizing entity. Each section carries a unique identifier, a human-readable title, a prose description of what belongs in it, and a tier (primary or derivative). Derivative sections additionally declare an ordered list of upstream section identifiers they depend on, forming an explicit dependency graph. An invariant holds at startup: every upstream identifier in a derivative section's dependency list must refer to a section that appears earlier in the canonical ordering (topological sort enforced). + +**WikiLayout** is an immutable value object that encodes the on-disk structure of a wiki workspace. Given a project root, it derives all canonical sub-paths: the wiki directory, configuration file, gitignore file, notes directory, per-section markdown files, and per-section note files. No fields are mutable after construction. + +**WalkConfig** is an immutable configuration record consumed by the filesystem walker. It captures the repository root, extra exclusion patterns, a flag for honouring ignore rules, a maximum file size in bytes, and a minimum stripped-content size in bytes. + +--- + +## File Classification and Graph + +**FileKind** is a closed enumeration of seven mutually exclusive file roles: application code, SQL, OpenAPI, Protobuf, GraphQL, migration, and other. This classification determines whether a file is routed to a specialised deterministic parser or the general-purpose extraction path. + +**GraphNode** represents a single file's position in the repository's import graph. It carries the file's repo-relative path, the ordered set of files it imports, and the ordered set of files that import it. It exposes a capped combined-neighbour list for inclusion in extraction prompts. + +**RepoGraph** holds the complete import-edge map for a repository scan. It supports node lookup by path and retrieval of a capped neighbour list for any given file, providing cross-file context during extraction. + +**DirSummary** is a value object holding aggregate statistics for a single (non-recursive) directory: its repo-relative path, file count, total byte size, a frequency map of the top-10 file extensions, and a tuple of notable filenames (manifests, readmes) present in that directory. + +--- + +## Extraction Layer + +**SectionFinding** represents one file's contribution to one wiki section. It carries the target section identifier, a technology-agnostic prose description of the contribution, and an optional inclusive line range within the source chunk. + +**FileFindings** groups a one-sentence summary of a file with all `SectionFinding` records produced for it. + +**SpecializedFinding** is the output unit of the deterministic parsing paths. It carries a section identifier, a human-readable description, and a list of source references. **SpecializedResult** groups zero or more such findings with an optional summary string; this is the uniform output contract for all specialised extractors, ensuring interoperability with the general extraction path downstream. + +**ExtractionStats** is a walk-level counter record, accumulating: total files seen, files yielding at least one finding, total findings, skipped files, chunks processed, cache hits, specialised-extractor invocations, and a per-kind file breakdown. + +--- + +## Evidence Layer + +**SourceRef** represents a single span of source: a repo-relative file path, an optional inclusive line range, and a short content fingerprint captured at extraction time for change detection. + +**Claim** represents one assertion placed in a wiki section. It carries the markdown text and a list of `SourceRef` values that justify it. A claim with no sources is explicitly marked unsupported — this is a first-class state, not an error. + +**Contradiction** groups two or more conflicting `Claim` objects about the same topic under a single summary sentence. Each disagreeing position retains its own source references, preserving full traceability. + +**EvidenceBundle** is the aggregator's structured output for a single wiki section. It combines the narrative body text, a list of `Claim` records, and a list of `Contradiction` records. The renderer uses the bundle to thread numbered citations and a conflicts block into the final markdown. + +During aggregation, the pipeline works with intermediate forms: **AggregatedClaim** pairs a single prose assertion with the 1-based indices of the input notes that support it, and **AggregatedContradiction** holds a one-sentence summary alongside multiple conflicting positional claims, each with its own note indices. These are the structured forms that the language model produces before being resolved into the full evidence model. + +--- + +## Cache Entities + +**CachedFindings** stores the extraction result for a single file: the file's content fingerprint, the list of structured findings produced, a one-sentence summary, and a count of processed chunks. Its invariant is content-addressed — the fingerprint is the cache key. + +**CachedSection** stores the aggregation result for a single wiki section: the hash of the notes payload that produced it, the rendered markdown body, and lists of claims and contradictions. It too is content-addressed on the notes hash. + +**WalkCache** is the in-memory container for both caches. It holds extraction and aggregation entries alongside hit and miss counters, enabling observability into cache effectiveness across a run. + +--- + +## Quality and Review Layer + +**Critique** captures the quality assessment of a single section: an integer score (0–10), a short overall judgment, a list of unsupported claims, a list of gaps relative to the section brief, and a list of concrete revision suggestions. + +**ReviewOutcome** tracks a section's review lifecycle: the section identifier, the initial critique, the current body text, a flag indicating whether a revision was applied, and the optional follow-up critique produced after revision. + +**WikiQualityReport** aggregates the full-wiki audit: an overall numeric score, a mapping from section identifiers to their individual `Critique` records, and optional coverage statistics. + +**CoverageStats** records total files seen, files with findings, and per-section breakdowns of finding counts and contributing file counts; it exposes a coverage-percentage computation. + +**SectionReport** captures the per-section view for reporting: the section descriptor, count of contributing files, total findings count, body size in characters, an emptiness flag, and an optional quality critique. + +**WikiReport** aggregates all `SectionReport` records alongside overall coverage statistics and an optional mean quality score across populated sections. + +--- + +## Derivation and Pipeline Outputs + +**IntrospectionResult** captures the Stage 1 decision about which files are worth deeper analysis: a list of gitignore-style include patterns, a list of exclude patterns, a list of primary languages (informational), a one-paragraph guess at the system's purpose, and a rationale for the choices made. + +**AggregationStats** records, for a single aggregation run, how many sections were written fresh, skipped due to empty notes, or served from cache. + +**DerivationStats** accumulates pipeline metrics for the derivation stage: counts of sections derived, skipped, and revised, plus the full list of `ReviewOutcome` records. It acts as an audit trail for the synthesis stage. + +**WalkReport** is the single return value of a completed wiki-generation run, aggregating the introspection result, extraction statistics, aggregation statistics, derivation statistics, the live cache state, and the repository import graph. + +--- + +## Chat Layer + +**ChatMessage** carries a role and a content field, representing a single turn in a multi-turn conversation. Lists of these are accumulated to maintain conversation history. + +**LoadedSection** pairs a `Section` descriptor with its rendered markdown body, representing a single populated section ready for inclusion in a chat context. + +**ChatSession** holds a provider reference, the frozen system prompt built from wiki sections, and the accumulated conversation history as an ordered list of `ChatMessage` records. It supports appending user and assistant turns and clearing history while retaining the wiki context. + +--- + +## Relationships and Invariants Summary + +| Entity | Key relationships | Notable invariants | +|---|---|---| +| Section | depends on upstream Sections (derivative tier only) | Dependency graph must be topologically ordered | +| WikiLayout | derived from a project root | Immutable; all paths are computed, not stored independently | +| SourceRef | referenced by Claim, SpecializedFinding | Fingerprint enables staleness detection | +| Claim | groups SourceRefs; composed into EvidenceBundle | Sourceless claims are explicitly flagged unsupported | +| Contradiction | groups ≥2 conflicting Claims | Each position retains its own SourceRefs | +| CachedFindings | keyed on file content fingerprint | Cache miss if fingerprint changes | +| CachedSection | keyed on notes-payload hash | Cache miss if any upstream note changes | +| ReviewOutcome | holds pre- and post-revision Critique | Revision flag distinguishes touched from untouched sections | +| WalkReport | aggregates all four stage outputs | Single return value for a complete run | + +## Sources +1. `wikifi/sections.py:30-40` +2. `wikifi/deriver.py:112-116` +3. `wikifi/cli.py:166-172` +4. `wikifi/wiki.py:34-61` +5. `wikifi/walker.py:61-79` +6. `README.md:31-33` +7. `wikifi/repograph.py:41-52` +8. `wikifi/repograph.py:148-167` +9. `wikifi/repograph.py:170-181` +10. `wikifi/walker.py:144-153` +11. `wikifi/extractor.py:106-123` +12. `wikifi/specialized/__init__.py:29-38` +13. `wikifi/extractor.py:126-135` +14. `wikifi/evidence.py:37-52` +15. `README.md:37-39` +16. `wikifi/evidence.py:55-67` +17. `wikifi/aggregator.py:166-186` +18. `wikifi/evidence.py:70-77` +19. `wikifi/aggregator.py:74-101` +20. `README.md:46-48` +21. `wikifi/evidence.py:80-85` +22. `wikifi/cache.py:44-51` +23. `wikifi/cache.py:54-60` +24. `wikifi/cache.py:63-70` +25. `wikifi/aggregator.py:103-107` +26. `wikifi/critic.py:67-84` +27. `wikifi/critic.py:91-96` +28. `wikifi/critic.py:99-114` +29. `wikifi/report.py:85-94` +30. `wikifi/report.py:28-42` +31. `wikifi/introspection.py:47-64` +32. `wikifi/deriver.py:57-62` +33. `wikifi/orchestrator.py:54-61` +34. `wikifi/cli.py:118-153` +35. `wikifi/providers/base.py:28-30` +36. `wikifi/chat.py:42-45` +37. `wikifi/chat.py:46-57` +38. `wikifi/specialized/sql.py:64-84` +39. `wikifi/specialized/sql.py:99-111` +40. `wikifi/specialized/graphql.py:32-81` +41. `wikifi/specialized/protobuf.py:44-68` +42. `wikifi/specialized/openapi.py:94-108` + +## From integrations +# Integrations + +### Inbound: Entry Points into the System + +The system is distributed as a library installed directly into a target project. The command-line interface (CLI) is the primary inbound entry point, exposing subcommands that drive the full pipeline from repository introspection through wiki generation, interactive querying, and quality reporting. The CLI delegates all pipeline coordination to the orchestrator, which is also the central hub wiring together every downstream stage. + +--- + +### Outbound: AI Model Backends + +All pipeline stages — introspection, per-file extraction, section aggregation, derivative content derivation, quality critique, and interactive chat — communicate with an AI model backend exclusively through a shared provider abstraction. No stage calls a specific backend directly. Three interaction shapes are exposed through this abstraction: schema-validated structured output, free-text completion, and multi-turn stateful conversation. + +Three backends are available and are interchangeable without altering any pipeline code: + +| Backend type | Hosting model | +|---|---| +| Local self-hosted inference runtime | On-premise / developer machine | +| Hosted AI service (Anthropic-compatible) | Remote cloud | +| Hosted AI service (OpenAI-compatible) | Remote cloud or self-managed endpoint | + +The active backend is selected via an environment variable or a per-invocation flag at the CLI level. OpenAI-compatible endpoints — including corporate reverse proxies and managed cloud deployments — are supported by overriding the base URL alone, with no other changes to the calling code. + +--- + +### Outbound: Development-Time Tool Servers (MCP) + +A separate set of external capability providers is declared through an MCP client configuration used during development or runtime. Four tool servers are wired up: a local AI utility, a local web crawler, a remote documentation-context service, and a remote search-and-stitching service. The system acts as an MCP client that fans requests out to these providers as needed. + +--- + +### Outbound: Filesystem and Persistence Layer + +All reading and writing of wiki artifacts — extraction notes, finished section bodies, and cache entries — flows through a centralized layout abstraction that manages the `.wikifi/` output directory inside the target project. The extractor, aggregator, deriver, CLI, and orchestrator all resolve paths through this abstraction rather than independently. + +A content-addressed cache layer sits between the orchestrator and the AI backend, consulting a fingerprinting service to derive content hashes as cache keys. The extractor, aggregator, and orchestrator each consult the cache before issuing AI calls, enabling both incremental re-runs and resumability for large codebases. + +--- + +### Integration Touchpoints Discovered in Target Codebases + +When analyzing a target codebase, the system identifies and surfaces integration touchpoints from high-signal artifact files through specialized parsers: + +- **HTTP API surfaces** — Parsed from API contract files; each contract contributes a finding recording the count of externally exposed endpoints, establishing the public-facing API surface as a documented integration point. +- **RPC service definitions** — Each declared service and its remote procedures are mapped, capturing procedure names, request and response message types, and whether either channel is streaming. +- **Event-driven channels** — Subscription roots found in schema definition files are classified as real-time integration touchpoints rather than ordinary capabilities, reflecting their role as channels that external consumers attach to. +- **Relational links** — Foreign key declarations (both explicit and inline) are surfaced as hard relational links between domain entities, identifying cross-entity data dependencies. + +The dispatcher that routes files to these specialized parsers uses the file-kind classification produced by the repository graph module, ensuring each artifact type reaches the appropriate parser while preserving a uniform output contract for downstream aggregation. + +## Sources +1. `README.md:8-12` +2. `wikifi/cli.py:98-101` +3. `wikifi/orchestrator.py:40-60` +4. `wikifi/providers/base.py:30-48` +5. `wikifi/providers/anthropic_provider.py:115-175` +6. `wikifi/providers/ollama_provider.py:58-95` +7. `wikifi/providers/openai_provider.py:1-8` +8. `README.md:46-51` +9. `TESTING-AND-DEMO.md:232-235` +10. `.mcp.json:2-36` +11. `wikifi/wiki.py:34-61` +12. `wikifi/cache.py:244-246` +13. `wikifi/cache.py:30` +14. `wikifi/specialized/openapi.py:83-92` +15. `wikifi/specialized/protobuf.py:70-87` +16. `wikifi/specialized/graphql.py:88-91` +17. `wikifi/specialized/sql.py:86-96` +18. `wikifi/specialized/__init__.py:46-57` diff --git a/.wikifi/domains.md b/.wikifi/domains.md index d5de393..1508c89 100644 --- a/.wikifi/domains.md +++ b/.wikifi/domains.md @@ -1,35 +1,42 @@ # Domains and Subdomains -### Core Domain: Automated Knowledge Translation -The system operates within a single core domain focused on transforming raw technical artifacts into structured, business-readable documentation. This domain treats source repositories as unstructured knowledge sources that require systematic discovery, semantic translation, and narrative synthesis. All processing is deliberately decoupled from implementation specifics, ensuring that technical constructs are consistently mapped to domain-agnostic business concepts. +## Core Domain -### Bounded Contexts & Subdomains -The core domain is partitioned into five bounded contexts, each with distinct responsibilities and clear boundaries: +The system's core domain is **codebase knowledge extraction**: ingesting an existing source base, classifying its contents, deriving domain findings from individual files, and synthesising those findings into a structured, technology-agnostic wiki. The primary consumers are migration teams who need to understand business intent, domain structure, and operational behaviour before re-implementing or replacing a legacy system. -| Subdomain | Primary Responsibility | DDD Classification | +## Subdomains + +### Repository Introspection +This subdomain concerns discovering and classifying the files that make up a target codebase. Its central responsibility is distinguishing production source that encodes business intent from infrastructure, tooling, and other artefacts that do not. Tech-agnosticism is a first-class constraint here: the classification logic must not rely on recognising any specific language, framework, or runtime. + +### Per-File Knowledge Extraction +Once relevant files are identified, each is analysed independently to surface domain findings. This subdomain covers the full extraction loop — examining file content, applying domain heuristics, and producing structured evidence — and forms the first phase of wiki generation (primary sections). + +### Section Synthesis and Aggregation +The second phase of wiki generation operates over the evidence produced by per-file extraction. It aggregates findings across files into coherent wiki sections, derives higher-level content that cannot be inferred from any single file, and enforces the dependency ordering between primary (evidence-driven) and derivative (aggregated) sections. This ordering is a structural design constraint, not merely a runtime convention. + +### Wiki Authoring and Organisation +A secondary domain governs how extracted knowledge is structured and stored. It defines the taxonomy of sections, distinguishes primary from derivative content, and produces output that a migration team can navigate and consume independently of the source codebase. + +### Interactive Knowledge Retrieval +A supporting subdomain exposes the generated wiki to conversational or query-driven access, allowing stakeholders to interrogate extracted knowledge without directly inspecting raw wiki files. + +## Cross-Cutting Constraint: Tech-Agnosticism +Tech-agnosticism spans every subdomain. All analysis, extraction, and synthesis must produce domain-level descriptions that are free of references to specific languages, frameworks, or libraries. This constraint is enforced at both the classification stage (repository introspection) and the output stage (section content). + +## Subdomain Relationships + +| Subdomain | Role | Depends On | |---|---|---| -| **Repository Introspection & Curation** | Discovers project structure, classifies artifacts, filters irrelevant content, and establishes workspace boundaries. | Supporting | -| **Semantic Extraction & Analysis** | Processes individual artifacts to translate technical patterns into structured knowledge units. Leverages external analytical services for complex pattern recognition. | Core | -| **Information Aggregation & Synthesis** | Consumes extracted knowledge units, resolves redundancies, aligns terminology, and composes coherent section-level documentation. | Core | -| **Pipeline Orchestration & Lifecycle Management** | Governs sequential stage execution, manages reporting, coordinates output derivation, and controls the documentation workspace lifecycle. | Supporting | -| **External Intelligence Integration** | Abstracts communication with generative analysis services. Standardizes request formulation and response consumption, decoupling core logic from provider implementations. | Generalized | - -### Context Relationships & Data Flow -The subdomains form a strict, stage-gated dependency chain. Data flows unidirectionally through the pipeline: - -1. **Introspection → Extraction**: Curated artifact lists and structural metadata are passed to the extraction context. -2. **Extraction → Aggregation**: Structured knowledge units and intermediate analysis results are consumed for section-level synthesis. -3. **Aggregation → Orchestration**: Synthesized content is handed off for final artifact derivation, workspace population, and lifecycle closure. - -External Intelligence Integration operates as a cross-cutting capability within the Extraction context. It is invoked on-demand to resolve ambiguous technical patterns or generate analytical narratives, but does not dictate pipeline progression. - -### State Management & Persistence -Intermediate analysis results are explicitly persisted between pipeline stages. This design supports: -- **Incremental Processing**: Only modified or newly discovered artifacts trigger re-analysis. -- **Auditability**: Each transformation step is traceable, preserving the lineage from raw artifact to final documentation. -- **Fault Tolerance**: Pipeline stages can resume from the last persisted state without requiring full re-execution. - -### Modeling Gaps & Observations -- **Error & Conflict Resolution**: The notes emphasize a linear, deterministic flow but provide limited detail on how conflicting domain interpretations are resolved during synthesis, or how pipeline failures trigger rollback or recovery. -- **Orchestration vs. Workspace Boundaries**: Responsibilities for pipeline execution and workspace lifecycle management appear overlapping. Future modeling may benefit from separating execution coordination from directory/configuration governance. -- **Provider Abstraction Depth**: While external intelligence is abstracted, the notes do not specify how fallback mechanisms or service degradation are handled when analytical responses are incomplete or malformed. +| Repository Introspection | Identifies source worth analysing | — | +| Per-File Knowledge Extraction | Produces primary section evidence | Introspection | +| Section Synthesis & Aggregation | Produces derivative sections | Per-File Extraction | +| Wiki Authoring & Organisation | Structures and stores the wiki | Synthesis | +| Interactive Knowledge Retrieval | Queries the completed wiki | Authoring | + +## Sources +1. `README.md:28-52` +2. `VISION.md:3-20` +3. `wikifi/cli.py:1-8` +4. `wikifi/introspection.py:19-44` +5. `wikifi/sections.py:1-19` diff --git a/.wikifi/entities.md b/.wikifi/entities.md index ef70396..e175e7c 100644 --- a/.wikifi/entities.md +++ b/.wikifi/entities.md @@ -1,35 +1,159 @@ # Core Entities -The documentation generation pipeline relies on a set of core domain entities that manage configuration, source analysis, content extraction, and final output assembly. These entities are organized to enforce consistent processing boundaries, track intermediate findings, and produce structured documentation artifacts. +The system's domain model spans five functional layers — wiki structure, file classification, extraction, evidence, and review — plus supporting entities for caching, derivation, and chat. -### Configuration & Processing Boundaries -The system uses a hierarchical configuration model to define how source repositories are scanned and processed. A base settings container manages default values, which can be overridden by local configuration files to ensure environment-specific customization. Scanning and traversal configurations establish the root directory, path inclusion/exclusion filters, and file size constraints. These boundaries ensure that only relevant source files are processed while preventing resource exhaustion from oversized or irrelevant directories. +--- -### Analysis & Introspection -Before content generation, the system performs structural and semantic analysis of the target repository. Directory summaries capture aggregate statistics, including file counts, total size, extension distribution, and the presence of key manifest or documentation files. An introspection assessment synthesizes this structural data to identify primary languages, infer the system's overarching purpose, and document a classification rationale. This assessment respects the previously defined path filters and serves as the foundation for targeted content extraction. +## Wiki Structure -### Extraction & Intermediate Records -During source analysis, intermediate findings are captured as timestamped extraction notes. Each note functions as a structured record that links a specific file reference to a role summary and the extracted finding. These records preserve the context of individual source files and serve as the raw material for downstream aggregation. The system maintains a chronological log of these notes to ensure traceability throughout the pipeline. +**Section** is the central organizing entity. Each section carries a unique identifier, a human-readable title, a prose description of what belongs in it, and a tier (primary or derivative). Derivative sections additionally declare an ordered list of upstream section identifiers they depend on, forming an explicit dependency graph. An invariant holds at startup: every upstream identifier in a derivative section's dependency list must refer to a section that appears earlier in the canonical ordering (topological sort enforced). -### Aggregation & Output Structure -Extracted notes are consolidated into categorized documentation sections. Each section acts as a logical container for generated content, ultimately producing a final markdown body. Aggregation statistics track the success rate of section writes and explicitly flag empty sections to highlight coverage gaps. The final output adheres to a predefined workspace layout that organizes configuration files, intermediate notes, and final section artifacts into a consistent, navigable directory hierarchy. +**WikiLayout** is an immutable value object that encodes the on-disk structure of a wiki workspace. Given a project root, it derives all canonical sub-paths: the wiki directory, configuration file, gitignore file, notes directory, per-section markdown files, and per-section note files. No fields are mutable after construction. -### Pipeline Execution & Reporting -A unified execution summary consolidates metrics, findings, and completion status across all processing stages. This entity provides a single source of truth for pipeline health, output readiness, and overall processing efficiency, enabling operators to verify that all stages completed successfully before final delivery. +**WalkConfig** is an immutable configuration record consumed by the filesystem walker. It captures the repository root, extra exclusion patterns, a flag for honouring ignore rules, a maximum file size in bytes, and a minimum stripped-content size in bytes. -### Entity Fields, Relationships & Invariants -| Entity | Primary Fields | Key Invariants | +--- + +## File Classification and Graph + +**FileKind** is a closed enumeration of seven mutually exclusive file roles: application code, SQL, OpenAPI, Protobuf, GraphQL, migration, and other. This classification determines whether a file is routed to a specialised deterministic parser or the general-purpose extraction path. + +**GraphNode** represents a single file's position in the repository's import graph. It carries the file's repo-relative path, the ordered set of files it imports, and the ordered set of files that import it. It exposes a capped combined-neighbour list for inclusion in extraction prompts. + +**RepoGraph** holds the complete import-edge map for a repository scan. It supports node lookup by path and retrieval of a capped neighbour list for any given file, providing cross-file context during extraction. + +**DirSummary** is a value object holding aggregate statistics for a single (non-recursive) directory: its repo-relative path, file count, total byte size, a frequency map of the top-10 file extensions, and a tuple of notable filenames (manifests, readmes) present in that directory. + +--- + +## Extraction Layer + +**SectionFinding** represents one file's contribution to one wiki section. It carries the target section identifier, a technology-agnostic prose description of the contribution, and an optional inclusive line range within the source chunk. + +**FileFindings** groups a one-sentence summary of a file with all `SectionFinding` records produced for it. + +**SpecializedFinding** is the output unit of the deterministic parsing paths. It carries a section identifier, a human-readable description, and a list of source references. **SpecializedResult** groups zero or more such findings with an optional summary string; this is the uniform output contract for all specialised extractors, ensuring interoperability with the general extraction path downstream. + +**ExtractionStats** is a walk-level counter record, accumulating: total files seen, files yielding at least one finding, total findings, skipped files, chunks processed, cache hits, specialised-extractor invocations, and a per-kind file breakdown. + +--- + +## Evidence Layer + +**SourceRef** represents a single span of source: a repo-relative file path, an optional inclusive line range, and a short content fingerprint captured at extraction time for change detection. + +**Claim** represents one assertion placed in a wiki section. It carries the markdown text and a list of `SourceRef` values that justify it. A claim with no sources is explicitly marked unsupported — this is a first-class state, not an error. + +**Contradiction** groups two or more conflicting `Claim` objects about the same topic under a single summary sentence. Each disagreeing position retains its own source references, preserving full traceability. + +**EvidenceBundle** is the aggregator's structured output for a single wiki section. It combines the narrative body text, a list of `Claim` records, and a list of `Contradiction` records. The renderer uses the bundle to thread numbered citations and a conflicts block into the final markdown. + +During aggregation, the pipeline works with intermediate forms: **AggregatedClaim** pairs a single prose assertion with the 1-based indices of the input notes that support it, and **AggregatedContradiction** holds a one-sentence summary alongside multiple conflicting positional claims, each with its own note indices. These are the structured forms that the language model produces before being resolved into the full evidence model. + +--- + +## Cache Entities + +**CachedFindings** stores the extraction result for a single file: the file's content fingerprint, the list of structured findings produced, a one-sentence summary, and a count of processed chunks. Its invariant is content-addressed — the fingerprint is the cache key. + +**CachedSection** stores the aggregation result for a single wiki section: the hash of the notes payload that produced it, the rendered markdown body, and lists of claims and contradictions. It too is content-addressed on the notes hash. + +**WalkCache** is the in-memory container for both caches. It holds extraction and aggregation entries alongside hit and miss counters, enabling observability into cache effectiveness across a run. + +--- + +## Quality and Review Layer + +**Critique** captures the quality assessment of a single section: an integer score (0–10), a short overall judgment, a list of unsupported claims, a list of gaps relative to the section brief, and a list of concrete revision suggestions. + +**ReviewOutcome** tracks a section's review lifecycle: the section identifier, the initial critique, the current body text, a flag indicating whether a revision was applied, and the optional follow-up critique produced after revision. + +**WikiQualityReport** aggregates the full-wiki audit: an overall numeric score, a mapping from section identifiers to their individual `Critique` records, and optional coverage statistics. + +**CoverageStats** records total files seen, files with findings, and per-section breakdowns of finding counts and contributing file counts; it exposes a coverage-percentage computation. + +**SectionReport** captures the per-section view for reporting: the section descriptor, count of contributing files, total findings count, body size in characters, an emptiness flag, and an optional quality critique. + +**WikiReport** aggregates all `SectionReport` records alongside overall coverage statistics and an optional mean quality score across populated sections. + +--- + +## Derivation and Pipeline Outputs + +**IntrospectionResult** captures the Stage 1 decision about which files are worth deeper analysis: a list of gitignore-style include patterns, a list of exclude patterns, a list of primary languages (informational), a one-paragraph guess at the system's purpose, and a rationale for the choices made. + +**AggregationStats** records, for a single aggregation run, how many sections were written fresh, skipped due to empty notes, or served from cache. + +**DerivationStats** accumulates pipeline metrics for the derivation stage: counts of sections derived, skipped, and revised, plus the full list of `ReviewOutcome` records. It acts as an audit trail for the synthesis stage. + +**WalkReport** is the single return value of a completed wiki-generation run, aggregating the introspection result, extraction statistics, aggregation statistics, derivation statistics, the live cache state, and the repository import graph. + +--- + +## Chat Layer + +**ChatMessage** carries a role and a content field, representing a single turn in a multi-turn conversation. Lists of these are accumulated to maintain conversation history. + +**LoadedSection** pairs a `Section` descriptor with its rendered markdown body, representing a single populated section ready for inclusion in a chat context. + +**ChatSession** holds a provider reference, the frozen system prompt built from wiki sections, and the accumulated conversation history as an ordered list of `ChatMessage` records. It supports appending user and assistant turns and clearing history while retaining the wiki context. + +--- + +## Relationships and Invariants Summary + +| Entity | Key relationships | Notable invariants | |---|---|---| -| Configuration | Default settings, local overrides | Local overrides always take precedence over environment defaults | -| Scan/Traversal Config | Root path, inclusion/exclusion patterns, size thresholds | Processing never exceeds defined size constraints or traverses excluded paths | -| Directory Summary | File count, total size, extension distribution, manifest presence | Statistics reflect only files within allowed traversal boundaries | -| Introspection Assessment | Primary languages, inferred purpose, classification rationale | Assessment is derived strictly from directory summaries and path filters | -| Extraction Note | Timestamp, file reference, role summary, extracted finding | Each note is immutable once created and tied to a single source file | -| Documentation Section | Category, aggregated content, final markdown body | Sections are generated only after successful note aggregation | -| Aggregation Stats | Successful writes, empty section count | Stats are updated atomically after each section generation attempt | -| Workspace Layout | Paths for config, notes, sections | Directory structure remains consistent across pipeline runs | -| Execution Summary | Stage metrics, completion status, consolidated findings | Summary is generated only after all pipeline stages report completion | - -**Relationships:** Configuration entities dictate the boundaries for analysis entities. Analysis outputs feed directly into extraction notes, which are then grouped and transformed into documentation sections. Aggregation statistics and the execution summary operate as cross-cutting observers, tracking the health and output of the entire flow. - -**Known Gaps:** The exact mapping rules between intermediate extraction notes and final documentation sections are implied by the aggregation process but not explicitly detailed in the available notes. Further specification may be needed to define how notes are grouped, prioritized, or filtered during section assembly, as well as how empty sections are resolved or reported upstream. +| Section | depends on upstream Sections (derivative tier only) | Dependency graph must be topologically ordered | +| WikiLayout | derived from a project root | Immutable; all paths are computed, not stored independently | +| SourceRef | referenced by Claim, SpecializedFinding | Fingerprint enables staleness detection | +| Claim | groups SourceRefs; composed into EvidenceBundle | Sourceless claims are explicitly flagged unsupported | +| Contradiction | groups ≥2 conflicting Claims | Each position retains its own SourceRefs | +| CachedFindings | keyed on file content fingerprint | Cache miss if fingerprint changes | +| CachedSection | keyed on notes-payload hash | Cache miss if any upstream note changes | +| ReviewOutcome | holds pre- and post-revision Critique | Revision flag distinguishes touched from untouched sections | +| WalkReport | aggregates all four stage outputs | Single return value for a complete run | + +## Sources +1. `wikifi/sections.py:30-40` +2. `wikifi/deriver.py:112-116` +3. `wikifi/cli.py:166-172` +4. `wikifi/wiki.py:34-61` +5. `wikifi/walker.py:61-79` +6. `README.md:31-33` +7. `wikifi/repograph.py:41-52` +8. `wikifi/repograph.py:148-167` +9. `wikifi/repograph.py:170-181` +10. `wikifi/walker.py:144-153` +11. `wikifi/extractor.py:106-123` +12. `wikifi/specialized/__init__.py:29-38` +13. `wikifi/extractor.py:126-135` +14. `wikifi/evidence.py:37-52` +15. `README.md:37-39` +16. `wikifi/evidence.py:55-67` +17. `wikifi/aggregator.py:166-186` +18. `wikifi/evidence.py:70-77` +19. `wikifi/aggregator.py:74-101` +20. `README.md:46-48` +21. `wikifi/evidence.py:80-85` +22. `wikifi/cache.py:44-51` +23. `wikifi/cache.py:54-60` +24. `wikifi/cache.py:63-70` +25. `wikifi/aggregator.py:103-107` +26. `wikifi/critic.py:67-84` +27. `wikifi/critic.py:91-96` +28. `wikifi/critic.py:99-114` +29. `wikifi/report.py:85-94` +30. `wikifi/report.py:28-42` +31. `wikifi/introspection.py:47-64` +32. `wikifi/deriver.py:57-62` +33. `wikifi/orchestrator.py:54-61` +34. `wikifi/cli.py:118-153` +35. `wikifi/providers/base.py:28-30` +36. `wikifi/chat.py:42-45` +37. `wikifi/chat.py:46-57` +38. `wikifi/specialized/sql.py:64-84` +39. `wikifi/specialized/sql.py:99-111` +40. `wikifi/specialized/graphql.py:32-81` +41. `wikifi/specialized/protobuf.py:44-68` +42. `wikifi/specialized/openapi.py:94-108` diff --git a/.wikifi/external_dependencies.md b/.wikifi/external_dependencies.md index c10527e..5d5e596 100644 --- a/.wikifi/external_dependencies.md +++ b/.wikifi/external_dependencies.md @@ -1,29 +1,27 @@ # External-System Dependencies -The system relies on a set of external services and infrastructure components that enable source code ingestion, semantic analysis, and structured documentation generation. These dependencies are abstracted to support interchangeable implementations while maintaining consistent operational roles. - -### AI Inference Engine -The primary external dependency is an AI inference service, which may be provisioned as a third-party API or a locally hosted instance. This engine provides the cognitive layer required for: -- Semantic analysis and intent extraction from raw source code -- Interpretation of code structure and abstraction of business domains -- Transformation of technical evidence into formal specifications, structured narratives, and architectural artifacts -- Generation of both structured data and unstructured explanatory text based on system prompts - -The system abstracts the deployment model of this layer, allowing it to operate against either cloud-hosted endpoints or local inference servers without altering core workflows. - -### Supporting Infrastructure & Standards -Beyond the inference engine, the system depends on several foundational services and standards to ensure reliable operation and output consistency: - -- **Host File System:** Direct read access is required to ingest source files and gather the raw technical evidence processed by the extraction engine. -- **Data Validation Framework:** A structured validation layer verifies output integrity, ensuring that generated artifacts conform to expected schemas before delivery. -- **Documentation & Diagramming Standards:** The system relies on standardized markup and diagram syntaxes to guarantee consistent rendering and interoperability across downstream consumption platforms. -- **Repository Filtering Logic:** Pattern-matching utilities aligned with standard version control ignore semantics are used to safely exclude irrelevant directories, build artifacts, and configuration files during traversal. - -### Dependency Summary -| Dependency | Role in System | -|---|---| -| AI Inference Service | Semantic analysis, intent extraction, content generation, and domain abstraction | -| Host File System | Source code ingestion and raw evidence collection | -| Data Validation Framework | Output integrity verification and schema enforcement | -| Standardized Markup/Diagram Syntaxes | Cross-platform rendering consistency and interoperability | -| VCS Ignore Pattern Logic | Safe repository traversal and artifact filtering | +The system draws on several categories of external service: language-model inference backends, development-time tooling integrations, and a continuous-integration platform. + +## Language-Model Inference + +All substantive text generation and structured extraction is delegated to an external (or locally hosted) language-model service. Three backends are supported through a common provider abstraction: + +| Backend | Hosting | Authentication | Role | +|---|---|---|---| +| Local inference server (default) | Self-hosted, no network egress | None required | Default backend for all extraction and synthesis calls; configurable host address and 15-minute per-call timeout | +| Hosted AI service A (Anthropic) | Cloud API | API key (`ANTHROPIC_API_KEY`) | Opt-in backend; uses an ephemeral prompt-cache marker on the system prompt so that large extraction prompts are billed at roughly 10 % of normal input-token cost across repeated per-file calls | +| Hosted AI service B (OpenAI-compatible) | Cloud API (or compatible proxy/Azure endpoint) | API key + optional custom base URL | Opt-in backend; relies on automatic prefix caching (prefixes ≥ 1 024 tokens cached for ~5–10 minutes); exposes a reasoning-intensity knob mapped to the backend's reasoning-effort parameter on capable model variants | + +The local inference server is the default and requires no credentials or external network access. The two hosted backends are opt-in and each require a provisioned API key. All three backends are configured with a model name, timeout, and per-call output-token cap drawn from the application's runtime settings. + +### Caching Strategy +Because the extraction prompt is large and is reused across every file in a repository, minimising repeated billing for identical prompt prefixes is a first-class concern. The hosted-AI-service-A integration achieves this by tagging the system-prompt block with an ephemeral cache-control marker. The hosted-AI-service-B integration relies on the provider's automatic prefix-caching mechanism without requiring explicit markers. + +## Development-Time Tool Integrations + +The MCP server configuration reveals several additional integrations that appear to be used during development or agent-assisted workflows rather than in the core production pipeline: + +- **Google AI generative API** — consumed by at least two registered tool integrations; authenticated via a shared API key. +- **Self-hosted web-crawling service** — running locally on a fixed port with no API key, providing crawling capability on demand. +- **External documentation/context lookup service** — called over HTTP with a dedicated API key; likely used to retrieve up-to-date reference documentation for prompt enrichment. +- **Google-hosted orchestration service ( diff --git a/.wikifi/hard_specifications.md b/.wikifi/hard_specifications.md index 21f362a..82fc253 100644 --- a/.wikifi/hard_specifications.md +++ b/.wikifi/hard_specifications.md @@ -1,31 +1,69 @@ # Hard Specifications -### Pipeline Execution & Architecture -- **Sequential Processing Order:** The system must execute stages in a strict, non-negotiable sequence: Introspection → Extraction → Aggregation → Derivation. Deviations from this order are prohibited. -- **Single-Provider Constraint:** The current release supports only one designated processing backend. Configuration attempts targeting alternative providers must fail gracefully without interrupting the pipeline. -- **Workspace Auto-Provisioning:** The target documentation workspace must be automatically initialized if it does not exist prior to pipeline execution. +_Aggregation failed for **Hard Specifications** (anthropic provider: empty parsed_output and parse fallback failed: 1 validation error for SectionBody + Invalid JSON: EOF while parsing a value at line 1 column 0 [type=json_invalid, input_value='', input_type=str] + For further information visit https://errors.pydantic.dev/2.13/v/json_invalid). Raw notes preserved below._ -### Input Processing & Data Boundaries -- **Deterministic & Non-Destructive Execution:** All processing stages must operate deterministically and preserve original source integrity. No upstream data may be altered or deleted during transformation. -- **Immutable Exclusion Patterns:** Version control metadata, dependency caches, build artifacts, and tool-specific directories are permanently excluded from traversal. -- **Strict Size Thresholds:** - | Metric | Limit | Handling Behavior | - |---|---|---| - | Maximum file size | 200,000 bytes | Truncated to limit | - | Minimum stripped content | 64 bytes | Files below threshold are ignored | -- **Fault Tolerance:** Invalid or unreadable inputs must be logged and skipped. The pipeline must never halt due to malformed source material. -- **Structural Recognition:** A predefined set of notable manifest and documentation filenames is used exclusively for structural analysis and routing. +> Brief: Critical requirements that must be carried forward verbatim: compliance rules, SLAs, contractual obligations, immutable formats. -### Content Synthesis & Documentation Standards -- **Technology-Agnostic Translation:** All outputs must strip implementation-specific terminology. Technical observations must be translated into domain-focused, user-facing intent. -- **Narrative Synthesis:** Generated content must form a coherent, structured narrative. Raw transcripts, verbatim note dumps, or unprocessed fragments are prohibited. -- **Behavioral Documentation Structure:** All operational or behavioral descriptions must adhere to a strict `Given/When/Then` format. -- **Visual & Formatting Constraints:** Diagrams must utilize standardized syntax with approved chart types. Output must exclude top-level headings and rely exclusively on appropriate markdown sub-headings, lists, and tables. -- **Explicit Gap & Contradiction Reporting:** Missing data, failed derivations, or conflicting upstream evidence must be explicitly declared. The system must preserve original evidence rather than fabricating content or leaving silent blanks. -### Configuration & Artifact Management -- **Configuration Precedence:** Local configuration files strictly override environment-level variables. This hierarchy is immutable. -- **Intermediate Artifact Isolation:** Temporary processing directories (intermediate notes) must be excluded from version control by default to prevent repository bloat and maintain clean lineage. +## Raw findings -### Output Structure & Compatibility Contract -- **Immutable Directory Schema:** The documentation directory layout functions as a strict backward-compatibility contract. Structural modifications are prohibited, as they will break existing documentation readability and violate compliance expectations. +- **.env.example** — Only the 'ollama' provider is supported in v1. The default request timeout is 900 seconds (15 minutes). Fully disabling thinking mode ('false') is documented as unsafe with Qwen3 models because those models ignore the JSON-schema output constraint and emit free text instead. +- **CLAUDE.md** — The system must run against a local LLM out of the box with no cloud dependency required; hosted backends (Anthropic, OpenAI, custom) are valid additional options but never the default. +- **CLAUDE.md** — Provider abstraction is mandatory: swapping the LLM backend must not require changes outside the provider boundary. +- **CLAUDE.md** — When the chosen model exposes a reasoning or thinking level, the system must run at the highest available setting; lower reasoning levels are opt-in only. +- **CLAUDE.md** — Test coverage target is ≥ 85%; every feature must ship with tests. +- **CLAUDE.md** — wikifi is strictly a feature-extraction tool: it describes what the legacy system does and must never transform source into any target architecture, language, or framework shape. +- **CLAUDE.md** — Derivative wiki sections (personas, user stories, diagrams) must be produced only after primary content sections are complete and must never be inferred from a single file. +- **README.md** — The critic-reviser loop must only accept a revised section if its quality score is at least as high as the score of the original; downgrades are rejected. +- **README.md** — Empty or near-empty input files must never stall the walk; the walker is required to filter them out before any LLM call is made. +- **README.md** — Every per-file finding must carry a structured SourceRef (file, line range, content fingerprint) to support citation in the rendered wiki. +- **TESTING-AND-DEMO.md** — The test suite must include exactly 156 passing tests with total line coverage at or above 93%. Every new module must individually reach at least 86% coverage, and each premium-pipeline module (fingerprint, cache, evidence, critic, report, repograph, specialized parsers, and the Anthropic provider) must carry a dedicated test file. +- **TESTING-AND-DEMO.md** — The Anthropic provider must place cache_control of type 'ephemeral' on the system-prompt block, use the messages.parse structured-output contract, translate the 'think' intensity setting to an effort level, and map API errors to a RuntimeError. These behaviors are locked in by the provider's dedicated test file. +- **TESTING-AND-DEMO.md** — The OpenAI provider must use the chat.completions.parse structured-output contract, route reasoning_effort only to o-series and gpt-5 models (not standard models), swap max_tokens for max_completion_tokens on reasoning models, and map API errors to RuntimeError. OpenAI's automatic prefix caching applies to prefixes of at least 1024 tokens and lasts approximately 5–10 minutes. +- **TESTING-AND-DEMO.md** — The critic-reviser loop must only accept a revised derivative section if the revision scores at least as well as the original; a revision that scores lower must be discarded. +- **VISION.md** — The generated wiki must at minimum contain: DDD domains and subdomains, system intent, domain-level capabilities, external-system dependencies, internal and external integrations, cross-cutting concerns, core entities and their structures, and hard specifications — regardless of the on-disk layout chosen by the implementor. +- **VISION.md** — Derivative wiki sections (user personas, user stories, aggregate diagrams) must be produced in a step that runs *after* primary capture and must never be inferred from a single source file. +- **VISION.md** — Wiki content is stored in the target project's `.wikifi/` directory; the contract is the content the wiki conveys, not its on-disk shape or file structure within that directory. +- **VISION.md** — Success is defined as: a migration team working from the wiki alone — without reference to the original codebase — can deliver a microservice re-implementation that preserves the original system's personas, problem space, integrations, cross-cutting concerns, entities, data patterns, and user value. +- **wikifi/aggregator.py** — Contradictions between source notes must never be silently resolved: any incompatible claims must produce a `contradictions[]` entry naming each position and the note indices that support it. This is stated as a hard rule in the LLM system prompt and enforced structurally via the `AggregatedContradiction` schema. +- **wikifi/aggregator.py** — Wiki section bodies must be tech-agnostic: no mention of specific languages, frameworks, or libraries is permitted in synthesised output; every observation must be translated into domain terms. +- **wikifi/aggregator.py** — Note indices presented to the LLM are 1-based, and the resolution logic subtracts 1 before indexing into the notes list — an off-by-one invariant that must be preserved if the prompting scheme changes. +- **wikifi/cache.py** — The aggregation cache key is computed only over content-bearing fields (file reference, summary, finding text) and explicitly excludes timestamps and per-walk debug fields, ensuring that regenerating identical notes on a fresh walk always produces a cache hit. +- **wikifi/cache.py** — Cache files must reside at `.wikifi/.cache/extraction.json` and `.wikifi/.cache/aggregation.json` relative to the wiki directory root. +- **wikifi/cli.py** — The tool's entry point must be declared as `wikifi` in the project's script configuration and must delegate directly to the Typer application; this contract ties the installed command name to the main() function in this module. +- **wikifi/config.py** — Files exceeding 2,000,000 bytes are unconditionally dropped and never read; this threshold is explicitly documented as targeting vendored or generated noise rather than real source files. +- **wikifi/config.py** — Each language model call is limited to a 150,000-byte content window, sized to fit within a 32K-context model after prompt overhead; larger files must be split into overlapping chunks rather than truncated. +- **wikifi/config.py** — Adjacent file chunks share an 8,000-byte overlap region to preserve cross-boundary context; this overlap guarantee must be maintained when the chunking logic is modified. +- **wikifi/critic.py** — The scoring rubric is fixed: 9–10 indicates fully grounded, tech-agnostic, narratively coherent content with no unsupported claims; 6–8 allows minor issues; 3–5 signals substantial gaps or partial coverage; 0–2 marks incoherent or off-brief content. The default minimum acceptable score for shipping a section without revision is 7. +- **wikifi/critic.py** — A revised body is only accepted if its follow-up critique score is greater than or equal to the initial score; any revision that causes a score regression is discarded and the original body is retained. This invariant must be preserved in any reimplementation. +- **wikifi/critic.py** — All section bodies must be tech-agnostic: the reviser is explicitly instructed not to invent claims unsupported by upstream evidence and to declare gaps explicitly when evidence is missing rather than speculating. +- **wikifi/deriver.py** — Derivative sections must be grounded solely in upstream section content. The model is instructed to declare gaps explicitly rather than filling them with invented facts — this is a hard constraint on output integrity. +- **wikifi/deriver.py** — All wiki content, including derivative sections, must remain technology-agnostic: language names, framework names, and library names are forbidden and must be translated into domain terms. +- **wikifi/deriver.py** — Gherkin-style outputs must use proper Given/When/Then syntax inside fenced ```gherkin code blocks. Mermaid diagrams must be valid and inside fenced ```mermaid code blocks, preferring graph, classDiagram, erDiagram, and sequenceDiagram diagram types. +- **wikifi/evidence.py** — Source references must be rendered in the format 'path/to/file:start-end' (or 'path/to/file:line' for a single line, or just 'path/to/file' when lines are unknown). The 'Sources' footer uses 1-based sequential numeric indices in the form '1. `path`'. +- **wikifi/evidence.py** — Contradictions must never be silently merged into a unified narrative; they must be explicitly surfaced in a dedicated 'Conflicts in source' sub-section, with a warning that migration teams must resolve them before re-implementation. +- **wikifi/extractor.py** — Per-file extraction is restricted to primary wiki sections only. Derivative sections (personas, user stories, diagrams) are explicitly excluded from per-file extraction and are instead produced in a later aggregation stage; requesting them at the per-file level is documented as producing sparse, speculative findings. +- **wikifi/extractor.py** — The recursive text splitter must guarantee termination on any input, including minified single-line files with no whitespace, by falling back through separator priority (blank lines → single newlines → spaces → character boundaries). The character-boundary split is the terminal step that ensures every byte is eventually consumed. +- **wikifi/extractor.py** — Chunk overlap must satisfy `0 <= overlap < chunk_size`; violating this constraint raises an error. The effective base chunk size is `chunk_size - overlap` so that prepending an overlap tail never causes a chunk to exceed `chunk_size` bytes. +- **wikifi/fingerprint.py** — Fingerprints are defined as the first 12 hexadecimal characters of a SHA-256 digest (48 bits of entropy). This length is explicitly chosen to be sufficient to distinguish every file in any realistic repository (estimated 50% collision threshold at ~10 trillion files) while remaining short enough to embed inline in human-readable citations. This format must be preserved across any migration because it is recorded in cached artefacts and emitted into wiki evidence references. +- **wikifi/introspection.py** — Stage 1 must operate without reading any source files; it sees only directory-level summaries and manifest contents. This constraint is architectural and must be preserved: source reading is exclusively Stage 2's responsibility. +- **wikifi/introspection.py** — Include and exclude patterns produced by Stage 1 must be in gitignore-style format relative to the repository root. +- **wikifi/orchestrator.py** — When a user selects the Anthropic provider but the configured model name does not begin with 'claude-', the system silently substitutes the model identifier 'claude-opus-4-7' rather than forwarding an invalid name. Similarly, for OpenAI, non-OpenAI-pattern model names are replaced with 'gpt-4o'. This model-name substitution logic must be preserved so that users migrating from the default local provider do not receive opaque remote API errors. +- **wikifi/orchestrator.py** — The only accepted provider identifiers are 'ollama', 'anthropic', and 'openai'; any other value raises an error. This contract is enforced at provider construction time and must be maintained by any future provider registration mechanism. +- **wikifi/repograph.py** — The implementation must remain dependency-free beyond regex and path resolution — tree-sitter or similar binary dependencies are explicitly prohibited so that the tool can be installed without native compilation. +- **wikifi/report.py** — Quality scoring is only performed when explicitly requested (`score=True`) and a provider is supplied; without both conditions the report remains purely structural. This ensures the tool can run in provider-free environments such as CI pipelines without failure. +- **wikifi/sections.py** — Derivative sections must always reference only known section IDs, and every upstream a derivative depends on must appear earlier in the canonical SECTIONS ordering. This ordering invariant is validated at module load time and any violation raises an error, making it a hard structural requirement for the section taxonomy. +- **wikifi/walker.py** — The maximum file size threshold is 2,000,000 bytes (2 MB); files at or above this limit are unconditionally skipped and never sent for analysis. The minimum content threshold is 64 bytes of stripped text. Manifest files are truncated to 20,000 bytes maximum before being included in any prompt. +- **wikifi/wiki.py** — The directory layout is explicitly declared as a stable contract between the tool and any target project: upgrading the tool must not break existing wikis. This constraint is called out in the module docstring and governs all future changes to path conventions. +- **wikifi/wiki.py** — The `.wikifi/` directory layout follows a fixed, documented schema: `config.toml` for provider/model overrides, `.gitignore` for excluding notes, one `
.md` per defined section, and a `.notes/
.jsonl` per section for extraction state. This schema must remain stable across upgrades. +- **wikifi/specialized/protobuf.py** — The module explicitly designates proto file findings as direct inputs to interface design during migration, implying that message names, enum value sets, service names, RPC signatures, and streaming contracts must be preserved verbatim when porting to a new stack. +- **wikifi/specialized/sql.py** — Indexes are explicitly annotated as performance invariants that 'the new system must preserve,' establishing a carry-forward requirement for any target platform. +- **wikifi/specialized/sql.py** — UNIQUE and NOT NULL constraints are treated as storage-level invariants that must survive migration, not merely advisory metadata. +- **wikifi/providers/anthropic_provider.py** — Sampling parameters (temperature, top_p, top_k) must not be sent to the claude-opus-4-7 model variant — doing so causes a 400 error. The provider explicitly omits these parameters for this model generation, making their absence a hard constraint carried forward with the provider implementation. +- **wikifi/providers/anthropic_provider.py** — The maximum output token budget per call is set at 16,000 tokens. This is documented as comfortable headroom for any section schema response while staying within the SDK's non-streaming HTTP timeout guard, making it an operationally important default that should not be reduced without re-validating pipeline completions. +- **wikifi/providers/ollama_provider.py** — Qwen3-family models must not be invoked with think=False on the structured-output path: doing so causes the model to bypass the schema constraint and emit free text, which fails downstream validation. The thinking level must be 'low' or higher to preserve schema compliance. For the derivative-section synthesis pass, 'high' thinking is the preferred setting for output quality, but callers must budget 1–3 minutes per file and configure the timeout to at least 900 seconds to absorb that latency. +- **wikifi/providers/openai_provider.py** — Reasoning-capable model families (identified by name prefix) must receive output-token limits via a distinct parameter name from standard chat models; sending the wrong parameter to either family causes a request failure. The provider routes the correct parameter unconditionally based on model identity. +- **wikifi/providers/openai_provider.py** — The `think` (reasoning-effort) knob must only be forwarded to reasoning-capable models; forwarding it to a plain chat model risks a validation error from the hosted service. The mapping from wikifi's internal knob values (`low`, `medium`, `high`) to the API's accepted values is fixed and must be preserved. +- **wikifi/providers/openai_provider.py** — When the hosted service returns a response that cannot be parsed into the expected structured schema (e.g. due to refusal or truncation), the system falls back to direct JSON validation of the raw text rather than returning a null result, preserving the protocol contract that callers always receive a validated object or an explicit error. diff --git a/.wikifi/integrations.md b/.wikifi/integrations.md index 7a65e87..26c67d9 100644 --- a/.wikifi/integrations.md +++ b/.wikifi/integrations.md @@ -1,36 +1,68 @@ # Integrations -#### Internal Pipeline Handoffs -The system operates as a staged processing pipeline where each module consumes structured outputs from upstream stages and passes refined data downstream. The orchestration layer serves as the central coordinator, triggered by external commands to provision workspaces or initiate full processing cycles. It delegates execution to specialized components in the following sequence: - -- **Repository Traversal & Introspection:** The traversal component scans target directories and supplies filtered file paths and structural metadata. The introspection module consumes directory summaries and manifests to generate filtering patterns and metadata, which guide subsequent analysis stages. -- **Source Analysis & Extraction:** The extraction engine receives the filtered file lists, analyzes individual artifacts, and translates technical content into structured, technology-agnostic notes. These notes are passed to the aggregation layer. -- **Content Aggregation:** The aggregation module consumes the structured notes, synthesizes them into formatted documentation, and writes the results to the central knowledge base layout. -- **Derivative Generation:** The derivation stage consumes finalized documentation, interfaces with generative synthesis services, and produces supplementary content. This output is written back into the central layout, completing the continuous pipeline from raw artifact analysis to polished documentation. - -#### External & Abstracted Interfaces -All external dependencies are routed through standardized contracts to isolate core business logic from implementation details: - -- **Generative AI Services:** A unified abstraction layer handles all AI-driven content requests. Downstream modules submit contextual prompts and source snippets through this interface and receive processed findings or synthesized text in return. Provider-specific implementations are swappable without modifying the analysis engine. -- **Configuration & Runtime Management:** A centralized settings provider supplies runtime parameters to the orchestration and traversal layers. These parameters govern model selection, provider routing, timeout thresholds, content size constraints, and file exclusion lists. -- **User Interface & Console:** The command-line interface delegates initialization and execution to the orchestration service. It manages structured console output, progress reporting, and user feedback, ensuring a consistent interaction model. -- **Observability & Telemetry:** The extraction stage integrates with a logging and statistics tracking system to monitor processing metrics, track pipeline health, and record analysis outcomes. - -#### Integration Touchpoint Summary -| Component | Inbound Dependencies | Outbound Deliverables | External Interfaces | -|---|---|---|---| -| **Orchestrator** | CLI commands, centralized config | Task delegation signals | None (internal coordinator) | -| **CLI Interface** | User input, runtime config | Execution triggers, console feedback | Standard console/terminal | -| **Traversal & Introspection** | Config/exclusion lists, directory manifests | Filtered paths, metadata, filtering patterns | Repository filesystem | -| **Extractor** | Filtered file lists, AI responses | Structured analysis notes | AI provider interface, logging/telemetry | -| **Aggregator** | Structured notes, AI responses | Synthesized markdown sections | AI provider interface, wiki storage | -| **Deriver** | Finalized markdown, AI responses | Derivative documentation | Generative synthesis service, wiki storage | -| **AI Provider Layer** | Contextual prompts, source snippets | Processed findings, synthesized text | External inference backends | - -#### Documentation Gaps -The provided notes outline the directional flow and high-level contracts but do not specify: -- Exact data schemas or serialization formats used for inter-module handoffs -- Error handling, retry policies, or fallback mechanisms for external service failures -- Authentication, rate-limiting, or security constraints for AI provider interactions -- Versioning or compatibility guarantees between pipeline stages -These details should be clarified in implementation documentation or interface contracts. +### Inbound: Entry Points into the System + +The system is distributed as a library installed directly into a target project. The command-line interface (CLI) is the primary inbound entry point, exposing subcommands that drive the full pipeline from repository introspection through wiki generation, interactive querying, and quality reporting. The CLI delegates all pipeline coordination to the orchestrator, which is also the central hub wiring together every downstream stage. + +--- + +### Outbound: AI Model Backends + +All pipeline stages — introspection, per-file extraction, section aggregation, derivative content derivation, quality critique, and interactive chat — communicate with an AI model backend exclusively through a shared provider abstraction. No stage calls a specific backend directly. Three interaction shapes are exposed through this abstraction: schema-validated structured output, free-text completion, and multi-turn stateful conversation. + +Three backends are available and are interchangeable without altering any pipeline code: + +| Backend type | Hosting model | +|---|---| +| Local self-hosted inference runtime | On-premise / developer machine | +| Hosted AI service (Anthropic-compatible) | Remote cloud | +| Hosted AI service (OpenAI-compatible) | Remote cloud or self-managed endpoint | + +The active backend is selected via an environment variable or a per-invocation flag at the CLI level. OpenAI-compatible endpoints — including corporate reverse proxies and managed cloud deployments — are supported by overriding the base URL alone, with no other changes to the calling code. + +--- + +### Outbound: Development-Time Tool Servers (MCP) + +A separate set of external capability providers is declared through an MCP client configuration used during development or runtime. Four tool servers are wired up: a local AI utility, a local web crawler, a remote documentation-context service, and a remote search-and-stitching service. The system acts as an MCP client that fans requests out to these providers as needed. + +--- + +### Outbound: Filesystem and Persistence Layer + +All reading and writing of wiki artifacts — extraction notes, finished section bodies, and cache entries — flows through a centralized layout abstraction that manages the `.wikifi/` output directory inside the target project. The extractor, aggregator, deriver, CLI, and orchestrator all resolve paths through this abstraction rather than independently. + +A content-addressed cache layer sits between the orchestrator and the AI backend, consulting a fingerprinting service to derive content hashes as cache keys. The extractor, aggregator, and orchestrator each consult the cache before issuing AI calls, enabling both incremental re-runs and resumability for large codebases. + +--- + +### Integration Touchpoints Discovered in Target Codebases + +When analyzing a target codebase, the system identifies and surfaces integration touchpoints from high-signal artifact files through specialized parsers: + +- **HTTP API surfaces** — Parsed from API contract files; each contract contributes a finding recording the count of externally exposed endpoints, establishing the public-facing API surface as a documented integration point. +- **RPC service definitions** — Each declared service and its remote procedures are mapped, capturing procedure names, request and response message types, and whether either channel is streaming. +- **Event-driven channels** — Subscription roots found in schema definition files are classified as real-time integration touchpoints rather than ordinary capabilities, reflecting their role as channels that external consumers attach to. +- **Relational links** — Foreign key declarations (both explicit and inline) are surfaced as hard relational links between domain entities, identifying cross-entity data dependencies. + +The dispatcher that routes files to these specialized parsers uses the file-kind classification produced by the repository graph module, ensuring each artifact type reaches the appropriate parser while preserving a uniform output contract for downstream aggregation. + +## Sources +1. `README.md:8-12` +2. `wikifi/cli.py:98-101` +3. `wikifi/orchestrator.py:40-60` +4. `wikifi/providers/base.py:30-48` +5. `wikifi/providers/anthropic_provider.py:115-175` +6. `wikifi/providers/ollama_provider.py:58-95` +7. `wikifi/providers/openai_provider.py:1-8` +8. `README.md:46-51` +9. `TESTING-AND-DEMO.md:232-235` +10. `.mcp.json:2-36` +11. `wikifi/wiki.py:34-61` +12. `wikifi/cache.py:244-246` +13. `wikifi/cache.py:30` +14. `wikifi/specialized/openapi.py:83-92` +15. `wikifi/specialized/protobuf.py:70-87` +16. `wikifi/specialized/graphql.py:88-91` +17. `wikifi/specialized/sql.py:86-96` +18. `wikifi/specialized/__init__.py:46-57` diff --git a/.wikifi/intent.md b/.wikifi/intent.md index 54c5959..73d0214 100644 --- a/.wikifi/intent.md +++ b/.wikifi/intent.md @@ -1,34 +1,57 @@ # Intent and Problem Space -### Purpose and Problem Statement -The system exists to eliminate the labor-intensive overhead of manual documentation and resolve the fragmentation of technical knowledge within software repositories. When teams inherit, maintain, or scale complex codebases, understanding the underlying business logic, user value, and architectural relationships typically requires tedious reverse-engineering. This tool automates that process by systematically analyzing source artifacts to produce structured, navigable documentation that captures *what* the system does and *why* it exists, deliberately abstracting away implementation-specific mechanics. - -### Target Audience -- Engineering teams onboarding to unfamiliar, legacy, or rapidly evolving codebases -- Technical writers and architects seeking a reliable, evidence-based baseline for system documentation -- Organizations requiring consistent, technology-agnostic knowledge bases across multiple projects or acquisition targets - -### Design Constraints and Guiding Principles -The system’s architecture is shaped by several non-negotiable constraints that prioritize reliability, analytical depth, and long-term maintainability: - -- **Strict Technology Agnosticism:** All extraction and synthesis processes deliberately ignore language-specific syntax, framework conventions, or library dependencies. The focus remains exclusively on business purpose, user value, and behavioral specifications. -- **Fidelity Over Throughput:** Processing is optimized for analytical depth and output accuracy rather than raw speed. The system explicitly trades computational cost for higher-quality, cross-cutting insights, providing configurable controls to balance resource expenditure against result quality. -- **Deterministic, Stage-Gated Execution:** Analysis follows a strictly ordered pipeline. Each phase must complete successfully before downstream processing begins, ensuring predictable outcomes, graceful failure handling, and reproducible results across runs. -- **Backend Decoupling:** Core analytical logic is strictly separated from underlying reasoning or generation services. This allows seamless substitution of processing backends without altering the system’s operational contract or output structure. -- **Upgrade-Safe Documentation Contract:** The output structure adheres to a stable, version-resilient schema. This ensures that documentation remains navigable and consistent even as the underlying analysis methods evolve. -- **Automated Noise Filtration:** The system automatically isolates production behavior from non-essential artifacts (e.g., tests, third-party dependencies, configuration files, generated code) to prevent analysis dilution and conserve processing resources. - -### Operational Boundaries -| Dimension | In Scope | Out of Scope | -|---|---|---| -| **Analysis Focus** | Business logic, user value, architectural relationships, behavioral narratives | Low-level implementation details, syntax optimization, performance profiling | -| **Input Handling** | Unknown or unstructured repositories, mixed-paradigm codebases | Pre-documented systems, strictly standardized templates | -| **State Management** | Intermediate data preservation, incremental processing, debugging traceability | Real-time code generation, automated refactoring, deployment pipelines | - -### Documented Gaps -While the system’s intent and high-level constraints are well-defined, the following operational parameters remain unspecified in the current documentation: -- Exact thresholds or heuristics used to balance computational cost against result quality -- Conflict resolution strategies when extracted insights from different files contradict one another -- Specific criteria for classifying artifacts as "non-essential" across highly customized or non-standard repository structures - -These gaps do not impact the system’s core purpose but should be addressed before production deployment in complex or highly regulated environments. +wikifi exists because the intent embedded in a legacy system is typically invisible — locked inside years of implementation choices, technology-specific conventions, and accumulated structure that makes it difficult to separate *what the system does and why* from *how it currently does it*. Migration teams tasked with replacing or re-implementing such a system need the former without the latter. + +### The Core Problem + +When a team inherits a large legacy codebase and must produce a new implementation, they face a knowledge-extraction problem. The source describes a particular way of solving a set of problems, but rarely describes the problems themselves at a level that is portable to a new context. Reading the source directly tends to reproduce the same structure and constraints in the new system — recreating legacy decisions rather than the underlying intent. + +wikifi addresses this by walking a repository and producing a structured, technology-agnostic wiki that surfaces: + +- **Domain entities and capabilities** — what the system models and what it can do +- **API contracts and integration touchpoints** — what it exposes and to whom +- **Cross-cutting concerns** — considerations that span the system as a whole +- **Personas, user stories, and diagrams** — who uses the system, what they need, and how flows connect + +The goal is to make legacy intent explicit, complete, and portable so a fresh implementation can retain full functional value without inheriting structural decisions. + +### Primary Audience + +The immediate audience is migration teams — architects and developers who need to understand a system's domain well enough to re-implement it rather than maintain it. A secondary audience includes anyone who must understand what a system does without reading its source directly, including those who need to interrogate the resulting wiki conversationally. + +### What the System Is Not + +wikifi is explicitly a feature-extraction tool, not a transposition tool. It surfaces what a legacy system does and leaves all decisions about target architecture, structure, and approach entirely to the migration team. The output prescribes nothing about how the new system should be built. + +### Shaping Constraints + +Several constraints are built into the design from the outset: + +| Constraint | Rationale | +|---|---| +| **Technology agnosticism** | Output must be expressed in domain terms, never in terms of the implementation technology found in the source, so the wiki does not embed the very assumptions it is meant to dissolve. | +| **Quality over speed** | Accuracy and completeness of the generated wiki are prioritised over processing throughput. | +| **Arbitrary scale** | The system must handle repositories of any size — including legacy monorepos with tens of thousands of files — through caching and chunking strategies that make repeated and interrupted runs cheap. | +| **Full traceability** | Every assertion in the generated wiki must trace back to specific source files and locations so architects can verify any claim against the original codebase. | +| **Honest disagreement** | Where source files contain conflicting signals, the system surfaces those contradictions explicitly rather than silently resolving them, preserving the full picture for the migration team. | + +## Sources +1. `VISION.md:3-9` +2. `CLAUDE.md:73-75` +3. `README.md:3` +4. `wikifi/cli.py:1-8` +5. `.env.example:1-2` +6. `TESTING-AND-DEMO.md:1-6` +7. `wikifi/config.py:1-8` +8. `wikifi/specialized/__init__.py:1-13` +9. `wikifi/specialized/openapi.py:1-11` +10. `wikifi/specialized/protobuf.py:1-8` +11. `wikifi/deriver.py:1-18` +12. `wikifi/sections.py:1-19` +13. `VISION.md:86-89` +14. `wikifi/critic.py:1-15` +15. `wikifi/chat.py:1-32` +16. `wikifi/cache.py:1-21` +17. `wikifi/extractor.py:1-37` +18. `wikifi/aggregator.py:1-15` +19. `wikifi/evidence.py:1-18` diff --git a/.wikifi/personas.md b/.wikifi/personas.md index 3b37339..9c48864 100644 --- a/.wikifi/personas.md +++ b/.wikifi/personas.md @@ -1,63 +1,153 @@ # User Personas -### Primary Human Operators - -The system’s target audience is explicitly defined across the intent and capability specifications. By aggregating the stated problem space, pipeline behaviors, and integration contracts, three distinct human operator personas emerge. Each persona interacts with the system to resolve specific documentation debt, knowledge fragmentation, or onboarding friction. - -#### 1. Onboarding Engineering Practitioner -*Focus: Rapid comprehension of unfamiliar, legacy, or rapidly evolving codebases.* - -- **Goals:** Accelerate cross-team onboarding; quickly map business logic and functional capabilities without manual reverse-engineering; maintain awareness of system relationships as the codebase evolves. -- **Needs:** Structured, navigable documentation that stays synchronized with implementation; standardized terminology across components; explicit declarations of missing or ambiguous information; traceability from documentation back to original source artifacts. -- **Pain Points:** Fragmented technical knowledge; labor-intensive manual documentation; outdated or speculative content that drifts from actual implementation; difficulty distinguishing production behavior from test or configuration noise. -- **Served Use Cases:** - - Structural analysis for system purpose inference and scoped processing boundaries - - Granular extraction of domain concepts from technical implementations - - Adaptive reasoning depth to toggle between lightweight overviews and deep architectural breakdowns - - Timestamped provenance for auditability and change tracking - -#### 2. Technical Writer & System Architect -*Focus: Establishing reliable, evidence-based documentation baselines and behavioral narratives.* - -- **Goals:** Produce consistent, technology-agnostic documentation; capture cross-cutting relationships and behavioral specifications; maintain long-term documentation stability across tooling or backend updates. -- **Needs:** Schema-validated structured generation for systematic phases; free-form analytical generation for narrative clarity; deterministic, stage-gated execution for reproducible outputs; explicit gap preservation rather than speculative filling. -- **Pain Points:** Inconsistent terminology across projects; lack of traceability between documentation and source artifacts; manual authoring overhead; documentation contracts that break when analysis methods or backends change. -- **Served Use Cases:** - - Section synthesis for cohesive, consistently structured documentation units - - Cross-cutting derivation for behavioral stories and system interaction diagrams - - Workspace lifecycle management for section scaffolding, versioning rules, and intermediate state cleanup - - Dual-mode generation to balance machine-readable consistency with human-readable clarity - -#### 3. Portfolio Manager & Acquisition Integrator -*Focus: Standardizing knowledge bases across multiple projects, mixed-paradigm repositories, or acquisition targets.* - -- **Goals:** Assess system purpose and classification rationale quickly; maintain a unified, technology-agnostic knowledge base without manual overhead; ensure processing efficiency across diverse repository structures. -- **Needs:** Automated noise filtration to isolate production behavior; flexible configuration of traversal depth, file size thresholds, and content filters; backend decoupling for seamless processing substitution; consistent workspace layouts across pipeline runs. -- **Pain Points:** Resource exhaustion from scanning irrelevant directories; inconsistent output structures when analysis methods evolve; lack of auditability for compliance or assessment; fragmented knowledge across acquired or legacy projects. -- **Served Use Cases:** - - Intelligent traversal & filtering for production-relevance classification and dynamic focus adjustment - - Introspection assessment for primary language identification and classification rationale - - Aggregation statistics and execution summaries for pipeline health monitoring and output readiness verification - - Upgrade-safe documentation contract to preserve navigability as underlying analysis methods evolve - -### Persona-to-Pipeline Mapping - -| Pipeline Stage / Capability | Onboarding Practitioner | Technical Writer & Architect | Portfolio Manager & Integrator | -|---|---|---|---| -| **Structural Analysis & Introspection** | System purpose inference, scoped boundaries | Classification rationale, structural metadata | Primary language/purpose assessment across targets | -| **Granular Extraction & Domain Translation** | Business logic mapping, noise isolation | Evidence-based baseline, traceable notes | Technology-agnostic abstraction, standardized terminology | -| **Section Synthesis & Dual-Mode Generation** | Lightweight overviews vs. deep breakdowns | Schema-validated structure + narrative clarity | Consistent output formatting across projects | -| **Cross-Cutting Derivation** | Relationship mapping, onboarding acceleration | Behavioral stories, interaction diagrams | *(Note: System also auto-generates behavioral personas as a downstream artifact)* | -| **Workspace Lifecycle & Execution Reporting** | Change tracking, provenance | Reproducible runs, gap preservation | Pipeline health metrics, auditability, upgrade-safe contracts | +Two broad audiences are evident from the system's stated purpose and the capabilities built to serve them: **migration teams** who need portable domain knowledge extracted from a legacy codebase, and **knowledge consumers** who need to interrogate that knowledge without reading the source. A third role — the **wiki operator** — emerges from the pipeline management and quality-assurance capabilities. A fourth is implied by the interactive chat interface and the explicitly non-technical framing of the conversational output. + +--- + +## Persona 1 — The Migration Architect + +> *"I need to understand what this system does, not how it does it."* + +### Profile +Leads the technical planning for a re-implementation or replacement of an inherited legacy system. Responsible for defining the scope and domain boundaries of the new system before any build work begins. + +### Goals +- Recover the intent embedded in a legacy codebase independently of its current technology choices. +- Identify domain entities, capabilities, integration touchpoints, and cross-cutting concerns that must be preserved in the new system. +- Produce artefacts (diagrams, user stories, entity maps) that can brief the wider delivery team. + +### Needs +- A technology-agnostic wiki that does not reproduce legacy structural decisions. +- Full traceability — every assertion must point back to a specific location in the source so claims can be verified. +- Explicit surfacing of contradictions in the source rather than silent resolution, since disagreements flag high-priority migration risks. +- Architectural diagrams and structured user stories derived automatically from the extracted knowledge. + +### Pain Points +| Pain point | How the system addresses it | +|---|---| +| Reading legacy source directly tends to reproduce its structure in the new design | Output is expressed entirely in domain terms, never in technology-specific terms | +| Conflicting signals in different parts of the codebase are invisible | Contradictions are surfaced in a dedicated *Conflicts in source* block | +| Claims cannot be verified without re-reading the entire codebase | Every claim carries numbered citations to originating files and line ranges | +| No portable documentation exists to brief the wider team | Derivative sections produce Mermaid diagrams, Gherkin stories, and persona documents | + +### Use Cases Served +- Full wiki generation from a legacy repository +- Review of the *Core Entities*, *Integrations*, and *Hard Specifications* sections +- Conflict review for migration risk prioritisation +- Sharing generated diagrams and user stories as briefing materials + +--- + +## Persona 2 — The Migration Developer + +> *"I need to understand a specific subsystem quickly and know which parts of the source back that up."* + +### Profile +A developer on the migration team working at the implementation level. Inherits specific domain areas to re-implement and needs targeted, verifiable knowledge about those areas without reading the entire legacy codebase. + +### Goals +- Understand the behaviour and boundaries of an assigned domain area. +- Trace any uncertainty back to the exact source location. +- Ask follow-up questions about the system without re-reading multiple files. + +### Needs +- Per-section wiki bodies with inline citations. +- Cross-file flow descriptions that show how files and components interact, not just what each file does in isolation. +- An interactive conversational interface grounded in the full wiki for targeted queries. +- Resumable analysis so a large codebase can be processed incrementally and interrupted runs are not lost. + +### Pain Points +| Pain point | How the system addresses it | +|---|---| +| Technical debt obscures the boundary between accidental and essential complexity | Technology-agnostic extraction separates domain behaviour from implementation noise | +| No way to ask targeted questions without reading source | Multi-turn chat session grounded in all populated wiki sections | +| Uncertainty about which source files are authoritative | Import and reference graph enriches findings with inter-file context; citations identify exact source spans | +| Repetitive re-runs on large repos are slow | Content-addressed cache replays unchanged file results; interrupted walks resume from the last processed file | + +### Use Cases Served +- Querying the interactive chat session for specific domain questions +- Reading per-section markdown with source citations +- Reviewing cross-file flow descriptions produced by the reference graph enrichment +- Verifying claims against cited file locations and line ranges -### Documented Gaps & Unresolved Persona Dimensions +--- -The upstream specifications define the system’s operational boundaries and target audiences but remain silent on several persona-specific dimensions. These gaps must be resolved before production deployment in complex or regulated environments: +## Persona 3 — The Domain Knowledge Consumer -- **Role-Based Configuration Presets:** No predefined configuration profiles or heuristic thresholds are specified for balancing computational cost against result quality per persona. -- **Access & Security Controls:** Authentication, rate-limiting, and role-based access constraints for AI provider interactions and workspace management are not defined. -- **Workflow Integration Points:** Exact data schemas, serialization formats, and error-handling/retry policies for inter-module handoffs are unspecified, leaving persona-specific CI/CD or documentation workflow integration undefined. -- **Conflict Resolution:** Strategies for reconciling contradictory extracted insights across files are not documented, which may impact how architects and writers validate synthesized sections. -- **Non-Essential Classification Criteria:** Specific heuristics for classifying artifacts as "non-essential" across highly customized or non-standard repository structures remain undefined, potentially affecting portfolio managers scanning atypical acquisition targets. +> *"I need to understand what this system does, but I cannot read the source code."* + +### Profile +A stakeholder — for example, a domain expert, product owner, or business analyst — who holds contextual knowledge about what the system is supposed to do but lacks the ability or time to read the codebase directly. May need to validate whether the extracted wiki accurately reflects business intent or to answer specific questions about system behaviour. + +### Goals +- Gain a clear, jargon-free understanding of what the system does and why. +- Validate or challenge the extracted domain model against real-world business knowledge. +- Ask specific questions without requiring a technical intermediary. + +### Needs +- Plain-language, technology-agnostic output that does not assume programming knowledge. +- A conversational interface for targeted questions rather than having to read structured markdown. +- Assurance that only populated, meaningful content is included in any context provided to the assistant. + +### Pain Points +| Pain point | How the system addresses it | +|---|---| +| No readable documentation exists for the legacy system | The generated wiki is expressed in domain terms without implementation-specific language | +| Technical intermediaries are needed to answer basic questions about behaviour | The interactive chat session allows direct conversational querying of the wiki | +| Risk that the AI-generated summary does not reflect ground truth | Full traceability to source and explicit conflict surfacing allow domain experts to challenge assertions | + +### Use Cases Served +- Reading generated wiki sections (particularly *Business Domains*, *System Intent*, and *User Personas*) +- Conducting multi-turn chat sessions to interrogate specific capabilities or entities +- Reviewing Gherkin-style user stories for business accuracy + +--- + +## Persona 4 — The Wiki Operator + +> *"I need to keep this wiki accurate, complete, and trustworthy as the codebase evolves."* + +### Profile +A technical lead, DevOps engineer, or senior developer responsible for running and maintaining the wiki-generation pipeline over time. Focuses on pipeline health, analysis completeness, and quality assurance rather than consuming the wiki content directly. + +### Goals +- Run and re-run the pipeline efficiently as the codebase changes. +- Monitor which areas of the codebase produced no useful findings (dead zones). +- Validate that generated sections meet a defined quality bar before the wiki is shared with the wider team. +- Configure the pipeline to match the constraints of the deployment environment (on-premise AI backend, private endpoints, exclusion patterns). + +### Needs +- Coverage reports showing per-section file counts, finding counts, and body sizes. +- Identification of dead zones — files that were processed but produced no findings. +- A configurable quality threshold that triggers automatic revision when sections fall below it. +- Support for on-premise or privately hosted AI backends for air-gapped or data-sensitive environments. +- Idempotent workspace initialisation so re-runs do not overwrite existing work. + +### Pain Points +| Pain point | How the system addresses it | +|---|---| +| Large repositories make full re-runs prohibitively slow | Two-scope content-addressed cache means only changed files and affected sections are reprocessed | +| Blind spots in the analysis go undetected | Coverage report surfaces dead zones and per-section gaps | +| Generated sections may introduce unsupported claims | Critic-and-reviser pass scores each section and auto-revises below a configurable threshold | +| Interrupted runs waste all completed work | Results are persisted after every completed file; walks resume from the last unprocessed file | +| Different deployment environments require different AI backends | Active backend is selected via environment variable or per-invocation flag; no pipeline code changes needed | + +### Use Cases Served +- Running incremental and full wiki-generation pipelines +- Reviewing the coverage and quality report +- Configuring quality thresholds and exclusion patterns +- Selecting and overriding the AI backend for private or on-premise deployments +- Forcing cache invalidation when a clean re-walk is required + +--- + +## Persona Summary + +| Persona | Primary interaction | Core output consumed | Key system capability relied on | +|---|---|---|---| +| Migration Architect | CLI — full wiki generation | All eleven sections; diagrams; user stories | Tech-agnostic extraction; conflict surfacing; derivative section synthesis | +| Migration Developer | CLI + interactive chat | Per-section bodies with citations; chat responses | Cross-file context enrichment; conversational querying; incremental caching | +| Domain Knowledge Consumer | Interactive chat; generated markdown | Plain-language wiki sections; Gherkin stories | Conversational session; technology-agnostic output | +| Wiki Operator | CLI — pipeline management and reporting | Coverage and quality reports | Incremental walks; dead zone detection; critic-and-reviser pass; backend configuration | -These gaps do not alter the system’s core purpose but should be addressed in implementation contracts or operational runbooks to fully support each persona’s workflow expectations. +> **Coverage note:** The upstream sections do not describe any end-user of the *target* legacy system as an audience for wikifi itself. All personas above are consumers of the extraction tool and its outputs, not of the system being analysed. diff --git a/.wikifi/user_stories.md b/.wikifi/user_stories.md index c4d4713..289efdb 100644 --- a/.wikifi/user_stories.md +++ b/.wikifi/user_stories.md @@ -1,121 +1,300 @@ # User Stories -### Feature: Intelligent Traversal & Structural Analysis +## Feature: Wiki Workspace Initialisation -**User Story** -As a Portfolio Manager & Acquisition Integrator, I want the system to automatically filter out non-essential files and large binaries during repository scanning, so that I can assess system purpose and classification rationale without resource exhaustion. +**As a Wiki Operator, I want the workspace to be initialised idempotently, so that re-running setup does not destroy work that has already been completed.** ```gherkin -Given a target repository containing mixed-paradigm artifacts and version-controlled noise -When the structural analysis stage executes with configured path filters and size thresholds -Then the system excludes irrelevant directories and oversized assets -And produces a directory summary reflecting only allowed traversal boundaries -And generates an introspection assessment identifying primary languages and system purpose +Given a target project root that already contains a partially populated wiki workspace +When the workspace initialisation command is invoked again +Then the existing directory structure, configuration file, version-control ignore rules, + and per-section placeholder documents are left untouched +And no previously generated section bodies are overwritten ``` -**Entities Involved:** `Scan/Traversal Config`, `Directory Summary`, `Introspection Assessment` -**Acceptance Criteria:** -- Processing never exceeds defined size constraints or traverses excluded paths. -- Directory statistics accurately reflect file counts, total size, and extension distribution within allowed boundaries. -- Classification rationale is derived strictly from structural data and path filters. -- *(Gap Declaration)* Specific heuristics for classifying artifacts as "non-essential" across highly customized or non-standard repository structures remain undefined. +--- + +## Feature: Technology-Agnostic Wiki Generation + +**As a Migration Architect, I want the wiki to express all findings in domain terms rather than in the vocabulary of the legacy technology stack, so that the new system design is not inadvertently shaped by the old implementation's structure.** + +```gherkin +Given a legacy codebase built on any technology stack +When a full wiki generation run completes +Then every wiki section body is expressed in technology-agnostic, domain-level language +And no technology-specific constructs, naming conventions, or structural patterns + from the source appear in the generated output +``` + +**As a Migration Architect, I want the wiki to be organised into all defined sections covering domains, intent, capabilities, integrations, entities, cross-cutting concerns, and hard specifications, so that I have a complete set of artefacts to brief the wider delivery team.** + +```gherkin +Given a completed wiki generation run against a legacy repository +When I inspect the generated wiki workspace +Then eight primary sections are populated with evidence-backed content +And three derivative sections (user personas, user stories, and architectural diagrams) + are synthesised from the completed primary sections +And any section for which no evidence was found contains an explicit placeholder + declaring the gap rather than fabricated content +``` + +--- + +## Feature: Source Traceability + +**As a Migration Developer, I want every assertion in the wiki to carry a numbered citation back to its originating file and line range, so that I can verify any claim without re-reading the entire codebase.** + +```gherkin +Given a populated wiki section body +When I read a claim made in that section +Then the claim is annotated with a numbered citation +And the citation resolves to a specific repo-relative file path + and an inclusive line range in the source repository +And a content fingerprint is stored alongside the citation to enable staleness detection +``` + +**As a Migration Developer, I want claims that have no source backing to be explicitly flagged as unsupported, so that I know which assertions require further investigation rather than assuming all claims are verified.** + +```gherkin +Given a wiki section that contains a claim for which no source reference could be identified +When the section is rendered +Then the claim is explicitly marked as unsupported +And no citation number is fabricated or silently omitted without a visible notice +``` + +--- + +## Feature: Conflict Surfacing + +**As a Migration Architect, I want contradictory assertions from different parts of the codebase to be surfaced explicitly, so that I can treat disagreements as high-priority migration risks rather than discovering them later in the build.** + +```gherkin +Given two or more source files that make incompatible assertions about the same topic +When the section aggregation stage processes their findings +Then a dedicated "Conflicts in source" block is included in the relevant section body +And each conflicting position is listed with its own source references +And no silent resolution or averaging of the conflict is performed +``` + +--- + +## Feature: Cross-File Context Enrichment + +**As a Migration Developer, I want extraction findings to reflect inter-file flows rather than isolated per-file summaries, so that I understand how components in an assigned domain area interact with one another.** + +```gherkin +Given an in-scope file that imports other in-scope files or is imported by them +When the extraction pipeline processes that file +Then the file's import neighbourhood (files it depends on and files that depend on it) + is included as context during extraction +And the resulting findings describe cross-file interactions + rather than treating the file in isolation +``` + +--- + +## Feature: Specialised Schema Parsing + +**As a Migration Architect, I want structured schema files — including SQL definitions, API contract specifications, interface definition files, graph schemas, and database migrations — to be parsed deterministically, so that entity and relationship extraction from these files is reliable and reproducible.** + +```gherkin +Given a repository that contains structured schema artifacts +When the extraction pipeline classifies and routes those files +Then each schema file is processed by a purpose-built deterministic parser +And the resulting findings describe entities, relationships, operations, and constraints +And no AI model invocation is required for the deterministic parsing path +``` + +**As a Wiki Operator, I want unparseable schema files to produce an advisory finding directing reviewers to inspect the file manually, so that a single malformed file does not silently omit domain knowledge from the wiki.** + +```gherkin +Given a schema file that cannot be parsed by the deterministic parser +When the specialised extraction path attempts to process that file +Then an advisory finding is produced directing reviewers to inspect the file manually +And no silent failure or empty result is returned without notice +``` + +--- + +## Feature: Large File Handling + +**As a Migration Developer, I want large source files to be fully analysed regardless of size, so that no content is silently missed during AI-assisted extraction.** + +```gherkin +Given a source file whose size exceeds the processing capacity of a single extraction pass +When the extraction pipeline routes the file through the AI-assisted extraction path +Then the file is recursively split into overlapping chunks +And each chunk is processed independently +And findings from all chunks are combined so that no content is omitted +``` --- -### Feature: Domain-Centric Translation & Granular Extraction +## Feature: Incremental and Resumable Processing -**User Story** -As an Onboarding Engineering Practitioner, I want technical implementations translated into domain concepts with explicit gap declarations, so that I can quickly map business logic and functional capabilities without manual reverse-engineering. +**As a Migration Developer, I want incremental runs to skip files and sections that have not changed, so that iterating on a large codebase does not require waiting for a full re-walk each time.** ```gherkin -Given a set of source files within the scoped processing boundaries -When the granular extraction stage translates technical implementations into domain concepts -Then the system strips implementation-specific syntax to surface underlying business rules -And creates timestamped extraction notes linking each file to a role summary and finding -And preserves raw evidence for ambiguous data instead of generating speculative content +Given a repository that has been walked at least once +And a subsequent run in which only a subset of files has changed +When the pipeline runs again +Then only files whose content fingerprint has changed are re-extracted +And only wiki sections whose contributing notes payload has changed are re-aggregated +And all unchanged results are served from the content-addressed cache ``` -**Entities Involved:** `Configuration`, `Extraction Note` -**Acceptance Criteria:** -- Each extraction note is immutable once created and tied to a single source file. -- Technical artifacts are consistently mapped to business-readable concepts. -- Missing or ambiguous information is explicitly documented rather than filled speculatively. -- *(Gap Declaration)* Strategies for reconciling contradictory extracted insights across files are not documented. +**As a Wiki Operator, I want an interrupted pipeline run to resume from the last processed file, so that no completed extraction work is lost when a run is cut short.** + +```gherkin +Given a pipeline run that was interrupted before processing all in-scope files +When the pipeline is restarted +Then processing resumes from the first unprocessed file +And all previously persisted extraction results are retained and not re-run +``` + +**As a Wiki Operator, I want to force a full cache invalidation, so that I can obtain a clean re-walk when the pipeline configuration or extraction logic has materially changed.** + +```gherkin +Given an existing populated cache for a repository +When a cache invalidation is requested +Then all cached extraction and aggregation results are cleared +And the next run processes every in-scope file from scratch +``` --- -### Feature: Section Synthesis & Dual-Mode Generation +## Feature: Derivative Section Synthesis -**User Story** -As a Technical Writer & System Architect, I want schema-validated structured generation combined with free-form narrative clarity, so that I can produce consistent, technology-agnostic documentation baselines. +**As a Migration Architect, I want Gherkin-style user stories generated automatically from the extracted wiki, so that I can brief the delivery team with structured acceptance criteria without writing them by hand.** ```gherkin -Given aggregated extraction notes from the granular extraction stage -When the section synthesis stage consolidates findings into documentation units -Then the system applies schema-validated structured generation for systematic phases -And uses free-form analytical generation for narrative clarity -And outputs finalized wiki sections with consistent terminology and structure +Given all primary wiki sections have been populated with evidence-backed content +When the derivative section synthesis stage runs for user stories +Then a set of Gherkin-style user stories is produced, grouped by feature +And each story is grounded only in capabilities and entities present in the primary sections +And if the required upstream primary sections are empty, a placeholder is written + declaring the gap rather than fabricating stories ``` -**Entities Involved:** `Documentation Section`, `Aggregation Stats`, `Workspace Layout` -**Acceptance Criteria:** -- Sections are generated only after successful note aggregation. -- Aggregation statistics track successful writes and explicitly flag empty sections. -- Directory structure remains consistent across pipeline runs, handling scaffolding and intermediate state cleanup. -- *(Gap Declaration)* Exact mapping rules between intermediate extraction notes and final documentation sections are implied by the aggregation process but not explicitly detailed. +**As a Migration Architect, I want Mermaid architectural diagrams generated automatically from the extracted wiki, so that I have portable visual artefacts for briefing the delivery team.** + +```gherkin +Given all primary wiki sections have been populated with evidence-backed content +When the derivative section synthesis stage runs for architectural diagrams +Then valid Mermaid diagram markup is produced + reflecting entities and relationships found in the primary sections +And no diagram elements are introduced that are not supported by the primary section evidence +``` --- -### Feature: Cross-Cutting Derivation & Behavioral Mapping +## Feature: Interactive Knowledge Querying + +**As a Migration Developer, I want to ask targeted questions about a specific domain area through a conversational interface, so that I can find precise answers without reading multiple wiki sections sequentially.** + +```gherkin +Given a wiki that has been generated for a legacy repository +When I open an interactive chat session +Then the session is grounded in all meaningfully populated wiki sections +And placeholder or empty sections are excluded from the context +And I can conduct multi-turn exchanges with conversation history retained across turns +``` -**User Story** -As a Technical Writer & System Architect, I want the system to derive behavioral stories and interaction diagrams from cross-component relationships, so that I can capture system interactions and maintain long-term documentation stability. +**As a Domain Knowledge Consumer, I want to interrogate system behaviour conversationally without needing a technical intermediary, so that I can validate or challenge the extracted domain model directly against my own business knowledge.** ```gherkin -Given finalized documentation sections and extracted domain concepts -When the cross-cutting derivation stage identifies relationships spanning multiple components -Then the system generates behavioral narratives and system interaction diagrams -And auto-generates behavioral personas as downstream artifacts -And ensures deterministic, stage-gated execution for reproducible outputs +Given a populated wiki for a legacy system +When I open an interactive chat session and ask a question about system behaviour +Then the assistant responds using only information present in populated wiki sections +And the response is expressed in plain, domain-level language + without implementation-specific detail +And I can reset conversation history and begin a fresh line of questioning + within the same session while retaining the wiki context ``` -**Entities Involved:** `Documentation Section`, `Execution Summary` -**Acceptance Criteria:** -- Cross-cutting relationships are identified without manual authoring overhead. -- Generated artifacts maintain traceability back to original source artifacts. -- Execution follows a deterministic, four-stage pipeline progression. -- *(Gap Declaration)* Workflow integration points, including exact data schemas, serialization formats, and error-handling/retry policies for inter-module handoffs, are unspecified. +**As a Migration Developer, I want to inspect which wiki sections are currently loaded as context in a chat session, so that I understand the boundaries of the assistant's knowledge before relying on its answers.** + +```gherkin +Given an active interactive chat session +When I request a context introspection +Then the session reports which sections are currently loaded as grounding context +And sections that are empty or contain only placeholders are listed as excluded +``` --- -### Feature: Execution Reporting & Provenance Tracking +## Feature: Quality Assurance Pass -**User Story** -As a Portfolio Manager & Acquisition Integrator, I want detailed execution summaries and timestamped provenance for all generated artifacts, so that I can ensure auditability and verify pipeline health across acquisition targets. +**As a Wiki Operator, I want each synthesised section to be scored against its brief and the upstream evidence, so that I can identify sections containing unsupported claims before the wiki is shared with the team.** ```gherkin -Given a completed pipeline run across all processing stages -When the system consolidates metrics, findings, and completion status -Then an execution summary is generated as a single source of truth for pipeline health -And a chronological record of extraction notes is maintained per section -And file inclusion/exclusion metrics and generation status are reported for full auditability +Given a synthesised wiki section and the upstream evidence it was derived from +When the critic-and-reviser pass runs +Then a quality score between 0 and 10 is produced for that section +And the critique itemises unsupported claims, gaps relative to the section brief, + and concrete revision suggestions ``` -**Entities Involved:** `Execution Summary`, `Extraction Note`, `Aggregation Stats` -**Acceptance Criteria:** -- Execution summary is generated only after all pipeline stages report completion. -- Provenance enables traceability from final documentation back to original source artifacts. -- Pipeline health metrics and output readiness are verified before final delivery. -- *(Gap Declaration)* Authentication, rate-limiting, and role-based access constraints for workspace management and AI provider interactions are not defined. +**As a Wiki Operator, I want sections that score below a configurable threshold to be automatically revised, so that low-quality sections are improved before the wiki reaches the broader team.** + +```gherkin +Given a synthesised section whose quality score falls below the configured threshold +When automatic revision is triggered +Then a revised body is produced informed by the critique's suggestions +And the revised section is accepted only if its score matches or improves on the original +And if the revision would regress the score, it is rejected + and the original body is retained +``` --- -### Story-to-Component Mapping Reference +## Feature: Coverage and Dead-Zone Reporting + +**As a Wiki Operator, I want a coverage report showing per-section file counts, finding counts, body sizes, and quality scores, so that I can assess the completeness of the wiki before distributing it.** + +```gherkin +Given a completed wiki-generation run +When the report command is invoked +Then a markdown table is produced listing every wiki section + with its contributing file count, finding count, body size, and emptiness status +And where a critic pass has been run, the quality score and highest-priority content gap + are included for each section +``` + +**As a Wiki Operator, I want files that were processed but produced no findings to be surfaced as dead zones in the coverage report, so that I can identify blind spots in the analysis and decide whether to investigate them further.** + +```gherkin +Given a repository walk in which some in-scope files yielded no findings for any section +When the coverage report is generated +Then a list of dead-zone files is included in the report +And those files are distinguished from files that were excluded during the classification stage +``` + +--- -| Feature | Primary Persona | Core Capability | Key Entities | Known Gaps Addressed | -|---|---|---|---|---| -| Intelligent Traversal & Structural Analysis | Portfolio Manager & Acquisition Integrator | Intelligent Traversal & Filtering | `Scan/Traversal Config`, `Directory Summary`, `Introspection Assessment` | Non-essential classification heuristics | -| Domain-Centric Translation & Granular Extraction | Onboarding Engineering Practitioner | Granular Extraction / Domain-Centric Translation | `Configuration`, `Extraction Note` | Contradictory insight resolution | -| Section Synthesis & Dual-Mode Generation | Technical Writer & System Architect | Section Synthesis / Dual-Mode Generation | `Documentation Section`, `Aggregation Stats`, `Workspace Layout` | Note-to-section mapping rules | -| Cross-Cutting Derivation & Behavioral Mapping | Technical Writer & System Architect | Cross-Cutting Derivation | `Documentation Section`, `Execution Summary` | Workflow integration & serialization schemas | -| Execution Reporting & Provenance Tracking | Portfolio Manager & Acquisition Integrator | Execution Reporting / Timestamped Provenance | `Execution Summary`, `Extraction Note`, `Aggregation Stats` | Access controls & role-based presets | +## Feature: AI Backend Configuration + +**As a Wiki Operator, I want to select and override the active AI backend via an environment variable or a per-invocation flag, so that the pipeline can run in air-gapped or data-sensitive environments without requiring code changes.** + +```gherkin +Given a deployment environment that requires a privately hosted or on-premise AI backend +When the pipeline is invoked with a backend selection flag or the corresponding + environment variable set +Then all AI-assisted extraction, aggregation, derivation, and critic calls + are routed to the specified backend +And no modification to pipeline code or shared configuration files is required +``` + +--- + +## Feature: Graceful Degradation + +**As a Wiki Operator, I want the pipeline to recover gracefully from AI synthesis failures on individual sections, so that a single failed section does not prevent the rest of the wiki from being produced.** + +```gherkin +Given a wiki section for which AI synthesis fails during aggregation +When the pipeline handles the failure +Then the raw collected notes for that section are emitted directly in the section body +And the error is surfaced inline in the section rather than silently suppressed +And all remaining sections continue to be generated normally +``` From bb4f9c773c2ca945093201288864036f4ba31da4 Mon Sep 17 00:00:00 2001 From: Dallas Pool Date: Fri, 1 May 2026 21:40:19 -0500 Subject: [PATCH 4/9] fix(pr15): address review comments and Anthropic empty-response bug MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Addresses every review comment on PR #15 plus the user-reported walk failure where adaptive thinking exhausted ``max_tokens`` and the Anthropic SDK returned an empty structured response. Human review comments - providers: introduce nominal ``LLMProvider`` ABC; Ollama, Anthropic, OpenAI and the test ``MockProvider`` now inherit from it. Hosted providers share ``format_api_error`` instead of duplicating the helper. - specialized: split the package — ``models.py`` (dataclasses), ``dispatch.py`` (``select(kind, rel_path=…)``); ``__init__.py`` is now a docstring-only marker per the project's no-re-exports rule. Copilot review comments - specialized.dispatch: only route SQL-shaped migrations (``.sql``/``.ddl``) through the SQL parser. Python/JS/Ruby migration scripts (Alembic, Django, Knex) stay on the LLM path. - extractor: honor ``settings.use_specialized_extractors``; wire through ``orchestrator.run_walk``. - specialized.graphql: handle ``extend type Query/Mutation`` and indented closing braces in ``_block_after``; anchor line numbers on the captured name offset so the leading-newline regex artifact no longer points one line above the declaration. - specialized.protobuf: bound each service's RPCs to its own ``{ … }`` block so multi-service files stop attributing later RPCs to the first service. - specialized.sql: count both CREATE and ALTER targets in the migration summary so an ALTER-only migration no longer reports "0 table(s)". - repograph: parse Python relative imports (``from .b import x``, ``from . import helpers``, ``from ..sibling import x``) and resolve them within the package, instead of stripping the leading dots and missing every intra-package edge. - cache.hash_section_notes: include each note's ``sources`` (file, lines, fingerprint) in the digest so cache hits can't replay stale citations after lines or file fingerprints change. - report: derive coverage from the on-disk notes JSONL first; fall back to the cache only when no notes exist. ``wikifi report`` after ``walk --no-cache`` (or after a manual cache wipe) now reports accurate coverage instead of 0%. - evidence.render_section_body: insert per-claim ``[N]`` markers next to matching sentences in the body, with a "Supporting claims" list for paraphrased claims that don't appear verbatim. - README: document the ``--provider openai`` option. Anthropic empty-response bug (user-reported) - anthropic_provider: bump ``DEFAULT_MAX_TOKENS`` 16K → 32K and raise ``settings.anthropic_max_tokens`` default to match. The 16K default was leaving no room for the structured-output block when adaptive thinking ran at ``effort=high``, causing the reported ``empty parsed_output and parse fallback failed`` error on hard sections like ``hard_specifications`` and ``diagrams``. - emit a diagnostic ``RuntimeError`` that names ``stop_reason``, ``output_tokens``, ``max_tokens``, and the relevant tuning knob ("raise max_tokens", "lower think effort") instead of letting a cryptic ``Invalid JSON: EOF`` pydantic error escape. Tests - 180 tests pass (was 168) at 93% coverage. Each fixed bug has a dedicated regression test. Note: bumping the ``hash_section_notes`` shape silently invalidates existing aggregation cache entries on disk. The next walk regenerates them — no action required from users. --- README.md | 2 +- tests/conftest.py | 4 +- tests/test_anthropic_provider.py | 29 +++++++ tests/test_cache.py | 47 ++++++++++ tests/test_evidence.py | 45 ++++++++++ tests/test_extractor.py | 32 +++++++ tests/test_repograph.py | 50 +++++++++++ tests/test_report.py | 26 ++++++ tests/test_specialized.py | 116 ++++++++++++++++++++++++- wikifi/cache.py | 47 +++++++++- wikifi/config.py | 11 ++- wikifi/evidence.py | 75 +++++++++++++++- wikifi/extractor.py | 5 +- wikifi/orchestrator.py | 1 + wikifi/providers/anthropic_provider.py | 76 ++++++++++------ wikifi/providers/base.py | 39 +++++++-- wikifi/providers/ollama_provider.py | 4 +- wikifi/providers/openai_provider.py | 19 ++-- wikifi/repograph.py | 64 ++++++++++++-- wikifi/report.py | 22 ++++- wikifi/specialized/__init__.py | 62 ++----------- wikifi/specialized/dispatch.py | 62 +++++++++++++ wikifi/specialized/graphql.py | 74 ++++++++++++---- wikifi/specialized/models.py | 30 +++++++ wikifi/specialized/openapi.py | 2 +- wikifi/specialized/protobuf.py | 43 ++++++++- wikifi/specialized/sql.py | 14 ++- 27 files changed, 851 insertions(+), 150 deletions(-) create mode 100644 wikifi/specialized/dispatch.py create mode 100644 wikifi/specialized/models.py diff --git a/README.md b/README.md index 9514185..3805c80 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ uv run wikifi init - `walk` — main entry point. Walks the target codebase and produces the wiki content. - `--no-cache` — force a clean re-walk; drops the on-disk extraction + aggregation caches. - `--review` — run the critic + reviser loop on derivative sections (personas, user stories, diagrams). - - `--provider {ollama|anthropic}` — override the configured provider for this walk. + - `--provider {ollama|anthropic|openai}` — override the configured provider for this walk. - `report` — print a coverage + quality report (per-section file counts, findings, body sizes). - `--score` — additionally run the critic on every populated section for a 0-10 quality score. - `ask` — natural language queries against the wiki content, with optional context injection from the target codebase. diff --git a/tests/conftest.py b/tests/conftest.py index d406e18..f327544 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -15,6 +15,8 @@ class so the same provider can serve introspection, extraction, and import pytest from pydantic import BaseModel +from wikifi.providers.base import LLMProvider + T = TypeVar("T", bound=BaseModel) # --------------------------------------------------------------------------- @@ -22,7 +24,7 @@ class so the same provider can serve introspection, extraction, and # --------------------------------------------------------------------------- -class MockProvider: +class MockProvider(LLMProvider): """Test double for ``LLMProvider`` driven by per-schema response queues.""" name = "mock" diff --git a/tests/test_anthropic_provider.py b/tests/test_anthropic_provider.py index 8abb298..9d80a97 100644 --- a/tests/test_anthropic_provider.py +++ b/tests/test_anthropic_provider.py @@ -166,3 +166,32 @@ def test_cache_system_prompt_off_returns_plain_string(): provider = AnthropicProvider(client=client, cache_system_prompt=False) provider.complete_text(system="SYS", user="u") assert client.create_calls[0]["system"] == "SYS" + + +def test_complete_json_raises_diagnostic_on_fully_empty_response(): + """Empty parsed_output AND empty text → emit a diagnostic with knobs. + + Locks in the user-reported failure mode where adaptive thinking + consumes the entire ``max_tokens`` budget and the structured + output block never lands. The replacement RuntimeError must + surface ``stop_reason``, ``output_tokens``, and ``max_tokens`` so + operators see which knob to turn (raise max_tokens, lower think + effort) instead of the original cryptic "Invalid JSON: EOF" + pydantic validation error. + """ + response = SimpleNamespace( + parsed_output=None, + content=[], + stop_reason="max_tokens", + usage=SimpleNamespace(output_tokens=16_000), + ) + client = _StubClient(parse_response=response) + provider = AnthropicProvider(client=client, max_tokens=16_000) + with pytest.raises(RuntimeError) as info: + provider.complete_json(system="s", user="u", schema=_Echo) + msg = str(info.value) + # Operator-facing diagnostic — names the knobs, not the SDK internals. + assert "max_tokens=16000" in msg + assert "output_tokens=16000" in msg + assert "stop_reason='max_tokens'" in msg + assert "raise max_tokens" in msg.lower() or "lower think" in msg.lower() diff --git a/tests/test_cache.py b/tests/test_cache.py index c515c3b..9081de5 100644 --- a/tests/test_cache.py +++ b/tests/test_cache.py @@ -139,3 +139,50 @@ def test_cache_version_is_pinned(): """Bumps to CACHE_VERSION should be intentional — guard against drift.""" assert isinstance(CACHE_VERSION, int) assert CACHE_VERSION >= 1 + + +def test_hash_section_notes_changes_when_sources_change(): + """The aggregation cache key must reflect each note's `sources`. + + Two notes with identical finding text but different source line + ranges or fingerprints describe different evidence; reusing the + same cached body would replay stale citations against new code. + """ + base = [ + { + "file": "a.py", + "summary": "role", + "finding": "Order entity.", + "sources": [{"file": "a.py", "lines": [1, 30], "fingerprint": "abc1234"}], + } + ] + same = [ + { + "file": "a.py", + "summary": "role", + "finding": "Order entity.", + "sources": [{"file": "a.py", "lines": (1, 30), "fingerprint": "abc1234"}], + } + ] + moved_lines = [ + { + "file": "a.py", + "summary": "role", + "finding": "Order entity.", + "sources": [{"file": "a.py", "lines": [42, 70], "fingerprint": "abc1234"}], + } + ] + new_fingerprint = [ + { + "file": "a.py", + "summary": "role", + "finding": "Order entity.", + "sources": [{"file": "a.py", "lines": [1, 30], "fingerprint": "deadbee"}], + } + ] + # Tuple vs list line range: same logical evidence, identical hash. + assert hash_section_notes(base) == hash_section_notes(same) + # Lines moved → new evidence → cache must miss. + assert hash_section_notes(base) != hash_section_notes(moved_lines) + # File contents changed (fingerprint shifted) → cache must miss. + assert hash_section_notes(base) != hash_section_notes(new_fingerprint) diff --git a/tests/test_evidence.py b/tests/test_evidence.py index b5bf3dc..2680023 100644 --- a/tests/test_evidence.py +++ b/tests/test_evidence.py @@ -76,3 +76,48 @@ def test_coalesce_refs_dedupes_by_render(): out = coalesce_refs(refs) assert len(out) == 2 assert {r.render() for r in out} == {"a.py:1-10", "b.py"} + + +def test_render_section_body_inserts_claim_markers_inline(): + """Each supported claim's text in the body picks up its `[N]` marker. + + Without inline markers the reader has the source list at the bottom + of the section but no way to tell which sentence each source backs. + """ + bundle = EvidenceBundle( + body="Orders carry line items. Tax is computed downstream.", + claims=[ + Claim(text="Orders carry line items.", sources=[SourceRef(file="src/order.py", lines=(1, 30))]), + Claim(text="Tax is computed downstream.", sources=[SourceRef(file="src/billing.py", lines=(40, 60))]), + ], + ) + out = render_section_body(bundle) + # Markers are appended next to the matching sentences, in source order. + assert "Orders carry line items.[1]" in out + assert "Tax is computed downstream.[2]" in out + # Sources footer still enumerates the distinct refs. + assert "1. `src/order.py:1-30`" in out + assert "2. `src/billing.py:40-60`" in out + + +def test_render_section_body_paraphrased_claims_listed_as_supporting(): + """Claims whose text doesn't appear verbatim go in a Supporting list. + + A conservative inline match avoids attaching markers to the wrong + sentence when the aggregator paraphrased — the claim still gets a + citation, just out-of-line. + """ + bundle = EvidenceBundle( + body="The system tracks orders end-to-end.", + claims=[ + Claim( + text="Order state transitions are persisted on every change.", + sources=[SourceRef(file="src/order.py", lines=(80, 95))], + ), + ], + ) + out = render_section_body(bundle) + assert "## Supporting claims" in out + assert "Order state transitions are persisted on every change." in out + assert "[1]" in out # marker still attached to the supporting-claim entry + assert "1. `src/order.py:80-95`" in out diff --git a/tests/test_extractor.py b/tests/test_extractor.py index 2b27cca..029425b 100644 --- a/tests/test_extractor.py +++ b/tests/test_extractor.py @@ -501,3 +501,35 @@ def test_extract_repo_drops_derivative_section_findings(tmp_path, mock_provider_ assert len(read_notes(layout, "entities")) == 1 assert read_notes(layout, derivative_id) == [] + + +def test_extract_repo_use_specialized_extractors_false_falls_back_to_llm(tmp_path, mock_provider_factory): + """`use_specialized_extractors=False` keeps schema files on the LLM path. + + Lock in the `use_specialized_extractors` setting wired through from + config — without this the knob would be silently ignored and SQL/ + GraphQL/Protobuf/OpenAPI files would always bypass the LLM regardless + of the user's explicit opt-out. + """ + layout = _layout(tmp_path) + (tmp_path / "schema.sql").write_text("CREATE TABLE customer (id INTEGER PRIMARY KEY);") + + seen: list[str] = [] + + def factory(schema, system, user): + seen.append(user) + return FileFindings(findings=[SectionFinding(section_id="entities", finding="Routed to LLM.")]) + + provider = mock_provider_factory(json_factory=factory) + stats = extract_repo( + layout=layout, + provider=provider, + files=[Path("schema.sql")], + repo_root=tmp_path, + use_specialized_extractors=False, + ) + + assert seen, "LLM should have been called when specialized extractors are disabled" + assert stats.specialized_files == 0 + notes = read_notes(layout, "entities") + assert any("Routed to LLM." in n["finding"] for n in notes) diff --git a/tests/test_repograph.py b/tests/test_repograph.py index b97d20e..1affa68 100644 --- a/tests/test_repograph.py +++ b/tests/test_repograph.py @@ -83,3 +83,53 @@ def test_build_graph_skips_unreadable_files(tmp_path: Path): node = graph.get("ghost.py") assert node is not None assert node.imports == () + + +def test_build_graph_python_relative_imports(tmp_path: Path): + """`from .b import x` resolves to a sibling within the same package. + + Without this the regex skips the leading-dot form entirely and + intra-package edges silently disappear from the graph, so per-file + neighbor context for Python codebases is incomplete. + """ + pkg = tmp_path / "pkg" + pkg.mkdir() + (pkg / "__init__.py").write_text("") + (pkg / "a.py").write_text("from .b import thing\nfrom . import helpers\n") + (pkg / "b.py").write_text("def thing(): return 1\n") + (pkg / "helpers.py").write_text("VALUE = 1\n") + + files = [ + Path("pkg/__init__.py"), + Path("pkg/a.py"), + Path("pkg/b.py"), + Path("pkg/helpers.py"), + ] + graph = build_graph(repo_root=tmp_path, files=files) + + a_node = graph.get("pkg/a.py") + assert a_node is not None + assert "pkg/b.py" in a_node.imports + assert "pkg/helpers.py" in a_node.imports + + +def test_build_graph_python_double_dot_relative_import(tmp_path: Path): + """`from ..sibling import x` walks one level up before resolving.""" + sub = tmp_path / "pkg" / "sub" + sub.mkdir(parents=True) + (tmp_path / "pkg" / "__init__.py").write_text("") + (sub / "__init__.py").write_text("") + (sub / "leaf.py").write_text("from ..sibling import thing\n") + (tmp_path / "pkg" / "sibling.py").write_text("def thing(): return 1\n") + + files = [ + Path("pkg/__init__.py"), + Path("pkg/sibling.py"), + Path("pkg/sub/__init__.py"), + Path("pkg/sub/leaf.py"), + ] + graph = build_graph(repo_root=tmp_path, files=files) + + leaf = graph.get("pkg/sub/leaf.py") + assert leaf is not None + assert "pkg/sibling.py" in leaf.imports diff --git a/tests/test_report.py b/tests/test_report.py index 325938b..1cf8188 100644 --- a/tests/test_report.py +++ b/tests/test_report.py @@ -71,3 +71,29 @@ def test_build_report_marks_unpopulated_sections(tmp_path: Path): save(layout, WalkCache()) report = build_report(layout=layout, provider=None, score=False) assert any(entry.is_empty for entry in report.sections) + + +def test_build_report_uses_notes_when_cache_is_empty(tmp_path: Path): + """`wikifi report` after `walk --no-cache` must still report coverage. + + Coverage was previously derived from the cache only; with caching + disabled or the cache deleted, every walk reported `0%` even though + notes and section bodies were present on disk. Pulling + ``files_with_findings`` from the JSONL notes restores accuracy. + """ + layout = _layout(tmp_path) + # No cache written — emulates `walk --no-cache` or a manual cache wipe. + append_note(layout, "entities", {"file": "src/order.py", "summary": "x", "finding": "Order"}) + append_note(layout, "entities", {"file": "src/customer.py", "summary": "y", "finding": "Customer"}) + append_note(layout, "capabilities", {"file": "src/order.py", "summary": "x", "finding": "Place order"}) + write_section(layout, "entities", "Body for entities.") + + report = build_report(layout=layout, provider=None, score=False) + + # Two distinct files contributed — coverage reflects them, not 0. + assert report.coverage.files_with_findings == 2 + assert report.coverage.files_total >= 2 + assert report.coverage.coverage_pct() > 0 + # Per-section counts still come from the notes themselves. + assert report.coverage.findings_per_section["entities"] == 2 + assert report.coverage.findings_per_section["capabilities"] == 1 diff --git a/tests/test_specialized.py b/tests/test_specialized.py index 4e064f1..ab28adb 100644 --- a/tests/test_specialized.py +++ b/tests/test_specialized.py @@ -3,7 +3,7 @@ from __future__ import annotations from wikifi.repograph import FileKind -from wikifi.specialized import select +from wikifi.specialized.dispatch import select from wikifi.specialized.graphql import extract as gql_extract from wikifi.specialized.openapi import extract as openapi_extract from wikifi.specialized.protobuf import extract as proto_extract @@ -15,8 +15,18 @@ def test_select_routes_known_kinds_to_extractors(): assert select(FileKind.PROTOBUF) is proto_extract assert select(FileKind.GRAPHQL) is gql_extract assert select(FileKind.OPENAPI) is openapi_extract - # Migrations route to a SQL variant. - assert select(FileKind.MIGRATION).__name__ == "extract_migration" + # SQL-shaped migrations route to the SQL migration variant. + sql_mig = select(FileKind.MIGRATION, rel_path="db/migrations/0042_orders.sql") + assert sql_mig is not None + assert sql_mig.__name__ == "extract_migration" + # Python / JS / Ruby migrations stay on the LLM path — the SQL + # parser would silently produce empty findings on real code. + assert select(FileKind.MIGRATION, rel_path="alembic/versions/0001_init.py") is None + assert select(FileKind.MIGRATION, rel_path="db/migrate/20260501_add_users.rb") is None + assert select(FileKind.MIGRATION, rel_path="db/migrations/001-add-users.js") is None + # Without a rel_path the dispatcher can't tell SQL from non-SQL — + # err on the safe side and return ``None``. + assert select(FileKind.MIGRATION) is None assert select(FileKind.APPLICATION_CODE) is None assert select(FileKind.OTHER) is None @@ -184,3 +194,103 @@ def test_graphql_extracts_types_and_roots(): assert "capabilities" in sections cap = next(f for f in result.findings if f.section_id == "capabilities") assert "Query" in cap.finding or "Mutation" in cap.finding + + +def test_graphql_extract_handles_extend_type_query(tmp_path): + """`extend type Query` blocks contribute to the capabilities section. + + Modular GraphQL schemas split root types across files; if the + extractor only matched bare `type Query { ... }` declarations, + capabilities would silently disappear for any schema composed from + multiple files. + """ + sdl = """ + type Order { + id: ID! + } + + extend type Query { + orderById(id: ID!): Order + } + + extend type Mutation { + cancelOrder(id: ID!): Boolean! + } + """ + result = gql_extract("schema.graphql", sdl) + capabilities = [f for f in result.findings if f.section_id == "capabilities"] + assert any("orderById" in f.finding for f in capabilities) + assert any("cancelOrder" in f.finding for f in capabilities) + + +def test_graphql_block_after_handles_indented_closing_brace(): + """`_block_after` must stop on indented `}` lines, not just column-0 ones. + + Many SDL formatters indent the closing brace; the previous + column-0-only check would let the scan run into subsequent type + declarations, polluting the root field list with unrelated fields. + """ + sdl = """ + type Query { + orderById(id: ID!): Order + listOrders: [Order!]! + } + + type SecretOps { + shouldNotAppear: String! + } + """ + result = gql_extract("schema.graphql", sdl) + capabilities = next(f for f in result.findings if f.section_id == "capabilities") + assert "orderById" in capabilities.finding + assert "listOrders" in capabilities.finding + assert "shouldNotAppear" not in capabilities.finding + + +def test_proto_scopes_rpcs_to_owning_service(): + """Multiple `service` blocks: each owns only its own RPCs. + + The previous scope ("every RPC at or after my line") attributed + every later service's RPCs to the first service, inflating the + integration inventory whenever a proto file declared more than one. + """ + text = """ + service AccountsService { + rpc CreateAccount (CreateAccountRequest) returns (Account); + } + + service BillingService { + rpc ChargeAccount (ChargeRequest) returns (Receipt); + rpc Refund (RefundRequest) returns (Receipt); + } + """ + result = proto_extract("svc.proto", text) + integrations = {f.finding.split("\n", 1)[0]: f.finding for f in result.findings if f.section_id == "integrations"} + accounts_finding = next(v for k, v in integrations.items() if "AccountsService" in k) + billing_finding = next(v for k, v in integrations.items() if "BillingService" in k) + + assert "CreateAccount" in accounts_finding + assert "ChargeAccount" not in accounts_finding + assert "Refund" not in accounts_finding + + assert "ChargeAccount" in billing_finding + assert "Refund" in billing_finding + assert "CreateAccount" not in billing_finding + + +def test_sql_migration_with_only_alter_counts_altered_tables(): + """An ALTER-only migration reports its altered targets, not 0 tables. + + Prior to the fix the summary counted only CREATE TABLE matches, so + a migration that only ALTERs existing tables was reported as + "Migration touches 0 table(s)" even though it had real targets. + """ + from wikifi.specialized.sql import extract_migration + + text = """ + ALTER TABLE orders ADD COLUMN refund_status TEXT; + ALTER TABLE customers ADD COLUMN tier TEXT; + """ + result = extract_migration("backend/migrations/0042_alter.sql", text) + assert "0 table" not in result.summary + assert "2 table" in result.summary diff --git a/wikifi/cache.py b/wikifi/cache.py index fecc97b..b611cfc 100644 --- a/wikifi/cache.py +++ b/wikifi/cache.py @@ -263,9 +263,13 @@ def _load_json(path: Path) -> dict[str, Any] | None: def hash_section_notes(notes: list[dict[str, Any]]) -> str: """Stable digest of a section's note payload for aggregation cache keys. - The hash spans only the *content* fields the aggregator actually reads - (file ref, summary, finding) — not timestamps or per-walk debug fields — - so regenerating identical notes on a fresh walk reuses the cached body. + The hash spans the *content* fields the aggregator and renderer + actually rely on — file ref, summary, finding text, and the + structured ``sources`` list (file/lines/fingerprint per source). + Including ``sources`` is what keeps citation freshness honest: + when a referenced file's lines move or its fingerprint changes, + the cache misses and we re-aggregate against the new evidence + instead of replaying stale citations. """ from wikifi.fingerprint import hash_text @@ -274,7 +278,44 @@ def hash_section_notes(notes: list[dict[str, Any]]) -> str: "file": n.get("file", ""), "summary": n.get("summary", ""), "finding": n.get("finding", ""), + "sources": _normalize_sources(n.get("sources")), } for n in notes ] return hash_text(json.dumps(payload, ensure_ascii=False, sort_keys=True)) + + +def _normalize_sources(sources: Any) -> list[dict[str, Any]]: + """Render the ``sources`` list into a stable dict shape for hashing. + + Notes vary in how ``sources`` is stored — a list of dicts from the + JSONL store, a list of Pydantic models from in-memory paths, or + missing entirely on legacy notes. Coerce each entry to the same + ``{file, lines, fingerprint}`` shape so the hash is stable across + code paths. + """ + if not sources: + return [] + out: list[dict[str, Any]] = [] + for src in sources: + if isinstance(src, dict): + file = src.get("file", "") + lines = src.get("lines") + fingerprint = src.get("fingerprint", "") + else: + file = getattr(src, "file", "") + lines = getattr(src, "lines", None) + fingerprint = getattr(src, "fingerprint", "") + # Tuples and lists both serialize the same in JSON, but coerce + # to a list so two notes with identical (start, end) ranges + # produce identical bytes regardless of representation. + normalized_lines: list[int] | None + if lines is None: + normalized_lines = None + else: + try: + normalized_lines = [int(lines[0]), int(lines[1])] + except (TypeError, ValueError, IndexError): + normalized_lines = None + out.append({"file": file, "lines": normalized_lines, "fingerprint": fingerprint or ""}) + return out diff --git a/wikifi/config.py b/wikifi/config.py index 99b7a07..67f7f1f 100644 --- a/wikifi/config.py +++ b/wikifi/config.py @@ -106,8 +106,15 @@ class Settings(BaseSettings): description=("Explicit Anthropic API key. Falls back to ANTHROPIC_API_KEY in the environment when unset."), ) anthropic_max_tokens: int = Field( - default=16_000, - description="Per-call output token cap for the Anthropic provider.", + default=32_000, + description=( + "Per-call output token cap for the Anthropic provider. " + "Adaptive thinking at ``effort=high`` can consume substantial " + "output budget; 32K leaves comfortable headroom for the wiki " + "section schemas while staying under the SDK's non-streaming " + "HTTP timeout guard. Premium-effort callers (xhigh/max) " + "should bump higher and enable streaming." + ), ) # ----- OpenAI provider knobs ----- diff --git a/wikifi/evidence.py b/wikifi/evidence.py index af3c9ee..66d6eb8 100644 --- a/wikifi/evidence.py +++ b/wikifi/evidence.py @@ -98,10 +98,30 @@ def render_section_body(bundle: EvidenceBundle) -> str: The body is appended with a "Sources" footer enumerating every distinct source ref across claims and contradictions, plus an explicit "Conflicts in source" section if any contradictions were surfaced. + + When the bundle carries supported claims, each claim's footnote + markers (``[1]``, ``[2]``…) are appended to the body — either next + to the matching sentence (when the claim text appears verbatim in + the body) or as a "Supporting claims" list when the body is a + paraphrase. Without this the reader has the source list at the + bottom but no way to tell which sentence each source backs up. """ + sources = _enumerate_sources(bundle) + source_index_for: dict[str, int] = {entry.ref.render(): entry.index for entry in sources} + parts: list[str] = [] - if bundle.body.strip(): - parts.append(bundle.body.strip()) + body_with_markers = _annotate_body_with_markers(bundle, source_index_for) + if body_with_markers.strip(): + parts.append(body_with_markers.strip()) + + unmatched_claims = [c for c in bundle.claims if c.sources and not _claim_text_in_body(c, bundle.body)] + if unmatched_claims: + parts.append("") + parts.append("## Supporting claims") + for claim in unmatched_claims: + markers = _markers_for(claim.sources, source_index_for) + suffix = f" {markers}" if markers else "" + parts.append(f"- {claim.text.strip()}{suffix}") if bundle.contradictions: parts.append("") @@ -117,7 +137,6 @@ def render_section_body(bundle: EvidenceBundle) -> str: refs = _format_refs(position.sources) parts.append(f" - {position.text.strip()} {refs}".rstrip()) - sources = _enumerate_sources(bundle) if sources: parts.append("") parts.append("## Sources") @@ -150,6 +169,56 @@ def _enumerate_sources(bundle: EvidenceBundle) -> list[_Numbered]: return list(seen.values()) +def _markers_for(refs: list[SourceRef], source_index_for: dict[str, int]) -> str: + """Return the bracketed footnote markers for a list of source refs.""" + indices: list[int] = [] + seen: set[int] = set() + for ref in refs: + idx = source_index_for.get(ref.render()) + if idx is not None and idx not in seen: + seen.add(idx) + indices.append(idx) + if not indices: + return "" + return "".join(f"[{i}]" for i in indices) + + +def _claim_text_in_body(claim: Claim, body: str) -> bool: + """True when the claim's exact text appears in the body, modulo whitespace.""" + needle = " ".join(claim.text.split()) + haystack = " ".join(body.split()) + return bool(needle) and needle in haystack + + +def _annotate_body_with_markers(bundle: EvidenceBundle, source_index_for: dict[str, int]) -> str: + """Append claim-level markers next to matching sentences in the body. + + Conservative substring match: only annotate when the claim's text + appears verbatim in the body. If the aggregator paraphrased, the + claim falls through to the "Supporting claims" list rather than + getting attached to the wrong sentence. + """ + if not bundle.body or not bundle.claims: + return bundle.body + annotated = bundle.body + for claim in bundle.claims: + if not claim.sources: + continue + # ``_claim_text_in_body`` is the gate that decides "match" vs. + # "paraphrase"; we use the same predicate here so a claim + # classified as inline-matchable always actually gets inlined, + # never silently dropped between the two passes. + if not _claim_text_in_body(claim, annotated): + continue + markers = _markers_for(claim.sources, source_index_for) + if not markers: + continue + text = claim.text.strip() + if text and text in annotated and markers not in annotated: + annotated = annotated.replace(text, text + markers, 1) + return annotated + + def coalesce_refs(refs: list[SourceRef]) -> list[SourceRef]: """Deduplicate refs by rendered form, preserving first-seen order.""" seen: dict[str, SourceRef] = {} diff --git a/wikifi/extractor.py b/wikifi/extractor.py index 8f769d0..c375a05 100644 --- a/wikifi/extractor.py +++ b/wikifi/extractor.py @@ -37,7 +37,7 @@ from wikifi.providers.base import LLMProvider from wikifi.repograph import FileKind, RepoGraph, classify from wikifi.sections import PRIMARY_SECTION_IDS, PRIMARY_SECTIONS -from wikifi.specialized import select as select_specialized +from wikifi.specialized.dispatch import select as select_specialized from wikifi.wiki import WikiLayout, append_note log = logging.getLogger("wikifi.extractor") @@ -134,6 +134,7 @@ def extract_repo( cache: WalkCache | None = None, graph: RepoGraph | None = None, persist_cache: Callable[[], None] | None = None, + use_specialized_extractors: bool = True, ) -> ExtractionStats: """Walk the supplied files and append per-section findings to the notes store. @@ -185,7 +186,7 @@ def extract_repo( continue # ---- specialized routing ---- - specialized_fn = select_specialized(kind) + specialized_fn = select_specialized(kind, rel_path=rel.as_posix()) if use_specialized_extractors else None if specialized_fn is not None: stats.specialized_files += 1 try: diff --git a/wikifi/orchestrator.py b/wikifi/orchestrator.py index da9864b..bddb94f 100644 --- a/wikifi/orchestrator.py +++ b/wikifi/orchestrator.py @@ -130,6 +130,7 @@ def _persist() -> None: cache=cache, graph=graph, persist_cache=_persist if cache is not None else None, + use_specialized_extractors=settings.use_specialized_extractors, ) log.info("stage 3: aggregating primary sections") diff --git a/wikifi/providers/anthropic_provider.py b/wikifi/providers/anthropic_provider.py index 8241e07..134d5cc 100644 --- a/wikifi/providers/anthropic_provider.py +++ b/wikifi/providers/anthropic_provider.py @@ -37,7 +37,7 @@ from pydantic import BaseModel -from wikifi.providers.base import ChatMessage +from wikifi.providers.base import ChatMessage, LLMProvider try: # the dep is declared in pyproject.toml, but importing lazily yields # a clearer error if a user installs without extras. @@ -58,17 +58,21 @@ # `.wikifi/config.toml`. DEFAULT_MODEL = "claude-opus-4-7" -# Default per-call max output tokens. Wikifi's structured findings are -# small relative to the input; 16K is comfortable headroom for any of -# the section schemas without crossing the SDK's non-streaming HTTP -# timeout guard. -DEFAULT_MAX_TOKENS = 16_000 +# Default per-call max output tokens. Adaptive thinking at ``effort=high`` +# can consume substantial output budget on its own; if ``max_tokens`` is too +# tight, the model burns its allowance on the thinking trace and the +# structured-output block comes back empty (``parsed_output is None`` and +# no text content). 32K leaves comfortable headroom for any of the wiki +# section schemas while staying under the SDK's non-streaming HTTP timeout +# guard. Premium-effort callers ("xhigh"/"max") should bump higher and +# enable streaming — see Anthropic's Opus 4.7 migration notes. +DEFAULT_MAX_TOKENS = 32_000 ThinkLevel = bool | str | None -class AnthropicProvider: +class AnthropicProvider(LLMProvider): """Hosted-Claude implementation of the wikifi provider protocol.""" name = "anthropic" @@ -119,20 +123,27 @@ def complete_json(self, *, system: str, user: str, schema: type[T]) -> T: **self._thinking_kwargs(), ) except anthropic.APIError as exc: - raise RuntimeError(_format_api_error(exc)) from exc + raise RuntimeError(self.format_api_error(self.name, exc)) from exc parsed = getattr(response, "parsed_output", None) - if parsed is None: - # Defensive: if the model refused or the SDK couldn't parse, - # fall back to schema-validating the response text. This - # keeps the protocol's ``raise on failure`` contract intact - # rather than returning a None. - text = _first_text(response) + if parsed is not None: + return parsed # type: ignore[return-value] + # The SDK couldn't parse a structured instance. Try the raw text + # block (covers refusals where the model emitted text rather than + # the structured form) before raising. + text = _first_text(response) + if text: try: return schema.model_validate_json(text) except Exception as exc: # pragma: no cover - defensive path - raise RuntimeError(f"anthropic provider: empty parsed_output and parse fallback failed: {exc}") from exc - return parsed # type: ignore[return-value] + raise RuntimeError( + f"anthropic provider: parsed_output missing and JSON validation of response text failed: {exc}" + ) from exc + # No parsed output and no text — typically the thinking trace + # consumed the entire output budget. Surface stop_reason and + # usage so the caller knows whether to raise ``max_tokens``, + # lower ``effort``, or look at a refusal. + raise RuntimeError(_empty_response_message(response, self.max_tokens)) def complete_text(self, *, system: str, user: str) -> str: """Return the model's free-text response.""" @@ -145,7 +156,7 @@ def complete_text(self, *, system: str, user: str) -> str: **self._thinking_kwargs(), ) except anthropic.APIError as exc: - raise RuntimeError(_format_api_error(exc)) from exc + raise RuntimeError(self.format_api_error(self.name, exc)) from exc return _first_text(response) or "" def chat(self, *, system: str, messages: list[ChatMessage]) -> str: @@ -162,7 +173,7 @@ def chat(self, *, system: str, messages: list[ChatMessage]) -> str: **self._thinking_kwargs(), ) except anthropic.APIError as exc: - raise RuntimeError(_format_api_error(exc)) from exc + raise RuntimeError(self.format_api_error(self.name, exc)) from exc return _first_text(response) or "" # ------------------------------------------------------------------ @@ -226,10 +237,25 @@ def _first_text(response: Any) -> str: return "" -def _format_api_error(exc: Exception) -> str: - """Render an APIError with the request id, when present, for diagnostics.""" - request_id = getattr(exc, "request_id", None) - msg = getattr(exc, "message", None) or str(exc) - if request_id: - return f"anthropic provider failed ({request_id}): {msg}" - return f"anthropic provider failed: {msg}" +def _empty_response_message(response: Any, max_tokens: int) -> str: + """Diagnose an empty structured response with stop_reason + usage. + + The dominant cause is adaptive thinking consuming the entire + ``max_tokens`` budget before the structured output block is + produced. Surface the operational knobs (``max_tokens``, + ``effort``) so the caller sees the fix at the failure site. + """ + stop_reason = getattr(response, "stop_reason", None) + usage = getattr(response, "usage", None) + output_tokens = getattr(usage, "output_tokens", None) if usage is not None else None + parts = [ + "anthropic provider: empty structured response (no parsed_output, no text block)", + f"stop_reason={stop_reason!r}", + f"output_tokens={output_tokens}", + f"max_tokens={max_tokens}", + ] + if stop_reason == "max_tokens" or (output_tokens is not None and output_tokens >= max_tokens): + parts.append("hint: thinking likely consumed the budget — raise max_tokens or lower think/effort") + elif stop_reason == "refusal": + parts.append("hint: model refused; the input may need rewording") + return " | ".join(parts) diff --git a/wikifi/providers/base.py b/wikifi/providers/base.py index a99fbdd..cec0740 100644 --- a/wikifi/providers/base.py +++ b/wikifi/providers/base.py @@ -1,4 +1,4 @@ -"""LLM provider protocol. +"""LLM provider abstract base class. Wikifi calls a provider in three modes: @@ -11,13 +11,16 @@ message list. Used by the ``wikifi chat`` REPL where conversation history carries between turns. -The protocol is deliberately minimal so swapping providers (Ollama → hosted -APIs → mock) is a one-class change. +The base class is deliberately minimal so swapping providers (Ollama → hosted +APIs → mock) is a one-class change. Concrete subclasses inherit nominally so +``isinstance(p, LLMProvider)`` works and ``ABC`` enforces the three call +surfaces at construction time. """ from __future__ import annotations -from typing import Protocol, TypedDict, TypeVar +from abc import ABC, abstractmethod +from typing import TypedDict, TypeVar from pydantic import BaseModel @@ -29,18 +32,38 @@ class ChatMessage(TypedDict): content: str -class LLMProvider(Protocol): +class LLMProvider(ABC): + """Nominal base class every backend implements. + + Subclasses set the class-level ``name`` (provider id) and assign + ``self.model`` in ``__init__``. The three abstract methods are the + full contract — wikifi never calls anything else on a provider. + """ + name: str model: str + @abstractmethod def complete_json(self, *, system: str, user: str, schema: type[T]) -> T: """Return an instance of ``schema`` populated by the model.""" - ... + @abstractmethod def complete_text(self, *, system: str, user: str) -> str: """Return the model's text response verbatim.""" - ... + @abstractmethod def chat(self, *, system: str, messages: list[ChatMessage]) -> str: """Run a multi-turn exchange and return the assistant's next reply.""" - ... + + @staticmethod + def format_api_error(provider_name: str, exc: Exception) -> str: + """Render a vendor APIError with the request id, when present. + + Shared by hosted providers (Anthropic, OpenAI) so the diagnostic + format is consistent across backends. + """ + request_id = getattr(exc, "request_id", None) + msg = getattr(exc, "message", None) or str(exc) + if request_id: + return f"{provider_name} provider failed ({request_id}): {msg}" + return f"{provider_name} provider failed: {msg}" diff --git a/wikifi/providers/ollama_provider.py b/wikifi/providers/ollama_provider.py index 1c85ca9..b52f591 100644 --- a/wikifi/providers/ollama_provider.py +++ b/wikifi/providers/ollama_provider.py @@ -36,14 +36,14 @@ from ollama import Client from pydantic import BaseModel -from wikifi.providers.base import ChatMessage +from wikifi.providers.base import ChatMessage, LLMProvider T = TypeVar("T", bound=BaseModel) ThinkLevel = bool | str | None -class OllamaProvider: +class OllamaProvider(LLMProvider): name = "ollama" def __init__( diff --git a/wikifi/providers/openai_provider.py b/wikifi/providers/openai_provider.py index de2de05..68b3ea6 100644 --- a/wikifi/providers/openai_provider.py +++ b/wikifi/providers/openai_provider.py @@ -33,7 +33,7 @@ from pydantic import BaseModel -from wikifi.providers.base import ChatMessage +from wikifi.providers.base import ChatMessage, LLMProvider try: import openai @@ -70,7 +70,7 @@ ThinkLevel = bool | str | None -class OpenAIProvider: +class OpenAIProvider(LLMProvider): """Hosted-OpenAI implementation of the wikifi provider protocol.""" name = "openai" @@ -126,7 +126,7 @@ def complete_json(self, *, system: str, user: str, schema: type[T]) -> T: **self._reasoning_kwargs(), ) except openai.APIError as exc: - raise RuntimeError(_format_api_error(exc)) from exc + raise RuntimeError(self.format_api_error(self.name, exc)) from exc parsed = _first_parsed(response) if parsed is None: @@ -154,7 +154,7 @@ def complete_text(self, *, system: str, user: str) -> str: **self._reasoning_kwargs(), ) except openai.APIError as exc: - raise RuntimeError(_format_api_error(exc)) from exc + raise RuntimeError(self.format_api_error(self.name, exc)) from exc return _first_text(response) or "" def chat(self, *, system: str, messages: list[ChatMessage]) -> str: @@ -168,7 +168,7 @@ def chat(self, *, system: str, messages: list[ChatMessage]) -> str: **self._reasoning_kwargs(), ) except openai.APIError as exc: - raise RuntimeError(_format_api_error(exc)) from exc + raise RuntimeError(self.format_api_error(self.name, exc)) from exc return _first_text(response) or "" # ------------------------------------------------------------------ @@ -230,12 +230,3 @@ def _first_text(response: Any) -> str: if content is None and isinstance(message, dict): content = message.get("content") return content or "" - - -def _format_api_error(exc: Exception) -> str: - """Render an APIError with the request id, when present, for diagnostics.""" - request_id = getattr(exc, "request_id", None) - msg = getattr(exc, "message", None) or str(exc) - if request_id: - return f"openai provider failed ({request_id}): {msg}" - return f"openai provider failed: {msg}" diff --git a/wikifi/repograph.py b/wikifi/repograph.py index 6fa8bb5..7301c93 100644 --- a/wikifi/repograph.py +++ b/wikifi/repograph.py @@ -192,9 +192,18 @@ def __contains__(self, rel_path: str) -> bool: # pragma: no cover - convenience # Per-language import patterns. Each pattern captures the imported module # path/identifier; resolution to a real file is handled by a separate -# heuristic. +# heuristic. The Python pattern allows leading dots so relative imports +# (``from .foo import bar`` / ``from .. import baz``) survive the scan — +# without that, intra-package edges silently disappear from the graph. +# A second pattern (``_PY_FROM_DOT_IMPORT``) handles ``from . import X``, +# where the regex above only captures the bare dot prefix and would lose +# the ``X`` symbol that names the actual sibling module. _PY_IMPORT = re.compile( - r"^\s*(?:from\s+([A-Za-z_][\w.]*)\s+import|import\s+([A-Za-z_][\w.]*))", + r"^\s*(?:from\s+(\.+[\w.]*|[A-Za-z_][\w.]*)\s+import|import\s+([A-Za-z_][\w.]*))", + re.MULTILINE, +) +_PY_FROM_DOT_IMPORT = re.compile( + r"^\s*from\s+(\.+)\s+import\s+([\w*][\w,\s]*)", re.MULTILINE, ) _JS_IMPORT = re.compile( @@ -306,6 +315,16 @@ def _resolve_imports( if suffix == ".py": for match in _PY_IMPORT.finditer(text): raw_targets.append(match.group(1) or match.group(2)) + # ``from . import a, b`` adds an edge to each *named* sibling + # rather than to the package's ``__init__.py``. The base regex + # above only captures the dot prefix, so we expand the symbol + # list here and synthesize one ``.symbol`` raw target per name. + for match in _PY_FROM_DOT_IMPORT.finditer(text): + dots = match.group(1) + for symbol in match.group(2).split(","): + symbol = symbol.strip() + if symbol and symbol != "*" and symbol.isidentifier(): + raw_targets.append(f"{dots}{symbol}") elif suffix in {".js", ".jsx", ".ts", ".tsx", ".mjs", ".cjs"}: for match in _JS_IMPORT.finditer(text): raw_targets.append(next((g for g in match.groups() if g), "")) @@ -345,15 +364,24 @@ def _candidates_for( file_set: set[str], modules: dict[str, list[str]], ) -> list[str]: - # Relative imports (``./foo``, ``../bar``) — resolve within the repo. - # Path.resolve() would expand against the CWD; we want the result - # relative to the repo root so it can match file_set entries. - if raw.startswith((".", "/")): + # Python relative imports (``from .foo import bar``, ``from .. import baz``) + # use leading dots, NOT path-style ``./`` or ``../``. JS/TS relative + # imports use the path style and are handled below. Treat the two + # syntaxes separately so a ``.foo`` from Python doesn't get joined as + # ``parent/.foo`` (a hidden-file path that won't match any module). + if source.suffix.lower() == ".py" and raw.startswith("."): + return _python_relative_candidates(raw, source=source, file_set=file_set) + + # Path-style relative imports (``./foo``, ``../bar``) and absolute + # paths — resolve within the repo. Path.resolve() would expand + # against the CWD; we want the result relative to the repo root so + # it can match file_set entries. + if raw.startswith(("./", "../", "/")): target = source.parent / raw normalized = _normalize_relative(target) return [p for p in _try_path_variants(normalized) if p in file_set] - # Strip leading dots from Python relative-from imports + # Strip leading dots from any other dotted form (defensive). stripped = raw.lstrip(".") matches = modules.get(stripped, []) matches += modules.get(stripped.split(".")[-1], []) @@ -368,6 +396,28 @@ def _candidates_for( return out +def _python_relative_candidates(raw: str, *, source: Path, file_set: set[str]) -> list[str]: + """Resolve a Python ``from .foo`` style import against the repo. + + Each leading dot pops one level from the source's package directory: + a single dot is the package itself, two dots is the parent package, + and so on. Whatever follows is a dotted module path inside the + resolved package (``a.b`` → ``a/b``), which we attempt with the + standard ``.py`` and ``__init__.py`` variants. + """ + leading = len(raw) - len(raw.lstrip(".")) + remainder = raw[leading:] + # ``source.parent`` is the directory the source file lives in, + # which corresponds to the *current* package (one dot's worth). + base = source.parent + for _ in range(leading - 1): + if not base.parts: + return [] + base = base.parent + target = base / Path(*remainder.split(".")) if remainder else base + return [p for p in _try_path_variants(target) if p in file_set] + + def _normalize_relative(path: Path) -> Path: """Collapse ``..`` / ``.`` segments without touching the filesystem. diff --git a/wikifi/report.py b/wikifi/report.py index 4942f4f..fdc83a4 100644 --- a/wikifi/report.py +++ b/wikifi/report.py @@ -89,13 +89,25 @@ def build_report( is run through the critic for a quality score. Without that, the report is purely structural — useful in CI without an LLM. """ - files_total, files_with_findings = _coverage_from_cache(layout) findings_per_section: dict[str, int] = {} files_per_section: dict[str, int] = {} + contributing_files: set[str] = set() for section in PRIMARY_SECTIONS: notes = read_notes(layout, section) findings_per_section[section.id] = len(notes) - files_per_section[section.id] = len({n.get("file") for n in notes if n.get("file")}) + section_files = {n.get("file") for n in notes if n.get("file")} + files_per_section[section.id] = len(section_files) + contributing_files.update(f for f in section_files if isinstance(f, str)) + + # Coverage is derived from the on-disk notes first so a walk run with + # ``--no-cache`` (or one whose cache was deleted) still reports + # accurate counts. When notes are present they're authoritative; we + # only fall back to the cache when no notes have been written yet. + if contributing_files: + files_with_findings = len(contributing_files) + files_total = max(files_with_findings, _files_total_from_cache(layout)) + else: + files_total, files_with_findings = _coverage_from_cache(layout) coverage = CoverageStats( files_total=files_total, @@ -145,6 +157,12 @@ def _coverage_from_cache(layout: WikiLayout) -> tuple[int, int]: return files_total, files_with_findings +def _files_total_from_cache(layout: WikiLayout) -> int: + """Return the cache's seen-files count if available; ``0`` otherwise.""" + cache: WalkCache = load(layout) + return len(cache.extraction) + + def _collect_upstream(layout: WikiLayout, section: Section) -> dict[str, str]: bodies: dict[str, str] = {} for upstream_id in section.derived_from: diff --git a/wikifi/specialized/__init__.py b/wikifi/specialized/__init__.py index d911a91..3382b5d 100644 --- a/wikifi/specialized/__init__.py +++ b/wikifi/specialized/__init__.py @@ -1,58 +1,12 @@ """Type-aware extractors for high-signal source artifacts. -Schema files, IDLs, OpenAPI specs, and migrations carry the system's -contracts in machine-readable form. Running them through the same prose -LLM extractor as application code is wasteful and lossy: the structure -is already there, the extractor just has to read it. - Each module in this package implements one or more parsers that consume -the file's text and emit a list of structured findings, in the same -``{section_id, finding, sources}`` shape the LLM extractor produces. -That keeps the downstream aggregator interface unchanged — the -specialized path is a drop-in replacement for the LLM call when the -file kind is recognized. - -Extractor selection lives in :func:`select` below. +a file's text and emit structured findings in the same shape the LLM +extractor produces. Import from the concrete module — never from this +``__init__.py`` — per the project's no-re-exports rule: + +- :mod:`wikifi.specialized.models` — finding/result dataclasses +- :mod:`wikifi.specialized.dispatch` — :func:`select` for kind → extractor +- :mod:`wikifi.specialized.sql` / ``openapi`` / ``protobuf`` / ``graphql`` — + the per-format extractors """ - -from __future__ import annotations - -import logging -from collections.abc import Callable -from dataclasses import dataclass, field - -from wikifi.evidence import SourceRef -from wikifi.repograph import FileKind - -log = logging.getLogger("wikifi.specialized") - - -@dataclass -class SpecializedFinding: - section_id: str - finding: str - sources: list[SourceRef] = field(default_factory=list) - - -@dataclass -class SpecializedResult: - findings: list[SpecializedFinding] = field(default_factory=list) - summary: str = "" - - -# Each extractor takes ``(rel_path, text)`` and returns a SpecializedResult. -ExtractorFn = Callable[[str, str], SpecializedResult] - - -def select(kind: FileKind) -> ExtractorFn | None: - """Return the specialized extractor for a file kind, or ``None``.""" - from wikifi.specialized import graphql, openapi, protobuf, sql - - table: dict[FileKind, ExtractorFn] = { - FileKind.SQL: sql.extract, - FileKind.MIGRATION: sql.extract_migration, - FileKind.OPENAPI: openapi.extract, - FileKind.PROTOBUF: protobuf.extract, - FileKind.GRAPHQL: graphql.extract, - } - return table.get(kind) diff --git a/wikifi/specialized/dispatch.py b/wikifi/specialized/dispatch.py new file mode 100644 index 0000000..f23cb8f --- /dev/null +++ b/wikifi/specialized/dispatch.py @@ -0,0 +1,62 @@ +"""Dispatch a :class:`FileKind` to its specialized extractor. + +Schema files, IDLs, OpenAPI specs, and migrations carry the system's +contracts in machine-readable form. Running them through the same prose +LLM extractor as application code is wasteful and lossy: the structure +is already there, the extractor just has to read it. + +Selection respects the file's *path* — not just its kind — so a Python +Alembic/Django migration is not silently routed through the SQL parser. +The classifier upstream (``wikifi.repograph.classify``) tags every file +under a migrations directory as :attr:`FileKind.MIGRATION`; this layer +narrows that to the SQL-shaped subset (``.sql`` / ``.ddl``) and returns +``None`` for the rest, letting them fall through to the LLM path. +""" + +from __future__ import annotations + +import logging +from pathlib import PurePosixPath + +from wikifi.repograph import FileKind +from wikifi.specialized.models import ExtractorFn + +log = logging.getLogger("wikifi.specialized") + + +# Suffixes that the SQL extractor can actually read. Anything else +# tagged :attr:`FileKind.MIGRATION` (e.g. an Alembic ``.py`` script, +# a Django ``0001_initial.py``, a Knex ``.js`` migration) keeps its +# logic in code, not DDL — those belong on the LLM extraction path. +_SQL_MIGRATION_SUFFIXES: frozenset[str] = frozenset({".sql", ".ddl"}) + + +def select(kind: FileKind, *, rel_path: str | None = None) -> ExtractorFn | None: + """Return the specialized extractor for a file, or ``None``. + + ``rel_path`` is required for :attr:`FileKind.MIGRATION` because the + classifier marks any file inside a migrations directory as a + migration, including non-SQL ones. Without the path, we can't tell + a SQL migration from an Alembic Python script. + """ + # Imports are lazy so this module stays cheap to load and so the + # extractors can import freely from ``wikifi.specialized.models`` + # without a circular ``__init__`` dependency. + from wikifi.specialized import graphql, openapi, protobuf, sql + + if kind is FileKind.SQL: + return sql.extract + if kind is FileKind.OPENAPI: + return openapi.extract + if kind is FileKind.PROTOBUF: + return protobuf.extract + if kind is FileKind.GRAPHQL: + return graphql.extract + if kind is FileKind.MIGRATION: + if rel_path is None: + return None + suffix = PurePosixPath(rel_path).suffix.lower() + if suffix in _SQL_MIGRATION_SUFFIXES: + return sql.extract_migration + return None + return None diff --git a/wikifi/specialized/graphql.py b/wikifi/specialized/graphql.py index c972bcc..d6d4f48 100644 --- a/wikifi/specialized/graphql.py +++ b/wikifi/specialized/graphql.py @@ -3,6 +3,11 @@ Pulls types, inputs, queries, mutations, and subscriptions. Maps them to ``entities`` (types/inputs) and ``capabilities`` + ``integrations`` (query/mutation/subscription roots). + +Modular GraphQL schemas often split root types across files using +``extend type Query`` / ``extend type Mutation``; we treat those exactly +like the base declaration so capabilities don't disappear when a schema +is composed from many files. """ from __future__ import annotations @@ -10,9 +15,12 @@ import re from wikifi.evidence import SourceRef -from wikifi.specialized import SpecializedFinding, SpecializedResult +from wikifi.specialized.models import SpecializedFinding, SpecializedResult _TYPE_RE = re.compile(r"^\s*type\s+(\w+)\s*(?:implements\s+[^\{]+)?\{", re.MULTILINE) +# ``extend type Query { ... }`` is the standard way to add fields to a +# root from a separate SDL file; treat it as a same-named root. +_EXTEND_TYPE_RE = re.compile(r"^\s*extend\s+type\s+(\w+)\s*(?:implements\s+[^\{]+)?\{", re.MULTILINE) _INPUT_RE = re.compile(r"^\s*input\s+(\w+)\s*\{", re.MULTILINE) _INTERFACE_RE = re.compile(r"^\s*interface\s+(\w+)\s*\{", re.MULTILINE) _ENUM_RE = re.compile(r"^\s*enum\s+(\w+)\s*\{", re.MULTILINE) @@ -23,13 +31,22 @@ def extract(rel_path: str, text: str) -> SpecializedResult: findings: list[SpecializedFinding] = [] summary_bits: list[str] = [] - types = [(m.group(1), _line(text, m.start())) for m in _TYPE_RE.finditer(text)] - inputs = [(m.group(1), _line(text, m.start())) for m in _INPUT_RE.finditer(text)] - interfaces = [(m.group(1), _line(text, m.start())) for m in _INTERFACE_RE.finditer(text)] - enums = [(m.group(1), _line(text, m.start())) for m in _ENUM_RE.finditer(text)] - - domain_types = [t for t in types if t[0] not in {"Query", "Mutation", "Subscription"}] - root_types = [t for t in types if t[0] in {"Query", "Mutation", "Subscription"}] + # Anchor line numbers on the captured *name* offset, not the match + # start. The leading ``^\s*`` in each pattern can consume the + # preceding newline (``\s`` is newline-aware by default), which + # would otherwise put the line number one above the actual + # declaration and confuse :func:`_block_after`. + types = [(m.group(1), _line(text, m.start(1))) for m in _TYPE_RE.finditer(text)] + extensions = [(m.group(1), _line(text, m.start(1))) for m in _EXTEND_TYPE_RE.finditer(text)] + inputs = [(m.group(1), _line(text, m.start(1))) for m in _INPUT_RE.finditer(text)] + interfaces = [(m.group(1), _line(text, m.start(1))) for m in _INTERFACE_RE.finditer(text)] + enums = [(m.group(1), _line(text, m.start(1))) for m in _ENUM_RE.finditer(text)] + + root_names = {"Query", "Mutation", "Subscription"} + domain_types = [t for t in types if t[0] not in root_names] + # Root declarations come from both ``type Query { ... }`` and + # ``extend type Query { ... }`` forms. + root_types = [t for t in types if t[0] in root_names] + [t for t in extensions if t[0] in root_names] if domain_types: summary_bits.append(f"{len(domain_types)} type(s)") @@ -79,7 +96,9 @@ def extract(rel_path: str, text: str) -> SpecializedResult: if root_types: # Pull each root's fields by scanning the snippet between its - # declaration line and the next ``}``. + # declaration line and the matching closing brace. Multiple + # root declarations (the file may contain ``extend type Query`` + # blocks) get one finding each. for name, line in root_types: block = _block_after(text, line) fields = _SCHEMA_FIELD_RE.findall(block) @@ -92,7 +111,13 @@ def extract(rel_path: str, text: str) -> SpecializedResult: sources=[SourceRef(file=rel_path, lines=(line, line))], ) ) - summary_bits.append(", ".join(name for name, _ in root_types) + " roots") + # Deduped name list for the summary (Query/Mutation likely repeat + # across base + extend blocks). + seen_root_names: list[str] = [] + for name, _ in root_types: + if name not in seen_root_names: + seen_root_names.append(name) + summary_bits.append(", ".join(seen_root_names) + " roots") return SpecializedResult( findings=findings, @@ -105,16 +130,33 @@ def _line(text: str, offset: int) -> int: def _block_after(text: str, line: int) -> str: - """Return the text between line ``line`` and the next top-level ``}``. + """Return the body lines between ``line`` and the matching ``}``. - Approximate: enough to read field declarations for a GraphQL root - type. Matches ``}`` that appears at column 0. + Walks the source brace-depth-aware so an indented closing brace + (`` }``) ends the block — many SDL formatters indent the closing + brace, and a column-0-only check would consume every type that + follows. """ lines = text.splitlines() - start = max(0, line - 1) + start_index = max(0, line - 1) out: list[str] = [] - for ln in lines[start:]: - if ln.startswith("}"): + depth = 0 + started = False + for ln in lines[start_index:]: + opens = ln.count("{") + closes = ln.count("}") + if not started: + # The declaration line carries the opening ``{``; record it + # but don't emit the declaration itself as a body line. + depth += opens - closes + started = True + if depth <= 0: + # ``type X {}`` on a single line — empty body. + break + continue + if closes and depth - closes <= 0: + # The line that closes the block — stop before consuming it. break + depth += opens - closes out.append(ln) return "\n".join(out) diff --git a/wikifi/specialized/models.py b/wikifi/specialized/models.py new file mode 100644 index 0000000..ae3b208 --- /dev/null +++ b/wikifi/specialized/models.py @@ -0,0 +1,30 @@ +"""Result types emitted by specialized extractors. + +Specialized extractors short-circuit the LLM on schema/IDL files — +their output flows into the same notes store the LLM extractor writes +to, so the dispatch contract is just ``(rel_path, text) -> SpecializedResult``. +""" + +from __future__ import annotations + +from collections.abc import Callable +from dataclasses import dataclass, field + +from wikifi.evidence import SourceRef + + +@dataclass +class SpecializedFinding: + section_id: str + finding: str + sources: list[SourceRef] = field(default_factory=list) + + +@dataclass +class SpecializedResult: + findings: list[SpecializedFinding] = field(default_factory=list) + summary: str = "" + + +# Each extractor takes ``(rel_path, text)`` and returns a SpecializedResult. +ExtractorFn = Callable[[str, str], SpecializedResult] diff --git a/wikifi/specialized/openapi.py b/wikifi/specialized/openapi.py index db407d3..f2c13fc 100644 --- a/wikifi/specialized/openapi.py +++ b/wikifi/specialized/openapi.py @@ -17,7 +17,7 @@ from typing import Any from wikifi.evidence import SourceRef -from wikifi.specialized import SpecializedFinding, SpecializedResult +from wikifi.specialized.models import SpecializedFinding, SpecializedResult log = logging.getLogger("wikifi.specialized.openapi") diff --git a/wikifi/specialized/protobuf.py b/wikifi/specialized/protobuf.py index b4c864b..1829833 100644 --- a/wikifi/specialized/protobuf.py +++ b/wikifi/specialized/protobuf.py @@ -11,7 +11,7 @@ import re from wikifi.evidence import SourceRef -from wikifi.specialized import SpecializedFinding, SpecializedResult +from wikifi.specialized.models import SpecializedFinding, SpecializedResult _MESSAGE_RE = re.compile(r"^\s*message\s+(\w+)\s*\{", re.MULTILINE) _SERVICE_RE = re.compile(r"^\s*service\s+(\w+)\s*\{", re.MULTILINE) @@ -62,8 +62,14 @@ def extract(rel_path: str, text: str) -> SpecializedResult: ) ) - for service_name, line in services: - related = [r for r in rpcs if line <= r[5]] + # Each service owns the RPCs declared between its opening ``{`` and + # the matching ``}``. The previous "every RPC at or after my line" + # filter would attribute every later service's RPCs to the first + # service block in a multi-service file, inflating the integration + # inventory. Bound each service by its block-end line instead. + service_spans = _service_spans(text, services) + for (service_name, start_line), (_, end_line) in zip(services, service_spans, strict=True): + related = [r for r in rpcs if start_line <= r[5] <= end_line] bullets = "\n".join( f" - `{name}({_arrow(in_msg, in_stream)}) -> {_arrow(out_msg, out_stream)}`" for name, in_msg, out_msg, in_stream, out_stream, _ in related[:25] @@ -75,7 +81,7 @@ def extract(rel_path: str, text: str) -> SpecializedResult: f"Service **{service_name}** exposes the following RPCs:\n" + (bullets if bullets else " - (no RPCs detected)") ), - sources=[SourceRef(file=rel_path, lines=(line, line))], + sources=[SourceRef(file=rel_path, lines=(start_line, end_line))], ) ) if services: @@ -104,3 +110,32 @@ def _arrow(name: str, stream: bool) -> str: def _line(text: str, offset: int) -> int: return text.count("\n", 0, offset) + 1 + + +def _service_spans(text: str, services: list[tuple[str, int]]) -> list[tuple[str, int]]: + """For each (service_name, start_line) return (service_name, end_line). + + ``end_line`` is the line carrying the brace that closes the service + block, found by walking forward and counting brace depth so nested + blocks (``oneof``, message-in-service) don't terminate the scan. + If the closing brace is missing the span runs to EOF. + """ + lines = text.splitlines() + spans: list[tuple[str, int]] = [] + last_line = len(lines) + for name, start_line in services: + depth = 0 + started = False + end_line = last_line + for i in range(start_line - 1, last_line): + ln = lines[i] + opens = ln.count("{") + closes = ln.count("}") + depth += opens - closes + if not started and opens: + started = True + if started and depth <= 0: + end_line = i + 1 + break + spans.append((name, end_line)) + return spans diff --git a/wikifi/specialized/sql.py b/wikifi/specialized/sql.py index c3b6926..810c22a 100644 --- a/wikifi/specialized/sql.py +++ b/wikifi/specialized/sql.py @@ -17,7 +17,7 @@ from dataclasses import dataclass, field from wikifi.evidence import SourceRef -from wikifi.specialized import SpecializedFinding, SpecializedResult +from wikifi.specialized.models import SpecializedFinding, SpecializedResult # Line-number tracking is precise to "the line containing the matched # keyword" — that's specific enough for citations and avoids the cost @@ -65,6 +65,7 @@ def extract_migration(rel_path: str, text: str) -> SpecializedResult: def _extract(rel_path: str, text: str, *, migration: bool) -> SpecializedResult: findings: list[SpecializedFinding] = [] tables: list[_TableHit] = [] + altered_tables: set[str] = set() for match in _CREATE_TABLE_RE.finditer(text): name = _strip_ident(match.group(1)) @@ -112,6 +113,7 @@ def _extract(rel_path: str, text: str, *, migration: bool) -> SpecializedResult: line = _line_of(text, match.start()) target = _strip_ident(match.group(1)) action = match.group(2).strip() + altered_tables.add(target) prefix = "Migration alters" if migration else "Alters" findings.append( SpecializedFinding( @@ -136,7 +138,15 @@ def _extract(rel_path: str, text: str, *, migration: bool) -> SpecializedResult: ) ) - summary = f"Migration touches {len(tables)} table(s)." if migration else f"Schema for {len(tables)} table(s)." + if migration: + # Count both newly-created tables AND tables targeted by ALTER — + # a migration that only ALTERs still touches its targets, and + # a "0 table(s)" summary on an ALTER-only file misled callers + # browsing the report. + touched = len({hit.name for hit in tables} | altered_tables) + summary = f"Migration touches {touched} table(s)." + else: + summary = f"Schema for {len(tables)} table(s)." return SpecializedResult(findings=findings, summary=summary) From c6ea760ebceb0700b9e2fcd6529292cf68de2225 Mon Sep 17 00:00:00 2001 From: Dallas Pool Date: Fri, 1 May 2026 22:07:42 -0500 Subject: [PATCH 5/9] feat(extractor): emit per-file INFO log during walk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A `wikifi walk` against a large target was opaque between the "stage 2: extracting" header and the next stage banner — users had no visibility into which file the walker was on or whether it had hung. Add an INFO line per file so the live walk logs read like: - extracting: ./src/billing/orders.py - extracting: ./src/billing/refunds.py - extracting: ./src/main.py The CLI already configures ``logging.basicConfig(level=INFO, …)`` so this surfaces by default; no flag flip required. The line is emitted once per file regardless of route (cache hit, specialized parser, or LLM call) so cache-replay re-walks remain audit-able too. --- wikifi/extractor.py | 1 + 1 file changed, 1 insertion(+) diff --git a/wikifi/extractor.py b/wikifi/extractor.py index c375a05..08adc8c 100644 --- a/wikifi/extractor.py +++ b/wikifi/extractor.py @@ -156,6 +156,7 @@ def extract_repo( for rel in files: stats.files_seen += 1 + log.info("- extracting: ./%s", rel.as_posix()) full = repo_root / rel try: data = full.read_text(encoding="utf-8", errors="replace") From f1f51b4205c3edb5f67fe634e4cd78c6631352e2 Mon Sep 17 00:00:00 2001 From: Dallas Pool Date: Fri, 1 May 2026 22:11:22 -0500 Subject: [PATCH 6/9] fix(wiki): include `.cache/` in `.wikifi/.gitignore` and backfill on init MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Addresses Copilot review comment on wikifi/cache.py:21. The premium cache layer writes to `.wikifi/.cache/`, but the generated `.wikifi/.gitignore` only ignored `.notes/`, so every walk left unignored cache files in the target repo — exactly the noise the wiki contract promises to avoid. Changes - Hoist `CACHE_DIRNAME` from `cache.py` to `wiki.py` (next to `NOTES_DIRNAME` and `WIKI_DIRNAME`) so the layout has one source of truth and the gitignore template can reference it without inverting the existing `cache → wiki` import direction. `cache.py` re-exports the name for backwards compatibility. - `WikiLayout.cache_dir` property added; `cache.cache_dir(layout)` delegates to it. - `DEFAULT_GITIGNORE` now lists both `.notes/` and `.cache/` from a single `_GITIGNORE_REQUIRED_ENTRIES` tuple so future additions flow through automatically. - `initialize()` now calls `_ensure_gitignore()` which: - writes the full template on a fresh init, AND - backfills any missing required entries into a pre-existing `.gitignore` (the legacy ".notes/-only" case from wikis created before the cache layer landed). - preserves user-added lines verbatim — only appends what's missing. Tests - 183 tests pass (was 180); 3 new regression tests cover fresh init, legacy-gitignore backfill (no duplicates on re-run), and preservation of user-authored extra entries. - `wikifi/wiki.py` now at 100% coverage. --- .wikifi/.cache/aggregation.json | 2 +- .wikifi/.cache/extraction.json | 2149 +++++++++++++++++++------------ .wikifi/domains.md | 55 +- tests/test_wiki.py | 73 ++ wikifi/cache.py | 25 +- wikifi/wiki.py | 52 +- 6 files changed, 1476 insertions(+), 880 deletions(-) diff --git a/.wikifi/.cache/aggregation.json b/.wikifi/.cache/aggregation.json index e678b78..b597b5c 100644 --- a/.wikifi/.cache/aggregation.json +++ b/.wikifi/.cache/aggregation.json @@ -1,6 +1,6 @@ { "version": 1, - "saved_at": "2026-05-02T02:17:19.876759+00:00", + "saved_at": "2026-05-02T03:10:48.125020+00:00", "entries": { "domains": { "notes_hash": "4040897a09cc", diff --git a/.wikifi/.cache/extraction.json b/.wikifi/.cache/extraction.json index 67b0d06..6013d81 100644 --- a/.wikifi/.cache/extraction.json +++ b/.wikifi/.cache/extraction.json @@ -1,6 +1,6 @@ { "version": 1, - "saved_at": "2026-05-02T02:17:19.874633+00:00", + "saved_at": "2026-05-02T03:10:48.123655+00:00", "entries": { ".env.example": { "fingerprint": "2e493dbd2d87", @@ -344,315 +344,273 @@ "findings": [] }, "README.md": { - "fingerprint": "996c401d036d", - "summary": "Top-level README describing wikifi's purpose, CLI surface, architecture, and configuration as a codebase analysis tool that produces technology-agnostic domain and feature wikis.", + "fingerprint": "369c47fa5d27", + "summary": "Top-level README describing wikifi's purpose, CLI surface, architecture, and technology choices as a codebase-analysis tool that produces technology-agnostic wiki content.", "chunks_processed": 1, "findings": [ { "section_id": "intent", - "finding": "wikifi exists to walk an arbitrary codebase and produce a technology-agnostic extraction of its features, domains, and delivered value — so that a new modern implementation can be built that fully retains the functionality and value the original system provides to its users.", + "finding": "wikifi exists to walk an existing codebase and produce a technology-agnostic extraction of its features, domains, and delivered value — the output is designed to guide re-implementation in a modern stack while preserving what the original system actually does for its users. The tool treats understanding intent and capabilities as a first-class problem, separate from any specific language or framework.", "sources": [ { "file": "README.md", "lines": [ - 3, - 3 - ], - "fingerprint": "996c401d036d" - } - ] - }, - { - "section_id": "domains", - "finding": "The core domain is codebase knowledge extraction: ingesting source files, classifying them, extracting domain findings per file, and synthesising those findings into structured wiki sections. Subdomains include repository introspection, static import-graph analysis, LLM-backed extraction, section synthesis, quality critique, and coverage reporting.", - "sources": [ - { - "file": "README.md", - "lines": [ - 28, - 52 - ], - "fingerprint": "996c401d036d" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system can initialise a wiki workspace for a target project, walk the target codebase to extract per-file domain findings, synthesise primary wiki sections from accumulated findings, and then derive higher-level artefacts (personas, user stories, architecture diagrams) from the primary content.", - "sources": [ - { - "file": "README.md", - "lines": [ - 14, - 24 + 1, + 5 ], - "fingerprint": "996c401d036d" + "fingerprint": "369c47fa5d27" } ] }, { "section_id": "capabilities", - "finding": "Users can force a clean re-walk, run an opt-in quality critique and revision loop on derivative sections, and override the configured LLM provider at invocation time.", + "finding": "The system provides a one-time project setup command that scaffolds a local configuration directory, and a primary walk command that traverses a target codebase and produces structured wiki content organized into primary capture sections (domains, entities, capabilities, etc.) and derivative sections (personas, user stories, diagrams).", "sources": [ { "file": "README.md", "lines": [ - 16, - 20 + 20, + 27 ], - "fingerprint": "996c401d036d" + "fingerprint": "369c47fa5d27" } ] }, { "section_id": "capabilities", - "finding": "A coverage and quality report shows per-section file counts, finding counts, body sizes, and optionally critic-derived quality scores for every populated section.", + "finding": "An interactive query capability lets practitioners ask natural-language questions against the extracted wiki content, optionally injected with context from the target codebase; a REPL-style chat mode supports iterative exploration of both the wiki and the source.", "sources": [ { "file": "README.md", "lines": [ - 21, - 23 + 28, + 29 ], - "fingerprint": "996c401d036d" + "fingerprint": "369c47fa5d27" } ] }, { "section_id": "capabilities", - "finding": "Natural-language querying and an interactive chat REPL allow iterative exploration of the extracted wiki content alongside the target codebase.", + "finding": "A coverage and quality reporting command surfaces per-section statistics (contributing file count, finding count, body size) and can invoke an automated quality scorer to assign 0–10 scores to each populated section.", "sources": [ { "file": "README.md", "lines": [ - 24, - 25 + 26, + 27 ], - "fingerprint": "996c401d036d" + "fingerprint": "369c47fa5d27" } ] }, { "section_id": "capabilities", - "finding": "Schema files (SQL, OpenAPI, Protobuf, GraphQL, migrations) are processed through deterministic parsers rather than an LLM, producing the same structured findings as LLM extraction without consuming model tokens.", - "sources": [ - { - "file": "README.md", - "lines": [ - 34, - 36 - ], - "fingerprint": "996c401d036d" - } - ] - }, - { - "section_id": "external_dependencies", - "finding": "A local Ollama server is the default LLM runtime, used for all per-file extraction and synthesis calls; the default model is a thinking-capable model run at the highest available reasoning level.", + "finding": "An opt-in critic-and-reviser loop runs a quality pass on derivative sections: it scores each section's body against its brief and upstream evidence, flags unsupported claims, and re-synthesizes the section when quality falls below threshold — accepting the revision only if it scores at least as well as the original.", "sources": [ { "file": "README.md", "lines": [ - 56, - 57 + 52, + 53 ], - "fingerprint": "996c401d036d" + "fingerprint": "369c47fa5d27" } ] }, { - "section_id": "external_dependencies", - "finding": "Anthropic's hosted API is an opt-in LLM backend; it uses prompt caching with ephemeral cache-control markers on the system prompt so the large extraction prompt is paid for only once across hundreds of per-file calls.", + "section_id": "domains", + "finding": "The core domain is codebase knowledge extraction: the system understands a target repository's structure (manifests, layout, import graphs, file kinds) and distills that understanding into a structured, technology-agnostic wiki. Subdomains include repository introspection, per-file extraction, section synthesis, and quality assurance of generated content.", "sources": [ { "file": "README.md", "lines": [ - 48, - 49 + 32, + 55 ], - "fingerprint": "996c401d036d" + "fingerprint": "369c47fa5d27" } ] }, { - "section_id": "external_dependencies", - "finding": "OpenAI's hosted API is an opt-in LLM backend; it relies on automatic prefix caching and routes a 'think' knob to reasoning-effort on reasoning-capable models.", + "section_id": "domains", + "finding": "A secondary domain is provider abstraction: the system decouples the extraction intelligence from any specific AI backend, allowing local and hosted inference providers to be swapped without changing the extraction pipeline.", "sources": [ { "file": "README.md", "lines": [ - 50, - 51 + 57, + 63 ], - "fingerprint": "996c401d036d" + "fingerprint": "369c47fa5d27" } ] }, { - "section_id": "external_dependencies", - "finding": "GitHub Actions provides the continuous integration pipeline.", + "section_id": "entities", + "finding": "A SourceRef captures the provenance of each extracted finding: it records the file path, line range, and a content fingerprint, enabling downstream citation and traceability back to source.", "sources": [ { "file": "README.md", "lines": [ - 61, - 61 + 43, + 44 ], - "fingerprint": "996c401d036d" + "fingerprint": "369c47fa5d27" } ] }, { - "section_id": "integrations", - "finding": "wikifi is distributed as a library installed into a target project and invoked as a CLI from that project's root; it reads the target's source tree and writes its output into a `.wikifi/` directory within that project.", + "section_id": "entities", + "finding": "An EvidenceBundle is the output of section aggregation: it contains a synthesized body, a set of supported claims, and any contradictions detected across per-file findings. The renderer uses it to thread numbered citations and a conflicts block into the final section markdown.", "sources": [ { "file": "README.md", "lines": [ - 8, - 12 + 48, + 50 ], - "fingerprint": "996c401d036d" + "fingerprint": "369c47fa5d27" } ] }, { - "section_id": "integrations", - "finding": "The LLM backend is reached through a provider abstraction; Ollama, Anthropic, and OpenAI backends slot in without changing the rest of the pipeline, selectable via an environment variable or a per-invocation flag.", + "section_id": "entities", + "finding": "FileKind classifies each in-scope file as one of: application code, SQL, OpenAPI, Protobuf, GraphQL, migration, or other. This classification determines whether the file is routed through the LLM extractor or a deterministic specialized parser.", "sources": [ { "file": "README.md", "lines": [ - 46, - 51 + 35, + 37 ], - "fingerprint": "996c401d036d" + "fingerprint": "369c47fa5d27" } ] }, { "section_id": "cross_cutting", - "finding": "Extraction results are content-addressed: each file's findings are keyed by the combination of its relative path and the SHA-256 hash of its bytes, so re-walks skip unchanged files automatically and the walk is resumable after a crash.", + "finding": "All extraction findings are stored in a content-addressed cache keyed by the tuple (relative file path, SHA-256 of file bytes); aggregation bodies are keyed by a hash of the section's notes payload. This design provides free resumability after a crash and allows re-walks to skip files whose content has not changed.", "sources": [ { "file": "README.md", "lines": [ - 40, - 43 + 45, + 47 ], - "fingerprint": "996c401d036d" + "fingerprint": "369c47fa5d27" } ] }, { "section_id": "cross_cutting", - "finding": "Aggregation bodies are keyed by a hash of the section's notes payload, so synthesis is also skipped when inputs have not changed.", + "finding": "Input filtering is applied before any file reaches the extraction agent: stub files, empty fixtures, and machine-generated artifacts are recognized and skipped; size bounds on raw and stripped content are enforced via configuration so oversized or unstructured files never stall the walk.", "sources": [ { "file": "README.md", "lines": [ - 40, - 43 + 47, + 48 ], - "fingerprint": "996c401d036d" + "fingerprint": "369c47fa5d27" } ] }, { "section_id": "cross_cutting", - "finding": "Input filtering discards unstructured, near-empty, or machine-generated files before they reach the LLM; an invariant is that empty input must never stall a walk.", + "finding": "The Anthropic backend uses prompt caching (ephemeral cache control on the system prompt) so the large extraction prompt is paid for only once across potentially hundreds of per-file calls, reducing both latency and cost at scale.", "sources": [ { "file": "README.md", "lines": [ - 44, - 46 + 60, + 61 ], - "fingerprint": "996c401d036d" + "fingerprint": "369c47fa5d27" } ] }, { - "section_id": "entities", - "finding": "A per-file finding captures a single file's contribution to one wiki section and carries a structured source reference (file path, line range, content fingerprint) for downstream citation.", + "section_id": "external_dependencies", + "finding": "A local Ollama inference server is the default AI backend; it is expected to host a thinking-capable model (referenced as Qwen 3 27B) at the highest available reasoning level, and its endpoint is configurable via environment variable.", "sources": [ { "file": "README.md", "lines": [ - 37, - 39 + 68, + 70 ], - "fingerprint": "996c401d036d" + "fingerprint": "369c47fa5d27" } ] }, { - "section_id": "entities", - "finding": "An EvidenceBundle is the output of section synthesis: it contains the section body, the supporting claims, and any contradictions found across source findings; the renderer uses it to thread numbered citations and a conflicts block into the final markdown.", + "section_id": "external_dependencies", + "finding": "Anthropic's hosted API is an opt-in inference backend, selected via environment variable; the integration takes advantage of Anthropic's prompt-caching feature to amortize the cost of a large system prompt across many calls.", "sources": [ { "file": "README.md", "lines": [ - 46, - 48 + 60, + 61 ], - "fingerprint": "996c401d036d" + "fingerprint": "369c47fa5d27" } ] }, { - "section_id": "entities", - "finding": "FileKind classifies each in-scope file as one of: application code, SQL, OpenAPI, Protobuf, GraphQL, migration, or other; this classification drives routing to LLM extraction or a deterministic parser.", + "section_id": "external_dependencies", + "finding": "OpenAI's hosted API is a second opt-in inference backend; the integration routes a configurable reasoning-effort knob to OpenAI's reasoning_effort parameter for compatible models, and relies on OpenAI's automatic prefix caching rather than explicit cache markers.", "sources": [ { "file": "README.md", "lines": [ - 31, - 33 + 62, + 63 ], - "fingerprint": "996c401d036d" + "fingerprint": "369c47fa5d27" } ] }, { "section_id": "hard_specifications", - "finding": "The critic-reviser loop must only accept a revised section if its quality score is at least as high as the score of the original; downgrades are rejected.", + "finding": "The extraction cache key for per-file findings must be the tuple (relative path, SHA-256 of file bytes); the aggregation cache key must be a hash of the section's notes payload. These are the canonical identifiers for cache hit/miss decisions and must be preserved for cache compatibility.", "sources": [ { "file": "README.md", "lines": [ - 53, - 55 + 45, + 47 ], - "fingerprint": "996c401d036d" + "fingerprint": "369c47fa5d27" } ] }, { "section_id": "hard_specifications", - "finding": "Empty or near-empty input files must never stall the walk; the walker is required to filter them out before any LLM call is made.", + "finding": "The critic-and-reviser loop must only accept a revised section if the revision's quality score is at least as high as the original's score; accepting a lower-scoring revision is explicitly prohibited.", "sources": [ { "file": "README.md", "lines": [ - 44, - 45 + 52, + 53 ], - "fingerprint": "996c401d036d" + "fingerprint": "369c47fa5d27" } ] }, { "section_id": "hard_specifications", - "finding": "Every per-file finding must carry a structured SourceRef (file, line range, content fingerprint) to support citation in the rendered wiki.", + "finding": "The walk scope decision is made once during repository introspection and is deterministic — the agent must not re-pick or alter scope mid-walk.", "sources": [ { "file": "README.md", "lines": [ - 37, - 39 + 34, + 35 ], - "fingerprint": "996c401d036d" + "fingerprint": "369c47fa5d27" } ] } @@ -1338,231 +1296,231 @@ ] }, "wikifi/cache.py": { - "fingerprint": "1ba541fe863d", - "summary": "Implements a two-scope content-addressed cache that lets the documentation walk pipeline skip unchanged files and unchanged section aggregations, providing both speed and resumability for large codebases.", + "fingerprint": "e0a85dbf45f8", + "summary": "Content-addressed, two-scope cache that eliminates redundant extraction and aggregation work when re-processing large codebases, and provides free resumability after interrupted walks.", "chunks_processed": 1, "findings": [ { "section_id": "intent", - "finding": "The system is designed to handle very large legacy codebases (example cited: 50,000 files) where regenerating documentation on every run would take hours. The cache layer reduces repeat runs to processing only changed files, and provides free resumability so an interrupted walk restarts from the last completed file rather than from scratch.", + "finding": "The cache exists to make iterative re-walks of large codebases economical: without it a full re-walk of a 50 000-file monorepo would take hours; with it only files whose content has changed since the last run require fresh processing. Resumability of interrupted runs is a first-class consequence of the design, not an add-on.", "sources": [ { "file": "wikifi/cache.py", "lines": [ 1, - 21 + 20 ], - "fingerprint": "1ba541fe863d" + "fingerprint": "e0a85dbf45f8" } ] }, { "section_id": "capabilities", - "finding": "The pipeline caches per-file extraction results keyed by the combination of relative file path and a content fingerprint, so files whose bytes have not changed since the last run reuse their previous structured findings without incurring an AI extraction call.", + "finding": "The system maintains two independent caches: a per-file extraction cache that skips LLM calls for any file whose byte content is unchanged, and a per-section aggregation cache that skips the aggregation step whenever the full set of notes for a section is bit-identical to the previous run.", "sources": [ { "file": "wikifi/cache.py", "lines": [ - 5, - 8 + 4, + 13 ], - "fingerprint": "1ba541fe863d" + "fingerprint": "e0a85dbf45f8" } ] }, { "section_id": "capabilities", - "finding": "The pipeline also caches per-section aggregation results keyed by a stable digest of the section's notes payload. If all contributing file findings are identical to the previous run, the cached rendered section body is reused without re-invoking the aggregator.", + "finding": "Interrupted processing runs can be resumed automatically: because each file's result is persisted as soon as it is produced, a walk that crashes part-way through restores all previously completed files from cache and continues from the point of failure.", "sources": [ { "file": "wikifi/cache.py", "lines": [ - 9, - 12 + 15, + 18 ], - "fingerprint": "1ba541fe863d" + "fingerprint": "e0a85dbf45f8" } ] }, { "section_id": "capabilities", - "finding": "Interrupted walks are automatically resumable: because per-file results are persisted incrementally, a walk that fails partway through resumes from the last unprocessed file on the next invocation.", + "finding": "The cache can be pruned to remove entries for files no longer in scope, reset entirely (e.g. via a `--no-cache` flag), and reports hit and miss counters for both scopes to aid observability.", "sources": [ { "file": "wikifi/cache.py", "lines": [ - 14, - 18 + 109, + 122 ], - "fingerprint": "1ba541fe863d" + "fingerprint": "e0a85dbf45f8" } ] }, { - "section_id": "capabilities", - "finding": "Stale cache entries for files that are no longer in scope can be pruned, and the entire cache can be reset (e.g., via a `--no-cache` flag) to force a full fresh walk.", + "section_id": "entities", + "finding": "A `CachedFindings` record holds a file's content fingerprint, the structured list of findings produced by the extractor, a one-sentence summary, and the count of chunks processed. It is keyed by relative file path.", "sources": [ { "file": "wikifi/cache.py", "lines": [ - 105, - 113 + 44, + 50 ], - "fingerprint": "1ba541fe863d" + "fingerprint": "e0a85dbf45f8" } ] }, { - "section_id": "cross_cutting", - "finding": "Cache writes are performed atomically by writing to a temporary file then renaming it into place, preventing corrupt cache state from a partial write.", + "section_id": "entities", + "finding": "A `CachedSection` record holds the hash of a section's notes payload, the rendered markdown body, and lists of claims and contradictions. It is keyed by section identifier.", "sources": [ { "file": "wikifi/cache.py", "lines": [ - 189, - 193 + 53, + 58 ], - "fingerprint": "1ba541fe863d" + "fingerprint": "e0a85dbf45f8" } ] }, { - "section_id": "cross_cutting", - "finding": "A monotonically increasing cache version number is embedded in every persisted cache file; any version mismatch causes the entire cache to be silently discarded and rebuilt, providing a controlled invalidation mechanism across software upgrades.", + "section_id": "entities", + "finding": "A `WalkCache` is the in-memory container for both caches and their hit/miss counters; it is loaded from and persisted to disk as a unit.", "sources": [ { "file": "wikifi/cache.py", "lines": [ - 38, - 38 + 61, + 73 ], - "fingerprint": "1ba541fe863d" + "fingerprint": "e0a85dbf45f8" } ] }, { "section_id": "cross_cutting", - "finding": "Malformed individual cache entries are logged as warnings and silently dropped rather than causing a hard failure, so a partially corrupt cache degrades gracefully to a fresh extraction for only the affected entries.", + "finding": "All cache files are written atomically: content is first written to a temporary file alongside the target, then renamed into place, preventing partial or corrupt cache files from being read on subsequent runs.", "sources": [ { "file": "wikifi/cache.py", "lines": [ - 196, - 222 + 192, + 196 ], - "fingerprint": "1ba541fe863d" + "fingerprint": "e0a85dbf45f8" } ] }, { "section_id": "cross_cutting", - "finding": "Cache files are stored under a dedicated hidden subdirectory within the wiki output directory so they inherit the same version-control ignore rules as the rest of the tool's working state, keeping generated documentation commits clean.", + "finding": "Cache files carry a version tag; any file whose version does not match the current constant is silently discarded and treated as an empty cache, allowing safe schema evolution across releases.", "sources": [ { "file": "wikifi/cache.py", "lines": [ - 19, - 21 + 199, + 204 ], - "fingerprint": "1ba541fe863d" + "fingerprint": "e0a85dbf45f8" } ] }, { - "section_id": "entities", - "finding": "A `CachedFindings` entity represents the extraction result for a single file: it holds the file's content fingerprint, the list of structured findings produced, a one-sentence summary, and a count of how many chunks were processed.", + "section_id": "cross_cutting", + "finding": "Malformed individual cache entries are dropped with a warning log rather than aborting the load, ensuring a single corrupt record does not invalidate the entire cache.", "sources": [ { "file": "wikifi/cache.py", "lines": [ - 44, - 51 + 207, + 210 ], - "fingerprint": "1ba541fe863d" + "fingerprint": "e0a85dbf45f8" } ] }, { - "section_id": "entities", - "finding": "A `CachedSection` entity represents the aggregation result for a single wiki section: it holds the hash of the notes payload that produced it, the rendered markdown body, and lists of claims and contradictions identified during aggregation.", + "section_id": "cross_cutting", + "finding": "Hit and miss counters are maintained in memory for both the extraction and aggregation scopes, enabling downstream reporting on cache efficiency.", "sources": [ { "file": "wikifi/cache.py", "lines": [ - 54, - 60 + 66, + 70 ], - "fingerprint": "1ba541fe863d" + "fingerprint": "e0a85dbf45f8" } ] }, { - "section_id": "entities", - "finding": "A `WalkCache` entity is the in-memory container for both caches, tracking extraction and aggregation entries alongside hit and miss counters for observability into cache effectiveness.", + "section_id": "hard_specifications", + "finding": "The aggregation cache key must include not just finding text but also the per-source tuple of (file path, line range, fingerprint). This ensures that when a referenced file's lines shift or its content changes, the cache misses and re-aggregation occurs against fresh evidence rather than replaying stale citations.", "sources": [ { "file": "wikifi/cache.py", "lines": [ - 63, - 70 + 241, + 254 ], - "fingerprint": "1ba541fe863d" + "fingerprint": "e0a85dbf45f8" } ] }, { "section_id": "hard_specifications", - "finding": "The aggregation cache key is computed only over content-bearing fields (file reference, summary, finding text) and explicitly excludes timestamps and per-walk debug fields, ensuring that regenerating identical notes on a fresh walk always produces a cache hit.", + "finding": "The cache version constant (`CACHE_VERSION = 1`) must be incremented whenever the cache schema changes, as version mismatch causes all existing entries to be unconditionally dropped.", "sources": [ { "file": "wikifi/cache.py", "lines": [ - 238, - 251 + 36, + 36 ], - "fingerprint": "1ba541fe863d" + "fingerprint": "e0a85dbf45f8" } ] }, { "section_id": "hard_specifications", - "finding": "Cache files must reside at `.wikifi/.cache/extraction.json` and `.wikifi/.cache/aggregation.json` relative to the wiki directory root.", + "finding": "Line-range values stored in source records must be normalized to a two-element integer list regardless of whether they arrive as tuples, lists, or other sequences, so that identical ranges always produce identical hash bytes across code paths.", "sources": [ { "file": "wikifi/cache.py", "lines": [ - 33, - 36 + 276, + 285 ], - "fingerprint": "1ba541fe863d" + "fingerprint": "e0a85dbf45f8" } ] }, { "section_id": "integrations", - "finding": "This module depends on the fingerprinting service (imported from `wikifi/fingerprint.py`) to compute content hashes used as cache keys for both extraction and aggregation scopes.", + "finding": "The cache interacts with the fingerprinting subsystem (imported from `wikifi/fingerprint.py`) to produce stable content hashes used as cache keys for both file-level and section-level entries.", "sources": [ { "file": "wikifi/cache.py", "lines": [ - 244, - 246 + 238, + 240 ], - "fingerprint": "1ba541fe863d" + "fingerprint": "e0a85dbf45f8" } ] }, { "section_id": "integrations", - "finding": "The cache is consumed by the extractor, aggregator, and orchestrator (all neighbor files) to gate whether AI calls are needed; the wiki layout structure (from `wikifi/wiki.py`) determines where cache files are persisted on disk.", + "finding": "Cache storage is rooted under the wiki layout directory (sourced from `wikifi/wiki.py`), placing cache files at `.wikifi/.cache/` so they co-locate with wiki output but remain outside committed section markdown.", "sources": [ { "file": "wikifi/cache.py", "lines": [ - 30, - 30 + 18, + 19 ], - "fingerprint": "1ba541fe863d" + "fingerprint": "e0a85dbf45f8" } ] } @@ -1863,203 +1821,161 @@ ] }, "wikifi/config.py": { - "fingerprint": "8cd2ca53c957", - "summary": "Runtime configuration module that declares all tunable settings for the wiki-generation pipeline, loaded from environment variables or a .env file.", + "fingerprint": "953e3d59fb7e", + "summary": "Defines all runtime-configurable settings for the wikifi codebase-to-wiki pipeline, including LLM provider selection, file-processing thresholds, chunking strategy, and optional pipeline stages.", "chunks_processed": 1, "findings": [ { "section_id": "intent", - "finding": "wikifi exists to generate technology-agnostic wiki documentation by walking source repositories, extracting structured findings from each file via language models, and assembling them into coherent wiki sections. The configuration reveals the system prioritizes wiki quality over processing speed, and is designed to handle arbitrarily large codebases through chunking and caching.", + "finding": "The system is designed to walk a codebase and produce wiki documentation using an LLM. It defaults to a locally-hosted inference server but supports hosted cloud providers as opt-in alternatives, prioritising wiki quality over processing speed.", "sources": [ { "file": "wikifi/config.py", "lines": [ 1, - 8 + 9 ], - "fingerprint": "8cd2ca53c957" + "fingerprint": "953e3d59fb7e" } ] }, { - "section_id": "capabilities", - "finding": "The pipeline can route schema files (SQL, OpenAPI, Protobuf, GraphQL, migrations) through deterministic extractors that bypass the language model entirely, providing faster and more reliable handling of structured file types.", + "section_id": "external_dependencies", + "finding": "The system integrates with three LLM inference providers: a locally-hosted Ollama server (default, no key required), the Anthropic API (opt-in, requires API key), and the OpenAI API (opt-in, supports custom base URLs for Azure or proxies). Each provider is independently configurable with its own output-token cap and authentication credential.", "sources": [ { "file": "wikifi/config.py", "lines": [ - 75, - 81 + 26, + 120 ], - "fingerprint": "8cd2ca53c957" + "fingerprint": "953e3d59fb7e" } ] }, { "section_id": "capabilities", - "finding": "The system builds an import/reference graph of the codebase and feeds each file's neighborhood into the extraction prompt, enabling context-aware extraction that understands cross-file relationships rather than treating each file in isolation.", + "finding": "The pipeline can build an import/reference graph of the target project and feed each file's neighbourhood into the extraction prompt, improving cross-file context. This behaviour is independently toggleable.", "sources": [ { "file": "wikifi/config.py", "lines": [ - 69, - 74 + 72, + 79 ], - "fingerprint": "8cd2ca53c957" + "fingerprint": "953e3d59fb7e" } ] }, { "section_id": "capabilities", - "finding": "A critic-and-reviser loop can be applied to derivative wiki sections (personas, user stories, diagrams), invoking a revision pass whenever a quality score falls below a configurable threshold, improving groundedness at the cost of additional processing.", + "finding": "Schema and structured-definition files (SQL, OpenAPI, Protobuf, GraphQL, migrations) are routed through deterministic extractors that bypass the LLM entirely, trading flexibility for reliability on well-structured inputs.", "sources": [ { "file": "wikifi/config.py", "lines": [ - 83, - 94 + 80, + 87 ], - "fingerprint": "8cd2ca53c957" + "fingerprint": "953e3d59fb7e" } ] }, { "section_id": "capabilities", - "finding": "Per-file extraction results and per-section aggregation results are cached across walks, allowing incremental re-runs that only reprocess changed files.", - "sources": [ - { - "file": "wikifi/config.py", - "lines": [ - 63, - 68 - ], - "fingerprint": "8cd2ca53c957" - } - ] - }, - { - "section_id": "external_dependencies", - "finding": "A locally-hosted Ollama inference server (defaulting to localhost:11434) serves as the default language model provider, with qwen3:27b as the default model.", - "sources": [ - { - "file": "wikifi/config.py", - "lines": [ - 31, - 32 - ], - "fingerprint": "8cd2ca53c957" - } - ] - }, - { - "section_id": "external_dependencies", - "finding": "Anthropic's hosted API is an opt-in language model provider, authenticated via an API key, with a configurable per-call output token cap.", - "sources": [ - { - "file": "wikifi/config.py", - "lines": [ - 100, - 106 - ], - "fingerprint": "8cd2ca53c957" - } - ] - }, - { - "section_id": "external_dependencies", - "finding": "OpenAI's API (or compatible proxies such as Azure OpenAI) is an opt-in language model provider, authenticated via an API key and configurable base URL, with a configurable per-call output token cap.", + "finding": "Derivative wiki sections (personas, user stories, diagrams) can optionally be passed through a critic-then-reviser quality loop; the loop is triggered only when a critic score falls below a configurable threshold, and is disabled by default to keep run time predictable.", "sources": [ { "file": "wikifi/config.py", "lines": [ - 108, - 117 + 88, + 99 ], - "fingerprint": "8cd2ca53c957" + "fingerprint": "953e3d59fb7e" } ] }, { "section_id": "cross_cutting", - "finding": "File processing enforces a hard size ceiling (default 2 MB) above which files are silently skipped as vendored or generated noise; files below the ceiling but above 150 KB are split into overlapping windows (8 KB overlap) so each language model call stays within a comfortable context budget.", + "finding": "The pipeline maintains a per-file extraction cache and a per-section aggregation cache that persist across successive walks. Both can be disabled together to force a full re-walk, giving operators a straightforward cache-invalidation escape hatch.", "sources": [ { "file": "wikifi/config.py", "lines": [ - 38, - 56 + 63, + 71 ], - "fingerprint": "8cd2ca53c957" + "fingerprint": "953e3d59fb7e" } ] }, { "section_id": "cross_cutting", - "finding": "A minimum content threshold (default 64 bytes) prevents the system from invoking the language model on essentially empty stub files, guarding against runaway reasoning on trivial inputs.", + "finding": "A per-request timeout (default 900 seconds) guards against runaway LLM calls. A minimum-content threshold (default 64 bytes) prevents the LLM from being invoked on near-empty stub files, avoiding token waste and reasoning loops.", "sources": [ { "file": "wikifi/config.py", "lines": [ - 56, - 59 + 33, + 49 ], - "fingerprint": "8cd2ca53c957" + "fingerprint": "953e3d59fb7e" } ] }, { - "section_id": "cross_cutting", - "finding": "All language model calls share a single per-request timeout (default 900 seconds), providing a uniform backstop against hung inference requests across all providers.", + "section_id": "hard_specifications", + "finding": "Files exceeding 2 MB are unconditionally skipped by the walker and treated as vendored or generated noise; this threshold is not applied to chunking logic.", "sources": [ { "file": "wikifi/config.py", "lines": [ - 33, - 34 + 37, + 43 ], - "fingerprint": "8cd2ca53c957" + "fingerprint": "953e3d59fb7e" } ] }, { "section_id": "hard_specifications", - "finding": "Files exceeding 2,000,000 bytes are unconditionally dropped and never read; this threshold is explicitly documented as targeting vendored or generated noise rather than real source files.", + "finding": "Large files are split into overlapping windows of 150 KB each with an 8 KB overlap between adjacent chunks; each window is sent as a separate LLM call. The 150 KB size is chosen to fit within a 32 K-token context window after prompt overhead.", "sources": [ { "file": "wikifi/config.py", "lines": [ - 36, - 42 + 44, + 54 ], - "fingerprint": "8cd2ca53c957" + "fingerprint": "953e3d59fb7e" } ] }, { "section_id": "hard_specifications", - "finding": "Each language model call is limited to a 150,000-byte content window, sized to fit within a 32K-context model after prompt overhead; larger files must be split into overlapping chunks rather than truncated.", + "finding": "The Anthropic provider is capped at 32 000 output tokens per call to stay within the SDK's non-streaming HTTP timeout guard; callers using maximum-effort thinking are advised to raise this cap and enable streaming. The OpenAI provider is capped at 16 000 output tokens per call.", "sources": [ { "file": "wikifi/config.py", "lines": [ - 43, - 50 + 104, + 120 ], - "fingerprint": "8cd2ca53c957" + "fingerprint": "953e3d59fb7e" } ] }, { "section_id": "hard_specifications", - "finding": "Adjacent file chunks share an 8,000-byte overlap region to preserve cross-boundary context; this overlap guarantee must be maintained when the chunking logic is modified.", + "finding": "The introspection pass receives a directory-tree snapshot limited to depth 3, bounding the context size fed to that stage.", "sources": [ { "file": "wikifi/config.py", "lines": [ - 51, - 54 + 55, + 56 ], - "fingerprint": "8cd2ca53c957" + "fingerprint": "953e3d59fb7e" } ] } @@ -2402,13 +2318,13 @@ ] }, "wikifi/evidence.py": { - "fingerprint": "dddfe1a01c85", - "summary": "Defines the core evidence model — source references, claims, contradictions, and rendering helpers — that gives every wiki assertion a traceable pointer back to the original codebase.", + "fingerprint": "9c7863e99adc", + "summary": "Defines the evidence model — source references, claims, and contradictions — that lets architects trace every sentence in the generated migration wiki back to a precise location in the source codebase.", "chunks_processed": 1, "findings": [ { "section_id": "intent", - "finding": "The system exists so that any architect reading the migration wiki can ask 'where in the source did this come from?' and receive a precise, verifiable answer. Every assertion in the generated wiki is backed by file paths and optional line ranges captured at extraction time.", + "finding": "The system exists so that any architect reading a generated migration wiki can ask ", "sources": [ { "file": "wikifi/evidence.py", @@ -2416,336 +2332,417 @@ 1, 18 ], - "fingerprint": "dddfe1a01c85" + "fingerprint": "9c7863e99adc" } ] }, { "section_id": "capabilities", - "finding": "The system surfaces conflicting information found across source files as explicit 'Contradiction' entries rather than silently merging them, treating disagreements as high-priority migration signals that encode tribal knowledge.", + "finding": "The system generates citation-bearing markdown narratives where every assertion is linked to the specific source-file locations that support it. Claims that cannot be matched verbatim to the narrative body are collected into a separate 'Supporting claims' list rather than silently dropped.", "sources": [ { "file": "wikifi/evidence.py", "lines": [ - 13, - 17 + 85, + 120 ], - "fingerprint": "dddfe1a01c85" + "fingerprint": "9c7863e99adc" } ] }, { "section_id": "capabilities", - "finding": "The system renders each wiki section with a 'Sources' footer enumerating every distinct source reference that backs claims in that section, and an additional 'Conflicts in source' sub-section when contradictions exist.", + "finding": "The system explicitly surfaces conflicting claims across source files as a dedicated 'Conflicts in source' section in each wiki section's output. Migration teams are directed to resolve these conflicts before re-implementation, treating them as high-priority signals about hidden tribal knowledge.", "sources": [ { "file": "wikifi/evidence.py", "lines": [ - 88, - 121 + 121, + 133 ], - "fingerprint": "dddfe1a01c85" + "fingerprint": "9c7863e99adc" } ] }, { "section_id": "entities", - "finding": "A SourceRef represents a single span of source code: a repo-relative file path, an optional inclusive line range, and a short content fingerprint captured at extraction time for change detection.", + "finding": "A SourceRef represents a single pointer into the source codebase, carrying a repo-relative file path, an optional inclusive line range, and a short content fingerprint captured at extraction time. It renders as `path:start-end` (or just `path` when no line range is known).", "sources": [ { "file": "wikifi/evidence.py", "lines": [ - 37, - 52 + 35, + 55 ], - "fingerprint": "dddfe1a01c85" + "fingerprint": "9c7863e99adc" } ] }, { "section_id": "entities", - "finding": "A Claim represents one assertion placed in a wiki section, carrying the markdown text and a list of SourceRefs that justify it; a claim with no sources is explicitly marked as unsupported.", + "finding": "A Claim is a single markdown assertion placed in a section's narrative, backed by zero or more SourceRefs. A claim with no sources is considered unsupported. A Contradiction groups two or more conflicting Claims, each with its own sources, under a one-sentence summary of the conflict.", "sources": [ { "file": "wikifi/evidence.py", "lines": [ - 55, - 67 + 57, + 80 ], - "fingerprint": "dddfe1a01c85" + "fingerprint": "9c7863e99adc" } ] }, { "section_id": "entities", - "finding": "A Contradiction groups two or more conflicting Claims about the same topic under a single summary sentence; each disagreeing position retains its own source references.", + "finding": "An EvidenceBundle is the aggregator's structured output for a single section, containing the markdown narrative body, a list of Claims, and a list of Contradictions. It is the primary handoff type between the aggregator and the renderer.", "sources": [ { "file": "wikifi/evidence.py", "lines": [ - 70, - 77 + 82, + 87 ], - "fingerprint": "dddfe1a01c85" + "fingerprint": "9c7863e99adc" } ] }, { - "section_id": "entities", - "finding": "An EvidenceBundle is the aggregator's structured output for a single wiki section, combining the narrative body text, a list of Claims, and a list of Contradictions.", + "section_id": "cross_cutting", + "finding": "Full source provenance is a non-functional invariant: every claim in the output must carry the file and optional line range that justifies it. Contradictions are never silently merged — they are always rendered explicitly so that data integrity issues visible in the source are preserved and escalated to the migration team.", "sources": [ { "file": "wikifi/evidence.py", "lines": [ - 80, - 85 + 1, + 18 ], - "fingerprint": "dddfe1a01c85" + "fingerprint": "9c7863e99adc" } ] }, { - "section_id": "cross_cutting", - "finding": "Full source traceability is a non-negotiable invariant: every sentence in every wiki section must be linkable back to the originating file and, when available, the precise line range. This is enforced structurally through the Claim and SourceRef types rather than by convention.", + "section_id": "hard_specifications", + "finding": "Citations must be rendered as compact footnote-style markers ([1], [2], …) with a Sources footer at the bottom of each section. Line ranges are formatted as `path/to/file:start-end`; when start equals end, as `path/to/file:line`; when unknown, as `path/to/file` alone.", "sources": [ { "file": "wikifi/evidence.py", "lines": [ - 1, - 18 + 43, + 52 ], - "fingerprint": "dddfe1a01c85" + "fingerprint": "9c7863e99adc" } ] }, { "section_id": "hard_specifications", - "finding": "Source references must be rendered in the format 'path/to/file:start-end' (or 'path/to/file:line' for a single line, or just 'path/to/file' when lines are unknown). The 'Sources' footer uses 1-based sequential numeric indices in the form '1. `path`'.", + "finding": "Any contradictions detected across source files must appear verbatim in the wiki output under a 'Conflicts in source' heading with the explicit instruction that migration teams must resolve them before re-implementation — they must not be suppressed or merged.", "sources": [ { "file": "wikifi/evidence.py", "lines": [ - 45, - 52 + 121, + 131 ], - "fingerprint": "dddfe1a01c85" + "fingerprint": "9c7863e99adc" } ] }, { - "section_id": "hard_specifications", - "finding": "Contradictions must never be silently merged into a unified narrative; they must be explicitly surfaced in a dedicated 'Conflicts in source' sub-section, with a warning that migration teams must resolve them before re-implementation.", + "section_id": "integrations", + "finding": "The evidence model types (SourceRef, Claim, Contradiction, EvidenceBundle) are consumed by the aggregator to produce section output, and by specialized extractors (GraphQL, OpenAPI, Protobuf, SQL, general models) that populate SourceRefs during the extraction pass.", "sources": [ { "file": "wikifi/evidence.py", "lines": [ - 96, - 102 + 1, + 5 ], - "fingerprint": "dddfe1a01c85" + "fingerprint": "9c7863e99adc" } ] } ] }, "wikifi/extractor.py": { - "fingerprint": "b0e939259557", - "summary": "Orchestrates per-file extraction of intent-bearing findings from repository source files, routing each file through caching, specialized deterministic parsing, or LLM-based chunked extraction, then appending structured findings to per-section note stores for downstream aggregation.", + "fingerprint": "67bd95fa3f07", + "summary": "Stage 2 of the wiki-generation pipeline: walks each included source file, extracts structured intent-bearing findings per wiki section via LLM or deterministic extractors, and appends them to a per-section notes store for later aggregation.", "chunks_processed": 1, "findings": [ { "section_id": "intent", - "finding": "The extractor exists to make large codebases understandable by walking every included source file and asking what intent-bearing content it contributes to each section of a technology-agnostic wiki. It is designed to handle repositories of arbitrary size — including 50,000-file legacy monorepos — by making repeated walks cheap through content-addressed caching and by keeping per-chunk LLM failures isolated so partial results are never discarded.", + "finding": "The extractor exists to translate raw source files into structured, technology-agnostic findings that describe *why* code exists rather than how it is implemented. It is designed to scale to very large legacy codebases (e.g. 50 000-file monorepos) by skipping unchanged files via content-addressed caching, routing schema-typed files through deterministic parsers, and splitting oversized files into overlapping windows so no content is lost.", "sources": [ { "file": "wikifi/extractor.py", "lines": [ 1, - 37 + 35 ], - "fingerprint": "b0e939259557" + "fingerprint": "67bd95fa3f07" + } + ] + }, + { + "section_id": "domains", + "finding": "The core domain is automated knowledge extraction from source repositories: classifying files by kind, extracting intent-bearing findings per wiki section, and recording citations back to the originating file and line range. Subdomains include caching/memoization of extraction results, import-graph-based cross-file context, and chunk-level deduplication.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": null, + "fingerprint": "67bd95fa3f07" } ] }, { "section_id": "capabilities", - "finding": "The system can extract structured, section-tagged findings from every file in a repository, supporting three extraction paths: replaying previously cached findings for unchanged files, running deterministic structure-reading for files classified as SQL, OpenAPI, Protobuf, GraphQL, or migrations, and invoking an LLM for all other files. Large files are recursively split into overlapping chunks so no content is missed regardless of file size.", + "finding": "The system can walk an arbitrary set of source files and produce structured findings mapped to predefined wiki sections. It supports three extraction paths: (1) cache replay for files whose content has not changed, (2) deterministic specialized extractors for schema-typed files (SQL, OpenAPI, Protobuf, GraphQL, migrations), and (3) LLM-based extraction with recursive chunking for everything else.", "sources": [ { "file": "wikifi/extractor.py", "lines": [ - 140, - 200 + 155, + 250 ], - "fingerprint": "b0e939259557" + "fingerprint": "67bd95fa3f07" } ] }, { "section_id": "capabilities", - "finding": "Every emitted finding carries a source citation — the originating file path, an optional inclusive line range within that file, and a content fingerprint — enabling the aggregator to stitch citations back into the rendered wiki. Line ranges reported per-chunk are translated to absolute file line numbers before storage.", + "finding": "Large files are split into overlapping windows so cross-boundary context is preserved. Findings that appear in the overlap region of adjacent chunks are deduplicated by (section, finding-text) identity before being written to the notes store, preventing double-counting of the same declaration.", "sources": [ { "file": "wikifi/extractor.py", "lines": [ - 251, - 270 + 253, + 290 ], - "fingerprint": "b0e939259557" + "fingerprint": "67bd95fa3f07" } ] }, { "section_id": "capabilities", - "finding": "Cross-file context is surfaced to the LLM by supplying each file's import neighborhood (up to eight neighbor paths) in the extraction prompt, enabling findings to describe inter-file flows rather than treating each file in isolation.", + "finding": "Processing is crash-resumable: the cache is persisted after each file completes, so a run interrupted partway through can be restarted without re-processing already-extracted files. Per-chunk and per-file failures are isolated — a single failed LLM call does not abort the overall walk.", "sources": [ { "file": "wikifi/extractor.py", "lines": [ - 241, - 246 + 160, + 165 + ], + "fingerprint": "67bd95fa3f07" + } + ] + }, + { + "section_id": "capabilities", + "finding": "Each emitted finding carries a structured source citation (file path, absolute line range, content fingerprint) so the downstream aggregation stage can stitch precise references into the rendered wiki.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 262, + 270 ], - "fingerprint": "b0e939259557" + "fingerprint": "67bd95fa3f07" } ] }, { "section_id": "entities", - "finding": "A `SectionFinding` represents one contribution from a single file to one wiki section, carrying the target section identifier, a technology-agnostic prose description of the contribution, and an optional inclusive line range within the source chunk. A `FileFindings` groups a one-sentence file summary with all findings produced for that file.", + "finding": "A `SectionFinding` represents one contribution from a file to one wiki section, carrying the section identifier, a technology-agnostic prose description, and an optional inclusive line range within the chunk where the evidence appears.", "sources": [ { "file": "wikifi/extractor.py", "lines": [ - 106, - 123 + 105, + 116 ], - "fingerprint": "b0e939259557" + "fingerprint": "67bd95fa3f07" } ] }, { "section_id": "entities", - "finding": "An `ExtractionStats` record accumulates walk-level counters: total files seen, files yielding at least one finding, total findings, skipped files, chunks processed, cache hits, specialized-extractor invocations, and a per-kind file breakdown. It is returned by the walk so callers can report or act on extraction quality.", + "finding": "A `FileFindings` groups all findings produced for a single file, along with a one-sentence summary of the file's role. It is the schema the LLM must conform to when returning structured results.", "sources": [ { "file": "wikifi/extractor.py", "lines": [ - 126, + 119, + 122 + ], + "fingerprint": "67bd95fa3f07" + } + ] + }, + { + "section_id": "entities", + "finding": "An `ExtractionStats` record accumulates operational counters for a full walk: files seen, files with findings, total findings, skipped files, chunks processed, cache hits, specialized-extractor files, and a breakdown of file kinds encountered.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 125, 135 ], - "fingerprint": "b0e939259557" + "fingerprint": "67bd95fa3f07" } ] }, { - "section_id": "cross_cutting", - "finding": "Crash-resumability is a first-class property: after each file completes, the cache is optionally persisted via a caller-supplied callback, so a mid-walk failure loses at most one file's worth of work. Combined with content-addressed cache lookup, this makes re-walking an interrupted run nearly instantaneous for already-processed files.", + "section_id": "integrations", + "finding": "The extractor delegates LLM calls to an injected `LLMProvider` (from `wikifi/providers/base.py`), which must support a structured JSON completion method that accepts a system prompt, a user prompt, and a response schema. The extractor is otherwise provider-agnostic.", "sources": [ { "file": "wikifi/extractor.py", "lines": [ - 155, - 175 + 139, + 152 ], - "fingerprint": "b0e939259557" + "fingerprint": "67bd95fa3f07" } ] }, { - "section_id": "cross_cutting", - "finding": "Findings that emerge from the overlap region shared between adjacent chunks are deduplicated by (section_id, normalized finding text) within each file's processing pass, preventing the same declaration from being double-counted in the aggregated wiki.", + "section_id": "integrations", + "finding": "Findings are written to the notes store via `append_note` from `wikifi/wiki.py`, using a `WikiLayout` that describes where section note files live. The extractor has no direct knowledge of the storage format.", "sources": [ { "file": "wikifi/extractor.py", "lines": [ - 253, - 262 + 43, + 44 ], - "fingerprint": "b0e939259557" + "fingerprint": "67bd95fa3f07" } ] }, { - "section_id": "cross_cutting", - "finding": "Extraction failures — whether from the LLM provider or a specialized parser — are logged and counted but never propagate to abort the walk. A file whose only chunk fails with no salvageable findings is counted as skipped; partially-chunked files retain whatever findings were recovered.", + "section_id": "integrations", + "finding": "Import-graph neighbor data is consumed from `wikifi/repograph.py` (`RepoGraph.neighbor_paths`) and injected into each LLM prompt so the model can describe cross-file flows. File-kind classification (`classify`) from the same module drives specialized-extractor routing.", "sources": [ { "file": "wikifi/extractor.py", "lines": [ - 228, - 242 + 38, + 39 ], - "fingerprint": "b0e939259557" + "fingerprint": "67bd95fa3f07" } ] }, { "section_id": "integrations", - "finding": "The extractor delegates LLM calls to a provider abstraction sourced from the providers layer, passes file fingerprints to a cache layer for lookup and recording, reads import-graph neighborhoods from a repo-graph component, appends findings to a wiki layout via a note-store helper, and hands off recognized structured file types to a specialized extractor registry.", + "finding": "Specialized extractor selection is delegated to `wikifi/specialized/dispatch.py` (`select`), which returns a deterministic extraction function for schema-typed files or `None` for files that should go through the LLM path.", "sources": [ { "file": "wikifi/extractor.py", "lines": [ - 32, + 41, 42 ], - "fingerprint": "b0e939259557" + "fingerprint": "67bd95fa3f07" } ] }, { - "section_id": "external_dependencies", - "finding": "An LLM service is the primary external dependency for the general extraction path: structured prompts are sent per chunk and the responses are parsed as typed finding objects. The system is designed so the LLM call is the sole expensive operation, with all other mechanisms (caching, specialization, chunking) oriented toward minimizing how often it must be invoked.", + "section_id": "integrations", + "finding": "Content fingerprinting is handled by `wikifi/fingerprint.py` (`hash_file`) and cache lookup/recording by `wikifi/cache.py` (`WalkCache`). Source citations use the `SourceRef` model from `wikifi/evidence.py`.", "sources": [ { "file": "wikifi/extractor.py", "lines": [ - 224, - 236 + 36, + 43 + ], + "fingerprint": "67bd95fa3f07" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Content-addressed caching ensures idempotency: a file is only sent to the LLM if its fingerprint has changed since the last run. Cache state is persisted after every file, turning the cache into a crash-recovery checkpoint with no additional coordination required.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 185, + 200 + ], + "fingerprint": "67bd95fa3f07" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Deduplication of findings across chunk overlap regions is enforced by tracking a (section_id, finding_text) set per file, so identical findings discovered in the shared context of adjacent chunks are counted and stored exactly once.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 253, + 270 + ], + "fingerprint": "67bd95fa3f07" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Errors are scoped: specialized-extractor failures and per-chunk LLM failures are logged at warning level and cause only the affected unit to be skipped, while the rest of the walk continues. A file is counted as fully skipped only if it is a single-chunk file whose sole LLM call failed.", + "sources": [ + { + "file": "wikifi/extractor.py", + "lines": [ + 211, + 220 ], - "fingerprint": "b0e939259557" + "fingerprint": "67bd95fa3f07" } ] }, { "section_id": "hard_specifications", - "finding": "Per-file extraction is restricted to primary wiki sections only. Derivative sections (personas, user stories, diagrams) are explicitly excluded from per-file extraction and are instead produced in a later aggregation stage; requesting them at the per-file level is documented as producing sparse, speculative findings.", + "finding": "Per-file extraction targets only *primary* wiki sections. Derivative sections (personas, user stories, diagrams) are explicitly excluded from this stage and deferred to Stage 4 aggregation, because asking the model to identify them at the per-file level produces sparse, speculative findings.", "sources": [ { "file": "wikifi/extractor.py", "lines": [ - 46, - 51 + 51, + 56 ], - "fingerprint": "b0e939259557" + "fingerprint": "67bd95fa3f07" } ] }, { "section_id": "hard_specifications", - "finding": "The recursive text splitter must guarantee termination on any input, including minified single-line files with no whitespace, by falling back through separator priority (blank lines → single newlines → spaces → character boundaries). The character-boundary split is the terminal step that ensures every byte is eventually consumed.", + "finding": "Chunk size and overlap must satisfy `chunk_size > 0` and `0 <= overlap < chunk_size`; violations raise a `ValueError`. Chunks are built so that `base_size + overlap == chunk_size`, guaranteeing no chunk ever exceeds the configured byte limit.", "sources": [ { "file": "wikifi/extractor.py", "lines": [ - 91, - 103 + 333, + 340 ], - "fingerprint": "b0e939259557" + "fingerprint": "67bd95fa3f07" } ] }, { "section_id": "hard_specifications", - "finding": "Chunk overlap must satisfy `0 <= overlap < chunk_size`; violating this constraint raises an error. The effective base chunk size is `chunk_size - overlap` so that prepending an overlap tail never causes a chunk to exceed `chunk_size` bytes.", + "finding": "The recursive text splitter must always terminate and produce chunks that fit within the configured size, even for inputs with no whitespace (e.g. minified files). This is enforced by the empty-string terminal separator, which falls back to character-level slicing.", "sources": [ { "file": "wikifi/extractor.py", "lines": [ - 328, - 336 + 90, + 98 ], - "fingerprint": "b0e939259557" + "fingerprint": "67bd95fa3f07" } ] } @@ -2934,13 +2931,13 @@ ] }, "wikifi/orchestrator.py": { - "fingerprint": "6ed682a87356", - "summary": "Central orchestrator that wires the full four-stage documentation-generation pipeline (repository introspection → per-file extraction → section aggregation → derivative content derivation) and constructs the appropriate LLM provider based on configuration.", + "fingerprint": "1528ab8f73c3", + "summary": "Central pipeline orchestrator that sequences repository introspection, per-file extraction, section aggregation, and derivative artifact generation while managing caching, provider selection, and filesystem layout.", "chunks_processed": 1, "findings": [ { "section_id": "intent", - "finding": "The system automates the creation of a structured wiki from a source repository. It analyses the repository's files using a language model and produces aggregated documentation sections along with derived artefacts such as personas, user stories, and diagrams — removing the burden of manual documentation from developers.", + "finding": "The system exists to automatically generate a structured wiki from any source-code repository. It sequences LLM-powered analysis through four deterministic stages — structure introspection, per-file knowledge extraction, cross-file aggregation, and high-level artifact derivation — so developers obtain living documentation without writing it by hand.", "sources": [ { "file": "wikifi/orchestrator.py", @@ -2948,455 +2945,581 @@ 1, 16 ], - "fingerprint": "6ed682a87356" + "fingerprint": "1528ab8f73c3" } ] }, { "section_id": "capabilities", - "finding": "The pipeline initialises a wiki skeleton in the target repository, introspects its structure to determine which files are in scope, performs per-file content extraction into structured notes with optional caching and import-graph context, aggregates those notes into primary wiki sections, and then derives higher-level artefacts (personas, user stories, diagrams) from the aggregated sections. An optional critic loop can review and rescore the derived content.", + "finding": "The pipeline offers: (1) repository structure introspection to identify files worth analysing, (2) per-file finding extraction with chunking support and optional cross-file import-graph context, (3) section-level aggregation of findings across all files, and (4) derivation of personas, user stories, and diagrams from aggregated content — including an optional critic review loop.", "sources": [ { "file": "wikifi/orchestrator.py", "lines": [ - 62, - 155 + 1, + 16 ], - "fingerprint": "6ed682a87356" + "fingerprint": "1528ab8f73c3" } ] }, { "section_id": "capabilities", - "finding": "The system can be initialised idempotently against any repository root, automatically bootstrapping the required directory skeleton if it does not yet exist before executing the walk pipeline.", + "finding": "The system supports a caching layer that persists extraction and aggregation results between runs and automatically prunes stale entries for files that have fallen out of scope, enabling incremental re-runs over large repositories.", "sources": [ { "file": "wikifi/orchestrator.py", "lines": [ - 62, - 76 + 96, + 108 ], - "fingerprint": "6ed682a87356" + "fingerprint": "1528ab8f73c3" } ] }, { - "section_id": "external_dependencies", - "finding": "Three external language-model backends are supported: a locally hosted inference server (Ollama, the default), the Anthropic hosted API (requiring an API key), and the OpenAI hosted API (requiring an API key and an optional custom base URL). Each backend is configured with a model name, timeout, and token limits drawn from application settings.", + "section_id": "capabilities", + "finding": "An optional static import-graph analysis enriches per-file extraction with neighbour context, giving the LLM visibility into how files relate to each other before generating findings.", "sources": [ { "file": "wikifi/orchestrator.py", "lines": [ - 163, - 210 + 110, + 114 ], - "fingerprint": "6ed682a87356" + "fingerprint": "1528ab8f73c3" } ] }, { - "section_id": "integrations", - "finding": "The orchestrator is the primary entry point called by the CLI layer (`init_wiki`, `run_walk`). It delegates outward to the introspection, extraction, aggregation, and derivation modules, and further to the chosen LLM provider. The cache and file-walker modules are also called as part of the pipeline.", + "section_id": "capabilities", + "finding": "The wiki directory is initialised idempotently, so the tool can be run on a fresh project without a prior setup step.", "sources": [ { "file": "wikifi/orchestrator.py", "lines": [ - 40, - 60 + 55, + 64 ], - "fingerprint": "6ed682a87356" + "fingerprint": "1528ab8f73c3" } ] }, { - "section_id": "cross_cutting", - "finding": "A persistent extraction cache is maintained across runs: on each walk the cache is loaded, entries for files that are no longer in scope are pruned, and the cache is saved after extraction and again after all stages complete. Caching can be disabled, in which case the cache is fully reset before the walk begins.", + "section_id": "external_dependencies", + "finding": "A locally-hosted Ollama service is the default LLM backend, used for all four pipeline stages. It is configurable by host URL, model name, request timeout, and an optional extended-thinking mode.", "sources": [ { "file": "wikifi/orchestrator.py", "lines": [ - 95, - 110 + 166, + 174 ], - "fingerprint": "6ed682a87356" + "fingerprint": "1528ab8f73c3" } ] }, { - "section_id": "cross_cutting", - "finding": "Structured logging is used at each major stage boundary (introspection, graph build, extraction, aggregation, derivation) to provide observability into pipeline progress.", + "section_id": "external_dependencies", + "finding": "Anthropic's hosted API (Claude family) is an opt-in backend selected via configuration. When the configured model name does not begin with 'claude-', the system substitutes the default 'claude-opus-4-7' to avoid model-not-found errors.", "sources": [ { "file": "wikifi/orchestrator.py", "lines": [ - 84, - 148 + 175, + 186 ], - "fingerprint": "6ed682a87356" + "fingerprint": "1528ab8f73c3" } ] }, { - "section_id": "entities", - "finding": "A `WalkReport` aggregates the outputs of all four pipeline stages: the repository introspection result, per-file extraction statistics, section aggregation statistics, derivation statistics, the live cache state, and the repository import graph. It is the single return value representing a completed wiki-generation run.", + "section_id": "external_dependencies", + "finding": "OpenAI's API is a second opt-in hosted backend, with a configurable base URL to support compatible third-party endpoints. When the model name does not match known OpenAI patterns, it defaults to 'gpt-4o'.", "sources": [ { "file": "wikifi/orchestrator.py", "lines": [ - 54, - 61 + 187, + 200 ], - "fingerprint": "6ed682a87356" + "fingerprint": "1528ab8f73c3" } ] }, { - "section_id": "hard_specifications", - "finding": "When a user selects the Anthropic provider but the configured model name does not begin with 'claude-', the system silently substitutes the model identifier 'claude-opus-4-7' rather than forwarding an invalid name. Similarly, for OpenAI, non-OpenAI-pattern model names are replaced with 'gpt-4o'. This model-name substitution logic must be preserved so that users migrating from the default local provider do not receive opaque remote API errors.", + "section_id": "integrations", + "finding": "The CLI surface exposes three entry points — init_wiki, run_walk, and run_report — each accepting a root path and a provider instance, making the pipeline fully substitutable with a mock provider in tests.", "sources": [ { "file": "wikifi/orchestrator.py", "lines": [ - 178, - 207 + 44, + 53 ], - "fingerprint": "6ed682a87356" + "fingerprint": "1528ab8f73c3" } ] }, { - "section_id": "hard_specifications", - "finding": "The only accepted provider identifiers are 'ollama', 'anthropic', and 'openai'; any other value raises an error. This contract is enforced at provider construction time and must be maintained by any future provider registration mechanism.", + "section_id": "integrations", + "finding": "The orchestrator calls into six internal subsystems in sequence: the introspector, the file walker, the extractor, the aggregator, the deriver, and the cache layer, coordinating their inputs and outputs to form the end-to-end pipeline.", "sources": [ { "file": "wikifi/orchestrator.py", "lines": [ - 208, - 210 - ], - "fingerprint": "6ed682a87356" - } - ] - } - ] - }, - "wikifi/repograph.py": { - "fingerprint": "3d8bbdb10112", - "summary": "Provides lightweight, language-agnostic static analysis of a repository: classifies each in-scope file by kind and constructs an import/reference graph so that per-file wiki extraction can reference cross-file flows rather than treating each file in isolation.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "This module exists to enrich wiki extraction with two signals: what kind of structured artifact each file represents (schema, API contract, migration, general code) and which other files it depends on or is depended upon by. The goal is to let per-file analysis describe cross-file flows (e.g. 'this handler delegates to the billing service') rather than producing island findings.", - "sources": [ - { - "file": "wikifi/repograph.py", - "lines": [ - 1, - 30 - ], - "fingerprint": "3d8bbdb10112" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system classifies every in-scope file into one of seven categories — general application code, SQL DDL, OpenAPI/Swagger spec, Protobuf definition, GraphQL schema, database migration, or unclassified — allowing downstream stages to route structured files to specialised extractors that skip the language model entirely.", - "sources": [ - { - "file": "wikifi/repograph.py", - "lines": [ - 41, - 52 + 66, + 157 ], - "fingerprint": "3d8bbdb10112" + "fingerprint": "1528ab8f73c3" } ] }, { - "section_id": "capabilities", - "finding": "A regex-driven import graph is built across the entire in-scope file set for multiple language families, producing for each file the list of files it imports and the list of files that import it. This neighbor map is injected into per-file extraction prompts to surface cross-file relationships.", + "section_id": "cross_cutting", + "finding": "Structured logging is emitted at the start of each pipeline stage (introspection, graph build, extraction, aggregation, derivation), providing observability into long-running batch runs.", "sources": [ { - "file": "wikifi/repograph.py", + "file": "wikifi/orchestrator.py", "lines": [ - 155, - 210 + 87, + 148 ], - "fingerprint": "3d8bbdb10112" + "fingerprint": "1528ab8f73c3" } ] }, { - "section_id": "capabilities", - "finding": "Migration files are distinguished from generic DDL by detecting well-known migration directory conventions (Alembic, Django, Rails, Prisma, Flyway, Liquibase), ensuring the wiki can separate the current schema state from historical forward-only change scripts.", + "section_id": "cross_cutting", + "finding": "The cache is saved to disk after both the aggregation stage and again at the end of the full run, ensuring partial progress is not lost if derivation fails.", "sources": [ { - "file": "wikifi/repograph.py", + "file": "wikifi/orchestrator.py", "lines": [ - 84, - 99 + 150, + 156 ], - "fingerprint": "3d8bbdb10112" + "fingerprint": "1528ab8f73c3" } ] }, { - "section_id": "capabilities", - "finding": "OpenAPI/Swagger specifications are detected inside YAML and JSON files by scanning the first 4 KB for characteristic header patterns, since extension alone is insufficient to distinguish them from other structured data files.", + "section_id": "cross_cutting", + "finding": "Cache entries for files that leave the in-scope set are pruned before extraction begins, preventing stale data from leaking into aggregation and keeping cache size proportional to the live file set.", "sources": [ { - "file": "wikifi/repograph.py", + "file": "wikifi/orchestrator.py", "lines": [ - 102, - 109 + 100, + 106 ], - "fingerprint": "3d8bbdb10112" + "fingerprint": "1528ab8f73c3" } ] }, { "section_id": "entities", - "finding": "A `FileKind` enumeration captures seven mutually exclusive file roles: application code, SQL, OpenAPI, Protobuf, GraphQL, migration, and other. This classification drives routing to specialised or general-purpose extraction paths.", + "finding": "WalkReport is the top-level result record for a full pipeline run, carrying the introspection result, extraction statistics, aggregation statistics, derivation statistics, the live cache snapshot, and the repo import graph.", "sources": [ { - "file": "wikifi/repograph.py", + "file": "wikifi/orchestrator.py", "lines": [ - 41, - 52 + 56, + 63 ], - "fingerprint": "3d8bbdb10112" + "fingerprint": "1528ab8f73c3" } ] }, { "section_id": "entities", - "finding": "A `GraphNode` entity represents a single file's position in the import graph, carrying its repo-relative path, the ordered set of files it imports, and the ordered set of files that import it. It exposes a capped combined-neighbor list for use in prompts.", + "finding": "Settings is the central configuration entity, carrying all tuneable parameters: provider identity and credentials, model name, file-size limits, chunk dimensions, cache and graph feature flags, the critic-review threshold, and provider-specific token limits.", "sources": [ { - "file": "wikifi/repograph.py", + "file": "wikifi/orchestrator.py", "lines": [ - 148, - 167 + 166, + 201 ], - "fingerprint": "3d8bbdb10112" + "fingerprint": "1528ab8f73c3" } ] }, { - "section_id": "entities", - "finding": "A `RepoGraph` entity holds the complete per-file import-edge map for a repository scan, supporting lookup of a node by path and retrieval of a capped neighbor list for any given file.", + "section_id": "hard_specifications", + "finding": "Exactly three provider values are accepted — 'ollama', 'anthropic', 'openai' — and any other value raises a hard error with the valid options listed explicitly.", "sources": [ { - "file": "wikifi/repograph.py", + "file": "wikifi/orchestrator.py", "lines": [ - 170, - 181 + 202, + 203 ], - "fingerprint": "3d8bbdb10112" + "fingerprint": "1528ab8f73c3" } ] }, { - "section_id": "cross_cutting", - "finding": "The import graph caps neighbor lists at eight entries by default when constructing prompt context, providing a bounded, deterministic input size regardless of how highly connected a file is in the repository.", + "section_id": "hard_specifications", + "finding": "When Anthropic is selected, a model name not starting with 'claude-' is silently replaced with 'claude-opus-4-7' to prevent API 404 errors from a carried-over Ollama model id.", "sources": [ { - "file": "wikifi/repograph.py", + "file": "wikifi/orchestrator.py", "lines": [ - 156, - 165 + 180, + 185 ], - "fingerprint": "3d8bbdb10112" + "fingerprint": "1528ab8f73c3" } ] }, { "section_id": "hard_specifications", - "finding": "The implementation must remain dependency-free beyond regex and path resolution — tree-sitter or similar binary dependencies are explicitly prohibited so that the tool can be installed without native compilation.", + "finding": "When OpenAI is selected, a model name that does not match the heuristic patterns (gpt-*, o1, o3, o4, ft:) is silently replaced with 'gpt-4o'; fine-tuned variants (ft: prefix) are explicitly recognised as valid OpenAI identifiers.", "sources": [ { - "file": "wikifi/repograph.py", + "file": "wikifi/orchestrator.py", "lines": [ - 24, - 29 + 192, + 205 ], - "fingerprint": "3d8bbdb10112" + "fingerprint": "1528ab8f73c3" } ] }, { - "section_id": "integrations", - "finding": "This module is consumed by the extractor and orchestrator components: the file classification result determines which specialised extractor is invoked, and the neighbor list from the graph is injected into the Stage 2 extraction prompt for each file.", + "section_id": "domains", + "finding": "The system operates across two core domains: repository intelligence (understanding a codebase's structure, file relationships, and per-file semantics) and documentation synthesis (aggregating extracted knowledge into human-readable wiki sections and higher-level narrative artifacts like personas and diagrams).", "sources": [ { - "file": "wikifi/repograph.py", + "file": "wikifi/orchestrator.py", "lines": [ 1, - 15 + 16 ], - "fingerprint": "3d8bbdb10112" + "fingerprint": "1528ab8f73c3" } ] } ] }, - "wikifi/report.py": { - "fingerprint": "2b94c0a5e62e", - "summary": "Produces a read-only coverage-and-quality report over a completed wiki walk, answering whether the codebase was fully covered and whether the resulting wiki is good enough to act on.", + "wikifi/repograph.py": { + "fingerprint": "808453182a95", + "summary": "Performs lightweight static analysis of a repository to classify each file by structural kind and build a cross-file import/reference graph used to enrich per-file knowledge extraction.", "chunks_processed": 1, "findings": [ { "section_id": "intent", - "finding": "The report module addresses the questions that migration leads ask before committing to a re-implementation effort: (1) did the automated walk cover the entire system, and (2) is the generated wiki accurate and complete enough to guide action? It runs purely from existing on-disk artifacts and never mutates the wiki, making it safe to run in CI without side-effects.", + "finding": "This module exists to give the extraction pipeline two pieces of context it would otherwise lack: a classification of each file's structural role (e.g. schema definition, API contract, migration, application logic), and a map of which files import which. Together these allow structured files to bypass the language-model path entirely, and allow application-code findings to mention cross-file relationships rather than treating each file in isolation.", "sources": [ { - "file": "wikifi/report.py", + "file": "wikifi/repograph.py", "lines": [ 1, - 14 + 30 ], - "fingerprint": "2b94c0a5e62e" + "fingerprint": "808453182a95" } ] }, { "section_id": "capabilities", - "finding": "The system can produce a structured coverage report showing, per wiki section, how many files contributed findings and how many findings were extracted. It also offers an optional quality-scoring mode where each populated section is evaluated by the critic and assigned a numeric score out of 10, surfacing unsupported claims and gaps. A human-readable markdown table is rendered from these results.", + "finding": "The system classifies every in-scope repository file into one of seven structural kinds — application code, SQL DDL, OpenAPI specification, Protocol Buffer definition, GraphQL schema, database migration, or other — using file extension, path conventions, and content sampling.", "sources": [ { - "file": "wikifi/report.py", + "file": "wikifi/repograph.py", "lines": [ - 44, - 77 - ], - "fingerprint": "2b94c0a5e62e" + 43, + 70 + ], + "fingerprint": "808453182a95" } ] }, { "section_id": "capabilities", - "finding": "Coverage statistics are derived from the persisted walk cache, comparing total files seen against files that produced at least one finding, and computing a coverage percentage. This allows teams to identify dead zones — files the extraction pass processed but from which no signal was extracted.", + "finding": "For structured file kinds (SQL, OpenAPI, Protobuf, GraphQL, migration) the system short-circuits the language-model extraction path and routes to a specialized extractor, avoiding expensive model calls for files whose structure is mechanically parseable.", "sources": [ { - "file": "wikifi/report.py", + "file": "wikifi/repograph.py", "lines": [ - 103, - 107 + 1, + 15 + ], + "fingerprint": "808453182a95" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system builds a language-pluralistic import/reference graph across all application-code files using regex-based scanning, recording for each file which files it imports and which files import it. This neighbor map is injected into per-file extraction prompts so findings can describe cross-file flows.", + "sources": [ + { + "file": "wikifi/repograph.py", + "lines": [ + 162, + 215 + ], + "fingerprint": "808453182a95" + } + ] + }, + { + "section_id": "capabilities", + "finding": "Import resolution handles Python relative imports (dot-prefix syntax), JavaScript/TypeScript path-style and bare module imports, Go import blocks, Java/Kotlin/Scala/C# dotted class imports, and Ruby require statements — resolving each to a concrete repo-relative path where possible.", + "sources": [ + { + "file": "wikifi/repograph.py", + "lines": [ + 280, + 340 ], - "fingerprint": "2b94c0a5e62e" + "fingerprint": "808453182a95" } ] }, { "section_id": "entities", - "finding": "A `SectionReport` captures the per-section view: the section descriptor, count of contributing files, total findings count, body size in characters, an emptiness flag, and an optional quality critique. A `WikiReport` aggregates all section reports alongside overall coverage statistics and an optional mean quality score across populated sections.", + "finding": "A `GraphNode` represents a single file's position in the import graph, carrying its repo-relative path, the ordered set of files it imports, and the ordered set of files that import it. It exposes a combined neighbor list capped at a configurable limit for use in prompt enrichment.", "sources": [ { - "file": "wikifi/report.py", + "file": "wikifi/repograph.py", "lines": [ - 28, - 42 + 143, + 162 ], - "fingerprint": "2b94c0a5e62e" + "fingerprint": "808453182a95" } ] }, { "section_id": "entities", - "finding": "A `CoverageStats` entity (defined in the critic module) holds the total files seen, files with findings, and per-section breakdowns of findings and contributing files, and exposes a coverage percentage computation.", + "finding": "A `RepoGraph` is the complete per-repository import graph, keyed by repo-relative file path, providing lookup of individual nodes and neighbor path lists.", "sources": [ { - "file": "wikifi/report.py", + "file": "wikifi/repograph.py", "lines": [ - 85, - 94 + 165, + 177 + ], + "fingerprint": "808453182a95" + } + ] + }, + { + "section_id": "entities", + "finding": "A `FileKind` enumeration defines the seven recognized structural categories of source files: application code, SQL, OpenAPI, Protobuf, GraphQL, migration, and other. Classification drives routing to specialized or general-purpose extractors.", + "sources": [ + { + "file": "wikifi/repograph.py", + "lines": [ + 43, + 56 + ], + "fingerprint": "808453182a95" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "The implementation explicitly avoids tree-sitter or any binary dependency for import graph construction, relying only on regex and path resolution. This is a stated architectural constraint: the regex graph is considered sufficient for its sole consumer (the language model), and adding a binary dependency has been explicitly rejected.", + "sources": [ + { + "file": "wikifi/repograph.py", + "lines": [ + 22, + 30 + ], + "fingerprint": "808453182a95" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Migration files are detected by matching a hardcoded list of well-known migration directory path tokens (Alembic, Django, Rails, Knex, Flyway, Liquibase, Prisma). A SQL file in one of these directories is classified as a migration rather than generic DDL, preserving the distinction between forward-only schema changes and current schema in the generated wiki.", + "sources": [ + { + "file": "wikifi/repograph.py", + "lines": [ + 80, + 93 ], - "fingerprint": "2b94c0a5e62e" + "fingerprint": "808453182a95" } ] }, { "section_id": "cross_cutting", - "finding": "The report is explicitly read-only: it inspects on-disk artifacts (notes JSONL, section markdown files, and the walk cache) without modifying the wiki. This invariant is stated in the module docstring and upheld throughout the implementation.", + "finding": "File classification uses a content sample of at most 4 096 bytes to detect OpenAPI/Swagger YAML and JSON files, avoiding full-file reads during the classification pass.", + "sources": [ + { + "file": "wikifi/repograph.py", + "lines": [ + 113, + 123 + ], + "fingerprint": "808453182a95" + } + ] + }, + { + "section_id": "integrations", + "finding": "This module feeds two downstream consumers: the specialized extractor dispatcher (which routes structured file kinds to dedicated parsers) and the orchestrator (which uses the `RepoGraph` to inject neighbor context into per-file extraction prompts). Both relationships are visible from neighbor file references.", + "sources": [ + { + "file": "wikifi/repograph.py", + "lines": [ + 1, + 10 + ], + "fingerprint": "808453182a95" + } + ] + } + ] + }, + "wikifi/report.py": { + "fingerprint": "eaa5459516bf", + "summary": "Generates a structural and quality report for a completed wiki walk, answering coverage and readiness questions for migration leads without modifying any wiki content.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "The report module exists to answer two pre-migration questions: whether the automated walk covered the entire system (per-section file and finding counts), and whether the resulting wiki is good enough to act on (per-section quality scores surfacing unsupported claims and gaps). It is designed to run read-only, relying solely on already-produced on-disk artifacts.", "sources": [ { "file": "wikifi/report.py", "lines": [ - 9, - 12 + 1, + 16 ], - "fingerprint": "2b94c0a5e62e" + "fingerprint": "eaa5459516bf" } ] }, { - "section_id": "cross_cutting", - "finding": "Emptiness detection for a section applies three textual sentinel checks — 'Not yet populated', 'No findings were extracted', and 'upstream sections required to derive' — to decide whether a section should be skipped for scoring. These sentinels act as a lightweight data-integrity signal distinguishing meaningfully populated sections from placeholder content.", + "section_id": "capabilities", + "finding": "The system can produce a coverage report showing how many files were seen, how many contributed findings, and a per-section breakdown of both — without requiring an AI provider, making it suitable for CI pipelines.", "sources": [ { "file": "wikifi/report.py", "lines": [ - 118, - 123 + 82, + 85 + ], + "fingerprint": "eaa5459516bf" + } + ] + }, + { + "section_id": "capabilities", + "finding": "When an AI provider is supplied and scoring is enabled, every populated section is evaluated by a quality critic, yielding a per-section score out of 10 along with identified gaps and unsupported claims. An overall score is computed as the mean across all scored sections.", + "sources": [ + { + "file": "wikifi/report.py", + "lines": [ + 106, + 114 + ], + "fingerprint": "eaa5459516bf" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The report renders as a markdown table listing each section with its file count, finding count, body character count, quality score, and the single most prominent gap or unsupported claim — giving migration leads a one-page readiness summary.", + "sources": [ + { + "file": "wikifi/report.py", + "lines": [ + 46, + 74 + ], + "fingerprint": "eaa5459516bf" + } + ] + }, + { + "section_id": "entities", + "finding": "A `SectionReport` captures the per-section view: a reference to the section definition, count of contributing files, count of findings, character length of the written body, an emptiness flag, and an optional quality critique.", + "sources": [ + { + "file": "wikifi/report.py", + "lines": [ + 29, + 36 + ], + "fingerprint": "eaa5459516bf" + } + ] + }, + { + "section_id": "entities", + "finding": "A `WikiReport` aggregates all section reports, the overall coverage statistics, and an optional mean quality score across populated sections.", + "sources": [ + { + "file": "wikifi/report.py", + "lines": [ + 39, + 44 ], - "fingerprint": "2b94c0a5e62e" + "fingerprint": "eaa5459516bf" } ] }, { "section_id": "cross_cutting", - "finding": "Structured logging is initialised under the `wikifi.report` namespace for observability into report generation steps.", + "finding": "Coverage statistics are authoritative from on-disk notes when available; the cache is used as a fallback only when no notes have been written — ensuring accuracy even when the cache has been deleted or the walk was run without caching.", "sources": [ { "file": "wikifi/report.py", "lines": [ - 22, - 22 + 90, + 99 ], - "fingerprint": "2b94c0a5e62e" + "fingerprint": "eaa5459516bf" } ] }, { - "section_id": "integrations", - "finding": "The report integrates with the walk cache to retrieve the full extraction manifest (file → findings mapping) for computing coverage statistics, and with the notes store to count per-section findings and contributing files.", + "section_id": "cross_cutting", + "finding": "Section emptiness is determined by detecting specific sentinel strings in the body text: 'Not yet populated', 'No findings were extracted', and 'upstream sections required to derive'. Sections matching any of these are excluded from quality scoring.", "sources": [ { "file": "wikifi/report.py", "lines": [ 103, - 107 + 108 ], - "fingerprint": "2b94c0a5e62e" + "fingerprint": "eaa5459516bf" } ] }, { - "section_id": "integrations", - "finding": "When quality scoring is requested and a language-model provider is available, the report delegates to the critic subsystem, optionally supplying upstream section bodies for derivative sections that depend on earlier wiki content.", + "section_id": "hard_specifications", + "finding": "Three exact sentinel strings must be preserved as the canonical markers for unpopulated sections: 'Not yet populated', 'No findings were extracted', and 'upstream sections required to derive'. The report module depends on these exact strings to correctly identify and exclude empty sections from scoring and gap analysis.", "sources": [ { "file": "wikifi/report.py", "lines": [ - 124, - 131 + 103, + 108 ], - "fingerprint": "2b94c0a5e62e" + "fingerprint": "eaa5459516bf" } ] }, { - "section_id": "hard_specifications", - "finding": "Quality scoring is only performed when explicitly requested (`score=True`) and a provider is supplied; without both conditions the report remains purely structural. This ensures the tool can run in provider-free environments such as CI pipelines without failure.", + "section_id": "integrations", + "finding": "The report reads section content from on-disk markdown files via the wiki layout abstraction, reads per-file extraction notes from a JSONL store, and optionally reads upstream section bodies to provide context when scoring derivative sections through the critic component.", "sources": [ { "file": "wikifi/report.py", "lines": [ - 80, - 84 + 78, + 130 ], - "fingerprint": "2b94c0a5e62e" + "fingerprint": "eaa5459516bf" } ] } @@ -3732,153 +3855,153 @@ ] }, "wikifi/specialized/__init__.py": { - "fingerprint": "84d6c382c745", - "summary": "Dispatcher that routes high-signal source artifacts (schemas, IDLs, API specs, migrations) to purpose-built parsers instead of the general LLM extraction path, while preserving a uniform output contract downstream.", + "fingerprint": "06204b629ff9", + "summary": "Package init for the specialized extractors sub-package, documenting its structure and conventions.", "chunks_processed": 1, "findings": [ { "section_id": "intent", - "finding": "Schema files, interface definition languages, API specifications, and database migrations encode a system's contracts in structured, machine-readable form. Passing them through a general-purpose prose extractor is both inefficient and lossy; dedicated parsers can read the structure directly. This package implements that optimised path as a drop-in replacement for the LLM call whenever a file's kind is recognised.", + "finding": "The specialized extractors package exists to provide type-aware, high-signal parsing of specific source artifact formats (SQL, OpenAPI, Protobuf, GraphQL) as an alternative or complement to the general LLM-based extraction pass, producing the same structured findings shape.", "sources": [ { "file": "wikifi/specialized/__init__.py", "lines": [ 1, - 13 + 12 ], - "fingerprint": "84d6c382c745" + "fingerprint": "06204b629ff9" } ] }, { "section_id": "capabilities", - "finding": "The system selects a specialised extractor based on the detected kind of a source file — covering SQL queries, database migrations, OpenAPI specifications, Protocol Buffer definitions, and GraphQL schemas — and routes each file to the parser best suited to its structure rather than treating all files identically.", + "finding": "The system can parse and extract structured findings from multiple well-known artifact formats — SQL schemas, OpenAPI specifications, Protobuf definitions, and GraphQL schemas — each handled by a dedicated per-format extractor module.", "sources": [ { "file": "wikifi/specialized/__init__.py", "lines": [ - 46, - 57 + 8, + 11 ], - "fingerprint": "84d6c382c745" + "fingerprint": "06204b629ff9" } ] }, { - "section_id": "entities", - "finding": "A `SpecializedFinding` represents a single structured insight extracted from a file, carrying a section identifier, a human-readable description, and a list of source references. A `SpecializedResult` groups zero or more such findings together with an optional summary string, and is the standard output contract for every specialised extractor.", + "section_id": "integrations", + "finding": "A dispatch function (`select`) in the sibling dispatch module maps a file's kind to the appropriate extractor, acting as the internal routing layer between artifact type detection and structured extraction.", "sources": [ { "file": "wikifi/specialized/__init__.py", "lines": [ - 29, - 38 + 7, + 8 ], - "fingerprint": "84d6c382c745" + "fingerprint": "06204b629ff9" } ] - }, + } + ] + }, + "wikifi/specialized/graphql.py": { + "fingerprint": "1d454892894d", + "summary": "GraphQL SDL extractor that maps schema constructs to wiki sections: domain types and inputs become entities, Query/Mutation roots become capabilities, and Subscription roots become integrations.", + "chunks_processed": 1, + "findings": [ { - "section_id": "cross_cutting", - "finding": "All specialised extractors are required to return results in the same `{section_id, finding, sources}` shape that the LLM extractor produces, ensuring the downstream aggregation layer needs no knowledge of which extraction path was taken. This interface contract is an invariant that must be preserved.", + "section_id": "intent", + "finding": "This module exists to let wikifi understand GraphQL schema definition files as first-class inputs, translating the type system and operation roots into technology-agnostic wiki findings rather than raw SDL syntax.", "sources": [ { - "file": "wikifi/specialized/__init__.py", + "file": "wikifi/specialized/graphql.py", "lines": [ - 9, - 13 + 1, + 11 ], - "fingerprint": "84d6c382c745" + "fingerprint": "1d454892894d" } ] }, { - "section_id": "integrations", - "finding": "The dispatcher integrates internally with the file-kind classification system (sourced from the repository graph module) and delegates to four sibling extractor modules — SQL, OpenAPI, Protobuf, and GraphQL — each responsible for a distinct artifact type.", + "section_id": "capabilities", + "finding": "The system can parse a GraphQL SDL file and enumerate every Query and Mutation root field, producing capability findings that describe what operations the API exposes. Fields are extracted per root block, with up to 30 field names captured per root.", "sources": [ { - "file": "wikifi/specialized/__init__.py", + "file": "wikifi/specialized/graphql.py", "lines": [ - 46, - 57 + 101, + 113 ], - "fingerprint": "84d6c382c745" + "fingerprint": "1d454892894d" } ] - } - ] - }, - "wikifi/specialized/graphql.py": { - "fingerprint": "bbb305e0d47f", - "summary": "Specialized extractor that parses GraphQL Schema Definition Language files and maps their constructs to structured wiki findings about domain entities and API capabilities.", - "chunks_processed": 1, - "findings": [ + }, { "section_id": "capabilities", - "finding": "The system can analyze GraphQL schema files to identify and catalog all operation roots. Query and Mutation roots are recognized as first-class capability surfaces, with each declared field surfaced as an individually named operation the API exposes.", + "finding": "The extractor explicitly supports modular schema composition: `extend type Query` and `extend type Mutation` declarations are treated identically to base root declarations, so capabilities defined across multiple SDL files are all surfaced rather than dropped.", "sources": [ { "file": "wikifi/specialized/graphql.py", "lines": [ - 85, - 101 + 19, + 20 ], - "fingerprint": "bbb305e0d47f" + "fingerprint": "1d454892894d" } ] }, { - "section_id": "integrations", - "finding": "Subscription roots in a GraphQL schema are treated as integration touchpoints rather than capabilities, reflecting their role as real-time or event-driven channels that external consumers attach to.", + "section_id": "entities", + "finding": "GraphQL domain object types (any `type` that is not a root operation) are recorded as domain entities, with up to 25 names listed per file. Interfaces (shared shape contracts), input types (request payload shapes), and enums (closed value sets) are each captured as separate entity findings.", "sources": [ { "file": "wikifi/specialized/graphql.py", "lines": [ - 88, - 91 + 56, + 95 ], - "fingerprint": "bbb305e0d47f" + "fingerprint": "1d454892894d" } ] }, { - "section_id": "entities", - "finding": "Domain-level GraphQL object types (excluding operation roots) are extracted and recorded as named entities. Interfaces, input types, and enums are each treated as distinct entity-level constructs: interfaces represent shared shape contracts, inputs represent request payload shapes, and enums represent closed value sets.", + "section_id": "integrations", + "finding": "GraphQL Subscription roots are mapped to the integrations section, reflecting that subscriptions represent event-driven integration touchpoints rather than direct capabilities.", "sources": [ { "file": "wikifi/specialized/graphql.py", "lines": [ - 32, - 81 + 108, + 110 ], - "fingerprint": "bbb305e0d47f" + "fingerprint": "1d454892894d" } ] }, { - "section_id": "intent", - "finding": "The extractor exists to make GraphQL schemas first-class source material for wiki generation, automatically translating schema structure into domain-meaningful findings about entities and API surface rather than requiring manual documentation of those schemas.", + "section_id": "cross_cutting", + "finding": "Line-number anchoring is computed against the captured name offset (not the full match start) to prevent off-by-one errors when SDL files use leading whitespace, ensuring that source references in findings point accurately to declarations.", "sources": [ { "file": "wikifi/specialized/graphql.py", "lines": [ - 1, - 7 + 38, + 43 ], - "fingerprint": "bbb305e0d47f" + "fingerprint": "1d454892894d" } ] } ] }, "wikifi/specialized/openapi.py": { - "fingerprint": "ae97781309c4", - "summary": "Parses OpenAPI/Swagger contract files and extracts structured findings about public endpoints, data schemas, and authentication schemes to support migration analysis.", + "fingerprint": "bdc664e7ad72", + "summary": "Extracts structured migration intelligence from OpenAPI/Swagger contract files, surfacing endpoints, schemas, and authentication schemes as typed findings.", "chunks_processed": 1, "findings": [ { "section_id": "intent", - "finding": "OpenAPI specification files are treated as authoritative 'migration gold' because they enumerate every public endpoint, request/response body, and authentication method in one structured document. This extractor surfaces that information so migration teams have a complete picture of the API contract without manually reading raw spec files.", + "finding": "This module exists to turn OpenAPI/Swagger contract files into structured migration intelligence. Every public endpoint, request/response schema, and authentication method declared in the contract is enumerated so migration teams have a complete, authoritative picture of the API surface without manually reading raw spec files.", "sources": [ { "file": "wikifi/specialized/openapi.py", @@ -3886,104 +4009,104 @@ 1, 11 ], - "fingerprint": "ae97781309c4" + "fingerprint": "bdc664e7ad72" } ] }, { "section_id": "capabilities", - "finding": "The system can parse an API contract file (JSON or YAML format) and produce an inventory of all public HTTP endpoints — including the verb, path, and human-readable description — capping display at 20 with a count of omitted entries. When a spec cannot be parsed, it emits a graceful advisory finding rather than failing, directing reviewers to inspect the file manually.", + "finding": "Parses API contract files to extract: the API title and version, the full list of HTTP endpoints with their verbs, paths, and summaries (up to 20 shown inline, with a count of additional ones), and the named request/response schema models. Produces a concise summary noting total endpoint and schema counts.", "sources": [ { "file": "wikifi/specialized/openapi.py", "lines": [ - 23, - 50 + 53, + 116 ], - "fingerprint": "ae97781309c4" + "fingerprint": "bdc664e7ad72" } ] }, { - "section_id": "capabilities", - "finding": "The extractor also surfaces the API title, version, and description from the contract's info block, providing high-level identity context for the API being migrated.", + "section_id": "integrations", + "finding": "Identifies inbound integration surface: each parsed API contract contributes a finding recording how many HTTP endpoints the system exposes to external consumers, forming the inbound integration inventory for the wiki.", "sources": [ { "file": "wikifi/specialized/openapi.py", "lines": [ - 53, - 66 + 96, + 103 ], - "fingerprint": "ae97781309c4" + "fingerprint": "bdc664e7ad72" } ] }, { "section_id": "entities", - "finding": "API request/response schemas are extracted from the contract's component definitions and listed by name (up to 25, with overflow count). These represent the canonical data models the API exposes or consumes, and are surfaced as entity-level findings for migration awareness.", + "finding": "Extracts named API schema definitions (request/response models) from the contract's component definitions, listing up to 25 named schemas with a note when additional schemas are elided.", "sources": [ { "file": "wikifi/specialized/openapi.py", "lines": [ - 94, - 108 + 105, + 116 ], - "fingerprint": "ae97781309c4" + "fingerprint": "bdc664e7ad72" } ] }, { "section_id": "cross_cutting", - "finding": "Authentication schemes declared in the API contract are extracted and categorized by type. This ensures that security contracts (e.g., API key, OAuth, HTTP bearer) are recorded as cross-cutting concerns that must be preserved through migration.", + "finding": "Reads the declared security scheme types from the API contract and records the authentication contract for the API, capturing scheme categories such as API key, bearer token, or OAuth flows.", "sources": [ { "file": "wikifi/specialized/openapi.py", "lines": [ - 110, - 121 + 118, + 126 ], - "fingerprint": "ae97781309c4" + "fingerprint": "bdc664e7ad72" } ] }, { - "section_id": "integrations", - "finding": "Each parsed API contract contributes an inbound-integration finding recording the count of HTTP endpoints exposed to external consumers, establishing the external-facing API surface as a documented integration point.", + "section_id": "hard_specifications", + "finding": "When an API contract file is present but cannot be parsed, the system must emit a explicit warning finding directing migration teams to review the file manually rather than silently dropping it. Specs that exceed the parser's capability are flagged, not skipped.", "sources": [ { "file": "wikifi/specialized/openapi.py", "lines": [ - 83, - 92 - ], - "fingerprint": "ae97781309c4" + 24, + 37 + ], + "fingerprint": "bdc664e7ad72" } ] }, { "section_id": "external_dependencies", - "finding": "The system optionally relies on an external YAML parsing library for reading YAML-formatted API specifications. When the library is unavailable, a built-in minimal parser handles the subset of YAML structures present in standard OpenAPI documents, ensuring no hard runtime dependency is introduced.", + "finding": "Optionally relies on a third-party YAML parsing library for full YAML spec support; when that library is absent the system falls back to an internal minimal YAML parser sufficient to read the OpenAPI keys it needs, so the library is a soft rather than hard dependency.", "sources": [ { "file": "wikifi/specialized/openapi.py", "lines": [ - 143, - 157 + 154, + 162 ], - "fingerprint": "ae97781309c4" + "fingerprint": "bdc664e7ad72" } ] } ] }, "wikifi/specialized/protobuf.py": { - "fingerprint": "e20d5913745a", - "summary": "Specialized extractor that parses interface definition (proto) files to surface message types as domain entities and service/RPC blocks as integration touchpoints for migration-ready contract analysis.", + "fingerprint": "5a5f77699e9b", + "summary": "Protobuf IDL extractor that parses protocol definition files to surface message types as domain entities, enum types as closed value sets, and service/RPC blocks as integration touchpoints for use by migration teams designing new interfaces.", "chunks_processed": 1, "findings": [ { "section_id": "intent", - "finding": "This extractor exists to treat interface definition files as pure contracts: its findings are intended to be read directly into interface design when re-implementing in a new stack. It bridges the gap between existing wire-protocol definitions and a migration team's understanding of what must be preserved.", + "finding": "This extractor treats protocol definition files as pure contracts — the findings it produces are intended to be read directly by a migration team redesigning system interfaces in a new stack, requiring no knowledge of the original implementation.", "sources": [ { "file": "wikifi/specialized/protobuf.py", @@ -3991,419 +4114,447 @@ 1, 8 ], - "fingerprint": "e20d5913745a" + "fingerprint": "5a5f77699e9b" } ] }, { "section_id": "capabilities", - "finding": "Parses protocol definition files to extract: named data types (messages) grouped by package, closed value sets (enums), service definitions, and individual remote procedure signatures including streaming variants on both input and output. Produces structured findings categorised by entity type and integration surface.", + "finding": "The system can parse interface definition files to enumerate all defined message types, closed-value enum types, named services, and their remote procedure calls — including whether any call leg uses a streaming transport. Results are summarised as counts of messages, services, and RPCs.", "sources": [ { "file": "wikifi/specialized/protobuf.py", "lines": [ - 27, + 26, 95 ], - "fingerprint": "e20d5913745a" + "fingerprint": "5a5f77699e9b" } ] }, { "section_id": "entities", - "finding": "Extracts protocol message types and enum types from interface definition files, recording each by name and source line. Messages are grouped under their package namespace; enums are surfaced separately as closed value sets. Up to 25 of each are reported verbatim; larger sets are truncated with a count.", + "finding": "Message types declared in a protocol definition are surfaced as named domain entities, grouped by their package namespace. Enum types are separately captured as closed value sets. Up to 25 of each kind are rendered verbatim; additional entries are noted as elided.", "sources": [ { "file": "wikifi/specialized/protobuf.py", "lines": [ - 44, - 68 + 42, + 60 ], - "fingerprint": "e20d5913745a" + "fingerprint": "5a5f77699e9b" } ] }, { "section_id": "integrations", - "finding": "Identifies every declared service and maps each of its remote procedures, capturing the procedure name, request message type, response message type, and whether either side is a streaming channel. Each service is emitted as a distinct integration touchpoint.", + "finding": "Each service block in a protocol definition is treated as an integration touchpoint. The extractor resolves which RPCs belong to each service by brace-counting to find the exact closing boundary, preventing cross-attribution in files with multiple services. Each RPC is described with its input and output message types, annotated with streaming direction where present.", "sources": [ { "file": "wikifi/specialized/protobuf.py", "lines": [ - 70, - 87 + 64, + 90 ], - "fingerprint": "e20d5913745a" + "fingerprint": "5a5f77699e9b" } ] }, { "section_id": "hard_specifications", - "finding": "The module explicitly designates proto file findings as direct inputs to interface design during migration, implying that message names, enum value sets, service names, RPC signatures, and streaming contracts must be preserved verbatim when porting to a new stack.", + "finding": "Service-to-RPC attribution must be computed by tracking brace depth (counting nested blocks) rather than by line proximity, ensuring each RPC is assigned only to the service whose block encloses it — a correctness invariant required for accurate integration inventories in multi-service files.", "sources": [ { "file": "wikifi/specialized/protobuf.py", "lines": [ - 1, - 8 + 62, + 67 ], - "fingerprint": "e20d5913745a" + "fingerprint": "5a5f77699e9b" } ] } ] }, "wikifi/specialized/sql.py": { - "fingerprint": "1ef5e77c4038", - "summary": "SQL DDL and migration extractor that converts CREATE TABLE, ALTER TABLE, and CREATE INDEX statements into structured findings about domain entities, relational links, and storage invariants.", + "fingerprint": "ebbdecc4c021", + "summary": "SQL DDL and migration extractor that parses table definitions, relationships, and constraints from schema files, producing structured findings about entities, relational links, and storage invariants for use by a migration team.", "chunks_processed": 1, "findings": [ { - "section_id": "capabilities", - "finding": "The system can parse SQL schema files and migration scripts to automatically discover domain entities, their fields, foreign-key relationships, uniqueness and nullability constraints, and index definitions — producing structured wiki findings without requiring a live database connection.", + "section_id": "intent", + "finding": "This extractor exists to help a migration team understand an existing database schema by systematically pulling entity definitions, column inventories, foreign-key relationships, and storage invariants out of SQL DDL and migration scripts. It distinguishes baseline schema files from incremental migration files so the team can identify forward-only schema changes versus original table definitions.", "sources": [ { "file": "wikifi/specialized/sql.py", "lines": [ - 56, - 62 + 1, + 13 ], - "fingerprint": "1ef5e77c4038" + "fingerprint": "ebbdecc4c021" } ] }, { "section_id": "capabilities", - "finding": "Migration files are parsed with the same logic as baseline DDL but are distinguished in output summaries, allowing the migration team to differentiate additive schema changes from the original baseline structure.", + "finding": "The system can parse SQL DDL files to enumerate every persisted entity with its columns, extract foreign-key relationships between entities, surface UNIQUE and NOT NULL storage invariants, record indexes as performance invariants that must be carried forward, and summarize ALTER TABLE operations. It separately tags migration files so that schema additions can be distinguished from the original baseline.", "sources": [ { "file": "wikifi/specialized/sql.py", "lines": [ - 56, - 62 + 57, + 130 ], - "fingerprint": "1ef5e77c4038" + "fingerprint": "ebbdecc4c021" } ] }, { "section_id": "entities", - "finding": "Each CREATE TABLE statement is treated as a domain entity: the extractor captures the table name, all column names, foreign key edges, and storage constraints as structured entity findings.", + "finding": "The internal `_TableHit` model captures a parsed database table: its name, the source line where it is defined, the raw body text, a list of column names, and a list of foreign-key edges expressed as (local column, referenced table, referenced column) tuples. This model is the intermediate representation that drives all downstream entity and relationship findings.", "sources": [ { "file": "wikifi/specialized/sql.py", "lines": [ - 64, - 84 + 50, + 58 ], - "fingerprint": "1ef5e77c4038" + "fingerprint": "ebbdecc4c021" } ] }, { - "section_id": "entities", - "finding": "ALTER TABLE statements are also tracked as entity-level findings, recording what schema evolution has been applied to a given entity over time (e.g., added columns, dropped constraints).", + "section_id": "integrations", + "finding": "Foreign-key references between tables are recorded as hard relational links between entities — a column in one table references a column in another, which constrains how the two entities may be separated or migrated independently. Both explicit FOREIGN KEY constraint syntax and inline column-level REFERENCES syntax are detected.", "sources": [ { "file": "wikifi/specialized/sql.py", "lines": [ - 99, - 111 + 88, + 98 ], - "fingerprint": "1ef5e77c4038" + "fingerprint": "ebbdecc4c021" } ] }, { - "section_id": "integrations", - "finding": "Foreign key declarations — both explicit FOREIGN KEY clauses and inline REFERENCES annotations on column definitions — are surfaced as hard relational links between entities, capturing which field on one entity points into another.", + "section_id": "cross_cutting", + "finding": "UNIQUE and NOT NULL constraints on any table are surfaced as storage invariants that the target system must honour. Additionally, every index definition is flagged as a query-time performance invariant explicitly annotated as something 'the new system must preserve'.", "sources": [ { "file": "wikifi/specialized/sql.py", "lines": [ - 86, - 96 + 100, + 121 ], - "fingerprint": "1ef5e77c4038" + "fingerprint": "ebbdecc4c021" } ] }, { - "section_id": "cross_cutting", - "finding": "UNIQUE and NOT NULL constraints found within a table definition are extracted as storage invariants that the system flags must be preserved across any migration or re-implementation.", + "section_id": "hard_specifications", + "finding": "Indexes are explicitly declared to encode query-time performance invariants that must be preserved through migration — the extractor emits this requirement verbatim in every index finding so it is not lost in the migration report.", "sources": [ { "file": "wikifi/specialized/sql.py", "lines": [ - 97, - 98 + 115, + 121 ], - "fingerprint": "1ef5e77c4038" + "fingerprint": "ebbdecc4c021" } ] }, { "section_id": "cross_cutting", - "finding": "Index definitions (CREATE INDEX) are recorded as query-time performance invariants, with an explicit note that the new system must preserve them.", + "finding": "Migration files are counted differently from baseline DDL: the summary tracks all tables touched by either CREATE TABLE or ALTER TABLE operations, ensuring that ALTER-only migrations are not misleadingly reported as touching zero tables when browsing the migration report.", "sources": [ { "file": "wikifi/specialized/sql.py", "lines": [ - 113, - 125 + 123, + 130 ], - "fingerprint": "1ef5e77c4038" + "fingerprint": "ebbdecc4c021" } ] - }, + } + ] + }, + "wikifi/providers/anthropic_provider.py": { + "fingerprint": "fe8422f0e6c5", + "summary": "Implements the hosted-Claude LLM provider used by the wikifi pipeline for structured per-file extraction, with prompt caching and adaptive reasoning depth to make large-scale codebase walks economically viable.", + "chunks_processed": 1, + "findings": [ { - "section_id": "hard_specifications", - "finding": "Indexes are explicitly annotated as performance invariants that 'the new system must preserve,' establishing a carry-forward requirement for any target platform.", + "section_id": "intent", + "finding": "The file's module-level docstring explains a core economic constraint: running a large multi-KB system prompt across hundreds of per-file extraction calls at full input-token price is uneconomical on large codebases. The solution is to mark the repeated system prompt as cacheable so subsequent calls pay roughly 10% of the normal input price, making hosted AI extraction cost-competitive with local alternatives at better quality.", "sources": [ { - "file": "wikifi/specialized/sql.py", + "file": "wikifi/providers/anthropic_provider.py", "lines": [ - 117, - 122 + 1, + 19 ], - "fingerprint": "1ef5e77c4038" + "fingerprint": "fe8422f0e6c5" } ] }, { - "section_id": "hard_specifications", - "finding": "UNIQUE and NOT NULL constraints are treated as storage-level invariants that must survive migration, not merely advisory metadata.", + "section_id": "external_dependencies", + "finding": "The system depends on Anthropic's hosted Claude API for all LLM inference on this path. The default model is `claude-opus-4-7`, described as the most capable option for migration-grade domain extraction. The API key is sourced from the `ANTHROPIC_API_KEY` environment variable. A configurable HTTP timeout (default 900 seconds) guards against long-running inference calls.", "sources": [ { - "file": "wikifi/specialized/sql.py", + "file": "wikifi/providers/anthropic_provider.py", "lines": [ - 97, - 98 + 83, + 100 ], - "fingerprint": "1ef5e77c4038" + "fingerprint": "fe8422f0e6c5" } ] - } - ] - }, - "wikifi/providers/anthropic_provider.py": { - "fingerprint": "872020d40ac3", - "summary": "Implements the hosted-AI provider by calling an external large-language-model service, with a prompt-caching strategy that makes large-scale codebase walks economically viable.", - "chunks_processed": 1, - "findings": [ + }, { - "section_id": "external_dependencies", - "finding": "The system depends on a hosted large-language-model service (Anthropic Claude) for structured extraction and free-text generation. The API key is resolved from an environment variable (`ANTHROPIC_API_KEY`) or injected at construction time, and a configurable HTTP timeout guards against long-running inference calls.", + "section_id": "capabilities", + "finding": "The provider exposes three interaction modes: (1) structured JSON extraction where the model returns a schema-validated domain object directly, (2) free-text completion for open-ended responses, and (3) multi-turn conversational chat for interactive or iterative workflows. All three modes share the same prompt-caching and adaptive-reasoning configuration.", "sources": [ { "file": "wikifi/providers/anthropic_provider.py", "lines": [ - 100, - 108 + 107, + 186 ], - "fingerprint": "872020d40ac3" + "fingerprint": "fe8422f0e6c5" } ] }, { - "section_id": "integrations", - "finding": "This module is the outbound integration point to the hosted LLM service. It is consumed by the orchestrator (`wikifi/orchestrator.py`) to drive all per-file extraction, aggregation, and derivation calls. Three interaction shapes are exposed: schema-validated structured output, free-text completion, and multi-turn chat.", + "section_id": "capabilities", + "finding": "The provider supports configurable reasoning depth ('low', 'medium', 'high', 'max') that controls how deeply the model deliberates before producing output. This knob is intentionally mirrored to match the interface of the local (Ollama) provider so the rest of the pipeline does not need to branch on which provider is active.", "sources": [ { "file": "wikifi/providers/anthropic_provider.py", "lines": [ - 115, - 175 + 226, + 245 ], - "fingerprint": "872020d40ac3" + "fingerprint": "fe8422f0e6c5" } ] }, { "section_id": "cross_cutting", - "finding": "A prompt-caching mechanism marks the large, repeated system prompt with an ephemeral cache breakpoint so that only the first call in a pipeline walk pays full input-token cost; subsequent calls pay roughly 10% of that cost as a cache read. The module's own documentation states this is the critical cost control that makes hosted inference economical on codebases of tens of thousands of files.", + "finding": "All API errors from the hosted service are caught and re-raised as a uniform internal error type carrying the original request ID. This preserves the pipeline's existing per-call fallback paths, which catch broad exceptions, without leaking provider-specific error types into the orchestration layer.", "sources": [ { "file": "wikifi/providers/anthropic_provider.py", "lines": [ - 193, - 210 + 119, + 127 ], - "fingerprint": "872020d40ac3" + "fingerprint": "fe8422f0e6c5" } ] }, { "section_id": "cross_cutting", - "finding": "All errors from the external LLM service are caught and re-raised as a normalized internal error type, carrying the provider's request identifier when available. This preserves the pipeline's existing per-call fallback behaviour without leaking provider-specific error types into the broader system.", + "finding": "The system prompt is wrapped in a cacheable block on every call, so the large shared extraction prompt is billed at the cache-read rate for all calls after the first within a cache window. This is described as 'the entire cost story for hosted runs' and is enabled by default but can be disabled.", "sources": [ { "file": "wikifi/providers/anthropic_provider.py", "lines": [ - 238, - 244 + 196, + 211 ], - "fingerprint": "872020d40ac3" + "fingerprint": "fe8422f0e6c5" } ] }, { "section_id": "cross_cutting", - "finding": "A configurable reasoning-depth control (`think`) is translated into the external API's adaptive-thinking feature, allowing callers to trade inference latency and cost against extraction quality without branching on provider type elsewhere in the codebase.", + "finding": "When the model returns neither a parsed structured object nor any text content, a diagnostic message surfaces the stop reason, output token count, and configured token budget, along with operational hints (raise the token limit or lower reasoning effort) to help operators resolve the failure at the point it occurs.", "sources": [ { "file": "wikifi/providers/anthropic_provider.py", "lines": [ - 212, - 232 + 255, + 275 ], - "fingerprint": "872020d40ac3" + "fingerprint": "fe8422f0e6c5" } ] }, { "section_id": "hard_specifications", - "finding": "Sampling parameters (temperature, top_p, top_k) must not be sent to the claude-opus-4-7 model variant — doing so causes a 400 error. The provider explicitly omits these parameters for this model generation, making their absence a hard constraint carried forward with the provider implementation.", + "finding": "Sampling parameters (temperature, top_p, top_k) must NOT be sent to the claude-opus-4-7 model — doing so causes a 400 error. The provider omits them entirely rather than conditionally including them.", "sources": [ { "file": "wikifi/providers/anthropic_provider.py", "lines": [ - 17, - 21 + 14, + 17 ], - "fingerprint": "872020d40ac3" + "fingerprint": "fe8422f0e6c5" } ] }, { "section_id": "hard_specifications", - "finding": "The maximum output token budget per call is set at 16,000 tokens. This is documented as comfortable headroom for any section schema response while staying within the SDK's non-streaming HTTP timeout guard, making it an operationally important default that should not be reduced without re-validating pipeline completions.", + "finding": "The default maximum output token budget is 32,000. The rationale is that adaptive reasoning at high effort can consume the entire budget before the structured output block is produced; too low a value causes the model to return an empty structured response. Callers using the highest effort levels are advised to increase this limit and enable streaming.", "sources": [ { "file": "wikifi/providers/anthropic_provider.py", "lines": [ - 72, - 76 + 70, + 79 + ], + "fingerprint": "fe8422f0e6c5" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Structured output is obtained via the SDK's schema-constrained decoding path (messages.parse), not via manually constructed tool-use blocks. The fallback path attempts to parse the raw text block as JSON if the primary parsed output is absent, before raising an error.", + "sources": [ + { + "file": "wikifi/providers/anthropic_provider.py", + "lines": [ + 107, + 145 + ], + "fingerprint": "fe8422f0e6c5" + } + ] + }, + { + "section_id": "integrations", + "finding": "This provider implements the LLMProvider interface defined in the base provider module and is consumed by the orchestrator. The interface contract includes three methods — structured JSON completion, free-text completion, and multi-turn chat — which the orchestrator calls without branching on which concrete provider is active.", + "sources": [ + { + "file": "wikifi/providers/anthropic_provider.py", + "lines": [ + 83, + 106 ], - "fingerprint": "872020d40ac3" + "fingerprint": "fe8422f0e6c5" } ] } ] }, "wikifi/providers/base.py": { - "fingerprint": "2750f0f56327", - "summary": "Defines the minimal provider abstraction that all language-model backends must satisfy, enabling the rest of the system to swap underlying AI services without changing call sites.", + "fingerprint": "f40d924f0cb0", + "summary": "Defines the abstract contract every LLM backend must implement, exposing exactly three interaction modes used throughout the wiki-generation pipeline.", "chunks_processed": 1, "findings": [ { "section_id": "intent", - "finding": "The provider protocol is explicitly designed to be minimal so that switching between different AI backends (local, hosted, or mock) requires changing only a single class, keeping the rest of the system decoupled from any particular AI service.", + "finding": "The provider abstraction exists so that any LLM backend (local, hosted, or mock) can be substituted with a single-class change. The rest of the system never calls anything beyond these three surfaces, keeping the integration boundary explicit and narrow.", "sources": [ { "file": "wikifi/providers/base.py", "lines": [ 1, - 16 + 17 ], - "fingerprint": "2750f0f56327" + "fingerprint": "f40d924f0cb0" } ] }, { "section_id": "capabilities", - "finding": "The system interacts with language models in three distinct modes: structured JSON generation (used for repository introspection, per-file extraction, and section aggregation), free-form markdown generation (used for diagram passes), and stateful multi-turn conversation (used for the interactive chat REPL).", + "finding": "The system interacts with language models in three distinct modes: (1) structured single-shot completion where the model returns a validated structured document, used for introspection, per-file extraction, and aggregation stages; (2) free-text single-shot completion for unstructured narrative output such as diagram sections; and (3) multi-turn conversation that preserves history across turns, used in the interactive REPL.", "sources": [ { "file": "wikifi/providers/base.py", "lines": [ - 1, - 16 + 5, + 14 ], - "fingerprint": "2750f0f56327" + "fingerprint": "f40d924f0cb0" } ] }, { - "section_id": "integrations", - "finding": "The provider abstraction is the outbound integration boundary between wikifi's pipeline stages and any AI model backend; all pipeline components (introspection, extractor, aggregator, deriver, critic, chat, orchestrator) call through this interface rather than directly to a specific service.", + "section_id": "entities", + "finding": "A `ChatMessage` carries two fields — a `role` identifier and a `content` string — representing one turn in a multi-turn exchange. The `LLMProvider` entity carries a `name` (provider identity) and a `model` (specific model variant) and is the sole point of contact between the pipeline and any language model backend.", "sources": [ { "file": "wikifi/providers/base.py", "lines": [ - 30, - 48 + 33, + 52 ], - "fingerprint": "2750f0f56327" + "fingerprint": "f40d924f0cb0" } ] }, { - "section_id": "entities", - "finding": "A `ChatMessage` entity carries a `role` and `content` field and represents a single turn in a multi-turn conversation; lists of these are passed to the chat mode to maintain conversation history across turns.", + "section_id": "cross_cutting", + "finding": "All hosted provider backends share a single error-formatting routine that extracts a vendor-issued request identifier when present, producing a consistent diagnostic string across all backends. This ensures that failure messages are uniformly attributable regardless of which backend is active.", "sources": [ { "file": "wikifi/providers/base.py", "lines": [ - 28, - 30 + 54, + 63 ], - "fingerprint": "2750f0f56327" + "fingerprint": "f40d924f0cb0" } ] }, { - "section_id": "cross_cutting", - "finding": "Structured output validation is a cross-cutting concern: the `complete_json` mode requires the model response to be validated against a declared schema before being returned, ensuring type-safe data flows through every pipeline stage that uses it.", + "section_id": "hard_specifications", + "finding": "The three abstract methods — structured completion, text completion, and chat — constitute the complete and exclusive contract that the rest of the system relies on; no other methods on a provider are ever invoked. Any conforming implementation must satisfy all three signatures exactly.", "sources": [ { "file": "wikifi/providers/base.py", "lines": [ - 36, - 38 + 42, + 52 ], - "fingerprint": "2750f0f56327" + "fingerprint": "f40d924f0cb0" } ] } ] }, "wikifi/providers/ollama_provider.py": { - "fingerprint": "0a21916665a5", - "summary": "Ollama-backed LLM provider that handles schema-enforced structured output, free-text generation, and multi-turn chat by connecting to a locally-hosted inference runtime.", + "fingerprint": "dda16c755eff", + "summary": "Concrete implementation of the LLM provider contract backed by a locally-hosted language model service, with explicit controls for structured-output reliability and reasoning depth.", "chunks_processed": 1, "findings": [ { "section_id": "external_dependencies", - "finding": "The system depends on a locally-hosted Ollama inference runtime reachable at a configurable host address. Ollama serves as the sole LLM backend, handling structured JSON output (via schema enforcement), free-text completion, and multi-turn chat. A per-connection timeout (defaulting to 900 seconds) gates all calls to this service.", + "finding": "The system relies on a locally-hosted language model service (Ollama) for all AI inference. It connects via a configurable host address and timeout, and uses the service's native schema-enforcement mechanism to obtain structured JSON responses.", "sources": [ { "file": "wikifi/providers/ollama_provider.py", "lines": [ - 50, - 56 + 52, + 52 ], - "fingerprint": "0a21916665a5" + "fingerprint": "dda16c755eff" } ] }, { - "section_id": "integrations", - "finding": "The orchestrator calls this provider for all LLM inference work. The provider, in turn, calls the Ollama service for three interaction modes: schema-constrained structured extraction, unconstrained text generation, and stateful multi-turn conversation. This provider is the exclusive outbound integration boundary between the system and any language model.", + "section_id": "capabilities", + "finding": "The provider exposes three interaction modes: structured JSON extraction (with schema validation), free-form text completion, and multi-turn conversation. Structured extraction pins output temperature to zero to guarantee reproducibility across identical inputs.", "sources": [ { "file": "wikifi/providers/ollama_provider.py", "lines": [ 58, - 95 + 91 ], - "fingerprint": "0a21916665a5" + "fingerprint": "dda16c755eff" } ] }, { "section_id": "cross_cutting", - "finding": "Temperature is hard-pinned to 0 on every structured-output call, enforcing determinism so that identical inputs reliably produce identical structured results across repeated runs. The text and chat paths leave temperature at the model default, accepting variability in exchange for naturalness.", + "finding": "Temperature is fixed at zero for all structured-output calls so that the same input always produces the same structured result across runs; free-text and chat calls inherit the model's default temperature. This is a non-negotiable invariant for the JSON extraction path.", "sources": [ { "file": "wikifi/providers/ollama_provider.py", @@ -4411,154 +4562,460 @@ 58, 68 ], - "fingerprint": "0a21916665a5" + "fingerprint": "dda16c755eff" } ] }, { "section_id": "hard_specifications", - "finding": "Qwen3-family models must not be invoked with think=False on the structured-output path: doing so causes the model to bypass the schema constraint and emit free text, which fails downstream validation. The thinking level must be 'low' or higher to preserve schema compliance. For the derivative-section synthesis pass, 'high' thinking is the preferred setting for output quality, but callers must budget 1–3 minutes per file and configure the timeout to at least 900 seconds to absorb that latency.", + "finding": "Disabling the reasoning trace (think=False) on Qwen3-family models causes them to ignore the JSON schema constraint and emit free text, breaking validation. The system therefore defaults reasoning to 'high' and must never disable it for Qwen3-style models used in the structured-output path.", "sources": [ { "file": "wikifi/providers/ollama_provider.py", "lines": [ - 14, - 32 + 9, + 27 ], - "fingerprint": "0a21916665a5" + "fingerprint": "dda16c755eff" } ] }, { - "section_id": "capabilities", - "finding": "The provider exposes three distinct LLM interaction modes: structured extraction (response validated against a caller-supplied schema), open-ended text generation, and multi-turn conversational exchange. This separation allows the orchestrator to select the right interaction pattern for each processing stage (per-file extraction vs. derivative synthesis vs. interactive refinement).", + "section_id": "hard_specifications", + "finding": "The default request timeout is 900 seconds, chosen to absorb the 1–3 minute per-file latency observed at the 'high' thinking level on local 27B-parameter models. Reducing this timeout risks aborting in-progress reasoning traces.", "sources": [ { "file": "wikifi/providers/ollama_provider.py", "lines": [ - 58, - 95 + 50, + 54 + ], + "fingerprint": "dda16c755eff" + } + ] + }, + { + "section_id": "integrations", + "finding": "This component implements the shared LLMProvider interface defined in the base provider module and is consumed by the orchestrator. It bridges the orchestrator's abstract completion requests to the concrete locally-hosted model service.", + "sources": [ + { + "file": "wikifi/providers/ollama_provider.py", + "lines": [ + 44, + 46 + ], + "fingerprint": "dda16c755eff" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "A companion content-size filter (min_content_bytes, described in the docstring as living in the walker) acts as a guard to prevent near-empty files from reaching the extractor and triggering expensive, potentially timeout-inducing reasoning traces.", + "sources": [ + { + "file": "wikifi/providers/ollama_provider.py", + "lines": [ + 28, + 35 ], - "fingerprint": "0a21916665a5" + "fingerprint": "dda16c755eff" } ] } ] }, "wikifi/providers/openai_provider.py": { - "fingerprint": "428df9ba13f1", - "summary": "Implements the OpenAI-hosted LLM backend for wikifi, providing structured-output, free-text, and multi-turn chat completions with automatic prompt-caching and reasoning-effort control.", + "fingerprint": "a64fb7819574", + "summary": "OpenAI-backed LLM provider that handles structured-output extraction, free-text completion, and multi-turn chat against OpenAI-hosted models, with automatic prompt caching and reasoning-effort routing.", "chunks_processed": 1, "findings": [ { - "section_id": "external_dependencies", - "finding": "The system depends on OpenAI's hosted language model API for all inference when this provider is selected. It is activated via an environment variable (`WIKIFI_PROVIDER=openai`) and an API key, and supports an optional custom base URL to point at compatible third-party endpoints.", + "section_id": "integrations", + "finding": "This module is one of three selectable hosted LLM backends (alongside local and Anthropic options). It is activated by setting the provider selector to `openai` and supplying an API key, and is invoked by the orchestrator for per-file extraction passes and synthesis steps.", "sources": [ { "file": "wikifi/providers/openai_provider.py", "lines": [ 1, - 10 + 9 ], - "fingerprint": "428df9ba13f1" + "fingerprint": "a64fb7819574" } ] }, { - "section_id": "integrations", - "finding": "This provider is one of three selectable backends (alongside local and Anthropic-hosted options) consumed by the orchestrator. It implements the shared provider protocol defined in `wikifi/providers/base.py`, exposing structured-JSON, free-text, and multi-turn chat completion methods that the orchestrator calls during per-file extraction and synthesis passes.", + "section_id": "external_dependencies", + "finding": "Depends on the OpenAI hosted API for all language-model inference. The API is used in three modes: schema-constrained structured decoding (returning validated domain objects), free-text completion, and multi-turn conversational chat. API errors surface as normalised runtime failures.", "sources": [ { "file": "wikifi/providers/openai_provider.py", "lines": [ - 1, - 8 + 113, + 175 ], - "fingerprint": "428df9ba13f1" + "fingerprint": "a64fb7819574" } ] }, { "section_id": "capabilities", - "finding": "The provider supports three interaction modes: schema-constrained structured output (returning a validated domain object), free-text generation, and stateful multi-turn conversation. This covers the full range of interactions wikifi needs — per-file structured extraction, narrative synthesis, and interactive Q&A.", + "finding": "Provides three interaction modes with the hosted model: (1) structured extraction that returns a fully validated domain-findings object matching a declared schema; (2) open-ended text generation; and (3) multi-turn chat for iterative wiki synthesis. All three modes share a unified output-token cap and reasoning-effort configuration.", "sources": [ { "file": "wikifi/providers/openai_provider.py", "lines": [ - 120, - 185 + 113, + 175 ], - "fingerprint": "428df9ba13f1" + "fingerprint": "a64fb7819574" } ] }, { "section_id": "cross_cutting", - "finding": "API errors are caught and re-raised with a normalised diagnostic message that includes the upstream request identifier when available, preserving traceability across the provider boundary without leaking raw SDK exceptions to callers.", + "finding": "Prompt caching is exploited by placing the large, repeated extraction system prompt at message position 0; OpenAI automatically caches identical prefixes of ≥ 1024 tokens for roughly 5–10 minutes, reducing latency and cost across the many per-file calls in a single wiki walk.", "sources": [ { "file": "wikifi/providers/openai_provider.py", "lines": [ - 248, - 255 + 14, + 18 ], - "fingerprint": "428df9ba13f1" + "fingerprint": "a64fb7819574" } ] }, { "section_id": "cross_cutting", - "finding": "Prompt caching is exploited automatically by placing the system prompt at message position 0 in every call; the hosted service caches identical long prefixes, reducing latency and cost for the repeated multi-kilobyte extraction prompt that is sent once per source file.", + "finding": "All API failures are caught and re-raised as normalised runtime errors via a shared `format_api_error` helper, ensuring that provider-specific error shapes do not leak into the orchestration layer.", "sources": [ { "file": "wikifi/providers/openai_provider.py", "lines": [ - 13, - 17 + 128, + 135 ], - "fingerprint": "428df9ba13f1" + "fingerprint": "a64fb7819574" } ] }, { "section_id": "hard_specifications", - "finding": "Reasoning-capable model families (identified by name prefix) must receive output-token limits via a distinct parameter name from standard chat models; sending the wrong parameter to either family causes a request failure. The provider routes the correct parameter unconditionally based on model identity.", + "finding": "Reasoning-capable model families (identified by the prefixes `o` or `gpt-5`) must receive `max_completion_tokens` instead of `max_tokens`, and may optionally receive a `reasoning_effort` value of `low`, `medium`, or `high`. Non-reasoning models must not receive `reasoning_effort` to avoid API validation errors.", "sources": [ { "file": "wikifi/providers/openai_provider.py", "lines": [ 215, - 226 + 235 ], - "fingerprint": "428df9ba13f1" + "fingerprint": "a64fb7819574" } ] }, { "section_id": "hard_specifications", - "finding": "The `think` (reasoning-effort) knob must only be forwarded to reasoning-capable models; forwarding it to a plain chat model risks a validation error from the hosted service. The mapping from wikifi's internal knob values (`low`, `medium`, `high`) to the API's accepted values is fixed and must be preserved.", + "finding": "The default output token cap is 16,000 tokens per call, chosen to accommodate the largest structured findings schema without hitting SDK HTTP timeout guards. The default model is `gpt-4o` and the default per-call timeout is 900 seconds.", "sources": [ { "file": "wikifi/providers/openai_provider.py", "lines": [ - 200, - 214 + 59, + 66 ], - "fingerprint": "428df9ba13f1" + "fingerprint": "a64fb7819574" } ] }, { "section_id": "hard_specifications", - "finding": "When the hosted service returns a response that cannot be parsed into the expected structured schema (e.g. due to refusal or truncation), the system falls back to direct JSON validation of the raw text rather than returning a null result, preserving the protocol contract that callers always receive a validated object or an explicit error.", + "finding": "When the structured-output parse path returns no parsed object (e.g. due to a refusal or truncation), the implementation must fall back to validating the raw JSON text against the schema rather than returning a null, preserving the provider protocol's contract of raising on failure rather than silently returning nothing.", "sources": [ { "file": "wikifi/providers/openai_provider.py", "lines": [ - 150, - 163 + 136, + 144 + ], + "fingerprint": "a64fb7819574" + } + ] + } + ] + }, + "Dockerfile": { + "fingerprint": "a3f802d0c632", + "summary": "Multi-stage build placeholder with no domain logic.", + "chunks_processed": 1, + "findings": [] + }, + "Makefile": { + "fingerprint": "961d8c040205", + "summary": "Build and developer-workflow automation for the wikifi project.", + "chunks_processed": 1, + "findings": [] + }, + "docker-compose.yml": { + "fingerprint": "26be8a812822", + "summary": "Local development environment definition providing a persistent relational database service.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "external_dependencies", + "finding": "A relational database service is used for persistent data storage, configured with a dedicated user, password, and database named 'wikifi'. Data is persisted across restarts via a named volume.", + "sources": [ + { + "file": "docker-compose.yml", + "lines": [ + 2, + 11 + ], + "fingerprint": "26be8a812822" + } + ] + } + ] + }, + "pyproject.toml": { + "fingerprint": "e9bb63d5a6a9", + "summary": "Project manifest declaring wikifi's identity, dependencies, and entry point as a codebase-documentation tool.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "wikifi exists to walk any codebase and produce a technology-agnostic markdown wiki of its intent — helping teams understand, migrate, or document software independent of its implementation details.", + "sources": [ + { + "file": "pyproject.toml", + "lines": [ + 3, + 3 + ], + "fingerprint": "e9bb63d5a6a9" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The tool exposes a single command-line entry point (`wikifi`) that orchestrates codebase traversal and wiki generation.", + "sources": [ + { + "file": "pyproject.toml", + "lines": [ + 19, + 21 + ], + "fingerprint": "e9bb63d5a6a9" + } + ] + }, + { + "section_id": "external_dependencies", + "finding": "Three distinct LLM back-ends are supported: a locally-hosted model service (Ollama), the Anthropic Claude API, and the OpenAI API. Each plays the role of the inference engine that converts source code into domain-language documentation.", + "sources": [ + { + "file": "pyproject.toml", + "lines": [ + 11, + 17 + ], + "fingerprint": "e9bb63d5a6a9" + } + ] + }, + { + "section_id": "external_dependencies", + "finding": "A file-pattern library (pathspec) is used to control which files are included or excluded during codebase traversal, likely respecting .gitignore-style rules.", + "sources": [ + { + "file": "pyproject.toml", + "lines": [ + 16, + 16 + ], + "fingerprint": "e9bb63d5a6a9" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Structured data validation and settings management are enforced throughout the system via a schema/validation library, ensuring configuration and model outputs conform to expected shapes.", + "sources": [ + { + "file": "pyproject.toml", + "lines": [ + 12, + 13 + ], + "fingerprint": "e9bb63d5a6a9" + } + ] + } + ] + }, + "wikifi/specialized/dispatch.py": { + "fingerprint": "cec0697482a9", + "summary": "Routes each recognized file kind to the appropriate specialized extractor, or returns None to let the file fall through to the general LLM extraction path.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "The system distinguishes structured contract files (schemas, interface definitions, API specs, migrations) from general application code, recognizing that their machine-readable structure can be extracted more accurately and efficiently by targeted parsers than by a general prose LLM extractor. The dispatch layer enforces this routing decision so that the LLM path is reserved only for files where structure is implicit.", + "sources": [ + { + "file": "wikifi/specialized/dispatch.py", + "lines": [ + 1, + 13 + ], + "fingerprint": "cec0697482a9" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system can route files to four specialized extraction paths — SQL schemas, OpenAPI specifications, Protobuf definitions, and GraphQL schemas — as well as a dedicated SQL migration extraction path, based on the file's classified kind and, for migrations, its path suffix.", + "sources": [ + { + "file": "wikifi/specialized/dispatch.py", + "lines": [ + 44, + 62 + ], + "fingerprint": "cec0697482a9" + } + ] + }, + { + "section_id": "hard_specifications", + "finding": "Only migration files with `.sql` or `.ddl` suffixes are sent to the SQL migration extractor; all other migration files (e.g. Python Alembic scripts, Django initial migrations, Knex JavaScript migrations) must fall through to the LLM extraction path. This rule is enforced by inspecting the file path suffix, not just the file kind classification.", + "sources": [ + { + "file": "wikifi/specialized/dispatch.py", + "lines": [ + 28, + 62 + ], + "fingerprint": "cec0697482a9" + } + ] + }, + { + "section_id": "integrations", + "finding": "This module acts as the internal integration hub between the upstream file classifier (repograph) and the downstream specialized extractors (sql, openapi, protobuf, graphql). The upstream classifier tags every file in a migrations directory uniformly; this layer narrows that coarse classification into actionable routing decisions.", + "sources": [ + { + "file": "wikifi/specialized/dispatch.py", + "lines": [ + 36, + 62 + ], + "fingerprint": "cec0697482a9" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Lazy imports of the specialized extractor modules are used to keep this dispatch module cheap to load and to avoid circular dependency issues within the specialized package hierarchy — a structural invariant that must be preserved as the extractor set grows.", + "sources": [ + { + "file": "wikifi/specialized/dispatch.py", + "lines": [ + 40, + 43 + ], + "fingerprint": "cec0697482a9" + } + ] + } + ] + }, + "wikifi/specialized/models.py": { + "fingerprint": "32d041c141a3", + "summary": "Defines the shared result types and extractor contract used by specialized (non-LLM) extractors for schema and IDL files.", + "chunks_processed": 1, + "findings": [ + { + "section_id": "intent", + "finding": "Specialized extractors exist to bypass LLM processing for schema and interface-definition files, producing structured findings through a deterministic code path. Their output is intentionally compatible with the LLM extractor's output so both flow into a single unified notes store.", + "sources": [ + { + "file": "wikifi/specialized/models.py", + "lines": [ + 1, + 8 + ], + "fingerprint": "32d041c141a3" + } + ] + }, + { + "section_id": "capabilities", + "finding": "The system can analyze schema and IDL files (such as GraphQL, OpenAPI, Protobuf, SQL) using dedicated extractors rather than general-purpose language model inference, short-circuiting the LLM for these well-structured inputs.", + "sources": [ + { + "file": "wikifi/specialized/models.py", + "lines": [ + 4, + 6 + ], + "fingerprint": "32d041c141a3" + } + ] + }, + { + "section_id": "entities", + "finding": "A `SpecializedFinding` represents a single extracted insight tied to a wiki section, carrying the section identifier, the finding text, and one or more source references indicating where in the file the finding originates.", + "sources": [ + { + "file": "wikifi/specialized/models.py", + "lines": [ + 19, + 22 + ], + "fingerprint": "32d041c141a3" + } + ] + }, + { + "section_id": "entities", + "finding": "A `SpecializedResult` aggregates a list of `SpecializedFinding` items and an optional summary string, forming the complete output of a single specialized extractor run over one file.", + "sources": [ + { + "file": "wikifi/specialized/models.py", + "lines": [ + 25, + 27 + ], + "fingerprint": "32d041c141a3" + } + ] + }, + { + "section_id": "cross_cutting", + "finding": "Specialized extractor output conforms to the same contract as LLM extractor output — both write to the same notes store — ensuring that the downstream wiki-building pipeline is agnostic to which extraction path produced a given finding.", + "sources": [ + { + "file": "wikifi/specialized/models.py", + "lines": [ + 4, + 8 + ], + "fingerprint": "32d041c141a3" + } + ] + }, + { + "section_id": "integrations", + "finding": "The extractor function type `(rel_path, text) -> SpecializedResult` defines the internal integration contract between the dispatch layer and each specialized extractor (GraphQL, OpenAPI, Protobuf, SQL), as well as between all extractors and the shared evidence/notes store.", + "sources": [ + { + "file": "wikifi/specialized/models.py", + "lines": [ + 30, + 31 ], - "fingerprint": "428df9ba13f1" + "fingerprint": "32d041c141a3" } ] } diff --git a/.wikifi/domains.md b/.wikifi/domains.md index 1508c89..38a24fd 100644 --- a/.wikifi/domains.md +++ b/.wikifi/domains.md @@ -2,41 +2,48 @@ ## Core Domain -The system's core domain is **codebase knowledge extraction**: ingesting an existing source base, classifying its contents, deriving domain findings from individual files, and synthesising those findings into a structured, technology-agnostic wiki. The primary consumers are migration teams who need to understand business intent, domain structure, and operational behaviour before re-implementing or replacing a legacy system. +The system's core domain is **codebase knowledge extraction**: reasoning about an arbitrary repository's structure, intent, and behaviour, then representing that understanding as a technology-agnostic, human-readable wiki. The domain is explicitly decoupled from any recognition of specific languages, frameworks, or runtimes — tech-agnosticism is a first-class constraint enforced at the analysis level, not merely a presentation concern. -## Subdomains +## Primary Subdomains ### Repository Introspection -This subdomain concerns discovering and classifying the files that make up a target codebase. Its central responsibility is distinguishing production source that encodes business intent from infrastructure, tooling, and other artefacts that do not. Tech-agnosticism is a first-class constraint here: the classification logic must not rely on recognising any specific language, framework, or runtime. +This subdomain covers the initial act of understanding a repository: discovering which paths exist, classifying files by kind, resolving import relationships, and deciding which parts of the codebase encode genuine business intent versus infrastructure or tooling noise. The output is a curated inclusion set that drives all downstream work. ### Per-File Knowledge Extraction -Once relevant files are identified, each is analysed independently to surface domain findings. This subdomain covers the full extraction loop — examining file content, applying domain heuristics, and producing structured evidence — and forms the first phase of wiki generation (primary sections). +Operating over the inclusion set produced by introspection, this subdomain extracts intent-bearing findings from individual source files, organised by wiki section. It encompasses caching and memoisation of extraction results, cross-file context derived from the import graph, and chunk-level deduplication to prevent redundant evidence. -### Section Synthesis and Aggregation -The second phase of wiki generation operates over the evidence produced by per-file extraction. It aggregates findings across files into coherent wiki sections, derives higher-level content that cannot be inferred from any single file, and enforces the dependency ordering between primary (evidence-driven) and derivative (aggregated) sections. This ordering is a structural design constraint, not merely a runtime convention. +### Documentation Synthesis +This subdomain aggregates per-file findings into coherent wiki sections and then derives higher-level artifacts (narrative summaries, personas, diagrams) from those aggregates. A critical design constraint enforced structurally is the **dependency ordering** between primary evidence extraction and derivative synthesis: derivative sections may only consume content that primary sections have already produced. -### Wiki Authoring and Organisation -A secondary domain governs how extracted knowledge is structured and stored. It defines the taxonomy of sections, distinguishes primary from derivative content, and produces output that a migration team can navigate and consume independently of the source codebase. +## Secondary Subdomains -### Interactive Knowledge Retrieval -A supporting subdomain exposes the generated wiki to conversational or query-driven access, allowing stakeholders to interrogate extracted knowledge without directly inspecting raw wiki files. +| Subdomain | Responsibility | +|---|---| +| **Provider Abstraction** | Decouples extraction and synthesis intelligence from any specific inference backend, allowing local and hosted providers to be swapped without altering the pipeline. | +| **Wiki Authoring & Organisation** | Governs how extracted knowledge is structured, stored on the filesystem, and made navigable for consumers such as migration teams. | +| **Interactive Knowledge Retrieval** | Supports on-demand querying of the generated wiki, enabling a conversational interface over the accumulated knowledge base. | -## Cross-Cutting Constraint: Tech-Agnosticism -Tech-agnosticism spans every subdomain. All analysis, extraction, and synthesis must produce domain-level descriptions that are free of references to specific languages, frameworks, or libraries. This constraint is enforced at both the classification stage (repository introspection) and the output stage (section content). +## Domain Relationships -## Subdomain Relationships +Repository Introspection feeds Per-File Extraction, which in turn feeds Documentation Synthesis — forming a directed, stage-ordered pipeline. Provider Abstraction is a horizontal supporting concern that all three primary subdomains depend on. Wiki Authoring & Organisation governs the output representation consumed by Interactive Knowledge Retrieval. Quality assurance of generated content is an ancillary concern cross-cutting the extraction and synthesis stages. -| Subdomain | Role | Depends On | -|---|---|---| -| Repository Introspection | Identifies source worth analysing | — | -| Per-File Knowledge Extraction | Produces primary section evidence | Introspection | -| Section Synthesis & Aggregation | Produces derivative sections | Per-File Extraction | -| Wiki Authoring & Organisation | Structures and stores the wiki | Synthesis | -| Interactive Knowledge Retrieval | Queries the completed wiki | Authoring | +## Supporting claims +- The core domain is codebase knowledge extraction: reasoning about an arbitrary repository's structure, intent, and behaviour and representing that understanding as a technology-agnostic wiki. [1][2][3][4] +- Tech-agnosticism is a first-class constraint at the analysis level, not merely a presentation concern. [5] +- The repository introspection subdomain covers discovering and classifying files, resolving import relationships, and deciding which parts of a codebase encode business intent versus infrastructure or tooling. [1][6][5] +- The per-file knowledge extraction subdomain extracts intent-bearing findings per wiki section, and encompasses caching/memoisation, import-graph-based cross-file context, and chunk-level deduplication. [1][3] +- The documentation synthesis subdomain aggregates per-file findings into wiki sections and derives higher-level artifacts such as narrative summaries, personas, and diagrams. [1][4][7] +- The dependency ordering between primary evidence extraction and derivative synthesis is a first-class design constraint enforced structurally. [7] +- Provider abstraction is a secondary domain that decouples extraction intelligence from any specific inference backend. [8] +- Wiki authoring and organisation is a secondary domain governing how extracted knowledge is structured and stored for consumption by a migration team. [2] +- Interactive knowledge retrieval against the generated wiki is a supporting subdomain. [6] ## Sources -1. `README.md:28-52` +1. `README.md:32-55` 2. `VISION.md:3-20` -3. `wikifi/cli.py:1-8` -4. `wikifi/introspection.py:19-44` -5. `wikifi/sections.py:1-19` +3. `wikifi/extractor.py` +4. `wikifi/orchestrator.py:1-16` +5. `wikifi/introspection.py:19-44` +6. `wikifi/cli.py:1-8` +7. `wikifi/sections.py:1-19` +8. `README.md:57-63` diff --git a/tests/test_wiki.py b/tests/test_wiki.py index d95712f..bc54404 100644 --- a/tests/test_wiki.py +++ b/tests/test_wiki.py @@ -90,3 +90,76 @@ def test_write_section_with_section_object(tmp_path): body = "Some **bold** content." path = write_section(layout, section, body) assert section.title in path.read_text() + + +def test_initialize_gitignore_includes_cache_dir(tmp_path): + """Fresh init must ignore both `.notes/` AND `.cache/`. + + The cache layer writes to `.wikifi/.cache/`; if the gitignore + template misses it, every walk leaves untracked files in the + target repo — exactly the noise the wiki contract promises to + avoid. + """ + from wikifi.wiki import CACHE_DIRNAME, NOTES_DIRNAME + + layout = _layout(tmp_path) + initialize(layout, model="m", provider="ollama", ollama_host="http://h") + contents = layout.gitignore_path.read_text() + assert f"{NOTES_DIRNAME}/" in contents + assert f"{CACHE_DIRNAME}/" in contents + + +def test_initialize_backfills_cache_into_legacy_gitignore(tmp_path): + """An older wiki's `.gitignore` (only `.notes/`) gains `.cache/` on re-init. + + Wikis created before the cache layer landed have a `.gitignore` + missing the new entry. Re-running `wikifi init` against them must + append the missing line in place rather than leaving the older + config silently incomplete. + """ + from wikifi.wiki import CACHE_DIRNAME + + layout = _layout(tmp_path) + layout.wiki_dir.mkdir(parents=True) + # Simulate the pre-cache-era gitignore — comment + .notes/ only. + legacy = "# wikifi local working state — section markdown is committed, notes are not.\n.notes/\n" + layout.gitignore_path.write_text(legacy) + + initialize(layout, model="m", provider="ollama", ollama_host="http://h") + + contents = layout.gitignore_path.read_text() + # The original line is preserved unchanged. + assert ".notes/" in contents + # The missing entry is appended. + assert f"{CACHE_DIRNAME}/" in contents + # No duplication on a second init. + initialize(layout, model="m", provider="ollama", ollama_host="http://h") + after_second = layout.gitignore_path.read_text() + assert after_second.count(f"{CACHE_DIRNAME}/") == 1 + + +def test_initialize_preserves_user_extra_lines_in_gitignore(tmp_path): + """User-added entries in `.wikifi/.gitignore` survive re-init. + + Backfill must only *append* missing required entries — it must + never rewrite, reorder, or strip lines the user added themselves + (e.g. `local-notes/`, `*.draft`, etc.). + """ + from wikifi.wiki import CACHE_DIRNAME + + layout = _layout(tmp_path) + layout.wiki_dir.mkdir(parents=True) + # User-customized: includes the standard .notes/ plus an extra entry, + # but is missing the new .cache/ line. + user_authored = "# my custom comment\n.notes/\nlocal-notes/\n*.draft\n" + layout.gitignore_path.write_text(user_authored) + + initialize(layout, model="m", provider="ollama", ollama_host="http://h") + contents = layout.gitignore_path.read_text() + + # User content unchanged. + assert "# my custom comment" in contents + assert "local-notes/" in contents + assert "*.draft" in contents + # Required entry appended. + assert f"{CACHE_DIRNAME}/" in contents diff --git a/wikifi/cache.py b/wikifi/cache.py index b611cfc..bf43e24 100644 --- a/wikifi/cache.py +++ b/wikifi/cache.py @@ -30,15 +30,34 @@ from pathlib import Path from typing import Any -from wikifi.wiki import WikiLayout +from wikifi.wiki import CACHE_DIRNAME, WikiLayout log = logging.getLogger("wikifi.cache") -CACHE_DIRNAME = ".cache" EXTRACTION_CACHE_FILENAME = "extraction.json" AGGREGATION_CACHE_FILENAME = "aggregation.json" CACHE_VERSION = 1 # bump to invalidate every cache entry across upgrades +# Re-exposed for callers that already import ``CACHE_DIRNAME`` from this +# module; the constant itself lives in :mod:`wikifi.wiki` next to the +# other layout names. +__all__ = [ + "CACHE_DIRNAME", + "AGGREGATION_CACHE_FILENAME", + "EXTRACTION_CACHE_FILENAME", + "CACHE_VERSION", + "CachedFindings", + "CachedSection", + "WalkCache", + "aggregation_cache_path", + "cache_dir", + "extraction_cache_path", + "hash_section_notes", + "load", + "reset", + "save", +] + @dataclass class CachedFindings: @@ -140,7 +159,7 @@ def record_aggregation( def cache_dir(layout: WikiLayout) -> Path: - return layout.wiki_dir / CACHE_DIRNAME + return layout.cache_dir def extraction_cache_path(layout: WikiLayout) -> Path: diff --git a/wikifi/wiki.py b/wikifi/wiki.py index 5c0e014..77d1114 100644 --- a/wikifi/wiki.py +++ b/wikifi/wiki.py @@ -6,9 +6,10 @@ ``` /.wikifi/ config.toml # provider/model overrides; created by `wikifi init` - .gitignore # excludes per-file extraction notes by default + .gitignore # excludes per-file extraction notes + cache by default
.md # one per entry in wikifi.sections.SECTIONS .notes/ # per-file/per-section extraction state (jsonl) + .cache/ # content-addressed extraction + aggregation cache ``` """ @@ -24,12 +25,26 @@ WIKI_DIRNAME = ".wikifi" NOTES_DIRNAME = ".notes" +# Cache dir constant lives here (not in ``cache.py``) so the layout has +# one source of truth and ``cache.py`` can import it without inverting +# the existing ``cache → wiki`` dependency direction. +CACHE_DIRNAME = ".cache" CONFIG_FILENAME = "config.toml" GITIGNORE_FILENAME = ".gitignore" -DEFAULT_GITIGNORE = """# wikifi local working state — section markdown is committed, notes are not. -.notes/ -""" +# Lines we guarantee in ``.wikifi/.gitignore``. Both ``.notes/`` and +# ``.cache/`` are local working state — section markdown is what gets +# committed. New entries appended here are also backfilled into older +# wikis on the next ``wikifi init`` (see :func:`initialize`) so users +# upgrading wikifi don't accumulate noisy untracked files. +_GITIGNORE_REQUIRED_ENTRIES: tuple[str, ...] = ( + f"{NOTES_DIRNAME}/", + f"{CACHE_DIRNAME}/", +) +DEFAULT_GITIGNORE = ( + "# wikifi local working state — section markdown is committed, " + "notes and cache are not.\n" + "\n".join(_GITIGNORE_REQUIRED_ENTRIES) + "\n" +) @dataclass(frozen=True) @@ -52,6 +67,10 @@ def gitignore_path(self) -> Path: def notes_dir(self) -> Path: return self.wiki_dir / NOTES_DIRNAME + @property + def cache_dir(self) -> Path: + return self.wiki_dir / CACHE_DIRNAME + def section_path(self, section: Section | str) -> Path: sid = section.id if isinstance(section, Section) else section return self.wiki_dir / f"{sid}.md" @@ -79,8 +98,7 @@ def initialize(layout: WikiLayout, *, model: str, provider: str, ollama_host: st layout.config_path.write_text(_render_config(model=model, provider=provider, ollama_host=ollama_host)) created.append(layout.config_path) - if not layout.gitignore_path.exists(): - layout.gitignore_path.write_text(DEFAULT_GITIGNORE) + _ensure_gitignore(layout) created.append(layout.gitignore_path) for section in SECTIONS: @@ -92,6 +110,28 @@ def initialize(layout: WikiLayout, *, model: str, provider: str, ollama_host: st return created +def _ensure_gitignore(layout: WikiLayout) -> None: + """Ensure the wiki's .gitignore exists and covers every required entry. + + Older wikis predate the cache layer and have a ``.gitignore`` that + only ignores ``.notes/``. Backfill any missing line-by-line entries + from :data:`_GITIGNORE_REQUIRED_ENTRIES` so users upgrading wikifi + don't end up with stray ``.cache/`` (or future-added) directories + showing as untracked changes in the target repo. + """ + path = layout.gitignore_path + if not path.exists(): + path.write_text(DEFAULT_GITIGNORE) + return + existing = path.read_text(encoding="utf-8") + existing_lines = {line.strip() for line in existing.splitlines() if line.strip()} + missing = [entry for entry in _GITIGNORE_REQUIRED_ENTRIES if entry not in existing_lines] + if not missing: + return + suffix_nl = "" if existing.endswith("\n") else "\n" + path.write_text(existing + suffix_nl + "\n".join(missing) + "\n") + + def write_section(layout: WikiLayout, section: Section, body: str) -> Path: """Replace a section's body with rendered markdown.""" path = layout.section_path(section) From cee343c8dab2bf57c2480c2225f88e8de0c4572f Mon Sep 17 00:00:00 2001 From: Dallas Pool Date: Fri, 1 May 2026 22:13:01 -0500 Subject: [PATCH 7/9] chore: untrack pre-existing wiki cache files; sync local .gitignore MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Follow-up to f1f51b4. The previous commit added `.cache/` to the generated gitignore template and backfilled it on `wikifi init`, but two pre-existing cache JSON files (`.wikifi/.cache/aggregation.json`, `.wikifi/.cache/extraction.json`) were already tracked from the "e2e run" snapshot in ddd193c. Git only honors gitignore for untracked paths, so future walks would still mark those two files as modified despite the new ignore rule. Untrack them here so the gitignore actually takes effect for this repo, and bring `.wikifi/.gitignore` in line with the updated template (the template change only writes on fresh inits — existing wikis upgrade through `_ensure_gitignore` on the next `wikifi init`, but this repo's file was already on disk so it needs the manual sync). --- .wikifi/.cache/aggregation.json | 2656 ---------------- .wikifi/.cache/extraction.json | 5025 ------------------------------- .wikifi/.gitignore | 3 +- 3 files changed, 2 insertions(+), 7682 deletions(-) delete mode 100644 .wikifi/.cache/aggregation.json delete mode 100644 .wikifi/.cache/extraction.json diff --git a/.wikifi/.cache/aggregation.json b/.wikifi/.cache/aggregation.json deleted file mode 100644 index b597b5c..0000000 --- a/.wikifi/.cache/aggregation.json +++ /dev/null @@ -1,2656 +0,0 @@ -{ - "version": 1, - "saved_at": "2026-05-02T03:10:48.125020+00:00", - "entries": { - "domains": { - "notes_hash": "4040897a09cc", - "body": "## Core Domain\n\nThe system's core domain is **codebase knowledge extraction**: ingesting an existing source base, classifying its contents, deriving domain findings from individual files, and synthesising those findings into a structured, technology-agnostic wiki. The primary consumers are migration teams who need to understand business intent, domain structure, and operational behaviour before re-implementing or replacing a legacy system.\n\n## Subdomains\n\n### Repository Introspection\nThis subdomain concerns discovering and classifying the files that make up a target codebase. Its central responsibility is distinguishing production source that encodes business intent from infrastructure, tooling, and other artefacts that do not. Tech-agnosticism is a first-class constraint here: the classification logic must not rely on recognising any specific language, framework, or runtime.\n\n### Per-File Knowledge Extraction\nOnce relevant files are identified, each is analysed independently to surface domain findings. This subdomain covers the full extraction loop — examining file content, applying domain heuristics, and producing structured evidence — and forms the first phase of wiki generation (primary sections).\n\n### Section Synthesis and Aggregation\nThe second phase of wiki generation operates over the evidence produced by per-file extraction. It aggregates findings across files into coherent wiki sections, derives higher-level content that cannot be inferred from any single file, and enforces the dependency ordering between primary (evidence-driven) and derivative (aggregated) sections. This ordering is a structural design constraint, not merely a runtime convention.\n\n### Wiki Authoring and Organisation\nA secondary domain governs how extracted knowledge is structured and stored. It defines the taxonomy of sections, distinguishes primary from derivative content, and produces output that a migration team can navigate and consume independently of the source codebase.\n\n### Interactive Knowledge Retrieval\nA supporting subdomain exposes the generated wiki to conversational or query-driven access, allowing stakeholders to interrogate extracted knowledge without directly inspecting raw wiki files.\n\n## Cross-Cutting Constraint: Tech-Agnosticism\nTech-agnosticism spans every subdomain. All analysis, extraction, and synthesis must produce domain-level descriptions that are free of references to specific languages, frameworks, or libraries. This constraint is enforced at both the classification stage (repository introspection) and the output stage (section content).\n\n## Subdomain Relationships\n\n| Subdomain | Role | Depends On |\n|---|---|---|\n| Repository Introspection | Identifies source worth analysing | — |\n| Per-File Knowledge Extraction | Produces primary section evidence | Introspection |\n| Section Synthesis & Aggregation | Produces derivative sections | Per-File Extraction |\n| Wiki Authoring & Organisation | Structures and stores the wiki | Synthesis |\n| Interactive Knowledge Retrieval | Queries the completed wiki | Authoring |\n", - "claims": [ - { - "text": "The core domain is codebase knowledge extraction: ingesting source files, classifying them, deriving domain findings, and synthesising those findings into a structured wiki.", - "sources": [ - { - "file": "README.md", - "lines": [ - 28, - 52 - ], - "fingerprint": "996c401d036d" - }, - { - "file": "VISION.md", - "lines": [ - 3, - 20 - ], - "fingerprint": "10651b456a64" - }, - { - "file": "wikifi/cli.py", - "lines": [ - 1, - 8 - ], - "fingerprint": "f326383c7da1" - } - ] - }, - { - "text": "The primary consumers are migration teams who need to understand business intent, domain structure, and operational behaviour of a legacy system.", - "sources": [ - { - "file": "VISION.md", - "lines": [ - 3, - 20 - ], - "fingerprint": "10651b456a64" - } - ] - }, - { - "text": "Repository introspection is responsible for distinguishing production source that encodes business intent from infrastructure, tooling, and other artefacts.", - "sources": [ - { - "file": "wikifi/cli.py", - "lines": [ - 1, - 8 - ], - "fingerprint": "f326383c7da1" - }, - { - "file": "wikifi/introspection.py", - "lines": [ - 19, - 44 - ], - "fingerprint": "59cd5940f72e" - } - ] - }, - { - "text": "Tech-agnosticism is a first-class constraint on the introspection subdomain: classification must not rely on recognising any specific language or framework.", - "sources": [ - { - "file": "wikifi/introspection.py", - "lines": [ - 19, - 44 - ], - "fingerprint": "59cd5940f72e" - } - ] - }, - { - "text": "Wiki generation is split into two subdomains: per-file evidence extraction (primary sections) and aggregate synthesis (derivative sections), with a structurally enforced dependency ordering between them.", - "sources": [ - { - "file": "README.md", - "lines": [ - 28, - 52 - ], - "fingerprint": "996c401d036d" - }, - { - "file": "wikifi/sections.py", - "lines": [ - 1, - 19 - ], - "fingerprint": "f743972a8fce" - } - ] - }, - { - "text": "Wiki authoring and organisation is a secondary domain governing how extracted knowledge is structured and stored for consumption.", - "sources": [ - { - "file": "VISION.md", - "lines": [ - 3, - 20 - ], - "fingerprint": "10651b456a64" - }, - { - "file": "wikifi/sections.py", - "lines": [ - 1, - 19 - ], - "fingerprint": "f743972a8fce" - } - ] - }, - { - "text": "Interactive knowledge retrieval is a supporting subdomain that exposes the generated wiki to query-driven access.", - "sources": [ - { - "file": "wikifi/cli.py", - "lines": [ - 1, - 8 - ], - "fingerprint": "f326383c7da1" - } - ] - } - ], - "contradictions": [] - }, - "intent": { - "notes_hash": "5f1844b4a404", - "body": "wikifi exists because the intent embedded in a legacy system is typically invisible — locked inside years of implementation choices, technology-specific conventions, and accumulated structure that makes it difficult to separate *what the system does and why* from *how it currently does it*. Migration teams tasked with replacing or re-implementing such a system need the former without the latter.\n\n### The Core Problem\n\nWhen a team inherits a large legacy codebase and must produce a new implementation, they face a knowledge-extraction problem. The source describes a particular way of solving a set of problems, but rarely describes the problems themselves at a level that is portable to a new context. Reading the source directly tends to reproduce the same structure and constraints in the new system — recreating legacy decisions rather than the underlying intent.\n\nwikifi addresses this by walking a repository and producing a structured, technology-agnostic wiki that surfaces:\n\n- **Domain entities and capabilities** — what the system models and what it can do\n- **API contracts and integration touchpoints** — what it exposes and to whom\n- **Cross-cutting concerns** — considerations that span the system as a whole\n- **Personas, user stories, and diagrams** — who uses the system, what they need, and how flows connect\n\nThe goal is to make legacy intent explicit, complete, and portable so a fresh implementation can retain full functional value without inheriting structural decisions.\n\n### Primary Audience\n\nThe immediate audience is migration teams — architects and developers who need to understand a system's domain well enough to re-implement it rather than maintain it. A secondary audience includes anyone who must understand what a system does without reading its source directly, including those who need to interrogate the resulting wiki conversationally.\n\n### What the System Is Not\n\nwikifi is explicitly a feature-extraction tool, not a transposition tool. It surfaces what a legacy system does and leaves all decisions about target architecture, structure, and approach entirely to the migration team. The output prescribes nothing about how the new system should be built.\n\n### Shaping Constraints\n\nSeveral constraints are built into the design from the outset:\n\n| Constraint | Rationale |\n|---|---|\n| **Technology agnosticism** | Output must be expressed in domain terms, never in terms of the implementation technology found in the source, so the wiki does not embed the very assumptions it is meant to dissolve. |\n| **Quality over speed** | Accuracy and completeness of the generated wiki are prioritised over processing throughput. |\n| **Arbitrary scale** | The system must handle repositories of any size — including legacy monorepos with tens of thousands of files — through caching and chunking strategies that make repeated and interrupted runs cheap. |\n| **Full traceability** | Every assertion in the generated wiki must trace back to specific source files and locations so architects can verify any claim against the original codebase. |\n| **Honest disagreement** | Where source files contain conflicting signals, the system surfaces those contradictions explicitly rather than silently resolving them, preserving the full picture for the migration team. |", - "claims": [ - { - "text": "wikifi exists because the intent of legacy systems is locked inside their implementation choices, making it difficult to separate what the system does from how it does it.", - "sources": [ - { - "file": "VISION.md", - "lines": [ - 3, - 9 - ], - "fingerprint": "10651b456a64" - } - ] - }, - { - "text": "Migration teams need a description of what a system does and why, decoupled from how it currently does it, so they can re-implement on a fresh stack without recreating legacy structure.", - "sources": [ - { - "file": "CLAUDE.md", - "lines": [ - 73, - 75 - ], - "fingerprint": "ac9698d91de6" - }, - { - "file": "README.md", - "lines": [ - 3, - 3 - ], - "fingerprint": "996c401d036d" - }, - { - "file": "VISION.md", - "lines": [ - 3, - 9 - ], - "fingerprint": "10651b456a64" - }, - { - "file": "wikifi/cli.py", - "lines": [ - 1, - 8 - ], - "fingerprint": "f326383c7da1" - } - ] - }, - { - "text": "wikifi walks a repository and produces a structured, technology-agnostic wiki surfacing features, domains, entities, capabilities, and delivered value.", - "sources": [ - { - "file": ".env.example", - "lines": [ - 1, - 2 - ], - "fingerprint": "2e493dbd2d87" - }, - { - "file": "README.md", - "lines": [ - 3, - 3 - ], - "fingerprint": "996c401d036d" - }, - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 1, - 6 - ], - "fingerprint": "3b93f710ebca" - }, - { - "file": "wikifi/cli.py", - "lines": [ - 1, - 8 - ], - "fingerprint": "f326383c7da1" - }, - { - "file": "wikifi/config.py", - "lines": [ - 1, - 8 - ], - "fingerprint": "8cd2ca53c957" - } - ] - }, - { - "text": "The wiki includes domain entities and capabilities, API contracts and integration touchpoints, and cross-cutting concerns extracted from source files.", - "sources": [ - { - "file": "README.md", - "lines": [ - 3, - 3 - ], - "fingerprint": "996c401d036d" - }, - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 1, - 6 - ], - "fingerprint": "3b93f710ebca" - }, - { - "file": "wikifi/specialized/__init__.py", - "lines": [ - 1, - 13 - ], - "fingerprint": "84d6c382c745" - }, - { - "file": "wikifi/specialized/openapi.py", - "lines": [ - 1, - 11 - ], - "fingerprint": "ae97781309c4" - }, - { - "file": "wikifi/specialized/protobuf.py", - "lines": [ - 1, - 8 - ], - "fingerprint": "e20d5913745a" - } - ] - }, - { - "text": "Certain wiki sections — personas, user stories, and diagrams — are synthesized from aggregated primary evidence because they cannot be inferred from individual files.", - "sources": [ - { - "file": "wikifi/deriver.py", - "lines": [ - 1, - 18 - ], - "fingerprint": "0b7f4f5abb09" - }, - { - "file": "wikifi/sections.py", - "lines": [ - 1, - 19 - ], - "fingerprint": "f743972a8fce" - } - ] - }, - { - "text": "The goal is to make legacy intent explicit, complete, and portable so a fresh implementation retains full functional value without inheriting structural decisions.", - "sources": [ - { - "file": "README.md", - "lines": [ - 3, - 3 - ], - "fingerprint": "996c401d036d" - }, - { - "file": "VISION.md", - "lines": [ - 3, - 9 - ], - "fingerprint": "10651b456a64" - } - ] - }, - { - "text": "The primary audience is migration teams who need to understand a system's domain well enough to re-implement it.", - "sources": [ - { - "file": "CLAUDE.md", - "lines": [ - 73, - 75 - ], - "fingerprint": "ac9698d91de6" - }, - { - "file": "VISION.md", - "lines": [ - 3, - 9 - ], - "fingerprint": "10651b456a64" - }, - { - "file": "VISION.md", - "lines": [ - 86, - 89 - ], - "fingerprint": "10651b456a64" - }, - { - "file": "wikifi/critic.py", - "lines": [ - 1, - 15 - ], - "fingerprint": "502af9aee392" - } - ] - }, - { - "text": "Users can also interrogate the generated wiki conversationally, with every answer grounded in the extracted sections rather than invented detail.", - "sources": [ - { - "file": "wikifi/chat.py", - "lines": [ - 1, - 32 - ], - "fingerprint": "0333e700a046" - } - ] - }, - { - "text": "wikifi is explicitly a feature-extraction tool, not a transposition tool — it surfaces what the legacy system does and leaves all target architecture decisions to the migration team.", - "sources": [ - { - "file": "CLAUDE.md", - "lines": [ - 73, - 75 - ], - "fingerprint": "ac9698d91de6" - }, - { - "file": "VISION.md", - "lines": [ - 86, - 89 - ], - "fingerprint": "10651b456a64" - } - ] - }, - { - "text": "All output is expressed in domain terms, never in terms of the implementation technology found in the source.", - "sources": [ - { - "file": "CLAUDE.md", - "lines": [ - 73, - 75 - ], - "fingerprint": "ac9698d91de6" - }, - { - "file": "README.md", - "lines": [ - 3, - 3 - ], - "fingerprint": "996c401d036d" - }, - { - "file": "VISION.md", - "lines": [ - 86, - 89 - ], - "fingerprint": "10651b456a64" - }, - { - "file": "wikifi/cli.py", - "lines": [ - 1, - 8 - ], - "fingerprint": "f326383c7da1" - } - ] - }, - { - "text": "The system prioritises documentation quality over processing speed.", - "sources": [ - { - "file": ".env.example", - "lines": [ - 1, - 2 - ], - "fingerprint": "2e493dbd2d87" - }, - { - "file": "wikifi/config.py", - "lines": [ - 1, - 8 - ], - "fingerprint": "8cd2ca53c957" - } - ] - }, - { - "text": "The system is designed to handle repositories of arbitrary size, including legacy monorepos with tens of thousands of files, through caching and chunking.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 1, - 21 - ], - "fingerprint": "1ba541fe863d" - }, - { - "file": "wikifi/config.py", - "lines": [ - 1, - 8 - ], - "fingerprint": "8cd2ca53c957" - }, - { - "file": "wikifi/extractor.py", - "lines": [ - 1, - 37 - ], - "fingerprint": "b0e939259557" - } - ] - }, - { - "text": "Every assertion in the generated wiki must trace back to specific source files and locations so architects can verify any claim against the original codebase.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 1, - 15 - ], - "fingerprint": "c5f76cb7c4a3" - }, - { - "file": "wikifi/evidence.py", - "lines": [ - 1, - 18 - ], - "fingerprint": "dddfe1a01c85" - } - ] - }, - { - "text": "Where source files contain conflicting signals, the system surfaces those contradictions explicitly rather than silently resolving them.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 1, - 15 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - } - ], - "contradictions": [] - }, - "capabilities": { - "notes_hash": "4a4c91043bca", - "body": "wikifi analyzes any target codebase and produces a structured, technology-agnostic wiki that captures domain knowledge, system intent, capabilities, external dependencies, integrations, cross-cutting concerns, core entities, and hard specifications — expressed entirely in domain terms rather than in the language of a specific technology stack.\n\n## Workspace Initialization\n\nBefore analysis begins, the system bootstraps a wiki workspace inside the target project in an idempotent manner, creating the required directory structure, a configuration file, version-control ignore rules, and one placeholder document per defined section. Repeat invocations leave already-existing artifacts untouched.\n\n## Codebase Analysis Pipeline\n\nThe core pipeline runs in four ordered stages:\n\n1. **Repository introspection** — The system compresses the repository's directory layout and reads key manifest files, then uses this compact view to classify every path as either worth walking (production source, business logic, integrations, domain models) or worth skipping (vendored dependencies, build output, tests, CI/CD). The classification is returned as a structured, diffable result.\n\n2. **Per-file extraction** — Every in-scope file is routed through one of three extraction paths:\n - *Cache replay* — if a file's content is unchanged since the last run, previously stored findings are reused without any further processing.\n - *Deterministic schema parsing* — files recognised as structured schema artifacts (SQL DDL, database migrations, API contract specs, interface definition files, and graph schema files) are processed by purpose-built parsers that produce findings about entities, relationships, operations, and constraints without invoking an AI model.\n - *AI-assisted extraction* — all remaining files pass through an AI extraction pass; large files are recursively split into overlapping chunks so no content is missed regardless of size.\n\n Every finding carries a source citation — the originating file path, an inclusive line range, and a content fingerprint — enabling full traceability back to the codebase.\n\n3. **Cross-file context enrichment** — In parallel with extraction, the system builds an import and reference graph across the entire in-scope file set. Each file's neighborhood (the files it depends on and the files that depend on it) is injected into its extraction prompt, enabling findings to describe inter-file flows rather than treating each file in isolation.\n\n4. **Section aggregation** — Per-file findings are grouped by their target wiki section and synthesised into readable markdown bodies. Every asserted claim is backed by numbered citations pointing to the originating files and line ranges. Where two or more files make incompatible assertions about the same topic, the system surfaces the conflict explicitly in a dedicated *Conflicts in source* block rather than silently resolving it — a deliberate feature for legacy codebases where disagreements encode high-priority migration signals.\n\n## Wiki Structure\n\nThe generated wiki is organised into **eleven sections**: eight primary sections populated directly from per-file evidence, and three derivative sections synthesised from the completed primaries:\n\n| Section type | Sections |\n|---|---|\n| Primary (8) | Business domains, system intent, capabilities, external dependencies, integrations, cross-cutting concerns, core entities, hard specifications |\n| Derivative (3) | User personas, Gherkin-style user stories, Mermaid architectural diagrams |\n\nDerivative sections are only generated after the primaries they depend on are finalised. If upstream primary sections are empty or missing, the system writes a placeholder that declares the gap rather than fabricating content.\n\n## Quality Assurance\n\nAn optional critic-and-reviser pass evaluates any synthesised section against its brief and the upstream evidence it drew from, producing a structured quality score (0–10) with itemised unsupported claims, gaps, and suggested edits. When a section scores below a configurable threshold, a revision is automatically invoked; the revision is accepted only if it matches or improves the original score, preventing regressions. This loop is particularly valuable for derivative sections — personas and user stories — where single-shot synthesis is most prone to introducing unsupported assertions.\n\n## Incremental and Resumable Walks\n\nThe pipeline uses a two-scope content-addressed cache: per-file extraction results are keyed to a combination of file path and content fingerprint, and per-section aggregation results are keyed to a digest of the contributing notes payload. Only changed files and affected sections are reprocessed on incremental runs. Because results are persisted after every completed file, an interrupted walk resumes from the last unprocessed file rather than restarting from scratch. The cache can also be fully invalidated to force a clean re-walk.\n\n## Coverage and Quality Reporting\n\nA report command produces a human-readable markdown table summarising every wiki section by contributing file count, finding count, body size, optional critic-derived quality score, and the highest-priority content gap identified by the critic. Coverage statistics also surface *dead zones* — files that were processed but produced no findings — so teams can identify blind spots in the analysis.\n\n## Interactive Knowledge Querying\n\nOnce a wiki has been generated, users can open an interactive conversational session grounded in all populated sections. The session supports multi-turn exchanges, conversation history reset, and introspection of which sections are currently loaded as context. Only meaningfully populated sections are included, ensuring the assistant is not grounded in placeholder content.\n\n## Graceful Degradation\n\nWhen AI synthesis fails for a section, the system falls back to emitting the raw collected notes directly in the section body, preserving information at the cost of polish and surfacing the error inline. Similarly, unparseable schema files produce an advisory finding directing reviewers to inspect the file manually rather than silently failing.", - "claims": [ - { - "text": "wikifi analyzes a target codebase and produces a technology-agnostic wiki covering DDD domains, intent, capabilities, external dependencies, integrations, cross-cutting concerns, core entities, and hard specifications.", - "sources": [ - { - "file": "VISION.md", - "lines": [ - 6, - 8 - ], - "fingerprint": "10651b456a64" - }, - { - "file": "wikifi/sections.py", - "lines": [ - 44, - 142 - ], - "fingerprint": "f743972a8fce" - } - ] - }, - { - "text": "The system bootstraps a wiki workspace inside the target project in an idempotent manner, creating directory structure, configuration, version-control ignore rules, and one placeholder document per section.", - "sources": [ - { - "file": "README.md", - "lines": [ - 14, - 24 - ], - "fingerprint": "996c401d036d" - }, - { - "file": "wikifi/orchestrator.py", - "lines": [ - 62, - 76 - ], - "fingerprint": "6ed682a87356" - }, - { - "file": "wikifi/wiki.py", - "lines": [ - 64, - 86 - ], - "fingerprint": "9230b7444e0d" - } - ] - }, - { - "text": "Stage 1 introspects the repository by compressing its directory layout and reading manifest files, then classifies paths as worth walking or worth skipping.", - "sources": [ - { - "file": "wikifi/introspection.py", - "lines": [ - 28, - 44 - ], - "fingerprint": "59cd5940f72e" - }, - { - "file": "wikifi/introspection.py", - "lines": [ - 61, - 70 - ], - "fingerprint": "59cd5940f72e" - }, - { - "file": "wikifi/walker.py", - "lines": [ - 92, - 186 - ], - "fingerprint": "a29bd1ad8bdb" - } - ] - }, - { - "text": "Per-file extraction routes each file through one of three paths: cache replay, deterministic schema parsing, or AI-assisted extraction with chunking for large files.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 140, - 200 - ], - "fingerprint": "b0e939259557" - }, - { - "file": "wikifi/cache.py", - "lines": [ - 5, - 8 - ], - "fingerprint": "1ba541fe863d" - } - ] - }, - { - "text": "Structured schema artifacts — SQL DDL, migrations, API contract specs, interface definition files, and graph schema files — are processed by purpose-built deterministic parsers without invoking an AI model.", - "sources": [ - { - "file": "README.md", - "lines": [ - 34, - 36 - ], - "fingerprint": "996c401d036d" - }, - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 116, - 149 - ], - "fingerprint": "3b93f710ebca" - }, - { - "file": "wikifi/config.py", - "lines": [ - 75, - 81 - ], - "fingerprint": "8cd2ca53c957" - }, - { - "file": "wikifi/extractor.py", - "lines": [ - 140, - 200 - ], - "fingerprint": "b0e939259557" - }, - { - "file": "wikifi/repograph.py", - "lines": [ - 41, - 52 - ], - "fingerprint": "3d8bbdb10112" - }, - { - "file": "wikifi/specialized/__init__.py", - "lines": [ - 46, - 57 - ], - "fingerprint": "84d6c382c745" - }, - { - "file": "wikifi/specialized/sql.py", - "lines": [ - 56, - 62 - ], - "fingerprint": "1ef5e77c4038" - } - ] - }, - { - "text": "Every finding carries a source citation including file path, line range, and content fingerprint.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 251, - 270 - ], - "fingerprint": "b0e939259557" - }, - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 40, - 66 - ], - "fingerprint": "3b93f710ebca" - }, - { - "file": "wikifi/aggregator.py", - "lines": [ - 1, - 15 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "text": "The system builds an import and reference graph across the in-scope file set and injects each file's neighborhood into its extraction prompt.", - "sources": [ - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 90, - 114 - ], - "fingerprint": "3b93f710ebca" - }, - { - "file": "wikifi/config.py", - "lines": [ - 69, - 74 - ], - "fingerprint": "8cd2ca53c957" - }, - { - "file": "wikifi/extractor.py", - "lines": [ - 241, - 246 - ], - "fingerprint": "b0e939259557" - }, - { - "file": "wikifi/repograph.py", - "lines": [ - 155, - 210 - ], - "fingerprint": "3d8bbdb10112" - } - ] - }, - { - "text": "Per-file findings are synthesised into readable markdown section bodies with every claim backed by numbered citations.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 1, - 15 - ], - "fingerprint": "c5f76cb7c4a3" - }, - { - "file": "wikifi/evidence.py", - "lines": [ - 88, - 121 - ], - "fingerprint": "dddfe1a01c85" - } - ] - }, - { - "text": "Where two or more files make incompatible assertions about the same topic, the system surfaces the conflict explicitly in a 'Conflicts in source' block rather than silently resolving it.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 9, - 14 - ], - "fingerprint": "c5f76cb7c4a3" - }, - { - "file": "wikifi/evidence.py", - "lines": [ - 13, - 17 - ], - "fingerprint": "dddfe1a01c85" - }, - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 40, - 66 - ], - "fingerprint": "3b93f710ebca" - } - ] - }, - { - "text": "The wiki is organised into eight primary sections and three derivative sections (user personas, Gherkin-style user stories, and Mermaid diagrams).", - "sources": [ - { - "file": "wikifi/sections.py", - "lines": [ - 44, - 142 - ], - "fingerprint": "f743972a8fce" - }, - { - "file": "VISION.md", - "lines": [ - 53, - 63 - ], - "fingerprint": "10651b456a64" - }, - { - "file": "wikifi/deriver.py", - "lines": [ - 73, - 107 - ], - "fingerprint": "0b7f4f5abb09" - } - ] - }, - { - "text": "Derivative sections are only generated after the primaries they depend on are finalised; if upstream sections are empty, a placeholder is written rather than fabricating content.", - "sources": [ - { - "file": "wikifi/deriver.py", - "lines": [ - 73, - 107 - ], - "fingerprint": "0b7f4f5abb09" - }, - { - "file": "wikifi/sections.py", - "lines": [ - 44, - 142 - ], - "fingerprint": "f743972a8fce" - } - ] - }, - { - "text": "An optional critic-and-reviser pass evaluates sections against a quality rubric, scoring them 0–10, and invokes a revision only when the score is below a configurable threshold, accepting the revision only if it matches or improves the original score.", - "sources": [ - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 151, - 164 - ], - "fingerprint": "3b93f710ebca" - }, - { - "file": "wikifi/config.py", - "lines": [ - 83, - 94 - ], - "fingerprint": "8cd2ca53c957" - }, - { - "file": "wikifi/critic.py", - "lines": [ - 100, - 153 - ], - "fingerprint": "502af9aee392" - }, - { - "file": "wikifi/deriver.py", - "lines": [ - 90, - 103 - ], - "fingerprint": "0b7f4f5abb09" - } - ] - }, - { - "text": "The pipeline uses a two-scope content-addressed cache — per-file and per-section — so only changed files and affected sections are reprocessed on incremental runs.", - "sources": [ - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 67, - 88 - ], - "fingerprint": "3b93f710ebca" - }, - { - "file": "wikifi/cache.py", - "lines": [ - 5, - 8 - ], - "fingerprint": "1ba541fe863d" - }, - { - "file": "wikifi/cache.py", - "lines": [ - 9, - 12 - ], - "fingerprint": "1ba541fe863d" - }, - { - "file": "wikifi/config.py", - "lines": [ - 63, - 68 - ], - "fingerprint": "8cd2ca53c957" - } - ] - }, - { - "text": "Interrupted walks are resumable because per-file results are persisted incrementally; the cache can also be fully invalidated to force a clean re-walk.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 14, - 18 - ], - "fingerprint": "1ba541fe863d" - }, - { - "file": "wikifi/cache.py", - "lines": [ - 105, - 113 - ], - "fingerprint": "1ba541fe863d" - }, - { - "file": "README.md", - "lines": [ - 16, - 20 - ], - "fingerprint": "996c401d036d" - }, - { - "file": "wikifi/cli.py", - "lines": [ - 88, - 112 - ], - "fingerprint": "f326383c7da1" - } - ] - }, - { - "text": "A report command produces a markdown table summarising every wiki section by file count, finding count, body size, quality score, and highest-priority content gap.", - "sources": [ - { - "file": "README.md", - "lines": [ - 21, - 23 - ], - "fingerprint": "996c401d036d" - }, - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 166, - 186 - ], - "fingerprint": "3b93f710ebca" - }, - { - "file": "wikifi/critic.py", - "lines": [ - 155, - 180 - ], - "fingerprint": "502af9aee392" - }, - { - "file": "wikifi/report.py", - "lines": [ - 44, - 77 - ], - "fingerprint": "2b94c0a5e62e" - } - ] - }, - { - "text": "Coverage statistics surface dead zones — files processed but producing no findings.", - "sources": [ - { - "file": "wikifi/report.py", - "lines": [ - 103, - 107 - ], - "fingerprint": "2b94c0a5e62e" - } - ] - }, - { - "text": "An interactive conversational session grounded in all populated wiki sections supports multi-turn exchanges and various session management commands.", - "sources": [ - { - "file": "README.md", - "lines": [ - 24, - 25 - ], - "fingerprint": "996c401d036d" - }, - { - "file": "wikifi/chat.py", - "lines": [ - 88, - 130 - ], - "fingerprint": "0333e700a046" - }, - { - "file": "wikifi/chat.py", - "lines": [ - 63, - 82 - ], - "fingerprint": "0333e700a046" - }, - { - "file": "wikifi/cli.py", - "lines": [ - 60, - 220 - ], - "fingerprint": "f326383c7da1" - } - ] - }, - { - "text": "When AI synthesis fails for a section, the system falls back to emitting raw collected notes with the error message inline.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 272, - 285 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "text": "Unparseable schema files produce an advisory finding rather than failing silently.", - "sources": [ - { - "file": "wikifi/specialized/openapi.py", - "lines": [ - 23, - 50 - ], - "fingerprint": "ae97781309c4" - } - ] - } - ], - "contradictions": [] - }, - "external_dependencies": { - "notes_hash": "ba55342df61c", - "body": "The system draws on several categories of external service: language-model inference backends, development-time tooling integrations, and a continuous-integration platform.\n\n## Language-Model Inference\n\nAll substantive text generation and structured extraction is delegated to an external (or locally hosted) language-model service. Three backends are supported through a common provider abstraction:\n\n| Backend | Hosting | Authentication | Role |\n|---|---|---|---|\n| Local inference server (default) | Self-hosted, no network egress | None required | Default backend for all extraction and synthesis calls; configurable host address and 15-minute per-call timeout |\n| Hosted AI service A (Anthropic) | Cloud API | API key (`ANTHROPIC_API_KEY`) | Opt-in backend; uses an ephemeral prompt-cache marker on the system prompt so that large extraction prompts are billed at roughly 10 % of normal input-token cost across repeated per-file calls |\n| Hosted AI service B (OpenAI-compatible) | Cloud API (or compatible proxy/Azure endpoint) | API key + optional custom base URL | Opt-in backend; relies on automatic prefix caching (prefixes ≥ 1 024 tokens cached for ~5–10 minutes); exposes a reasoning-intensity knob mapped to the backend's reasoning-effort parameter on capable model variants |\n\nThe local inference server is the default and requires no credentials or external network access. The two hosted backends are opt-in and each require a provisioned API key. All three backends are configured with a model name, timeout, and per-call output-token cap drawn from the application's runtime settings.\n\n### Caching Strategy\nBecause the extraction prompt is large and is reused across every file in a repository, minimising repeated billing for identical prompt prefixes is a first-class concern. The hosted-AI-service-A integration achieves this by tagging the system-prompt block with an ephemeral cache-control marker. The hosted-AI-service-B integration relies on the provider's automatic prefix-caching mechanism without requiring explicit markers.\n\n## Development-Time Tool Integrations\n\nThe MCP server configuration reveals several additional integrations that appear to be used during development or agent-assisted workflows rather than in the core production pipeline:\n\n- **Google AI generative API** — consumed by at least two registered tool integrations; authenticated via a shared API key.\n- **Self-hosted web-crawling service** — running locally on a fixed port with no API key, providing crawling capability on demand.\n- **External documentation/context lookup service** — called over HTTP with a dedicated API key; likely used to retrieve up-to-date reference documentation for prompt enrichment.\n- **Google-hosted orchestration service (", - "claims": [], - "contradictions": [] - }, - "integrations": { - "notes_hash": "dc7982e6a028", - "body": "### Inbound: Entry Points into the System\n\nThe system is distributed as a library installed directly into a target project. The command-line interface (CLI) is the primary inbound entry point, exposing subcommands that drive the full pipeline from repository introspection through wiki generation, interactive querying, and quality reporting. The CLI delegates all pipeline coordination to the orchestrator, which is also the central hub wiring together every downstream stage.\n\n---\n\n### Outbound: AI Model Backends\n\nAll pipeline stages — introspection, per-file extraction, section aggregation, derivative content derivation, quality critique, and interactive chat — communicate with an AI model backend exclusively through a shared provider abstraction. No stage calls a specific backend directly. Three interaction shapes are exposed through this abstraction: schema-validated structured output, free-text completion, and multi-turn stateful conversation.\n\nThree backends are available and are interchangeable without altering any pipeline code:\n\n| Backend type | Hosting model |\n|---|---|\n| Local self-hosted inference runtime | On-premise / developer machine |\n| Hosted AI service (Anthropic-compatible) | Remote cloud |\n| Hosted AI service (OpenAI-compatible) | Remote cloud or self-managed endpoint |\n\nThe active backend is selected via an environment variable or a per-invocation flag at the CLI level. OpenAI-compatible endpoints — including corporate reverse proxies and managed cloud deployments — are supported by overriding the base URL alone, with no other changes to the calling code.\n\n---\n\n### Outbound: Development-Time Tool Servers (MCP)\n\nA separate set of external capability providers is declared through an MCP client configuration used during development or runtime. Four tool servers are wired up: a local AI utility, a local web crawler, a remote documentation-context service, and a remote search-and-stitching service. The system acts as an MCP client that fans requests out to these providers as needed.\n\n---\n\n### Outbound: Filesystem and Persistence Layer\n\nAll reading and writing of wiki artifacts — extraction notes, finished section bodies, and cache entries — flows through a centralized layout abstraction that manages the `.wikifi/` output directory inside the target project. The extractor, aggregator, deriver, CLI, and orchestrator all resolve paths through this abstraction rather than independently.\n\nA content-addressed cache layer sits between the orchestrator and the AI backend, consulting a fingerprinting service to derive content hashes as cache keys. The extractor, aggregator, and orchestrator each consult the cache before issuing AI calls, enabling both incremental re-runs and resumability for large codebases.\n\n---\n\n### Integration Touchpoints Discovered in Target Codebases\n\nWhen analyzing a target codebase, the system identifies and surfaces integration touchpoints from high-signal artifact files through specialized parsers:\n\n- **HTTP API surfaces** — Parsed from API contract files; each contract contributes a finding recording the count of externally exposed endpoints, establishing the public-facing API surface as a documented integration point.\n- **RPC service definitions** — Each declared service and its remote procedures are mapped, capturing procedure names, request and response message types, and whether either channel is streaming.\n- **Event-driven channels** — Subscription roots found in schema definition files are classified as real-time integration touchpoints rather than ordinary capabilities, reflecting their role as channels that external consumers attach to.\n- **Relational links** — Foreign key declarations (both explicit and inline) are surfaced as hard relational links between domain entities, identifying cross-entity data dependencies.\n\nThe dispatcher that routes files to these specialized parsers uses the file-kind classification produced by the repository graph module, ensuring each artifact type reaches the appropriate parser while preserving a uniform output contract for downstream aggregation.", - "claims": [ - { - "text": "The system is distributed as a library installed into a target project and invoked via a CLI from that project's root.", - "sources": [ - { - "file": "README.md", - "lines": [ - 8, - 12 - ], - "fingerprint": "996c401d036d" - } - ] - }, - { - "text": "The CLI is the primary inbound entry point, exposing subcommands that drive the full pipeline from introspection through wiki generation, chat, and reporting.", - "sources": [ - { - "file": "wikifi/cli.py", - "lines": [ - 98, - 101 - ], - "fingerprint": "f326383c7da1" - } - ] - }, - { - "text": "The orchestrator is the central hub called by the CLI that wires together all pipeline stages.", - "sources": [ - { - "file": "wikifi/orchestrator.py", - "lines": [ - 40, - 60 - ], - "fingerprint": "6ed682a87356" - } - ] - }, - { - "text": "All pipeline stages communicate with an AI model backend exclusively through a shared provider abstraction; no stage calls a specific backend directly.", - "sources": [ - { - "file": "wikifi/providers/base.py", - "lines": [ - 30, - 48 - ], - "fingerprint": "2750f0f56327" - } - ] - }, - { - "text": "Three interaction shapes are exposed through the provider abstraction: schema-validated structured output, free-text completion, and multi-turn stateful conversation.", - "sources": [ - { - "file": "wikifi/providers/anthropic_provider.py", - "lines": [ - 115, - 175 - ], - "fingerprint": "872020d40ac3" - }, - { - "file": "wikifi/providers/base.py", - "lines": [ - 30, - 48 - ], - "fingerprint": "2750f0f56327" - }, - { - "file": "wikifi/providers/ollama_provider.py", - "lines": [ - 58, - 95 - ], - "fingerprint": "0a21916665a5" - }, - { - "file": "wikifi/providers/openai_provider.py", - "lines": [ - 1, - 8 - ], - "fingerprint": "428df9ba13f1" - } - ] - }, - { - "text": "Three interchangeable backends are available: a local self-hosted inference runtime, an Anthropic-hosted service, and an OpenAI-compatible service.", - "sources": [ - { - "file": "README.md", - "lines": [ - 46, - 51 - ], - "fingerprint": "996c401d036d" - }, - { - "file": "wikifi/cli.py", - "lines": [ - 98, - 101 - ], - "fingerprint": "f326383c7da1" - }, - { - "file": "wikifi/providers/anthropic_provider.py", - "lines": [ - 115, - 175 - ], - "fingerprint": "872020d40ac3" - }, - { - "file": "wikifi/providers/ollama_provider.py", - "lines": [ - 58, - 95 - ], - "fingerprint": "0a21916665a5" - }, - { - "file": "wikifi/providers/openai_provider.py", - "lines": [ - 1, - 8 - ], - "fingerprint": "428df9ba13f1" - } - ] - }, - { - "text": "The active backend is selected via an environment variable or a per-invocation flag.", - "sources": [ - { - "file": "README.md", - "lines": [ - 46, - 51 - ], - "fingerprint": "996c401d036d" - }, - { - "file": "wikifi/cli.py", - "lines": [ - 98, - 101 - ], - "fingerprint": "f326383c7da1" - } - ] - }, - { - "text": "OpenAI-compatible endpoints including corporate reverse proxies and managed cloud deployments are supported by overriding the base URL only, with no other code changes.", - "sources": [ - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 232, - 235 - ], - "fingerprint": "3b93f710ebca" - } - ] - }, - { - "text": "An MCP client configuration wires up four external tool servers: a local AI utility, a local web crawler, a remote documentation-context service, and a remote search-and-stitching service.", - "sources": [ - { - "file": ".mcp.json", - "lines": [ - 2, - 36 - ], - "fingerprint": "b6b856cb3fe2" - } - ] - }, - { - "text": "All wiki artifact persistence flows through a centralized layout abstraction managing the .wikifi/ directory, consumed by the orchestrator, extractor, aggregator, deriver, and CLI.", - "sources": [ - { - "file": "wikifi/wiki.py", - "lines": [ - 34, - 61 - ], - "fingerprint": "9230b7444e0d" - } - ] - }, - { - "text": "A content-addressed cache layer uses a fingerprinting service to compute content hashes as cache keys, and is consulted by the extractor, aggregator, and orchestrator before issuing AI calls.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 244, - 246 - ], - "fingerprint": "1ba541fe863d" - }, - { - "file": "wikifi/cache.py", - "lines": [ - 30, - 30 - ], - "fingerprint": "1ba541fe863d" - } - ] - }, - { - "text": "API contract files are parsed to produce inbound-integration findings recording the count of HTTP endpoints exposed to external consumers.", - "sources": [ - { - "file": "wikifi/specialized/openapi.py", - "lines": [ - 83, - 92 - ], - "fingerprint": "ae97781309c4" - } - ] - }, - { - "text": "Service and RPC definition files are parsed to map each procedure, capturing name, request and response types, and streaming flags, with each service emitted as a distinct integration touchpoint.", - "sources": [ - { - "file": "wikifi/specialized/protobuf.py", - "lines": [ - 70, - 87 - ], - "fingerprint": "e20d5913745a" - } - ] - }, - { - "text": "Subscription roots in schema definition files are classified as real-time integration touchpoints rather than ordinary capabilities.", - "sources": [ - { - "file": "wikifi/specialized/graphql.py", - "lines": [ - 88, - 91 - ], - "fingerprint": "bbb305e0d47f" - } - ] - }, - { - "text": "Foreign key declarations are surfaced as hard relational links between domain entities.", - "sources": [ - { - "file": "wikifi/specialized/sql.py", - "lines": [ - 86, - 96 - ], - "fingerprint": "1ef5e77c4038" - } - ] - }, - { - "text": "The specialized-parser dispatcher uses the file-kind classification from the repository graph module and routes to four sibling parsers while preserving a uniform output contract.", - "sources": [ - { - "file": "wikifi/specialized/__init__.py", - "lines": [ - 46, - 57 - ], - "fingerprint": "84d6c382c745" - } - ] - } - ], - "contradictions": [] - }, - "cross_cutting": { - "notes_hash": "9142920419c0", - "body": "## Observability\n\nA consistent, pipeline-wide observability model spans every stage of the system. Structured logging is initialised once and reused across all subcommands; a single verbose flag activates debug-level output globally without each subsystem needing its own toggle. Stage-boundary log events are emitted at each major transition — repository introspection, dependency-graph construction, file extraction, section aggregation, and derivative synthesis — so operators can pinpoint where a long walk is spending time. Revision and quality-scoring events are counted in the run's statistics, and cache hit counts are surfaced in the post-walk report, giving a quantitative picture of incremental efficiency.\n\n## Resilience and Error Handling\n\nThe system is designed so that no single failure can abort an entire pipeline run. Extraction failures — whether caused by an inference provider or a specialised deterministic parser — are logged and tallied but never propagated upward; a file whose processing fails entirely is recorded as skipped, and partially-recovered files retain whatever findings were salvaged. Aggregation and derivation failures follow the same pattern: errors are caught and logged at warning level, and a fallback body that preserves the raw upstream evidence is written so the wiki remains inspectable. Quality-assurance (critic and reviser) failures degrade gracefully to returning the original body with a diagnostic score of zero rather than halting. Provider failures during interactive query sessions are surfaced inline without terminating the session. Across all provider backends, raw infrastructure errors are caught at the provider boundary and re-raised as a normalised internal error type carrying the upstream request identifier when available, so the rest of the pipeline does not branch on provider-specific exception shapes.\n\n## Content-Addressed Caching and Crash Resumability\n\nAll expensive inference work is protected by a two-scope content-addressed cache stored under a dedicated hidden subdirectory within the wiki output directory, inheriting the same version-control ignore rules as other working-state artifacts.\n\n- **Extraction scope:** each file's results are keyed by the combination of its relative path and a stable hash of its raw bytes. Any unchanged file is skipped on re-walk with no inference call.\n- **Aggregation scope:** each section's synthesised body is keyed by a deterministic digest of its note payload. Unchanged inputs reuse the stored body and evidence bundle.\n\nCache entries are written after every individual file completes, so a mid-walk crash loses at most one file's work. Writes are performed atomically — content is first written to a temporary location and then renamed into place — preventing corrupt partial writes. Malformed entries are silently dropped and logged rather than causing a hard failure, so a partially corrupt cache degrades gracefully to a fresh extraction for only the affected entries. A monotonically increasing version tag is embedded in every persisted cache file; a version mismatch on load causes the entire cache to be discarded and rebuilt, providing a controlled invalidation path across software upgrades. Between runs, entries for files no longer in scope are pruned automatically.\n\n## Input Integrity Guards\n\nA layered set of guards prevents low-signal or pathological inputs from ever reaching the inference layer.\n\n| Guard | Threshold | Effect |\n|---|---|---|\n| Minimum content size | 64 bytes (stripped) | File silently skipped |\n| Maximum file size | 2 MB | File silently skipped |\n| Large-file windowing | 150 KB – 2 MB | File split into overlapping chunks with 8 KB overlap |\n| Manifest truncation | 20 000 bytes | Hard-truncated with visible marker |\n| Per-request timeout | 900 seconds | Uniform backstop across all providers |\n\nDirectory traversal prunes excluded subtrees before descending into them, so ignore patterns are applied efficiently at the directory level rather than file-by-file. Files carrying no extractable intent — stub initialisers, empty fixtures, generated lockfiles — are identified and dropped before reaching the inference layer; the invariant that a single empty or unstructured file must never stall the walk is explicitly upheld. Findings produced from the overlap region between adjacent large-file chunks are deduplicated by section and normalised text within each file's pass, preventing double-counting in downstream aggregation.\n\n## Provider Abstraction\n\nAll inference calls — structured extraction, free-text generation, and multi-turn chat — are routed through a single provider abstraction layer. This boundary is where observability hooks, retry logic, error normalisation, and backend-switching concerns live; no extraction or aggregation logic needs knowledge of which backend is active. Supported backend shapes include local inference runtimes and hosted services; the local-inference path is the default, with hosted options as addenda, and swapping between them requires no changes outside the provider boundary.\n\nStructured-output calls enforce a schema-validation contract: the model response must be validated against a declared schema before being returned to the caller, ensuring type-safe data flows through every pipeline stage. To maximise determinism, temperature is hard-pinned to zero on all structured-output calls; free-text and conversational paths accept model-default variability in exchange for naturalness.\n\nWhen a backend exposes a reasoning-depth control, the system runs at the highest available setting, prioritising output quality over walk speed. A configurable depth parameter is translated into the provider's native adaptive-thinking feature, allowing callers to trade latency and cost against quality without branching on provider type in shared pipeline code.\n\nHosted backends employ prompt-caching strategies — placing the large, repeated system prompt at a fixed position in every request so the service can serve subsequent calls from a cached prefix — making large-scale walks economically viable by paying full input cost only on the first call and a fraction of that cost on subsequent ones.\n\n## Source Traceability and Hallucination Prevention\n\nFull source traceability is a non-negotiable structural invariant: every assertion in every wiki section must be linkable back to the originating file and, where available, the precise line range within it. This is enforced through typed evidence structures (claims and source references) rather than by convention, so the constraint cannot be silently bypassed.\n\nHallucination prevention operates at two additional levels. First, the inference prompt explicitly instructs the model never to name specific technologies, translating all observations into domain terms — this is a mandatory invariant enforced at the prompt layer. Second, upstream section content that matches known placeholder shapes is filtered out before derivative synthesis, preventing empty or stub sections from being treated as real evidence; these same sentinel strings are used by the quality-report layer to exclude placeholder sections from scoring. Interactive query sessions are similarly grounded: the assistant is instructed to explicitly acknowledge when the wiki does not cover a topic rather than generating unsupported answers.\n\nContent fingerprints serve a triple cross-cutting role: keying both extraction and aggregation caches so stale results are never served, anchoring source-evidence citations so claims can be re-verified against a fresh repository walk, and tracking file identity inside the dependency graph so cross-file context is invalidated when any contributing source changes. Files are always fingerprinted as raw bytes rather than decoded text to ensure the cache layer and the extractor agree on identity regardless of encoding assumptions.\n\n## Authentication and Storage Invariants\n\nSpecialised deterministic parsers extract security and data-integrity contracts from high-signal artifacts and surface them as first-class cross-cutting concerns that must be preserved through any migration:\n\n- **Authentication schemes** declared in API contract files are extracted and categorised by type, flagging which security contracts (key-based, delegated authorisation, bearer-token, etc.) the new system must honour.\n- **Data integrity constraints** — uniqueness and non-nullability — found in schema definitions are extracted as storage invariants explicitly marked as migration-critical.\n- **Query-performance invariants** — index definitions — are recorded with an explicit note that the new system must preserve equivalent access patterns.\n\nAll specialised parsers return results in the same structured shape as the general inference extractor, so the aggregation layer needs no knowledge of which extraction path was taken; this uniform interface contract is itself an invariant that must be preserved.\n\n## Data Storage Layout\n\nThe pipeline's working state is isolated to a single hidden directory within the repository:\n\n- **Rendered section documents** live at the root of this directory and are intended to be committed to version control.\n- **Per-section extraction notes** (JSONL, each record UTC-timestamped) are stored in a notes subdirectory and excluded from version control via a generated ignore file.\n- **Extraction and aggregation caches** are stored in a cache subdirectory and similarly excluded.\n\nDeleting the cache subdirectory forces a full re-walk; deleting the entire working directory resets all pipeline state. This layout ensures generated documentation commits remain clean and the boundary between committed outputs and ephemeral working state is unambiguous.", - "claims": [ - { - "text": "A single verbose flag activates debug-level structured logging globally across all subcommands.", - "sources": [ - { - "file": "wikifi/cli.py", - "lines": [ - 51, - 60 - ], - "fingerprint": "f326383c7da1" - } - ] - }, - { - "text": "Stage-boundary log events are emitted at each major pipeline transition — introspection, dependency-graph construction, extraction, aggregation, and derivation.", - "sources": [ - { - "file": "wikifi/orchestrator.py", - "lines": [ - 84, - 148 - ], - "fingerprint": "6ed682a87356" - } - ] - }, - { - "text": "Revision and quality-scoring events are counted in run statistics, and cache hit counts are surfaced in the post-walk report.", - "sources": [ - { - "file": "wikifi/cli.py", - "lines": [ - 90, - 97 - ], - "fingerprint": "f326383c7da1" - }, - { - "file": "wikifi/deriver.py", - "lines": [ - 110, - 135 - ], - "fingerprint": "0b7f4f5abb09" - } - ] - }, - { - "text": "Structured logging is initialised under a dedicated namespace for the report subsystem.", - "sources": [ - { - "file": "wikifi/report.py", - "lines": [ - 22, - 22 - ], - "fingerprint": "2b94c0a5e62e" - } - ] - }, - { - "text": "Extraction failures are logged and tallied but never propagate to abort the walk; a file whose processing fails entirely is recorded as skipped.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 228, - 242 - ], - "fingerprint": "b0e939259557" - } - ] - }, - { - "text": "Aggregation failures are caught and logged at warning level, and a fallback body preserving raw notes is written.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 143, - 152 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "text": "Derivation failures are caught and logged; a fallback body preserving upstream evidence is written rather than leaving the section blank.", - "sources": [ - { - "file": "wikifi/deriver.py", - "lines": [ - 96, - 107 - ], - "fingerprint": "0b7f4f5abb09" - } - ] - }, - { - "text": "Quality-assurance failures degrade gracefully by returning the original body with a diagnostic score of zero.", - "sources": [ - { - "file": "wikifi/critic.py", - "lines": [ - 158, - 165 - ], - "fingerprint": "502af9aee392" - } - ] - }, - { - "text": "Provider failures during interactive sessions are surfaced inline without terminating the session.", - "sources": [ - { - "file": "wikifi/chat.py", - "lines": [ - 120, - 125 - ], - "fingerprint": "0333e700a046" - } - ] - }, - { - "text": "Raw infrastructure errors are caught at the provider boundary and re-raised as a normalised internal error type carrying the upstream request identifier.", - "sources": [ - { - "file": "wikifi/providers/anthropic_provider.py", - "lines": [ - 238, - 244 - ], - "fingerprint": "872020d40ac3" - }, - { - "file": "wikifi/providers/openai_provider.py", - "lines": [ - 248, - 255 - ], - "fingerprint": "428df9ba13f1" - } - ] - }, - { - "text": "Extraction results are keyed by the combination of a file's relative path and a stable hash of its raw bytes; unchanged files are skipped on re-walk.", - "sources": [ - { - "file": "README.md", - "lines": [ - 40, - 43 - ], - "fingerprint": "996c401d036d" - }, - { - "file": "wikifi/fingerprint.py", - "lines": [ - 1, - 18 - ], - "fingerprint": "853400108135" - } - ] - }, - { - "text": "Aggregation results are keyed by a deterministic digest of the note payload; unchanged inputs reuse the stored body and evidence bundle.", - "sources": [ - { - "file": "README.md", - "lines": [ - 40, - 43 - ], - "fingerprint": "996c401d036d" - }, - { - "file": "wikifi/aggregator.py", - "lines": [ - 126, - 155 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "text": "Cache entries are written after every individual file completes, so a mid-walk crash loses at most one file's work.", - "sources": [ - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 67, - 88 - ], - "fingerprint": "3b93f710ebca" - }, - { - "file": "wikifi/extractor.py", - "lines": [ - 155, - 175 - ], - "fingerprint": "b0e939259557" - } - ] - }, - { - "text": "Cache writes are atomic — content is written to a temporary location and then renamed into place.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 189, - 193 - ], - "fingerprint": "1ba541fe863d" - } - ] - }, - { - "text": "Malformed cache entries are silently dropped and logged rather than causing a hard failure.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 196, - 222 - ], - "fingerprint": "1ba541fe863d" - } - ] - }, - { - "text": "A monotonically increasing version tag is embedded in every cache file; a version mismatch causes the entire cache to be discarded and rebuilt.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 38, - 38 - ], - "fingerprint": "1ba541fe863d" - } - ] - }, - { - "text": "Entries for files no longer in scope are pruned from the cache between runs.", - "sources": [ - { - "file": "wikifi/orchestrator.py", - "lines": [ - 95, - 110 - ], - "fingerprint": "6ed682a87356" - } - ] - }, - { - "text": "Cache files are stored under a dedicated hidden subdirectory within the wiki output directory, inheriting version-control ignore rules.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 19, - 21 - ], - "fingerprint": "1ba541fe863d" - } - ] - }, - { - "text": "Files below 64 bytes (stripped) are silently skipped to prevent inference on effectively empty inputs.", - "sources": [ - { - "file": "wikifi/config.py", - "lines": [ - 56, - 59 - ], - "fingerprint": "8cd2ca53c957" - }, - { - "file": "wikifi/walker.py", - "lines": [ - 61, - 79 - ], - "fingerprint": "a29bd1ad8bdb" - } - ] - }, - { - "text": "Files above 2 MB are silently skipped on the assumption they are vendored, generated, or binary.", - "sources": [ - { - "file": "wikifi/config.py", - "lines": [ - 38, - 56 - ], - "fingerprint": "8cd2ca53c957" - }, - { - "file": "wikifi/walker.py", - "lines": [ - 61, - 79 - ], - "fingerprint": "a29bd1ad8bdb" - } - ] - }, - { - "text": "Files between 150 KB and 2 MB are split into overlapping chunks with an 8 KB overlap.", - "sources": [ - { - "file": "wikifi/config.py", - "lines": [ - 38, - 56 - ], - "fingerprint": "8cd2ca53c957" - } - ] - }, - { - "text": "Manifest files are hard-truncated at 20,000 bytes with a visible truncation marker.", - "sources": [ - { - "file": "wikifi/walker.py", - "lines": [ - 220, - 231 - ], - "fingerprint": "a29bd1ad8bdb" - } - ] - }, - { - "text": "All inference calls share a single per-request timeout of 900 seconds as a uniform backstop.", - "sources": [ - { - "file": ".env.example", - "lines": [ - 16, - 29 - ], - "fingerprint": "2e493dbd2d87" - }, - { - "file": "wikifi/config.py", - "lines": [ - 33, - 34 - ], - "fingerprint": "8cd2ca53c957" - } - ] - }, - { - "text": "Directory traversal prunes excluded subtrees before descending into them.", - "sources": [ - { - "file": "wikifi/walker.py", - "lines": [ - 133, - 143 - ], - "fingerprint": "a29bd1ad8bdb" - } - ] - }, - { - "text": "Files carrying no extractable intent are identified and dropped before reaching the inference layer; an empty file must never stall the walk.", - "sources": [ - { - "file": "README.md", - "lines": [ - 44, - 46 - ], - "fingerprint": "996c401d036d" - }, - { - "file": "VISION.md", - "lines": [ - 99, - 100 - ], - "fingerprint": "10651b456a64" - } - ] - }, - { - "text": "Findings from overlap regions between adjacent chunks are deduplicated by section and normalised text within each file's pass.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 253, - 262 - ], - "fingerprint": "b0e939259557" - } - ] - }, - { - "text": "All inference calls are routed through a provider abstraction layer where observability, retry logic, error normalisation, and backend-switching concerns are centralised.", - "sources": [ - { - "file": "CLAUDE.md", - "lines": [ - 53, - 54 - ], - "fingerprint": "ac9698d91de6" - }, - { - "file": "VISION.md", - "lines": [ - 92, - 96 - ], - "fingerprint": "10651b456a64" - } - ] - }, - { - "text": "Structured-output calls require the model response to be validated against a declared schema before being returned to the caller.", - "sources": [ - { - "file": "wikifi/providers/base.py", - "lines": [ - 36, - 38 - ], - "fingerprint": "2750f0f56327" - } - ] - }, - { - "text": "Temperature is hard-pinned to zero on all structured-output calls to enforce determinism; free-text and conversational paths use model-default temperature.", - "sources": [ - { - "file": "wikifi/providers/ollama_provider.py", - "lines": [ - 58, - 68 - ], - "fingerprint": "0a21916665a5" - } - ] - }, - { - "text": "When a backend exposes a reasoning-depth control, the system runs at the highest available setting, prioritising quality over speed.", - "sources": [ - { - "file": "VISION.md", - "lines": [ - 97, - 98 - ], - "fingerprint": "10651b456a64" - }, - { - "file": "wikifi/providers/anthropic_provider.py", - "lines": [ - 212, - 232 - ], - "fingerprint": "872020d40ac3" - } - ] - }, - { - "text": "Hosted backends employ prompt-caching strategies so that only the first call in a walk pays full input-token cost.", - "sources": [ - { - "file": "wikifi/providers/anthropic_provider.py", - "lines": [ - 193, - 210 - ], - "fingerprint": "872020d40ac3" - }, - { - "file": "wikifi/providers/openai_provider.py", - "lines": [ - 13, - 17 - ], - "fingerprint": "428df9ba13f1" - } - ] - }, - { - "text": "Full source traceability is enforced structurally: every wiki assertion must be linkable to its originating file and line range via typed evidence structures.", - "sources": [ - { - "file": "wikifi/evidence.py", - "lines": [ - 1, - 18 - ], - "fingerprint": "dddfe1a01c85" - } - ] - }, - { - "text": "The inference prompt mandates tech-agnostic output, explicitly instructing the model to translate all observations into domain terms.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 54, - 67 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "text": "Upstream section content matching known placeholder shapes is filtered out before derivative synthesis to prevent fabrication from empty inputs.", - "sources": [ - { - "file": "wikifi/deriver.py", - "lines": [ - 110, - 135 - ], - "fingerprint": "0b7f4f5abb09" - }, - { - "file": "wikifi/deriver.py", - "lines": [ - 118, - 135 - ], - "fingerprint": "0b7f4f5abb09" - } - ] - }, - { - "text": "The same sentinel strings used in derivation filtering are used by the quality-report layer to exclude placeholder sections from scoring.", - "sources": [ - { - "file": "wikifi/report.py", - "lines": [ - 118, - 123 - ], - "fingerprint": "2b94c0a5e62e" - } - ] - }, - { - "text": "Interactive query sessions instruct the assistant to acknowledge when the wiki does not cover a topic rather than generating unsupported answers.", - "sources": [ - { - "file": "wikifi/chat.py", - "lines": [ - 27, - 31 - ], - "fingerprint": "0333e700a046" - } - ] - }, - { - "text": "Content fingerprints serve three roles: cache keying, citation anchoring, and dependency-graph invalidation.", - "sources": [ - { - "file": "wikifi/fingerprint.py", - "lines": [ - 1, - 18 - ], - "fingerprint": "853400108135" - } - ] - }, - { - "text": "Files are always fingerprinted as raw bytes to ensure consistent identity regardless of encoding assumptions.", - "sources": [ - { - "file": "wikifi/fingerprint.py", - "lines": [ - 44, - 50 - ], - "fingerprint": "853400108135" - } - ] - }, - { - "text": "Authentication schemes declared in API contract files are extracted and categorised by type as migration-critical cross-cutting concerns.", - "sources": [ - { - "file": "wikifi/specialized/openapi.py", - "lines": [ - 110, - 121 - ], - "fingerprint": "ae97781309c4" - } - ] - }, - { - "text": "Uniqueness and non-nullability constraints from schema definitions are extracted as storage invariants flagged as migration-critical.", - "sources": [ - { - "file": "wikifi/specialized/sql.py", - "lines": [ - 97, - 98 - ], - "fingerprint": "1ef5e77c4038" - } - ] - }, - { - "text": "Index definitions are recorded as query-performance invariants that the target system must preserve.", - "sources": [ - { - "file": "wikifi/specialized/sql.py", - "lines": [ - 113, - 125 - ], - "fingerprint": "1ef5e77c4038" - } - ] - }, - { - "text": "All specialised parsers return results in the same structured shape as the general inference extractor, preserving a uniform interface contract downstream.", - "sources": [ - { - "file": "wikifi/specialized/__init__.py", - "lines": [ - 9, - 13 - ], - "fingerprint": "84d6c382c745" - } - ] - }, - { - "text": "Rendered section documents are committed to version control; extraction notes and caches are excluded via a generated ignore file.", - "sources": [ - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 249, - 265 - ], - "fingerprint": "3b93f710ebca" - }, - { - "file": "wikifi/wiki.py", - "lines": [ - 96, - 121 - ], - "fingerprint": "9230b7444e0d" - } - ] - }, - { - "text": "Deleting the cache subdirectory forces a full re-walk; deleting the entire working directory resets all pipeline state.", - "sources": [ - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 249, - 265 - ], - "fingerprint": "3b93f710ebca" - } - ] - } - ], - "contradictions": [] - }, - "entities": { - "notes_hash": "aff1a81afdaf", - "body": "The system's domain model spans five functional layers — wiki structure, file classification, extraction, evidence, and review — plus supporting entities for caching, derivation, and chat.\n\n---\n\n## Wiki Structure\n\n**Section** is the central organizing entity. Each section carries a unique identifier, a human-readable title, a prose description of what belongs in it, and a tier (primary or derivative). Derivative sections additionally declare an ordered list of upstream section identifiers they depend on, forming an explicit dependency graph. An invariant holds at startup: every upstream identifier in a derivative section's dependency list must refer to a section that appears earlier in the canonical ordering (topological sort enforced).\n\n**WikiLayout** is an immutable value object that encodes the on-disk structure of a wiki workspace. Given a project root, it derives all canonical sub-paths: the wiki directory, configuration file, gitignore file, notes directory, per-section markdown files, and per-section note files. No fields are mutable after construction.\n\n**WalkConfig** is an immutable configuration record consumed by the filesystem walker. It captures the repository root, extra exclusion patterns, a flag for honouring ignore rules, a maximum file size in bytes, and a minimum stripped-content size in bytes.\n\n---\n\n## File Classification and Graph\n\n**FileKind** is a closed enumeration of seven mutually exclusive file roles: application code, SQL, OpenAPI, Protobuf, GraphQL, migration, and other. This classification determines whether a file is routed to a specialised deterministic parser or the general-purpose extraction path.\n\n**GraphNode** represents a single file's position in the repository's import graph. It carries the file's repo-relative path, the ordered set of files it imports, and the ordered set of files that import it. It exposes a capped combined-neighbour list for inclusion in extraction prompts.\n\n**RepoGraph** holds the complete import-edge map for a repository scan. It supports node lookup by path and retrieval of a capped neighbour list for any given file, providing cross-file context during extraction.\n\n**DirSummary** is a value object holding aggregate statistics for a single (non-recursive) directory: its repo-relative path, file count, total byte size, a frequency map of the top-10 file extensions, and a tuple of notable filenames (manifests, readmes) present in that directory.\n\n---\n\n## Extraction Layer\n\n**SectionFinding** represents one file's contribution to one wiki section. It carries the target section identifier, a technology-agnostic prose description of the contribution, and an optional inclusive line range within the source chunk.\n\n**FileFindings** groups a one-sentence summary of a file with all `SectionFinding` records produced for it.\n\n**SpecializedFinding** is the output unit of the deterministic parsing paths. It carries a section identifier, a human-readable description, and a list of source references. **SpecializedResult** groups zero or more such findings with an optional summary string; this is the uniform output contract for all specialised extractors, ensuring interoperability with the general extraction path downstream.\n\n**ExtractionStats** is a walk-level counter record, accumulating: total files seen, files yielding at least one finding, total findings, skipped files, chunks processed, cache hits, specialised-extractor invocations, and a per-kind file breakdown.\n\n---\n\n## Evidence Layer\n\n**SourceRef** represents a single span of source: a repo-relative file path, an optional inclusive line range, and a short content fingerprint captured at extraction time for change detection.\n\n**Claim** represents one assertion placed in a wiki section. It carries the markdown text and a list of `SourceRef` values that justify it. A claim with no sources is explicitly marked unsupported — this is a first-class state, not an error.\n\n**Contradiction** groups two or more conflicting `Claim` objects about the same topic under a single summary sentence. Each disagreeing position retains its own source references, preserving full traceability.\n\n**EvidenceBundle** is the aggregator's structured output for a single wiki section. It combines the narrative body text, a list of `Claim` records, and a list of `Contradiction` records. The renderer uses the bundle to thread numbered citations and a conflicts block into the final markdown.\n\nDuring aggregation, the pipeline works with intermediate forms: **AggregatedClaim** pairs a single prose assertion with the 1-based indices of the input notes that support it, and **AggregatedContradiction** holds a one-sentence summary alongside multiple conflicting positional claims, each with its own note indices. These are the structured forms that the language model produces before being resolved into the full evidence model.\n\n---\n\n## Cache Entities\n\n**CachedFindings** stores the extraction result for a single file: the file's content fingerprint, the list of structured findings produced, a one-sentence summary, and a count of processed chunks. Its invariant is content-addressed — the fingerprint is the cache key.\n\n**CachedSection** stores the aggregation result for a single wiki section: the hash of the notes payload that produced it, the rendered markdown body, and lists of claims and contradictions. It too is content-addressed on the notes hash.\n\n**WalkCache** is the in-memory container for both caches. It holds extraction and aggregation entries alongside hit and miss counters, enabling observability into cache effectiveness across a run.\n\n---\n\n## Quality and Review Layer\n\n**Critique** captures the quality assessment of a single section: an integer score (0–10), a short overall judgment, a list of unsupported claims, a list of gaps relative to the section brief, and a list of concrete revision suggestions.\n\n**ReviewOutcome** tracks a section's review lifecycle: the section identifier, the initial critique, the current body text, a flag indicating whether a revision was applied, and the optional follow-up critique produced after revision.\n\n**WikiQualityReport** aggregates the full-wiki audit: an overall numeric score, a mapping from section identifiers to their individual `Critique` records, and optional coverage statistics.\n\n**CoverageStats** records total files seen, files with findings, and per-section breakdowns of finding counts and contributing file counts; it exposes a coverage-percentage computation.\n\n**SectionReport** captures the per-section view for reporting: the section descriptor, count of contributing files, total findings count, body size in characters, an emptiness flag, and an optional quality critique.\n\n**WikiReport** aggregates all `SectionReport` records alongside overall coverage statistics and an optional mean quality score across populated sections.\n\n---\n\n## Derivation and Pipeline Outputs\n\n**IntrospectionResult** captures the Stage 1 decision about which files are worth deeper analysis: a list of gitignore-style include patterns, a list of exclude patterns, a list of primary languages (informational), a one-paragraph guess at the system's purpose, and a rationale for the choices made.\n\n**AggregationStats** records, for a single aggregation run, how many sections were written fresh, skipped due to empty notes, or served from cache.\n\n**DerivationStats** accumulates pipeline metrics for the derivation stage: counts of sections derived, skipped, and revised, plus the full list of `ReviewOutcome` records. It acts as an audit trail for the synthesis stage.\n\n**WalkReport** is the single return value of a completed wiki-generation run, aggregating the introspection result, extraction statistics, aggregation statistics, derivation statistics, the live cache state, and the repository import graph.\n\n---\n\n## Chat Layer\n\n**ChatMessage** carries a role and a content field, representing a single turn in a multi-turn conversation. Lists of these are accumulated to maintain conversation history.\n\n**LoadedSection** pairs a `Section` descriptor with its rendered markdown body, representing a single populated section ready for inclusion in a chat context.\n\n**ChatSession** holds a provider reference, the frozen system prompt built from wiki sections, and the accumulated conversation history as an ordered list of `ChatMessage` records. It supports appending user and assistant turns and clearing history while retaining the wiki context.\n\n---\n\n## Relationships and Invariants Summary\n\n| Entity | Key relationships | Notable invariants |\n|---|---|---|\n| Section | depends on upstream Sections (derivative tier only) | Dependency graph must be topologically ordered |\n| WikiLayout | derived from a project root | Immutable; all paths are computed, not stored independently |\n| SourceRef | referenced by Claim, SpecializedFinding | Fingerprint enables staleness detection |\n| Claim | groups SourceRefs; composed into EvidenceBundle | Sourceless claims are explicitly flagged unsupported |\n| Contradiction | groups ≥2 conflicting Claims | Each position retains its own SourceRefs |\n| CachedFindings | keyed on file content fingerprint | Cache miss if fingerprint changes |\n| CachedSection | keyed on notes-payload hash | Cache miss if any upstream note changes |\n| ReviewOutcome | holds pre- and post-revision Critique | Revision flag distinguishes touched from untouched sections |\n| WalkReport | aggregates all four stage outputs | Single return value for a complete run |", - "claims": [ - { - "text": "A Section entity carries a unique identifier, human-readable title, prose description, tier (primary or derivative), and an ordered list of upstream section identifiers for derivative sections, forming an explicit dependency graph.", - "sources": [ - { - "file": "wikifi/sections.py", - "lines": [ - 30, - 40 - ], - "fingerprint": "f743972a8fce" - }, - { - "file": "wikifi/deriver.py", - "lines": [ - 112, - 116 - ], - "fingerprint": "0b7f4f5abb09" - } - ] - }, - { - "text": "An invariant holds at startup: every upstream identifier in a derivative section's dependency list must refer to a section that appears earlier in the canonical ordering (topological sort enforced).", - "sources": [ - { - "file": "wikifi/sections.py", - "lines": [ - 30, - 40 - ], - "fingerprint": "f743972a8fce" - } - ] - }, - { - "text": "WikiLayout is an immutable value object that encodes the on-disk wiki workspace structure and derives all canonical sub-paths from a project root.", - "sources": [ - { - "file": "wikifi/cli.py", - "lines": [ - 166, - 172 - ], - "fingerprint": "f326383c7da1" - }, - { - "file": "wikifi/wiki.py", - "lines": [ - 34, - 61 - ], - "fingerprint": "9230b7444e0d" - } - ] - }, - { - "text": "WalkConfig is an immutable configuration record capturing repository root, extra exclusion patterns, gitignore-honouring flag, maximum file size in bytes, and minimum stripped-content size in bytes.", - "sources": [ - { - "file": "wikifi/walker.py", - "lines": [ - 61, - 79 - ], - "fingerprint": "a29bd1ad8bdb" - } - ] - }, - { - "text": "FileKind is a closed enumeration of seven mutually exclusive file roles (application code, SQL, OpenAPI, Protobuf, GraphQL, migration, other), driving routing to specialised or general-purpose extraction paths.", - "sources": [ - { - "file": "README.md", - "lines": [ - 31, - 33 - ], - "fingerprint": "996c401d036d" - }, - { - "file": "wikifi/repograph.py", - "lines": [ - 41, - 52 - ], - "fingerprint": "3d8bbdb10112" - } - ] - }, - { - "text": "GraphNode represents a single file's position in the import graph, carrying its repo-relative path, the ordered set of files it imports, and the ordered set of files that import it, and exposes a capped combined-neighbour list.", - "sources": [ - { - "file": "wikifi/repograph.py", - "lines": [ - 148, - 167 - ], - "fingerprint": "3d8bbdb10112" - } - ] - }, - { - "text": "RepoGraph holds the complete per-file import-edge map, supporting node lookup by path and retrieval of a capped neighbour list for any file.", - "sources": [ - { - "file": "wikifi/repograph.py", - "lines": [ - 170, - 181 - ], - "fingerprint": "3d8bbdb10112" - } - ] - }, - { - "text": "DirSummary is a value object holding aggregate statistics for a single non-recursive directory: path, file count, total byte size, top-10 extension frequency map, and notable filenames.", - "sources": [ - { - "file": "wikifi/walker.py", - "lines": [ - 144, - 153 - ], - "fingerprint": "a29bd1ad8bdb" - } - ] - }, - { - "text": "SectionFinding represents one file's contribution to one wiki section, carrying the target section identifier, a technology-agnostic prose description, and an optional inclusive line range.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 106, - 123 - ], - "fingerprint": "b0e939259557" - } - ] - }, - { - "text": "FileFindings groups a one-sentence file summary with all SectionFinding records produced for that file.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 106, - 123 - ], - "fingerprint": "b0e939259557" - } - ] - }, - { - "text": "SpecializedFinding carries a section identifier, a human-readable description, and a list of source references. SpecializedResult groups zero or more such findings with an optional summary string and is the uniform output contract for all specialised extractors.", - "sources": [ - { - "file": "wikifi/specialized/__init__.py", - "lines": [ - 29, - 38 - ], - "fingerprint": "84d6c382c745" - } - ] - }, - { - "text": "ExtractionStats accumulates walk-level counters: total files seen, files with findings, total findings, skipped files, chunks processed, cache hits, specialised-extractor invocations, and a per-kind file breakdown.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 126, - 135 - ], - "fingerprint": "b0e939259557" - } - ] - }, - { - "text": "SourceRef represents a single span of source: a repo-relative file path, an optional inclusive line range, and a short content fingerprint captured at extraction time.", - "sources": [ - { - "file": "wikifi/evidence.py", - "lines": [ - 37, - 52 - ], - "fingerprint": "dddfe1a01c85" - }, - { - "file": "README.md", - "lines": [ - 37, - 39 - ], - "fingerprint": "996c401d036d" - } - ] - }, - { - "text": "Claim carries markdown text and a list of SourceRefs that justify it; a claim with no sources is explicitly marked unsupported.", - "sources": [ - { - "file": "wikifi/evidence.py", - "lines": [ - 55, - 67 - ], - "fingerprint": "dddfe1a01c85" - }, - { - "file": "wikifi/aggregator.py", - "lines": [ - 166, - 186 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "text": "Contradiction groups two or more conflicting Claims about the same topic under a single summary sentence, with each disagreeing position retaining its own source references.", - "sources": [ - { - "file": "wikifi/evidence.py", - "lines": [ - 70, - 77 - ], - "fingerprint": "dddfe1a01c85" - }, - { - "file": "wikifi/aggregator.py", - "lines": [ - 74, - 101 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "text": "EvidenceBundle combines the narrative body text, a list of Claims, and a list of Contradictions for a single wiki section; the renderer uses it to thread citations and a conflicts block into the final markdown.", - "sources": [ - { - "file": "README.md", - "lines": [ - 46, - 48 - ], - "fingerprint": "996c401d036d" - }, - { - "file": "wikifi/evidence.py", - "lines": [ - 80, - 85 - ], - "fingerprint": "dddfe1a01c85" - }, - { - "file": "wikifi/aggregator.py", - "lines": [ - 166, - 186 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "text": "AggregatedClaim pairs a single prose assertion with the 1-based indices of the input notes that support it.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 74, - 101 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "text": "AggregatedContradiction holds a one-sentence summary alongside multiple conflicting positional claims, each with its own note indices.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 74, - 101 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "text": "CachedFindings stores the extraction result for a single file: content fingerprint, list of structured findings, one-sentence summary, and chunk count; it is content-addressed on the fingerprint.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 44, - 51 - ], - "fingerprint": "1ba541fe863d" - } - ] - }, - { - "text": "CachedSection stores the aggregation result for a single wiki section: the hash of the notes payload, the rendered markdown body, and lists of claims and contradictions; it is content-addressed on the notes hash.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 54, - 60 - ], - "fingerprint": "1ba541fe863d" - } - ] - }, - { - "text": "WalkCache is the in-memory container for both caches, holding extraction and aggregation entries alongside hit and miss counters.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 63, - 70 - ], - "fingerprint": "1ba541fe863d" - } - ] - }, - { - "text": "AggregationStats records how many sections were written fresh, skipped due to empty notes, or served from cache during a single aggregation run.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 103, - 107 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "text": "Critique captures the quality assessment of a single section: an integer score (0–10), a short overall judgment, a list of unsupported claims, a list of gaps relative to the section brief, and a list of revision suggestions.", - "sources": [ - { - "file": "wikifi/critic.py", - "lines": [ - 67, - 84 - ], - "fingerprint": "502af9aee392" - } - ] - }, - { - "text": "ReviewOutcome tracks a section's review lifecycle: the section identifier, initial critique, current body text, a revision-applied flag, and an optional follow-up critique.", - "sources": [ - { - "file": "wikifi/critic.py", - "lines": [ - 91, - 96 - ], - "fingerprint": "502af9aee392" - } - ] - }, - { - "text": "WikiQualityReport aggregates an overall numeric score, a mapping from section identifiers to individual Critique records, and optional coverage statistics.", - "sources": [ - { - "file": "wikifi/critic.py", - "lines": [ - 99, - 114 - ], - "fingerprint": "502af9aee392" - } - ] - }, - { - "text": "CoverageStats records total files seen, files with findings, and per-section breakdowns of finding counts and contributing file counts, and exposes a coverage-percentage computation.", - "sources": [ - { - "file": "wikifi/critic.py", - "lines": [ - 99, - 114 - ], - "fingerprint": "502af9aee392" - }, - { - "file": "wikifi/report.py", - "lines": [ - 85, - 94 - ], - "fingerprint": "2b94c0a5e62e" - } - ] - }, - { - "text": "SectionReport captures the per-section view: section descriptor, contributing file count, total findings count, body size in characters, an emptiness flag, and an optional quality critique.", - "sources": [ - { - "file": "wikifi/report.py", - "lines": [ - 28, - 42 - ], - "fingerprint": "2b94c0a5e62e" - } - ] - }, - { - "text": "WikiReport aggregates all SectionReport records alongside overall coverage statistics and an optional mean quality score.", - "sources": [ - { - "file": "wikifi/report.py", - "lines": [ - 28, - 42 - ], - "fingerprint": "2b94c0a5e62e" - } - ] - }, - { - "text": "IntrospectionResult captures the Stage 1 decision: include/exclude gitignore-style patterns, primary languages (informational), a one-paragraph purpose guess, and a rationale.", - "sources": [ - { - "file": "wikifi/introspection.py", - "lines": [ - 47, - 64 - ], - "fingerprint": "59cd5940f72e" - } - ] - }, - { - "text": "DerivationStats accumulates pipeline metrics for the derivation stage: derived, skipped, and revised counts, plus the full list of ReviewOutcome records.", - "sources": [ - { - "file": "wikifi/deriver.py", - "lines": [ - 57, - 62 - ], - "fingerprint": "0b7f4f5abb09" - } - ] - }, - { - "text": "WalkReport aggregates the introspection result, extraction statistics, aggregation statistics, derivation statistics, the live cache state, and the repository import graph; it is the single return value of a completed wiki-generation run.", - "sources": [ - { - "file": "wikifi/orchestrator.py", - "lines": [ - 54, - 61 - ], - "fingerprint": "6ed682a87356" - }, - { - "file": "wikifi/cli.py", - "lines": [ - 118, - 153 - ], - "fingerprint": "f326383c7da1" - } - ] - }, - { - "text": "ChatMessage carries a role and content field, representing a single turn in a multi-turn conversation.", - "sources": [ - { - "file": "wikifi/providers/base.py", - "lines": [ - 28, - 30 - ], - "fingerprint": "2750f0f56327" - } - ] - }, - { - "text": "LoadedSection pairs a Section descriptor with its rendered markdown body.", - "sources": [ - { - "file": "wikifi/chat.py", - "lines": [ - 42, - 45 - ], - "fingerprint": "0333e700a046" - } - ] - }, - { - "text": "ChatSession holds a provider reference, a frozen system prompt built from wiki sections, and an accumulated conversation history of ChatMessage records; it supports appending turns and clearing history while retaining wiki context.", - "sources": [ - { - "file": "wikifi/chat.py", - "lines": [ - 46, - 57 - ], - "fingerprint": "0333e700a046" - } - ] - }, - { - "text": "Specialised extractors surface domain entities from structured artifacts: SQL CREATE TABLE statements are treated as domain entities capturing table name, columns, foreign keys, and storage constraints; ALTER TABLE statements track schema evolution per entity.", - "sources": [ - { - "file": "wikifi/specialized/sql.py", - "lines": [ - 64, - 84 - ], - "fingerprint": "1ef5e77c4038" - }, - { - "file": "wikifi/specialized/sql.py", - "lines": [ - 99, - 111 - ], - "fingerprint": "1ef5e77c4038" - } - ] - }, - { - "text": "GraphQL object types (excluding operation roots), interfaces, input types, and enums are each treated as distinct entity-level constructs by the GraphQL specialised extractor.", - "sources": [ - { - "file": "wikifi/specialized/graphql.py", - "lines": [ - 32, - 81 - ], - "fingerprint": "bbb305e0d47f" - } - ] - }, - { - "text": "Protobuf message types and enum types are extracted from interface definition files, grouped by package namespace, with counts truncated after 25 items.", - "sources": [ - { - "file": "wikifi/specialized/protobuf.py", - "lines": [ - 44, - 68 - ], - "fingerprint": "e20d5913745a" - } - ] - }, - { - "text": "OpenAPI component schemas are extracted as canonical data models (up to 25, with overflow count) and surfaced as entity-level findings.", - "sources": [ - { - "file": "wikifi/specialized/openapi.py", - "lines": [ - 94, - 108 - ], - "fingerprint": "ae97781309c4" - } - ] - } - ], - "contradictions": [] - } - } -} \ No newline at end of file diff --git a/.wikifi/.cache/extraction.json b/.wikifi/.cache/extraction.json deleted file mode 100644 index 6013d81..0000000 --- a/.wikifi/.cache/extraction.json +++ /dev/null @@ -1,5025 +0,0 @@ -{ - "version": 1, - "saved_at": "2026-05-02T03:10:48.123655+00:00", - "entries": { - ".env.example": { - "fingerprint": "2e493dbd2d87", - "summary": "Example environment configuration exposing all tuneable runtime parameters for the wikifi documentation-generation system.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "wikifi is a tool that generates wiki documentation from source codebases using a local large-language model. It is designed to prioritise documentation quality over processing speed, and includes guards to prevent runaway behaviour on near-empty or oversized files.", - "sources": [ - { - "file": ".env.example", - "lines": [ - 1, - 2 - ], - "fingerprint": "2e493dbd2d87" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system processes source files through at least two pipeline stages: a Stage 1 introspection pass (which receives a configurable directory-tree depth) and a per-file extraction pass. Files outside configurable size bounds are skipped to avoid wasting model inference time.", - "sources": [ - { - "file": ".env.example", - "lines": [ - 20, - 29 - ], - "fingerprint": "2e493dbd2d87" - } - ] - }, - { - "section_id": "external_dependencies", - "finding": "Ollama is the sole supported LLM provider in version 1, serving models locally over HTTP (default endpoint http://localhost:11434). The system is designed to work with reasoning-capable models such as Qwen3 and DeepSeek-R1, which support a 'thinking mode' that trades latency for output depth.", - "sources": [ - { - "file": ".env.example", - "lines": [ - 7, - 14 - ], - "fingerprint": "2e493dbd2d87" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Per-request timeouts (default 900 seconds) are set to accommodate high-thinking model runs on real source files. Minimum and maximum file-size thresholds act as integrity guards: the minimum prevents thinking-mode runaway on stub files, while the maximum prevents processing files too large to be useful.", - "sources": [ - { - "file": ".env.example", - "lines": [ - 16, - 29 - ], - "fingerprint": "2e493dbd2d87" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Only the 'ollama' provider is supported in v1. The default request timeout is 900 seconds (15 minutes). Fully disabling thinking mode ('false') is documented as unsafe with Qwen3 models because those models ignore the JSON-schema output constraint and emit free text instead.", - "sources": [ - { - "file": ".env.example", - "lines": [ - 7, - 44 - ], - "fingerprint": "2e493dbd2d87" - } - ] - } - ] - }, - ".gitignore": { - "fingerprint": "493b2310ee7c", - "summary": "Standard version-control ignore file for a full-stack project with backend and frontend components.", - "chunks_processed": 1, - "findings": [] - }, - ".mcp.json": { - "fingerprint": "b6b856cb3fe2", - "summary": "MCP server configuration wiring together several external tool/API integrations used during development or runtime.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "external_dependencies", - "finding": "A locally-running web-crawling service is depended upon at a fixed local address (port 3002), requiring no API key, suggesting an self-hosted crawling capability used by the system.", - "sources": [ - { - "file": ".mcp.json", - "lines": [ - 14, - 20 - ], - "fingerprint": "b6b856cb3fe2" - } - ] - }, - { - "section_id": "external_dependencies", - "finding": "Google's AI/generative API is consumed under the key named GOOGLE_API_KEY, used by at least two registered server integrations (nano-banana and stitch).", - "sources": [ - { - "file": ".mcp.json", - "lines": [ - 4, - 8 - ], - "fingerprint": "b6b856cb3fe2" - } - ] - }, - { - "section_id": "external_dependencies", - "finding": "An external documentation/context lookup service (context7) is called over HTTP using a dedicated API key, likely to enrich prompts or retrieve up-to-date library documentation.", - "sources": [ - { - "file": ".mcp.json", - "lines": [ - 22, - 28 - ], - "fingerprint": "b6b856cb3fe2" - } - ] - }, - { - "section_id": "external_dependencies", - "finding": "A Google-hosted MCP-compatible service called 'stitch' is consumed over HTTP, authenticated via the Google API key, purpose not fully specified but likely an orchestration or data-stitching capability.", - "sources": [ - { - "file": ".mcp.json", - "lines": [ - 29, - 35 - ], - "fingerprint": "b6b856cb3fe2" - } - ] - }, - { - "section_id": "integrations", - "finding": "Four tool-server integrations are declared: a local banana/AI utility, a local web crawler, a remote documentation context service, and a remote Google stitching service — suggesting the system acts as an MCP client that fans out to multiple capability providers.", - "sources": [ - { - "file": ".mcp.json", - "lines": [ - 2, - 36 - ], - "fingerprint": "b6b856cb3fe2" - } - ] - } - ] - }, - "CLAUDE.md": { - "fingerprint": "ac9698d91de6", - "summary": "Developer and agent operating guide for the wikifi CLI library, capturing tooling rules, code constraints, architectural invariants, and workflow conventions.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "wikifi exists to describe what a legacy system does — producing a technology-agnostic wiki of its capabilities and domain model — so that migration teams can consume that knowledge without the tool itself prescribing any target architecture, language, or framework.", - "sources": [ - { - "file": "CLAUDE.md", - "lines": [ - 73, - 75 - ], - "fingerprint": "ac9698d91de6" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system exposes two primary commands: `init` (initialise wiki scaffolding against a repository) and `walk` (traverse the repository, extract per-file findings, and synthesise wiki sections). The walk is responsible for repository introspection, empty-file filtering, deterministic per-file extraction, and multi-file synthesis — in that order.", - "sources": [ - { - "file": "CLAUDE.md", - "lines": [ - 60, - 72 - ], - "fingerprint": "ac9698d91de6" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "The system must run against a local LLM out of the box with no cloud dependency required; hosted backends (Anthropic, OpenAI, custom) are valid additional options but never the default.", - "sources": [ - { - "file": "CLAUDE.md", - "lines": [ - 51, - 52 - ], - "fingerprint": "ac9698d91de6" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Provider abstraction is mandatory: swapping the LLM backend must not require changes outside the provider boundary.", - "sources": [ - { - "file": "CLAUDE.md", - "lines": [ - 53, - 54 - ], - "fingerprint": "ac9698d91de6" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "When the chosen model exposes a reasoning or thinking level, the system must run at the highest available setting; lower reasoning levels are opt-in only.", - "sources": [ - { - "file": "CLAUDE.md", - "lines": [ - 55, - 56 - ], - "fingerprint": "ac9698d91de6" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Test coverage target is ≥ 85%; every feature must ship with tests.", - "sources": [ - { - "file": "CLAUDE.md", - "lines": [ - 45, - 46 - ], - "fingerprint": "ac9698d91de6" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "wikifi is strictly a feature-extraction tool: it describes what the legacy system does and must never transform source into any target architecture, language, or framework shape.", - "sources": [ - { - "file": "CLAUDE.md", - "lines": [ - 73, - 75 - ], - "fingerprint": "ac9698d91de6" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Derivative wiki sections (personas, user stories, diagrams) must be produced only after primary content sections are complete and must never be inferred from a single file.", - "sources": [ - { - "file": "CLAUDE.md", - "lines": [ - 66, - 72 - ], - "fingerprint": "ac9698d91de6" - } - ] - }, - { - "section_id": "external_dependencies", - "finding": "A local LLM runtime (e.g. Ollama) is the default inference backend, requiring no external network dependency for core operation.", - "sources": [ - { - "file": "CLAUDE.md", - "lines": [ - 51, - 52 - ], - "fingerprint": "ac9698d91de6" - } - ] - }, - { - "section_id": "external_dependencies", - "finding": "Hosted Anthropic and hosted OpenAI are supported as optional alternative inference backends, reachable through the mandatory provider abstraction layer.", - "sources": [ - { - "file": "CLAUDE.md", - "lines": [ - 53, - 54 - ], - "fingerprint": "ac9698d91de6" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "A pre-commit hook auto-fixes lint and re-stages changed files; a pre-push hook runs the full test suite and gates the push, ensuring the main branch remains deployable at all times.", - "sources": [ - { - "file": "CLAUDE.md", - "lines": [ - 47, - 48 - ], - "fingerprint": "ac9698d91de6" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "A mandatory provider abstraction layer insulates all LLM calls, so observability, retry logic, and backend-switching concerns are centralised at that boundary rather than scattered across extraction logic.", - "sources": [ - { - "file": "CLAUDE.md", - "lines": [ - 53, - 54 - ], - "fingerprint": "ac9698d91de6" - } - ] - } - ] - }, - "CODE-FORMAT.md": { - "fingerprint": "b5e0603faf44", - "summary": "Project conventions and tooling guide defining how software is built, structured, tested, and deployed — serves as the single source of truth for agents and humans working on any project.", - "chunks_processed": 1, - "findings": [] - }, - "README.md": { - "fingerprint": "369c47fa5d27", - "summary": "Top-level README describing wikifi's purpose, CLI surface, architecture, and technology choices as a codebase-analysis tool that produces technology-agnostic wiki content.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "wikifi exists to walk an existing codebase and produce a technology-agnostic extraction of its features, domains, and delivered value — the output is designed to guide re-implementation in a modern stack while preserving what the original system actually does for its users. The tool treats understanding intent and capabilities as a first-class problem, separate from any specific language or framework.", - "sources": [ - { - "file": "README.md", - "lines": [ - 1, - 5 - ], - "fingerprint": "369c47fa5d27" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system provides a one-time project setup command that scaffolds a local configuration directory, and a primary walk command that traverses a target codebase and produces structured wiki content organized into primary capture sections (domains, entities, capabilities, etc.) and derivative sections (personas, user stories, diagrams).", - "sources": [ - { - "file": "README.md", - "lines": [ - 20, - 27 - ], - "fingerprint": "369c47fa5d27" - } - ] - }, - { - "section_id": "capabilities", - "finding": "An interactive query capability lets practitioners ask natural-language questions against the extracted wiki content, optionally injected with context from the target codebase; a REPL-style chat mode supports iterative exploration of both the wiki and the source.", - "sources": [ - { - "file": "README.md", - "lines": [ - 28, - 29 - ], - "fingerprint": "369c47fa5d27" - } - ] - }, - { - "section_id": "capabilities", - "finding": "A coverage and quality reporting command surfaces per-section statistics (contributing file count, finding count, body size) and can invoke an automated quality scorer to assign 0–10 scores to each populated section.", - "sources": [ - { - "file": "README.md", - "lines": [ - 26, - 27 - ], - "fingerprint": "369c47fa5d27" - } - ] - }, - { - "section_id": "capabilities", - "finding": "An opt-in critic-and-reviser loop runs a quality pass on derivative sections: it scores each section's body against its brief and upstream evidence, flags unsupported claims, and re-synthesizes the section when quality falls below threshold — accepting the revision only if it scores at least as well as the original.", - "sources": [ - { - "file": "README.md", - "lines": [ - 52, - 53 - ], - "fingerprint": "369c47fa5d27" - } - ] - }, - { - "section_id": "domains", - "finding": "The core domain is codebase knowledge extraction: the system understands a target repository's structure (manifests, layout, import graphs, file kinds) and distills that understanding into a structured, technology-agnostic wiki. Subdomains include repository introspection, per-file extraction, section synthesis, and quality assurance of generated content.", - "sources": [ - { - "file": "README.md", - "lines": [ - 32, - 55 - ], - "fingerprint": "369c47fa5d27" - } - ] - }, - { - "section_id": "domains", - "finding": "A secondary domain is provider abstraction: the system decouples the extraction intelligence from any specific AI backend, allowing local and hosted inference providers to be swapped without changing the extraction pipeline.", - "sources": [ - { - "file": "README.md", - "lines": [ - 57, - 63 - ], - "fingerprint": "369c47fa5d27" - } - ] - }, - { - "section_id": "entities", - "finding": "A SourceRef captures the provenance of each extracted finding: it records the file path, line range, and a content fingerprint, enabling downstream citation and traceability back to source.", - "sources": [ - { - "file": "README.md", - "lines": [ - 43, - 44 - ], - "fingerprint": "369c47fa5d27" - } - ] - }, - { - "section_id": "entities", - "finding": "An EvidenceBundle is the output of section aggregation: it contains a synthesized body, a set of supported claims, and any contradictions detected across per-file findings. The renderer uses it to thread numbered citations and a conflicts block into the final section markdown.", - "sources": [ - { - "file": "README.md", - "lines": [ - 48, - 50 - ], - "fingerprint": "369c47fa5d27" - } - ] - }, - { - "section_id": "entities", - "finding": "FileKind classifies each in-scope file as one of: application code, SQL, OpenAPI, Protobuf, GraphQL, migration, or other. This classification determines whether the file is routed through the LLM extractor or a deterministic specialized parser.", - "sources": [ - { - "file": "README.md", - "lines": [ - 35, - 37 - ], - "fingerprint": "369c47fa5d27" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "All extraction findings are stored in a content-addressed cache keyed by the tuple (relative file path, SHA-256 of file bytes); aggregation bodies are keyed by a hash of the section's notes payload. This design provides free resumability after a crash and allows re-walks to skip files whose content has not changed.", - "sources": [ - { - "file": "README.md", - "lines": [ - 45, - 47 - ], - "fingerprint": "369c47fa5d27" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Input filtering is applied before any file reaches the extraction agent: stub files, empty fixtures, and machine-generated artifacts are recognized and skipped; size bounds on raw and stripped content are enforced via configuration so oversized or unstructured files never stall the walk.", - "sources": [ - { - "file": "README.md", - "lines": [ - 47, - 48 - ], - "fingerprint": "369c47fa5d27" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "The Anthropic backend uses prompt caching (ephemeral cache control on the system prompt) so the large extraction prompt is paid for only once across potentially hundreds of per-file calls, reducing both latency and cost at scale.", - "sources": [ - { - "file": "README.md", - "lines": [ - 60, - 61 - ], - "fingerprint": "369c47fa5d27" - } - ] - }, - { - "section_id": "external_dependencies", - "finding": "A local Ollama inference server is the default AI backend; it is expected to host a thinking-capable model (referenced as Qwen 3 27B) at the highest available reasoning level, and its endpoint is configurable via environment variable.", - "sources": [ - { - "file": "README.md", - "lines": [ - 68, - 70 - ], - "fingerprint": "369c47fa5d27" - } - ] - }, - { - "section_id": "external_dependencies", - "finding": "Anthropic's hosted API is an opt-in inference backend, selected via environment variable; the integration takes advantage of Anthropic's prompt-caching feature to amortize the cost of a large system prompt across many calls.", - "sources": [ - { - "file": "README.md", - "lines": [ - 60, - 61 - ], - "fingerprint": "369c47fa5d27" - } - ] - }, - { - "section_id": "external_dependencies", - "finding": "OpenAI's hosted API is a second opt-in inference backend; the integration routes a configurable reasoning-effort knob to OpenAI's reasoning_effort parameter for compatible models, and relies on OpenAI's automatic prefix caching rather than explicit cache markers.", - "sources": [ - { - "file": "README.md", - "lines": [ - 62, - 63 - ], - "fingerprint": "369c47fa5d27" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "The extraction cache key for per-file findings must be the tuple (relative path, SHA-256 of file bytes); the aggregation cache key must be a hash of the section's notes payload. These are the canonical identifiers for cache hit/miss decisions and must be preserved for cache compatibility.", - "sources": [ - { - "file": "README.md", - "lines": [ - 45, - 47 - ], - "fingerprint": "369c47fa5d27" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "The critic-and-reviser loop must only accept a revised section if the revision's quality score is at least as high as the original's score; accepting a lower-scoring revision is explicitly prohibited.", - "sources": [ - { - "file": "README.md", - "lines": [ - 52, - 53 - ], - "fingerprint": "369c47fa5d27" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "The walk scope decision is made once during repository introspection and is deterministic — the agent must not re-pick or alter scope mid-walk.", - "sources": [ - { - "file": "README.md", - "lines": [ - 34, - 35 - ], - "fingerprint": "369c47fa5d27" - } - ] - } - ] - }, - "TESTING-AND-DEMO.md": { - "fingerprint": "3b93f710ebca", - "summary": "Developer-facing documentation covering how to test and demonstrate the premium pipeline features of the wikifi wiki-generation tool, including setup, feature verification steps, and teardown instructions.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "The system exists to automatically generate technology-agnostic wiki documentation from a source-code repository. It extracts domain knowledge, entities, capabilities, and cross-cutting concerns from code files using a combination of deterministic parsers and language-model-based extraction, then aggregates and refines the results into committed markdown sections.", - "sources": [ - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 1, - 6 - ], - "fingerprint": "3b93f710ebca" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system performs incremental, content-addressed extraction walks over a repository, serving previously processed files from a persistent cache and skipping redundant language-model calls. A walk can be interrupted and resumed without losing progress, since the cache is flushed after every completed file.", - "sources": [ - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 67, - 88 - ], - "fingerprint": "3b93f710ebca" - } - ] - }, - { - "section_id": "capabilities", - "finding": "Each generated wiki section carries source-traceable citations (file path and line range) so readers can verify where each claim originates. Where the aggregation step detects disagreement across files, the section also renders a Conflicts block enumerating each conflicting position alongside its sources.", - "sources": [ - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 40, - 66 - ], - "fingerprint": "3b93f710ebca" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The walker builds an import graph across the repository and injects a Neighbor files block into each extraction prompt, giving the language model cross-file context about which modules a file imports from and which modules depend on it.", - "sources": [ - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 90, - 114 - ], - "fingerprint": "3b93f710ebca" - } - ] - }, - { - "section_id": "capabilities", - "finding": "Specialized, deterministic parsers handle structured schema files (SQL DDL, Protobuf, GraphQL, OpenAPI YAML/JSON, and migration scripts) without invoking a language model, producing findings for entities, integrations, and cross-cutting invariants directly from the schema syntax.", - "sources": [ - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 116, - 149 - ], - "fingerprint": "3b93f710ebca" - } - ] - }, - { - "section_id": "capabilities", - "finding": "An optional critic-and-reviser pass evaluates derivative sections (personas, user stories, diagrams) against a quality threshold and rewrites any that fall below it, accepting the revision only when it scores at least as well as the original.", - "sources": [ - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 151, - 164 - ], - "fingerprint": "3b93f710ebca" - } - ] - }, - { - "section_id": "capabilities", - "finding": "A report command produces a markdown table summarising every wiki section by contributing file count, finding count, body size, critic-derived quality score (0–10), and the highest-priority content gap identified by the critic.", - "sources": [ - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 166, - 186 - ], - "fingerprint": "3b93f710ebca" - } - ] - }, - { - "section_id": "external_dependencies", - "finding": "Ollama serves as the default local language-model backend; the model is configured via the repository's config file (defaulting to qwen3.6:27b). No external service or API key is required to use this path.", - "sources": [ - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 27, - 32 - ], - "fingerprint": "3b93f710ebca" - } - ] - }, - { - "section_id": "external_dependencies", - "finding": "The Anthropic API is an opt-in hosted language-model backend. The system sets an ephemeral cache-control marker on the system-prompt block so that repeated per-file extraction calls read the cached prompt at roughly 10% of the normal input token price.", - "sources": [ - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 188, - 209 - ], - "fingerprint": "3b93f710ebca" - } - ] - }, - { - "section_id": "external_dependencies", - "finding": "The OpenAI API is a second opt-in hosted backend. It relies on OpenAI's automatic prefix caching (no explicit marker required; prefixes of at least 1024 tokens are cached for approximately 5–10 minutes). The integration also routes a 'think' intensity knob to the reasoning_effort parameter on reasoning-capable models (o-series and gpt-5), while omitting that parameter for standard models.", - "sources": [ - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 210, - 236 - ], - "fingerprint": "3b93f710ebca" - } - ] - }, - { - "section_id": "integrations", - "finding": "Azure OpenAI deployments and corporate reverse-proxy endpoints are supported by overriding the base URL for the OpenAI provider, either via an environment variable or a constructor parameter, with no other changes to the calling code.", - "sources": [ - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 232, - 235 - ], - "fingerprint": "3b93f710ebca" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Extraction results are persisted in a content-addressed cache under .wikifi/.cache/ and are written after every individual file completes. This ensures that a process crash or manual interruption at any point does not require re-processing already-completed files.", - "sources": [ - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 67, - 88 - ], - "fingerprint": "3b93f710ebca" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "The full pipeline state is isolated to the .wikifi/ directory: committed markdown sections live at the root of that directory, per-section JSONL findings are stored in .notes/ (gitignored), and extraction/aggregation caches are stored in .cache/ (gitignored). Deleting .cache/ forces a full re-walk; deleting the entire directory resets all state.", - "sources": [ - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 249, - 265 - ], - "fingerprint": "3b93f710ebca" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "The test suite must include exactly 156 passing tests with total line coverage at or above 93%. Every new module must individually reach at least 86% coverage, and each premium-pipeline module (fingerprint, cache, evidence, critic, report, repograph, specialized parsers, and the Anthropic provider) must carry a dedicated test file.", - "sources": [ - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 20, - 30 - ], - "fingerprint": "3b93f710ebca" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "The Anthropic provider must place cache_control of type 'ephemeral' on the system-prompt block, use the messages.parse structured-output contract, translate the 'think' intensity setting to an effort level, and map API errors to a RuntimeError. These behaviors are locked in by the provider's dedicated test file.", - "sources": [ - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 237, - 242 - ], - "fingerprint": "3b93f710ebca" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "The OpenAI provider must use the chat.completions.parse structured-output contract, route reasoning_effort only to o-series and gpt-5 models (not standard models), swap max_tokens for max_completion_tokens on reasoning models, and map API errors to RuntimeError. OpenAI's automatic prefix caching applies to prefixes of at least 1024 tokens and lasts approximately 5–10 minutes.", - "sources": [ - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 215, - 230 - ], - "fingerprint": "3b93f710ebca" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "The critic-reviser loop must only accept a revised derivative section if the revision scores at least as well as the original; a revision that scores lower must be discarded.", - "sources": [ - { - "file": "TESTING-AND-DEMO.md", - "lines": [ - 158, - 163 - ], - "fingerprint": "3b93f710ebca" - } - ] - } - ] - }, - "VISION.md": { - "fingerprint": "10651b456a64", - "summary": "VISION.md defines wikifi's purpose, scope, operational requirements, and success criteria as a technology-agnostic codebase-to-wiki extraction tool for legacy migration teams.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "wikifi exists because the intent of legacy systems is locked inside their implementation choices. A migration team needs a description of *what the system does and why*, decoupled from *how it currently does it*, so they can re-implement on a fresh stack without recreating the legacy system's structure. The goal is to make legacy intent explicit, complete, and portable.", - "sources": [ - { - "file": "VISION.md", - "lines": [ - 3, - 9 - ], - "fingerprint": "10651b456a64" - } - ] - }, - { - "section_id": "intent", - "finding": "wikifi is explicitly a feature-extraction tool, not a transposition tool — it surfaces what a legacy system does and leaves the act of reshaping it to a target architecture entirely to the migration team.", - "sources": [ - { - "file": "VISION.md", - "lines": [ - 86, - 89 - ], - "fingerprint": "10651b456a64" - } - ] - }, - { - "section_id": "capabilities", - "finding": "wikifi walks a target codebase, uses an AI agent to extract domain knowledge from each source file, and writes a technology-agnostic wiki covering DDD domains, intent, capabilities, external dependencies, integrations, cross-cutting concerns, core entities, and hard specifications.", - "sources": [ - { - "file": "VISION.md", - "lines": [ - 6, - 8 - ], - "fingerprint": "10651b456a64" - } - ] - }, - { - "section_id": "capabilities", - "finding": "After primary per-file capture is complete, wikifi performs a derivative synthesis pass that produces user personas (with intent, needs, pain points, and usage patterns), Gherkin-style user stories keyed to those personas, and aggregate system diagrams — none of which can be inferred from any single source file.", - "sources": [ - { - "file": "VISION.md", - "lines": [ - 53, - 63 - ], - "fingerprint": "10651b456a64" - } - ] - }, - { - "section_id": "capabilities", - "finding": "wikifi exposes a CLI interface for interacting with the generated wiki; an MCP interface is identified as in-scope for a follow-up release.", - "sources": [ - { - "file": "VISION.md", - "lines": [ - 79, - 80 - ], - "fingerprint": "10651b456a64" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "The agent must run against a local LLM by default with no cloud dependency; hosted backends are valid additional options but not the default. The LLM backend must be reachable through a provider abstraction layer so it can be swapped (local Ollama, hosted Anthropic, hosted OpenAI, or custom) without changes outside the provider boundary.", - "sources": [ - { - "file": "VISION.md", - "lines": [ - 92, - 96 - ], - "fingerprint": "10651b456a64" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "When the chosen model exposes a thinking/reasoning level, the agent runs at the highest available setting, prioritising wiki quality over walk speed.", - "sources": [ - { - "file": "VISION.md", - "lines": [ - 97, - 98 - ], - "fingerprint": "10651b456a64" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "The walker must recognise and skip files carrying no extractable intent (stub init files, empty fixtures, generated lockfiles, and similar) before they reach the agent; a single empty or unstructured file must never stall the walk.", - "sources": [ - { - "file": "VISION.md", - "lines": [ - 99, - 100 - ], - "fingerprint": "10651b456a64" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "The generated wiki must at minimum contain: DDD domains and subdomains, system intent, domain-level capabilities, external-system dependencies, internal and external integrations, cross-cutting concerns, core entities and their structures, and hard specifications — regardless of the on-disk layout chosen by the implementor.", - "sources": [ - { - "file": "VISION.md", - "lines": [ - 26, - 47 - ], - "fingerprint": "10651b456a64" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Derivative wiki sections (user personas, user stories, aggregate diagrams) must be produced in a step that runs *after* primary capture and must never be inferred from a single source file.", - "sources": [ - { - "file": "VISION.md", - "lines": [ - 50, - 63 - ], - "fingerprint": "10651b456a64" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Wiki content is stored in the target project's `.wikifi/` directory; the contract is the content the wiki conveys, not its on-disk shape or file structure within that directory.", - "sources": [ - { - "file": "VISION.md", - "lines": [ - 73, - 76 - ], - "fingerprint": "10651b456a64" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Success is defined as: a migration team working from the wiki alone — without reference to the original codebase — can deliver a microservice re-implementation that preserves the original system's personas, problem space, integrations, cross-cutting concerns, entities, data patterns, and user value.", - "sources": [ - { - "file": "VISION.md", - "lines": [ - 103, - 105 - ], - "fingerprint": "10651b456a64" - } - ] - }, - { - "section_id": "domains", - "finding": "wikifi's core domain is legacy-system knowledge extraction: capturing business intent, domain structure, and operational behaviour from an existing codebase and representing it as a technology-agnostic wiki. A secondary domain is wiki authoring and organisation, governing how extracted knowledge is structured and stored for consumption by a migration team.", - "sources": [ - { - "file": "VISION.md", - "lines": [ - 3, - 20 - ], - "fingerprint": "10651b456a64" - } - ] - } - ] - }, - "wikifi/aggregator.py": { - "fingerprint": "c5f76cb7c4a3", - "summary": "Stage 3 of the wiki-generation pipeline: synthesises per-section wiki content from accumulated file-level notes using an LLM, attaches structured evidence (citations + contradictions), and writes the final section body.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "The aggregator exists to turn many small, file-scoped observations into a coherent, tech-agnostic wiki section — while refusing to silently hide disagreements between sources. The system prompt makes explicit that contradictions across files must be surfaced as named conflicts rather than merged away, and that every claim must trace back to the specific source files that justified it.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 1, - 15 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system synthesises notes collected from individual source files into readable markdown bodies for each primary wiki section, with every asserted claim backed by numbered citations pointing to the originating files and optional line ranges.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 1, - 15 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "section_id": "capabilities", - "finding": "When two or more files make incompatible assertions about the same domain topic, the system surfaces the conflict explicitly under a 'Conflicts in source' heading rather than silently choosing one position — a deliberate feature for legacy codebases where tribal knowledge hides in inconsistencies.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 9, - 14 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "section_id": "capabilities", - "finding": "A section-level cache compares a digest of the current note payload against the previous walk; if the notes are unchanged, the prior rendered body and evidence bundle are reused without re-invoking the LLM, saving cost and latency on incremental runs.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 15, - 17 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "section_id": "capabilities", - "finding": "When LLM synthesis fails, the system falls back to emitting the raw notes directly in the section body, preserving information at the cost of polish and providing the error message inline.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 272, - 285 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "section_id": "entities", - "finding": "An `AggregatedClaim` pairs a single prose assertion with the 1-based indices of the input notes that support it. A `SectionBody` groups a markdown body string with a list of such claims and a list of `AggregatedContradiction` records, each contradiction holding a one-sentence summary and multiple conflicting claim positions.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 74, - 101 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "section_id": "entities", - "finding": "An `AggregationStats` record tracks how many sections were written fresh, skipped due to empty notes, or served from cache during a single aggregation run.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 103, - 107 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "section_id": "entities", - "finding": "A `SourceRef` links a claim back to a specific file, optionally scoped to a line range. Multiple `SourceRef` values are coalesced before being attached to a `Claim`; a `Claim` is the evidence-layer representation of one assertion with its resolved file sources. An `EvidenceBundle` carries the final body, claims, and contradictions for a section.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 166, - 186 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Section-level caching uses a deterministic digest of the note payload; hits reuse the stored body and evidence bundle without any LLM call, and misses record the fresh result back to the cache for the next walk.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 126, - 155 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "All aggregation failures are logged at WARNING level and produce a fallback body that preserves the raw notes, ensuring a section is always written even when the LLM call fails.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 143, - 152 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "The aggregator enforces a tech-agnostic invariant at the prompt level: the LLM is explicitly instructed never to name languages, frameworks, or libraries in the synthesised output, translating all observations into domain terms.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 54, - 67 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "section_id": "integrations", - "finding": "The aggregator reads accumulated per-file notes from the wiki layout store (via `read_notes`) and writes finished section bodies back to it (via `write_section`), acting as the bridge between the extraction and rendering stages.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 109, - 160 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "section_id": "integrations", - "finding": "The aggregator calls the LLM provider's structured-JSON completion endpoint, passing a system prompt and a rendered user prompt, and expects the response to conform to the `SectionBody` schema.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 136, - 141 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "section_id": "integrations", - "finding": "Derivative sections (personas, user stories, diagrams) are explicitly excluded from this stage and are instead populated by a separate deriver stage that runs afterwards, indicating a two-stage downstream pipeline.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 111, - 116 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Contradictions between source notes must never be silently resolved: any incompatible claims must produce a `contradictions[]` entry naming each position and the note indices that support it. This is stated as a hard rule in the LLM system prompt and enforced structurally via the `AggregatedContradiction` schema.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 61, - 63 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Wiki section bodies must be tech-agnostic: no mention of specific languages, frameworks, or libraries is permitted in synthesised output; every observation must be translated into domain terms.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 57, - 59 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Note indices presented to the LLM are 1-based, and the resolution logic subtracts 1 before indexing into the notes list — an off-by-one invariant that must be preserved if the prompting scheme changes.", - "sources": [ - { - "file": "wikifi/aggregator.py", - "lines": [ - 167, - 173 - ], - "fingerprint": "c5f76cb7c4a3" - } - ] - } - ] - }, - "wikifi/cache.py": { - "fingerprint": "e0a85dbf45f8", - "summary": "Content-addressed, two-scope cache that eliminates redundant extraction and aggregation work when re-processing large codebases, and provides free resumability after interrupted walks.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "The cache exists to make iterative re-walks of large codebases economical: without it a full re-walk of a 50 000-file monorepo would take hours; with it only files whose content has changed since the last run require fresh processing. Resumability of interrupted runs is a first-class consequence of the design, not an add-on.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 1, - 20 - ], - "fingerprint": "e0a85dbf45f8" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system maintains two independent caches: a per-file extraction cache that skips LLM calls for any file whose byte content is unchanged, and a per-section aggregation cache that skips the aggregation step whenever the full set of notes for a section is bit-identical to the previous run.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 4, - 13 - ], - "fingerprint": "e0a85dbf45f8" - } - ] - }, - { - "section_id": "capabilities", - "finding": "Interrupted processing runs can be resumed automatically: because each file's result is persisted as soon as it is produced, a walk that crashes part-way through restores all previously completed files from cache and continues from the point of failure.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 15, - 18 - ], - "fingerprint": "e0a85dbf45f8" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The cache can be pruned to remove entries for files no longer in scope, reset entirely (e.g. via a `--no-cache` flag), and reports hit and miss counters for both scopes to aid observability.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 109, - 122 - ], - "fingerprint": "e0a85dbf45f8" - } - ] - }, - { - "section_id": "entities", - "finding": "A `CachedFindings` record holds a file's content fingerprint, the structured list of findings produced by the extractor, a one-sentence summary, and the count of chunks processed. It is keyed by relative file path.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 44, - 50 - ], - "fingerprint": "e0a85dbf45f8" - } - ] - }, - { - "section_id": "entities", - "finding": "A `CachedSection` record holds the hash of a section's notes payload, the rendered markdown body, and lists of claims and contradictions. It is keyed by section identifier.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 53, - 58 - ], - "fingerprint": "e0a85dbf45f8" - } - ] - }, - { - "section_id": "entities", - "finding": "A `WalkCache` is the in-memory container for both caches and their hit/miss counters; it is loaded from and persisted to disk as a unit.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 61, - 73 - ], - "fingerprint": "e0a85dbf45f8" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "All cache files are written atomically: content is first written to a temporary file alongside the target, then renamed into place, preventing partial or corrupt cache files from being read on subsequent runs.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 192, - 196 - ], - "fingerprint": "e0a85dbf45f8" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Cache files carry a version tag; any file whose version does not match the current constant is silently discarded and treated as an empty cache, allowing safe schema evolution across releases.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 199, - 204 - ], - "fingerprint": "e0a85dbf45f8" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Malformed individual cache entries are dropped with a warning log rather than aborting the load, ensuring a single corrupt record does not invalidate the entire cache.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 207, - 210 - ], - "fingerprint": "e0a85dbf45f8" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Hit and miss counters are maintained in memory for both the extraction and aggregation scopes, enabling downstream reporting on cache efficiency.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 66, - 70 - ], - "fingerprint": "e0a85dbf45f8" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "The aggregation cache key must include not just finding text but also the per-source tuple of (file path, line range, fingerprint). This ensures that when a referenced file's lines shift or its content changes, the cache misses and re-aggregation occurs against fresh evidence rather than replaying stale citations.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 241, - 254 - ], - "fingerprint": "e0a85dbf45f8" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "The cache version constant (`CACHE_VERSION = 1`) must be incremented whenever the cache schema changes, as version mismatch causes all existing entries to be unconditionally dropped.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 36, - 36 - ], - "fingerprint": "e0a85dbf45f8" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Line-range values stored in source records must be normalized to a two-element integer list regardless of whether they arrive as tuples, lists, or other sequences, so that identical ranges always produce identical hash bytes across code paths.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 276, - 285 - ], - "fingerprint": "e0a85dbf45f8" - } - ] - }, - { - "section_id": "integrations", - "finding": "The cache interacts with the fingerprinting subsystem (imported from `wikifi/fingerprint.py`) to produce stable content hashes used as cache keys for both file-level and section-level entries.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 238, - 240 - ], - "fingerprint": "e0a85dbf45f8" - } - ] - }, - { - "section_id": "integrations", - "finding": "Cache storage is rooted under the wiki layout directory (sourced from `wikifi/wiki.py`), placing cache files at `.wikifi/.cache/` so they co-locate with wiki output but remain outside committed section markdown.", - "sources": [ - { - "file": "wikifi/cache.py", - "lines": [ - 18, - 19 - ], - "fingerprint": "e0a85dbf45f8" - } - ] - } - ] - }, - "wikifi/chat.py": { - "fingerprint": "0333e700a046", - "summary": "Implements an interactive multi-turn chat session grounded in the populated wiki sections of a target project, allowing users to query the wiki content conversationally.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "The chat module exists so that users can hold a natural-language conversation about a codebase's wiki, with every answer grounded in the extracted wiki sections rather than invented detail. It explicitly instructs the assistant to cite section names and admit gaps rather than fabricate information.", - "sources": [ - { - "file": "wikifi/chat.py", - "lines": [ - 1, - 32 - ], - "fingerprint": "0333e700a046" - } - ] - }, - { - "section_id": "capabilities", - "finding": "Users can launch an interactive session that loads all populated wiki sections as context, then send multi-turn messages and receive responses grounded in that context. The session supports conversation history reset (clearing turns while keeping wiki context), listing which sections are loaded, and graceful exit — all via slash commands.", - "sources": [ - { - "file": "wikifi/chat.py", - "lines": [ - 88, - 130 - ], - "fingerprint": "0333e700a046" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system filters out unpopulated or placeholder wiki sections before building the context bundle, ensuring the assistant is only grounded in meaningful content.", - "sources": [ - { - "file": "wikifi/chat.py", - "lines": [ - 63, - 82 - ], - "fingerprint": "0333e700a046" - } - ] - }, - { - "section_id": "entities", - "finding": "A `ChatSession` entity holds the LLM provider reference, the frozen system prompt built from wiki sections, and the accumulated conversation history (ordered list of role/content message pairs). It supports appending user and assistant turns and clearing history while retaining the wiki context.", - "sources": [ - { - "file": "wikifi/chat.py", - "lines": [ - 46, - 57 - ], - "fingerprint": "0333e700a046" - } - ] - }, - { - "section_id": "entities", - "finding": "A `LoadedSection` entity pairs a wiki Section descriptor with its markdown body text, representing a single populated section ready for inclusion in a chat context.", - "sources": [ - { - "file": "wikifi/chat.py", - "lines": [ - 42, - 45 - ], - "fingerprint": "0333e700a046" - } - ] - }, - { - "section_id": "integrations", - "finding": "The chat session delegates all LLM inference to the configured provider via a `chat` call, passing the system prompt and accumulated message history. The provider abstraction is sourced from `wikifi/providers/base.py`, keeping the chat logic decoupled from any specific LLM service.", - "sources": [ - { - "file": "wikifi/chat.py", - "lines": [ - 52, - 55 - ], - "fingerprint": "0333e700a046" - } - ] - }, - { - "section_id": "integrations", - "finding": "Wiki section content is read directly from the `.wikifi/` directory on disk using the layout abstraction from `wikifi/wiki.py`, and section metadata is sourced from `wikifi/sections.py`.", - "sources": [ - { - "file": "wikifi/chat.py", - "lines": [ - 63, - 82 - ], - "fingerprint": "0333e700a046" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Provider failures during a chat turn are caught and surfaced as inline error messages rather than crashing the REPL, ensuring a single failed inference call does not terminate the session.", - "sources": [ - { - "file": "wikifi/chat.py", - "lines": [ - 120, - 125 - ], - "fingerprint": "0333e700a046" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "The system prompt instructs the assistant to explicitly acknowledge when the wiki does not cover a topic, enforcing a data-integrity constraint that responses must be grounded in extracted content rather than hallucinated.", - "sources": [ - { - "file": "wikifi/chat.py", - "lines": [ - 27, - 31 - ], - "fingerprint": "0333e700a046" - } - ] - } - ] - }, - "wikifi/cli.py": { - "fingerprint": "f326383c7da1", - "summary": "The command-line entry point for wikifi, exposing four subcommands (init, walk, chat, report) that drive the full pipeline from codebase introspection through wiki generation and interactive querying.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "wikifi exists to walk a codebase and produce a technology-agnostic markdown wiki of its intent — translating implementation details into domain-level documentation for whoever needs to understand what the system does rather than how it is built.", - "sources": [ - { - "file": "wikifi/cli.py", - "lines": [ - 1, - 8 - ], - "fingerprint": "f326383c7da1" - } - ] - }, - { - "section_id": "capabilities", - "finding": "Users can initialise a wiki workspace in any project directory, run a full multi-stage extraction-and-aggregation pipeline against that codebase, query the resulting wiki through an interactive conversational interface, and obtain a coverage and quality report on how completely the wiki sections have been populated.", - "sources": [ - { - "file": "wikifi/cli.py", - "lines": [ - 60, - 220 - ], - "fingerprint": "f326383c7da1" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The walk pipeline supports opt-in cache invalidation (forcing a clean re-walk), an optional critic-and-reviser review loop on derivative sections, and runtime override of the AI provider, giving operators fine-grained control over cost and quality trade-offs.", - "sources": [ - { - "file": "wikifi/cli.py", - "lines": [ - 88, - 112 - ], - "fingerprint": "f326383c7da1" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The report command can optionally invoke a critic against every populated wiki section to produce quality scores, in addition to its baseline coverage summary.", - "sources": [ - { - "file": "wikifi/cli.py", - "lines": [ - 185, - 205 - ], - "fingerprint": "f326383c7da1" - } - ] - }, - { - "section_id": "domains", - "finding": "Two core domains are visible: codebase introspection (discovering and classifying source files) and wiki generation (extracting findings, aggregating them into sections, and deriving higher-level content). A supporting subdomain covers interactive knowledge retrieval against the generated wiki.", - "sources": [ - { - "file": "wikifi/cli.py", - "lines": [ - 1, - 8 - ], - "fingerprint": "f326383c7da1" - } - ] - }, - { - "section_id": "entities", - "finding": "A WikiLayout entity represents the on-disk structure of a wiki workspace rooted at a project directory; it tracks the presence of the .wikifi/ directory and organises the paths that other pipeline stages read and write.", - "sources": [ - { - "file": "wikifi/cli.py", - "lines": [ - 166, - 172 - ], - "fingerprint": "f326383c7da1" - } - ] - }, - { - "section_id": "entities", - "finding": "A walk report entity carries structured metrics for each of the four pipeline stages: introspection (included/excluded file counts, detected languages), extraction (files seen, files with findings, total findings, skipped files, cache hits, specialised-extractor files), aggregation (sections written/empty/cached), and derivation (sections derived/skipped/revised).", - "sources": [ - { - "file": "wikifi/cli.py", - "lines": [ - 118, - 153 - ], - "fingerprint": "f326383c7da1" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Verbose mode activates debug-level structured logging across all subcommands via a shared callback, providing a consistent observability toggle for the entire pipeline.", - "sources": [ - { - "file": "wikifi/cli.py", - "lines": [ - 51, - 60 - ], - "fingerprint": "f326383c7da1" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "An on-disk cache is used by the walk pipeline to avoid redundant re-processing; it can be explicitly invalidated at runtime, and cache hit counts are surfaced in the walk report.", - "sources": [ - { - "file": "wikifi/cli.py", - "lines": [ - 90, - 97 - ], - "fingerprint": "f326383c7da1" - } - ] - }, - { - "section_id": "integrations", - "finding": "The CLI integrates with configurable AI providers (ollama, anthropic, openai) at runtime; the provider and model are resolved from settings but can be overridden per-walk invocation, and the same provider instance is reused for both the chat REPL and quality scoring in the report command.", - "sources": [ - { - "file": "wikifi/cli.py", - "lines": [ - 98, - 101 - ], - "fingerprint": "f326383c7da1" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "The tool's entry point must be declared as `wikifi` in the project's script configuration and must delegate directly to the Typer application; this contract ties the installed command name to the main() function in this module.", - "sources": [ - { - "file": "wikifi/cli.py", - "lines": [ - 210, - 215 - ], - "fingerprint": "f326383c7da1" - } - ] - } - ] - }, - "wikifi/config.py": { - "fingerprint": "953e3d59fb7e", - "summary": "Defines all runtime-configurable settings for the wikifi codebase-to-wiki pipeline, including LLM provider selection, file-processing thresholds, chunking strategy, and optional pipeline stages.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "The system is designed to walk a codebase and produce wiki documentation using an LLM. It defaults to a locally-hosted inference server but supports hosted cloud providers as opt-in alternatives, prioritising wiki quality over processing speed.", - "sources": [ - { - "file": "wikifi/config.py", - "lines": [ - 1, - 9 - ], - "fingerprint": "953e3d59fb7e" - } - ] - }, - { - "section_id": "external_dependencies", - "finding": "The system integrates with three LLM inference providers: a locally-hosted Ollama server (default, no key required), the Anthropic API (opt-in, requires API key), and the OpenAI API (opt-in, supports custom base URLs for Azure or proxies). Each provider is independently configurable with its own output-token cap and authentication credential.", - "sources": [ - { - "file": "wikifi/config.py", - "lines": [ - 26, - 120 - ], - "fingerprint": "953e3d59fb7e" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The pipeline can build an import/reference graph of the target project and feed each file's neighbourhood into the extraction prompt, improving cross-file context. This behaviour is independently toggleable.", - "sources": [ - { - "file": "wikifi/config.py", - "lines": [ - 72, - 79 - ], - "fingerprint": "953e3d59fb7e" - } - ] - }, - { - "section_id": "capabilities", - "finding": "Schema and structured-definition files (SQL, OpenAPI, Protobuf, GraphQL, migrations) are routed through deterministic extractors that bypass the LLM entirely, trading flexibility for reliability on well-structured inputs.", - "sources": [ - { - "file": "wikifi/config.py", - "lines": [ - 80, - 87 - ], - "fingerprint": "953e3d59fb7e" - } - ] - }, - { - "section_id": "capabilities", - "finding": "Derivative wiki sections (personas, user stories, diagrams) can optionally be passed through a critic-then-reviser quality loop; the loop is triggered only when a critic score falls below a configurable threshold, and is disabled by default to keep run time predictable.", - "sources": [ - { - "file": "wikifi/config.py", - "lines": [ - 88, - 99 - ], - "fingerprint": "953e3d59fb7e" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "The pipeline maintains a per-file extraction cache and a per-section aggregation cache that persist across successive walks. Both can be disabled together to force a full re-walk, giving operators a straightforward cache-invalidation escape hatch.", - "sources": [ - { - "file": "wikifi/config.py", - "lines": [ - 63, - 71 - ], - "fingerprint": "953e3d59fb7e" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "A per-request timeout (default 900 seconds) guards against runaway LLM calls. A minimum-content threshold (default 64 bytes) prevents the LLM from being invoked on near-empty stub files, avoiding token waste and reasoning loops.", - "sources": [ - { - "file": "wikifi/config.py", - "lines": [ - 33, - 49 - ], - "fingerprint": "953e3d59fb7e" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Files exceeding 2 MB are unconditionally skipped by the walker and treated as vendored or generated noise; this threshold is not applied to chunking logic.", - "sources": [ - { - "file": "wikifi/config.py", - "lines": [ - 37, - 43 - ], - "fingerprint": "953e3d59fb7e" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Large files are split into overlapping windows of 150 KB each with an 8 KB overlap between adjacent chunks; each window is sent as a separate LLM call. The 150 KB size is chosen to fit within a 32 K-token context window after prompt overhead.", - "sources": [ - { - "file": "wikifi/config.py", - "lines": [ - 44, - 54 - ], - "fingerprint": "953e3d59fb7e" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "The Anthropic provider is capped at 32 000 output tokens per call to stay within the SDK's non-streaming HTTP timeout guard; callers using maximum-effort thinking are advised to raise this cap and enable streaming. The OpenAI provider is capped at 16 000 output tokens per call.", - "sources": [ - { - "file": "wikifi/config.py", - "lines": [ - 104, - 120 - ], - "fingerprint": "953e3d59fb7e" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "The introspection pass receives a directory-tree snapshot limited to depth 3, bounding the context size fed to that stage.", - "sources": [ - { - "file": "wikifi/config.py", - "lines": [ - 55, - 56 - ], - "fingerprint": "953e3d59fb7e" - } - ] - } - ] - }, - "wikifi/critic.py": { - "fingerprint": "502af9aee392", - "summary": "Quality-assurance component that scores synthesized wiki sections against a rubric, identifies unsupported claims and gaps, and optionally revises bodies that fall below a minimum quality threshold.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "The critic exists specifically to catch hallucination and missing-coverage failures that are most likely to occur in single-shot synthesis of derivative sections. It enforces that all wiki content remains tech-agnostic and grounded in upstream evidence, so that a migration team can trust the output without manually verifying every claim.", - "sources": [ - { - "file": "wikifi/critic.py", - "lines": [ - 1, - 15 - ], - "fingerprint": "502af9aee392" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system can evaluate any synthesized wiki section body against its brief and upstream evidence, producing a structured score (0–10) with itemised unsupported claims, gaps, and suggested edits. When the initial score falls below a configurable threshold, it automatically invokes a revision pass and only accepts the revision if it improves or matches the prior score, preventing regressions.", - "sources": [ - { - "file": "wikifi/critic.py", - "lines": [ - 100, - 153 - ], - "fingerprint": "502af9aee392" - } - ] - }, - { - "section_id": "capabilities", - "finding": "A separate audit path walks every section in the finished wiki and produces a rubric-style quality report, including per-section coverage statistics such as total files analysed, files that produced findings, and finding counts per section. This report is surfaced via the `wikifi report` command.", - "sources": [ - { - "file": "wikifi/critic.py", - "lines": [ - 155, - 180 - ], - "fingerprint": "502af9aee392" - } - ] - }, - { - "section_id": "entities", - "finding": "A `Critique` entity captures the quality assessment of a single section: an integer score (0–10), a short overall judgment, a list of unsupported claims, a list of gaps relative to the section brief, and a list of concrete revision suggestions.", - "sources": [ - { - "file": "wikifi/critic.py", - "lines": [ - 67, - 84 - ], - "fingerprint": "502af9aee392" - } - ] - }, - { - "section_id": "entities", - "finding": "A `ReviewOutcome` entity tracks the lifecycle of a section review: the section identifier, the initial critique, the current body text, a flag indicating whether a revision was applied, and the optional follow-up critique produced after revision.", - "sources": [ - { - "file": "wikifi/critic.py", - "lines": [ - 91, - 96 - ], - "fingerprint": "502af9aee392" - } - ] - }, - { - "section_id": "entities", - "finding": "A `WikiQualityReport` entity aggregates the full-wiki audit results: an overall numeric score, a mapping from section identifiers to their individual critiques, and optional coverage statistics. `CoverageStats` records total files, files with findings, and per-section finding and file counts.", - "sources": [ - { - "file": "wikifi/critic.py", - "lines": [ - 99, - 114 - ], - "fingerprint": "502af9aee392" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "The scoring rubric is fixed: 9–10 indicates fully grounded, tech-agnostic, narratively coherent content with no unsupported claims; 6–8 allows minor issues; 3–5 signals substantial gaps or partial coverage; 0–2 marks incoherent or off-brief content. The default minimum acceptable score for shipping a section without revision is 7.", - "sources": [ - { - "file": "wikifi/critic.py", - "lines": [ - 31, - 48 - ], - "fingerprint": "502af9aee392" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "A revised body is only accepted if its follow-up critique score is greater than or equal to the initial score; any revision that causes a score regression is discarded and the original body is retained. This invariant must be preserved in any reimplementation.", - "sources": [ - { - "file": "wikifi/critic.py", - "lines": [ - 137, - 147 - ], - "fingerprint": "502af9aee392" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "All section bodies must be tech-agnostic: the reviser is explicitly instructed not to invent claims unsupported by upstream evidence and to declare gaps explicitly when evidence is missing rather than speculating.", - "sources": [ - { - "file": "wikifi/critic.py", - "lines": [ - 53, - 61 - ], - "fingerprint": "502af9aee392" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Failures in either the critic or reviser calls are caught and logged as warnings; the system degrades gracefully by returning the original body rather than propagating errors. A score of 0 with a diagnostic message is returned when the critic is unavailable.", - "sources": [ - { - "file": "wikifi/critic.py", - "lines": [ - 158, - 165 - ], - "fingerprint": "502af9aee392" - } - ] - }, - { - "section_id": "integrations", - "finding": "This module calls into the shared LLM provider (from the providers layer) for both structured critique completions and structured revision completions, passing domain-specific system prompts and Pydantic schemas for type-safe JSON responses. It consumes Section metadata from the sections catalogue and its outputs are consumed by the report module.", - "sources": [ - { - "file": "wikifi/critic.py", - "lines": [ - 30, - 32 - ], - "fingerprint": "502af9aee392" - } - ] - } - ] - }, - "wikifi/deriver.py": { - "fingerprint": "0b7f4f5abb09", - "summary": "Stage 4 of the wiki generation pipeline: synthesizes derivative sections (personas, user stories, diagrams) by feeding aggregated upstream section content into the language model, then optionally running a critic/reviser quality loop.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "Some wiki sections — personas, user stories, and diagrams — cannot be extracted from individual source files because they only emerge from the aggregate of capabilities, entities, and integrations. This module exists specifically to synthesize those derivative sections after all primary sections have been written, grounding each output exclusively in already-aggregated upstream evidence.", - "sources": [ - { - "file": "wikifi/deriver.py", - "lines": [ - 1, - 18 - ], - "fingerprint": "0b7f4f5abb09" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system synthesizes derivative wiki sections (personas, Gherkin-style user stories, and Mermaid architectural diagrams) by collecting the final markdown bodies of all upstream primary sections and passing them to the language model with a targeted brief. If upstream sections are empty or missing, the system writes a placeholder that declares the gap rather than fabricating content.", - "sources": [ - { - "file": "wikifi/deriver.py", - "lines": [ - 73, - 107 - ], - "fingerprint": "0b7f4f5abb09" - } - ] - }, - { - "section_id": "capabilities", - "finding": "Each synthesized derivative section can optionally be run through a critic-and-reviser quality loop. The system explicitly notes this loop is the highest-leverage quality control point for derivative sections, because personas and Gherkin stories are where single-shot synthesis most often hallucinates.", - "sources": [ - { - "file": "wikifi/deriver.py", - "lines": [ - 90, - 103 - ], - "fingerprint": "0b7f4f5abb09" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Hallucination prevention is enforced at two levels: (1) a heuristic filters placeholder bodies so no derivative section treats an unpopulated upstream as real evidence, and (2) an optional critic review loop scores and revises each derivative before it is written. Revision events are counted in the run's stats for observability.", - "sources": [ - { - "file": "wikifi/deriver.py", - "lines": [ - 110, - 135 - ], - "fingerprint": "0b7f4f5abb09" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "The empty-content heuristic must match all known placeholder shapes ('not yet populated', 'no findings were extracted', 'upstream sections required to derive') to prevent fabricated findings cascading into downstream derivative sections.", - "sources": [ - { - "file": "wikifi/deriver.py", - "lines": [ - 118, - 135 - ], - "fingerprint": "0b7f4f5abb09" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "All derivation failures are caught and logged; a fallback body is written that preserves the upstream evidence verbatim rather than leaving the section blank, ensuring the wiki remains inspectable even after partial failures.", - "sources": [ - { - "file": "wikifi/deriver.py", - "lines": [ - 96, - 107 - ], - "fingerprint": "0b7f4f5abb09" - } - ] - }, - { - "section_id": "entities", - "finding": "A `DerivationStats` record accumulates pipeline metrics for a single run: count of sections derived, skipped, and revised, plus the full list of critic review outcomes. This acts as an audit trail for the synthesis stage.", - "sources": [ - { - "file": "wikifi/deriver.py", - "lines": [ - 57, - 62 - ], - "fingerprint": "0b7f4f5abb09" - } - ] - }, - { - "section_id": "entities", - "finding": "A `Section` entity has a `derived_from` list declaring which upstream section IDs it depends on, establishing an explicit dependency graph between primary and derivative sections.", - "sources": [ - { - "file": "wikifi/deriver.py", - "lines": [ - 112, - 116 - ], - "fingerprint": "0b7f4f5abb09" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Derivative sections must be grounded solely in upstream section content. The model is instructed to declare gaps explicitly rather than filling them with invented facts — this is a hard constraint on output integrity.", - "sources": [ - { - "file": "wikifi/deriver.py", - "lines": [ - 34, - 50 - ], - "fingerprint": "0b7f4f5abb09" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "All wiki content, including derivative sections, must remain technology-agnostic: language names, framework names, and library names are forbidden and must be translated into domain terms.", - "sources": [ - { - "file": "wikifi/deriver.py", - "lines": [ - 37, - 39 - ], - "fingerprint": "0b7f4f5abb09" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Gherkin-style outputs must use proper Given/When/Then syntax inside fenced ```gherkin code blocks. Mermaid diagrams must be valid and inside fenced ```mermaid code blocks, preferring graph, classDiagram, erDiagram, and sequenceDiagram diagram types.", - "sources": [ - { - "file": "wikifi/deriver.py", - "lines": [ - 40, - 45 - ], - "fingerprint": "0b7f4f5abb09" - } - ] - }, - { - "section_id": "integrations", - "finding": "The deriver reads upstream section bodies from the filesystem layout written by the aggregator stage, and writes its output back through the same layout abstraction. It also calls into the critic module to obtain scored review outcomes for each derivative section.", - "sources": [ - { - "file": "wikifi/deriver.py", - "lines": [ - 73, - 107 - ], - "fingerprint": "0b7f4f5abb09" - } - ] - } - ] - }, - "wikifi/evidence.py": { - "fingerprint": "9c7863e99adc", - "summary": "Defines the evidence model — source references, claims, and contradictions — that lets architects trace every sentence in the generated migration wiki back to a precise location in the source codebase.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "The system exists so that any architect reading a generated migration wiki can ask ", - "sources": [ - { - "file": "wikifi/evidence.py", - "lines": [ - 1, - 18 - ], - "fingerprint": "9c7863e99adc" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system generates citation-bearing markdown narratives where every assertion is linked to the specific source-file locations that support it. Claims that cannot be matched verbatim to the narrative body are collected into a separate 'Supporting claims' list rather than silently dropped.", - "sources": [ - { - "file": "wikifi/evidence.py", - "lines": [ - 85, - 120 - ], - "fingerprint": "9c7863e99adc" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system explicitly surfaces conflicting claims across source files as a dedicated 'Conflicts in source' section in each wiki section's output. Migration teams are directed to resolve these conflicts before re-implementation, treating them as high-priority signals about hidden tribal knowledge.", - "sources": [ - { - "file": "wikifi/evidence.py", - "lines": [ - 121, - 133 - ], - "fingerprint": "9c7863e99adc" - } - ] - }, - { - "section_id": "entities", - "finding": "A SourceRef represents a single pointer into the source codebase, carrying a repo-relative file path, an optional inclusive line range, and a short content fingerprint captured at extraction time. It renders as `path:start-end` (or just `path` when no line range is known).", - "sources": [ - { - "file": "wikifi/evidence.py", - "lines": [ - 35, - 55 - ], - "fingerprint": "9c7863e99adc" - } - ] - }, - { - "section_id": "entities", - "finding": "A Claim is a single markdown assertion placed in a section's narrative, backed by zero or more SourceRefs. A claim with no sources is considered unsupported. A Contradiction groups two or more conflicting Claims, each with its own sources, under a one-sentence summary of the conflict.", - "sources": [ - { - "file": "wikifi/evidence.py", - "lines": [ - 57, - 80 - ], - "fingerprint": "9c7863e99adc" - } - ] - }, - { - "section_id": "entities", - "finding": "An EvidenceBundle is the aggregator's structured output for a single section, containing the markdown narrative body, a list of Claims, and a list of Contradictions. It is the primary handoff type between the aggregator and the renderer.", - "sources": [ - { - "file": "wikifi/evidence.py", - "lines": [ - 82, - 87 - ], - "fingerprint": "9c7863e99adc" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Full source provenance is a non-functional invariant: every claim in the output must carry the file and optional line range that justifies it. Contradictions are never silently merged — they are always rendered explicitly so that data integrity issues visible in the source are preserved and escalated to the migration team.", - "sources": [ - { - "file": "wikifi/evidence.py", - "lines": [ - 1, - 18 - ], - "fingerprint": "9c7863e99adc" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Citations must be rendered as compact footnote-style markers ([1], [2], …) with a Sources footer at the bottom of each section. Line ranges are formatted as `path/to/file:start-end`; when start equals end, as `path/to/file:line`; when unknown, as `path/to/file` alone.", - "sources": [ - { - "file": "wikifi/evidence.py", - "lines": [ - 43, - 52 - ], - "fingerprint": "9c7863e99adc" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Any contradictions detected across source files must appear verbatim in the wiki output under a 'Conflicts in source' heading with the explicit instruction that migration teams must resolve them before re-implementation — they must not be suppressed or merged.", - "sources": [ - { - "file": "wikifi/evidence.py", - "lines": [ - 121, - 131 - ], - "fingerprint": "9c7863e99adc" - } - ] - }, - { - "section_id": "integrations", - "finding": "The evidence model types (SourceRef, Claim, Contradiction, EvidenceBundle) are consumed by the aggregator to produce section output, and by specialized extractors (GraphQL, OpenAPI, Protobuf, SQL, general models) that populate SourceRefs during the extraction pass.", - "sources": [ - { - "file": "wikifi/evidence.py", - "lines": [ - 1, - 5 - ], - "fingerprint": "9c7863e99adc" - } - ] - } - ] - }, - "wikifi/extractor.py": { - "fingerprint": "67bd95fa3f07", - "summary": "Stage 2 of the wiki-generation pipeline: walks each included source file, extracts structured intent-bearing findings per wiki section via LLM or deterministic extractors, and appends them to a per-section notes store for later aggregation.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "The extractor exists to translate raw source files into structured, technology-agnostic findings that describe *why* code exists rather than how it is implemented. It is designed to scale to very large legacy codebases (e.g. 50 000-file monorepos) by skipping unchanged files via content-addressed caching, routing schema-typed files through deterministic parsers, and splitting oversized files into overlapping windows so no content is lost.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 1, - 35 - ], - "fingerprint": "67bd95fa3f07" - } - ] - }, - { - "section_id": "domains", - "finding": "The core domain is automated knowledge extraction from source repositories: classifying files by kind, extracting intent-bearing findings per wiki section, and recording citations back to the originating file and line range. Subdomains include caching/memoization of extraction results, import-graph-based cross-file context, and chunk-level deduplication.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": null, - "fingerprint": "67bd95fa3f07" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system can walk an arbitrary set of source files and produce structured findings mapped to predefined wiki sections. It supports three extraction paths: (1) cache replay for files whose content has not changed, (2) deterministic specialized extractors for schema-typed files (SQL, OpenAPI, Protobuf, GraphQL, migrations), and (3) LLM-based extraction with recursive chunking for everything else.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 155, - 250 - ], - "fingerprint": "67bd95fa3f07" - } - ] - }, - { - "section_id": "capabilities", - "finding": "Large files are split into overlapping windows so cross-boundary context is preserved. Findings that appear in the overlap region of adjacent chunks are deduplicated by (section, finding-text) identity before being written to the notes store, preventing double-counting of the same declaration.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 253, - 290 - ], - "fingerprint": "67bd95fa3f07" - } - ] - }, - { - "section_id": "capabilities", - "finding": "Processing is crash-resumable: the cache is persisted after each file completes, so a run interrupted partway through can be restarted without re-processing already-extracted files. Per-chunk and per-file failures are isolated — a single failed LLM call does not abort the overall walk.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 160, - 165 - ], - "fingerprint": "67bd95fa3f07" - } - ] - }, - { - "section_id": "capabilities", - "finding": "Each emitted finding carries a structured source citation (file path, absolute line range, content fingerprint) so the downstream aggregation stage can stitch precise references into the rendered wiki.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 262, - 270 - ], - "fingerprint": "67bd95fa3f07" - } - ] - }, - { - "section_id": "entities", - "finding": "A `SectionFinding` represents one contribution from a file to one wiki section, carrying the section identifier, a technology-agnostic prose description, and an optional inclusive line range within the chunk where the evidence appears.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 105, - 116 - ], - "fingerprint": "67bd95fa3f07" - } - ] - }, - { - "section_id": "entities", - "finding": "A `FileFindings` groups all findings produced for a single file, along with a one-sentence summary of the file's role. It is the schema the LLM must conform to when returning structured results.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 119, - 122 - ], - "fingerprint": "67bd95fa3f07" - } - ] - }, - { - "section_id": "entities", - "finding": "An `ExtractionStats` record accumulates operational counters for a full walk: files seen, files with findings, total findings, skipped files, chunks processed, cache hits, specialized-extractor files, and a breakdown of file kinds encountered.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 125, - 135 - ], - "fingerprint": "67bd95fa3f07" - } - ] - }, - { - "section_id": "integrations", - "finding": "The extractor delegates LLM calls to an injected `LLMProvider` (from `wikifi/providers/base.py`), which must support a structured JSON completion method that accepts a system prompt, a user prompt, and a response schema. The extractor is otherwise provider-agnostic.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 139, - 152 - ], - "fingerprint": "67bd95fa3f07" - } - ] - }, - { - "section_id": "integrations", - "finding": "Findings are written to the notes store via `append_note` from `wikifi/wiki.py`, using a `WikiLayout` that describes where section note files live. The extractor has no direct knowledge of the storage format.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 43, - 44 - ], - "fingerprint": "67bd95fa3f07" - } - ] - }, - { - "section_id": "integrations", - "finding": "Import-graph neighbor data is consumed from `wikifi/repograph.py` (`RepoGraph.neighbor_paths`) and injected into each LLM prompt so the model can describe cross-file flows. File-kind classification (`classify`) from the same module drives specialized-extractor routing.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 38, - 39 - ], - "fingerprint": "67bd95fa3f07" - } - ] - }, - { - "section_id": "integrations", - "finding": "Specialized extractor selection is delegated to `wikifi/specialized/dispatch.py` (`select`), which returns a deterministic extraction function for schema-typed files or `None` for files that should go through the LLM path.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 41, - 42 - ], - "fingerprint": "67bd95fa3f07" - } - ] - }, - { - "section_id": "integrations", - "finding": "Content fingerprinting is handled by `wikifi/fingerprint.py` (`hash_file`) and cache lookup/recording by `wikifi/cache.py` (`WalkCache`). Source citations use the `SourceRef` model from `wikifi/evidence.py`.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 36, - 43 - ], - "fingerprint": "67bd95fa3f07" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Content-addressed caching ensures idempotency: a file is only sent to the LLM if its fingerprint has changed since the last run. Cache state is persisted after every file, turning the cache into a crash-recovery checkpoint with no additional coordination required.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 185, - 200 - ], - "fingerprint": "67bd95fa3f07" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Deduplication of findings across chunk overlap regions is enforced by tracking a (section_id, finding_text) set per file, so identical findings discovered in the shared context of adjacent chunks are counted and stored exactly once.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 253, - 270 - ], - "fingerprint": "67bd95fa3f07" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Errors are scoped: specialized-extractor failures and per-chunk LLM failures are logged at warning level and cause only the affected unit to be skipped, while the rest of the walk continues. A file is counted as fully skipped only if it is a single-chunk file whose sole LLM call failed.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 211, - 220 - ], - "fingerprint": "67bd95fa3f07" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Per-file extraction targets only *primary* wiki sections. Derivative sections (personas, user stories, diagrams) are explicitly excluded from this stage and deferred to Stage 4 aggregation, because asking the model to identify them at the per-file level produces sparse, speculative findings.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 51, - 56 - ], - "fingerprint": "67bd95fa3f07" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Chunk size and overlap must satisfy `chunk_size > 0` and `0 <= overlap < chunk_size`; violations raise a `ValueError`. Chunks are built so that `base_size + overlap == chunk_size`, guaranteeing no chunk ever exceeds the configured byte limit.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 333, - 340 - ], - "fingerprint": "67bd95fa3f07" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "The recursive text splitter must always terminate and produce chunks that fit within the configured size, even for inputs with no whitespace (e.g. minified files). This is enforced by the empty-string terminal separator, which falls back to character-level slicing.", - "sources": [ - { - "file": "wikifi/extractor.py", - "lines": [ - 90, - 98 - ], - "fingerprint": "67bd95fa3f07" - } - ] - } - ] - }, - "wikifi/fingerprint.py": { - "fingerprint": "853400108135", - "summary": "Utility that produces stable short content fingerprints (12-character SHA-256 hex prefixes) used for cache keying, source-evidence citations, and dependency-graph invalidation.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "cross_cutting", - "finding": "Content fingerprints serve three cross-cutting roles: keying extraction and aggregation caches so stale results are never served, anchoring source-evidence citations so wiki claims can be verified against a re-walk of the repository, and tracking file identity inside the dependency graph so cross-file context is invalidated when any source changes.", - "sources": [ - { - "file": "wikifi/fingerprint.py", - "lines": [ - 1, - 18 - ], - "fingerprint": "853400108135" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Fingerprints are defined as the first 12 hexadecimal characters of a SHA-256 digest (48 bits of entropy). This length is explicitly chosen to be sufficient to distinguish every file in any realistic repository (estimated 50% collision threshold at ~10 trillion files) while remaining short enough to embed inline in human-readable citations. This format must be preserved across any migration because it is recorded in cached artefacts and emitted into wiki evidence references.", - "sources": [ - { - "file": "wikifi/fingerprint.py", - "lines": [ - 23, - 27 - ], - "fingerprint": "853400108135" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Files are always hashed as raw bytes rather than decoded text, ensuring that the cache layer and the extractor produce the same fingerprint for the same file regardless of encoding assumptions.", - "sources": [ - { - "file": "wikifi/fingerprint.py", - "lines": [ - 44, - 50 - ], - "fingerprint": "853400108135" - } - ] - } - ] - }, - "wikifi/introspection.py": { - "fingerprint": "59cd5940f72e", - "summary": "Implements Stage 1 of the wiki-generation pipeline: a single LLM call that examines a compressed repository structure and decides which paths contain production source worth deeper analysis.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "The system exists to produce a technology-agnostic wiki from an unknown codebase. Stage 1 solves the problem of not knowing which parts of a repository contain intent-bearing production code versus scaffolding, tests, or build artifacts — without reading any source files up front.", - "sources": [ - { - "file": "wikifi/introspection.py", - "lines": [ - 1, - 9 - ], - "fingerprint": "59cd5940f72e" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system can analyze a repository's directory layout and manifest files to classify paths as worth walking (production source, business logic, integrations, domain models) or worth skipping (vendored dependencies, build output, test code, CI/CD, documentation). The classification is returned as a structured, diffable result.", - "sources": [ - { - "file": "wikifi/introspection.py", - "lines": [ - 28, - 44 - ], - "fingerprint": "59cd5940f72e" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system compresses the entire repository tree into a compact directory-summary representation and reads selected manifest files, then submits this compressed view to an LLM to infer the repository's likely purpose, primary languages, and path patterns to include or exclude.", - "sources": [ - { - "file": "wikifi/introspection.py", - "lines": [ - 61, - 70 - ], - "fingerprint": "59cd5940f72e" - } - ] - }, - { - "section_id": "domains", - "finding": "The core domain is automated repository understanding: deciding which parts of an arbitrary codebase encode business intent versus infrastructure or tooling. A key subdomain constraint is tech-agnosticism — the analysis must not depend on recognizing any specific language or framework.", - "sources": [ - { - "file": "wikifi/introspection.py", - "lines": [ - 19, - 44 - ], - "fingerprint": "59cd5940f72e" - } - ] - }, - { - "section_id": "entities", - "finding": "The `IntrospectionResult` entity captures the Stage 1 decision: a list of gitignore-style include patterns, a list of exclude patterns, a list of primary languages (informational), a one-paragraph guess at the system's purpose, and a rationale for the include/exclude choices.", - "sources": [ - { - "file": "wikifi/introspection.py", - "lines": [ - 47, - 64 - ], - "fingerprint": "59cd5940f72e" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "LLM output is constrained to a strict Pydantic schema to ensure deterministic parsing and easy diffing between runs. At Stage 1, the agent deliberately has no access to source file contents — only compressed directory metadata and manifest files — enforcing a clean separation of concerns between introspection and per-file analysis.", - "sources": [ - { - "file": "wikifi/introspection.py", - "lines": [ - 5, - 9 - ], - "fingerprint": "59cd5940f72e" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Stage 1 must operate without reading any source files; it sees only directory-level summaries and manifest contents. This constraint is architectural and must be preserved: source reading is exclusively Stage 2's responsibility.", - "sources": [ - { - "file": "wikifi/introspection.py", - "lines": [ - 5, - 9 - ], - "fingerprint": "59cd5940f72e" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Include and exclude patterns produced by Stage 1 must be in gitignore-style format relative to the repository root.", - "sources": [ - { - "file": "wikifi/introspection.py", - "lines": [ - 50, - 58 - ], - "fingerprint": "59cd5940f72e" - } - ] - }, - { - "section_id": "integrations", - "finding": "Stage 1 calls into an LLM provider (via the shared LLMProvider interface) requesting structured JSON output conforming to the IntrospectionResult schema. It also depends on the walker component to produce directory summaries and read manifest file contents. The orchestrator calls this stage as the first step of the pipeline.", - "sources": [ - { - "file": "wikifi/introspection.py", - "lines": [ - 61, - 70 - ], - "fingerprint": "59cd5940f72e" - } - ] - } - ] - }, - "wikifi/orchestrator.py": { - "fingerprint": "1528ab8f73c3", - "summary": "Central pipeline orchestrator that sequences repository introspection, per-file extraction, section aggregation, and derivative artifact generation while managing caching, provider selection, and filesystem layout.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "The system exists to automatically generate a structured wiki from any source-code repository. It sequences LLM-powered analysis through four deterministic stages — structure introspection, per-file knowledge extraction, cross-file aggregation, and high-level artifact derivation — so developers obtain living documentation without writing it by hand.", - "sources": [ - { - "file": "wikifi/orchestrator.py", - "lines": [ - 1, - 16 - ], - "fingerprint": "1528ab8f73c3" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The pipeline offers: (1) repository structure introspection to identify files worth analysing, (2) per-file finding extraction with chunking support and optional cross-file import-graph context, (3) section-level aggregation of findings across all files, and (4) derivation of personas, user stories, and diagrams from aggregated content — including an optional critic review loop.", - "sources": [ - { - "file": "wikifi/orchestrator.py", - "lines": [ - 1, - 16 - ], - "fingerprint": "1528ab8f73c3" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system supports a caching layer that persists extraction and aggregation results between runs and automatically prunes stale entries for files that have fallen out of scope, enabling incremental re-runs over large repositories.", - "sources": [ - { - "file": "wikifi/orchestrator.py", - "lines": [ - 96, - 108 - ], - "fingerprint": "1528ab8f73c3" - } - ] - }, - { - "section_id": "capabilities", - "finding": "An optional static import-graph analysis enriches per-file extraction with neighbour context, giving the LLM visibility into how files relate to each other before generating findings.", - "sources": [ - { - "file": "wikifi/orchestrator.py", - "lines": [ - 110, - 114 - ], - "fingerprint": "1528ab8f73c3" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The wiki directory is initialised idempotently, so the tool can be run on a fresh project without a prior setup step.", - "sources": [ - { - "file": "wikifi/orchestrator.py", - "lines": [ - 55, - 64 - ], - "fingerprint": "1528ab8f73c3" - } - ] - }, - { - "section_id": "external_dependencies", - "finding": "A locally-hosted Ollama service is the default LLM backend, used for all four pipeline stages. It is configurable by host URL, model name, request timeout, and an optional extended-thinking mode.", - "sources": [ - { - "file": "wikifi/orchestrator.py", - "lines": [ - 166, - 174 - ], - "fingerprint": "1528ab8f73c3" - } - ] - }, - { - "section_id": "external_dependencies", - "finding": "Anthropic's hosted API (Claude family) is an opt-in backend selected via configuration. When the configured model name does not begin with 'claude-', the system substitutes the default 'claude-opus-4-7' to avoid model-not-found errors.", - "sources": [ - { - "file": "wikifi/orchestrator.py", - "lines": [ - 175, - 186 - ], - "fingerprint": "1528ab8f73c3" - } - ] - }, - { - "section_id": "external_dependencies", - "finding": "OpenAI's API is a second opt-in hosted backend, with a configurable base URL to support compatible third-party endpoints. When the model name does not match known OpenAI patterns, it defaults to 'gpt-4o'.", - "sources": [ - { - "file": "wikifi/orchestrator.py", - "lines": [ - 187, - 200 - ], - "fingerprint": "1528ab8f73c3" - } - ] - }, - { - "section_id": "integrations", - "finding": "The CLI surface exposes three entry points — init_wiki, run_walk, and run_report — each accepting a root path and a provider instance, making the pipeline fully substitutable with a mock provider in tests.", - "sources": [ - { - "file": "wikifi/orchestrator.py", - "lines": [ - 44, - 53 - ], - "fingerprint": "1528ab8f73c3" - } - ] - }, - { - "section_id": "integrations", - "finding": "The orchestrator calls into six internal subsystems in sequence: the introspector, the file walker, the extractor, the aggregator, the deriver, and the cache layer, coordinating their inputs and outputs to form the end-to-end pipeline.", - "sources": [ - { - "file": "wikifi/orchestrator.py", - "lines": [ - 66, - 157 - ], - "fingerprint": "1528ab8f73c3" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Structured logging is emitted at the start of each pipeline stage (introspection, graph build, extraction, aggregation, derivation), providing observability into long-running batch runs.", - "sources": [ - { - "file": "wikifi/orchestrator.py", - "lines": [ - 87, - 148 - ], - "fingerprint": "1528ab8f73c3" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "The cache is saved to disk after both the aggregation stage and again at the end of the full run, ensuring partial progress is not lost if derivation fails.", - "sources": [ - { - "file": "wikifi/orchestrator.py", - "lines": [ - 150, - 156 - ], - "fingerprint": "1528ab8f73c3" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Cache entries for files that leave the in-scope set are pruned before extraction begins, preventing stale data from leaking into aggregation and keeping cache size proportional to the live file set.", - "sources": [ - { - "file": "wikifi/orchestrator.py", - "lines": [ - 100, - 106 - ], - "fingerprint": "1528ab8f73c3" - } - ] - }, - { - "section_id": "entities", - "finding": "WalkReport is the top-level result record for a full pipeline run, carrying the introspection result, extraction statistics, aggregation statistics, derivation statistics, the live cache snapshot, and the repo import graph.", - "sources": [ - { - "file": "wikifi/orchestrator.py", - "lines": [ - 56, - 63 - ], - "fingerprint": "1528ab8f73c3" - } - ] - }, - { - "section_id": "entities", - "finding": "Settings is the central configuration entity, carrying all tuneable parameters: provider identity and credentials, model name, file-size limits, chunk dimensions, cache and graph feature flags, the critic-review threshold, and provider-specific token limits.", - "sources": [ - { - "file": "wikifi/orchestrator.py", - "lines": [ - 166, - 201 - ], - "fingerprint": "1528ab8f73c3" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Exactly three provider values are accepted — 'ollama', 'anthropic', 'openai' — and any other value raises a hard error with the valid options listed explicitly.", - "sources": [ - { - "file": "wikifi/orchestrator.py", - "lines": [ - 202, - 203 - ], - "fingerprint": "1528ab8f73c3" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "When Anthropic is selected, a model name not starting with 'claude-' is silently replaced with 'claude-opus-4-7' to prevent API 404 errors from a carried-over Ollama model id.", - "sources": [ - { - "file": "wikifi/orchestrator.py", - "lines": [ - 180, - 185 - ], - "fingerprint": "1528ab8f73c3" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "When OpenAI is selected, a model name that does not match the heuristic patterns (gpt-*, o1, o3, o4, ft:) is silently replaced with 'gpt-4o'; fine-tuned variants (ft: prefix) are explicitly recognised as valid OpenAI identifiers.", - "sources": [ - { - "file": "wikifi/orchestrator.py", - "lines": [ - 192, - 205 - ], - "fingerprint": "1528ab8f73c3" - } - ] - }, - { - "section_id": "domains", - "finding": "The system operates across two core domains: repository intelligence (understanding a codebase's structure, file relationships, and per-file semantics) and documentation synthesis (aggregating extracted knowledge into human-readable wiki sections and higher-level narrative artifacts like personas and diagrams).", - "sources": [ - { - "file": "wikifi/orchestrator.py", - "lines": [ - 1, - 16 - ], - "fingerprint": "1528ab8f73c3" - } - ] - } - ] - }, - "wikifi/repograph.py": { - "fingerprint": "808453182a95", - "summary": "Performs lightweight static analysis of a repository to classify each file by structural kind and build a cross-file import/reference graph used to enrich per-file knowledge extraction.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "This module exists to give the extraction pipeline two pieces of context it would otherwise lack: a classification of each file's structural role (e.g. schema definition, API contract, migration, application logic), and a map of which files import which. Together these allow structured files to bypass the language-model path entirely, and allow application-code findings to mention cross-file relationships rather than treating each file in isolation.", - "sources": [ - { - "file": "wikifi/repograph.py", - "lines": [ - 1, - 30 - ], - "fingerprint": "808453182a95" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system classifies every in-scope repository file into one of seven structural kinds — application code, SQL DDL, OpenAPI specification, Protocol Buffer definition, GraphQL schema, database migration, or other — using file extension, path conventions, and content sampling.", - "sources": [ - { - "file": "wikifi/repograph.py", - "lines": [ - 43, - 70 - ], - "fingerprint": "808453182a95" - } - ] - }, - { - "section_id": "capabilities", - "finding": "For structured file kinds (SQL, OpenAPI, Protobuf, GraphQL, migration) the system short-circuits the language-model extraction path and routes to a specialized extractor, avoiding expensive model calls for files whose structure is mechanically parseable.", - "sources": [ - { - "file": "wikifi/repograph.py", - "lines": [ - 1, - 15 - ], - "fingerprint": "808453182a95" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system builds a language-pluralistic import/reference graph across all application-code files using regex-based scanning, recording for each file which files it imports and which files import it. This neighbor map is injected into per-file extraction prompts so findings can describe cross-file flows.", - "sources": [ - { - "file": "wikifi/repograph.py", - "lines": [ - 162, - 215 - ], - "fingerprint": "808453182a95" - } - ] - }, - { - "section_id": "capabilities", - "finding": "Import resolution handles Python relative imports (dot-prefix syntax), JavaScript/TypeScript path-style and bare module imports, Go import blocks, Java/Kotlin/Scala/C# dotted class imports, and Ruby require statements — resolving each to a concrete repo-relative path where possible.", - "sources": [ - { - "file": "wikifi/repograph.py", - "lines": [ - 280, - 340 - ], - "fingerprint": "808453182a95" - } - ] - }, - { - "section_id": "entities", - "finding": "A `GraphNode` represents a single file's position in the import graph, carrying its repo-relative path, the ordered set of files it imports, and the ordered set of files that import it. It exposes a combined neighbor list capped at a configurable limit for use in prompt enrichment.", - "sources": [ - { - "file": "wikifi/repograph.py", - "lines": [ - 143, - 162 - ], - "fingerprint": "808453182a95" - } - ] - }, - { - "section_id": "entities", - "finding": "A `RepoGraph` is the complete per-repository import graph, keyed by repo-relative file path, providing lookup of individual nodes and neighbor path lists.", - "sources": [ - { - "file": "wikifi/repograph.py", - "lines": [ - 165, - 177 - ], - "fingerprint": "808453182a95" - } - ] - }, - { - "section_id": "entities", - "finding": "A `FileKind` enumeration defines the seven recognized structural categories of source files: application code, SQL, OpenAPI, Protobuf, GraphQL, migration, and other. Classification drives routing to specialized or general-purpose extractors.", - "sources": [ - { - "file": "wikifi/repograph.py", - "lines": [ - 43, - 56 - ], - "fingerprint": "808453182a95" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "The implementation explicitly avoids tree-sitter or any binary dependency for import graph construction, relying only on regex and path resolution. This is a stated architectural constraint: the regex graph is considered sufficient for its sole consumer (the language model), and adding a binary dependency has been explicitly rejected.", - "sources": [ - { - "file": "wikifi/repograph.py", - "lines": [ - 22, - 30 - ], - "fingerprint": "808453182a95" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Migration files are detected by matching a hardcoded list of well-known migration directory path tokens (Alembic, Django, Rails, Knex, Flyway, Liquibase, Prisma). A SQL file in one of these directories is classified as a migration rather than generic DDL, preserving the distinction between forward-only schema changes and current schema in the generated wiki.", - "sources": [ - { - "file": "wikifi/repograph.py", - "lines": [ - 80, - 93 - ], - "fingerprint": "808453182a95" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "File classification uses a content sample of at most 4 096 bytes to detect OpenAPI/Swagger YAML and JSON files, avoiding full-file reads during the classification pass.", - "sources": [ - { - "file": "wikifi/repograph.py", - "lines": [ - 113, - 123 - ], - "fingerprint": "808453182a95" - } - ] - }, - { - "section_id": "integrations", - "finding": "This module feeds two downstream consumers: the specialized extractor dispatcher (which routes structured file kinds to dedicated parsers) and the orchestrator (which uses the `RepoGraph` to inject neighbor context into per-file extraction prompts). Both relationships are visible from neighbor file references.", - "sources": [ - { - "file": "wikifi/repograph.py", - "lines": [ - 1, - 10 - ], - "fingerprint": "808453182a95" - } - ] - } - ] - }, - "wikifi/report.py": { - "fingerprint": "eaa5459516bf", - "summary": "Generates a structural and quality report for a completed wiki walk, answering coverage and readiness questions for migration leads without modifying any wiki content.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "The report module exists to answer two pre-migration questions: whether the automated walk covered the entire system (per-section file and finding counts), and whether the resulting wiki is good enough to act on (per-section quality scores surfacing unsupported claims and gaps). It is designed to run read-only, relying solely on already-produced on-disk artifacts.", - "sources": [ - { - "file": "wikifi/report.py", - "lines": [ - 1, - 16 - ], - "fingerprint": "eaa5459516bf" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system can produce a coverage report showing how many files were seen, how many contributed findings, and a per-section breakdown of both — without requiring an AI provider, making it suitable for CI pipelines.", - "sources": [ - { - "file": "wikifi/report.py", - "lines": [ - 82, - 85 - ], - "fingerprint": "eaa5459516bf" - } - ] - }, - { - "section_id": "capabilities", - "finding": "When an AI provider is supplied and scoring is enabled, every populated section is evaluated by a quality critic, yielding a per-section score out of 10 along with identified gaps and unsupported claims. An overall score is computed as the mean across all scored sections.", - "sources": [ - { - "file": "wikifi/report.py", - "lines": [ - 106, - 114 - ], - "fingerprint": "eaa5459516bf" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The report renders as a markdown table listing each section with its file count, finding count, body character count, quality score, and the single most prominent gap or unsupported claim — giving migration leads a one-page readiness summary.", - "sources": [ - { - "file": "wikifi/report.py", - "lines": [ - 46, - 74 - ], - "fingerprint": "eaa5459516bf" - } - ] - }, - { - "section_id": "entities", - "finding": "A `SectionReport` captures the per-section view: a reference to the section definition, count of contributing files, count of findings, character length of the written body, an emptiness flag, and an optional quality critique.", - "sources": [ - { - "file": "wikifi/report.py", - "lines": [ - 29, - 36 - ], - "fingerprint": "eaa5459516bf" - } - ] - }, - { - "section_id": "entities", - "finding": "A `WikiReport` aggregates all section reports, the overall coverage statistics, and an optional mean quality score across populated sections.", - "sources": [ - { - "file": "wikifi/report.py", - "lines": [ - 39, - 44 - ], - "fingerprint": "eaa5459516bf" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Coverage statistics are authoritative from on-disk notes when available; the cache is used as a fallback only when no notes have been written — ensuring accuracy even when the cache has been deleted or the walk was run without caching.", - "sources": [ - { - "file": "wikifi/report.py", - "lines": [ - 90, - 99 - ], - "fingerprint": "eaa5459516bf" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Section emptiness is determined by detecting specific sentinel strings in the body text: 'Not yet populated', 'No findings were extracted', and 'upstream sections required to derive'. Sections matching any of these are excluded from quality scoring.", - "sources": [ - { - "file": "wikifi/report.py", - "lines": [ - 103, - 108 - ], - "fingerprint": "eaa5459516bf" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Three exact sentinel strings must be preserved as the canonical markers for unpopulated sections: 'Not yet populated', 'No findings were extracted', and 'upstream sections required to derive'. The report module depends on these exact strings to correctly identify and exclude empty sections from scoring and gap analysis.", - "sources": [ - { - "file": "wikifi/report.py", - "lines": [ - 103, - 108 - ], - "fingerprint": "eaa5459516bf" - } - ] - }, - { - "section_id": "integrations", - "finding": "The report reads section content from on-disk markdown files via the wiki layout abstraction, reads per-file extraction notes from a JSONL store, and optionally reads upstream section bodies to provide context when scoring derivative sections through the critic component.", - "sources": [ - { - "file": "wikifi/report.py", - "lines": [ - 78, - 130 - ], - "fingerprint": "eaa5459516bf" - } - ] - } - ] - }, - "wikifi/sections.py": { - "fingerprint": "f743972a8fce", - "summary": "Defines the complete taxonomy of wiki sections that the system generates, distinguishing between primary sections (extracted from per-file evidence) and derivative sections (synthesized from aggregates of primary sections).", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "The system exists to generate structured, technology-agnostic wiki documentation from a codebase. Its design explicitly separates extraction of direct per-file evidence (primary sections) from higher-order synthesis across the whole codebase (derivative sections), because single files rarely contain enough signal to infer personas, user stories, or diagrams.", - "sources": [ - { - "file": "wikifi/sections.py", - "lines": [ - 1, - 19 - ], - "fingerprint": "f743972a8fce" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system produces wiki documentation organized into eight primary sections — business domains, system intent, capabilities, external dependencies, integrations, cross-cutting concerns, core entities, and hard specifications — plus three derivative sections: user personas, Gherkin-style user stories, and Mermaid diagrams. Each derivative section declares which primary sections it depends on and is only generated after those primaries are finalized.", - "sources": [ - { - "file": "wikifi/sections.py", - "lines": [ - 44, - 142 - ], - "fingerprint": "f743972a8fce" - } - ] - }, - { - "section_id": "entities", - "finding": "A Section entity captures: a unique identifier, a human-readable title, a prose description of what belongs in that section, a tier (primary or derivative), and an ordered tuple of upstream section identifiers it is derived from. Sections form a dependency graph that must be topologically ordered — each derivative's upstreams must appear earlier in the canonical section list, enforced at startup.", - "sources": [ - { - "file": "wikifi/sections.py", - "lines": [ - 30, - 40 - ], - "fingerprint": "f743972a8fce" - } - ] - }, - { - "section_id": "domains", - "finding": "The wiki-generation process is split into two subdomains: per-file evidence extraction (primary sections, Stages 2–3) and aggregate synthesis (derivative sections, Stage 4). The dependency ordering between these two subdomains is a first-class design constraint enforced structurally.", - "sources": [ - { - "file": "wikifi/sections.py", - "lines": [ - 1, - 19 - ], - "fingerprint": "f743972a8fce" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Derivative sections must always reference only known section IDs, and every upstream a derivative depends on must appear earlier in the canonical SECTIONS ordering. This ordering invariant is validated at module load time and any violation raises an error, making it a hard structural requirement for the section taxonomy.", - "sources": [ - { - "file": "wikifi/sections.py", - "lines": [ - 148, - 158 - ], - "fingerprint": "f743972a8fce" - } - ] - } - ] - }, - "wikifi/walker.py": { - "fingerprint": "a29bd1ad8bdb", - "summary": "Filesystem walker that enumerates and filters source files in a repository, building directory summaries and reading manifest content for downstream analysis passes.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "The walker exists to give higher-level analysis passes a clean, noise-free view of a repository's source files. It deliberately excludes VCS metadata, dependency caches, build artifacts, and the tool's own working directory so that only meaningful source content reaches the analysis layer. It is described as intentionally provider-free — it knows nothing about analysis models or output sections.", - "sources": [ - { - "file": "wikifi/walker.py", - "lines": [ - 1, - 12 - ], - "fingerprint": "a29bd1ad8bdb" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system can enumerate all analyzable files in a repository tree while respecting gitignore rules and a configurable set of additional exclusion patterns. It produces a depth-limited, pre-order directory summary (file counts, byte totals, extension histograms, notable filenames) used as a compressed structural view for the Stage 1 introspection pass. It can also read a targeted set of manifest and readme files up to a configurable byte limit for inclusion in introspection prompts.", - "sources": [ - { - "file": "wikifi/walker.py", - "lines": [ - 92, - 186 - ], - "fingerprint": "a29bd1ad8bdb" - } - ] - }, - { - "section_id": "capabilities", - "finding": "File selection applies layered filtering: pattern-based exclusion, size upper-bound to discard generated or vendored assets, and a minimum content threshold to skip stubs and near-empty files before they reach any analysis model.", - "sources": [ - { - "file": "wikifi/walker.py", - "lines": [ - 100, - 130 - ], - "fingerprint": "a29bd1ad8bdb" - } - ] - }, - { - "section_id": "entities", - "finding": "WalkConfig is a configuration entity capturing: the repository root path, extra exclusion patterns beyond the defaults, whether to honour gitignore rules, a maximum file size in bytes, and a minimum stripped-content size in bytes. It is immutable once constructed.", - "sources": [ - { - "file": "wikifi/walker.py", - "lines": [ - 61, - 79 - ], - "fingerprint": "a29bd1ad8bdb" - } - ] - }, - { - "section_id": "entities", - "finding": "DirSummary is a value object representing aggregate statistics for a single (non-recursive) directory: its repo-relative path, file count, total byte size, a frequency map of file extensions (top 10), and a tuple of notable filenames (manifests, readmes) present in that directory.", - "sources": [ - { - "file": "wikifi/walker.py", - "lines": [ - 144, - 153 - ], - "fingerprint": "a29bd1ad8bdb" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Files exceeding 2 MB are silently dropped on the assumption they are vendored, generated, or binary assets; real source files are expected to fit within this bound. Files whose stripped text content is shorter than 64 bytes are also dropped to prevent analysis models from producing speculative or hallucinated findings on effectively empty inputs.", - "sources": [ - { - "file": "wikifi/walker.py", - "lines": [ - 61, - 79 - ], - "fingerprint": "a29bd1ad8bdb" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Manifest files read for introspection context are hard-truncated at 20,000 bytes with a visible truncation marker to keep prompt payloads bounded.", - "sources": [ - { - "file": "wikifi/walker.py", - "lines": [ - 220, - 231 - ], - "fingerprint": "a29bd1ad8bdb" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Directory traversal prunes ignored directories before descending into them, meaning exclusion patterns apply to entire subtrees efficiently rather than file-by-file.", - "sources": [ - { - "file": "wikifi/walker.py", - "lines": [ - 133, - 143 - ], - "fingerprint": "a29bd1ad8bdb" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "The maximum file size threshold is 2,000,000 bytes (2 MB); files at or above this limit are unconditionally skipped and never sent for analysis. The minimum content threshold is 64 bytes of stripped text. Manifest files are truncated to 20,000 bytes maximum before being included in any prompt.", - "sources": [ - { - "file": "wikifi/walker.py", - "lines": [ - 61, - 79 - ], - "fingerprint": "a29bd1ad8bdb" - } - ] - }, - { - "section_id": "integrations", - "finding": "The walker is called into by the introspection layer (Stage 1) and the extractor layer: introspection uses `summarize_tree` and `read_manifest_files` to build its compressed repo view, while the extractor uses `iter_files` to obtain the filtered file list for per-file analysis. The walker itself calls into no external services.", - "sources": [ - { - "file": "wikifi/walker.py", - "lines": [ - 1, - 12 - ], - "fingerprint": "a29bd1ad8bdb" - } - ] - } - ] - }, - "wikifi/wiki.py": { - "fingerprint": "9230b7444e0d", - "summary": "Manages the on-disk wiki directory layout, scaffolding lifecycle, and persistence of per-file extraction notes and rendered section documents.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "hard_specifications", - "finding": "The directory layout is explicitly declared as a stable contract between the tool and any target project: upgrading the tool must not break existing wikis. This constraint is called out in the module docstring and governs all future changes to path conventions.", - "sources": [ - { - "file": "wikifi/wiki.py", - "lines": [ - 1, - 13 - ], - "fingerprint": "9230b7444e0d" - } - ] - }, - { - "section_id": "entities", - "finding": "WikiLayout is a value object (immutable dataclass) that encapsulates the root path of a target project and derives all canonical sub-paths from it: the wiki directory, configuration file, gitignore file, notes directory, per-section markdown files, and per-section JSONL note files.", - "sources": [ - { - "file": "wikifi/wiki.py", - "lines": [ - 34, - 61 - ], - "fingerprint": "9230b7444e0d" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system can initialize a wiki skeleton inside a target project directory in an idempotent manner: it creates the directory structure, a provider/model configuration file, a gitignore, and one placeholder markdown file per defined section — skipping anything that already exists.", - "sources": [ - { - "file": "wikifi/wiki.py", - "lines": [ - 64, - 86 - ], - "fingerprint": "9230b7444e0d" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system can write a fully rendered markdown document for any individual section, replacing its previous content, and can append timestamped per-file extraction notes to a section's scratch log, read those notes back in insertion order, and wipe all notes at the start of a fresh analysis run.", - "sources": [ - { - "file": "wikifi/wiki.py", - "lines": [ - 89, - 121 - ], - "fingerprint": "9230b7444e0d" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Per-file extraction state is stored as newline-delimited JSON (JSONL), with each record automatically stamped with a UTC timestamp. Notes are excluded from version control via a generated gitignore entry, while rendered section markdown is intended to be committed.", - "sources": [ - { - "file": "wikifi/wiki.py", - "lines": [ - 96, - 121 - ], - "fingerprint": "9230b7444e0d" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "The `.wikifi/` directory layout follows a fixed, documented schema: `config.toml` for provider/model overrides, `.gitignore` for excluding notes, one `
.md` per defined section, and a `.notes/
.jsonl` per section for extraction state. This schema must remain stable across upgrades.", - "sources": [ - { - "file": "wikifi/wiki.py", - "lines": [ - 1, - 13 - ], - "fingerprint": "9230b7444e0d" - } - ] - }, - { - "section_id": "integrations", - "finding": "This module is the authoritative persistence layer consumed by the orchestrator, extractor, aggregator, deriver, and CLI — all path resolution for reading and writing wiki artifacts flows through WikiLayout rather than being scattered across those callers.", - "sources": [ - { - "file": "wikifi/wiki.py", - "lines": [ - 34, - 61 - ], - "fingerprint": "9230b7444e0d" - } - ] - } - ] - }, - "wikifi/specialized/__init__.py": { - "fingerprint": "06204b629ff9", - "summary": "Package init for the specialized extractors sub-package, documenting its structure and conventions.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "The specialized extractors package exists to provide type-aware, high-signal parsing of specific source artifact formats (SQL, OpenAPI, Protobuf, GraphQL) as an alternative or complement to the general LLM-based extraction pass, producing the same structured findings shape.", - "sources": [ - { - "file": "wikifi/specialized/__init__.py", - "lines": [ - 1, - 12 - ], - "fingerprint": "06204b629ff9" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system can parse and extract structured findings from multiple well-known artifact formats — SQL schemas, OpenAPI specifications, Protobuf definitions, and GraphQL schemas — each handled by a dedicated per-format extractor module.", - "sources": [ - { - "file": "wikifi/specialized/__init__.py", - "lines": [ - 8, - 11 - ], - "fingerprint": "06204b629ff9" - } - ] - }, - { - "section_id": "integrations", - "finding": "A dispatch function (`select`) in the sibling dispatch module maps a file's kind to the appropriate extractor, acting as the internal routing layer between artifact type detection and structured extraction.", - "sources": [ - { - "file": "wikifi/specialized/__init__.py", - "lines": [ - 7, - 8 - ], - "fingerprint": "06204b629ff9" - } - ] - } - ] - }, - "wikifi/specialized/graphql.py": { - "fingerprint": "1d454892894d", - "summary": "GraphQL SDL extractor that maps schema constructs to wiki sections: domain types and inputs become entities, Query/Mutation roots become capabilities, and Subscription roots become integrations.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "This module exists to let wikifi understand GraphQL schema definition files as first-class inputs, translating the type system and operation roots into technology-agnostic wiki findings rather than raw SDL syntax.", - "sources": [ - { - "file": "wikifi/specialized/graphql.py", - "lines": [ - 1, - 11 - ], - "fingerprint": "1d454892894d" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system can parse a GraphQL SDL file and enumerate every Query and Mutation root field, producing capability findings that describe what operations the API exposes. Fields are extracted per root block, with up to 30 field names captured per root.", - "sources": [ - { - "file": "wikifi/specialized/graphql.py", - "lines": [ - 101, - 113 - ], - "fingerprint": "1d454892894d" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The extractor explicitly supports modular schema composition: `extend type Query` and `extend type Mutation` declarations are treated identically to base root declarations, so capabilities defined across multiple SDL files are all surfaced rather than dropped.", - "sources": [ - { - "file": "wikifi/specialized/graphql.py", - "lines": [ - 19, - 20 - ], - "fingerprint": "1d454892894d" - } - ] - }, - { - "section_id": "entities", - "finding": "GraphQL domain object types (any `type` that is not a root operation) are recorded as domain entities, with up to 25 names listed per file. Interfaces (shared shape contracts), input types (request payload shapes), and enums (closed value sets) are each captured as separate entity findings.", - "sources": [ - { - "file": "wikifi/specialized/graphql.py", - "lines": [ - 56, - 95 - ], - "fingerprint": "1d454892894d" - } - ] - }, - { - "section_id": "integrations", - "finding": "GraphQL Subscription roots are mapped to the integrations section, reflecting that subscriptions represent event-driven integration touchpoints rather than direct capabilities.", - "sources": [ - { - "file": "wikifi/specialized/graphql.py", - "lines": [ - 108, - 110 - ], - "fingerprint": "1d454892894d" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Line-number anchoring is computed against the captured name offset (not the full match start) to prevent off-by-one errors when SDL files use leading whitespace, ensuring that source references in findings point accurately to declarations.", - "sources": [ - { - "file": "wikifi/specialized/graphql.py", - "lines": [ - 38, - 43 - ], - "fingerprint": "1d454892894d" - } - ] - } - ] - }, - "wikifi/specialized/openapi.py": { - "fingerprint": "bdc664e7ad72", - "summary": "Extracts structured migration intelligence from OpenAPI/Swagger contract files, surfacing endpoints, schemas, and authentication schemes as typed findings.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "This module exists to turn OpenAPI/Swagger contract files into structured migration intelligence. Every public endpoint, request/response schema, and authentication method declared in the contract is enumerated so migration teams have a complete, authoritative picture of the API surface without manually reading raw spec files.", - "sources": [ - { - "file": "wikifi/specialized/openapi.py", - "lines": [ - 1, - 11 - ], - "fingerprint": "bdc664e7ad72" - } - ] - }, - { - "section_id": "capabilities", - "finding": "Parses API contract files to extract: the API title and version, the full list of HTTP endpoints with their verbs, paths, and summaries (up to 20 shown inline, with a count of additional ones), and the named request/response schema models. Produces a concise summary noting total endpoint and schema counts.", - "sources": [ - { - "file": "wikifi/specialized/openapi.py", - "lines": [ - 53, - 116 - ], - "fingerprint": "bdc664e7ad72" - } - ] - }, - { - "section_id": "integrations", - "finding": "Identifies inbound integration surface: each parsed API contract contributes a finding recording how many HTTP endpoints the system exposes to external consumers, forming the inbound integration inventory for the wiki.", - "sources": [ - { - "file": "wikifi/specialized/openapi.py", - "lines": [ - 96, - 103 - ], - "fingerprint": "bdc664e7ad72" - } - ] - }, - { - "section_id": "entities", - "finding": "Extracts named API schema definitions (request/response models) from the contract's component definitions, listing up to 25 named schemas with a note when additional schemas are elided.", - "sources": [ - { - "file": "wikifi/specialized/openapi.py", - "lines": [ - 105, - 116 - ], - "fingerprint": "bdc664e7ad72" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Reads the declared security scheme types from the API contract and records the authentication contract for the API, capturing scheme categories such as API key, bearer token, or OAuth flows.", - "sources": [ - { - "file": "wikifi/specialized/openapi.py", - "lines": [ - 118, - 126 - ], - "fingerprint": "bdc664e7ad72" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "When an API contract file is present but cannot be parsed, the system must emit a explicit warning finding directing migration teams to review the file manually rather than silently dropping it. Specs that exceed the parser's capability are flagged, not skipped.", - "sources": [ - { - "file": "wikifi/specialized/openapi.py", - "lines": [ - 24, - 37 - ], - "fingerprint": "bdc664e7ad72" - } - ] - }, - { - "section_id": "external_dependencies", - "finding": "Optionally relies on a third-party YAML parsing library for full YAML spec support; when that library is absent the system falls back to an internal minimal YAML parser sufficient to read the OpenAPI keys it needs, so the library is a soft rather than hard dependency.", - "sources": [ - { - "file": "wikifi/specialized/openapi.py", - "lines": [ - 154, - 162 - ], - "fingerprint": "bdc664e7ad72" - } - ] - } - ] - }, - "wikifi/specialized/protobuf.py": { - "fingerprint": "5a5f77699e9b", - "summary": "Protobuf IDL extractor that parses protocol definition files to surface message types as domain entities, enum types as closed value sets, and service/RPC blocks as integration touchpoints for use by migration teams designing new interfaces.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "This extractor treats protocol definition files as pure contracts — the findings it produces are intended to be read directly by a migration team redesigning system interfaces in a new stack, requiring no knowledge of the original implementation.", - "sources": [ - { - "file": "wikifi/specialized/protobuf.py", - "lines": [ - 1, - 8 - ], - "fingerprint": "5a5f77699e9b" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system can parse interface definition files to enumerate all defined message types, closed-value enum types, named services, and their remote procedure calls — including whether any call leg uses a streaming transport. Results are summarised as counts of messages, services, and RPCs.", - "sources": [ - { - "file": "wikifi/specialized/protobuf.py", - "lines": [ - 26, - 95 - ], - "fingerprint": "5a5f77699e9b" - } - ] - }, - { - "section_id": "entities", - "finding": "Message types declared in a protocol definition are surfaced as named domain entities, grouped by their package namespace. Enum types are separately captured as closed value sets. Up to 25 of each kind are rendered verbatim; additional entries are noted as elided.", - "sources": [ - { - "file": "wikifi/specialized/protobuf.py", - "lines": [ - 42, - 60 - ], - "fingerprint": "5a5f77699e9b" - } - ] - }, - { - "section_id": "integrations", - "finding": "Each service block in a protocol definition is treated as an integration touchpoint. The extractor resolves which RPCs belong to each service by brace-counting to find the exact closing boundary, preventing cross-attribution in files with multiple services. Each RPC is described with its input and output message types, annotated with streaming direction where present.", - "sources": [ - { - "file": "wikifi/specialized/protobuf.py", - "lines": [ - 64, - 90 - ], - "fingerprint": "5a5f77699e9b" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Service-to-RPC attribution must be computed by tracking brace depth (counting nested blocks) rather than by line proximity, ensuring each RPC is assigned only to the service whose block encloses it — a correctness invariant required for accurate integration inventories in multi-service files.", - "sources": [ - { - "file": "wikifi/specialized/protobuf.py", - "lines": [ - 62, - 67 - ], - "fingerprint": "5a5f77699e9b" - } - ] - } - ] - }, - "wikifi/specialized/sql.py": { - "fingerprint": "ebbdecc4c021", - "summary": "SQL DDL and migration extractor that parses table definitions, relationships, and constraints from schema files, producing structured findings about entities, relational links, and storage invariants for use by a migration team.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "This extractor exists to help a migration team understand an existing database schema by systematically pulling entity definitions, column inventories, foreign-key relationships, and storage invariants out of SQL DDL and migration scripts. It distinguishes baseline schema files from incremental migration files so the team can identify forward-only schema changes versus original table definitions.", - "sources": [ - { - "file": "wikifi/specialized/sql.py", - "lines": [ - 1, - 13 - ], - "fingerprint": "ebbdecc4c021" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system can parse SQL DDL files to enumerate every persisted entity with its columns, extract foreign-key relationships between entities, surface UNIQUE and NOT NULL storage invariants, record indexes as performance invariants that must be carried forward, and summarize ALTER TABLE operations. It separately tags migration files so that schema additions can be distinguished from the original baseline.", - "sources": [ - { - "file": "wikifi/specialized/sql.py", - "lines": [ - 57, - 130 - ], - "fingerprint": "ebbdecc4c021" - } - ] - }, - { - "section_id": "entities", - "finding": "The internal `_TableHit` model captures a parsed database table: its name, the source line where it is defined, the raw body text, a list of column names, and a list of foreign-key edges expressed as (local column, referenced table, referenced column) tuples. This model is the intermediate representation that drives all downstream entity and relationship findings.", - "sources": [ - { - "file": "wikifi/specialized/sql.py", - "lines": [ - 50, - 58 - ], - "fingerprint": "ebbdecc4c021" - } - ] - }, - { - "section_id": "integrations", - "finding": "Foreign-key references between tables are recorded as hard relational links between entities — a column in one table references a column in another, which constrains how the two entities may be separated or migrated independently. Both explicit FOREIGN KEY constraint syntax and inline column-level REFERENCES syntax are detected.", - "sources": [ - { - "file": "wikifi/specialized/sql.py", - "lines": [ - 88, - 98 - ], - "fingerprint": "ebbdecc4c021" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "UNIQUE and NOT NULL constraints on any table are surfaced as storage invariants that the target system must honour. Additionally, every index definition is flagged as a query-time performance invariant explicitly annotated as something 'the new system must preserve'.", - "sources": [ - { - "file": "wikifi/specialized/sql.py", - "lines": [ - 100, - 121 - ], - "fingerprint": "ebbdecc4c021" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Indexes are explicitly declared to encode query-time performance invariants that must be preserved through migration — the extractor emits this requirement verbatim in every index finding so it is not lost in the migration report.", - "sources": [ - { - "file": "wikifi/specialized/sql.py", - "lines": [ - 115, - 121 - ], - "fingerprint": "ebbdecc4c021" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Migration files are counted differently from baseline DDL: the summary tracks all tables touched by either CREATE TABLE or ALTER TABLE operations, ensuring that ALTER-only migrations are not misleadingly reported as touching zero tables when browsing the migration report.", - "sources": [ - { - "file": "wikifi/specialized/sql.py", - "lines": [ - 123, - 130 - ], - "fingerprint": "ebbdecc4c021" - } - ] - } - ] - }, - "wikifi/providers/anthropic_provider.py": { - "fingerprint": "fe8422f0e6c5", - "summary": "Implements the hosted-Claude LLM provider used by the wikifi pipeline for structured per-file extraction, with prompt caching and adaptive reasoning depth to make large-scale codebase walks economically viable.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "The file's module-level docstring explains a core economic constraint: running a large multi-KB system prompt across hundreds of per-file extraction calls at full input-token price is uneconomical on large codebases. The solution is to mark the repeated system prompt as cacheable so subsequent calls pay roughly 10% of the normal input price, making hosted AI extraction cost-competitive with local alternatives at better quality.", - "sources": [ - { - "file": "wikifi/providers/anthropic_provider.py", - "lines": [ - 1, - 19 - ], - "fingerprint": "fe8422f0e6c5" - } - ] - }, - { - "section_id": "external_dependencies", - "finding": "The system depends on Anthropic's hosted Claude API for all LLM inference on this path. The default model is `claude-opus-4-7`, described as the most capable option for migration-grade domain extraction. The API key is sourced from the `ANTHROPIC_API_KEY` environment variable. A configurable HTTP timeout (default 900 seconds) guards against long-running inference calls.", - "sources": [ - { - "file": "wikifi/providers/anthropic_provider.py", - "lines": [ - 83, - 100 - ], - "fingerprint": "fe8422f0e6c5" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The provider exposes three interaction modes: (1) structured JSON extraction where the model returns a schema-validated domain object directly, (2) free-text completion for open-ended responses, and (3) multi-turn conversational chat for interactive or iterative workflows. All three modes share the same prompt-caching and adaptive-reasoning configuration.", - "sources": [ - { - "file": "wikifi/providers/anthropic_provider.py", - "lines": [ - 107, - 186 - ], - "fingerprint": "fe8422f0e6c5" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The provider supports configurable reasoning depth ('low', 'medium', 'high', 'max') that controls how deeply the model deliberates before producing output. This knob is intentionally mirrored to match the interface of the local (Ollama) provider so the rest of the pipeline does not need to branch on which provider is active.", - "sources": [ - { - "file": "wikifi/providers/anthropic_provider.py", - "lines": [ - 226, - 245 - ], - "fingerprint": "fe8422f0e6c5" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "All API errors from the hosted service are caught and re-raised as a uniform internal error type carrying the original request ID. This preserves the pipeline's existing per-call fallback paths, which catch broad exceptions, without leaking provider-specific error types into the orchestration layer.", - "sources": [ - { - "file": "wikifi/providers/anthropic_provider.py", - "lines": [ - 119, - 127 - ], - "fingerprint": "fe8422f0e6c5" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "The system prompt is wrapped in a cacheable block on every call, so the large shared extraction prompt is billed at the cache-read rate for all calls after the first within a cache window. This is described as 'the entire cost story for hosted runs' and is enabled by default but can be disabled.", - "sources": [ - { - "file": "wikifi/providers/anthropic_provider.py", - "lines": [ - 196, - 211 - ], - "fingerprint": "fe8422f0e6c5" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "When the model returns neither a parsed structured object nor any text content, a diagnostic message surfaces the stop reason, output token count, and configured token budget, along with operational hints (raise the token limit or lower reasoning effort) to help operators resolve the failure at the point it occurs.", - "sources": [ - { - "file": "wikifi/providers/anthropic_provider.py", - "lines": [ - 255, - 275 - ], - "fingerprint": "fe8422f0e6c5" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Sampling parameters (temperature, top_p, top_k) must NOT be sent to the claude-opus-4-7 model — doing so causes a 400 error. The provider omits them entirely rather than conditionally including them.", - "sources": [ - { - "file": "wikifi/providers/anthropic_provider.py", - "lines": [ - 14, - 17 - ], - "fingerprint": "fe8422f0e6c5" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "The default maximum output token budget is 32,000. The rationale is that adaptive reasoning at high effort can consume the entire budget before the structured output block is produced; too low a value causes the model to return an empty structured response. Callers using the highest effort levels are advised to increase this limit and enable streaming.", - "sources": [ - { - "file": "wikifi/providers/anthropic_provider.py", - "lines": [ - 70, - 79 - ], - "fingerprint": "fe8422f0e6c5" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Structured output is obtained via the SDK's schema-constrained decoding path (messages.parse), not via manually constructed tool-use blocks. The fallback path attempts to parse the raw text block as JSON if the primary parsed output is absent, before raising an error.", - "sources": [ - { - "file": "wikifi/providers/anthropic_provider.py", - "lines": [ - 107, - 145 - ], - "fingerprint": "fe8422f0e6c5" - } - ] - }, - { - "section_id": "integrations", - "finding": "This provider implements the LLMProvider interface defined in the base provider module and is consumed by the orchestrator. The interface contract includes three methods — structured JSON completion, free-text completion, and multi-turn chat — which the orchestrator calls without branching on which concrete provider is active.", - "sources": [ - { - "file": "wikifi/providers/anthropic_provider.py", - "lines": [ - 83, - 106 - ], - "fingerprint": "fe8422f0e6c5" - } - ] - } - ] - }, - "wikifi/providers/base.py": { - "fingerprint": "f40d924f0cb0", - "summary": "Defines the abstract contract every LLM backend must implement, exposing exactly three interaction modes used throughout the wiki-generation pipeline.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "The provider abstraction exists so that any LLM backend (local, hosted, or mock) can be substituted with a single-class change. The rest of the system never calls anything beyond these three surfaces, keeping the integration boundary explicit and narrow.", - "sources": [ - { - "file": "wikifi/providers/base.py", - "lines": [ - 1, - 17 - ], - "fingerprint": "f40d924f0cb0" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system interacts with language models in three distinct modes: (1) structured single-shot completion where the model returns a validated structured document, used for introspection, per-file extraction, and aggregation stages; (2) free-text single-shot completion for unstructured narrative output such as diagram sections; and (3) multi-turn conversation that preserves history across turns, used in the interactive REPL.", - "sources": [ - { - "file": "wikifi/providers/base.py", - "lines": [ - 5, - 14 - ], - "fingerprint": "f40d924f0cb0" - } - ] - }, - { - "section_id": "entities", - "finding": "A `ChatMessage` carries two fields — a `role` identifier and a `content` string — representing one turn in a multi-turn exchange. The `LLMProvider` entity carries a `name` (provider identity) and a `model` (specific model variant) and is the sole point of contact between the pipeline and any language model backend.", - "sources": [ - { - "file": "wikifi/providers/base.py", - "lines": [ - 33, - 52 - ], - "fingerprint": "f40d924f0cb0" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "All hosted provider backends share a single error-formatting routine that extracts a vendor-issued request identifier when present, producing a consistent diagnostic string across all backends. This ensures that failure messages are uniformly attributable regardless of which backend is active.", - "sources": [ - { - "file": "wikifi/providers/base.py", - "lines": [ - 54, - 63 - ], - "fingerprint": "f40d924f0cb0" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "The three abstract methods — structured completion, text completion, and chat — constitute the complete and exclusive contract that the rest of the system relies on; no other methods on a provider are ever invoked. Any conforming implementation must satisfy all three signatures exactly.", - "sources": [ - { - "file": "wikifi/providers/base.py", - "lines": [ - 42, - 52 - ], - "fingerprint": "f40d924f0cb0" - } - ] - } - ] - }, - "wikifi/providers/ollama_provider.py": { - "fingerprint": "dda16c755eff", - "summary": "Concrete implementation of the LLM provider contract backed by a locally-hosted language model service, with explicit controls for structured-output reliability and reasoning depth.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "external_dependencies", - "finding": "The system relies on a locally-hosted language model service (Ollama) for all AI inference. It connects via a configurable host address and timeout, and uses the service's native schema-enforcement mechanism to obtain structured JSON responses.", - "sources": [ - { - "file": "wikifi/providers/ollama_provider.py", - "lines": [ - 52, - 52 - ], - "fingerprint": "dda16c755eff" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The provider exposes three interaction modes: structured JSON extraction (with schema validation), free-form text completion, and multi-turn conversation. Structured extraction pins output temperature to zero to guarantee reproducibility across identical inputs.", - "sources": [ - { - "file": "wikifi/providers/ollama_provider.py", - "lines": [ - 58, - 91 - ], - "fingerprint": "dda16c755eff" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Temperature is fixed at zero for all structured-output calls so that the same input always produces the same structured result across runs; free-text and chat calls inherit the model's default temperature. This is a non-negotiable invariant for the JSON extraction path.", - "sources": [ - { - "file": "wikifi/providers/ollama_provider.py", - "lines": [ - 58, - 68 - ], - "fingerprint": "dda16c755eff" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Disabling the reasoning trace (think=False) on Qwen3-family models causes them to ignore the JSON schema constraint and emit free text, breaking validation. The system therefore defaults reasoning to 'high' and must never disable it for Qwen3-style models used in the structured-output path.", - "sources": [ - { - "file": "wikifi/providers/ollama_provider.py", - "lines": [ - 9, - 27 - ], - "fingerprint": "dda16c755eff" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "The default request timeout is 900 seconds, chosen to absorb the 1–3 minute per-file latency observed at the 'high' thinking level on local 27B-parameter models. Reducing this timeout risks aborting in-progress reasoning traces.", - "sources": [ - { - "file": "wikifi/providers/ollama_provider.py", - "lines": [ - 50, - 54 - ], - "fingerprint": "dda16c755eff" - } - ] - }, - { - "section_id": "integrations", - "finding": "This component implements the shared LLMProvider interface defined in the base provider module and is consumed by the orchestrator. It bridges the orchestrator's abstract completion requests to the concrete locally-hosted model service.", - "sources": [ - { - "file": "wikifi/providers/ollama_provider.py", - "lines": [ - 44, - 46 - ], - "fingerprint": "dda16c755eff" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "A companion content-size filter (min_content_bytes, described in the docstring as living in the walker) acts as a guard to prevent near-empty files from reaching the extractor and triggering expensive, potentially timeout-inducing reasoning traces.", - "sources": [ - { - "file": "wikifi/providers/ollama_provider.py", - "lines": [ - 28, - 35 - ], - "fingerprint": "dda16c755eff" - } - ] - } - ] - }, - "wikifi/providers/openai_provider.py": { - "fingerprint": "a64fb7819574", - "summary": "OpenAI-backed LLM provider that handles structured-output extraction, free-text completion, and multi-turn chat against OpenAI-hosted models, with automatic prompt caching and reasoning-effort routing.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "integrations", - "finding": "This module is one of three selectable hosted LLM backends (alongside local and Anthropic options). It is activated by setting the provider selector to `openai` and supplying an API key, and is invoked by the orchestrator for per-file extraction passes and synthesis steps.", - "sources": [ - { - "file": "wikifi/providers/openai_provider.py", - "lines": [ - 1, - 9 - ], - "fingerprint": "a64fb7819574" - } - ] - }, - { - "section_id": "external_dependencies", - "finding": "Depends on the OpenAI hosted API for all language-model inference. The API is used in three modes: schema-constrained structured decoding (returning validated domain objects), free-text completion, and multi-turn conversational chat. API errors surface as normalised runtime failures.", - "sources": [ - { - "file": "wikifi/providers/openai_provider.py", - "lines": [ - 113, - 175 - ], - "fingerprint": "a64fb7819574" - } - ] - }, - { - "section_id": "capabilities", - "finding": "Provides three interaction modes with the hosted model: (1) structured extraction that returns a fully validated domain-findings object matching a declared schema; (2) open-ended text generation; and (3) multi-turn chat for iterative wiki synthesis. All three modes share a unified output-token cap and reasoning-effort configuration.", - "sources": [ - { - "file": "wikifi/providers/openai_provider.py", - "lines": [ - 113, - 175 - ], - "fingerprint": "a64fb7819574" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Prompt caching is exploited by placing the large, repeated extraction system prompt at message position 0; OpenAI automatically caches identical prefixes of ≥ 1024 tokens for roughly 5–10 minutes, reducing latency and cost across the many per-file calls in a single wiki walk.", - "sources": [ - { - "file": "wikifi/providers/openai_provider.py", - "lines": [ - 14, - 18 - ], - "fingerprint": "a64fb7819574" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "All API failures are caught and re-raised as normalised runtime errors via a shared `format_api_error` helper, ensuring that provider-specific error shapes do not leak into the orchestration layer.", - "sources": [ - { - "file": "wikifi/providers/openai_provider.py", - "lines": [ - 128, - 135 - ], - "fingerprint": "a64fb7819574" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Reasoning-capable model families (identified by the prefixes `o` or `gpt-5`) must receive `max_completion_tokens` instead of `max_tokens`, and may optionally receive a `reasoning_effort` value of `low`, `medium`, or `high`. Non-reasoning models must not receive `reasoning_effort` to avoid API validation errors.", - "sources": [ - { - "file": "wikifi/providers/openai_provider.py", - "lines": [ - 215, - 235 - ], - "fingerprint": "a64fb7819574" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "The default output token cap is 16,000 tokens per call, chosen to accommodate the largest structured findings schema without hitting SDK HTTP timeout guards. The default model is `gpt-4o` and the default per-call timeout is 900 seconds.", - "sources": [ - { - "file": "wikifi/providers/openai_provider.py", - "lines": [ - 59, - 66 - ], - "fingerprint": "a64fb7819574" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "When the structured-output parse path returns no parsed object (e.g. due to a refusal or truncation), the implementation must fall back to validating the raw JSON text against the schema rather than returning a null, preserving the provider protocol's contract of raising on failure rather than silently returning nothing.", - "sources": [ - { - "file": "wikifi/providers/openai_provider.py", - "lines": [ - 136, - 144 - ], - "fingerprint": "a64fb7819574" - } - ] - } - ] - }, - "Dockerfile": { - "fingerprint": "a3f802d0c632", - "summary": "Multi-stage build placeholder with no domain logic.", - "chunks_processed": 1, - "findings": [] - }, - "Makefile": { - "fingerprint": "961d8c040205", - "summary": "Build and developer-workflow automation for the wikifi project.", - "chunks_processed": 1, - "findings": [] - }, - "docker-compose.yml": { - "fingerprint": "26be8a812822", - "summary": "Local development environment definition providing a persistent relational database service.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "external_dependencies", - "finding": "A relational database service is used for persistent data storage, configured with a dedicated user, password, and database named 'wikifi'. Data is persisted across restarts via a named volume.", - "sources": [ - { - "file": "docker-compose.yml", - "lines": [ - 2, - 11 - ], - "fingerprint": "26be8a812822" - } - ] - } - ] - }, - "pyproject.toml": { - "fingerprint": "e9bb63d5a6a9", - "summary": "Project manifest declaring wikifi's identity, dependencies, and entry point as a codebase-documentation tool.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "wikifi exists to walk any codebase and produce a technology-agnostic markdown wiki of its intent — helping teams understand, migrate, or document software independent of its implementation details.", - "sources": [ - { - "file": "pyproject.toml", - "lines": [ - 3, - 3 - ], - "fingerprint": "e9bb63d5a6a9" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The tool exposes a single command-line entry point (`wikifi`) that orchestrates codebase traversal and wiki generation.", - "sources": [ - { - "file": "pyproject.toml", - "lines": [ - 19, - 21 - ], - "fingerprint": "e9bb63d5a6a9" - } - ] - }, - { - "section_id": "external_dependencies", - "finding": "Three distinct LLM back-ends are supported: a locally-hosted model service (Ollama), the Anthropic Claude API, and the OpenAI API. Each plays the role of the inference engine that converts source code into domain-language documentation.", - "sources": [ - { - "file": "pyproject.toml", - "lines": [ - 11, - 17 - ], - "fingerprint": "e9bb63d5a6a9" - } - ] - }, - { - "section_id": "external_dependencies", - "finding": "A file-pattern library (pathspec) is used to control which files are included or excluded during codebase traversal, likely respecting .gitignore-style rules.", - "sources": [ - { - "file": "pyproject.toml", - "lines": [ - 16, - 16 - ], - "fingerprint": "e9bb63d5a6a9" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Structured data validation and settings management are enforced throughout the system via a schema/validation library, ensuring configuration and model outputs conform to expected shapes.", - "sources": [ - { - "file": "pyproject.toml", - "lines": [ - 12, - 13 - ], - "fingerprint": "e9bb63d5a6a9" - } - ] - } - ] - }, - "wikifi/specialized/dispatch.py": { - "fingerprint": "cec0697482a9", - "summary": "Routes each recognized file kind to the appropriate specialized extractor, or returns None to let the file fall through to the general LLM extraction path.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "The system distinguishes structured contract files (schemas, interface definitions, API specs, migrations) from general application code, recognizing that their machine-readable structure can be extracted more accurately and efficiently by targeted parsers than by a general prose LLM extractor. The dispatch layer enforces this routing decision so that the LLM path is reserved only for files where structure is implicit.", - "sources": [ - { - "file": "wikifi/specialized/dispatch.py", - "lines": [ - 1, - 13 - ], - "fingerprint": "cec0697482a9" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system can route files to four specialized extraction paths — SQL schemas, OpenAPI specifications, Protobuf definitions, and GraphQL schemas — as well as a dedicated SQL migration extraction path, based on the file's classified kind and, for migrations, its path suffix.", - "sources": [ - { - "file": "wikifi/specialized/dispatch.py", - "lines": [ - 44, - 62 - ], - "fingerprint": "cec0697482a9" - } - ] - }, - { - "section_id": "hard_specifications", - "finding": "Only migration files with `.sql` or `.ddl` suffixes are sent to the SQL migration extractor; all other migration files (e.g. Python Alembic scripts, Django initial migrations, Knex JavaScript migrations) must fall through to the LLM extraction path. This rule is enforced by inspecting the file path suffix, not just the file kind classification.", - "sources": [ - { - "file": "wikifi/specialized/dispatch.py", - "lines": [ - 28, - 62 - ], - "fingerprint": "cec0697482a9" - } - ] - }, - { - "section_id": "integrations", - "finding": "This module acts as the internal integration hub between the upstream file classifier (repograph) and the downstream specialized extractors (sql, openapi, protobuf, graphql). The upstream classifier tags every file in a migrations directory uniformly; this layer narrows that coarse classification into actionable routing decisions.", - "sources": [ - { - "file": "wikifi/specialized/dispatch.py", - "lines": [ - 36, - 62 - ], - "fingerprint": "cec0697482a9" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Lazy imports of the specialized extractor modules are used to keep this dispatch module cheap to load and to avoid circular dependency issues within the specialized package hierarchy — a structural invariant that must be preserved as the extractor set grows.", - "sources": [ - { - "file": "wikifi/specialized/dispatch.py", - "lines": [ - 40, - 43 - ], - "fingerprint": "cec0697482a9" - } - ] - } - ] - }, - "wikifi/specialized/models.py": { - "fingerprint": "32d041c141a3", - "summary": "Defines the shared result types and extractor contract used by specialized (non-LLM) extractors for schema and IDL files.", - "chunks_processed": 1, - "findings": [ - { - "section_id": "intent", - "finding": "Specialized extractors exist to bypass LLM processing for schema and interface-definition files, producing structured findings through a deterministic code path. Their output is intentionally compatible with the LLM extractor's output so both flow into a single unified notes store.", - "sources": [ - { - "file": "wikifi/specialized/models.py", - "lines": [ - 1, - 8 - ], - "fingerprint": "32d041c141a3" - } - ] - }, - { - "section_id": "capabilities", - "finding": "The system can analyze schema and IDL files (such as GraphQL, OpenAPI, Protobuf, SQL) using dedicated extractors rather than general-purpose language model inference, short-circuiting the LLM for these well-structured inputs.", - "sources": [ - { - "file": "wikifi/specialized/models.py", - "lines": [ - 4, - 6 - ], - "fingerprint": "32d041c141a3" - } - ] - }, - { - "section_id": "entities", - "finding": "A `SpecializedFinding` represents a single extracted insight tied to a wiki section, carrying the section identifier, the finding text, and one or more source references indicating where in the file the finding originates.", - "sources": [ - { - "file": "wikifi/specialized/models.py", - "lines": [ - 19, - 22 - ], - "fingerprint": "32d041c141a3" - } - ] - }, - { - "section_id": "entities", - "finding": "A `SpecializedResult` aggregates a list of `SpecializedFinding` items and an optional summary string, forming the complete output of a single specialized extractor run over one file.", - "sources": [ - { - "file": "wikifi/specialized/models.py", - "lines": [ - 25, - 27 - ], - "fingerprint": "32d041c141a3" - } - ] - }, - { - "section_id": "cross_cutting", - "finding": "Specialized extractor output conforms to the same contract as LLM extractor output — both write to the same notes store — ensuring that the downstream wiki-building pipeline is agnostic to which extraction path produced a given finding.", - "sources": [ - { - "file": "wikifi/specialized/models.py", - "lines": [ - 4, - 8 - ], - "fingerprint": "32d041c141a3" - } - ] - }, - { - "section_id": "integrations", - "finding": "The extractor function type `(rel_path, text) -> SpecializedResult` defines the internal integration contract between the dispatch layer and each specialized extractor (GraphQL, OpenAPI, Protobuf, SQL), as well as between all extractors and the shared evidence/notes store.", - "sources": [ - { - "file": "wikifi/specialized/models.py", - "lines": [ - 30, - 31 - ], - "fingerprint": "32d041c141a3" - } - ] - } - ] - } - } -} \ No newline at end of file diff --git a/.wikifi/.gitignore b/.wikifi/.gitignore index 23c6b94..03adbc0 100644 --- a/.wikifi/.gitignore +++ b/.wikifi/.gitignore @@ -1,2 +1,3 @@ -# wikifi local working state — section markdown is committed, notes are not. +# wikifi local working state — section markdown is committed, notes and cache are not. .notes/ +.cache/ From a308ab11232b1e47693ee537b0b9bd61486ad4e9 Mon Sep 17 00:00:00 2001 From: Dallas Pool Date: Fri, 1 May 2026 22:24:59 -0500 Subject: [PATCH 8/9] fix(pr15): honor target's .wikifi/config.toml; preserve Azure deployment IDs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Addresses two new Copilot review comments on PR #15. cli.py:197 — `report --score` ignored target's .wikifi/config.toml - New `load_target_settings(target)` in `wikifi/config.py` reads `/.wikifi/config.toml` and layers its values onto the env-derived defaults. The wiki's own config wins over per-session env vars, matching the contract printed at the top of every scaffolded `config.toml` ("overrides WIKIFI_* environment variables when present"). - `walk`, `chat`, and `report` CLI commands now use it; only `init` stays on `get_settings()` because it's the command that *creates* config.toml. - Allow-list of overridable fields (`provider`, `model`, `ollama_host`) so a stale or hand-edited config can't silently start steering fields the user didn't sign up for. - Malformed TOML logs a warning and falls back to env defaults rather than crashing the command. orchestrator.py:194 — OpenAI model swap clobbered Azure deployment IDs - `_looks_like_openai_model` was an allow-list (gpt-/o1/o3/o4/ft:) that fell back to "gpt-4o" for everything else, including valid Azure-OpenAI deployment names like `prod-gpt4o`, `eastus-chat`, or `my-team-deployment`. With `openai_base_url` now documented for Azure use, the swap silently routed users to the wrong model. - Replaced with `_looks_like_ollama_model` (deny-list): only swaps when the model id obviously looks like an Ollama identifier (`family:tag`), excluding fine-tuned OpenAI models which also contain a colon (`ft:gpt-4o:...`). Anything else passes through — Azure deployment IDs, plain proxy aliases, and untouched OpenAI defaults all keep their configured value. Tests - 189 tests pass (was 183). New regression coverage: - `load_target_settings` happy path, toml-wins-over-env, missing config, and malformed-toml warning paths - Azure deployment ID and fine-tuned OpenAI model both pass through the OpenAI provider builder unchanged - Updated `test_build_provider_returns_openai_when_selected` to use a realistic Ollama model id (`qwen3.6:27b`) since the new heuristic only swaps obviously-Ollama identifiers. --- tests/test_config.py | 93 ++++++++++++++++++++++++++++++++++++++ tests/test_orchestrator.py | 39 ++++++++++++++-- wikifi/cli.py | 8 ++-- wikifi/config.py | 70 +++++++++++++++++++++++++++- wikifi/orchestrator.py | 26 +++++++---- 5 files changed, 219 insertions(+), 17 deletions(-) diff --git a/tests/test_config.py b/tests/test_config.py index 69852dc..6e79f13 100644 --- a/tests/test_config.py +++ b/tests/test_config.py @@ -20,3 +20,96 @@ def test_get_settings_is_cached(): a = get_settings() b = get_settings() assert a is b + + +def test_load_target_settings_reads_config_toml(tmp_path, monkeypatch): + """`/.wikifi/config.toml` overrides field defaults. + + A target wiki initialized with `provider = "anthropic"` should + produce settings that say "anthropic" even when the calling shell + has no WIKIFI_* env vars set. + """ + from wikifi.config import load_target_settings, reset_settings_cache + + # ``Settings`` reads ``.env`` from CWD; chdir to tmp_path so the + # project-root .env (which sets WIKIFI_PROVIDER=anthropic) doesn't + # leak into the test. + monkeypatch.chdir(tmp_path) + monkeypatch.delenv("WIKIFI_PROVIDER", raising=False) + monkeypatch.delenv("WIKIFI_MODEL", raising=False) + monkeypatch.delenv("WIKIFI_OLLAMA_HOST", raising=False) + reset_settings_cache() + + wiki_dir = tmp_path / ".wikifi" + wiki_dir.mkdir() + (wiki_dir / "config.toml").write_text( + 'provider = "anthropic"\nmodel = "claude-opus-4-7"\nollama_host = "http://unused:11434"\n' + ) + + settings = load_target_settings(tmp_path) + assert settings.provider == "anthropic" + assert settings.model == "claude-opus-4-7" + reset_settings_cache() + + +def test_load_target_settings_toml_wins_over_env(tmp_path, monkeypatch): + """The target wiki's `config.toml` wins over per-session env vars. + + Matches the contract printed at the top of every scaffolded + `config.toml`: "overrides WIKIFI_* environment variables when + present". A wiki initialized for a hosted backend should keep + using that backend even if the user happens to have + `WIKIFI_PROVIDER=ollama` exported in their shell. + """ + from wikifi.config import load_target_settings, reset_settings_cache + + monkeypatch.setenv("WIKIFI_PROVIDER", "ollama") + monkeypatch.setenv("WIKIFI_MODEL", "qwen3.6:27b") + reset_settings_cache() + + wiki_dir = tmp_path / ".wikifi" + wiki_dir.mkdir() + (wiki_dir / "config.toml").write_text( + 'provider = "anthropic"\nmodel = "claude-opus-4-7"\n', + ) + + settings = load_target_settings(tmp_path) + assert settings.provider == "anthropic" + assert settings.model == "claude-opus-4-7" + reset_settings_cache() + + +def test_load_target_settings_handles_missing_config(tmp_path, monkeypatch): + """No `.wikifi/config.toml` → fall back cleanly to env defaults.""" + from wikifi.config import load_target_settings, reset_settings_cache + + monkeypatch.chdir(tmp_path) + monkeypatch.delenv("WIKIFI_PROVIDER", raising=False) + monkeypatch.delenv("WIKIFI_MODEL", raising=False) + reset_settings_cache() + + settings = load_target_settings(tmp_path) + assert settings.provider == "ollama" + reset_settings_cache() + + +def test_load_target_settings_ignores_malformed_toml(tmp_path, monkeypatch, caplog): + """A corrupt config.toml warns and falls back instead of raising.""" + import logging + + from wikifi.config import load_target_settings, reset_settings_cache + + monkeypatch.chdir(tmp_path) + monkeypatch.delenv("WIKIFI_PROVIDER", raising=False) + reset_settings_cache() + + wiki_dir = tmp_path / ".wikifi" + wiki_dir.mkdir() + (wiki_dir / "config.toml").write_text("not = valid = toml = at all\n") + + with caplog.at_level(logging.WARNING, logger="wikifi.config"): + settings = load_target_settings(tmp_path) + + assert settings.provider == "ollama" + assert any("could not read" in record.message for record in caplog.records) + reset_settings_cache() diff --git a/tests/test_orchestrator.py b/tests/test_orchestrator.py index f4f2458..212cf8c 100644 --- a/tests/test_orchestrator.py +++ b/tests/test_orchestrator.py @@ -104,14 +104,19 @@ def test_build_provider_returns_anthropic_when_selected(monkeypatch): def test_build_provider_returns_openai_when_selected(monkeypatch): - """``provider='openai'`` dispatches to OpenAIProvider with a GPT default.""" + """``provider='openai'`` dispatches to OpenAIProvider with a GPT default. + + The default-swap fires when the configured model id is obviously + an Ollama identifier (``family:tag``) — the common "user opted + into openai but forgot to update WIKIFI_MODEL" case. + """ monkeypatch.setenv("OPENAI_API_KEY", "test-key") - settings = _settings(provider="openai", model="m") # non-gpt id + settings = _settings(provider="openai", model="qwen3.6:27b") provider = build_provider(settings) from wikifi.providers.openai_provider import OpenAIProvider assert isinstance(provider, OpenAIProvider) - # Falls back to gpt-4o rather than 404'ing on "m". + # Falls back to gpt-4o rather than 404'ing on the Ollama default. assert provider.model.startswith("gpt-") @@ -124,6 +129,34 @@ def test_build_provider_preserves_explicit_openai_model(monkeypatch): assert provider.model == model +def test_build_provider_preserves_azure_openai_deployment_id(monkeypatch): + """Arbitrary Azure / proxy deployment IDs survive the swap. + + Azure-OpenAI (and OpenAI-compatible proxies) commonly use + deployment names that don't match the upstream OpenAI prefixes — + e.g. ``prod-gpt4o``, ``eastus-chat``, ``my-team-deployment``. + Replacing them with ``gpt-4o`` would silently route the user to + the wrong model on a perfectly valid configuration. + """ + monkeypatch.setenv("OPENAI_API_KEY", "test-key") + for deployment in ("prod-gpt4o", "eastus-chat", "my-team-deployment", "fine-tuned-v3"): + settings = _settings( + provider="openai", + model=deployment, + openai_base_url="https://my-azure-endpoint.openai.azure.com/", + ) + provider = build_provider(settings) + assert provider.model == deployment, f"{deployment} should pass through unchanged" + + +def test_build_provider_preserves_fine_tuned_openai_model(monkeypatch): + """``ft:gpt-4o:org::id`` contains a colon but stays on the OpenAI path.""" + monkeypatch.setenv("OPENAI_API_KEY", "test-key") + settings = _settings(provider="openai", model="ft:gpt-4o:my-org::abc123") + provider = build_provider(settings) + assert provider.model == "ft:gpt-4o:my-org::abc123" + + def test_run_walk_persists_cache_for_resumability(mini_target, mock_provider_factory): """A second walk reuses the cache and skips the LLM call for unchanged files.""" settings = _settings() diff --git a/wikifi/cli.py b/wikifi/cli.py index 8f4f4e0..8207421 100644 --- a/wikifi/cli.py +++ b/wikifi/cli.py @@ -23,7 +23,7 @@ from wikifi import __version__ from wikifi.cache import reset as reset_cache from wikifi.chat import run_repl -from wikifi.config import get_settings +from wikifi.config import get_settings, load_target_settings from wikifi.orchestrator import build_provider, init_wiki, run_walk from wikifi.report import build_report from wikifi.wiki import WikiLayout @@ -102,7 +102,7 @@ def walk( ) -> None: """Walk the target codebase and populate every wiki section.""" target = target or Path.cwd() - settings = get_settings() + settings = load_target_settings(target) if no_cache: settings = settings.model_copy(update={"use_cache": False}) reset_cache(WikiLayout(root=target)) @@ -169,7 +169,7 @@ def chat(target: TargetArg = None) -> None: ) raise typer.Exit(code=1) - settings = get_settings() + settings = load_target_settings(target) provider = build_provider(settings) run_repl(layout=layout, provider=provider, console=console) @@ -192,7 +192,7 @@ def report( ) raise typer.Exit(code=1) - settings = get_settings() + settings = load_target_settings(target) provider = build_provider(settings) if score else None wiki_report = build_report(layout=layout, provider=provider, score=score) console.print(Markdown(wiki_report.render())) diff --git a/wikifi/config.py b/wikifi/config.py index 67f7f1f..977b5a6 100644 --- a/wikifi/config.py +++ b/wikifi/config.py @@ -1,7 +1,23 @@ -"""Runtime settings loaded from environment / .env. +"""Runtime settings loaded from environment / .env / target's .wikifi/config.toml. Defaults assume a local Ollama server with qwen3.6:27b. Override any field via -WIKIFI_* env vars or a .env file in the target project's CWD. +WIKIFI_* env vars, a .env file in the target project's CWD, or by writing +provider/model entries into ``/.wikifi/config.toml`` (the file +``wikifi init`` scaffolds — and what callers expect to be authoritative for +that wiki). + +Resolution order, highest precedence first: + +1. ``/.wikifi/config.toml`` +2. ``WIKIFI_*`` environment variables (and ``.env``) +3. Field defaults + +The wiki's own ``config.toml`` wins over per-session env vars: a wiki +initialized for a hosted backend should still drive its own runs even +when the user happens to have ``WIKIFI_PROVIDER=ollama`` exported in +their shell. This matches the contract printed at the top of every +generated ``config.toml`` ("overrides WIKIFI_* environment variables +when present"). Hosted providers are opt-in: - ``WIKIFI_PROVIDER=anthropic`` (plus ``ANTHROPIC_API_KEY``) @@ -10,11 +26,17 @@ from __future__ import annotations +import logging +import tomllib from functools import lru_cache +from pathlib import Path +from typing import Any from pydantic import Field from pydantic_settings import BaseSettings, SettingsConfigDict +log = logging.getLogger("wikifi.config") + class Settings(BaseSettings): model_config = SettingsConfigDict( @@ -144,3 +166,47 @@ def reset_settings_cache() -> None: Used by tests that mutate ``WIKIFI_*`` env vars between cases. """ get_settings.cache_clear() + + +# Field names a wiki's ``config.toml`` is allowed to override. We accept +# only the fields ``wikifi init`` writes today (provider, model, +# ollama_host) so a stale or hand-edited config can't silently start +# overriding behavior the user didn't sign up for. +_TARGET_CONFIG_FIELDS: frozenset[str] = frozenset({"provider", "model", "ollama_host"}) + + +def load_target_settings(target: Path) -> Settings: + """Return :class:`Settings` for a wiki at ``target``. + + Reads ``/.wikifi/config.toml`` (when present) and layers + its values on top of the env-derived defaults — the wiki's own + config wins over per-session env vars, matching the contract + printed at the top of every generated ``config.toml``. + + Without this, ``wikifi report --score `` (and the other + target-aware commands) would build a provider from the process-wide + defaults regardless of what the target wiki was actually + initialized with — fine when target equals CWD, but wrong when the + user is operating against another project's wiki. + """ + base = get_settings() + overrides = _read_target_config(target) + if not overrides: + return base + effective: dict[str, Any] = {field: value for field, value in overrides.items() if field in _TARGET_CONFIG_FIELDS} + if not effective: + return base + return base.model_copy(update=effective) + + +def _read_target_config(target: Path) -> dict[str, Any]: + """Parse ``/.wikifi/config.toml``; return ``{}`` on any failure.""" + config_path = target / ".wikifi" / "config.toml" + if not config_path.exists(): + return {} + try: + with config_path.open("rb") as handle: + return tomllib.load(handle) + except (OSError, tomllib.TOMLDecodeError) as exc: + log.warning("could not read %s: %s; falling back to env-only settings", config_path, exc) + return {} diff --git a/wikifi/orchestrator.py b/wikifi/orchestrator.py index bddb94f..4069ffa 100644 --- a/wikifi/orchestrator.py +++ b/wikifi/orchestrator.py @@ -188,10 +188,15 @@ def build_provider(settings: Settings) -> LLMProvider: if settings.provider == "openai": from wikifi.providers.openai_provider import OpenAIProvider - # Same default-swap guard as the Anthropic path: a user opting - # in to OpenAI shouldn't 404 because the Ollama model id is - # still in their config. - model = settings.model if _looks_like_openai_model(settings.model) else "gpt-4o" + # Same default-swap guard as the Anthropic path, but inverted: + # only swap when the model id is *obviously* an Ollama + # identifier (the user opted into openai but forgot to update + # WIKIFI_MODEL). Anything else passes through unchanged so + # Azure-OpenAI / proxy deployments — which use arbitrary + # deployment IDs like ``prod-gpt4o`` or ``eastus-chat`` that + # don't match the upstream OpenAI naming convention — keep + # working. + model = "gpt-4o" if _looks_like_ollama_model(settings.model) else settings.model return OpenAIProvider( model=model, api_key=settings.openai_api_key, @@ -203,7 +208,12 @@ def build_provider(settings: Settings) -> LLMProvider: raise ValueError(f"unknown provider {settings.provider!r}; expected 'ollama', 'anthropic', or 'openai'") -def _looks_like_openai_model(model: str) -> bool: - """Heuristic — covers gpt-*, o1/o3/o4 reasoning, and ft: variants.""" - lowered = model.lower() - return lowered.startswith(("gpt-", "o1", "o3", "o4", "ft:")) +def _looks_like_ollama_model(model: str) -> bool: + """Heuristic — Ollama uses ``family:tag`` (e.g. ``qwen3.6:27b``). + + Fine-tuned OpenAI models also contain ``:`` (``ft:gpt-4o:...``) + so we exclude that prefix. Anything else without a ``:`` — + upstream OpenAI ids, Azure deployment names, plain proxy aliases — + is left alone. + """ + return ":" in model and not model.lower().startswith("ft:") From 942af1714d3d3bbc6436c745b5509d93d1d094f9 Mon Sep 17 00:00:00 2001 From: Dallas Pool Date: Mon, 4 May 2026 11:03:12 -0500 Subject: [PATCH 9/9] wikifi --- .wikifi/capabilities.md | 184 ++++++------- .wikifi/cross_cutting.md | 159 ++++-------- .wikifi/diagrams.md | 425 +++++++++++-------------------- .wikifi/domains.md | 71 +++--- .wikifi/entities.md | 252 +++++++++++------- .wikifi/external_dependencies.md | 76 ++++-- .wikifi/hard_specifications.md | 253 +++++++++++++----- .wikifi/integrations.md | 125 +++++---- .wikifi/intent.md | 85 ++++--- .wikifi/personas.md | 150 +---------- .wikifi/user_stories.md | 322 +++++++++-------------- 11 files changed, 971 insertions(+), 1131 deletions(-) diff --git a/.wikifi/capabilities.md b/.wikifi/capabilities.md index 9ced16c..2b780f7 100644 --- a/.wikifi/capabilities.md +++ b/.wikifi/capabilities.md @@ -1,107 +1,115 @@ # Capabilities -wikifi analyzes any target codebase and produces a structured, technology-agnostic wiki that captures domain knowledge, system intent, capabilities, external dependencies, integrations, cross-cutting concerns, core entities, and hard specifications — expressed entirely in domain terms rather than in the language of a specific technology stack. +wikifi turns any codebase into a structured, technology-agnostic wiki by walking source files, extracting structured knowledge, and synthesizing readable documentation with a full evidence trail that links every assertion back to specific source locations. -## Workspace Initialization +## Documentation Pipeline -Before analysis begins, the system bootstraps a wiki workspace inside the target project in an idempotent manner, creating the required directory structure, a configuration file, version-control ignore rules, and one placeholder document per defined section. Repeat invocations leave already-existing artifacts untouched. +Documentation is produced through four ordered stages: -## Codebase Analysis Pipeline +**1. Repository Triage** +The system examines the directory layout and manifest files to determine which paths contain production source worth deeper analysis and which should be skipped (vendored dependencies, build artifacts, generated files, CI configuration). Files outside configurable size bounds are excluded before any analysis begins.[7][8] -The core pipeline runs in four ordered stages: +**2. Per-file Extraction** +Each in-scope file is analyzed to produce structured findings describing its contribution to each wiki section.[9] Files whose format is well-structured — relational schemas, API contracts, interface definitions, and migration scripts — are routed to dedicated deterministic extractors that bypass AI inference entirely, improving accuracy and reducing cost for these artifact types. General-purpose source files are analyzed via AI inference, with large files split into overlapping chunks so no content is lost at boundaries. Findings are deduplicated across chunk boundaries to avoid double-counting, and each finding carries a citation (path and line range) for downstream traceability. -1. **Repository introspection** — The system compresses the repository's directory layout and reads key manifest files, then uses this compact view to classify every path as either worth walking (production source, business logic, integrations, domain models) or worth skipping (vendored dependencies, build output, tests, CI/CD). The classification is returned as a structured, diffable result. +Optionally, the system builds a cross-file import and reference graph before extraction begins. Each file's extraction is then enriched with its neighborhood in that graph — which files it depends on and which depend on it — enabling findings to describe cross-file flows rather than treating files in isolation. -2. **Per-file extraction** — Every in-scope file is routed through one of three extraction paths: - - *Cache replay* — if a file's content is unchanged since the last run, previously stored findings are reused without any further processing. - - *Deterministic schema parsing* — files recognised as structured schema artifacts (SQL DDL, database migrations, API contract specs, interface definition files, and graph schema files) are processed by purpose-built parsers that produce findings about entities, relationships, operations, and constraints without invoking an AI model. - - *AI-assisted extraction* — all remaining files pass through an AI extraction pass; large files are recursively split into overlapping chunks so no content is missed regardless of size. +**3. Section Synthesis** +Per-file findings are aggregated into coherent, readable markdown bodies for each primary wiki section. Every assertion in the output is backed by numbered citations traceable to the specific source files and line ranges from which it was inferred. - Every finding carries a source citation — the originating file path, an inclusive line range, and a content fingerprint — enabling full traceability back to the codebase. - -3. **Cross-file context enrichment** — In parallel with extraction, the system builds an import and reference graph across the entire in-scope file set. Each file's neighborhood (the files it depends on and the files that depend on it) is injected into its extraction prompt, enabling findings to describe inter-file flows rather than treating each file in isolation. - -4. **Section aggregation** — Per-file findings are grouped by their target wiki section and synthesised into readable markdown bodies. Every asserted claim is backed by numbered citations pointing to the originating files and line ranges. Where two or more files make incompatible assertions about the same topic, the system surfaces the conflict explicitly in a dedicated *Conflicts in source* block rather than silently resolving it — a deliberate feature for legacy codebases where disagreements encode high-priority migration signals. +**4. Derivative Section Generation** +Higher-level artifacts — user personas, scenario-based user stories, and architectural diagrams — are synthesized from the finalized primary sections. If upstream content is absent, the system writes a placeholder declaring the gap rather than fabricating content. ## Wiki Structure -The generated wiki is organised into **eleven sections**: eight primary sections populated directly from per-file evidence, and three derivative sections synthesised from the completed primaries: - -| Section type | Sections | -|---|---| -| Primary (8) | Business domains, system intent, capabilities, external dependencies, integrations, cross-cutting concerns, core entities, hard specifications | -| Derivative (3) | User personas, Gherkin-style user stories, Mermaid architectural diagrams | - -Derivative sections are only generated after the primaries they depend on are finalised. If upstream primary sections are empty or missing, the system writes a placeholder that declares the gap rather than fabricating content. - -## Quality Assurance +The generated wiki covers **eight primary sections**: business domains, system intent, capabilities, external dependencies, integrations, cross-cutting concerns, core entities, and hard specifications. Three derivative sections (personas, user stories, architectural diagrams) are generated only after all relevant primary sections are finalized. -An optional critic-and-reviser pass evaluates any synthesised section against its brief and the upstream evidence it drew from, producing a structured quality score (0–10) with itemised unsupported claims, gaps, and suggested edits. When a section scores below a configurable threshold, a revision is automatically invoked; the revision is accepted only if it matches or improves the original score, preventing regressions. This loop is particularly valuable for derivative sections — personas and user stories — where single-shot synthesis is most prone to introducing unsupported assertions. +The system can scaffold a complete wiki directory structure in a target project in an idempotent manner — re-running initialization leaves existing content untouched while creating any missing pieces. -## Incremental and Resumable Walks +## Conflict Detection and Evidence Traceability -The pipeline uses a two-scope content-addressed cache: per-file extraction results are keyed to a combination of file path and content fingerprint, and per-section aggregation results are keyed to a digest of the contributing notes payload. Only changed files and affected sections are reprocessed on incremental runs. Because results are persisted after every completed file, an interrupted walk resumes from the last unprocessed file rather than restarting from scratch. The cache can also be fully invalidated to force a clean re-walk. +When source files make incompatible assertions about the same domain topic, the conflict is surfaced explicitly under a dedicated heading rather than silently resolved. This is a deliberate design choice for legacy codebases, where tribal knowledge often hides in inconsistencies; teams are directed to resolve conflicts before re-implementation. Claims that appear in the supporting evidence but cannot be matched to the synthesized narrative body are collected into a separate supporting-claims list, ensuring nothing is silently dropped. -## Coverage and Quality Reporting - -A report command produces a human-readable markdown table summarising every wiki section by contributing file count, finding count, body size, optional critic-derived quality score, and the highest-priority content gap identified by the critic. Coverage statistics also surface *dead zones* — files that were processed but produced no findings — so teams can identify blind spots in the analysis. - -## Interactive Knowledge Querying - -Once a wiki has been generated, users can open an interactive conversational session grounded in all populated sections. The session supports multi-turn exchanges, conversation history reset, and introspection of which sections are currently loaded as context. Only meaningfully populated sections are included, ensuring the assistant is not grounded in placeholder content. - -## Graceful Degradation +## Quality Assurance -When AI synthesis fails for a section, the system falls back to emitting the raw collected notes directly in the section body, preserving information at the cost of polish and surfacing the error inline. Similarly, unparseable schema files produce an advisory finding directing reviewers to inspect the file manually rather than silently failing. +An optional critic-and-reviser loop evaluates each synthesized section against a structured rubric (scored 0–10), identifies unsupported claims and coverage gaps, and triggers a revision pass when the score falls below a configurable threshold. A revised body is accepted only if it improves or matches the prior score, preventing regressions. The loop is off by default to keep generation time predictable, but is most beneficial for derivative sections where single-shot synthesis is most likely to stray from evidence. If synthesis fails entirely for a section, the system falls back to emitting the raw notes directly, preserving information at the cost of polish. + +## Coverage and Readiness Reporting + +A dedicated report command produces a markdown table listing each section with its file count, finding count, body size, quality score, and the most prominent gap or unsupported claim — giving teams a one-page readiness summary. The coverage portion requires no AI provider and is safe for automated pipelines. + +## Incremental and Crash-Resumable Operation + +Two independently keyed caches — one per file (keyed by content fingerprint) and one per section (keyed by a hash of its full notes payload) — allow re-walks to skip unchanged material entirely. The cache is written after each file completes, making the pipeline crash-resumable.[34] Stale entries for files removed from the repository can be pruned. A monotonically incremented version number embedded in every cache file causes a clean rebuild on version mismatch, preventing stale data from surviving format upgrades. Cache files are written atomically so a crash during persistence never corrupts the stored state. A force-reanalysis mode is also available to drop the cache entirely and perform a clean walk. + +## Interactive Query Interface + +Once a wiki is generated, users can open a conversational session grounded in the populated wiki sections. Only sections with meaningful content are loaded as context. The session supports multi-turn questioning, conversation history reset while retaining the wiki context, and inspection of which sections are currently loaded. + +## Supporting claims +- wikifi produces a technology-agnostic wiki from any codebase, linking every assertion back to specific source locations. [1][2][3] +- Stage 1 examines the directory layout and manifest files to classify paths as worth walking or skippable (vendored dependencies, build output, generated files, CI configuration). [4][5][6] +- Files with well-structured formats (relational schemas, API contracts, interface definitions, migration scripts) are routed to dedicated deterministic extractors that bypass AI inference entirely. [10][11][12][13][14][15] +- Large files are split into overlapping chunks so no content is lost at boundaries, with separators tried from coarsest to finest. [16] +- Findings are deduplicated across chunk boundaries to avoid double-counting, and each finding carries a citation (path and line range). [9] +- The system optionally builds a cross-file import and reference graph, injecting each file's neighborhood into its extraction pass to enable cross-file flow descriptions. [17][18][19] +- Per-file findings are aggregated into coherent, readable markdown bodies for each primary wiki section, with every assertion backed by numbered citations. [1][3] +- Derivative sections — user personas, scenario-based user stories, and architectural diagrams — are synthesized from finalized primary sections; absent upstream content produces a placeholder rather than fabricated content. [20][21] +- The wiki covers eight primary sections and three derivative sections; derivative sections are generated only after their upstream primaries are finalized. [21] +- The system scaffolds a complete wiki directory structure idempotently, leaving existing content untouched while creating missing pieces. [22] +- Incompatible assertions across source files are surfaced explicitly under a dedicated heading rather than silently resolved — a deliberate feature for legacy codebases where tribal knowledge hides in inconsistencies. [23][24] +- Claims that appear in the supporting evidence but cannot be matched to the narrative body are collected into a separate supporting-claims list rather than silently dropped. [3] +- An optional critic-and-reviser loop evaluates each synthesized section on a 0–10 rubric, identifies unsupported claims and gaps, and triggers revision when the score falls below a configurable threshold, accepting the revision only if it improves or matches the prior score. [25][26] +- The critic-reviser loop is off by default to keep generation time predictable, and is most beneficial for derivative sections where single-shot synthesis is most likely to fabricate. [25][27] +- If synthesis fails entirely, the system falls back to emitting raw notes directly in the section body, preserving information at the cost of polish. [28] +- A report command produces a per-section markdown table with file counts, finding counts, body size, quality score, and the most prominent gap or unsupported claim. [29][30][31] +- The coverage portion of the report requires no AI provider and is safe for automated pipelines. [29] +- Two independently keyed caches — per-file (content fingerprint) and per-section (notes-payload hash) — allow re-walks to skip unchanged material entirely. [32][33][34] +- Stale cache entries for removed files can be pruned in bulk. [35][18] +- A monotonically incremented version number in every cache file triggers a clean rebuild on version mismatch, preventing stale data from surviving upgrades. [36] +- Cache files are written atomically so a crash during persistence never leaves a corrupted cache. [37] +- A force-reanalysis mode drops the on-disk cache entirely to perform a clean walk. [38] +- Users can open a conversational session grounded in populated wiki sections, supporting multi-turn questioning, history reset, and inspection of loaded sections. [39][2] +- Only sections with meaningful content are loaded as context for the conversational session; placeholder sections are filtered out. [40] ## Sources -1. `VISION.md:6-8` -2. `wikifi/sections.py:44-142` -3. `README.md:14-24` -4. `wikifi/orchestrator.py:62-76` -5. `wikifi/wiki.py:64-86` -6. `wikifi/introspection.py:28-44` -7. `wikifi/introspection.py:61-70` -8. `wikifi/walker.py:92-186` -9. `wikifi/extractor.py:140-200` -10. `wikifi/cache.py:5-8` -11. `README.md:34-36` -12. `TESTING-AND-DEMO.md:116-149` -13. `wikifi/config.py:75-81` -14. `wikifi/repograph.py:41-52` -15. `wikifi/specialized/__init__.py:46-57` -16. `wikifi/specialized/sql.py:56-62` -17. `wikifi/extractor.py:251-270` -18. `TESTING-AND-DEMO.md:40-66` -19. `wikifi/aggregator.py:1-15` -20. `TESTING-AND-DEMO.md:90-114` -21. `wikifi/config.py:69-74` -22. `wikifi/extractor.py:241-246` -23. `wikifi/repograph.py:155-210` -24. `wikifi/evidence.py:88-121` -25. `wikifi/aggregator.py:9-14` -26. `wikifi/evidence.py:13-17` -27. `VISION.md:53-63` -28. `wikifi/deriver.py:73-107` -29. `TESTING-AND-DEMO.md:151-164` -30. `wikifi/config.py:83-94` -31. `wikifi/critic.py:100-153` -32. `wikifi/deriver.py:90-103` -33. `TESTING-AND-DEMO.md:67-88` -34. `wikifi/cache.py:9-12` -35. `wikifi/config.py:63-68` -36. `wikifi/cache.py:14-18` -37. `wikifi/cache.py:105-113` -38. `README.md:16-20` -39. `wikifi/cli.py:88-112` -40. `README.md:21-23` -41. `TESTING-AND-DEMO.md:166-186` -42. `wikifi/critic.py:155-180` -43. `wikifi/report.py:44-77` -44. `wikifi/report.py:103-107` -45. `README.md:24-25` -46. `wikifi/chat.py:88-130` -47. `wikifi/chat.py:63-82` -48. `wikifi/cli.py:60-220` -49. `wikifi/aggregator.py:272-285` -50. `wikifi/specialized/openapi.py:23-50` +1. `wikifi/aggregator.py:1-15` +2. `wikifi/cli.py:63-210` +3. `wikifi/evidence.py:85-120` +4. `wikifi/introspection.py:28-44` +5. `wikifi/introspection.py:61-70` +6. `wikifi/walker.py:92-186` +7. `.env.example:20-29` +8. `wikifi/walker.py:100-130` +9. `wikifi/extractor.py` +10. `wikifi/config.py:97-102` +11. `wikifi/extractor.py:183-218` +12. `wikifi/repograph.py:1-15` +13. `wikifi/specialized/__init__.py:8-11` +14. `wikifi/specialized/dispatch.py:44-62` +15. `wikifi/specialized/models.py:4-6` +16. `wikifi/extractor.py:298-360` +17. `wikifi/config.py:60-93` +18. `wikifi/orchestrator.py:55-145` +19. `wikifi/repograph.py:162-215` +20. `wikifi/deriver.py:73-107` +21. `wikifi/sections.py:44-142` +22. `wikifi/wiki.py:72-101` +23. `wikifi/aggregator.py:9-14` +24. `wikifi/evidence.py:121-133` +25. `wikifi/config.py:103-113` +26. `wikifi/critic.py:100-153` +27. `wikifi/deriver.py:90-103` +28. `wikifi/aggregator.py:272-285` +29. `wikifi/report.py:82-85` +30. `wikifi/report.py:106-114` +31. `wikifi/report.py:46-74` +32. `wikifi/cache.py:5-15` +33. `wikifi/config.py:88-96` +34. `wikifi/extractor.py:166-182` +35. `wikifi/cache.py:113-118` +36. `wikifi/cache.py:37` +37. `wikifi/cache.py:205-209` +38. `wikifi/cli.py:90-122` +39. `wikifi/chat.py:88-130` +40. `wikifi/chat.py:63-82` diff --git a/.wikifi/cross_cutting.md b/.wikifi/cross_cutting.md index d504546..4dfe08d 100644 --- a/.wikifi/cross_cutting.md +++ b/.wikifi/cross_cutting.md @@ -1,124 +1,57 @@ # Cross-Cutting Concerns -## Observability +## Observability and Logging -A consistent, pipeline-wide observability model spans every stage of the system. Structured logging is initialised once and reused across all subcommands; a single verbose flag activates debug-level output globally without each subsystem needing its own toggle. Stage-boundary log events are emitted at each major transition — repository introspection, dependency-graph construction, file extraction, section aggregation, and derivative synthesis — so operators can pinpoint where a long walk is spending time. Revision and quality-scoring events are counted in the run's statistics, and cache hit counts are surfaced in the post-walk report, giving a quantitative picture of incremental efficiency. +Log verbosity is configured globally before any pipeline stage executes: a verbose flag activates debug-level output, while the default level is informational. Structured log events are emitted at the entry point of each pipeline stage, giving operators a continuous view of progress across the entire run. -## Resilience and Error Handling +All significant failure modes follow a uniform pattern: errors are caught at the point of failure, logged at WARNING level, and a graceful fallback is substituted so that downstream stages are never blocked. Specifically: -The system is designed so that no single failure can abort an entire pipeline run. Extraction failures — whether caused by an inference provider or a specialised deterministic parser — are logged and tallied but never propagated upward; a file whose processing fails entirely is recorded as skipped, and partially-recovered files retain whatever findings were salvaged. Aggregation and derivation failures follow the same pattern: errors are caught and logged at warning level, and a fallback body that preserves the raw upstream evidence is written so the wiki remains inspectable. Quality-assurance (critic and reviser) failures degrade gracefully to returning the original body with a diagnostic score of zero rather than halting. Provider failures during interactive query sessions are surfaced inline without terminating the session. Across all provider backends, raw infrastructure errors are caught at the provider boundary and re-raised as a normalised internal error type carrying the upstream request identifier when available, so the rest of the pipeline does not branch on provider-specific exception shapes. +- Aggregation failures produce a fallback body that preserves the raw notes, ensuring a section is always written. +- Derivation failures write a fallback body that retains the upstream evidence verbatim rather than leaving the section blank. +- Quality-review failures return the original body with a zero score and a diagnostic annotation rather than propagating the error. -## Content-Addressed Caching and Crash Resumability +When an inference call returns neither structured output nor any usable text, a diagnostic message surfaces the stop reason, output-token count, and configured resource budget, together with actionable hints so operators can resolve the issue at the point it occurs. -All expensive inference work is protected by a two-scope content-addressed cache stored under a dedicated hidden subdirectory within the wiki output directory, inheriting the same version-control ignore rules as other working-state artifacts. +All provider backends share a single error-formatting routine that extracts a vendor-issued request identifier when present, producing uniformly attributable failure messages regardless of which backend is active. -- **Extraction scope:** each file's results are keyed by the combination of its relative path and a stable hash of its raw bytes. Any unchanged file is skipped on re-walk with no inference call. -- **Aggregation scope:** each section's synthesised body is keyed by a deterministic digest of its note payload. Unchanged inputs reuse the stored body and evidence bundle. +--- -Cache entries are written after every individual file completes, so a mid-walk crash loses at most one file's work. Writes are performed atomically — content is first written to a temporary location and then renamed into place — preventing corrupt partial writes. Malformed entries are silently dropped and logged rather than causing a hard failure, so a partially corrupt cache degrades gracefully to a fresh extraction for only the affected entries. A monotonically increasing version tag is embedded in every persisted cache file; a version mismatch on load causes the entire cache to be discarded and rebuilt, providing a controlled invalidation path across software upgrades. Between runs, entries for files no longer in scope are pruned automatically. +## Error Isolation and Graceful Degradation -## Input Integrity Guards +Errors are scoped as narrowly as possible throughout the pipeline: -A layered set of guards prevents low-signal or pathological inputs from ever reaching the inference layer. - -| Guard | Threshold | Effect | -|---|---|---| -| Minimum content size | 64 bytes (stripped) | File silently skipped | -| Maximum file size | 2 MB | File silently skipped | -| Large-file windowing | 150 KB – 2 MB | File split into overlapping chunks with 8 KB overlap | -| Manifest truncation | 20 000 bytes | Hard-truncated with visible marker | -| Per-request timeout | 900 seconds | Uniform backstop across all providers | - -Directory traversal prunes excluded subtrees before descending into them, so ignore patterns are applied efficiently at the directory level rather than file-by-file. Files carrying no extractable intent — stub initialisers, empty fixtures, generated lockfiles — are identified and dropped before reaching the inference layer; the invariant that a single empty or unstructured file must never stall the walk is explicitly upheld. Findings produced from the overlap region between adjacent large-file chunks are deduplicated by section and normalised text within each file's pass, preventing double-counting in downstream aggregation. - -## Provider Abstraction - -All inference calls — structured extraction, free-text generation, and multi-turn chat — are routed through a single provider abstraction layer. This boundary is where observability hooks, retry logic, error normalisation, and backend-switching concerns live; no extraction or aggregation logic needs knowledge of which backend is active. Supported backend shapes include local inference runtimes and hosted services; the local-inference path is the default, with hosted options as addenda, and swapping between them requires no changes outside the provider boundary. - -Structured-output calls enforce a schema-validation contract: the model response must be validated against a declared schema before being returned to the caller, ensuring type-safe data flows through every pipeline stage. To maximise determinism, temperature is hard-pinned to zero on all structured-output calls; free-text and conversational paths accept model-default variability in exchange for naturalness. - -When a backend exposes a reasoning-depth control, the system runs at the highest available setting, prioritising output quality over walk speed. A configurable depth parameter is translated into the provider's native adaptive-thinking feature, allowing callers to trade latency and cost against quality without branching on provider type in shared pipeline code. - -Hosted backends employ prompt-caching strategies — placing the large, repeated system prompt at a fixed position in every request so the service can serve subsequent calls from a cached prefix — making large-scale walks economically viable by paying full input cost only on the first call and a fraction of that cost on subsequent ones. - -## Source Traceability and Hallucination Prevention - -Full source traceability is a non-negotiable structural invariant: every assertion in every wiki section must be linkable back to the originating file and, where available, the precise line range within it. This is enforced through typed evidence structures (claims and source references) rather than by convention, so the constraint cannot be silently bypassed. - -Hallucination prevention operates at two additional levels. First, the inference prompt explicitly instructs the model never to name specific technologies, translating all observations into domain terms — this is a mandatory invariant enforced at the prompt layer. Second, upstream section content that matches known placeholder shapes is filtered out before derivative synthesis, preventing empty or stub sections from being treated as real evidence; these same sentinel strings are used by the quality-report layer to exclude placeholder sections from scoring. Interactive query sessions are similarly grounded: the assistant is instructed to explicitly acknowledge when the wiki does not cover a topic rather than generating unsupported answers. - -Content fingerprints serve a triple cross-cutting role: keying both extraction and aggregation caches so stale results are never served, anchoring source-evidence citations so claims can be re-verified against a fresh repository walk, and tracking file identity inside the dependency graph so cross-file context is invalidated when any contributing source changes. Files are always fingerprinted as raw bytes rather than decoded text to ensure the cache layer and the extractor agree on identity regardless of encoding assumptions. - -## Authentication and Storage Invariants - -Specialised deterministic parsers extract security and data-integrity contracts from high-signal artifacts and surface them as first-class cross-cutting concerns that must be preserved through any migration: - -- **Authentication schemes** declared in API contract files are extracted and categorised by type, flagging which security contracts (key-based, delegated authorisation, bearer-token, etc.) the new system must honour. -- **Data integrity constraints** — uniqueness and non-nullability — found in schema definitions are extracted as storage invariants explicitly marked as migration-critical. -- **Query-performance invariants** — index definitions — are recorded with an explicit note that the new system must preserve equivalent access patterns. - -All specialised parsers return results in the same structured shape as the general inference extractor, so the aggregation layer needs no knowledge of which extraction path was taken; this uniform interface contract is itself an invariant that must be preserved. - -## Data Storage Layout - -The pipeline's working state is isolated to a single hidden directory within the repository: - -- **Rendered section documents** live at the root of this directory and are intended to be committed to version control. -- **Per-section extraction notes** (JSONL, each record UTC-timestamped) are stored in a notes subdirectory and excluded from version control via a generated ignore file. -- **Extraction and aggregation caches** are stored in a cache subdirectory and similarly excluded. - -Deleting the cache subdirectory forces a full re-walk; deleting the entire working directory resets all pipeline state. This layout ensures generated documentation commits remain clean and the boundary between committed outputs and ephemeral working state is unambiguous. - -## Sources -1. `wikifi/cli.py:51-60` -2. `wikifi/orchestrator.py:84-148` -3. `wikifi/cli.py:90-97` -4. `wikifi/deriver.py:110-135` -5. `wikifi/report.py:22` -6. `wikifi/extractor.py:228-242` -7. `wikifi/aggregator.py:143-152` -8. `wikifi/deriver.py:96-107` -9. `wikifi/critic.py:158-165` -10. `wikifi/chat.py:120-125` -11. `wikifi/providers/anthropic_provider.py:238-244` -12. `wikifi/providers/openai_provider.py:248-255` -13. `README.md:40-43` -14. `wikifi/fingerprint.py:1-18` -15. `wikifi/aggregator.py:126-155` -16. `TESTING-AND-DEMO.md:67-88` -17. `wikifi/extractor.py:155-175` -18. `wikifi/cache.py:189-193` -19. `wikifi/cache.py:196-222` -20. `wikifi/cache.py:38` -21. `wikifi/orchestrator.py:95-110` -22. `wikifi/cache.py:19-21` -23. `wikifi/config.py:56-59` -24. `wikifi/walker.py:61-79` -25. `wikifi/config.py:38-56` -26. `wikifi/walker.py:220-231` -27. `.env.example:16-29` -28. `wikifi/config.py:33-34` -29. `wikifi/walker.py:133-143` -30. `README.md:44-46` -31. `VISION.md:99-100` -32. `wikifi/extractor.py:253-262` -33. `CLAUDE.md:53-54` -34. `VISION.md:92-96` -35. `wikifi/providers/base.py:36-38` -36. `wikifi/providers/ollama_provider.py:58-68` -37. `VISION.md:97-98` -38. `wikifi/providers/anthropic_provider.py:212-232` -39. `wikifi/providers/anthropic_provider.py:193-210` -40. `wikifi/providers/openai_provider.py:13-17` -41. `wikifi/evidence.py:1-18` -42. `wikifi/aggregator.py:54-67` -43. `wikifi/deriver.py:118-135` -44. `wikifi/report.py:118-123` -45. `wikifi/chat.py:27-31` -46. `wikifi/fingerprint.py:44-50` -47. `wikifi/specialized/openapi.py:110-121` -48. `wikifi/specialized/sql.py:97-98` -49. `wikifi/specialized/sql.py:113-125` -50. `wikifi/specialized/__init__.py:9-13` -51. `TESTING-AND-DEMO.md:249-265` -52. `wikifi/wiki.py:96-121` +- A failure in one file chunk does not abort extraction of the remaining chunks; a failure on one file does not abort the repository walk. Files that fail entirely are counted as skipped rather than silently lost. +- Provider inference failures during interactive sessions are surfaced as inline error messages rather than terminating the session. +- Cache I/O failures (missing files, malformed content, or bad individual entries) are logged as warnings and fall back to an empty cache, preserving pipeline continuity. +- Provider-specific error shapes are never allowed to leak into the orchestration layer; all backends normalize errors through a shared formatting helper before re-raising them. + +--- + +## Data Integrity and Source Provenance + +Full source provenance is a non-negotiable invariant: every claim in the output must carry the source file path, line range, and content fingerprint that justifies it. This citation chain is preserved through caching and replay so that any re-walk of the repository can verify claims against the current source. + +Content fingerprints serve three cross-cutting roles: + +| Role | Effect | +|---|---| +| Cache keying | Stale extraction or aggregation results are never served when source content changes | +| Citation anchoring | Claims in the wiki can be traced to the exact file revision that produced them | +| Dependency-graph invalidation | Cross-file context is invalidated when any referenced file changes | + +Files are always hashed as raw bytes rather than decoded text, ensuring that encoding differences never cause the cache and the extractor to disagree on a file's identity. The aggregation cache key deliberately includes each source file's fingerprint and line range in addition to the finding text, so that even if the text is unchanged, a shift in the cited location triggers a cache miss and re-derives citations from fresh evidence. + +Contradictions found in the source are never silently merged. They are always rendered explicitly in the output so that data-integrity issues visible in the source are escalated to the team rather than hidden. + +Note records are stamped with a UTC timestamp at write time, providing an audit trail of when each per-file extraction was recorded. + +Structured inference output is constrained to a strict schema at every pipeline stage, ensuring deterministic parsing and making successive runs straightforwardly diffable. + +--- + +## Hallucination and Fabrication Prevention + +Several independent mechanisms work together to ensure all generated content is grounded in extracted evidence: + +- **Deterministic structured output**: Temperature is fixed at zero for all structured-output inference calls so that the same input always produces the same structured result across runs. This is treated as a non-negotiable invariant for the extraction path. +- **Placeholder filtering**: A heuristic matches all known empty-section shapes ( diff --git a/.wikifi/diagrams.md b/.wikifi/diagrams.md index 5a0e4c3..4b27753 100644 --- a/.wikifi/diagrams.md +++ b/.wikifi/diagrams.md @@ -1,283 +1,146 @@ # Diagrams -_Derivation failed for **Diagrams** (anthropic provider: empty parsed_output and parse fallback failed: 1 validation error for DerivedSection - Invalid JSON: EOF while parsing a value at line 1 column 0 [type=json_invalid, input_value='', input_type=str] - For further information visit https://errors.pydantic.dev/2.13/v/json_invalid). Upstream evidence preserved below._ - -> Brief: Mermaid diagrams that visualize structural and behavioral relationships across the system: a domain map (graph or classDiagram across domains), an entity relationship view (erDiagram across entities), and an integration flow (sequence or flowchart across integrations). Tech-agnostic — no reference to current stack. - - -## From domains -# Domains and Subdomains - -## Core Domain - -The system's core domain is **codebase knowledge extraction**: ingesting an existing source base, classifying its contents, deriving domain findings from individual files, and synthesising those findings into a structured, technology-agnostic wiki. The primary consumers are migration teams who need to understand business intent, domain structure, and operational behaviour before re-implementing or replacing a legacy system. - -## Subdomains - -### Repository Introspection -This subdomain concerns discovering and classifying the files that make up a target codebase. Its central responsibility is distinguishing production source that encodes business intent from infrastructure, tooling, and other artefacts that do not. Tech-agnosticism is a first-class constraint here: the classification logic must not rely on recognising any specific language, framework, or runtime. - -### Per-File Knowledge Extraction -Once relevant files are identified, each is analysed independently to surface domain findings. This subdomain covers the full extraction loop — examining file content, applying domain heuristics, and producing structured evidence — and forms the first phase of wiki generation (primary sections). - -### Section Synthesis and Aggregation -The second phase of wiki generation operates over the evidence produced by per-file extraction. It aggregates findings across files into coherent wiki sections, derives higher-level content that cannot be inferred from any single file, and enforces the dependency ordering between primary (evidence-driven) and derivative (aggregated) sections. This ordering is a structural design constraint, not merely a runtime convention. - -### Wiki Authoring and Organisation -A secondary domain governs how extracted knowledge is structured and stored. It defines the taxonomy of sections, distinguishes primary from derivative content, and produces output that a migration team can navigate and consume independently of the source codebase. - -### Interactive Knowledge Retrieval -A supporting subdomain exposes the generated wiki to conversational or query-driven access, allowing stakeholders to interrogate extracted knowledge without directly inspecting raw wiki files. - -## Cross-Cutting Constraint: Tech-Agnosticism -Tech-agnosticism spans every subdomain. All analysis, extraction, and synthesis must produce domain-level descriptions that are free of references to specific languages, frameworks, or libraries. This constraint is enforced at both the classification stage (repository introspection) and the output stage (section content). - -## Subdomain Relationships - -| Subdomain | Role | Depends On | -|---|---|---| -| Repository Introspection | Identifies source worth analysing | — | -| Per-File Knowledge Extraction | Produces primary section evidence | Introspection | -| Section Synthesis & Aggregation | Produces derivative sections | Per-File Extraction | -| Wiki Authoring & Organisation | Structures and stores the wiki | Synthesis | -| Interactive Knowledge Retrieval | Queries the completed wiki | Authoring | - -## Sources -1. `README.md:28-52` -2. `VISION.md:3-20` -3. `wikifi/cli.py:1-8` -4. `wikifi/introspection.py:19-44` -5. `wikifi/sections.py:1-19` - -## From entities -# Core Entities - -The system's domain model spans five functional layers — wiki structure, file classification, extraction, evidence, and review — plus supporting entities for caching, derivation, and chat. - ---- - -## Wiki Structure - -**Section** is the central organizing entity. Each section carries a unique identifier, a human-readable title, a prose description of what belongs in it, and a tier (primary or derivative). Derivative sections additionally declare an ordered list of upstream section identifiers they depend on, forming an explicit dependency graph. An invariant holds at startup: every upstream identifier in a derivative section's dependency list must refer to a section that appears earlier in the canonical ordering (topological sort enforced). - -**WikiLayout** is an immutable value object that encodes the on-disk structure of a wiki workspace. Given a project root, it derives all canonical sub-paths: the wiki directory, configuration file, gitignore file, notes directory, per-section markdown files, and per-section note files. No fields are mutable after construction. - -**WalkConfig** is an immutable configuration record consumed by the filesystem walker. It captures the repository root, extra exclusion patterns, a flag for honouring ignore rules, a maximum file size in bytes, and a minimum stripped-content size in bytes. - ---- - -## File Classification and Graph - -**FileKind** is a closed enumeration of seven mutually exclusive file roles: application code, SQL, OpenAPI, Protobuf, GraphQL, migration, and other. This classification determines whether a file is routed to a specialised deterministic parser or the general-purpose extraction path. - -**GraphNode** represents a single file's position in the repository's import graph. It carries the file's repo-relative path, the ordered set of files it imports, and the ordered set of files that import it. It exposes a capped combined-neighbour list for inclusion in extraction prompts. - -**RepoGraph** holds the complete import-edge map for a repository scan. It supports node lookup by path and retrieval of a capped neighbour list for any given file, providing cross-file context during extraction. - -**DirSummary** is a value object holding aggregate statistics for a single (non-recursive) directory: its repo-relative path, file count, total byte size, a frequency map of the top-10 file extensions, and a tuple of notable filenames (manifests, readmes) present in that directory. - ---- - -## Extraction Layer - -**SectionFinding** represents one file's contribution to one wiki section. It carries the target section identifier, a technology-agnostic prose description of the contribution, and an optional inclusive line range within the source chunk. - -**FileFindings** groups a one-sentence summary of a file with all `SectionFinding` records produced for it. - -**SpecializedFinding** is the output unit of the deterministic parsing paths. It carries a section identifier, a human-readable description, and a list of source references. **SpecializedResult** groups zero or more such findings with an optional summary string; this is the uniform output contract for all specialised extractors, ensuring interoperability with the general extraction path downstream. - -**ExtractionStats** is a walk-level counter record, accumulating: total files seen, files yielding at least one finding, total findings, skipped files, chunks processed, cache hits, specialised-extractor invocations, and a per-kind file breakdown. - ---- - -## Evidence Layer - -**SourceRef** represents a single span of source: a repo-relative file path, an optional inclusive line range, and a short content fingerprint captured at extraction time for change detection. - -**Claim** represents one assertion placed in a wiki section. It carries the markdown text and a list of `SourceRef` values that justify it. A claim with no sources is explicitly marked unsupported — this is a first-class state, not an error. - -**Contradiction** groups two or more conflicting `Claim` objects about the same topic under a single summary sentence. Each disagreeing position retains its own source references, preserving full traceability. - -**EvidenceBundle** is the aggregator's structured output for a single wiki section. It combines the narrative body text, a list of `Claim` records, and a list of `Contradiction` records. The renderer uses the bundle to thread numbered citations and a conflicts block into the final markdown. - -During aggregation, the pipeline works with intermediate forms: **AggregatedClaim** pairs a single prose assertion with the 1-based indices of the input notes that support it, and **AggregatedContradiction** holds a one-sentence summary alongside multiple conflicting positional claims, each with its own note indices. These are the structured forms that the language model produces before being resolved into the full evidence model. - ---- - -## Cache Entities - -**CachedFindings** stores the extraction result for a single file: the file's content fingerprint, the list of structured findings produced, a one-sentence summary, and a count of processed chunks. Its invariant is content-addressed — the fingerprint is the cache key. - -**CachedSection** stores the aggregation result for a single wiki section: the hash of the notes payload that produced it, the rendered markdown body, and lists of claims and contradictions. It too is content-addressed on the notes hash. - -**WalkCache** is the in-memory container for both caches. It holds extraction and aggregation entries alongside hit and miss counters, enabling observability into cache effectiveness across a run. - ---- - -## Quality and Review Layer - -**Critique** captures the quality assessment of a single section: an integer score (0–10), a short overall judgment, a list of unsupported claims, a list of gaps relative to the section brief, and a list of concrete revision suggestions. - -**ReviewOutcome** tracks a section's review lifecycle: the section identifier, the initial critique, the current body text, a flag indicating whether a revision was applied, and the optional follow-up critique produced after revision. - -**WikiQualityReport** aggregates the full-wiki audit: an overall numeric score, a mapping from section identifiers to their individual `Critique` records, and optional coverage statistics. - -**CoverageStats** records total files seen, files with findings, and per-section breakdowns of finding counts and contributing file counts; it exposes a coverage-percentage computation. - -**SectionReport** captures the per-section view for reporting: the section descriptor, count of contributing files, total findings count, body size in characters, an emptiness flag, and an optional quality critique. - -**WikiReport** aggregates all `SectionReport` records alongside overall coverage statistics and an optional mean quality score across populated sections. - ---- - -## Derivation and Pipeline Outputs - -**IntrospectionResult** captures the Stage 1 decision about which files are worth deeper analysis: a list of gitignore-style include patterns, a list of exclude patterns, a list of primary languages (informational), a one-paragraph guess at the system's purpose, and a rationale for the choices made. - -**AggregationStats** records, for a single aggregation run, how many sections were written fresh, skipped due to empty notes, or served from cache. - -**DerivationStats** accumulates pipeline metrics for the derivation stage: counts of sections derived, skipped, and revised, plus the full list of `ReviewOutcome` records. It acts as an audit trail for the synthesis stage. - -**WalkReport** is the single return value of a completed wiki-generation run, aggregating the introspection result, extraction statistics, aggregation statistics, derivation statistics, the live cache state, and the repository import graph. - ---- - -## Chat Layer - -**ChatMessage** carries a role and a content field, representing a single turn in a multi-turn conversation. Lists of these are accumulated to maintain conversation history. - -**LoadedSection** pairs a `Section` descriptor with its rendered markdown body, representing a single populated section ready for inclusion in a chat context. - -**ChatSession** holds a provider reference, the frozen system prompt built from wiki sections, and the accumulated conversation history as an ordered list of `ChatMessage` records. It supports appending user and assistant turns and clearing history while retaining the wiki context. - ---- - -## Relationships and Invariants Summary - -| Entity | Key relationships | Notable invariants | -|---|---|---| -| Section | depends on upstream Sections (derivative tier only) | Dependency graph must be topologically ordered | -| WikiLayout | derived from a project root | Immutable; all paths are computed, not stored independently | -| SourceRef | referenced by Claim, SpecializedFinding | Fingerprint enables staleness detection | -| Claim | groups SourceRefs; composed into EvidenceBundle | Sourceless claims are explicitly flagged unsupported | -| Contradiction | groups ≥2 conflicting Claims | Each position retains its own SourceRefs | -| CachedFindings | keyed on file content fingerprint | Cache miss if fingerprint changes | -| CachedSection | keyed on notes-payload hash | Cache miss if any upstream note changes | -| ReviewOutcome | holds pre- and post-revision Critique | Revision flag distinguishes touched from untouched sections | -| WalkReport | aggregates all four stage outputs | Single return value for a complete run | - -## Sources -1. `wikifi/sections.py:30-40` -2. `wikifi/deriver.py:112-116` -3. `wikifi/cli.py:166-172` -4. `wikifi/wiki.py:34-61` -5. `wikifi/walker.py:61-79` -6. `README.md:31-33` -7. `wikifi/repograph.py:41-52` -8. `wikifi/repograph.py:148-167` -9. `wikifi/repograph.py:170-181` -10. `wikifi/walker.py:144-153` -11. `wikifi/extractor.py:106-123` -12. `wikifi/specialized/__init__.py:29-38` -13. `wikifi/extractor.py:126-135` -14. `wikifi/evidence.py:37-52` -15. `README.md:37-39` -16. `wikifi/evidence.py:55-67` -17. `wikifi/aggregator.py:166-186` -18. `wikifi/evidence.py:70-77` -19. `wikifi/aggregator.py:74-101` -20. `README.md:46-48` -21. `wikifi/evidence.py:80-85` -22. `wikifi/cache.py:44-51` -23. `wikifi/cache.py:54-60` -24. `wikifi/cache.py:63-70` -25. `wikifi/aggregator.py:103-107` -26. `wikifi/critic.py:67-84` -27. `wikifi/critic.py:91-96` -28. `wikifi/critic.py:99-114` -29. `wikifi/report.py:85-94` -30. `wikifi/report.py:28-42` -31. `wikifi/introspection.py:47-64` -32. `wikifi/deriver.py:57-62` -33. `wikifi/orchestrator.py:54-61` -34. `wikifi/cli.py:118-153` -35. `wikifi/providers/base.py:28-30` -36. `wikifi/chat.py:42-45` -37. `wikifi/chat.py:46-57` -38. `wikifi/specialized/sql.py:64-84` -39. `wikifi/specialized/sql.py:99-111` -40. `wikifi/specialized/graphql.py:32-81` -41. `wikifi/specialized/protobuf.py:44-68` -42. `wikifi/specialized/openapi.py:94-108` - -## From integrations -# Integrations - -### Inbound: Entry Points into the System - -The system is distributed as a library installed directly into a target project. The command-line interface (CLI) is the primary inbound entry point, exposing subcommands that drive the full pipeline from repository introspection through wiki generation, interactive querying, and quality reporting. The CLI delegates all pipeline coordination to the orchestrator, which is also the central hub wiring together every downstream stage. - ---- - -### Outbound: AI Model Backends - -All pipeline stages — introspection, per-file extraction, section aggregation, derivative content derivation, quality critique, and interactive chat — communicate with an AI model backend exclusively through a shared provider abstraction. No stage calls a specific backend directly. Three interaction shapes are exposed through this abstraction: schema-validated structured output, free-text completion, and multi-turn stateful conversation. - -Three backends are available and are interchangeable without altering any pipeline code: - -| Backend type | Hosting model | -|---|---| -| Local self-hosted inference runtime | On-premise / developer machine | -| Hosted AI service (Anthropic-compatible) | Remote cloud | -| Hosted AI service (OpenAI-compatible) | Remote cloud or self-managed endpoint | - -The active backend is selected via an environment variable or a per-invocation flag at the CLI level. OpenAI-compatible endpoints — including corporate reverse proxies and managed cloud deployments — are supported by overriding the base URL alone, with no other changes to the calling code. - ---- - -### Outbound: Development-Time Tool Servers (MCP) - -A separate set of external capability providers is declared through an MCP client configuration used during development or runtime. Four tool servers are wired up: a local AI utility, a local web crawler, a remote documentation-context service, and a remote search-and-stitching service. The system acts as an MCP client that fans requests out to these providers as needed. - ---- - -### Outbound: Filesystem and Persistence Layer - -All reading and writing of wiki artifacts — extraction notes, finished section bodies, and cache entries — flows through a centralized layout abstraction that manages the `.wikifi/` output directory inside the target project. The extractor, aggregator, deriver, CLI, and orchestrator all resolve paths through this abstraction rather than independently. - -A content-addressed cache layer sits between the orchestrator and the AI backend, consulting a fingerprinting service to derive content hashes as cache keys. The extractor, aggregator, and orchestrator each consult the cache before issuing AI calls, enabling both incremental re-runs and resumability for large codebases. - ---- - -### Integration Touchpoints Discovered in Target Codebases - -When analyzing a target codebase, the system identifies and surfaces integration touchpoints from high-signal artifact files through specialized parsers: - -- **HTTP API surfaces** — Parsed from API contract files; each contract contributes a finding recording the count of externally exposed endpoints, establishing the public-facing API surface as a documented integration point. -- **RPC service definitions** — Each declared service and its remote procedures are mapped, capturing procedure names, request and response message types, and whether either channel is streaming. -- **Event-driven channels** — Subscription roots found in schema definition files are classified as real-time integration touchpoints rather than ordinary capabilities, reflecting their role as channels that external consumers attach to. -- **Relational links** — Foreign key declarations (both explicit and inline) are surfaced as hard relational links between domain entities, identifying cross-entity data dependencies. - -The dispatcher that routes files to these specialized parsers uses the file-kind classification produced by the repository graph module, ensuring each artifact type reaches the appropriate parser while preserving a uniform output contract for downstream aggregation. - -## Sources -1. `README.md:8-12` -2. `wikifi/cli.py:98-101` -3. `wikifi/orchestrator.py:40-60` -4. `wikifi/providers/base.py:30-48` -5. `wikifi/providers/anthropic_provider.py:115-175` -6. `wikifi/providers/ollama_provider.py:58-95` -7. `wikifi/providers/openai_provider.py:1-8` -8. `README.md:46-51` -9. `TESTING-AND-DEMO.md:232-235` -10. `.mcp.json:2-36` -11. `wikifi/wiki.py:34-61` -12. `wikifi/cache.py:244-246` -13. `wikifi/cache.py:30` -14. `wikifi/specialized/openapi.py:83-92` -15. `wikifi/specialized/protobuf.py:70-87` -16. `wikifi/specialized/graphql.py:88-91` -17. `wikifi/specialized/sql.py:86-96` -18. `wikifi/specialized/__init__.py:46-57` +Three diagrams follow: a domain map, an entity–relationship view, and an integration flow. All representations are technology-agnostic and derived solely from the documented system model. + +## Domain Map + +Subdomains, their responsibilities, and the directed dependency chain that governs pipeline ordering. No subdomain reaches backwards; the arrows below are the authoritative expression of inter-subdomain dependency. + +```mermaid +graph TD + subgraph CORE["Core Domain — Automated Documentation Synthesis"] + RI[Repository Introspection] + KE[Knowledge Extraction] + SS[Section Synthesis] + APW[Artefact Persistence — working state] + APC[Artefact Persistence — committed wiki] + end + + RI -->|include and exclude scope| KE + KE -->|extraction notes| APW + KE -->|evidential record| SS + SS -->|rendered section markdown| APC +``` + +## Entity Relationship View + +Core entities across all concern areas. Cardinality follows the documented information model. + +```mermaid +erDiagram + SECTION { + string id PK + string title + string brief + string tier + } + SECTION ||--o{ SECTION : "upstream-of" + WIKI_LAYOUT ||--o{ SECTION : "resolves paths for" + SECTION_REPORT }o--|| SECTION : "describes" + WIKI_REPORT ||--|{ SECTION_REPORT : "aggregates" + LOADED_SECTION ||--|| SECTION : "pairs body with" + + SECTION ||--o{ SECTION_FINDING : "collects" + FILE_FINDINGS ||--|{ SECTION_FINDING : "groups" + + SOURCE_REF { + string file_path + string line_range + string fingerprint + } + CLAIM }o--|{ SOURCE_REF : "backed by" + CONTRADICTION }|--|{ CLAIM : "groups conflicting" + EVIDENCE_BUNDLE ||--|{ CLAIM : "contains" + EVIDENCE_BUNDLE ||--o{ CONTRADICTION : "contains" + + WALK_REPORT ||--|| INTROSPECTION_RESULT : "carries" + WALK_REPORT ||--|| EXTRACTION_STATS : "carries" + WALK_REPORT ||--|| AGGREGATION_STATS : "carries" + WALK_REPORT ||--|| DERIVATION_STATS : "carries" + WALK_REPORT ||--|| WALK_CACHE : "carries" + WALK_REPORT ||--|| REPO_GRAPH : "carries" + WALK_CACHE ||--o{ CACHED_FINDINGS : "holds" + WALK_CACHE ||--o{ CACHED_SECTION : "holds" + REPO_GRAPH ||--|{ GRAPH_NODE : "indexes" + + DERIVATION_STATS ||--o{ REVIEW_OUTCOME : "audit trail" + REVIEW_OUTCOME ||--|| CRITIQUE : "initial" + REVIEW_OUTCOME ||--o| CRITIQUE : "follow-up" + + CHAT_SESSION ||--|| LLM_PROVIDER : "uses" + CHAT_SESSION ||--|{ CHAT_MESSAGE : "history" +``` + +## Integration Flow + +End-to-end pipeline sequence from CLI invocation through all four stages, showing each stage's interactions with the LLM provider abstraction, the cache layer, the import graph, and the filesystem layout. + +```mermaid +sequenceDiagram + autonumber + participant CLI + participant Orchestrator + participant Config + participant LLMProvider + participant ImportGraph + participant SpecDispatch + participant Cache + participant FilesystemLayout + + CLI->>Config: load settings and feature flags + CLI->>Orchestrator: walk command + Orchestrator->>LLMProvider: Stage 1 — scope classification + LLMProvider-->>Orchestrator: IntrospectionResult + Orchestrator->>FilesystemLayout: initialise layout + + loop per in-scope file + Orchestrator->>ImportGraph: fetch file neighbours + ImportGraph-->>Orchestrator: neighbour paths + Orchestrator->>Cache: lookup by content fingerprint + alt cache hit + Cache-->>Orchestrator: FileFindings cached + else cache miss + Orchestrator->>SpecDispatch: route by FileKind + alt recognised kind + SpecDispatch-->>Orchestrator: SpecializedFindings + else general path + SpecDispatch->>LLMProvider: Stage 2 — extraction + LLMProvider-->>SpecDispatch: SectionFindings + SpecDispatch-->>Orchestrator: FileFindings + end + Orchestrator->>Cache: store findings + end + Orchestrator->>FilesystemLayout: append notes per section + end + + loop per primary section + Orchestrator->>Cache: lookup by notes-payload hash + alt cache hit + Cache-->>Orchestrator: rendered section body + else cache miss + Orchestrator->>LLMProvider: Stage 3 — aggregation + LLMProvider-->>Orchestrator: EvidenceBundle + Orchestrator->>FilesystemLayout: write section markdown + Orchestrator->>Cache: store aggregated section + end + end + + loop per derivative section in topological order + Orchestrator->>FilesystemLayout: read upstream section bodies + Orchestrator->>LLMProvider: Stage 4 — derivation + LLMProvider-->>Orchestrator: section body + Orchestrator->>FilesystemLayout: write section markdown + opt quality review enabled + Orchestrator->>LLMProvider: critique + LLMProvider-->>Orchestrator: Critique with score + alt score below revision threshold + Orchestrator->>LLMProvider: revise + LLMProvider-->>Orchestrator: revised body + Orchestrator->>FilesystemLayout: overwrite section markdown + end + end + end + + Orchestrator-->>CLI: WalkReport + Note over CLI,FilesystemLayout: chat and report subcommands read finished wiki via FilesystemLayout +``` diff --git a/.wikifi/domains.md b/.wikifi/domains.md index 38a24fd..0330a14 100644 --- a/.wikifi/domains.md +++ b/.wikifi/domains.md @@ -2,48 +2,55 @@ ## Core Domain -The system's core domain is **codebase knowledge extraction**: reasoning about an arbitrary repository's structure, intent, and behaviour, then representing that understanding as a technology-agnostic, human-readable wiki. The domain is explicitly decoupled from any recognition of specific languages, frameworks, or runtimes — tech-agnosticism is a first-class constraint enforced at the analysis level, not merely a presentation concern. +The system's core domain is **automated documentation synthesis**: ingesting an arbitrary source repository and producing a structured, intent-bearing wiki that describes the codebase in technology-agnostic terms. The central concern is not the mechanics of reading files, but the act of surfacing *business intent* — distinguishing what a system does from the accidental details of how it is implemented. -## Primary Subdomains +## Subdomains ### Repository Introspection -This subdomain covers the initial act of understanding a repository: discovering which paths exist, classifying files by kind, resolving import relationships, and deciding which parts of the codebase encode genuine business intent versus infrastructure or tooling noise. The output is a curated inclusion set that drives all downstream work. +Before any analysis begins, the system must decide which parts of a repository carry production intent and which represent infrastructure, tooling, or generated artefacts. This subdomain owns that classification decision. A defining constraint is **tech-agnosticism**: the introspection logic must not rely on recognising specific languages, frameworks, or conventions, so that it generalises across any codebase. -### Per-File Knowledge Extraction -Operating over the inclusion set produced by introspection, this subdomain extracts intent-bearing findings from individual source files, organised by wiki section. It encompasses caching and memoisation of extraction results, cross-file context derived from the import graph, and chunk-level deduplication to prevent redundant evidence. +### Knowledge Extraction +Once relevant files are identified, this subdomain is responsible for extracting structured, intent-bearing findings from each one. It encompasses file classification, content chunking, querying an inference backend for structured observations, and persisting those observations with precise citations for downstream use. The output of this subdomain is the raw evidential record. -### Documentation Synthesis -This subdomain aggregates per-file findings into coherent wiki sections and then derives higher-level artifacts (narrative summaries, personas, diagrams) from those aggregates. A critical design constraint enforced structurally is the **dependency ordering** between primary evidence extraction and derivative synthesis: derivative sections may only consume content that primary sections have already produced. +### Section Synthesis +The documentation produced by the system is split along a clear dependency boundary: -## Secondary Subdomains +| Subdomain tier | Description | Pipeline position | +|---|---|---| +| **Primary sections** | Built from per-file evidence produced by the extraction subdomain | Stages 2–3 | +| **Derivative sections** | Synthesised by aggregating across all primary-section findings | Stage 4 | -| Subdomain | Responsibility | -|---|---| -| **Provider Abstraction** | Decouples extraction and synthesis intelligence from any specific inference backend, allowing local and hosted providers to be swapped without altering the pipeline. | -| **Wiki Authoring & Organisation** | Governs how extracted knowledge is structured, stored on the filesystem, and made navigable for consumers such as migration teams. | -| **Interactive Knowledge Retrieval** | Supports on-demand querying of the generated wiki, enabling a conversational interface over the accumulated knowledge base. | +This ordering is a first-class design constraint: derivative sections cannot be produced until all primary evidence is available. The boundary between the two tiers is enforced structurally, not merely by convention. -## Domain Relationships +### Artefact Persistence +Two distinct storage concerns are separated within the system. *Committed wiki content* — the section markdown files that are versioned alongside the target project — is kept apart from *local working state*, which includes per-file extraction notes and a content-addressed cache. The persistence subdomain owns this boundary and ensures that working state is never accidentally treated as part of the published record. -Repository Introspection feeds Per-File Extraction, which in turn feeds Documentation Synthesis — forming a directed, stage-ordered pipeline. Provider Abstraction is a horizontal supporting concern that all three primary subdomains depend on. Wiki Authoring & Organisation governs the output representation consumed by Interactive Knowledge Retrieval. Quality assurance of generated content is an ancillary concern cross-cutting the extraction and synthesis stages. +## Subdomain Relationships + +The subdomains form a directed dependency chain: + +``` +Repository Introspection + ↓ + Knowledge Extraction → Artefact Persistence (working state) + ↓ + Section Synthesis + ↓ + Artefact Persistence (committed wiki content) +``` + +No subdomain reaches backwards in this chain; the pipeline ordering is the authoritative expression of inter-subdomain dependency. ## Supporting claims -- The core domain is codebase knowledge extraction: reasoning about an arbitrary repository's structure, intent, and behaviour and representing that understanding as a technology-agnostic wiki. [1][2][3][4] -- Tech-agnosticism is a first-class constraint at the analysis level, not merely a presentation concern. [5] -- The repository introspection subdomain covers discovering and classifying files, resolving import relationships, and deciding which parts of a codebase encode business intent versus infrastructure or tooling. [1][6][5] -- The per-file knowledge extraction subdomain extracts intent-bearing findings per wiki section, and encompasses caching/memoisation, import-graph-based cross-file context, and chunk-level deduplication. [1][3] -- The documentation synthesis subdomain aggregates per-file findings into wiki sections and derives higher-level artifacts such as narrative summaries, personas, and diagrams. [1][4][7] -- The dependency ordering between primary evidence extraction and derivative synthesis is a first-class design constraint enforced structurally. [7] -- Provider abstraction is a secondary domain that decouples extraction intelligence from any specific inference backend. [8] -- Wiki authoring and organisation is a secondary domain governing how extracted knowledge is structured and stored for consumption by a migration team. [2] -- Interactive knowledge retrieval against the generated wiki is a supporting subdomain. [6] +- The core domain is automated documentation synthesis: extracting business intent from a source repository and producing a technology-agnostic wiki. [1][2] +- The repository introspection subdomain decides which parts of a codebase encode business intent versus infrastructure or tooling. [2] +- A defining constraint of repository introspection is tech-agnosticism — the analysis must not depend on recognising any specific language or framework. [2] +- The knowledge extraction subdomain covers file classification, content chunking, inference-backend querying, and persisting findings with citations. [1] +- Documentation sections are divided into primary sections (built from per-file evidence) and derivative sections (synthesised from aggregates of primary sections), with the ordering enforced as a structural constraint. [3] +- Two distinct storage concerns exist: committed wiki content and local working state (extraction notes and cache); the persistence subdomain enforces this boundary. [4] ## Sources -1. `README.md:32-55` -2. `VISION.md:3-20` -3. `wikifi/extractor.py` -4. `wikifi/orchestrator.py:1-16` -5. `wikifi/introspection.py:19-44` -6. `wikifi/cli.py:1-8` -7. `wikifi/sections.py:1-19` -8. `README.md:57-63` +1. `wikifi/extractor.py` +2. `wikifi/introspection.py:19-44` +3. `wikifi/sections.py:1-19` +4. `wikifi/wiki.py:1-50` diff --git a/.wikifi/entities.md b/.wikifi/entities.md index e175e7c..016420b 100644 --- a/.wikifi/entities.md +++ b/.wikifi/entities.md @@ -1,159 +1,217 @@ # Core Entities -The system's domain model spans five functional layers — wiki structure, file classification, extraction, evidence, and review — plus supporting entities for caching, derivation, and chat. +The system's information model spans six concern areas: wiki structure, evidence tracing, extraction and aggregation, repository analysis, caching, and pipeline orchestration. The entities below are described domain-first; implementation details such as storage format are noted only where they affect the entity's invariants. --- ## Wiki Structure -**Section** is the central organizing entity. Each section carries a unique identifier, a human-readable title, a prose description of what belongs in it, and a tier (primary or derivative). Derivative sections additionally declare an ordered list of upstream section identifiers they depend on, forming an explicit dependency graph. An invariant holds at startup: every upstream identifier in a derivative section's dependency list must refer to a section that appears earlier in the canonical ordering (topological sort enforced). +A **Section** is the fundamental organisational unit of the generated wiki. It carries: -**WikiLayout** is an immutable value object that encodes the on-disk structure of a wiki workspace. Given a project root, it derives all canonical sub-paths: the wiki directory, configuration file, gitignore file, notes directory, per-section markdown files, and per-section note files. No fields are mutable after construction. +| Field | Description | +|---|---| +| Unique identifier | Stable key used throughout the pipeline | +| Title | Human-readable heading | +| Brief | Prose description of what belongs in the section | +| Tier | Either *primary* (populated from per-file evidence) or *derivative* (synthesised from primary sections) | +| Upstream list | Ordered tuple of section identifiers this section depends on (derivative sections only) | -**WalkConfig** is an immutable configuration record consumed by the filesystem walker. It captures the repository root, extra exclusion patterns, a flag for honouring ignore rules, a maximum file size in bytes, and a minimum stripped-content size in bytes. +Derivative sections declare explicit upstream dependencies, forming a directed acyclic graph. The system enforces topological ordering at startup: every section's upstreams must appear earlier in the canonical section list. ---- - -## File Classification and Graph +A **WikiLayout** anchors all on-disk path resolution to a single project root, exposing named locations for the wiki directory, configuration file, gitignore, notes directory, cache directory, and per-section markdown and notes files. Its existence is a precondition for the conversational query and report commands. -**FileKind** is a closed enumeration of seven mutually exclusive file roles: application code, SQL, OpenAPI, Protobuf, GraphQL, migration, and other. This classification determines whether a file is routed to a specialised deterministic parser or the general-purpose extraction path. +A **LoadedSection** pairs a Section descriptor with its rendered markdown body, representing one populated section ready for downstream use (such as building a conversational context). -**GraphNode** represents a single file's position in the repository's import graph. It carries the file's repo-relative path, the ordered set of files it imports, and the ordered set of files that import it. It exposes a capped combined-neighbour list for inclusion in extraction prompts. +A **SectionReport** captures the per-section view for reporting: a reference to the Section definition, the count of contributing files, the count of findings, the character length of the written body, an emptiness flag, and an optional quality critique. A **WikiReport** aggregates all SectionReports together with overall coverage statistics and an optional mean quality score across populated sections. -**RepoGraph** holds the complete import-edge map for a repository scan. It supports node lookup by path and retrieval of a capped neighbour list for any given file, providing cross-file context during extraction. +--- -**DirSummary** is a value object holding aggregate statistics for a single (non-recursive) directory: its repo-relative path, file count, total byte size, a frequency map of the top-10 file extensions, and a tuple of notable filenames (manifests, readmes) present in that directory. +## Evidence and Citation Model ---- +Every factual sentence in the generated wiki is traceable back through a three-layer evidence hierarchy. -## Extraction Layer +**SourceRef** — the lowest-level pointer. Carries a repo-relative file path, an optional inclusive line range, and a short content fingerprint captured at extraction time. Renders as `path:start–end` or just `path` when no line range is available. -**SectionFinding** represents one file's contribution to one wiki section. It carries the target section identifier, a technology-agnostic prose description of the contribution, and an optional inclusive line range within the source chunk. +**Claim** — a single markdown assertion placed in a section's narrative. Backed by zero or more SourceRefs. A claim with no sources is explicitly considered *unsupported*. -**FileFindings** groups a one-sentence summary of a file with all `SectionFinding` records produced for it. +**Contradiction** — groups two or more conflicting Claims under a one-sentence summary of the conflict; each conflicting position retains its own SourceRefs. -**SpecializedFinding** is the output unit of the deterministic parsing paths. It carries a section identifier, a human-readable description, and a list of source references. **SpecializedResult** groups zero or more such findings with an optional summary string; this is the uniform output contract for all specialised extractors, ensuring interoperability with the general extraction path downstream. +**EvidenceBundle** — the aggregator's structured handoff to the renderer for one section: the markdown narrative body, the ordered list of Claims, and the list of Contradictions. -**ExtractionStats** is a walk-level counter record, accumulating: total files seen, files yielding at least one finding, total findings, skipped files, chunks processed, cache hits, specialised-extractor invocations, and a per-kind file breakdown. +During the language-model aggregation pass, an intermediate form is used: an **AggregatedClaim** pairs a prose assertion with 1-based indices into the input notes (rather than resolved file paths), and an **AggregatedContradiction** wraps a one-sentence summary around multiple such indexed positions. These are resolved into full SourceRefs and Claims before the EvidenceBundle is assembled. --- -## Evidence Layer +## Extraction Layer + +**IntrospectionResult** captures the Stage 1 decision: include patterns, exclude patterns, a one-paragraph hypothesis about the system's purpose, an informational list of primary technologies detected, and the rationale for the filtering choices. + +**SectionFinding** is the atomic extraction unit from one source file for one section. Fields: +- Target section identifier +- Technology-agnostic markdown description (one to five sentences) +- Optional inclusive line range within the source chunk -**SourceRef** represents a single span of source: a repo-relative file path, an optional inclusive line range, and a short content fingerprint captured at extraction time for change detection. +**FileFindings** groups all SectionFindings produced for a single file, together with a one-sentence summary of that file's role. It is the unit exchanged between an extraction call and the notes store. -**Claim** represents one assertion placed in a wiki section. It carries the markdown text and a list of `SourceRef` values that justify it. A claim with no sources is explicitly marked unsupported — this is a first-class state, not an error. +Specialised extractors — handling schema definition languages, API contracts, and data-definition files — produce **SpecializedFindings** rather than relying on general LLM inference. Each carries a section identifier, finding text, and one or more source references. Multiple SpecializedFindings are collected into a **SpecializedResult**, which additionally carries an optional summary string. -**Contradiction** groups two or more conflicting `Claim` objects about the same topic under a single summary sentence. Each disagreeing position retains its own source references, preserving full traceability. +For data-definition schema files, an intermediate **table record** is derived first (table name, source line, raw body, column list, and foreign-key edges expressed as local-column → referenced-table.referenced-column tuples). All downstream entity and relationship findings are derived from this intermediate form. -**EvidenceBundle** is the aggregator's structured output for a single wiki section. It combines the narrative body text, a list of `Claim` records, and a list of `Contradiction` records. The renderer uses the bundle to thread numbered citations and a conflicts block into the final markdown. +Domain object types from API schema files (those that are not root operation types) are surfaced directly as domain entity findings, grouped by their namespace, with closed value sets (enumerations) and shared shape contracts (interfaces, input types) captured as separate finding categories. -During aggregation, the pipeline works with intermediate forms: **AggregatedClaim** pairs a single prose assertion with the 1-based indices of the input notes that support it, and **AggregatedContradiction** holds a one-sentence summary alongside multiple conflicting positional claims, each with its own note indices. These are the structured forms that the language model produces before being resolved into the full evidence model. +**ExtractionStats** accumulates per-run metrics: files seen, files with findings, total findings, skipped files, chunks processed, cache hits, files routed to specialised extractors, and a breakdown by file kind. --- -## Cache Entities +## Repository Analysis Entities -**CachedFindings** stores the extraction result for a single file: the file's content fingerprint, the list of structured findings produced, a one-sentence summary, and a count of processed chunks. Its invariant is content-addressed — the fingerprint is the cache key. +**FileKind** is a fixed enumeration of seven structural categories: application code, SQL, OpenAPI contract, protocol definition, GraphQL schema, migration script, and other. The classification drives routing to the appropriate extractor. -**CachedSection** stores the aggregation result for a single wiki section: the hash of the notes payload that produced it, the rendered markdown body, and lists of claims and contradictions. It too is content-addressed on the notes hash. +**GraphNode** represents one file's position in the cross-file import graph: its repo-relative path, the ordered set of files it imports, and the ordered set of files that import it. It exposes a combined neighbour list capped at a configurable limit for use in prompt enrichment. -**WalkCache** is the in-memory container for both caches. It holds extraction and aggregation entries alongside hit and miss counters, enabling observability into cache effectiveness across a run. +**RepoGraph** is the complete repository-level import graph, keyed by repo-relative file path, providing lookup of individual GraphNodes and neighbour path lists. + +**DirSummary** is a value object for a single non-recursive directory: its repo-relative path, file count, total byte size, a frequency map of the top-10 file extensions, and a tuple of notable filenames (manifests, readmes). --- -## Quality and Review Layer +## Caching Entities + +| Entity | Cache key | Stored payload | +|---|---|---| +| **CachedFindings** | Content fingerprint of the source file | Findings list, one-sentence file summary, chunk count | +| **CachedSection** | Hash of the notes payload | Rendered markdown body, claims list, contradictions list | -**Critique** captures the quality assessment of a single section: an integer score (0–10), a short overall judgment, a list of unsupported claims, a list of gaps relative to the section brief, and a list of concrete revision suggestions. +**WalkCache** is the in-memory aggregate of both caches. It tracks four counters — extraction hits, extraction misses, aggregation hits, aggregation misses — supporting efficiency reporting across a full pipeline run. -**ReviewOutcome** tracks a section's review lifecycle: the section identifier, the initial critique, the current body text, a flag indicating whether a revision was applied, and the optional follow-up critique produced after revision. +--- -**WikiQualityReport** aggregates the full-wiki audit: an overall numeric score, a mapping from section identifiers to their individual `Critique` records, and optional coverage statistics. +## Quality-Review Entities -**CoverageStats** records total files seen, files with findings, and per-section breakdowns of finding counts and contributing file counts; it exposes a coverage-percentage computation. +A **Critique** captures the quality assessment of one section: +- Integer score (0–10) +- Short overall judgment +- List of unsupported claims +- List of gaps relative to the section brief +- List of concrete revision suggestions -**SectionReport** captures the per-section view for reporting: the section descriptor, count of contributing files, total findings count, body size in characters, an emptiness flag, and an optional quality critique. +A **ReviewOutcome** tracks the lifecycle of a single section review: the section identifier, the initial Critique, the current body text, a boolean flag indicating whether a revision was applied, and an optional follow-up Critique produced after revision. -**WikiReport** aggregates all `SectionReport` records alongside overall coverage statistics and an optional mean quality score across populated sections. +A **WikiQualityReport** aggregates the full-wiki audit: an overall numeric score, a mapping from section identifiers to individual Critiques, and optional **CoverageStats** (total files, files with findings, and per-section finding and file counts). --- -## Derivation and Pipeline Outputs +## Pipeline Orchestration Entities -**IntrospectionResult** captures the Stage 1 decision about which files are worth deeper analysis: a list of gitignore-style include patterns, a list of exclude patterns, a list of primary languages (informational), a one-paragraph guess at the system's purpose, and a rationale for the choices made. +**WalkConfig** encapsulates the parameters for file traversal. Notes from two different pipeline layers describe it somewhat differently (see Conflicts below), but the agreed-upon core fields are: repository root, file-size limits, minimum content thresholds, and extra exclusion patterns. It is treated as immutable once constructed. -**AggregationStats** records, for a single aggregation run, how many sections were written fresh, skipped due to empty notes, or served from cache. +**Notes records** are the ephemeral per-section extraction state persisted during a walk. Each record carries a UTC timestamp and arbitrary key-value metadata. Records for a section are accumulated in insertion order. -**DerivationStats** accumulates pipeline metrics for the derivation stage: counts of sections derived, skipped, and revised, plus the full list of `ReviewOutcome` records. It acts as an audit trail for the synthesis stage. +**WalkReport** is the primary return value from a complete pipeline run. It carries the IntrospectionResult, ExtractionStats, AggregationStats, DerivationStats, the WalkCache state, and the RepoGraph. -**WalkReport** is the single return value of a completed wiki-generation run, aggregating the introspection result, extraction statistics, aggregation statistics, derivation statistics, the live cache state, and the repository import graph. +**AggregationStats** tracks three counters for a single aggregation pass: sections written fresh, sections skipped due to empty notes, and sections served from cache. ---- +**DerivationStats** accumulates pipeline metrics for the derivation stage: count of sections derived, skipped, and revised, plus the full list of ReviewOutcomes as an audit trail. -## Chat Layer +--- -**ChatMessage** carries a role and a content field, representing a single turn in a multi-turn conversation. Lists of these are accumulated to maintain conversation history. +## Interaction Entities -**LoadedSection** pairs a `Section` descriptor with its rendered markdown body, representing a single populated section ready for inclusion in a chat context. +A **ChatMessage** carries a role identifier and a content string, representing one turn in a multi-turn exchange. -**ChatSession** holds a provider reference, the frozen system prompt built from wiki sections, and the accumulated conversation history as an ordered list of `ChatMessage` records. It supports appending user and assistant turns and clearing history while retaining the wiki context. +A **ChatSession** holds a reference to the language-model provider, the frozen system prompt built from populated wiki sections, and the accumulated conversation history (an ordered list of ChatMessages). It supports appending user and assistant turns and clearing the history while retaining the wiki context. --- -## Relationships and Invariants Summary - -| Entity | Key relationships | Notable invariants | -|---|---|---| -| Section | depends on upstream Sections (derivative tier only) | Dependency graph must be topologically ordered | -| WikiLayout | derived from a project root | Immutable; all paths are computed, not stored independently | -| SourceRef | referenced by Claim, SpecializedFinding | Fingerprint enables staleness detection | -| Claim | groups SourceRefs; composed into EvidenceBundle | Sourceless claims are explicitly flagged unsupported | -| Contradiction | groups ≥2 conflicting Claims | Each position retains its own SourceRefs | -| CachedFindings | keyed on file content fingerprint | Cache miss if fingerprint changes | -| CachedSection | keyed on notes-payload hash | Cache miss if any upstream note changes | -| ReviewOutcome | holds pre- and post-revision Critique | Revision flag distinguishes touched from untouched sections | -| WalkReport | aggregates all four stage outputs | Single return value for a complete run | +## Configuration and Provider Entities + +**Settings** captures all runtime knobs for a wiki-generation run: provider and model identity, inference endpoint, request timeout, file-size and chunk thresholds, pipeline feature flags (caching, graph building, specialised extractors, review loop), the quality threshold that triggers revision, and provider-specific credentials and token caps. + +An **LLMProvider** carries a provider name and a specific model variant. It is the sole point of contact between the pipeline and any language-model backend, exposing exactly three interaction modes used throughout the system. + +## Supporting claims +- A Section carries a unique identifier, human-readable title, prose brief, a tier (primary or derivative), and an ordered tuple of upstream section identifiers. [1] +- Derivative sections declare explicit upstream dependencies forming a directed acyclic graph enforced by topological ordering at startup. [2][1] +- WikiLayout anchors all on-disk path resolution to a single project root, exposing named locations for wiki, config, gitignore, notes, cache, and per-section files; its existence is a precondition for the chat and report commands. [3][4] +- A LoadedSection pairs a Section descriptor with its rendered markdown body. [5] +- A SectionReport carries the section definition reference, contributing file count, findings count, body character length, emptiness flag, and an optional quality critique. [6] +- A WikiReport aggregates all SectionReports, overall coverage statistics, and an optional mean quality score. [7] +- A SourceRef holds a repo-relative file path, an optional inclusive line range, and a short content fingerprint captured at extraction time. [8][9] +- A Claim is a single markdown assertion backed by zero or more SourceRefs; a claim with no sources is explicitly considered unsupported. [8][10] +- A Contradiction groups two or more conflicting Claims under a one-sentence summary, each position retaining its own SourceRefs. [10] +- An EvidenceBundle is the aggregator's structured handoff to the renderer: markdown body, ordered Claims list, and Contradictions list. [8][11] +- During the aggregation pass an AggregatedClaim pairs a prose assertion with 1-based input-note indices, and an AggregatedContradiction wraps a one-sentence summary around multiple such indexed positions; these are resolved into SourceRefs before the EvidenceBundle is assembled. [12] +- IntrospectionResult captures include/exclude patterns, a purpose hypothesis, an informational language list, and the filtering rationale. [13] +- A SectionFinding carries a target section identifier, a technology-agnostic markdown description of one to five sentences, and an optional line range. [14] +- A FileFindings groups all SectionFindings for one file plus a one-sentence file-role summary, and is the unit exchanged between the extraction call and the notes store. [15] +- A SpecializedFinding carries a section identifier, finding text, and one or more source references; multiple SpecializedFindings are collected into a SpecializedResult that also carries an optional summary string. [16][17] +- For data-definition schema files an intermediate table record is derived first (name, source line, raw body, column list, and foreign-key edges) and all downstream findings are derived from it. [18] +- Domain object types from schema files are surfaced as domain entity findings; closed value sets and shared shape contracts are captured as separate finding categories; a maximum of 25 items per category are rendered with elision noted. [19][20][21] +- ExtractionStats accumulates: files seen, files with findings, total findings, skipped files, chunks processed, cache hits, specialised-extractor files, and a file-kind breakdown. [22] +- FileKind is a fixed enumeration of seven structural categories: application code, SQL, OpenAPI, Protobuf, GraphQL, migration, and other; it drives routing to the appropriate extractor. [23] +- A GraphNode carries its repo-relative path, the ordered set of files it imports, and the ordered set of files that import it, with a configurable cap on the combined neighbour list. [24] +- A RepoGraph is the complete repository import graph keyed by repo-relative path. [25] +- A DirSummary holds a directory's path, file count, total byte size, top-10 extension frequency map, and a tuple of notable filenames. [26] +- CachedFindings stores a content fingerprint, findings list, one-sentence summary, and chunk count; CachedSection stores a notes-payload hash, rendered markdown body, claims list, and contradictions list. [27][28] +- WalkCache aggregates both caches and tracks four counters (extraction hits/misses, aggregation hits/misses) for efficiency reporting. [29] +- A Critique carries an integer score (0–10), a short judgment, a list of unsupported claims, a list of brief gaps, and a list of revision suggestions. [30] +- A ReviewOutcome tracks the section identifier, initial critique, current body, revision-applied flag, and optional follow-up critique. [31] +- A WikiQualityReport carries an overall numeric score, a mapping from section identifiers to individual Critiques, and optional CoverageStats (total files, files with findings, per-section finding and file counts). [32] +- WalkReport is the primary return value from a full pipeline run, carrying IntrospectionResult, ExtractionStats, AggregationStats, DerivationStats, WalkCache state, and RepoGraph. [33] +- AggregationStats tracks sections written fresh, skipped due to empty notes, and served from cache. [34] +- DerivationStats accumulates sections derived, skipped, and revised counts, plus the full list of ReviewOutcomes. [35] +- Notes records carry a UTC timestamp and arbitrary key-value metadata, stored per section in insertion order. [36] +- A ChatMessage carries a role identifier and a content string representing one turn in a multi-turn exchange. [37] +- A ChatSession holds an LLM provider reference, a frozen system prompt built from wiki sections, and an ordered conversation history; it supports appending turns and clearing history while retaining context. [38] +- Settings captures provider and model identity, inference endpoint, timeout, file-size and chunk thresholds, pipeline feature flags, revision quality threshold, and provider-specific credentials and token caps. [39] +- An LLMProvider carries a provider name and model variant and is the sole point of contact between the pipeline and any language-model backend. [37] + +## Conflicts in source +_The walker found disagreements across files. Migration teams should resolve these before re-implementation._ + +- **Two sources describe a 'WalkConfig' entity with partially different field sets, suggesting either two distinct same-named entities at different pipeline layers or a single entity incompletely described in each source.** + - WalkConfig (orchestrator layer) encapsulates repository root, byte-size limits, minimum content thresholds, and an optional introspection-derived exclusion list; it is constructed twice per run — once before introspection and once after with the exclusion list populated. (`wikifi/orchestrator.py:83-101`) + - WalkConfig (filesystem-walker layer) encapsulates repository root, extra exclusion patterns beyond defaults, a flag for honouring gitignore rules, maximum file size in bytes, and minimum stripped-content size in bytes; it is immutable once constructed. (`wikifi/walker.py:61-79`) ## Sources 1. `wikifi/sections.py:30-40` 2. `wikifi/deriver.py:112-116` -3. `wikifi/cli.py:166-172` -4. `wikifi/wiki.py:34-61` -5. `wikifi/walker.py:61-79` -6. `README.md:31-33` -7. `wikifi/repograph.py:41-52` -8. `wikifi/repograph.py:148-167` -9. `wikifi/repograph.py:170-181` -10. `wikifi/walker.py:144-153` -11. `wikifi/extractor.py:106-123` -12. `wikifi/specialized/__init__.py:29-38` -13. `wikifi/extractor.py:126-135` -14. `wikifi/evidence.py:37-52` -15. `README.md:37-39` -16. `wikifi/evidence.py:55-67` -17. `wikifi/aggregator.py:166-186` -18. `wikifi/evidence.py:70-77` -19. `wikifi/aggregator.py:74-101` -20. `README.md:46-48` -21. `wikifi/evidence.py:80-85` -22. `wikifi/cache.py:44-51` -23. `wikifi/cache.py:54-60` -24. `wikifi/cache.py:63-70` -25. `wikifi/aggregator.py:103-107` -26. `wikifi/critic.py:67-84` -27. `wikifi/critic.py:91-96` -28. `wikifi/critic.py:99-114` -29. `wikifi/report.py:85-94` -30. `wikifi/report.py:28-42` -31. `wikifi/introspection.py:47-64` -32. `wikifi/deriver.py:57-62` -33. `wikifi/orchestrator.py:54-61` -34. `wikifi/cli.py:118-153` -35. `wikifi/providers/base.py:28-30` -36. `wikifi/chat.py:42-45` -37. `wikifi/chat.py:46-57` -38. `wikifi/specialized/sql.py:64-84` -39. `wikifi/specialized/sql.py:99-111` -40. `wikifi/specialized/graphql.py:32-81` -41. `wikifi/specialized/protobuf.py:44-68` -42. `wikifi/specialized/openapi.py:94-108` +3. `wikifi/cli.py:172-183` +4. `wikifi/wiki.py:55-80` +5. `wikifi/chat.py:42-45` +6. `wikifi/report.py:29-36` +7. `wikifi/report.py:39-44` +8. `wikifi/aggregator.py:166-186` +9. `wikifi/evidence.py:35-55` +10. `wikifi/evidence.py:57-80` +11. `wikifi/evidence.py:82-87` +12. `wikifi/aggregator.py:74-101` +13. `wikifi/introspection.py:47-64` +14. `wikifi/extractor.py:113-125` +15. `wikifi/extractor.py:128-131` +16. `wikifi/specialized/models.py:19-22` +17. `wikifi/specialized/models.py:25-27` +18. `wikifi/specialized/sql.py:50-58` +19. `wikifi/specialized/graphql.py:56-95` +20. `wikifi/specialized/openapi.py:105-116` +21. `wikifi/specialized/protobuf.py:42-60` +22. `wikifi/extractor.py:134-142` +23. `wikifi/repograph.py:43-56` +24. `wikifi/repograph.py:143-162` +25. `wikifi/repograph.py:165-177` +26. `wikifi/walker.py:144-153` +27. `wikifi/cache.py:60-66` +28. `wikifi/cache.py:69-74` +29. `wikifi/cache.py:77-88` +30. `wikifi/critic.py:67-84` +31. `wikifi/critic.py:91-96` +32. `wikifi/critic.py:99-114` +33. `wikifi/orchestrator.py:60-70` +34. `wikifi/aggregator.py:103-107` +35. `wikifi/deriver.py:57-62` +36. `wikifi/wiki.py:136-152` +37. `wikifi/providers/base.py:33-52` +38. `wikifi/chat.py:46-57` +39. `wikifi/config.py:46-155` +40. `wikifi/orchestrator.py:83-101` +41. `wikifi/walker.py:61-79` diff --git a/.wikifi/external_dependencies.md b/.wikifi/external_dependencies.md index 5d5e596..2655643 100644 --- a/.wikifi/external_dependencies.md +++ b/.wikifi/external_dependencies.md @@ -1,27 +1,61 @@ # External-System Dependencies -The system draws on several categories of external service: language-model inference backends, development-time tooling integrations, and a continuous-integration platform. +The system depends on external services in two areas: the **language-model inference layer** that drives all AI analysis, and a set of **tooling integrations** used for development support and runtime enrichment. -## Language-Model Inference +### Language-Model Inference -All substantive text generation and structured extraction is delegated to an external (or locally hosted) language-model service. Three backends are supported through a common provider abstraction: +Three mutually exclusive inference backends are supported; exactly one is active per deployment. -| Backend | Hosting | Authentication | Role | +| Backend | Role | Authentication | Key Configuration | |---|---|---|---| -| Local inference server (default) | Self-hosted, no network egress | None required | Default backend for all extraction and synthesis calls; configurable host address and 15-minute per-call timeout | -| Hosted AI service A (Anthropic) | Cloud API | API key (`ANTHROPIC_API_KEY`) | Opt-in backend; uses an ephemeral prompt-cache marker on the system prompt so that large extraction prompts are billed at roughly 10 % of normal input-token cost across repeated per-file calls | -| Hosted AI service B (OpenAI-compatible) | Cloud API (or compatible proxy/Azure endpoint) | API key + optional custom base URL | Opt-in backend; relies on automatic prefix caching (prefixes ≥ 1 024 tokens cached for ~5–10 minutes); exposes a reasoning-intensity knob mapped to the backend's reasoning-effort parameter on capable model variants | - -The local inference server is the default and requires no credentials or external network access. The two hosted backends are opt-in and each require a provisioned API key. All three backends are configured with a model name, timeout, and per-call output-token cap drawn from the application's runtime settings. - -### Caching Strategy -Because the extraction prompt is large and is reused across every file in a repository, minimising repeated billing for identical prompt prefixes is a first-class concern. The hosted-AI-service-A integration achieves this by tagging the system-prompt block with an ephemeral cache-control marker. The hosted-AI-service-B integration relies on the provider's automatic prefix-caching mechanism without requiring explicit markers. - -## Development-Time Tool Integrations - -The MCP server configuration reveals several additional integrations that appear to be used during development or agent-assisted workflows rather than in the core production pipeline: - -- **Google AI generative API** — consumed by at least two registered tool integrations; authenticated via a shared API key. -- **Self-hosted web-crawling service** — running locally on a fixed port with no API key, providing crawling capability on demand. -- **External documentation/context lookup service** — called over HTTP with a dedicated API key; likely used to retrieve up-to-date reference documentation for prompt enrichment. -- **Google-hosted orchestration service ( +| Self-hosted local inference service | Default LLM backend; serves models over HTTP on the local network | None required | Configurable host address and request timeout | +| Anthropic's hosted inference API | Opt-in cloud backend for high-capability extraction | API key (environment variable) | Configurable output-token cap and HTTP timeout to manage long-running calls; supports adaptive reasoning depth | +| OpenAI-compatible hosted inference API | Opt-in cloud backend for structured decoding, completion, and chat | API key | Configurable base URL, enabling compatible proxy or alternate deployment targets | + +The self-hosted local service is the default and the zero-friction starting point for new users — it requires no credentials and no cloud account. The two hosted cloud services are opt-in alternatives that require API keys and expose additional parameters for latency and cost control. + +The local backend supports reasoning-capable model variants that trade increased latency for greater analytical depth; this extended-reasoning mode is also available on the hosted cloud backends. + +The Anthropic-backed path operates in a single structured-extraction mode. The OpenAI-compatible path supports three distinct usage modes: schema-constrained structured decoding (returning validated domain objects), free-text completion, and multi-turn conversational chat. + +### Development and Runtime Tooling Integrations + +Several additional services are configured in the tooling layer: + +- **Self-hosted web-crawling service** — runs locally on a fixed port with no external credentials required. Provides on-demand web-crawling capability, used to gather source material. +- **Google's hosted AI/generative API** — authenticated via a dedicated API key; consumed by at least two registered tool integrations. +- **External documentation context service** — called over HTTP using a dedicated API key; enriches prompts or retrieves up-to-date reference documentation at runtime. +- **Google-hosted orchestration service** — an HTTP service authenticated via the same Google API key; its exact role is not fully specified in available sources but is likely related to data composition or workflow orchestration. + +### Soft Dependency: Structured-Data Parsing + +When processing structured API contract files, the system can optionally leverage an external YAML parsing library for full format support. If that library is absent, an internal minimal parser serves as a fallback, covering the specific fields the system requires. This is a soft rather than hard dependency — the system remains functional without it. + +## Supporting claims +- Three mutually exclusive LLM inference backends are supported: a self-hosted local service, Anthropic's hosted API, and an OpenAI-compatible hosted API. [1][2][3][4][5] +- The self-hosted local inference service is the default backend, requires no API key, and connects over a configurable HTTP endpoint. [1][2][5][6] +- The self-hosted local service supports reasoning-capable model variants that trade latency for greater analytical depth. [1][6] +- Anthropic's hosted inference API is an opt-in backend authenticated via an environment-variable API key, with a configurable output-token cap and HTTP timeout to manage long-running inference calls. [3][5][7] +- Anthropic's hosted backend supports an adaptive reasoning depth mode. [3][7] +- The OpenAI-compatible hosted API is an opt-in backend authenticated via API key, with a configurable base URL enabling use of compatible proxy deployments. [4][5][8] +- The OpenAI-compatible backend supports three usage modes: schema-constrained structured decoding, free-text completion, and multi-turn conversational chat. [8] +- A self-hosted web-crawling service runs locally on a fixed port and requires no external credentials. [9] +- Google's hosted AI/generative API is consumed by at least two tool integrations, authenticated via a dedicated API key. [10][11] +- An external documentation context service is called over HTTP with a dedicated API key, used to enrich prompts or retrieve reference documentation at runtime. [12] +- A Google-hosted orchestration service is consumed over HTTP, authenticated via the same Google API key; its exact role is not fully specified in available sources. [11] +- An external YAML parsing library is a soft dependency for structured API contract processing; the system falls back to an internal minimal parser when the library is absent. [13] + +## Sources +1. `.env.example:7-14` +2. `wikifi/config.py:53-55` +3. `wikifi/config.py:116-134` +4. `wikifi/config.py:136-151` +5. `wikifi/orchestrator.py:148-200` +6. `wikifi/providers/ollama_provider.py:52` +7. `wikifi/providers/anthropic_provider.py:83-100` +8. `wikifi/providers/openai_provider.py:113-175` +9. `.mcp.json:14-20` +10. `.mcp.json:4-8` +11. `.mcp.json:29-35` +12. `.mcp.json:22-28` +13. `wikifi/specialized/openapi.py:154-162` diff --git a/.wikifi/hard_specifications.md b/.wikifi/hard_specifications.md index 82fc253..a8fa680 100644 --- a/.wikifi/hard_specifications.md +++ b/.wikifi/hard_specifications.md @@ -1,69 +1,188 @@ # Hard Specifications -_Aggregation failed for **Hard Specifications** (anthropic provider: empty parsed_output and parse fallback failed: 1 validation error for SectionBody - Invalid JSON: EOF while parsing a value at line 1 column 0 [type=json_invalid, input_value='', input_type=str] - For further information visit https://errors.pydantic.dev/2.13/v/json_invalid). Raw notes preserved below._ - -> Brief: Critical requirements that must be carried forward verbatim: compliance rules, SLAs, contractual obligations, immutable formats. - - -## Raw findings - -- **.env.example** — Only the 'ollama' provider is supported in v1. The default request timeout is 900 seconds (15 minutes). Fully disabling thinking mode ('false') is documented as unsafe with Qwen3 models because those models ignore the JSON-schema output constraint and emit free text instead. -- **CLAUDE.md** — The system must run against a local LLM out of the box with no cloud dependency required; hosted backends (Anthropic, OpenAI, custom) are valid additional options but never the default. -- **CLAUDE.md** — Provider abstraction is mandatory: swapping the LLM backend must not require changes outside the provider boundary. -- **CLAUDE.md** — When the chosen model exposes a reasoning or thinking level, the system must run at the highest available setting; lower reasoning levels are opt-in only. -- **CLAUDE.md** — Test coverage target is ≥ 85%; every feature must ship with tests. -- **CLAUDE.md** — wikifi is strictly a feature-extraction tool: it describes what the legacy system does and must never transform source into any target architecture, language, or framework shape. -- **CLAUDE.md** — Derivative wiki sections (personas, user stories, diagrams) must be produced only after primary content sections are complete and must never be inferred from a single file. -- **README.md** — The critic-reviser loop must only accept a revised section if its quality score is at least as high as the score of the original; downgrades are rejected. -- **README.md** — Empty or near-empty input files must never stall the walk; the walker is required to filter them out before any LLM call is made. -- **README.md** — Every per-file finding must carry a structured SourceRef (file, line range, content fingerprint) to support citation in the rendered wiki. -- **TESTING-AND-DEMO.md** — The test suite must include exactly 156 passing tests with total line coverage at or above 93%. Every new module must individually reach at least 86% coverage, and each premium-pipeline module (fingerprint, cache, evidence, critic, report, repograph, specialized parsers, and the Anthropic provider) must carry a dedicated test file. -- **TESTING-AND-DEMO.md** — The Anthropic provider must place cache_control of type 'ephemeral' on the system-prompt block, use the messages.parse structured-output contract, translate the 'think' intensity setting to an effort level, and map API errors to a RuntimeError. These behaviors are locked in by the provider's dedicated test file. -- **TESTING-AND-DEMO.md** — The OpenAI provider must use the chat.completions.parse structured-output contract, route reasoning_effort only to o-series and gpt-5 models (not standard models), swap max_tokens for max_completion_tokens on reasoning models, and map API errors to RuntimeError. OpenAI's automatic prefix caching applies to prefixes of at least 1024 tokens and lasts approximately 5–10 minutes. -- **TESTING-AND-DEMO.md** — The critic-reviser loop must only accept a revised derivative section if the revision scores at least as well as the original; a revision that scores lower must be discarded. -- **VISION.md** — The generated wiki must at minimum contain: DDD domains and subdomains, system intent, domain-level capabilities, external-system dependencies, internal and external integrations, cross-cutting concerns, core entities and their structures, and hard specifications — regardless of the on-disk layout chosen by the implementor. -- **VISION.md** — Derivative wiki sections (user personas, user stories, aggregate diagrams) must be produced in a step that runs *after* primary capture and must never be inferred from a single source file. -- **VISION.md** — Wiki content is stored in the target project's `.wikifi/` directory; the contract is the content the wiki conveys, not its on-disk shape or file structure within that directory. -- **VISION.md** — Success is defined as: a migration team working from the wiki alone — without reference to the original codebase — can deliver a microservice re-implementation that preserves the original system's personas, problem space, integrations, cross-cutting concerns, entities, data patterns, and user value. -- **wikifi/aggregator.py** — Contradictions between source notes must never be silently resolved: any incompatible claims must produce a `contradictions[]` entry naming each position and the note indices that support it. This is stated as a hard rule in the LLM system prompt and enforced structurally via the `AggregatedContradiction` schema. -- **wikifi/aggregator.py** — Wiki section bodies must be tech-agnostic: no mention of specific languages, frameworks, or libraries is permitted in synthesised output; every observation must be translated into domain terms. -- **wikifi/aggregator.py** — Note indices presented to the LLM are 1-based, and the resolution logic subtracts 1 before indexing into the notes list — an off-by-one invariant that must be preserved if the prompting scheme changes. -- **wikifi/cache.py** — The aggregation cache key is computed only over content-bearing fields (file reference, summary, finding text) and explicitly excludes timestamps and per-walk debug fields, ensuring that regenerating identical notes on a fresh walk always produces a cache hit. -- **wikifi/cache.py** — Cache files must reside at `.wikifi/.cache/extraction.json` and `.wikifi/.cache/aggregation.json` relative to the wiki directory root. -- **wikifi/cli.py** — The tool's entry point must be declared as `wikifi` in the project's script configuration and must delegate directly to the Typer application; this contract ties the installed command name to the main() function in this module. -- **wikifi/config.py** — Files exceeding 2,000,000 bytes are unconditionally dropped and never read; this threshold is explicitly documented as targeting vendored or generated noise rather than real source files. -- **wikifi/config.py** — Each language model call is limited to a 150,000-byte content window, sized to fit within a 32K-context model after prompt overhead; larger files must be split into overlapping chunks rather than truncated. -- **wikifi/config.py** — Adjacent file chunks share an 8,000-byte overlap region to preserve cross-boundary context; this overlap guarantee must be maintained when the chunking logic is modified. -- **wikifi/critic.py** — The scoring rubric is fixed: 9–10 indicates fully grounded, tech-agnostic, narratively coherent content with no unsupported claims; 6–8 allows minor issues; 3–5 signals substantial gaps or partial coverage; 0–2 marks incoherent or off-brief content. The default minimum acceptable score for shipping a section without revision is 7. -- **wikifi/critic.py** — A revised body is only accepted if its follow-up critique score is greater than or equal to the initial score; any revision that causes a score regression is discarded and the original body is retained. This invariant must be preserved in any reimplementation. -- **wikifi/critic.py** — All section bodies must be tech-agnostic: the reviser is explicitly instructed not to invent claims unsupported by upstream evidence and to declare gaps explicitly when evidence is missing rather than speculating. -- **wikifi/deriver.py** — Derivative sections must be grounded solely in upstream section content. The model is instructed to declare gaps explicitly rather than filling them with invented facts — this is a hard constraint on output integrity. -- **wikifi/deriver.py** — All wiki content, including derivative sections, must remain technology-agnostic: language names, framework names, and library names are forbidden and must be translated into domain terms. -- **wikifi/deriver.py** — Gherkin-style outputs must use proper Given/When/Then syntax inside fenced ```gherkin code blocks. Mermaid diagrams must be valid and inside fenced ```mermaid code blocks, preferring graph, classDiagram, erDiagram, and sequenceDiagram diagram types. -- **wikifi/evidence.py** — Source references must be rendered in the format 'path/to/file:start-end' (or 'path/to/file:line' for a single line, or just 'path/to/file' when lines are unknown). The 'Sources' footer uses 1-based sequential numeric indices in the form '1. `path`'. -- **wikifi/evidence.py** — Contradictions must never be silently merged into a unified narrative; they must be explicitly surfaced in a dedicated 'Conflicts in source' sub-section, with a warning that migration teams must resolve them before re-implementation. -- **wikifi/extractor.py** — Per-file extraction is restricted to primary wiki sections only. Derivative sections (personas, user stories, diagrams) are explicitly excluded from per-file extraction and are instead produced in a later aggregation stage; requesting them at the per-file level is documented as producing sparse, speculative findings. -- **wikifi/extractor.py** — The recursive text splitter must guarantee termination on any input, including minified single-line files with no whitespace, by falling back through separator priority (blank lines → single newlines → spaces → character boundaries). The character-boundary split is the terminal step that ensures every byte is eventually consumed. -- **wikifi/extractor.py** — Chunk overlap must satisfy `0 <= overlap < chunk_size`; violating this constraint raises an error. The effective base chunk size is `chunk_size - overlap` so that prepending an overlap tail never causes a chunk to exceed `chunk_size` bytes. -- **wikifi/fingerprint.py** — Fingerprints are defined as the first 12 hexadecimal characters of a SHA-256 digest (48 bits of entropy). This length is explicitly chosen to be sufficient to distinguish every file in any realistic repository (estimated 50% collision threshold at ~10 trillion files) while remaining short enough to embed inline in human-readable citations. This format must be preserved across any migration because it is recorded in cached artefacts and emitted into wiki evidence references. -- **wikifi/introspection.py** — Stage 1 must operate without reading any source files; it sees only directory-level summaries and manifest contents. This constraint is architectural and must be preserved: source reading is exclusively Stage 2's responsibility. -- **wikifi/introspection.py** — Include and exclude patterns produced by Stage 1 must be in gitignore-style format relative to the repository root. -- **wikifi/orchestrator.py** — When a user selects the Anthropic provider but the configured model name does not begin with 'claude-', the system silently substitutes the model identifier 'claude-opus-4-7' rather than forwarding an invalid name. Similarly, for OpenAI, non-OpenAI-pattern model names are replaced with 'gpt-4o'. This model-name substitution logic must be preserved so that users migrating from the default local provider do not receive opaque remote API errors. -- **wikifi/orchestrator.py** — The only accepted provider identifiers are 'ollama', 'anthropic', and 'openai'; any other value raises an error. This contract is enforced at provider construction time and must be maintained by any future provider registration mechanism. -- **wikifi/repograph.py** — The implementation must remain dependency-free beyond regex and path resolution — tree-sitter or similar binary dependencies are explicitly prohibited so that the tool can be installed without native compilation. -- **wikifi/report.py** — Quality scoring is only performed when explicitly requested (`score=True`) and a provider is supplied; without both conditions the report remains purely structural. This ensures the tool can run in provider-free environments such as CI pipelines without failure. -- **wikifi/sections.py** — Derivative sections must always reference only known section IDs, and every upstream a derivative depends on must appear earlier in the canonical SECTIONS ordering. This ordering invariant is validated at module load time and any violation raises an error, making it a hard structural requirement for the section taxonomy. -- **wikifi/walker.py** — The maximum file size threshold is 2,000,000 bytes (2 MB); files at or above this limit are unconditionally skipped and never sent for analysis. The minimum content threshold is 64 bytes of stripped text. Manifest files are truncated to 20,000 bytes maximum before being included in any prompt. -- **wikifi/wiki.py** — The directory layout is explicitly declared as a stable contract between the tool and any target project: upgrading the tool must not break existing wikis. This constraint is called out in the module docstring and governs all future changes to path conventions. -- **wikifi/wiki.py** — The `.wikifi/` directory layout follows a fixed, documented schema: `config.toml` for provider/model overrides, `.gitignore` for excluding notes, one `
.md` per defined section, and a `.notes/
.jsonl` per section for extraction state. This schema must remain stable across upgrades. -- **wikifi/specialized/protobuf.py** — The module explicitly designates proto file findings as direct inputs to interface design during migration, implying that message names, enum value sets, service names, RPC signatures, and streaming contracts must be preserved verbatim when porting to a new stack. -- **wikifi/specialized/sql.py** — Indexes are explicitly annotated as performance invariants that 'the new system must preserve,' establishing a carry-forward requirement for any target platform. -- **wikifi/specialized/sql.py** — UNIQUE and NOT NULL constraints are treated as storage-level invariants that must survive migration, not merely advisory metadata. -- **wikifi/providers/anthropic_provider.py** — Sampling parameters (temperature, top_p, top_k) must not be sent to the claude-opus-4-7 model variant — doing so causes a 400 error. The provider explicitly omits these parameters for this model generation, making their absence a hard constraint carried forward with the provider implementation. -- **wikifi/providers/anthropic_provider.py** — The maximum output token budget per call is set at 16,000 tokens. This is documented as comfortable headroom for any section schema response while staying within the SDK's non-streaming HTTP timeout guard, making it an operationally important default that should not be reduced without re-validating pipeline completions. -- **wikifi/providers/ollama_provider.py** — Qwen3-family models must not be invoked with think=False on the structured-output path: doing so causes the model to bypass the schema constraint and emit free text, which fails downstream validation. The thinking level must be 'low' or higher to preserve schema compliance. For the derivative-section synthesis pass, 'high' thinking is the preferred setting for output quality, but callers must budget 1–3 minutes per file and configure the timeout to at least 900 seconds to absorb that latency. -- **wikifi/providers/openai_provider.py** — Reasoning-capable model families (identified by name prefix) must receive output-token limits via a distinct parameter name from standard chat models; sending the wrong parameter to either family causes a request failure. The provider routes the correct parameter unconditionally based on model identity. -- **wikifi/providers/openai_provider.py** — The `think` (reasoning-effort) knob must only be forwarded to reasoning-capable models; forwarding it to a plain chat model risks a validation error from the hosted service. The mapping from wikifi's internal knob values (`low`, `medium`, `high`) to the API's accepted values is fixed and must be preserved. -- **wikifi/providers/openai_provider.py** — When the hosted service returns a response that cannot be parsed into the expected structured schema (e.g. due to refusal or truncation), the system falls back to direct JSON validation of the raw text rather than returning a null result, preserving the protocol contract that callers always receive a validated object or an explicit error. +## Output Integrity + +These rules govern what the system is permitted to emit and are enforced at multiple stages of the pipeline. + +- **Tech-agnostic language.** All synthesised wiki content — both primary sections and derivative sections such as personas, user stories, and diagrams — must be free of specific language, framework, or library names. Every such observation must be translated into domain terms. This constraint applies equally to the aggregator, the reviser, and the deriver. +- **No silent contradiction resolution.** Whenever two source notes make incompatible claims about the same topic, the output must include a `contradictions[]` entry naming each position and the note indices that support it. Suppressing or merging conflicting claims is forbidden. +- **No invented facts.** When evidence is absent, the system must declare the gap explicitly rather than speculating.[2][6] This applies to both primary aggregation and derivative synthesis. +- **Derivative sections grounded in upstream content only.** Derivative sections must draw exclusively on the aggregated bodies of the primary sections that precede them in the canonical section ordering; they may not introduce claims not present in those upstream bodies. + +## Evidence and Citation Format + +The citation scheme is a contractual output format. + +- Claims must be rendered with compact footnote-style markers (`[1]`, `[2]`, …) and a **Sources** footer at the bottom of each section. +- Line ranges are formatted as `path/to/file:start-end`; a single-line reference as `path/to/file:line`; an unknown range as `path/to/file` alone. +- Detected contradictions must appear verbatim under a **Conflicts in source** heading with an explicit instruction that migration teams must resolve them before re-implementation. They must not be suppressed. +- Note indices presented to the synthesis stage are 1-based; the internal resolution step subtracts 1 before indexing into the underlying list. This off-by-one invariant must be preserved if the prompting scheme is ever changed. + +## File Processing Thresholds + +| Parameter | Value | Rule | +|---|---|---| +| Maximum file size | 2,000,000 bytes | Files at or above this limit are unconditionally skipped and never read | +| Minimum content size | 64 bytes (stripped) | Files below this threshold are skipped entirely | +| Chunk window | 150,000 bytes | Fixed sliding-window size for splitting large files | +| Chunk overlap | 8,000 bytes | Overlap between adjacent chunks to preserve cross-boundary context | +| Manifest truncation | 20,000 bytes | Manifest files are truncated to this length before inclusion in any prompt | + +Additionally, chunk overlap must satisfy `0 ≤ overlap < chunk_size`, and chunk size must be positive. These inequalities are hard invariants; violating them causes the recursive splitter to fail on edge-case inputs such as whitespace-free monolithic files. + +## Caching Constraints + +- **Aggregation cache key completeness.** The hash used to key an aggregation result must span the file reference, summary, finding text, and the full structured sources list (file path, line range, and fingerprint per source). Omitting any field allows stale citation metadata to be replayed without re-aggregation. +- **Atomic write pattern.** Cache persistence must write to a sibling `.tmp` file and then rename it atomically. A crash during saving must never produce a corrupt cache file. +- **Fingerprint format.** Content fingerprints are defined as the first 12 hexadecimal characters of a SHA-256 digest. This format must be preserved across any migration because it is recorded in cached artefacts and emitted into wiki evidence references. + +## Quality Assurance Rules + +The scoring rubric is fixed and non-negotiable: + +| Score range | Meaning | +|---|---| +| 9–10 | Fully grounded, tech-agnostic, narratively coherent; no unsupported claims | +| 6–8 | Minor issues only | +| 3–5 | Substantial gaps or partial coverage | +| 0–2 | Incoherent or off-brief | + +- The **minimum acceptable score** for publishing a section without revision is **7**. +- A revised body is accepted only if its follow-up critique score is **greater than or equal to** the initial score. Any revision that produces a score regression is discarded and the original body is retained. This invariant must be preserved in any reimplementation. + +## Provider and API Constraints + +### Shared +- The default per-call request timeout is **900 seconds**, chosen to absorb the observed latency of high-effort reasoning on large local models. Reducing this value risks aborting in-progress reasoning traces. +- Three abstract interaction modes — structured completion, text completion, and chat — constitute the **complete and exclusive** contract between the pipeline and any backend. No other methods are ever invoked; any conforming implementation must satisfy all three signatures exactly. + +### Hosted-Claude Backend +- Default maximum output is **32,000 tokens** per call. Callers using the highest reasoning effort levels are expected to raise this limit and enable streaming; too low a value causes the model to exhaust the budget on reasoning before producing structured output. +- Sampling parameters (temperature, top-p, top-k) **must not** be sent to the `claude-opus-4-7` model variant; doing so causes a validation error. The provider omits them unconditionally. +- Structured output is obtained via schema-constrained decoding; if the primary parsed result is absent, the implementation falls back to parsing the raw text block as JSON before raising an error. + +### Local-Model Backend +- Disabling the reasoning trace on Qwen3-family models causes them to ignore the JSON schema constraint and emit free text, breaking validation. Reasoning must never be disabled for Qwen3-style models on the structured-output path. The configuration documentation explicitly marks fully-disabled thinking as unsafe for this reason. +- Default per-call timeout is 900 seconds (same rationale as above). + +### OpenAI-Compatible Backend +- Default output cap is **16,000 tokens** per call; default per-call timeout is 900 seconds. +- Reasoning-capable model families (identified by the prefixes `o` or `gpt-5`) must receive `max_completion_tokens` instead of `max_tokens`, and may receive a `reasoning_effort` value of `low`, `medium`, or `high`. Non-reasoning models must **not** receive `reasoning_effort` to avoid API validation errors. +- When the structured-output parse path returns no parsed object (due to a refusal or truncation), the implementation must fall back to validating raw JSON text against the schema, not return a null silently. + +### Model Identifier Routing +- **Ollama heuristic:** a model identifier is classified as an Ollama-style identifier if it contains `:` and does not begin with the prefix `ft:` (case-insensitive). This rule must be carried forward exactly to avoid misclassifying fine-tuned models or Azure deployment IDs. +- When the hosted-Claude backend is selected but no Claude-prefixed model identifier is configured, the system falls back to a specific default model rather than forwarding the potentially invalid identifier. +- Azure/proxy deployments with non-standard deployment IDs are preserved unchanged. + +## Pipeline Stage Boundaries + +- **Stage 1** must operate without reading any source files; it sees only directory-level summaries and manifest contents. Source reading is exclusively Stage 2's responsibility. +- **Stage 1** must produce include and exclude path patterns in gitignore-style format relative to the repository root. +- **Stage 2 (extraction)** targets only primary wiki sections. Derivative sections are explicitly excluded and are produced in Stage 4 from the aggregate of primary findings. This boundary must be preserved through any migration. +- **Derivative section ordering.** Every derivative section must reference only known section IDs, and every upstream dependency must appear earlier in the canonical section ordering. This ordering invariant is validated at module load time; any violation raises an error. + +## Interface and Directory Contracts + +- The CLI entry point and its four subcommands (`init`, `walk`, `chat`, `report`) are declared as a named script in the package manifest; the command name and subcommand surface are **contractual interfaces** for users and tooling. +- The on-disk directory layout (`.wikifi/`, `config.toml`, `.gitignore`, one markdown file per section, `.notes/`, `.cache/`) is the **explicit versioned contract** with target projects and must not change in ways that break existing wikis. +- The `.notes/` and `.cache/` directories must always be excluded from version control; only section markdown files are committed. Any new required gitignore entries introduced in future versions must be backfilled into older wikis automatically on the next `init` run. +- Three exact sentinel strings mark unpopulated sections and must not be altered: `Not yet populated`, `No findings were extracted`, and `upstream sections required to derive`. The report module depends on these exact strings for gap analysis and scoring exclusion. + +## Specialized Extractor Rules + +- Only migration files with `.sql` or `.ddl` suffixes are routed to the SQL migration extractor; all other migration files must fall through to the general extraction path. Routing is determined by file suffix inspection, not by file-kind classification alone. +- When an API contract file is present but cannot be parsed, the system must emit an explicit warning finding directing migration teams to review the file manually. Unparseable specs are flagged, not silently skipped. +- Service-to-RPC attribution in protocol definition files must be computed by tracking brace depth (counting nested blocks), not by line proximity, to ensure correct attribution in multi-service files. +- Index definitions in schema files encode query-time performance invariants that must be preserved through migration; the extractor emits this requirement explicitly in every index finding. +- The import/reference graph must be constructed without any binary or compiled dependencies; only pattern matching and path resolution are permitted. This is a stated architectural constraint. +- Migration files are detected by matching a hardcoded list of well-known migration directory path tokens. A SQL file located in such a directory is classified as a migration rather than generic schema, preserving the distinction between forward-only schema changes and current schema state. + +## Supporting claims +- All synthesised wiki content must be free of specific language, framework, or library names and must be translated into domain terms. [1][2][3] +- Whenever two source notes make incompatible claims, the output must include a contradictions entry naming each position and the note indices that support it; suppressing or merging conflicting claims is forbidden. [4][5] +- Derivative sections must draw exclusively on the aggregated bodies of the primary sections that precede them. [6][7] +- Claims must be rendered with compact footnote-style markers and a Sources footer; detected contradictions must appear under a Conflicts in source heading. [8][5] +- Note indices are 1-based and the internal resolution step subtracts 1 before indexing; this off-by-one invariant must be preserved. [9] +- Files at or above 2,000,000 bytes are unconditionally skipped and never read. [10][11] +- Files below 64 bytes of stripped content are skipped entirely. [12][11] +- Chunk window is 150,000 bytes with 8,000 bytes of overlap; chunk overlap must satisfy 0 ≤ overlap < chunk_size and chunk size must be positive. [12][13] +- Manifest files are truncated to 20,000 bytes maximum before inclusion in any prompt. [11] +- The aggregation cache key must span the file reference, summary, finding text, and the full structured sources list; omitting any field allows stale metadata to be replayed. [14] +- Cache persistence must use an atomic write pattern (write to a sibling .tmp file, then rename) to guarantee a crash never produces a corrupt cache file. [15] +- Content fingerprints are defined as the first 12 hexadecimal characters of a SHA-256 digest and this format must be preserved across any migration. [16] +- The minimum acceptable quality score for publishing a section without revision is 7; the fixed rubric maps 9–10 to fully grounded, 6–8 to minor issues, 3–5 to substantial gaps, and 0–2 to incoherent. [17] +- A revised body is accepted only if its follow-up critique score is greater than or equal to the initial score; any regression causes the original body to be retained. [18] +- The default per-call request timeout is 900 seconds. [19][20][21] +- Three abstract interaction modes — structured completion, text completion, and chat — constitute the complete and exclusive provider contract. [22] +- The hosted-Claude backend defaults to a 32,000 token output cap; sampling parameters must not be sent to the claude-opus-4-7 model variant. [23][24][25] +- Disabling the reasoning trace on Qwen3-family models causes them to ignore the JSON schema constraint and emit free text; reasoning must never be disabled for these models on the structured-output path. [19][26] +- Reasoning-capable model families must receive max_completion_tokens instead of max_tokens and may receive a reasoning_effort value; non-reasoning models must not receive reasoning_effort. [27] +- When the structured-output parse path returns no parsed object, the implementation must fall back to validating raw JSON text against the schema rather than returning null. [28][29] +- An Ollama-style model identifier is defined as a string containing ':' that does not begin with the prefix 'ft:' (case-insensitive); this rule must be carried forward exactly. [30] +- Stage 1 must operate without reading any source files and must produce include/exclude patterns in gitignore-style format relative to the repository root. [31][32] +- Stage 2 extraction targets only primary sections; derivative sections are excluded and produced in Stage 4. [33] +- Every derivative section must reference only known section IDs, and every upstream dependency must appear earlier in the canonical section ordering; violations raise an error at module load time. [7] +- The CLI entry point and its four subcommands (init, walk, chat, report) are contractual interfaces for users and tooling. [34] +- The on-disk directory layout is the explicit versioned contract with target projects and must not change in ways that break existing wikis. [35] +- .notes/ and .cache/ directories must always be excluded from version control; new required gitignore entries must be backfilled automatically on the next init run. [36] +- Three exact sentinel strings — 'Not yet populated', 'No findings were extracted', and 'upstream sections required to derive' — must be preserved as canonical markers for unpopulated sections. [37] +- Only migration files with .sql or .ddl suffixes are routed to the SQL migration extractor; all others fall through to the general extraction path. [38] +- When an API contract file cannot be parsed, the system must emit an explicit warning finding rather than silently dropping it. [39] +- Service-to-RPC attribution must be computed by tracking brace depth, not line proximity. [40] +- Index definitions encode query-time performance invariants that must be preserved through migration; the extractor emits this requirement explicitly in every index finding. [41] +- The import/reference graph must be constructed without any binary or compiled dependencies. [42] +- Gherkin outputs must use Given/When/Then syntax inside fenced gherkin code blocks; Mermaid diagrams must be valid and inside fenced mermaid code blocks. [43] + +## Conflicts in source +_The walker found disagreements across files. Migration teams should resolve these before re-implementation._ + +- **The example environment configuration states that only the local-model provider is supported in v1, but multiple other sources document the hosted-Claude and OpenAI providers as fully implemented first-class backends with detailed API constraints.** + - Only the local-model (Ollama) provider is supported in v1. (`.env.example:7-44`) + - The hosted-Claude and OpenAI backends are fully implemented with detailed token caps, sampling-parameter rules, model-routing logic, and fallback behaviour. (`wikifi/config.py:122-134`, `wikifi/orchestrator.py:160-200`, `wikifi/providers/anthropic_provider.py:14-17`, `wikifi/providers/anthropic_provider.py:70-79`, `wikifi/providers/openai_provider.py:215-235`, `wikifi/providers/openai_provider.py:59-66`, `wikifi/providers/openai_provider.py:136-144`) + +## Sources +1. `wikifi/aggregator.py:57-59` +2. `wikifi/critic.py:53-61` +3. `wikifi/deriver.py:37-39` +4. `wikifi/aggregator.py:61-63` +5. `wikifi/evidence.py:121-131` +6. `wikifi/deriver.py:34-50` +7. `wikifi/sections.py:148-158` +8. `wikifi/evidence.py:43-52` +9. `wikifi/aggregator.py:167-173` +10. `wikifi/config.py:59-65` +11. `wikifi/walker.py:61-79` +12. `wikifi/config.py:66-81` +13. `wikifi/extractor.py:302-308` +14. `wikifi/cache.py:243-255` +15. `wikifi/cache.py:205-209` +16. `wikifi/fingerprint.py:23-27` +17. `wikifi/critic.py:31-48` +18. `wikifi/critic.py:137-147` +19. `.env.example:7-44` +20. `wikifi/providers/ollama_provider.py:50-54` +21. `wikifi/providers/openai_provider.py:59-66` +22. `wikifi/providers/base.py:42-52` +23. `wikifi/config.py:122-134` +24. `wikifi/providers/anthropic_provider.py:14-17` +25. `wikifi/providers/anthropic_provider.py:70-79` +26. `wikifi/providers/ollama_provider.py:9-27` +27. `wikifi/providers/openai_provider.py:215-235` +28. `wikifi/providers/anthropic_provider.py:107-145` +29. `wikifi/providers/openai_provider.py:136-144` +30. `wikifi/orchestrator.py:205-215` +31. `wikifi/introspection.py:5-9` +32. `wikifi/introspection.py:50-58` +33. `wikifi/extractor.py:51-56` +34. `wikifi/cli.py:1-7` +35. `wikifi/wiki.py:1-8` +36. `wikifi/wiki.py:36-47` +37. `wikifi/report.py:103-108` +38. `wikifi/specialized/dispatch.py:28-62` +39. `wikifi/specialized/openapi.py:24-37` +40. `wikifi/specialized/protobuf.py:62-67` +41. `wikifi/specialized/sql.py:115-121` +42. `wikifi/repograph.py:22-30` +43. `wikifi/deriver.py:40-45` +44. `wikifi/orchestrator.py:160-200` diff --git a/.wikifi/integrations.md b/.wikifi/integrations.md index 26c67d9..86dc8ea 100644 --- a/.wikifi/integrations.md +++ b/.wikifi/integrations.md @@ -1,68 +1,105 @@ # Integrations -### Inbound: Entry Points into the System +## Outbound Integrations -The system is distributed as a library installed directly into a target project. The command-line interface (CLI) is the primary inbound entry point, exposing subcommands that drive the full pipeline from repository introspection through wiki generation, interactive querying, and quality reporting. The CLI delegates all pipeline coordination to the orchestrator, which is also the central hub wiring together every downstream stage. +### Language-Model Providers ---- - -### Outbound: AI Model Backends - -All pipeline stages — introspection, per-file extraction, section aggregation, derivative content derivation, quality critique, and interactive chat — communicate with an AI model backend exclusively through a shared provider abstraction. No stage calls a specific backend directly. Three interaction shapes are exposed through this abstraction: schema-validated structured output, free-text completion, and multi-turn stateful conversation. +The system maintains a uniform provider abstraction that isolates every pipeline stage from the concrete inference backend. Three selectable backends are supported — a locally-hosted model service, a hosted Anthropic-compatible service, and an OpenAI-compatible service — each implementing the same three-method contract: structured JSON completion, free-text completion, and multi-turn chat. The active backend is chosen by configuration; the orchestrator and all downstream stages call into it without branching on which concrete provider is live. -Three backends are available and are interchangeable without altering any pipeline code: +Every stage that performs inference uses this abstraction: -| Backend type | Hosting model | +| Stage | Operation | |---|---| -| Local self-hosted inference runtime | On-premise / developer machine | -| Hosted AI service (Anthropic-compatible) | Remote cloud | -| Hosted AI service (OpenAI-compatible) | Remote cloud or self-managed endpoint | +| Introspection (Stage 1) | Structured JSON completion to classify repository paths | +| Extraction (Stage 2) | Structured JSON completion against a findings schema, per file | +| Aggregation (Stage 3) | Structured JSON completion against a section-body schema | +| Derivation (Stage 4) | Structured completion for personas, user stories, and diagrams | +| Critic / Reviser | Structured completion for rubric scoring and body revision | +| Chat session | Multi-turn chat grounded in populated wiki content | -The active backend is selected via an environment variable or a per-invocation flag at the CLI level. OpenAI-compatible endpoints — including corporate reverse proxies and managed cloud deployments — are supported by overriding the base URL alone, with no other changes to the calling code. +### External Tool and Capability Servers ---- +At the development and runtime boundary, the system is configured as a client that fans out to multiple external capability providers via a tool-server protocol. Four integrations are declared: a local AI utility, a local web crawler, a remote documentation context service, and a remote stitching/search service. This makes the system both a producer of wiki content and a consumer of external knowledge services during operation. -### Outbound: Development-Time Tool Servers (MCP) +### Filesystem and Layout Abstraction -A separate set of external capability providers is declared through an MCP client configuration used during development or runtime. Four tool servers are wired up: a local AI utility, a local web crawler, a remote documentation-context service, and a remote search-and-stitching service. The system acts as an MCP client that fans requests out to these providers as needed. +All pipeline stages read and write through a shared filesystem layout abstraction rather than addressing paths directly. Extraction findings are appended to a notes store; aggregated section bodies are written back through the same abstraction; the report and chat components read section markdown from the same on-disk layout. The cache layer uses this abstraction to locate its storage directory, and all cache reads and writes (keyed on file fingerprints and section-content hashes) pass through it. ---- +### Import Graph -### Outbound: Filesystem and Persistence Layer +The extraction stage integrates with a repository-wide import/reference graph. For each file being analysed, the graph supplies the file's direct neighbors — files it depends on and files that depend on it — which are injected into the extraction prompt to enable cross-file flow descriptions. The graph also drives the specialized-extractor dispatch path by classifying each file's structural kind before routing. -All reading and writing of wiki artifacts — extraction notes, finished section bodies, and cache entries — flows through a centralized layout abstraction that manages the `.wikifi/` output directory inside the target project. The extractor, aggregator, deriver, CLI, and orchestrator all resolve paths through this abstraction rather than independently. +### Per-Project Configuration -A content-addressed cache layer sits between the orchestrator and the AI backend, consulting a fingerprinting service to derive content hashes as cache keys. The extractor, aggregator, and orchestrator each consult the cache before issuing AI calls, enabling both incremental re-runs and resumability for large codebases. +Project-specific provider selection, model preferences, caching behavior, and feature flags are read from a TOML configuration file stored inside each managed project's wiki directory. Parse failures fall back gracefully to environment-derived defaults rather than aborting the pipeline. --- -### Integration Touchpoints Discovered in Target Codebases +## Inbound Integrations -When analyzing a target codebase, the system identifies and surfaces integration touchpoints from high-signal artifact files through specialized parsers: +The primary entry point is the command-line interface, which exposes four subcommands (`init`, `walk`, `chat`, `report`). It constructs the provider instance and passes it directly into the chat and report capabilities. All other pipeline stages are driven by the orchestrator, which sequences introspection → extraction → aggregation → derivation and is itself invoked by the CLI `walk` subcommand. -- **HTTP API surfaces** — Parsed from API contract files; each contract contributes a finding recording the count of externally exposed endpoints, establishing the public-facing API surface as a documented integration point. -- **RPC service definitions** — Each declared service and its remote procedures are mapped, capturing procedure names, request and response message types, and whether either channel is streaming. -- **Event-driven channels** — Subscription roots found in schema definition files are classified as real-time integration touchpoints rather than ordinary capabilities, reflecting their role as channels that external consumers attach to. -- **Relational links** — Foreign key declarations (both explicit and inline) are surfaced as hard relational links between domain entities, identifying cross-entity data dependencies. +--- -The dispatcher that routes files to these specialized parsers uses the file-kind classification produced by the repository graph module, ensuring each artifact type reaches the appropriate parser while preserving a uniform output contract for downstream aggregation. +## Integration Surfaces Detected in Analysed Codebases + +When the system analyses a target repository, it surfaces the following categories of integration touchpoint: + +- **HTTP API endpoints** — each parsed API contract contributes a finding recording the number of endpoints the analysed system exposes to external consumers, forming the inbound integration inventory. +- **RPC service blocks** — each service definition is treated as an integration touchpoint; individual operations are described with their request and response types, including streaming direction where declared. +- **Event-driven subscriptions** — subscription roots in schema definition files are mapped specifically to the integrations section, reflecting that they represent event-driven touchpoints rather than direct request/response capabilities. +- **Relational foreign-key links** — cross-table references are recorded as hard relational links between entities, surfacing constraints that affect how components may be separated or migrated independently. + +The specialized extractor dispatch layer acts as the internal routing hub between the upstream file classifier (which tags file kinds) and the downstream extractors responsible for each artifact type. Files that do not match a recognized kind fall through to the general LLM extraction path. + +## Supporting claims +- Three selectable LLM backends are supported — a locally-hosted model service, a hosted Anthropic-compatible service, and an OpenAI-compatible service. [1][2][3] +- Each backend implements the same three-method contract: structured JSON completion, free-text completion, and multi-turn chat. [1][2][3] +- The orchestrator and all downstream stages call into the provider without branching on which concrete provider is active. [4][1][2][3] +- The introspection stage uses structured JSON completion to classify repository paths. [5] +- The extraction stage uses structured JSON completion against a findings schema, per file. [6] +- The aggregation stage uses structured JSON completion against a section-body schema. [7] +- The critic/reviser uses structured completions for rubric scoring and body revision. [8] +- The chat session uses multi-turn chat grounded in populated wiki content. [9] +- Four external tool-server integrations are declared: a local AI utility, a local web crawler, a remote documentation context service, and a remote stitching/search service, making the system an MCP client that fans out to multiple capability providers. [10] +- All pipeline stages read and write through a shared filesystem layout abstraction; the cache layer uses this abstraction to locate its storage directory. [11][12][13][14][15][16] +- The extraction stage integrates with a repository-wide import/reference graph; each file's neighbors are injected into the extraction prompt. [17][18] +- The import graph also drives the specialized-extractor dispatch path by classifying each file's structural kind. [18][19] +- Project-specific settings are read from a TOML configuration file inside each managed project's wiki directory; parse failures fall back gracefully to defaults. [20][21] +- The CLI constructs the provider instance and passes it directly into the chat and report capabilities. [4] +- Each parsed API contract contributes a finding recording the number of HTTP endpoints the analysed system exposes to external consumers. [22] +- Each RPC service block in a protocol definition is treated as an integration touchpoint, with operations described including streaming direction. [23] +- Subscription roots are mapped specifically to the integrations section, reflecting event-driven touchpoints. [24] +- Cross-table foreign-key references are recorded as hard relational links between entities, surfacing migration constraints. [25] +- The specialized extractor dispatch layer routes recognized file kinds to dedicated extractors; unrecognized files fall through to the general LLM extraction path. [26][19][27] +- Derivative sections are excluded from the aggregation stage and are instead populated by a separate deriver stage that runs afterwards. [28][14] ## Sources -1. `README.md:8-12` -2. `wikifi/cli.py:98-101` -3. `wikifi/orchestrator.py:40-60` -4. `wikifi/providers/base.py:30-48` -5. `wikifi/providers/anthropic_provider.py:115-175` -6. `wikifi/providers/ollama_provider.py:58-95` -7. `wikifi/providers/openai_provider.py:1-8` -8. `README.md:46-51` -9. `TESTING-AND-DEMO.md:232-235` +1. `wikifi/providers/anthropic_provider.py:83-106` +2. `wikifi/providers/ollama_provider.py:44-46` +3. `wikifi/providers/openai_provider.py:1-9` +4. `wikifi/cli.py:176-179` +5. `wikifi/introspection.py:61-70` +6. `wikifi/extractor.py:220-235` +7. `wikifi/aggregator.py:136-141` +8. `wikifi/critic.py:30-32` +9. `wikifi/chat.py:52-55` 10. `.mcp.json:2-36` -11. `wikifi/wiki.py:34-61` -12. `wikifi/cache.py:244-246` -13. `wikifi/cache.py:30` -14. `wikifi/specialized/openapi.py:83-92` -15. `wikifi/specialized/protobuf.py:70-87` -16. `wikifi/specialized/graphql.py:88-91` -17. `wikifi/specialized/sql.py:86-96` -18. `wikifi/specialized/__init__.py:46-57` +11. `wikifi/aggregator.py:109-160` +12. `wikifi/cache.py:30-32` +13. `wikifi/chat.py:63-82` +14. `wikifi/deriver.py:73-107` +15. `wikifi/extractor.py` +16. `wikifi/report.py:78-130` +17. `wikifi/extractor.py:213-215` +18. `wikifi/repograph.py:1-10` +19. `wikifi/specialized/dispatch.py:36-62` +20. `wikifi/cli.py:103-105` +21. `wikifi/config.py:169-200` +22. `wikifi/specialized/openapi.py:96-103` +23. `wikifi/specialized/protobuf.py:64-90` +24. `wikifi/specialized/graphql.py:108-110` +25. `wikifi/specialized/sql.py:88-98` +26. `wikifi/specialized/__init__.py:7-8` +27. `wikifi/specialized/models.py:30-31` +28. `wikifi/aggregator.py:111-116` diff --git a/.wikifi/intent.md b/.wikifi/intent.md index 73d0214..4f2d24e 100644 --- a/.wikifi/intent.md +++ b/.wikifi/intent.md @@ -1,57 +1,64 @@ # Intent and Problem Space -wikifi exists because the intent embedded in a legacy system is typically invisible — locked inside years of implementation choices, technology-specific conventions, and accumulated structure that makes it difficult to separate *what the system does and why* from *how it currently does it*. Migration teams tasked with replacing or re-implementing such a system need the former without the latter. +wikifi exists to produce a structured, technology-agnostic wiki from an arbitrary source code repository — explaining **what a system does and why**, independent of the languages, frameworks, or infrastructure used to build it. Its primary audience is the team inheriting or migrating an existing codebase: architects and engineers who need a trustworthy, actionable picture of domain entities, capabilities, and integrations without spending days reading raw source files. -### The Core Problem +### The core problem -When a team inherits a large legacy codebase and must produce a new implementation, they face a knowledge-extraction problem. The source describes a particular way of solving a set of problems, but rarely describes the problems themselves at a level that is portable to a new context. Reading the source directly tends to reproduce the same structure and constraints in the new system — recreating legacy decisions rather than the underlying intent. +Large and legacy codebases resist quick comprehension. Source files encode intent implicitly, mixed with scaffolding, build artifacts, tests, and dependency code that carry no domain signal. At the same time, certain structured artifacts — database schemas, API contracts, protocol definitions — express intent with machine-readable precision that general-purpose analysis handles poorly. Any naive, uniform approach to understanding a codebase either drowns in noise or misses the highest-fidelity evidence. -wikifi addresses this by walking a repository and producing a structured, technology-agnostic wiki that surfaces: +Beyond individual files, some concepts — user personas, end-to-end user stories, system-level diagrams — only emerge from the *aggregate* of capabilities, entities, and integrations, and simply cannot be read from any single file in isolation. -- **Domain entities and capabilities** — what the system models and what it can do -- **API contracts and integration touchpoints** — what it exposes and to whom -- **Cross-cutting concerns** — considerations that span the system as a whole -- **Personas, user stories, and diagrams** — who uses the system, what they need, and how flows connect +wikifi addresses all of this by treating repository understanding as a structured, multi-stage extraction problem rather than a documentation-writing task. -The goal is to make legacy intent explicit, complete, and portable so a fresh implementation can retain full functional value without inheriting structural decisions. +### For whom -### Primary Audience +The system is designed explicitly around the needs of **migration teams and technical architects** who must understand a live system well enough to redesign or replatform it. Every design choice — traceability of claims to source locations, surfacing of contradictions rather than silently merging them, quality scoring before handoff — is oriented toward answering the question: *can we trust this wiki enough to act on it?* -The immediate audience is migration teams — architects and developers who need to understand a system's domain well enough to re-implement it rather than maintain it. A secondary audience includes anyone who must understand what a system does without reading its source directly, including those who need to interrogate the resulting wiki conversationally. +### Constraints that shape the design -### What the System Is Not +**Trust and traceability over convenience.** The system refuses to silently resolve disagreements between source files. Every synthesized claim must trace back to the specific files that justified it, and a dedicated quality-assurance pass flags unsupported claims and coverage gaps before output is delivered. -wikifi is explicitly a feature-extraction tool, not a transposition tool. It surfaces what a legacy system does and leaves all decisions about target architecture, structure, and approach entirely to the migration team. The output prescribes nothing about how the new system should be built. +**Technology neutrality.** All output is expressed in domain terms — entities, capabilities, integrations, personas — never in terms of the implementation technology. This ensures the wiki remains useful even when the migration replaces the entire stack. -### Shaping Constraints +**Local-first operation.** The default configuration routes all inference through a locally-hosted model to avoid cloud API dependencies. Hosted providers are explicit opt-ins, reflecting a philosophy of keeping sensitive source code within the operator's own infrastructure unless otherwise chosen. -Several constraints are built into the design from the outset: +**Quality over speed.** The system prioritises documentation quality over processing throughput. Guards prevent runaway behaviour on near-empty or oversized files, and higher-order sections are synthesized only after all primary evidence has been assembled. -| Constraint | Rationale | -|---|---| -| **Technology agnosticism** | Output must be expressed in domain terms, never in terms of the implementation technology found in the source, so the wiki does not embed the very assumptions it is meant to dissolve. | -| **Quality over speed** | Accuracy and completeness of the generated wiki are prioritised over processing throughput. | -| **Arbitrary scale** | The system must handle repositories of any size — including legacy monorepos with tens of thousands of files — through caching and chunking strategies that make repeated and interrupted runs cheap. | -| **Full traceability** | Every assertion in the generated wiki must trace back to specific source files and locations so architects can verify any claim against the original codebase. | -| **Honest disagreement** | Where source files contain conflicting signals, the system surfaces those contradictions explicitly rather than silently resolving them, preserving the full picture for the migration team. | +**Scalability on large codebases.** Re-processing a large legacy codebase on every run is impractical. Content-addressed caching ensures only changed files require new analysis, making repeated full-repository walks economical and enabling recovery after mid-run failures. Structured contract files bypass general-model processing entirely when deterministic parsing is more accurate and less costly. + +**Stable output contract.** The on-disk layout produced by wikifi is treated as a contract with the target project: it must remain stable across tool upgrades so that existing wikis stay readable and can be updated incrementally without full regeneration. + +## Supporting claims +- wikifi exists to produce a technology-agnostic wiki explaining what a system does and why, independent of the technologies used to build it. [1][2][3][4][5] +- Its primary audience is migration teams and architects who need a trustworthy picture of a codebase without manual source-reading. [6][7][8][9][10][11] +- Some concepts — personas, user stories, diagrams — only emerge from the aggregate of capabilities and entities and cannot be extracted from individual files. [12][5] +- The system refuses to silently resolve contradictions; every claim must trace back to the specific source files that justified it. [13][7] +- A quality-assurance pass flags unsupported claims and coverage gaps before output is delivered, so migration teams can trust the result without manually verifying every claim. [6][8] +- The default configuration routes inference through a locally-hosted model; hosted providers are explicit opt-ins reflecting a local-first philosophy. [14] +- The system prioritises documentation quality over processing throughput, with guards against runaway behaviour on near-empty or oversized files. [1] +- Content-addressed caching makes repeated full-repository walks economical and enables recovery after mid-run failures; only changed files require new analysis. [15][16] +- Structured contract files bypass general-model processing when deterministic parsing is more accurate and less costly. [17][18][19][20] +- The on-disk layout is treated as a stable contract with the target project, kept consistent across tool upgrades so existing wikis remain readable. [21] ## Sources -1. `VISION.md:3-9` -2. `CLAUDE.md:73-75` -3. `README.md:3` -4. `wikifi/cli.py:1-8` -5. `.env.example:1-2` -6. `TESTING-AND-DEMO.md:1-6` -7. `wikifi/config.py:1-8` -8. `wikifi/specialized/__init__.py:1-13` +1. `.env.example:1-2` +2. `wikifi/cli.py:1-10` +3. `wikifi/introspection.py:1-9` +4. `wikifi/orchestrator.py:1-17` +5. `wikifi/sections.py:1-19` +6. `wikifi/critic.py:1-15` +7. `wikifi/evidence.py:1-18` +8. `wikifi/report.py:1-16` 9. `wikifi/specialized/openapi.py:1-11` 10. `wikifi/specialized/protobuf.py:1-8` -11. `wikifi/deriver.py:1-18` -12. `wikifi/sections.py:1-19` -13. `VISION.md:86-89` -14. `wikifi/critic.py:1-15` -15. `wikifi/chat.py:1-32` -16. `wikifi/cache.py:1-21` -17. `wikifi/extractor.py:1-37` -18. `wikifi/aggregator.py:1-15` -19. `wikifi/evidence.py:1-18` +11. `wikifi/specialized/sql.py:1-13` +12. `wikifi/deriver.py:1-18` +13. `wikifi/aggregator.py:1-15` +14. `wikifi/config.py:1-26` +15. `wikifi/cache.py:1-20` +16. `wikifi/extractor.py:1-30` +17. `wikifi/repograph.py:1-30` +18. `wikifi/specialized/__init__.py:1-12` +19. `wikifi/specialized/dispatch.py:1-13` +20. `wikifi/specialized/models.py:1-8` +21. `wikifi/wiki.py:1-8` diff --git a/.wikifi/personas.md b/.wikifi/personas.md index 9c48864..3e033b3 100644 --- a/.wikifi/personas.md +++ b/.wikifi/personas.md @@ -1,153 +1,9 @@ # User Personas -Two broad audiences are evident from the system's stated purpose and the capabilities built to serve them: **migration teams** who need portable domain knowledge extracted from a legacy codebase, and **knowledge consumers** who need to interrogate that knowledge without reading the source. A third role — the **wiki operator** — emerges from the pipeline management and quality-assurance capabilities. A fourth is implied by the interactive chat interface and the explicitly non-technical framing of the conversational output. +The system's design choices, capabilities, and integration surfaces converge on a small set of distinct roles. Each persona below is inferred from the aggregate of what the system does — no single module or feature alone would justify any of them. --- -## Persona 1 — The Migration Architect +## Persona 1 — Migration Architect -> *"I need to understand what this system does, not how it does it."* - -### Profile -Leads the technical planning for a re-implementation or replacement of an inherited legacy system. Responsible for defining the scope and domain boundaries of the new system before any build work begins. - -### Goals -- Recover the intent embedded in a legacy codebase independently of its current technology choices. -- Identify domain entities, capabilities, integration touchpoints, and cross-cutting concerns that must be preserved in the new system. -- Produce artefacts (diagrams, user stories, entity maps) that can brief the wider delivery team. - -### Needs -- A technology-agnostic wiki that does not reproduce legacy structural decisions. -- Full traceability — every assertion must point back to a specific location in the source so claims can be verified. -- Explicit surfacing of contradictions in the source rather than silent resolution, since disagreements flag high-priority migration risks. -- Architectural diagrams and structured user stories derived automatically from the extracted knowledge. - -### Pain Points -| Pain point | How the system addresses it | -|---|---| -| Reading legacy source directly tends to reproduce its structure in the new design | Output is expressed entirely in domain terms, never in technology-specific terms | -| Conflicting signals in different parts of the codebase are invisible | Contradictions are surfaced in a dedicated *Conflicts in source* block | -| Claims cannot be verified without re-reading the entire codebase | Every claim carries numbered citations to originating files and line ranges | -| No portable documentation exists to brief the wider team | Derivative sections produce Mermaid diagrams, Gherkin stories, and persona documents | - -### Use Cases Served -- Full wiki generation from a legacy repository -- Review of the *Core Entities*, *Integrations*, and *Hard Specifications* sections -- Conflict review for migration risk prioritisation -- Sharing generated diagrams and user stories as briefing materials - ---- - -## Persona 2 — The Migration Developer - -> *"I need to understand a specific subsystem quickly and know which parts of the source back that up."* - -### Profile -A developer on the migration team working at the implementation level. Inherits specific domain areas to re-implement and needs targeted, verifiable knowledge about those areas without reading the entire legacy codebase. - -### Goals -- Understand the behaviour and boundaries of an assigned domain area. -- Trace any uncertainty back to the exact source location. -- Ask follow-up questions about the system without re-reading multiple files. - -### Needs -- Per-section wiki bodies with inline citations. -- Cross-file flow descriptions that show how files and components interact, not just what each file does in isolation. -- An interactive conversational interface grounded in the full wiki for targeted queries. -- Resumable analysis so a large codebase can be processed incrementally and interrupted runs are not lost. - -### Pain Points -| Pain point | How the system addresses it | -|---|---| -| Technical debt obscures the boundary between accidental and essential complexity | Technology-agnostic extraction separates domain behaviour from implementation noise | -| No way to ask targeted questions without reading source | Multi-turn chat session grounded in all populated wiki sections | -| Uncertainty about which source files are authoritative | Import and reference graph enriches findings with inter-file context; citations identify exact source spans | -| Repetitive re-runs on large repos are slow | Content-addressed cache replays unchanged file results; interrupted walks resume from the last processed file | - -### Use Cases Served -- Querying the interactive chat session for specific domain questions -- Reading per-section markdown with source citations -- Reviewing cross-file flow descriptions produced by the reference graph enrichment -- Verifying claims against cited file locations and line ranges - ---- - -## Persona 3 — The Domain Knowledge Consumer - -> *"I need to understand what this system does, but I cannot read the source code."* - -### Profile -A stakeholder — for example, a domain expert, product owner, or business analyst — who holds contextual knowledge about what the system is supposed to do but lacks the ability or time to read the codebase directly. May need to validate whether the extracted wiki accurately reflects business intent or to answer specific questions about system behaviour. - -### Goals -- Gain a clear, jargon-free understanding of what the system does and why. -- Validate or challenge the extracted domain model against real-world business knowledge. -- Ask specific questions without requiring a technical intermediary. - -### Needs -- Plain-language, technology-agnostic output that does not assume programming knowledge. -- A conversational interface for targeted questions rather than having to read structured markdown. -- Assurance that only populated, meaningful content is included in any context provided to the assistant. - -### Pain Points -| Pain point | How the system addresses it | -|---|---| -| No readable documentation exists for the legacy system | The generated wiki is expressed in domain terms without implementation-specific language | -| Technical intermediaries are needed to answer basic questions about behaviour | The interactive chat session allows direct conversational querying of the wiki | -| Risk that the AI-generated summary does not reflect ground truth | Full traceability to source and explicit conflict surfacing allow domain experts to challenge assertions | - -### Use Cases Served -- Reading generated wiki sections (particularly *Business Domains*, *System Intent*, and *User Personas*) -- Conducting multi-turn chat sessions to interrogate specific capabilities or entities -- Reviewing Gherkin-style user stories for business accuracy - ---- - -## Persona 4 — The Wiki Operator - -> *"I need to keep this wiki accurate, complete, and trustworthy as the codebase evolves."* - -### Profile -A technical lead, DevOps engineer, or senior developer responsible for running and maintaining the wiki-generation pipeline over time. Focuses on pipeline health, analysis completeness, and quality assurance rather than consuming the wiki content directly. - -### Goals -- Run and re-run the pipeline efficiently as the codebase changes. -- Monitor which areas of the codebase produced no useful findings (dead zones). -- Validate that generated sections meet a defined quality bar before the wiki is shared with the wider team. -- Configure the pipeline to match the constraints of the deployment environment (on-premise AI backend, private endpoints, exclusion patterns). - -### Needs -- Coverage reports showing per-section file counts, finding counts, and body sizes. -- Identification of dead zones — files that were processed but produced no findings. -- A configurable quality threshold that triggers automatic revision when sections fall below it. -- Support for on-premise or privately hosted AI backends for air-gapped or data-sensitive environments. -- Idempotent workspace initialisation so re-runs do not overwrite existing work. - -### Pain Points -| Pain point | How the system addresses it | -|---|---| -| Large repositories make full re-runs prohibitively slow | Two-scope content-addressed cache means only changed files and affected sections are reprocessed | -| Blind spots in the analysis go undetected | Coverage report surfaces dead zones and per-section gaps | -| Generated sections may introduce unsupported claims | Critic-and-reviser pass scores each section and auto-revises below a configurable threshold | -| Interrupted runs waste all completed work | Results are persisted after every completed file; walks resume from the last unprocessed file | -| Different deployment environments require different AI backends | Active backend is selected via environment variable or per-invocation flag; no pipeline code changes needed | - -### Use Cases Served -- Running incremental and full wiki-generation pipelines -- Reviewing the coverage and quality report -- Configuring quality thresholds and exclusion patterns -- Selecting and overriding the AI backend for private or on-premise deployments -- Forcing cache invalidation when a clean re-walk is required - ---- - -## Persona Summary - -| Persona | Primary interaction | Core output consumed | Key system capability relied on | -|---|---|---|---| -| Migration Architect | CLI — full wiki generation | All eleven sections; diagrams; user stories | Tech-agnostic extraction; conflict surfacing; derivative section synthesis | -| Migration Developer | CLI + interactive chat | Per-section bodies with citations; chat responses | Cross-file context enrichment; conversational querying; incremental caching | -| Domain Knowledge Consumer | Interactive chat; generated markdown | Plain-language wiki sections; Gherkin stories | Conversational session; technology-agnostic output | -| Wiki Operator | CLI — pipeline management and reporting | Coverage and quality reports | Incremental walks; dead zone detection; critic-and-reviser pass; backend configuration | - -> **Coverage note:** The upstream sections do not describe any end-user of the *target* legacy system as an audience for wikifi itself. All personas above are consumers of the extraction tool and its outputs, not of the system being analysed. +> * diff --git a/.wikifi/user_stories.md b/.wikifi/user_stories.md index 289efdb..4ca05c5 100644 --- a/.wikifi/user_stories.md +++ b/.wikifi/user_stories.md @@ -1,300 +1,218 @@ # User Stories -## Feature: Wiki Workspace Initialisation +## Feature: Repository Triage and Scoping -**As a Wiki Operator, I want the workspace to be initialised idempotently, so that re-running setup does not destroy work that has already been completed.** +### As a Documentation Engineer, I want the system to classify which paths contain production source before any deep analysis begins, so that analysis effort is focused on meaningful content and costs are bounded. ```gherkin -Given a target project root that already contains a partially populated wiki workspace -When the workspace initialisation command is invoked again -Then the existing directory structure, configuration file, version-control ignore rules, - and per-section placeholder documents are left untouched -And no previously generated section bodies are overwritten +Given a repository containing production source alongside vendored dependencies, build artifacts, generated files, and CI configuration +And file-size bounds are configured in the pipeline settings +When Stage 1 triage runs +Then paths classified as vendored dependencies, build artifacts, generated files, or CI configuration are excluded from analysis +And files outside the configured size bounds are filtered before any extraction begins +And the rationale for every filtering choice is recorded in the IntrospectionResult ``` --- -## Feature: Technology-Agnostic Wiki Generation +## Feature: Per-File Extraction -**As a Migration Architect, I want the wiki to express all findings in domain terms rather than in the vocabulary of the legacy technology stack, so that the new system design is not inadvertently shaped by the old implementation's structure.** +### As a Documentation Engineer, I want well-structured files to be routed to deterministic extractors rather than AI inference, so that extraction is more accurate and cost-effective for those artifact types. ```gherkin -Given a legacy codebase built on any technology stack -When a full wiki generation run completes -Then every wiki section body is expressed in technology-agnostic, domain-level language -And no technology-specific constructs, naming conventions, or structural patterns - from the source appear in the generated output +Given a repository containing relational schema files, API contract files, interface definitions, and migration scripts alongside general source files +When the extraction stage begins +Then well-structured files are routed to dedicated deterministic extractors +And general-purpose source files are analyzed via AI inference +And every finding carries a citation recording the repo-relative file path and line range ``` -**As a Migration Architect, I want the wiki to be organised into all defined sections covering domains, intent, capabilities, integrations, entities, cross-cutting concerns, and hard specifications, so that I have a complete set of artefacts to brief the wider delivery team.** +### As a Documentation Engineer, I want large source files to be split into overlapping chunks during extraction, so that no content is lost at chunk boundaries. ```gherkin -Given a completed wiki generation run against a legacy repository -When I inspect the generated wiki workspace -Then eight primary sections are populated with evidence-backed content -And three derivative sections (user personas, user stories, and architectural diagrams) - are synthesised from the completed primary sections -And any section for which no evidence was found contains an explicit placeholder - declaring the gap rather than fabricated content +Given a source file whose size exceeds the configured chunk threshold +When the file is processed during extraction +Then the file is divided into overlapping chunks using the coarsest available boundary first +And findings are deduplicated across chunk boundaries to avoid double-counting +And each finding retains its citation to the originating path and line range ``` ---- - -## Feature: Source Traceability - -**As a Migration Developer, I want every assertion in the wiki to carry a numbered citation back to its originating file and line range, so that I can verify any claim without re-reading the entire codebase.** +### As a Migration Architect, I want each file's extraction to be enriched with its import-graph neighbourhood, so that findings describe cross-file flows rather than treating each file in isolation. ```gherkin -Given a populated wiki section body -When I read a claim made in that section -Then the claim is annotated with a numbered citation -And the citation resolves to a specific repo-relative file path - and an inclusive line range in the source repository -And a content fingerprint is stored alongside the citation to enable staleness detection -``` - -**As a Migration Developer, I want claims that have no source backing to be explicitly flagged as unsupported, so that I know which assertions require further investigation rather than assuming all claims are verified.** - -```gherkin -Given a wiki section that contains a claim for which no source reference could be identified -When the section is rendered -Then the claim is explicitly marked as unsupported -And no citation number is fabricated or silently omitted without a visible notice +Given a repository with interdependent files +And the cross-file import graph option is enabled in settings +When per-file extraction runs +Then the system builds a cross-file import and reference graph before extraction begins +And each file's extraction pass is enriched with the files it depends on and the files that depend on it +And findings can assert cross-file relationships rather than single-file observations ``` --- -## Feature: Conflict Surfacing +## Feature: Section Synthesis and Derivative Generation -**As a Migration Architect, I want contradictory assertions from different parts of the codebase to be surfaced explicitly, so that I can treat disagreements as high-priority migration risks rather than discovering them later in the build.** +### As a Documentation Engineer, I want primary wiki sections to be synthesized from per-file findings with full citation trails, so that every assertion in the output is traceable to a specific source location. ```gherkin -Given two or more source files that make incompatible assertions about the same topic -When the section aggregation stage processes their findings -Then a dedicated "Conflicts in source" block is included in the relevant section body -And each conflicting position is listed with its own source references -And no silent resolution or averaging of the conflict is performed +Given a set of per-file findings accumulated for a primary section +When the aggregation stage runs for that section +Then a coherent markdown body is produced from the findings +And every assertion in the body is backed by numbered citations traceable to the source files and line ranges from which it was inferred +And claims present in the supporting evidence but absent from the narrative body are collected into a separate supporting-claims list rather than silently dropped ``` ---- - -## Feature: Cross-File Context Enrichment - -**As a Migration Developer, I want extraction findings to reflect inter-file flows rather than isolated per-file summaries, so that I understand how components in an assigned domain area interact with one another.** +### As a Documentation Engineer, I want derivative sections to be synthesized only after all their upstream primary sections are finalized, so that personas, user stories, and diagrams are grounded in complete evidence. ```gherkin -Given an in-scope file that imports other in-scope files or is imported by them -When the extraction pipeline processes that file -Then the file's import neighbourhood (files it depends on and files that depend on it) - is included as context during extraction -And the resulting findings describe cross-file interactions - rather than treating the file in isolation +Given a wiki configuration where derivative sections declare upstream dependencies +When the pipeline reaches the derivation stage +Then sections are processed in topological order enforced at startup +And no derivative section is synthesized until all of its declared upstream sections are finalized +And if an upstream section is absent or empty, a placeholder is emitted rather than fabricated content ``` --- -## Feature: Specialised Schema Parsing +## Feature: Wiki Scaffolding -**As a Migration Architect, I want structured schema files — including SQL definitions, API contract specifications, interface definition files, graph schemas, and database migrations — to be parsed deterministically, so that entity and relationship extraction from these files is reliable and reproducible.** +### As a Pipeline Operator, I want wiki initialization to be idempotent, so that re-running the scaffold command in an automated pipeline does not overwrite existing content. ```gherkin -Given a repository that contains structured schema artifacts -When the extraction pipeline classifies and routes those files -Then each schema file is processed by a purpose-built deterministic parser -And the resulting findings describe entities, relationships, operations, and constraints -And no AI model invocation is required for the deterministic parsing path -``` - -**As a Wiki Operator, I want unparseable schema files to produce an advisory finding directing reviewers to inspect the file manually, so that a single malformed file does not silently omit domain knowledge from the wiki.** - -```gherkin -Given a schema file that cannot be parsed by the deterministic parser -When the specialised extraction path attempts to process that file -Then an advisory finding is produced directing reviewers to inspect the file manually -And no silent failure or empty result is returned without notice +Given a project with a partially populated wiki directory structure +When the wiki scaffold command is run again +Then existing content is left untouched +And only missing structural pieces are created ``` --- -## Feature: Large File Handling +## Feature: Conflict Detection and Evidence Traceability -**As a Migration Developer, I want large source files to be fully analysed regardless of size, so that no content is silently missed during AI-assisted extraction.** +### As a Migration Architect, I want incompatible assertions across source files to be surfaced explicitly rather than silently resolved, so that my team can identify and resolve tribal knowledge conflicts before re-implementation. ```gherkin -Given a source file whose size exceeds the processing capacity of a single extraction pass -When the extraction pipeline routes the file through the AI-assisted extraction path -Then the file is recursively split into overlapping chunks -And each chunk is processed independently -And findings from all chunks are combined so that no content is omitted +Given a codebase where two or more source files make incompatible assertions about the same domain topic +When the aggregation stage produces the section body +Then the conflict is surfaced under a dedicated heading in the output +And each conflicting position retains its own source references +And the narrative does not silently choose one position over another ``` ---- - -## Feature: Incremental and Resumable Processing - -**As a Migration Developer, I want incremental runs to skip files and sections that have not changed, so that iterating on a large codebase does not require waiting for a full re-walk each time.** +### As a Migration Architect, I want every factual claim in the wiki traceable to the file and line range from which it was inferred, so that I can verify assertions against the original source. ```gherkin -Given a repository that has been walked at least once -And a subsequent run in which only a subset of files has changed -When the pipeline runs again -Then only files whose content fingerprint has changed are re-extracted -And only wiki sections whose contributing notes payload has changed are re-aggregated -And all unchanged results are served from the content-addressed cache -``` - -**As a Wiki Operator, I want an interrupted pipeline run to resume from the last processed file, so that no completed extraction work is lost when a run is cut short.** - -```gherkin -Given a pipeline run that was interrupted before processing all in-scope files -When the pipeline is restarted -Then processing resumes from the first unprocessed file -And all previously persisted extraction results are retained and not re-run -``` - -**As a Wiki Operator, I want to force a full cache invalidation, so that I can obtain a clean re-walk when the pipeline configuration or extraction logic has materially changed.** - -```gherkin -Given an existing populated cache for a repository -When a cache invalidation is requested -Then all cached extraction and aggregation results are cleared -And the next run processes every in-scope file from scratch +Given a generated wiki section containing factual assertions +When I inspect any claim in the narrative body +Then the claim is backed by one or more SourceRefs each identifying a repo-relative file path and line range +And a claim with no SourceRefs is explicitly marked as unsupported ``` --- -## Feature: Derivative Section Synthesis +## Feature: Quality Assurance -**As a Migration Architect, I want Gherkin-style user stories generated automatically from the extracted wiki, so that I can brief the delivery team with structured acceptance criteria without writing them by hand.** +### As a Quality Reviewer, I want sections evaluated against a structured rubric and revised when they fall below a quality threshold, so that the generated wiki meets a minimum standard before publication. ```gherkin -Given all primary wiki sections have been populated with evidence-backed content -When the derivative section synthesis stage runs for user stories -Then a set of Gherkin-style user stories is produced, grouped by feature -And each story is grounded only in capabilities and entities present in the primary sections -And if the required upstream primary sections are empty, a placeholder is written - declaring the gap rather than fabricating stories +Given the critic-and-reviser loop is enabled in settings +And a minimum quality threshold score is configured +When a section body is synthesized +Then the section is scored on a 0–10 rubric identifying unsupported claims and coverage gaps +And if the score falls below the configured threshold a revision pass is triggered +And the revised body is accepted only if it improves or matches the prior score +And if synthesis fails entirely raw notes are emitted directly preserving information at the cost of polish ``` -**As a Migration Architect, I want Mermaid architectural diagrams generated automatically from the extracted wiki, so that I have portable visual artefacts for briefing the delivery team.** +### As a Pipeline Operator, I want the critic-and-reviser loop to be disabled by default, so that generation time remains predictable in routine runs. ```gherkin -Given all primary wiki sections have been populated with evidence-backed content -When the derivative section synthesis stage runs for architectural diagrams -Then valid Mermaid diagram markup is produced - reflecting entities and relationships found in the primary sections -And no diagram elements are introduced that are not supported by the primary section evidence +Given a pipeline run with default settings +When the pipeline generates and derives sections +Then the critic-and-reviser loop is not executed +And generation time is predictable +When the loop is explicitly enabled via settings +Then it is applied to all sections and is most beneficial for derivative sections where single-shot synthesis is most likely to stray from evidence ``` --- -## Feature: Interactive Knowledge Querying - -**As a Migration Developer, I want to ask targeted questions about a specific domain area through a conversational interface, so that I can find precise answers without reading multiple wiki sections sequentially.** - -```gherkin -Given a wiki that has been generated for a legacy repository -When I open an interactive chat session -Then the session is grounded in all meaningfully populated wiki sections -And placeholder or empty sections are excluded from the context -And I can conduct multi-turn exchanges with conversation history retained across turns -``` - -**As a Domain Knowledge Consumer, I want to interrogate system behaviour conversationally without needing a technical intermediary, so that I can validate or challenge the extracted domain model directly against my own business knowledge.** +## Feature: Coverage and Readiness Reporting -```gherkin -Given a populated wiki for a legacy system -When I open an interactive chat session and ask a question about system behaviour -Then the assistant responds using only information present in populated wiki sections -And the response is expressed in plain, domain-level language - without implementation-specific detail -And I can reset conversation history and begin a fresh line of questioning - within the same session while retaining the wiki context -``` - -**As a Migration Developer, I want to inspect which wiki sections are currently loaded as context in a chat session, so that I understand the boundaries of the assistant's knowledge before relying on its answers.** +### As a Pipeline Operator, I want a single-page readiness report listing per-section metrics, so that I can assess documentation completeness in an automated pipeline without requiring an AI provider. ```gherkin -Given an active interactive chat session -When I request a context introspection -Then the session reports which sections are currently loaded as grounding context -And sections that are empty or contain only placeholders are listed as excluded +Given a wiki with one or more generated sections +When the report command is run +Then a markdown table is produced listing each section with its contributing file count, finding count, body character length, quality score, and most prominent gap or unsupported claim +And the coverage portion of the report executes without an AI provider +And the report output is safe for automated pipelines ``` --- -## Feature: Quality Assurance Pass +## Feature: Incremental and Crash-Resumable Operation -**As a Wiki Operator, I want each synthesised section to be scored against its brief and the upstream evidence, so that I can identify sections containing unsupported claims before the wiki is shared with the team.** +### As a Pipeline Operator, I want unchanged files and sections to be served from cache on re-runs, so that the pipeline completes quickly when only a subset of files has changed. ```gherkin -Given a synthesised wiki section and the upstream evidence it was derived from -When the critic-and-reviser pass runs -Then a quality score between 0 and 10 is produced for that section -And the critique itemises unsupported claims, gaps relative to the section brief, - and concrete revision suggestions +Given a repository that has been walked at least once with caching enabled +And some files have not changed since the last run +When the pipeline runs again +Then files whose content fingerprint matches the cache are skipped without re-extraction +And sections whose notes-payload hash matches the cache are served from cache without re-synthesis +And the cache is written after each individual file completes so that a mid-run crash leaves previously completed work intact ``` -**As a Wiki Operator, I want sections that score below a configurable threshold to be automatically revised, so that low-quality sections are improved before the wiki reaches the broader team.** +### As a Pipeline Operator, I want cache entries to be automatically invalidated when the cache format changes, so that obsolete data from a previous version never silently corrupts a rebuild. ```gherkin -Given a synthesised section whose quality score falls below the configured threshold -When automatic revision is triggered -Then a revised body is produced informed by the critique's suggestions -And the revised section is accepted only if its score matches or improves on the original -And if the revision would regress the score, it is rejected - and the original body is retained +Given a pipeline upgrade that changes the internal cache format version +When the pipeline runs after the upgrade +Then the version number embedded in existing cache files is compared to the current version +And a mismatch triggers a clean rebuild discarding all stale entries +And cache files are written atomically so that a crash during persistence never leaves a corrupted cache ``` ---- - -## Feature: Coverage and Dead-Zone Reporting - -**As a Wiki Operator, I want a coverage report showing per-section file counts, finding counts, body sizes, and quality scores, so that I can assess the completeness of the wiki before distributing it.** +### As a Documentation Engineer, I want a force-reanalysis mode that drops all cached data, so that I can guarantee a completely fresh walk when cache state is suspect. ```gherkin -Given a completed wiki-generation run -When the report command is invoked -Then a markdown table is produced listing every wiki section - with its contributing file count, finding count, body size, and emptiness status -And where a critic pass has been run, the quality score and highest-priority content gap - are included for each section +Given a pipeline run invoked with force-reanalysis mode enabled +When the pipeline begins +Then the on-disk cache is dropped entirely before any files are processed +And all files are re-extracted and all sections re-synthesized from scratch regardless of cached state ``` -**As a Wiki Operator, I want files that were processed but produced no findings to be surfaced as dead zones in the coverage report, so that I can identify blind spots in the analysis and decide whether to investigate them further.** +### As a Documentation Engineer, I want stale cache entries for removed files to be prunable in bulk, so that the cache does not accumulate entries for files that no longer exist in the repository. ```gherkin -Given a repository walk in which some in-scope files yielded no findings for any section -When the coverage report is generated -Then a list of dead-zone files is included in the report -And those files are distinguished from files that were excluded during the classification stage +Given a repository from which one or more files have been deleted since the last walk +When the cache pruning operation is run +Then cache entries for files no longer present in the repository are removed +And remaining entries for current files are left intact ``` --- -## Feature: AI Backend Configuration +## Feature: Interactive Query Interface -**As a Wiki Operator, I want to select and override the active AI backend via an environment variable or a per-invocation flag, so that the pipeline can run in air-gapped or data-sensitive environments without requiring code changes.** +### As a Codebase Explorer, I want to open a conversational session grounded in the generated wiki, so that I can ask natural-language questions about the codebase without reading raw source files. ```gherkin -Given a deployment environment that requires a privately hosted or on-premise AI backend -When the pipeline is invoked with a backend selection flag or the corresponding - environment variable set -Then all AI-assisted extraction, aggregation, derivation, and critic calls - are routed to the specified backend -And no modification to pipeline code or shared configuration files is required +Given a wiki with one or more sections containing meaningful content +When I open a conversational session +Then only sections with meaningful content are loaded as context +And placeholder sections are excluded from the context +And I can ask multi-turn questions drawing on the loaded wiki sections +And I can inspect which sections are currently loaded as context ``` ---- - -## Feature: Graceful Degradation - -**As a Wiki Operator, I want the pipeline to recover gracefully from AI synthesis failures on individual sections, so that a single failed section does not prevent the rest of the wiki from being produced.** +### As a Codebase Explorer, I want to reset the conversation history without losing the wiki context, so that I can start a fresh line of questioning without reloading the wiki. ```gherkin -Given a wiki section for which AI synthesis fails during aggregation -When the pipeline handles the failure -Then the raw collected notes for that section are emitted directly in the section body -And the error is surfaced inline in the section rather than silently suppressed -And all remaining sections continue to be generated normally +Given an active conversational session with accumulated conversation history +When I issue a history-reset command +Then the conversation history is cleared +And the frozen system prompt built from the wiki sections is retained +And subsequent questions are answered using the same wiki context without the prior exchanges ```