diff --git a/.wikifi/capabilities.md b/.wikifi/capabilities.md index bf6f0a4..0612a31 100644 --- a/.wikifi/capabilities.md +++ b/.wikifi/capabilities.md @@ -1,32 +1,110 @@ # Capabilities -The system is an automated documentation generation tool that produces a comprehensive, technology-agnostic wiki from an existing codebase. Its core value proposition is making implicit knowledge explicit, surfacing hidden inconsistencies, and keeping documentation current with low marginal effort on subsequent runs. +wikifi turns any codebase into a structured, evidence-backed wiki and keeps that wiki current as the codebase evolves — without requiring the development team to write documentation by hand. -### Four-Stage Analysis Pipeline +### Automated Repository Discovery -Documentation is produced through a coordinated four-stage pipeline: +The system autonomously examines a repository using only its directory structure and manifest file contents, classifying paths into those that contain meaningful production source and those that should be skipped. From this metadata alone it derives a one-paragraph summary of the system's apparent purpose, a list of primary languages in use, and a set of include/exclude patterns for the analysis that follows. -1. **Repository Introspection** — The system examines the directory layout and manifest files of a target repository, classifying paths as worth analyzing (production source, business logic, domain models, integrations) or worth skipping (vendored dependencies, build output, test code, CI configuration). The classification is returned as a structured, diffable result. +### Structured Documentation Generation -2. **Per-File Extraction** — Each included file is analyzed to produce structured findings describing what it contributes to each wiki section. Large files are split into overlapping windows so no content is missed; adjacent windows share a configurable overlap region to preserve cross-boundary context. Findings are deduplicated across window boundaries so a single declaration is never counted twice. Each finding carries a precise citation (path and line range) so it can be traced later. +A multi-stage pipeline walks every in-scope file, extracts domain-level findings, and synthesizes those findings into a canonical set of wiki sections.[4][5][6] Documentation is organized into two tiers: -3. **Section Aggregation** — Findings collected from all files are synthesized into coherent markdown narratives for each primary wiki section. Every claim in the narrative is backed by numbered citations pointing to the originating files and line ranges. +| Tier | Sections | Grounding | +|---|---|---| +| Primary | Domains, Intent, Capabilities, External Dependencies, Integrations, Cross-Cutting Concerns, Entities, Hard Specifications | Direct per-file evidence | +| Derivative | User Personas, User Stories, Diagrams | Cross-file synthesis from aggregated primaries | -4. **Derivative Synthesis** — User personas, scenario-based user stories, and architectural diagrams are generated from the finalized primary section bodies. If upstream sections are empty, the system writes a placeholder declaring the gap rather than fabricating content. +Structured artifacts — API contracts, schema definitions, interface definition files, and database migration scripts — are processed through deterministic, type-aware extractors for higher reliability and speed. All other source files are analyzed through a general-purpose understanding pass. The cross-file reference graph is consulted per file so that findings can describe flows between modules rather than treating each file in isolation. -### Structured Documentation Output +### Citation Traceability and Contradiction Surfacing -The wiki is organized into **eight primary sections** — business domains, system intent, capabilities, external dependencies, integrations, cross-cutting concerns, core entities, and hard specifications — plus three **derivative sections** (personas, scenario-based user stories, and architectural diagrams). Derivative sections are only generated after all primary sections are finalized and declare which primaries they depend on. +Every claim in the generated wiki is anchored to the specific source file and line range from which it was derived, rendered as inline citation markers linked to a numbered source footer. Where two or more source locations assert incompatible things about the same topic, the system surfaces an explicit "Conflicts in source" block rather than silently reconciling the disagreement — giving downstream readers an honest view of ambiguity. -### Schema-Aware Extraction +### Incremental Re-Analysis -Files recognized as data-definition schemas, API contract specifications, interface definition files, or schema migration scripts are processed by deterministic, format-specific extractors rather than the general AI path. This improves both accuracy and cost for schema-heavy artifacts. These extractors surface: +A content-addressed record of prior analysis results is maintained at multiple independent granularities: per file, per section, and per repository scope. On a subsequent run, only files whose content has changed are re-processed; sections whose evidence base is entirely unchanged are served from cache without additional analysis work. When only a small subset of findings in a section changes, the system performs targeted in-place edits to preserve established prose rather than rewriting from scratch, retaining citation numbering and unaffected paragraphs verbatim. A re-run in which nothing has changed is a complete no-op. -- **Data schema files** — persisted entities with their columns, foreign-key relationships, uniqueness and nullability invariants, index definitions, and ALTER-based schema additions distinguished from the original baseline. -- **API contract files** — the full list of endpoints (operation, path, summary), named request/response models, and aggregate endpoint and model counts. -- **Interface definition files** — message types, closed-value enum types, named services, and their remote procedure calls including streaming legs. -- **Query/mutation schema files** — every operation root field, including fields defined across multiple composed schema files, so capabilities spread across modular definitions are all surfaced. +### Quality Review -### Conflict Detection +An optional critic-and-reviser cycle can be applied to any wiki section: the body is scored against a structured rubric, weaknesses are identified, a revised body is produced, and the better version is kept. This loop is considered the highest-leverage quality lever for derivative outputs such as personas and user stories. A separate coverage and quality report summarizes, per section, how many files contributed findings and how complete the analysis was; when the critic is enabled, per-section quality scores and unsupported-claim flags are included alongside an overall mean score. -When source files contain incompatible assertions about the same domain topic, the system surfaces the disagreement explicitly in a dedicated +### Interactive Exploration + +Once a wiki has been generated, users can open an interactive conversational session grounded in the wiki content. Questions can span multiple turns, conversation history can be reset while retaining the wiki context, and currently loaded sections can be listed at any time. Only fully populated sections are included in the assistant's context, preventing placeholder content from diluting answers. + +### Resilience and Graceful Degradation + +If synthesis of a section fails at runtime, the raw extracted notes are preserved in the output so the wiki always contains some content rather than a blank section. Files that cannot be parsed — for example, malformed API contract files — are flagged for manual review rather than causing processing to halt entirely. Cache state is persisted after each file so that a crash at any stage resumes from exactly the last completed file rather than restarting from scratch. + +## Supporting claims +- The system autonomously classifies a repository's directory tree into paths worth analysing and paths to skip, using only directory metadata and manifest file contents. [1][2][3] +- From directory metadata alone the system derives a one-paragraph summary of the system's apparent purpose and identifies primary languages in use. [1] +- Documentation is organized into eight primary sections grounded in direct per-file evidence and three derivative sections synthesized from the aggregated primaries. [6] +- Primary sections cover: domains, intent, capabilities, external dependencies, integrations, cross-cutting concerns, entities, and hard specifications. [6] +- Derivative sections cover user personas, user stories, and diagrams, and require cross-file synthesis to avoid speculation. [6][7] +- Structured artifacts such as API contracts, schema definitions, interface definition files, and database migration scripts are processed through deterministic, type-aware extractors rather than the general understanding pass. [8][5][9][10][11][12][13] +- The cross-file reference graph is consulted per file so findings can describe flows between modules rather than treating each file in isolation. [14][15][16] +- Every claim in the generated wiki is anchored to the originating source file and line range, rendered as inline citation markers linked to a numbered source footer. [17][18] +- Where two or more sources assert incompatible things about the same topic, the system surfaces an explicit 'Conflicts in source' block rather than silently reconciling the disagreement. [19][17] +- A content-addressed record of prior analysis results is maintained at multiple independent granularities — per file, per section, and per repository scope — enabling incremental re-runs. [20][21] +- On a subsequent run, only files whose content has changed are re-processed; unchanged sections are served from cache without additional analysis work. [21][22] +- When only a small subset of findings in a section changes, the system performs targeted in-place edits to preserve established prose rather than rewriting from scratch. [23][24][25][26][27] +- A re-run in which nothing has changed is a complete no-op — stages 3 and 4 are skipped entirely. [28][22] +- An optional critic-and-reviser cycle scores any section body against a structured rubric, produces a revised body, and keeps the better version. [29] +- The critic-and-reviser loop is considered the highest-leverage quality lever for derivative outputs such as personas and user stories. [30] +- A coverage and quality report summarizes, per section, how many files contributed findings; when the critic is enabled it also includes per-section quality scores and unsupported-claim flags. [31][32][33][34] +- Users can open an interactive conversational session grounded in the generated wiki content, spanning multiple turns, with the ability to reset conversation history while retaining wiki context. [35][36] +- Only fully populated wiki sections are included in the conversational assistant's context, preventing placeholder content from diluting answers. [37] +- If synthesis of a section fails, the raw extracted notes are preserved in the output so the wiki always contains some content. [38] +- Files that cannot be parsed are flagged for manual review rather than halting processing entirely. [39] +- Cache state is persisted after each file so that a crash at any stage resumes from the last completed file rather than restarting from scratch. [20][21] + +## Conflicts in source +_The walker found disagreements across files. Migration teams should resolve these before re-implementation._ + +- **The number of distinct routing paths for section re-synthesis is described inconsistently across notes: one describes four paths (full cache hit, finding-set unchanged but metadata shifted, surgical edit, full rewrite) while another describes three paths (full cache hit, surgical edit, full rewrite).** + - Four prioritized paths govern section production: full cache hit, finding-set unchanged but notes metadata shifted (refresh cache key, no LLM call), surgical edit for low-churn deltas, and full rewrite for high churn or no prior body. (`wikifi/aggregator.py:130-175`) + - Three routing paths exist: full cache hit (no work), surgical edit (append or remove specific claims), and full rewrite (when changes are too broad). (`wikifi/cache.py:100-116`) + +## Sources +1. `wikifi/introspection.py:55-70` +2. `wikifi/introspection.py:86-117` +3. `wikifi/walker.py:89-115` +4. `wikifi/orchestrator.py:57-100` +5. `wikifi/extractor.py:1-20` +6. `wikifi/sections.py:41-136` +7. `wikifi/deriver.py:80-135` +8. `wikifi/config.py:114-119` +9. `wikifi/specialized/__init__.py:1-12` +10. `wikifi/specialized/dispatch.py:43-60` +11. `wikifi/specialized/models.py:1-9` +12. `wikifi/specialized/sql.py:63-70` +13. `wikifi/specialized/sql.py:56-62` +14. `wikifi/config.py:107-113` +15. `wikifi/extractor.py:242-244` +16. `wikifi/repograph.py:163-230` +17. `wikifi/evidence.py:95-145` +18. `wikifi/extractor.py:254-268` +19. `wikifi/aggregator.py:1-16` +20. `wikifi/cache.py:5-35` +21. `wikifi/extractor.py:193-215` +22. `wikifi/orchestrator.py:175-205` +23. `wikifi/aggregator.py:130-175` +24. `wikifi/config.py:131-148` +25. `wikifi/surgical.py:148-193` +26. `wikifi/surgical.py:196-234` +27. `wikifi/surgical.py:237-284` +28. `wikifi/cache.py:22-30` +29. `wikifi/critic.py:4-10` +30. `wikifi/deriver.py:115-130` +31. `wikifi/report.py:4-8` +32. `wikifi/report.py:8-11` +33. `wikifi/report.py:43-70` +34. `wikifi/critic.py:213-232` +35. `wikifi/chat.py:95-134` +36. `wikifi/cli.py:63-240` +37. `wikifi/chat.py:68-83` +38. `wikifi/aggregator.py:290-302` +39. `wikifi/specialized/openapi.py:27-41` +40. `wikifi/cache.py:100-116` diff --git a/.wikifi/cross_cutting.md b/.wikifi/cross_cutting.md index 70d6f75..83300e0 100644 --- a/.wikifi/cross_cutting.md +++ b/.wikifi/cross_cutting.md @@ -1,169 +1,164 @@ # Cross-Cutting Concerns -## Observability +### Observability -Logging is a first-class, globally applied concern. At startup, a verbosity flag controls whether the runtime emits debug-level or INFO-level output; this choice is applied before any pipeline stage executes, so every downstream component inherits a consistent logging posture. Structured log lines are emitted at the entry point of each pipeline stage, giving operators a reliable breadcrumb trail across introspection, extraction, aggregation, and derivation. +Structured logging is enabled globally across all pipeline stages through a single startup hook, with a verbosity flag that switches between standard and debug output levels. Each subsystem registers its own log namespace (conversation, reporting, derivation, and others), so failures are attributable to their origin. Provider-level errors are normalized by a shared formatting helper that includes the vendor-assigned request identifier when available, ensuring consistent and traceable diagnostics regardless of which backend is active. -Diagnostic quality is also enforced at the inference boundary: when a model call produces no usable output, the system surfaces the stop reason, token counts, and the configured budget, together with actionable remediation hints. All backend providers share a single error-formatting routine that extracts any vendor-issued request identifier when present, ensuring failure messages are uniformly attributable regardless of which backend is in use. +Token-budget issues are diagnosed explicitly at the provider level: when a structured response comes back empty, the system surfaces the stop reason, output token count, and the active limit, together with hints that distinguish budget exhaustion from a model refusal. -## Graceful Degradation +The quality-reporting command is strictly read-only and never modifies the wiki, making it safe to invoke repeatedly in automated pipelines or monitoring hooks without risk of data mutation. -The pipeline is designed so that no single failure propagates to a total abort: +--- -- **Per-chunk and per-file extraction failures** are isolated; a failed chunk does not stop the rest of the file, and a failed file does not stop the walk. Single-chunk files that fail entirely are counted as skipped rather than silently lost. -- **Aggregation failures** are logged at WARNING level and result in a fallback body that preserves the raw notes verbatim, guaranteeing that every section is written. -- **Derivation failures** likewise produce a fallback body that retains upstream evidence, keeping the wiki inspectable after partial failures. -- **Critic and reviser failures** degrade gracefully by returning the original body and recording a score of zero with a diagnostic message. -- **Cache I/O failures** (missing files, malformed records, bad individual entries) are logged as warnings and fall back to an empty cache, preserving pipeline continuity. -- **Chat-session inference failures** are surfaced as inline messages rather than terminating the interactive session. +### Resilience and Graceful Degradation -A per-request timeout is enforced on all inference calls, preventing indefinite hangs during long reasoning passes. +The pipeline is designed to produce output under partial failure at every stage: -## Data Integrity and Provenance +- **Synthesis failures** during aggregation are caught and logged; affected sections degrade to raw-note preservation rather than raising an exception. +- **Critic and reviser failures** fall back to a zero-score sentinel and the original body, respectively; a score-regression guard prevents a poorly-guided rewrite from replacing better-quality existing content. +- **Per-chunk extraction failures** are logged as warnings; the file continues with whatever chunks succeeded, and a fully failed file increments a counter rather than being silently dropped. +- **Configuration parse failures** are logged as warnings and the system falls back gracefully to environment-derived settings. +- **Provider failures during interactive sessions** are reported to the user without terminating the session. -Full source provenance is a non-negotiable invariant. Every finding emitted by any extraction path carries the relative file path, an absolute line range (chunk-relative ranges are translated to file-absolute positions before storage), and a content fingerprint. This citation chain is preserved through caching and replay, so the aggregation stage can always render accurate source links. +Incremental cache persistence — a per-section callback invoked after each successfully aggregated section — converts a mid-stage crash or out-of-memory event into a survivable event, preserving all aggregation progress up to the last completed section. -Contradictions discovered in source material are **never silently resolved**. They are always surfaced explicitly in the output so that incompatibilities visible in the source are escalated to the migration team rather than hidden. +--- -Finding deduplication is enforced within a single file across all its chunks, preventing overlapping analysis windows from producing duplicate wiki entries. +### Data Integrity and Evidence Traceability -### Hallucination Prevention +Every finding produced by the pipeline carries a structured source reference containing the originating file path, an optional line range, and a content fingerprint captured at extraction time. This fingerprint is a 12-character prefix of a cryptographic hash — providing 48 bits of entropy, sufficient to be collision-resistant across any realistic repository — and allows any claim to be re-verified against the original source after a subsequent repository walk. -Grounding in real evidence is enforced at multiple levels: +Deduplication is enforced within each file: identical (section, finding-text) pairs arising from overlapping processing chunks are discarded before being written to the notes store, preventing count inflation. -| Level | Mechanism | -|---|---| -| Prompt contract | The assistant is instructed to acknowledge explicitly when the wiki does not cover a topic rather than extrapolating.[15] | -| Output terminology | The synthesiser is instructed to translate all observations into domain terms, never naming implementation-specific artefacts.[16] | -| Derivative sections | A heuristic filters placeholder bodies before they can be treated as real upstream evidence; an optional critic-and-reviser loop then scores and revises each derivative section before it is written. | -| Placeholder detection | A fixed set of sentinel strings identifies unpopulated sections; these are excluded from quality scoring and cannot seed downstream derivation. | +Contradictions between source files are never silently merged. They are surfaced as high-priority signals in the rendered output, on the explicit rationale that legacy systems routinely hide tribal knowledge in disagreements; making contradictions visible is therefore a non-functional invariant of the documentation pipeline. -Revision events are counted in run statistics so operators can observe how often the quality loop intervenes. +Introspection-stage responses are validated against a strict schema, making that stage's output deterministic to validate and straightforward to diff between successive runs — a consistency guarantee that carries forward into extraction. -## Caching and Cache Integrity +After incremental (surgical) edits, citation integrity is actively maintained: cached claims are preserved or selectively dropped, new claims carry resolved source references, and contradiction records are fully replaced rather than partially updated. A companion content-integrity invariant requires that any sentence in the cached body not affected by a changed finding must appear verbatim in the revised output. -Content fingerprints (short, stable hashes derived from raw bytes rather than decoded text) serve three cross-cutting roles simultaneously: keying the extraction and aggregation caches, anchoring source-evidence citations, and tracking file identity inside the dependency graph. +A heuristic scans upstream section bodies for known sentinel strings before synthesis begins, preventing derivative sections from being generated from boilerplate or placeholder content and guarding against fabricated findings. -The aggregation cache key is designed to be maximally conservative: it includes each referenced source file's fingerprint **and** its line range alongside the finding text. This means that when any referenced file changes, the cache misses and citations are re-derived from fresh evidence rather than replayed from stale data. +All per-file extraction notes are persisted with a UTC timestamp on every record, providing a lightweight audit trail of when each finding was recorded. -Cache files are stored in a designated subdirectory co-located with the wiki output and governed by the same repository-ignore rules, keeping them out of version control without extra configuration. When caching is explicitly disabled, the cache store is reset at run start to prevent stale data from influencing results; when caching is enabled, entries for files no longer in scope are pruned to keep cache size proportional to the live working set. +A startup validation routine enforces referential integrity across the section dependency graph, raising a descriptive error if any cross-reference names an unknown or out-of-order section. This check runs at module load time so misconfiguration is detected before any pipeline work begins. -Operators can also force a full cache invalidation at the command line to guarantee a fresh analysis after source changes. +--- -## Configuration Safety +### Cache and Storage Integrity -Configuration resolution follows a strict, documented precedence order: a project's own configuration file overrides process-wide environment variables, which override compiled-in defaults. This ensures that a wiki initialised for a specific backend continues to use that backend regardless of the operator's shell environment. +All cache writes are atomic: content is first written to a sibling temporary file and then renamed into place, so a crash during writing leaves the previous valid cache file intact. Cache files are stored in a dedicated subdirectory co-located with the wiki output; a programmatically managed ignore file ensures that only the final section markdown is committed to source control, while intermediate extraction notes and the cache are excluded. Missing or corrupt cache files are silently treated as empty (a fresh start), and malformed individual entries are logged as warnings and dropped. -To prevent accidental side-effects, only a small, named set of fields (backend provider, model identifier, and local service host) may be overridden by a project-level configuration file. All other settings are controlled exclusively at the process level, so stale or hand-edited project configs cannot silently alter pipeline behaviour the user did not intend to change. +A monotonically incremented schema version gates all cache reads: any cache file written under an older version is rejected entirely and treated as empty, forcing a clean re-run rather than consuming structurally incompatible data. Cache entries for files that leave scope are pruned at the start of each walk to prevent unbounded growth. -## Authentication and Authorization +Derivation cache entries are keyed by both the hash of upstream section bodies and a flag recording whether the critic-review loop ran; entries from non-reviewed runs are not reused when a reviewed run is requested, preserving quality parity across run modes. -Authentication schemes declared in API contract files are captured as explicit findings during extraction. The scheme categories recorded include API-key, bearer-token, and OAuth flows. These are surfaced as migration-relevant intelligence so that the target system's authentication implementation can be verified against the source contract. +The short-circuit predicate that skips aggregation and derivation enforces five simultaneous conditions: caching must be enabled, at least one file must have been processed, all files must have been cache hits or handled by deterministic extractors, the introspection scope hash must match the prior walk's, and every aggregation and derivation cache entry must be fresh. This prevents a mid-run crash from permanently locking in stale sections. Introspection results are persisted to disk before extraction begins so that this predicate remains evaluable even after a crash. -## Storage Invariants +For structured-output inference calls the temperature is pinned to zero, ensuring identical inputs produce identical outputs across runs. Free-text and conversational calls use the model's default temperature and may therefore vary between runs. -Schema analysis produces structured findings that the migration team must honour in the target system: +Prompt caching at the service level — achieved by placing the large, repeated system prompt in a predictable position in every request — reduces the cost and latency of the many per-file calls that make up a single pipeline run. -- **Uniqueness and nullability constraints** on any table are surfaced as explicit storage invariants. -- **Index definitions** are flagged as query-time performance invariants annotated as requirements the new system must preserve. -- **Migration history** is tracked separately from baseline schema definitions: tables touched by incremental schema-change operations are counted distinctly from those established in the original schema, ensuring that alter-only migrations are not misleadingly reported as empty. +--- -## Input Filtering +### Configuration Integrity -To prevent speculative or low-quality inference, the pipeline applies size guards before any file reaches an analysis model: +Target-specific configuration overrides are allowlisted to a small set of fields; any additional or unrecognized fields in a hand-edited or stale configuration file are silently ignored to prevent unintended behavior changes. Configuration parse failures are logged as warnings and the system falls back to environment-derived settings rather than crashing. A process-wide settings singleton is used to avoid redundant reads, with an explicit invalidation mechanism to support isolated test execution when environment variables are mutated between cases. -- Files exceeding a fixed size threshold are dropped on the assumption that they are vendored, generated, or binary assets. -- Files whose text content is shorter than a minimum byte threshold are also dropped to prevent hallucinated findings on effectively empty inputs. -- Manifest files read for structural context are hard-truncated at a maximum byte count with a visible marker, keeping prompt payloads bounded. +--- -Directory traversal prunes excluded subtrees before descending, so exclusion patterns are applied efficiently at the directory level rather than file-by-file. +### Authentication and Authorization -## Audit Trail +Authentication and authorization contracts are extracted from API specification files: the set of scheme types (bearer tokens, API keys, OAuth flows, and similar mechanisms) that external consumers must present to access the system is surfaced as explicit wiki findings. This ensures that access-control requirements are visible to migration teams as first-class documented facts rather than being implicit in raw specification files. -Every per-file extraction record is stamped with a UTC timestamp at write time, providing an audit trail of when each finding was recorded. The repository-ignore file for the wiki directory is actively maintained: any required exclusion entries missing from an older configuration are backfilled on each initialisation run, preventing local working state from surfacing as untracked changes after a tool upgrade. +--- -## Structured Output Determinism +### Data-Quality Invariants -All structured-output inference calls use a fixed temperature of zero, ensuring that the same input always produces the same structured result across runs. This is treated as a non-negotiable invariant for the extraction path. Model output is additionally constrained to a strict schema, enabling deterministic parsing and straightforward diffing between runs. Both LLM-based and specialised (deterministic) extractors write to the same notes store under the same contract, so the downstream aggregation pipeline is agnostic to which extraction path produced a given finding. +File selection applies a cheapest-first filter chain (path pattern → size check → content read) that enforces a minimum content threshold before any analysis pass runs. This threshold is explicitly motivated by inference model behaviour — preventing speculative or runaway outputs on near-empty files — making it a data-quality invariant that must be preserved through any migration of the analysis pipeline. + +Relational schema extractors capture UNIQUE and NOT NULL constraints as storage invariants and indexes as performance invariants, both annotated with the explicit requirement that they must survive migration to the target system. + +The tech-agnostic output invariant — that synthesized prose must express all observations in domain terms and must never name specific technologies — is enforced both in the inference system prompt and as a stated structural design goal of the incremental update subsystem. ## Supporting claims -- Logging verbosity is controlled by a runtime flag applied globally before any pipeline stage executes. [1] -- Structured log lines are emitted at the entry point of each pipeline stage. [2] -- When a model call produces no usable output, the system surfaces the stop reason, token counts, and configured budget with actionable remediation hints. [3] -- All backend providers share a single error-formatting routine that extracts any vendor-issued request identifier, ensuring uniform failure attribution. [4] -- Per-chunk and per-file extraction failures are isolated; a failed chunk does not stop the rest of the file, and a failed file does not stop the walk. [5] -- Aggregation failures are logged at WARNING level and produce a fallback body that preserves the raw notes verbatim. [6] -- Derivation failures produce a fallback body retaining upstream evidence verbatim. [7] -- Critic and reviser failures degrade gracefully by returning the original body and recording a score of zero with a diagnostic message. [8] -- Cache I/O failures are logged as warnings and fall back to an empty cache, preserving pipeline continuity. [9] -- Chat-session inference failures are surfaced as inline messages rather than terminating the interactive session. [10] -- A per-request timeout is enforced on all inference calls to prevent indefinite hangs. [11] -- Every finding carries the relative file path, absolute line range, and content fingerprint as source provenance. [12] -- Contradictions discovered in source material are never silently resolved; they are always surfaced explicitly. [13] -- Finding deduplication is enforced within a single file across all its chunks. [14] -- A heuristic filters placeholder bodies before they can be treated as real upstream evidence, and an optional critic-and-reviser loop scores and revises each derivative section. [17] -- A fixed set of sentinel strings identifies unpopulated sections, which are excluded from quality scoring and cannot seed downstream derivation. [18][19] -- Revision events are counted in run statistics for observability. [17] -- Content fingerprints are derived from raw bytes rather than decoded text, ensuring consistent hashing regardless of encoding assumptions. [20] -- Content fingerprints serve three cross-cutting roles: cache keying, source-evidence citation anchoring, and dependency-graph invalidation. [21] -- The aggregation cache key includes each referenced source file's fingerprint and line range, so any change causes a cache miss and fresh re-derivation. [22] -- Cache files are stored in a designated subdirectory co-located with wiki output and governed by repository-ignore rules. [23] -- When caching is disabled the cache store is reset at run start; when enabled, out-of-scope entries are pruned. [24] -- Operators can force full cache invalidation at the command line. [25] -- Configuration resolution follows a strict precedence: project config file overrides environment variables, which override compiled-in defaults. [26] -- Only a small named set of fields may be overridden by a project-level configuration file; all other settings are process-level only. [27] -- Authentication scheme categories declared in API contract files are captured as explicit findings. [28] -- Uniqueness and nullability constraints are surfaced as explicit storage invariants the target system must honour. [29] -- Index definitions are flagged as query-time performance invariants the new system must preserve. [29] -- Migration history is tracked separately from baseline schema, ensuring alter-only migrations are not reported as touching zero tables. [30] -- Files exceeding a maximum size threshold are dropped; files shorter than a minimum byte threshold are also dropped to prevent hallucinated findings. [31] -- Manifest files read for structural context are hard-truncated at a maximum byte count with a visible marker. [32] -- Directory traversal prunes excluded subtrees before descending, applying exclusion patterns at the directory level. [33] -- Every per-file extraction record is stamped with a UTC timestamp at write time. [34] -- The repository-ignore file for the wiki directory is backfilled with any missing required entries on each initialisation run. [35] -- All structured-output inference calls use a fixed temperature of zero, ensuring deterministic structured results across runs. [36] -- Model output is constrained to a strict schema enabling deterministic parsing and diffing between runs. [37] -- Both LLM-based and specialised extractors write to the same notes store under the same contract, making the aggregation pipeline agnostic to extraction path. [38] -- API errors from hosted backends are caught and re-raised as a uniform internal error type, preventing provider-specific error shapes from leaking into the orchestration layer. [39][40] +- Structured logging is enabled globally through a single startup hook with a verbosity flag; each subsystem registers its own log namespace. [1][2][3][4] +- Provider-level errors are normalized by a shared formatting helper that includes the vendor-assigned request identifier when available. [5] +- Token-budget issues are diagnosed explicitly: stop reason, output token count, active limit, and disambiguation hints are surfaced. [6] +- The quality-reporting command is strictly read-only and never modifies the wiki. [7] +- Synthesis failures during aggregation degrade to raw-note preservation rather than raising an exception. [8] +- Critic and reviser failures fall back to a zero-score sentinel and the original body respectively; a score-regression guard is present. [9] +- Per-chunk extraction failures are logged as warnings and do not abort the file walk; fully failed files increment a counter. [10] +- Configuration parse failures are logged as warnings and the system falls back to environment-derived settings. [11] +- Provider failures during interactive sessions are reported to the user without terminating the session. [12] +- Incremental cache persistence via a per-section callback makes mid-stage crashes survivable by preserving all aggregation progress up to the last completed section. [13] +- Every finding carries a structured source reference with file path, optional line range, and a content fingerprint captured at extraction time. [14][15][16][17] +- The content fingerprint is a 12-character prefix of a cryptographic hash, providing 48 bits of entropy and collision resistance across any realistic repository. [18][19] +- Deduplication of findings within each file is enforced by tracking (section, finding-text) pairs across all processing chunks. [20] +- Contradictions between source files are surfaced as high-priority signals and never silently merged; their visibility is a stated non-functional invariant. [21] +- Introspection-stage LLM responses are validated against a strict schema, making that output deterministic to validate and easy to diff between runs. [22] +- After surgical edits, citation integrity is maintained by re-anchoring references; a content-integrity invariant requires unchanged sentences to appear verbatim. [23][24] +- A heuristic scans upstream bodies for sentinel strings to prevent synthesis from placeholder or boilerplate content. [25] +- All per-file extraction notes are persisted with a UTC timestamp on every record. [26] +- A startup validation routine enforces referential integrity of the section dependency graph at module load time. [27] +- All cache writes are atomic: content is written to a sibling temporary file and then renamed into place. [28] +- Cache files are stored in a dedicated subdirectory co-located with the wiki; only final section markdown is committed to source control via a programmatically managed ignore file. [29][30] +- A monotonically incremented schema version gates all cache reads; older versions are rejected entirely and treated as empty. [31] +- Cache entries for files that leave scope are pruned at the start of each walk. [32] +- Derivation cache entries are keyed by upstream section body hash and a critic-review flag; non-reviewed entries are not reused for reviewed runs. [33] +- The short-circuit predicate enforces five simultaneous conditions before skipping aggregation and derivation stages. [34][35] +- Introspection results are persisted to disk before the extraction stage begins so the short-circuit predicate remains evaluable after a crash. [36] +- For structured-output inference calls the temperature is pinned to zero; free-text and conversational calls may vary between runs. [37] +- Prompt caching at the service level is achieved by placing the system prompt in a predictable position in every request, reducing cost and latency across many per-file calls. [38][39] +- Target-specific configuration overrides are allowlisted; unknown fields are silently ignored and parse failures fall back to environment-derived settings. [40][11] +- A process-wide settings singleton includes an explicit invalidation mechanism to support isolated test execution. [41] +- Authentication and authorization contracts are extracted from API specification files, surfacing scheme types as explicit wiki findings. [42] +- File selection applies a cheapest-first filter chain enforcing a minimum content threshold as a data-quality invariant motivated by inference model behaviour. [43] +- Relational schema extractors capture UNIQUE and NOT NULL constraints as storage invariants and indexes as performance invariants, both annotated with a migration-preservation requirement. [44] +- The tech-agnostic output invariant — no technology names in synthesized prose — is enforced in the inference system prompt and as a structural design goal. [45] ## Sources -1. `wikifi/cli.py:53-61` -2. `wikifi/orchestrator.py:103-145` -3. `wikifi/providers/anthropic_provider.py:255-275` -4. `wikifi/providers/base.py:54-63` -5. `wikifi/extractor.py:192-198` -6. `wikifi/aggregator.py:143-152` -7. `wikifi/deriver.py:96-107` -8. `wikifi/critic.py:158-165` -9. `wikifi/cache.py:162-168` -10. `wikifi/chat.py:120-125` -11. `wikifi/config.py:56-57` -12. `wikifi/extractor.py:240-256` -13. `wikifi/evidence.py:1-18` -14. `wikifi/extractor.py:224-234` -15. `wikifi/chat.py:27-31` -16. `wikifi/aggregator.py:54-67` -17. `wikifi/deriver.py:110-135` -18. `wikifi/deriver.py:118-135` -19. `wikifi/report.py:103-108` -20. `wikifi/fingerprint.py:44-50` -21. `wikifi/fingerprint.py:1-18` -22. `wikifi/cache.py:243-255` -23. `wikifi/cache.py:18-20` -24. `wikifi/orchestrator.py:108-116` -25. `wikifi/cli.py:107-110` -26. `wikifi/config.py:28-44` -27. `wikifi/config.py:161-167` -28. `wikifi/specialized/openapi.py:118-126` -29. `wikifi/specialized/sql.py:100-121` -30. `wikifi/specialized/sql.py:123-130` -31. `wikifi/walker.py:61-79` -32. `wikifi/walker.py:220-231` -33. `wikifi/walker.py:133-143` -34. `wikifi/wiki.py:136-141` -35. `wikifi/wiki.py:103-126` -36. `wikifi/providers/ollama_provider.py:58-68` -37. `wikifi/introspection.py:5-9` -38. `wikifi/specialized/models.py:4-8` -39. `wikifi/providers/anthropic_provider.py:119-127` -40. `wikifi/providers/openai_provider.py:128-135` +1. `wikifi/chat.py:22` +2. `wikifi/cli.py:52-62` +3. `wikifi/deriver.py:88-95` +4. `wikifi/report.py:22` +5. `wikifi/providers/base.py:59-68` +6. `wikifi/providers/anthropic_provider.py:250-275` +7. `wikifi/report.py:13-15` +8. `wikifi/aggregator.py:195-204` +9. `wikifi/critic.py:128-145` +10. `wikifi/extractor.py:271-282` +11. `wikifi/config.py:223-227` +12. `wikifi/chat.py:126-132` +13. `wikifi/aggregator.py:162-170` +14. `wikifi/evidence.py:41-46` +15. `wikifi/extractor.py:254-265` +16. `wikifi/specialized/graphql.py:54-68` +17. `wikifi/specialized/protobuf.py:13-14` +18. `wikifi/fingerprint.py:1-17` +19. `wikifi/fingerprint.py:22-24` +20. `wikifi/extractor.py:249-253` +21. `wikifi/evidence.py:73-79` +22. `wikifi/introspection.py:43-62` +23. `wikifi/surgical.py:46-67` +24. `wikifi/surgical.py:196-234` +25. `wikifi/deriver.py:186-200` +26. `wikifi/wiki.py:135-142` +27. `wikifi/sections.py:143-156` +28. `wikifi/cache.py:298-302` +29. `wikifi/cache.py:36-42` +30. `wikifi/wiki.py:102-128` +31. `wikifi/cache.py:54-68` +32. `wikifi/orchestrator.py:120-140` +33. `wikifi/deriver.py:95-115` +34. `wikifi/aggregator.py:305-328` +35. `wikifi/orchestrator.py:243-300` +36. `wikifi/orchestrator.py:144-152` +37. `wikifi/providers/ollama_provider.py:59-69` +38. `wikifi/providers/anthropic_provider.py:195-211` +39. `wikifi/providers/openai_provider.py:8-13` +40. `wikifi/config.py:196-200` +41. `wikifi/config.py:185-193` +42. `wikifi/specialized/openapi.py:112-121` +43. `wikifi/walker.py:89-115` +44. `wikifi/specialized/sql.py:100-121` +45. `wikifi/surgical.py:57-59` diff --git a/.wikifi/domains.md b/.wikifi/domains.md index 0330a14..cf1db4c 100644 --- a/.wikifi/domains.md +++ b/.wikifi/domains.md @@ -1,56 +1,62 @@ # Domains and Subdomains -## Core Domain +The system is organized around a single **core domain**: automated codebase documentation — the systematic transformation of a living software repository into a structured, technology-agnostic wiki that is kept coherent as the codebase evolves. -The system's core domain is **automated documentation synthesis**: ingesting an arbitrary source repository and producing a structured, intent-bearing wiki that describes the codebase in technology-agnostic terms. The central concern is not the mechanics of reading files, but the act of surfacing *business intent* — distinguishing what a system does from the accidental details of how it is implemented. +Within that core domain, five supporting subdomains have been identified. -## Subdomains +--- -### Repository Introspection -Before any analysis begins, the system must decide which parts of a repository carry production intent and which represent infrastructure, tooling, or generated artefacts. This subdomain owns that classification decision. A defining constraint is **tech-agnosticism**: the introspection logic must not rely on recognising specific languages, frameworks, or conventions, so that it generalises across any codebase. +### Codebase Analysis -### Knowledge Extraction -Once relevant files are identified, this subdomain is responsible for extracting structured, intent-bearing findings from each one. It encompasses file classification, content chunking, querying an inference backend for structured observations, and persisting those observations with precise citations for downstream use. The output of this subdomain is the raw evidential record. +This subdomain is responsible for reading every source file and producing structured, section-aligned findings. It breaks into two tightly coupled concerns: -### Section Synthesis -The documentation produced by the system is split along a clear dependency boundary: +- **File Classification** — assigning each file a semantic kind (e.g., schema definition, migration script, API contract, or general prose). Path-level disambiguation handles ambiguous files. +- **Knowledge Extraction** — consuming the classification to select the appropriate reading strategy (a specialized structured extractor or a general prose extraction path), then producing normalized findings for each predefined wiki section. -| Subdomain tier | Description | Pipeline position | -|---|---|---| -| **Primary sections** | Built from per-file evidence produced by the extraction subdomain | Stages 2–3 | -| **Derivative sections** | Synthesised by aggregating across all primary-section findings | Stage 4 | +### Source Traceability + +This subdomain bridges raw analysis and the human-readable wiki. Every claim written into a section must be anchored to a precise location in the source code. The subdomain also surfaces contradictions — cases where two sources make incompatible assertions about the same topic — so that the final wiki reflects genuine uncertainty rather than silent resolution. + +### Incremental Content Maintenance + +When the underlying evidence for a section changes, this subdomain decides how much of the previously synthesized content to preserve. It models the decision as a three-way classification — **unchanged**, **surgical** (small targeted edits), or **rewrite** (full regeneration) — driven by a churn ratio derived from the delta in the finding set. -This ordering is a first-class design constraint: derivative sections cannot be produced until all primary evidence is available. The boundary between the two tiers is enforced structurally, not merely by convention. +### Wiki Persistence -### Artefact Persistence -Two distinct storage concerns are separated within the system. *Committed wiki content* — the section markdown files that are versioned alongside the target project — is kept apart from *local working state*, which includes per-file extraction notes and a content-addressed cache. The persistence subdomain owns this boundary and ensures that working state is never accidentally treated as part of the published record. +This subdomain owns the authoritative on-disk representation of the wiki: directory naming and layout conventions, idempotent scaffolding, and all read/write operations against section bodies, per-file extraction notes, configuration, and the cache directory. -## Subdomain Relationships +### Knowledge Presentation -The subdomains form a directed dependency chain: +Once a wiki exists, this subdomain makes it accessible. It encompasses two capabilities: conversational question-and-answer over the generated content, and coverage or quality reporting that surfaces gaps or weak sections. -``` -Repository Introspection - ↓ - Knowledge Extraction → Artefact Persistence (working state) - ↓ - Section Synthesis - ↓ - Artefact Persistence (committed wiki content) -``` +--- + +### Domain Relationships + +| Subdomain | Depends on | Feeds into | +|---|---|---| +| File Classification | — | Knowledge Extraction | +| Knowledge Extraction | File Classification | Source Traceability, Wiki Persistence | +| Source Traceability | Knowledge Extraction | Wiki Persistence, Knowledge Presentation | +| Incremental Content Maintenance | Wiki Persistence, Knowledge Extraction | Wiki Persistence | +| Wiki Persistence | All upstream | Knowledge Presentation | +| Knowledge Presentation | Wiki Persistence | (end user) | -No subdomain reaches backwards in this chain; the pipeline ordering is the authoritative expression of inter-subdomain dependency. +The boundaries described here are defined by domain responsibility, not by any particular module or technology choice. ## Supporting claims -- The core domain is automated documentation synthesis: extracting business intent from a source repository and producing a technology-agnostic wiki. [1][2] -- The repository introspection subdomain decides which parts of a codebase encode business intent versus infrastructure or tooling. [2] -- A defining constraint of repository introspection is tech-agnosticism — the analysis must not depend on recognising any specific language or framework. [2] -- The knowledge extraction subdomain covers file classification, content chunking, inference-backend querying, and persisting findings with citations. [1] -- Documentation sections are divided into primary sections (built from per-file evidence) and derivative sections (synthesised from aggregates of primary sections), with the ordering enforced as a structural constraint. [3] -- Two distinct storage concerns exist: committed wiki content and local working state (extraction notes and cache); the persistence subdomain enforces this boundary. [4] +- The core domain is automated codebase documentation: the systematic transformation of a software repository into a structured, technology-agnostic wiki. [1][2] +- A file classification subdomain tags every file with a semantic kind and handles path-level disambiguation for ambiguous files. [3] +- A knowledge extraction subdomain consumes the file classification to select the appropriate reading strategy and produces normalized findings per wiki section. [2][3] +- A source traceability subdomain anchors every narrative claim to a precise codebase location and surfaces contradictions between sources. [4] +- An incremental content maintenance subdomain classifies section updates as unchanged, surgical, or rewrite based on a churn ratio derived from changes in the finding set. [5] +- A wiki persistence subdomain owns the authoritative on-disk layout, idempotent scaffolding, and all read/write operations for section bodies, extraction notes, configuration, and cache. [6] +- A knowledge presentation subdomain provides conversational Q&A over the generated wiki and coverage/quality reporting. [1] ## Sources -1. `wikifi/extractor.py` -2. `wikifi/introspection.py:19-44` -3. `wikifi/sections.py:1-19` -4. `wikifi/wiki.py:1-50` +1. `wikifi/cli.py:1-11` +2. `wikifi/extractor.py:1-20` +3. `wikifi/specialized/dispatch.py:36-60` +4. `wikifi/evidence.py:1-18` +5. `wikifi/surgical.py:1-34` +6. `wikifi/wiki.py:1-55` diff --git a/.wikifi/entities.md b/.wikifi/entities.md index 016420b..5c3ecb5 100644 --- a/.wikifi/entities.md +++ b/.wikifi/entities.md @@ -1,217 +1,225 @@ # Core Entities -The system's information model spans six concern areas: wiki structure, evidence tracing, extraction and aggregation, repository analysis, caching, and pipeline orchestration. The entities below are described domain-first; implementation details such as storage format are noted only where they affect the entity's invariants. +## Evidence Primitives ---- +The traceability model is anchored by three small, composable entities. + +**SourceRef** is the atomic pointer into the codebase: a repo-relative file path, an optional inclusive line range, and a short content fingerprint captured at extraction time. It renders as `path:start–end`, `path:line`, or bare `path` depending on available information. + +**Claim** is one assertion placed into a wiki section. It pairs the markdown text of the assertion with a list of `SourceRef` objects that justify it. A claim that carries no source references is explicitly considered unsupported. -## Wiki Structure +**Contradiction** captures two or more conflicting `Claim` objects about the same topic. It carries a one-sentence summary of the disagreement and, for each disagreeing position, the full `Claim` with its own source references. -A **Section** is the fundamental organisational unit of the generated wiki. It carries: +At the aggregation stage, the LLM expresses the same concepts through index-based counterparts. An **AggregatedClaim** carries claim text plus a list of 1-based indices into the ordered input notes list. When an `AggregatedClaim` is resolved against the notes list, it converges with the `Claim` / `SourceRef` model. -| Field | Description | -|---|---| -| Unique identifier | Stable key used throughout the pipeline | -| Title | Human-readable heading | -| Brief | Prose description of what belongs in the section | -| Tier | Either *primary* (populated from per-file evidence) or *derivative* (synthesised from primary sections) | -| Upstream list | Ordered tuple of section identifiers this section depends on (derivative sections only) | +An **EvidenceBundle** (defined in the evidence model) and a **SectionBody** (defined as the LLM's structured response schema) both represent the aggregator's complete output for one wiki section: a markdown narrative body, a list of claims, and a list of contradictions. `SectionBody` is the schema the LLM fills in during generation (claims are index-based); `EvidenceBundle` is the broader domain concept once indices have been resolved to `SourceRef` objects. -Derivative sections declare explicit upstream dependencies, forming a directed acyclic graph. The system enforces topological ordering at startup: every section's upstreams must appear earlier in the canonical section list. +--- -A **WikiLayout** anchors all on-disk path resolution to a single project root, exposing named locations for the wiki directory, configuration file, gitignore, notes directory, cache directory, and per-section markdown and notes files. Its existence is a precondition for the conversational query and report commands. +## Extraction Layer -A **LoadedSection** pairs a Section descriptor with its rendered markdown body, representing one populated section ready for downstream use (such as building a conversational context). +| Entity | Key fields | Invariants / notes | +|---|---|---| +| **SectionFinding** | Target section ID, tech-agnostic markdown description (1–5 sentences), optional line range | One contribution from one file to one section | +| **FileFindings** | List of `SectionFinding` records, one-sentence file-role summary | Unit of output from one extractor call | +| **SpecializedFinding** | Target section ID, descriptive finding string, list of `SourceRef` objects | Produced by deterministic schema-aware extractors | +| **SpecializedResult** | List of `SpecializedFinding` records, optional summary | Mirrors the output shape of the LLM extractor | +| **ExtractionStats** | Files seen, files with findings, total findings, skipped files, chunks processed, cache hits, specialized-extractor files, file-kind breakdown | Run-level accumulators only; no identity | -A **SectionReport** captures the per-section view for reporting: a reference to the Section definition, the count of contributing files, the count of findings, the character length of the written body, an emptiness flag, and an optional quality critique. A **WikiReport** aggregates all SectionReports together with overall coverage statistics and an optional mean quality score across populated sections. +Schema-aware extractors surface specialised sub-types of `SpecializedFinding`. GraphQL schema files yield five entity categories: named domain object types, interface contracts, input types (request-payload shapes), enum types (closed value sets), and root operation types. Protocol-buffer IDL files yield named protocol entities (from `message` declarations, grouped by package) and closed value sets (from `enum` declarations). SQL DDL files yield a richer internal record — a **_TableHit** — that adds the raw column-definition body, a parsed column list, and a list of foreign-key edges expressed as (local column, referenced table, referenced column) triples. API-contract files yield named request and response schemas from the component section, capped at 25 inline names.[14] --- -## Evidence and Citation Model +## Wiki Taxonomy -Every factual sentence in the generated wiki is traceable back through a three-layer evidence hierarchy. +A **Section** is the unit of wiki organisation. Its fields are: a stable identifier, a human-readable title, a descriptive text, a tier classification (`primary` or `derivative`), and an ordered set of upstream section identifiers it depends on. The entity is immutable once created. The full set of sections forms a dependency graph governed by a hard topological invariant: derivative sections may only reference sections that appear earlier in the canonical sequence. -**SourceRef** — the lowest-level pointer. Carries a repo-relative file path, an optional inclusive line range, and a short content fingerprint captured at extraction time. Renders as `path:start–end` or just `path` when no line range is available. +A **WikiLayout** is an immutable value object. It holds the project root path and derives every well-known filesystem path from it — wiki directory, config file, gitignore, notes directory, cache directory, and per-section markdown and notes files. It is the single source of truth for path resolution across all pipeline stages. -**Claim** — a single markdown assertion placed in a section's narrative. Backed by zero or more SourceRefs. A claim with no sources is explicitly considered *unsupported*. +A **LoadedSection** pairs a `Section` descriptor with the markdown body read from disk. It is the unit of context fed into the chat system prompt. -**Contradiction** — groups two or more conflicting Claims under a one-sentence summary of the conflict; each conflicting position retains its own SourceRefs. +--- + +## Cache Entities -**EvidenceBundle** — the aggregator's structured handoff to the renderer for one section: the markdown narrative body, the ordered list of Claims, and the list of Contradictions. +Four typed cache scopes are unified into a single **WalkCache** object that is passed through the entire pipeline. It exposes typed lookup and record methods per scope and maintains hit/miss counters for each. -During the language-model aggregation pass, an intermediate form is used: an **AggregatedClaim** pairs a prose assertion with 1-based indices into the input notes (rather than resolved file paths), and an **AggregatedContradiction** wraps a one-sentence summary around multiple such indexed positions. These are resolved into full SourceRefs and Claims before the EvidenceBundle is assembled. +| Cache entity | Keyed by | Content | +|---|---|---| +| **CachedFindings** | Repo-relative file path | Content fingerprint, structured findings, file-role summary, chunk count | +| **CachedSection** | Hash of the notes that produced the section | Markdown body, structured claims, contradictions, ordered finding-ID list | +| **CachedDerivation** | Hash of upstream primary-section bodies | Rendered body, `reviewed` flag | +| **CachedIntrospection** | Include/exclude scope hash | Stage 1 introspection result | + +Two invariants deserve emphasis. `CachedSection` stores an ordered list of stable finding identifiers aligned with note position so that 1-based claim source indices remain meaningful across sessions. `CachedDerivation` records whether the critic-and-reviser loop ran; a reviewed body is never silently substituted for an unreviewed one. `CachedIntrospection` deliberately excludes descriptive fields (primary languages, rationale) from its key hash, so model-run variation in those fields does not invalidate an otherwise valid scope result. --- -## Extraction Layer +## Surgical Edit Entities -**IntrospectionResult** captures the Stage 1 decision: include patterns, exclude patterns, a one-paragraph hypothesis about the system's purpose, an informational list of primary technologies detected, and the rationale for the filtering choices. +When the finding set changes only slightly, the pipeline performs an in-place update rather than a full rewrite. -**SectionFinding** is the atomic extraction unit from one source file for one section. Fields: -- Target section identifier -- Technology-agnostic markdown description (one to five sentences) -- Optional inclusive line range within the source chunk +**SectionChange** captures the diff result for one section: a decision value (`unchanged`, `surgical`, or `rewrite`), the 1-based indices of live findings absent from the cache, the cached finding IDs no longer in the live set, a count of unchanged findings, the total live count, and a churn ratio derived from those counts. -**FileFindings** groups all SectionFindings produced for a single file, together with a one-sentence summary of that file's role. It is the unit exchanged between an extraction call and the notes store. +**SurgicalClaim** pairs an assertion text with 1-based indices into the added-findings list. **SurgicalContradiction** pairs a summary sentence with a list of `SurgicalClaim` objects representing disagreeing positions. -Specialised extractors — handling schema definition languages, API contracts, and data-definition files — produce **SpecializedFindings** rather than relying on general LLM inference. Each carries a section identifier, finding text, and one or more source references. Multiple SpecializedFindings are collected into a **SpecializedResult**, which additionally carries an optional summary string. +**SurgicalEdit** is the structured output of one surgical pass: an edited markdown body, the list of newly introduced `SurgicalClaim` records, a list of 1-based indices into the cached claims to drop, and a full replacement list of `SurgicalContradiction` records. -For data-definition schema files, an intermediate **table record** is derived first (table name, source line, raw body, column list, and foreign-key edges expressed as local-column → referenced-table.referenced-column tuples). All downstream entity and relationship findings are derived from this intermediate form. +--- -Domain object types from API schema files (those that are not root operation types) are surfaced directly as domain entity findings, grouped by their namespace, with closed value sets (enumerations) and shared shape contracts (interfaces, input types) captured as separate finding categories. +## Aggregation Output -**ExtractionStats** accumulates per-run metrics: files seen, files with findings, total findings, skipped files, chunks processed, cache hits, files routed to specialised extractors, and a breakdown by file kind. +**AggregationStats** tracks, across a single pipeline walk, how many sections were written fresh, left empty, served from cache, updated via surgical edit, or fully rewritten. --- -## Repository Analysis Entities +## Derivation -**FileKind** is a fixed enumeration of seven structural categories: application code, SQL, OpenAPI contract, protocol definition, GraphQL schema, migration script, and other. The classification drives routing to the appropriate extractor. +**DerivedSection** is the final markdown body produced for one derivative section. Its single invariant is that it contains no top-level heading; the wiki writer adds the heading separately. -**GraphNode** represents one file's position in the cross-file import graph: its repo-relative path, the ordered set of files it imports, and the ordered set of files that import it. It exposes a combined neighbour list capped at a configurable limit for use in prompt enrichment. +**DerivationStats** accumulates, across one run, counts of derivative sections derived, skipped, revised by the critic loop, and served from cache, together with the list of individual review outcomes. -**RepoGraph** is the complete repository-level import graph, keyed by repo-relative file path, providing lookup of individual GraphNodes and neighbour path lists. +--- -**DirSummary** is a value object for a single non-recursive directory: its repo-relative path, file count, total byte size, a frequency map of the top-10 file extensions, and a tuple of notable filenames (manifests, readmes). +## Quality and Review ---- +**Critique** captures the quality assessment of one section: an integer score (0–10), a one-to-two sentence summary judgment, a list of unsupported claims, a list of gaps against the section brief, and a list of concrete suggested edits. -## Caching Entities +**ReviewOutcome** records the full lifecycle of a section review: the section identifier, the initial `Critique`, the current body text, a boolean indicating whether a revision was accepted, and the follow-up `Critique` if revision occurred. -| Entity | Cache key | Stored payload | -|---|---|---| -| **CachedFindings** | Content fingerprint of the source file | Findings list, one-sentence file summary, chunk count | -| **CachedSection** | Hash of the notes payload | Rendered markdown body, claims list, contradictions list | +**WikiQualityReport** aggregates a whole-wiki scoring run: an overall numeric score, a map from section identifiers to their individual `Critique` objects, and optional coverage statistics. -**WalkCache** is the in-memory aggregate of both caches. It tracks four counters — extraction hits, extraction misses, aggregation hits, aggregation misses — supporting efficiency reporting across a full pipeline run. +**CoverageStats** records extraction coverage: total files seen, files that contributed findings, per-section counts of both findings and contributing files, and a coverage percentage derived from those counts. --- -## Quality-Review Entities - -A **Critique** captures the quality assessment of one section: -- Integer score (0–10) -- Short overall judgment -- List of unsupported claims -- List of gaps relative to the section brief -- List of concrete revision suggestions +## Reporting -A **ReviewOutcome** tracks the lifecycle of a single section review: the section identifier, the initial Critique, the current body text, a boolean flag indicating whether a revision was applied, and an optional follow-up Critique produced after revision. +**SectionReport** is a read-only per-section summary: contributing file count, finding count, body character length, an empty flag, and an optional quality `Critique`. -A **WikiQualityReport** aggregates the full-wiki audit: an overall numeric score, a mapping from section identifiers to individual Critiques, and optional **CoverageStats** (total files, files with findings, and per-section finding and file counts). +**WikiReport** aggregates coverage statistics across all sections, the list of `SectionReport` records, and an optional overall quality score computed as the mean of all scored sections. --- -## Pipeline Orchestration Entities - -**WalkConfig** encapsulates the parameters for file traversal. Notes from two different pipeline layers describe it somewhat differently (see Conflicts below), but the agreed-upon core fields are: repository root, file-size limits, minimum content thresholds, and extra exclusion patterns. It is treated as immutable once constructed. +## Pipeline Orchestration -**Notes records** are the ephemeral per-section extraction state persisted during a walk. Each record carries a UTC timestamp and arbitrary key-value metadata. Records for a section are accumulated in insertion order. +**IntrospectionResult** is the output of Stage 1: include patterns, exclude patterns, primary languages, a likely-purpose paragraph, and a rationale string. Patterns are gitignore-style and relative to the repository root. -**WalkReport** is the primary return value from a complete pipeline run. It carries the IntrospectionResult, ExtractionStats, AggregationStats, DerivationStats, the WalkCache state, and the RepoGraph. +**WalkConfig** is an immutable record governing one filesystem enumeration: root directory, supplemental exclude patterns, a flag to respect the project's own ignore file, maximum file size, and minimum stripped-content size. -**AggregationStats** tracks three counters for a single aggregation pass: sections written fresh, sections skipped due to empty notes, and sections served from cache. +**DirSummary** holds non-recursive aggregate statistics for one directory: repo-relative path, file count, total byte size, a top-10 map of file extensions to counts, and a list of notable manifest or readme filenames. It is produced in pre-order depth-first traversal. -**DerivationStats** accumulates pipeline metrics for the derivation stage: count of sections derived, skipped, and revised, plus the full list of ReviewOutcomes as an audit trail. +**WalkReport** is the top-level result for a complete pipeline run. It carries the `IntrospectionResult`, extraction statistics, aggregation statistics, derivation statistics, the `WalkCache` snapshot, the repo import graph, and a boolean indicating whether the run was a full cache hit with no generation work performed. --- -## Interaction Entities +## Repository Graph -A **ChatMessage** carries a role identifier and a content string, representing one turn in a multi-turn exchange. +**GraphNode** represents one in-scope file: its repo-relative path, the tuple of files it imports, the tuple of files that import it, and a capped combined-neighbor accessor used for prompt enrichment. -A **ChatSession** holds a reference to the language-model provider, the frozen system prompt built from populated wiki sections, and the accumulated conversation history (an ordered list of ChatMessages). It supports appending user and assistant turns and clearing the history while retaining the wiki context. +**RepoGraph** is the aggregate of all `GraphNode` records for the in-scope file set. It supports lookup by path and neighbor-path retrieval with a configurable cap. --- -## Configuration and Provider Entities +## Provider and Chat Entities + +**ChatMessage** is one turn in a conversation: a role identifier and a content string. + +**LLMProvider** is the abstract contract every generation backend must satisfy. It carries a provider name and a model identifier and must implement three call surfaces: structured extraction (schema-constrained, returns a validated document), free-text generation, and multi-turn conversation. + +**ChatSession** holds a reference to the configured provider, the frozen system prompt built from wiki content, and the growing message history for the current session. It exposes two operations: *send* (appends the user turn, calls the provider, appends the reply) and *reset* (clears history while keeping the system context). + +--- -**Settings** captures all runtime knobs for a wiki-generation run: provider and model identity, inference endpoint, request timeout, file-size and chunk thresholds, pipeline feature flags (caching, graph building, specialised extractors, review loop), the quality threshold that triggers revision, and provider-specific credentials and token caps. +## Settings -An **LLMProvider** carries a provider name and a specific model variant. It is the sole point of contact between the pipeline and any language-model backend, exposing exactly three interaction modes used throughout the system. +**Settings** is the single runtime configuration entity. Its fields span: LLM provider identity and model identifier, Ollama endpoint URL, per-request timeout, file-size thresholds (maximum, chunk, overlap, minimum content), introspection tree depth, thinking-mode level, pipeline feature flags (caching, graph analysis, specialized extractors, critic loop, surgical edits), surgical-edit churn threshold, and provider-specific API keys, base URLs, and token caps. ## Supporting claims -- A Section carries a unique identifier, human-readable title, prose brief, a tier (primary or derivative), and an ordered tuple of upstream section identifiers. [1] -- Derivative sections declare explicit upstream dependencies forming a directed acyclic graph enforced by topological ordering at startup. [2][1] -- WikiLayout anchors all on-disk path resolution to a single project root, exposing named locations for wiki, config, gitignore, notes, cache, and per-section files; its existence is a precondition for the chat and report commands. [3][4] -- A LoadedSection pairs a Section descriptor with its rendered markdown body. [5] -- A SectionReport carries the section definition reference, contributing file count, findings count, body character length, emptiness flag, and an optional quality critique. [6] -- A WikiReport aggregates all SectionReports, overall coverage statistics, and an optional mean quality score. [7] -- A SourceRef holds a repo-relative file path, an optional inclusive line range, and a short content fingerprint captured at extraction time. [8][9] -- A Claim is a single markdown assertion backed by zero or more SourceRefs; a claim with no sources is explicitly considered unsupported. [8][10] -- A Contradiction groups two or more conflicting Claims under a one-sentence summary, each position retaining its own SourceRefs. [10] -- An EvidenceBundle is the aggregator's structured handoff to the renderer: markdown body, ordered Claims list, and Contradictions list. [8][11] -- During the aggregation pass an AggregatedClaim pairs a prose assertion with 1-based input-note indices, and an AggregatedContradiction wraps a one-sentence summary around multiple such indexed positions; these are resolved into SourceRefs before the EvidenceBundle is assembled. [12] -- IntrospectionResult captures include/exclude patterns, a purpose hypothesis, an informational language list, and the filtering rationale. [13] -- A SectionFinding carries a target section identifier, a technology-agnostic markdown description of one to five sentences, and an optional line range. [14] -- A FileFindings groups all SectionFindings for one file plus a one-sentence file-role summary, and is the unit exchanged between the extraction call and the notes store. [15] -- A SpecializedFinding carries a section identifier, finding text, and one or more source references; multiple SpecializedFindings are collected into a SpecializedResult that also carries an optional summary string. [16][17] -- For data-definition schema files an intermediate table record is derived first (name, source line, raw body, column list, and foreign-key edges) and all downstream findings are derived from it. [18] -- Domain object types from schema files are surfaced as domain entity findings; closed value sets and shared shape contracts are captured as separate finding categories; a maximum of 25 items per category are rendered with elision noted. [19][20][21] -- ExtractionStats accumulates: files seen, files with findings, total findings, skipped files, chunks processed, cache hits, specialised-extractor files, and a file-kind breakdown. [22] -- FileKind is a fixed enumeration of seven structural categories: application code, SQL, OpenAPI, Protobuf, GraphQL, migration, and other; it drives routing to the appropriate extractor. [23] -- A GraphNode carries its repo-relative path, the ordered set of files it imports, and the ordered set of files that import it, with a configurable cap on the combined neighbour list. [24] -- A RepoGraph is the complete repository import graph keyed by repo-relative path. [25] -- A DirSummary holds a directory's path, file count, total byte size, top-10 extension frequency map, and a tuple of notable filenames. [26] -- CachedFindings stores a content fingerprint, findings list, one-sentence summary, and chunk count; CachedSection stores a notes-payload hash, rendered markdown body, claims list, and contradictions list. [27][28] -- WalkCache aggregates both caches and tracks four counters (extraction hits/misses, aggregation hits/misses) for efficiency reporting. [29] -- A Critique carries an integer score (0–10), a short judgment, a list of unsupported claims, a list of brief gaps, and a list of revision suggestions. [30] -- A ReviewOutcome tracks the section identifier, initial critique, current body, revision-applied flag, and optional follow-up critique. [31] -- A WikiQualityReport carries an overall numeric score, a mapping from section identifiers to individual Critiques, and optional CoverageStats (total files, files with findings, per-section finding and file counts). [32] -- WalkReport is the primary return value from a full pipeline run, carrying IntrospectionResult, ExtractionStats, AggregationStats, DerivationStats, WalkCache state, and RepoGraph. [33] -- AggregationStats tracks sections written fresh, skipped due to empty notes, and served from cache. [34] -- DerivationStats accumulates sections derived, skipped, and revised counts, plus the full list of ReviewOutcomes. [35] -- Notes records carry a UTC timestamp and arbitrary key-value metadata, stored per section in insertion order. [36] -- A ChatMessage carries a role identifier and a content string representing one turn in a multi-turn exchange. [37] -- A ChatSession holds an LLM provider reference, a frozen system prompt built from wiki sections, and an ordered conversation history; it supports appending turns and clearing history while retaining context. [38] -- Settings captures provider and model identity, inference endpoint, timeout, file-size and chunk thresholds, pipeline feature flags, revision quality threshold, and provider-specific credentials and token caps. [39] -- An LLMProvider carries a provider name and model variant and is the sole point of contact between the pipeline and any language-model backend. [37] - -## Conflicts in source -_The walker found disagreements across files. Migration teams should resolve these before re-implementation._ - -- **Two sources describe a 'WalkConfig' entity with partially different field sets, suggesting either two distinct same-named entities at different pipeline layers or a single entity incompletely described in each source.** - - WalkConfig (orchestrator layer) encapsulates repository root, byte-size limits, minimum content thresholds, and an optional introspection-derived exclusion list; it is constructed twice per run — once before introspection and once after with the exclusion list populated. (`wikifi/orchestrator.py:83-101`) - - WalkConfig (filesystem-walker layer) encapsulates repository root, extra exclusion patterns beyond defaults, a flag for honouring gitignore rules, maximum file size in bytes, and minimum stripped-content size in bytes; it is immutable once constructed. (`wikifi/walker.py:61-79`) +- A SourceRef points to a specific location in the codebase: a repo-relative file path, an optional inclusive line range, and a short content fingerprint captured at extraction time. [1] +- A Claim pairs the markdown text of an assertion with a list of SourceRef objects that justify it; a claim carrying no source references is explicitly considered unsupported. [2] +- A Contradiction captures two or more conflicting Claim objects about the same topic, with a one-sentence summary and each disagreeing position retaining its own source references. [3] +- An AggregatedClaim carries claim text plus a list of 1-based indices into the ordered input notes list. [4] +- EvidenceBundle and SectionBody both represent the aggregator's complete output for one wiki section: a markdown narrative body, claims, and contradictions. [5][6] +- A SectionFinding carries the target section identifier, a technology-agnostic markdown description of 1–5 sentences, and an optional line range within the source chunk. [7] +- FileFindings groups all findings produced for a single file together with a one-sentence file-role summary; it is the unit of output from one extractor call. [8] +- A SpecializedFinding carries a target section identifier, a descriptive finding string, and a list of source references linking back to originating file locations. [9] +- A SpecializedResult aggregates a list of SpecializedFindings together with an optional summary string, mirroring the output shape of the LLM extractor. [9] +- ExtractionStats accumulates run-level counters: files seen, files with findings, total findings, skipped files, chunks processed, cache hits, specialized-extractor files, and a file-kind breakdown. [10] +- GraphQL schema files yield five entity categories: named domain object types, interface contracts, input types, enum types, and root operation types. [11] +- Protocol-buffer IDL files yield named protocol entities from message declarations (grouped by package) and closed value sets from enum declarations. [12] +- SQL DDL files yield a _TableHit internal record with table name, source line, raw column-definition body, parsed column names, and foreign-key edges as (local column, referenced table, referenced column) triples. [13] +- A Section carries an identifier, title, description, tier classification (primary or derivative), and ordered upstream dependencies; it is immutable and participates in a topologically ordered dependency graph where derivatives may only reference earlier sections. [15] +- WikiLayout is an immutable value object that holds the project root and derives all well-known filesystem paths from it; it is the single source of truth for path resolution. [16] +- A LoadedSection pairs a Section descriptor with the markdown body read from disk and is the unit of context fed into the chat system prompt. [17] +- WalkCache is the unified in-memory view of all four cache scopes plus hit/miss counters for each; it is the single cache object passed through the pipeline. [18] +- CachedFindings is keyed by repo-relative file path and holds the content fingerprint, structured findings, file-role summary, and chunk count. [19] +- CachedSection holds the notes hash, rendered markdown body, structured claims, contradictions, and an ordered list of stable finding identifiers so that 1-based claim source indices map to finding IDs across sessions. [20] +- CachedDerivation holds the hash of upstream primary-section bodies, the rendered body, and a flag recording whether the critic-and-reviser review loop ran, preventing a reviewed body from being silently substituted for an unreviewed one. [21] +- CachedIntrospection deliberately excludes descriptive fields (primary languages, rationale) from its key hash so model-run variation does not invalidate an otherwise valid scope result. [22] +- SectionChange captures a per-section diff: a decision (unchanged/surgical/rewrite), new finding indices, dropped finding IDs, unchanged count, total live count, and a derived churn ratio. [23] +- A SurgicalClaim pairs assertion text with 1-based indices into the added-findings list; a SurgicalContradiction pairs a summary sentence with a list of SurgicalClaims. [24] +- SurgicalEdit is the structured output of one surgical pass: an edited markdown body, newly introduced claims, 1-based indices of cached claims to drop, and a full replacement list of contradictions. [25] +- AggregationStats tracks how many sections were written, left empty, served from cache, surgically edited, or fully rewritten in a single walk. [5] +- DerivedSection holds the final markdown body for one derivative section with the invariant that it contains no top-level heading. [26] +- DerivationStats counts derivative sections derived, skipped, revised by the critic loop, and served from cache, plus the list of review outcomes. [27] +- Critique captures an integer score (0–10), a one-to-two sentence summary judgment, unsupported claims, gaps against the section brief, and concrete suggested edits. [28] +- ReviewOutcome records the section identifier, initial critique, current body text, revision-accepted flag, and the follow-up critique if revision occurred. [29] +- WikiQualityReport aggregates an overall numeric score, a map from section identifiers to individual critiques, and optional coverage statistics. [30] +- CoverageStats records total files seen, files with findings, per-section counts of findings and contributing files, and a coverage percentage. [31] +- SectionReport is a read-only per-section summary: contributing file count, finding count, body character length, empty flag, and optional quality critique. [32] +- WikiReport aggregates coverage statistics, the list of SectionReport records, and an optional overall quality score computed as the mean of all scored sections. [33] +- IntrospectionResult captures Stage 1 output: include patterns, exclude patterns, primary languages, a purpose paragraph, and a rationale string; patterns are gitignore-style and relative to the repository root. [34] +- WalkConfig is an immutable record capturing: root directory, supplemental exclude patterns, ignore-file flag, maximum file size, and minimum stripped-content size. [35] +- DirSummary holds non-recursive aggregate statistics for one directory: repo-relative path, file count, total byte size, a top-10 extension-to-count map, and notable manifest/readme filenames; it is produced in pre-order depth-first traversal. [36] +- WalkReport is the top-level result for a pipeline run, carrying introspection result, extraction stats, aggregation stats, derivation stats, walk cache snapshot, repo import graph, and a full-cache-hit flag. [37] +- GraphNode carries a file's repo-relative path, the tuple of files it imports, the tuple of files that import it, and a capped combined-neighbor accessor. [38] +- RepoGraph aggregates all GraphNode records for the in-scope file set and supports lookup by path and neighbor-path retrieval with a configurable cap. [39] +- ChatMessage carries a role identifier and a content string. [40] +- LLMProvider carries a provider name and model identifier and must implement structured extraction, free-text generation, and multi-turn conversation call surfaces. [40] +- ChatSession holds a reference to the configured provider, the frozen system prompt, and the growing message history; it exposes send and reset operations. [41] +- Settings captures LLM provider identity, model identifier, endpoint URL, timeout, file-size thresholds, introspection tree depth, thinking-mode level, pipeline feature flags, surgical-edit churn threshold, and provider-specific API keys, base URLs, and token caps. [42] ## Sources -1. `wikifi/sections.py:30-40` -2. `wikifi/deriver.py:112-116` -3. `wikifi/cli.py:172-183` -4. `wikifi/wiki.py:55-80` -5. `wikifi/chat.py:42-45` -6. `wikifi/report.py:29-36` -7. `wikifi/report.py:39-44` -8. `wikifi/aggregator.py:166-186` -9. `wikifi/evidence.py:35-55` -10. `wikifi/evidence.py:57-80` -11. `wikifi/evidence.py:82-87` -12. `wikifi/aggregator.py:74-101` -13. `wikifi/introspection.py:47-64` -14. `wikifi/extractor.py:113-125` -15. `wikifi/extractor.py:128-131` -16. `wikifi/specialized/models.py:19-22` -17. `wikifi/specialized/models.py:25-27` -18. `wikifi/specialized/sql.py:50-58` -19. `wikifi/specialized/graphql.py:56-95` -20. `wikifi/specialized/openapi.py:105-116` -21. `wikifi/specialized/protobuf.py:42-60` -22. `wikifi/extractor.py:134-142` -23. `wikifi/repograph.py:43-56` -24. `wikifi/repograph.py:143-162` -25. `wikifi/repograph.py:165-177` -26. `wikifi/walker.py:144-153` -27. `wikifi/cache.py:60-66` -28. `wikifi/cache.py:69-74` -29. `wikifi/cache.py:77-88` -30. `wikifi/critic.py:67-84` -31. `wikifi/critic.py:91-96` -32. `wikifi/critic.py:99-114` -33. `wikifi/orchestrator.py:60-70` -34. `wikifi/aggregator.py:103-107` -35. `wikifi/deriver.py:57-62` -36. `wikifi/wiki.py:136-152` -37. `wikifi/providers/base.py:33-52` -38. `wikifi/chat.py:46-57` -39. `wikifi/config.py:46-155` -40. `wikifi/orchestrator.py:83-101` -41. `wikifi/walker.py:61-79` +1. `wikifi/evidence.py:35-57` +2. `wikifi/evidence.py:60-70` +3. `wikifi/evidence.py:73-79` +4. `wikifi/aggregator.py:76-97` +5. `wikifi/aggregator.py:100-115` +6. `wikifi/evidence.py:82-87` +7. `wikifi/extractor.py:97-107` +8. `wikifi/extractor.py:110-113` +9. `wikifi/specialized/models.py:17-27` +10. `wikifi/extractor.py:116-124` +11. `wikifi/specialized/graphql.py:44-93` +12. `wikifi/specialized/protobuf.py:43-64` +13. `wikifi/specialized/sql.py:48-55` +14. `wikifi/specialized/openapi.py:98-110` +15. `wikifi/sections.py:31-38` +16. `wikifi/wiki.py:54-76` +17. `wikifi/chat.py:42-44` +18. `wikifi/cache.py:146-166` +19. `wikifi/cache.py:89-96` +20. `wikifi/cache.py:98-116` +21. `wikifi/cache.py:118-131` +22. `wikifi/cache.py:133-143` +23. `wikifi/surgical.py:127-150` +24. `wikifi/surgical.py:72-92` +25. `wikifi/surgical.py:95-120` +26. `wikifi/deriver.py:58-62` +27. `wikifi/deriver.py:64-71` +28. `wikifi/critic.py:63-78` +29. `wikifi/critic.py:85-90` +30. `wikifi/critic.py:93-96` +31. `wikifi/critic.py:216-232` +32. `wikifi/report.py:29-36` +33. `wikifi/report.py:39-41` +34. `wikifi/introspection.py:43-62` +35. `wikifi/walker.py:57-74` +36. `wikifi/walker.py:148-157` +37. `wikifi/orchestrator.py:76-91` +38. `wikifi/repograph.py:151-170` +39. `wikifi/repograph.py:172-183` +40. `wikifi/providers/base.py:31-57` +41. `wikifi/chat.py:46-58` +42. `wikifi/config.py:44-181` diff --git a/.wikifi/external_dependencies.md b/.wikifi/external_dependencies.md index 67bc63b..270d8e6 100644 --- a/.wikifi/external_dependencies.md +++ b/.wikifi/external_dependencies.md @@ -1,51 +1,49 @@ # External-System Dependencies -The system draws on external services in three areas: language-model inference, development-time tooling integrations, and an optional document-parsing aid. +The system integrates with up to three language-model inference backends — exactly one active at a time — selected at configuration time. An additional soft dependency covers structured API-contract parsing. -## Language-Model Inference Services +## Inference Backends -All AI inference is routed through exactly one of three backends, selected at configuration time. +### Local Inference Service (default) -| Backend | Hosting model | Authentication | Notes | -|---|---|---|---| -| Self-hosted model server | Local process, user-managed | None required | Default path; connects via a configurable HTTP endpoint | -| Anthropic hosted API | Cloud, vendor-managed | API key (environment variable) | Opt-in; supports adaptive reasoning depth; configurable per-call token cap (default 32 000) and HTTP timeout (default 900 s) to handle long inference runs | -| OpenAI-compatible hosted API | Cloud, vendor-managed (or proxy) | API key | Opt-in; base URL is overridable, enabling compatible third-party or private deployments | +By default, all language-model calls are routed to a locally-hosted **Ollama** inference service reached over HTTP at a configurable host address. No API key is required. The service must support three interaction modes: schema-constrained structured output, free-text generation, and multi-turn conversation. Models are identified using a `family:tag` naming convention. A configurable connection timeout (default: 900 seconds) accommodates long reasoning traces on large models. -The self-hosted model server is the zero-configuration default: new users can run it locally without obtaining any credentials. The two cloud options are opt-in and require API keys supplied via environment variables. Both cloud providers support configurable timeouts, token limits, and extended reasoning modes; the cloud inference path also performs structured-output decoding (schema-constrained responses returning validated domain objects), free-text completion, and multi-turn conversational exchange. +### Anthropic API — Hosted Inference (opt-in) -## Development-Time Tool Integrations +Anthropic's hosted inference API is an opt-in alternative. It requires an API key supplied via configuration or the runtime environment. The integration exposes adaptive thinking controls — allowing reasoning depth to be traded against latency on a per-call basis — and supports configurable per-call output token budgets. The system defaults to a specific known-good model when no explicit identifier is provided, preventing routing failures caused by locally-formatted model names being submitted to the hosted endpoint. A configurable network timeout (default: 900 seconds) covers extended-thinking and structured-output calls. -A separate layer of integrations, declared in the project's tool-server configuration, augments the system during development or at runtime with auxiliary capabilities: +### OpenAI API — Hosted Inference (opt-in) -- **Google's generative AI service** — consumed via a shared API key; powers at least two registered tool integrations, including one described as an orchestration or data-assembly capability. -- **External documentation and context lookup** — an HTTP-based service queried with its own API key to retrieve up-to-date reference material, likely used to enrich prompts with current library or API documentation. -- **Self-hosted web-crawling service** — a locally-running crawler reachable at a fixed port, requiring no API key, used to fetch and process web content on demand. +OpenAI's hosted API is a second opt-in alternative. It exposes two surfaces: a schema-constrained structured-output endpoint returning pre-validated typed responses, and a standard chat-completion endpoint for free text and multi-turn conversation. The API key is sourced from the runtime environment. The integration supports standard hosted endpoints, enterprise cloud deployments (e.g. Azure-hosted variants), and arbitrary proxy deployments via a configurable base URL; deployment identifiers are forwarded unchanged to preserve compatibility. When a locally-formatted model identifier is detected, the system substitutes a known-good hosted model to prevent request failures; explicit deployment identifiers bypass this substitution. -## Optional Parsing Support +## API-Contract Parsing -When processing contract specification files in YAML format, the system can delegate parsing to a third-party YAML library if one is present in the environment. If the library is absent the system falls back to a built-in minimal parser that covers the subset of YAML constructs it needs. This makes the external library a soft, non-blocking dependency rather than a hard requirement. +When processing structured API-contract files, the system can optionally rely on an external YAML-parsing library. If the library is absent at runtime, a built-in minimal parser handles the relevant document subset, making this a soft dependency — the capability remains functional without the library installed. ## Supporting claims -- The self-hosted model server is the default inference backend, requires no API key, and is reachable at a configurable HTTP endpoint. [1][2][3] -- Anthropic's hosted inference API is an opt-in backend authenticated via an environment-variable API key, supporting adaptive reasoning modes, a configurable per-call token cap (default 32 000), and an HTTP timeout defaulting to 900 seconds. [4][2][5] -- The OpenAI-compatible hosted API is an opt-in backend authenticated via API key with a configurable base URL, enabling use of compatible third-party or private proxy deployments. [6][2][7] -- The cloud inference path supports structured-output (schema-constrained) decoding, free-text completion, and multi-turn conversational chat. [7] -- Google's generative AI service is consumed via a shared API key and powers at least two registered tool integrations, including one orchestration or data-assembly capability. [8][9] -- An HTTP-based external documentation and context lookup service is queried with its own dedicated API key, likely to enrich prompts with up-to-date reference material. [10] -- A self-hosted web-crawling service runs locally at a fixed port, requires no API key, and is used to fetch and process web content. [11] -- A third-party YAML parsing library is a soft dependency for processing YAML-format specification files; the system falls back to a built-in minimal parser when the library is absent. [12] +- A locally-hosted Ollama inference service is the default language-model backend, requiring no API key and no hosted service subscription. [1][2][3] +- The Ollama service is reached over HTTP at a configurable host address. [1][3] +- The Ollama backend must support schema-constrained structured output, free-text generation, and multi-turn conversation. [3] +- Ollama models are identified using a family:tag naming convention. [2] +- The Ollama backend carries a configurable connection timeout defaulting to 900 seconds. [3] +- Anthropic's hosted inference API is an opt-in alternative that requires an API key from configuration or environment. [4][5][6] +- The Anthropic integration supports adaptive thinking modes and configurable per-call output token budgets. [4][6] +- When no explicit model identifier is provided for the Anthropic backend, the system substitutes a specific default model to avoid routing errors from locally-formatted model names. [5][6] +- The Anthropic backend carries a configurable network timeout defaulting to 900 seconds to accommodate extended-thinking and structured-output calls. [6] +- OpenAI's hosted API is a second opt-in inference backend; its API key is sourced from the runtime environment. [7][8][9] +- The OpenAI integration exposes two API surfaces: a schema-constrained structured-output endpoint and a standard chat-completion endpoint for free text and multi-turn conversation. [9] +- The OpenAI integration supports standard hosted endpoints, enterprise cloud deployments, and proxy deployments via a configurable base URL; arbitrary deployment identifiers are forwarded unchanged. [7][8] +- When a locally-formatted model identifier is detected, the OpenAI backend substitutes a known-good hosted model; explicit deployment identifiers are passed through unchanged. [8] +- An external YAML-parsing library is an optional soft dependency for processing API-contract files; a built-in fallback parser handles the required document subset when the library is absent. [10] ## Sources -1. `wikifi/config.py:53-55` -2. `wikifi/orchestrator.py:148-200` -3. `wikifi/providers/ollama_provider.py:52` -4. `wikifi/config.py:116-134` -5. `wikifi/providers/anthropic_provider.py:83-100` -6. `wikifi/config.py:136-151` -7. `wikifi/providers/openai_provider.py:113-175` -8. `.mcp.json:4-8` -9. `.mcp.json:29-35` -10. `.mcp.json:22-28` -11. `.mcp.json:14-20` -12. `wikifi/specialized/openapi.py:154-162` +1. `wikifi/config.py:51-53` +2. `wikifi/orchestrator.py:255-264` +3. `wikifi/providers/ollama_provider.py:48-57` +4. `wikifi/config.py:151-166` +5. `wikifi/orchestrator.py:265-277` +6. `wikifi/providers/anthropic_provider.py:75-100` +7. `wikifi/config.py:170-181` +8. `wikifi/orchestrator.py:278-299` +9. `wikifi/providers/openai_provider.py:111-135` +10. `wikifi/specialized/openapi.py:131-153` diff --git a/.wikifi/hard_specifications.md b/.wikifi/hard_specifications.md index 8423c1c..12cdf0a 100644 --- a/.wikifi/hard_specifications.md +++ b/.wikifi/hard_specifications.md @@ -1,197 +1,200 @@ # Hard Specifications -This section documents requirements that must be preserved verbatim through any reimplementation, migration, or refactor of the wiki-generation pipeline. They are grouped by domain. +## Technology-Agnostic Output Mandate -### Output Integrity +All generated content at every stage of the pipeline must be expressed in technology-agnostic domain language. No specific languages, frameworks, or library names may appear in findings, synthesized bodies, derivative sections, or introspection outputs. This requirement applies to per-file extraction, aggregation, derivation, and chat responses alike. -All synthesized wiki content — including derivative sections — must be technology-agnostic: no names of specific languages, frameworks, or libraries may appear in any generated output. Every observation must be expressed in domain terms. +## File Intake Thresholds -Contradictions between source notes must never be silently resolved. Any incompatible claims must produce a dedicated contradictions entry identifying each position and the note indices that support it. This rule applies at the aggregation stage and throughout the critic/reviser loop. +Two size-based hard cutoffs govern which files may enter the analysis pipeline: -All generated section bodies must declare gaps explicitly rather than speculating or inventing claims unsupported by upstream evidence. The reviser is bound by the same constraint: fabricated claims are prohibited even when evidence is sparse. +| Condition | Threshold | Action | +|---|---|---| +| File size exceeds | 2,000,000 bytes | Unconditionally skipped (treated as vendored or generated noise) | +| Stripped text content shorter than | 64 bytes | Unconditionally skipped (prevents speculative AI reasoning on near-empty stubs) | -### Quality Scoring Rubric +These thresholds are applied both by the filesystem walker and independently in configuration, making them doubly enforced. -The scoring rubric is fixed and must not be altered: +## Chunking Parameters -| Score | Meaning | -|---|---| -| 9–10 | Fully grounded, tech-agnostic, narratively coherent; no unsupported claims | -| 6–8 | Minor issues acceptable | -| 3–5 | Substantial gaps or partial coverage | -| 0–2 | Incoherent or off-brief | +The content chunking window is fixed at **150,000 bytes** with an **8,000-byte overlap** between adjacent chunks. The overlap is required to preserve cross-boundary context and must be maintained whenever chunking logic is modified. The invariant `0 ≤ overlap < chunk_size` is enforced as a hard error; any violation raises immediately. The recursive splitting strategy must guarantee full coverage of any input — no content may be silently dropped regardless of format. + +## Identity and Cache Rules + +**Finding identity** is defined as the SHA-256 digest of the concatenation `file::section_id::finding`, truncated to 12 hexadecimal characters (the same FINGERPRINT_LENGTH that governs content fingerprints elsewhere). A single-character change to any of the three components produces a different identity, treated as a delete-plus-insert rather than an edit, invalidating any cached prose referencing the prior wording. + +**Content fingerprints** must be exactly 12 hexadecimal characters derived from the leading 12 characters of a SHA-256 digest computed over raw bytes (not decoded text), ensuring encoding-independence across all subsystems. + +**Aggregation cache keys** must include the full sources list — file reference, line range, and content fingerprint per citation — not merely the finding text. When a cited file's line numbers shift or its content changes, the cache must miss and aggregation must rerun against fresh evidence. -The minimum acceptable score for shipping a section without revision is **7**. A revised body is only accepted if its follow-up critique score is **greater than or equal to** the initial score; any revision that causes a score regression must be discarded and the original body retained. +**Review-mode cache asymmetry**: an entry produced under the critic-review mode must not be silently served to a walk running without review. The reverse is permitted — a reviewed body is considered strictly higher quality and may be reused by a non-review walk. For derivative sections, both a matching upstream-content hash and a matching review-mode flag are required for cache reuse. -### Evidence and Citation Format +## Directory Layout and Storage Formats -Citations must be rendered as compact footnote-style markers (`[1]`, `[2]`, …) with a Sources footer at the bottom of each section. Line-range references follow the format `path/to/file:start-end`; when start equals end, `path/to/file:line`; when no range is known, `path/to/file` alone. +The wiki artefact directory must be named `.wikifi/` and must reside inside the target project root. The chat and report commands treat the existence of this directory as a hard pre-condition and refuse to proceed without it. The `.wikifi/` layout is declared a stable forward-compatible contract; any new required scaffolding entries must be backfilled into existing wikis on the next initialisation run. -Detected contradictions must appear verbatim in wiki output under a **'Conflicts in source'** heading, with explicit direction that migration teams must resolve them before re-implementation. They must not be suppressed or merged. +**Notes storage format**: notes are stored as JSONL (one JSON object per line). Each record must include a `timestamp` field in ISO-8601 UTC format. This format is consumed by the aggregation stage and must remain stable across versions. -Three exact sentinel strings serve as the canonical markers for unpopulated sections and must not be modified: -- `Not yet populated` -- `No findings were extracted` -- `upstream sections required to derive` +**Citation format**: citations must be rendered as compact footnote-style markers (`[1]`, `[2]`, …) with an explicit numbered "Sources" footer at the bottom of each section. Line-precise references must be formatted as `path/to/file:start-end` or `path/to/file:line` when line information is available. Inline annotation must use a conservative verbatim substring match — a claim is only inlined next to a matching sentence when the claim's exact text appears in the body; paraphrased bodies must place claims in a separate "Supporting claims" list to prevent mis-attribution. -### Content Fingerprint Format +**Report sentinel values**: section bodies containing the strings `Not yet populated`, `No findings were extracted`, or `upstream sections required to derive` are contractually treated as empty and excluded from quality scoring. These values form a stable interface between the writer and the reporter. -Fingerprints are defined as the first **12 hexadecimal characters** of a SHA-256 digest (48 bits of entropy). This length and format must be preserved across any migration, as fingerprints are recorded in cached artefacts and emitted into wiki evidence references. +## Section Taxonomy and Pipeline Stages -### File-Processing Thresholds +The analysis pipeline is divided into exactly four named stages in fixed order: Stage 1 Introspection, Stage 2 Extraction, Stage 3 Aggregation, and Stage 4 Derivation.[21] This four-stage contract is surfaced to users in walk report output. -The following thresholds are fixed pipeline constants: +The canonical set of section identifiers is fixed: -| Parameter | Value | +`domains`, `intent`, `capabilities`, `external_dependencies`, `integrations`, `cross_cutting`, `entities`, `hard_specifications`, `personas`, `user_stories`, `diagrams` + +Derivative sections must declare their upstream dependencies; those upstreams must exist in the taxonomy and appear earlier in the ordered sequence. This ordering constraint is validated at startup and failure raises an error. Per-file extraction is restricted to primary section IDs only; derivative sections are produced exclusively in Stage 4.[23] + +## Quality Scoring Rubric + +The quality rubric is fixed: + +| Score range | Meaning | |---|---| -| Maximum file size | 2,000,000 bytes | -| Minimum content (stripped) | 64 bytes | -| Chunk size | 150,000 bytes | -| Chunk overlap | 8,000 bytes | -| Manifest truncation limit | 20,000 bytes | +| 9–10 | Technology-agnostic, fully evidence-grounded, no unsupported claims | +| 6–8 | Largely sound with minor issues | +| 3–5 | Substantial gaps or partial coverage | +| 0–2 | Incoherent or off-brief | -Chunk overlap must satisfy `0 ≤ overlap < chunk_size`; chunk size must be positive. These invariants must hold for the recursive splitter to terminate correctly on all inputs, including whitespace-free monolithic files. +The score field is constrained to integers 0–10 inclusive. The default acceptance threshold (below which revision is triggered) is **7**. A revised body is only accepted when its score is greater than or equal to the score it replaces, preventing quality regressions. -### Cache Integrity +## Surgical Edit Constraints -The aggregation cache hash must span: file reference, summary, finding text, and the full structured sources list (file path, line range, and per-source fingerprint). Omitting any of these fields allows stale citation metadata to be replayed without re-aggregation. +When incremental updates are applied surgically: -Cache persistence must use an atomic write pattern — write to a sibling temporary file, then rename — to guarantee that a crash during saving never produces a corrupt cache file. +- Unchanged paragraphs **must** appear in the output exactly as they appeared in the input. This is a hard preservation rule, not a soft preference. +- Claim and finding indices are 1-based throughout: added findings are tagged `[A1]`, `[A2]`, …; cached claims are tagged `[C1]`, `[C2]`, …; removed findings are tagged `[R1]`, `[R2]`, …. Any deviation breaks citation re-anchoring. +- The contradictions field in surgical edit output is a **full replacement** of the cached contradictions list, not a delta; the model must include contradictions that survived the edit as well as any new ones. -### On-Disk Directory Layout +## Configuration Precedence -The directory layout (`.wikifi/`, `config.toml`, `.gitignore`, one markdown file per section, `.notes/`, `.cache/`) is the versioned contract with target projects and must not change in ways that break existing wikis. The `.notes/` and `.cache/` directories must always be excluded from version control; only section markdown files are committed. New required gitignore entries introduced in future versions must be backfilled automatically on the next initialization run. +Configuration resolution precedence is contractual: the wiki's own config file overrides environment variables when present.[29] This contract is printed at the top of every generated config file and must be preserved. -### Command-Line Interface Contract +## Inference Provider Constraints -The tool's entry point name and its four subcommands (`init`, `walk`, `chat`, `report`) are contractual interfaces for users and tooling. They must not be renamed or removed. +**Structured output** is obtained via the inference provider SDK's schema-constrained decoding path, not via manually constructed tool-use blocks. This path is load-bearing for extraction correctness. -### Pipeline Stage Boundaries +**Output token limits**: the default output token ceiling for the extended-reasoning hosted path is 32,000 tokens — headroom required because reasoning traces consume output tokens before producing the structured result. The default ceiling for the direct API path is 16,000 tokens. Reasoning-capable model families must receive the output token limit under a distinct request parameter name from plain chat models; sending the wrong parameter name causes a 400 error. -- **Stage 1** must operate without reading any source files; it sees only directory-level summaries and manifest contents. Source reading is exclusively Stage 2's responsibility. -- **Stage 2** targets only primary wiki sections during per-file extraction. Derivative sections (personas, user stories, diagrams) are explicitly excluded from this stage and are produced later from aggregated primary findings. -- Include and exclude patterns produced by Stage 1 must be in gitignore-style format relative to the repository root.[23] +**Temperature**: on any schema-constrained completion call via the local inference path, temperature must be 0 to preserve reproducibility of structured output. Thinking mode must not be disabled for certain local model families that rely on it for schema adherence, as disabling it causes the model to ignore the output schema and produce unparseable free text. -### Section Taxonomy Invariant +**Sampling parameters**: for certain hosted model variants, sampling parameters (temperature, top_p, top_k) must not be sent at all — the API returns a 400 error if they are present. -Every derivative section must reference only known section IDs, and every upstream dependency must appear earlier in the canonical section ordering. This ordering invariant is validated at module load time; violations raise an error. +**Request timeout**: the default request timeout is 900 seconds and must not be reduced when high-reasoning mode is active. -### Note Index Invariant +**System prompt position**: the system prompt must always be sent at message position 0 to ensure prefix caching eligibility. -Note indices presented to the synthesis stage are **1-based**. The resolution logic subtracts 1 before indexing into the notes list. This off-by-one invariant must be preserved if the prompting scheme changes. +**Model identifier routing**: a model name containing a colon separator (family:tag) is treated as a local inference identifier and swapped for a provider-appropriate default when a hosted provider is selected, except when the name begins with `ft:` (a fine-tuned model prefix), which is left unchanged. -### Derivative Section Output Formats +## File Classification Rules -Gherkin-style outputs must use proper `Given/When/Then` syntax inside fenced `gherkin` code blocks. Diagram outputs must be valid and inside fenced `mermaid` code blocks, with `graph`, `classDiagram`, `erDiagram`, and `sequenceDiagram` as the preferred diagram types. +**OpenAPI/Swagger detection** is performed against only the first 4,096 bytes of candidate files. Parse failures on API contract files must never crash the documentation walk; exactly one advisory finding must be emitted directing users to inspect the file manually. -### Provider API Contract +**Migration directory detection** uses a fixed enumeration of path tokens: `/migrations/`, `/alembic/`, `/db/migrate/`, `/database/migrations/`, `/prisma/migrations/`, `/flyway/`, `/liquibase/`. Files matching these tokens are classified as migrations rather than generic source. -The provider contract consists of exactly three interaction modes — structured completion, text completion, and chat. No other methods are ever invoked; any conforming implementation must satisfy all three signatures exactly. +**SQL extractor routing**: only migration files with the exact extension `.sql` or `.ddl` are routed to the SQL migration extractor. Migration files in any other form must fall through to prose LLM extraction. This rule must be preserved under any refactor. -Additional provider-specific invariants that must be carried forward: +**Schema index preservation**: every index discovered in a schema is recorded with the explicit requirement that it encodes a query-time performance invariant the new system must preserve, treating index existence as a non-negotiable carry-forward obligation rather than an implementation detail. -- Sampling parameters (temperature, top_p, top_k) must **not** be sent to certain hosted reasoning models; doing so causes a 400 validation error. They must be omitted entirely, not conditionally included. -- Reasoning-capable model families (identified by specific name prefixes) must receive `max_completion_tokens` instead of `max_tokens`, and may optionally receive a `reasoning_effort` value of `low`, `medium`, or `high`. Non-reasoning models must not receive `reasoning_effort`. -- Disabling the reasoning trace on certain locally-hosted model families causes them to ignore schema constraints and emit free text. Reasoning must default to **high** and must never be disabled for these models in the structured-output path. -- The default request timeout for locally-hosted model backends is **900 seconds**, chosen to absorb 1–3 minute per-file latencies at high thinking levels. Reducing this timeout risks aborting in-progress reasoning traces. -- The default output token cap for the hosted cloud provider is **32,000** tokens per call; for the OpenAI-compatible provider it is **16,000** tokens per call. -- When the structured-output parse path returns no parsed object (due to refusal or truncation), the implementation must fall back to validating raw JSON text against the schema, not return null. The provider protocol's contract is to raise on failure, never to silently return nothing. +## Hallucination Avoidance Contracts -### Model Identifier Heuristics +Two system-level prompts encode explicit anti-hallucination contracts: -When the hosted-Claude backend is configured but no matching model identifier is detected, the system falls back to `claude-opus-4-7`. The Ollama model identifier heuristic is: a string containing `:` that does not begin with the prefix `ft:` (case-insensitive). This exact rule must be carried forward without modification to avoid misclassifying fine-tuned model identifiers or Azure deployment IDs. +- The chat interface requires the assistant to ground every answer in the wiki material and to explicitly acknowledge when something is not covered, rather than fabricating detail. +- Derivative synthesis output must be grounded exclusively in the upstream sections provided; the model must declare a gap rather than invent any fact not supported by upstream evidence. -### Specialized Extractor Rules +## Fixed Exclusion Patterns -- Only migration files with `.sql` or `.ddl` suffixes are routed to the SQL migration extractor; all other migration files fall through to the general extraction path. Routing is determined by file suffix, not file-kind classification. -- When an API contract file is present but cannot be parsed, an explicit warning finding must be emitted directing migration teams to review the file manually. The file must not be silently dropped. -- Service-to-RPC attribution in protocol definition files must be computed by tracking brace depth (counting nested blocks), not by line proximity, to correctly handle multi-service files. -- Index definitions must be emitted as explicit findings recording that they encode query-time performance invariants which must be preserved through migration. +A fixed set of directory and file patterns is unconditionally excluded from the analysis pipeline regardless of user configuration or ignore-file contents, including: version control metadata directories, common dependency caches, build output directories, the tool's own working directory, and compiled binary file extensions (including lock files and minified assets). ## Supporting claims -- All synthesized wiki content must be technology-agnostic: no names of specific languages, frameworks, or libraries may appear in any generated output, and every observation must be expressed in domain terms. [1][2][3] -- Contradictions between source notes must never be silently resolved; any incompatible claims must produce a dedicated contradictions entry identifying each position and the supporting note indices. [4] -- All generated section bodies must declare gaps explicitly rather than speculating or inventing claims unsupported by upstream evidence; this constraint applies to the reviser as well. [2][5] -- The scoring rubric is fixed: 9–10 fully grounded and coherent; 6–8 minor issues; 3–5 substantial gaps; 0–2 incoherent or off-brief. [6] -- The minimum acceptable score for shipping a section without revision is 7. [6] -- A revised body is only accepted if its follow-up critique score is greater than or equal to the initial score; any revision that causes a score regression must be discarded and the original body retained. [7] -- Citations must be rendered as compact footnote-style markers with a Sources footer; line ranges formatted as path:start-end, path:line for single lines, and path alone when unknown. [8] -- Detected contradictions must appear verbatim in wiki output under a 'Conflicts in source' heading and must not be suppressed or merged. [9] -- Three exact sentinel strings mark unpopulated sections and must not be modified: 'Not yet populated', 'No findings were extracted', and 'upstream sections required to derive'. [10] -- Fingerprints are defined as the first 12 hexadecimal characters of a SHA-256 digest and this format must be preserved across any migration. [11] -- The maximum file size threshold is 2,000,000 bytes; the minimum content threshold is 64 bytes of stripped text; chunk size is 150,000 bytes with 8,000 bytes of overlap; manifest files are truncated to 20,000 bytes. [12][13][14] -- Chunk overlap must satisfy 0 ≤ overlap < chunk_size and chunk size must be positive; these invariants must hold for the recursive splitter to terminate correctly. [15] -- The aggregation cache hash must span file reference, summary, finding text, and the full structured sources list; omitting any field allows stale citation metadata to be replayed. [16] -- Cache persistence must use an atomic write pattern (write to a sibling temp file, then rename) to guarantee a crash never produces a corrupt cache file. [17] -- The on-disk directory layout is the versioned contract with target projects and must not change in ways that break existing wikis. [18] -- .notes/ and .cache/ directories must always be excluded from version control; only section markdown files are committed. New gitignore entries must be backfilled automatically on next init. [19] -- The tool's entry point name and its four subcommands (init, walk, chat, report) are contractual interfaces and must not be renamed or removed. [20] -- Stage 1 must operate without reading any source files; it sees only directory-level summaries and manifest contents. [21] -- Stage 2 targets only primary wiki sections; derivative sections are excluded and produced later from aggregated primary findings. [22] -- Every derivative section must reference only known section IDs and every upstream dependency must appear earlier in the canonical section ordering; violations raise an error at load time. [24] -- Note indices presented to the synthesis stage are 1-based and the resolution logic subtracts 1 before indexing; this off-by-one invariant must be preserved if the prompting scheme changes. [25] -- Gherkin-style outputs must use Given/When/Then syntax inside fenced gherkin code blocks; diagrams must be valid and inside fenced mermaid code blocks. [26] -- The provider contract consists of exactly three interaction modes — structured completion, text completion, and chat — and any conforming implementation must satisfy all three signatures exactly. [27] -- Sampling parameters must not be sent to certain hosted reasoning models; doing so causes a 400 error and they must be omitted entirely. [28] -- Reasoning-capable model families must receive max_completion_tokens instead of max_tokens and may receive reasoning_effort; non-reasoning models must not receive reasoning_effort. [29] -- Reasoning must default to high and must never be disabled for certain locally-hosted model families in the structured-output path, as disabling it causes schema constraints to be ignored. [30] -- The default request timeout for locally-hosted model backends is 900 seconds, chosen to absorb 1–3 minute per-file latencies; reducing it risks aborting in-progress reasoning traces. [31] -- The default output token cap for the hosted cloud provider is 32,000 tokens per call; for the OpenAI-compatible provider it is 16,000 tokens per call. [32][33][34] -- When the structured-output parse path returns no parsed object, the implementation must fall back to validating raw JSON text against the schema rather than returning null. [35][36] -- When the hosted-Claude backend is configured but no matching model identifier is detected, the system falls back to claude-opus-4-7. [37] -- The Ollama model identifier heuristic is: a string containing ':' that does not begin with the prefix 'ft:' (case-insensitive); this rule must be preserved exactly. [38] -- Only migration files with .sql or .ddl suffixes are routed to the SQL migration extractor; routing is determined by file suffix, not file-kind classification. [39] -- When an API contract file cannot be parsed, an explicit warning finding must be emitted; the file must not be silently dropped. [40] -- Service-to-RPC attribution in protocol definition files must be computed by tracking brace depth, not by line proximity. [41] -- Index definitions must be emitted as explicit findings recording that they encode query-time performance invariants that must be preserved through migration. [42] - -## Conflicts in source -_The walker found disagreements across files. Migration teams should resolve these before re-implementation._ - -- **Notes disagree on whether a file of exactly 2,000,000 bytes is skipped: one states files 'larger than' that threshold are skipped (exclusive boundary), while another states files 'at or above' that limit are skipped (inclusive boundary).** - - Files larger than 2,000,000 bytes are unconditionally skipped — implying a file of exactly 2,000,000 bytes would not be skipped. (`wikifi/config.py:59-65`) - - Files at or above 2,000,000 bytes are unconditionally skipped — implying a file of exactly 2,000,000 bytes would be skipped. (`wikifi/walker.py:61-79`) +- All generated content at every stage of the pipeline must be expressed in technology-agnostic domain language with no specific language, framework, or library names. [1][2][3][4] +- Files exceeding 2,000,000 bytes are unconditionally skipped and treated as vendored or generated noise. [5][6] +- Files whose stripped text content is shorter than 64 bytes are unconditionally skipped to prevent speculative AI reasoning on stubs. [5][7] +- The content chunking window is fixed at 150,000 bytes with an 8,000-byte overlap between adjacent chunks. [8][9] +- The invariant 0 ≤ overlap < chunk_size is enforced as a hard error on any change to chunking logic. [9] +- Finding identity is SHA-256(file::section_id::finding) truncated to 12 hex characters; a single-character change to any component is treated as a delete-plus-insert invalidating cached prose. [10] +- Content fingerprints must be exactly 12 hexadecimal characters derived from the leading 12 characters of a SHA-256 digest computed over raw bytes. [11] +- Aggregation cache keys must include the full sources list (file reference, line range, and content fingerprint per citation), not merely the finding text. [12] +- A cache entry produced under critic-review mode must not be silently served to a non-review walk; the reverse is permitted. [13] +- Derivative section cache reuse requires both a matching upstream-content hash and a matching review-mode flag. [14] +- The wiki artefact directory must be named .wikifi/ inside the target project root; its absence is a hard pre-condition that blocks chat and report commands. [15][16] +- The .wikifi/ directory layout is a stable forward-compatible contract; new scaffolding must be backfilled into existing wikis on the next initialisation run. [16] +- Notes are stored as JSONL with one JSON object per line; each record must include a timestamp field in ISO-8601 UTC format, and this format must remain stable across versions. [17] +- Citations must be rendered as compact footnote-style markers ([1], [2], …) with an explicit numbered Sources footer; line-precise references must use path/to/file:start-end or path/to/file:line format. [18] +- Inline annotation uses a conservative verbatim substring match; paraphrased bodies must place claims in a separate Supporting claims list to prevent mis-attribution. [19] +- Section bodies containing the strings 'Not yet populated', 'No findings were extracted', or 'upstream sections required to derive' are contractually treated as empty and excluded from quality scoring. [20] +- The canonical set of section identifiers is fixed and must be preserved; derivative sections must declare upstreams that exist in the taxonomy and appear earlier in the ordered sequence, validated at startup. [22] +- The quality rubric is fixed on a 0–10 integer scale with defined bands: 9–10 fully grounded, 6–8 largely sound, 3–5 substantial gaps, 0–2 incoherent. [24] +- The default acceptance threshold below which revision is triggered is a score of 7; a revised body is only accepted when its score is greater than or equal to the score it replaces. [24][25] +- Unchanged paragraphs in surgical edits must appear in the output exactly as they appeared in the input — a hard preservation rule. [26] +- Claim and finding indices in surgical edit prompts are 1-based: added findings tagged [A1]/[A2]/…, cached claims tagged [C1]/[C2]/…, removed findings tagged [R1]/[R2]/…. [27] +- The contradictions field in surgical edit output is a full replacement of the cached list, not a delta. [28] +- Structured output is obtained via the SDK's schema-constrained decoding path, not via manually constructed tool-use blocks; this path is load-bearing for extraction correctness. [30] +- The default output token ceiling for the extended-reasoning hosted path is 32,000 tokens; too low a limit causes the reasoning trace to exhaust the budget and return an empty structured response. [31] +- The default output token ceiling for the direct API path is 16,000 tokens and the default request timeout is 900 seconds. [32] +- Reasoning-capable model families must receive the output token limit under a distinct parameter name from plain chat models; sending the wrong name causes a 400 error. [33] +- On schema-constrained completion calls via the local inference path, temperature must be 0 to preserve reproducibility of structured output. [34] +- Thinking mode must not be disabled for certain local model families, as disabling it causes the model to ignore the output schema and produce unparseable free text. [34] +- For certain hosted model variants, sampling parameters (temperature, top_p, top_k) must not be sent at all; the API returns a 400 error if they are present. [35] +- The default request timeout is 900 seconds and must not be reduced when high-reasoning mode is active. [34][32] +- The system prompt must always be sent at message position 0 to ensure prefix caching eligibility. [32] +- A model name with a colon separator is treated as a local inference identifier and swapped for a provider default when a hosted provider is selected, except names beginning with ft: which are left unchanged. [36] +- OpenAPI/Swagger detection is performed against only the first 4,096 bytes of candidate files; parse failures must never crash the walk and must emit exactly one advisory finding. [37][38] +- Migration directory detection uses a fixed enumeration of path tokens: /migrations/, /alembic/, /db/migrate/, /database/migrations/, /prisma/migrations/, /flyway/, /liquibase/. [39] +- Only migration files with the exact extension .sql or .ddl are routed to the SQL migration extractor; all other forms must fall through to prose LLM extraction. [40] +- Every index discovered in a schema encodes a query-time performance invariant the new system must preserve — a non-negotiable carry-forward obligation. [41] +- The chat interface requires the assistant to ground every answer in the wiki material and explicitly acknowledge gaps rather than fabricating detail. [42] +- Derivative synthesis output must be grounded exclusively in the upstream sections provided; the model must declare a gap rather than invent unsupported facts. [43] +- A fixed set of directory and file patterns is unconditionally excluded from the analysis pipeline regardless of user configuration or ignore-file contents. [44] +- Contradictions must never be silently resolved; the aggregator must emit a structured contradictions[] entry naming each incompatible position and its supporting note indices. [1] ## Sources -1. `wikifi/aggregator.py:57-59` -2. `wikifi/critic.py:53-61` -3. `wikifi/deriver.py:37-39` -4. `wikifi/aggregator.py:61-63` -5. `wikifi/deriver.py:34-50` -6. `wikifi/critic.py:31-48` -7. `wikifi/critic.py:137-147` -8. `wikifi/evidence.py:43-52` -9. `wikifi/evidence.py:121-131` -10. `wikifi/report.py:103-108` -11. `wikifi/fingerprint.py:23-27` -12. `wikifi/config.py:59-65` -13. `wikifi/config.py:66-81` -14. `wikifi/walker.py:61-79` -15. `wikifi/extractor.py:302-308` -16. `wikifi/cache.py:243-255` -17. `wikifi/cache.py:205-209` -18. `wikifi/wiki.py:1-8` -19. `wikifi/wiki.py:36-47` -20. `wikifi/cli.py:1-7` -21. `wikifi/introspection.py:5-9` -22. `wikifi/extractor.py:51-56` -23. `wikifi/introspection.py:50-58` -24. `wikifi/sections.py:148-158` -25. `wikifi/aggregator.py:167-173` -26. `wikifi/deriver.py:40-45` -27. `wikifi/providers/base.py:42-52` -28. `wikifi/providers/anthropic_provider.py:14-17` -29. `wikifi/providers/openai_provider.py:215-235` -30. `wikifi/providers/ollama_provider.py:9-27` -31. `wikifi/providers/ollama_provider.py:50-54` -32. `wikifi/config.py:122-134` -33. `wikifi/providers/anthropic_provider.py:70-79` -34. `wikifi/providers/openai_provider.py:59-66` -35. `wikifi/providers/anthropic_provider.py:107-145` -36. `wikifi/providers/openai_provider.py:136-144` -37. `wikifi/orchestrator.py:160-200` -38. `wikifi/orchestrator.py:205-215` -39. `wikifi/specialized/dispatch.py:28-62` -40. `wikifi/specialized/openapi.py:24-37` -41. `wikifi/specialized/protobuf.py:62-67` -42. `wikifi/specialized/sql.py:115-121` +1. `wikifi/aggregator.py:55-72` +2. `wikifi/deriver.py:39-41` +3. `wikifi/extractor.py:65-69` +4. `wikifi/introspection.py:15-40` +5. `wikifi/config.py:64-82` +6. `wikifi/walker.py:70-73` +7. `wikifi/walker.py:74-76` +8. `wikifi/config.py:73-80` +9. `wikifi/extractor.py:317-321` +10. `wikifi/cache.py:389-408` +11. `wikifi/fingerprint.py:22-52` +12. `wikifi/cache.py:353-373` +13. `wikifi/cache.py:234-248` +14. `wikifi/deriver.py:148-160` +15. `wikifi/cli.py:205-240` +16. `wikifi/wiki.py:1-10` +17. `wikifi/wiki.py:135-142` +18. `wikifi/evidence.py:1-18` +19. `wikifi/evidence.py:168-185` +20. `wikifi/report.py:117-122` +21. `wikifi/cli.py:151-200` +22. `wikifi/sections.py:41-156` +23. `wikifi/extractor.py:45-50` +24. `wikifi/critic.py:33-52` +25. `wikifi/critic.py:63-65` +26. `wikifi/surgical.py:47-50` +27. `wikifi/surgical.py:60-67` +28. `wikifi/surgical.py:63-66` +29. `wikifi/config.py:9-21` +30. `wikifi/providers/anthropic_provider.py:19-24` +31. `wikifi/providers/anthropic_provider.py:60-69` +32. `wikifi/providers/openai_provider.py:55-61` +33. `wikifi/providers/openai_provider.py:224-228` +34. `wikifi/providers/ollama_provider.py:1-40` +35. `wikifi/providers/anthropic_provider.py:20-27` +36. `wikifi/orchestrator.py:307-316` +37. `wikifi/repograph.py:111-120` +38. `wikifi/specialized/openapi.py:7-11` +39. `wikifi/repograph.py:94-108` +40. `wikifi/specialized/dispatch.py:28-56` +41. `wikifi/specialized/sql.py:112-121` +42. `wikifi/chat.py:25-31` +43. `wikifi/deriver.py:36-52` +44. `wikifi/walker.py:22-52` diff --git a/.wikifi/integrations.md b/.wikifi/integrations.md index 86dc8ea..8635f93 100644 --- a/.wikifi/integrations.md +++ b/.wikifi/integrations.md @@ -1,105 +1,132 @@ # Integrations -## Outbound Integrations +## External Integrations -### Language-Model Providers +### Outbound: Language-Model Providers -The system maintains a uniform provider abstraction that isolates every pipeline stage from the concrete inference backend. Three selectable backends are supported — a locally-hosted model service, a hosted Anthropic-compatible service, and an OpenAI-compatible service — each implementing the same three-method contract: structured JSON completion, free-text completion, and multi-turn chat. The active backend is chosen by configuration; the orchestrator and all downstream stages call into it without branching on which concrete provider is live. +All communication with external or locally-hosted language models is routed through a single shared provider abstraction that declares three call surfaces: schema-validated structured output, unstructured free-text completion, and stateful multi-turn conversation. Every pipeline stage that needs inference (extraction, aggregation, derivation, quality critique, surgical editing, and the chat REPL) calls exclusively through this interface rather than directly into any particular backend. -Every stage that performs inference uses this abstraction: +Three concrete provider backends are available and are selected at runtime via configuration: -| Stage | Operation | -|---|---| -| Introspection (Stage 1) | Structured JSON completion to classify repository paths | -| Extraction (Stage 2) | Structured JSON completion against a findings schema, per file | -| Aggregation (Stage 3) | Structured JSON completion against a section-body schema | -| Derivation (Stage 4) | Structured completion for personas, user stories, and diagrams | -| Critic / Reviser | Structured completion for rubric scoring and body revision | -| Chat session | Multi-turn chat grounded in populated wiki content | +| Backend | Activation | Destination | +|---|---|---| +| Hosted (large-model API) | Default or explicit flag | External hosted inference API | +| Alternative hosted | Provider selector | Separate external hosted API | +| Local | Provider selector | Locally-running inference service | -### External Tool and Capability Servers +All three backends normalize API errors into uniform runtime errors before surfacing them, so the rest of the pipeline can apply a single fallback strategy regardless of which provider is active. The hosted backends additionally support features such as prompt caching and adaptive reasoning modes to make large-scale codebase walks economically viable. -At the development and runtime boundary, the system is configured as a client that fans out to multiple external capability providers via a tool-server protocol. Four integrations are declared: a local AI utility, a local web crawler, a remote documentation context service, and a remote stitching/search service. This makes the system both a producer of wiki content and a consumer of external knowledge services during operation. +### Inbound: CLI Entry Point -### Filesystem and Layout Abstraction +The command-line interface is the sole external entry point for human operators. It exposes four sub-commands and delegates entirely to internal modules — it contains no domain logic of its own. Configuration is resolved from up to three sources in strict precedence order: a per-target config file stored inside the target repository's own wiki directory, process-level environment variables (including a `.env` file), and built-in field defaults. The per-target config wins, so each analysed repository can drive its own pipeline settings. -All pipeline stages read and write through a shared filesystem layout abstraction rather than addressing paths directly. Extraction findings are appended to a notes store; aggregated section bodies are written back through the same abstraction; the report and chat components read section markdown from the same on-disk layout. The cache layer uses this abstraction to locate its storage directory, and all cache reads and writes (keyed on file fingerprints and section-content hashes) pass through it. +--- -### Import Graph +## Internal Integration Topology -The extraction stage integrates with a repository-wide import/reference graph. For each file being analysed, the graph supplies the file's direct neighbors — files it depends on and files that depend on it — which are injected into the extraction prompt to enable cross-file flow descriptions. The graph also drives the specialized-extractor dispatch path by classifying each file's structural kind before routing. +### Orchestrator as Central Hub -### Per-Project Configuration +The orchestrator is the primary internal integration point. It sequences five subsystems in order: -Project-specific provider selection, model preferences, caching behavior, and feature flags are read from a TOML configuration file stored inside each managed project's wiki directory. Parse failures fall back gracefully to environment-derived defaults rather than aborting the pipeline. +1. **Repository introspection** — classifies the directory tree and decides which paths are worth analysing. +2. **Extraction** — walks each included file, calling the provider for structured findings or delegating to deterministic specialized extractors. +3. **Aggregation** — synthesizes per-file notes into coherent, citation-backed section bodies. +4. **Derivation** — generates derivative content (personas, user stories, diagrams) from the finished primary sections. +5. **Cache layer** — consulted and updated at every stage to avoid redundant work. ---- +### On-Disk State via the Wiki Layout + +All read and write access to wiki artifacts — section markdown files, per-section extraction note stores (JSONL), and related paths — flows through a single canonical layout abstraction. The extraction, aggregation, caching, orchestration, derivation, chat, and CLI modules all resolve their file paths through this shared type, making it the de facto integration bus for persistent state between pipeline stages. + +### Cache and Fingerprint Subsystems + +The cache module is consumed by the orchestrator, extractor, aggregator, and deriver to check and record results. It delegates all content hashing to a dedicated fingerprinting subsystem (stable short-form SHA-256 prefixes). The fingerprinting subsystem is also used independently by the evidence citation layer (pairing source paths with fingerprints and line ranges) and by the dependency graph subsystem (propagating cache invalidation across files that import one another). + +### Extraction Stage + +The extractor calls the language-model provider once per file chunk for unstructured source files and writes results to the notes store via an `append_note` interface consumed downstream by the aggregator. For structured file kinds (relational schema, API contract, service definition, graph schema, and migration files), it delegates to a dispatch layer that selects a deterministic specialized extractor and bypasses the model entirely. The dispatch layer in turn depends on the file classifier from the repository analysis module. + +The repository analysis module also builds an import/reference graph. The orchestrator uses this graph to inject neighbor-file context into each extraction prompt. -## Inbound Integrations +### Aggregation and Surgical Editing -The primary entry point is the command-line interface, which exposes four subcommands (`init`, `walk`, `chat`, `report`). It constructs the provider instance and passes it directly into the chat and report capabilities. All other pipeline stages are driven by the orchestrator, which sequences introspection → extraction → aggregation → derivation and is itself invoked by the CLI `walk` subcommand. +The aggregator coordinates with three internal subsystems: the cache (to determine whether a full rewrite or a surgical patch is needed), the surgical editor (which sits between the cache and the rendering layer for small deltas, calling the provider for targeted JSON-output edits), and the evidence renderer (which formats citation footers and conflict blocks from structured claim data). + +### Quality Reporting and Critic + +The report module reads section markdown and notes files from disk via the wiki layout, reads the walk cache to determine total files processed, and optionally calls the critic subsystem to score section bodies. For derivative sections, the report also collects upstream section text to give the critic the context needed to evaluate how well the derivative synthesizes its sources. The critic and reviser themselves call the shared provider abstraction with structured schemas, keeping the quality pipeline independent of the active model. + +### Chat REPL + +The interactive chat feature has two integration dependencies: it reads all populated section files from disk via the wiki layout to assemble a grounding context, and it delegates each user turn to the configured provider through the shared interface, passing the full message history on every call. + +### Walker + +The filesystem walker is called by the introspection stage (to produce a compressed structural summary) and by the orchestrator/extractor stages (to supply the actual file list). It has no outbound integrations beyond the local filesystem and applies gitignore semantics plus hardcoded exclusion rules before any model pass runs. --- ## Integration Surfaces Detected in Analysed Codebases -When the system analyses a target repository, it surfaces the following categories of integration touchpoint: - -- **HTTP API endpoints** — each parsed API contract contributes a finding recording the number of endpoints the analysed system exposes to external consumers, forming the inbound integration inventory. -- **RPC service blocks** — each service definition is treated as an integration touchpoint; individual operations are described with their request and response types, including streaming direction where declared. -- **Event-driven subscriptions** — subscription roots in schema definition files are mapped specifically to the integrations section, reflecting that they represent event-driven touchpoints rather than direct request/response capabilities. -- **Relational foreign-key links** — cross-table references are recorded as hard relational links between entities, surfacing constraints that affect how components may be separated or migrated independently. +When the system analyses a target codebase, several specialized extractors surface integration touchpoints in the target's own code: -The specialized extractor dispatch layer acts as the internal routing hub between the upstream file classifier (which tags file kinds) and the downstream extractors responsible for each artifact type. Files that do not match a recognized kind fall through to the general LLM extraction path. +- **Service definition files** — each declared service and its remote-procedure calls are recorded as integration touchpoints, including input/output message types and streaming direction per operation. +- **API contract files** — the total number of endpoints exposed to external HTTP consumers is recorded, characterising the inbound HTTP integration surface. +- **Graph schema files** — subscription roots are classified as integration touchpoints (event-driven outbound signals or real-time feeds) rather than as capabilities. +- **Relational schema files** — foreign-key references between tables are emitted as directed relational links, recording the exact columns that form each join. ## Supporting claims -- Three selectable LLM backends are supported — a locally-hosted model service, a hosted Anthropic-compatible service, and an OpenAI-compatible service. [1][2][3] -- Each backend implements the same three-method contract: structured JSON completion, free-text completion, and multi-turn chat. [1][2][3] -- The orchestrator and all downstream stages call into the provider without branching on which concrete provider is active. [4][1][2][3] -- The introspection stage uses structured JSON completion to classify repository paths. [5] -- The extraction stage uses structured JSON completion against a findings schema, per file. [6] -- The aggregation stage uses structured JSON completion against a section-body schema. [7] -- The critic/reviser uses structured completions for rubric scoring and body revision. [8] -- The chat session uses multi-turn chat grounded in populated wiki content. [9] -- Four external tool-server integrations are declared: a local AI utility, a local web crawler, a remote documentation context service, and a remote stitching/search service, making the system an MCP client that fans out to multiple capability providers. [10] -- All pipeline stages read and write through a shared filesystem layout abstraction; the cache layer uses this abstraction to locate its storage directory. [11][12][13][14][15][16] -- The extraction stage integrates with a repository-wide import/reference graph; each file's neighbors are injected into the extraction prompt. [17][18] -- The import graph also drives the specialized-extractor dispatch path by classifying each file's structural kind. [18][19] -- Project-specific settings are read from a TOML configuration file inside each managed project's wiki directory; parse failures fall back gracefully to defaults. [20][21] -- The CLI constructs the provider instance and passes it directly into the chat and report capabilities. [4] -- Each parsed API contract contributes a finding recording the number of HTTP endpoints the analysed system exposes to external consumers. [22] -- Each RPC service block in a protocol definition is treated as an integration touchpoint, with operations described including streaming direction. [23] -- Subscription roots are mapped specifically to the integrations section, reflecting event-driven touchpoints. [24] -- Cross-table foreign-key references are recorded as hard relational links between entities, surfacing migration constraints. [25] -- The specialized extractor dispatch layer routes recognized file kinds to dedicated extractors; unrecognized files fall through to the general LLM extraction path. [26][19][27] -- Derivative sections are excluded from the aggregation stage and are instead populated by a separate deriver stage that runs afterwards. [28][14] +- All communication with language models is routed through a single shared provider abstraction that declares three call surfaces: schema-validated structured output, unstructured free-text completion, and stateful multi-turn conversation. [1] +- Every pipeline stage that needs inference (extraction, aggregation, derivation, quality critique, surgical editing, and the chat REPL) calls exclusively through this interface. [2][3][4][5][6][1] +- Three concrete provider backends are available: two externally hosted and one local, selected at runtime via configuration. [7][8][9] +- All three backends normalize API errors into uniform runtime errors before surfacing them. [7][9] +- The command-line interface is the sole external entry point and delegates entirely to internal modules, containing no domain logic of its own. [10] +- Configuration is resolved from up to three sources in strict precedence order: a per-target config file, process-level environment variables, and built-in field defaults, with the per-target config winning. [11] +- The orchestrator sequences five internal subsystems: introspection, extraction, aggregation, derivation, and the cache layer. [12] +- All read and write access to wiki artifacts flows through a single canonical wiki layout abstraction shared by extraction, aggregation, caching, orchestration, derivation, chat, and CLI modules. [13] +- The cache module is consumed by the orchestrator, extractor, aggregator, and deriver, and delegates all content hashing to a dedicated fingerprinting subsystem. [14][15] +- The fingerprinting subsystem is also used by the evidence citation layer and the dependency graph subsystem. [15] +- For structured file kinds, the extractor delegates to a deterministic dispatch layer that bypasses the language model entirely. [16][17] +- The dispatch layer depends on the file classifier from the repository analysis module. [18][17] +- The repository analysis module builds an import/reference graph used by the orchestrator to inject neighbor-file context into extraction prompts. [18] +- Completed extraction results are written to a notes store (one JSONL file per section) via an append_note interface consumed downstream by the aggregator. [19] +- The aggregator coordinates with the cache, the surgical editor, and the evidence renderer as internal subsystems. [2] +- The surgical editor sits between the cache and the rendering layer for small deltas, calling the provider for targeted structured edits. [6][20] +- The report module reads section files and notes from disk, reads the walk cache for total file counts, and optionally calls the critic to score section bodies. [21] +- For derivative sections, the report collects upstream section text to give the critic context-aware scoring capability. [22] +- The chat REPL reads all populated section files from disk via the wiki layout and delegates each user turn to the configured provider with the full message history. [23][3] +- The filesystem walker is called by the introspection stage and by the orchestrator/extractor stages, and has no outbound integrations beyond the local filesystem. [24] +- Service definition files surface each declared service and its remote-procedure calls as integration touchpoints, including input/output message types and streaming direction per operation. [25] +- API contract files surface the total number of endpoints exposed to external HTTP consumers as an inbound integration surface finding. [26] +- Graph schema subscription roots are classified as integration touchpoints rather than capabilities, reflecting their event-driven nature. [27] +- Relational schema foreign-key references are emitted as directed relational links recording the exact columns that form each join. [28] ## Sources -1. `wikifi/providers/anthropic_provider.py:83-106` -2. `wikifi/providers/ollama_provider.py:44-46` -3. `wikifi/providers/openai_provider.py:1-9` -4. `wikifi/cli.py:176-179` -5. `wikifi/introspection.py:61-70` -6. `wikifi/extractor.py:220-235` -7. `wikifi/aggregator.py:136-141` -8. `wikifi/critic.py:30-32` -9. `wikifi/chat.py:52-55` -10. `.mcp.json:2-36` -11. `wikifi/aggregator.py:109-160` -12. `wikifi/cache.py:30-32` -13. `wikifi/chat.py:63-82` -14. `wikifi/deriver.py:73-107` -15. `wikifi/extractor.py` -16. `wikifi/report.py:78-130` -17. `wikifi/extractor.py:213-215` -18. `wikifi/repograph.py:1-10` -19. `wikifi/specialized/dispatch.py:36-62` -20. `wikifi/cli.py:103-105` -21. `wikifi/config.py:169-200` -22. `wikifi/specialized/openapi.py:96-103` -23. `wikifi/specialized/protobuf.py:64-90` -24. `wikifi/specialized/graphql.py:108-110` -25. `wikifi/specialized/sql.py:88-98` -26. `wikifi/specialized/__init__.py:7-8` -27. `wikifi/specialized/models.py:30-31` -28. `wikifi/aggregator.py:111-116` +1. `wikifi/providers/base.py:37-57` +2. `wikifi/aggregator.py:118-210` +3. `wikifi/chat.py:50-55` +4. `wikifi/critic.py:27-29` +5. `wikifi/extractor.py:228-241` +6. `wikifi/surgical.py:196-220` +7. `wikifi/providers/anthropic_provider.py:117-145` +8. `wikifi/providers/ollama_provider.py:43-100` +9. `wikifi/providers/openai_provider.py:1-20` +10. `wikifi/cli.py:21-29` +11. `wikifi/config.py:1-26` +12. `wikifi/orchestrator.py:57-240` +13. `wikifi/wiki.py:54-76` +14. `wikifi/cache.py` +15. `wikifi/fingerprint.py:1-17` +16. `wikifi/extractor.py:216-235` +17. `wikifi/specialized/dispatch.py:18-60` +18. `wikifi/repograph.py:1-16` +19. `wikifi/extractor.py:265-268` +20. `wikifi/surgical.py:196-234` +21. `wikifi/report.py:72-130` +22. `wikifi/report.py:153-163` +23. `wikifi/chat.py:62-82` +24. `wikifi/walker.py:1-11` +25. `wikifi/specialized/protobuf.py:66-97` +26. `wikifi/specialized/openapi.py:90-96` +27. `wikifi/specialized/graphql.py:103-106` +28. `wikifi/specialized/sql.py:87-98` diff --git a/.wikifi/intent.md b/.wikifi/intent.md index c21b850..f68c661 100644 --- a/.wikifi/intent.md +++ b/.wikifi/intent.md @@ -1,71 +1,57 @@ # Intent and Problem Space -wikifi exists to answer a question that becomes increasingly urgent as software systems age: **what does this codebase actually do, and why?** Raw source code communicates intent only to engineers already steeped in its conventions; it is opaque to architects planning migrations, to onboarding teams, and to stakeholders who need to reason about capability and risk without reading every file. +Large software repositories — especially long-lived, legacy monorepos — accumulate operational knowledge that is scattered across files, embedded in implementation choices, and never captured as coherent documentation. Developers asked to understand or migrate such a system face two intertwined challenges: the sheer volume of source material makes manual review impractical, and the documentation that does exist describes *how* the system is built rather than *what it does for its users*. -The system addresses this by walking a repository and producing a **technology-agnostic, evidence-grounded wiki** — documentation that describes a system's purpose, entities, capabilities, and integrations in terms that survive a technology change. Its design is shaped above all by the needs of migration teams: every assertion must be traceable to a precise location in the source, contradictions between files must be surfaced rather than silently resolved, and the output must remain readable when the tool is re-run against an upgraded codebase. +The system exists to automatically generate a technology-agnostic wiki from a target repository — a structured body of documentation that describes what a system accomplishes for its users, deliberately decoupled from the languages, frameworks, and libraries used to build it. Because the wiki captures intent rather than implementation, it remains legible and actionable even when the underlying technology stack is replaced entirely. -### The problems it is designed to solve +## Who It Is For -**Navigating unknown repositories at scale.** A large legacy codebase may contain tens of thousands of files, the majority of which are scaffolding, build artifacts, dependency caches, or tests — not intent-bearing application logic. The tool must decide, before reading any file's content, which paths are worth analysing. Without this, processing cost grows with repository size rather than with the amount of meaningful content. +The primary audience is: -**Translating implementation detail into intent.** Source files express *how* a system works. The wiki must express *what* it does and *why*. This translation cannot be a manual step if the output is to stay current. The same requirement applies to structured contract files — schema definitions, interface specifications, API descriptions, and protocol definitions — which encode authoritative domain models but only in formats that require interpretation to surface as human-readable findings. Targeted, deterministic parsers handle these files more accurately and efficiently than a general prose-generation path, so the system routes them separately. +- **Developers** who need to understand a codebase they did not write +- **Migration leads and architects** who need to scope or fund a re-implementation +- **Teams maintaining legacy systems** where tribal knowledge hides in inconsistencies scattered across files -**Scalability and cost.** Processing a large codebase with an AI-assisted extraction pass is expensive if every file requires the same full-cost call on every run. The system is designed to make repeated walks practical: a content-addressed cache ensures that only files whose bytes have changed trigger new AI calls, and the ability to resume after mid-run interruptions is a direct by-product of the same mechanism. For hosted AI providers, the system exploits prompt-caching facilities so that a large shared system prompt used across hundreds of per-file calls pays a fraction of the standard input-token price, making hosted extraction cost-competitive at scale. +Before committing resources to a re-implementation, migration leads typically ask two questions: did the analysis walk cover the full system, and is the resulting wiki trustworthy enough to act on? The system is designed to answer both, and to do so without modifying any wiki content during the reporting pass, making it safe to run in automated pipelines. -**Preventing hallucination and unsupported claims.** Single-shot synthesis — producing a section in one pass without verification — is the most common source of fabricated detail. The system includes a quality-assurance layer that scores each synthesised section against the upstream evidence, flags unsupported claims, identifies coverage gaps, and optionally rewrites sections that fall below a defined quality threshold. Derivative sections — inferred personas, user stories, and structural diagrams — which emerge only from the aggregate of other sections rather than from individual files, receive this treatment by design. +## Constraints That Shape the Design -**Surfacing disagreements, not hiding them.** When two source files make incompatible claims about the same aspect of the system, a naive synthesis would silently pick one or blend them into a misleading statement. The system requires instead that conflicts be named and attributed to specific sources, so that a migration team can investigate rather than inherit a wrong assumption. - -**Conversational access to the wiki.** Beyond static documents, the system provides a grounded conversational interface over the generated wiki content, allowing users to query findings in natural language. The assistant is constrained to cite section names and admit gaps rather than invent answers, extending the evidence-grounding guarantee into the interactive surface. - -### Constraints that shape the design +Several explicit constraints govern the system's design decisions end to end: | Constraint | What it means in practice | |---|---| -| Tech-agnosticism | All output is in domain terms; no technology names appear in findings or wiki sections | -| Evidence traceability | Every claim links to the source files that justify it | -| Explicit contradiction handling | Conflicts between files are named, not merged away | -| Local-first operation | Defaults to a locally-hosted inference server; cloud providers are explicit opt-ins | -| Stable output contract | The on-disk wiki layout is versioned and isolated, so existing wikis survive tool upgrades | -| Primary vs. derivative separation | Per-file evidence feeds primary sections; aggregate synthesis produces derivative sections; the distinction is enforced structurally | +| **Technology agnosticism** | Every extracted observation must describe what the code accomplishes for users; implementation-specific terms are excluded at the extraction stage | +| **Traceability** | Every assertion in the wiki is linked to the precise source location that supports it, so an architect can verify any claim without trusting unsourced documentation | +| **Contradiction surfacing** | When source files contain conflicting observations, the system surfaces those conflicts explicitly rather than silently resolving or discarding them | +| **Quality over wall-time** | Configuration explicitly prioritises documentation quality over processing speed | +| **Incremental efficiency** | Repeated analysis of large codebases must be practical; unchanged work is skipped so that re-runs take minutes rather than hours | +| **Backend independence** | The AI inference backend can be swapped without affecting any pipeline stage, preventing vendor lock-in at the infrastructure level | +| **Economic viability** | Calling a hosted AI service for hundreds of per-file extraction passes without cost controls would be prohibitive; prompt-cache reuse makes the hosted path economically competitive | + +These constraints are not incidental implementation choices — they are the stated rationale behind the system's four-stage pipeline, its caching architecture, its structured evidence model, and its provider abstraction layer. ## Supporting claims -- The system exists to produce a technology-agnostic wiki from a source code repository, helping users understand what a system does and why, independent of the technologies used to build it. [1][2][3] -- Its design is shaped by the needs of migration teams: every assertion must be traceable to a precise source location and contradictions between files must be surfaced rather than silently resolved. [4][5][6] -- Before reading file content, the system must decide which repository paths contain intent-bearing production code versus scaffolding, tests, or build artifacts. [7][8] -- Structured contract files (schema definitions, interface specifications, API descriptions, protocol definitions) are routed to targeted deterministic parsers rather than the general AI extraction path, because their machine-readable structure can be extracted more accurately that way. [9][10][11][12][13][14] -- A content-addressed cache ensures that only files whose bytes have changed trigger new AI calls on repeated walks, and resumability after mid-run interruptions is a direct by-product of the same mechanism. [15][16] -- For hosted AI providers, prompt-caching facilities are exploited so that a large shared system prompt used across many per-file calls pays a fraction of the normal input-token price, making hosted extraction economically viable at scale. [17] -- A quality-assurance layer scores each synthesised section against upstream evidence, flags unsupported claims and gaps, and optionally rewrites sections that fall below a defined quality threshold. [5] -- Derivative sections — personas, user stories, and structural diagrams — can only emerge from the aggregate of all primary sections; no single file contains enough signal to produce them reliably, so they are synthesised in a separate stage after primary sections are complete. [18][3] -- When source files make incompatible claims, conflicts must be named and attributed to specific sources rather than merged, so migration teams can investigate rather than inherit wrong assumptions. [4] -- The system defaults to a locally-hosted inference server, treating cloud AI providers as explicit opt-ins, reflecting a local-first design philosophy. [19] -- The on-disk wiki layout is treated as a stable, versioned contract between the tool and the projects that consume it, and its definition is kept isolated from other modules. [20] -- The system provides a grounded conversational interface over the generated wiki, constraining the assistant to cite section names and admit gaps rather than fabricate answers. [21] -- The extraction stage enriches per-file analysis by supplying each file's import-graph neighborhood as context, allowing findings to mention cross-file relationships rather than treating each file in isolation. [22] -- A read-only reporting capability answers pre-migration coverage and quality questions — whether the walk covered the entire system and whether the resulting wiki is reliable enough to act on — without modifying any wiki content. [23] +- The system exists to automatically generate a technology-agnostic wiki from a target repository, decoupling what the system does from the technologies used to build it. [1][2][3] +- The primary audience is developers who need to understand or migrate a codebase, and migration leads or architects who need to scope a re-implementation. [1][4][5] +- Legacy systems accumulate tribal knowledge that hides in inconsistencies scattered across files. [6] +- Every extracted observation must describe what the code accomplishes for users, never naming specific technologies, so the wiki survives a technology migration. [1][7] +- Every assertion in the wiki is linked to the precise source location that supports it, so any architect can ask 'where in the source did this come from?' and receive a verifiable answer. [4] +- When source files contain conflicting or inconsistent observations, the system surfaces those contradictions explicitly rather than silently merging or discarding them. [6] +- Configuration explicitly prioritises wiki quality over processing wall-time. [8] +- Repeated analysis of large codebases (including 50,000-file legacy monorepos) must be efficient; unchanged work is skipped so re-runs take minutes rather than hours. [9][2] +- The AI inference backend can be swapped by implementing a single abstract contract, keeping all pipeline stages decoupled from any specific vendor. [10] +- Without deliberate cost controls, calling a hosted AI service for hundreds of per-file extraction passes would be cost-prohibitive; prompt-cache reuse makes hosted inference economically viable. [11] +- Before funding a re-implementation, migration leads ask two questions: did the analysis walk cover the full system, and is the resulting wiki trustworthy enough to act on? [5] ## Sources -1. `wikifi/cli.py:1-10` -2. `wikifi/orchestrator.py:1-17` -3. `wikifi/sections.py:1-19` -4. `wikifi/aggregator.py:1-15` -5. `wikifi/critic.py:1-15` -6. `wikifi/evidence.py:1-18` -7. `wikifi/introspection.py:1-9` -8. `wikifi/walker.py:1-12` -9. `wikifi/specialized/__init__.py:1-12` -10. `wikifi/specialized/dispatch.py:1-13` -11. `wikifi/specialized/models.py:1-8` -12. `wikifi/specialized/openapi.py:1-11` -13. `wikifi/specialized/protobuf.py:1-8` -14. `wikifi/specialized/sql.py:1-13` -15. `wikifi/cache.py:1-20` -16. `wikifi/extractor.py:1-30` -17. `wikifi/providers/anthropic_provider.py:1-19` -18. `wikifi/deriver.py:1-18` -19. `wikifi/config.py:1-26` -20. `wikifi/wiki.py:1-8` -21. `wikifi/chat.py:1-32` -22. `wikifi/repograph.py:1-30` -23. `wikifi/report.py:1-16` +1. `wikifi/cli.py:1-11` +2. `wikifi/orchestrator.py:1-27` +3. `wikifi/sections.py:1-18` +4. `wikifi/evidence.py:1-18` +5. `wikifi/report.py:1-15` +6. `wikifi/aggregator.py:1-16` +7. `wikifi/extractor.py:1-20` +8. `wikifi/config.py:1-26` +9. `wikifi/cache.py:1-35` +10. `wikifi/providers/base.py:1-19` +11. `wikifi/providers/anthropic_provider.py:1-18` diff --git a/.wikifi/personas.md b/.wikifi/personas.md index 39e4948..848b324 100644 --- a/.wikifi/personas.md +++ b/.wikifi/personas.md @@ -1,139 +1,148 @@ # User Personas -Four personas emerge from the aggregate of what the system does, the problems it is designed to solve, and the interaction surfaces it exposes. No persona is inferred from a single capability; each is grounded in the convergence of multiple upstream sections. +Three distinct personas emerge from the aggregate of capabilities, integrations, and stated intent. Each is grounded in what the system demonstrably does; roles the upstreams are silent on are noted as gaps at the end. --- -## Persona 1 — The Migration Planner +## Persona 1 — The Onboarding Developer -> *Leads or participates in the planned migration of a legacy system to a new architecture, organizational boundary, or technology set.* +> *"I have inherited a large codebase I did not write. I need to understand what it does before I can safely change it."* -The upstream Intent section names migration teams as the primary design audience. Every major architectural decision — evidence traceability, explicit conflict surfacing, technology-agnostic output, and a pre-migration coverage report — is framed in terms of what a migration team needs. +### Profile +A developer who joins a project mid-life and must build a mental model of an unfamiliar system. The codebase may be a long-lived monorepo where operational knowledge is scattered across files and never captured as coherent documentation. ### Goals -- Establish a shared, authoritative understanding of what an existing system does *before* any changes are made. -- Identify hidden dependencies, integration touchpoints, and entity relationships that could break during migration. -- Determine whether the generated documentation is reliable enough to act on. +- Quickly form an accurate picture of what the system accomplishes for its users. +- Navigate cross-module flows without reading every file individually. +- Ask follow-up questions when the static documentation raises new ones. ### Needs -| Need | System capability that serves it | +| Need | How the system addresses it | |---|---| -| Every claim traceable to a precise source location | SourceRef model with path and line-range citations on every Claim | -| Explicit surfacing of file-level contradictions | Contradiction entity; conflict-detection pipeline stage | -| Tech-agnostic output that survives a technology change | Domain-terms-only output enforced throughout the pipeline | -| Coverage assurance before migration begins | Read-only reporting capability (WikiReport, WikiQualityReport, CoverageStats) | -| Integration inventory of the target system | Specialized extractors surface HTTP endpoints, RPC services, event subscriptions, and foreign-key constraints | +| Intent over implementation | Wiki describes what the system does, not how it is built, so the developer's mental model survives future technology changes | +| Cross-module flow descriptions | The cross-file reference graph is consulted per file so findings describe flows between modules rather than treating each file in isolation | +| Trustworthy claims | Every assertion is anchored to a specific source file and line range, so the developer can verify any claim without trusting unsourced documentation | +| Interactive clarification | An interactive conversational session grounded in the wiki content supports multi-turn questions, with only fully populated sections included to prevent placeholder content from diluting answers | ### Pain Points -- Legacy codebases contain implicit, undocumented knowledge that lives only in engineers' heads. -- Naive documentation tools silently merge conflicting assertions, causing migration teams to inherit wrong assumptions. -- Documentation written in implementation terms becomes meaningless after the technology changes. +- Sheer volume of source material makes manual review impractical. +- Existing documentation, where it exists, describes implementation rather than intent. +- Tribal knowledge is hidden in inconsistencies scattered across files. -### Use Cases Served -- Running the full four-stage pipeline walk against the target repository. -- Reviewing surfaced contradictions and tracing each back to its originating source files. -- Using the pre-migration coverage report to verify that the walk reached all intent-bearing areas of the repository. -- Reading the integrations section to catalog external dependencies that must be preserved, replaced, or renegotiated. +### Primary Use Cases +- Reading primary wiki sections (domains, intent, capabilities, entities) to build an initial model. +- Querying the interactive chat session to chase down specific flows or clarify ambiguous sections. +- Reviewing contradiction blocks to understand where the source itself is inconsistent, rather than receiving a silently resolved answer. --- -## Persona 2 — The Onboarding Engineer +## Persona 2 — The Migration Lead or Architect -> *A software engineer who is new to a codebase and needs to build a working mental model quickly without reading tens of thousands of files.* +> *"Before I can scope this re-implementation, I need to know the analysis covered the full system and that the resulting wiki is trustworthy enough to act on."* -The Intent section explicitly identifies onboarding teams as a target audience. The conversational chat interface, cross-file import-graph enrichment, and the section structure covering entities, capabilities, and integrations all directly address the onboarding problem. +### Profile +A technical decision-maker — lead architect, principal engineer, or programme manager — who must scope, fund, or execute a re-implementation of a legacy system. They are not reading the source themselves; they are reading the wiki as the evidentiary basis for consequential decisions. ### Goals -- Build a coherent picture of domain entities, system capabilities, and integration points without a guided tour. -- Locate the code that actually matters — not scaffolding, build artifacts, or vendored dependencies. -- Ask targeted questions about unfamiliar parts of the system and get grounded, citable answers. +- Confirm that the analysis walk covered the full system, not just the easy-to-parse parts. +- Verify that every claim in the wiki can be traced back to a specific source location. +- Identify where the source itself contains contradictions that a migration team will need to resolve. +- Assess documentation quality before committing to use the wiki as migration input. ### Needs -| Need | System capability that serves it | +| Need | How the system addresses it | |---|---| -| Domain-level descriptions of each file's role | Per-file FileFindings with one-sentence role summary | -| Cross-file relationship context | Import-graph neighborhood injected into extraction prompts | -| Navigable, structured wiki | Eight primary sections plus derivative persona, user-story, and diagram sections | -| Interactive question-and-answer over the wiki | ChatSession grounded in populated wiki content; constrained to cite sections and admit gaps | +| Coverage assurance | A coverage and quality report summarises, per section, how many files contributed findings and how complete the analysis was | +| Traceability | Every wiki assertion is rendered with inline citation markers linked to a numbered source footer; any claim can be traced to the originating file and line range | +| Contradiction visibility | Where two or more sources assert incompatible things, the system surfaces an explicit conflict block rather than silently reconciling the disagreement | +| Quality scoring | An optional critic-and-reviser cycle scores section bodies against a structured rubric, producing per-section quality scores and unsupported-claim flags alongside an overall mean score | +| Technology agnosticism | Observations are expressed in domain terms, keeping the wiki legible and actionable even when the underlying technology stack is replaced entirely | ### Pain Points -- Raw source code communicates intent only to engineers already steeped in its conventions. -- Large repositories contain mostly scaffolding and build artifacts, making it hard to identify intent-bearing logic. -- There is no interactive way to ask targeted questions about the codebase without interrupting senior colleagues. +- Cannot act on documentation that lacks a verifiable source — any unsourced claim introduces risk into a re-implementation plan. +- A partial analysis walk that silently skips complex or malformed files produces a false sense of completeness. +- Silent conflict resolution hides exactly the ambiguities a migration team most needs to know about. -### Use Cases Served -- Reading generated wiki sections to understand the system's entities, capabilities, and external dependencies. -- Using the conversational interface (`chat` subcommand) to ask domain questions and receive answers that cite specific wiki sections. -- Following numbered citations in section narratives back to the originating source files to dive deeper. +### Primary Use Cases +- Running or reviewing a quality report (with the critic loop enabled) before sign-off. +- Inspecting the coverage report to confirm file contribution counts per section. +- Using contradiction blocks as an explicit work-item list for the migration team. +- Reviewing derivative sections (personas, user stories, diagrams) synthesized from the aggregated primaries. --- -## Persona 3 — The Non-Technical Stakeholder +## Persona 3 — The Legacy System Maintainer -> *A product owner, business analyst, or executive who needs to reason about a system's capabilities and risks without reading or interpreting code.* +> *"The system keeps changing. I cannot afford to rewrite the documentation by hand every time, and I need to know when something has gone inconsistent."* -The Intent section explicitly names stakeholders as an audience who must be able to reason about capability and risk. The technology-agnostic output constraint and the business-domains primary section exist precisely to serve this persona. +### Profile +A developer or small team responsible for keeping a production system running over months or years. The codebase changes continuously; documentation written at one point in time drifts out of date and becomes a liability rather than an asset. ### Goals -- Understand what the system does at a domain level, in terms that do not require engineering background. -- Assess system capability and integration risk for planning, prioritization, or compliance purposes. +- Keep the wiki accurate as the codebase evolves, without manual documentation effort. +- Know when a recent change has introduced an inconsistency. +- Avoid paying the full analysis cost on every run when only a small part of the codebase changed. ### Needs -| Need | System capability that serves it | +| Need | How the system addresses it | |---|---| -| Output free of implementation terminology | Tech-agnosticism enforced as a hard constraint across all pipeline stages | -| Business-level capability summary | Business domains and system intent as dedicated primary sections | -| Readable summary of external dependencies | Integrations section populated from specialized extractors | -| Ability to ask questions in plain language | Grounded conversational interface constrained to cite sections rather than invent answers | +| Automated currency | The system re-analyses changed files on each run without any manual authoring | +| Incremental efficiency | Only files whose content has changed are re-processed; unchanged sections are served from cache; a run in which nothing has changed is a complete no-op | +| Surgical preservation | When only a small subset of findings changes, targeted in-place edits preserve established prose, citation numbering, and unaffected paragraphs verbatim | +| Inconsistency surfacing | Contradiction blocks make newly introduced conflicts visible immediately after the next run | +| Crash resilience | Cache state is persisted after each file, so a crash at any stage resumes from the last completed file rather than restarting from scratch; malformed files are flagged for manual review rather than halting the run | ### Pain Points -- Technical documentation is written for engineers and is impenetrable without prior context. -- There is no single, reliable source of truth about what a system does expressed in business terms. -- Asking engineers directly is time-consuming and produces inconsistent answers. +- Full rewrites of documentation on every run would erase carefully reviewed prose and reset citation numbering. +- A pipeline that halts on a single unparseable file blocks the entire team. +- Re-running a full analysis on a large codebase just because one file changed is economically and practically unacceptable. -### Use Cases Served -- Reading the generated wiki sections covering business domains, system intent, and capabilities. -- Using the conversational interface to query specific capabilities or integration touchpoints in plain language. -- Sharing wiki output as a neutral artifact that communicates system scope across engineering and non-engineering audiences. +### Primary Use Cases +- Scheduling incremental re-runs after code merges to keep the wiki current. +- Reviewing the post-run report to spot newly surfaced contradictions or newly empty sections. +- Checking flagged files that could not be parsed to decide whether manual review is warranted. --- -## Persona 4 — The Platform or Documentation Engineer +## Persona 4 — The Pipeline Operator -> *An engineer responsible for operating and maintaining the documentation pipeline across one or more projects, managing inference costs, and ensuring quality on repeated runs.* +> *"This needs to run in an automated environment. I cannot babysit it, and it must not modify content during a reporting pass."* -The Capabilities section identifies low marginal effort on subsequent runs as a core value proposition. The caching model, quality-review loop, configurable provider selection, resumability, and per-project TOML configuration all point to an operator persona distinct from those who consume the output. +### Profile +An engineer or team responsible for integrating wiki generation into an automated workflow — for example, a scheduled job that runs after each significant merge. They interact with the system primarily through the command-line interface and configuration, not through the interactive chat. ### Goals -- Keep documentation current with minimal re-processing cost on every repository change. -- Ensure that generated sections meet a defined quality bar before they are distributed or acted upon. -- Control inference costs across multiple projects with different size and sensitivity profiles. +- Run the full pipeline unattended without human intervention. +- Swap the inference backend without changing any pipeline logic. +- Control cost by governing which provider is active and whether prompt caching is used. +- Ensure the reporting pass does not modify any wiki content (making it safe for automated pipelines). ### Needs -| Need | System capability that serves it | +| Need | How the system addresses it | |---|---| -| Incremental re-processing on repeated walks | Content-addressed cache (CachedFindings, CachedSection); only changed files trigger new inference calls | -| Recovery after interruption | Resumability as a direct by-product of the cache mechanism | -| Quality assurance with automatic revision | Critic/reviser loop; Critique and ReviewOutcome entities; configurable quality threshold | -| Cost control across provider types | Configurable provider selection (local-first default, hosted as explicit opt-in); prompt-caching for hosted providers | -| Per-project tuning | TOML configuration file per project; graceful fallback to environment-derived defaults on parse failure | -| Audit trail for a run | WalkReport carrying IntrospectionResult, ExtractionStats, AggregationStats, DerivationStats, and ReviewOutcomes | +| Single entry point | The command-line interface is the sole external entry point; it delegates entirely to internal modules and contains no domain logic | +| Layered configuration | Configuration is resolved from a per-target config file, environment variables, and built-in defaults in strict precedence order, so each analysed repository can drive its own settings | +| Backend flexibility | The inference backend is selected at runtime via configuration; the abstract provider contract means swapping backends requires no pipeline changes | +| Cost control | Prompt-cache reuse and adaptive reasoning modes make large-scale walks economically viable; the operator can tune these via settings | +| Safe reporting | The reporting pass reads wiki content and notes without modifying them, making it safe to run in automated pipelines | +| Graceful degradation | Synthesis failures preserve raw extracted notes rather than producing blank sections; the run always yields some output | ### Pain Points -- Full re-generation on every repository change is prohibitively expensive for large codebases. -- Undetected hallucinations or unsupported claims in generated documentation create risk for downstream consumers. -- Different projects require different cost-quality trade-offs that a single global configuration cannot express. -- Mid-run failures on large repositories waste significant compute if the run cannot be resumed. - -### Use Cases Served -- Configuring provider, model identity, caching behavior, feature flags, and quality threshold per project. -- Running repeated walks that leverage the cache and skip files whose content has not changed. -- Reviewing WikiQualityReport and per-section SectionReport outputs to verify coverage and quality before release. -- Tuning the quality threshold that triggers automatic section revision, balancing thoroughness against cost. -- Monitoring ExtractionStats and AggregationStats to understand pipeline efficiency across runs. +- A backend that cannot be swapped creates vendor lock-in at the infrastructure level. +- Calling a hosted inference service for hundreds of per-file passes without cost controls would be cost-prohibitive. +- A pipeline that produces blank sections or halts on failure cannot be trusted in an automated context. + +### Primary Use Cases +- Configuring and scheduling unattended wiki generation runs. +- Selecting and rotating inference backends via configuration. +- Reviewing the walk report and coverage statistics as pipeline outputs rather than as interactive documents. --- -## Coverage Gap +## Gaps + +The upstream sections are silent on the following potential audiences. These personas cannot be inferred from the available evidence and are declared here as gaps rather than invented: -The upstream sections are silent on whether any persona interacts with the system through an interface other than the command-line (`init`, `walk`, `chat`, `report` subcommands) or a direct filesystem read of the generated wiki. No web-based, API-based, or programmatic consumer persona is evidenced; any such persona would need to be inferred from sources not available in these upstreams. +- **End users of the documented system** — the wiki describes what the system does for its users, but those end users are not themselves users of the wiki-generation system as described in the upstreams. +- **Product managers or business stakeholders** — no upstream section describes non-technical readers consuming the generated wiki for product-level decision-making. +- **Security or compliance reviewers** — no upstream section describes use of the wiki for audit, compliance, or security assessment purposes. diff --git a/.wikifi/user_stories.md b/.wikifi/user_stories.md index 4629ea0..99ee1f3 100644 --- a/.wikifi/user_stories.md +++ b/.wikifi/user_stories.md @@ -1,267 +1,264 @@ # User Stories -## Feature: Repository Analysis and Pipeline Execution +Three distinct groups of stories are derived from the four confirmed personas (Onboarding Developer, Migration Lead, Legacy System Maintainer, Pipeline Operator) and the capabilities described in the upstream sections. Each story is followed by Gherkin acceptance criteria. Personas for whom no upstream evidence exists — end users of the documented system, product managers, and compliance reviewers — are out of scope and are not represented here. -**As a Migration Planner, I want to run the full four-stage analysis pipeline against a target repository, so that I can establish a shared, authoritative understanding of what the system does before any migration work begins.** +--- + +## Feature: Repository Discovery and Structured Documentation + +### Story 1 — Understand system intent without reading every source file + +> *As an Onboarding Developer, I want a structured wiki generated automatically from an unfamiliar codebase, so that I can form an accurate picture of what the system accomplishes without reading every file individually.* ```gherkin -Given a target repository has been registered with the system -And a project configuration file is present at the expected location -When I invoke the walk command -Then the system performs repository introspection and produces an IntrospectionResult - with include/exclude patterns, a purpose hypothesis, and filtering rationale -And each included file produces FileFindings containing section-level descriptions - and a one-sentence file-role summary -And all findings are synthesized into coherent markdown narratives for each of the - eight primary wiki sections -And derivative sections are generated only after all primary sections are finalized -And a WalkReport is returned carrying IntrospectionResult, ExtractionStats, - AggregationStats, DerivationStats, WalkCache state, and RepoGraph +Given a repository root with no prior analysis +When the pipeline runs for the first time +Then a wiki is produced containing at minimum the eight primary sections + (Domains, Intent, Capabilities, External Dependencies, Integrations, + Cross-Cutting Concerns, Entities, Hard Specifications) +And each section describes the system in technology-agnostic domain terms +And every claim carries an inline citation marker linked to the originating + file path and line range +And no section contains implementation-specific language ``` ---- +### Story 2 — Navigate cross-module flows without tracing imports manually + +> *As an Onboarding Developer, I want findings to describe flows between modules rather than treating each file in isolation, so that I can understand how the system's parts interact without manually tracing every dependency.* + +```gherkin +Given a codebase with inter-file dependencies recorded in the repository graph +When a source file is analysed +Then the cross-file reference graph is consulted as part of that file's analysis +And the resulting findings describe relationships and flows to neighbouring modules +And findings are not limited to the contents of the single file under analysis +``` -## Feature: Evidence Traceability +### Story 3 — Confirm analysis covered the full system -**As a Migration Planner, I want every claim in the generated wiki traceable to a precise source location, so that I can verify assertions and investigate discrepancies before committing to migration decisions.** +> *As a Migration Lead, I want a coverage report showing how many files contributed findings to each wiki section, so that I can verify the analysis walked the full repository before treating the wiki as migration input.* ```gherkin -Given the pipeline has completed a walk of the repository -When I read any narrative section of the generated wiki -Then each factual sentence is backed by at least one SourceRef identifying a - repo-relative file path and an optional inclusive line range -And numbered citations in section narratives resolve to the originating files -And any Claim with no supporting sources is explicitly marked as unsupported - rather than silently presented as fact +Given a completed pipeline run +When the coverage report is reviewed +Then each section entry shows the count of files that contributed findings +And sections with zero contributing files are flagged as empty +And a coverage percentage is included in the report +And the report is produced without modifying any wiki section ``` --- -## Feature: Contradiction and Conflict Surfacing +## Feature: Citation Traceability + +### Story 4 — Trace any wiki claim back to its source before acting on it -**As a Migration Planner, I want conflicting assertions about the same domain topic surfaced explicitly, so that I do not inherit wrong assumptions when planning the migration.** +> *As a Migration Lead, I want every wiki assertion rendered with inline citation markers linked to a numbered source footer, so that I can verify any claim against the codebase before committing to a re-implementation decision.* ```gherkin -Given two or more source files contain incompatible assertions about the same - domain topic -When the aggregation stage processes findings from those files -Then a Contradiction entity is created grouping the conflicting Claims -And each conflicting Claim retains its own SourceRefs pointing to its - originating file path and line range -And the disagreement is rendered explicitly in the relevant section narrative - rather than being silently merged into a single assertion +Given a generated wiki section containing multiple claims +When I inspect any individual claim +Then it carries at least one inline citation marker +And that marker resolves to a footer entry identifying the originating file path + and inclusive line range +And any claim carrying no source reference is explicitly identified as unsupported + rather than silently omitted ``` --- -## Feature: Pre-Migration Coverage Reporting +## Feature: Contradiction Surfacing -**As a Migration Planner, I want a coverage report after each pipeline run, so that I can confirm the walk reached all intent-bearing areas of the repository before acting on the output.** +### Story 5 — Use conflict blocks as a migration work-item list + +> *As a Migration Lead, I want incompatible source claims surfaced as explicit contradiction blocks rather than silently resolved, so that the migration team has a concrete and honest list of ambiguities to resolve.* ```gherkin -Given a pipeline walk has completed -When I invoke the report command -Then a WikiReport is produced aggregating all SectionReports -And each SectionReport shows contributing file count, findings count, - body character length, and an emptiness flag -And a WikiQualityReport is available with CoverageStats showing total files, - files with findings, and per-section finding and file counts -And sections with zero findings are flagged as empty rather than omitted +Given two or more source locations that assert incompatible things about the same topic +When the wiki section covering that topic is synthesised +Then a "Conflicts in source" block is included in that section +And each conflicting position is listed with its own source references +And no silent reconciliation of the disagreement is performed ``` ---- - -## Feature: Integration Inventory +### Story 6 — Detect newly introduced inconsistencies after a code change -**As a Migration Planner, I want a structured inventory of external integration touchpoints, so that I can identify dependencies that must be preserved, replaced, or renegotiated during migration.** +> *As a Legacy System Maintainer, I want newly introduced contradictions made visible after each incremental run, so that I know when a recent change has created an inconsistency without reviewing every file manually.* ```gherkin -Given the repository contains API contract files, interface definition files, - or data-definition schema files -When the schema-aware extraction stage routes those files to specialized extractors -Then HTTP endpoints with operation, path, and summary are surfaced as structured findings -And remote procedure calls including streaming legs are extracted from interface - definition files -And foreign-key constraints and persisted entity relationships are extracted from - data-definition files -And each finding carries one or more SourceRefs traceable to the originating - file and line range -And the integrations section of the wiki consolidates these findings into a - readable inventory +Given a previous run produced no contradiction block for a given section +And a code change has introduced conflicting claims between two files +When the pipeline is re-run +Then the affected section now contains a contradiction block surfacing the conflict +And each disagreeing position retains its own source references ``` --- -## Feature: Schema-Aware Extraction +## Feature: Incremental Re-Analysis + +### Story 7 — Avoid full analysis cost when only a few files changed -**As a Migration Planner, I want schema and contract files processed by format-specific extractors, so that persisted entity structures and external contracts are surfaced accurately without relying solely on general inference.** +> *As a Legacy System Maintainer, I want only changed files re-processed on each run, so that I do not pay the full analysis cost every time a small part of the codebase changes.* ```gherkin -Given the repository contains files whose kind is classified as SQL data-definition, - API contract, interface definition, or query/mutation schema -When the extraction stage routes these files by FileKind to the appropriate - specialized extractor -Then persisted entities with their columns, foreign-key edges, uniqueness and - nullability invariants, and index definitions are captured as SpecializedFindings -And closed value sets and shared shape contracts from schema files are captured - as separate finding categories -And every SpecializedFinding carries one or more SourceRefs traceable to the - source file and line range -And a SpecializedResult collects all findings for the file along with an optional - summary string +Given a prior completed pipeline run with cached results +When the pipeline is re-run and only a subset of files have changed +Then only files whose content has changed are re-extracted +And sections whose entire evidence base is unchanged are served from cache +And a run in which nothing has changed is a complete no-op with no generation work performed +``` + +### Story 8 — Preserve reviewed prose and citation numbering on small changes + +> *As a Legacy System Maintainer, I want targeted in-place edits performed when findings change only slightly, so that carefully reviewed prose, citation numbering, and unaffected paragraphs are not erased on every run.* + +```gherkin +Given a section with an established cached body +When the re-run finds that only a small subset of findings for that section has changed +And the churn ratio is at or below the configured threshold +Then the system performs a surgical edit introducing new claims and removing dropped claims +And all unaffected paragraphs and citation numbering are retained verbatim +When the churn ratio exceeds the threshold or no prior body exists +Then a full rewrite is performed instead ``` --- -## Feature: Wiki Navigation and Onboarding +## Feature: Quality Review + +### Story 9 — Score and revise wiki sections before migration sign-off -**As an Onboarding Engineer, I want structured, domain-level descriptions of each file's role, so that I can build a coherent mental model of the system without reading the entire codebase.** +> *As a Migration Lead, I want an optional critic-and-reviser cycle to score section bodies and flag unsupported claims, so that I can assess documentation quality before committing to use the wiki as migration input.* ```gherkin -Given a wiki has been generated for the target repository -When I navigate to any primary wiki section -Then each included file is represented by at least one SectionFinding with a - technology-agnostic description of one to five sentences -And each file's one-sentence role summary from its FileFindings is accessible -And cross-file import-graph context has been injected so relationships between - files are reflected in section narratives -And all descriptions are free of implementation-specific terminology +Given a completed wiki generation run +When the critic loop is enabled for a section +Then the section body is scored on an integer scale from 0 to 10 +And a list of unsupported claims is produced +And a list of gaps against the section brief is produced +And a concrete list of suggested edits is produced +And a revised body is generated and kept only if it scores better than the original +And per-section quality scores are included in the coverage report alongside + an overall mean score ``` -**As an Onboarding Engineer, I want a navigable wiki covering all eight primary concerns plus derivative sections, so that I can locate relevant information without a guided tour of the codebase.** +### Story 10 — Distinguish reviewed from unreviewed cached derivations + +> *As a Migration Lead, I want the cache to record whether the critic loop ran for a derivative section, so that a reviewed body is never silently replaced by an unreviewed one on a subsequent run.* ```gherkin -Given the pipeline has completed a walk -When I open the generated wiki -Then eight primary sections are present: business domains, system intent, - capabilities, external dependencies, integrations, cross-cutting concerns, - core entities, and hard specifications -And three derivative sections are present: personas, user stories, and - architectural diagrams -And any section for which no findings exist contains a placeholder declaring - the gap rather than fabricating content +Given a derivative section whose cached body was produced with the critic loop enabled +When the pipeline is re-run with the critic loop disabled +Then the cached reviewed body is served without being silently substituted + by an unreviewed regeneration +And the reviewed flag in the cached derivation record remains set ``` --- -## Feature: Conversational Query Interface +## Feature: Interactive Exploration + +### Story 11 — Ask follow-up questions grounded in the generated wiki -**As an Onboarding Engineer, I want to ask targeted questions about the system through a conversational interface and receive answers that cite specific wiki sections, so that I can deepen my understanding without interrupting senior colleagues.** +> *As an Onboarding Developer, I want an interactive conversational session grounded in the wiki, so that I can chase down specific flows or clarify ambiguous sections without re-reading the raw source.* ```gherkin -Given a wiki has been generated and a WikiLayout exists at the project root -When I invoke the chat subcommand and submit a domain question -Then the ChatSession formulates its response using a frozen system prompt built - from the populated wiki sections -And each response cites the specific sections from which it draws information -And the session accumulates conversation history across turns while retaining - the frozen wiki context -And when the wiki does not contain information relevant to the question the - session explicitly declares the gap rather than inventing an answer +Given a wiki with at least one fully populated section +When an interactive session is opened +Then questions can be answered across multiple turns using populated sections as context +And empty or placeholder sections are excluded from the assistant's context +And conversation history can be reset while the wiki context is retained ``` -**As a Non-Technical Stakeholder, I want to ask plain-language questions about system capabilities and integration points, so that I can assess scope and risk without requiring engineering background.** +### Story 12 — Know which wiki sections are informing chat answers + +> *As an Onboarding Developer, I want to list the currently loaded wiki sections at any time during a session, so that I understand the scope of context informing the answers I receive.* ```gherkin -Given a wiki has been generated and the chat subcommand is available -When I submit a plain-language question about a system capability or integration -Then the ChatSession provides an answer grounded in the business-domain, - system-intent, and integrations sections -And the answer contains no implementation-specific terminology -And if the question falls outside what the wiki covers the session admits the - gap explicitly rather than speculating +Given an active interactive session +When the user requests the list of loaded sections +Then the session returns the identifiers and titles of all fully populated sections + currently included in the context ``` --- -## Feature: Technology-Agnostic Output +## Feature: Pipeline Configuration and Backend Flexibility + +### Story 13 — Run the full pipeline unattended from a single entry point -**As a Non-Technical Stakeholder, I want all generated documentation expressed in domain terms, so that the wiki remains meaningful and shareable after the technology stack changes.** +> *As a Pipeline Operator, I want the full pipeline to run unattended via a single command-line entry point, so that I can integrate wiki generation into scheduled automated workflows without manual steps.* ```gherkin -Given the pipeline has completed a walk and generated all wiki sections -When I read any section body -Then no section contains references to specific implementation technologies, - frameworks, or libraries -And business-level capability summaries are present in the business domains - and system intent sections -And the integrations section describes external dependencies in terms that - do not require an engineering background to interpret +Given a configured repository target and a valid settings source +When the pipeline is invoked via the command-line interface +Then the pipeline runs to completion without prompting for human input +And all domain logic is delegated to internal modules +And the run always produces some output even when individual section synthesis fails ``` ---- - -## Feature: Incremental Re-Processing and Caching +### Story 14 — Swap inference backends without changing pipeline logic -**As a Platform or Documentation Engineer, I want the pipeline to skip unchanged files on repeated runs, so that documentation stays current without incurring the full inference cost of a fresh walk.** +> *As a Pipeline Operator, I want the inference backend selected at runtime via configuration, so that I can swap or rotate backends without modifying any pipeline code.* ```gherkin -Given a prior pipeline walk has completed and a WalkCache is populated -And only a subset of source files have changed since the last run -When I invoke the walk command again -Then files whose content fingerprint matches an existing CachedFindings entry - are served from cache without triggering new inference calls -And section aggregations whose notes-payload hash matches an existing - CachedSection entry are served from cache -And ExtractionStats records the count of cache hits and misses for the run +Given a pipeline configured to use one inference backend +When the backend identifier is changed in configuration +Then the pipeline uses the new backend on the next run +And no pipeline logic changes are required +And all three call surfaces (structured extraction, free-text generation, + multi-turn conversation) are satisfied by the new backend ``` -**As a Platform or Documentation Engineer, I want interrupted pipeline runs to resume from the point of failure, so that compute already expended on a large repository is not wasted.** +### Story 15 — Drive per-repository settings independently + +> *As a Pipeline Operator, I want configuration resolved in strict precedence order from a per-target config file, environment variables, and built-in defaults, so that each analysed repository controls its own settings without affecting others.* ```gherkin -Given a pipeline walk was interrupted before all files were processed -And the WalkCache contains findings for files processed before the interruption -When I re-invoke the walk command -Then files already present in the cache are not re-processed -And the run completes from the point of interruption rather than restarting -And the final WalkReport reflects contributions from both cached and newly - processed files +Given multiple repository targets each with their own config file +When the pipeline is run for a given target +Then the per-target config file takes precedence over environment variables +And environment variables take precedence over built-in defaults +And settings from one target's config file do not affect another target's run ``` --- -## Feature: Quality Review Loop +## Feature: Resilience and Graceful Degradation + +### Story 16 — Resume a crashed run without restarting from scratch -**As a Platform or Documentation Engineer, I want sections that fall below a configured quality threshold to be automatically revised, so that generated documentation meets a defined quality bar before it is distributed.** +> *As a Legacy System Maintainer, I want cache state persisted after each file so that a pipeline crash resumes from the last completed file, so that a failure mid-run does not force a full restart.* ```gherkin -Given the quality review loop feature flag is enabled in Settings -And a numeric quality threshold has been configured for the project -When the derivation stage produces a Critique for each section -Then any section whose Critique score falls below the threshold triggers an - automatic revision pass -And a ReviewOutcome is recorded capturing the section identifier, initial - Critique, current body text, revision-applied flag, and optional follow-up Critique -And DerivationStats accumulates counts of sections derived, revised, and skipped +Given a pipeline run that crashes partway through a large codebase +When the pipeline is re-invoked +Then it resumes from the last successfully cached file +And previously cached findings are not re-extracted +And no work completed before the crash is lost ``` -**As a Platform or Documentation Engineer, I want a per-section quality audit report available after each run, so that I can verify coverage and quality before distributing the wiki.** +### Story 17 — Continue a run when a file cannot be parsed + +> *As a Pipeline Operator, I want unparseable files flagged for manual review rather than halting the pipeline, so that a single malformed file does not block analysis of the rest of the repository.* ```gherkin -Given a pipeline walk has completed with the review loop enabled -When I request the quality report -Then a WikiQualityReport is produced with an overall numeric score and a - mapping of section identifiers to individual Critiques -And each Critique declares the integer score, unsupported claims, gaps relative - to the section brief, and concrete revision suggestions -And CoverageStats within the report show total files, files with findings, - and per-section finding and file counts +Given a repository containing one or more malformed structured-artifact files +When the pipeline processes those files +Then each unparseable file is recorded as flagged for manual review +And pipeline execution continues with the remaining in-scope files +And the final run report identifies which files were flagged ``` ---- - -## Feature: Per-Project Pipeline Configuration +### Story 18 — Always receive some wiki output even when synthesis fails -**As a Platform or Documentation Engineer, I want each project to carry its own configuration controlling provider, model, feature flags, and quality threshold, so that different projects can express different cost-quality trade-offs independently.** +> *As a Pipeline Operator, I want raw extracted notes preserved when section synthesis fails, so that automated pipelines always receive some output and never produce blank sections.* ```gherkin -Given a project TOML configuration file is present at the project root -When the pipeline is initialized for that project -Then Settings are loaded from the configuration file including provider and model - identity, inference endpoint, timeout, file-size and chunk thresholds, pipeline - feature flags, revision quality threshold, and provider-specific credentials -And the chosen provider is the sole point of contact between the pipeline and - any language-model backend for that run -And if the configuration file is absent or unparseable the system falls back - to environment-derived defaults without failing +Given a section whose synthesis step fails at runtime +When the pipeline completes +Then the raw extracted notes for that section are preserved in the output +And the section is not blank +And the pipeline does not halt due to the synthesis failure ``` diff --git a/tests/test_surgical.py b/tests/test_surgical.py new file mode 100644 index 0000000..ce7f692 --- /dev/null +++ b/tests/test_surgical.py @@ -0,0 +1,697 @@ +"""Plan B — surgical aggregation tests. + +Three surfaces under test: + +- ``classify_section_change`` — the diff classifier that routes + unchanged / surgical / rewrite paths. +- ``surgical_aggregate`` — the LLM-driven surgical edit including + citation re-anchoring (cached claims minus removed + new claims + resolved against added findings). +- The aggregator's decision tree that dispatches to those paths and + records ``finding_ids`` on every cache write. +""" + +from __future__ import annotations + +from wikifi.aggregator import aggregate_all +from wikifi.cache import ( + CachedSection, + WalkCache, + compute_finding_id, + note_finding_ids, +) +from wikifi.sections import PRIMARY_SECTIONS +from wikifi.surgical import ( + SurgicalClaim, + SurgicalContradiction, + SurgicalEdit, + classify_section_change, + surgical_aggregate, +) +from wikifi.wiki import WikiLayout, append_note, initialize + + +def _layout(tmp_path): + layout = WikiLayout(root=tmp_path) + initialize(layout, model="m", provider="ollama", ollama_host="http://h") + return layout + + +def _note(file: str, finding: str, summary: str = "x") -> dict: + return {"file": file, "summary": summary, "finding": finding} + + +# ---------- compute_finding_id / note_finding_ids ---------- + + +def test_compute_finding_id_is_deterministic_and_localized(): + """Same (file, section, finding) → same id; any change → new id.""" + base = compute_finding_id(file="a.py", section_id="entities", finding="Order entity.") + assert base == compute_finding_id(file="a.py", section_id="entities", finding="Order entity.") + assert base != compute_finding_id(file="b.py", section_id="entities", finding="Order entity.") + assert base != compute_finding_id(file="a.py", section_id="capabilities", finding="Order entity.") + assert base != compute_finding_id(file="a.py", section_id="entities", finding="Order entity reworded.") + + +def test_note_finding_ids_aligns_with_note_order(): + notes = [_note("a.py", "A1"), _note("b.py", "B1"), _note("a.py", "A2")] + ids = note_finding_ids(notes, section_id="entities") + assert len(ids) == 3 + assert ids[0] == compute_finding_id(file="a.py", section_id="entities", finding="A1") + assert ids[1] == compute_finding_id(file="b.py", section_id="entities", finding="B1") + assert ids[2] == compute_finding_id(file="a.py", section_id="entities", finding="A2") + + +# ---------- classify_section_change ---------- + + +def test_classify_no_cache_routes_to_rewrite(): + live_ids = ["id1", "id2"] + change = classify_section_change(cached=None, live_finding_ids=live_ids, surgical_threshold=0.3) + assert change.decision == "rewrite" + assert change.added_indices == [1, 2] + assert change.removed_ids == [] + + +def test_classify_legacy_v2_cache_with_no_finding_ids_routes_to_rewrite(): + """A cached entry from a pre-v3 cache has empty ``finding_ids`` — must rewrite.""" + cached = CachedSection(notes_hash="h", body="cached", finding_ids=[]) + change = classify_section_change(cached=cached, live_finding_ids=["id1"], surgical_threshold=0.3) + assert change.decision == "rewrite" + + +def test_classify_identical_sets_routes_to_unchanged(): + cached = CachedSection(notes_hash="h", body="cached", finding_ids=["id1", "id2"]) + change = classify_section_change(cached=cached, live_finding_ids=["id1", "id2"], surgical_threshold=0.3) + assert change.decision == "unchanged" + assert change.added_indices == [] + assert change.removed_ids == [] + assert change.unchanged_count == 2 + + +def test_classify_small_delta_routes_to_surgical(): + """1 added out of 5 → 0.2 churn ≤ 0.3 threshold → surgical.""" + cached = CachedSection(notes_hash="h", body="cached", finding_ids=["id1", "id2", "id3", "id4"]) + change = classify_section_change( + cached=cached, + live_finding_ids=["id1", "id2", "id3", "id4", "id5"], + surgical_threshold=0.3, + ) + assert change.decision == "surgical" + assert change.added_indices == [5] + assert change.removed_ids == [] + + +def test_classify_large_delta_routes_to_rewrite(): + """3 added of 4 live → 0.75 churn > 0.3 threshold → rewrite.""" + cached = CachedSection(notes_hash="h", body="cached", finding_ids=["id1"]) + change = classify_section_change( + cached=cached, + live_finding_ids=["id1", "idA", "idB", "idC"], + surgical_threshold=0.3, + ) + assert change.decision == "rewrite" + + +def test_classify_threshold_disabled_always_rewrites_when_changed(): + """Negative threshold (use_surgical_edits=False sentinel) disables surgical entirely.""" + cached = CachedSection(notes_hash="h", body="cached", finding_ids=["id1", "id2"]) + change = classify_section_change( + cached=cached, + live_finding_ids=["id1", "id3"], # 1 add, 1 remove → would be 1.0 churn anyway + surgical_threshold=-1.0, + ) + assert change.decision == "rewrite" + + +def test_classify_churn_ratio_at_exactly_threshold_routes_to_surgical(): + """``churn_ratio == surgical_threshold`` is inclusive — surgical, not rewrite.""" + cached = CachedSection(notes_hash="h", body="cached", finding_ids=["id1", "id2", "id3"]) + change = classify_section_change( + cached=cached, + live_finding_ids=["id1", "id2", "id3", "id4"], + surgical_threshold=0.25, + ) + assert change.decision == "surgical" + assert change.churn_ratio == 0.25 + + +# ---------- surgical_aggregate ---------- + + +def test_surgical_aggregate_merges_cached_and_new_claims(mock_provider_factory): + """Cached claims survive minus removed; new claims attach to added notes.""" + section = PRIMARY_SECTIONS[0] + cached = CachedSection( + notes_hash="h", + body="Original body paragraph one.\n\nOriginal body paragraph two.", + claims=[ + { + "text": "Cached claim A.", + "sources": [{"file": "a.py", "lines": None, "fingerprint": ""}], + }, + { + "text": "Cached claim B (will be removed).", + "sources": [{"file": "b.py", "lines": None, "fingerprint": ""}], + }, + ], + contradictions=[], + finding_ids=[ + compute_finding_id(file="a.py", section_id=section.id, finding="A1"), + compute_finding_id(file="b.py", section_id=section.id, finding="B1 (removed)"), + ], + ) + live_notes = [ + _note("a.py", "A1"), + _note("c.py", "C1 (added)"), + ] + live_ids = note_finding_ids(live_notes, section_id=section.id) + # 1 added + 1 removed out of 2 live = 1.0 churn — needs threshold ≥ 1.0 + # to route surgical. The merge logic itself is what's under test. + change = classify_section_change(cached=cached, live_finding_ids=live_ids, surgical_threshold=1.0) + assert change.decision == "surgical" + + edit = SurgicalEdit( + body="Edited body — preserves A, drops B, adds C.", + new_claims=[ + SurgicalClaim(text="New claim from C.", source_indices=[1]), + ], + removed_claim_indices=[2], # drop "Cached claim B" + contradictions=[], + ) + + provider = mock_provider_factory(json_factory=lambda schema, system, user: edit) + bundle = surgical_aggregate( + section=section, + cached=cached, + live_notes=live_notes, + change=change, + provider=provider, + ) + assert bundle.body == "Edited body — preserves A, drops B, adds C." + # Cached claim A survives, cached claim B was removed, new claim from C added. + assert len(bundle.claims) == 2 + assert bundle.claims[0].text == "Cached claim A." + assert bundle.claims[1].text == "New claim from C." + # The new claim's source resolves to c.py (the added note), not a.py or b.py. + assert any(ref.file == "c.py" for ref in bundle.claims[1].sources) + + +def test_surgical_aggregate_ignores_out_of_range_removed_indices(mock_provider_factory): + """Bad indices from the model don't break the merge — they're silently dropped.""" + section = PRIMARY_SECTIONS[0] + cached = CachedSection( + notes_hash="h", + body="Body.", + claims=[{"text": "Only claim.", "sources": [{"file": "a.py", "lines": None, "fingerprint": ""}]}], + contradictions=[], + finding_ids=[compute_finding_id(file="a.py", section_id=section.id, finding="A1")], + ) + live_notes = [_note("a.py", "A1"), _note("c.py", "C1")] + live_ids = note_finding_ids(live_notes, section_id=section.id) + change = classify_section_change(cached=cached, live_finding_ids=live_ids, surgical_threshold=1.0) + + edit = SurgicalEdit( + body="Edited.", + new_claims=[], + removed_claim_indices=[42, -1, 0], # all invalid + contradictions=[], + ) + provider = mock_provider_factory(json_factory=lambda schema, system, user: edit) + bundle = surgical_aggregate( + section=section, + cached=cached, + live_notes=live_notes, + change=change, + provider=provider, + ) + assert len(bundle.claims) == 1 + assert bundle.claims[0].text == "Only claim." + + +def test_surgical_aggregate_replaces_contradictions_with_new_set(mock_provider_factory): + """``edit.contradictions`` is the FULL post-edit set — cached contradictions are replaced.""" + section = PRIMARY_SECTIONS[0] + cached = CachedSection( + notes_hash="h", + body="Body.", + claims=[], + contradictions=[ + { + "summary": "Old conflict.", + "positions": [{"text": "stale", "sources": []}], + } + ], + finding_ids=[compute_finding_id(file="a.py", section_id=section.id, finding="A1")], + ) + live_notes = [_note("a.py", "A1"), _note("c.py", "C1")] + live_ids = note_finding_ids(live_notes, section_id=section.id) + change = classify_section_change(cached=cached, live_finding_ids=live_ids, surgical_threshold=1.0) + + edit = SurgicalEdit( + body="Edited.", + new_claims=[], + removed_claim_indices=[], + contradictions=[ + SurgicalContradiction( + summary="Fresh conflict.", + positions=[SurgicalClaim(text="position from added", source_indices=[1])], + ) + ], + ) + provider = mock_provider_factory(json_factory=lambda schema, system, user: edit) + bundle = surgical_aggregate( + section=section, + cached=cached, + live_notes=live_notes, + change=change, + provider=provider, + ) + assert len(bundle.contradictions) == 1 + assert bundle.contradictions[0].summary == "Fresh conflict." + + +# ---------- aggregator decision tree ---------- + + +def test_aggregate_records_finding_ids_on_cache_write(tmp_path, mock_provider_factory): + """Every aggregation cache write must include finding_ids — the diff cornerstone.""" + from wikifi.aggregator import SectionBody + + layout = _layout(tmp_path) + section = PRIMARY_SECTIONS[0] + append_note(layout, section, _note("a.py", "Order entity.")) + + cache = WalkCache() + provider = mock_provider_factory( + json_factory=lambda schema, system, user: SectionBody(body="Synthesized."), + ) + aggregate_all(layout=layout, provider=provider, cache=cache) + entry = cache.aggregation[section.id] + assert entry.finding_ids == [compute_finding_id(file="a.py", section_id=section.id, finding="Order entity.")] + + +def test_aggregate_takes_unchanged_path_when_only_notes_hash_differs(tmp_path, mock_provider_factory): + """Same finding ids, different ``notes_hash`` (e.g. line range moved) → no LLM call.""" + from wikifi.aggregator import SectionBody + + layout = _layout(tmp_path) + section = PRIMARY_SECTIONS[0] + # Note has a source ref; we'll change the lines to invalidate notes_hash + # while keeping finding_id stable. + append_note( + layout, + section, + { + "file": "a.py", + "summary": "x", + "finding": "Order entity.", + "sources": [{"file": "a.py", "lines": [10, 20], "fingerprint": "abc"}], + }, + ) + + cache = WalkCache() + cache.record_aggregation( + section.id, + notes_hash="stale-hash-from-prior-walk", + body="Cached body that should survive.", + claims=[{"text": "Claim.", "sources": [{"file": "a.py", "lines": None, "fingerprint": ""}]}], + contradictions=[], + finding_ids=[compute_finding_id(file="a.py", section_id=section.id, finding="Order entity.")], + ) + + call_count = {"n": 0} + + def factory(schema, system, user): + call_count["n"] += 1 + return SectionBody(body="should not be called") + + provider = mock_provider_factory(json_factory=factory) + stats = aggregate_all(layout=layout, provider=provider, cache=cache) + + assert call_count["n"] == 0, "unchanged-finding-ids path must not call the LLM" + assert stats.sections_cached == 1 + body = layout.section_path(section).read_text() + assert "Cached body that should survive." in body + # Cache key is *not* refreshed here. The cached claims still carry + # SourceRefs from the prior walk's notes (e.g. lines=[10, 20] when + # current lines could be [12, 22]); refreshing notes_hash to match + # live notes would let ``aggregation_fully_cached`` flag the entry + # as fresh and let the orchestrator's short-circuit lock those + # stale citations in place. Leaving the key alone keeps the + # predicate honest until a real Path 4 rewrite refreshes citations. + assert cache.aggregation[section.id].notes_hash == "stale-hash-from-prior-walk" + + +def test_aggregate_takes_surgical_path_for_small_delta(tmp_path, mock_provider_factory): + """1 added finding out of 4 live → surgical edit, not rewrite.""" + from wikifi.aggregator import SectionBody + from wikifi.cache import hash_section_notes + + layout = _layout(tmp_path) + section = PRIMARY_SECTIONS[0] + findings_text = ["A1", "A2", "A3"] + for f in findings_text: + append_note(layout, section, _note("a.py", f)) + + cache = WalkCache() + cache.record_aggregation( + section.id, + notes_hash=hash_section_notes( + [{"file": "a.py", "summary": "x", "finding": f, "sources": []} for f in findings_text] + ), + body="Original body.", + claims=[], + contradictions=[], + finding_ids=[compute_finding_id(file="a.py", section_id=section.id, finding=f) for f in findings_text], + ) + # Add a fourth note — 1/4 churn = 0.25 ≤ 0.3 → surgical. + append_note(layout, section, _note("a.py", "A4_added")) + + call_log = {"surgical": 0, "rewrite": 0} + + def factory(schema, system, user): + if schema is SurgicalEdit: + call_log["surgical"] += 1 + return SurgicalEdit(body="Edited body with A4.") + if schema is SectionBody: + call_log["rewrite"] += 1 + return SectionBody(body="rewrite body") + raise AssertionError(f"unexpected schema {schema}") + + provider = mock_provider_factory(json_factory=factory) + stats = aggregate_all(layout=layout, provider=provider, cache=cache, surgical_threshold=0.3) + assert call_log["surgical"] == 1 + assert call_log["rewrite"] == 0 + assert stats.sections_edited == 1 + assert stats.sections_rewritten == 0 + body = layout.section_path(section).read_text() + assert "Edited body with A4." in body + + +def test_aggregate_takes_rewrite_path_when_churn_above_threshold(tmp_path, mock_provider_factory): + from wikifi.aggregator import SectionBody + from wikifi.cache import hash_section_notes + + layout = _layout(tmp_path) + section = PRIMARY_SECTIONS[0] + cache = WalkCache() + # Cache says one finding existed; live notes have four entirely different findings. + cache.record_aggregation( + section.id, + notes_hash=hash_section_notes([{"file": "a.py", "summary": "x", "finding": "old", "sources": []}]), + body="Old body.", + claims=[], + contradictions=[], + finding_ids=[compute_finding_id(file="a.py", section_id=section.id, finding="old")], + ) + for finding in ("new1", "new2", "new3", "new4"): + append_note(layout, section, _note("a.py", finding)) + + call_log = {"surgical": 0, "rewrite": 0} + + def factory(schema, system, user): + if schema is SurgicalEdit: + call_log["surgical"] += 1 + return SurgicalEdit(body="surgical") + if schema is SectionBody: + call_log["rewrite"] += 1 + return SectionBody(body="Full rewrite body.") + raise AssertionError(f"unexpected schema {schema}") + + provider = mock_provider_factory(json_factory=factory) + stats = aggregate_all(layout=layout, provider=provider, cache=cache, surgical_threshold=0.3) + assert call_log["surgical"] == 0 + assert call_log["rewrite"] == 1 + assert stats.sections_rewritten == 1 + assert stats.sections_edited == 0 + + +def test_aggregate_use_surgical_false_skips_surgical_path(tmp_path, mock_provider_factory): + """``use_surgical_edits=False`` disables the surgical path entirely (Plan A behavior).""" + from wikifi.aggregator import SectionBody + from wikifi.cache import hash_section_notes + + layout = _layout(tmp_path) + section = PRIMARY_SECTIONS[0] + findings_text = ["A1", "A2", "A3"] + for f in findings_text: + append_note(layout, section, _note("a.py", f)) + + cache = WalkCache() + cache.record_aggregation( + section.id, + notes_hash=hash_section_notes( + [{"file": "a.py", "summary": "x", "finding": f, "sources": []} for f in findings_text] + ), + body="Original.", + claims=[], + contradictions=[], + finding_ids=[compute_finding_id(file="a.py", section_id=section.id, finding=f) for f in findings_text], + ) + append_note(layout, section, _note("a.py", "A4_added")) + + call_log = {"surgical": 0, "rewrite": 0} + + def factory(schema, system, user): + if schema is SurgicalEdit: + call_log["surgical"] += 1 + return SurgicalEdit(body="surgical") + if schema is SectionBody: + call_log["rewrite"] += 1 + return SectionBody(body="rewrite") + raise AssertionError(f"unexpected schema {schema}") + + provider = mock_provider_factory(json_factory=factory) + aggregate_all( + layout=layout, + provider=provider, + cache=cache, + use_surgical_edits=False, + ) + assert call_log["surgical"] == 0 + assert call_log["rewrite"] == 1 + + +def test_aggregate_falls_back_to_rewrite_when_surgical_raises(tmp_path, mock_provider_factory): + """A surgical-path LLM failure must not leave the section empty — fall back to rewrite.""" + from wikifi.aggregator import SectionBody + from wikifi.cache import hash_section_notes + + layout = _layout(tmp_path) + section = PRIMARY_SECTIONS[0] + findings_text = ["A1", "A2", "A3"] + for f in findings_text: + append_note(layout, section, _note("a.py", f)) + + cache = WalkCache() + cache.record_aggregation( + section.id, + notes_hash=hash_section_notes( + [{"file": "a.py", "summary": "x", "finding": f, "sources": []} for f in findings_text] + ), + body="Cached.", + claims=[], + contradictions=[], + finding_ids=[compute_finding_id(file="a.py", section_id=section.id, finding=f) for f in findings_text], + ) + append_note(layout, section, _note("a.py", "A4_added")) + + def factory(schema, system, user): + if schema is SurgicalEdit: + raise RuntimeError("surgical model unavailable") + if schema is SectionBody: + return SectionBody(body="Recovered via rewrite.") + raise AssertionError(f"unexpected schema {schema}") + + provider = mock_provider_factory(json_factory=factory) + stats = aggregate_all(layout=layout, provider=provider, cache=cache, surgical_threshold=0.5) + body = layout.section_path(section).read_text() + assert "Recovered via rewrite." in body + # Counted as rewrite, not edit, because the fallback path won. + assert stats.sections_rewritten == 1 + assert stats.sections_edited == 0 + + +# ---------- stability test (the load-bearing assertion) ---------- + + +def test_surgical_edit_preserves_unchanged_paragraphs_when_model_honors_contract( + mock_provider_factory, +): + """If the LLM honors the prompt and returns the input body verbatim plus an addition, + the surgical pipeline preserves every byte of the unchanged content. + + This is the contract Plan B is built on. The test mocks the LLM + response to be a *literal* good citizen — input body unchanged, one + sentence appended — and verifies the merged output preserves the + original prose. Real-world stability depends on prompt quality + a + follow-up critic; this test guards the *mechanism* (no rendering + pass mangles the unchanged region). + """ + section = PRIMARY_SECTIONS[0] + original = "First paragraph that should survive verbatim.\n\nSecond paragraph that should also survive verbatim." + cached = CachedSection( + notes_hash="h", + body=original, + claims=[], + contradictions=[], + finding_ids=[ + compute_finding_id(file="a.py", section_id=section.id, finding="A1"), + compute_finding_id(file="b.py", section_id=section.id, finding="B1"), + ], + ) + live_notes = [ + _note("a.py", "A1"), + _note("b.py", "B1"), + _note("c.py", "C1 added"), + ] + live_ids = note_finding_ids(live_notes, section_id=section.id) + change = classify_section_change(cached=cached, live_finding_ids=live_ids, surgical_threshold=0.5) + assert change.decision == "surgical" + + edited_body = original + "\n\nThird paragraph integrating the added finding." + edit = SurgicalEdit( + body=edited_body, + new_claims=[SurgicalClaim(text="C1 claim.", source_indices=[1])], + removed_claim_indices=[], + contradictions=[], + ) + provider = mock_provider_factory(json_factory=lambda schema, system, user: edit) + bundle = surgical_aggregate(section=section, cached=cached, live_notes=live_notes, change=change, provider=provider) + # The original two paragraphs survive the merge byte-for-byte. + assert original in bundle.body + # And the new content is present. + assert "Third paragraph integrating the added finding." in bundle.body + + +# ---------- PR review regressions ---------- + + +def test_note_finding_ids_returns_empty_for_malformed_notes(): + """Notes missing ``file`` or ``finding`` get an empty-string id. + + Without this, hashing empty strings via :func:`compute_finding_id` + produces a deterministic non-empty id that would let two malformed + notes from different walks compare equal as "unchanged" findings — + routing a section onto the cache/surgical paths despite having no + real identity to anchor on. + """ + notes = [ + _note("a.py", "real finding"), + {"file": "", "summary": "x", "finding": "no file"}, + {"file": "b.py", "summary": "x", "finding": ""}, + {"summary": "x"}, # neither file nor finding present + {"file": "c.py", "summary": "x", "finding": None}, # null finding + ] + ids = note_finding_ids(notes, section_id="entities") + assert ids[0] != "" + assert ids[1] == "" + assert ids[2] == "" + assert ids[3] == "" + assert ids[4] == "" + + +def test_classify_force_rewrites_when_any_finding_id_is_empty(): + """A malformed/legacy note (empty id) in either set forces a rewrite. + + Two empty-string ids would otherwise compare equal in the + set-symmetric-difference computation and let the section land on + the cache/surgical paths. The classifier must short-circuit to + rewrite when it detects any empty id. + """ + cached = CachedSection(notes_hash="h", body="cached", finding_ids=["real_id_a", ""]) + change = classify_section_change(cached=cached, live_finding_ids=["real_id_a", ""], surgical_threshold=0.3) + assert change.decision == "rewrite", ( + "empty finding_id on either side must force rewrite, even when cached and live look superficially identical" + ) + + +def test_section_change_churn_ratio_max_when_everything_removed(): + """``total_live == 0`` plus removed_ids → max churn (1.0), not 0.0. + + Mirrors the classifier's ``max(total_live, 1)`` guard. Without + this, downstream code reading ``churn_ratio`` for a "removed + everything" section would see 0.0 and treat it as a no-change + case. + """ + from wikifi.surgical import SectionChange + + removed_only = SectionChange( + decision="rewrite", + added_indices=[], + removed_ids=["a", "b"], + unchanged_count=0, + total_live=0, + ) + assert removed_only.churn_ratio == 1.0 + + # Truly empty section (no cached, no live) stays at 0.0. + empty = SectionChange( + decision="rewrite", + added_indices=[], + removed_ids=[], + unchanged_count=0, + total_live=0, + ) + assert empty.churn_ratio == 0.0 + + +def test_aggregate_path2_does_not_refresh_cache_key(tmp_path, mock_provider_factory): + """The unchanged-finding-ids path must leave the cache entry alone. + + Refreshing ``notes_hash`` to match live notes would let + :func:`aggregation_fully_cached` think the section is fresh and + skip stage 3 on the next walk — locking in cached citations whose + line ranges / fingerprints have drifted since the prior walk. + Leaving the cache key unchanged keeps the orchestrator's + short-circuit honest until a real Path 4 rewrite refreshes + citations. + """ + from wikifi.aggregator import SectionBody, aggregation_fully_cached + + layout = _layout(tmp_path) + section = PRIMARY_SECTIONS[0] + append_note( + layout, + section, + { + "file": "a.py", + "summary": "x", + "finding": "Order entity.", + "sources": [{"file": "a.py", "lines": [10, 20], "fingerprint": "abc"}], + }, + ) + cache = WalkCache() + cache.record_aggregation( + section.id, + notes_hash="prior-walk-hash", + body="Cached body.", + claims=[], + contradictions=[], + finding_ids=[compute_finding_id(file="a.py", section_id=section.id, finding="Order entity.")], + ) + # Empty cache entries on every other primary section so the + # aggregator doesn't crash trying to look them up. + for other in PRIMARY_SECTIONS[1:]: + from wikifi.cache import hash_section_notes + + cache.record_aggregation( + other.id, + notes_hash=hash_section_notes([]), + body="", + claims=[], + contradictions=[], + ) + + provider = mock_provider_factory( + json_factory=lambda schema, system, user: SectionBody(body="x"), + ) + aggregate_all(layout=layout, provider=provider, cache=cache) + + # Cache key untouched. + assert cache.aggregation[section.id].notes_hash == "prior-walk-hash" + # And aggregation_fully_cached refuses to flag the section as fresh. + assert not aggregation_fully_cached(layout, cache), ( + "Path 2 must not be reachable as 'fully cached' until a real Path 4 rewrite refreshes citations" + ) diff --git a/wikifi/aggregator.py b/wikifi/aggregator.py index f5ec240..e8d3fc7 100644 --- a/wikifi/aggregator.py +++ b/wikifi/aggregator.py @@ -28,7 +28,7 @@ from pydantic import BaseModel, Field -from wikifi.cache import WalkCache, hash_section_notes +from wikifi.cache import WalkCache, hash_section_notes, note_finding_ids from wikifi.evidence import ( Claim, Contradiction, @@ -39,6 +39,7 @@ ) from wikifi.providers.base import LLMProvider from wikifi.sections import PRIMARY_SECTIONS, Section +from wikifi.surgical import classify_section_change, surgical_aggregate from wikifi.wiki import WikiLayout, read_notes, write_section log = logging.getLogger("wikifi.aggregator") @@ -101,6 +102,8 @@ class AggregationStats: sections_written: int = 0 sections_empty: int = 0 sections_cached: int = 0 + sections_edited: int = 0 + sections_rewritten: int = 0 def aggregate_all( @@ -109,6 +112,8 @@ def aggregate_all( provider: LLMProvider, cache: WalkCache | None = None, persist_cache: Callable[[], None] | None = None, + surgical_threshold: float = 0.3, + use_surgical_edits: bool = True, ) -> AggregationStats: """Aggregate every primary section from its accumulated notes. @@ -116,9 +121,33 @@ def aggregate_all( `wikifi.deriver.derive_all` after this stage — they have no per-file notes to aggregate from. - When ``cache`` is supplied and the section's note digest is unchanged - from the prior walk, the cached body and evidence are reused without - invoking the LLM. + Each section follows one of four paths, in priority order: + + 1. **Cache hit** — ``notes_hash`` matches the cached entry exactly. + Re-render the cached bundle, no LLM call. (Plan A behavior.) + 2. **Unchanged finding set** — ``notes_hash`` differs but the + finding ids are identical (a source line shifted, summary + changed, etc.). Re-render the cached body, no LLM call. The + cached ``notes_hash`` is *not* refreshed: the citations in the + cached claims still reference the prior walk's line ranges / + fingerprints, so a later walk needs another aggregation pass to + refresh them. Leaving the cache key stale keeps + :func:`aggregation_fully_cached` honest — the orchestrator's + short-circuit will not skip stage 3 for a section whose + citations could be drifting. + 3. **Surgical edit** — finding-set churn ratio is at or below + ``surgical_threshold``. Send the cached body plus the added / + removed delta to the LLM and merge the edit into the cached + claims. Preserves prose for unchanged paragraphs. + 4. **Full rewrite** — too much churn (or no prior cached body to + edit). Re-aggregate from scratch. (Plan A behavior.) + + Only Path 3 is gated by ``use_surgical_edits``; setting it to + ``False`` disables the LLM-side surgical edit, leaving the three + no-LLM paths (full cache hit, unchanged-finding-set re-render, + and full rewrite) intact. Path 2 in particular still fires when + only line ranges or summaries drift — that's a generic cache-reuse + optimization not specific to surgical editing. When ``persist_cache`` is supplied, it is invoked after each successful section's cache update — that turns a Ctrl-C / OOM mid-stage-3 into a @@ -152,42 +181,114 @@ def aggregate_all( continue notes_hash = hash_section_notes(notes) + live_finding_ids = note_finding_ids(notes, section_id=section.id) + + # Path 1: full cache hit. Identical notes payload, identical + # citations, identical body — re-render and move on. if cache is not None: - cached = cache.lookup_aggregation(section.id, notes_hash) - if cached is not None: + cached_hit = cache.lookup_aggregation(section.id, notes_hash) + if cached_hit is not None: bundle = EvidenceBundle( - body=cached.body, - claims=[Claim.model_validate(c) for c in cached.claims], - contradictions=[Contradiction.model_validate(c) for c in cached.contradictions], + body=cached_hit.body, + claims=[Claim.model_validate(c) for c in cached_hit.claims], + contradictions=[Contradiction.model_validate(c) for c in cached_hit.contradictions], ) write_section(layout, section, render_section_body(bundle)) stats.sections_cached += 1 stats.sections_written += 1 continue - try: - structured = provider.complete_json( - system=AGGREGATION_SYSTEM_PROMPT, - user=_render_user_prompt(section, notes), - schema=SectionBody, + cached_entry = cache.aggregation.get(section.id) if cache is not None else None + change = classify_section_change( + cached=cached_entry, + live_finding_ids=live_finding_ids, + surgical_threshold=surgical_threshold if use_surgical_edits else -1.0, + ) + + # Path 2: finding ids unchanged but notes_hash differs (e.g. + # line ranges shifted, summary changed). The cached body's + # narrative still holds because every finding_id is still + # present, so we re-render rather than calling the LLM. + # + # Deliberately *don't* refresh the cache entry here. The cached + # ``claims`` carry resolved SourceRefs (file/lines/fingerprint) + # captured at prior-walk extraction time; if line ranges or + # fingerprints drifted, those refs are stale. Re-rendering with + # them is a tolerable per-walk shortcut, but updating + # ``notes_hash`` to match live notes would let + # :func:`aggregation_fully_cached` flag the section as fresh + # and the orchestrator would then skip stage 3 entirely on the + # next no-source-change walk — locking the stale citations in + # place. Leaving the cache key alone keeps the predicate honest + # and lets a future Path 4 rewrite refresh citations cleanly. + if change.decision == "unchanged" and cached_entry is not None: + bundle = EvidenceBundle( + body=cached_entry.body, + claims=[Claim.model_validate(c) for c in cached_entry.claims], + contradictions=[Contradiction.model_validate(c) for c in cached_entry.contradictions], ) - bundle = _bundle_from(structured, notes) - rendered = render_section_body(bundle) - except Exception as exc: - log.warning("aggregation failed for %s: %s", section.id, exc) - rendered = _fallback_body(section, notes, error=str(exc)) - bundle = None - - write_section(layout, section, rendered) + write_section(layout, section, render_section_body(bundle)) + stats.sections_cached += 1 + stats.sections_written += 1 + continue + + bundle: EvidenceBundle | None = None + path_taken: str = "rewrite" + + # Path 3: surgical edit. Hand the cached body + delta to the LLM. + # On any failure we fall through to the rewrite path so the user + # never gets a missing section. + if change.decision == "surgical" and cached_entry is not None: + try: + bundle = surgical_aggregate( + section=section, + cached=cached_entry, + live_notes=notes, + change=change, + provider=provider, + ) + path_taken = "edited" + except Exception as exc: + log.warning( + "surgical edit failed for %s (%s); falling back to full rewrite", + section.id, + exc, + ) + bundle = None + + # Path 4: full rewrite. Either the classifier said so, or the + # surgical attempt above raised. + if bundle is None: + try: + structured = provider.complete_json( + system=AGGREGATION_SYSTEM_PROMPT, + user=_render_user_prompt(section, notes), + schema=SectionBody, + ) + bundle = _bundle_from(structured, notes) + path_taken = "rewrite" + except Exception as exc: + log.warning("aggregation failed for %s: %s", section.id, exc) + rendered = _fallback_body(section, notes, error=str(exc)) + write_section(layout, section, rendered) + stats.sections_written += 1 + continue + + write_section(layout, section, render_section_body(bundle)) stats.sections_written += 1 + if path_taken == "edited": + stats.sections_edited += 1 + else: + stats.sections_rewritten += 1 - if cache is not None and bundle is not None: + if cache is not None: cache.record_aggregation( section.id, notes_hash=notes_hash, body=bundle.body, claims=[c.model_dump() for c in bundle.claims], contradictions=[c.model_dump() for c in bundle.contradictions], + finding_ids=live_finding_ids, ) if persist_cache is not None: persist_cache() diff --git a/wikifi/cache.py b/wikifi/cache.py index 5d38685..72fd3b9 100644 --- a/wikifi/cache.py +++ b/wikifi/cache.py @@ -49,11 +49,16 @@ AGGREGATION_CACHE_FILENAME = "aggregation.json" DERIVATION_CACHE_FILENAME = "derivation.json" INTROSPECTION_CACHE_FILENAME = "introspection.json" -# Bumped from 1 → 2 when stages 3 & 4 gained incremental persistence and -# the derivation/introspection caches were added. v1 entries load to -# empty so an upgraded wiki re-extracts on the first walk; subsequent -# walks pick up the new short-circuit behavior. -CACHE_VERSION = 2 +# Cache schema version, bumped to invalidate every entry across upgrades: +# v1 → v2: stages 3 & 4 gained incremental persistence; derivation + +# introspection caches added. +# v2 → v3: ``CachedSection.finding_ids`` added so the aggregator can +# classify per-section deltas (added / removed findings) and +# choose between cache-hit, surgical-edit, and full-rewrite +# paths instead of always rewriting on any note change. +# Older entries load to empty so an upgraded wiki re-extracts on the +# first walk; subsequent walks pick up the new behavior. +CACHE_VERSION = 3 # Re-exposed for callers that already import ``CACHE_DIRNAME`` from this # module; the constant itself lives in :mod:`wikifi.wiki` next to the @@ -72,6 +77,7 @@ "WalkCache", "aggregation_cache_path", "cache_dir", + "compute_finding_id", "derivation_cache_path", "extraction_cache_path", "hash_introspection_scope", @@ -79,6 +85,7 @@ "hash_upstream_bodies", "introspection_cache_path", "load", + "note_finding_ids", "reset", "save", "save_aggregation", @@ -100,12 +107,24 @@ class CachedFindings: @dataclass class CachedSection: - """Per-section aggregator output recovered from cache.""" + """Per-section aggregator output recovered from cache. + + ``finding_ids`` is the ordered list of stable per-note ids (see + :func:`compute_finding_id`) that fed into the cached body. Its sole + consumer is :func:`wikifi.surgical.classify_section_change`, which + diffs cached vs live ids to decide between cache-hit / unchanged / + surgical / rewrite. Cached ``claims`` and ``contradictions`` already + carry resolved :class:`~wikifi.evidence.SourceRef` objects, so this + list is *not* used to re-resolve sources — note positional alignment + is preserved for the classifier's set comparison, not for source + reconstruction. + """ notes_hash: str body: str claims: list[dict[str, Any]] = field(default_factory=list) contradictions: list[dict[str, Any]] = field(default_factory=list) + finding_ids: list[str] = field(default_factory=list) @dataclass @@ -213,12 +232,14 @@ def record_aggregation( body: str, claims: list[dict[str, Any]] | None = None, contradictions: list[dict[str, Any]] | None = None, + finding_ids: list[str] | None = None, ) -> None: self.aggregation[section_id] = CachedSection( notes_hash=notes_hash, body=body, claims=list(claims or []), contradictions=list(contradictions or []), + finding_ids=list(finding_ids or []), ) # ----- derivation scope ----- @@ -350,6 +371,7 @@ def save_aggregation(layout: WikiLayout, cache: WalkCache) -> None: "body": entry.body, "claims": entry.claims, "contradictions": entry.contradictions, + "finding_ids": entry.finding_ids, } for sid, entry in cache.aggregation.items() }, @@ -441,6 +463,7 @@ def _load_aggregation(path: Path) -> dict[str, CachedSection]: body=entry.get("body", ""), claims=list(entry.get("claims", [])), contradictions=list(entry.get("contradictions", [])), + finding_ids=[str(fid) for fid in entry.get("finding_ids", [])], ) except (KeyError, TypeError, ValueError) as exc: log.warning("dropping malformed aggregation cache entry %s: %s", sid, exc) @@ -532,6 +555,53 @@ def hash_upstream_bodies(upstream_bodies: dict[str, str]) -> str: return hash_text(json.dumps(payload, ensure_ascii=False, sort_keys=True)) +def compute_finding_id(*, file: str, section_id: str, finding: str) -> str: + """Stable identity for one note across walks. + + Composition: ``sha256(file + "::" + section_id + "::" + finding)``, + truncated via :func:`wikifi.fingerprint.hash_text` to + ``FINGERPRINT_LENGTH`` (12) hex chars — the same digest length the + rest of wikifi uses for content fingerprints. Two walks that emit + the same finding text from the same file targeting the same section + produce the same id; any change in any of the three components is + a fresh id. + + A reword of ``finding`` (even one character) gets a new id, which + semantically counts as "removed-and-added" from the surgical + aggregator's point of view — the same as a delete-plus-insert in a + text diff. That's intentional: a paragraph the cached body wrote + around the old wording can no longer be assumed to still be + grounded by the new wording. + """ + from wikifi.fingerprint import hash_text + + return hash_text(f"{file}::{section_id}::{finding}") + + +def note_finding_ids(notes: list[dict[str, Any]], *, section_id: str) -> list[str]: + """Compute the ordered ``finding_id`` list for a section's notes. + + Aligned with note position so the surgical classifier and the + aggregator's Path 2 fast path can reason about which findings + appear at which index. Notes missing ``file`` or ``finding`` (or + carrying empty values for either) get an empty-string id — + `compute_finding_id` would still produce a stable hash for empty + inputs, which would let two malformed notes from different walks + look "unchanged" to the classifier. Returning ``""`` here forces + them to compare unequal to any real id and routes the section + around them. + """ + out: list[str] = [] + for note in notes: + file = str(note.get("file") or "") + finding = str(note.get("finding") or "") + if not file or not finding: + out.append("") + continue + out.append(compute_finding_id(file=file, section_id=section_id, finding=finding)) + return out + + def hash_introspection_scope(*, include: list[str], exclude: list[str]) -> str: """Hash only the scope-defining fields of an :class:`IntrospectionResult`. diff --git a/wikifi/cli.py b/wikifi/cli.py index ce997a1..294b650 100644 --- a/wikifi/cli.py +++ b/wikifi/cli.py @@ -92,6 +92,16 @@ def walk( bool, typer.Option("--review/--no-review", help="Run the critic + reviser loop on derivative sections."), ] = False, + no_surgical: Annotated[ + bool, + typer.Option( + "--no-surgical", + help=( + "Disable the surgical-edit path. When some findings change in a section, " + "fall back to a full LLM rewrite instead of editing the cached body in place." + ), + ), + ] = False, provider: Annotated[ str | None, typer.Option( @@ -108,6 +118,8 @@ def walk( reset_cache(WikiLayout(root=target)) if review: settings = settings.model_copy(update={"review_derivatives": True}) + if no_surgical: + settings = settings.model_copy(update={"use_surgical_edits": False}) if provider: settings = settings.model_copy(update={"provider": provider}) @@ -117,7 +129,8 @@ def walk( f"provider=[cyan]{settings.provider}[/cyan] model=[cyan]{settings.model}[/cyan]\n" f"cache=[cyan]{settings.use_cache}[/cyan] graph=[cyan]{settings.use_graph}[/cyan] " f"specialized=[cyan]{settings.use_specialized_extractors}[/cyan] " - f"review=[cyan]{settings.review_derivatives}[/cyan]", + f"review=[cyan]{settings.review_derivatives}[/cyan] " + f"surgical=[cyan]{settings.use_surgical_edits}[/cyan]", title="starting", ) ) @@ -155,7 +168,9 @@ def walk( "3. Aggregation", f"sections_written={report.aggregation.sections_written} " f"sections_empty={report.aggregation.sections_empty} " - f"sections_cached={report.aggregation.sections_cached}", + f"sections_cached={report.aggregation.sections_cached} " + f"sections_edited={report.aggregation.sections_edited} " + f"sections_rewritten={report.aggregation.sections_rewritten}", ) derivation_row = ( f"sections_derived={report.derivation.sections_derived} " diff --git a/wikifi/config.py b/wikifi/config.py index 977b5a6..bb3c542 100644 --- a/wikifi/config.py +++ b/wikifi/config.py @@ -120,6 +120,27 @@ class Settings(BaseSettings): default=7, description="Minimum critic score below which the reviser is invoked.", ) + use_surgical_edits: bool = Field( + default=True, + description=( + "When some findings change in a section but most are unchanged, edit the cached " + "section body in place around the delta instead of rewriting from scratch. " + "Preserves established prose and citation numbering. Disabling this only gates " + "the LLM-side surgical edit path; the no-LLM cache-reuse paths (full cache hit " + "and unchanged-finding-set re-render) still fire regardless of this flag." + ), + ) + surgical_edit_threshold: float = Field( + default=0.3, + ge=0.0, + le=1.0, + description=( + "Maximum churn ratio (added+removed findings divided by live findings) that " + "still routes to the surgical-edit path. Above this a section falls back to " + "full re-aggregation, which produces a cleaner narrative when the underlying " + "evidence has shifted substantially." + ), + ) # ----- Anthropic provider knobs ----- diff --git a/wikifi/orchestrator.py b/wikifi/orchestrator.py index 0ce462a..49cc6ea 100644 --- a/wikifi/orchestrator.py +++ b/wikifi/orchestrator.py @@ -205,6 +205,8 @@ def _persist_aggregation() -> None: provider=provider, cache=cache, persist_cache=_persist_aggregation if cache is not None else None, + surgical_threshold=settings.surgical_edit_threshold, + use_surgical_edits=settings.use_surgical_edits, ) def _persist_derivation() -> None: diff --git a/wikifi/surgical.py b/wikifi/surgical.py new file mode 100644 index 0000000..bd35dd1 --- /dev/null +++ b/wikifi/surgical.py @@ -0,0 +1,420 @@ +"""Surgical aggregation — edit a cached section body around a small finding delta. + +Plan A's aggregation cache is whole-section: any change to any contributing +file invalidates the section's body and triggers a from-scratch re-synthesis. +That works, but on a partially changed repo it (a) wastes tokens +re-narrating identical claims and (b) risks losing established prose that +the prior walk got right because the model has no anchor to it. + +Plan B inserts a third path between "cache hit" and "full rewrite": + +- **unchanged** — every cached finding id is present in the live notes and + vice versa. Hand the cached body back; just refresh the cache key (the + ``notes_hash`` may have shifted due to a source line move that didn't + touch the finding text itself). +- **surgical** — the symmetric difference between cached and live finding + ids is below ``surgical_edit_threshold``. Send the cached body plus a + delta of *added* and *removed* findings to the LLM and ask it to edit + in place, preserving every paragraph that doesn't depend on the delta. +- **rewrite** — too much churn. Fall back to Plan A's whole-section + re-aggregation path; surgical edits beyond the threshold tend to hide + latent inconsistencies the model would otherwise resolve cleanly. + +The surgical path is the one that protects the user's stated concern — +"potentially omitting or changing key details" — by making the cached +body the explicit anchor the LLM has to edit around rather than +re-derive from scratch. +""" + +from __future__ import annotations + +import logging +from dataclasses import dataclass +from typing import Literal + +from pydantic import BaseModel, Field + +from wikifi.cache import CachedSection +from wikifi.evidence import Claim, Contradiction, EvidenceBundle, SourceRef, coalesce_refs +from wikifi.providers.base import LLMProvider +from wikifi.sections import Section + +log = logging.getLogger("wikifi.surgical") + + +SURGICAL_SYSTEM_PROMPT = """\ +You are wikifi's section editor. You receive an existing markdown body for one \ +section of a technology-agnostic wiki, plus a small set of *added* findings \ +(new evidence) and *removed* findings (evidence no longer supported by the \ +source). Edit the body to integrate the added findings and revise around the \ +removed ones. + +Rules: +- PRESERVE unchanged paragraphs verbatim. Every sentence that does not \ + depend on a removed finding or directly contradict an added one MUST \ + appear in the output exactly as it was in the input. The reader should \ + be able to diff your output against the input and see only the localized \ + edit. +- When integrating an added finding, add it as a single sentence or bullet \ + within the most topically relevant existing paragraph. Only create a new \ + paragraph if no existing one is on-topic. +- When a removed finding underpinned a sentence, revise or delete that \ + sentence. Do not leave a claim in the body that no longer has source \ + support. +- Tech-agnostic. Never name languages, frameworks, or libraries — \ + translate every observation into domain terms. +- Use `new_claims` to declare each claim newly introduced in the edited \ + body. Index its `source_indices` against the *added findings* list \ + (1-based; the user prompt tags them [A1], [A2], …). +- Use `removed_claim_indices` (1-based against the *cached claims* list \ + the user prompt shows as [C1], [C2], …) to identify any cached claim \ + whose supporting evidence is gone. Those entries will be dropped from \ + the rendered citation footer. +- Use `contradictions` for the FULL post-edit set (not a delta). It \ + replaces the cached contradictions verbatim, so include any contradictions \ + that survived the edit too. +- Output the body only (no top-level heading); the writer adds the title. +""" + + +class SurgicalClaim(BaseModel): + """One claim added by the surgical edit, indexed against the *added* notes.""" + + text: str = Field(description="One assertion newly introduced in the edited body.") + source_indices: list[int] = Field( + default_factory=list, + description=( + "1-based indices into the ADDED findings list (the [A1], [A2] tags). " + "Do not reference cached claims here — use removed_claim_indices for those." + ), + ) + + +class SurgicalContradiction(BaseModel): + summary: str = Field(description="One-sentence description of the disagreement.") + positions: list[SurgicalClaim] = Field( + default_factory=list, + description="Each disagreeing position, with its own added-findings indices.", + ) + + +class SurgicalEdit(BaseModel): + """The model's structured output for one surgical aggregation pass.""" + + body: str = Field(description="Edited markdown body. Preserves unchanged paragraphs verbatim.") + new_claims: list[SurgicalClaim] = Field( + default_factory=list, + description="Claims newly introduced by the edit; indexed against added findings.", + ) + removed_claim_indices: list[int] = Field( + default_factory=list, + description=( + "1-based indices into the cached claims list ([C1], [C2] in the prompt) for " + "claims whose supporting evidence is gone — drop them from the citation footer." + ), + ) + contradictions: list[SurgicalContradiction] = Field( + default_factory=list, + description="Full post-edit contradictions list; replaces the cached entries.", + ) + + +@dataclass(frozen=True) +class SectionChange: + """Per-section diff between cached and live finding sets.""" + + decision: Literal["unchanged", "surgical", "rewrite"] + added_indices: list[int] + """1-based positions in the live notes whose finding_id is not in the cache.""" + removed_ids: list[str] + """Cached finding ids no longer present in the live notes.""" + unchanged_count: int + total_live: int + + @property + def churn_ratio(self) -> float: + # Mirror :func:`classify_section_change`'s guard: when the live + # finding set is empty but the cache had findings (i.e. + # ``removed_ids`` is non-empty), every cached finding is gone — + # the maximum possible churn. Returning 0.0 there would be + # misleading and could let downstream callers treat a + # remove-everything diff as "no change." + if self.total_live == 0: + return 1.0 if self.removed_ids else 0.0 + return (len(self.added_indices) + len(self.removed_ids)) / self.total_live + + +def classify_section_change( + *, + cached: CachedSection | None, + live_finding_ids: list[str], + surgical_threshold: float, +) -> SectionChange: + """Choose between cache-hit-rerender, surgical edit, and full rewrite. + + - No cached entry, or a cached entry with no ``finding_ids`` (legacy + v2 caches), routes to ``rewrite`` — without a prior finding-set + we can't compute a meaningful delta. + - Symmetric difference empty → ``unchanged`` (the caller refreshes + the notes_hash and re-renders the cached body, no LLM call). + - Otherwise compute churn ratio. ≤ threshold → ``surgical``; + > threshold → ``rewrite``. + """ + if cached is None or not cached.finding_ids: + return SectionChange( + decision="rewrite", + added_indices=list(range(1, len(live_finding_ids) + 1)), + removed_ids=[], + unchanged_count=0, + total_live=len(live_finding_ids), + ) + + # Empty-string ids surface from malformed or legacy notes (see + # :func:`wikifi.cache.note_finding_ids`). They have no meaningful + # identity, so any section that contains one — cached or live — + # forces a full rewrite rather than risking a "two empties look + # unchanged" set collision in the comparisons below. + if "" in cached.finding_ids or "" in live_finding_ids: + return SectionChange( + decision="rewrite", + added_indices=list(range(1, len(live_finding_ids) + 1)), + removed_ids=list(cached.finding_ids), + unchanged_count=0, + total_live=len(live_finding_ids), + ) + + cached_set = set(cached.finding_ids) + live_set = set(live_finding_ids) + added_indices = [i + 1 for i, fid in enumerate(live_finding_ids) if fid not in cached_set] + removed_ids = [fid for fid in cached.finding_ids if fid not in live_set] + unchanged_count = len(cached_set & live_set) + total_live = len(live_finding_ids) + + if not added_indices and not removed_ids: + return SectionChange( + decision="unchanged", + added_indices=[], + removed_ids=[], + unchanged_count=unchanged_count, + total_live=total_live, + ) + + # ``max(total_live, 1)`` guards the "removed everything" edge case + # where total_live is 0 — the entire cached set is gone, which is a + # rewrite by any reasonable threshold. + churn = (len(added_indices) + len(removed_ids)) / max(total_live, 1) + decision: Literal["unchanged", "surgical", "rewrite"] = "surgical" if churn <= surgical_threshold else "rewrite" + return SectionChange( + decision=decision, + added_indices=added_indices, + removed_ids=removed_ids, + unchanged_count=unchanged_count, + total_live=total_live, + ) + + +def surgical_aggregate( + *, + section: Section, + cached: CachedSection, + live_notes: list[dict], + change: SectionChange, + provider: LLMProvider, +) -> EvidenceBundle: + """Send the cached body + finding delta to the LLM and merge the edit. + + Returns the bundle the renderer takes (body + claims + contradictions). + Citation re-anchoring is done here: + - Cached claims survive verbatim, except those listed in + ``removed_claim_indices`` are dropped. + - ``new_claims`` indices are resolved against the *added notes only* + and converted to :class:`SourceRef`. + - Contradictions are fully replaced from the model output. + """ + added_notes = [live_notes[i - 1] for i in change.added_indices if 1 <= i <= len(live_notes)] + removed_notes = _removed_finding_descriptors(change.removed_ids) + user_prompt = _render_surgical_user_prompt( + section=section, + cached_body=cached.body, + added_notes=added_notes, + removed_notes=removed_notes, + cached_claims=cached.claims, + ) + edit = provider.complete_json( + system=SURGICAL_SYSTEM_PROMPT, + user=user_prompt, + schema=SurgicalEdit, + ) + return _merge_edit_with_cached( + edit=edit, + cached=cached, + added_notes=added_notes, + ) + + +def _merge_edit_with_cached( + *, + edit: SurgicalEdit, + cached: CachedSection, + added_notes: list[dict], +) -> EvidenceBundle: + """Re-anchor citations: keep cached claims minus removed, add new claims.""" + survivors = _drop_removed_claims(cached.claims, edit.removed_claim_indices) + surviving_claims = [ + Claim( + text=c.get("text", ""), + sources=[SourceRef.model_validate(s) for s in c.get("sources", [])], + ) + for c in survivors + ] + + added_refs = _refs_per_added_note(added_notes) + new_claims = [ + Claim( + text=nc.text, + sources=_resolve_added_indices(nc.source_indices, added_refs), + ) + for nc in edit.new_claims + ] + + contradictions = [ + Contradiction( + summary=c.summary, + positions=[ + Claim(text=p.text, sources=_resolve_added_indices(p.source_indices, added_refs)) for p in c.positions + ], + ) + for c in edit.contradictions + ] + + return EvidenceBundle( + body=edit.body, + claims=surviving_claims + new_claims, + contradictions=contradictions, + ) + + +def _drop_removed_claims(cached_claims: list[dict], removed_indices: list[int]) -> list[dict]: + """Return the cached claims whose 1-based index is NOT in ``removed_indices``.""" + drop = {i for i in removed_indices if 1 <= i <= len(cached_claims)} + return [c for i, c in enumerate(cached_claims, start=1) if i not in drop] + + +def _refs_per_added_note(added_notes: list[dict]) -> list[list[SourceRef]]: + """Map each added note to its source refs (mirrors ``aggregator._refs_per_note``).""" + out: list[list[SourceRef]] = [] + for note in added_notes: + sources = note.get("sources") + if isinstance(sources, list) and sources: + try: + out.append([SourceRef.model_validate(s) for s in sources]) + continue + except Exception: # malformed sources — fall back to file + pass + file = note.get("file") + if file: + out.append([SourceRef(file=str(file))]) + else: + out.append([]) + return out + + +def _resolve_added_indices(indices: list[int], added_refs: list[list[SourceRef]]) -> list[SourceRef]: + refs: list[SourceRef] = [] + for idx in indices: + real = idx - 1 + if 0 <= real < len(added_refs): + refs.extend(added_refs[real]) + return coalesce_refs(refs) + + +def _removed_finding_descriptors(removed_ids: list[str]) -> list[dict]: + """Build the ``removed_notes`` payload the surgical prompt expects. + + The cache stores stable ``finding_ids`` aligned with the prior + walk's notes order, but it does *not* persist the original note + bodies (file path or finding text). All we can give the model for + a removed finding is the opaque id; the prompt then asks the model + to find any sentence in the cached body that cites this id-shaped + handle and revise around it. + + Restoring richer context (file path, finding text) would require + persisting the underlying notes alongside ``finding_ids`` in the + cache schema — a separate change, not addressed here. + """ + return [{"finding_id": fid} for fid in removed_ids] + + +def _render_surgical_user_prompt( + *, + section: Section, + cached_body: str, + added_notes: list[dict], + removed_notes: list[dict], + cached_claims: list[dict], +) -> str: + lines: list[str] = [] + lines.append(f"## Section: {section.title} (id: {section.id})") + lines.append("") + lines.append("### Brief") + lines.append(section.description) + lines.append("") + lines.append("### Current section body (preserve unchanged paragraphs verbatim)") + lines.append("```markdown") + lines.append(cached_body.strip()) + lines.append("```") + lines.append("") + + lines.append(f"### Added findings ({len(added_notes)}) — index against new_claims.source_indices") + if not added_notes: + lines.append("_(none)_") + for idx, note in enumerate(added_notes, start=1): + file_ref = note.get("file", "?") + finding = note.get("finding", "") + summary = note.get("summary", "") + role = f" (file role: {summary})" if summary else "" + lines.append(f"[A{idx}] {file_ref}{role}: {finding}") + lines.append("") + + lines.append( + f"### Removed findings ({len(removed_notes)}) — these no longer have source support; " + "revise or drop the prose that depended on them" + ) + if not removed_notes: + lines.append("_(none)_") + for idx, note in enumerate(removed_notes, start=1): + file_ref = note.get("file", note.get("finding_id", "?")) + finding = note.get("finding", "") + if finding: + lines.append(f"[R{idx}] {file_ref}: {finding}") + else: + lines.append(f"[R{idx}] (cached finding id {file_ref})") + lines.append("") + + lines.append( + f"### Cached claims ({len(cached_claims)}) — index into removed_claim_indices " + "to drop any whose evidence is gone" + ) + if not cached_claims: + lines.append("_(none)_") + for idx, claim in enumerate(cached_claims, start=1): + text = claim.get("text", "") + lines.append(f"[C{idx}] {text}") + lines.append("") + + lines.append( + "Edit the body to integrate the added findings and revise around the removed ones. " + "Preserve unchanged paragraphs verbatim. Return SurgicalEdit per the schema." + ) + return "\n".join(lines) + + +__all__ = [ + "SURGICAL_SYSTEM_PROMPT", + "SectionChange", + "SurgicalClaim", + "SurgicalContradiction", + "SurgicalEdit", + "classify_section_change", + "surgical_aggregate", +]