diff --git a/.wikifi/.gitignore b/.wikifi/.gitignore index 23c6b94..03adbc0 100644 --- a/.wikifi/.gitignore +++ b/.wikifi/.gitignore @@ -1,2 +1,3 @@ -# wikifi local working state — section markdown is committed, notes are not. +# wikifi local working state — section markdown is committed, notes and cache are not. .notes/ +.cache/ diff --git a/.wikifi/capabilities.md b/.wikifi/capabilities.md index fc72e87..2b780f7 100644 --- a/.wikifi/capabilities.md +++ b/.wikifi/capabilities.md @@ -1,28 +1,115 @@ # Capabilities -### Value Proposition & Core Purpose -The application automates the transformation of raw source artifacts into structured, domain-focused documentation. By systematically analyzing codebases, it extracts business logic, system relationships, and functional capabilities, delivering a living knowledge base that reduces documentation debt, standardizes terminology, and accelerates cross-team onboarding. The system is designed to keep documentation synchronized with implementation without requiring manual authoring overhead. - -### Sequential Analysis Workflow -The application operates through a deterministic, four-stage pipeline that progresses from structural discovery to polished documentation: - -| Pipeline Stage | Domain Focus | Primary Output | -|---|---|---| -| **Structural Analysis** | Repository layout evaluation, manifest inspection, and production-relevance classification | Scoped processing boundaries and system purpose inference | -| **Granular Extraction** | File-by-file translation of technical implementations into domain concepts | Schema-validated, technology-agnostic capability notes | -| **Section Synthesis** | Aggregation of extracted notes into cohesive documentation units | Finalized wiki sections with consistent structure and terminology | -| **Cross-Cutting Derivation** | Identification of relationships spanning multiple components | Inferred user personas, behavioral stories, and system interaction diagrams | - -### Key Capabilities -- **Intelligent Traversal & Filtering:** Recursively navigates directory structures while automatically excluding version-controlled noise, large binary assets, and empty stubs. Processing focus is dynamically adjusted to prioritize substantive, domain-relevant files. -- **Domain-Centric Translation:** Strips away implementation-specific syntax to surface underlying business rules, data flows, and functional responsibilities. Technical artifacts are consistently mapped to business-readable concepts. -- **Adaptive Reasoning Depth:** Analytical intensity can be tuned to balance comprehensive detail against processing efficiency, allowing the system to scale from lightweight overviews to deep architectural breakdowns. -- **Workspace Lifecycle Management:** Initializes and maintains a standardized documentation environment, handling section scaffolding, versioning rules, and intermediate state cleanup between generation cycles. - -### Quality Assurance & Transparency -- **Explicit Gap Declaration:** When upstream data is incomplete or ambiguous, the system preserves raw evidence and explicitly documents missing information rather than generating speculative content. -- **Execution Reporting:** Produces detailed summaries capturing file inclusion/exclusion metrics, processing counts, and generation status for full auditability and pipeline monitoring. -- **Timestamped Provenance:** Maintains a chronological record of extraction notes per section, enabling traceability from final documentation back to the original source artifacts. - -### Adaptive Configuration -The application supports flexible configuration of analysis parameters, including file size thresholds, content length filters, and traversal depth limits. Analytical interactions are standardized into two operational modes: schema-validated structured generation for systematic processing phases, and free-form analytical generation for narrative documentation and visual representations. This dual-mode approach ensures both machine-readable consistency and human-readable clarity across all generated artifacts. +wikifi turns any codebase into a structured, technology-agnostic wiki by walking source files, extracting structured knowledge, and synthesizing readable documentation with a full evidence trail that links every assertion back to specific source locations. + +## Documentation Pipeline + +Documentation is produced through four ordered stages: + +**1. Repository Triage** +The system examines the directory layout and manifest files to determine which paths contain production source worth deeper analysis and which should be skipped (vendored dependencies, build artifacts, generated files, CI configuration). Files outside configurable size bounds are excluded before any analysis begins.[7][8] + +**2. Per-file Extraction** +Each in-scope file is analyzed to produce structured findings describing its contribution to each wiki section.[9] Files whose format is well-structured — relational schemas, API contracts, interface definitions, and migration scripts — are routed to dedicated deterministic extractors that bypass AI inference entirely, improving accuracy and reducing cost for these artifact types. General-purpose source files are analyzed via AI inference, with large files split into overlapping chunks so no content is lost at boundaries. Findings are deduplicated across chunk boundaries to avoid double-counting, and each finding carries a citation (path and line range) for downstream traceability. + +Optionally, the system builds a cross-file import and reference graph before extraction begins. Each file's extraction is then enriched with its neighborhood in that graph — which files it depends on and which depend on it — enabling findings to describe cross-file flows rather than treating files in isolation. + +**3. Section Synthesis** +Per-file findings are aggregated into coherent, readable markdown bodies for each primary wiki section. Every assertion in the output is backed by numbered citations traceable to the specific source files and line ranges from which it was inferred. + +**4. Derivative Section Generation** +Higher-level artifacts — user personas, scenario-based user stories, and architectural diagrams — are synthesized from the finalized primary sections. If upstream content is absent, the system writes a placeholder declaring the gap rather than fabricating content. + +## Wiki Structure + +The generated wiki covers **eight primary sections**: business domains, system intent, capabilities, external dependencies, integrations, cross-cutting concerns, core entities, and hard specifications. Three derivative sections (personas, user stories, architectural diagrams) are generated only after all relevant primary sections are finalized. + +The system can scaffold a complete wiki directory structure in a target project in an idempotent manner — re-running initialization leaves existing content untouched while creating any missing pieces. + +## Conflict Detection and Evidence Traceability + +When source files make incompatible assertions about the same domain topic, the conflict is surfaced explicitly under a dedicated heading rather than silently resolved. This is a deliberate design choice for legacy codebases, where tribal knowledge often hides in inconsistencies; teams are directed to resolve conflicts before re-implementation. Claims that appear in the supporting evidence but cannot be matched to the synthesized narrative body are collected into a separate supporting-claims list, ensuring nothing is silently dropped. + +## Quality Assurance + +An optional critic-and-reviser loop evaluates each synthesized section against a structured rubric (scored 0–10), identifies unsupported claims and coverage gaps, and triggers a revision pass when the score falls below a configurable threshold. A revised body is accepted only if it improves or matches the prior score, preventing regressions. The loop is off by default to keep generation time predictable, but is most beneficial for derivative sections where single-shot synthesis is most likely to stray from evidence. If synthesis fails entirely for a section, the system falls back to emitting the raw notes directly, preserving information at the cost of polish. + +## Coverage and Readiness Reporting + +A dedicated report command produces a markdown table listing each section with its file count, finding count, body size, quality score, and the most prominent gap or unsupported claim — giving teams a one-page readiness summary. The coverage portion requires no AI provider and is safe for automated pipelines. + +## Incremental and Crash-Resumable Operation + +Two independently keyed caches — one per file (keyed by content fingerprint) and one per section (keyed by a hash of its full notes payload) — allow re-walks to skip unchanged material entirely. The cache is written after each file completes, making the pipeline crash-resumable.[34] Stale entries for files removed from the repository can be pruned. A monotonically incremented version number embedded in every cache file causes a clean rebuild on version mismatch, preventing stale data from surviving format upgrades. Cache files are written atomically so a crash during persistence never corrupts the stored state. A force-reanalysis mode is also available to drop the cache entirely and perform a clean walk. + +## Interactive Query Interface + +Once a wiki is generated, users can open a conversational session grounded in the populated wiki sections. Only sections with meaningful content are loaded as context. The session supports multi-turn questioning, conversation history reset while retaining the wiki context, and inspection of which sections are currently loaded. + +## Supporting claims +- wikifi produces a technology-agnostic wiki from any codebase, linking every assertion back to specific source locations. [1][2][3] +- Stage 1 examines the directory layout and manifest files to classify paths as worth walking or skippable (vendored dependencies, build output, generated files, CI configuration). [4][5][6] +- Files with well-structured formats (relational schemas, API contracts, interface definitions, migration scripts) are routed to dedicated deterministic extractors that bypass AI inference entirely. [10][11][12][13][14][15] +- Large files are split into overlapping chunks so no content is lost at boundaries, with separators tried from coarsest to finest. [16] +- Findings are deduplicated across chunk boundaries to avoid double-counting, and each finding carries a citation (path and line range). [9] +- The system optionally builds a cross-file import and reference graph, injecting each file's neighborhood into its extraction pass to enable cross-file flow descriptions. [17][18][19] +- Per-file findings are aggregated into coherent, readable markdown bodies for each primary wiki section, with every assertion backed by numbered citations. [1][3] +- Derivative sections — user personas, scenario-based user stories, and architectural diagrams — are synthesized from finalized primary sections; absent upstream content produces a placeholder rather than fabricated content. [20][21] +- The wiki covers eight primary sections and three derivative sections; derivative sections are generated only after their upstream primaries are finalized. [21] +- The system scaffolds a complete wiki directory structure idempotently, leaving existing content untouched while creating missing pieces. [22] +- Incompatible assertions across source files are surfaced explicitly under a dedicated heading rather than silently resolved — a deliberate feature for legacy codebases where tribal knowledge hides in inconsistencies. [23][24] +- Claims that appear in the supporting evidence but cannot be matched to the narrative body are collected into a separate supporting-claims list rather than silently dropped. [3] +- An optional critic-and-reviser loop evaluates each synthesized section on a 0–10 rubric, identifies unsupported claims and gaps, and triggers revision when the score falls below a configurable threshold, accepting the revision only if it improves or matches the prior score. [25][26] +- The critic-reviser loop is off by default to keep generation time predictable, and is most beneficial for derivative sections where single-shot synthesis is most likely to fabricate. [25][27] +- If synthesis fails entirely, the system falls back to emitting raw notes directly in the section body, preserving information at the cost of polish. [28] +- A report command produces a per-section markdown table with file counts, finding counts, body size, quality score, and the most prominent gap or unsupported claim. [29][30][31] +- The coverage portion of the report requires no AI provider and is safe for automated pipelines. [29] +- Two independently keyed caches — per-file (content fingerprint) and per-section (notes-payload hash) — allow re-walks to skip unchanged material entirely. [32][33][34] +- Stale cache entries for removed files can be pruned in bulk. [35][18] +- A monotonically incremented version number in every cache file triggers a clean rebuild on version mismatch, preventing stale data from surviving upgrades. [36] +- Cache files are written atomically so a crash during persistence never leaves a corrupted cache. [37] +- A force-reanalysis mode drops the on-disk cache entirely to perform a clean walk. [38] +- Users can open a conversational session grounded in populated wiki sections, supporting multi-turn questioning, history reset, and inspection of loaded sections. [39][2] +- Only sections with meaningful content are loaded as context for the conversational session; placeholder sections are filtered out. [40] + +## Sources +1. `wikifi/aggregator.py:1-15` +2. `wikifi/cli.py:63-210` +3. `wikifi/evidence.py:85-120` +4. `wikifi/introspection.py:28-44` +5. `wikifi/introspection.py:61-70` +6. `wikifi/walker.py:92-186` +7. `.env.example:20-29` +8. `wikifi/walker.py:100-130` +9. `wikifi/extractor.py` +10. `wikifi/config.py:97-102` +11. `wikifi/extractor.py:183-218` +12. `wikifi/repograph.py:1-15` +13. `wikifi/specialized/__init__.py:8-11` +14. `wikifi/specialized/dispatch.py:44-62` +15. `wikifi/specialized/models.py:4-6` +16. `wikifi/extractor.py:298-360` +17. `wikifi/config.py:60-93` +18. `wikifi/orchestrator.py:55-145` +19. `wikifi/repograph.py:162-215` +20. `wikifi/deriver.py:73-107` +21. `wikifi/sections.py:44-142` +22. `wikifi/wiki.py:72-101` +23. `wikifi/aggregator.py:9-14` +24. `wikifi/evidence.py:121-133` +25. `wikifi/config.py:103-113` +26. `wikifi/critic.py:100-153` +27. `wikifi/deriver.py:90-103` +28. `wikifi/aggregator.py:272-285` +29. `wikifi/report.py:82-85` +30. `wikifi/report.py:106-114` +31. `wikifi/report.py:46-74` +32. `wikifi/cache.py:5-15` +33. `wikifi/config.py:88-96` +34. `wikifi/extractor.py:166-182` +35. `wikifi/cache.py:113-118` +36. `wikifi/cache.py:37` +37. `wikifi/cache.py:205-209` +38. `wikifi/cli.py:90-122` +39. `wikifi/chat.py:88-130` +40. `wikifi/chat.py:63-82` diff --git a/.wikifi/config.toml b/.wikifi/config.toml index 571ab5e..28ed551 100644 --- a/.wikifi/config.toml +++ b/.wikifi/config.toml @@ -1,4 +1,4 @@ # wikifi local config — overrides WIKIFI_* environment variables when present. -provider = "ollama" -model = "qwen3.6:27b" +provider = "anthropic" +model = "claude-sonnet-4-6" ollama_host = "http://localhost:11434" diff --git a/.wikifi/cross_cutting.md b/.wikifi/cross_cutting.md index aa10602..4dfe08d 100644 --- a/.wikifi/cross_cutting.md +++ b/.wikifi/cross_cutting.md @@ -1,25 +1,57 @@ # Cross-Cutting Concerns -### Observability and Monitoring -The system maintains comprehensive visibility into pipeline execution through dynamic logging and structured progress tracking. Logging verbosity adjusts based on operational context, supporting both routine monitoring and deep debugging. Each processing stage emits standardized progress markers, while metric tracking and warning logs capture anomalies such as missing sections or synthesis failures. System identification metadata is exposed to facilitate version tracking and compatibility verification across deployments. +## Observability and Logging -### Data Integrity and Traceability -Content generation adheres to strict evidence-based principles, explicitly prohibiting fabrication when upstream sources lack relevant information. Every derivative output is validated against a predefined schema to prevent malformed data from propagating downstream. When synthesis encounters failures, fallback mechanisms preserve raw findings, guaranteeing that all documentation sections receive either synthesized content, structured placeholders, or unprocessed source material. Traceability is maintained by linking generated content directly to its originating evidence, and deterministic processing order ensures consistent evaluation of source artifacts. +Log verbosity is configured globally before any pipeline stage executes: a verbose flag activates debug-level output, while the default level is informational. Structured log events are emitted at the entry point of each pipeline stage, giving operators a continuous view of progress across the entire run. -### State Management and Data Storage -Workspace initialization and intermediate data resets are designed to be idempotent, preventing state corruption during repeated or interrupted executions. The system enforces strict format contracts for configuration files, documentation outputs, and intermediate logs to maintain structural consistency. Working state is automatically isolated from committed artifacts, preserving version control hygiene and ensuring that transient processing data does not interfere with finalized documentation. +All significant failure modes follow a uniform pattern: errors are caught at the point of failure, logged at WARNING level, and a graceful fallback is substituted so that downstream stages are never blocked. Specifically: -### Operational Guardrails and Determinism -Runtime behavior is governed by centralized, environment-driven configuration that standardizes parameters across execution contexts. To prevent resource exhaustion and processing delays, the system enforces several operational limits: +- Aggregation failures produce a fallback body that preserves the raw notes, ensuring a section is always written. +- Derivation failures write a fallback body that retains the upstream evidence verbatim rather than leaving the section blank. +- Quality-review failures return the original body with a zero score and a diagnostic annotation rather than propagating the error. -| Constraint | Purpose | +When an inference call returns neither structured output nor any usable text, a diagnostic message surfaces the stop reason, output-token count, and configured resource budget, together with actionable hints so operators can resolve the issue at the point it occurs. + +All provider backends share a single error-formatting routine that extracts a vendor-issued request identifier when present, producing uniformly attributable failure messages regardless of which backend is active. + +--- + +## Error Isolation and Graceful Degradation + +Errors are scoped as narrowly as possible throughout the pipeline: + +- A failure in one file chunk does not abort extraction of the remaining chunks; a failure on one file does not abort the repository walk. Files that fail entirely are counted as skipped rather than silently lost. +- Provider inference failures during interactive sessions are surfaced as inline error messages rather than terminating the session. +- Cache I/O failures (missing files, malformed content, or bad individual entries) are logged as warnings and fall back to an empty cache, preserving pipeline continuity. +- Provider-specific error shapes are never allowed to leak into the orchestration layer; all backends normalize errors through a shared formatting helper before re-raising them. + +--- + +## Data Integrity and Source Provenance + +Full source provenance is a non-negotiable invariant: every claim in the output must carry the source file path, line range, and content fingerprint that justifies it. This citation chain is preserved through caching and replay so that any re-walk of the repository can verify claims against the current source. + +Content fingerprints serve three cross-cutting roles: + +| Role | Effect | |---|---| -| Request Timeouts | Accommodate variable processing durations while preventing indefinite hangs | -| File Size & Content Caps | Filter out oversized or trivial inputs to conserve computational resources | -| Reasoning Mode Controls | Balance depth of analysis against execution speed | -| Determinism Parameters | Ensure reproducible outputs for identical inputs across runs | +| Cache keying | Stale extraction or aggregation results are never served when source content changes | +| Citation anchoring | Claims in the wiki can be traced to the exact file revision that produced them | +| Dependency-graph invalidation | Cross-file context is invalidated when any referenced file changes | + +Files are always hashed as raw bytes rather than decoded text, ensuring that encoding differences never cause the cache and the extractor to disagree on a file's identity. The aggregation cache key deliberately includes each source file's fingerprint and line range in addition to the finding text, so that even if the text is unchanged, a shift in the cited location triggers a cache miss and re-derives citations from fresh evidence. + +Contradictions found in the source are never silently merged. They are always rendered explicitly in the output so that data-integrity issues visible in the source are escalated to the team rather than hidden. + +Note records are stamped with a UTC timestamp at write time, providing an audit trail of when each per-file extraction was recorded. + +Structured inference output is constrained to a strict schema at every pipeline stage, ensuring deterministic parsing and making successive runs straightforwardly diffable. + +--- + +## Hallucination and Fabrication Prevention -The pipeline incorporates graceful degradation strategies to handle read errors, parsing failures, and permission restrictions during directory traversal without halting execution. +Several independent mechanisms work together to ensure all generated content is grounded in extracted evidence: -### Authentication and Authorization -The provided notes do not contain information regarding access controls, credential management, or authorization mechanisms. This area remains undocumented and should be addressed separately to ensure secure handling of sensitive source materials and generated artifacts. +- **Deterministic structured output**: Temperature is fixed at zero for all structured-output inference calls so that the same input always produces the same structured result across runs. This is treated as a non-negotiable invariant for the extraction path. +- **Placeholder filtering**: A heuristic matches all known empty-section shapes ( diff --git a/.wikifi/diagrams.md b/.wikifi/diagrams.md index f7588d9..4b27753 100644 --- a/.wikifi/diagrams.md +++ b/.wikifi/diagrams.md @@ -1,137 +1,146 @@ # Diagrams -### Domain Map -The following graph visualizes the bounded contexts within the core domain of Automated Knowledge Translation. It reflects the strict, stage-gated dependency chain and the cross-cutting nature of external intelligence integration. +Three diagrams follow: a domain map, an entity–relationship view, and an integration flow. All representations are technology-agnostic and derived solely from the documented system model. + +## Domain Map + +Subdomains, their responsibilities, and the directed dependency chain that governs pipeline ordering. No subdomain reaches backwards; the arrows below are the authoritative expression of inter-subdomain dependency. ```mermaid graph TD - Core[Core Domain: Automated Knowledge Translation] - Introspection[Repository Introspection & Curation\nSupporting] - Extraction[Semantic Extraction & Analysis\nCore] - Aggregation[Information Aggregation & Synthesis\nCore] - Orchestration[Pipeline Orchestration & Lifecycle Management\nSupporting] - External[External Intelligence Integration\nGeneralized] - - Core --> Introspection - Introspection -->|Curated artifacts & structural metadata| Extraction - Extraction -->|Structured knowledge units & analysis results| Aggregation - Aggregation -->|Synthesized content & workspace population| Orchestration - - External -.->|On-demand pattern resolution & narrative generation| Extraction + subgraph CORE["Core Domain — Automated Documentation Synthesis"] + RI[Repository Introspection] + KE[Knowledge Extraction] + SS[Section Synthesis] + APW[Artefact Persistence — working state] + APC[Artefact Persistence — committed wiki] + end + + RI -->|include and exclude scope| KE + KE -->|extraction notes| APW + KE -->|evidential record| SS + SS -->|rendered section markdown| APC ``` -**Key Observations:** -- Data flows unidirectionally through the pipeline, with intermediate states explicitly persisted between stages to support incremental processing, auditability, and fault tolerance. -- External Intelligence Integration operates as a generalized, cross-cutting capability invoked on-demand within the extraction context rather than dictating pipeline progression. -- Orchestration and workspace lifecycle management responsibilities currently overlap; future modeling may require separating execution coordination from directory/configuration governance. +## Entity Relationship View -### Entity Relationship View -This entity-relationship diagram maps the core domain entities, their primary fields, and the structural boundaries that govern data transformation from raw repository scanning to final documentation assembly. +Core entities across all concern areas. Cardinality follows the documented information model. ```mermaid erDiagram - CONFIGURATION ||--o{ SCAN_TRAVERSAL_CONFIG : "defines" - SCAN_TRAVERSAL_CONFIG ||--o{ DIRECTORY_SUMMARY : "scopes" - DIRECTORY_SUMMARY ||--|| INTROSPECTION_ASSESSMENT : "generates" - INTROSPECTION_ASSESSMENT ||--o{ EXTRACTION_NOTE : "guides" - EXTRACTION_NOTE }o--|| DOCUMENTATION_SECTION : "aggregates_to" - DOCUMENTATION_SECTION ||--o{ AGGREGATION_STATS : "updates" - DOCUMENTATION_SECTION ||--o{ WORKSPACE_LAYOUT : "populates" - EXECUTION_SUMMARY }o--|| PIPELINE_EXECUTION : "observes" - - CONFIGURATION { - string default_settings - string local_overrides - } - SCAN_TRAVERSAL_CONFIG { - string root_path - string inclusion_exclusion_patterns - number size_thresholds - } - DIRECTORY_SUMMARY { - number file_count - number total_size - string extension_distribution - boolean manifest_presence - } - INTROSPECTION_ASSESSMENT { - string primary_languages - string inferred_purpose - string classification_rationale + SECTION { + string id PK + string title + string brief + string tier } - EXTRACTION_NOTE { - datetime timestamp - string file_reference - string role_summary - string extracted_finding - } - DOCUMENTATION_SECTION { - string category - string aggregated_content - string final_markdown_body - } - AGGREGATION_STATS { - number successful_writes - number empty_section_count - } - WORKSPACE_LAYOUT { - string config_paths - string notes_paths - string sections_paths - } - EXECUTION_SUMMARY { - string stage_metrics - string completion_status - string consolidated_findings + SECTION ||--o{ SECTION : "upstream-of" + WIKI_LAYOUT ||--o{ SECTION : "resolves paths for" + SECTION_REPORT }o--|| SECTION : "describes" + WIKI_REPORT ||--|{ SECTION_REPORT : "aggregates" + LOADED_SECTION ||--|| SECTION : "pairs body with" + + SECTION ||--o{ SECTION_FINDING : "collects" + FILE_FINDINGS ||--|{ SECTION_FINDING : "groups" + + SOURCE_REF { + string file_path + string line_range + string fingerprint } + CLAIM }o--|{ SOURCE_REF : "backed by" + CONTRADICTION }|--|{ CLAIM : "groups conflicting" + EVIDENCE_BUNDLE ||--|{ CLAIM : "contains" + EVIDENCE_BUNDLE ||--o{ CONTRADICTION : "contains" + + WALK_REPORT ||--|| INTROSPECTION_RESULT : "carries" + WALK_REPORT ||--|| EXTRACTION_STATS : "carries" + WALK_REPORT ||--|| AGGREGATION_STATS : "carries" + WALK_REPORT ||--|| DERIVATION_STATS : "carries" + WALK_REPORT ||--|| WALK_CACHE : "carries" + WALK_REPORT ||--|| REPO_GRAPH : "carries" + WALK_CACHE ||--o{ CACHED_FINDINGS : "holds" + WALK_CACHE ||--o{ CACHED_SECTION : "holds" + REPO_GRAPH ||--|{ GRAPH_NODE : "indexes" + + DERIVATION_STATS ||--o{ REVIEW_OUTCOME : "audit trail" + REVIEW_OUTCOME ||--|| CRITIQUE : "initial" + REVIEW_OUTCOME ||--o| CRITIQUE : "follow-up" + + CHAT_SESSION ||--|| LLM_PROVIDER : "uses" + CHAT_SESSION ||--|{ CHAT_MESSAGE : "history" ``` -**Key Observations:** -- Configuration entities establish hard boundaries for traversal and analysis, ensuring processing never exceeds defined size constraints or excluded paths. -- Extraction notes are immutable, timestamped records tied to single source files, serving as the raw material for downstream aggregation. -- Aggregation statistics and the execution summary function as cross-cutting observers, tracking pipeline health and output readiness without interfering with the primary data flow. -- **Known Gap:** The exact mapping rules between intermediate extraction notes and final documentation sections are implied but not explicitly detailed. Further specification is required to define how notes are grouped, prioritized, or filtered during section assembly, and how empty sections are resolved or reported upstream. +## Integration Flow -### Integration Flow -The sequence diagram below illustrates the internal pipeline handoffs and external interface interactions. It captures the staged execution model, centralized orchestration, and abstracted external dependencies. +End-to-end pipeline sequence from CLI invocation through all four stages, showing each stage's interactions with the LLM provider abstraction, the cache layer, the import graph, and the filesystem layout. ```mermaid sequenceDiagram - participant CLI as CLI Interface - participant Orch as Orchestrator - participant Traversal as Traversal & Introspection - participant Extractor as Source Analysis & Extraction - participant Aggregator as Content Aggregation - participant Deriver as Derivative Generation - participant AI as Generative AI Services - participant Telemetry as Observability & Telemetry - participant Storage as Wiki Storage - - CLI->>Orch: Trigger execution / provision workspace - Orch->>Traversal: Delegate scanning & structural analysis - Traversal->>Traversal: Apply path filters & size constraints - Traversal-->>Orch: Return filtered paths & metadata - Orch->>Extractor: Delegate artifact analysis - Extractor->>AI: Request pattern resolution / narrative generation (on-demand) - AI-->>Extractor: Return processed findings - Extractor->>Telemetry: Log processing metrics & outcomes - Extractor-->>Orch: Return structured analysis notes - Orch->>Aggregator: Delegate content synthesis - Aggregator->>AI: Request section-level synthesis - AI-->>Aggregator: Return synthesized markdown - Aggregator->>Storage: Write documentation sections - Aggregator-->>Orch: Return aggregation statistics - Orch->>Deriver: Delegate supplementary content generation - Deriver->>AI: Request derivative synthesis - AI-->>Deriver: Return derivative documentation - Deriver->>Storage: Write derivative artifacts - Deriver-->>Orch: Confirm completion - Orch->>Orch: Consolidate metrics & generate execution summary - Orch-->>CLI: Report pipeline health & output readiness -``` + autonumber + participant CLI + participant Orchestrator + participant Config + participant LLMProvider + participant ImportGraph + participant SpecDispatch + participant Cache + participant FilesystemLayout -**Key Observations:** -- The orchestrator acts as the central coordinator, delegating execution to specialized components in a strict sequence while maintaining a single source of truth for pipeline health. -- All external dependencies are routed through standardized contracts, isolating core business logic from provider-specific implementations and enabling swappable analytical backends. -- Observability and telemetry are integrated directly into the extraction stage to monitor processing metrics and record analysis outcomes in real time. -- **Known Gaps:** The integration contracts do not specify exact data schemas or serialization formats for inter-module handoffs. Error handling, retry policies, fallback mechanisms for external service degradation, authentication/rate-limiting constraints, and versioning guarantees between pipeline stages remain undefined and require clarification in implementation documentation. + CLI->>Config: load settings and feature flags + CLI->>Orchestrator: walk command + Orchestrator->>LLMProvider: Stage 1 — scope classification + LLMProvider-->>Orchestrator: IntrospectionResult + Orchestrator->>FilesystemLayout: initialise layout + + loop per in-scope file + Orchestrator->>ImportGraph: fetch file neighbours + ImportGraph-->>Orchestrator: neighbour paths + Orchestrator->>Cache: lookup by content fingerprint + alt cache hit + Cache-->>Orchestrator: FileFindings cached + else cache miss + Orchestrator->>SpecDispatch: route by FileKind + alt recognised kind + SpecDispatch-->>Orchestrator: SpecializedFindings + else general path + SpecDispatch->>LLMProvider: Stage 2 — extraction + LLMProvider-->>SpecDispatch: SectionFindings + SpecDispatch-->>Orchestrator: FileFindings + end + Orchestrator->>Cache: store findings + end + Orchestrator->>FilesystemLayout: append notes per section + end + + loop per primary section + Orchestrator->>Cache: lookup by notes-payload hash + alt cache hit + Cache-->>Orchestrator: rendered section body + else cache miss + Orchestrator->>LLMProvider: Stage 3 — aggregation + LLMProvider-->>Orchestrator: EvidenceBundle + Orchestrator->>FilesystemLayout: write section markdown + Orchestrator->>Cache: store aggregated section + end + end + + loop per derivative section in topological order + Orchestrator->>FilesystemLayout: read upstream section bodies + Orchestrator->>LLMProvider: Stage 4 — derivation + LLMProvider-->>Orchestrator: section body + Orchestrator->>FilesystemLayout: write section markdown + opt quality review enabled + Orchestrator->>LLMProvider: critique + LLMProvider-->>Orchestrator: Critique with score + alt score below revision threshold + Orchestrator->>LLMProvider: revise + LLMProvider-->>Orchestrator: revised body + Orchestrator->>FilesystemLayout: overwrite section markdown + end + end + end + + Orchestrator-->>CLI: WalkReport + Note over CLI,FilesystemLayout: chat and report subcommands read finished wiki via FilesystemLayout +``` diff --git a/.wikifi/domains.md b/.wikifi/domains.md index d5de393..0330a14 100644 --- a/.wikifi/domains.md +++ b/.wikifi/domains.md @@ -1,35 +1,56 @@ # Domains and Subdomains -### Core Domain: Automated Knowledge Translation -The system operates within a single core domain focused on transforming raw technical artifacts into structured, business-readable documentation. This domain treats source repositories as unstructured knowledge sources that require systematic discovery, semantic translation, and narrative synthesis. All processing is deliberately decoupled from implementation specifics, ensuring that technical constructs are consistently mapped to domain-agnostic business concepts. +## Core Domain -### Bounded Contexts & Subdomains -The core domain is partitioned into five bounded contexts, each with distinct responsibilities and clear boundaries: +The system's core domain is **automated documentation synthesis**: ingesting an arbitrary source repository and producing a structured, intent-bearing wiki that describes the codebase in technology-agnostic terms. The central concern is not the mechanics of reading files, but the act of surfacing *business intent* — distinguishing what a system does from the accidental details of how it is implemented. -| Subdomain | Primary Responsibility | DDD Classification | +## Subdomains + +### Repository Introspection +Before any analysis begins, the system must decide which parts of a repository carry production intent and which represent infrastructure, tooling, or generated artefacts. This subdomain owns that classification decision. A defining constraint is **tech-agnosticism**: the introspection logic must not rely on recognising specific languages, frameworks, or conventions, so that it generalises across any codebase. + +### Knowledge Extraction +Once relevant files are identified, this subdomain is responsible for extracting structured, intent-bearing findings from each one. It encompasses file classification, content chunking, querying an inference backend for structured observations, and persisting those observations with precise citations for downstream use. The output of this subdomain is the raw evidential record. + +### Section Synthesis +The documentation produced by the system is split along a clear dependency boundary: + +| Subdomain tier | Description | Pipeline position | |---|---|---| -| **Repository Introspection & Curation** | Discovers project structure, classifies artifacts, filters irrelevant content, and establishes workspace boundaries. | Supporting | -| **Semantic Extraction & Analysis** | Processes individual artifacts to translate technical patterns into structured knowledge units. Leverages external analytical services for complex pattern recognition. | Core | -| **Information Aggregation & Synthesis** | Consumes extracted knowledge units, resolves redundancies, aligns terminology, and composes coherent section-level documentation. | Core | -| **Pipeline Orchestration & Lifecycle Management** | Governs sequential stage execution, manages reporting, coordinates output derivation, and controls the documentation workspace lifecycle. | Supporting | -| **External Intelligence Integration** | Abstracts communication with generative analysis services. Standardizes request formulation and response consumption, decoupling core logic from provider implementations. | Generalized | - -### Context Relationships & Data Flow -The subdomains form a strict, stage-gated dependency chain. Data flows unidirectionally through the pipeline: - -1. **Introspection → Extraction**: Curated artifact lists and structural metadata are passed to the extraction context. -2. **Extraction → Aggregation**: Structured knowledge units and intermediate analysis results are consumed for section-level synthesis. -3. **Aggregation → Orchestration**: Synthesized content is handed off for final artifact derivation, workspace population, and lifecycle closure. - -External Intelligence Integration operates as a cross-cutting capability within the Extraction context. It is invoked on-demand to resolve ambiguous technical patterns or generate analytical narratives, but does not dictate pipeline progression. - -### State Management & Persistence -Intermediate analysis results are explicitly persisted between pipeline stages. This design supports: -- **Incremental Processing**: Only modified or newly discovered artifacts trigger re-analysis. -- **Auditability**: Each transformation step is traceable, preserving the lineage from raw artifact to final documentation. -- **Fault Tolerance**: Pipeline stages can resume from the last persisted state without requiring full re-execution. - -### Modeling Gaps & Observations -- **Error & Conflict Resolution**: The notes emphasize a linear, deterministic flow but provide limited detail on how conflicting domain interpretations are resolved during synthesis, or how pipeline failures trigger rollback or recovery. -- **Orchestration vs. Workspace Boundaries**: Responsibilities for pipeline execution and workspace lifecycle management appear overlapping. Future modeling may benefit from separating execution coordination from directory/configuration governance. -- **Provider Abstraction Depth**: While external intelligence is abstracted, the notes do not specify how fallback mechanisms or service degradation are handled when analytical responses are incomplete or malformed. +| **Primary sections** | Built from per-file evidence produced by the extraction subdomain | Stages 2–3 | +| **Derivative sections** | Synthesised by aggregating across all primary-section findings | Stage 4 | + +This ordering is a first-class design constraint: derivative sections cannot be produced until all primary evidence is available. The boundary between the two tiers is enforced structurally, not merely by convention. + +### Artefact Persistence +Two distinct storage concerns are separated within the system. *Committed wiki content* — the section markdown files that are versioned alongside the target project — is kept apart from *local working state*, which includes per-file extraction notes and a content-addressed cache. The persistence subdomain owns this boundary and ensures that working state is never accidentally treated as part of the published record. + +## Subdomain Relationships + +The subdomains form a directed dependency chain: + +``` +Repository Introspection + ↓ + Knowledge Extraction → Artefact Persistence (working state) + ↓ + Section Synthesis + ↓ + Artefact Persistence (committed wiki content) +``` + +No subdomain reaches backwards in this chain; the pipeline ordering is the authoritative expression of inter-subdomain dependency. + +## Supporting claims +- The core domain is automated documentation synthesis: extracting business intent from a source repository and producing a technology-agnostic wiki. [1][2] +- The repository introspection subdomain decides which parts of a codebase encode business intent versus infrastructure or tooling. [2] +- A defining constraint of repository introspection is tech-agnosticism — the analysis must not depend on recognising any specific language or framework. [2] +- The knowledge extraction subdomain covers file classification, content chunking, inference-backend querying, and persisting findings with citations. [1] +- Documentation sections are divided into primary sections (built from per-file evidence) and derivative sections (synthesised from aggregates of primary sections), with the ordering enforced as a structural constraint. [3] +- Two distinct storage concerns exist: committed wiki content and local working state (extraction notes and cache); the persistence subdomain enforces this boundary. [4] + +## Sources +1. `wikifi/extractor.py` +2. `wikifi/introspection.py:19-44` +3. `wikifi/sections.py:1-19` +4. `wikifi/wiki.py:1-50` diff --git a/.wikifi/entities.md b/.wikifi/entities.md index ef70396..016420b 100644 --- a/.wikifi/entities.md +++ b/.wikifi/entities.md @@ -1,35 +1,217 @@ # Core Entities -The documentation generation pipeline relies on a set of core domain entities that manage configuration, source analysis, content extraction, and final output assembly. These entities are organized to enforce consistent processing boundaries, track intermediate findings, and produce structured documentation artifacts. +The system's information model spans six concern areas: wiki structure, evidence tracing, extraction and aggregation, repository analysis, caching, and pipeline orchestration. The entities below are described domain-first; implementation details such as storage format are noted only where they affect the entity's invariants. -### Configuration & Processing Boundaries -The system uses a hierarchical configuration model to define how source repositories are scanned and processed. A base settings container manages default values, which can be overridden by local configuration files to ensure environment-specific customization. Scanning and traversal configurations establish the root directory, path inclusion/exclusion filters, and file size constraints. These boundaries ensure that only relevant source files are processed while preventing resource exhaustion from oversized or irrelevant directories. +--- -### Analysis & Introspection -Before content generation, the system performs structural and semantic analysis of the target repository. Directory summaries capture aggregate statistics, including file counts, total size, extension distribution, and the presence of key manifest or documentation files. An introspection assessment synthesizes this structural data to identify primary languages, infer the system's overarching purpose, and document a classification rationale. This assessment respects the previously defined path filters and serves as the foundation for targeted content extraction. +## Wiki Structure -### Extraction & Intermediate Records -During source analysis, intermediate findings are captured as timestamped extraction notes. Each note functions as a structured record that links a specific file reference to a role summary and the extracted finding. These records preserve the context of individual source files and serve as the raw material for downstream aggregation. The system maintains a chronological log of these notes to ensure traceability throughout the pipeline. +A **Section** is the fundamental organisational unit of the generated wiki. It carries: -### Aggregation & Output Structure -Extracted notes are consolidated into categorized documentation sections. Each section acts as a logical container for generated content, ultimately producing a final markdown body. Aggregation statistics track the success rate of section writes and explicitly flag empty sections to highlight coverage gaps. The final output adheres to a predefined workspace layout that organizes configuration files, intermediate notes, and final section artifacts into a consistent, navigable directory hierarchy. +| Field | Description | +|---|---| +| Unique identifier | Stable key used throughout the pipeline | +| Title | Human-readable heading | +| Brief | Prose description of what belongs in the section | +| Tier | Either *primary* (populated from per-file evidence) or *derivative* (synthesised from primary sections) | +| Upstream list | Ordered tuple of section identifiers this section depends on (derivative sections only) | -### Pipeline Execution & Reporting -A unified execution summary consolidates metrics, findings, and completion status across all processing stages. This entity provides a single source of truth for pipeline health, output readiness, and overall processing efficiency, enabling operators to verify that all stages completed successfully before final delivery. +Derivative sections declare explicit upstream dependencies, forming a directed acyclic graph. The system enforces topological ordering at startup: every section's upstreams must appear earlier in the canonical section list. -### Entity Fields, Relationships & Invariants -| Entity | Primary Fields | Key Invariants | +A **WikiLayout** anchors all on-disk path resolution to a single project root, exposing named locations for the wiki directory, configuration file, gitignore, notes directory, cache directory, and per-section markdown and notes files. Its existence is a precondition for the conversational query and report commands. + +A **LoadedSection** pairs a Section descriptor with its rendered markdown body, representing one populated section ready for downstream use (such as building a conversational context). + +A **SectionReport** captures the per-section view for reporting: a reference to the Section definition, the count of contributing files, the count of findings, the character length of the written body, an emptiness flag, and an optional quality critique. A **WikiReport** aggregates all SectionReports together with overall coverage statistics and an optional mean quality score across populated sections. + +--- + +## Evidence and Citation Model + +Every factual sentence in the generated wiki is traceable back through a three-layer evidence hierarchy. + +**SourceRef** — the lowest-level pointer. Carries a repo-relative file path, an optional inclusive line range, and a short content fingerprint captured at extraction time. Renders as `path:start–end` or just `path` when no line range is available. + +**Claim** — a single markdown assertion placed in a section's narrative. Backed by zero or more SourceRefs. A claim with no sources is explicitly considered *unsupported*. + +**Contradiction** — groups two or more conflicting Claims under a one-sentence summary of the conflict; each conflicting position retains its own SourceRefs. + +**EvidenceBundle** — the aggregator's structured handoff to the renderer for one section: the markdown narrative body, the ordered list of Claims, and the list of Contradictions. + +During the language-model aggregation pass, an intermediate form is used: an **AggregatedClaim** pairs a prose assertion with 1-based indices into the input notes (rather than resolved file paths), and an **AggregatedContradiction** wraps a one-sentence summary around multiple such indexed positions. These are resolved into full SourceRefs and Claims before the EvidenceBundle is assembled. + +--- + +## Extraction Layer + +**IntrospectionResult** captures the Stage 1 decision: include patterns, exclude patterns, a one-paragraph hypothesis about the system's purpose, an informational list of primary technologies detected, and the rationale for the filtering choices. + +**SectionFinding** is the atomic extraction unit from one source file for one section. Fields: +- Target section identifier +- Technology-agnostic markdown description (one to five sentences) +- Optional inclusive line range within the source chunk + +**FileFindings** groups all SectionFindings produced for a single file, together with a one-sentence summary of that file's role. It is the unit exchanged between an extraction call and the notes store. + +Specialised extractors — handling schema definition languages, API contracts, and data-definition files — produce **SpecializedFindings** rather than relying on general LLM inference. Each carries a section identifier, finding text, and one or more source references. Multiple SpecializedFindings are collected into a **SpecializedResult**, which additionally carries an optional summary string. + +For data-definition schema files, an intermediate **table record** is derived first (table name, source line, raw body, column list, and foreign-key edges expressed as local-column → referenced-table.referenced-column tuples). All downstream entity and relationship findings are derived from this intermediate form. + +Domain object types from API schema files (those that are not root operation types) are surfaced directly as domain entity findings, grouped by their namespace, with closed value sets (enumerations) and shared shape contracts (interfaces, input types) captured as separate finding categories. + +**ExtractionStats** accumulates per-run metrics: files seen, files with findings, total findings, skipped files, chunks processed, cache hits, files routed to specialised extractors, and a breakdown by file kind. + +--- + +## Repository Analysis Entities + +**FileKind** is a fixed enumeration of seven structural categories: application code, SQL, OpenAPI contract, protocol definition, GraphQL schema, migration script, and other. The classification drives routing to the appropriate extractor. + +**GraphNode** represents one file's position in the cross-file import graph: its repo-relative path, the ordered set of files it imports, and the ordered set of files that import it. It exposes a combined neighbour list capped at a configurable limit for use in prompt enrichment. + +**RepoGraph** is the complete repository-level import graph, keyed by repo-relative file path, providing lookup of individual GraphNodes and neighbour path lists. + +**DirSummary** is a value object for a single non-recursive directory: its repo-relative path, file count, total byte size, a frequency map of the top-10 file extensions, and a tuple of notable filenames (manifests, readmes). + +--- + +## Caching Entities + +| Entity | Cache key | Stored payload | |---|---|---| -| Configuration | Default settings, local overrides | Local overrides always take precedence over environment defaults | -| Scan/Traversal Config | Root path, inclusion/exclusion patterns, size thresholds | Processing never exceeds defined size constraints or traverses excluded paths | -| Directory Summary | File count, total size, extension distribution, manifest presence | Statistics reflect only files within allowed traversal boundaries | -| Introspection Assessment | Primary languages, inferred purpose, classification rationale | Assessment is derived strictly from directory summaries and path filters | -| Extraction Note | Timestamp, file reference, role summary, extracted finding | Each note is immutable once created and tied to a single source file | -| Documentation Section | Category, aggregated content, final markdown body | Sections are generated only after successful note aggregation | -| Aggregation Stats | Successful writes, empty section count | Stats are updated atomically after each section generation attempt | -| Workspace Layout | Paths for config, notes, sections | Directory structure remains consistent across pipeline runs | -| Execution Summary | Stage metrics, completion status, consolidated findings | Summary is generated only after all pipeline stages report completion | - -**Relationships:** Configuration entities dictate the boundaries for analysis entities. Analysis outputs feed directly into extraction notes, which are then grouped and transformed into documentation sections. Aggregation statistics and the execution summary operate as cross-cutting observers, tracking the health and output of the entire flow. - -**Known Gaps:** The exact mapping rules between intermediate extraction notes and final documentation sections are implied by the aggregation process but not explicitly detailed in the available notes. Further specification may be needed to define how notes are grouped, prioritized, or filtered during section assembly, as well as how empty sections are resolved or reported upstream. +| **CachedFindings** | Content fingerprint of the source file | Findings list, one-sentence file summary, chunk count | +| **CachedSection** | Hash of the notes payload | Rendered markdown body, claims list, contradictions list | + +**WalkCache** is the in-memory aggregate of both caches. It tracks four counters — extraction hits, extraction misses, aggregation hits, aggregation misses — supporting efficiency reporting across a full pipeline run. + +--- + +## Quality-Review Entities + +A **Critique** captures the quality assessment of one section: +- Integer score (0–10) +- Short overall judgment +- List of unsupported claims +- List of gaps relative to the section brief +- List of concrete revision suggestions + +A **ReviewOutcome** tracks the lifecycle of a single section review: the section identifier, the initial Critique, the current body text, a boolean flag indicating whether a revision was applied, and an optional follow-up Critique produced after revision. + +A **WikiQualityReport** aggregates the full-wiki audit: an overall numeric score, a mapping from section identifiers to individual Critiques, and optional **CoverageStats** (total files, files with findings, and per-section finding and file counts). + +--- + +## Pipeline Orchestration Entities + +**WalkConfig** encapsulates the parameters for file traversal. Notes from two different pipeline layers describe it somewhat differently (see Conflicts below), but the agreed-upon core fields are: repository root, file-size limits, minimum content thresholds, and extra exclusion patterns. It is treated as immutable once constructed. + +**Notes records** are the ephemeral per-section extraction state persisted during a walk. Each record carries a UTC timestamp and arbitrary key-value metadata. Records for a section are accumulated in insertion order. + +**WalkReport** is the primary return value from a complete pipeline run. It carries the IntrospectionResult, ExtractionStats, AggregationStats, DerivationStats, the WalkCache state, and the RepoGraph. + +**AggregationStats** tracks three counters for a single aggregation pass: sections written fresh, sections skipped due to empty notes, and sections served from cache. + +**DerivationStats** accumulates pipeline metrics for the derivation stage: count of sections derived, skipped, and revised, plus the full list of ReviewOutcomes as an audit trail. + +--- + +## Interaction Entities + +A **ChatMessage** carries a role identifier and a content string, representing one turn in a multi-turn exchange. + +A **ChatSession** holds a reference to the language-model provider, the frozen system prompt built from populated wiki sections, and the accumulated conversation history (an ordered list of ChatMessages). It supports appending user and assistant turns and clearing the history while retaining the wiki context. + +--- + +## Configuration and Provider Entities + +**Settings** captures all runtime knobs for a wiki-generation run: provider and model identity, inference endpoint, request timeout, file-size and chunk thresholds, pipeline feature flags (caching, graph building, specialised extractors, review loop), the quality threshold that triggers revision, and provider-specific credentials and token caps. + +An **LLMProvider** carries a provider name and a specific model variant. It is the sole point of contact between the pipeline and any language-model backend, exposing exactly three interaction modes used throughout the system. + +## Supporting claims +- A Section carries a unique identifier, human-readable title, prose brief, a tier (primary or derivative), and an ordered tuple of upstream section identifiers. [1] +- Derivative sections declare explicit upstream dependencies forming a directed acyclic graph enforced by topological ordering at startup. [2][1] +- WikiLayout anchors all on-disk path resolution to a single project root, exposing named locations for wiki, config, gitignore, notes, cache, and per-section files; its existence is a precondition for the chat and report commands. [3][4] +- A LoadedSection pairs a Section descriptor with its rendered markdown body. [5] +- A SectionReport carries the section definition reference, contributing file count, findings count, body character length, emptiness flag, and an optional quality critique. [6] +- A WikiReport aggregates all SectionReports, overall coverage statistics, and an optional mean quality score. [7] +- A SourceRef holds a repo-relative file path, an optional inclusive line range, and a short content fingerprint captured at extraction time. [8][9] +- A Claim is a single markdown assertion backed by zero or more SourceRefs; a claim with no sources is explicitly considered unsupported. [8][10] +- A Contradiction groups two or more conflicting Claims under a one-sentence summary, each position retaining its own SourceRefs. [10] +- An EvidenceBundle is the aggregator's structured handoff to the renderer: markdown body, ordered Claims list, and Contradictions list. [8][11] +- During the aggregation pass an AggregatedClaim pairs a prose assertion with 1-based input-note indices, and an AggregatedContradiction wraps a one-sentence summary around multiple such indexed positions; these are resolved into SourceRefs before the EvidenceBundle is assembled. [12] +- IntrospectionResult captures include/exclude patterns, a purpose hypothesis, an informational language list, and the filtering rationale. [13] +- A SectionFinding carries a target section identifier, a technology-agnostic markdown description of one to five sentences, and an optional line range. [14] +- A FileFindings groups all SectionFindings for one file plus a one-sentence file-role summary, and is the unit exchanged between the extraction call and the notes store. [15] +- A SpecializedFinding carries a section identifier, finding text, and one or more source references; multiple SpecializedFindings are collected into a SpecializedResult that also carries an optional summary string. [16][17] +- For data-definition schema files an intermediate table record is derived first (name, source line, raw body, column list, and foreign-key edges) and all downstream findings are derived from it. [18] +- Domain object types from schema files are surfaced as domain entity findings; closed value sets and shared shape contracts are captured as separate finding categories; a maximum of 25 items per category are rendered with elision noted. [19][20][21] +- ExtractionStats accumulates: files seen, files with findings, total findings, skipped files, chunks processed, cache hits, specialised-extractor files, and a file-kind breakdown. [22] +- FileKind is a fixed enumeration of seven structural categories: application code, SQL, OpenAPI, Protobuf, GraphQL, migration, and other; it drives routing to the appropriate extractor. [23] +- A GraphNode carries its repo-relative path, the ordered set of files it imports, and the ordered set of files that import it, with a configurable cap on the combined neighbour list. [24] +- A RepoGraph is the complete repository import graph keyed by repo-relative path. [25] +- A DirSummary holds a directory's path, file count, total byte size, top-10 extension frequency map, and a tuple of notable filenames. [26] +- CachedFindings stores a content fingerprint, findings list, one-sentence summary, and chunk count; CachedSection stores a notes-payload hash, rendered markdown body, claims list, and contradictions list. [27][28] +- WalkCache aggregates both caches and tracks four counters (extraction hits/misses, aggregation hits/misses) for efficiency reporting. [29] +- A Critique carries an integer score (0–10), a short judgment, a list of unsupported claims, a list of brief gaps, and a list of revision suggestions. [30] +- A ReviewOutcome tracks the section identifier, initial critique, current body, revision-applied flag, and optional follow-up critique. [31] +- A WikiQualityReport carries an overall numeric score, a mapping from section identifiers to individual Critiques, and optional CoverageStats (total files, files with findings, per-section finding and file counts). [32] +- WalkReport is the primary return value from a full pipeline run, carrying IntrospectionResult, ExtractionStats, AggregationStats, DerivationStats, WalkCache state, and RepoGraph. [33] +- AggregationStats tracks sections written fresh, skipped due to empty notes, and served from cache. [34] +- DerivationStats accumulates sections derived, skipped, and revised counts, plus the full list of ReviewOutcomes. [35] +- Notes records carry a UTC timestamp and arbitrary key-value metadata, stored per section in insertion order. [36] +- A ChatMessage carries a role identifier and a content string representing one turn in a multi-turn exchange. [37] +- A ChatSession holds an LLM provider reference, a frozen system prompt built from wiki sections, and an ordered conversation history; it supports appending turns and clearing history while retaining context. [38] +- Settings captures provider and model identity, inference endpoint, timeout, file-size and chunk thresholds, pipeline feature flags, revision quality threshold, and provider-specific credentials and token caps. [39] +- An LLMProvider carries a provider name and model variant and is the sole point of contact between the pipeline and any language-model backend. [37] + +## Conflicts in source +_The walker found disagreements across files. Migration teams should resolve these before re-implementation._ + +- **Two sources describe a 'WalkConfig' entity with partially different field sets, suggesting either two distinct same-named entities at different pipeline layers or a single entity incompletely described in each source.** + - WalkConfig (orchestrator layer) encapsulates repository root, byte-size limits, minimum content thresholds, and an optional introspection-derived exclusion list; it is constructed twice per run — once before introspection and once after with the exclusion list populated. (`wikifi/orchestrator.py:83-101`) + - WalkConfig (filesystem-walker layer) encapsulates repository root, extra exclusion patterns beyond defaults, a flag for honouring gitignore rules, maximum file size in bytes, and minimum stripped-content size in bytes; it is immutable once constructed. (`wikifi/walker.py:61-79`) + +## Sources +1. `wikifi/sections.py:30-40` +2. `wikifi/deriver.py:112-116` +3. `wikifi/cli.py:172-183` +4. `wikifi/wiki.py:55-80` +5. `wikifi/chat.py:42-45` +6. `wikifi/report.py:29-36` +7. `wikifi/report.py:39-44` +8. `wikifi/aggregator.py:166-186` +9. `wikifi/evidence.py:35-55` +10. `wikifi/evidence.py:57-80` +11. `wikifi/evidence.py:82-87` +12. `wikifi/aggregator.py:74-101` +13. `wikifi/introspection.py:47-64` +14. `wikifi/extractor.py:113-125` +15. `wikifi/extractor.py:128-131` +16. `wikifi/specialized/models.py:19-22` +17. `wikifi/specialized/models.py:25-27` +18. `wikifi/specialized/sql.py:50-58` +19. `wikifi/specialized/graphql.py:56-95` +20. `wikifi/specialized/openapi.py:105-116` +21. `wikifi/specialized/protobuf.py:42-60` +22. `wikifi/extractor.py:134-142` +23. `wikifi/repograph.py:43-56` +24. `wikifi/repograph.py:143-162` +25. `wikifi/repograph.py:165-177` +26. `wikifi/walker.py:144-153` +27. `wikifi/cache.py:60-66` +28. `wikifi/cache.py:69-74` +29. `wikifi/cache.py:77-88` +30. `wikifi/critic.py:67-84` +31. `wikifi/critic.py:91-96` +32. `wikifi/critic.py:99-114` +33. `wikifi/orchestrator.py:60-70` +34. `wikifi/aggregator.py:103-107` +35. `wikifi/deriver.py:57-62` +36. `wikifi/wiki.py:136-152` +37. `wikifi/providers/base.py:33-52` +38. `wikifi/chat.py:46-57` +39. `wikifi/config.py:46-155` +40. `wikifi/orchestrator.py:83-101` +41. `wikifi/walker.py:61-79` diff --git a/.wikifi/external_dependencies.md b/.wikifi/external_dependencies.md index c10527e..2655643 100644 --- a/.wikifi/external_dependencies.md +++ b/.wikifi/external_dependencies.md @@ -1,29 +1,61 @@ # External-System Dependencies -The system relies on a set of external services and infrastructure components that enable source code ingestion, semantic analysis, and structured documentation generation. These dependencies are abstracted to support interchangeable implementations while maintaining consistent operational roles. - -### AI Inference Engine -The primary external dependency is an AI inference service, which may be provisioned as a third-party API or a locally hosted instance. This engine provides the cognitive layer required for: -- Semantic analysis and intent extraction from raw source code -- Interpretation of code structure and abstraction of business domains -- Transformation of technical evidence into formal specifications, structured narratives, and architectural artifacts -- Generation of both structured data and unstructured explanatory text based on system prompts - -The system abstracts the deployment model of this layer, allowing it to operate against either cloud-hosted endpoints or local inference servers without altering core workflows. - -### Supporting Infrastructure & Standards -Beyond the inference engine, the system depends on several foundational services and standards to ensure reliable operation and output consistency: - -- **Host File System:** Direct read access is required to ingest source files and gather the raw technical evidence processed by the extraction engine. -- **Data Validation Framework:** A structured validation layer verifies output integrity, ensuring that generated artifacts conform to expected schemas before delivery. -- **Documentation & Diagramming Standards:** The system relies on standardized markup and diagram syntaxes to guarantee consistent rendering and interoperability across downstream consumption platforms. -- **Repository Filtering Logic:** Pattern-matching utilities aligned with standard version control ignore semantics are used to safely exclude irrelevant directories, build artifacts, and configuration files during traversal. - -### Dependency Summary -| Dependency | Role in System | -|---|---| -| AI Inference Service | Semantic analysis, intent extraction, content generation, and domain abstraction | -| Host File System | Source code ingestion and raw evidence collection | -| Data Validation Framework | Output integrity verification and schema enforcement | -| Standardized Markup/Diagram Syntaxes | Cross-platform rendering consistency and interoperability | -| VCS Ignore Pattern Logic | Safe repository traversal and artifact filtering | +The system depends on external services in two areas: the **language-model inference layer** that drives all AI analysis, and a set of **tooling integrations** used for development support and runtime enrichment. + +### Language-Model Inference + +Three mutually exclusive inference backends are supported; exactly one is active per deployment. + +| Backend | Role | Authentication | Key Configuration | +|---|---|---|---| +| Self-hosted local inference service | Default LLM backend; serves models over HTTP on the local network | None required | Configurable host address and request timeout | +| Anthropic's hosted inference API | Opt-in cloud backend for high-capability extraction | API key (environment variable) | Configurable output-token cap and HTTP timeout to manage long-running calls; supports adaptive reasoning depth | +| OpenAI-compatible hosted inference API | Opt-in cloud backend for structured decoding, completion, and chat | API key | Configurable base URL, enabling compatible proxy or alternate deployment targets | + +The self-hosted local service is the default and the zero-friction starting point for new users — it requires no credentials and no cloud account. The two hosted cloud services are opt-in alternatives that require API keys and expose additional parameters for latency and cost control. + +The local backend supports reasoning-capable model variants that trade increased latency for greater analytical depth; this extended-reasoning mode is also available on the hosted cloud backends. + +The Anthropic-backed path operates in a single structured-extraction mode. The OpenAI-compatible path supports three distinct usage modes: schema-constrained structured decoding (returning validated domain objects), free-text completion, and multi-turn conversational chat. + +### Development and Runtime Tooling Integrations + +Several additional services are configured in the tooling layer: + +- **Self-hosted web-crawling service** — runs locally on a fixed port with no external credentials required. Provides on-demand web-crawling capability, used to gather source material. +- **Google's hosted AI/generative API** — authenticated via a dedicated API key; consumed by at least two registered tool integrations. +- **External documentation context service** — called over HTTP using a dedicated API key; enriches prompts or retrieves up-to-date reference documentation at runtime. +- **Google-hosted orchestration service** — an HTTP service authenticated via the same Google API key; its exact role is not fully specified in available sources but is likely related to data composition or workflow orchestration. + +### Soft Dependency: Structured-Data Parsing + +When processing structured API contract files, the system can optionally leverage an external YAML parsing library for full format support. If that library is absent, an internal minimal parser serves as a fallback, covering the specific fields the system requires. This is a soft rather than hard dependency — the system remains functional without it. + +## Supporting claims +- Three mutually exclusive LLM inference backends are supported: a self-hosted local service, Anthropic's hosted API, and an OpenAI-compatible hosted API. [1][2][3][4][5] +- The self-hosted local inference service is the default backend, requires no API key, and connects over a configurable HTTP endpoint. [1][2][5][6] +- The self-hosted local service supports reasoning-capable model variants that trade latency for greater analytical depth. [1][6] +- Anthropic's hosted inference API is an opt-in backend authenticated via an environment-variable API key, with a configurable output-token cap and HTTP timeout to manage long-running inference calls. [3][5][7] +- Anthropic's hosted backend supports an adaptive reasoning depth mode. [3][7] +- The OpenAI-compatible hosted API is an opt-in backend authenticated via API key, with a configurable base URL enabling use of compatible proxy deployments. [4][5][8] +- The OpenAI-compatible backend supports three usage modes: schema-constrained structured decoding, free-text completion, and multi-turn conversational chat. [8] +- A self-hosted web-crawling service runs locally on a fixed port and requires no external credentials. [9] +- Google's hosted AI/generative API is consumed by at least two tool integrations, authenticated via a dedicated API key. [10][11] +- An external documentation context service is called over HTTP with a dedicated API key, used to enrich prompts or retrieve reference documentation at runtime. [12] +- A Google-hosted orchestration service is consumed over HTTP, authenticated via the same Google API key; its exact role is not fully specified in available sources. [11] +- An external YAML parsing library is a soft dependency for structured API contract processing; the system falls back to an internal minimal parser when the library is absent. [13] + +## Sources +1. `.env.example:7-14` +2. `wikifi/config.py:53-55` +3. `wikifi/config.py:116-134` +4. `wikifi/config.py:136-151` +5. `wikifi/orchestrator.py:148-200` +6. `wikifi/providers/ollama_provider.py:52` +7. `wikifi/providers/anthropic_provider.py:83-100` +8. `wikifi/providers/openai_provider.py:113-175` +9. `.mcp.json:14-20` +10. `.mcp.json:4-8` +11. `.mcp.json:29-35` +12. `.mcp.json:22-28` +13. `wikifi/specialized/openapi.py:154-162` diff --git a/.wikifi/hard_specifications.md b/.wikifi/hard_specifications.md index 21f362a..a8fa680 100644 --- a/.wikifi/hard_specifications.md +++ b/.wikifi/hard_specifications.md @@ -1,31 +1,188 @@ # Hard Specifications -### Pipeline Execution & Architecture -- **Sequential Processing Order:** The system must execute stages in a strict, non-negotiable sequence: Introspection → Extraction → Aggregation → Derivation. Deviations from this order are prohibited. -- **Single-Provider Constraint:** The current release supports only one designated processing backend. Configuration attempts targeting alternative providers must fail gracefully without interrupting the pipeline. -- **Workspace Auto-Provisioning:** The target documentation workspace must be automatically initialized if it does not exist prior to pipeline execution. - -### Input Processing & Data Boundaries -- **Deterministic & Non-Destructive Execution:** All processing stages must operate deterministically and preserve original source integrity. No upstream data may be altered or deleted during transformation. -- **Immutable Exclusion Patterns:** Version control metadata, dependency caches, build artifacts, and tool-specific directories are permanently excluded from traversal. -- **Strict Size Thresholds:** - | Metric | Limit | Handling Behavior | - |---|---|---| - | Maximum file size | 200,000 bytes | Truncated to limit | - | Minimum stripped content | 64 bytes | Files below threshold are ignored | -- **Fault Tolerance:** Invalid or unreadable inputs must be logged and skipped. The pipeline must never halt due to malformed source material. -- **Structural Recognition:** A predefined set of notable manifest and documentation filenames is used exclusively for structural analysis and routing. - -### Content Synthesis & Documentation Standards -- **Technology-Agnostic Translation:** All outputs must strip implementation-specific terminology. Technical observations must be translated into domain-focused, user-facing intent. -- **Narrative Synthesis:** Generated content must form a coherent, structured narrative. Raw transcripts, verbatim note dumps, or unprocessed fragments are prohibited. -- **Behavioral Documentation Structure:** All operational or behavioral descriptions must adhere to a strict `Given/When/Then` format. -- **Visual & Formatting Constraints:** Diagrams must utilize standardized syntax with approved chart types. Output must exclude top-level headings and rely exclusively on appropriate markdown sub-headings, lists, and tables. -- **Explicit Gap & Contradiction Reporting:** Missing data, failed derivations, or conflicting upstream evidence must be explicitly declared. The system must preserve original evidence rather than fabricating content or leaving silent blanks. - -### Configuration & Artifact Management -- **Configuration Precedence:** Local configuration files strictly override environment-level variables. This hierarchy is immutable. -- **Intermediate Artifact Isolation:** Temporary processing directories (intermediate notes) must be excluded from version control by default to prevent repository bloat and maintain clean lineage. - -### Output Structure & Compatibility Contract -- **Immutable Directory Schema:** The documentation directory layout functions as a strict backward-compatibility contract. Structural modifications are prohibited, as they will break existing documentation readability and violate compliance expectations. +## Output Integrity + +These rules govern what the system is permitted to emit and are enforced at multiple stages of the pipeline. + +- **Tech-agnostic language.** All synthesised wiki content — both primary sections and derivative sections such as personas, user stories, and diagrams — must be free of specific language, framework, or library names. Every such observation must be translated into domain terms. This constraint applies equally to the aggregator, the reviser, and the deriver. +- **No silent contradiction resolution.** Whenever two source notes make incompatible claims about the same topic, the output must include a `contradictions[]` entry naming each position and the note indices that support it. Suppressing or merging conflicting claims is forbidden. +- **No invented facts.** When evidence is absent, the system must declare the gap explicitly rather than speculating.[2][6] This applies to both primary aggregation and derivative synthesis. +- **Derivative sections grounded in upstream content only.** Derivative sections must draw exclusively on the aggregated bodies of the primary sections that precede them in the canonical section ordering; they may not introduce claims not present in those upstream bodies. + +## Evidence and Citation Format + +The citation scheme is a contractual output format. + +- Claims must be rendered with compact footnote-style markers (`[1]`, `[2]`, …) and a **Sources** footer at the bottom of each section. +- Line ranges are formatted as `path/to/file:start-end`; a single-line reference as `path/to/file:line`; an unknown range as `path/to/file` alone. +- Detected contradictions must appear verbatim under a **Conflicts in source** heading with an explicit instruction that migration teams must resolve them before re-implementation. They must not be suppressed. +- Note indices presented to the synthesis stage are 1-based; the internal resolution step subtracts 1 before indexing into the underlying list. This off-by-one invariant must be preserved if the prompting scheme is ever changed. + +## File Processing Thresholds + +| Parameter | Value | Rule | +|---|---|---| +| Maximum file size | 2,000,000 bytes | Files at or above this limit are unconditionally skipped and never read | +| Minimum content size | 64 bytes (stripped) | Files below this threshold are skipped entirely | +| Chunk window | 150,000 bytes | Fixed sliding-window size for splitting large files | +| Chunk overlap | 8,000 bytes | Overlap between adjacent chunks to preserve cross-boundary context | +| Manifest truncation | 20,000 bytes | Manifest files are truncated to this length before inclusion in any prompt | + +Additionally, chunk overlap must satisfy `0 ≤ overlap < chunk_size`, and chunk size must be positive. These inequalities are hard invariants; violating them causes the recursive splitter to fail on edge-case inputs such as whitespace-free monolithic files. + +## Caching Constraints + +- **Aggregation cache key completeness.** The hash used to key an aggregation result must span the file reference, summary, finding text, and the full structured sources list (file path, line range, and fingerprint per source). Omitting any field allows stale citation metadata to be replayed without re-aggregation. +- **Atomic write pattern.** Cache persistence must write to a sibling `.tmp` file and then rename it atomically. A crash during saving must never produce a corrupt cache file. +- **Fingerprint format.** Content fingerprints are defined as the first 12 hexadecimal characters of a SHA-256 digest. This format must be preserved across any migration because it is recorded in cached artefacts and emitted into wiki evidence references. + +## Quality Assurance Rules + +The scoring rubric is fixed and non-negotiable: + +| Score range | Meaning | +|---|---| +| 9–10 | Fully grounded, tech-agnostic, narratively coherent; no unsupported claims | +| 6–8 | Minor issues only | +| 3–5 | Substantial gaps or partial coverage | +| 0–2 | Incoherent or off-brief | + +- The **minimum acceptable score** for publishing a section without revision is **7**. +- A revised body is accepted only if its follow-up critique score is **greater than or equal to** the initial score. Any revision that produces a score regression is discarded and the original body is retained. This invariant must be preserved in any reimplementation. + +## Provider and API Constraints + +### Shared +- The default per-call request timeout is **900 seconds**, chosen to absorb the observed latency of high-effort reasoning on large local models. Reducing this value risks aborting in-progress reasoning traces. +- Three abstract interaction modes — structured completion, text completion, and chat — constitute the **complete and exclusive** contract between the pipeline and any backend. No other methods are ever invoked; any conforming implementation must satisfy all three signatures exactly. + +### Hosted-Claude Backend +- Default maximum output is **32,000 tokens** per call. Callers using the highest reasoning effort levels are expected to raise this limit and enable streaming; too low a value causes the model to exhaust the budget on reasoning before producing structured output. +- Sampling parameters (temperature, top-p, top-k) **must not** be sent to the `claude-opus-4-7` model variant; doing so causes a validation error. The provider omits them unconditionally. +- Structured output is obtained via schema-constrained decoding; if the primary parsed result is absent, the implementation falls back to parsing the raw text block as JSON before raising an error. + +### Local-Model Backend +- Disabling the reasoning trace on Qwen3-family models causes them to ignore the JSON schema constraint and emit free text, breaking validation. Reasoning must never be disabled for Qwen3-style models on the structured-output path. The configuration documentation explicitly marks fully-disabled thinking as unsafe for this reason. +- Default per-call timeout is 900 seconds (same rationale as above). + +### OpenAI-Compatible Backend +- Default output cap is **16,000 tokens** per call; default per-call timeout is 900 seconds. +- Reasoning-capable model families (identified by the prefixes `o` or `gpt-5`) must receive `max_completion_tokens` instead of `max_tokens`, and may receive a `reasoning_effort` value of `low`, `medium`, or `high`. Non-reasoning models must **not** receive `reasoning_effort` to avoid API validation errors. +- When the structured-output parse path returns no parsed object (due to a refusal or truncation), the implementation must fall back to validating raw JSON text against the schema, not return a null silently. + +### Model Identifier Routing +- **Ollama heuristic:** a model identifier is classified as an Ollama-style identifier if it contains `:` and does not begin with the prefix `ft:` (case-insensitive). This rule must be carried forward exactly to avoid misclassifying fine-tuned models or Azure deployment IDs. +- When the hosted-Claude backend is selected but no Claude-prefixed model identifier is configured, the system falls back to a specific default model rather than forwarding the potentially invalid identifier. +- Azure/proxy deployments with non-standard deployment IDs are preserved unchanged. + +## Pipeline Stage Boundaries + +- **Stage 1** must operate without reading any source files; it sees only directory-level summaries and manifest contents. Source reading is exclusively Stage 2's responsibility. +- **Stage 1** must produce include and exclude path patterns in gitignore-style format relative to the repository root. +- **Stage 2 (extraction)** targets only primary wiki sections. Derivative sections are explicitly excluded and are produced in Stage 4 from the aggregate of primary findings. This boundary must be preserved through any migration. +- **Derivative section ordering.** Every derivative section must reference only known section IDs, and every upstream dependency must appear earlier in the canonical section ordering. This ordering invariant is validated at module load time; any violation raises an error. + +## Interface and Directory Contracts + +- The CLI entry point and its four subcommands (`init`, `walk`, `chat`, `report`) are declared as a named script in the package manifest; the command name and subcommand surface are **contractual interfaces** for users and tooling. +- The on-disk directory layout (`.wikifi/`, `config.toml`, `.gitignore`, one markdown file per section, `.notes/`, `.cache/`) is the **explicit versioned contract** with target projects and must not change in ways that break existing wikis. +- The `.notes/` and `.cache/` directories must always be excluded from version control; only section markdown files are committed. Any new required gitignore entries introduced in future versions must be backfilled into older wikis automatically on the next `init` run. +- Three exact sentinel strings mark unpopulated sections and must not be altered: `Not yet populated`, `No findings were extracted`, and `upstream sections required to derive`. The report module depends on these exact strings for gap analysis and scoring exclusion. + +## Specialized Extractor Rules + +- Only migration files with `.sql` or `.ddl` suffixes are routed to the SQL migration extractor; all other migration files must fall through to the general extraction path. Routing is determined by file suffix inspection, not by file-kind classification alone. +- When an API contract file is present but cannot be parsed, the system must emit an explicit warning finding directing migration teams to review the file manually. Unparseable specs are flagged, not silently skipped. +- Service-to-RPC attribution in protocol definition files must be computed by tracking brace depth (counting nested blocks), not by line proximity, to ensure correct attribution in multi-service files. +- Index definitions in schema files encode query-time performance invariants that must be preserved through migration; the extractor emits this requirement explicitly in every index finding. +- The import/reference graph must be constructed without any binary or compiled dependencies; only pattern matching and path resolution are permitted. This is a stated architectural constraint. +- Migration files are detected by matching a hardcoded list of well-known migration directory path tokens. A SQL file located in such a directory is classified as a migration rather than generic schema, preserving the distinction between forward-only schema changes and current schema state. + +## Supporting claims +- All synthesised wiki content must be free of specific language, framework, or library names and must be translated into domain terms. [1][2][3] +- Whenever two source notes make incompatible claims, the output must include a contradictions entry naming each position and the note indices that support it; suppressing or merging conflicting claims is forbidden. [4][5] +- Derivative sections must draw exclusively on the aggregated bodies of the primary sections that precede them. [6][7] +- Claims must be rendered with compact footnote-style markers and a Sources footer; detected contradictions must appear under a Conflicts in source heading. [8][5] +- Note indices are 1-based and the internal resolution step subtracts 1 before indexing; this off-by-one invariant must be preserved. [9] +- Files at or above 2,000,000 bytes are unconditionally skipped and never read. [10][11] +- Files below 64 bytes of stripped content are skipped entirely. [12][11] +- Chunk window is 150,000 bytes with 8,000 bytes of overlap; chunk overlap must satisfy 0 ≤ overlap < chunk_size and chunk size must be positive. [12][13] +- Manifest files are truncated to 20,000 bytes maximum before inclusion in any prompt. [11] +- The aggregation cache key must span the file reference, summary, finding text, and the full structured sources list; omitting any field allows stale metadata to be replayed. [14] +- Cache persistence must use an atomic write pattern (write to a sibling .tmp file, then rename) to guarantee a crash never produces a corrupt cache file. [15] +- Content fingerprints are defined as the first 12 hexadecimal characters of a SHA-256 digest and this format must be preserved across any migration. [16] +- The minimum acceptable quality score for publishing a section without revision is 7; the fixed rubric maps 9–10 to fully grounded, 6–8 to minor issues, 3–5 to substantial gaps, and 0–2 to incoherent. [17] +- A revised body is accepted only if its follow-up critique score is greater than or equal to the initial score; any regression causes the original body to be retained. [18] +- The default per-call request timeout is 900 seconds. [19][20][21] +- Three abstract interaction modes — structured completion, text completion, and chat — constitute the complete and exclusive provider contract. [22] +- The hosted-Claude backend defaults to a 32,000 token output cap; sampling parameters must not be sent to the claude-opus-4-7 model variant. [23][24][25] +- Disabling the reasoning trace on Qwen3-family models causes them to ignore the JSON schema constraint and emit free text; reasoning must never be disabled for these models on the structured-output path. [19][26] +- Reasoning-capable model families must receive max_completion_tokens instead of max_tokens and may receive a reasoning_effort value; non-reasoning models must not receive reasoning_effort. [27] +- When the structured-output parse path returns no parsed object, the implementation must fall back to validating raw JSON text against the schema rather than returning null. [28][29] +- An Ollama-style model identifier is defined as a string containing ':' that does not begin with the prefix 'ft:' (case-insensitive); this rule must be carried forward exactly. [30] +- Stage 1 must operate without reading any source files and must produce include/exclude patterns in gitignore-style format relative to the repository root. [31][32] +- Stage 2 extraction targets only primary sections; derivative sections are excluded and produced in Stage 4. [33] +- Every derivative section must reference only known section IDs, and every upstream dependency must appear earlier in the canonical section ordering; violations raise an error at module load time. [7] +- The CLI entry point and its four subcommands (init, walk, chat, report) are contractual interfaces for users and tooling. [34] +- The on-disk directory layout is the explicit versioned contract with target projects and must not change in ways that break existing wikis. [35] +- .notes/ and .cache/ directories must always be excluded from version control; new required gitignore entries must be backfilled automatically on the next init run. [36] +- Three exact sentinel strings — 'Not yet populated', 'No findings were extracted', and 'upstream sections required to derive' — must be preserved as canonical markers for unpopulated sections. [37] +- Only migration files with .sql or .ddl suffixes are routed to the SQL migration extractor; all others fall through to the general extraction path. [38] +- When an API contract file cannot be parsed, the system must emit an explicit warning finding rather than silently dropping it. [39] +- Service-to-RPC attribution must be computed by tracking brace depth, not line proximity. [40] +- Index definitions encode query-time performance invariants that must be preserved through migration; the extractor emits this requirement explicitly in every index finding. [41] +- The import/reference graph must be constructed without any binary or compiled dependencies. [42] +- Gherkin outputs must use Given/When/Then syntax inside fenced gherkin code blocks; Mermaid diagrams must be valid and inside fenced mermaid code blocks. [43] + +## Conflicts in source +_The walker found disagreements across files. Migration teams should resolve these before re-implementation._ + +- **The example environment configuration states that only the local-model provider is supported in v1, but multiple other sources document the hosted-Claude and OpenAI providers as fully implemented first-class backends with detailed API constraints.** + - Only the local-model (Ollama) provider is supported in v1. (`.env.example:7-44`) + - The hosted-Claude and OpenAI backends are fully implemented with detailed token caps, sampling-parameter rules, model-routing logic, and fallback behaviour. (`wikifi/config.py:122-134`, `wikifi/orchestrator.py:160-200`, `wikifi/providers/anthropic_provider.py:14-17`, `wikifi/providers/anthropic_provider.py:70-79`, `wikifi/providers/openai_provider.py:215-235`, `wikifi/providers/openai_provider.py:59-66`, `wikifi/providers/openai_provider.py:136-144`) + +## Sources +1. `wikifi/aggregator.py:57-59` +2. `wikifi/critic.py:53-61` +3. `wikifi/deriver.py:37-39` +4. `wikifi/aggregator.py:61-63` +5. `wikifi/evidence.py:121-131` +6. `wikifi/deriver.py:34-50` +7. `wikifi/sections.py:148-158` +8. `wikifi/evidence.py:43-52` +9. `wikifi/aggregator.py:167-173` +10. `wikifi/config.py:59-65` +11. `wikifi/walker.py:61-79` +12. `wikifi/config.py:66-81` +13. `wikifi/extractor.py:302-308` +14. `wikifi/cache.py:243-255` +15. `wikifi/cache.py:205-209` +16. `wikifi/fingerprint.py:23-27` +17. `wikifi/critic.py:31-48` +18. `wikifi/critic.py:137-147` +19. `.env.example:7-44` +20. `wikifi/providers/ollama_provider.py:50-54` +21. `wikifi/providers/openai_provider.py:59-66` +22. `wikifi/providers/base.py:42-52` +23. `wikifi/config.py:122-134` +24. `wikifi/providers/anthropic_provider.py:14-17` +25. `wikifi/providers/anthropic_provider.py:70-79` +26. `wikifi/providers/ollama_provider.py:9-27` +27. `wikifi/providers/openai_provider.py:215-235` +28. `wikifi/providers/anthropic_provider.py:107-145` +29. `wikifi/providers/openai_provider.py:136-144` +30. `wikifi/orchestrator.py:205-215` +31. `wikifi/introspection.py:5-9` +32. `wikifi/introspection.py:50-58` +33. `wikifi/extractor.py:51-56` +34. `wikifi/cli.py:1-7` +35. `wikifi/wiki.py:1-8` +36. `wikifi/wiki.py:36-47` +37. `wikifi/report.py:103-108` +38. `wikifi/specialized/dispatch.py:28-62` +39. `wikifi/specialized/openapi.py:24-37` +40. `wikifi/specialized/protobuf.py:62-67` +41. `wikifi/specialized/sql.py:115-121` +42. `wikifi/repograph.py:22-30` +43. `wikifi/deriver.py:40-45` +44. `wikifi/orchestrator.py:160-200` diff --git a/.wikifi/integrations.md b/.wikifi/integrations.md index 7a65e87..86dc8ea 100644 --- a/.wikifi/integrations.md +++ b/.wikifi/integrations.md @@ -1,36 +1,105 @@ # Integrations -#### Internal Pipeline Handoffs -The system operates as a staged processing pipeline where each module consumes structured outputs from upstream stages and passes refined data downstream. The orchestration layer serves as the central coordinator, triggered by external commands to provision workspaces or initiate full processing cycles. It delegates execution to specialized components in the following sequence: - -- **Repository Traversal & Introspection:** The traversal component scans target directories and supplies filtered file paths and structural metadata. The introspection module consumes directory summaries and manifests to generate filtering patterns and metadata, which guide subsequent analysis stages. -- **Source Analysis & Extraction:** The extraction engine receives the filtered file lists, analyzes individual artifacts, and translates technical content into structured, technology-agnostic notes. These notes are passed to the aggregation layer. -- **Content Aggregation:** The aggregation module consumes the structured notes, synthesizes them into formatted documentation, and writes the results to the central knowledge base layout. -- **Derivative Generation:** The derivation stage consumes finalized documentation, interfaces with generative synthesis services, and produces supplementary content. This output is written back into the central layout, completing the continuous pipeline from raw artifact analysis to polished documentation. - -#### External & Abstracted Interfaces -All external dependencies are routed through standardized contracts to isolate core business logic from implementation details: - -- **Generative AI Services:** A unified abstraction layer handles all AI-driven content requests. Downstream modules submit contextual prompts and source snippets through this interface and receive processed findings or synthesized text in return. Provider-specific implementations are swappable without modifying the analysis engine. -- **Configuration & Runtime Management:** A centralized settings provider supplies runtime parameters to the orchestration and traversal layers. These parameters govern model selection, provider routing, timeout thresholds, content size constraints, and file exclusion lists. -- **User Interface & Console:** The command-line interface delegates initialization and execution to the orchestration service. It manages structured console output, progress reporting, and user feedback, ensuring a consistent interaction model. -- **Observability & Telemetry:** The extraction stage integrates with a logging and statistics tracking system to monitor processing metrics, track pipeline health, and record analysis outcomes. - -#### Integration Touchpoint Summary -| Component | Inbound Dependencies | Outbound Deliverables | External Interfaces | -|---|---|---|---| -| **Orchestrator** | CLI commands, centralized config | Task delegation signals | None (internal coordinator) | -| **CLI Interface** | User input, runtime config | Execution triggers, console feedback | Standard console/terminal | -| **Traversal & Introspection** | Config/exclusion lists, directory manifests | Filtered paths, metadata, filtering patterns | Repository filesystem | -| **Extractor** | Filtered file lists, AI responses | Structured analysis notes | AI provider interface, logging/telemetry | -| **Aggregator** | Structured notes, AI responses | Synthesized markdown sections | AI provider interface, wiki storage | -| **Deriver** | Finalized markdown, AI responses | Derivative documentation | Generative synthesis service, wiki storage | -| **AI Provider Layer** | Contextual prompts, source snippets | Processed findings, synthesized text | External inference backends | - -#### Documentation Gaps -The provided notes outline the directional flow and high-level contracts but do not specify: -- Exact data schemas or serialization formats used for inter-module handoffs -- Error handling, retry policies, or fallback mechanisms for external service failures -- Authentication, rate-limiting, or security constraints for AI provider interactions -- Versioning or compatibility guarantees between pipeline stages -These details should be clarified in implementation documentation or interface contracts. +## Outbound Integrations + +### Language-Model Providers + +The system maintains a uniform provider abstraction that isolates every pipeline stage from the concrete inference backend. Three selectable backends are supported — a locally-hosted model service, a hosted Anthropic-compatible service, and an OpenAI-compatible service — each implementing the same three-method contract: structured JSON completion, free-text completion, and multi-turn chat. The active backend is chosen by configuration; the orchestrator and all downstream stages call into it without branching on which concrete provider is live. + +Every stage that performs inference uses this abstraction: + +| Stage | Operation | +|---|---| +| Introspection (Stage 1) | Structured JSON completion to classify repository paths | +| Extraction (Stage 2) | Structured JSON completion against a findings schema, per file | +| Aggregation (Stage 3) | Structured JSON completion against a section-body schema | +| Derivation (Stage 4) | Structured completion for personas, user stories, and diagrams | +| Critic / Reviser | Structured completion for rubric scoring and body revision | +| Chat session | Multi-turn chat grounded in populated wiki content | + +### External Tool and Capability Servers + +At the development and runtime boundary, the system is configured as a client that fans out to multiple external capability providers via a tool-server protocol. Four integrations are declared: a local AI utility, a local web crawler, a remote documentation context service, and a remote stitching/search service. This makes the system both a producer of wiki content and a consumer of external knowledge services during operation. + +### Filesystem and Layout Abstraction + +All pipeline stages read and write through a shared filesystem layout abstraction rather than addressing paths directly. Extraction findings are appended to a notes store; aggregated section bodies are written back through the same abstraction; the report and chat components read section markdown from the same on-disk layout. The cache layer uses this abstraction to locate its storage directory, and all cache reads and writes (keyed on file fingerprints and section-content hashes) pass through it. + +### Import Graph + +The extraction stage integrates with a repository-wide import/reference graph. For each file being analysed, the graph supplies the file's direct neighbors — files it depends on and files that depend on it — which are injected into the extraction prompt to enable cross-file flow descriptions. The graph also drives the specialized-extractor dispatch path by classifying each file's structural kind before routing. + +### Per-Project Configuration + +Project-specific provider selection, model preferences, caching behavior, and feature flags are read from a TOML configuration file stored inside each managed project's wiki directory. Parse failures fall back gracefully to environment-derived defaults rather than aborting the pipeline. + +--- + +## Inbound Integrations + +The primary entry point is the command-line interface, which exposes four subcommands (`init`, `walk`, `chat`, `report`). It constructs the provider instance and passes it directly into the chat and report capabilities. All other pipeline stages are driven by the orchestrator, which sequences introspection → extraction → aggregation → derivation and is itself invoked by the CLI `walk` subcommand. + +--- + +## Integration Surfaces Detected in Analysed Codebases + +When the system analyses a target repository, it surfaces the following categories of integration touchpoint: + +- **HTTP API endpoints** — each parsed API contract contributes a finding recording the number of endpoints the analysed system exposes to external consumers, forming the inbound integration inventory. +- **RPC service blocks** — each service definition is treated as an integration touchpoint; individual operations are described with their request and response types, including streaming direction where declared. +- **Event-driven subscriptions** — subscription roots in schema definition files are mapped specifically to the integrations section, reflecting that they represent event-driven touchpoints rather than direct request/response capabilities. +- **Relational foreign-key links** — cross-table references are recorded as hard relational links between entities, surfacing constraints that affect how components may be separated or migrated independently. + +The specialized extractor dispatch layer acts as the internal routing hub between the upstream file classifier (which tags file kinds) and the downstream extractors responsible for each artifact type. Files that do not match a recognized kind fall through to the general LLM extraction path. + +## Supporting claims +- Three selectable LLM backends are supported — a locally-hosted model service, a hosted Anthropic-compatible service, and an OpenAI-compatible service. [1][2][3] +- Each backend implements the same three-method contract: structured JSON completion, free-text completion, and multi-turn chat. [1][2][3] +- The orchestrator and all downstream stages call into the provider without branching on which concrete provider is active. [4][1][2][3] +- The introspection stage uses structured JSON completion to classify repository paths. [5] +- The extraction stage uses structured JSON completion against a findings schema, per file. [6] +- The aggregation stage uses structured JSON completion against a section-body schema. [7] +- The critic/reviser uses structured completions for rubric scoring and body revision. [8] +- The chat session uses multi-turn chat grounded in populated wiki content. [9] +- Four external tool-server integrations are declared: a local AI utility, a local web crawler, a remote documentation context service, and a remote stitching/search service, making the system an MCP client that fans out to multiple capability providers. [10] +- All pipeline stages read and write through a shared filesystem layout abstraction; the cache layer uses this abstraction to locate its storage directory. [11][12][13][14][15][16] +- The extraction stage integrates with a repository-wide import/reference graph; each file's neighbors are injected into the extraction prompt. [17][18] +- The import graph also drives the specialized-extractor dispatch path by classifying each file's structural kind. [18][19] +- Project-specific settings are read from a TOML configuration file inside each managed project's wiki directory; parse failures fall back gracefully to defaults. [20][21] +- The CLI constructs the provider instance and passes it directly into the chat and report capabilities. [4] +- Each parsed API contract contributes a finding recording the number of HTTP endpoints the analysed system exposes to external consumers. [22] +- Each RPC service block in a protocol definition is treated as an integration touchpoint, with operations described including streaming direction. [23] +- Subscription roots are mapped specifically to the integrations section, reflecting event-driven touchpoints. [24] +- Cross-table foreign-key references are recorded as hard relational links between entities, surfacing migration constraints. [25] +- The specialized extractor dispatch layer routes recognized file kinds to dedicated extractors; unrecognized files fall through to the general LLM extraction path. [26][19][27] +- Derivative sections are excluded from the aggregation stage and are instead populated by a separate deriver stage that runs afterwards. [28][14] + +## Sources +1. `wikifi/providers/anthropic_provider.py:83-106` +2. `wikifi/providers/ollama_provider.py:44-46` +3. `wikifi/providers/openai_provider.py:1-9` +4. `wikifi/cli.py:176-179` +5. `wikifi/introspection.py:61-70` +6. `wikifi/extractor.py:220-235` +7. `wikifi/aggregator.py:136-141` +8. `wikifi/critic.py:30-32` +9. `wikifi/chat.py:52-55` +10. `.mcp.json:2-36` +11. `wikifi/aggregator.py:109-160` +12. `wikifi/cache.py:30-32` +13. `wikifi/chat.py:63-82` +14. `wikifi/deriver.py:73-107` +15. `wikifi/extractor.py` +16. `wikifi/report.py:78-130` +17. `wikifi/extractor.py:213-215` +18. `wikifi/repograph.py:1-10` +19. `wikifi/specialized/dispatch.py:36-62` +20. `wikifi/cli.py:103-105` +21. `wikifi/config.py:169-200` +22. `wikifi/specialized/openapi.py:96-103` +23. `wikifi/specialized/protobuf.py:64-90` +24. `wikifi/specialized/graphql.py:108-110` +25. `wikifi/specialized/sql.py:88-98` +26. `wikifi/specialized/__init__.py:7-8` +27. `wikifi/specialized/models.py:30-31` +28. `wikifi/aggregator.py:111-116` diff --git a/.wikifi/intent.md b/.wikifi/intent.md index 54c5959..4f2d24e 100644 --- a/.wikifi/intent.md +++ b/.wikifi/intent.md @@ -1,34 +1,64 @@ # Intent and Problem Space -### Purpose and Problem Statement -The system exists to eliminate the labor-intensive overhead of manual documentation and resolve the fragmentation of technical knowledge within software repositories. When teams inherit, maintain, or scale complex codebases, understanding the underlying business logic, user value, and architectural relationships typically requires tedious reverse-engineering. This tool automates that process by systematically analyzing source artifacts to produce structured, navigable documentation that captures *what* the system does and *why* it exists, deliberately abstracting away implementation-specific mechanics. - -### Target Audience -- Engineering teams onboarding to unfamiliar, legacy, or rapidly evolving codebases -- Technical writers and architects seeking a reliable, evidence-based baseline for system documentation -- Organizations requiring consistent, technology-agnostic knowledge bases across multiple projects or acquisition targets - -### Design Constraints and Guiding Principles -The system’s architecture is shaped by several non-negotiable constraints that prioritize reliability, analytical depth, and long-term maintainability: - -- **Strict Technology Agnosticism:** All extraction and synthesis processes deliberately ignore language-specific syntax, framework conventions, or library dependencies. The focus remains exclusively on business purpose, user value, and behavioral specifications. -- **Fidelity Over Throughput:** Processing is optimized for analytical depth and output accuracy rather than raw speed. The system explicitly trades computational cost for higher-quality, cross-cutting insights, providing configurable controls to balance resource expenditure against result quality. -- **Deterministic, Stage-Gated Execution:** Analysis follows a strictly ordered pipeline. Each phase must complete successfully before downstream processing begins, ensuring predictable outcomes, graceful failure handling, and reproducible results across runs. -- **Backend Decoupling:** Core analytical logic is strictly separated from underlying reasoning or generation services. This allows seamless substitution of processing backends without altering the system’s operational contract or output structure. -- **Upgrade-Safe Documentation Contract:** The output structure adheres to a stable, version-resilient schema. This ensures that documentation remains navigable and consistent even as the underlying analysis methods evolve. -- **Automated Noise Filtration:** The system automatically isolates production behavior from non-essential artifacts (e.g., tests, third-party dependencies, configuration files, generated code) to prevent analysis dilution and conserve processing resources. - -### Operational Boundaries -| Dimension | In Scope | Out of Scope | -|---|---|---| -| **Analysis Focus** | Business logic, user value, architectural relationships, behavioral narratives | Low-level implementation details, syntax optimization, performance profiling | -| **Input Handling** | Unknown or unstructured repositories, mixed-paradigm codebases | Pre-documented systems, strictly standardized templates | -| **State Management** | Intermediate data preservation, incremental processing, debugging traceability | Real-time code generation, automated refactoring, deployment pipelines | - -### Documented Gaps -While the system’s intent and high-level constraints are well-defined, the following operational parameters remain unspecified in the current documentation: -- Exact thresholds or heuristics used to balance computational cost against result quality -- Conflict resolution strategies when extracted insights from different files contradict one another -- Specific criteria for classifying artifacts as "non-essential" across highly customized or non-standard repository structures - -These gaps do not impact the system’s core purpose but should be addressed before production deployment in complex or highly regulated environments. +wikifi exists to produce a structured, technology-agnostic wiki from an arbitrary source code repository — explaining **what a system does and why**, independent of the languages, frameworks, or infrastructure used to build it. Its primary audience is the team inheriting or migrating an existing codebase: architects and engineers who need a trustworthy, actionable picture of domain entities, capabilities, and integrations without spending days reading raw source files. + +### The core problem + +Large and legacy codebases resist quick comprehension. Source files encode intent implicitly, mixed with scaffolding, build artifacts, tests, and dependency code that carry no domain signal. At the same time, certain structured artifacts — database schemas, API contracts, protocol definitions — express intent with machine-readable precision that general-purpose analysis handles poorly. Any naive, uniform approach to understanding a codebase either drowns in noise or misses the highest-fidelity evidence. + +Beyond individual files, some concepts — user personas, end-to-end user stories, system-level diagrams — only emerge from the *aggregate* of capabilities, entities, and integrations, and simply cannot be read from any single file in isolation. + +wikifi addresses all of this by treating repository understanding as a structured, multi-stage extraction problem rather than a documentation-writing task. + +### For whom + +The system is designed explicitly around the needs of **migration teams and technical architects** who must understand a live system well enough to redesign or replatform it. Every design choice — traceability of claims to source locations, surfacing of contradictions rather than silently merging them, quality scoring before handoff — is oriented toward answering the question: *can we trust this wiki enough to act on it?* + +### Constraints that shape the design + +**Trust and traceability over convenience.** The system refuses to silently resolve disagreements between source files. Every synthesized claim must trace back to the specific files that justified it, and a dedicated quality-assurance pass flags unsupported claims and coverage gaps before output is delivered. + +**Technology neutrality.** All output is expressed in domain terms — entities, capabilities, integrations, personas — never in terms of the implementation technology. This ensures the wiki remains useful even when the migration replaces the entire stack. + +**Local-first operation.** The default configuration routes all inference through a locally-hosted model to avoid cloud API dependencies. Hosted providers are explicit opt-ins, reflecting a philosophy of keeping sensitive source code within the operator's own infrastructure unless otherwise chosen. + +**Quality over speed.** The system prioritises documentation quality over processing throughput. Guards prevent runaway behaviour on near-empty or oversized files, and higher-order sections are synthesized only after all primary evidence has been assembled. + +**Scalability on large codebases.** Re-processing a large legacy codebase on every run is impractical. Content-addressed caching ensures only changed files require new analysis, making repeated full-repository walks economical and enabling recovery after mid-run failures. Structured contract files bypass general-model processing entirely when deterministic parsing is more accurate and less costly. + +**Stable output contract.** The on-disk layout produced by wikifi is treated as a contract with the target project: it must remain stable across tool upgrades so that existing wikis stay readable and can be updated incrementally without full regeneration. + +## Supporting claims +- wikifi exists to produce a technology-agnostic wiki explaining what a system does and why, independent of the technologies used to build it. [1][2][3][4][5] +- Its primary audience is migration teams and architects who need a trustworthy picture of a codebase without manual source-reading. [6][7][8][9][10][11] +- Some concepts — personas, user stories, diagrams — only emerge from the aggregate of capabilities and entities and cannot be extracted from individual files. [12][5] +- The system refuses to silently resolve contradictions; every claim must trace back to the specific source files that justified it. [13][7] +- A quality-assurance pass flags unsupported claims and coverage gaps before output is delivered, so migration teams can trust the result without manually verifying every claim. [6][8] +- The default configuration routes inference through a locally-hosted model; hosted providers are explicit opt-ins reflecting a local-first philosophy. [14] +- The system prioritises documentation quality over processing throughput, with guards against runaway behaviour on near-empty or oversized files. [1] +- Content-addressed caching makes repeated full-repository walks economical and enables recovery after mid-run failures; only changed files require new analysis. [15][16] +- Structured contract files bypass general-model processing when deterministic parsing is more accurate and less costly. [17][18][19][20] +- The on-disk layout is treated as a stable contract with the target project, kept consistent across tool upgrades so existing wikis remain readable. [21] + +## Sources +1. `.env.example:1-2` +2. `wikifi/cli.py:1-10` +3. `wikifi/introspection.py:1-9` +4. `wikifi/orchestrator.py:1-17` +5. `wikifi/sections.py:1-19` +6. `wikifi/critic.py:1-15` +7. `wikifi/evidence.py:1-18` +8. `wikifi/report.py:1-16` +9. `wikifi/specialized/openapi.py:1-11` +10. `wikifi/specialized/protobuf.py:1-8` +11. `wikifi/specialized/sql.py:1-13` +12. `wikifi/deriver.py:1-18` +13. `wikifi/aggregator.py:1-15` +14. `wikifi/config.py:1-26` +15. `wikifi/cache.py:1-20` +16. `wikifi/extractor.py:1-30` +17. `wikifi/repograph.py:1-30` +18. `wikifi/specialized/__init__.py:1-12` +19. `wikifi/specialized/dispatch.py:1-13` +20. `wikifi/specialized/models.py:1-8` +21. `wikifi/wiki.py:1-8` diff --git a/.wikifi/personas.md b/.wikifi/personas.md index 3b37339..3e033b3 100644 --- a/.wikifi/personas.md +++ b/.wikifi/personas.md @@ -1,63 +1,9 @@ # User Personas -### Primary Human Operators +The system's design choices, capabilities, and integration surfaces converge on a small set of distinct roles. Each persona below is inferred from the aggregate of what the system does — no single module or feature alone would justify any of them. -The system’s target audience is explicitly defined across the intent and capability specifications. By aggregating the stated problem space, pipeline behaviors, and integration contracts, three distinct human operator personas emerge. Each persona interacts with the system to resolve specific documentation debt, knowledge fragmentation, or onboarding friction. +--- -#### 1. Onboarding Engineering Practitioner -*Focus: Rapid comprehension of unfamiliar, legacy, or rapidly evolving codebases.* +## Persona 1 — Migration Architect -- **Goals:** Accelerate cross-team onboarding; quickly map business logic and functional capabilities without manual reverse-engineering; maintain awareness of system relationships as the codebase evolves. -- **Needs:** Structured, navigable documentation that stays synchronized with implementation; standardized terminology across components; explicit declarations of missing or ambiguous information; traceability from documentation back to original source artifacts. -- **Pain Points:** Fragmented technical knowledge; labor-intensive manual documentation; outdated or speculative content that drifts from actual implementation; difficulty distinguishing production behavior from test or configuration noise. -- **Served Use Cases:** - - Structural analysis for system purpose inference and scoped processing boundaries - - Granular extraction of domain concepts from technical implementations - - Adaptive reasoning depth to toggle between lightweight overviews and deep architectural breakdowns - - Timestamped provenance for auditability and change tracking - -#### 2. Technical Writer & System Architect -*Focus: Establishing reliable, evidence-based documentation baselines and behavioral narratives.* - -- **Goals:** Produce consistent, technology-agnostic documentation; capture cross-cutting relationships and behavioral specifications; maintain long-term documentation stability across tooling or backend updates. -- **Needs:** Schema-validated structured generation for systematic phases; free-form analytical generation for narrative clarity; deterministic, stage-gated execution for reproducible outputs; explicit gap preservation rather than speculative filling. -- **Pain Points:** Inconsistent terminology across projects; lack of traceability between documentation and source artifacts; manual authoring overhead; documentation contracts that break when analysis methods or backends change. -- **Served Use Cases:** - - Section synthesis for cohesive, consistently structured documentation units - - Cross-cutting derivation for behavioral stories and system interaction diagrams - - Workspace lifecycle management for section scaffolding, versioning rules, and intermediate state cleanup - - Dual-mode generation to balance machine-readable consistency with human-readable clarity - -#### 3. Portfolio Manager & Acquisition Integrator -*Focus: Standardizing knowledge bases across multiple projects, mixed-paradigm repositories, or acquisition targets.* - -- **Goals:** Assess system purpose and classification rationale quickly; maintain a unified, technology-agnostic knowledge base without manual overhead; ensure processing efficiency across diverse repository structures. -- **Needs:** Automated noise filtration to isolate production behavior; flexible configuration of traversal depth, file size thresholds, and content filters; backend decoupling for seamless processing substitution; consistent workspace layouts across pipeline runs. -- **Pain Points:** Resource exhaustion from scanning irrelevant directories; inconsistent output structures when analysis methods evolve; lack of auditability for compliance or assessment; fragmented knowledge across acquired or legacy projects. -- **Served Use Cases:** - - Intelligent traversal & filtering for production-relevance classification and dynamic focus adjustment - - Introspection assessment for primary language identification and classification rationale - - Aggregation statistics and execution summaries for pipeline health monitoring and output readiness verification - - Upgrade-safe documentation contract to preserve navigability as underlying analysis methods evolve - -### Persona-to-Pipeline Mapping - -| Pipeline Stage / Capability | Onboarding Practitioner | Technical Writer & Architect | Portfolio Manager & Integrator | -|---|---|---|---| -| **Structural Analysis & Introspection** | System purpose inference, scoped boundaries | Classification rationale, structural metadata | Primary language/purpose assessment across targets | -| **Granular Extraction & Domain Translation** | Business logic mapping, noise isolation | Evidence-based baseline, traceable notes | Technology-agnostic abstraction, standardized terminology | -| **Section Synthesis & Dual-Mode Generation** | Lightweight overviews vs. deep breakdowns | Schema-validated structure + narrative clarity | Consistent output formatting across projects | -| **Cross-Cutting Derivation** | Relationship mapping, onboarding acceleration | Behavioral stories, interaction diagrams | *(Note: System also auto-generates behavioral personas as a downstream artifact)* | -| **Workspace Lifecycle & Execution Reporting** | Change tracking, provenance | Reproducible runs, gap preservation | Pipeline health metrics, auditability, upgrade-safe contracts | - -### Documented Gaps & Unresolved Persona Dimensions - -The upstream specifications define the system’s operational boundaries and target audiences but remain silent on several persona-specific dimensions. These gaps must be resolved before production deployment in complex or regulated environments: - -- **Role-Based Configuration Presets:** No predefined configuration profiles or heuristic thresholds are specified for balancing computational cost against result quality per persona. -- **Access & Security Controls:** Authentication, rate-limiting, and role-based access constraints for AI provider interactions and workspace management are not defined. -- **Workflow Integration Points:** Exact data schemas, serialization formats, and error-handling/retry policies for inter-module handoffs are unspecified, leaving persona-specific CI/CD or documentation workflow integration undefined. -- **Conflict Resolution:** Strategies for reconciling contradictory extracted insights across files are not documented, which may impact how architects and writers validate synthesized sections. -- **Non-Essential Classification Criteria:** Specific heuristics for classifying artifacts as "non-essential" across highly customized or non-standard repository structures remain undefined, potentially affecting portfolio managers scanning atypical acquisition targets. - -These gaps do not alter the system’s core purpose but should be addressed in implementation contracts or operational runbooks to fully support each persona’s workflow expectations. +> * diff --git a/.wikifi/user_stories.md b/.wikifi/user_stories.md index c4d4713..4ca05c5 100644 --- a/.wikifi/user_stories.md +++ b/.wikifi/user_stories.md @@ -1,121 +1,218 @@ # User Stories -### Feature: Intelligent Traversal & Structural Analysis +## Feature: Repository Triage and Scoping -**User Story** -As a Portfolio Manager & Acquisition Integrator, I want the system to automatically filter out non-essential files and large binaries during repository scanning, so that I can assess system purpose and classification rationale without resource exhaustion. +### As a Documentation Engineer, I want the system to classify which paths contain production source before any deep analysis begins, so that analysis effort is focused on meaningful content and costs are bounded. ```gherkin -Given a target repository containing mixed-paradigm artifacts and version-controlled noise -When the structural analysis stage executes with configured path filters and size thresholds -Then the system excludes irrelevant directories and oversized assets -And produces a directory summary reflecting only allowed traversal boundaries -And generates an introspection assessment identifying primary languages and system purpose +Given a repository containing production source alongside vendored dependencies, build artifacts, generated files, and CI configuration +And file-size bounds are configured in the pipeline settings +When Stage 1 triage runs +Then paths classified as vendored dependencies, build artifacts, generated files, or CI configuration are excluded from analysis +And files outside the configured size bounds are filtered before any extraction begins +And the rationale for every filtering choice is recorded in the IntrospectionResult ``` -**Entities Involved:** `Scan/Traversal Config`, `Directory Summary`, `Introspection Assessment` -**Acceptance Criteria:** -- Processing never exceeds defined size constraints or traverses excluded paths. -- Directory statistics accurately reflect file counts, total size, and extension distribution within allowed boundaries. -- Classification rationale is derived strictly from structural data and path filters. -- *(Gap Declaration)* Specific heuristics for classifying artifacts as "non-essential" across highly customized or non-standard repository structures remain undefined. +--- + +## Feature: Per-File Extraction + +### As a Documentation Engineer, I want well-structured files to be routed to deterministic extractors rather than AI inference, so that extraction is more accurate and cost-effective for those artifact types. + +```gherkin +Given a repository containing relational schema files, API contract files, interface definitions, and migration scripts alongside general source files +When the extraction stage begins +Then well-structured files are routed to dedicated deterministic extractors +And general-purpose source files are analyzed via AI inference +And every finding carries a citation recording the repo-relative file path and line range +``` + +### As a Documentation Engineer, I want large source files to be split into overlapping chunks during extraction, so that no content is lost at chunk boundaries. + +```gherkin +Given a source file whose size exceeds the configured chunk threshold +When the file is processed during extraction +Then the file is divided into overlapping chunks using the coarsest available boundary first +And findings are deduplicated across chunk boundaries to avoid double-counting +And each finding retains its citation to the originating path and line range +``` + +### As a Migration Architect, I want each file's extraction to be enriched with its import-graph neighbourhood, so that findings describe cross-file flows rather than treating each file in isolation. + +```gherkin +Given a repository with interdependent files +And the cross-file import graph option is enabled in settings +When per-file extraction runs +Then the system builds a cross-file import and reference graph before extraction begins +And each file's extraction pass is enriched with the files it depends on and the files that depend on it +And findings can assert cross-file relationships rather than single-file observations +``` --- -### Feature: Domain-Centric Translation & Granular Extraction +## Feature: Section Synthesis and Derivative Generation -**User Story** -As an Onboarding Engineering Practitioner, I want technical implementations translated into domain concepts with explicit gap declarations, so that I can quickly map business logic and functional capabilities without manual reverse-engineering. +### As a Documentation Engineer, I want primary wiki sections to be synthesized from per-file findings with full citation trails, so that every assertion in the output is traceable to a specific source location. ```gherkin -Given a set of source files within the scoped processing boundaries -When the granular extraction stage translates technical implementations into domain concepts -Then the system strips implementation-specific syntax to surface underlying business rules -And creates timestamped extraction notes linking each file to a role summary and finding -And preserves raw evidence for ambiguous data instead of generating speculative content +Given a set of per-file findings accumulated for a primary section +When the aggregation stage runs for that section +Then a coherent markdown body is produced from the findings +And every assertion in the body is backed by numbered citations traceable to the source files and line ranges from which it was inferred +And claims present in the supporting evidence but absent from the narrative body are collected into a separate supporting-claims list rather than silently dropped ``` -**Entities Involved:** `Configuration`, `Extraction Note` -**Acceptance Criteria:** -- Each extraction note is immutable once created and tied to a single source file. -- Technical artifacts are consistently mapped to business-readable concepts. -- Missing or ambiguous information is explicitly documented rather than filled speculatively. -- *(Gap Declaration)* Strategies for reconciling contradictory extracted insights across files are not documented. +### As a Documentation Engineer, I want derivative sections to be synthesized only after all their upstream primary sections are finalized, so that personas, user stories, and diagrams are grounded in complete evidence. + +```gherkin +Given a wiki configuration where derivative sections declare upstream dependencies +When the pipeline reaches the derivation stage +Then sections are processed in topological order enforced at startup +And no derivative section is synthesized until all of its declared upstream sections are finalized +And if an upstream section is absent or empty, a placeholder is emitted rather than fabricated content +``` + +--- + +## Feature: Wiki Scaffolding + +### As a Pipeline Operator, I want wiki initialization to be idempotent, so that re-running the scaffold command in an automated pipeline does not overwrite existing content. + +```gherkin +Given a project with a partially populated wiki directory structure +When the wiki scaffold command is run again +Then existing content is left untouched +And only missing structural pieces are created +``` --- -### Feature: Section Synthesis & Dual-Mode Generation +## Feature: Conflict Detection and Evidence Traceability -**User Story** -As a Technical Writer & System Architect, I want schema-validated structured generation combined with free-form narrative clarity, so that I can produce consistent, technology-agnostic documentation baselines. +### As a Migration Architect, I want incompatible assertions across source files to be surfaced explicitly rather than silently resolved, so that my team can identify and resolve tribal knowledge conflicts before re-implementation. ```gherkin -Given aggregated extraction notes from the granular extraction stage -When the section synthesis stage consolidates findings into documentation units -Then the system applies schema-validated structured generation for systematic phases -And uses free-form analytical generation for narrative clarity -And outputs finalized wiki sections with consistent terminology and structure +Given a codebase where two or more source files make incompatible assertions about the same domain topic +When the aggregation stage produces the section body +Then the conflict is surfaced under a dedicated heading in the output +And each conflicting position retains its own source references +And the narrative does not silently choose one position over another ``` -**Entities Involved:** `Documentation Section`, `Aggregation Stats`, `Workspace Layout` -**Acceptance Criteria:** -- Sections are generated only after successful note aggregation. -- Aggregation statistics track successful writes and explicitly flag empty sections. -- Directory structure remains consistent across pipeline runs, handling scaffolding and intermediate state cleanup. -- *(Gap Declaration)* Exact mapping rules between intermediate extraction notes and final documentation sections are implied by the aggregation process but not explicitly detailed. +### As a Migration Architect, I want every factual claim in the wiki traceable to the file and line range from which it was inferred, so that I can verify assertions against the original source. + +```gherkin +Given a generated wiki section containing factual assertions +When I inspect any claim in the narrative body +Then the claim is backed by one or more SourceRefs each identifying a repo-relative file path and line range +And a claim with no SourceRefs is explicitly marked as unsupported +``` --- -### Feature: Cross-Cutting Derivation & Behavioral Mapping +## Feature: Quality Assurance -**User Story** -As a Technical Writer & System Architect, I want the system to derive behavioral stories and interaction diagrams from cross-component relationships, so that I can capture system interactions and maintain long-term documentation stability. +### As a Quality Reviewer, I want sections evaluated against a structured rubric and revised when they fall below a quality threshold, so that the generated wiki meets a minimum standard before publication. ```gherkin -Given finalized documentation sections and extracted domain concepts -When the cross-cutting derivation stage identifies relationships spanning multiple components -Then the system generates behavioral narratives and system interaction diagrams -And auto-generates behavioral personas as downstream artifacts -And ensures deterministic, stage-gated execution for reproducible outputs +Given the critic-and-reviser loop is enabled in settings +And a minimum quality threshold score is configured +When a section body is synthesized +Then the section is scored on a 0–10 rubric identifying unsupported claims and coverage gaps +And if the score falls below the configured threshold a revision pass is triggered +And the revised body is accepted only if it improves or matches the prior score +And if synthesis fails entirely raw notes are emitted directly preserving information at the cost of polish ``` -**Entities Involved:** `Documentation Section`, `Execution Summary` -**Acceptance Criteria:** -- Cross-cutting relationships are identified without manual authoring overhead. -- Generated artifacts maintain traceability back to original source artifacts. -- Execution follows a deterministic, four-stage pipeline progression. -- *(Gap Declaration)* Workflow integration points, including exact data schemas, serialization formats, and error-handling/retry policies for inter-module handoffs, are unspecified. +### As a Pipeline Operator, I want the critic-and-reviser loop to be disabled by default, so that generation time remains predictable in routine runs. + +```gherkin +Given a pipeline run with default settings +When the pipeline generates and derives sections +Then the critic-and-reviser loop is not executed +And generation time is predictable +When the loop is explicitly enabled via settings +Then it is applied to all sections and is most beneficial for derivative sections where single-shot synthesis is most likely to stray from evidence +``` + +--- + +## Feature: Coverage and Readiness Reporting + +### As a Pipeline Operator, I want a single-page readiness report listing per-section metrics, so that I can assess documentation completeness in an automated pipeline without requiring an AI provider. + +```gherkin +Given a wiki with one or more generated sections +When the report command is run +Then a markdown table is produced listing each section with its contributing file count, finding count, body character length, quality score, and most prominent gap or unsupported claim +And the coverage portion of the report executes without an AI provider +And the report output is safe for automated pipelines +``` --- -### Feature: Execution Reporting & Provenance Tracking +## Feature: Incremental and Crash-Resumable Operation -**User Story** -As a Portfolio Manager & Acquisition Integrator, I want detailed execution summaries and timestamped provenance for all generated artifacts, so that I can ensure auditability and verify pipeline health across acquisition targets. +### As a Pipeline Operator, I want unchanged files and sections to be served from cache on re-runs, so that the pipeline completes quickly when only a subset of files has changed. ```gherkin -Given a completed pipeline run across all processing stages -When the system consolidates metrics, findings, and completion status -Then an execution summary is generated as a single source of truth for pipeline health -And a chronological record of extraction notes is maintained per section -And file inclusion/exclusion metrics and generation status are reported for full auditability +Given a repository that has been walked at least once with caching enabled +And some files have not changed since the last run +When the pipeline runs again +Then files whose content fingerprint matches the cache are skipped without re-extraction +And sections whose notes-payload hash matches the cache are served from cache without re-synthesis +And the cache is written after each individual file completes so that a mid-run crash leaves previously completed work intact ``` -**Entities Involved:** `Execution Summary`, `Extraction Note`, `Aggregation Stats` -**Acceptance Criteria:** -- Execution summary is generated only after all pipeline stages report completion. -- Provenance enables traceability from final documentation back to original source artifacts. -- Pipeline health metrics and output readiness are verified before final delivery. -- *(Gap Declaration)* Authentication, rate-limiting, and role-based access constraints for workspace management and AI provider interactions are not defined. +### As a Pipeline Operator, I want cache entries to be automatically invalidated when the cache format changes, so that obsolete data from a previous version never silently corrupts a rebuild. + +```gherkin +Given a pipeline upgrade that changes the internal cache format version +When the pipeline runs after the upgrade +Then the version number embedded in existing cache files is compared to the current version +And a mismatch triggers a clean rebuild discarding all stale entries +And cache files are written atomically so that a crash during persistence never leaves a corrupted cache +``` + +### As a Documentation Engineer, I want a force-reanalysis mode that drops all cached data, so that I can guarantee a completely fresh walk when cache state is suspect. + +```gherkin +Given a pipeline run invoked with force-reanalysis mode enabled +When the pipeline begins +Then the on-disk cache is dropped entirely before any files are processed +And all files are re-extracted and all sections re-synthesized from scratch regardless of cached state +``` + +### As a Documentation Engineer, I want stale cache entries for removed files to be prunable in bulk, so that the cache does not accumulate entries for files that no longer exist in the repository. + +```gherkin +Given a repository from which one or more files have been deleted since the last walk +When the cache pruning operation is run +Then cache entries for files no longer present in the repository are removed +And remaining entries for current files are left intact +``` --- -### Story-to-Component Mapping Reference +## Feature: Interactive Query Interface + +### As a Codebase Explorer, I want to open a conversational session grounded in the generated wiki, so that I can ask natural-language questions about the codebase without reading raw source files. -| Feature | Primary Persona | Core Capability | Key Entities | Known Gaps Addressed | -|---|---|---|---|---| -| Intelligent Traversal & Structural Analysis | Portfolio Manager & Acquisition Integrator | Intelligent Traversal & Filtering | `Scan/Traversal Config`, `Directory Summary`, `Introspection Assessment` | Non-essential classification heuristics | -| Domain-Centric Translation & Granular Extraction | Onboarding Engineering Practitioner | Granular Extraction / Domain-Centric Translation | `Configuration`, `Extraction Note` | Contradictory insight resolution | -| Section Synthesis & Dual-Mode Generation | Technical Writer & System Architect | Section Synthesis / Dual-Mode Generation | `Documentation Section`, `Aggregation Stats`, `Workspace Layout` | Note-to-section mapping rules | -| Cross-Cutting Derivation & Behavioral Mapping | Technical Writer & System Architect | Cross-Cutting Derivation | `Documentation Section`, `Execution Summary` | Workflow integration & serialization schemas | -| Execution Reporting & Provenance Tracking | Portfolio Manager & Acquisition Integrator | Execution Reporting / Timestamped Provenance | `Execution Summary`, `Extraction Note`, `Aggregation Stats` | Access controls & role-based presets | +```gherkin +Given a wiki with one or more sections containing meaningful content +When I open a conversational session +Then only sections with meaningful content are loaded as context +And placeholder sections are excluded from the context +And I can ask multi-turn questions drawing on the loaded wiki sections +And I can inspect which sections are currently loaded as context +``` + +### As a Codebase Explorer, I want to reset the conversation history without losing the wiki context, so that I can start a fresh line of questioning without reloading the wiki. + +```gherkin +Given an active conversational session with accumulated conversation history +When I issue a history-reset command +Then the conversation history is cleared +And the frozen system prompt built from the wiki sections is retained +And subsequent questions are answered using the same wiki context without the prior exchanges +``` diff --git a/README.md b/README.md index 22309b4..3805c80 100644 --- a/README.md +++ b/README.md @@ -19,16 +19,28 @@ uv run wikifi init - `init` — one-time setup; scaffolds the `.wikifi/` directory and any local config the implementor chooses to expose. - `walk` — main entry point. Walks the target codebase and produces the wiki content. + - `--no-cache` — force a clean re-walk; drops the on-disk extraction + aggregation caches. + - `--review` — run the critic + reviser loop on derivative sections (personas, user stories, diagrams). + - `--provider {ollama|anthropic|openai}` — override the configured provider for this walk. +- `report` — print a coverage + quality report (per-section file counts, findings, body sizes). + - `--score` — additionally run the critic on every populated section for a 0-10 quality score. - `ask` — natural language queries against the wiki content, with optional context injection from the target codebase. - `chat` — interactive REPL for iterative exploration of the wiki content and the target codebase. ## Architecture - **`wikifi/` package** — the library, with the CLI entry point exposed via `[project.scripts] wikifi = "wikifi.cli:main"` in `pyproject.toml`. - **Repository introspection** — before walking, the agent reviews the target's root structure (manifests, top-level layout, gitignore signals) and decides which paths carry production source worth analyzing. The walk that follows is deterministic — the agent does not re-pick scope mid-walk. -- **Per-file extraction** — for each in-scope file, the agent extracts contributions to each *primary* capture section (see `VISION.md`) into structured findings. +- **Repo graph** (`wikifi/repograph.py`) — a regex-driven static analysis builds an import / reference graph across in-scope files, plus classifies each file's `FileKind` (application code, SQL, OpenAPI, Protobuf, GraphQL, migration, other). Each file's neighborhood is injected into the extraction prompt so per-file findings can describe cross-file flows. +- **Specialized extractors** (`wikifi/specialized/`) — schema files (SQL, OpenAPI, Protobuf, GraphQL, migrations) bypass the LLM entirely and run through deterministic parsers. The structured findings reach the same notes store as LLM output, so the rest of the pipeline is unchanged. +- **Per-file extraction** — for each in-scope file, the agent extracts contributions to each *primary* capture section (see `VISION.md`) into structured findings. Each finding carries a structured `SourceRef` (file + line range + content fingerprint) for downstream citation. +- **Content-addressed cache** (`wikifi/cache.py`) — extraction findings are keyed by `(rel_path, sha256(file_bytes))`; aggregation bodies are keyed by a hash of the section's notes payload. Re-walks skip every file whose fingerprint hasn't changed; resumability after a crash is a free property of the same cache. Use `walk --no-cache` to force a clean re-walk. - **Input filtering** — the walker recognizes and skips unstructured or near-empty files (stub `__init__` files, empty fixtures, machine-generated artifacts) before they reach the agent. Empty input must never stall the walk. -- **Section synthesis** — primary capture sections are synthesized from the accumulated per-file findings; derivative sections (personas, user stories, diagrams) are produced *after* primary content is complete, taking the synthesized primary content as their input. -- **Provider abstraction** — the LLM backend is reached through a provider interface. Default is a local Ollama server; alternative providers (hosted Anthropic, hosted OpenAI, custom) plug in by implementing the same interface. +- **Section synthesis** — primary capture sections are synthesized from the accumulated per-file findings; the aggregator emits a structured `EvidenceBundle` (body + claims + contradictions) and the renderer threads numbered citations + a "Conflicts in source" block into the section markdown. Derivative sections (personas, user stories, diagrams) are produced *after* primary content is complete, taking the synthesized primary content as their input. +- **Critic + reviser** (`wikifi/critic.py`) — opt-in (`walk --review`), runs a quality pass on derivative sections: scores the body against its brief and upstream evidence, identifies unsupported claims, and re-synthesizes when the score is below threshold. Only accepts a revision if it scores at least as well as the original. +- **Coverage + quality report** (`wikifi/report.py`) — `wikifi report` produces a per-section view of files contributing, finding count, body size, and (with `--score`) critic-derived quality scores. +- **Provider abstraction** — the LLM backend is reached through a provider interface. Default is a local Ollama server (`OllamaProvider`); two hosted backends are opt-in: + - `AnthropicProvider` via `WIKIFI_PROVIDER=anthropic` — uses prompt caching with `cache_control: ephemeral` on the system prompt so the multi-KB extraction prompt is paid for once across hundreds of per-file calls. + - `OpenAIProvider` via `WIKIFI_PROVIDER=openai` — relies on OpenAI's automatic prefix caching (no marker required) and routes the `think` knob to `reasoning_effort` on `o*`/`gpt-5` reasoning models. - **Wiki adapter** — writes the rendered wiki into the target's `.wikifi/` directory. Layout, taxonomy, and structure within `.wikifi/` are at the implementor's discretion, provided the content contract from `VISION.md` is met. ## Tech stack diff --git a/TESTING-AND-DEMO.md b/TESTING-AND-DEMO.md new file mode 100644 index 0000000..5c72fdf --- /dev/null +++ b/TESTING-AND-DEMO.md @@ -0,0 +1,279 @@ +# Testing & demoing the premium pipeline + +This document covers how to verify and demo the nine premium features +landed in this PR. Every step works from a clean clone — no external +service required for the test suite, and only Ollama (default) or an +Anthropic API key (opt-in) for the live demos. + +## Prerequisites + +```bash +make hooks # one-time, enables the pre-commit + pre-push hooks +uv sync # installs anthropic + the other deps (already in uv.lock) +``` + +## Running the test suite + +```bash +make test # runs pytest with coverage +``` + +Expectations: +- **156 tests pass.** +- **Total coverage ≥ 93%.** Every new module is at or above 86%; the + premium-pipeline modules — `fingerprint`, `cache`, `evidence`, + `critic`, `report`, `repograph`, `specialized/*`, + `providers/anthropic_provider` — each carry a dedicated test file. + +To run only the suites for the new functionality: + +```bash +uv run pytest tests/test_fingerprint.py tests/test_cache.py tests/test_evidence.py \ + tests/test_repograph.py tests/test_specialized.py tests/test_critic.py \ + tests/test_report.py tests/test_anthropic_provider.py -v --no-cov +``` + +## Demoing each feature + +The demos below assume a working Ollama install with the model from +`.wikifi/config.toml` (default `qwen3.6:27b`). If you want the hosted +Anthropic path instead, set `ANTHROPIC_API_KEY` and pass +`--provider anthropic` to the relevant commands; everything else is +identical. + +### 1. Source-traceable citations + 5. Contradiction surfacing + +Run a walk against this repo: + +```bash +make init # one-time; idempotent +make walk +``` + +Open `.wikifi/
.md` for any populated primary section +(`entities.md`, `capabilities.md`, `cross_cutting.md`, …). At the bottom +you should see: + +``` +## Sources +1. `wikifi/extractor.py:115-187` +2. `wikifi/aggregator.py:54-79` +… +``` + +Where the aggregator detected disagreement across files, the section +also carries a `## Conflicts in source` block enumerating each +position with its sources. Search for it via: + +```bash +rg -n '^## Conflicts in source' .wikifi/ +``` + +(For unit-level evidence: `tests/test_evidence.py` exercises citation +rendering and contradiction rendering directly; `tests/test_aggregator.py` +covers the end-to-end "claim → SourceRef" resolution.) + +### 2. Incremental walks (content-addressed cache) + 11. Resumability + +Run a walk, then run it again immediately: + +```bash +make walk # first walk: extracts every in-scope file +make walk # second walk: cache_hits == files_seen +``` + +The second invocation prints `cache_hits=N` in the **Extraction** row +of the walk report — that's the number of files served from the cache +without an LLM call. + +To force a clean re-walk: + +```bash +uv run wikifi walk --no-cache +``` + +Resumability is the same mechanism: the cache is persisted after every +file finishes, so a `Ctrl-C` mid-walk loses no progress — the next +`wikifi walk` continues from the file that was in flight when the +crash happened. + +(Unit evidence: `tests/test_cache.py`, plus +`test_run_walk_persists_cache_for_resumability` in +`tests/test_orchestrator.py`.) + +### 3. Cross-file context (import graph) + +Open the live extraction prompt for any application file. The walker +includes a `Neighbor files` block listing files this one imports from +or is imported by: + +```bash +uv run wikifi walk -v 2>&1 | rg -A3 "Neighbor files" | head +``` + +You can also inspect the graph directly: + +```python +from pathlib import Path +from wikifi.repograph import build_graph +from wikifi.walker import WalkConfig, iter_files + +config = WalkConfig(root=Path(".")) +files = list(iter_files(config)) +graph = build_graph(repo_root=Path("."), files=files) +node = graph.get("wikifi/aggregator.py") +print(node.imports) # ('wikifi/cache.py', 'wikifi/evidence.py', ...) +print(node.imported_by) # ('wikifi/orchestrator.py', ...) +``` + +(Unit evidence: `tests/test_repograph.py`, plus +`test_extract_repo_injects_neighbor_context_when_graph_supplied` in +`tests/test_extractor.py`.) + +### 4. Type-aware extractors (SQL / OpenAPI / Protobuf / GraphQL / migrations) + +Drop a SQL file into a target project and run a walk: + +```bash +mkdir -p /tmp/demo && cd /tmp/demo && git init -q +cat > schema.sql <<'EOF' +CREATE TABLE customer ( + id INTEGER PRIMARY KEY, + email VARCHAR(255) UNIQUE NOT NULL +); +CREATE TABLE orders ( + id INTEGER PRIMARY KEY, + customer_id INTEGER REFERENCES customer(id), + total INTEGER NOT NULL +); +CREATE INDEX idx_orders_customer ON orders (customer_id); +EOF +uv run --project /home/user/wikifi wikifi init +uv run --project /home/user/wikifi wikifi walk +``` + +The walk report's **Extraction** row shows `specialized=1`. The +findings produced for `entities.md`, `integrations.md` (the FK), and +`cross_cutting.md` (the index + UNIQUE invariants) come from the +deterministic SQL parser — no LLM call was made for `schema.sql`. + +The same routing covers `*.proto`, `*.graphql`, OpenAPI YAML / JSON +specs, and any SQL file under `migrations/` / `alembic/` / +`db/migrate/` directories. + +(Unit evidence: `tests/test_specialized.py` covers each parser; +`test_extract_repo_routes_sql_through_specialized_extractor` in +`tests/test_extractor.py` covers the end-to-end routing.) + +### 6. Critic + reviser pass on derivatives + +Re-run the walk with `--review`: + +```bash +uv run wikifi walk --review +``` + +The walk report shows `sections_revised=N` in the **Derivation** row — +that's how many derivative sections (personas / user stories / +diagrams) the critic flagged as below the score threshold and the +reviser improved. + +(Unit evidence: `tests/test_critic.py` covers the critic loop and the +"only accept revision if it scores at least as well" guard. Integration: +`test_run_walk_review_flag_invokes_critic`.) + +### 8. Coverage + quality report + +After a walk: + +```bash +uv run wikifi report # purely structural; no LLM calls +uv run wikifi report --score # adds critic-derived quality scores +``` + +Output is a markdown table of every section (files contributing, +findings count, body size, score, headline gap): + +``` +| Section | Files | Findings | Body | Score | Headline gap | +| --- | --- | --- | --- | --- | --- | +| `entities` | 12 | 47 | 5132 | 9/10 | — | +| `cross_cutting` | 4 | 9 | 1421 | 6/10 | unsupported: rate-limit policy | +… +``` + +(Unit evidence: `tests/test_report.py`.) + +### 9. Hosted providers with prompt caching + +Two opt-in hosted backends share the same provider abstraction. + +**Anthropic.** Sets `cache_control: {"type": "ephemeral"}` on the system +prompt block; subsequent per-file extraction calls read the cache for +~10% of the input price. + +```bash +export ANTHROPIC_API_KEY=sk-ant-... +WIKIFI_PROVIDER=anthropic uv run wikifi walk +# or: +uv run wikifi walk --provider anthropic +``` + +```python +from wikifi.providers.anthropic_provider import AnthropicProvider +provider = AnthropicProvider(model="claude-opus-4-7", think="high") +# After two calls with the same system prompt: +# response.usage.cache_read_input_tokens > 0 +``` + +**OpenAI.** Relies on OpenAI's automatic prefix caching — no marker +required, prefixes ≥ 1024 tokens are cached for ~5–10 minutes. The +provider also routes the `think` knob to `reasoning_effort` on +reasoning-capable models (`o*`, `gpt-5`): + +```bash +export OPENAI_API_KEY=sk-... +WIKIFI_PROVIDER=openai uv run wikifi walk +# or, with a reasoning model: +WIKIFI_PROVIDER=openai WIKIFI_MODEL=o3-mini uv run wikifi walk +# or via flag: +uv run wikifi walk --provider openai +``` + +```python +from wikifi.providers.openai_provider import OpenAIProvider +provider = OpenAIProvider(model="gpt-4o", think="high") +# Reasoning routing: +# OpenAIProvider(model="o3-mini", think="medium") → forwards reasoning_effort +# OpenAIProvider(model="gpt-4o", think="medium") → no reasoning_effort +``` + +For Azure-OpenAI or a corporate proxy, set +`WIKIFI_OPENAI_BASE_URL` (or pass `base_url=...` directly to the +constructor). + +(Unit evidence: `tests/test_anthropic_provider.py` locks in the +`cache_control` placement, the `messages.parse` structured-output +contract, the thinking → effort translation, and the APIError → +RuntimeError mapping. `tests/test_openai_provider.py` covers the +`chat.completions.parse` structured-output contract, the +reasoning-effort routing for `o*`/`gpt-5` vs plain models, the +`max_tokens` vs `max_completion_tokens` swap, and the same APIError → +RuntimeError mapping. `test_build_provider_returns_anthropic_when_selected` +and `test_build_provider_returns_openai_when_selected` in +`tests/test_orchestrator.py` cover dispatch.) + +## Tearing down + +The premium-pipeline state lives entirely under `.wikifi/`: + +``` +.wikifi/ + config.toml + *.md # rendered sections (committable) + .notes/ # per-section JSONL findings (gitignored) + .cache/ # extraction + aggregation caches (gitignored) +``` + +Delete `.wikifi/.cache/` to drop the cache and force a full re-walk; +delete the whole directory to start over. diff --git a/pyproject.toml b/pyproject.toml index 7394b57..5908fac 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -14,6 +14,8 @@ dependencies = [ "typer>=0.12", "rich>=13.7", "pathspec>=0.12", + "anthropic>=0.40", + "openai>=1.50", ] [project.scripts] diff --git a/tests/conftest.py b/tests/conftest.py index d406e18..f327544 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -15,6 +15,8 @@ class so the same provider can serve introspection, extraction, and import pytest from pydantic import BaseModel +from wikifi.providers.base import LLMProvider + T = TypeVar("T", bound=BaseModel) # --------------------------------------------------------------------------- @@ -22,7 +24,7 @@ class so the same provider can serve introspection, extraction, and # --------------------------------------------------------------------------- -class MockProvider: +class MockProvider(LLMProvider): """Test double for ``LLMProvider`` driven by per-schema response queues.""" name = "mock" diff --git a/tests/test_aggregator.py b/tests/test_aggregator.py index 0bd3e5e..a62fed2 100644 --- a/tests/test_aggregator.py +++ b/tests/test_aggregator.py @@ -1,6 +1,12 @@ -from wikifi.aggregator import SectionBody, aggregate_all +from wikifi.aggregator import ( + AggregatedClaim, + AggregatedContradiction, + SectionBody, + aggregate_all, +) +from wikifi.cache import WalkCache, hash_section_notes from wikifi.sections import PRIMARY_SECTIONS -from wikifi.wiki import WikiLayout, append_note, initialize +from wikifi.wiki import WikiLayout, append_note, initialize, read_notes def _setup(tmp_path): @@ -68,3 +74,88 @@ def raiser(schema, system, user): body = layout.section_path(section).read_text() assert "Aggregation failed" in body assert "Order line item." in body # raw notes preserved + + +def test_aggregate_renders_citations_and_contradictions(tmp_path, mock_provider_factory): + layout = _setup(tmp_path) + section = PRIMARY_SECTIONS[0] + append_note( + layout, + section, + { + "file": "a.py", + "summary": "domain", + "finding": "Tax computed at order time.", + "sources": [{"file": "src/order.py", "lines": [10, 25], "fingerprint": "abc"}], + }, + ) + append_note( + layout, + section, + { + "file": "b.py", + "summary": "domain", + "finding": "Tax computed at invoice time.", + "sources": [{"file": "src/invoice.py", "lines": [5, 12], "fingerprint": "def"}], + }, + ) + + structured = SectionBody( + body="The system computes tax somewhere.", + claims=[AggregatedClaim(text="Tax computation lives at the boundary.", source_indices=[1, 2])], + contradictions=[ + AggregatedContradiction( + summary="Where tax is computed.", + positions=[ + AggregatedClaim(text="At order time.", source_indices=[1]), + AggregatedClaim(text="At invoice time.", source_indices=[2]), + ], + ) + ], + ) + + provider = mock_provider_factory( + json_factory=lambda schema, system, user: structured, + ) + aggregate_all(layout=layout, provider=provider) + body = layout.section_path(section).read_text() + assert "Conflicts in source" in body + assert "src/order.py:10-25" in body + assert "src/invoice.py:5-12" in body + assert "## Sources" in body + + +def test_aggregate_uses_cache_to_skip_unchanged_notes(tmp_path, mock_provider_factory): + layout = _setup(tmp_path) + section = PRIMARY_SECTIONS[0] + append_note(layout, section, {"file": "a.py", "summary": "x", "finding": "Order entity."}) + + cache = WalkCache() + notes_hash = hash_section_notes(read_notes(layout, section)) + cache.record_aggregation( + section.id, + notes_hash=notes_hash, + body="Cached body for the section.", + claims=[], + contradictions=[], + ) + + provider = mock_provider_factory() # no responses queued — must not be called + stats = aggregate_all(layout=layout, provider=provider, cache=cache) + + body = layout.section_path(section).read_text() + assert "Cached body for the section." in body + assert stats.sections_cached == 1 + + +def test_aggregate_records_cache_entry_after_synthesis(tmp_path, mock_provider_factory): + layout = _setup(tmp_path) + section = PRIMARY_SECTIONS[0] + append_note(layout, section, {"file": "a.py", "summary": "x", "finding": "Order."}) + + cache = WalkCache() + provider = mock_provider_factory( + json_factory=lambda schema, system, user: SectionBody(body="Synthesized body."), + ) + aggregate_all(layout=layout, provider=provider, cache=cache) + assert section.id in cache.aggregation diff --git a/tests/test_anthropic_provider.py b/tests/test_anthropic_provider.py new file mode 100644 index 0000000..9d80a97 --- /dev/null +++ b/tests/test_anthropic_provider.py @@ -0,0 +1,197 @@ +"""AnthropicProvider tests. + +The HTTP transport is mocked via the ``client=`` injection point so the +test never touches the network. The point is to lock in the wikifi +contract: prompt caching on the system prompt, structured output via +``messages.parse``, and APIError → RuntimeError mapping. +""" + +from __future__ import annotations + +from types import SimpleNamespace + +import anthropic +import pytest +from pydantic import BaseModel + +from wikifi.providers.anthropic_provider import AnthropicProvider + + +class _Echo(BaseModel): + value: str + + +class _StubClient: + """Minimal stand-in for ``anthropic.Anthropic`` exposing ``messages``.""" + + def __init__( + self, + *, + parse_response=None, + create_response=None, + raise_on_parse: Exception | None = None, + raise_on_create: Exception | None = None, + ) -> None: + self.parse_calls: list[dict] = [] + self.create_calls: list[dict] = [] + self._parse_response = parse_response + self._create_response = create_response + self._raise_on_parse = raise_on_parse + self._raise_on_create = raise_on_create + self.messages = SimpleNamespace(parse=self._parse, create=self._create) + + def _parse(self, **kwargs): + self.parse_calls.append(kwargs) + if self._raise_on_parse is not None: + raise self._raise_on_parse + return self._parse_response + + def _create(self, **kwargs): + self.create_calls.append(kwargs) + if self._raise_on_create is not None: + raise self._raise_on_create + return self._create_response + + +def _api_error(message: str = "boom", request_id: str = "req_abc") -> anthropic.APIError: + """Build an APIError without going through the real httpx wiring.""" + err = anthropic.APIError.__new__(anthropic.APIError) + err.message = message + err.request_id = request_id + err.args = (message,) + return err + + +def test_complete_json_passes_cache_control_and_returns_pydantic(): + parsed = _Echo(value="hello") + response = SimpleNamespace(parsed_output=parsed, content=[]) + client = _StubClient(parse_response=response) + + provider = AnthropicProvider(model="claude-opus-4-7", client=client, think="high") + result = provider.complete_json(system="SYS", user="USR", schema=_Echo) + + assert result == parsed + call = client.parse_calls[0] + assert call["model"] == "claude-opus-4-7" + assert call["output_format"] is _Echo + assert call["messages"] == [{"role": "user", "content": "USR"}] + # System prompt must be a list with a cache_control marker. + system = call["system"] + assert isinstance(system, list) + assert system[0]["cache_control"] == {"type": "ephemeral"} + assert system[0]["text"] == "SYS" + # think="high" → adaptive thinking + effort. + assert call["thinking"] == {"type": "adaptive"} + assert call["output_config"] == {"effort": "high"} + + +def test_complete_json_falls_back_to_validate_json_when_parsed_output_missing(): + response = SimpleNamespace( + parsed_output=None, + content=[SimpleNamespace(type="text", text='{"value": "fallback"}')], + ) + client = _StubClient(parse_response=response) + provider = AnthropicProvider(client=client) + out = provider.complete_json(system="s", user="u", schema=_Echo) + assert out == _Echo(value="fallback") + + +def test_complete_json_raises_runtime_error_on_api_error(): + client = _StubClient(raise_on_parse=_api_error("rate-limited", "req_xyz")) + provider = AnthropicProvider(client=client) + with pytest.raises(RuntimeError) as info: + provider.complete_json(system="s", user="u", schema=_Echo) + assert "req_xyz" in str(info.value) + assert "rate-limited" in str(info.value) + + +def test_complete_text_extracts_first_text_block(): + response = SimpleNamespace(content=[SimpleNamespace(type="text", text="hi")]) + client = _StubClient(create_response=response) + provider = AnthropicProvider(client=client) + assert provider.complete_text(system="s", user="u") == "hi" + + +def test_complete_text_returns_empty_when_no_text_block(): + response = SimpleNamespace(content=[]) + client = _StubClient(create_response=response) + provider = AnthropicProvider(client=client) + assert provider.complete_text(system="s", user="u") == "" + + +def test_chat_forwards_messages_and_caches_system(): + response = SimpleNamespace(content=[SimpleNamespace(type="text", text="hello back")]) + client = _StubClient(create_response=response) + provider = AnthropicProvider(client=client, think=False) + out = provider.chat( + system="SYS", + messages=[{"role": "user", "content": "first"}], + ) + assert out == "hello back" + call = client.create_calls[0] + assert call["messages"] == [{"role": "user", "content": "first"}] + assert call["system"][0]["cache_control"] == {"type": "ephemeral"} + # think=False → thinking disabled, no effort. + assert call["thinking"] == {"type": "disabled"} + assert "output_config" not in call + + +def test_thinking_kwargs_translation_table(): + """Lock the think-knob → request mapping so the contract is testable.""" + client = _StubClient(create_response=SimpleNamespace(content=[])) + cases = [ + ("low", {"thinking": {"type": "adaptive"}, "output_config": {"effort": "low"}}), + ("medium", {"thinking": {"type": "adaptive"}, "output_config": {"effort": "medium"}}), + ("high", {"thinking": {"type": "adaptive"}, "output_config": {"effort": "high"}}), + ("max", {"thinking": {"type": "adaptive"}, "output_config": {"effort": "max"}}), + (True, {"thinking": {"type": "adaptive"}}), + (False, {"thinking": {"type": "disabled"}}), + ("off", {"thinking": {"type": "disabled"}}), + ] + for think, expected in cases: + provider = AnthropicProvider(client=client, think=think) + # Reset the recorded calls between cases. + client.create_calls.clear() + provider.complete_text(system="s", user="u") + call = client.create_calls[-1] + for key, value in expected.items(): + assert call.get(key) == value, f"think={think!r}: expected {key}={value}" + if "output_config" not in expected: + assert "output_config" not in call + + +def test_cache_system_prompt_off_returns_plain_string(): + response = SimpleNamespace(content=[]) + client = _StubClient(create_response=response) + provider = AnthropicProvider(client=client, cache_system_prompt=False) + provider.complete_text(system="SYS", user="u") + assert client.create_calls[0]["system"] == "SYS" + + +def test_complete_json_raises_diagnostic_on_fully_empty_response(): + """Empty parsed_output AND empty text → emit a diagnostic with knobs. + + Locks in the user-reported failure mode where adaptive thinking + consumes the entire ``max_tokens`` budget and the structured + output block never lands. The replacement RuntimeError must + surface ``stop_reason``, ``output_tokens``, and ``max_tokens`` so + operators see which knob to turn (raise max_tokens, lower think + effort) instead of the original cryptic "Invalid JSON: EOF" + pydantic validation error. + """ + response = SimpleNamespace( + parsed_output=None, + content=[], + stop_reason="max_tokens", + usage=SimpleNamespace(output_tokens=16_000), + ) + client = _StubClient(parse_response=response) + provider = AnthropicProvider(client=client, max_tokens=16_000) + with pytest.raises(RuntimeError) as info: + provider.complete_json(system="s", user="u", schema=_Echo) + msg = str(info.value) + # Operator-facing diagnostic — names the knobs, not the SDK internals. + assert "max_tokens=16000" in msg + assert "output_tokens=16000" in msg + assert "stop_reason='max_tokens'" in msg + assert "raise max_tokens" in msg.lower() or "lower think" in msg.lower() diff --git a/tests/test_cache.py b/tests/test_cache.py new file mode 100644 index 0000000..9081de5 --- /dev/null +++ b/tests/test_cache.py @@ -0,0 +1,188 @@ +"""Cache layer tests.""" + +from __future__ import annotations + +from pathlib import Path + +from wikifi.cache import ( + CACHE_VERSION, + WalkCache, + aggregation_cache_path, + extraction_cache_path, + hash_section_notes, + load, + reset, + save, +) +from wikifi.wiki import WikiLayout, initialize + + +def _layout(tmp_path: Path) -> WikiLayout: + layout = WikiLayout(root=tmp_path) + initialize(layout, model="m", provider="ollama", ollama_host="http://h") + return layout + + +def test_extraction_cache_hit_and_miss(tmp_path: Path): + layout = _layout(tmp_path) + cache = load(layout) + assert cache.lookup_extraction("a.py", "abc") is None + assert cache.extraction_misses == 1 + + cache.record_extraction( + "a.py", + fingerprint="abc", + findings=[{"section_id": "entities", "finding": "x", "sources": []}], + summary="role", + chunks_processed=1, + ) + hit = cache.lookup_extraction("a.py", "abc") + assert hit is not None + assert hit.fingerprint == "abc" + assert cache.extraction_hits == 1 + + +def test_extraction_cache_invalidated_on_fingerprint_change(tmp_path: Path): + layout = _layout(tmp_path) + cache = load(layout) + cache.record_extraction("a.py", fingerprint="old", findings=[], summary="", chunks_processed=0) + assert cache.lookup_extraction("a.py", "new") is None + + +def test_aggregation_cache_round_trip(tmp_path: Path): + layout = _layout(tmp_path) + cache = load(layout) + cache.record_aggregation( + "entities", + notes_hash="h1", + body="body", + claims=[{"text": "c", "sources": []}], + contradictions=[], + ) + hit = cache.lookup_aggregation("entities", "h1") + assert hit is not None + assert hit.body == "body" + assert cache.lookup_aggregation("entities", "h2") is None + + +def test_save_and_load_round_trip(tmp_path: Path): + layout = _layout(tmp_path) + cache = WalkCache() + cache.record_extraction( + "src/a.py", + fingerprint="abc123", + findings=[{"section_id": "entities", "finding": "x", "sources": []}], + summary="role", + chunks_processed=2, + ) + cache.record_aggregation("entities", notes_hash="hh", body="body", claims=[], contradictions=[]) + save(layout, cache) + assert extraction_cache_path(layout).exists() + assert aggregation_cache_path(layout).exists() + + loaded = load(layout) + assert loaded.lookup_extraction("src/a.py", "abc123") is not None + assert loaded.lookup_aggregation("entities", "hh") is not None + + +def test_reset_clears_disk_files(tmp_path: Path): + layout = _layout(tmp_path) + cache = WalkCache() + cache.record_extraction("a.py", fingerprint="x", findings=[], summary="", chunks_processed=0) + save(layout, cache) + reset(layout) + assert not extraction_cache_path(layout).exists() + assert not aggregation_cache_path(layout).exists() + + +def test_load_returns_empty_when_file_missing(tmp_path: Path): + layout = _layout(tmp_path) + cache = load(layout) + assert cache.extraction == {} + assert cache.aggregation == {} + + +def test_load_drops_bad_version(tmp_path: Path): + layout = _layout(tmp_path) + extraction_cache_path(layout).parent.mkdir(parents=True, exist_ok=True) + extraction_cache_path(layout).write_text('{"version": 999, "entries": {"a.py": {"fingerprint": "abc"}}}') + cache = load(layout) + assert cache.extraction == {} + + +def test_prune_extraction_drops_out_of_scope_files(tmp_path: Path): + _layout(tmp_path) # ensures cache dir exists; entries are exercised below + cache = WalkCache() + for path in ("keep.py", "drop.py"): + cache.record_extraction(path, fingerprint="x", findings=[], summary="", chunks_processed=0) + removed = cache.prune_extraction(keep={"keep.py"}) + assert removed == 1 + assert "keep.py" in cache.extraction + assert "drop.py" not in cache.extraction + + +def test_hash_section_notes_is_stable(): + notes = [ + {"file": "a.py", "summary": "x", "finding": "y", "timestamp": "t1"}, + {"file": "b.py", "summary": "x", "finding": "z", "timestamp": "t2"}, + ] + same = [ + {"file": "a.py", "summary": "x", "finding": "y", "timestamp": "t99"}, + {"file": "b.py", "summary": "x", "finding": "z", "timestamp": "t100"}, + ] + assert hash_section_notes(notes) == hash_section_notes(same) + different = [{"file": "a.py", "summary": "x", "finding": "DIFFERENT"}] + assert hash_section_notes(notes) != hash_section_notes(different) + + +def test_cache_version_is_pinned(): + """Bumps to CACHE_VERSION should be intentional — guard against drift.""" + assert isinstance(CACHE_VERSION, int) + assert CACHE_VERSION >= 1 + + +def test_hash_section_notes_changes_when_sources_change(): + """The aggregation cache key must reflect each note's `sources`. + + Two notes with identical finding text but different source line + ranges or fingerprints describe different evidence; reusing the + same cached body would replay stale citations against new code. + """ + base = [ + { + "file": "a.py", + "summary": "role", + "finding": "Order entity.", + "sources": [{"file": "a.py", "lines": [1, 30], "fingerprint": "abc1234"}], + } + ] + same = [ + { + "file": "a.py", + "summary": "role", + "finding": "Order entity.", + "sources": [{"file": "a.py", "lines": (1, 30), "fingerprint": "abc1234"}], + } + ] + moved_lines = [ + { + "file": "a.py", + "summary": "role", + "finding": "Order entity.", + "sources": [{"file": "a.py", "lines": [42, 70], "fingerprint": "abc1234"}], + } + ] + new_fingerprint = [ + { + "file": "a.py", + "summary": "role", + "finding": "Order entity.", + "sources": [{"file": "a.py", "lines": [1, 30], "fingerprint": "deadbee"}], + } + ] + # Tuple vs list line range: same logical evidence, identical hash. + assert hash_section_notes(base) == hash_section_notes(same) + # Lines moved → new evidence → cache must miss. + assert hash_section_notes(base) != hash_section_notes(moved_lines) + # File contents changed (fingerprint shifted) → cache must miss. + assert hash_section_notes(base) != hash_section_notes(new_fingerprint) diff --git a/tests/test_cli.py b/tests/test_cli.py index 8d02b5b..36e3829 100644 --- a/tests/test_cli.py +++ b/tests/test_cli.py @@ -2,7 +2,7 @@ from wikifi import __version__ from wikifi.cli import app -from wikifi.wiki import WikiLayout, initialize +from wikifi.wiki import WikiLayout, initialize, write_section def test_version_flag(): @@ -49,6 +49,59 @@ def test_chat_command_errors_when_wiki_missing(tmp_path): assert "No .wikifi/" in result.output +def test_report_command_errors_when_wiki_missing(tmp_path): + runner = CliRunner() + result = runner.invoke(app, ["report", str(tmp_path)]) + assert result.exit_code == 1 + assert "No .wikifi/" in result.output + + +def test_report_command_renders_table(tmp_path): + layout = WikiLayout(root=tmp_path) + initialize(layout, model="m", provider="ollama", ollama_host="http://h") + write_section(layout, "intent", "Some intent body.") + + runner = CliRunner() + result = runner.invoke(app, ["report", str(tmp_path)]) + assert result.exit_code == 0, result.output + # Markdown rendered through rich; check for the header text. + assert "wikifi coverage" in result.output.lower() or "section" in result.output.lower() + + +def test_walk_no_cache_flag_clears_cache_dir(tmp_path, monkeypatch): + """`walk --no-cache` triggers the cache-reset path before the run starts.""" + layout = WikiLayout(root=tmp_path) + initialize(layout, model="m", provider="ollama", ollama_host="http://h") + cache_path = layout.wiki_dir / ".cache" / "extraction.json" + cache_path.parent.mkdir(parents=True, exist_ok=True) + cache_path.write_text('{"version": 1, "entries": {}}') + + captured = {} + + def fake_run_walk(*, root, settings, provider=None): + captured["use_cache"] = settings.use_cache + from wikifi.aggregator import AggregationStats + from wikifi.deriver import DerivationStats + from wikifi.extractor import ExtractionStats + from wikifi.introspection import IntrospectionResult + from wikifi.orchestrator import WalkReport + + return WalkReport( + introspection=IntrospectionResult(), + extraction=ExtractionStats(), + aggregation=AggregationStats(), + derivation=DerivationStats(), + ) + + monkeypatch.setattr("wikifi.cli.run_walk", fake_run_walk) + runner = CliRunner() + result = runner.invoke(app, ["walk", str(tmp_path), "--no-cache"]) + assert result.exit_code == 0, result.output + assert captured["use_cache"] is False + # Cache file was deleted by the flag. + assert not cache_path.exists() + + def test_chat_command_runs_repl(tmp_path, monkeypatch): layout = WikiLayout(root=tmp_path) initialize(layout, model="m", provider="ollama", ollama_host="http://h") diff --git a/tests/test_config.py b/tests/test_config.py index 69852dc..6e79f13 100644 --- a/tests/test_config.py +++ b/tests/test_config.py @@ -20,3 +20,96 @@ def test_get_settings_is_cached(): a = get_settings() b = get_settings() assert a is b + + +def test_load_target_settings_reads_config_toml(tmp_path, monkeypatch): + """`/.wikifi/config.toml` overrides field defaults. + + A target wiki initialized with `provider = "anthropic"` should + produce settings that say "anthropic" even when the calling shell + has no WIKIFI_* env vars set. + """ + from wikifi.config import load_target_settings, reset_settings_cache + + # ``Settings`` reads ``.env`` from CWD; chdir to tmp_path so the + # project-root .env (which sets WIKIFI_PROVIDER=anthropic) doesn't + # leak into the test. + monkeypatch.chdir(tmp_path) + monkeypatch.delenv("WIKIFI_PROVIDER", raising=False) + monkeypatch.delenv("WIKIFI_MODEL", raising=False) + monkeypatch.delenv("WIKIFI_OLLAMA_HOST", raising=False) + reset_settings_cache() + + wiki_dir = tmp_path / ".wikifi" + wiki_dir.mkdir() + (wiki_dir / "config.toml").write_text( + 'provider = "anthropic"\nmodel = "claude-opus-4-7"\nollama_host = "http://unused:11434"\n' + ) + + settings = load_target_settings(tmp_path) + assert settings.provider == "anthropic" + assert settings.model == "claude-opus-4-7" + reset_settings_cache() + + +def test_load_target_settings_toml_wins_over_env(tmp_path, monkeypatch): + """The target wiki's `config.toml` wins over per-session env vars. + + Matches the contract printed at the top of every scaffolded + `config.toml`: "overrides WIKIFI_* environment variables when + present". A wiki initialized for a hosted backend should keep + using that backend even if the user happens to have + `WIKIFI_PROVIDER=ollama` exported in their shell. + """ + from wikifi.config import load_target_settings, reset_settings_cache + + monkeypatch.setenv("WIKIFI_PROVIDER", "ollama") + monkeypatch.setenv("WIKIFI_MODEL", "qwen3.6:27b") + reset_settings_cache() + + wiki_dir = tmp_path / ".wikifi" + wiki_dir.mkdir() + (wiki_dir / "config.toml").write_text( + 'provider = "anthropic"\nmodel = "claude-opus-4-7"\n', + ) + + settings = load_target_settings(tmp_path) + assert settings.provider == "anthropic" + assert settings.model == "claude-opus-4-7" + reset_settings_cache() + + +def test_load_target_settings_handles_missing_config(tmp_path, monkeypatch): + """No `.wikifi/config.toml` → fall back cleanly to env defaults.""" + from wikifi.config import load_target_settings, reset_settings_cache + + monkeypatch.chdir(tmp_path) + monkeypatch.delenv("WIKIFI_PROVIDER", raising=False) + monkeypatch.delenv("WIKIFI_MODEL", raising=False) + reset_settings_cache() + + settings = load_target_settings(tmp_path) + assert settings.provider == "ollama" + reset_settings_cache() + + +def test_load_target_settings_ignores_malformed_toml(tmp_path, monkeypatch, caplog): + """A corrupt config.toml warns and falls back instead of raising.""" + import logging + + from wikifi.config import load_target_settings, reset_settings_cache + + monkeypatch.chdir(tmp_path) + monkeypatch.delenv("WIKIFI_PROVIDER", raising=False) + reset_settings_cache() + + wiki_dir = tmp_path / ".wikifi" + wiki_dir.mkdir() + (wiki_dir / "config.toml").write_text("not = valid = toml = at all\n") + + with caplog.at_level(logging.WARNING, logger="wikifi.config"): + settings = load_target_settings(tmp_path) + + assert settings.provider == "ollama" + assert any("could not read" in record.message for record in caplog.records) + reset_settings_cache() diff --git a/tests/test_critic.py b/tests/test_critic.py new file mode 100644 index 0000000..35bbdb4 --- /dev/null +++ b/tests/test_critic.py @@ -0,0 +1,118 @@ +"""Critic + reviser tests.""" + +from __future__ import annotations + +from wikifi.aggregator import SectionBody # for unused import sanity +from wikifi.critic import ( + CoverageStats, + Critique, + RevisedBody, + review_section, +) +from wikifi.sections import SECTIONS_BY_ID + +_ = SectionBody # silence "imported but unused" + + +def test_review_skips_revision_when_score_meets_threshold(mock_provider_factory): + section = SECTIONS_BY_ID["entities"] + provider = mock_provider_factory( + json_responses={ + Critique: [Critique(score=8, summary="solid")], + } + ) + outcome = review_section( + section=section, + body="Bodies of evidence here.", + upstream_evidence=None, + provider=provider, + min_score=7, + ) + assert outcome.revised is False + assert outcome.body == "Bodies of evidence here." + + +def test_review_revises_when_score_below_threshold(mock_provider_factory): + section = SECTIONS_BY_ID["entities"] + queue_critique = [ + Critique(score=4, summary="weak", unsupported_claims=["X"], gaps=["Y"]), + Critique(score=8, summary="better"), + ] + + def factory(schema, system, user): + if schema is Critique: + return queue_critique.pop(0) + if schema is RevisedBody: + return RevisedBody(body="Revised body that addresses X and Y.") + raise AssertionError(f"unexpected schema {schema}") + + provider = mock_provider_factory(json_factory=factory) + + outcome = review_section( + section=section, + body="Original body.", + upstream_evidence={"intent": "upstream content"}, + provider=provider, + min_score=7, + ) + assert outcome.revised is True + assert "Revised body" in outcome.body + assert outcome.final is not None + assert outcome.final.score == 8 + + +def test_review_keeps_original_when_revision_regresses(mock_provider_factory): + section = SECTIONS_BY_ID["entities"] + critiques = [ + Critique(score=5, gaps=["Y"]), + Critique(score=3), # revision is worse + ] + + def factory(schema, system, user): + if schema is Critique: + return critiques.pop(0) + if schema is RevisedBody: + return RevisedBody(body="Worse body.") + raise AssertionError + + provider = mock_provider_factory(json_factory=factory) + + outcome = review_section( + section=section, + body="Original body.", + upstream_evidence=None, + provider=provider, + min_score=7, + ) + assert outcome.revised is False + assert outcome.body == "Original body." + + +def test_review_handles_critic_failure(mock_provider_factory): + """If the critic call fails, score=0 → no revision attempt; the body stays.""" + section = SECTIONS_BY_ID["entities"] + + def factory(schema, system, user): + raise RuntimeError("model unavailable") + + provider = mock_provider_factory(json_factory=factory) + outcome = review_section( + section=section, + body="Body.", + upstream_evidence=None, + provider=provider, + min_score=7, + ) + assert outcome.body == "Body." + assert outcome.initial.score == 0 + + +def test_coverage_stats_pct(): + stats = CoverageStats( + files_total=100, + files_with_findings=42, + findings_per_section={}, + files_per_section={}, + ) + assert stats.coverage_pct() == 42.0 + assert CoverageStats(0, 0, {}, {}).coverage_pct() == 0.0 diff --git a/tests/test_evidence.py b/tests/test_evidence.py new file mode 100644 index 0000000..2680023 --- /dev/null +++ b/tests/test_evidence.py @@ -0,0 +1,123 @@ +"""Evidence model + rendering tests.""" + +from __future__ import annotations + +from wikifi.evidence import ( + Claim, + Contradiction, + EvidenceBundle, + SourceRef, + coalesce_refs, + render_section_body, +) + + +def test_source_ref_render(): + assert SourceRef(file="a.py").render() == "a.py" + assert SourceRef(file="a.py", lines=(10, 10)).render() == "a.py:10" + assert SourceRef(file="a.py", lines=(10, 25)).render() == "a.py:10-25" + + +def test_claim_supported_flag(): + assert not Claim(text="x").supported() + assert Claim(text="x", sources=[SourceRef(file="a.py")]).supported() + + +def test_render_section_body_includes_sources_footer(): + bundle = EvidenceBundle( + body="The system manages orders.", + claims=[ + Claim(text="Orders carry line items.", sources=[SourceRef(file="src/order.py", lines=(1, 30))]), + Claim(text="Orders are immutable once placed.", sources=[SourceRef(file="src/order.py", lines=(1, 30))]), + ], + ) + out = render_section_body(bundle) + assert "The system manages orders." in out + assert "## Sources" in out + assert "src/order.py:1-30" in out + # Same source ref is deduped — only one numbered entry. + assert out.count("src/order.py:1-30") == 1 + + +def test_render_section_body_renders_contradictions(): + bundle = EvidenceBundle( + body="Order pricing is calculated downstream.", + contradictions=[ + Contradiction( + summary="Whether tax is computed at order time or invoice time.", + positions=[ + Claim(text="Tax is computed at order time.", sources=[SourceRef(file="src/order.py")]), + Claim(text="Tax is computed at invoice time.", sources=[SourceRef(file="src/invoice.py")]), + ], + ) + ], + ) + out = render_section_body(bundle) + assert "Conflicts in source" in out + assert "Tax is computed at order time" in out + assert "Tax is computed at invoice time" in out + assert "src/order.py" in out + assert "src/invoice.py" in out + + +def test_render_section_body_omits_footer_when_no_sources(): + bundle = EvidenceBundle(body="Plain body, no claims.") + out = render_section_body(bundle) + assert "Plain body, no claims." in out + assert "## Sources" not in out + + +def test_coalesce_refs_dedupes_by_render(): + refs = [ + SourceRef(file="a.py", lines=(1, 10)), + SourceRef(file="a.py", lines=(1, 10)), + SourceRef(file="b.py"), + ] + out = coalesce_refs(refs) + assert len(out) == 2 + assert {r.render() for r in out} == {"a.py:1-10", "b.py"} + + +def test_render_section_body_inserts_claim_markers_inline(): + """Each supported claim's text in the body picks up its `[N]` marker. + + Without inline markers the reader has the source list at the bottom + of the section but no way to tell which sentence each source backs. + """ + bundle = EvidenceBundle( + body="Orders carry line items. Tax is computed downstream.", + claims=[ + Claim(text="Orders carry line items.", sources=[SourceRef(file="src/order.py", lines=(1, 30))]), + Claim(text="Tax is computed downstream.", sources=[SourceRef(file="src/billing.py", lines=(40, 60))]), + ], + ) + out = render_section_body(bundle) + # Markers are appended next to the matching sentences, in source order. + assert "Orders carry line items.[1]" in out + assert "Tax is computed downstream.[2]" in out + # Sources footer still enumerates the distinct refs. + assert "1. `src/order.py:1-30`" in out + assert "2. `src/billing.py:40-60`" in out + + +def test_render_section_body_paraphrased_claims_listed_as_supporting(): + """Claims whose text doesn't appear verbatim go in a Supporting list. + + A conservative inline match avoids attaching markers to the wrong + sentence when the aggregator paraphrased — the claim still gets a + citation, just out-of-line. + """ + bundle = EvidenceBundle( + body="The system tracks orders end-to-end.", + claims=[ + Claim( + text="Order state transitions are persisted on every change.", + sources=[SourceRef(file="src/order.py", lines=(80, 95))], + ), + ], + ) + out = render_section_body(bundle) + assert "## Supporting claims" in out + assert "Order state transitions are persisted on every change." in out + assert "[1]" in out # marker still attached to the supporting-claim entry + assert "1. `src/order.py:80-95`" in out diff --git a/tests/test_extractor.py b/tests/test_extractor.py index d007cbf..029425b 100644 --- a/tests/test_extractor.py +++ b/tests/test_extractor.py @@ -2,12 +2,14 @@ import pytest +from wikifi.cache import WalkCache from wikifi.extractor import ( FileFindings, SectionFinding, _chunk_text, extract_repo, ) +from wikifi.repograph import build_graph from wikifi.wiki import WikiLayout, initialize, read_notes @@ -327,6 +329,150 @@ def test_section_ids_documented_in_system_prompt(): assert sid not in EXTRACTION_SYSTEM_PROMPT.split("Only emit findings for these section ids:")[1].split("\n")[0] +def test_extract_repo_uses_cache_to_skip_unchanged_files(tmp_path, mock_provider_factory): + """A file whose fingerprint matches a cache entry skips the LLM call entirely.""" + layout = _layout(tmp_path) + (tmp_path / "a.py").write_text("class Order: pass\n# meaningful body content here for the walker\n") + + cache = WalkCache() + seen: list[str] = [] + + def factory(schema, system, user): + seen.append(user) + return FileFindings( + summary="domain class", + findings=[SectionFinding(section_id="entities", finding="Order entity.")], + ) + + provider = mock_provider_factory(json_factory=factory) + extract_repo( + layout=layout, + provider=provider, + files=[Path("a.py")], + repo_root=tmp_path, + cache=cache, + ) + assert len(seen) == 1 + assert "a.py" in cache.extraction + # Second walk against the same file: cache hit, no new LLM call. + seen.clear() + stats2 = extract_repo( + layout=layout, + provider=provider, + files=[Path("a.py")], + repo_root=tmp_path, + cache=cache, + ) + assert seen == [] + assert stats2.cache_hits == 1 + notes = read_notes(layout, "entities") + # Findings are replayed into the notes store on cache hit. + assert any("Order" in n["finding"] for n in notes) + + +def test_extract_repo_invalidates_cache_when_file_changes(tmp_path, mock_provider_factory): + layout = _layout(tmp_path) + target = tmp_path / "a.py" + target.write_text("class Order: pass\n# body content for the walker minimum threshold\n") + + cache = WalkCache() + call_count = {"n": 0} + + def factory(schema, system, user): + call_count["n"] += 1 + return FileFindings(findings=[SectionFinding(section_id="entities", finding="Order.")]) + + provider = mock_provider_factory(json_factory=factory) + extract_repo(layout=layout, provider=provider, files=[Path("a.py")], repo_root=tmp_path, cache=cache) + assert call_count["n"] == 1 + + # Mutate the file → fingerprint changes → cache miss → new call. + target.write_text("class Customer: pass\n# different content for the walker minimum threshold\n") + extract_repo(layout=layout, provider=provider, files=[Path("a.py")], repo_root=tmp_path, cache=cache) + assert call_count["n"] == 2 + + +def test_extract_repo_routes_sql_through_specialized_extractor(tmp_path, mock_provider_factory): + """SQL files bypass the LLM and go through the deterministic SQL extractor.""" + layout = _layout(tmp_path) + (tmp_path / "schema.sql").write_text("CREATE TABLE customer (id INTEGER PRIMARY KEY, email VARCHAR(255) NOT NULL);") + + seen: list[str] = [] + + def factory(schema, system, user): + seen.append(user) + return FileFindings() + + provider = mock_provider_factory(json_factory=factory) + stats = extract_repo( + layout=layout, + provider=provider, + files=[Path("schema.sql")], + repo_root=tmp_path, + ) + # No LLM calls — specialized extractor handled the file directly. + assert seen == [] + assert stats.specialized_files == 1 + notes = read_notes(layout, "entities") + assert any("customer" in n["finding"] for n in notes) + + +def test_extract_repo_emits_source_refs_in_notes(tmp_path, mock_provider_factory): + """Every note carries a structured ``sources`` list for downstream citations.""" + layout = _layout(tmp_path) + (tmp_path / "a.py").write_text("class Order:\n pass\n# more body content for walker minimum\n") + + findings = FileFindings( + summary="domain class", + findings=[ + SectionFinding(section_id="entities", finding="Order entity.", line_range=(1, 2)), + ], + ) + provider = mock_provider_factory(json_responses={FileFindings: [findings]}) + extract_repo( + layout=layout, + provider=provider, + files=[Path("a.py")], + repo_root=tmp_path, + ) + note = read_notes(layout, "entities")[0] + sources = note["sources"] + assert sources and sources[0]["file"] == "a.py" + assert sources[0]["lines"] == [1, 2] + assert sources[0]["fingerprint"] + + +def test_extract_repo_injects_neighbor_context_when_graph_supplied(tmp_path, mock_provider_factory): + layout = _layout(tmp_path) + (tmp_path / "pkg").mkdir() + (tmp_path / "pkg" / "__init__.py").write_text("# package marker for tests; long enough to pass min_content\n") + (tmp_path / "pkg" / "main.py").write_text("from pkg.helper import compute\n\ndef run():\n return compute()\n") + (tmp_path / "pkg" / "helper.py").write_text( + "def compute():\n return 42\n# extra padding to satisfy the minimum content threshold for the walker\n" + ) + + files = [Path("pkg/__init__.py"), Path("pkg/main.py"), Path("pkg/helper.py")] + graph = build_graph(repo_root=tmp_path, files=files) + + captured: list[str] = [] + + def factory(schema, system, user): + captured.append(user) + return FileFindings() + + provider = mock_provider_factory(json_factory=factory) + extract_repo( + layout=layout, + provider=provider, + files=files, + repo_root=tmp_path, + graph=graph, + ) + main_prompt = next(p for p in captured if "pkg/main.py" in p) + assert "Neighbor files" in main_prompt + assert "pkg/helper.py" in main_prompt + + def test_extract_repo_drops_derivative_section_findings(tmp_path, mock_provider_factory): """Even if the model emits a derivative section id, the extractor filters it out.""" from wikifi.sections import DERIVATIVE_SECTION_IDS @@ -355,3 +501,35 @@ def test_extract_repo_drops_derivative_section_findings(tmp_path, mock_provider_ assert len(read_notes(layout, "entities")) == 1 assert read_notes(layout, derivative_id) == [] + + +def test_extract_repo_use_specialized_extractors_false_falls_back_to_llm(tmp_path, mock_provider_factory): + """`use_specialized_extractors=False` keeps schema files on the LLM path. + + Lock in the `use_specialized_extractors` setting wired through from + config — without this the knob would be silently ignored and SQL/ + GraphQL/Protobuf/OpenAPI files would always bypass the LLM regardless + of the user's explicit opt-out. + """ + layout = _layout(tmp_path) + (tmp_path / "schema.sql").write_text("CREATE TABLE customer (id INTEGER PRIMARY KEY);") + + seen: list[str] = [] + + def factory(schema, system, user): + seen.append(user) + return FileFindings(findings=[SectionFinding(section_id="entities", finding="Routed to LLM.")]) + + provider = mock_provider_factory(json_factory=factory) + stats = extract_repo( + layout=layout, + provider=provider, + files=[Path("schema.sql")], + repo_root=tmp_path, + use_specialized_extractors=False, + ) + + assert seen, "LLM should have been called when specialized extractors are disabled" + assert stats.specialized_files == 0 + notes = read_notes(layout, "entities") + assert any("Routed to LLM." in n["finding"] for n in notes) diff --git a/tests/test_fingerprint.py b/tests/test_fingerprint.py new file mode 100644 index 0000000..c462afd --- /dev/null +++ b/tests/test_fingerprint.py @@ -0,0 +1,29 @@ +"""Fingerprint tests.""" + +from __future__ import annotations + +from pathlib import Path + +from wikifi.fingerprint import FINGERPRINT_LENGTH, hash_bytes, hash_file, hash_text + + +def test_hash_text_is_stable_and_short(): + a = hash_text("hello world") + b = hash_text("hello world") + assert a == b + assert len(a) == FINGERPRINT_LENGTH + assert all(c in "0123456789abcdef" for c in a) + + +def test_hash_text_diverges_on_change(): + assert hash_text("hello") != hash_text("hello!") + + +def test_hash_bytes_handles_arbitrary_bytes(): + assert hash_bytes(b"\x00\x01\x02") != hash_bytes(b"\x00\x01\x03") + + +def test_hash_file_reads_bytes_from_disk(tmp_path: Path): + target = tmp_path / "file.txt" + target.write_bytes(b"contents") + assert hash_file(target) == hash_bytes(b"contents") diff --git a/tests/test_openai_provider.py b/tests/test_openai_provider.py new file mode 100644 index 0000000..023d4ee --- /dev/null +++ b/tests/test_openai_provider.py @@ -0,0 +1,232 @@ +"""OpenAIProvider tests. + +The HTTP transport is mocked via the ``client=`` injection point so the +test never touches the network. The point is to lock in the wikifi +contract: structured output via ``chat.completions.parse``, the +reasoning-effort routing for reasoning vs. plain models, the +``max_tokens`` vs ``max_completion_tokens`` swap, and APIError → +RuntimeError mapping. +""" + +from __future__ import annotations + +from types import SimpleNamespace + +import openai +import pytest +from pydantic import BaseModel + +from wikifi.providers.openai_provider import OpenAIProvider + + +class _Echo(BaseModel): + value: str + + +class _StubClient: + """Minimal stand-in for ``openai.OpenAI``. + + Exposes ``chat.completions.parse`` and ``chat.completions.create`` + via the same ``SimpleNamespace`` shape the real SDK uses. + """ + + def __init__( + self, + *, + parse_response=None, + create_response=None, + raise_on_parse: Exception | None = None, + raise_on_create: Exception | None = None, + ) -> None: + self.parse_calls: list[dict] = [] + self.create_calls: list[dict] = [] + self._parse_response = parse_response + self._create_response = create_response + self._raise_on_parse = raise_on_parse + self._raise_on_create = raise_on_create + self.chat = SimpleNamespace( + completions=SimpleNamespace(parse=self._parse, create=self._create), + ) + + def _parse(self, **kwargs): + self.parse_calls.append(kwargs) + if self._raise_on_parse is not None: + raise self._raise_on_parse + return self._parse_response + + def _create(self, **kwargs): + self.create_calls.append(kwargs) + if self._raise_on_create is not None: + raise self._raise_on_create + return self._create_response + + +def _api_error(message: str = "boom", request_id: str = "req_abc") -> openai.APIError: + """Construct an APIError without going through the real httpx wiring.""" + err = openai.APIError.__new__(openai.APIError) + err.message = message + err.request_id = request_id + err.args = (message,) + return err + + +def _parse_response(parsed): + return SimpleNamespace( + choices=[SimpleNamespace(message=SimpleNamespace(parsed=parsed, content=""))], + ) + + +def _text_response(text: str | None): + return SimpleNamespace( + choices=[SimpleNamespace(message=SimpleNamespace(content=text, parsed=None))], + ) + + +# --------------------------------------------------------------------------- +# complete_json +# --------------------------------------------------------------------------- + + +def test_complete_json_returns_parsed_pydantic_instance(): + parsed = _Echo(value="hello") + client = _StubClient(parse_response=_parse_response(parsed)) + provider = OpenAIProvider(model="gpt-4o", client=client, think="high") + + result = provider.complete_json(system="SYS", user="USR", schema=_Echo) + + assert result == parsed + call = client.parse_calls[0] + assert call["model"] == "gpt-4o" + assert call["response_format"] is _Echo + assert call["messages"] == [ + {"role": "system", "content": "SYS"}, + {"role": "user", "content": "USR"}, + ] + # gpt-4o is non-reasoning → max_tokens, not max_completion_tokens. + assert "max_tokens" in call + assert "max_completion_tokens" not in call + # think="high" must NOT leak through on a non-reasoning model. + assert "reasoning_effort" not in call + + +def test_complete_json_falls_back_to_validate_json_when_parsed_missing(): + response = SimpleNamespace( + choices=[SimpleNamespace(message=SimpleNamespace(parsed=None, content='{"value": "fallback"}'))], + ) + client = _StubClient(parse_response=response) + provider = OpenAIProvider(client=client) + out = provider.complete_json(system="s", user="u", schema=_Echo) + assert out == _Echo(value="fallback") + + +def test_complete_json_raises_runtime_error_on_api_error(): + client = _StubClient(raise_on_parse=_api_error("rate-limited", "req_xyz")) + provider = OpenAIProvider(client=client) + with pytest.raises(RuntimeError) as info: + provider.complete_json(system="s", user="u", schema=_Echo) + assert "req_xyz" in str(info.value) + assert "rate-limited" in str(info.value) + + +# --------------------------------------------------------------------------- +# complete_text + chat +# --------------------------------------------------------------------------- + + +def test_complete_text_extracts_first_message_content(): + client = _StubClient(create_response=_text_response("hi")) + provider = OpenAIProvider(client=client) + assert provider.complete_text(system="s", user="u") == "hi" + + +def test_complete_text_returns_empty_when_content_none(): + client = _StubClient(create_response=_text_response(None)) + provider = OpenAIProvider(client=client) + assert provider.complete_text(system="s", user="u") == "" + + +def test_chat_prepends_system_and_returns_content(): + client = _StubClient(create_response=_text_response("reply")) + provider = OpenAIProvider(client=client) + out = provider.chat( + system="SYS", + messages=[ + {"role": "user", "content": "first"}, + {"role": "assistant", "content": "first reply"}, + {"role": "user", "content": "second"}, + ], + ) + assert out == "reply" + call = client.create_calls[0] + assert call["messages"][0] == {"role": "system", "content": "SYS"} + assert call["messages"][-1] == {"role": "user", "content": "second"} + assert len(call["messages"]) == 4 + + +# --------------------------------------------------------------------------- +# Reasoning model routing +# --------------------------------------------------------------------------- + + +def test_reasoning_model_forwards_reasoning_effort_and_uses_completion_tokens(): + """o-series + gpt-5 models should receive ``reasoning_effort`` and + ``max_completion_tokens`` instead of ``max_tokens``.""" + client = _StubClient(create_response=_text_response("x")) + provider = OpenAIProvider(model="o3-mini", client=client, think="medium") + provider.complete_text(system="s", user="u") + call = client.create_calls[0] + assert call["reasoning_effort"] == "medium" + assert "max_completion_tokens" in call + assert "max_tokens" not in call + + +def test_reasoning_model_strips_effort_when_think_is_off(): + client = _StubClient(create_response=_text_response("x")) + provider = OpenAIProvider(model="gpt-5", client=client, think=False) + provider.complete_text(system="s", user="u") + call = client.create_calls[0] + assert "reasoning_effort" not in call + # Reasoning model still uses max_completion_tokens regardless of think. + assert "max_completion_tokens" in call + + +def test_plain_model_does_not_forward_reasoning_effort(): + client = _StubClient(create_response=_text_response("x")) + provider = OpenAIProvider(model="gpt-4o", client=client, think="high") + provider.complete_text(system="s", user="u") + call = client.create_calls[0] + assert "reasoning_effort" not in call + + +# --------------------------------------------------------------------------- +# Token-knob translation table +# --------------------------------------------------------------------------- + + +def test_reasoning_kwargs_translation_table(): + """Lock the (model, think) → request mapping so the contract is testable.""" + client = _StubClient(create_response=_text_response("x")) + cases = [ + # Reasoning-capable model: each level forwards through + ("o3-mini", "low", {"reasoning_effort": "low"}), + ("o3-mini", "medium", {"reasoning_effort": "medium"}), + ("o3-mini", "high", {"reasoning_effort": "high"}), + ("o3-mini", True, {}), # SDK default + ("o3-mini", False, {}), # disabled + ("o3-mini", "off", {}), + # Plain model: never forwards + ("gpt-4o", "high", {}), + ("gpt-4o", "low", {}), + ("gpt-4o", False, {}), + ] + for model, think, expected_extras in cases: + provider = OpenAIProvider(model=model, client=client, think=think) + client.create_calls.clear() + provider.complete_text(system="s", user="u") + call = client.create_calls[-1] + if "reasoning_effort" in expected_extras: + assert call.get("reasoning_effort") == expected_extras["reasoning_effort"], ( + f"model={model} think={think!r}: want {expected_extras}" + ) + else: + assert "reasoning_effort" not in call, f"model={model} think={think!r}: must not forward effort" diff --git a/tests/test_orchestrator.py b/tests/test_orchestrator.py index f6c3baf..212cf8c 100644 --- a/tests/test_orchestrator.py +++ b/tests/test_orchestrator.py @@ -89,3 +89,133 @@ def test_build_provider_returns_ollama_for_ollama_settings(): def test_build_provider_rejects_unknown(): with pytest.raises(ValueError): build_provider(_settings(provider="other")) + + +def test_build_provider_returns_anthropic_when_selected(monkeypatch): + """``provider='anthropic'`` dispatches to AnthropicProvider with a Claude model default.""" + monkeypatch.setenv("ANTHROPIC_API_KEY", "test-key") + settings = _settings(provider="anthropic", model="m") # non-claude model id + provider = build_provider(settings) + from wikifi.providers.anthropic_provider import AnthropicProvider + + assert isinstance(provider, AnthropicProvider) + # Falls back to a sane Claude default rather than 404'ing on "m". + assert provider.model.startswith("claude-") + + +def test_build_provider_returns_openai_when_selected(monkeypatch): + """``provider='openai'`` dispatches to OpenAIProvider with a GPT default. + + The default-swap fires when the configured model id is obviously + an Ollama identifier (``family:tag``) — the common "user opted + into openai but forgot to update WIKIFI_MODEL" case. + """ + monkeypatch.setenv("OPENAI_API_KEY", "test-key") + settings = _settings(provider="openai", model="qwen3.6:27b") + provider = build_provider(settings) + from wikifi.providers.openai_provider import OpenAIProvider + + assert isinstance(provider, OpenAIProvider) + # Falls back to gpt-4o rather than 404'ing on the Ollama default. + assert provider.model.startswith("gpt-") + + +def test_build_provider_preserves_explicit_openai_model(monkeypatch): + """A user-supplied gpt/o-series model id is passed through unchanged.""" + monkeypatch.setenv("OPENAI_API_KEY", "test-key") + for model in ("gpt-4o", "o3-mini", "gpt-5"): + settings = _settings(provider="openai", model=model) + provider = build_provider(settings) + assert provider.model == model + + +def test_build_provider_preserves_azure_openai_deployment_id(monkeypatch): + """Arbitrary Azure / proxy deployment IDs survive the swap. + + Azure-OpenAI (and OpenAI-compatible proxies) commonly use + deployment names that don't match the upstream OpenAI prefixes — + e.g. ``prod-gpt4o``, ``eastus-chat``, ``my-team-deployment``. + Replacing them with ``gpt-4o`` would silently route the user to + the wrong model on a perfectly valid configuration. + """ + monkeypatch.setenv("OPENAI_API_KEY", "test-key") + for deployment in ("prod-gpt4o", "eastus-chat", "my-team-deployment", "fine-tuned-v3"): + settings = _settings( + provider="openai", + model=deployment, + openai_base_url="https://my-azure-endpoint.openai.azure.com/", + ) + provider = build_provider(settings) + assert provider.model == deployment, f"{deployment} should pass through unchanged" + + +def test_build_provider_preserves_fine_tuned_openai_model(monkeypatch): + """``ft:gpt-4o:org::id`` contains a colon but stays on the OpenAI path.""" + monkeypatch.setenv("OPENAI_API_KEY", "test-key") + settings = _settings(provider="openai", model="ft:gpt-4o:my-org::abc123") + provider = build_provider(settings) + assert provider.model == "ft:gpt-4o:my-org::abc123" + + +def test_run_walk_persists_cache_for_resumability(mini_target, mock_provider_factory): + """A second walk reuses the cache and skips the LLM call for unchanged files.""" + settings = _settings() + introspection = IntrospectionResult( + include=["src/"], exclude=[], primary_languages=["python"], likely_purpose="demo", rationale="ok" + ) + + extraction_calls = {"n": 0} + + def factory(schema, system, user): + if schema is IntrospectionResult: + return introspection + if schema is FileFindings: + extraction_calls["n"] += 1 + return FileFindings( + summary="role", + findings=[SectionFinding(section_id="entities", finding="Order entity inferred.")], + ) + if schema is SectionBody: + return SectionBody(body="Synthesized.") + if schema is DerivedSection: + return DerivedSection(body="Derived.") + raise AssertionError(f"unexpected {schema}") + + provider = mock_provider_factory(json_factory=factory) + run_walk(root=mini_target, settings=settings, provider=provider) + first = extraction_calls["n"] + assert first >= 2 + + # Second walk against the same target with the same content: cache reuses + # the per-file findings, so extraction calls do not increase. + run_walk(root=mini_target, settings=settings, provider=provider) + assert extraction_calls["n"] == first + + +def test_run_walk_review_flag_invokes_critic(mini_target, mock_provider_factory): + """With ``review_derivatives=True`` the deriver runs the critic loop.""" + from wikifi.critic import Critique + + settings = _settings(review_derivatives=True) + introspection = IntrospectionResult( + include=["src/"], exclude=[], primary_languages=["python"], likely_purpose="demo", rationale="ok" + ) + critic_called = {"n": 0} + + def factory(schema, system, user): + if schema is IntrospectionResult: + return introspection + if schema is FileFindings: + return FileFindings(findings=[SectionFinding(section_id="entities", finding="Order.")]) + if schema is SectionBody: + return SectionBody(body="Synthesized.") + if schema is DerivedSection: + return DerivedSection(body="Derived.") + if schema is Critique: + critic_called["n"] += 1 + return Critique(score=9, summary="ok") + raise AssertionError(f"unexpected {schema}") + + provider = mock_provider_factory(json_factory=factory) + run_walk(root=mini_target, settings=settings, provider=provider) + assert critic_called["n"] >= 1 diff --git a/tests/test_repograph.py b/tests/test_repograph.py new file mode 100644 index 0000000..1affa68 --- /dev/null +++ b/tests/test_repograph.py @@ -0,0 +1,135 @@ +"""Repo graph + file classification tests.""" + +from __future__ import annotations + +from pathlib import Path + +from wikifi.repograph import FileKind, build_graph, classify + + +def test_classify_extension_only(): + assert classify(Path("schema.sql")) is FileKind.SQL + assert classify(Path("api.proto")) is FileKind.PROTOBUF + assert classify(Path("schema.graphql")) is FileKind.GRAPHQL + + +def test_classify_application_code(): + assert classify(Path("src/app.py")) is FileKind.APPLICATION_CODE + assert classify(Path("src/app.ts")) is FileKind.APPLICATION_CODE + assert classify(Path("src/app.go")) is FileKind.APPLICATION_CODE + + +def test_classify_migration_path(): + assert classify(Path("backend/migrations/0001_init.sql")) is FileKind.MIGRATION + assert classify(Path("alembic/versions/abc.py")) is FileKind.MIGRATION + + +def test_classify_openapi_via_sample(): + assert classify(Path("api.yaml"), sample="openapi: 3.0.3\ninfo: ...") is FileKind.OPENAPI + assert classify(Path("api.json"), sample='{"openapi": "3.0.0"}') is FileKind.OPENAPI + + +def test_classify_other(): + assert classify(Path("README.md")) is FileKind.OTHER + assert classify(Path("data.csv")) is FileKind.OTHER + + +def test_build_graph_python_imports(tmp_path: Path): + (tmp_path / "pkg").mkdir() + (tmp_path / "pkg" / "__init__.py").write_text("") + (tmp_path / "pkg" / "a.py").write_text("from pkg.b import thing\nimport os\n") + (tmp_path / "pkg" / "b.py").write_text("def thing(): return 1\n") + + files = [Path("pkg/__init__.py"), Path("pkg/a.py"), Path("pkg/b.py")] + graph = build_graph(repo_root=tmp_path, files=files) + + a_node = graph.get("pkg/a.py") + assert a_node is not None + assert "pkg/b.py" in a_node.imports + + b_node = graph.get("pkg/b.py") + assert b_node is not None + assert "pkg/a.py" in b_node.imported_by + + +def test_build_graph_js_imports(tmp_path: Path): + (tmp_path / "src").mkdir() + (tmp_path / "src" / "main.js").write_text("import { run } from './worker';\n") + (tmp_path / "src" / "worker.js").write_text("export function run() {}\n") + + files = [Path("src/main.js"), Path("src/worker.js")] + graph = build_graph(repo_root=tmp_path, files=files) + main_node = graph.get("src/main.js") + assert main_node is not None + assert "src/worker.js" in main_node.imports + + +def test_neighbor_paths_caps_results(tmp_path: Path): + """neighbors() bounds the prompt-side noise.""" + (tmp_path / "hub.py").write_text("\n".join(f"from leaf{i} import foo" for i in range(20))) + for i in range(20): + (tmp_path / f"leaf{i}.py").write_text("def foo(): pass\n") + files = [Path("hub.py")] + [Path(f"leaf{i}.py") for i in range(20)] + graph = build_graph(repo_root=tmp_path, files=files) + neighbors = graph.neighbor_paths("hub.py", limit=5) + assert len(neighbors) == 5 + + +def test_build_graph_skips_unreadable_files(tmp_path: Path): + """Missing-file path is exercised even if no other tests trip it.""" + files = [Path("ghost.py")] + graph = build_graph(repo_root=tmp_path, files=files) + # No edges produced; graph still records a node with empty imports. + node = graph.get("ghost.py") + assert node is not None + assert node.imports == () + + +def test_build_graph_python_relative_imports(tmp_path: Path): + """`from .b import x` resolves to a sibling within the same package. + + Without this the regex skips the leading-dot form entirely and + intra-package edges silently disappear from the graph, so per-file + neighbor context for Python codebases is incomplete. + """ + pkg = tmp_path / "pkg" + pkg.mkdir() + (pkg / "__init__.py").write_text("") + (pkg / "a.py").write_text("from .b import thing\nfrom . import helpers\n") + (pkg / "b.py").write_text("def thing(): return 1\n") + (pkg / "helpers.py").write_text("VALUE = 1\n") + + files = [ + Path("pkg/__init__.py"), + Path("pkg/a.py"), + Path("pkg/b.py"), + Path("pkg/helpers.py"), + ] + graph = build_graph(repo_root=tmp_path, files=files) + + a_node = graph.get("pkg/a.py") + assert a_node is not None + assert "pkg/b.py" in a_node.imports + assert "pkg/helpers.py" in a_node.imports + + +def test_build_graph_python_double_dot_relative_import(tmp_path: Path): + """`from ..sibling import x` walks one level up before resolving.""" + sub = tmp_path / "pkg" / "sub" + sub.mkdir(parents=True) + (tmp_path / "pkg" / "__init__.py").write_text("") + (sub / "__init__.py").write_text("") + (sub / "leaf.py").write_text("from ..sibling import thing\n") + (tmp_path / "pkg" / "sibling.py").write_text("def thing(): return 1\n") + + files = [ + Path("pkg/__init__.py"), + Path("pkg/sibling.py"), + Path("pkg/sub/__init__.py"), + Path("pkg/sub/leaf.py"), + ] + graph = build_graph(repo_root=tmp_path, files=files) + + leaf = graph.get("pkg/sub/leaf.py") + assert leaf is not None + assert "pkg/sibling.py" in leaf.imports diff --git a/tests/test_report.py b/tests/test_report.py new file mode 100644 index 0000000..1cf8188 --- /dev/null +++ b/tests/test_report.py @@ -0,0 +1,99 @@ +"""Coverage + quality report tests.""" + +from __future__ import annotations + +from pathlib import Path + +from wikifi.cache import WalkCache, save +from wikifi.critic import Critique +from wikifi.report import build_report +from wikifi.wiki import WikiLayout, append_note, initialize, write_section + + +def _layout(tmp_path: Path) -> WikiLayout: + layout = WikiLayout(root=tmp_path) + initialize(layout, model="m", provider="ollama", ollama_host="http://h") + return layout + + +def test_build_report_without_provider_returns_structural_view(tmp_path: Path): + layout = _layout(tmp_path) + cache = WalkCache() + cache.record_extraction( + "src/order.py", + fingerprint="abc", + findings=[{"section_id": "entities", "finding": "Order", "sources": []}], + summary="domain", + chunks_processed=1, + ) + cache.record_extraction( + "src/empty.py", + fingerprint="def", + findings=[], + summary="", + chunks_processed=1, + ) + save(layout, cache) + append_note(layout, "entities", {"file": "src/order.py", "summary": "x", "finding": "Order"}) + write_section(layout, "entities", "Body for entities.") + + report = build_report(layout=layout, provider=None, score=False) + + assert report.coverage.files_total == 2 + assert report.coverage.files_with_findings == 1 + assert report.coverage.coverage_pct() == 50.0 + assert report.overall_score is None + md = report.render() + assert "wikifi coverage" in md + assert "`entities`" in md + + +def test_build_report_with_score_uses_provider(tmp_path: Path, mock_provider_factory): + layout = _layout(tmp_path) + write_section(layout, "entities", "An entity body.") + write_section(layout, "intent", "Intent body.") + + provider = mock_provider_factory( + json_factory=lambda schema, system, user: Critique(score=9, summary="great"), + ) + report = build_report(layout=layout, provider=provider, score=True) + + populated = [s for s in report.sections if s.critique is not None] + assert populated, "expected at least one populated section to be scored" + assert all(s.critique.score == 9 for s in populated) + assert report.overall_score == 9.0 + assert "9/10" in report.render() + + +def test_build_report_marks_unpopulated_sections(tmp_path: Path): + """Sections still bearing the init placeholder are flagged ``is_empty``.""" + layout = _layout(tmp_path) + save(layout, WalkCache()) + report = build_report(layout=layout, provider=None, score=False) + assert any(entry.is_empty for entry in report.sections) + + +def test_build_report_uses_notes_when_cache_is_empty(tmp_path: Path): + """`wikifi report` after `walk --no-cache` must still report coverage. + + Coverage was previously derived from the cache only; with caching + disabled or the cache deleted, every walk reported `0%` even though + notes and section bodies were present on disk. Pulling + ``files_with_findings`` from the JSONL notes restores accuracy. + """ + layout = _layout(tmp_path) + # No cache written — emulates `walk --no-cache` or a manual cache wipe. + append_note(layout, "entities", {"file": "src/order.py", "summary": "x", "finding": "Order"}) + append_note(layout, "entities", {"file": "src/customer.py", "summary": "y", "finding": "Customer"}) + append_note(layout, "capabilities", {"file": "src/order.py", "summary": "x", "finding": "Place order"}) + write_section(layout, "entities", "Body for entities.") + + report = build_report(layout=layout, provider=None, score=False) + + # Two distinct files contributed — coverage reflects them, not 0. + assert report.coverage.files_with_findings == 2 + assert report.coverage.files_total >= 2 + assert report.coverage.coverage_pct() > 0 + # Per-section counts still come from the notes themselves. + assert report.coverage.findings_per_section["entities"] == 2 + assert report.coverage.findings_per_section["capabilities"] == 1 diff --git a/tests/test_specialized.py b/tests/test_specialized.py new file mode 100644 index 0000000..ab28adb --- /dev/null +++ b/tests/test_specialized.py @@ -0,0 +1,296 @@ +"""Type-aware (specialized) extractor tests.""" + +from __future__ import annotations + +from wikifi.repograph import FileKind +from wikifi.specialized.dispatch import select +from wikifi.specialized.graphql import extract as gql_extract +from wikifi.specialized.openapi import extract as openapi_extract +from wikifi.specialized.protobuf import extract as proto_extract +from wikifi.specialized.sql import extract as sql_extract + + +def test_select_routes_known_kinds_to_extractors(): + assert select(FileKind.SQL) is sql_extract + assert select(FileKind.PROTOBUF) is proto_extract + assert select(FileKind.GRAPHQL) is gql_extract + assert select(FileKind.OPENAPI) is openapi_extract + # SQL-shaped migrations route to the SQL migration variant. + sql_mig = select(FileKind.MIGRATION, rel_path="db/migrations/0042_orders.sql") + assert sql_mig is not None + assert sql_mig.__name__ == "extract_migration" + # Python / JS / Ruby migrations stay on the LLM path — the SQL + # parser would silently produce empty findings on real code. + assert select(FileKind.MIGRATION, rel_path="alembic/versions/0001_init.py") is None + assert select(FileKind.MIGRATION, rel_path="db/migrate/20260501_add_users.rb") is None + assert select(FileKind.MIGRATION, rel_path="db/migrations/001-add-users.js") is None + # Without a rel_path the dispatcher can't tell SQL from non-SQL — + # err on the safe side and return ``None``. + assert select(FileKind.MIGRATION) is None + assert select(FileKind.APPLICATION_CODE) is None + assert select(FileKind.OTHER) is None + + +# --------------------------------------------------------------------------- +# SQL +# --------------------------------------------------------------------------- + + +def test_sql_extracts_table_and_foreign_key(): + text = """ + CREATE TABLE customer ( + id INTEGER PRIMARY KEY, + email VARCHAR(255) UNIQUE NOT NULL + ); + + CREATE TABLE orders ( + id INTEGER PRIMARY KEY, + customer_id INTEGER REFERENCES customer(id), + total INTEGER NOT NULL + ); + """ + result = sql_extract("schema.sql", text) + sections = {f.section_id for f in result.findings} + assert "entities" in sections + assert "integrations" in sections + findings_by_section = {s: [f for f in result.findings if f.section_id == s] for s in sections} + # Both tables surface as entities. + entity_findings = findings_by_section["entities"] + assert any("customer" in f.finding for f in entity_findings) + assert any("orders" in f.finding for f in entity_findings) + # FK becomes an integration. + fk_findings = findings_by_section["integrations"] + assert any("customer" in f.finding for f in fk_findings) + + +def test_sql_migration_marks_summary(): + text = "ALTER TABLE orders ADD COLUMN refund_status TEXT;" + from wikifi.specialized.sql import extract_migration + + result = extract_migration("backend/migrations/0042_refunds.sql", text) + assert "Migration" in result.summary or "migration" in result.summary.lower() + assert any("orders" in f.finding for f in result.findings) + + +def test_sql_index_becomes_cross_cutting(): + text = "CREATE INDEX idx_orders_customer ON orders (customer_id);" + result = sql_extract("schema.sql", text) + assert any(f.section_id == "cross_cutting" and "idx_orders_customer" in f.finding for f in result.findings) + + +# --------------------------------------------------------------------------- +# OpenAPI +# --------------------------------------------------------------------------- + + +def test_openapi_extracts_endpoints_and_schemas_from_json(): + spec = """ + { + "openapi": "3.0.0", + "info": {"title": "Orders API", "version": "1.0"}, + "paths": { + "/orders": { + "post": {"summary": "Create order"}, + "get": {"summary": "List orders"} + } + }, + "components": { + "schemas": {"Order": {"type": "object"}, "LineItem": {"type": "object"}}, + "securitySchemes": {"bearerAuth": {"type": "http"}} + } + } + """ + result = openapi_extract("openapi.json", spec) + sections = {f.section_id for f in result.findings} + assert "intent" in sections + assert "capabilities" in sections + assert "entities" in sections + assert "integrations" in sections + assert "cross_cutting" in sections + cap_text = next(f.finding for f in result.findings if f.section_id == "capabilities") + assert "POST /orders" in cap_text + assert "GET /orders" in cap_text + + +def test_openapi_handles_unparseable_input(): + result = openapi_extract("openapi.yaml", "") + assert any(f.section_id == "capabilities" for f in result.findings) + assert "Unparseable" in result.summary or "manual review" in result.findings[0].finding.lower() + + +def test_openapi_yaml_fallback_parser(): + """The shallow YAML parser should work even without PyYAML installed.""" + spec = """openapi: 3.0.0 +info: + title: Test API + version: "1.0" +paths: + /test: + get: + summary: Test endpoint +""" + result = openapi_extract("openapi.yaml", spec) + # Should at least extract intent (title is present). + assert any("Test API" in f.finding for f in result.findings) + + +# --------------------------------------------------------------------------- +# Protobuf +# --------------------------------------------------------------------------- + + +def test_proto_extracts_messages_and_services(): + text = """ + syntax = "proto3"; + package billing.v1; + + message Invoice { + int64 id = 1; + string customer_id = 2; + } + + service BillingService { + rpc CreateInvoice (Invoice) returns (Invoice); + rpc StreamInvoices (Invoice) returns (stream Invoice); + } + """ + result = proto_extract("billing.proto", text) + sections = {f.section_id for f in result.findings} + assert "entities" in sections + assert "integrations" in sections + assert "capabilities" in sections + integrations = next(f for f in result.findings if f.section_id == "integrations") + assert "BillingService" in integrations.finding + assert "CreateInvoice" in integrations.finding + + +# --------------------------------------------------------------------------- +# GraphQL +# --------------------------------------------------------------------------- + + +def test_graphql_extracts_types_and_roots(): + sdl = """ + type Order { + id: ID! + total: Int! + } + + input OrderInput { + total: Int! + } + + type Query { + order(id: ID!): Order + } + + type Mutation { + createOrder(input: OrderInput!): Order! + } + """ + result = gql_extract("schema.graphql", sdl) + sections = {f.section_id for f in result.findings} + assert "entities" in sections + assert "capabilities" in sections + cap = next(f for f in result.findings if f.section_id == "capabilities") + assert "Query" in cap.finding or "Mutation" in cap.finding + + +def test_graphql_extract_handles_extend_type_query(tmp_path): + """`extend type Query` blocks contribute to the capabilities section. + + Modular GraphQL schemas split root types across files; if the + extractor only matched bare `type Query { ... }` declarations, + capabilities would silently disappear for any schema composed from + multiple files. + """ + sdl = """ + type Order { + id: ID! + } + + extend type Query { + orderById(id: ID!): Order + } + + extend type Mutation { + cancelOrder(id: ID!): Boolean! + } + """ + result = gql_extract("schema.graphql", sdl) + capabilities = [f for f in result.findings if f.section_id == "capabilities"] + assert any("orderById" in f.finding for f in capabilities) + assert any("cancelOrder" in f.finding for f in capabilities) + + +def test_graphql_block_after_handles_indented_closing_brace(): + """`_block_after` must stop on indented `}` lines, not just column-0 ones. + + Many SDL formatters indent the closing brace; the previous + column-0-only check would let the scan run into subsequent type + declarations, polluting the root field list with unrelated fields. + """ + sdl = """ + type Query { + orderById(id: ID!): Order + listOrders: [Order!]! + } + + type SecretOps { + shouldNotAppear: String! + } + """ + result = gql_extract("schema.graphql", sdl) + capabilities = next(f for f in result.findings if f.section_id == "capabilities") + assert "orderById" in capabilities.finding + assert "listOrders" in capabilities.finding + assert "shouldNotAppear" not in capabilities.finding + + +def test_proto_scopes_rpcs_to_owning_service(): + """Multiple `service` blocks: each owns only its own RPCs. + + The previous scope ("every RPC at or after my line") attributed + every later service's RPCs to the first service, inflating the + integration inventory whenever a proto file declared more than one. + """ + text = """ + service AccountsService { + rpc CreateAccount (CreateAccountRequest) returns (Account); + } + + service BillingService { + rpc ChargeAccount (ChargeRequest) returns (Receipt); + rpc Refund (RefundRequest) returns (Receipt); + } + """ + result = proto_extract("svc.proto", text) + integrations = {f.finding.split("\n", 1)[0]: f.finding for f in result.findings if f.section_id == "integrations"} + accounts_finding = next(v for k, v in integrations.items() if "AccountsService" in k) + billing_finding = next(v for k, v in integrations.items() if "BillingService" in k) + + assert "CreateAccount" in accounts_finding + assert "ChargeAccount" not in accounts_finding + assert "Refund" not in accounts_finding + + assert "ChargeAccount" in billing_finding + assert "Refund" in billing_finding + assert "CreateAccount" not in billing_finding + + +def test_sql_migration_with_only_alter_counts_altered_tables(): + """An ALTER-only migration reports its altered targets, not 0 tables. + + Prior to the fix the summary counted only CREATE TABLE matches, so + a migration that only ALTERs existing tables was reported as + "Migration touches 0 table(s)" even though it had real targets. + """ + from wikifi.specialized.sql import extract_migration + + text = """ + ALTER TABLE orders ADD COLUMN refund_status TEXT; + ALTER TABLE customers ADD COLUMN tier TEXT; + """ + result = extract_migration("backend/migrations/0042_alter.sql", text) + assert "0 table" not in result.summary + assert "2 table" in result.summary diff --git a/tests/test_wiki.py b/tests/test_wiki.py index d95712f..bc54404 100644 --- a/tests/test_wiki.py +++ b/tests/test_wiki.py @@ -90,3 +90,76 @@ def test_write_section_with_section_object(tmp_path): body = "Some **bold** content." path = write_section(layout, section, body) assert section.title in path.read_text() + + +def test_initialize_gitignore_includes_cache_dir(tmp_path): + """Fresh init must ignore both `.notes/` AND `.cache/`. + + The cache layer writes to `.wikifi/.cache/`; if the gitignore + template misses it, every walk leaves untracked files in the + target repo — exactly the noise the wiki contract promises to + avoid. + """ + from wikifi.wiki import CACHE_DIRNAME, NOTES_DIRNAME + + layout = _layout(tmp_path) + initialize(layout, model="m", provider="ollama", ollama_host="http://h") + contents = layout.gitignore_path.read_text() + assert f"{NOTES_DIRNAME}/" in contents + assert f"{CACHE_DIRNAME}/" in contents + + +def test_initialize_backfills_cache_into_legacy_gitignore(tmp_path): + """An older wiki's `.gitignore` (only `.notes/`) gains `.cache/` on re-init. + + Wikis created before the cache layer landed have a `.gitignore` + missing the new entry. Re-running `wikifi init` against them must + append the missing line in place rather than leaving the older + config silently incomplete. + """ + from wikifi.wiki import CACHE_DIRNAME + + layout = _layout(tmp_path) + layout.wiki_dir.mkdir(parents=True) + # Simulate the pre-cache-era gitignore — comment + .notes/ only. + legacy = "# wikifi local working state — section markdown is committed, notes are not.\n.notes/\n" + layout.gitignore_path.write_text(legacy) + + initialize(layout, model="m", provider="ollama", ollama_host="http://h") + + contents = layout.gitignore_path.read_text() + # The original line is preserved unchanged. + assert ".notes/" in contents + # The missing entry is appended. + assert f"{CACHE_DIRNAME}/" in contents + # No duplication on a second init. + initialize(layout, model="m", provider="ollama", ollama_host="http://h") + after_second = layout.gitignore_path.read_text() + assert after_second.count(f"{CACHE_DIRNAME}/") == 1 + + +def test_initialize_preserves_user_extra_lines_in_gitignore(tmp_path): + """User-added entries in `.wikifi/.gitignore` survive re-init. + + Backfill must only *append* missing required entries — it must + never rewrite, reorder, or strip lines the user added themselves + (e.g. `local-notes/`, `*.draft`, etc.). + """ + from wikifi.wiki import CACHE_DIRNAME + + layout = _layout(tmp_path) + layout.wiki_dir.mkdir(parents=True) + # User-customized: includes the standard .notes/ plus an extra entry, + # but is missing the new .cache/ line. + user_authored = "# my custom comment\n.notes/\nlocal-notes/\n*.draft\n" + layout.gitignore_path.write_text(user_authored) + + initialize(layout, model="m", provider="ollama", ollama_host="http://h") + contents = layout.gitignore_path.read_text() + + # User content unchanged. + assert "# my custom comment" in contents + assert "local-notes/" in contents + assert "*.draft" in contents + # Required entry appended. + assert f"{CACHE_DIRNAME}/" in contents diff --git a/uv.lock b/uv.lock index e2c0f09..234d787 100644 --- a/uv.lock +++ b/uv.lock @@ -20,6 +20,25 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/78/b6/6307fbef88d9b5ee7421e68d78a9f162e0da4900bc5f5793f6d3d0e34fb8/annotated_types-0.7.0-py3-none-any.whl", hash = "sha256:1f02e8b43a8fbbc3f3e0d4f0f4bfc8131bcb4eebe8849b8e5c773f3a1c582a53", size = 13643, upload-time = "2024-05-20T21:33:24.1Z" }, ] +[[package]] +name = "anthropic" +version = "0.97.0" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "anyio" }, + { name = "distro" }, + { name = "docstring-parser" }, + { name = "httpx" }, + { name = "jiter" }, + { name = "pydantic" }, + { name = "sniffio" }, + { name = "typing-extensions" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/14/93/f66ea8bfe39f2e6bb9da8e27fa5457ad2520e8f7612dfc547b17fad55c4d/anthropic-0.97.0.tar.gz", hash = "sha256:021e79fd8e21e90ad94dc5ba2bbbd8b1599f424f5b1fab6c06204009cab764be", size = 669502, upload-time = "2026-04-23T20:52:34.445Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/53/b6/8e851369fa661ad0fef2ae6266bf3b7d52b78ccf011720058f4adaca59e2/anthropic-0.97.0-py3-none-any.whl", hash = "sha256:8a1a472dfabcfc0c52ff6a3eecf724ac7e07107a2f6e2367be55ceb42f5d5613", size = 662126, upload-time = "2026-04-23T20:52:32.377Z" }, +] + [[package]] name = "anyio" version = "4.13.0" @@ -147,6 +166,24 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/9e/ee/a4cf96b8ce1e566ed238f0659ac2d3f007ed1d14b181bcb684e19561a69a/coverage-7.13.5-py3-none-any.whl", hash = "sha256:34b02417cf070e173989b3db962f7ed56d2f644307b2cf9d5a0f258e13084a61", size = 211346, upload-time = "2026-03-17T10:33:15.691Z" }, ] +[[package]] +name = "distro" +version = "1.9.0" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/fc/f8/98eea607f65de6527f8a2e8885fc8015d3e6f5775df186e443e0964a11c3/distro-1.9.0.tar.gz", hash = "sha256:2fa77c6fd8940f116ee1d6b94a2f90b13b5ea8d019b98bc8bafdcabcdd9bdbed", size = 60722, upload-time = "2023-12-24T09:54:32.31Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/12/b3/231ffd4ab1fc9d679809f356cebee130ac7daa00d6d6f3206dd4fd137e9e/distro-1.9.0-py3-none-any.whl", hash = "sha256:7bffd925d65168f85027d8da9af6bddab658135b840670a223589bc0c8ef02b2", size = 20277, upload-time = "2023-12-24T09:54:30.421Z" }, +] + +[[package]] +name = "docstring-parser" +version = "0.18.0" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/e0/4d/f332313098c1de1b2d2ff91cf2674415cc7cddab2ca1b01ae29774bd5fdf/docstring_parser-0.18.0.tar.gz", hash = "sha256:292510982205c12b1248696f44959db3cdd1740237a968ea1e2e7a900eeb2015", size = 29341, upload-time = "2026-04-14T04:09:19.867Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/a7/5f/ed01f9a3cdffbd5a008556fc7b2a08ddb1cc6ace7effa7340604b1d16699/docstring_parser-0.18.0-py3-none-any.whl", hash = "sha256:b3fcbed555c47d8479be0796ef7e19c2670d428d72e96da63f3a40122860374b", size = 22484, upload-time = "2026-04-14T04:09:18.638Z" }, +] + [[package]] name = "h11" version = "0.16.0" @@ -202,6 +239,78 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/cb/b1/3846dd7f199d53cb17f49cba7e651e9ce294d8497c8c150530ed11865bb8/iniconfig-2.3.0-py3-none-any.whl", hash = "sha256:f631c04d2c48c52b84d0d0549c99ff3859c98df65b3101406327ecc7d53fbf12", size = 7484, upload-time = "2025-10-18T21:55:41.639Z" }, ] +[[package]] +name = "jiter" +version = "0.14.0" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/6e/c1/0cddc6eb17d4c53a99840953f95dd3accdc5cfc7a337b0e9b26476276be9/jiter-0.14.0.tar.gz", hash = "sha256:e8a39e66dac7153cf3f964a12aad515afa8d74938ec5cc0018adcdae5367c79e", size = 165725, upload-time = "2026-04-10T14:28:42.01Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/5a/68/7390a418f10897da93b158f2d5a8bd0bcd73a0f9ec3bb36917085bb759ef/jiter-0.14.0-cp312-cp312-macosx_10_12_x86_64.whl", hash = "sha256:2fb2ce3a7bc331256dfb14cefc34832366bb28a9aca81deaf43bbf2a5659e607", size = 316295, upload-time = "2026-04-10T14:26:24.887Z" }, + { url = "https://files.pythonhosted.org/packages/60/a0/5854ac00ff63551c52c6c89534ec6aba4b93474e7924d64e860b1c94165b/jiter-0.14.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:5252a7ca23785cef5d02d4ece6077a1b556a410c591b379f82091c3001e14844", size = 315898, upload-time = "2026-04-10T14:26:26.601Z" }, + { url = "https://files.pythonhosted.org/packages/41/a1/4f44832650a16b18e8391f1bf1d6ca4909bc738351826bcc198bba4357f4/jiter-0.14.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c409578cbd77c338975670ada777add4efd53379667edf0aceea730cabede6fb", size = 343730, upload-time = "2026-04-10T14:26:28.326Z" }, + { url = "https://files.pythonhosted.org/packages/48/64/a329e9d469f86307203594b1707e11ae51c3348d03bfd514a5f997870012/jiter-0.14.0-cp312-cp312-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:7ede4331a1899d604463369c730dbb961ffdc5312bc7f16c41c2896415b1304a", size = 370102, upload-time = "2026-04-10T14:26:30.089Z" }, + { url = "https://files.pythonhosted.org/packages/94/c1/5e3dfc59635aa4d4c7bd20a820ac1d09b8ed851568356802cf1c08edb3cf/jiter-0.14.0-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:92cd8b6025981a041f5310430310b55b25ca593972c16407af8837d3d7d2ca01", size = 461335, upload-time = "2026-04-10T14:26:31.911Z" }, + { url = "https://files.pythonhosted.org/packages/e3/1b/dd157009dbc058f7b00108f545ccb72a2d56461395c4fc7b9cfdccb00af4/jiter-0.14.0-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:351bf6eda4e3a7ceb876377840c702e9a3e4ecc4624dbfb2d6463c67ae52637d", size = 378536, upload-time = "2026-04-10T14:26:33.595Z" }, + { url = "https://files.pythonhosted.org/packages/91/78/256013667b7c10b8834f8e6e54cd3e562d4c6e34227a1596addccc05e38c/jiter-0.14.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c1dcfbeb93d9ecd9ca128bbf8910120367777973fa193fb9a39c31237d8df165", size = 353859, upload-time = "2026-04-10T14:26:35.098Z" }, + { url = "https://files.pythonhosted.org/packages/de/d9/137d65ade9093a409fe80955ce60b12bb753722c986467aeda47faf450ad/jiter-0.14.0-cp312-cp312-manylinux_2_31_riscv64.whl", hash = "sha256:ae039aaef8de3f8157ecc1fdd4d85043ac4f57538c245a0afaecb8321ec951c3", size = 357626, upload-time = "2026-04-10T14:26:36.685Z" }, + { url = "https://files.pythonhosted.org/packages/2e/48/76750835b87029342727c1a268bea8878ab988caf81ee4e7b880900eeb5a/jiter-0.14.0-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:7d9d51eb96c82a9652933bd769fe6de66877d6eb2b2440e281f2938c51b5643e", size = 393172, upload-time = "2026-04-10T14:26:38.097Z" }, + { url = "https://files.pythonhosted.org/packages/a6/60/456c4e81d5c8045279aefe60e9e483be08793828800a4e64add8fdde7f2a/jiter-0.14.0-cp312-cp312-musllinux_1_1_aarch64.whl", hash = "sha256:d824ca4148b705970bf4e120924a212fdfca9859a73e42bd7889a63a4ea6bb98", size = 520300, upload-time = "2026-04-10T14:26:39.532Z" }, + { url = "https://files.pythonhosted.org/packages/a8/9f/2020e0984c235f678dced38fe4eec3058cf528e6af36ebf969b410305941/jiter-0.14.0-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:ff3a6465b3a0f54b1a430f45c3c0ba7d61ceb45cbc3e33f9e1a7f638d690baf3", size = 553059, upload-time = "2026-04-10T14:26:40.991Z" }, + { url = "https://files.pythonhosted.org/packages/ef/32/e2d298e1a22a4bbe6062136d1c7192db7dba003a6975e51d9a9eecabc4c2/jiter-0.14.0-cp312-cp312-win32.whl", hash = "sha256:5dec7c0a3e98d2a3f8a2e67382d0d7c3ac60c69103a4b271da889b4e8bb1e129", size = 206030, upload-time = "2026-04-10T14:26:42.517Z" }, + { url = "https://files.pythonhosted.org/packages/36/ac/96369141b3d8a4a8e4590e983085efe1c436f35c0cda940dd76d942e3e40/jiter-0.14.0-cp312-cp312-win_amd64.whl", hash = "sha256:fc7e37b4b8bc7e80a63ad6cfa5fc11fab27dbfea4cc4ae644b1ab3f273dc348f", size = 201603, upload-time = "2026-04-10T14:26:44.328Z" }, + { url = "https://files.pythonhosted.org/packages/01/c3/75d847f264647017d7e3052bbcc8b1e24b95fa139c320c5f5066fa7a0bdd/jiter-0.14.0-cp312-cp312-win_arm64.whl", hash = "sha256:ee4a72f12847ef29b072aee9ad5474041ab2924106bdca9fcf5d7d965853e057", size = 191525, upload-time = "2026-04-10T14:26:46Z" }, + { url = "https://files.pythonhosted.org/packages/97/2a/09f70020898507a89279659a1afe3364d57fc1b2c89949081975d135f6f5/jiter-0.14.0-cp313-cp313-macosx_10_12_x86_64.whl", hash = "sha256:af72f204cf4d44258e5b4c1745130ac45ddab0e71a06333b01de660ab4187a94", size = 315502, upload-time = "2026-04-10T14:26:47.697Z" }, + { url = "https://files.pythonhosted.org/packages/d6/be/080c96a45cd74f9fce5db4fd68510b88087fb37ffe2541ff73c12db92535/jiter-0.14.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:4b77da71f6e819be5fbcec11a453fde5b1d0267ef6ed487e2a392fd8e14e4e3a", size = 314870, upload-time = "2026-04-10T14:26:49.149Z" }, + { url = "https://files.pythonhosted.org/packages/7d/5e/2d0fee155826a968a832cc32438de5e2a193292c8721ca70d0b53e58245b/jiter-0.14.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:77f4ea612fe8b84b8b04e51d0e78029ecf3466348e25973f953de6e6a59aa4c1", size = 343406, upload-time = "2026-04-10T14:26:50.762Z" }, + { url = "https://files.pythonhosted.org/packages/70/af/bf9ee0d3a4f8dc0d679fc1337f874fe60cdbf841ebbb304b374e1c9aaceb/jiter-0.14.0-cp313-cp313-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:62fe2451f8fcc0240261e6a4df18ecbcd58327857e61e625b2393ea3b468aac9", size = 369415, upload-time = "2026-04-10T14:26:52.188Z" }, + { url = "https://files.pythonhosted.org/packages/0f/83/8e8561eadba31f4d3948a5b712fb0447ec71c3560b57a855449e7b8ddc98/jiter-0.14.0-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:6112f26f5afc75bcb475787d29da3aa92f9d09c7858f632f4be6ffe607be82e9", size = 461456, upload-time = "2026-04-10T14:26:53.611Z" }, + { url = "https://files.pythonhosted.org/packages/f6/c9/c5299e826a5fe6108d172b344033f61c69b1bb979dd8d9ddd4278a160971/jiter-0.14.0-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:215a6cb8fb7dc702aa35d475cc00ddc7f970e5c0b1417fb4b4ac5d82fa2a29db", size = 378488, upload-time = "2026-04-10T14:26:55.211Z" }, + { url = "https://files.pythonhosted.org/packages/5d/37/c16d9d15c0a471b8644b1abe3c82668092a707d9bedcf076f24ff2e380cd/jiter-0.14.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:fc4ab96a30fb3cb2c7e0cd33f7616c8860da5f5674438988a54ac717caccdbaa", size = 353242, upload-time = "2026-04-10T14:26:56.705Z" }, + { url = "https://files.pythonhosted.org/packages/58/ea/8050cb0dc654e728e1bfacbc0c640772f2181af5dedd13ae70145743a439/jiter-0.14.0-cp313-cp313-manylinux_2_31_riscv64.whl", hash = "sha256:3a99c1387b1f2928f799a9de899193484d66206a50e98233b6b088a7f0c1edb2", size = 356823, upload-time = "2026-04-10T14:26:58.281Z" }, + { url = "https://files.pythonhosted.org/packages/b0/3b/cf71506d270e5f84d97326bf220e47aed9b95e9a4a060758fb07772170ab/jiter-0.14.0-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:ab18d11074485438695f8d34a1b6da61db9754248f96d51341956607a8f39985", size = 392564, upload-time = "2026-04-10T14:27:00.018Z" }, + { url = "https://files.pythonhosted.org/packages/b0/cc/8c6c74a3efb5bd671bfd14f51e8a73375464ca914b1551bc3b40e26ac2c9/jiter-0.14.0-cp313-cp313-musllinux_1_1_aarch64.whl", hash = "sha256:801028dcfc26ac0895e4964cbc0fd62c73be9fd4a7d7b1aaf6e5790033a719b7", size = 520322, upload-time = "2026-04-10T14:27:01.664Z" }, + { url = "https://files.pythonhosted.org/packages/41/24/68d7b883ec959884ddf00d019b2e0e82ba81b167e1253684fa90519ce33c/jiter-0.14.0-cp313-cp313-musllinux_1_1_x86_64.whl", hash = "sha256:ad425b087aafb4a1c7e1e98a279200743b9aaf30c3e0ba723aec93f061bd9bc8", size = 552619, upload-time = "2026-04-10T14:27:03.316Z" }, + { url = "https://files.pythonhosted.org/packages/b6/89/b1a0985223bbf3150ff9e8f46f98fc9360c1de94f48abe271bbe1b465682/jiter-0.14.0-cp313-cp313-win32.whl", hash = "sha256:882bcb9b334318e233950b8be366fe5f92c86b66a7e449e76975dfd6d776a01f", size = 205699, upload-time = "2026-04-10T14:27:04.662Z" }, + { url = "https://files.pythonhosted.org/packages/4c/19/3f339a5a7f14a11730e67f6be34f9d5105751d547b615ef593fa122a5ded/jiter-0.14.0-cp313-cp313-win_amd64.whl", hash = "sha256:9b8c571a5dba09b98bd3462b5a53f27209a5cbbe85670391692ede71974e979f", size = 201323, upload-time = "2026-04-10T14:27:06.139Z" }, + { url = "https://files.pythonhosted.org/packages/50/56/752dd89c84be0e022a8ea3720bcfa0a8431db79a962578544812ce061739/jiter-0.14.0-cp313-cp313-win_arm64.whl", hash = "sha256:34f19dcc35cb1abe7c369b3756babf8c7f04595c0807a848df8f26ef8298ef92", size = 191099, upload-time = "2026-04-10T14:27:07.564Z" }, + { url = "https://files.pythonhosted.org/packages/91/28/292916f354f25a1fe8cf2c918d1415c699a4a659ae00be0430e1c5d9ffea/jiter-0.14.0-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:e89bcd7d426a75bb4952c696b267075790d854a07aad4c9894551a82c5b574ab", size = 320880, upload-time = "2026-04-10T14:27:09.326Z" }, + { url = "https://files.pythonhosted.org/packages/ad/c7/b002a7d8b8957ac3d469bd59c18ef4b1595a5216ae0de639a287b9816023/jiter-0.14.0-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7b25beaa0d4447ea8c7ae0c18c688905d34840d7d0b937f2f7bdd52162c98a40", size = 346563, upload-time = "2026-04-10T14:27:11.287Z" }, + { url = "https://files.pythonhosted.org/packages/f9/3b/f8d07580d8706021d255a6356b8fab13ee4c869412995550ce6ed4ddf97d/jiter-0.14.0-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:651a8758dd413c51e3b7f6557cdc6921faf70b14106f45f969f091f5cda990ea", size = 357928, upload-time = "2026-04-10T14:27:12.729Z" }, + { url = "https://files.pythonhosted.org/packages/47/5b/ac1a974da29e35507230383110ffec59998b290a8732585d04e19a9eb5ba/jiter-0.14.0-cp313-cp313t-win_amd64.whl", hash = "sha256:e1a7eead856a5038a8d291f1447176ab0b525c77a279a058121b5fccee257f6f", size = 203519, upload-time = "2026-04-10T14:27:14.125Z" }, + { url = "https://files.pythonhosted.org/packages/96/6d/9fc8433d667d2454271378a79747d8c76c10b51b482b454e6190e511f244/jiter-0.14.0-cp313-cp313t-win_arm64.whl", hash = "sha256:2e692633a12cda97e352fdcd1c4acc971b1c28707e1e33aeef782b0cbf051975", size = 190113, upload-time = "2026-04-10T14:27:16.638Z" }, + { url = "https://files.pythonhosted.org/packages/4f/1e/354ed92461b165bd581f9ef5150971a572c873ec3b68a916d5aa91da3cc2/jiter-0.14.0-cp314-cp314-macosx_10_12_x86_64.whl", hash = "sha256:6f396837fc7577871ca8c12edaf239ed9ccef3bbe39904ae9b8b63ce0a48b140", size = 315277, upload-time = "2026-04-10T14:27:18.109Z" }, + { url = "https://files.pythonhosted.org/packages/a6/95/8c7c7028aa8636ac21b7a55faef3e34215e6ed0cbf5ae58258427f621aa3/jiter-0.14.0-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:a4d50ea3d8ba4176f79754333bd35f1bbcd28e91adc13eb9b7ca91bc52a6cef9", size = 315923, upload-time = "2026-04-10T14:27:19.603Z" }, + { url = "https://files.pythonhosted.org/packages/47/40/e2a852a44c4a089f2681a16611b7ce113224a80fd8504c46d78491b47220/jiter-0.14.0-cp314-cp314-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ce17f8a050447d1b4153bda4fb7d26e6a9e74eb4f4a41913f30934c5075bf615", size = 344943, upload-time = "2026-04-10T14:27:21.262Z" }, + { url = "https://files.pythonhosted.org/packages/fc/1f/670f92adee1e9895eac41e8a4d623b6da68c4d46249d8b556b60b63f949e/jiter-0.14.0-cp314-cp314-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:f4f1c4b125e1652aefbc2e2c1617b60a160ab789d180e3d423c41439e5f32850", size = 369725, upload-time = "2026-04-10T14:27:22.766Z" }, + { url = "https://files.pythonhosted.org/packages/01/2f/541c9ba567d05de1c4874a0f8f8c5e3fd78e2b874266623da9a775cf46e0/jiter-0.14.0-cp314-cp314-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:be808176a6a3a14321d18c603f2d40741858a7c4fc982f83232842689fe86dd9", size = 461210, upload-time = "2026-04-10T14:27:24.315Z" }, + { url = "https://files.pythonhosted.org/packages/ce/a9/c31cbec09627e0d5de7aeaec7690dba03e090caa808fefd8133137cf45bc/jiter-0.14.0-cp314-cp314-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:26679d58ba816f88c3849306dd58cb863a90a1cf352cdd4ef67e30ccf8a77994", size = 380002, upload-time = "2026-04-10T14:27:26.155Z" }, + { url = "https://files.pythonhosted.org/packages/50/02/3c05c1666c41904a2f607475a73e7a4763d1cbde2d18229c4f85b22dc253/jiter-0.14.0-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:80381f5a19af8fa9aef743f080e34f6b25ebd89656475f8cf0470ec6157052aa", size = 354678, upload-time = "2026-04-10T14:27:27.701Z" }, + { url = "https://files.pythonhosted.org/packages/7d/97/e15b33545c2b13518f560d695f974b9891b311641bdcf178d63177e8801e/jiter-0.14.0-cp314-cp314-manylinux_2_31_riscv64.whl", hash = "sha256:004df5fdb8ecbd6d99f3227df18ba1a259254c4359736a2e6f036c944e02d7c5", size = 358920, upload-time = "2026-04-10T14:27:29.256Z" }, + { url = "https://files.pythonhosted.org/packages/ad/d2/8b1461def6b96ba44530df20d07ef7a1c7da22f3f9bf1727e2d611077bf1/jiter-0.14.0-cp314-cp314-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:cff5708f7ed0fa098f2b53446c6fa74c48469118e5cd7497b4f1cd569ab06928", size = 394512, upload-time = "2026-04-10T14:27:31.344Z" }, + { url = "https://files.pythonhosted.org/packages/e3/88/837566dd6ed6e452e8d3205355afd484ce44b2533edfa4ed73a298ea893e/jiter-0.14.0-cp314-cp314-musllinux_1_1_aarch64.whl", hash = "sha256:2492e5f06c36a976d25c7cc347a60e26d5470178d44cde1b9b75e60b4e519f28", size = 521120, upload-time = "2026-04-10T14:27:33.299Z" }, + { url = "https://files.pythonhosted.org/packages/89/6b/b00b45c4d1b4c031777fe161d620b755b5b02cdade1e316dcb46e4471d63/jiter-0.14.0-cp314-cp314-musllinux_1_1_x86_64.whl", hash = "sha256:7609cfbe3a03d37bfdbf5052012d5a879e72b83168a363deae7b3a26564d57de", size = 553668, upload-time = "2026-04-10T14:27:34.868Z" }, + { url = "https://files.pythonhosted.org/packages/ad/d8/6fe5b42011d19397433d345716eac16728ac241862a2aac9c91923c7509a/jiter-0.14.0-cp314-cp314-win32.whl", hash = "sha256:7282342d32e357543565286b6450378c3cd402eea333fc1ebe146f1fabb306fc", size = 207001, upload-time = "2026-04-10T14:27:36.455Z" }, + { url = "https://files.pythonhosted.org/packages/e5/43/5c2e08da1efad5e410f0eaaabeadd954812612c33fbbd8fd5328b489139d/jiter-0.14.0-cp314-cp314-win_amd64.whl", hash = "sha256:bd77945f38866a448e73b0b7637366afa814d4617790ecd88a18ca74377e6c02", size = 202187, upload-time = "2026-04-10T14:27:38Z" }, + { url = "https://files.pythonhosted.org/packages/aa/1f/6e39ac0b4cdfa23e606af5b245df5f9adaa76f35e0c5096790da430ca506/jiter-0.14.0-cp314-cp314-win_arm64.whl", hash = "sha256:f2d4c61da0821ee42e0cdf5489da60a6d074306313a377c2b35af464955a3611", size = 192257, upload-time = "2026-04-10T14:27:39.504Z" }, + { url = "https://files.pythonhosted.org/packages/05/57/7dbc0ffbbb5176a27e3518716608aa464aee2e2887dc938f0b900a120449/jiter-0.14.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:1bf7ff85517dd2f20a5750081d2b75083c1b269cf75afc7511bdf1f9548beb3b", size = 323441, upload-time = "2026-04-10T14:27:41.039Z" }, + { url = "https://files.pythonhosted.org/packages/83/6e/7b3314398d8983f06b557aa21b670511ec72d3b79a68ee5e4d9bff972286/jiter-0.14.0-cp314-cp314t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c8ef8791c3e78d6c6b157c6d360fbb5c715bebb8113bc6a9303c5caff012754a", size = 348109, upload-time = "2026-04-10T14:27:42.552Z" }, + { url = "https://files.pythonhosted.org/packages/ae/4f/8dc674bcd7db6dba566de73c08c763c337058baff1dbeb34567045b27cdc/jiter-0.14.0-cp314-cp314t-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:e74663b8b10da1fe0f4e4703fd7980d24ad17174b6bb35d8498d6e3ebce2ae6a", size = 368328, upload-time = "2026-04-10T14:27:44.574Z" }, + { url = "https://files.pythonhosted.org/packages/3b/5f/188e09a1f20906f98bbdec44ed820e19f4e8eb8aff88b9d1a5a497587ff3/jiter-0.14.0-cp314-cp314t-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:1aca29ba52913f78362ec9c2da62f22cdc4c3083313403f90c15460979b84d9b", size = 463301, upload-time = "2026-04-10T14:27:46.717Z" }, + { url = "https://files.pythonhosted.org/packages/ac/f0/19046ef965ed8f349e8554775bb12ff4352f443fbe12b95d31f575891256/jiter-0.14.0-cp314-cp314t-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:8b39b7d87a952b79949af5fef44d2544e58c21a28da7f1bae3ef166455c61746", size = 378891, upload-time = "2026-04-10T14:27:48.32Z" }, + { url = "https://files.pythonhosted.org/packages/c4/c3/da43bd8431ee175695777ee78cf0e93eacbb47393ff493f18c45231b427d/jiter-0.14.0-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:78d918a68b26e9fab068c2b5453577ef04943ab2807b9a6275df2a812599a310", size = 360749, upload-time = "2026-04-10T14:27:49.88Z" }, + { url = "https://files.pythonhosted.org/packages/72/26/e054771be889707c6161dbdec9c23d33a9ec70945395d70f07cfea1e9a6f/jiter-0.14.0-cp314-cp314t-manylinux_2_31_riscv64.whl", hash = "sha256:b08997c35aee1201c1a5361466a8fb9162d03ae7bf6568df70b6c859f1e654a4", size = 358526, upload-time = "2026-04-10T14:27:51.504Z" }, + { url = "https://files.pythonhosted.org/packages/c3/0f/7bea65ea2a6d91f2bf989ff11a18136644392bf2b0497a1fa50934c30a9c/jiter-0.14.0-cp314-cp314t-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:260bf7ca20704d58d41f669e5e9fe7fe2fa72901a6b324e79056f5d52e9c9be2", size = 393926, upload-time = "2026-04-10T14:27:53.368Z" }, + { url = "https://files.pythonhosted.org/packages/3c/a1/b1ff7d70deef61ac0b7c6c2f12d2ace950cdeecb4fdc94500a0926802857/jiter-0.14.0-cp314-cp314t-musllinux_1_1_aarch64.whl", hash = "sha256:37826e3df29e60f30a382f9294348d0238ef127f4b5d7f5f8da78b5b9e050560", size = 521052, upload-time = "2026-04-10T14:27:55.058Z" }, + { url = "https://files.pythonhosted.org/packages/0b/7b/3b0649983cbaf15eda26a414b5b1982e910c67bd6f7b1b490f3cfc76896a/jiter-0.14.0-cp314-cp314t-musllinux_1_1_x86_64.whl", hash = "sha256:645be49c46f2900937ba0eaf871ad5183c96858c0af74b6becc7f4e367e36e06", size = 553716, upload-time = "2026-04-10T14:27:57.269Z" }, + { url = "https://files.pythonhosted.org/packages/97/f8/33d78c83bd93ae0c0af05293a6660f88a1977caef39a6d72a84afab94ce0/jiter-0.14.0-cp314-cp314t-win32.whl", hash = "sha256:2f7877ed45118de283786178eceaf877110abacd04fde31efff3940ae9672674", size = 207957, upload-time = "2026-04-10T14:27:59.285Z" }, + { url = "https://files.pythonhosted.org/packages/d6/ac/2b760516c03e2227826d1f7025d89bf6bf6357a28fe75c2a2800873c50bf/jiter-0.14.0-cp314-cp314t-win_amd64.whl", hash = "sha256:14c0cb10337c49f5eafe8e7364daca5e29a020ea03580b8f8e6c597fed4e1588", size = 204690, upload-time = "2026-04-10T14:28:00.962Z" }, + { url = "https://files.pythonhosted.org/packages/dc/2e/a44c20c58aeed0355f2d326969a181696aeb551a25195f47563908a815be/jiter-0.14.0-cp314-cp314t-win_arm64.whl", hash = "sha256:5419d4aa2024961da9fe12a9cfe7484996735dca99e8e090b5c88595ef1951ff", size = 191338, upload-time = "2026-04-10T14:28:02.853Z" }, + { url = "https://files.pythonhosted.org/packages/21/42/9042c3f3019de4adcb8c16591c325ec7255beea9fcd33a42a43f3b0b1000/jiter-0.14.0-graalpy312-graalpy250_312_native-macosx_10_12_x86_64.whl", hash = "sha256:fbd9e482663ca9d005d051330e4d2d8150bb208a209409c10f7e7dfdf7c49da9", size = 308810, upload-time = "2026-04-10T14:28:34.673Z" }, + { url = "https://files.pythonhosted.org/packages/60/cf/a7e19b308bd86bb04776803b1f01a5f9a287a4c55205f4708827ee487fbf/jiter-0.14.0-graalpy312-graalpy250_312_native-macosx_11_0_arm64.whl", hash = "sha256:33a20d838b91ef376b3a56896d5b04e725c7df5bc4864cc6569cf046a8d73b6d", size = 308443, upload-time = "2026-04-10T14:28:36.658Z" }, + { url = "https://files.pythonhosted.org/packages/ca/44/e26ede3f0caeff93f222559cb0cc4ca68579f07d009d7b6010c5b586f9b1/jiter-0.14.0-graalpy312-graalpy250_312_native-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:432c4db5255d86a259efde91e55cb4c8d18c0521d844c9e2e7efcce3899fb016", size = 343039, upload-time = "2026-04-10T14:28:38.356Z" }, + { url = "https://files.pythonhosted.org/packages/da/e9/1f9ada30cef7b05e74bb06f52127e7a724976c225f46adb65c37b1dadfb6/jiter-0.14.0-graalpy312-graalpy250_312_native-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:67f00d94b281174144d6532a04b66a12cb866cbdc47c3af3bfe2973677f9861a", size = 349613, upload-time = "2026-04-10T14:28:40.066Z" }, +] + [[package]] name = "markdown-it-py" version = "4.0.0" @@ -236,6 +345,25 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/47/4f/4a617ee93d8208d2bcf26b2d8b9402ceaed03e3853c754940e2290fed063/ollama-0.6.1-py3-none-any.whl", hash = "sha256:fc4c984b345735c5486faeee67d8a265214a31cbb828167782dc642ce0a2bf8c", size = 14354, upload-time = "2025-11-13T23:02:16.292Z" }, ] +[[package]] +name = "openai" +version = "2.33.0" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "anyio" }, + { name = "distro" }, + { name = "httpx" }, + { name = "jiter" }, + { name = "pydantic" }, + { name = "sniffio" }, + { name = "tqdm" }, + { name = "typing-extensions" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/f0/ee/d056c82f63c05f06baac0cffb4a90952d8274f90c49dfe244f20497b9bbd/openai-2.33.0.tar.gz", hash = "sha256:f850c435e2a4685bba3295bd54912dd26315d9c1b7733068186134d6e0599f9a", size = 693254, upload-time = "2026-04-28T14:04:42.428Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/7d/32/37734d769bc8b42e4938785313cc05aade6cb0fa72479d3220a0d61a4e78/openai-2.33.0-py3-none-any.whl", hash = "sha256:03ac37d70e8c9e3a8124214e3afa785e2cbc12e627fbd98177a086ef2fd87ad5", size = 1162695, upload-time = "2026-04-28T14:04:40.482Z" }, +] + [[package]] name = "packaging" version = "26.2" @@ -475,6 +603,27 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/e0/f9/0595336914c5619e5f28a1fb793285925a8cd4b432c9da0a987836c7f822/shellingham-1.5.4-py2.py3-none-any.whl", hash = "sha256:7ecfff8f2fd72616f7481040475a65b2bf8af90a56c89140852d1120324e8686", size = 9755, upload-time = "2023-10-24T04:13:38.866Z" }, ] +[[package]] +name = "sniffio" +version = "1.3.1" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/a2/87/a6771e1546d97e7e041b6ae58d80074f81b7d5121207425c964ddf5cfdbd/sniffio-1.3.1.tar.gz", hash = "sha256:f4324edc670a0f49750a81b895f35c3adb843cca46f0530f79fc1babb23789dc", size = 20372, upload-time = "2024-02-25T23:20:04.057Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/e9/44/75a9c9421471a6c4805dbf2356f7c181a29c1879239abab1ea2cc8f38b40/sniffio-1.3.1-py3-none-any.whl", hash = "sha256:2f6da418d1f1e0fddd844478f41680e794e6051915791a034ff65e5f100525a2", size = 10235, upload-time = "2024-02-25T23:20:01.196Z" }, +] + +[[package]] +name = "tqdm" +version = "4.67.3" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "colorama", marker = "sys_platform == 'win32'" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/09/a9/6ba95a270c6f1fbcd8dac228323f2777d886cb206987444e4bce66338dd4/tqdm-4.67.3.tar.gz", hash = "sha256:7d825f03f89244ef73f1d4ce193cb1774a8179fd96f31d7e1dcde62092b960bb", size = 169598, upload-time = "2026-02-03T17:35:53.048Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/16/e1/3079a9ff9b8e11b846c6ac5c8b5bfb7ff225eee721825310c91b3b50304f/tqdm-4.67.3-py3-none-any.whl", hash = "sha256:ee1e4c0e59148062281c49d80b25b67771a127c85fc9676d3be5f243206826bf", size = 78374, upload-time = "2026-02-03T17:35:50.982Z" }, +] + [[package]] name = "typer" version = "0.24.2" @@ -516,7 +665,9 @@ name = "wikifi" version = "0.1.0" source = { editable = "." } dependencies = [ + { name = "anthropic" }, { name = "ollama" }, + { name = "openai" }, { name = "pathspec" }, { name = "pydantic" }, { name = "pydantic-settings" }, @@ -534,7 +685,9 @@ dev = [ [package.metadata] requires-dist = [ + { name = "anthropic", specifier = ">=0.40" }, { name = "ollama", specifier = ">=0.4.0" }, + { name = "openai", specifier = ">=1.50" }, { name = "pathspec", specifier = ">=0.12" }, { name = "pydantic", specifier = ">=2.6" }, { name = "pydantic-settings", specifier = ">=2.2" }, diff --git a/wikifi/aggregator.py b/wikifi/aggregator.py index fdbe2ca..cad0350 100644 --- a/wikifi/aggregator.py +++ b/wikifi/aggregator.py @@ -1,20 +1,41 @@ """Stage 3 — per-section synthesis. Reads the JSONL notes accumulated by the extractor and asks the LLM to -synthesize each section's final markdown. One LLM call per section, with the -section description as the contract for what should appear and what shouldn't. +synthesize each section's final markdown along with **structured +evidence**: a list of supported claims and any contradictions surfaced +across the file findings. -Sections with zero notes get a placeholder body so the wiki layout stays -complete and the absence is visible. +Three behaviors set this stage apart from a vanilla "merge LLM output": + +1. **Citations.** Every claim records the source files (and line ranges + when the extractor knew them) it draws from. The renderer threads + those into the final markdown as numbered footnotes. +2. **Contradiction surfacing.** When two or more files disagree about + the same domain claim — a frequent case in legacy systems where + tribal knowledge hides in inconsistencies — the conflict is rendered + under a "Conflicts in source" heading rather than silently merged. +3. **Section-level cache.** A digest of the section's note payload is + compared against the prior walk's; if the notes are byte-identical, + the cached body and evidence are reused without re-calling the LLM. """ from __future__ import annotations +import json import logging from dataclasses import dataclass from pydantic import BaseModel, Field +from wikifi.cache import WalkCache, hash_section_notes +from wikifi.evidence import ( + Claim, + Contradiction, + EvidenceBundle, + SourceRef, + coalesce_refs, + render_section_body, +) from wikifi.providers.base import LLMProvider from wikifi.sections import PRIMARY_SECTIONS, Section from wikifi.wiki import WikiLayout, read_notes, write_section @@ -25,38 +46,77 @@ You are wikifi's section aggregator. You receive structured notes that an \ extractor pass collected from individual source files in a target codebase, \ along with the brief for one section of a tech-agnostic wiki. Synthesize a \ -clean markdown body for that section. +clean markdown body for that section *and* expose the evidence you used. + +Each note carries an `[index]` tag — when you make a claim that draws on \ +specific notes, list those indices in the corresponding `claim.source_indices`. \ +Indices are 1-based and refer to the numbered notes in the user prompt. Rules: - Tech-agnostic. Never mention specific languages, frameworks, or libraries. \ Translate every observation into domain terms. -- Coherent narrative — not a transcript of the notes. Merge duplicates, \ - resolve contradictions, organize by domain logic. -- Use markdown sub-headings, lists, and tables where they help the reader. -- If notes are sparse or contradictory, say so plainly. Better to declare a \ - gap than to invent content. -- Output the body only. Do not repeat the section title (the writer adds it). +- Coherent narrative — not a transcript of the notes. Merge consistent \ + statements, organize by domain logic. +- DO NOT silently resolve contradictions. If two notes assert incompatible \ + things about the same topic, emit a `contradictions[]` entry naming each \ + position and the source-note indices that support it. +- Use markdown sub-headings, lists, and tables in `body` where they help. +- Keep `body` focused on prose; the renderer adds the citation footer and \ + "Conflicts in source" block from the structured `claims`/`contradictions` \ + fields. Don't duplicate citations inline. +- If notes are sparse or contradictory, say so plainly rather than inventing. +- Output the body only (no top-level heading); the writer adds the title. """ +class AggregatedClaim(BaseModel): + """One claim the aggregator extracted, indexed against the input notes.""" + + text: str = Field(description="One assertion in the synthesized body.") + source_indices: list[int] = Field( + default_factory=list, + description="1-based indices of the notes that justify this claim.", + ) + + +class AggregatedContradiction(BaseModel): + summary: str = Field(description="One-sentence description of the disagreement.") + positions: list[AggregatedClaim] = Field( + default_factory=list, + description="Each disagreeing position, with its own supporting note indices.", + ) + + class SectionBody(BaseModel): - """The final markdown body for a section.""" + """The aggregator's structured output for a single section.""" body: str = Field(description="Markdown content for the section, no top-level heading.") + claims: list[AggregatedClaim] = Field(default_factory=list) + contradictions: list[AggregatedContradiction] = Field(default_factory=list) @dataclass class AggregationStats: sections_written: int = 0 sections_empty: int = 0 + sections_cached: int = 0 -def aggregate_all(*, layout: WikiLayout, provider: LLMProvider) -> AggregationStats: +def aggregate_all( + *, + layout: WikiLayout, + provider: LLMProvider, + cache: WalkCache | None = None, +) -> AggregationStats: """Aggregate every primary section from its accumulated notes. Derivative sections (personas, user stories, diagrams) are populated by `wikifi.deriver.derive_all` after this stage — they have no per-file notes to aggregate from. + + When ``cache`` is supplied and the section's note digest is unchanged + from the prior walk, the cached body and evidence are reused without + invoking the LLM. """ stats = AggregationStats() for section in PRIMARY_SECTIONS: @@ -65,20 +125,96 @@ def aggregate_all(*, layout: WikiLayout, provider: LLMProvider) -> AggregationSt write_section(layout, section, _empty_body(section)) stats.sections_empty += 1 continue + + notes_hash = hash_section_notes(notes) + if cache is not None: + cached = cache.lookup_aggregation(section.id, notes_hash) + if cached is not None: + bundle = EvidenceBundle( + body=cached.body, + claims=[Claim.model_validate(c) for c in cached.claims], + contradictions=[Contradiction.model_validate(c) for c in cached.contradictions], + ) + write_section(layout, section, render_section_body(bundle)) + stats.sections_cached += 1 + stats.sections_written += 1 + continue + try: - body = provider.complete_json( + structured = provider.complete_json( system=AGGREGATION_SYSTEM_PROMPT, user=_render_user_prompt(section, notes), schema=SectionBody, - ).body + ) + bundle = _bundle_from(structured, notes) + rendered = render_section_body(bundle) except Exception as exc: log.warning("aggregation failed for %s: %s", section.id, exc) - body = _fallback_body(section, notes, error=str(exc)) - write_section(layout, section, body) + rendered = _fallback_body(section, notes, error=str(exc)) + bundle = None + + write_section(layout, section, rendered) stats.sections_written += 1 + + if cache is not None and bundle is not None: + cache.record_aggregation( + section.id, + notes_hash=notes_hash, + body=bundle.body, + claims=[c.model_dump() for c in bundle.claims], + contradictions=[c.model_dump() for c in bundle.contradictions], + ) + return stats +def _bundle_from(structured: SectionBody, notes: list[dict]) -> EvidenceBundle: + """Resolve note indices into concrete :class:`SourceRef` lists.""" + note_refs = _refs_per_note(notes) + + def resolve(indices: list[int]) -> list[SourceRef]: + refs: list[SourceRef] = [] + for idx in indices: + real = idx - 1 + if 0 <= real < len(note_refs): + refs.extend(note_refs[real]) + return coalesce_refs(refs) + + claims = [Claim(text=c.text, sources=resolve(c.source_indices)) for c in structured.claims] + contradictions = [ + Contradiction( + summary=c.summary, + positions=[Claim(text=p.text, sources=resolve(p.source_indices)) for p in c.positions], + ) + for c in structured.contradictions + ] + return EvidenceBundle(body=structured.body, claims=claims, contradictions=contradictions) + + +def _refs_per_note(notes: list[dict]) -> list[list[SourceRef]]: + """Map each note to its source refs. + + Notes produced by the modern extractor carry a ``sources`` list; + older notes (or hand-written ones) fall back to a single SourceRef + derived from the ``file`` field. + """ + out: list[list[SourceRef]] = [] + for note in notes: + sources = note.get("sources") + if isinstance(sources, list) and sources: + try: + out.append([SourceRef.model_validate(s) for s in sources]) + continue + except Exception: # malformed sources — fall back to file + pass + file = note.get("file") + if file: + out.append([SourceRef(file=str(file))]) + else: + out.append([]) + return out + + def _render_user_prompt(section: Section, notes: list[dict]) -> str: lines: list[str] = [] lines.append(f"## Section: {section.title} (id: {section.id})") @@ -86,17 +222,38 @@ def _render_user_prompt(section: Section, notes: list[dict]) -> str: lines.append("### Brief") lines.append(section.description) lines.append("") - lines.append(f"### Notes from {len(notes)} file(s)") - for note in notes: + lines.append(f"### Notes from {len(notes)} file(s) — referenced by 1-based index in `source_indices`") + for idx, note in enumerate(notes, start=1): file_ref = note.get("file", "?") summary = note.get("summary", "") finding = note.get("finding", "") - lines.append(f"- [{file_ref}] (file role: {summary}) {finding}") + sources = note.get("sources") or [] + ranges = ", ".join(_format_source(s) for s in sources) if sources else file_ref + role = f" (file role: {summary})" if summary else "" + lines.append(f"[{idx}] {ranges}{role}: {finding}") lines.append("") - lines.append("Synthesize a coherent markdown body for this section. Follow the rules in the system prompt.") + lines.append( + "Synthesize a coherent markdown body for this section in `body`, " + "and populate `claims` (with the 1-based note indices that justify " + "each one) and `contradictions` for any disagreements. Follow the " + "rules in the system prompt." + ) return "\n".join(lines) +def _format_source(source: dict | SourceRef) -> str: + if isinstance(source, SourceRef): + return source.render() + file = source.get("file", "?") + lines = source.get("lines") + if not lines: + return file + if isinstance(lines, list | tuple) and len(lines) == 2: + start, end = lines + return f"{file}:{start}-{end}" if start != end else f"{file}:{start}" + return file + + def _empty_body(section: Section) -> str: return ( f"_No findings were extracted for **{section.title}** during the last walk._\n\n" @@ -117,3 +274,15 @@ def _fallback_body(section: Section, notes: list[dict], *, error: str) -> str: finding = note.get("finding", "") lines.append(f"- **{file_ref}** — {finding}") return "\n".join(lines) + + +__all__ = [ + "AGGREGATION_SYSTEM_PROMPT", + "AggregatedClaim", + "AggregatedContradiction", + "AggregationStats", + "SectionBody", + "aggregate_all", +] +# json kept for downstream debugging needs +_ = json diff --git a/wikifi/cache.py b/wikifi/cache.py new file mode 100644 index 0000000..bf43e24 --- /dev/null +++ b/wikifi/cache.py @@ -0,0 +1,340 @@ +"""Content-addressed cache for the walk pipeline. + +The cache turns a clean re-walk of a 50k-file legacy monorepo from "hours" +to "minutes-of-changed-files-only". Two scopes are persisted: + +- **Per-file extraction cache.** Keyed by ``(rel_path, file_fingerprint)``, + values are the list of structured findings the extractor produced. If a + file's bytes haven't changed since the last walk the cache entry is + reused verbatim and no LLM call is made. +- **Per-section aggregation cache.** Keyed by the SHA-256 of the section's + full notes payload (after extraction completes). If the notes payload + is bit-identical to last walk's, the cached markdown body is reused + rather than calling the aggregator again. + +Resumability falls out of the per-file cache for free: a walk that crashes +at file 8127/10000 picks up exactly where it left off because the previous +8126 files' fingerprints are still in the cache from the last successful +extraction call. + +Cache files live under ``.wikifi/.cache/`` so they share the wiki's +git-ignore rules but stay out of the section markdown that *is* committed. +""" + +from __future__ import annotations + +import json +import logging +from dataclasses import dataclass, field +from datetime import UTC, datetime +from pathlib import Path +from typing import Any + +from wikifi.wiki import CACHE_DIRNAME, WikiLayout + +log = logging.getLogger("wikifi.cache") + +EXTRACTION_CACHE_FILENAME = "extraction.json" +AGGREGATION_CACHE_FILENAME = "aggregation.json" +CACHE_VERSION = 1 # bump to invalidate every cache entry across upgrades + +# Re-exposed for callers that already import ``CACHE_DIRNAME`` from this +# module; the constant itself lives in :mod:`wikifi.wiki` next to the +# other layout names. +__all__ = [ + "CACHE_DIRNAME", + "AGGREGATION_CACHE_FILENAME", + "EXTRACTION_CACHE_FILENAME", + "CACHE_VERSION", + "CachedFindings", + "CachedSection", + "WalkCache", + "aggregation_cache_path", + "cache_dir", + "extraction_cache_path", + "hash_section_notes", + "load", + "reset", + "save", +] + + +@dataclass +class CachedFindings: + """Per-file findings recovered from cache.""" + + fingerprint: str + findings: list[dict[str, Any]] + summary: str = "" + chunks_processed: int = 0 + + +@dataclass +class CachedSection: + """Per-section aggregator output recovered from cache.""" + + notes_hash: str + body: str + claims: list[dict[str, Any]] = field(default_factory=list) + contradictions: list[dict[str, Any]] = field(default_factory=list) + + +@dataclass +class WalkCache: + """Mutable in-memory view of both caches; persisted via :func:`save`.""" + + extraction: dict[str, CachedFindings] = field(default_factory=dict) + aggregation: dict[str, CachedSection] = field(default_factory=dict) + extraction_hits: int = 0 + extraction_misses: int = 0 + aggregation_hits: int = 0 + aggregation_misses: int = 0 + + # ----- extraction scope ----- + + def lookup_extraction(self, rel_path: str, fingerprint: str) -> CachedFindings | None: + entry = self.extraction.get(rel_path) + if entry is None or entry.fingerprint != fingerprint: + self.extraction_misses += 1 + return None + self.extraction_hits += 1 + return entry + + def record_extraction( + self, + rel_path: str, + *, + fingerprint: str, + findings: list[dict[str, Any]], + summary: str, + chunks_processed: int, + ) -> None: + self.extraction[rel_path] = CachedFindings( + fingerprint=fingerprint, + findings=list(findings), + summary=summary, + chunks_processed=chunks_processed, + ) + + def forget_extraction(self, rel_path: str) -> None: + self.extraction.pop(rel_path, None) + + def prune_extraction(self, *, keep: set[str]) -> int: + """Drop cache entries for files no longer in scope. Returns count removed.""" + removed = [path for path in list(self.extraction) if path not in keep] + for path in removed: + del self.extraction[path] + return len(removed) + + # ----- aggregation scope ----- + + def lookup_aggregation(self, section_id: str, notes_hash: str) -> CachedSection | None: + entry = self.aggregation.get(section_id) + if entry is None or entry.notes_hash != notes_hash: + self.aggregation_misses += 1 + return None + self.aggregation_hits += 1 + return entry + + def record_aggregation( + self, + section_id: str, + *, + notes_hash: str, + body: str, + claims: list[dict[str, Any]] | None = None, + contradictions: list[dict[str, Any]] | None = None, + ) -> None: + self.aggregation[section_id] = CachedSection( + notes_hash=notes_hash, + body=body, + claims=list(claims or []), + contradictions=list(contradictions or []), + ) + + +# --------------------------------------------------------------------------- +# Persistence +# --------------------------------------------------------------------------- + + +def cache_dir(layout: WikiLayout) -> Path: + return layout.cache_dir + + +def extraction_cache_path(layout: WikiLayout) -> Path: + return cache_dir(layout) / EXTRACTION_CACHE_FILENAME + + +def aggregation_cache_path(layout: WikiLayout) -> Path: + return cache_dir(layout) / AGGREGATION_CACHE_FILENAME + + +def load(layout: WikiLayout) -> WalkCache: + """Load both caches from disk. Missing or invalid files yield an empty cache.""" + cache = WalkCache() + cache.extraction = _load_extraction(extraction_cache_path(layout)) + cache.aggregation = _load_aggregation(aggregation_cache_path(layout)) + return cache + + +def save(layout: WikiLayout, cache: WalkCache) -> None: + """Persist both caches atomically.""" + cache_dir(layout).mkdir(parents=True, exist_ok=True) + _atomic_write_json( + extraction_cache_path(layout), + { + "version": CACHE_VERSION, + "saved_at": datetime.now(UTC).isoformat(), + "entries": { + path: { + "fingerprint": entry.fingerprint, + "summary": entry.summary, + "chunks_processed": entry.chunks_processed, + "findings": entry.findings, + } + for path, entry in cache.extraction.items() + }, + }, + ) + _atomic_write_json( + aggregation_cache_path(layout), + { + "version": CACHE_VERSION, + "saved_at": datetime.now(UTC).isoformat(), + "entries": { + sid: { + "notes_hash": entry.notes_hash, + "body": entry.body, + "claims": entry.claims, + "contradictions": entry.contradictions, + } + for sid, entry in cache.aggregation.items() + }, + }, + ) + + +def reset(layout: WikiLayout) -> None: + """Delete every cache file. Triggered by `walk --no-cache` and tests.""" + for path in (extraction_cache_path(layout), aggregation_cache_path(layout)): + if path.exists(): + path.unlink() + + +def _atomic_write_json(path: Path, payload: dict[str, Any]) -> None: + tmp = path.with_suffix(path.suffix + ".tmp") + tmp.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8") + tmp.replace(path) + + +def _load_extraction(path: Path) -> dict[str, CachedFindings]: + raw = _load_json(path) + if not raw or raw.get("version") != CACHE_VERSION: + return {} + out: dict[str, CachedFindings] = {} + for rel, entry in raw.get("entries", {}).items(): + try: + out[rel] = CachedFindings( + fingerprint=entry["fingerprint"], + findings=list(entry.get("findings", [])), + summary=entry.get("summary", ""), + chunks_processed=int(entry.get("chunks_processed", 0)), + ) + except (KeyError, TypeError, ValueError) as exc: + log.warning("dropping malformed extraction cache entry %s: %s", rel, exc) + return out + + +def _load_aggregation(path: Path) -> dict[str, CachedSection]: + raw = _load_json(path) + if not raw or raw.get("version") != CACHE_VERSION: + return {} + out: dict[str, CachedSection] = {} + for sid, entry in raw.get("entries", {}).items(): + try: + out[sid] = CachedSection( + notes_hash=entry["notes_hash"], + body=entry.get("body", ""), + claims=list(entry.get("claims", [])), + contradictions=list(entry.get("contradictions", [])), + ) + except (KeyError, TypeError, ValueError) as exc: + log.warning("dropping malformed aggregation cache entry %s: %s", sid, exc) + return out + + +def _load_json(path: Path) -> dict[str, Any] | None: + if not path.exists(): + return None + try: + return json.loads(path.read_text(encoding="utf-8")) + except (OSError, json.JSONDecodeError) as exc: + log.warning("could not load cache at %s: %s; starting fresh", path, exc) + return None + + +# --------------------------------------------------------------------------- +# Hash helpers used at the section boundary +# --------------------------------------------------------------------------- + + +def hash_section_notes(notes: list[dict[str, Any]]) -> str: + """Stable digest of a section's note payload for aggregation cache keys. + + The hash spans the *content* fields the aggregator and renderer + actually rely on — file ref, summary, finding text, and the + structured ``sources`` list (file/lines/fingerprint per source). + Including ``sources`` is what keeps citation freshness honest: + when a referenced file's lines move or its fingerprint changes, + the cache misses and we re-aggregate against the new evidence + instead of replaying stale citations. + """ + from wikifi.fingerprint import hash_text + + payload = [ + { + "file": n.get("file", ""), + "summary": n.get("summary", ""), + "finding": n.get("finding", ""), + "sources": _normalize_sources(n.get("sources")), + } + for n in notes + ] + return hash_text(json.dumps(payload, ensure_ascii=False, sort_keys=True)) + + +def _normalize_sources(sources: Any) -> list[dict[str, Any]]: + """Render the ``sources`` list into a stable dict shape for hashing. + + Notes vary in how ``sources`` is stored — a list of dicts from the + JSONL store, a list of Pydantic models from in-memory paths, or + missing entirely on legacy notes. Coerce each entry to the same + ``{file, lines, fingerprint}`` shape so the hash is stable across + code paths. + """ + if not sources: + return [] + out: list[dict[str, Any]] = [] + for src in sources: + if isinstance(src, dict): + file = src.get("file", "") + lines = src.get("lines") + fingerprint = src.get("fingerprint", "") + else: + file = getattr(src, "file", "") + lines = getattr(src, "lines", None) + fingerprint = getattr(src, "fingerprint", "") + # Tuples and lists both serialize the same in JSON, but coerce + # to a list so two notes with identical (start, end) ranges + # produce identical bytes regardless of representation. + normalized_lines: list[int] | None + if lines is None: + normalized_lines = None + else: + try: + normalized_lines = [int(lines[0]), int(lines[1])] + except (TypeError, ValueError, IndexError): + normalized_lines = None + out.append({"file": file, "lines": normalized_lines, "fingerprint": fingerprint or ""}) + return out diff --git a/wikifi/cli.py b/wikifi/cli.py index e9231d0..8207421 100644 --- a/wikifi/cli.py +++ b/wikifi/cli.py @@ -5,6 +5,7 @@ - ``wikifi init`` — scaffold the ``.wikifi/`` directory in CWD - ``wikifi walk`` — run the full Stage 1→2→3→4 pipeline against CWD - ``wikifi chat`` — interactive REPL with ``.wikifi/`` content as context +- ``wikifi report`` — coverage + quality report on the wiki """ from __future__ import annotations @@ -15,13 +16,16 @@ import typer from rich.console import Console +from rich.markdown import Markdown from rich.panel import Panel from rich.table import Table from wikifi import __version__ +from wikifi.cache import reset as reset_cache from wikifi.chat import run_repl -from wikifi.config import get_settings +from wikifi.config import get_settings, load_target_settings from wikifi.orchestrator import build_provider, init_wiki, run_walk +from wikifi.report import build_report from wikifi.wiki import WikiLayout app = typer.Typer( @@ -79,13 +83,41 @@ def init(target: TargetArg = None) -> None: @app.command() -def walk(target: TargetArg = None) -> None: +def walk( + target: TargetArg = None, + no_cache: Annotated[ + bool, typer.Option("--no-cache", help="Force a clean re-walk; drop the on-disk cache.") + ] = False, + review: Annotated[ + bool, + typer.Option("--review/--no-review", help="Run the critic + reviser loop on derivative sections."), + ] = False, + provider: Annotated[ + str | None, + typer.Option( + "--provider", + help="Override the configured provider for this walk ('ollama' | 'anthropic' | 'openai').", + ), + ] = None, +) -> None: """Walk the target codebase and populate every wiki section.""" target = target or Path.cwd() - settings = get_settings() + settings = load_target_settings(target) + if no_cache: + settings = settings.model_copy(update={"use_cache": False}) + reset_cache(WikiLayout(root=target)) + if review: + settings = settings.model_copy(update={"review_derivatives": True}) + if provider: + settings = settings.model_copy(update={"provider": provider}) + console.print( Panel.fit( - f"[bold]wikifi walk[/bold] — target=[cyan]{target}[/cyan] model=[cyan]{settings.model}[/cyan]", + f"[bold]wikifi walk[/bold] — target=[cyan]{target}[/cyan] " + f"provider=[cyan]{settings.provider}[/cyan] model=[cyan]{settings.model}[/cyan]\n" + f"cache=[cyan]{settings.use_cache}[/cyan] graph=[cyan]{settings.use_graph}[/cyan] " + f"specialized=[cyan]{settings.use_specialized_extractors}[/cyan] " + f"review=[cyan]{settings.review_derivatives}[/cyan]", title="starting", ) ) @@ -100,21 +132,27 @@ def walk(target: TargetArg = None) -> None: f"exclude={len(report.introspection.exclude)} " f"langs={', '.join(report.introspection.primary_languages) or '?'}", ) - table.add_row( - "2. Extraction", + extraction_row = ( f"seen={report.extraction.files_seen} " f"contributed={report.extraction.files_with_findings} " f"findings={report.extraction.findings_total} " - f"skipped={report.extraction.files_skipped}", + f"skipped={report.extraction.files_skipped} " + f"cache_hits={report.extraction.cache_hits} " + f"specialized={report.extraction.specialized_files}" ) + table.add_row("2. Extraction", extraction_row) table.add_row( "3. Aggregation", - f"sections_written={report.aggregation.sections_written} sections_empty={report.aggregation.sections_empty}", + f"sections_written={report.aggregation.sections_written} " + f"sections_empty={report.aggregation.sections_empty} " + f"sections_cached={report.aggregation.sections_cached}", ) - table.add_row( - "4. Derivation", - f"sections_derived={report.derivation.sections_derived} sections_skipped={report.derivation.sections_skipped}", + derivation_row = ( + f"sections_derived={report.derivation.sections_derived} " + f"sections_skipped={report.derivation.sections_skipped} " + f"sections_revised={report.derivation.sections_revised}" ) + table.add_row("4. Derivation", derivation_row) console.print(table) console.print(f"\n[green]Done.[/green] Wiki at [bold]{target}/.wikifi/[/bold]") @@ -131,11 +169,35 @@ def chat(target: TargetArg = None) -> None: ) raise typer.Exit(code=1) - settings = get_settings() + settings = load_target_settings(target) provider = build_provider(settings) run_repl(layout=layout, provider=provider, console=console) +@app.command() +def report( + target: TargetArg = None, + score: Annotated[ + bool, + typer.Option("--score/--no-score", help="Run the critic on every populated section for quality scoring."), + ] = False, +) -> None: + """Print a coverage + quality report for the wiki at ``target``.""" + target = target or Path.cwd() + layout = WikiLayout(root=target) + if not layout.wiki_dir.exists(): + console.print( + f"[red]No .wikifi/ directory at {target}.[/red] " + "Run [bold]wikifi init[/bold] and [bold]wikifi walk[/bold] first." + ) + raise typer.Exit(code=1) + + settings = load_target_settings(target) + provider = build_provider(settings) if score else None + wiki_report = build_report(layout=layout, provider=provider, score=score) + console.print(Markdown(wiki_report.render())) + + def main() -> None: """Entry point referenced by [project.scripts] in pyproject.toml.""" app() diff --git a/wikifi/config.py b/wikifi/config.py index fad9fb5..977b5a6 100644 --- a/wikifi/config.py +++ b/wikifi/config.py @@ -1,16 +1,42 @@ -"""Runtime settings loaded from environment / .env. +"""Runtime settings loaded from environment / .env / target's .wikifi/config.toml. Defaults assume a local Ollama server with qwen3.6:27b. Override any field via -WIKIFI_* env vars or a .env file in the target project's CWD. +WIKIFI_* env vars, a .env file in the target project's CWD, or by writing +provider/model entries into ``/.wikifi/config.toml`` (the file +``wikifi init`` scaffolds — and what callers expect to be authoritative for +that wiki). + +Resolution order, highest precedence first: + +1. ``/.wikifi/config.toml`` +2. ``WIKIFI_*`` environment variables (and ``.env``) +3. Field defaults + +The wiki's own ``config.toml`` wins over per-session env vars: a wiki +initialized for a hosted backend should still drive its own runs even +when the user happens to have ``WIKIFI_PROVIDER=ollama`` exported in +their shell. This matches the contract printed at the top of every +generated ``config.toml`` ("overrides WIKIFI_* environment variables +when present"). + +Hosted providers are opt-in: +- ``WIKIFI_PROVIDER=anthropic`` (plus ``ANTHROPIC_API_KEY``) +- ``WIKIFI_PROVIDER=openai`` (plus ``OPENAI_API_KEY``) """ from __future__ import annotations +import logging +import tomllib from functools import lru_cache +from pathlib import Path +from typing import Any from pydantic import Field from pydantic_settings import BaseSettings, SettingsConfigDict +log = logging.getLogger("wikifi.config") + class Settings(BaseSettings): model_config = SettingsConfigDict( @@ -20,7 +46,10 @@ class Settings(BaseSettings): extra="ignore", ) - provider: str = Field(default="ollama", description="LLM provider id; only 'ollama' in v1") + provider: str = Field( + default="ollama", + description="LLM provider id; 'ollama' (default), 'anthropic', or 'openai'", + ) model: str = Field(default="qwen3.6:27b", description="Model identifier passed to the provider") ollama_host: str = Field(default="http://localhost:11434", description="Ollama HTTP endpoint") request_timeout: float = Field(default=900.0, description="Per-request timeout in seconds") @@ -50,18 +79,134 @@ class Settings(BaseSettings): description="Skip files whose stripped content is shorter than this (avoids thinking runaway on stubs)", ) introspection_depth: int = Field(default=3, description="Tree depth fed to the introspection pass") - # Thinking mode for reasoning-capable models (Qwen3, DeepSeek-R1, etc.). + # Thinking mode for reasoning-capable models (Qwen3, DeepSeek-R1, Anthropic). # Default 'high' — wikifi prioritizes wiki quality over walk wall-time. - # Higher thinking levels produce noticeably better domain abstraction and - # cleaner Gherkin in the derivative pass; expect 1–3 minutes per real - # file on a local 27B model. The min_content_bytes guard keeps the - # thinking-runaway-on-stubs failure mode at bay. - # Accepted values: 'low' / 'medium' / 'high' (Qwen3-style); True - # (DeepSeek-style); False to opt out entirely (only safe with non- - # thinking models — Qwen3 ignores `format=` when thinking is off). + # On Anthropic, this maps to adaptive thinking + the equivalent + # ``effort`` level (low/medium/high/max). think: bool | str = Field(default="high", description="Thinking-mode level for reasoning models") + # ----- Premium pipeline knobs ----- + + use_cache: bool = Field( + default=True, + description=( + "Reuse the per-file extraction + per-section aggregation caches across walks. " + "Disable to force a clean re-walk." + ), + ) + use_graph: bool = Field( + default=True, + description=( + "Build an import/reference graph and feed each file's neighborhood into the " + "extraction prompt. Disable to fall back to per-file isolated extraction." + ), + ) + use_specialized_extractors: bool = Field( + default=True, + description=( + "Route schema files (SQL, OpenAPI, Protobuf, GraphQL, migrations) through " + "deterministic extractors that bypass the LLM." + ), + ) + review_derivatives: bool = Field( + default=False, + description=( + "Run the critic + reviser loop on derivative sections (personas, user stories, " + "diagrams). Adds 2 LLM calls per derivative section but materially improves " + "groundedness. Off by default to keep walk wall-time predictable." + ), + ) + review_min_score: int = Field( + default=7, + description="Minimum critic score below which the reviser is invoked.", + ) + + # ----- Anthropic provider knobs ----- + + anthropic_api_key: str | None = Field( + default=None, + description=("Explicit Anthropic API key. Falls back to ANTHROPIC_API_KEY in the environment when unset."), + ) + anthropic_max_tokens: int = Field( + default=32_000, + description=( + "Per-call output token cap for the Anthropic provider. " + "Adaptive thinking at ``effort=high`` can consume substantial " + "output budget; 32K leaves comfortable headroom for the wiki " + "section schemas while staying under the SDK's non-streaming " + "HTTP timeout guard. Premium-effort callers (xhigh/max) " + "should bump higher and enable streaming." + ), + ) + + # ----- OpenAI provider knobs ----- + + openai_api_key: str | None = Field( + default=None, + description=("Explicit OpenAI API key. Falls back to OPENAI_API_KEY in the environment when unset."), + ) + openai_base_url: str | None = Field( + default=None, + description=("Explicit OpenAI base URL (for Azure-OpenAI / proxies). Defaults to api.openai.com."), + ) + openai_max_tokens: int = Field( + default=16_000, + description="Per-call output token cap for the OpenAI provider.", + ) + @lru_cache def get_settings() -> Settings: return Settings() + + +def reset_settings_cache() -> None: + """Drop the cached :class:`Settings` instance so env changes take effect. + + Used by tests that mutate ``WIKIFI_*`` env vars between cases. + """ + get_settings.cache_clear() + + +# Field names a wiki's ``config.toml`` is allowed to override. We accept +# only the fields ``wikifi init`` writes today (provider, model, +# ollama_host) so a stale or hand-edited config can't silently start +# overriding behavior the user didn't sign up for. +_TARGET_CONFIG_FIELDS: frozenset[str] = frozenset({"provider", "model", "ollama_host"}) + + +def load_target_settings(target: Path) -> Settings: + """Return :class:`Settings` for a wiki at ``target``. + + Reads ``/.wikifi/config.toml`` (when present) and layers + its values on top of the env-derived defaults — the wiki's own + config wins over per-session env vars, matching the contract + printed at the top of every generated ``config.toml``. + + Without this, ``wikifi report --score `` (and the other + target-aware commands) would build a provider from the process-wide + defaults regardless of what the target wiki was actually + initialized with — fine when target equals CWD, but wrong when the + user is operating against another project's wiki. + """ + base = get_settings() + overrides = _read_target_config(target) + if not overrides: + return base + effective: dict[str, Any] = {field: value for field, value in overrides.items() if field in _TARGET_CONFIG_FIELDS} + if not effective: + return base + return base.model_copy(update=effective) + + +def _read_target_config(target: Path) -> dict[str, Any]: + """Parse ``/.wikifi/config.toml``; return ``{}`` on any failure.""" + config_path = target / ".wikifi" / "config.toml" + if not config_path.exists(): + return {} + try: + with config_path.open("rb") as handle: + return tomllib.load(handle) + except (OSError, tomllib.TOMLDecodeError) as exc: + log.warning("could not read %s: %s; falling back to env-only settings", config_path, exc) + return {} diff --git a/wikifi/critic.py b/wikifi/critic.py new file mode 100644 index 0000000..4666baf --- /dev/null +++ b/wikifi/critic.py @@ -0,0 +1,242 @@ +"""Section-quality critic. + +Two consumers: + +- :func:`review_section` runs a *critic + reviser* loop on a synthesized + section body. The critic scores the body against its brief and the + upstream evidence, identifying unsupported claims and gaps. If the + score falls below ``min_score`` the reviser is invoked once with the + critique to produce an improved body. This catches the bulk of + hallucination and missing-coverage failures on derivative sections, + where a single-shot synthesis is most error-prone. +- :func:`score_wiki` walks every section in the wiki and produces a + rubric-style report (used by ``wikifi report``). + +The two paths share a single Pydantic schema (:class:`Critique`) so the +provider implementation can cache the system prompt across both. +""" + +from __future__ import annotations + +import logging +from collections.abc import Mapping +from dataclasses import dataclass, field + +from pydantic import BaseModel, Field + +from wikifi.providers.base import LLMProvider +from wikifi.sections import Section + +log = logging.getLogger("wikifi.critic") + + +CRITIC_SYSTEM_PROMPT = """\ +You are wikifi's quality critic. You receive (a) the brief for a section of \ +a technology-agnostic wiki, (b) the synthesized markdown body, and \ +optionally (c) the upstream evidence the body was supposed to derive from. \ +You score the body on a 0–10 rubric and identify concrete improvements. + +Rubric: +- 9–10: tech-agnostic, fully grounded in evidence, narratively coherent, \ + no unsupported claims, no obvious gaps against the brief. +- 6–8: largely sound but with one or more issues — minor unsupported \ + claims, awkward narrative, or missed coverage of brief items. +- 3–5: substantial gaps, several unsupported claims, or partial coverage. +- 0–2: incoherent, dominated by speculation, or off-brief. + +Be specific in `unsupported_claims` and `gaps`. A migration team will use \ +your critique to decide whether the section is ready to ship. +""" + + +REVISER_SYSTEM_PROMPT = """\ +You are wikifi's section reviser. You receive (a) the section brief, \ +(b) the prior body, (c) a critique flagging unsupported claims and gaps, \ +and (d) the upstream evidence available. Produce a revised body that \ +addresses every flagged issue. Stay tech-agnostic. Do not invent claims \ +the upstreams cannot support — declare gaps explicitly when evidence is \ +missing. Output the body only, no top-level heading. +""" + + +class Critique(BaseModel): + """Structured critic output.""" + + score: int = Field(ge=0, le=10, description="Overall quality score (0–10).") + summary: str = Field(default="", description="One- or two-sentence overall judgment.") + unsupported_claims: list[str] = Field( + default_factory=list, + description="Statements in the body not supported by the upstream evidence.", + ) + gaps: list[str] = Field( + default_factory=list, + description="Brief items the body fails to cover.", + ) + suggestions: list[str] = Field( + default_factory=list, + description="Concrete edits the reviser should make.", + ) + + +class RevisedBody(BaseModel): + body: str = Field(description="Revised markdown body for the section.") + + +@dataclass +class ReviewOutcome: + section_id: str + initial: Critique + body: str + revised: bool = False + final: Critique | None = None + + +@dataclass +class WikiQualityReport: + overall_score: float + critiques: dict[str, Critique] = field(default_factory=dict) + coverage: CoverageStats | None = None + + +def review_section( + *, + section: Section, + body: str, + upstream_evidence: Mapping[str, str] | None, + provider: LLMProvider, + min_score: int = 7, +) -> ReviewOutcome: + """Critique → optionally revise → critique again. Returns the outcome.""" + initial = _critique(section=section, body=body, upstream=upstream_evidence, provider=provider) + outcome = ReviewOutcome(section_id=section.id, initial=initial, body=body) + if initial.score >= min_score or not (initial.unsupported_claims or initial.gaps): + return outcome + + try: + revised = provider.complete_json( + system=REVISER_SYSTEM_PROMPT, + user=_render_revise_prompt(section, body, initial, upstream_evidence), + schema=RevisedBody, + ) + except Exception as exc: + log.warning("reviser failed for %s: %s", section.id, exc) + return outcome + + follow_up = _critique(section=section, body=revised.body, upstream=upstream_evidence, provider=provider) + # Only accept the revision if it actually improved the score; otherwise + # keep the original to avoid regressions caused by a confused reviser. + if follow_up.score >= initial.score: + outcome.body = revised.body + outcome.revised = True + outcome.final = follow_up + else: + log.info( + "discarding revision for %s — score dropped from %d to %d", + section.id, + initial.score, + follow_up.score, + ) + return outcome + + +def _critique( + *, + section: Section, + body: str, + upstream: Mapping[str, str] | None, + provider: LLMProvider, +) -> Critique: + user = _render_critique_prompt(section, body, upstream) + try: + return provider.complete_json(system=CRITIC_SYSTEM_PROMPT, user=user, schema=Critique) + except Exception as exc: + log.warning("critic failed for %s: %s", section.id, exc) + return Critique(score=0, summary=f"Critic unavailable ({exc}).") + + +def _render_critique_prompt( + section: Section, + body: str, + upstream: Mapping[str, str] | None, +) -> str: + parts = [ + f"## Section: {section.title} (id: {section.id})", + "", + "### Brief", + section.description, + "", + "### Body to evaluate", + "```markdown", + body.strip() or "(empty body)", + "```", + ] + if upstream: + parts += ["", "### Upstream evidence available"] + for upstream_id, content in upstream.items(): + parts.append(f"#### {upstream_id}") + parts.append("```markdown") + parts.append(content.strip()) + parts.append("```") + parts.append("") + parts.append("Score the body and list unsupported claims, gaps, and suggested edits.") + return "\n".join(parts) + + +def _render_revise_prompt( + section: Section, + body: str, + critique: Critique, + upstream: Mapping[str, str] | None, +) -> str: + parts = [ + f"## Section: {section.title} (id: {section.id})", + "", + "### Brief", + section.description, + "", + "### Prior body", + "```markdown", + body.strip() or "(empty)", + "```", + "", + "### Critique", + f"score: {critique.score}/10", + ] + if critique.unsupported_claims: + parts.append("Unsupported claims to remove or qualify:") + parts += [f"- {c}" for c in critique.unsupported_claims] + if critique.gaps: + parts.append("Gaps to fill (only when evidence allows):") + parts += [f"- {g}" for g in critique.gaps] + if critique.suggestions: + parts.append("Suggested edits:") + parts += [f"- {s}" for s in critique.suggestions] + if upstream: + parts += ["", "### Upstream evidence"] + for upstream_id, content in upstream.items(): + parts.append(f"#### {upstream_id}") + parts.append("```markdown") + parts.append(content.strip()) + parts.append("```") + parts.append("") + parts.append("Output the revised body only.") + return "\n".join(parts) + + +# --------------------------------------------------------------------------- +# Coverage stats — populated by the extractor + aggregator caches and +# rendered by `wikifi report`. +# --------------------------------------------------------------------------- + + +@dataclass +class CoverageStats: + files_total: int + files_with_findings: int + findings_per_section: dict[str, int] + files_per_section: dict[str, int] + + def coverage_pct(self) -> float: + if self.files_total == 0: + return 0.0 + return round(100.0 * self.files_with_findings / self.files_total, 1) diff --git a/wikifi/deriver.py b/wikifi/deriver.py index b5a11bc..9c4223b 100644 --- a/wikifi/deriver.py +++ b/wikifi/deriver.py @@ -18,10 +18,11 @@ from __future__ import annotations import logging -from dataclasses import dataclass +from dataclasses import dataclass, field from pydantic import BaseModel, Field +from wikifi.critic import ReviewOutcome, review_section from wikifi.providers.base import LLMProvider from wikifi.sections import DERIVATIVE_SECTIONS, SECTIONS_BY_ID, Section from wikifi.wiki import WikiLayout, write_section @@ -60,10 +61,24 @@ class DerivedSection(BaseModel): class DerivationStats: sections_derived: int = 0 sections_skipped: int = 0 - - -def derive_all(*, layout: WikiLayout, provider: LLMProvider) -> DerivationStats: - """Synthesize every derivative section from its upstream primary sections.""" + sections_revised: int = 0 + review_outcomes: list[ReviewOutcome] = field(default_factory=list) + + +def derive_all( + *, + layout: WikiLayout, + provider: LLMProvider, + review: bool = False, + review_min_score: int = 7, +) -> DerivationStats: + """Synthesize every derivative section from its upstream primary sections. + + With ``review=True`` each derivative is run through the critic + + reviser loop after synthesis. The critic loop is the highest-leverage + quality lever for derivative sections — personas and Gherkin stories + are exactly where single-shot synthesis tends to hallucinate. + """ stats = DerivationStats() for section in DERIVATIVE_SECTIONS: upstream_bodies = _collect_upstream(layout, section) @@ -85,6 +100,20 @@ def derive_all(*, layout: WikiLayout, provider: LLMProvider) -> DerivationStats: except Exception as exc: log.warning("derivation failed for %s: %s", section.id, exc) body = _fallback_body(section, upstream_bodies, error=str(exc)) + + if review: + outcome = review_section( + section=section, + body=body, + upstream_evidence=upstream_bodies, + provider=provider, + min_score=review_min_score, + ) + body = outcome.body + stats.review_outcomes.append(outcome) + if outcome.revised: + stats.sections_revised += 1 + write_section(layout, section, body) stats.sections_derived += 1 return stats diff --git a/wikifi/evidence.py b/wikifi/evidence.py new file mode 100644 index 0000000..66d6eb8 --- /dev/null +++ b/wikifi/evidence.py @@ -0,0 +1,229 @@ +"""Evidence model: source references, claims, and contradictions. + +A premium migration wiki must let an architect ask, for any sentence in the +wiki, *"where in the source did this come from?"* — and get a precise, +verifiable answer. This module defines the small structured types that +carry that answer end-to-end: + +- :class:`SourceRef` — a single ``(file, lines, fingerprint)`` pointer back + to the codebase. Lines are optional because not every claim has a line + range (e.g. cross-cutting findings that span a whole module). +- :class:`Claim` — one assertion in a section's narrative, with the source + refs that justify it. The aggregator emits one or more claims per + section; the renderer converts them into citation-bearing markdown. +- :class:`Contradiction` — two or more claims that disagree, surfaced + rather than silently merged. Migration teams treat contradictions as + high-priority signals: legacy systems hide tribal knowledge in them. + +Citations are rendered as compact footnote-style markers (``[1]``, ``[2]``, +…) with an explicit "Sources" footer at the bottom of each section. Lines +are included when known (``path/to/file.py:42-87``). +""" + +from __future__ import annotations + +from dataclasses import dataclass + +from pydantic import BaseModel, Field + + +class SourceRef(BaseModel): + """A pointer back to a single span of source code.""" + + file: str = Field(description="Repo-relative path of the source file.") + lines: tuple[int, int] | None = Field( + default=None, + description="Optional inclusive (start, end) line range within the file.", + ) + fingerprint: str = Field( + default="", + description="Short content hash captured at extraction time. Empty when unknown.", + ) + + def render(self) -> str: + """Render as ``path:start-end`` (or just ``path`` when lines unknown).""" + if self.lines is None: + return self.file + start, end = self.lines + if start == end: + return f"{self.file}:{start}" + return f"{self.file}:{start}-{end}" + + +class Claim(BaseModel): + """A single assertion the aggregator places in a section, with sources.""" + + text: str = Field(description="Markdown sentence(s) asserting one fact.") + sources: list[SourceRef] = Field( + default_factory=list, + description="Files/lines that support this claim. Empty means unsupported.", + ) + + def supported(self) -> bool: + return bool(self.sources) + + +class Contradiction(BaseModel): + """Two or more conflicting claims about the same topic.""" + + summary: str = Field(description="One-sentence description of the conflict.") + positions: list[Claim] = Field( + default_factory=list, + description="Each disagreeing position, with its own sources.", + ) + + +class EvidenceBundle(BaseModel): + """The aggregator's structured output for a single section.""" + + body: str = Field(description="Markdown narrative for the section.") + claims: list[Claim] = Field(default_factory=list) + contradictions: list[Contradiction] = Field(default_factory=list) + + +# --------------------------------------------------------------------------- +# Rendering helpers +# --------------------------------------------------------------------------- + + +@dataclass +class _Numbered: + index: int + ref: SourceRef + + +def render_section_body(bundle: EvidenceBundle) -> str: + """Render an EvidenceBundle into final markdown. + + The body is appended with a "Sources" footer enumerating every distinct + source ref across claims and contradictions, plus an explicit + "Conflicts in source" section if any contradictions were surfaced. + + When the bundle carries supported claims, each claim's footnote + markers (``[1]``, ``[2]``…) are appended to the body — either next + to the matching sentence (when the claim text appears verbatim in + the body) or as a "Supporting claims" list when the body is a + paraphrase. Without this the reader has the source list at the + bottom but no way to tell which sentence each source backs up. + """ + sources = _enumerate_sources(bundle) + source_index_for: dict[str, int] = {entry.ref.render(): entry.index for entry in sources} + + parts: list[str] = [] + body_with_markers = _annotate_body_with_markers(bundle, source_index_for) + if body_with_markers.strip(): + parts.append(body_with_markers.strip()) + + unmatched_claims = [c for c in bundle.claims if c.sources and not _claim_text_in_body(c, bundle.body)] + if unmatched_claims: + parts.append("") + parts.append("## Supporting claims") + for claim in unmatched_claims: + markers = _markers_for(claim.sources, source_index_for) + suffix = f" {markers}" if markers else "" + parts.append(f"- {claim.text.strip()}{suffix}") + + if bundle.contradictions: + parts.append("") + parts.append("## Conflicts in source") + parts.append( + "_The walker found disagreements across files. Migration teams " + "should resolve these before re-implementation._" + ) + for entry in bundle.contradictions: + parts.append("") + parts.append(f"- **{entry.summary.strip()}**") + for position in entry.positions: + refs = _format_refs(position.sources) + parts.append(f" - {position.text.strip()} {refs}".rstrip()) + + if sources: + parts.append("") + parts.append("## Sources") + for entry in sources: + parts.append(f"{entry.index}. `{entry.ref.render()}`") + + return "\n".join(parts).strip() + + +def _format_refs(refs: list[SourceRef]) -> str: + if not refs: + return "" + rendered = ", ".join(f"`{ref.render()}`" for ref in refs) + return f"({rendered})" + + +def _enumerate_sources(bundle: EvidenceBundle) -> list[_Numbered]: + seen: dict[str, _Numbered] = {} + next_index = 1 + iterables: list[list[SourceRef]] = [c.sources for c in bundle.claims] + for entry in bundle.contradictions: + for position in entry.positions: + iterables.append(position.sources) + for refs in iterables: + for ref in refs: + key = ref.render() + if key not in seen: + seen[key] = _Numbered(index=next_index, ref=ref) + next_index += 1 + return list(seen.values()) + + +def _markers_for(refs: list[SourceRef], source_index_for: dict[str, int]) -> str: + """Return the bracketed footnote markers for a list of source refs.""" + indices: list[int] = [] + seen: set[int] = set() + for ref in refs: + idx = source_index_for.get(ref.render()) + if idx is not None and idx not in seen: + seen.add(idx) + indices.append(idx) + if not indices: + return "" + return "".join(f"[{i}]" for i in indices) + + +def _claim_text_in_body(claim: Claim, body: str) -> bool: + """True when the claim's exact text appears in the body, modulo whitespace.""" + needle = " ".join(claim.text.split()) + haystack = " ".join(body.split()) + return bool(needle) and needle in haystack + + +def _annotate_body_with_markers(bundle: EvidenceBundle, source_index_for: dict[str, int]) -> str: + """Append claim-level markers next to matching sentences in the body. + + Conservative substring match: only annotate when the claim's text + appears verbatim in the body. If the aggregator paraphrased, the + claim falls through to the "Supporting claims" list rather than + getting attached to the wrong sentence. + """ + if not bundle.body or not bundle.claims: + return bundle.body + annotated = bundle.body + for claim in bundle.claims: + if not claim.sources: + continue + # ``_claim_text_in_body`` is the gate that decides "match" vs. + # "paraphrase"; we use the same predicate here so a claim + # classified as inline-matchable always actually gets inlined, + # never silently dropped between the two passes. + if not _claim_text_in_body(claim, annotated): + continue + markers = _markers_for(claim.sources, source_index_for) + if not markers: + continue + text = claim.text.strip() + if text and text in annotated and markers not in annotated: + annotated = annotated.replace(text, text + markers, 1) + return annotated + + +def coalesce_refs(refs: list[SourceRef]) -> list[SourceRef]: + """Deduplicate refs by rendered form, preserving first-seen order.""" + seen: dict[str, SourceRef] = {} + for ref in refs: + key = ref.render() + if key not in seen: + seen[key] = ref + return list(seen.values()) diff --git a/wikifi/extractor.py b/wikifi/extractor.py index 34ffa9f..08adc8c 100644 --- a/wikifi/extractor.py +++ b/wikifi/extractor.py @@ -2,25 +2,42 @@ Given the include/exclude decision from Stage 1, walk each file deterministically and ask the LLM what intent-bearing content it contributes to each capture -section. Results are appended to per-section JSONL note files for the aggregator. - -The contract: one LLM call per file *or* one call per overlapping chunk for -files that exceed the per-call window. Output is validated against a strict -Pydantic schema. Files that can't be read or validated are recorded as skipped -findings rather than crashing the walk. +section. Results are appended to per-section JSONL note files for the +aggregator. + +Three orthogonal mechanisms make this stage premium-grade: + +1. **Content-addressed cache.** Each file is fingerprinted; if its fingerprint + matches a cached entry, the LLM call is skipped entirely and cached + findings are replayed into the notes store. This is what makes a re-walk + of a 50k-file legacy monorepo finish in minutes. +2. **Cross-file context.** A repo-wide import graph (built once, before + extraction starts) supplies each file's neighborhood to the prompt so + findings can describe inter-file flows. +3. **Type-aware specialization.** Files classified as SQL, OpenAPI, + Protobuf, GraphQL, or migrations bypass the LLM entirely and run + through deterministic extractors that read the structure directly. + +Every emitted finding carries a structured :class:`SourceRef` so the +aggregator can stitch citations back into the rendered wiki. """ from __future__ import annotations import logging -from collections.abc import Iterable -from dataclasses import dataclass +from collections.abc import Callable, Iterable +from dataclasses import dataclass, field from pathlib import Path from pydantic import BaseModel, Field +from wikifi.cache import WalkCache +from wikifi.evidence import SourceRef +from wikifi.fingerprint import hash_file from wikifi.providers.base import LLMProvider +from wikifi.repograph import FileKind, RepoGraph, classify from wikifi.sections import PRIMARY_SECTION_IDS, PRIMARY_SECTIONS +from wikifi.specialized.dispatch import select as select_specialized from wikifi.wiki import WikiLayout, append_note log = logging.getLogger("wikifi.extractor") @@ -52,6 +69,14 @@ the same finding to appear twice — that's deliberate context, not duplication \ to invent around. +When the user prompt names neighbor files (files this one imports from or is \ +imported by), you may reference those relationships when describing flows that \ +cross file boundaries. Do not fabricate flows that aren't visible in the chunk. + +Each finding can carry an optional list of supporting line ranges within \ +this file. Provide them when you can; omit them when the contribution is \ +diffuse across the chunk. + Only emit findings for these section ids: {_SECTION_LIST} Section briefs: @@ -73,6 +98,10 @@ class SectionFinding(BaseModel): section_id: str = Field(description=f"Must be one of: {_SECTION_LIST}") finding: str = Field(description="Tech-agnostic markdown describing the contribution. 1-5 sentences.") + line_range: tuple[int, int] | None = Field( + default=None, + description="Optional inclusive (start, end) line range within the chunk supporting this finding.", + ) class FileFindings(BaseModel): @@ -89,6 +118,9 @@ class ExtractionStats: findings_total: int = 0 files_skipped: int = 0 chunks_processed: int = 0 + cache_hits: int = 0 + specialized_files: int = 0 + files_kinds: dict[str, int] = field(default_factory=dict) def extract_repo( @@ -99,6 +131,10 @@ def extract_repo( repo_root: Path, chunk_size_bytes: int = 150_000, chunk_overlap_bytes: int = 8_000, + cache: WalkCache | None = None, + graph: RepoGraph | None = None, + persist_cache: Callable[[], None] | None = None, + use_specialized_extractors: bool = True, ) -> ExtractionStats: """Walk the supplied files and append per-section findings to the notes store. @@ -108,12 +144,19 @@ def extract_repo( chunk produces one LLM call; identical findings emerging from the overlap region are deduplicated per file so a single declaration isn't double-counted. + + When a ``cache`` is supplied, files whose content fingerprint matches a + cached entry skip the LLM call entirely and replay the cached findings. + When ``persist_cache`` is supplied, it is invoked after each file + finishes — that turns crash-resumability into a free property of the + cache layer. """ stats = ExtractionStats() valid_ids = set(PRIMARY_SECTION_IDS) for rel in files: stats.files_seen += 1 + log.info("- extracting: ./%s", rel.as_posix()) full = repo_root / rel try: data = full.read_text(encoding="utf-8", errors="replace") @@ -122,12 +165,89 @@ def extract_repo( stats.files_skipped += 1 continue + try: + fingerprint = hash_file(full) + except OSError: + fingerprint = "" + + kind = classify(rel, sample=data[:4096]) + kind_label = kind.value + stats.files_kinds[kind_label] = stats.files_kinds.get(kind_label, 0) + 1 + + # ---- cache hit ---- + if cache is not None and fingerprint: + cached = cache.lookup_extraction(rel.as_posix(), fingerprint) + if cached is not None: + file_had_findings = _replay_cached(layout, rel, cached, valid_ids, stats) + if file_had_findings: + stats.files_with_findings += 1 + stats.cache_hits += 1 + if persist_cache is not None: + persist_cache() + continue + + # ---- specialized routing ---- + specialized_fn = select_specialized(kind, rel_path=rel.as_posix()) if use_specialized_extractors else None + if specialized_fn is not None: + stats.specialized_files += 1 + try: + result = specialized_fn(rel.as_posix(), data) + except Exception as exc: # specialized failures don't kill the walk + log.warning("specialized extraction failed for %s: %s", rel, exc) + stats.files_skipped += 1 + continue + + cached_findings = [] + file_had_findings = False + for finding in result.findings: + if finding.section_id not in valid_ids: + continue + note = _build_note( + rel=rel, + summary=result.summary, + finding_text=finding.finding, + sources=finding.sources, + extractor=f"specialized:{kind_label}", + ) + append_note(layout, finding.section_id, note) + cached_findings.append( + { + "section_id": finding.section_id, + "finding": finding.finding, + "sources": [s.model_dump() for s in finding.sources], + } + ) + stats.findings_total += 1 + file_had_findings = True + if file_had_findings: + stats.files_with_findings += 1 + if cache is not None and fingerprint: + cache.record_extraction( + rel.as_posix(), + fingerprint=fingerprint, + findings=cached_findings, + summary=result.summary, + chunks_processed=0, + ) + if persist_cache is not None: + persist_cache() + continue + + # ---- LLM extraction path ---- chunks = _chunk_text(data, chunk_size=chunk_size_bytes, overlap=chunk_overlap_bytes) total_chunks = len(chunks) file_had_findings = False any_chunk_failed = False seen_findings: set[tuple[str, str]] = set() latest_summary = "" + cached_findings: list[dict] = [] + chunks_done = 0 + + neighbors = graph.neighbor_paths(rel.as_posix()) if graph is not None else [] + + # Track each chunk's starting line so finding line_ranges can be + # mapped back to absolute file lines for the citation. + chunk_offsets = _chunk_line_offsets(data, chunks) for chunk_index, chunk_body in enumerate(chunks): try: @@ -138,6 +258,7 @@ def extract_repo( body=chunk_body, chunk_index=chunk_index, total_chunks=total_chunks, + neighbors=neighbors, ), schema=FileFindings, ) @@ -153,9 +274,11 @@ def extract_repo( continue stats.chunks_processed += 1 + chunks_done += 1 if chunk_findings.summary: latest_summary = chunk_findings.summary + chunk_line_offset = chunk_offsets[chunk_index] for finding in chunk_findings.findings: if finding.section_id not in valid_ids: continue @@ -164,15 +287,29 @@ def extract_repo( continue seen_findings.add(key) - note: dict[str, object] = { - "file": rel.as_posix(), - "summary": latest_summary, - "finding": finding.finding, - } - if total_chunks > 1: - note["chunk"] = chunk_index - note["chunks"] = total_chunks + line_range: tuple[int, int] | None = None + if finding.line_range is not None: + start, end = finding.line_range + line_range = (start + chunk_line_offset, end + chunk_line_offset) + + sources = [SourceRef(file=rel.as_posix(), lines=line_range, fingerprint=fingerprint)] + note = _build_note( + rel=rel, + summary=latest_summary, + finding_text=finding.finding, + sources=sources, + extractor=f"llm:{kind_label}", + chunk_index=chunk_index, + total_chunks=total_chunks, + ) append_note(layout, finding.section_id, note) + cached_findings.append( + { + "section_id": finding.section_id, + "finding": finding.finding, + "sources": [s.model_dump() for s in sources], + } + ) stats.findings_total += 1 file_had_findings = True @@ -184,10 +321,78 @@ def extract_repo( # chunked files lose some chunks we still keep what we got. stats.files_skipped += 1 + if cache is not None and fingerprint and chunks_done > 0: + cache.record_extraction( + rel.as_posix(), + fingerprint=fingerprint, + findings=cached_findings, + summary=latest_summary, + chunks_processed=chunks_done, + ) + if persist_cache is not None: + persist_cache() + return stats -def _render_user_prompt(*, rel: Path, body: str, chunk_index: int = 0, total_chunks: int = 1) -> str: +def _replay_cached( + layout: WikiLayout, + rel: Path, + cached, + valid_ids: set[str], + stats: ExtractionStats, +) -> bool: + """Re-emit cached findings into the notes store. Returns True if any landed.""" + file_had_findings = False + for entry in cached.findings: + section_id = entry.get("section_id", "") + if section_id not in valid_ids: + continue + sources = [SourceRef(**s) for s in entry.get("sources", [])] + note = _build_note( + rel=rel, + summary=cached.summary, + finding_text=entry.get("finding", ""), + sources=sources, + extractor="cache", + ) + append_note(layout, section_id, note) + stats.findings_total += 1 + file_had_findings = True + return file_had_findings + + +def _build_note( + *, + rel: Path, + summary: str, + finding_text: str, + sources: list[SourceRef], + extractor: str, + chunk_index: int | None = None, + total_chunks: int | None = None, +) -> dict[str, object]: + note: dict[str, object] = { + "file": rel.as_posix(), + "summary": summary, + "finding": finding_text, + "sources": [s.model_dump() for s in sources], + "extractor": extractor, + } + if total_chunks is not None and total_chunks > 1: + note["chunk"] = chunk_index + note["chunks"] = total_chunks + return note + + +def _render_user_prompt( + *, + rel: Path, + body: str, + chunk_index: int = 0, + total_chunks: int = 1, + neighbors: list[str] | None = None, +) -> str: if total_chunks > 1: chunk_header = ( f"Chunk: {chunk_index + 1} of {total_chunks} " @@ -196,15 +401,26 @@ def _render_user_prompt(*, rel: Path, body: str, chunk_index: int = 0, total_chu ) else: chunk_header = "" + neighbor_block = "" + if neighbors: + neighbor_lines = "\n".join(f" - {n}" for n in neighbors[:8]) + neighbor_block = ( + "Neighbor files (this file imports from or is imported by these — " + "feel free to mention cross-file relationships when supported by the chunk):\n" + f"{neighbor_lines}\n\n" + ) return ( f"File path: {rel.as_posix()}\n\n" + f"{neighbor_block}" f"{chunk_header}" "File contents:\n" "```\n" f"{body}\n" "```\n\n" "Return findings strictly in the FileFindings schema. Use section ids " - f"only from: {_SECTION_LIST}." + f"only from: {_SECTION_LIST}. Provide ``line_range`` as an inclusive " + "(start, end) pair *within this chunk* whenever the contribution is " + "tied to a specific span; omit it for diffuse contributions." ) @@ -242,6 +458,28 @@ def _chunk_text(text: str, *, chunk_size: int, overlap: int) -> list[str]: return overlapped +def _chunk_line_offsets(text: str, chunks: list[str]) -> list[int]: + """Return the starting line number (0-indexed offset) of each chunk + within ``text``. Used to translate per-chunk line ranges into absolute + file line ranges for citations. + """ + offsets: list[int] = [] + cursor = 0 + for chunk in chunks: + idx = text.find(chunk, cursor) + if idx < 0: + # Overlap or aggressive splitting can shift the search window; + # fall back to a global find. Worst case: line offsets are + # approximate, which is acceptable for citation purposes. + idx = text.find(chunk) + if idx < 0: + offsets.append(0) + continue + offsets.append(text.count("\n", 0, idx)) + cursor = idx + max(1, len(chunk) // 2) # advance past most of this chunk + return offsets + + def _recursive_split(text: str, *, chunk_size: int, separators: list[str]) -> list[str]: """Split ``text`` so every chunk fits within ``chunk_size``, trying each separator in priority order. The empty-string separator is the terminal @@ -284,3 +522,8 @@ def _recursive_split(text: str, *, chunk_size: int, separators: list[str]) -> li if current: chunks.append(current) return chunks + + +def classify_file(rel_path: Path, sample: str) -> FileKind: + """Public re-export so callers don't need to import :mod:`repograph`.""" + return classify(rel_path, sample=sample) diff --git a/wikifi/fingerprint.py b/wikifi/fingerprint.py new file mode 100644 index 0000000..69a4dfa --- /dev/null +++ b/wikifi/fingerprint.py @@ -0,0 +1,48 @@ +"""Stable content fingerprints for files and synthesized text. + +Used by three subsystems: + +- :mod:`wikifi.cache` keys cached extraction findings by ``hash(file_bytes)`` + and cached aggregations by ``hash(notes_payload)``. +- :mod:`wikifi.evidence` cites source files by ``(path, fingerprint, lines)`` + so a migration team can verify the wiki claim survives a re-walk. +- :mod:`wikifi.repograph` records each file's fingerprint alongside its + import edges so cross-file context invalidates correctly when source + changes. + +Fingerprints are short hex prefixes of SHA-256: enough entropy to +distinguish every file in any realistic repository (~10 trillion files +before a 50% collision chance with a 12-char prefix), and short enough +to render comfortably inline in citations. +""" + +from __future__ import annotations + +import hashlib +from pathlib import Path + +# Twelve hex chars = 48 bits of entropy. Using a prefix (rather than the +# full digest) keeps citations readable while leaving margin against +# collisions on any realistic codebase. +FINGERPRINT_LENGTH = 12 + + +def hash_text(text: str) -> str: + """Return a stable short fingerprint for a string.""" + digest = hashlib.sha256(text.encode("utf-8", errors="replace")).hexdigest() + return digest[:FINGERPRINT_LENGTH] + + +def hash_bytes(data: bytes) -> str: + """Return a stable short fingerprint for raw bytes.""" + digest = hashlib.sha256(data).hexdigest() + return digest[:FINGERPRINT_LENGTH] + + +def hash_file(path: Path) -> str: + """Return the fingerprint of the file at ``path``. + + Reads the file as bytes (not text) so the same fingerprint is produced + regardless of how the cache or extractor later decodes it. + """ + return hash_bytes(path.read_bytes()) diff --git a/wikifi/orchestrator.py b/wikifi/orchestrator.py index 1923ecc..4069ffa 100644 --- a/wikifi/orchestrator.py +++ b/wikifi/orchestrator.py @@ -1,13 +1,19 @@ """End-to-end pipeline that wires Stage 1 → Stage 2 → Stage 3 → Stage 4. -The CLI calls into ``init_wiki`` and ``run_walk``. Both accept a target root -and a configured provider so tests can substitute a mock provider trivially. +The CLI calls into ``init_wiki``, ``run_walk``, and ``run_report``. Each +accepts a target root and a configured provider so tests can substitute +a mock provider trivially. - Stage 1: LLM introspection of repo structure (`introspection.introspect`) -- Stage 2: deterministic per-file extraction → JSONL notes (`extractor.extract_repo`) -- Stage 3: per-section aggregation of primary sections (`aggregator.aggregate_all`) -- Stage 4: derivation of personas/user_stories/diagrams from primary section - bodies (`deriver.derive_all`) +- Stage 1.5: lightweight static analysis (`repograph.build_graph`) when + ``settings.use_graph`` is set +- Stage 2: deterministic per-file extraction → JSONL notes + (`extractor.extract_repo`), with caching, specialized routing, and + cross-file context if available +- Stage 3: per-section aggregation of primary sections + (`aggregator.aggregate_all`), with section-level cache +- Stage 4: derivation of personas/user_stories/diagrams from primary + section bodies (`deriver.derive_all`), with optional critic loop """ from __future__ import annotations @@ -17,12 +23,17 @@ from pathlib import Path from wikifi.aggregator import AggregationStats, aggregate_all +from wikifi.cache import WalkCache +from wikifi.cache import load as load_cache +from wikifi.cache import reset as reset_cache +from wikifi.cache import save as save_cache from wikifi.config import Settings from wikifi.deriver import DerivationStats, derive_all from wikifi.extractor import ExtractionStats, extract_repo from wikifi.introspection import IntrospectionResult, introspect from wikifi.providers.base import LLMProvider from wikifi.providers.ollama_provider import OllamaProvider +from wikifi.repograph import RepoGraph, build_graph from wikifi.walker import WalkConfig, iter_files from wikifi.wiki import WikiLayout, initialize, reset_notes @@ -46,6 +57,8 @@ class WalkReport: extraction: ExtractionStats aggregation: AggregationStats derivation: DerivationStats + cache: WalkCache | None = None + graph: RepoGraph | None = None def run_walk( @@ -83,9 +96,30 @@ def run_walk( min_content_bytes=settings.min_content_bytes, ) + files = list(iter_files(walk_config)) + + cache: WalkCache | None = None + if settings.use_cache: + cache = load_cache(layout) + # Drop cache entries for files that fell out of scope so the + # cache size tracks the live in-scope set. + in_scope = {p.as_posix() for p in files} + cache.prune_extraction(keep=in_scope) + else: + reset_cache(layout) + + graph: RepoGraph | None = None + if settings.use_graph: + log.info("stage 1.5: building repo import graph") + graph = build_graph(repo_root=root, files=files) + log.info("stage 2: extracting per-file findings") reset_notes(layout) - files = list(iter_files(walk_config)) + + def _persist() -> None: + if cache is not None: + save_cache(layout, cache) + extraction = extract_repo( layout=layout, provider=provider, @@ -93,29 +127,93 @@ def run_walk( repo_root=root, chunk_size_bytes=settings.chunk_size_bytes, chunk_overlap_bytes=settings.chunk_overlap_bytes, + cache=cache, + graph=graph, + persist_cache=_persist if cache is not None else None, + use_specialized_extractors=settings.use_specialized_extractors, ) log.info("stage 3: aggregating primary sections") - aggregation = aggregate_all(layout=layout, provider=provider) + aggregation = aggregate_all(layout=layout, provider=provider, cache=cache) log.info("stage 4: deriving personas, user stories, and diagrams") - derivation = derive_all(layout=layout, provider=provider) + derivation = derive_all( + layout=layout, + provider=provider, + review=settings.review_derivatives, + review_min_score=settings.review_min_score, + ) + + if cache is not None: + save_cache(layout, cache) return WalkReport( introspection=introspection, extraction=extraction, aggregation=aggregation, derivation=derivation, + cache=cache, + graph=graph, ) def build_provider(settings: Settings) -> LLMProvider: - """Construct the configured provider. Currently Ollama is the only backend.""" - if settings.provider != "ollama": - raise ValueError(f"unknown provider {settings.provider!r}; only 'ollama' is supported in v1") - return OllamaProvider( - model=settings.model, - host=settings.ollama_host, - timeout=settings.request_timeout, - think=settings.think, - ) + """Construct the configured provider. + + Local Ollama is the default. Hosted backends are opt-in via + ``WIKIFI_PROVIDER=anthropic`` (plus ``ANTHROPIC_API_KEY``) or + ``WIKIFI_PROVIDER=openai`` (plus ``OPENAI_API_KEY``). + """ + if settings.provider == "ollama": + return OllamaProvider( + model=settings.model, + host=settings.ollama_host, + timeout=settings.request_timeout, + think=settings.think, + ) + if settings.provider == "anthropic": + from wikifi.providers.anthropic_provider import AnthropicProvider + + # When users opt in to Anthropic but leave the Ollama default + # model id in place, swap to a sensible Claude default rather + # than 404 on the model name. + model = settings.model if settings.model.startswith("claude-") else "claude-opus-4-7" + return AnthropicProvider( + model=model, + api_key=settings.anthropic_api_key, + timeout=settings.request_timeout, + max_tokens=settings.anthropic_max_tokens, + think=settings.think, + ) + if settings.provider == "openai": + from wikifi.providers.openai_provider import OpenAIProvider + + # Same default-swap guard as the Anthropic path, but inverted: + # only swap when the model id is *obviously* an Ollama + # identifier (the user opted into openai but forgot to update + # WIKIFI_MODEL). Anything else passes through unchanged so + # Azure-OpenAI / proxy deployments — which use arbitrary + # deployment IDs like ``prod-gpt4o`` or ``eastus-chat`` that + # don't match the upstream OpenAI naming convention — keep + # working. + model = "gpt-4o" if _looks_like_ollama_model(settings.model) else settings.model + return OpenAIProvider( + model=model, + api_key=settings.openai_api_key, + base_url=settings.openai_base_url, + timeout=settings.request_timeout, + max_tokens=settings.openai_max_tokens, + think=settings.think, + ) + raise ValueError(f"unknown provider {settings.provider!r}; expected 'ollama', 'anthropic', or 'openai'") + + +def _looks_like_ollama_model(model: str) -> bool: + """Heuristic — Ollama uses ``family:tag`` (e.g. ``qwen3.6:27b``). + + Fine-tuned OpenAI models also contain ``:`` (``ft:gpt-4o:...``) + so we exclude that prefix. Anything else without a ``:`` — + upstream OpenAI ids, Azure deployment names, plain proxy aliases — + is left alone. + """ + return ":" in model and not model.lower().startswith("ft:") diff --git a/wikifi/providers/anthropic_provider.py b/wikifi/providers/anthropic_provider.py new file mode 100644 index 0000000..134d5cc --- /dev/null +++ b/wikifi/providers/anthropic_provider.py @@ -0,0 +1,261 @@ +"""Anthropic-backed implementation of :class:`LLMProvider`. + +This is the premium / hosted path. Wikifi's pipeline reuses the same +multi-KB system prompt across hundreds of per-file extraction calls; the +defining design choice here is to mark that prompt with +``cache_control: {"type": "ephemeral"}`` so subsequent calls served by +the same cache breakpoint pay ~10% of the input price (cache read) instead +of full price every time. Without that, hosted Anthropic is uneconomical +on a 10k-file codebase walk; with it, the cost story competes with +local Ollama at materially better extraction quality. + +Three design notes worth flagging: + +1. **Structured output via ``messages.parse``.** The Pydantic schema is + converted to JSON Schema by the SDK and the model returns a + pre-validated instance. This is the SDK's recommended path for + structured outputs (see ``claude-api`` skill, *Structured Outputs*) — + we don't hand-roll tool_use blocks for this. +2. **Adaptive thinking + effort.** Opus 4.7 (the recommended default) + supports only adaptive thinking and exposes ``effort`` for depth. + Sampling parameters (``temperature``, ``top_p``, ``top_k``) are + removed on 4.7 and would 400 if sent — we omit them entirely. The + ``think`` knob mirrors the Ollama provider's interface so the rest + of the codebase doesn't branch on provider. +3. **Errors map to ``RuntimeError``.** The aggregator/extractor/deriver + already catch broad ``Exception`` per call; mapping + ``anthropic.APIError`` (and friends) into a plain ``RuntimeError`` + with the request id keeps the pipeline's existing fallback paths + working unchanged. +""" + +from __future__ import annotations + +import logging +import os +from typing import Any, TypeVar + +from pydantic import BaseModel + +from wikifi.providers.base import ChatMessage, LLMProvider + +try: # the dep is declared in pyproject.toml, but importing lazily yields + # a clearer error if a user installs without extras. + import anthropic +except ImportError as exc: # pragma: no cover - import error path + raise ImportError( + "wikifi.providers.anthropic_provider requires the `anthropic` package. " + "Install via `uv add anthropic` or include the [hosted] extras." + ) from exc + + +T = TypeVar("T", bound=BaseModel) +log = logging.getLogger("wikifi.providers.anthropic") + + +# Default model — opus 4.7 is the most capable for migration-grade +# domain extraction. Override per-walk via `WIKIFI_MODEL` env or +# `.wikifi/config.toml`. +DEFAULT_MODEL = "claude-opus-4-7" + +# Default per-call max output tokens. Adaptive thinking at ``effort=high`` +# can consume substantial output budget on its own; if ``max_tokens`` is too +# tight, the model burns its allowance on the thinking trace and the +# structured-output block comes back empty (``parsed_output is None`` and +# no text content). 32K leaves comfortable headroom for any of the wiki +# section schemas while staying under the SDK's non-streaming HTTP timeout +# guard. Premium-effort callers ("xhigh"/"max") should bump higher and +# enable streaming — see Anthropic's Opus 4.7 migration notes. +DEFAULT_MAX_TOKENS = 32_000 + + +ThinkLevel = bool | str | None + + +class AnthropicProvider(LLMProvider): + """Hosted-Claude implementation of the wikifi provider protocol.""" + + name = "anthropic" + + def __init__( + self, + *, + model: str = DEFAULT_MODEL, + api_key: str | None = None, + timeout: float = 900.0, + max_tokens: int = DEFAULT_MAX_TOKENS, + think: ThinkLevel = "high", + cache_system_prompt: bool = True, + client: Any | None = None, + ) -> None: + self.model = model + self.timeout = timeout + self.max_tokens = max_tokens + self.think = think + self.cache_system_prompt = cache_system_prompt + if client is not None: + # Tests pass an injected mock; preserve the duck-typed surface. + self._client = client + else: + api_key = api_key or os.environ.get("ANTHROPIC_API_KEY") + self._client = anthropic.Anthropic(api_key=api_key, timeout=timeout) + + # ------------------------------------------------------------------ + # Provider protocol + # ------------------------------------------------------------------ + + def complete_json(self, *, system: str, user: str, schema: type[T]) -> T: + """Return a ``schema``-validated Pydantic instance. + + Uses ``messages.parse`` so the SDK runs JSON-Schema-constrained + decoding and returns the parsed Pydantic model directly. The + system prompt is wrapped in a single text block with + ``cache_control`` so successive per-file calls hit the prompt + cache. + """ + try: + response = self._client.messages.parse( + model=self.model, + max_tokens=self.max_tokens, + system=self._render_system(system), + messages=[{"role": "user", "content": user}], + output_format=schema, + **self._thinking_kwargs(), + ) + except anthropic.APIError as exc: + raise RuntimeError(self.format_api_error(self.name, exc)) from exc + + parsed = getattr(response, "parsed_output", None) + if parsed is not None: + return parsed # type: ignore[return-value] + # The SDK couldn't parse a structured instance. Try the raw text + # block (covers refusals where the model emitted text rather than + # the structured form) before raising. + text = _first_text(response) + if text: + try: + return schema.model_validate_json(text) + except Exception as exc: # pragma: no cover - defensive path + raise RuntimeError( + f"anthropic provider: parsed_output missing and JSON validation of response text failed: {exc}" + ) from exc + # No parsed output and no text — typically the thinking trace + # consumed the entire output budget. Surface stop_reason and + # usage so the caller knows whether to raise ``max_tokens``, + # lower ``effort``, or look at a refusal. + raise RuntimeError(_empty_response_message(response, self.max_tokens)) + + def complete_text(self, *, system: str, user: str) -> str: + """Return the model's free-text response.""" + try: + response = self._client.messages.create( + model=self.model, + max_tokens=self.max_tokens, + system=self._render_system(system), + messages=[{"role": "user", "content": user}], + **self._thinking_kwargs(), + ) + except anthropic.APIError as exc: + raise RuntimeError(self.format_api_error(self.name, exc)) from exc + return _first_text(response) or "" + + def chat(self, *, system: str, messages: list[ChatMessage]) -> str: + """Multi-turn chat. The system prompt is cached; the running + message history follows it (and is therefore not cached itself + beyond the prefix-match window — see the prompt-caching guide + in the ``claude-api`` skill).""" + try: + response = self._client.messages.create( + model=self.model, + max_tokens=self.max_tokens, + system=self._render_system(system), + messages=list(messages), + **self._thinking_kwargs(), + ) + except anthropic.APIError as exc: + raise RuntimeError(self.format_api_error(self.name, exc)) from exc + return _first_text(response) or "" + + # ------------------------------------------------------------------ + # Helpers + # ------------------------------------------------------------------ + + def _render_system(self, system: str) -> list[dict[str, Any]] | str: + """Wrap ``system`` in a single text block with ``cache_control``. + + Returning a list (not a string) is what enables the cache mark. + Wikifi's per-file system prompt is large and identical across + every Stage 2 / Stage 3 / Stage 4 call — the cache hit on the + 2nd-Nth request is the entire cost story for hosted runs. + """ + if not self.cache_system_prompt: + return system + return [ + { + "type": "text", + "text": system, + "cache_control": {"type": "ephemeral"}, + } + ] + + def _thinking_kwargs(self) -> dict[str, Any]: + """Translate ``think`` into Anthropic's adaptive-thinking config. + + - ``False`` / ``"off"`` / ``"none"`` → thinking disabled. + - ``"low"`` / ``"medium"`` / ``"high"`` / ``"max"`` → adaptive + thinking with the corresponding ``effort``. Wikifi defaults + to ``"high"`` since the walk is bounded; bump to ``"max"`` for + intelligence-critical migrations. + - ``True`` / unspecified string → adaptive thinking, no + ``effort`` override (SDK default). + """ + if self.think is False or self.think in {"off", "none"}: + return {"thinking": {"type": "disabled"}} + if isinstance(self.think, str) and self.think.lower() in {"low", "medium", "high", "xhigh", "max"}: + return { + "thinking": {"type": "adaptive"}, + "output_config": {"effort": self.think.lower()}, + } + return {"thinking": {"type": "adaptive"}} + + +def _first_text(response: Any) -> str: + """Pull the first text block out of a Messages response. + + Tolerates the SDK shape (``response.content`` is a list of typed + blocks) and a duck-typed mock (a list of dicts). + """ + content = getattr(response, "content", None) + if not content: + return "" + for block in content: + block_type = getattr(block, "type", None) or (block.get("type") if isinstance(block, dict) else None) + if block_type == "text": + text = getattr(block, "text", None) or (block.get("text") if isinstance(block, dict) else None) + if text: + return text + return "" + + +def _empty_response_message(response: Any, max_tokens: int) -> str: + """Diagnose an empty structured response with stop_reason + usage. + + The dominant cause is adaptive thinking consuming the entire + ``max_tokens`` budget before the structured output block is + produced. Surface the operational knobs (``max_tokens``, + ``effort``) so the caller sees the fix at the failure site. + """ + stop_reason = getattr(response, "stop_reason", None) + usage = getattr(response, "usage", None) + output_tokens = getattr(usage, "output_tokens", None) if usage is not None else None + parts = [ + "anthropic provider: empty structured response (no parsed_output, no text block)", + f"stop_reason={stop_reason!r}", + f"output_tokens={output_tokens}", + f"max_tokens={max_tokens}", + ] + if stop_reason == "max_tokens" or (output_tokens is not None and output_tokens >= max_tokens): + parts.append("hint: thinking likely consumed the budget — raise max_tokens or lower think/effort") + elif stop_reason == "refusal": + parts.append("hint: model refused; the input may need rewording") + return " | ".join(parts) diff --git a/wikifi/providers/base.py b/wikifi/providers/base.py index a99fbdd..cec0740 100644 --- a/wikifi/providers/base.py +++ b/wikifi/providers/base.py @@ -1,4 +1,4 @@ -"""LLM provider protocol. +"""LLM provider abstract base class. Wikifi calls a provider in three modes: @@ -11,13 +11,16 @@ message list. Used by the ``wikifi chat`` REPL where conversation history carries between turns. -The protocol is deliberately minimal so swapping providers (Ollama → hosted -APIs → mock) is a one-class change. +The base class is deliberately minimal so swapping providers (Ollama → hosted +APIs → mock) is a one-class change. Concrete subclasses inherit nominally so +``isinstance(p, LLMProvider)`` works and ``ABC`` enforces the three call +surfaces at construction time. """ from __future__ import annotations -from typing import Protocol, TypedDict, TypeVar +from abc import ABC, abstractmethod +from typing import TypedDict, TypeVar from pydantic import BaseModel @@ -29,18 +32,38 @@ class ChatMessage(TypedDict): content: str -class LLMProvider(Protocol): +class LLMProvider(ABC): + """Nominal base class every backend implements. + + Subclasses set the class-level ``name`` (provider id) and assign + ``self.model`` in ``__init__``. The three abstract methods are the + full contract — wikifi never calls anything else on a provider. + """ + name: str model: str + @abstractmethod def complete_json(self, *, system: str, user: str, schema: type[T]) -> T: """Return an instance of ``schema`` populated by the model.""" - ... + @abstractmethod def complete_text(self, *, system: str, user: str) -> str: """Return the model's text response verbatim.""" - ... + @abstractmethod def chat(self, *, system: str, messages: list[ChatMessage]) -> str: """Run a multi-turn exchange and return the assistant's next reply.""" - ... + + @staticmethod + def format_api_error(provider_name: str, exc: Exception) -> str: + """Render a vendor APIError with the request id, when present. + + Shared by hosted providers (Anthropic, OpenAI) so the diagnostic + format is consistent across backends. + """ + request_id = getattr(exc, "request_id", None) + msg = getattr(exc, "message", None) or str(exc) + if request_id: + return f"{provider_name} provider failed ({request_id}): {msg}" + return f"{provider_name} provider failed: {msg}" diff --git a/wikifi/providers/ollama_provider.py b/wikifi/providers/ollama_provider.py index 1c85ca9..b52f591 100644 --- a/wikifi/providers/ollama_provider.py +++ b/wikifi/providers/ollama_provider.py @@ -36,14 +36,14 @@ from ollama import Client from pydantic import BaseModel -from wikifi.providers.base import ChatMessage +from wikifi.providers.base import ChatMessage, LLMProvider T = TypeVar("T", bound=BaseModel) ThinkLevel = bool | str | None -class OllamaProvider: +class OllamaProvider(LLMProvider): name = "ollama" def __init__( diff --git a/wikifi/providers/openai_provider.py b/wikifi/providers/openai_provider.py new file mode 100644 index 0000000..68b3ea6 --- /dev/null +++ b/wikifi/providers/openai_provider.py @@ -0,0 +1,232 @@ +"""OpenAI-backed implementation of :class:`LLMProvider`. + +The third provider, alongside :mod:`wikifi.providers.ollama_provider` +(local default) and :mod:`wikifi.providers.anthropic_provider` (hosted +Claude). Selected via ``WIKIFI_PROVIDER=openai`` plus an +``OPENAI_API_KEY``. + +Three implementation notes worth flagging: + +1. **Structured output via ``chat.completions.parse``.** The Pydantic + schema is converted to a JSON Schema by the SDK and the model + returns a pre-validated instance. This is OpenAI's GA path for + schema-constrained decoding; we don't hand-roll function calls. +2. **Prompt caching is automatic.** Unlike Anthropic, OpenAI does not + require a ``cache_control`` marker — the API caches identical + prefixes (≥ 1024 tokens) for ~5–10 minutes automatically. We keep + the system prompt at message position 0 so wikifi's repeated multi-KB + extraction prompt is what gets cached. +3. **Reasoning effort.** Reasoning-capable models (o1, o3, o4, gpt-5 + families) accept a ``reasoning_effort`` parameter that mirrors + wikifi's ``think`` knob. Non-reasoning models silently ignore the + parameter, so we route the knob through whenever a reasoning level + is set and skip it on plain models to avoid surfacing a 400 if a + future SDK starts validating it. +""" + +from __future__ import annotations + +import logging +import os +import re +from typing import Any, TypeVar + +from pydantic import BaseModel + +from wikifi.providers.base import ChatMessage, LLMProvider + +try: + import openai +except ImportError as exc: # pragma: no cover - import error path + raise ImportError( + "wikifi.providers.openai_provider requires the `openai` package. " + "Install via `uv add openai` or include the [hosted] extras." + ) from exc + + +T = TypeVar("T", bound=BaseModel) +log = logging.getLogger("wikifi.providers.openai") + + +# Default model — gpt-4o is the most stable, broadly-available +# structured-output capable model. Override per-walk via ``WIKIFI_MODEL`` +# env or ``.wikifi/config.toml`` (e.g. set to a reasoning model like +# ``o3-mini`` or ``gpt-5`` to opt into the reasoning_effort path). +DEFAULT_MODEL = "gpt-4o" + +# Default per-call output token cap. wikifi's structured findings are +# small relative to the input; 16K leaves headroom for any of the +# section schemas without crossing the SDK's HTTP timeout guard. +DEFAULT_MAX_TOKENS = 16_000 + + +# Names that match a reasoning-capable model family. We inspect the +# model id by prefix because OpenAI's lineup is too volatile to +# enumerate exactly. Anything matching gets ``reasoning_effort`` +# forwarded; anything else has it stripped from the request. +_REASONING_MODEL_RE = re.compile(r"^(o\d|gpt-5)", re.IGNORECASE) + + +ThinkLevel = bool | str | None + + +class OpenAIProvider(LLMProvider): + """Hosted-OpenAI implementation of the wikifi provider protocol.""" + + name = "openai" + + def __init__( + self, + *, + model: str = DEFAULT_MODEL, + api_key: str | None = None, + base_url: str | None = None, + timeout: float = 900.0, + max_tokens: int = DEFAULT_MAX_TOKENS, + think: ThinkLevel = "high", + client: Any | None = None, + ) -> None: + self.model = model + self.timeout = timeout + self.max_tokens = max_tokens + self.think = think + if client is not None: + # Tests pass an injected mock; preserve the duck-typed surface. + self._client = client + else: + api_key = api_key or os.environ.get("OPENAI_API_KEY") + self._client = openai.OpenAI( + api_key=api_key, + base_url=base_url, + timeout=timeout, + ) + + # ------------------------------------------------------------------ + # Provider protocol + # ------------------------------------------------------------------ + + def complete_json(self, *, system: str, user: str, schema: type[T]) -> T: + """Return a ``schema``-validated Pydantic instance. + + Uses ``chat.completions.parse`` so the SDK runs JSON-Schema- + constrained decoding and returns the parsed Pydantic model + directly. The system prompt sits at position 0 so OpenAI's + automatic prefix cache catches the repeated multi-KB extraction + prompt across per-file calls. + """ + try: + response = self._client.chat.completions.parse( + model=self.model, + messages=[ + {"role": "system", "content": system}, + {"role": "user", "content": user}, + ], + response_format=schema, + **self._token_kwargs(), + **self._reasoning_kwargs(), + ) + except openai.APIError as exc: + raise RuntimeError(self.format_api_error(self.name, exc)) from exc + + parsed = _first_parsed(response) + if parsed is None: + # Defensive fallback: if the SDK couldn't parse (refusal, + # truncation), schema-validate the raw JSON text. Keeps the + # protocol's "raise on failure" contract intact rather than + # returning a None. + text = _first_text(response) + try: + return schema.model_validate_json(text) + except Exception as exc: # pragma: no cover - defensive path + raise RuntimeError(f"openai provider: empty parsed and validate fallback failed: {exc}") from exc + return parsed + + def complete_text(self, *, system: str, user: str) -> str: + """Return the model's free-text response.""" + try: + response = self._client.chat.completions.create( + model=self.model, + messages=[ + {"role": "system", "content": system}, + {"role": "user", "content": user}, + ], + **self._token_kwargs(), + **self._reasoning_kwargs(), + ) + except openai.APIError as exc: + raise RuntimeError(self.format_api_error(self.name, exc)) from exc + return _first_text(response) or "" + + def chat(self, *, system: str, messages: list[ChatMessage]) -> str: + """Multi-turn chat. The system prompt sits at position 0; the + running message history follows it.""" + try: + response = self._client.chat.completions.create( + model=self.model, + messages=[{"role": "system", "content": system}, *messages], + **self._token_kwargs(), + **self._reasoning_kwargs(), + ) + except openai.APIError as exc: + raise RuntimeError(self.format_api_error(self.name, exc)) from exc + return _first_text(response) or "" + + # ------------------------------------------------------------------ + # Helpers + # ------------------------------------------------------------------ + + def _is_reasoning_model(self) -> bool: + return bool(_REASONING_MODEL_RE.match(self.model)) + + def _reasoning_kwargs(self) -> dict[str, Any]: + """Forward the ``think`` knob as ``reasoning_effort`` only on + reasoning-capable models. Plain models silently ignore it but + we still strip it so a future strict validation can't 400 us. + """ + if not self._is_reasoning_model(): + return {} + if self.think is False or self.think in {"off", "none"}: + return {} + if isinstance(self.think, str) and self.think.lower() in {"low", "medium", "high"}: + return {"reasoning_effort": self.think.lower()} + # ``True`` / unrecognized string → adopt SDK default by omitting. + return {} + + def _token_kwargs(self) -> dict[str, Any]: + """Output cap. Reasoning models use ``max_completion_tokens``; + plain chat models use ``max_tokens``. We send the appropriate + one so neither path 400s on an unrecognized parameter.""" + key = "max_completion_tokens" if self._is_reasoning_model() else "max_tokens" + return {key: self.max_tokens} + + +def _first_parsed(response: Any) -> Any: + """Pull the parsed Pydantic instance out of a parse() response. + + Tolerates the SDK shape (``response.choices[0].message.parsed``) + and a duck-typed mock (a list of dicts). + """ + choices = getattr(response, "choices", None) or (response.get("choices") if isinstance(response, dict) else None) + if not choices: + return None + first = choices[0] + message = getattr(first, "message", None) or (first.get("message") if isinstance(first, dict) else None) + if message is None: + return None + parsed = getattr(message, "parsed", None) or (message.get("parsed") if isinstance(message, dict) else None) + return parsed + + +def _first_text(response: Any) -> str: + """Pull the first text content out of a chat-completion response.""" + choices = getattr(response, "choices", None) or (response.get("choices") if isinstance(response, dict) else None) + if not choices: + return "" + first = choices[0] + message = getattr(first, "message", None) or (first.get("message") if isinstance(first, dict) else None) + if message is None: + return "" + content = getattr(message, "content", None) + if content is None and isinstance(message, dict): + content = message.get("content") + return content or "" diff --git a/wikifi/repograph.py b/wikifi/repograph.py new file mode 100644 index 0000000..7301c93 --- /dev/null +++ b/wikifi/repograph.py @@ -0,0 +1,447 @@ +"""Lightweight static analysis of the repository. + +Two outputs feed Stage 2: + +1. **File classification.** Each in-scope file is tagged with a + :class:`FileKind` (``application_code``, ``sql``, ``openapi``, + ``protobuf``, ``graphql``, ``migration``, ``other``). Specialized + extractors short-circuit the LLM for the structured kinds — a SQL + DDL file becomes a precise entity diff without a 90-second model + call. Application code falls through to the existing LLM extraction + path, but enriched with the import graph. + +2. **Import / reference graph.** A regex-driven scan builds an undirected + neighbor map: for each file, "this file imports from these files, + and is imported by these files". The neighbor list is injected into + the Stage 2 prompt so per-file findings can talk about cross-file + flows ("this handler delegates to ``services/billing.py`` for the + order-totalling step") rather than treating each file as an island. + +The implementation is deliberately language-pluralistic and relies only +on regex + path resolution. tree-sitter would give richer structure but +adds a binary dep wikifi has explicitly avoided so far; the regex graph +is good enough to surface neighbors for the LLM to reason over, which is +the only consumer that matters here. +""" + +from __future__ import annotations + +import logging +import re +from collections import defaultdict +from collections.abc import Iterable +from dataclasses import dataclass, field +from enum import StrEnum +from pathlib import Path + +log = logging.getLogger("wikifi.repograph") + + +class FileKind(StrEnum): + APPLICATION_CODE = "application_code" + SQL = "sql" + OPENAPI = "openapi" + PROTOBUF = "protobuf" + GRAPHQL = "graphql" + MIGRATION = "migration" + OTHER = "other" + + +# Suffixes that pin a file kind purely by extension. +_EXTENSION_KINDS: dict[str, FileKind] = { + ".sql": FileKind.SQL, + ".ddl": FileKind.SQL, + ".proto": FileKind.PROTOBUF, + ".graphql": FileKind.GRAPHQL, + ".graphqls": FileKind.GRAPHQL, + ".gql": FileKind.GRAPHQL, +} + + +_APPLICATION_EXTS: frozenset[str] = frozenset( + { + ".py", + ".js", + ".jsx", + ".ts", + ".tsx", + ".mjs", + ".cjs", + ".go", + ".rs", + ".rb", + ".php", + ".java", + ".kt", + ".kts", + ".scala", + ".cs", + ".cpp", + ".cc", + ".c", + ".h", + ".hpp", + ".swift", + ".m", + ".mm", + ".dart", + ".ex", + ".exs", + ".clj", + ".cljs", + ".lua", + } +) + + +# Common conventions for migration directories (Alembic, Django, Rails, +# Knex, Flyway, Liquibase). A ``.sql`` file in any of these is a migration +# rather than a generic DDL — both kinds run through the SQL extractor +# but the migration label keeps the wiki distinguishing forward-only +# changes from current schema. +_MIGRATION_DIR_TOKENS: tuple[str, ...] = ( + "/migrations/", + "/alembic/", + "/db/migrate/", + "/database/migrations/", + "/prisma/migrations/", + "/flyway/", + "/liquibase/", +) + + +# Heuristics for OpenAPI/Swagger detection inside YAML and JSON files. +_OPENAPI_HEAD_PATTERNS: tuple[re.Pattern[str], ...] = ( + re.compile(r"^\s*openapi\s*:\s*[\"']?\d", re.MULTILINE), + re.compile(r'"openapi"\s*:\s*"\d'), + re.compile(r"^\s*swagger\s*:\s*[\"']?\d", re.MULTILINE), + re.compile(r'"swagger"\s*:\s*"\d'), +) + + +def classify(rel_path: Path, sample: str | None = None) -> FileKind: + """Return the :class:`FileKind` for a repo-relative path. + + ``sample`` may carry the first ~4 KB of the file's contents and is + consulted for kinds that can't be decided from the path alone (YAML + / JSON files that may or may not be OpenAPI specs). + """ + suffix = rel_path.suffix.lower() + posix = rel_path.as_posix().lower() + + if suffix in _EXTENSION_KINDS: + kind = _EXTENSION_KINDS[suffix] + if kind is FileKind.SQL and any(token in f"/{posix}" for token in _MIGRATION_DIR_TOKENS): + return FileKind.MIGRATION + return kind + + if suffix in {".yml", ".yaml", ".json"} and sample is not None: + head = sample[:4096] + if any(pat.search(head) for pat in _OPENAPI_HEAD_PATTERNS): + return FileKind.OPENAPI + + if suffix in _APPLICATION_EXTS: + if any(token in f"/{posix}" for token in _MIGRATION_DIR_TOKENS): + return FileKind.MIGRATION + return FileKind.APPLICATION_CODE + + return FileKind.OTHER + + +# --------------------------------------------------------------------------- +# Import graph +# --------------------------------------------------------------------------- + + +@dataclass(frozen=True) +class GraphNode: + rel_path: str + imports: tuple[str, ...] + imported_by: tuple[str, ...] + + def neighbors(self, *, limit: int = 8) -> list[str]: + """Combined neighbor list, capped, for prompt enrichment.""" + out: list[str] = [] + seen: set[str] = set() + for paths in (self.imports, self.imported_by): + for path in paths: + if path not in seen: + seen.add(path) + out.append(path) + if len(out) >= limit: + return out + return out + + +@dataclass +class RepoGraph: + """Per-file import edges across an in-scope file list.""" + + nodes: dict[str, GraphNode] = field(default_factory=dict) + + def get(self, rel_path: str) -> GraphNode | None: + return self.nodes.get(rel_path) + + def neighbor_paths(self, rel_path: str, *, limit: int = 8) -> list[str]: + node = self.nodes.get(rel_path) + return node.neighbors(limit=limit) if node else [] + + def __contains__(self, rel_path: str) -> bool: # pragma: no cover - convenience + return rel_path in self.nodes + + +# Per-language import patterns. Each pattern captures the imported module +# path/identifier; resolution to a real file is handled by a separate +# heuristic. The Python pattern allows leading dots so relative imports +# (``from .foo import bar`` / ``from .. import baz``) survive the scan — +# without that, intra-package edges silently disappear from the graph. +# A second pattern (``_PY_FROM_DOT_IMPORT``) handles ``from . import X``, +# where the regex above only captures the bare dot prefix and would lose +# the ``X`` symbol that names the actual sibling module. +_PY_IMPORT = re.compile( + r"^\s*(?:from\s+(\.+[\w.]*|[A-Za-z_][\w.]*)\s+import|import\s+([A-Za-z_][\w.]*))", + re.MULTILINE, +) +_PY_FROM_DOT_IMPORT = re.compile( + r"^\s*from\s+(\.+)\s+import\s+([\w*][\w,\s]*)", + re.MULTILINE, +) +_JS_IMPORT = re.compile( + r"""(?:import\s+[^'"\n]*?from\s*['"]([^'"\n]+)['"])""" + r"""|(?:require\(\s*['"]([^'"\n]+)['"]\s*\))""" + r"""|(?:import\(\s*['"]([^'"\n]+)['"]\s*\))""", +) +_GO_IMPORT = re.compile(r"""import\s+(?:\([^)]*\)|\"([^\"]+)\")""", re.DOTALL) +_GO_IMPORT_BLOCK = re.compile(r"^\s*\"([^\"]+)\"", re.MULTILINE) +_JAVA_IMPORT = re.compile(r"^\s*import\s+(?:static\s+)?([\w.]+);", re.MULTILINE) +_RUBY_REQUIRE = re.compile(r"""^\s*require(?:_relative)?\s+['"]([^'"\n]+)['"]""", re.MULTILINE) + + +def build_graph(*, repo_root: Path, files: Iterable[Path]) -> RepoGraph: + """Build a :class:`RepoGraph` from the supplied in-scope files. + + Files outside :data:`_APPLICATION_EXTS` contribute nothing — their + import semantics aren't text-recoverable in any meaningful sense + (binary, image, lockfile, etc.). + """ + file_list = [Path(f) for f in files] + file_set = {p.as_posix() for p in file_list} + candidates_by_module: dict[str, list[str]] = _index_modules(file_set) + + raw_edges: dict[str, set[str]] = defaultdict(set) + reverse: dict[str, set[str]] = defaultdict(set) + + for rel in file_list: + full = repo_root / rel + if rel.suffix.lower() not in _APPLICATION_EXTS: + continue + try: + text = full.read_text(encoding="utf-8", errors="replace") + except OSError: + continue + targets = _resolve_imports(rel, text, file_set=file_set, modules=candidates_by_module) + rel_str = rel.as_posix() + for target in targets: + if target == rel_str: + continue + raw_edges[rel_str].add(target) + reverse[target].add(rel_str) + + nodes: dict[str, GraphNode] = {} + for rel in file_list: + rel_str = rel.as_posix() + nodes[rel_str] = GraphNode( + rel_path=rel_str, + imports=tuple(sorted(raw_edges.get(rel_str, set()))), + imported_by=tuple(sorted(reverse.get(rel_str, set()))), + ) + return RepoGraph(nodes=nodes) + + +def _index_modules(file_set: set[str]) -> dict[str, list[str]]: + """Build module-name → candidate-paths index for resolution. + + For Python ``foo.bar.baz`` we register every dotted prefix that maps + to a concrete file (``foo/bar/baz.py`` or ``foo/bar/baz/__init__.py``). + For Java ``com.foo.Bar`` we register the matching ``com/foo/Bar.java``. + Other languages fall back to filename-stem matching when imports are + bare names. + """ + index: dict[str, list[str]] = defaultdict(list) + for path in file_set: + p = Path(path) + suffix = p.suffix.lower() + stem = p.stem + # Bare filename → all paths sharing that stem + index[stem].append(path) + + if suffix == ".py": + parts = list(p.with_suffix("").parts) + if parts and parts[-1] == "__init__": + parts = parts[:-1] + for size in range(1, len(parts) + 1): + dotted = ".".join(parts[-size:]) + index[dotted].append(path) + elif suffix in {".java", ".kt", ".scala", ".cs"}: + parts = list(p.with_suffix("").parts) + for size in range(1, len(parts) + 1): + dotted = ".".join(parts[-size:]) + index[dotted].append(path) + elif suffix in {".js", ".jsx", ".ts", ".tsx", ".mjs", ".cjs"}: + parts = list(p.parts) + # JS imports are usually written without extension. + for size in range(1, len(parts) + 1): + tail = "/".join(parts[-size:]) + stripped = re.sub(r"\.(?:js|jsx|ts|tsx|mjs|cjs)$", "", tail) + index[stripped].append(path) + index[tail].append(path) + elif suffix == ".go": + parts = list(p.parts) + for size in range(1, len(parts) + 1): + index["/".join(parts[-size:])].append(path) + return index + + +def _resolve_imports( + source: Path, + text: str, + *, + file_set: set[str], + modules: dict[str, list[str]], +) -> list[str]: + suffix = source.suffix.lower() + raw_targets: list[str] = [] + + if suffix == ".py": + for match in _PY_IMPORT.finditer(text): + raw_targets.append(match.group(1) or match.group(2)) + # ``from . import a, b`` adds an edge to each *named* sibling + # rather than to the package's ``__init__.py``. The base regex + # above only captures the dot prefix, so we expand the symbol + # list here and synthesize one ``.symbol`` raw target per name. + for match in _PY_FROM_DOT_IMPORT.finditer(text): + dots = match.group(1) + for symbol in match.group(2).split(","): + symbol = symbol.strip() + if symbol and symbol != "*" and symbol.isidentifier(): + raw_targets.append(f"{dots}{symbol}") + elif suffix in {".js", ".jsx", ".ts", ".tsx", ".mjs", ".cjs"}: + for match in _JS_IMPORT.finditer(text): + raw_targets.append(next((g for g in match.groups() if g), "")) + elif suffix == ".go": + for match in _GO_IMPORT.finditer(text): + block = match.group(0) + for inner in _GO_IMPORT_BLOCK.finditer(block): + raw_targets.append(inner.group(1)) + if match.group(1): + raw_targets.append(match.group(1)) + elif suffix in {".java", ".kt", ".scala", ".cs"}: + for match in _JAVA_IMPORT.finditer(text): + raw_targets.append(match.group(1)) + elif suffix == ".rb": + for match in _RUBY_REQUIRE.finditer(text): + raw_targets.append(match.group(1)) + + resolved: list[str] = [] + seen: set[str] = set() + for raw in raw_targets: + if not raw: + continue + normalized = raw.strip().strip('"').strip("'") + if not normalized: + continue + for candidate in _candidates_for(normalized, source=source, file_set=file_set, modules=modules): + if candidate not in seen: + seen.add(candidate) + resolved.append(candidate) + return resolved + + +def _candidates_for( + raw: str, + *, + source: Path, + file_set: set[str], + modules: dict[str, list[str]], +) -> list[str]: + # Python relative imports (``from .foo import bar``, ``from .. import baz``) + # use leading dots, NOT path-style ``./`` or ``../``. JS/TS relative + # imports use the path style and are handled below. Treat the two + # syntaxes separately so a ``.foo`` from Python doesn't get joined as + # ``parent/.foo`` (a hidden-file path that won't match any module). + if source.suffix.lower() == ".py" and raw.startswith("."): + return _python_relative_candidates(raw, source=source, file_set=file_set) + + # Path-style relative imports (``./foo``, ``../bar``) and absolute + # paths — resolve within the repo. Path.resolve() would expand + # against the CWD; we want the result relative to the repo root so + # it can match file_set entries. + if raw.startswith(("./", "../", "/")): + target = source.parent / raw + normalized = _normalize_relative(target) + return [p for p in _try_path_variants(normalized) if p in file_set] + + # Strip leading dots from any other dotted form (defensive). + stripped = raw.lstrip(".") + matches = modules.get(stripped, []) + matches += modules.get(stripped.split(".")[-1], []) + matches += modules.get(stripped.split("/")[-1], []) + + out: list[str] = [] + seen: set[str] = set() + for path in matches: + if path in file_set and path not in seen and path != source.as_posix(): + seen.add(path) + out.append(path) + return out + + +def _python_relative_candidates(raw: str, *, source: Path, file_set: set[str]) -> list[str]: + """Resolve a Python ``from .foo`` style import against the repo. + + Each leading dot pops one level from the source's package directory: + a single dot is the package itself, two dots is the parent package, + and so on. Whatever follows is a dotted module path inside the + resolved package (``a.b`` → ``a/b``), which we attempt with the + standard ``.py`` and ``__init__.py`` variants. + """ + leading = len(raw) - len(raw.lstrip(".")) + remainder = raw[leading:] + # ``source.parent`` is the directory the source file lives in, + # which corresponds to the *current* package (one dot's worth). + base = source.parent + for _ in range(leading - 1): + if not base.parts: + return [] + base = base.parent + target = base / Path(*remainder.split(".")) if remainder else base + return [p for p in _try_path_variants(target) if p in file_set] + + +def _normalize_relative(path: Path) -> Path: + """Collapse ``..`` / ``.`` segments without touching the filesystem. + + ``Path.resolve()`` would anchor against the current working directory + and break the repo-relative semantics we rely on for graph keys. + """ + parts: list[str] = [] + for part in path.parts: + if part in ("", "."): + continue + if part == "..": + if parts: + parts.pop() + continue + parts.append(part) + return Path(*parts) if parts else Path() + + +def _try_path_variants(path: Path) -> list[str]: + candidates: list[str] = [] + for ext in (".py", ".js", ".ts", ".tsx", ".jsx", ".mjs", ".cjs", ".rb", ".go", ""): + with_ext = path if ext == "" else path.with_suffix(ext) + candidates.append(with_ext.as_posix()) + candidates.append((path / "__init__.py").as_posix()) + candidates.append((path / "index.ts").as_posix()) + candidates.append((path / "index.js").as_posix()) + return candidates diff --git a/wikifi/report.py b/wikifi/report.py new file mode 100644 index 0000000..fdc83a4 --- /dev/null +++ b/wikifi/report.py @@ -0,0 +1,174 @@ +"""``wikifi report`` — coverage and quality view of the wiki. + +The report answers two questions migration leads ask before they fund a +re-implementation: + +1. **Did the walk cover the system?** Per-section file/finding counts, + total files seen vs. files that contributed something. +2. **Is the wiki good enough to act on?** Per-section quality score from + the critic, with the headline ``unsupported_claims`` and ``gaps``. + +The report runs purely from on-disk artifacts (notes JSONL + section +markdown + cache) plus optional provider-driven scoring; it never +modifies the wiki. +""" + +from __future__ import annotations + +import logging +from dataclasses import dataclass, field + +from wikifi.cache import WalkCache, load +from wikifi.critic import CoverageStats, Critique, _critique +from wikifi.providers.base import LLMProvider +from wikifi.sections import PRIMARY_SECTIONS, SECTIONS, Section +from wikifi.wiki import WikiLayout, read_notes + +log = logging.getLogger("wikifi.report") + + +@dataclass +class SectionReport: + section: Section + files_contributing: int + findings_count: int + body_chars: int + is_empty: bool + critique: Critique | None = None + + +@dataclass +class WikiReport: + coverage: CoverageStats + sections: list[SectionReport] = field(default_factory=list) + overall_score: float | None = None + + def render(self) -> str: + lines: list[str] = [] + lines.append("# wikifi coverage + quality report") + lines.append("") + lines.append( + f"Files seen: **{self.coverage.files_total}** · " + f"Files with findings: **{self.coverage.files_with_findings}** " + f"({self.coverage.coverage_pct()}%)" + ) + if self.overall_score is not None: + lines.append(f"Overall section score (mean of populated sections): **{self.overall_score:.1f} / 10**") + lines.append("") + lines.append("| Section | Files | Findings | Body | Score | Headline gap |") + lines.append("| --- | ---: | ---: | ---: | ---: | --- |") + for entry in self.sections: + score = "—" if entry.critique is None else f"{entry.critique.score}/10" + gap = "" + if entry.critique and entry.critique.gaps: + gap = entry.critique.gaps[0][:60] + elif entry.critique and entry.critique.unsupported_claims: + gap = "unsupported: " + entry.critique.unsupported_claims[0][:50] + elif entry.is_empty: + gap = "no findings" + lines.append( + f"| `{entry.section.id}` " + f"| {entry.files_contributing} " + f"| {entry.findings_count} " + f"| {entry.body_chars} " + f"| {score} " + f"| {gap} |" + ) + return "\n".join(lines) + + +def build_report( + *, + layout: WikiLayout, + provider: LLMProvider | None = None, + score: bool = False, +) -> WikiReport: + """Inspect a wiki and produce a :class:`WikiReport`. + + With ``score=True`` and a provider supplied, every populated section + is run through the critic for a quality score. Without that, the + report is purely structural — useful in CI without an LLM. + """ + findings_per_section: dict[str, int] = {} + files_per_section: dict[str, int] = {} + contributing_files: set[str] = set() + for section in PRIMARY_SECTIONS: + notes = read_notes(layout, section) + findings_per_section[section.id] = len(notes) + section_files = {n.get("file") for n in notes if n.get("file")} + files_per_section[section.id] = len(section_files) + contributing_files.update(f for f in section_files if isinstance(f, str)) + + # Coverage is derived from the on-disk notes first so a walk run with + # ``--no-cache`` (or one whose cache was deleted) still reports + # accurate counts. When notes are present they're authoritative; we + # only fall back to the cache when no notes have been written yet. + if contributing_files: + files_with_findings = len(contributing_files) + files_total = max(files_with_findings, _files_total_from_cache(layout)) + else: + files_total, files_with_findings = _coverage_from_cache(layout) + + coverage = CoverageStats( + files_total=files_total, + files_with_findings=files_with_findings, + findings_per_section=findings_per_section, + files_per_section=files_per_section, + ) + + section_reports: list[SectionReport] = [] + scored: list[int] = [] + for section in SECTIONS: + path = layout.section_path(section) + body = path.read_text(encoding="utf-8") if path.exists() else "" + is_empty = ( + "Not yet populated" in body + or "No findings were extracted" in body + or "upstream sections required to derive" in body.lower() + ) + critique: Critique | None = None + if score and provider is not None and not is_empty and body.strip(): + critique = _critique( + section=section, + body=body, + upstream=_collect_upstream(layout, section) if section.tier == "derivative" else None, + provider=provider, + ) + scored.append(critique.score) + section_reports.append( + SectionReport( + section=section, + files_contributing=files_per_section.get(section.id, 0), + findings_count=findings_per_section.get(section.id, 0), + body_chars=len(body), + is_empty=is_empty, + critique=critique, + ) + ) + + overall = sum(scored) / len(scored) if scored else None + return WikiReport(coverage=coverage, sections=section_reports, overall_score=overall) + + +def _coverage_from_cache(layout: WikiLayout) -> tuple[int, int]: + cache: WalkCache = load(layout) + files_total = len(cache.extraction) + files_with_findings = sum(1 for entry in cache.extraction.values() if entry.findings) + return files_total, files_with_findings + + +def _files_total_from_cache(layout: WikiLayout) -> int: + """Return the cache's seen-files count if available; ``0`` otherwise.""" + cache: WalkCache = load(layout) + return len(cache.extraction) + + +def _collect_upstream(layout: WikiLayout, section: Section) -> dict[str, str]: + bodies: dict[str, str] = {} + for upstream_id in section.derived_from: + path = layout.section_path(upstream_id) + if path.exists(): + text = path.read_text(encoding="utf-8") + if "Not yet populated" not in text and "No findings were extracted" not in text: + bodies[upstream_id] = text + return bodies diff --git a/wikifi/specialized/__init__.py b/wikifi/specialized/__init__.py new file mode 100644 index 0000000..3382b5d --- /dev/null +++ b/wikifi/specialized/__init__.py @@ -0,0 +1,12 @@ +"""Type-aware extractors for high-signal source artifacts. + +Each module in this package implements one or more parsers that consume +a file's text and emit structured findings in the same shape the LLM +extractor produces. Import from the concrete module — never from this +``__init__.py`` — per the project's no-re-exports rule: + +- :mod:`wikifi.specialized.models` — finding/result dataclasses +- :mod:`wikifi.specialized.dispatch` — :func:`select` for kind → extractor +- :mod:`wikifi.specialized.sql` / ``openapi`` / ``protobuf`` / ``graphql`` — + the per-format extractors +""" diff --git a/wikifi/specialized/dispatch.py b/wikifi/specialized/dispatch.py new file mode 100644 index 0000000..f23cb8f --- /dev/null +++ b/wikifi/specialized/dispatch.py @@ -0,0 +1,62 @@ +"""Dispatch a :class:`FileKind` to its specialized extractor. + +Schema files, IDLs, OpenAPI specs, and migrations carry the system's +contracts in machine-readable form. Running them through the same prose +LLM extractor as application code is wasteful and lossy: the structure +is already there, the extractor just has to read it. + +Selection respects the file's *path* — not just its kind — so a Python +Alembic/Django migration is not silently routed through the SQL parser. +The classifier upstream (``wikifi.repograph.classify``) tags every file +under a migrations directory as :attr:`FileKind.MIGRATION`; this layer +narrows that to the SQL-shaped subset (``.sql`` / ``.ddl``) and returns +``None`` for the rest, letting them fall through to the LLM path. +""" + +from __future__ import annotations + +import logging +from pathlib import PurePosixPath + +from wikifi.repograph import FileKind +from wikifi.specialized.models import ExtractorFn + +log = logging.getLogger("wikifi.specialized") + + +# Suffixes that the SQL extractor can actually read. Anything else +# tagged :attr:`FileKind.MIGRATION` (e.g. an Alembic ``.py`` script, +# a Django ``0001_initial.py``, a Knex ``.js`` migration) keeps its +# logic in code, not DDL — those belong on the LLM extraction path. +_SQL_MIGRATION_SUFFIXES: frozenset[str] = frozenset({".sql", ".ddl"}) + + +def select(kind: FileKind, *, rel_path: str | None = None) -> ExtractorFn | None: + """Return the specialized extractor for a file, or ``None``. + + ``rel_path`` is required for :attr:`FileKind.MIGRATION` because the + classifier marks any file inside a migrations directory as a + migration, including non-SQL ones. Without the path, we can't tell + a SQL migration from an Alembic Python script. + """ + # Imports are lazy so this module stays cheap to load and so the + # extractors can import freely from ``wikifi.specialized.models`` + # without a circular ``__init__`` dependency. + from wikifi.specialized import graphql, openapi, protobuf, sql + + if kind is FileKind.SQL: + return sql.extract + if kind is FileKind.OPENAPI: + return openapi.extract + if kind is FileKind.PROTOBUF: + return protobuf.extract + if kind is FileKind.GRAPHQL: + return graphql.extract + if kind is FileKind.MIGRATION: + if rel_path is None: + return None + suffix = PurePosixPath(rel_path).suffix.lower() + if suffix in _SQL_MIGRATION_SUFFIXES: + return sql.extract_migration + return None + return None diff --git a/wikifi/specialized/graphql.py b/wikifi/specialized/graphql.py new file mode 100644 index 0000000..d6d4f48 --- /dev/null +++ b/wikifi/specialized/graphql.py @@ -0,0 +1,162 @@ +"""GraphQL SDL extractor. + +Pulls types, inputs, queries, mutations, and subscriptions. Maps them to +``entities`` (types/inputs) and ``capabilities`` + ``integrations`` +(query/mutation/subscription roots). + +Modular GraphQL schemas often split root types across files using +``extend type Query`` / ``extend type Mutation``; we treat those exactly +like the base declaration so capabilities don't disappear when a schema +is composed from many files. +""" + +from __future__ import annotations + +import re + +from wikifi.evidence import SourceRef +from wikifi.specialized.models import SpecializedFinding, SpecializedResult + +_TYPE_RE = re.compile(r"^\s*type\s+(\w+)\s*(?:implements\s+[^\{]+)?\{", re.MULTILINE) +# ``extend type Query { ... }`` is the standard way to add fields to a +# root from a separate SDL file; treat it as a same-named root. +_EXTEND_TYPE_RE = re.compile(r"^\s*extend\s+type\s+(\w+)\s*(?:implements\s+[^\{]+)?\{", re.MULTILINE) +_INPUT_RE = re.compile(r"^\s*input\s+(\w+)\s*\{", re.MULTILINE) +_INTERFACE_RE = re.compile(r"^\s*interface\s+(\w+)\s*\{", re.MULTILINE) +_ENUM_RE = re.compile(r"^\s*enum\s+(\w+)\s*\{", re.MULTILINE) +_SCHEMA_FIELD_RE = re.compile(r"^\s*(\w+)\s*(?:\([^)]*\))?\s*:\s*[^\n]+", re.MULTILINE) + + +def extract(rel_path: str, text: str) -> SpecializedResult: + findings: list[SpecializedFinding] = [] + summary_bits: list[str] = [] + + # Anchor line numbers on the captured *name* offset, not the match + # start. The leading ``^\s*`` in each pattern can consume the + # preceding newline (``\s`` is newline-aware by default), which + # would otherwise put the line number one above the actual + # declaration and confuse :func:`_block_after`. + types = [(m.group(1), _line(text, m.start(1))) for m in _TYPE_RE.finditer(text)] + extensions = [(m.group(1), _line(text, m.start(1))) for m in _EXTEND_TYPE_RE.finditer(text)] + inputs = [(m.group(1), _line(text, m.start(1))) for m in _INPUT_RE.finditer(text)] + interfaces = [(m.group(1), _line(text, m.start(1))) for m in _INTERFACE_RE.finditer(text)] + enums = [(m.group(1), _line(text, m.start(1))) for m in _ENUM_RE.finditer(text)] + + root_names = {"Query", "Mutation", "Subscription"} + domain_types = [t for t in types if t[0] not in root_names] + # Root declarations come from both ``type Query { ... }`` and + # ``extend type Query { ... }`` forms. + root_types = [t for t in types if t[0] in root_names] + [t for t in extensions if t[0] in root_names] + + if domain_types: + summary_bits.append(f"{len(domain_types)} type(s)") + bullets = "\n".join(f" - **{name}**" for name, _ in domain_types[:25]) + findings.append( + SpecializedFinding( + section_id="entities", + finding=("GraphQL domain types:\n" + bullets), + sources=[ + SourceRef( + file=rel_path, + lines=(domain_types[0][1], domain_types[-1][1]), + ) + ], + ) + ) + + if interfaces: + bullets = "\n".join(f" - **{name}**" for name, _ in interfaces) + findings.append( + SpecializedFinding( + section_id="entities", + finding=("Interfaces (shared shape contracts):\n" + bullets), + sources=[SourceRef(file=rel_path)], + ) + ) + + if inputs: + bullets = "\n".join(f" - **{name}**" for name, _ in inputs[:25]) + findings.append( + SpecializedFinding( + section_id="entities", + finding=("Input types (request payload shapes):\n" + bullets), + sources=[SourceRef(file=rel_path)], + ) + ) + + if enums: + bullets = "\n".join(f" - **{name}**" for name, _ in enums[:15]) + findings.append( + SpecializedFinding( + section_id="entities", + finding=("Enum types (closed value sets):\n" + bullets), + sources=[SourceRef(file=rel_path)], + ) + ) + + if root_types: + # Pull each root's fields by scanning the snippet between its + # declaration line and the matching closing brace. Multiple + # root declarations (the file may contain ``extend type Query`` + # blocks) get one finding each. + for name, line in root_types: + block = _block_after(text, line) + fields = _SCHEMA_FIELD_RE.findall(block) + bullets = "\n".join(f" - `{f}`" for f in fields[:30]) + section_id = "capabilities" if name in {"Query", "Mutation"} else "integrations" + findings.append( + SpecializedFinding( + section_id=section_id, + finding=(f"GraphQL **{name}** root exposes:\n" + (bullets or " - (no fields detected)")), + sources=[SourceRef(file=rel_path, lines=(line, line))], + ) + ) + # Deduped name list for the summary (Query/Mutation likely repeat + # across base + extend blocks). + seen_root_names: list[str] = [] + for name, _ in root_types: + if name not in seen_root_names: + seen_root_names.append(name) + summary_bits.append(", ".join(seen_root_names) + " roots") + + return SpecializedResult( + findings=findings, + summary=("GraphQL SDL: " + ", ".join(summary_bits)) if summary_bits else "GraphQL SDL.", + ) + + +def _line(text: str, offset: int) -> int: + return text.count("\n", 0, offset) + 1 + + +def _block_after(text: str, line: int) -> str: + """Return the body lines between ``line`` and the matching ``}``. + + Walks the source brace-depth-aware so an indented closing brace + (`` }``) ends the block — many SDL formatters indent the closing + brace, and a column-0-only check would consume every type that + follows. + """ + lines = text.splitlines() + start_index = max(0, line - 1) + out: list[str] = [] + depth = 0 + started = False + for ln in lines[start_index:]: + opens = ln.count("{") + closes = ln.count("}") + if not started: + # The declaration line carries the opening ``{``; record it + # but don't emit the declaration itself as a body line. + depth += opens - closes + started = True + if depth <= 0: + # ``type X {}`` on a single line — empty body. + break + continue + if closes and depth - closes <= 0: + # The line that closes the block — stop before consuming it. + break + depth += opens - closes + out.append(ln) + return "\n".join(out) diff --git a/wikifi/specialized/models.py b/wikifi/specialized/models.py new file mode 100644 index 0000000..ae3b208 --- /dev/null +++ b/wikifi/specialized/models.py @@ -0,0 +1,30 @@ +"""Result types emitted by specialized extractors. + +Specialized extractors short-circuit the LLM on schema/IDL files — +their output flows into the same notes store the LLM extractor writes +to, so the dispatch contract is just ``(rel_path, text) -> SpecializedResult``. +""" + +from __future__ import annotations + +from collections.abc import Callable +from dataclasses import dataclass, field + +from wikifi.evidence import SourceRef + + +@dataclass +class SpecializedFinding: + section_id: str + finding: str + sources: list[SourceRef] = field(default_factory=list) + + +@dataclass +class SpecializedResult: + findings: list[SpecializedFinding] = field(default_factory=list) + summary: str = "" + + +# Each extractor takes ``(rel_path, text)`` and returns a SpecializedResult. +ExtractorFn = Callable[[str, str], SpecializedResult] diff --git a/wikifi/specialized/openapi.py b/wikifi/specialized/openapi.py new file mode 100644 index 0000000..f2c13fc --- /dev/null +++ b/wikifi/specialized/openapi.py @@ -0,0 +1,188 @@ +"""OpenAPI / Swagger contract extractor. + +OpenAPI specs are migration gold: every public endpoint, request/response +body, and authentication method is enumerated in one structured document. +We avoid pulling PyYAML as a hard dependency by attempting JSON first, +then falling back to a small permissive YAML parser sufficient for the +keys we read. Specs that exceed the parser's limits are flagged with a +single ``capabilities`` finding noting the parse failure rather than +crashing the walk. +""" + +from __future__ import annotations + +import json +import logging +import re +from typing import Any + +from wikifi.evidence import SourceRef +from wikifi.specialized.models import SpecializedFinding, SpecializedResult + +log = logging.getLogger("wikifi.specialized.openapi") + + +def extract(rel_path: str, text: str) -> SpecializedResult: + spec = _parse(text) + if spec is None: + return SpecializedResult( + findings=[ + SpecializedFinding( + section_id="capabilities", + finding=( + "An API contract was found but could not be parsed for " + "structured extraction. Migration teams should consult " + "this file directly for endpoint inventory." + ), + sources=[SourceRef(file=rel_path)], + ) + ], + summary="Unparseable API spec — manual review recommended.", + ) + + findings: list[SpecializedFinding] = [] + summary_bits: list[str] = [] + + info = spec.get("info") or {} + if isinstance(info, dict) and (title := info.get("title")): + findings.append( + SpecializedFinding( + section_id="intent", + finding=( + f"The system exposes a public API titled **{title}**" + + (f" (v{info.get('version')})" if info.get("version") else "") + + (f": {info.get('description')}" if info.get("description") else ".") + ), + sources=[SourceRef(file=rel_path)], + ) + ) + + paths = spec.get("paths") or {} + if isinstance(paths, dict): + verbs = ("get", "post", "put", "patch", "delete", "head", "options") + endpoints: list[tuple[str, str, str]] = [] + for path, ops in paths.items(): + if not isinstance(ops, dict): + continue + for verb in verbs: + op = ops.get(verb) + if not isinstance(op, dict): + continue + description = op.get("summary") or op.get("description") or "" + endpoints.append((verb.upper(), str(path), str(description))) + if endpoints: + summary_bits.append(f"{len(endpoints)} endpoint(s)") + top = endpoints[:20] + bullets = "\n".join(f" - `{verb} {path}`{(' — ' + desc) if desc else ''}" for verb, path, desc in top) + more = f"\n - … {len(endpoints) - 20} more endpoint(s) elided." if len(endpoints) > 20 else "" + findings.append( + SpecializedFinding( + section_id="capabilities", + finding=("Public API surface (subset shown):\n" + bullets + more), + sources=[SourceRef(file=rel_path)], + ) + ) + findings.append( + SpecializedFinding( + section_id="integrations", + finding=( + f"Inbound integration: HTTP API exposes {len(endpoints)} endpoint(s) for external consumers." + ), + sources=[SourceRef(file=rel_path)], + ) + ) + + components = spec.get("components") or {} + schemas = components.get("schemas") if isinstance(components, dict) else None + if isinstance(schemas, dict) and schemas: + names = list(schemas.keys()) + summary_bits.append(f"{len(names)} schema(s)") + bullets = "\n".join(f" - **{name}**" for name in names[:25]) + more = f"\n - … {len(names) - 25} more schema(s) elided." if len(names) > 25 else "" + findings.append( + SpecializedFinding( + section_id="entities", + finding=("API schemas (request/response models):\n" + bullets + more), + sources=[SourceRef(file=rel_path)], + ) + ) + + security = components.get("securitySchemes") if isinstance(components, dict) else None + if isinstance(security, dict) and security: + types = sorted({(v or {}).get("type", "?") for v in security.values() if isinstance(v, dict)}) + findings.append( + SpecializedFinding( + section_id="cross_cutting", + finding=("Authentication contract for the API: scheme(s) " + ", ".join(f"`{t}`" for t in types) + "."), + sources=[SourceRef(file=rel_path)], + ) + ) + + return SpecializedResult( + findings=findings, + summary="API contract: " + ", ".join(summary_bits) if summary_bits else "API contract.", + ) + + +def _parse(text: str) -> dict[str, Any] | None: + stripped = text.strip() + if not stripped: + return None + if stripped.startswith("{"): + try: + return json.loads(stripped) + except json.JSONDecodeError: + return None + try: + import yaml # type: ignore[import-not-found] + except ImportError: + return _shallow_yaml(stripped) + try: + loaded = yaml.safe_load(stripped) + except Exception as exc: # pragma: no cover - depends on installed PyYAML + log.warning("yaml parse failed: %s", exc) + return None + return loaded if isinstance(loaded, dict) else None + + +# --------------------------------------------------------------------------- +# Tiny YAML fallback — only handles the OpenAPI subset we need (top-level +# keys, simple nested dicts, and method blocks under paths). +# --------------------------------------------------------------------------- + + +_KEY_RE = re.compile(r"^(\s*)([\w./{}-]+):\s*(.*)$") + + +def _shallow_yaml(text: str) -> dict[str, Any] | None: + """Best-effort YAML parser sufficient for OpenAPI's known shape. + + Returns nested dicts where each key contributes a string value or a + nested dict; lists and complex flow-style structures collapse to + string descriptions, which is fine for the keys :func:`extract` + actually inspects. + """ + root: dict[str, Any] = {} + stack: list[tuple[int, dict[str, Any]]] = [(-1, root)] + for raw_line in text.splitlines(): + if not raw_line.strip() or raw_line.lstrip().startswith("#"): + continue + match = _KEY_RE.match(raw_line) + if not match: + continue + indent = len(match.group(1)) + key = match.group(2).strip() + value = match.group(3).strip() + while stack and stack[-1][0] >= indent: + stack.pop() + if not stack: + stack.append((-1, root)) + parent = stack[-1][1] + if value == "" or value == "{}": + child: dict[str, Any] = {} + parent[key] = child + stack.append((indent, child)) + else: + stripped = value.strip().strip('"').strip("'") + parent[key] = stripped + return root or None diff --git a/wikifi/specialized/protobuf.py b/wikifi/specialized/protobuf.py new file mode 100644 index 0000000..1829833 --- /dev/null +++ b/wikifi/specialized/protobuf.py @@ -0,0 +1,141 @@ +"""Protobuf IDL extractor. + +Surfaces ``message`` types as entities and ``service``/``rpc`` blocks as +integration touchpoints. Proto files are pure contract: a migration team +re-implementing in a new stack can read these findings directly into +their interface design. +""" + +from __future__ import annotations + +import re + +from wikifi.evidence import SourceRef +from wikifi.specialized.models import SpecializedFinding, SpecializedResult + +_MESSAGE_RE = re.compile(r"^\s*message\s+(\w+)\s*\{", re.MULTILINE) +_SERVICE_RE = re.compile(r"^\s*service\s+(\w+)\s*\{", re.MULTILINE) +_RPC_RE = re.compile( + r"^\s*rpc\s+(\w+)\s*\(\s*(stream\s+)?([\w.]+)\s*\)\s*returns\s*\(\s*(stream\s+)?([\w.]+)\s*\)", + re.MULTILINE, +) +_ENUM_RE = re.compile(r"^\s*enum\s+(\w+)\s*\{", re.MULTILINE) +_PACKAGE_RE = re.compile(r"^\s*package\s+([\w.]+)\s*;", re.MULTILINE) + + +def extract(rel_path: str, text: str) -> SpecializedResult: + findings: list[SpecializedFinding] = [] + summary_bits: list[str] = [] + + package_match = _PACKAGE_RE.search(text) + package = package_match.group(1) if package_match else "" + + messages = [(m.group(1), _line(text, m.start())) for m in _MESSAGE_RE.finditer(text)] + enums = [(m.group(1), _line(text, m.start())) for m in _ENUM_RE.finditer(text)] + services = [(m.group(1), _line(text, m.start())) for m in _SERVICE_RE.finditer(text)] + rpcs = [ + (m.group(1), m.group(3), m.group(5), bool(m.group(2)), bool(m.group(4)), _line(text, m.start())) + for m in _RPC_RE.finditer(text) + ] + + if messages: + summary_bits.append(f"{len(messages)} message(s)") + bullets = "\n".join(f" - **{name}**" for name, _ in messages[:25]) + more = f"\n - … {len(messages) - 25} more message(s) elided." if len(messages) > 25 else "" + findings.append( + SpecializedFinding( + section_id="entities", + finding=( + f"Protocol entities {('in package `' + package + '`') if package else ''}:\n" + bullets + more + ), + sources=[SourceRef(file=rel_path, lines=(messages[0][1], messages[-1][1]))], + ) + ) + + if enums: + bullets = "\n".join(f" - **{name}**" for name, _ in enums[:15]) + findings.append( + SpecializedFinding( + section_id="entities", + finding=("Enum types (closed value sets):\n" + bullets), + sources=[SourceRef(file=rel_path, lines=(enums[0][1], enums[-1][1]))], + ) + ) + + # Each service owns the RPCs declared between its opening ``{`` and + # the matching ``}``. The previous "every RPC at or after my line" + # filter would attribute every later service's RPCs to the first + # service block in a multi-service file, inflating the integration + # inventory. Bound each service by its block-end line instead. + service_spans = _service_spans(text, services) + for (service_name, start_line), (_, end_line) in zip(services, service_spans, strict=True): + related = [r for r in rpcs if start_line <= r[5] <= end_line] + bullets = "\n".join( + f" - `{name}({_arrow(in_msg, in_stream)}) -> {_arrow(out_msg, out_stream)}`" + for name, in_msg, out_msg, in_stream, out_stream, _ in related[:25] + ) + findings.append( + SpecializedFinding( + section_id="integrations", + finding=( + f"Service **{service_name}** exposes the following RPCs:\n" + + (bullets if bullets else " - (no RPCs detected)") + ), + sources=[SourceRef(file=rel_path, lines=(start_line, end_line))], + ) + ) + if services: + summary_bits.append(f"{len(services)} service(s)") + if rpcs: + summary_bits.append(f"{len(rpcs)} rpc(s)") + findings.append( + SpecializedFinding( + section_id="capabilities", + finding=( + f"Wire protocol exposes {len(rpcs)} remote procedure(s) across {len(services) or 1} service(s)." + ), + sources=[SourceRef(file=rel_path)], + ) + ) + + return SpecializedResult( + findings=findings, + summary=("Proto file: " + ", ".join(summary_bits)) if summary_bits else "Proto file.", + ) + + +def _arrow(name: str, stream: bool) -> str: + return f"stream {name}" if stream else name + + +def _line(text: str, offset: int) -> int: + return text.count("\n", 0, offset) + 1 + + +def _service_spans(text: str, services: list[tuple[str, int]]) -> list[tuple[str, int]]: + """For each (service_name, start_line) return (service_name, end_line). + + ``end_line`` is the line carrying the brace that closes the service + block, found by walking forward and counting brace depth so nested + blocks (``oneof``, message-in-service) don't terminate the scan. + If the closing brace is missing the span runs to EOF. + """ + lines = text.splitlines() + spans: list[tuple[str, int]] = [] + last_line = len(lines) + for name, start_line in services: + depth = 0 + started = False + end_line = last_line + for i in range(start_line - 1, last_line): + ln = lines[i] + opens = ln.count("{") + closes = ln.count("}") + depth += opens - closes + if not started and opens: + started = True + if started and depth <= 0: + end_line = i + 1 + break + spans.append((name, end_line)) + return spans diff --git a/wikifi/specialized/sql.py b/wikifi/specialized/sql.py new file mode 100644 index 0000000..810c22a --- /dev/null +++ b/wikifi/specialized/sql.py @@ -0,0 +1,235 @@ +"""SQL DDL + migration extractor. + +Pulls table definitions, columns, primary/foreign keys, indexes, and +constraints. Each table becomes an ``entities`` finding; foreign keys +become ``integrations``-style relationships if they cross obvious +service boundaries (heuristic), and ``cross_cutting`` for storage +invariants like ``UNIQUE`` and ``NOT NULL`` constraints. + +Migration files (Alembic/Knex/Flyway/etc.) are extracted with the same +parser and additionally tagged in the summary so the migration team can +spot forward-only schema changes vs. baseline DDL. +""" + +from __future__ import annotations + +import re +from dataclasses import dataclass, field + +from wikifi.evidence import SourceRef +from wikifi.specialized.models import SpecializedFinding, SpecializedResult + +# Line-number tracking is precise to "the line containing the matched +# keyword" — that's specific enough for citations and avoids the cost +# of a full SQL parser. Migrations frequently mix dialects; we tolerate +# anything that loosely matches the keyword grammar. +_CREATE_TABLE_RE = re.compile( + r"create\s+(?:or\s+replace\s+)?(?:temporary\s+)?table\s+(?:if\s+not\s+exists\s+)?" + r"([\"`\[\]\w.]+)\s*\((.*?)\)\s*;", + re.IGNORECASE | re.DOTALL, +) +_ALTER_TABLE_RE = re.compile( + r"alter\s+table\s+([\"`\[\]\w.]+)\s+(.*?);", + re.IGNORECASE | re.DOTALL, +) +_FK_RE = re.compile( + r"foreign\s+key\s*\(([^)]+)\)\s*references\s+([\"`\[\]\w.]+)\s*\(([^)]+)\)", + re.IGNORECASE, +) +_REF_INLINE_RE = re.compile(r"references\s+([\"`\[\]\w.]+)\s*\(([^)]+)\)", re.IGNORECASE) +_UNIQUE_RE = re.compile(r"\bunique\b", re.IGNORECASE) +_NOT_NULL_RE = re.compile(r"\bnot\s+null\b", re.IGNORECASE) +_INDEX_RE = re.compile( + r"create\s+(?:unique\s+)?index\s+(?:if\s+not\s+exists\s+)?([\"`\[\]\w.]+)\s+on\s+([\"`\[\]\w.]+)", + re.IGNORECASE, +) + + +@dataclass +class _TableHit: + name: str + line: int + body: str + columns: list[str] = field(default_factory=list) + fks: list[tuple[str, str, str]] = field(default_factory=list) + + +def extract(rel_path: str, text: str) -> SpecializedResult: + return _extract(rel_path, text, migration=False) + + +def extract_migration(rel_path: str, text: str) -> SpecializedResult: + return _extract(rel_path, text, migration=True) + + +def _extract(rel_path: str, text: str, *, migration: bool) -> SpecializedResult: + findings: list[SpecializedFinding] = [] + tables: list[_TableHit] = [] + altered_tables: set[str] = set() + + for match in _CREATE_TABLE_RE.finditer(text): + name = _strip_ident(match.group(1)) + body = match.group(2) + line = _line_of(text, match.start()) + hit = _TableHit(name=name, line=line, body=body) + _populate_columns(hit) + tables.append(hit) + + for hit in tables: + bullet_lines = ", ".join(hit.columns) if hit.columns else "(no columns parsed)" + prefix = "Migration adds" if migration else "Persists" + findings.append( + SpecializedFinding( + section_id="entities", + finding=(f"{prefix} the **{hit.name}** entity. Columns: {bullet_lines}."), + sources=[SourceRef(file=rel_path, lines=(hit.line, hit.line))], + ) + ) + + for column, ref_table, ref_column in hit.fks: + findings.append( + SpecializedFinding( + section_id="integrations", + finding=( + f"`{hit.name}.{column}` references " + f"`{ref_table}.{ref_column}` — a hard relational link " + "between these entities." + ), + sources=[SourceRef(file=rel_path, lines=(hit.line, hit.line))], + ) + ) + + constraints = _parse_constraints(hit.body) + if constraints: + findings.append( + SpecializedFinding( + section_id="cross_cutting", + finding=(f"Storage invariants on **{hit.name}**: {constraints}."), + sources=[SourceRef(file=rel_path, lines=(hit.line, hit.line))], + ) + ) + + for match in _ALTER_TABLE_RE.finditer(text): + line = _line_of(text, match.start()) + target = _strip_ident(match.group(1)) + action = match.group(2).strip() + altered_tables.add(target) + prefix = "Migration alters" if migration else "Alters" + findings.append( + SpecializedFinding( + section_id="entities", + finding=(f"{prefix} entity **{target}**: {_summarize_alter(action)}."), + sources=[SourceRef(file=rel_path, lines=(line, line))], + ) + ) + + for match in _INDEX_RE.finditer(text): + line = _line_of(text, match.start()) + idx = _strip_ident(match.group(1)) + target = _strip_ident(match.group(2)) + findings.append( + SpecializedFinding( + section_id="cross_cutting", + finding=( + f"Index `{idx}` on **{target}** — encodes a query-time " + "performance invariant the new system must preserve." + ), + sources=[SourceRef(file=rel_path, lines=(line, line))], + ) + ) + + if migration: + # Count both newly-created tables AND tables targeted by ALTER — + # a migration that only ALTERs still touches its targets, and + # a "0 table(s)" summary on an ALTER-only file misled callers + # browsing the report. + touched = len({hit.name for hit in tables} | altered_tables) + summary = f"Migration touches {touched} table(s)." + else: + summary = f"Schema for {len(tables)} table(s)." + return SpecializedResult(findings=findings, summary=summary) + + +def _populate_columns(hit: _TableHit) -> None: + """Pull column names + foreign-key edges from a CREATE TABLE body.""" + body = hit.body + columns: list[str] = [] + fks: list[tuple[str, str, str]] = [] + + for fk in _FK_RE.finditer(body): + local_cols = [c.strip().strip('"`[]') for c in fk.group(1).split(",")] + ref_table = _strip_ident(fk.group(2)) + ref_cols = [c.strip().strip('"`[]') for c in fk.group(3).split(",")] + for lc, rc in zip(local_cols, ref_cols, strict=False): + fks.append((lc, ref_table, rc)) + + # Split top-level commas so we can read column lines. + for raw_line in _split_top_level_commas(body): + line = raw_line.strip() + if not line: + continue + lowered = line.lower() + if lowered.startswith(("primary key", "foreign key", "unique", "constraint", "check", "index")): + continue + # First token is the column name (may be quoted). + match = re.match(r"\s*([\"`\[]?[\w]+[\"`\]]?)", line) + if not match: + continue + column = match.group(1).strip('"`[]') + columns.append(column) + + ref = _REF_INLINE_RE.search(line) + if ref: + ref_table = _strip_ident(ref.group(1)) + ref_cols = [c.strip().strip('"`[]') for c in ref.group(2).split(",")] + for rc in ref_cols: + fks.append((column, ref_table, rc)) + + hit.columns = columns + hit.fks = fks + + +def _split_top_level_commas(body: str) -> list[str]: + """Split on commas that are not inside parentheses.""" + out: list[str] = [] + depth = 0 + buf: list[str] = [] + for ch in body: + if ch == "(": + depth += 1 + buf.append(ch) + elif ch == ")": + depth = max(0, depth - 1) + buf.append(ch) + elif ch == "," and depth == 0: + out.append("".join(buf)) + buf = [] + else: + buf.append(ch) + if buf: + out.append("".join(buf)) + return out + + +def _parse_constraints(body: str) -> str: + bits: list[str] = [] + if _UNIQUE_RE.search(body): + bits.append("UNIQUE") + if _NOT_NULL_RE.search(body): + bits.append("NOT NULL") + return ", ".join(bits) + + +def _summarize_alter(action: str) -> str: + cleaned = " ".join(action.split()) + if len(cleaned) > 160: + cleaned = cleaned[:157] + "..." + return cleaned + + +def _strip_ident(name: str) -> str: + return name.strip().strip('"`[]') + + +def _line_of(text: str, offset: int) -> int: + return text.count("\n", 0, offset) + 1 diff --git a/wikifi/wiki.py b/wikifi/wiki.py index 5c0e014..77d1114 100644 --- a/wikifi/wiki.py +++ b/wikifi/wiki.py @@ -6,9 +6,10 @@ ``` /.wikifi/ config.toml # provider/model overrides; created by `wikifi init` - .gitignore # excludes per-file extraction notes by default + .gitignore # excludes per-file extraction notes + cache by default
.md # one per entry in wikifi.sections.SECTIONS .notes/ # per-file/per-section extraction state (jsonl) + .cache/ # content-addressed extraction + aggregation cache ``` """ @@ -24,12 +25,26 @@ WIKI_DIRNAME = ".wikifi" NOTES_DIRNAME = ".notes" +# Cache dir constant lives here (not in ``cache.py``) so the layout has +# one source of truth and ``cache.py`` can import it without inverting +# the existing ``cache → wiki`` dependency direction. +CACHE_DIRNAME = ".cache" CONFIG_FILENAME = "config.toml" GITIGNORE_FILENAME = ".gitignore" -DEFAULT_GITIGNORE = """# wikifi local working state — section markdown is committed, notes are not. -.notes/ -""" +# Lines we guarantee in ``.wikifi/.gitignore``. Both ``.notes/`` and +# ``.cache/`` are local working state — section markdown is what gets +# committed. New entries appended here are also backfilled into older +# wikis on the next ``wikifi init`` (see :func:`initialize`) so users +# upgrading wikifi don't accumulate noisy untracked files. +_GITIGNORE_REQUIRED_ENTRIES: tuple[str, ...] = ( + f"{NOTES_DIRNAME}/", + f"{CACHE_DIRNAME}/", +) +DEFAULT_GITIGNORE = ( + "# wikifi local working state — section markdown is committed, " + "notes and cache are not.\n" + "\n".join(_GITIGNORE_REQUIRED_ENTRIES) + "\n" +) @dataclass(frozen=True) @@ -52,6 +67,10 @@ def gitignore_path(self) -> Path: def notes_dir(self) -> Path: return self.wiki_dir / NOTES_DIRNAME + @property + def cache_dir(self) -> Path: + return self.wiki_dir / CACHE_DIRNAME + def section_path(self, section: Section | str) -> Path: sid = section.id if isinstance(section, Section) else section return self.wiki_dir / f"{sid}.md" @@ -79,8 +98,7 @@ def initialize(layout: WikiLayout, *, model: str, provider: str, ollama_host: st layout.config_path.write_text(_render_config(model=model, provider=provider, ollama_host=ollama_host)) created.append(layout.config_path) - if not layout.gitignore_path.exists(): - layout.gitignore_path.write_text(DEFAULT_GITIGNORE) + _ensure_gitignore(layout) created.append(layout.gitignore_path) for section in SECTIONS: @@ -92,6 +110,28 @@ def initialize(layout: WikiLayout, *, model: str, provider: str, ollama_host: st return created +def _ensure_gitignore(layout: WikiLayout) -> None: + """Ensure the wiki's .gitignore exists and covers every required entry. + + Older wikis predate the cache layer and have a ``.gitignore`` that + only ignores ``.notes/``. Backfill any missing line-by-line entries + from :data:`_GITIGNORE_REQUIRED_ENTRIES` so users upgrading wikifi + don't end up with stray ``.cache/`` (or future-added) directories + showing as untracked changes in the target repo. + """ + path = layout.gitignore_path + if not path.exists(): + path.write_text(DEFAULT_GITIGNORE) + return + existing = path.read_text(encoding="utf-8") + existing_lines = {line.strip() for line in existing.splitlines() if line.strip()} + missing = [entry for entry in _GITIGNORE_REQUIRED_ENTRIES if entry not in existing_lines] + if not missing: + return + suffix_nl = "" if existing.endswith("\n") else "\n" + path.write_text(existing + suffix_nl + "\n".join(missing) + "\n") + + def write_section(layout: WikiLayout, section: Section, body: str) -> Path: """Replace a section's body with rendered markdown.""" path = layout.section_path(section)