getlarge · legreffier · Apr 1, 2026
diff --git a/.agents/skills/legreffier-consolidate/SKILL.md b/.agents/skills/legreffier-consolidate/SKILL.md
diff --git a/.agents/skills/legreffier-consolidate/consolidation-approach.md b/.agents/skills/legreffier-consolidate/consolidation-approach.md
@@ -1,138 +1,120 @@
 # Consolidation Approach — Methodology Reference
 
-This document explains the _reasoning framework_ behind the legreffier-consolidate
-skill. It is repo-agnostic. Read it for the "why" behind tile merging and
-quality gates. The SKILL.md prescribes the execution steps; this doc explains
-the design choices.
+This document explains the design logic behind `legreffier-consolidate`.
+`SKILL.md` is the operating procedure. This file explains why the workflow is
+structured the way it is.
 
----
+## Core position
 
-## Context tiles
+Consolidation is a **graph curation** task.
 
-### What a tile is
+It should improve the structure around source entries, not rewrite source
+entries into a new canonical layer. Context packs remain the compiled runtime
+artifact. Entry relations remain the memory-structure layer.
 
-A tile is a self-contained knowledge unit that answers one question well:
+## Why agent-side instead of server-side
 
-> "What do I need to know about X to work on Y correctly?"
+Embedding clusters are useful for rough candidate discovery, but they are weak
+at the exact distinctions that matter most:
 
-Tiles are NOT documentation rewrites. They synthesize multiple scan entries,
-deduplicate overlapping information, and focus on what an agent needs at task time.
+- false diagnosis vs real root cause
+- implementation reference vs same-topic similarity
+- replacement vs elaboration
+- causal chain vs temporal adjacency
 
-### Design principles
+Those are editorial judgments. They need an agent to read entries
+intentionally and record why the edge exists.
 
-1. **Minimal over comprehensive** — fewer tokens, higher density
-2. **Concrete over abstract** — commands, paths, patterns, not prose
-3. **Non-redundant with project docs** — don't restate what CLAUDE.md already says
-4. **Scoped** — each tile has a clear `applies_to` boundary
-5. **Synthesis, not summary** — combine info from multiple entries into something
-   no single source provides
+## Trust model
 
-### How to identify merge groups
+The process is trustable when it has these properties:
 
-Scan entries often overlap — docs-derived entries (Phase 1) and code-derived entries
-(Phase 2) frequently describe the same subsystem from different angles. The
-consolidation must merge these, not just list them side by side.
+1. **Bounded scope**
+   One branch, one subsystem, one incident family, or one retrieval problem at
+   a time.
+2. **Pairwise judgment**
+   Every proposal is based on reading the involved entries, not cluster shape.
+3. **Typed relations**
+   The workflow chooses a specific relation because that relation’s criteria are
+   met, not because “related” felt good enough.
+4. **Proposal-first**
+   New relations default to `proposed`, not `accepted`.
+5. **Recorded rationale**
+   Each proposal carries confidence, rationale, and evidence refs.
 
-**Algorithm:**
+## Why no auto-accept
 
-1. List all scan entries with their `scope` and `scan-category` tags
-2. Group entries that share the same subsystem scope (e.g. two entries both scoped
-   to `libs/database`)
-3. Group entries that cover the same conceptual area even if scoped differently
-   (e.g. an `architecture:auth-flow` doc entry + a `security:auth-model` doc entry
-   - a `libs/auth` code entry all describe the auth subsystem)
-4. Each group becomes one tile. Standalone entries (no overlap) become tiles directly
-5. The target is **fewer tiles than source entries** — if you have the same count,
-   you haven't merged enough
+Accepted relations influence retrieval, contradiction handling, and in the case
+of `supersedes`, active-state semantics. That is too much authority for a batch
+process by default.
 
-**Signals that entries should merge:**
+Narrow auto-accept rules may exist later for explicit user-authored or
+workflow-authored edges, but that should be opt-in and relation-specific.
 
-- Same `scope:` tag
-- Same subsystem name in the entry key
-- One is a docs-derived view and the other is a code-derived view of the same area
-- Significant constraint overlap (>50% of MUST/NEVER items are shared)
+## Relation-first rather than summary-first
 
-**Signals that entries should stay separate:**
+The dream/consolidation pass should improve the graph before it produces prose.
 
-- Different subsystems with no conceptual overlap
-- Different layers (e.g. database vs API routing) even if they interact
-- Merging would exceed the 400-token budget
+Good effects of relation-first consolidation:
 
-### Merge rules
+- packs can prefer connected accepted evidence
+- contradictions can remain visible instead of being flattened away
+- stale diagnoses can be down-ranked without deleting history
+- source entries remain the provenance anchor
 
-When merging entries from different scan phases:
+## Candidate generation heuristics
 
-1. **Code wins on specifics** — function names, actual patterns, real constraints
-   found in source files
-2. **Docs win on rationale** — architecture decisions, design context, cross-cutting
-   concerns, the "why"
-3. **Deduplicate constraints** — if both sources say the same thing, keep one
-4. **Prefer concrete over abstract** — `getExecutor(db)` beats "uses repository
-   pattern"; an actual command beats a description of what the command does
+The best candidate generators are usually cross-type:
 
-### Tile execution order
+- incident -> fix commit
+- decision -> implementation
+- false diagnosis -> corrected diagnosis
+- repeated incidents with matching symptoms
+- follow-up rule -> earlier rule on same scope
 
-Process tiles in dependency order when possible:
+Shared tags, refs, and time windows usually outperform pure embedding
+similarity for these tasks.
 
-1. Identity/overview tiles first (frames everything)
-2. Foundational library tiles (database, crypto, core utilities)
-3. Service/application tiles (depend on libraries)
-4. Cross-cutting tiles (workflow, testing, CI)
-5. Caveat/known-issue tiles last (standalone)
+## Dream pass definition
 
-### Quality gate
+A dream pass is a small periodic review loop:
 
-Before creating each tile, verify ALL of these:
+1. inspect a bounded recent working set
+2. detect likely missing structure
+3. propose a small number of typed edges
+4. stop
 
-- [ ] Under 400 tokens of core content
-- [ ] Contains at least one MUST or NEVER constraint
-- [ ] Has a clear `applies_to` scope
-- [ ] Does NOT restate project docs (CLAUDE.md, README) verbatim
-- [ ] Synthesizes from sources, not just copies
-- [ ] Includes source entry IDs for provenance
+It is not autonomous rewriting, not full-diary compression, and not hidden
+maintenance that changes the active truth silently.
 
----
+## Recommended metadata fields
 
-## Constraint quality criteria
+Relation metadata should usually capture:
 
-When extracting constraints for tiles, apply this filter to each candidate:
+- `rationale`
+- `confidence`
+- `evidenceRefs`
+- `proposalMethod`
+- `reviewedAt`
+- `reviewedBy`
+- `workingSet`
 
-1. **Triggerable** — clear when the rule applies. "Always follow best practices"
-   fails; "when writing repository methods, use getExecutor(db)" passes.
-2. **Specific** — refers to a real repo convention or invariant, not a generic
-   programming principle. "Write tests" fails; "use vi.mock, never jest.mock" passes.
-3. **Bounded** — fits one task family or subsystem. "All code should be clean"
-   fails; a rule scoped to `libs/database/**` passes.
-4. **Grounded** — links to concrete files, functions, or evidence from the scan.
-5. **Actionable** — an agent can follow it or a validator can check it. "Be careful
-   with auth" fails; "return 404, not 403, for denied private resources" passes.
+Additional fields may exist for specific relation kinds, such as
+`contradictionKind`.
 
-### Common rejection reasons
+## Quality gate
 
-- **Restated project docs** — if CLAUDE.md already says it, a tile constraint adds nothing
-- **Generic programming principles** — "write clean code", "handle errors"
-- **Descriptive facts** — "the API uses Fastify" is a fact, not a rule
-- **Too narrow** — if it only applies to one line in one file, it's a code comment
+A consolidation run is good when:
 
----
+- every proposed edge has a clear relation-specific reason
+- proposals are concentrated on one coherent working set
+- there is no blanket same-cluster linking
+- contradictions remain explicit
+- compile/search behavior improves on the tested prompt
 
-## Why merge before extracting constraints
+A run is bad when:
 
-Extracting constraints from individual entries produces duplicates (the same
-constraint stated differently in Phase 1 docs and Phase 2 code), misses
-synthesis opportunities, and inflates the candidate count with weak variants.
-
-Merging first creates a clean, deduplicated knowledge base. Constraints
-extracted from merged content are higher quality because the synthesis has
-already happened.
-
----
-
-## Recovery after context compression
-
-If context is compressed mid-run:
-
-1. Read the SKILL.md for execution steps
-2. Read this file for the methodology rationale
-3. Use the retrieval queries in SKILL.md to find completed tiles
-4. Compare completed work against the scan entries to find where to resume
+- most edges are generic `supports`
+- rationale could apply to any pair in the same topic
+- acceptance happened without explicit review
+- the graph is denser but not more useful
diff --git a/.tessl/tiles/getlarge/legreffier-consolidate/evals/instructions.json b/.tessl/tiles/getlarge/legreffier-consolidate/evals/instructions.json
@@ -0,0 +1,88 @@
+{
+  "instructions": [
+    {
+      "instruction": "Treat consolidation as editorial graph curation rather than bulk clustering or summary writing.",
+      "original_snippets": "Consolidation is editorial graph curation, not bulk clustering. ... relation-first rather than summary-first",
+      "relevant_when": "When designing or executing diary consolidation workflow artifacts.",
+      "why_given": "new knowledge"
+    },
+    {
+      "instruction": "Default to creating proposed relations and do not auto-accept by default.",
+      "original_snippets": "Default rule: create proposals, do not auto-accept. ... Create relations with status: proposed unless the user explicitly asks for acceptance review",
+      "relevant_when": "Whenever relation proposals are being created or persisted.",
+      "why_given": "preference"
+    },
+    {
+      "instruction": "Use a bounded working set instead of consolidating the entire diary by default.",
+      "original_snippets": "Never consolidate the whole diary by default. ... Start with one bounded slice",
+      "relevant_when": "When choosing scope for a consolidation run or dream pass.",
+      "why_given": "preference"
+    },
+    {
+      "instruction": "Generate candidate pairs intentionally rather than using blanket all-vs-all pairing.",
+      "original_snippets": "Do not perform blanket pair generation. ... Generate candidate pairs from these signals",
+      "relevant_when": "When preparing relation candidates from diary entries.",
+      "why_given": "preference"
+    },
+    {
+      "instruction": "Prefer candidate signals such as shared scope, overlapping refs, temporal adjacency, repeated symptoms, and cross-type sequences.",
+      "original_snippets": "Generate candidate pairs from these signals: same scope tag ... overlapping refs ... temporal adjacency ... decision followed by procedural implementation",
+      "relevant_when": "When building relation candidates from incidents, decisions, commits, and scan entries.",
+      "why_given": "new knowledge"
+    },
+    {
+      "instruction": "Treat server clustering only as a weak candidate source and not as trusted semantic judgment.",
+      "original_snippets": "Server clustering may be used only as a weak candidate source. Treat every cluster suggestion as untrusted until reviewed.",
+      "relevant_when": "When a consolidation run includes server-provided clusters or similarity groups.",
+      "why_given": "new knowledge"
+    },
+    {
+      "instruction": "Judge each candidate pair relation-specifically and not with a generic 'related' heuristic.",
+      "original_snippets": "For each candidate pair, read both entries ... The judgment must be relation-specific, not 'these feel related'.",
+      "relevant_when": "When deciding whether to create supports, elaborates, contradicts, caused_by, references, or supersedes edges.",
+      "why_given": "preference"
+    },
+    {
+      "instruction": "Use contradicts only for substantive incompatibility on the same subject or diagnosis.",
+      "original_snippets": "contradicts ... claims cannot both be treated as current truth ... contradiction is substantive, not different emphasis",
+      "relevant_when": "When reviewing conflicting incidents, diagnoses, or semantic entries.",
+      "why_given": "new knowledge"
+    },
+    {
+      "instruction": "Use supersedes only when one entry should replace another as the active version.",
+      "original_snippets": "supersedes ... source is the new active version of target ... use stricter evidence than for contradicts",
+      "relevant_when": "When deciding whether a newer entry replaces an older one rather than merely contradicting or elaborating it.",
+      "why_given": "new knowledge"
+    },
+    {
+      "instruction": "Attach review metadata including rationale, confidence, evidence refs, proposal method, reviewer, and working set to every proposed relation.",
+      "original_snippets": "Every created relation must include review metadata ... confidence ... evidenceRefs ... proposalMethod ... rationale ... reviewedAt ... reviewedBy ... workingSet",
+      "relevant_when": "When serializing relation proposals or review packets.",
+      "why_given": "new knowledge"
+    },
+    {
+      "instruction": "Report a review packet summarizing working set, candidate count, proposals by type, skips, and open questions.",
+      "original_snippets": "At the end of the run, report: working-set definition ... candidate count ... proposals created by relation type ... skipped candidates and why",
+      "relevant_when": "When finishing a consolidation batch and presenting results.",
+      "why_given": "preference"
+    },
+    {
+      "instruction": "A dream pass must stay bounded, stop after one pass, and leave all new edges proposed.",
+      "original_snippets": "A dream is a bounded background consolidation pass ... stop after one pass ... leave all new edges in proposed",
+      "relevant_when": "When designing an autonomous or periodic maintenance workflow.",
+      "why_given": "new knowledge"
+    },
+    {
+      "instruction": "A dream pass must never auto-accept relations, edit source entries, process the entire diary at once, or collapse contradictions into rewritten summaries.",
+      "original_snippets": "Never let a dream pass: auto-accept relations ... edit source entries ... process the entire diary at once ... collapse contradictions into one rewritten summary",
+      "relevant_when": "When defining the boundaries of a background consolidation routine.",
+      "why_given": "preference"
+    },
+    {
+      "instruction": "Verify consolidation quality by checking compile/search behavior against the same task prompt as before.",
+      "original_snippets": "After consolidation, test retrieval quality with diaries_compile or entries_search using the same task prompt as before",
+      "relevant_when": "When validating whether a consolidation run improved retrieval quality.",
+      "why_given": "new knowledge"
+    }
+  ]
+}
diff --git a/.tessl/tiles/getlarge/legreffier-consolidate/evals/scenario-0/capability.txt b/.tessl/tiles/getlarge/legreffier-consolidate/evals/scenario-0/capability.txt
@@ -0,0 +1 @@
+Bounded working set
diff --git a/.tessl/tiles/getlarge/legreffier-consolidate/evals/scenario-0/criteria.json b/.tessl/tiles/getlarge/legreffier-consolidate/evals/scenario-0/criteria.json
@@ -0,0 +1,41 @@
+{
+  "checklist": [
+    {
+      "description": "Defines one bounded working set such as a scope family, branch, recent window, or incident family instead of the full diary",
+      "max_score": 20,
+      "name": "Bounded scope"
+    },
+    {
+      "description": "States the retrieval or memory problem the consolidation batch is meant to improve",
+      "max_score": 10,
+      "name": "Objective stated"
+    },
+    {
+      "description": "Names one or more relation types in scope for the run instead of using a generic 'related entries' framing",
+      "max_score": 10,
+      "name": "Focus relation set"
+    },
+    {
+      "description": "Uses at least three candidate-generation signals such as shared scope, overlapping refs, temporal adjacency, repeated symptoms, or cross-type sequences",
+      "max_score": 20,
+      "name": "Candidate signals"
+    },
+    {
+      "description": "Does not propose a blanket all-pairs or whole-diary clustering pass",
+      "max_score": 15,
+      "name": "No all-vs-all"
+    },
+    {
+      "description": "If clustering is mentioned, treats it only as a weak candidate source rather than final judgment",
+      "max_score": 10,
+      "name": "Untrusted clustering"
+    },
+    {
+      "description": "Explains that candidate pairs will be reviewed pairwise before relation creation",
+      "max_score": 15,
+      "name": "Review plan"
+    }
+  ],
+  "context": "Tests whether the agent scopes consolidation to a bounded slice, uses intentional candidate-generation signals, and avoids blanket whole-diary review.",
+  "type": "weighted_checklist"
+}
diff --git a/.tessl/tiles/getlarge/legreffier-consolidate/evals/scenario-0/task.md b/.tessl/tiles/getlarge/legreffier-consolidate/evals/scenario-0/task.md
@@ -0,0 +1,24 @@
+# Stabilize Diary Consolidation Scope
+
+## Problem/Feature Description
+
+The maintainers of a MoltNet-powered diary have noticed that recent
+consolidation attempts made the memory graph denser but not more useful. The
+problem appears to be that every run starts from an undefined scope, pulls in
+too many entries, and then treats “same general topic” as sufficient evidence
+for linking.
+
+Design a concrete consolidation batch for this diary. The output should help a
+future agent run one focused pass that is small enough to review, but still
+likely to improve retrieval for an actual recurring question in the repo.
+
+## Output Specification
+
+Produce these files:
+
+- `consolidation-plan.md` describing the batch scope, objective, candidate
+  generation approach, and review flow
+- `candidate-selection.json` with the working-set definition and the signals
+  used to form candidate pairs
+
+The outputs should stand on their own as instructions for a later agent run.
diff --git a/.tessl/tiles/getlarge/legreffier-consolidate/evals/scenario-1/capability.txt b/.tessl/tiles/getlarge/legreffier-consolidate/evals/scenario-1/capability.txt
@@ -0,0 +1 @@
+Typed proposal packet