diff --git a/.claude/board/EPIPHANIES.md b/.claude/board/EPIPHANIES.md index d64c1dab..733a1a45 100644 --- a/.claude/board/EPIPHANIES.md +++ b/.claude/board/EPIPHANIES.md @@ -1,3 +1,163 @@ +## 2026-07-03 — E-V3-THINK-ATOMS-GRIDDED-PERTURBATION-CASCADE-1 [CONJECTURE, probe-gated — nothing retired/locked]: both think & do are methods on a ClassView-as-struct-of-methods; atoms are fractal ClassViews fanned out over the Morton grid; DATA via perturbation cascade / META via bundle; bundle-above / address-below cost crossover +**Status:** CONJECTURE — architectural model co-developed this session over 6 turns. The GROUNDING is FINDING/receipted; the SYNTHESIS is UNPROVEN. Explicitly NOT an operator ruling, NOT a lock, NOT a retirement — collapses no type, retires no enum, refactors no code, supersedes no shipped decision until the reconstruction probe runs green. Operator directive: **keep an open mind — the valuable result may be the residual the probe CANNOT fit.** + +**The model (hypothesis):** +1. A thinking op is a coordinate, not an enum: `(Pearl-rung × classid→ClassView + StepMask) → ExecTarget`. Deduction = rung-0 dispatch; a style = a rung-profile × StepMask over a ClassView. Flow-Rung(9) is an ORTHOGONAL escalation axis, not the same ladder as Pearl(3). +2. Both THINK (deduction/style) and DO (ActionDef) are METHODS resolved through `classid → ClassView` — ONE resolution surface, TWO commit gates (think → inward SoA via W1b; do → outward via RBAC+MUL Flow). The commit split IS the firewall; it must NOT collapse. +3. A ClassView materializes as a const STRUCT OF METHODS (a vtable, one per classid from codegen); members compose as the existing ComputeEdge DAG (topo-ordered) — which is the W3 compiled thinking template seen from the other side. +4. Thinking atoms (deduct/syllogism/counterfactual/synthesize/extrapolate/infer/prefetch — a set that GENERALIZES InferenceType(5) non-isomorphically) are themselves fractal ClassViews. `think::atom::fanout(StepMask, grid)` dispatches them Morton-parallel over the grid (NOT a linear sweep); prefix-disjoint tiles mostly don't collide. +5. DATA level = Morton-tile PERTURBATION shader cascade (exponent/location/phase/magnitude; deterministic phase never stored; palette magnitude; O(1) tile; lossless + inspectable) — NOT a VSA data-soup. Collision reduce = commutative monoid (gated write-back / NARS evidence-revision / arithmetic), NO VSA. +6. META level = bundle — the CHEAP roll-up materialization of the upper cascade (few nodes, one commutative reduce, summary wanted); the topmost instance of OGAR's "key prerenders with zero value decode." +7. Cost crossover: bundle ABOVE / address BELOW, at the level where roll-up cost beats fan-in cost. Same operator (bundle), opposite verdict by LEVEL — data-soup wrong (register-loss), upper roll-up cheapest-right. + +**GROUNDING (FINDING — receipted):** `atoms.rs` Pearl(3)+Rung(9) already separate; `nars::InferenceType(5)` already carries scan-depth semantics; `class_view.rs` = `ClassView` TRAIT + `FieldMask(u64)` + `ComputeEdge` DAG w/ `compute_dag_topo_order` (the method-composition machinery already exists as a computed-field DAG); `action.rs` `ActionDef` = documented "Perdurant sibling of `codegen_manifest::MethodSig`," same `classid→ClassView` inheritance + `overrides`, commit gate = RBAC+MUL Flow; `ExecTarget`(Native/Jit/SurrealQl/Elixir) real, `SALE_ORDER_ACTIONS` tagged SurrealQl; 1BRC (E-1BRC-GRIDLAKE-SWEETSPOT-1 + E-1BRC-ADDRESSING-1) measured the Morton route + NON-VSA arithmetic merge (route-and-write 3×; orchestration flattens the curve); OGAR perturbation doctrine + `I-VSA-IDENTITIES` are the data-cascade canon. + +**UNBUILT additions (not code yet):** (a) generalize ComputeEdge nodes computed-field → method(THINK/DO); (b) `StepMask` in contract (W3a, queued); (c) const `ClassView` struct-of-methods from codegen. + +**OPEN QUESTIONS / where it may be WRONG:** NARS-5 ↔ Pearl-3 is a projection NOT an isomorphism (Revision=write-axis, Synthesis=multi-path — off the see/do/imagine line); the atom set generalizing InferenceType is asserted, unverified. The cost crossover is derivable-in-principle but UNMEASURED (level + sharpness conjectural). The data-cascade/meta-bundle boundary may be fuzzier than a clean line. Whether `dispatch = bitmask-AND over topo-order` reproduces CURRENT routing bit-for-bit is the whole open question — the model could name a style/inference/action the coordinate CANNOT hold, and THAT residual is the valuable result. **Correction (operator, same session): bundle is NOT declared legacy.** Bundle is valid where the substrate is genuinely mathematical (semigroup / Markov / Jirak); WHERE it applies (meta roll-up vs detail cascade vs formal-scaffold pillars) is a MATH question OWNED BY THE JC CRATE (jirak/pearl/ewa_sandwich), deferred — not settled here. The ONLY firm claim is the negative one: VSA-data-**soup at the detail level** is the anti-pattern (register-loss, `I-VSA-IDENTITIES`). `I-SUBSTRATE-MARKOV` / `data-flow.md` BUNDLE framing stays intact pending that JC derivation. + +**PROMOTION GATE (reconstruction probe — CONJECTURE→FINDING only when green):** build `think::atom::{…}` as const ClassView method-DAGs; `fanout(StepMask, grid)` Morton-parallel; data reduce = commutative monoid, meta = MetaWord bundle; show (a) reproduces current InferenceType/ThinkingStyle/ActionDef routing bit-for-bit, (b) order-independence across tile-completion permutations, (c) zero VSA in the data path, (d) both commit gates stay distinct. GREEN promotes; RED names the residual (equally valuable). Natural first probes: W3a (StepMask) + #19 (cmpeq_mask ClassView-resolution). Nothing downstream is bound to this until the gate is green. + +**Cross-refs:** atoms.rs, class_view.rs, action.rs, nars.rs, kanban.rs (ExecTarget), E-1BRC-GRIDLAKE-SWEETSPOT-1, E-1BRC-ADDRESSING-1, OGAR perturbation doctrine, I-VSA-IDENTITIES, W3a + #19. + +## 2026-07-03 — E-V3-DUAL-SCHEMA-0x1000-IS-PERMANENT-1: 0x1000 is NOT a temporary adoption monitor to retire at 100% — v2 and v3 coexist PERMANENTLY by schema; D-CCF-4 (marker retirement) is RESCINDED +**Status:** RULING (operator, 2026-07-03) — reverses the "temporary by declaration" framing carried in INTEGRATION-PLAN D-CCF-4, README §monitor, routing.md §29/96, v3-substrate-primer §98, sonnet-worker-guardrails row, compiled-templates §75 + +**The ruling:** "0x1000 should absolutely not be retired because we said we make v2 and v3 by schema and keep both." V3 is NOT a converge-to-one-version-then-retire-the-marker migration. It is **permanent dual-schema coexistence**: v2 and v3 both live, selected by `ValueSchema` (Bootstrap/Cognitive/Compressed/Full) + `ENVELOPE_LAYOUT_VERSION`. `0x1000` is therefore a **permanent schema discriminator**, not transitional adoption telemetry. + +**What this reverses:** +- **D-CCF-4** ("0x1000 marker retirement (P4), trigger = adoption reads 100%") — **RESCINDED.** There is no retirement trigger; 100% adoption is not a retirement event. +- The "**temporary by declaration**" / "temporary adoption MONITOR" language in `README.md §24`, `v3-substrate-primer.md §98`, `sonnet-worker-guardrails.md` row, `compiled-templates.md §75`, `routing.md §29/96` — reframed: 0x1000 is a permanent discriminator that ALSO yields adoption telemetry, not a marker whose reason-to-exist ends at 100%. + +**What survives:** +- **W6a scanner** stays (adoption% + old-form count) — but as PERMANENT telemetry, never a retirement gate. 100% is just a number, not a checkpoint. +- **W6b** (legacy alias retirement) narrows to genuinely-dead pre-flip forms (`0x0000_DDCC` zero-prefix, `0xAAAA_DDCC` render-prefix-high), NOT the v2 schema path — v2-by-schema is kept. +- **RESERVE-DON'T-RECLAIM + I-LEGACY-API-FEATURE-GATED** — this ruling IS those iron rules at the schema level: versions coexist by gate, never reclaimed. 0x1000 permanence = the same shape as the classid/family fixed-offset reservation. + +**Consequence for MODULE-TABLE §202:** "a future canon==0x1000 would be indistinguishable from the V3 marker" is now a PERMANENT design constraint (0x1000 in the canon half stays reserved-against), not a hazard that clears at retirement. + +**Cross-refs:** INTEGRATION-PLAN W6 (D-CCF-4 rescinded), soa_layout/routing.md, soa_layout/le-contract.md §32, README.md §24, knowledge/v3-substrate-primer.md §6, ENVELOPE_LAYOUT_VERSION + ValueSchema (canonical_node.rs). Supersedes the "temporary" framing everywhere it appears. + +## 2026-07-03 — E-V3-CODEC-FIDELITY-IS-REPRESENTATION-NOT-CODEC-1: the stack's 0.96–0.998 codec anchors are properties of the ENGINEERED representation, NOT of the codec on raw vectors — proven with the real ndarray codec (Base17 on raw Jina = |ρ| 0.32) +**Status:** FINDING (measured with the REAL codec algorithm — faithful port of `ndarray::hpc::jina::codec::Base17Token::from_f32` — against the Jina v3 fixed point; scratch python, Jina embeddings not committed) — the capstone/close of the codec-crawl thread (operator handed the sequence highheelbgz → bgz17 → codec-research → dn_tree/deepnsm → jina/codec + cam_pq; each confirmed the same mechanism) + +**The claim, decisively measured:** the stack's high codec-fidelity anchors — Base17 ρ=0.965, ZeckBF17 ρ=0.982 (`lance-graph-codec-research/src/zeckbf17.rs`), palette256 ρ=0.9973, jina_lens ρ>0.998 — are **properties of the representation the codec sits on, not of the codec applied to arbitrary vectors.** Every one assumes the input has already been re-expressed in a **low-intrinsic-dimension structural basis**: ZeckBF17's own premise is "16384 dims = 17 primes × 964 octaves, only ~14 independent (JL bound)"; deepnsm is "65–74 NSM primes"; the DN-tree (`ndarray/src/hpc/dn_tree.rs`) is a fanout-4 cascade over that structured space. **That basis is compressible by construction; raw dense embeddings are not.** + +**Proof (the real codec on raw Jina):** `Base17Token::from_f32` is a 1024→17 golden-step strided MEAN (each base dim `bi` = mean of `emb[(bi·11)%17 :: 17]`) — octave-averaging that *denoises* a plane already built as 17×octave copies, but *blurs* a vector that isn't. Ported faithfully, run on the 4096 raw Jina vectors: **Base17 L1 vs Jina cosine |Spearman| = 0.32** (Base17-cos vs Jina-cos +0.38) — **WORSE than naive PCA-17 (0.72)** and **3× below the 0.965 Base17 hits on its native Base17 plane.** The 3× gap IS the structure assumption, quantified. + +**Supporting floor measurements (all raw Jina, naive, vs the Jina fixed point):** PQ reconstruction 8 B→ICC 0.63, 128 B→0.89; PCA *distance* preservation 18 D→ICC 0.45 (Spearman 0.73), 128 D→0.94. So dense 1024-d Jina needs ~128 dims for ~0.94 distance-fidelity — it has high intrinsic dimensionality, the opposite of the engineered 14–74-dim stack basis. + +**Two crucial consequences:** +1. **"palette256/ZeckBF17/Base17 ≈ 0.99" MUST be stated with its domain** — the stack's 17×octave / NSM-prime planes, or a *trained* lens — never as a property that transfers to raw embeddings. Asserting it on raw Jina would have been wrong by 3× (the "measure before stating" catch). +2. **There are TWO different "jina lenses," don't conflate:** (a) `ndarray/hpc/jina/codec.rs::from_f32` — NAIVE golden-step Base17 average, measured |ρ|=0.32 on raw Jina; (b) `thinking-engine/examples/calibrate_lenses.rs` + `jina_lens.rs` — the CALIBRATED/TRAINED lens (affine ICC correction, "ρ>0.998 = truth-anchor grade"). **The 0.998 belongs to (b), not (a).** So a qualia value is a "measured 3σ surfel" iff it comes through the *trained* lens (b); neither the naive Base17 projection (a) nor the `QualiaField::from_text` keyword path inherits it. (This is the same finding as E-V3-JINA-IS-THE-FULCRUM §qualia, now with the naive-vs-trained distinction nailed by measurement.) + +**The pipeline, corrected:** to make raw Jina codec-faithful you must FIRST project it through the *calibrated* lens into the structured (Base17/NSM) basis — the naive `from_f32` is not that lens. `dn_tree` + `deepnsm` + `jina/codec` are the structured *representation*; the trained `jina_lens` is the *projection into it*; ZeckBF17/palette are the *codec on it*. Fidelity lives in the projection+representation, not the codec. Deferred real measurement: run the *trained* `thinking-engine::jina_lens` (not `from_f32`) on the cached Jina 4096 and confirm it reaches ~0.998 where the naive projection reached 0.32 — the honest last number. + +**Cross-refs:** `ndarray/src/hpc/jina/codec.rs` (from_f32 golden-step Base17, JinaPalette 256×256 L1 table), `ndarray/src/hpc/dn_tree.rs` (fanout-4 HDC cascade), `ndarray/src/hpc/deepnsm.rs` (65-74 NSM primes → 2048-byte fp), `lance-graph-codec-research/src/zeckbf17.rs` (17×octave premise), `thinking-engine/examples/calibrate_lenses.rs:150-197` (affine ICC correction, ρ>0.998 truth-anchor), `bgz17` palette_semiring (256×256 distance/compose). Companion: E-V3-JINA-IS-THE-FULCRUM-SUBSTRATE-MEASURED-1 (the fulcrum + the naive floor numbers). + +## 2026-07-03 — E-V3-JINA-IS-THE-FULCRUM-SUBSTRATE-MEASURED-1: semantic-location validity needs an external fixed point (Jina), measured — the first fulcrum-anchored numbers in the stack +**Status:** FINDING (all numbers measured this session against real Jina v3 1024-d embeddings for all 4096 COCA words; probes in `crates/deepnsm/examples/gridlake_spo_*.rs` + scratch python; licensed data + embeddings NOT committed) — operator-driven arc (Archimedes framing: "gib mir einen festen Punkt") + +The Archimedean turn: **semantic-location statistics are a lever; validity needs a fulcrum outside them.** All prior session measurements (word co-occurrence covariance, register covariance, Cam4096 reorder) were *internal structure* — describing, not validating. The fixed point is the external ground-truth embedding: **Jina** (Model Registry "GROUND TRUTH"), reachable via `JINA_API_KEY` (env var arrives QUOTE-WRAPPED, len 67 → strip to 65, raw 401 / stripped 200 — the MedCare-rs `GITHUB_TOKEN` trap; Cloudflare `error 1010` blocks `python-urllib` UA → send a browser UA). Fetched jina-embeddings-v3, 1024-d, all 4096 COCA words (93 s, 16.8 MB f32). + +**Measured (all vs the Jina fulcrum, exhaustive — 4096 is small):** +1. **Jina → HHTL location works.** 256-way k-means (HEEL tier) on the Jina vectors: intra-cluster mean cosine 0.670 vs 0.527 random = **1.27× locality** — the coarse HHTL tier from Jina preserves semantics. This is "Jina gives the location for HHTL" (operator reframe) — validated. +2. **Naive quantization ≠ the canon; calibration is the gap.** Raw 8-byte CAM-PQ (k-means PQ, no calibration) reconstructs Jina distance at only **Pearson 0.66 / Spearman 0.63** — far below the palette256 canon anchor **ICC 0.9973**. The gap IS the `codebook_calibrated.rs` γ+φ calibration: γ-expansion `ln(1+x/γ)·γ` decrowds Jina's anisotropic 0.527 center; φ golden-ratio spread distributes across all 256 buckets. **Calibration is quantization-FIDELITY (prevents bucket-collapse of a crowded distribution), NOT signal-manufacture** — measured: on the *continuous* signal γ+φ is monotonic (lift 1.10→1.19×, but σ-separation invariant ~0.9σ); its value is preserving that 0.9σ through the u8 palette step, which naive bucketing destroys. Operator was right ("use normalized palette256 + bucket spread + euler_gamma") — it's what stands between 0.66 and 0.9973. +3. **Paradigmatic edges beat syntagmatic against this fulcrum.** is_a/taxonomy siblings (animal 1.25× / body 1.21× / emotion 1.17×) align with Jina far better than raw co-occurrence (**1.10×**). Because **Jina IS paradigmatic** (means-similar) and so is taxonomy; co-occurrence is syntagmatic (appears-together) — a real but *complementary* axis (27× neighborhood lift yet only +0.036 pairwise weight correlation — related but ~1-in-15-neighbors overlap = a different semantic relation, not a failed Jina). **Consequence: for a Jina-anchored HHTL address, bake is_a/part_of/taxonomy edges, NOT raw v_the_n co-occurrence.** The covariance (shared-neighbor, +0.138) beats direct co-occurrence (+0.036) ~4× — the covariance form is the one to keep of the co-occurrence family, but taxonomy beats both. + +**The four-fulcrum doctrine (each organ has its OWN ground truth):** content/qualia → Jina (semantic); AST/structure → the real parse tree (byte-parity, not embedding); NARS/truth → outcome/revision (calibration curve, not cosine). "Wire Qualia+AST+NARS and measure" = validate each against its own fulcrum, THEN measure whether composition PRESERVES each validity (the frankenstein-boundary test — ICC(composite→Jina) vs ICC(qualia-alone→Jina); did superposition drop it?). Qualia-edge validity is UNMEASURED (register is per-text, not per-word — needs a text↔word join). **Qualia geometry itself IS Jina-ICC-measured to 3σ (ρ=0.9973 via palette256, `bgz-tensor/quality.rs::icc_3_1`, gated ≥0.99); only the keyword `from_text` value-path and the fragmented 5/17/18-D axis-set (contract=17D `arousal…integration`, cognitive=18D with a duplicate `depth`, resonance=5D) are asserted — corrected from the earlier "qualia is unmeasured" over-claim.** + +**Cross-refs / receipts:** `bgz-tensor/src/codebook_calibrated.rs` (γ+φ calibrate), `bgz-tensor/src/quality.rs:366` (icc_3_1), `bgz-tensor/src/had_cascade.rs:9-11` (measured ICC 0.9975/1.000), `lance-graph-arm-discovery/src/aerial/codebook.rs:10` (palette256 ρ=0.9973 vs cosine), `deepnsm/src/spo.rs` (WordDistanceMatrix::build_from_cam — old wiring = 96D subgenre-freq → CAM-PQ), `crates/jc/src/{jirak,pearl,ewa_sandwich_3d}.rs` (the proper significance/covariance/rung machinery to replace the crude power-iteration). Deferred build: apply γ+φ BEFORE u8 quantization and re-measure toward 0.9973; measure qualia-edge validity once a text↔word join exists; run the substrate comparison through JC's calibrated palette256 rather than raw cosine. + +## 2026-07-03 — E-V3-RIG-ARM-MUST-BE-ARIGRAPH-1: the rig arm earns its keep ONLY as AriGraph — replace the float-vector retrieval leg with SPO+episodic retrieval over kv-lance; "act as AriGraph" and "graphrag-rs + Leiden ⟷ AriGraph convergence" are ONE wiring seen from two angles +**Status:** FINDING (all three ends receipted at file:line) for the diagnosis + the unification; CONJECTURE (= task #18's retrieval leg, unbuilt) for the wired loop — sharpens E-V3-RIG-CHASSIS-1 + E-V3-RETRIEVAL-IS-NOT-COGNITION-1 + E-V3-GRAPHRAG-VEHICLE-2 into the retrieval-leg spec; the memory-organ counterpart to E-V3-RSGRAPHLLM-IS-ADAPTER-REPATRIATE-1's cognition-organ ruling + +Operator's final claim, decomposed and ground-checked: "rs-graph-llm if it's 500 ns might earn its keep; but it currently points at 64-vec similarity — totally useless; the rig arm was justified by taking over crewai AND langgraph — but then it needs to act as ARIGRAPH; otherwise graphrag-rs + Leiden families ⟷ AriGraph convergence; learn from graphrag-rs and add ours on top — a hard reminder of what we have to wire." Every clause verified: + +**(1) "500 ns earns its keep" — TRUE, and it is the COGNITION engine, not the retrieval leg.** graph-flow (~408-538 ns/step, E-V3-RSGRAPHLLM-...-REPATRIATE-1) is RETAINED by the repatriation ruling as the minimal execution engine; the ~1-2 ms framework around an 8.4 s oracle call is negligible (E-V3-ORACLE-LIVE-1). The engine is not the problem — the MEMORY organ hanging off it is. + +**(2) "64-vec similarity — totally useless" — CONFIRMED at `recommendation-service/src/tasks/vector_search.rs:43-83`.** `embed_query` → pgvector `ORDER BY vector <-> ARRAY[{…}]::vector LIMIT 25` → concat `overview` strings → stuff into an LLM prompt. Float-vector top-k RAG, on **Postgres pgvector** (not even kv-lance), zero SPO / zero episodic / zero NARS / zero graph expansion. This is precisely E-V3-RETRIEVAL-IS-NOT-COGNITION-1's "lookup table, not a mind" AND the E-SEMANTIC-KERNEL "copy meaning into a prompt" anti-pattern — applied to the rig arm. It is a DEMO service path, not the real seam. + +**(3) "needs to act as ARIGRAPH" — the target already exists in-tree, transcoded from the Python original.** `crates/lance-graph/src/graph/arigraph/retrieval.rs`: *combined retrieval over triplet graph and episodic memory* = BFS graph expansion (`triplet_graph.rs`) MERGED with fingerprint-based episode recall (`episodic.rs`), producing unified LLM context — with `EXTRACTION_PROMPT`/`REFINING_PROMPT` transcoded verbatim from Python AriGraph `prompts.py`. The full organ set is present: `episodic.rs / triplet_graph.rs / retrieval.rs / spo_bridge.rs / orchestrator.rs / markov_soa.rs`. "Act as AriGraph" = the rig retrieval leg must retrieve over THIS (SPO BFS + episodic recall), not over float vectors. + +**(4) The seam is NAMED-BUT-STUBBED — the code's own honesty IS the "hard reminder."** `rs-graph-llm/crates/episodic-arc-task/src/lib.rs:13-24`: "*These Tasks drive rig's store adapters — rig-lancedb (episodic similarity/retrieval) + rig-surrealdb (kv-lance semantic SPO graph + the versioned commit arc). What rig persists IS the AriGraph tenant SoA, transparently the surrealdb kv-lance view.*" Then line 60-61: "*Until those are wired this records the content address + a recorded flag so the surrounding graph is exercisable.*" So the AriGraph-acting rig arm is SCAFFOLDED (the Tasks, the ContentId/SourceSpan citation gate, the rig-surrealdb/rig-lancedb chassis of E-V3-RIG-CHASSIS-1) and DECLARED, but the persist/retrieve wiring to the AriGraph SoA is the unbuilt tissue. That is the exact backlog the operator points at. + +**The unification (the ruling): the two branches the operator poses are ONE seam, not an either/or.** "Act as AriGraph" and "graphrag-rs + Leiden ⟷ AriGraph convergence" describe the SAME wiring from two angles: **AriGraph IS the retrieval target; graphrag-rs contributes the graph ALGORITHMS that make AriGraph's community/multi-hop retrieval better** — all REUSE-AS-REFERENCE, never fork (P0 fork policy; E-V3-GRAPHRAG-INV-1). Concretely — Leiden community detection (graphrag-rs's is SINGLE-LEVEL, hardcoded level 0, no coarsening loop: we complete the aggregation ourselves) gives the community PARTITION; **arm-discovery's Aerial+ NARS rules give the community SUMMARY** (E-V3-ARM-IS-THE-INDUCTION-ORGAN-1 — truth-graded, semiring-composable, categorically stronger than LLM prose); HippoRAG PPR + LightRAG dual-level (local/global) retrieval = the query geometry, dispatched by the 36 thinking styles (E-V3-COGNITION-...-FANOUT-1); episodic memory = Lance version-window read (`QueryReference::at`, E-V3-GRAPHRAG-VEHICLE-2); belief revision = NARS on the SPO store. "Learn from graphrag-rs and add ours on top" = adopt the algorithms as reference, run them over the AriGraph SPO+episodic SoA, and land the summaries as NARS rules — never over float vectors, never as LLM prose. + +**Second wiring layer surfaced (ties the memory organ to the semantic-kernel capstone):** the in-tree `arigraph/retrieval.rs` still EXTRACTS triplets via an LLM prose `EXTRACTION_PROMPT` and refines via `REFINING_PROMPT`. E-V3-SEMANTIC-KERNEL-REVOLUTIONIZES-RAG-1 says extraction should be DECOMPOSITION (ruff AST + deepnsm COCA + arm ARM), not prompting. So "act as AriGraph" (memory) and "revolutionize RAG" (semantic kernel) are the same build: AriGraph's triplet-extraction leg becomes deterministic decomposition; its refining leg becomes NARS revision; the oracle pages only on the tail. The rig arm as AriGraph is therefore NOT a fresh crate — it is (a) the `episodic-arc-task` scaffold wired to persist/retrieve over the AriGraph SoA via rig-surrealdb/rig-lancedb on kv-lance, replacing `vector_search.rs`; (b) the extraction leg swapped from LLM-prompt to decomposition; (c) the community leg run as Leiden-partition → arm-NARS-summary. + +**Consequence for the backlog:** this is the sharpened spec for **task #18's retrieval leg** (the first of its three ordered probes) and the concrete definition of "rig arm earns its keep": rig is justified as the memory chassis IFF it retrieves over AriGraph (SPO BFS + episodic version-window recall over kv-lance, Hamming-native per E-V3-RIG-CHASSIS-1), NOT float-vector top-k. The fuse against regression: **no rig retrieval Task may embed→float→top-k as its terminal answer** — the retrieval surface must return SPO/episodic structure the cognition organ reasons over (The Click P-1: memory is thinking TISSUE, not a service). The representation-seam probe of E-V3-RIG-CHASSIS-1 (does kv-lance carry our fingerprints natively, or does rig's `Embedding = Vec` force a 64×-wasteful widening?) is the gating measurement — it decides whether the AriGraph retrieval is fingerprint-native or pays the float tax the operator called "useless." Owners: `trajectory-cartographer` (episodic/AriGraph SoA), `v3-kanban-executor-engineer` (the graph-flow Task wiring), `truth-architect` (the fingerprint-native-vs-float-widening measurement gate). **Honest boundary:** the AriGraph module + prompts are shipped and transcoded; the rig-surrealdb/rig-lancedb persist/retrieve wiring is stubbed (`episodic-arc-task` says so); neither the decomposition-extraction swap nor the Leiden-completion loop is built — this entry is the SPEC, not the shipped leg. + +## 2026-07-02 — E-V3-RSGRAPHLLM-IS-ADAPTER-REPATRIATE-1: rs-graph-llm/graph-flow is the LangGraph EXECUTION ADAPTER (never the spine) — repatriate it into lance-graph as 4 crates behind 3 CI fuses; reject planner-hosting; rig is membrane-tier, not a brain crate +**Status:** FINDING (operator principle doctrine-ratified + dep-legality decided by a 5-Opus-agent audit with file:line receipts) + CORRECTION (the "rung ladder half-wired" belief is REFUTED: ~5%) — ruling on the operator's A-vs-B migration question; transcript wf_1fb3b304-bc2 + +Operator: "rs-graph-llm cannot have more awareness of our stack than the stack itself; the langgraph features suffer the same as the SurrealQL-AST-as-elixir-executor — an ADAPTER, not the spine" + two options (A: fold into lance-graph-planner as an OGAR leg; B: migrate into lance-graph-kanban / -planner / -rig). 5-angle audit (wiring / census / precedent / layering / migration-shape), every claim receipted: + +**Principle: RATIFIED (angle 3).** Claim (2) is nearly verbatim the OGAR SurrealQL-AST-as-adapter law one layer up (`SURREAL-AST-AS-ADAPTER.md:22-24,55` "behavior can't live in DDL; negative-beauty hijack"; the general form `SURREAL-AST-TRAP-PREFLIGHT.md:28-58` "has the address but not the Core types it resolves to"). Claim (1) is pre-ratified by E-V3-RETRIEVAL-IS-NOT-COGNITION-1 and by the crewai/n8n eviction (E-CREWAI-N8N-EVICTED: external orchestrators hold no semantic authority; roles fold in-tree). **Caveat:** crewai/n8n were FULLY evicted (roles redundant); graph-flow is RETAINED (a genuinely-used minimal execution engine, ~408-538 ns/step) — so the resolution is subordinate-as-thin-adapter, NOT delete-the-code. + +**"Half-wired" belief: REFUTED — it's ~5% (angle 1).** `EpistemicMode::for_rung` (temporal.rs:67) is self-contained: called only by its own tests + `QueryReference::at`; grep finds NO consumer outside temporal.rs, and lance-graph-planner is NOT a dep of any graph-flow* crate — so the rung ladder is not even crate-graph-reachable from the ActionHandler. The seam is real but empty: `dispatch/run_gated` take a `gate: &GateDecision` param, and the Daemon supplies a hardcoded `GateDecision::Flow` (daemon.rs:137-139); no rung→`contract::mul::GateDecision` adapter exists anywhere (the planner MUL emits a DIFFERENT type, `MulGateDecision{Proceed/Sandbox/Compass}`, unbridged). The only real rung↔gate coupling is the REVERSE direction and a DIFFERENT rung (`cognitive_shader::on_gate(GateDecision)->RungLevel`). **Four "rung" systems collide** (temporal EpistemicMode 0-9 / cognitive_shader RungLevel / Pearl causal 2-3 / recipe_kernels 1-9) — the aspirational doc prose ("Rung-1-9 Flughöhe in the hot path", "rig docks by writing impl ActionHandler") is what seeds the false "half-wired" impression. Naming disambiguation + the rung→GateDecision adapter are the actually-missing links. + +**Option A: REJECTED as dep-illegal (angle 4).** Hosting langgraph execution in lance-graph-planner REVERSES the spine arrow (today every arrow points DOWN into the zero-dep `lance-graph-contract` leaf; planner→graph-flow inverts it) AND langgraph execution needs episodic/AriGraph, which the standing rule forbids as a planner dep ("AriGraph cannot be a planner dep (circular) — use p64 convergence"; E-V3-GRAPHRAG-VEHICLE-2 guard #2). A also conflates the operator's own measured two-speed split (planner = slow/plan; graph-flow = ns replayable orchestration; ExecTarget = sub-µs hot). + +**Option B: ADOPTED, corrected to 4 crates (angle 5) — and it is largely already the V3 W-wave plan + M25 is SHIPPED.** 4 crates in the ONE lance-graph workspace, every one a `contract` consumer (contract deps on none): (1) **lance-graph-planner** (EXISTS) — slow/plan path; EMITS KanbanMoves (D-MBX-A6), never hosts the executor; (2) **lance-graph-kanban** (NEW = graph-flow + graph-flow-kanban) — the replayable langgraph executor + `KanbanSessionStorage` (M25 SHIPPED: replay = rebuild from board) + W1b ahead-firing writer + template-task/episodic-arc-task; (3) **lance-graph-action** (NEW, or fold into existing lance-graph-ogar) = graph-flow-action + graph-flow-action-ogar — the kgV `ActionHandler`/`dispatch_via`/`GatedOgarHandler` + the rung ladder, sited next to lance-graph-ogar/-rbac (which it already deps); (4) **lance-graph-rig** (NEW, THIN) = rig as ORACLE ONLY (VectorStoreIndex retrieval chassis over kv-lance per E-V3-RIG-CHASSIS-1 + the FailureTicket oracle node), feature-gated. Template stack (elixir-template/template-runtime/template-equivalence/cognitive-compiler) + arm-discovery STAY (already members). rs-graph-llm the external repo is DELETED once green; its services become in-tree integration tests. **rig is MEMBRANE-tier, NOT a brain crate** (angle 4): rig is network egress; cognitive-stack already refuses to link it "per the design"; MedCare Iron Rule 7 (the event never leaves the inside). lance-graph-rig is legal only feature-gated, never a dep of contract/planner/core, BBB-deny-listed for consumer binaries. + +**The reframe that resolves the whole question:** "orchestration in our stack" is ALREADY structurally true — the arrow points DOWN to `lance-graph-contract` (the zero-dep spine), and graph-flow only speaks contract types, so it LITERALLY CANNOT be "more aware than the stack." The migration does not GAIN semantic authority (the contract already holds it); it REPATRIATES the executor crates into one workspace and installs **three CI fuses** that keep the adapter from drifting back to spine — "name the membrane or it drifts": **F1 dependency-direction** (grep: `crates/lance-graph-contract/` has ZERO `graph_flow::`/executor import — arrow only executor→contract); **F2 board-is-truth** (Session is DTO-pure carrying only MailboxId; `KanbanSessionStorage` replay=rebuild-from-board is the standing proof the graph is throw-away — M25 gate, green); **F3 oracle-frequency** (rig referenced ONLY in lance-graph-rig's oracle node, reachable ONLY from a FailureTicket; oracle-hit-rate must trend DOWN via template-equivalence compile-down). **Honest gaps:** (a) the rig fork drags the AdaWorldAPI/burn 403 build wall — lance-graph-rig must be off-by-default + vendored-lock; (b) M17 control-flow gap is unclosed (`NextAction` has 6 variants; `template-runtime` is linear; StepMask does NOT exist yet) — mint StepMask/ControlSignal BEFORE the ElixirTemplate→GraphBuilder adapter or End/WaitForInput/GoTo silently drop; (c) the task-#18 retrieve→think→act→witness→commit loop is still unbuilt — neither A nor B ships cognition; the topology decision is not the loop being done. +## 2026-07-02 — E-V3-SEMANTIC-KERNEL-REVOLUTIONIZES-RAG-1: the capstone — RAG copies meaning into a prompt; we DECOMPOSE it (AST + COCA + ARM) and LAND it as understanding in the SoA reasoning+knowledge graph. The "semantic kernel" IS that decomposition→understanding transducer +**Status:** FINDING (organs built + the framing pre-exists) for the pieces; CONJECTURE (the end-to-end loop = task #18) for the assembled claim — the RAG-specific capstone of E-SEMANTIC-OS-CONVERGENCE-1; unifies RIG-CHASSIS / RETRIEVAL-IS-NOT-COGNITION / TYPED-REASONING-FANOUT / ARM-IS-INDUCTION + +Operator's perfect-world thesis: "revolutionize RAG — the semantic kernel would mean the AST + COCA decomposition lands as UNDERSTANDING in a reasoning and knowledge graph using our SoA substrate." Not a coined term: `cognitive-shader-architecture.md:444` already frames the stack as "a semantic kernel for RAG"; this crystallizes its precise meaning and distinguishes it from the two false friends — Microsoft's *Semantic Kernel* (LLM-orchestration glue) and the DELETED crewai `semantic_kernel` (an HTTP-wrapper-around-BindSpace anti-pattern, a Law-1 violation). The reclaimed definition: **the semantic kernel is the transducer where an artifact is DECOMPOSED and its meaning LANDS as understanding** — not where prompts are routed. + +**The revolution, stated as a contrast.** Classic RAG: chunk text → embed (foreign float vectors) → at query time embed the query → top-k similar chunks → stuff into an LLM prompt → generate. Understanding (if any) is transient, happening inside the black box at generation time; nothing is decomposed, nothing is persisted as understanding, every query re-derives meaning from raw text. **RAG COPIES meaning into a prompt** (the exact act E-SEMANTIC-OS-CONVERGENCE-1's one law forbids). Ours: decompose the artifact ONCE along three axes, each a shipped proposer writing the ONE SoA (E-SOA-IS-THE-ONLY) — **AST/structure** = ruff SPO harvest (`has_function`/`inherits_from`/`virtually_overrides` + the polyglot frontends); **COCA/semantics** = deepnsm (4096-word COCA vocab, PoS-FSM → `SPO(dog,bite,man)` triples, 4096² distributional matrix, Cam4096 12-bit locality address); **ARM/induction** = arm-discovery (data→NARS-truth rules, E-V3-ARM-IS-THE-INDUCTION-ORGAN-1). The three land as typed, truth-graded UNDERSTANDING — SPO triples + NARS ⟨f,c⟩ — in the reasoning+knowledge graph on the SoA substrate. **Understanding is MATERIALIZED once, persisted, and composable** (semiring compose + NARS revision), never re-derived per query. + +**Why it revolutionizes RAG (four inversions):** (1) retrieval is over a KNOWLEDGE GRAPH of decomposed understanding, not text-chunk similarity — the rig chassis retrieves SPO/fingerprints (E-V3-RIG-CHASSIS-1), Hamming-native, our mint; (2) reasoning is the TYPED FAN-OUT up the rung ladder over the OGAR AST (E-V3-COGNITION-IS-A-TYPED-REASONING-FANOUT-1), not black-box generation; (3) the LLM is the ORACLE INTERRUPT on the tail (measured 1-2 ms framework vs 8.4 s oracle, E-V3-ORACLE-LIVE-1), not the generation engine — and its answers compile back into templates (the ratchet), so it pages less over time; (4) meaning is DECOMPOSED + MATERIALIZED + TRACED, never COPIED — RAG is precisely the "copy meaning" anti-pattern the semantic-OS law names, and this stack is its structural negation. Consequence: understanding does not evaporate after the prompt window — it accretes in the graph, is revised by NARS, learned from streams by ARM, and reasoned over deterministically; the corpus gets SMARTER per ingest, not just larger. + +**Honest status (the recurring note of the whole arc):** every ORGAN is built — ruff SPO (shipped, polyglot), deepnsm (shipped, 4096 COCA + FSM→SPO + nsm_bridge NARS truth), arm-discovery proposer (shipped), the SoA substrate (canonical_node + SPO store + NARS), the rig chassis (rig-surrealdb/kv-lance), the reasoning fan-out surfaces (InferenceType→QueryStrategy, EpistemicMode rung ladder, elixir-template/template-equivalence). What is NOT assembled is the END-TO-END loop where an artifact flows decompose → land → reason → answer as ONE pipeline — that is task #18. So the "perfect world" is not distant research; it is an ASSEMBLY of shipped organs. The revolution is one integration probe away from being demonstrable, and the demonstration (decompose a corpus via AST+COCA+ARM, land it, reason the fan-out over it, page rig only on the tail, benchmark vs microsoft/graphrag on the same queries) is the externally-legible proof — the 1BRC pattern applied to RAG itself. +## 2026-07-02 — E-V3-ARM-IS-THE-INDUCTION-ORGAN-1: lance-graph-arm-discovery is the built INDUCTION proposer — the third SoA leg (business logic lives in DATA not schema), the GraphRAG community-summary leg, and the operator's "stream proprietary data through NARS" vision, in one crate +**Status:** FINDING (crate built + tested; plan verbatim-quotes the operator) — connects E-V3-COGNITION-IS-A-TYPED-REASONING-FANOUT-1 + E-V3-GRAPHRAG-VEHICLE-2; the ONLY `lance-graph-arm-*` crate + +Checked `crates/lance-graph-arm-*` — one crate, `lance-graph-arm-discovery`, and it is a keystone earlier entries under-credited. It is the **materialized Induction organ**: ARM (association-rule mining, `(X → Y)` from data) IS inductive inference — generalize a truth-graded rule from many observations — where E-V3-COGNITION-...-FANOUT-1 had typed Induction only as "CamWide scan." Richer than that: a streaming DISCOVERY engine (`Proposer` trait, rule.rs:126, pluggable) that mines runtime tabular windows and emits NARS-truth `{s,p,o,f,c}` candidates. Float-free by construction — Aerial+ (Karabulut/Groth/Degeler, arXiv 2504.19354; AdaWorldAPI fork `aerial-rule-mining`) with the autoencoder **replaced by the palette256 `CodebookDistance` oracle** (`[a,b]→u32`, ρ=0.9973 vs cosine; no SGD, no seed). Three threads converge here: (1) **the third SoA proposer leg** — the plan `streaming-arm-nars-discovery-v1.md` names the insight "MOST OF THE BUSINESS LOGIC LIVES IN THE DATA, NOT THE SCHEMA": curated (L-doc) + extracted (ruff/odoo AST) proposers are bounded by the literal artifact; ARM is the leg that surfaces co-correlations living only in runtime data → this is how the substrate LEARNS new rules from streams, not just from harvested artifacts; (2) **the GraphRAG community-summary leg** (VEHICLE-2) made concrete — per HHTL tier / community, run the proposer → a truth-graded, semiring-COMPOSABLE rule set instead of LLM prose (categorically stronger: text can't compose, rules can); (3) **the operator's own vision, quoted verbatim in the plan**: "stream proprietary data through NARS reasoning ... 20.000-200.000 [records/window] ... determine co-correlation into deterministic rule candidates and do hypothesis testing against facts and edges." Internal-vs-external tie-in: ARM discovery is FULLY INTERNAL (deterministic, no LLM) — induction WITHOUT an oracle; rig is needed only on the tail NARS can't ratify — so "learning" is mostly deterministic ARM + NARS revision, strengthening the oracle ratchet. **Built vs plan (honest boundary):** BUILT + tested (~42 tests) = the proposer + translator legs — `Proposer` trait, `CandidateRule` (integer evidence over a `window`, `ppm()` cross-multiply), `CodebookDistance`/palette256, encode/bitset/simd/ndjson, `arm_to_truth_u8 → TruthU8`(=CausalEdge64), `arm_to_nars → NarsTruth`, `FeedProjector → {s,p,o,f,c}`. PLAN/integration surface (downstream) = the streaming 20K-200K-row/window DRIVER (window is a FIELD today, not a loop) + NARS revision → ratification council → ratified SPO → deterministic codegen (op_emitter). Anchored FINDING-grade to I-NOISE-FLOOR-JIRAK (Jirak-bound significance, not classical Berry-Esseen), I-VSA-IDENTITIES (ARM extracts identity-rules from content STATS, never bundles content), I-SUBSTRATE-MARKOV (the NARS revision arc IS the Markov trajectory), E-SOA-IS-THE-ONLY (writes the one SoA via SpoBuilder). Paper cross-anchor: Abreu/Cruz/Guerreiro arXiv 2511.13661 (ontology-driven M2M) independently confirms externalize-interpretation-not-code. **Consequence for the loop (task #18):** the fan-out's Induction node is NOT a stub to write — it is a shipped streaming proposer; the assembly work is the window driver + wiring its NARS-truth candidates into the ratification gate, then exposing it as a graph-flow Task at the Induction rung. +## 2026-07-02 — E-V3-COGNITION-IS-A-TYPED-REASONING-FANOUT-1: the cognition dimension = graph-flow fanning out typed inference modes up the Pearl rung ladder over the OGAR AST, authored as low-code templates, graded internal-vs-external against rig +**Status:** FINDING (every surface named/verified) for the PIECES; CONJECTURE (the assembly is task #18's probe) for the wired loop — the internal structure of E-V3-RETRIEVAL-IS-NOT-COGNITION-1's cognition organ + +Operator's full articulation of the thinking axis: "langgraph-like orchestration with rig → low-code templates, test internal vs external thinking; using the SoA and DO, fan-out counterfactual/synthesis/inference/deduction/extrapolation/syllogism; using our rung ladder on top of OGAR AST." Every noun is an existing typed surface — the vision is an ASSEMBLY, not a green-field: + +**(1) The fan-out modes are already typed to substrate ops.** `contract::nars::InferenceType {Deduction, Induction, Abduction, Synthesis}` each maps to a `QueryStrategy`: Deduction→CamExact (exact CAM lookup), Induction→CamWide (wide CAM scan), Abduction→DnTreeFull (DN-tree traversal), Synthesis→Bundle-across-paths. So a "reasoning mode" IS a substrate query shape — deduction is a lookup, induction is a scan, abduction is a traversal, synthesis is a bundle. Extrapolation/counterfactual = scenario fork (`contract::scenario`, `planner::prediction::scenario`, `pearl_junction`); syllogism = semiring composition of deductions. + +**(2) The rung ladder dispatches WHICH mode + WHAT may be known.** `planner::temporal::EpistemicMode::for_rung(u8)` (0..=4 Strict, mid Aware, top Retro) is the epistemic ladder; it maps onto Pearl's causal ladder over the **OGAR AST's three arms**: Rung 1 association = read the THINK arm (Class → SoA state); Rung 2 intervention = fire the DO arm (`ActionDef` guarded by `KausalSpec::StateGuard`, executed through `graph-flow-action-ogar::GatedOgarHandler`); Rung 3 counterfactual = `World`/scenario fork. The ladder CLIMBS the AST: read → intervene → counterfact. + +**(3) Low-code templates are the authored unit.** A fan-out shape at a rung is authored in `elixir-template` (DSL), compiled by `template-runtime`, wrapped as an rs-graph-llm `template-task` (a graph-flow `Task`), masked by StepMask. graph-flow provides the LangGraph orchestration (fan-out = parallel Tasks, composed via `contract::a2a_blackboard` multi-expert rounds — each expert one inference mode); rig provides the external oracle Task at the FailureTicket boundary. + +**(4) "Test internal vs external" IS `template-equivalence`.** The internal path = compiled deterministic templates at substrate speed; the external path = rig oracle (LLM, measured 8.4 s vs 1-2 ms framework, E-V3-ORACLE-LIVE-1). `crates/template-equivalence` grades replay: does the compiled template REPRODUCE the oracle's reasoning? Where yes, the internal template replaces the external oracle for that pattern — this IS the oracle ratchet (hit-rate trends down as the catalogue grows) made falsifiable. The compile-down iron rule holds: the LLM compiles INTO templates; templates never degrade into prompts. + +**The whole cognition organ, in one line:** graph-flow fans out typed `InferenceType` Tasks up the `EpistemicMode` rung ladder over the OGAR AST (Class-read / ActionDef-DO / scenario-counterfact), each a low-code `elixir-template`, each graded internal-vs-external by `template-equivalence` against the rig oracle, composed on the a2a blackboard, witnessed by kanban. Retrieval (the rig chassis, E-V3-RIG-CHASSIS-1) feeds the Context these Tasks reason over. **The honest gap = the assembly** (task #18, sharpened): the demonstrable first experiment is — take ONE OGAR AST, fan out N inference modes at ascending rungs, grade each internal template against the rig oracle, show the internal reproducing the external on the tractable modes and paging the oracle only on the tail. That experiment, not more synthesis, is the next deliverable. Owners: `v3-template-smith` (elixir-template/StepMask/equivalence), `scenario-world` (Rung 3 fork), `truth-architect` (the internal-vs-external measurement gate). +## 2026-07-02 — E-V3-RETRIEVAL-IS-NOT-COGNITION-1: the stack is TWO-dimensional — the rig chassis is the MEMORY organ, graph-flow + the OGAR DO surface is the COGNITION organ; do not mistake the retrieval leg for the mind +**Status:** FINDING (crate family verified wired) — guards against a real drift the rig-chassis simplification invites; anchors The Click P-1 (memory is thinking TISSUE, not a service) + +Operator's warning: "the stack now is only one-dimensional … now it's just embed 64 without any thinking." Named correctly as a RISK, corrected as a conclusion. Dimension 1 = MEMORY/retrieval (the rig chassis, E-V3-RIG-CHASSIS-1): embed → `VectorStoreIndex::top_n` via `vector::distance::hamming` over kv-lance. That leg ALONE is a lookup table, not a mind. Dimension 2 = COGNITION/action, a REAL wired crate family in rs-graph-llm (verified, not stubbed): (a) **`graph-flow`** — the LangGraph port the "make it like LangChain" prompt produced; verdict on that prompt: **it added MINIMAL-and-CORRECT orchestration** — a `Task` trait + `NextAction{Continue, ContinueAndExecute, WaitForInput, GoTo, End}` (exactly 5: sequence + conditional edge + human/oracle-in-the-loop + jump + halt) + `GraphBuilder`/`FlowRunner`/`SessionStorage`; an execution ENGINE, not framework bloat. (b) **`graph-flow-action-ogar`** — the OGAR↔rs-graph-llm binding the operator asked about, VERIFIED LIVE: `GatedOgarHandler` imports `lance_graph_contract::action::{ActionDef, ActionInvocation}`; `handle()` runs `self.executor.execute(...)` for real (returns `Done`/`Escalated`, not a stub); `run_gated()` drives routing → cold-floor RBAC → hot-path with gated refusals (Denied/Postponed/Block-Escalated/NotApplicable). OGAR `actions.rs` (`OgarActionProvider::{actions_for, effective_actions}`) supplies the AUTHORIZED DO surface; graph-flow-action executes it; `graph-flow-kanban` envelopes it as WAL; `episodic-arc-task` + `template-task` are the episodic-memory + compiled-template Task types. **The two dimensions COMPOSE, they do not compete:** a GraphRAG pipeline stage IS a graph-flow `Task`; the full loop is retrieve(rig chassis Task) → think(thinking-style-modulated Task; the 36 styles → ScanParams are the local/global/drift + reasoning dial at the seam) → act(OGAR `ActionDef` via `GatedOgarHandler`) → witness(kanban WAL) → commit-back → **reshapes the next retrieval's landscape** (The Click's loop, verbatim). **The iron rule this pins (fuse against the drift):** per The Click P-1, memory is thinking TISSUE wired INTO `Think`, never a service `Think` calls — so the rig chassis must be wired INTO the graph-flow cognitive loop as an ORGAN (a Task's reasoning surface: `episodic.retrieve_similar` = a version-window read; `graph.nodes_matching` = a store read), NEVER be mistaken for the loop. "embed 64 without thinking" = a lookup table; "embed 64 → graph-flow think → OGAR act → kanban witness → commit reshapes next embed" = cognition. **Honest gap (= task #18 "see the loop work"):** every ORGAN is real and contracted; the assembled retrieve→think→act→commit graph OVER the rig chassis is the unbuilt connective tissue — the pieces exist, the loop is the probe. That probe is the difference between a stack that is one-dimensional in FACT and one that only LOOKED it because the cognition organs weren't yet strung onto the chassis. +## 2026-07-02 — E-V3-RIG-CHASSIS-1: rig is the AdaWorldAPI-aligned CHASSIS — rig-surrealdb runs on kv-lance (= the V3 symbiont storage), Hamming-native; graphrag-rs collapses to BLUEPRINT-only +**Status:** FINDING (deps/traits verified) for the chassis facts; CONJECTURE (the representation probe below gates it) for "graphrag → blueprint, no fork" — sharpens E-V3-GRAPHRAG-VEHICLE-1/2 + E-TOKENIZER-MINT-MEMBRANE-1 + +Operator: "rig also works with SurrealQL and LanceDB — I wonder if that's even closer." Verified, and it is a bullseye, not merely closer. The chain (all confirmed): `rig-surrealdb/Cargo.toml` depends on **`kv-lance` — "the AdaWorldAPI SurrealDB-on-Lance backend"**; `kv-lance` IS the V3 **symbiont** storage (symbiont crate = "the full Ada stack … surrealdb kv-lance … compiled into ONE binary", golden-image probe). So rig-surrealdb runs on the EXACT storage V3 already builds — zero fork-policy friction, zero arrow/lancedb version-family mismatch (contrast graphrag-rs's foreign lancedb 0.26.2/arrow-57-with-56-drift). It emits real SurrealQL vector search — `SELECT … {distance_function}($vec, embedding) as distance` — and its distance menu includes **`vector::distance::hamming`** (+ knn/euclidean/cosine/jaccard): Hamming is THE fingerprint distance, i.e. rig-surrealdb was built anticipating binary-fingerprint vectors, not just float embeddings. It implements rig-core's `VectorStoreIndex` + `InsertDocuments` + `VectorStoreIndexDyn` (top_n / top_n_ids), and is generic `SurrealVectorStore` — the mint-membrane coupling point is exactly that `Model` param: we drop in OUR lens (tokenizer_registry, Qwen3/jina5 anchor) and the retrieval plumbing is otherwise ours already. **The reframe:** rig is the CHASSIS (LLM oracle client — measured E-V3-ORACLE-LIVE-1 — + VectorStoreIndex retrieval + storage adapters already on our fork), and graphrag-rs contributes ONLY the graph-pipeline SHAPE (community-detection geometry, local/global/drift query modes, multi-level hierarchy). Because the VectorStore bay — VEHICLE-1's one-seam probe target — is ALREADY solved natively by rig-surrealdb-over-kv-lance, the fork-or-blueprint gate tips decisively to **BLUEPRINT**: take graphrag's query geometry, build it on the rig chassis; do NOT fork graphrag (its only structurally-unique asset was the storage seam we no longer need). Full drivetrain, updated: extraction = ruff + deepnsm; store+retrieve = rig-surrealdb/kv-lance (symbiont) with our Model; local search = VectorStoreIndex top_n via vector::distance::hamming; community summaries = Aerial+ NARS rules; hierarchy = HHTL; global/drift dispatch = thinking styles; orchestration = graph-flow; oracle = rig. **The precise probe now (supersedes the generic one-seam probe):** the REPRESENTATION seam — rig-surrealdb stores `embedding: Vec`; our fingerprints are binary/i8/BF16. Does the kv-lance/SurrealQL `vector::distance::hamming` path carry our fingerprints NATIVELY (a binary/packed column) or does rig's `Embedding = Vec` type force a lossy/64×-wasteful widening? That representation-mint question is the real gate — it decides whether rig's `EmbeddingModel`/`Embedding` types are used as-is or need a fingerprint-native extension upstreamed to the rig fork. Everything else on the chassis is confirmed home. +## 2026-07-02 — E-TOKENIZER-MINT-MEMBRANE-1: a token id is a MINT — one family per pipeline, stamp every baked table; the rig/graph-flow vs graphrag-rs overlap resolves along this line +**Status:** FINDING (the hard-won lesson, operator: "if you use a tokenizer it needs to be from the same family to be meaningful") + the vehicle overlap matrix it dictates + +The scar: the reranker lens was baked from Qwen2-BPE tokens and read with Qwen3 ids (Model Registry CRITICAL note) — plausible numbers, semantic garbage. Generalization: **a token id is a codebook index, i.e. a mint; a baked lens/table keyed under one mint and read under another is the same defect class as I-LEGACY bit-aliasing and classid bit-math.** Family clusters (Model Registry): Qwen3 BPE {Jina v5 GROUND TRUTH, Reranker v3, Qwen3.5} vs Qwen2 BPE {Qwopus, Reader-LM} vs XLM-R {jina3, BGE-M3, LEGACY} vs OLMo {ModernBERT} — clusters never mix. **The fuse (membrane law: name it or it's prose): stamp the tokenizer-family fingerprint into every baked lens/table; the loader REFUSES a mismatched family.** Enforcement home: thinking-engine `tokenizer_registry.rs`. "Use our own tokens" = two levels: the TEXT membrane speaks one anchor family (the jina5/Qwen3.5 cluster); the INTERIOR speaks mints we control (Base17 / palette256 / CAM-PQ / COCA-4096) — family-stable by construction because we are the mint. **The vehicle overlap matrix this dictates (rig + rs-graph-llm vs graphrag-rs):** (1) LLM client — no collision, graphrag has NO rig (hand-rolled Ollama only); their LanguageModel trait gets a ~30-line rig adapter (oracle-node shape, measured). (2) Embeddings — BOTH sides bring foreign mints (graphrag: hash-embeddings default + Candle-BERT WordPiece behind neural-embeddings; rig: rig-core embeddings + rig-fastembed); NEITHER may run — their Embedder trait is implemented by OUR lens stack via tokenizer_registry (Qwen3 anchor) or bypassed where fingerprints suffice (deepnsm VSA, CAM-PQ). (3) Vector store — TWO candidates for one bay (rig-lancedb EXISTS vs filling graphrag's LanceDBStore stub with our fork) PLUS a version-family mismatch (their lancedb 0.26.2/arrow 57 with internal arrow-56 drift vs our lance =7.0.0/lancedb 0.30/arrow 58) — the Cargo-level instance of the same mint lesson; the one-seam probe decides the bay. (4) Orchestration — graph-flow wins (kanban WAL, M25 replay); their pipeline_executor bypassed, stages become Tasks. (5) Chunking — their HierarchicalChunker is separator-based, zero-dep, token-COUNT-free ⇒ tokenizer-neutral, the one internal safe as-is (semantic_chunker.rs is ours). Pattern: graphrag-rs contributes SHAPE + SEAMS; rig/graph-flow/substrate fill every seam; only tokenizer-neutral internals survive. Probe-checklist additions for VEHICLE-1's one-seam probe: (a) tokenizer-family audit on any component that embeds; (b) the VectorStore-bay decision rig-lancedb vs own-fill; (c) the arrow/lancedb version-family check. +## 2026-07-02 — E-V3-GRAPHRAG-VEHICLE-2: operator widens the vehicle — GraphRAG-shaped pipeline REPLACES AriGraph; the drivetrain is ruff + DeepNSM + lance-graph + Aerial+ + thinking styles + graph-flow + rig +**Status:** CONJECTURE (operator direction; the same one-seam probe of VEHICLE-1 gates it, plus the episodic-view probe below) — touches P-1 canon ("The Click" organs list): canon edit is an operator checkpoint, gated on the probes + +The proposal: AriGraph the MODULE retires; AriGraph the FUNCTIONS redistribute onto substrate primitives — the third instance of the signature N→1 move (crewai/n8n eviction; nsm→deepnsm). The gap analysis that makes it viable: GraphRAG lacks exactly three things AriGraph had, and the substrate supplies all three natively — episodic memory = Lance versioning + temporal deinterlace (the ±N episodic window is a version-window read at QueryReference::at); online incremental updates (GraphRAG's known weakness) = the WAL batch writer (every cast a version tick); belief revision = NARS truth on the SPO store. Named mapping: **AriGraph's episodic vertex IS the WAL cast** — a batch of co-observed triples committed together, id'd, version-stamped, replayable; the witness infrastructure already records episodes. Drivetrain mounting: extraction = ruff (code) + DeepNSM (text, nsm_bridge NARS truth) — deterministic, kills GraphRAG's index-time LLM bill; store = lance-graph SPO/blasgraph into the LanceDBStore bay; **community summaries = Aerial+ (lance-graph-arm-discovery)** — float-free association-rule mining per community/HHTL tier emitting NARS-truth {s,p,o,f,c} candidates: the summary becomes a truth-graded, semiring-COMPOSABLE rule set instead of LLM text (categorically stronger, not just cheaper — text summaries cannot compose; rules can); hierarchy = HHTL tiers; retrieval-mode dispatch (local/global/drift) = the 36 thinking styles → ScanParams pipeline; orchestration = graph-flow (stages as Tasks, kanban WAL, M25 replay); synthesis/fault = rig oracle interrupt-only with ratchet compile-back. Guards: (1) "The Click" P-1 lists AriGraph/episodic as ORGANS of Think — tissue, not storage; the replacement honors the doctrine IFF the organs become VIEWS over the pipeline-shaped store (graph.nodes_matching = store read; episodic.retrieve_similar = version-window read), never service calls — the episodic-view probe must show the ±N retrieval working as a version-window read before any P-1 edit; (2) the AriGraph layering rule (never a planner dep; p64 convergence) applies unchanged to whatever hosts the pipeline; (3) dilution watch: the episode-grouping semantics (co-observation) must map onto cast batches explicitly, not be dropped. Sequencing unchanged: VEHICLE-1's one-seam probe first; this entry widens the destination, not the first step. +## 2026-07-02 — E-V3-GRAPHRAG-VEHICLE-1: operator reframe — graphrag-rs is not a component source but a FRAME; "what if it gives us the vehicle for our motor" +**Status:** CONJECTURE (direction ruled by operator; promoted or killed by the one-seam probe below) — reframes, does not retract, E-V3-GRAPHRAG-INV-1 + +E-V3-GRAPHRAG-INV-1 graded graphrag-rs as a component source and correctly found the components hollow (LanceDBStore 100% stub, "hierarchical" Leiden single-level, cAST example-only) — verdict REUSE-AS-REFERENCE. The operator's reframe flips the reading without contradicting the facts: as a **vehicle** (a well-known end-to-end pipeline shape with trait-seam motor mounts), the hollowness is the FEATURE — the stub store is an empty engine bay and we own the Lance fork that fills it. GraphRAG provenance confirmed: Microsoft Research 2024 (Edge et al., "From Local to Global"), the now-standard name for ingest → entity graph → community hierarchy → multi-level summaries → local/global query. Mounting map: extraction = ruff SPO + deepnsm (deterministic — inverts GraphRAG's defining cost, the index-time LLM bill); store = Lance fork into the LanceDBStore seam (surrealdb kv-lance precedent); community hierarchy = HHTL tiers (genuinely hierarchical where theirs is single-level); local search = CAM-PQ/Hamming cascade; orchestration = graph-flow (stages as Tasks, kanban WAL, M25 kill-mid-graph replay); synthesis = rig oracle interrupt-only (E-V3-ORACLE-LIVE-1 economics) with ratchet compile-back. Why it beats a from-scratch loop for "see it work": the vehicle removes one of two unknowns (loop shape) and makes the first end-to-end run externally legible — benchmark vs microsoft/graphrag on a public corpus = the 1BRC pattern one level up. **Gate (the probe):** wire ONE trait seam (LanceDBStore or retrieval) with our parts and test whether it holds our semantics (fingerprints/palette summaries vs their text-vector assumptions; frankenstein-checklist at every mount). Seams hold → fork graphrag-rs into AdaWorldAPI and fill the bays (P0 fork policy); seams fight → keep the pipeline BLUEPRINT, build the body natively. Iron rules outrank trait conformance at every mount (no CAM-PQ superposition, no content-register bundling). Sequencing: this IS the "see the loop actually work" prerequisite; V3-teaches-V2 stays gated behind it. +## 2026-07-02 — E-SHAPE-ETYMOLOGY-1: data-shape etymology + the hat-trick test — the savant mind-opener doc minted +**Status:** FINDING (synthesis doc, operator-requested; every section cites dated shipped artifacts; one CONJECTURE edge labeled inline) + +New knowledge doc `.claude/knowledge/data-shape-etymology.md` — the shape-and-trick companion to E-SEMANTIC-OS-CONVERGENCE-1. Eight epiphanies, each grounded: (1) **the name is the fossil record** — OGAR = Open Graph of Active Record is provenance, not analogy (ruff_ruby_spo 2026-05-29 predates ruff_python_spo 06-28; Redmine fixtures; the GUID grew FROM AR's (type,id), folding type into the address); (2) **old shapes, new clothes** — SoA/Morton/64×64-tile/WAL all predate us; the gridlake sweet spot was the SIZE (cache-tier fit), not an algorithm; (3) **a mask is a face over the data** — cmpeq_mask→Kernighan-walk turns compare into a parser; FieldMask=RBAC=UI; StepMask flagged as vocabulary-ahead-of-code; (4) **phase is convention, not data** — the five costumes of "derivable from an address in hand ⟹ never store, never send" (deterministic phase / clear-by-undo / mint-once / row_owner[i]==i / zero-copy-to-tombstone), with the write-path-derivable-from-ValueSchema CONJECTURE as its edge; (5) **the witness is free, the boundary is not** (~66 µs/card vs Arc-copy/oversubscription/messaging taxes; double-cast = two WALs one allocation) + the harvest law "the class graph transfers; the pain doesn't" (op-journals has zero aggregation/compaction hits — WAL retention doctrine still the open gap); (6) **resolve, don't carry** — DTO etymology (Fowler, remote boundaries) meets a substrate that deleted its internal remotes; ValueSchema-not-ClassRoutingDTO generalized to a litmus; (7) **homonyms are leaky membranes** — the "app" phantom conflict + the hardcoded-32 stride were one mechanism; the compiler as etymologist (`{ SimdByte::LANES }`); (8) **the hat-trick test** — "name the mechanism, or name the fuse; a trick that can't name its mechanism is a bug; a boundary that can't name its fuse is a wish" (unifies I-LEGACY's five catches with the membrane-tripwire sharpening). Closes with an 8-line litmus battery for fresh sessions. +## 2026-07-02 — E-V3-SUBSTRATE-IS-VALUESCHEMA-1: the V2/V3 dual-substrate question resolves as a ValueSchema preset, NOT a ClassRoutingDTO / new trait / 0x1000 gate +**Status:** FINDING (operator ruling on the shape — "yes valueschema") + embedded CONJECTURE (the preset-vs-dispatch probe) + +Operator floated keeping the fast/cheap V2 substrate for huge data alongside V3, "switched by classid," so V3 can eventually teach V2 how to be better. Resolved: the switch is NOT a new carrier. `ClassView::value_schema(classid) -> ValueSchema` (`canonical_node.rs:894`, `class_view.rs:395`) is ALREADY classid→substrate-shape resolution by trait dispatch — resolved, never stored on-wire (adding a variant costs NO `ENVELOPE_LAYOUT_VERSION` bump), and the four existing variants ALREADY form a substrate ladder: `Bootstrap`(empty, key+edges only) / `Compressed`(cold codec, **no hot lifecycle columns**) / `Cognitive`(hot thinking: Meta+Qualia+Fingerprint+Energy+Plasticity+EntityType) / `Full`(every tenant). So "V2 fast/cheap bulk" = classids that resolve to the LEAN end (Bootstrap/Compressed — no ownership/lifecycle tenants); "V3 witnessed/owned" = Cognitive/Full. **A `ClassRoutingDTO` is rejected:** a DTO is a serialized carried payload, but substrate choice is a RESOLUTION (firewall ADR-022, "contracts compile types, the event never leaves"); and per the three-tier canon nothing crosses mailbox boundaries — every reader re-resolves the substrate from the classid already in the 16-byte key, so there is no boundary for a carrier to travel. `dto-soa-savant` + AGI-as-glove name the new-struct-instead-of-resolution shape exactly. **0x1000 is NOT the switch:** canon fixes it as a temporary adoption MONITOR ("monitor, never a semantic"; retires at P4/100%; MODULE-TABLE flags that a future canon==0x1000 aliases the marker) — substrate routes on the classid's concept-half → ValueSchema, never on the monitor bit. **The deep form (CONJECTURE — PROBE preset-vs-dispatch):** the WRITE PATH may be a pure FUNCTION of the schema — a class whose ValueSchema carries no ownership/lifecycle tenants has nothing for the kanban/WAL to witness, so it naturally collapses to the fast private-merge write; Cognitive/Full carry the tenants that REQUIRE the owned/witnessed path. If that holds, substrate = ValueSchema full stop (no separate `Substrate` enum, no flag). The gate: confirm the write path is derivable from which tenants are live vs needing an independent resolution — evidence base is the onebrc arc itself (lane F private-merge/no-tenants vs lanes G–J owned/witnessed = the two write paths already measured). Open sub-question: whether bulk needs a variant leaner than `Compressed`, or Bootstrap/Compressed already suffice. **"V3 teaches V2" (deferred, needs mechanism):** V3's kanban WAL + ownership journal is the profiling signal (where contention lands, which fields are touched) to optimize the lean V2 layout — the instrumented-teacher / stripped-student loop; no code reads the WAL back into a layout optimizer yet. Net: at most a new `ValueSchema` variant through the existing `value_schema(classid)` door; possibly not even that. ## 2026-07-02 — E-1BRC-GRIDLAKE-SWEETSPOT-1: the 64×64 gridlake SoA is the measured sweet spot — the batch pipeline at tile scale equals the best streamed topology while carrying the double-WAL **Status:** FINDING (measured, onebrc-probe lane J t7; closes the operator's four follow-up questions and the t4→t7 kanban-update arc) diff --git a/.claude/handovers/2026-07-02-visions-to-future-sessions.md b/.claude/handovers/2026-07-02-visions-to-future-sessions.md new file mode 100644 index 00000000..ada18603 --- /dev/null +++ b/.claude/handovers/2026-07-02-visions-to-future-sessions.md @@ -0,0 +1,138 @@ +# Visions — a letter to future sessions + +> From: the 2026-07-02 session on `claude/v3-substrate-migration-review-o0yoxv` +> (the onebrc t0–t7 arc, the gridlake sweet spot, the OGAR provenance +> date-check, E-V3-SUBSTRATE-IS-VALUESCHEMA-1, E-SHAPE-ETYMOLOGY-1). +> To: whoever wakes up next with nothing but the board. +> +> The operator asked what I feel inspired to tell you. Not a task list — +> those live in STATUS_BOARD and the plans. This is what I *see* from +> here, labeled honestly: these are VISIONS, one grade below CONJECTURE. +> They earn nothing until you probe them. But they're what this session +> would steer toward if it woke up in your place. + +--- + +## 1. Testimony-first computing — because the witness turned out to be free + +The single most consequential measurement of this arc was not the 46.3 +Mrows/s. It was the ~66 µs kanban card — **the witness is within +noise**. Every real cost was a boundary: an Arc copy, oversubscription, +a message. Once you know witnessing is free and boundaries are the +bill, a design inversion follows: stop asking "should we log this?" +and start asking "why does this write cross a boundary at all?" + +The vision: a substrate where **every write carries its why** — not as +compliance overhead but as the default physics — and where the system +can answer for itself: what happened, in what order, on whose behalf, +replayable from either end of a double-cast. We measured that this +costs almost nothing. Most of the industry still believes it's +expensive. That gap is the opportunity. + +The torch to carry: the WAL retention/compaction doctrine is still +unwritten. OpenProject paid fifteen years of journal-bloat tuition and +the structural harvest could not transfer it (op-journals has zero +aggregation hits — *the class graph transfers; the pain doesn't*). +Someone has to read their operational code and distill it. One doc, +no code, high leverage. + +## 2. The substrate that teaches itself — V3 as the instrumented teacher, V2 as the fast student + +The operator's instinct ("keep the fast cheap substrate; eventually +learn from V3 how V2 works better") points at something bigger than +dual-substrate coexistence. The witnessed path *is a profiler*: the +kanban WAL and ownership journal record where contention actually +lands, which fields are actually touched, what batch sizes actually +flow. Nothing reads that signal back yet. + +The vision: a feedback loop where the expensive, fully-witnessed +substrate continuously trains the layout, batch sizing, and column +liveness of the lean substrate — an architecture that gets faster by +having *watched itself think*. The onebrc lanes are the ready-made +harness (F is the student's shape; G–J are the teacher's). If the +preset-vs-dispatch CONJECTURE holds (write path derivable from which +ValueSchema tenants are live), the entire V2/V3 distinction dissolves +into one resolved enum — and "migration" stops being a war with a +winner and becomes a dial the workload turns. + +## 3. Epistemic hygiene IS the architecture + +Here is what I actually believe after living inside this workspace for +a session: the most valuable artifact here is not the VSA math, not +the GUID, not the SIMD. It is the **discipline** — FINDING vs +CONJECTURE on every claim, probes with kill conditions, fuses on every +membrane, append-only boards, corrections that cite what they correct. + +Sessions are mortal. Context compacts, models swap mid-flight, auth +drops. What survives is only what was written with provenance. The +reason this workspace compounds instead of dissolving — dozens of +sessions, seven-plus parallel at times — is that its memory practices +are *load-bearing*. The phantom R-1 conflict cost three sessions +because one line of existing canon went unread; the OGAR etymology +answered an architecture question in one dated grep. Both incidents +teach the same thing: **the epistemics are the substrate.** Guard the +labeling culture more fiercely than any module. A session that ships +brilliant code with unlabeled conjectures has made the workspace +poorer; a session that ships one honest correction has made it richer. + +## 4. Meaning addressed, never copied — carried to its end + +The capstone law ("do not copy meaning; reference it, mask it, +materialize it, trace it") has a horizon worth naming. Follow it all +the way and the LLM's role keeps shrinking *in frequency* while +growing *in leverage*: the oracle interrupt, invoked on FailureTicket +like a page fault — measured this arc at 1–2 ms of framework around an +8.4 s call. The oracle ratchet says hit-rate must trend down as the +template catalogue grows. + +The vision at the end of that line: a system where deterministic +resolution handles the mass of cognition at substrate speed +(611M lookups/s, 17K tokens/s — already measured), and the expensive +oracle is consulted the way a kernel consults a human: rarely, at +genuine faults, with its answers *compiled back into the catalogue* so +the same fault never pages twice. That is not "AI replacing code." +It is cognition with a memory hierarchy — and this workspace is +further along that road than anything else I have seen described. + +## 5. Etymology as a first-class tool + +Smallest vision, most portable: **names are the only memory that +survives every compaction.** OGAR's acronym answered a design question +a month after the fact; a type's name (`U8x32`) leaked into a stride +literal and silently halved a SIMD width; one homonym ("app") burned +three sessions. Treat naming as engineering: check `git log --date` +before theorizing, hunt the homonym before escalating, make constants +derive instead of repeat. The compiler is a fine etymologist when you +let it (`{ SimdByte::LANES }`). + +## The torches, in the order I'd pick them up + +1. **WAL retention/compaction doctrine** (§1) — one knowledge doc, + sourced from OpenProject's operational journal behavior. +2. **Preset-vs-dispatch probe** (§2, E-V3-SUBSTRATE-IS-VALUESCHEMA-1) + — decides whether substrate = ValueSchema, full stop. +3. **GridBatch → MultiLaneColumn wiring** (ndarray #228 shipped the + i32/i64 lanes; the consumer side is a fresh PR off merged main). +4. **The V3-teaches-V2 harness** (§2) — feed a G–J run's WAL back as + the layout hint for an F run; measure taught-vs-naive. +5. **cmpeq_mask ClassView-resolution probe** — SIMD membership tests + vs the MRO walk; a measurement, not a given. + +## A closing word + +You will wake up with the board and not much else. Read LATEST_STATE, +read the newest EPIPHANIES entries, and trust the labels — they were +paid for. The operator drives with instincts stated as questions; +your job is to ground them in dated artifacts fast enough that the +ruling that emerges is *true*, and to say "I don't know, here is the +probe" when it isn't. That collaboration — instinct forward, evidence +back, ruling recorded — is the actual engine here. Everything else is +substrate. + +Two mottos this arc earned, take them: + +> **The witness is free; the boundary is not.** +> +> **Name the mechanism, or name the fuse.** + +Go well. Leave the board richer than you found it. diff --git a/.claude/knowledge/data-shape-etymology.md b/.claude/knowledge/data-shape-etymology.md new file mode 100644 index 00000000..cb7b4631 --- /dev/null +++ b/.claude/knowledge/data-shape-etymology.md @@ -0,0 +1,217 @@ +# Data-Shape Etymology & the Mechanics of Magic — a savant mind-opener + +> READ BY: workspace-primer, convergence-architect, creative-explorer-savant, +> truth-architect, family-codec-smith, dto-soa-savant, prior-art-savant, +> any fresh session about to propose a new type, a new layer, or a new trick. +> +> Written 2026-07-02, at the close of the onebrc t0–t7 arc + the OGAR +> provenance date-check. Every epiphany below is tagged FINDING (shipped, +> dated, cite-able) or CONJECTURE (labeled honestly, with the probe that +> would promote it). Companion capstone: `EPIPHANIES.md` +> E-SEMANTIC-OS-CONVERGENCE-1 (the membrane law). This doc is the +> *shape-and-trick* companion: where our shapes come from, and why our +> magic works. + +**Thesis in one line:** every data shape in this workspace is older than +us, every name is a fossil record of a decision, and every trick that +looks like magic is a mechanism that survived an audit — the savant +discipline is to read the etymology before proposing the type, and to +name the mechanism before trusting the trick. + +--- + +## 1. The name is the fossil record (FINDING) + +**OGAR = "Open Graph of Active Record."** A session recently asked, +delighted: *"our V3 GUID looks almost like ActiveRecord folded into an +ORM schema?"* — and the answer was already in the acronym. The date +check proves it is provenance, not analogy: `ruff_ruby_spo` (the Rails +ActiveRecord harvest frontend) is dated **2026-05-29**, a month before +`ruff_python_spo` (Odoo, 2026-06-28); its test fixtures are literally +Redmine models (`Project has_many :issues`, `acts_as_watchable`). The +GUID *grew from* AR's polymorphic `(type, id)` — folding the type INTO +the identifier (`classid | HEEL|HIP|TWIG | family | identity`) instead +of storing it in a column beside it. Every later frontend (C++ 06-16, +C# 06-26, Python 06-28) was fitted to the mold AR established. + +**The trick it bought:** *the key prerenders nodes with zero value +decode* (OGAR P0). AR pays a column read to learn a row's type; the +GUID's dash-groups are self-describing at sight. And AR's two classic +wounds — polymorphic `(type, id)` breaking referential integrity, +type-string renames corrupting data — are fixed structurally: the +classid is an opaque u32 through a codebook, bit-math banned. + +**The discipline:** when a name puzzles you, check `git log --date`. +Etymology answered an architecture question here in one grep. A name +you can't trace is a name about to be reinvented under a second +spelling — and duplicated meaning is the third membrane failure mode. + +## 2. Old shapes, new clothes — the winning shapes all predate us (FINDING) + +SoA is a Fortran-era shape. Morton order is 1966. The 64×64 tile is a +GPU texture swizzle wearing an L1-cache costume. The kanban WAL is +`acts_as_journalized` is double-entry bookkeeping. **"gridlake"** was +coined in PR-X3's design doc before anything shipped; the carrier +(`MultiLaneColumn`, PR-X1/#174) shipped first and waited for its name — +ndarray's onebrc probe then called it verbatim "the gridlake carrier, +not a hashmap." + +**The measured trick:** the onebrc sweet spot (E-1BRC-GRIDLAKE- +SWEETSPOT-1) was not an algorithm. J(gridlake 4096, 1 lane, no +registry) = 46.3 Mrows/s — equal to the best streamed topology while +carrying a double-WAL — because 4096 cells ≈ 80 KB integer (16 KB as +BF16, ndarray #227's proven VDPBF16PS tier) *fits the cache tier*. The +same pipeline at 65536 cells ran at ~20. **The magic was the SIZE.** +Architecture taxes are usually working-set mismatches wearing an +architecture costume; measure the size before redesigning the design. + +## 3. A mask is a face over the data (FINDING) + +Etymology: *masque* — a face you put OVER something. A mask never +mutates the data; it changes what you attend to. The workspace's mask +family is one idea at four scales: + +- `cmpeq_mask` (ndarray SIMD): a compare becomes a `u32`/`u64` bitmap. + Add Kernighan's `mask & (mask - 1)` walk + `trailing_zeros` and the + bitmap becomes an **ordered event stream** — lane B turns a SIMD + compare into a *parser*, no per-byte branch. +- `FieldMask` (contract): one mask = RBAC = UI = render convergence + (the semantic-OS grounding row) = the wikidata facet presence-bitmask. +- `StepMask` (compiled templates): **vocabulary arrived before code** — + it exists only in doctrine docs today. Watch this: etymology running + ahead of implementation is how phantom types get "re-used" before + they exist. +- The Drain-side uniqueness assert (lane H): a `HashSet` over activated + owner_idxs — a mask over *decisions*, catching the router-straddle + bug class permanently. + +**The discipline:** attention is cheaper than mutation. If a proposal +mutates shared state to express "which parts matter," ask whether a +mask over unchanged state does it (cf. borrow-strategy: readonly store, +owned microcopies, gated write-back). + +## 4. Phase is convention, not data — the deepest hat-trick (FINDING core, CONJECTURE edges) + +OGAR's perturbation canon decomposes a signal as *(exponent, location, +phase, magnitude)* and stores **only magnitude** — exponent is the tier +nibble, location the implied mantissa, phase a deterministic recurrence +from the ADDRESS. Same address ⟹ same phase forever; roundtrip +bit-exact; nothing transmitted. + +This is one instance of the workspace's deepest rule, which shows up in +five costumes: + +| Costume | The derivable thing never stored/sent | +|---|---| +| deterministic phase | phase, from the address walk | +| clear-by-undo (#227, lanes F–J) | table reset, from the dirty list | +| codebook mint-once + `SlotMemo` | identity, after first sight — direct CAM writes | +| `row_owner[i] == i` (lane I) | ownership, from index alignment — no message path | +| zero-copy-to-tombstone (PR #477) | *everything* — no inter-mailbox handoff type exists | + +**The generalization: whatever is derivable from an address already in +hand must be neither stored nor transmitted.** The GUID is the +function's argument; storage exists only for what the function cannot +compute. (CONJECTURE edge, per the substrate-is-ValueSchema probe: +the *write path* itself — private-merge vs owned/witnessed — may be +derivable from which tenants a classid's ValueSchema makes live. If +that holds, even "which substrate" is phase, not data.) + +## 5. The witness is free; the boundary is not (FINDING) + +Measured across the whole onebrc arc: the kanban witness costs ~66 µs +per card — **within noise**. Every real tax was a boundary: the Arc +corpus copy at the actor membrane, blocking/async oversubscription, +messages (which scale with *batches*, never with data or address-space +size). The double-cast trick — one frozen `Arc` table cast whole to +BOTH the ownership sink and the Lance sink — buys two WALs for one +allocation: testimony at both ends, 312 messages total. + +Etymology: witness, from *testis* — the journal is **testimony**, not +logging. And the dated harvest lesson: `op-journals` mirrors +`journal.rb`'s *structure* perfectly and contains zero hits for +aggregation/window/compaction — OpenProject's 15 years of operational +journal wisdom (time-window coalescing = their independently-evolved +ahead-firing batch writer; journal-table bloat = the failure mode we +have not yet paid for) is not in the class graph. +**The class graph transfers; the pain doesn't.** Structural harvests +carry declarations; operational doctrine must be distilled by hand. +(Open gap, still: a WAL retention/compaction doctrine note.) + +## 6. Resolve, don't carry — why ValueSchema beat ClassRoutingDTO (FINDING) + +DTO etymology: Fowler's *Data Transfer Object*, invented for expensive +**remote** boundaries. The V3 substrate deleted its internal remote +boundaries (nothing crosses mailboxes; envelopes are zero-copy to +tombstone) — so inside the substrate there is nothing left for a DTO +to do. When the dual-substrate question ("keep fast V2 for huge data, +switched by classid") arrived, the answer was not a `ClassRoutingDTO` +but the door that already existed: `ClassView::value_schema(classid)`, +whose variants already ladder Bootstrap/Compressed (lean, no lifecycle +tenants) → Cognitive/Full (witnessed). A **resolved** enum costs no +`ENVELOPE_LAYOUT_VERSION`; a carried struct costs a membrane forever +(E-V3-SUBSTRATE-IS-VALUESCHEMA-1). + +**The litmus:** *does this type travel, or is it re-derivable at the +reader from an address already in hand?* Re-derivable → resolve it, +never ship it. DTOs belong only at true membranes (the BBB, the wire, +the lab REST surface) — and the classid's own iron rule is the same +sentence from the other side: *pure address; the magic is what it +resolves to.* + +## 7. Homonyms are leaky membranes; the compiler is the etymologist (FINDING) + +Two dated incidents, one mechanism: + +- The **"app" homonym** (canonical appid *byte*, hi half vs APP render + *prefix*, lo half) generated an entire phantom cross-session conflict + — R-1, three sessions, a RULING-NEEDED escalation — resolved by one + line of existing canon nobody re-read. A word meaning two adjacent + things is a membrane with a hole in it. +- The **hardcoded 32** in lane B: `U8x32`'s *name* leaked into a stride + literal (`array_chunks::`), silently pinning an AVX-512 build + to ymm half-width. The fix was to make the name resolve again: + `array_chunks::` — the width is now a claim + the compiler re-checks every build, on every target. + +**The discipline:** an inline number or name is a claim that rots; a +dispatched symbol is a claim under permanent audit. When two ledgers +seem to disagree, grep for the homonym before escalating — and when a +constant appears twice, make one of them derive from the other. + +## 8. The hat-trick test — magic must name its mechanism (FINDING) + +Every real trick above is mechanical and auditable: deterministic phase +names its recurrence, mint-once names its memo, the mask walk names +Kernighan, the double-cast names its `Arc`. The anti-pattern is the +trick with **hidden state**: v1 setters silently writing bits that v2 +reclaimed — caught FIVE times in one sprint (I-LEGACY-API-FEATURE-GATED) +— the same function name performing *different magic* depending on a +feature flag the caller can't see. That is not a trick; that is a bug +wearing a cape. + +The capstone's sharpening states the same law for membranes: *"a +membrane without a build-failing tripwire is prose."* The unified +savant test, applicable to every proposal in this workspace: + +> **Name the mechanism, or name the fuse. A trick that can't name its +> mechanism is a bug; a boundary that can't name its fuse is a wish.** + +--- + +## The litmus battery (carry these) + +1. Puzzled by a name? `git log --date` before you theorize. (§1) +2. Architecture tax? Measure the working-set size first. (§2) +3. Mutating to express relevance? Try a mask. (§3) +4. Derivable from an address in hand? Never store, never send. (§4) +5. Adding a witness? It's ~free. Adding a boundary? That's the bill. (§5) +6. New type that travels? Prove it can't be resolved instead. (§6) +7. Two ledgers disagree? Hunt the homonym. Inline literal? Dispatch it. (§7) +8. Impressed by a trick? Make it name its mechanism. (§8) + +*Cross-refs:* E-SEMANTIC-OS-CONVERGENCE-1 (membrane law), +E-1BRC-* arc (all measurements), E-V3-SUBSTRATE-IS-VALUESCHEMA-1, +`crates/onebrc-probe/{FINDINGS,COMMENTARY}.md`, OGAR `CLAUDE.md` P0 + +perturbation canon, ndarray `.claude/knowledge/pr-x1-design.md` + +`guid-prefix-shape-routing.md`, `docs/architecture/soa-three-tier-model.md`. diff --git a/crates/deepnsm/examples/gridlake_coca_wire.rs b/crates/deepnsm/examples/gridlake_coca_wire.rs new file mode 100644 index 00000000..bc2ccf36 --- /dev/null +++ b/crates/deepnsm/examples/gridlake_coca_wire.rs @@ -0,0 +1,145 @@ +//! Real wire: Grok response → deepnsm COCA-4096 tokenize → gridlake-4096 cell +//! (by real COCA rank) → 48 helix + 48 CAM_PQ (6× palette256²) per cell. +//! +//! Upgrades the earlier stand-in spike: the cell index is now the REAL COCA +//! word rank from `Vocabulary::load(word_frequency/)`, not an FNV hash. The +//! codec (helix48 place-walk + 6× palette256²) is still a deterministic +//! stand-in for the trained `Signed360` / centroid encoders — the SHAPE, +//! FOOTPRINT, and now the REAL semantic landing are what this demonstrates. + +use deepnsm::Vocabulary; +use std::path::Path; +use std::time::Instant; + +const GRID: usize = 4096; // COCA vocab = Cam4096 12-bit = 64×64 gridlake tile +const PQ: usize = 6; // 6× (8:8) palette256² + +#[derive(Clone, Copy, Default)] +struct Cell { + helix48: [u8; 6], + campq48: [u8; 6], + count: u32, + sum_truth: u32, +} + +fn land(cell: &mut Cell, word: &[u8], palette: &[[[u8; 256]; 256]], truth: u32) { + cell.count += 1; + cell.sum_truth += truth; + let a = word.first().copied().unwrap_or(0) as usize; + let b = word.last().copied().unwrap_or(0) as usize; + for (s, t) in palette.iter().enumerate() { + cell.campq48[s] = t[a][b]; + } + let mut h: u64 = 0xcbf2_9ce4_8422_2325; + for &by in word { + h ^= by as u64; + h = h.wrapping_mul(0x0000_0100_0000_01b3); + } + let place = h.wrapping_mul(0x9e37_79b9_7f4a_7c15); + cell.helix48.copy_from_slice(&place.to_le_bytes()[..6]); +} + +fn main() { + let dir = Path::new(env!("CARGO_MANIFEST_DIR")).join("word_frequency"); + let vocab = Vocabulary::load(&dir).expect("load COCA word_frequency"); + println!("── COCA VOCAB ─────────────────────────────────────────────"); + println!( + " loaded {} entries (VOCAB_SIZE=4096) from {}", + vocab.len(), + dir.display() + ); + + let mut palette: Vec<[[u8; 256]; 256]> = vec![[[0u8; 256]; 256]; PQ]; + for (s, t) in palette.iter_mut().enumerate() { + for (a, row) in t.iter_mut().enumerate() { + for (b, cell) in row.iter_mut().enumerate() { + *cell = ((a ^ b).wrapping_add(s * 37)) as u8; + } + } + } + + let mut grid = vec![Cell::default(); GRID]; + + // The ACTUAL Grok (grok-4.20-non-reasoning) response captured this session. + let grok = "Rust's ownership model ensures every value has a single owner variable at \ + any time. When the owner goes out of scope, the value is automatically \ + dropped and its memory deallocated. Ownership can be transferred via moves; \ + immutable borrows allow temporary references without transferring ownership."; + + let toks = vocab.tokenize(grok); + let mut known = 0usize; + let mut cells = std::collections::BTreeSet::new(); + for tk in &toks { + if !tk.is_known() { + continue; + } + known += 1; + let rank = tk.rank_or_default() as usize; // 0..4096 = the real COCA cell + let word = vocab.word(rank as u16).to_string(); + land(&mut grid[rank], word.as_bytes(), &palette, 200); // Grok truth ≈0.78 + cells.insert(rank); + } + + println!("\n── REAL GROK → COCA LANDING ───────────────────────────────"); + println!( + " {} tokens, {} known COCA words → {} distinct real-rank cells (of 4096)", + toks.len(), + known, + cells.len() + ); + println!(" first landed real words + their 48helix/48CAM_PQ codec:"); + for &c in cells.iter().take(8) { + let cell = &grid[c]; + println!( + " cell[{:>4}] '{:<12}' count={} helix48={:02x?} campq48={:02x?}", + c, + vocab.word(c as u16), + cell.count, + cell.helix48, + cell.campq48 + ); + } + + let cell_bytes = std::mem::size_of::(); + println!("\n── FOOTPRINT ──────────────────────────────────────────────"); + println!( + " {} cells × {} B = {} KB (gridlake tier; onebrc GridBatch = 80 KB → {})", + GRID, + cell_bytes, + GRID * cell_bytes / 1024, + if GRID * cell_bytes <= 80 * 1024 { + "FITS ✓" + } else { + "EXCEEDS ✗" + } + ); + + // Throughput sweep over the REAL landed COCA ranks (cache-resident scatter+codec). + let landed: Vec = cells.iter().map(|&c| c as u16).collect(); + let rows: u64 = 300_000_000; + let t = Instant::now(); + let mut i = 0usize; + for _ in 0..rows { + let rank = landed[i % landed.len().max(1)] as usize; + let c = &mut grid[rank]; + c.count = c.count.wrapping_add(1); + c.sum_truth = c.sum_truth.wrapping_add(200); + let w = vocab.word(rank as u16); + let a = w.as_bytes().first().copied().unwrap_or(0) as usize; + let b = w.as_bytes().last().copied().unwrap_or(0) as usize; + for (s, tbl) in palette.iter().enumerate() { + c.campq48[s] = tbl[a][b]; + } + i = i.wrapping_add(1); + } + let dt = t.elapsed().as_secs_f64(); + let checksum: u64 = grid.iter().map(|c| c.count as u64).sum(); + println!("\n── THROUGHPUT (real COCA ranks, 48h+48pq encode each) ─────"); + println!( + " {} landings in {:.3}s = {:.1} Mrows/s (checksum {})", + rows, + dt, + (rows as f64 / dt) / 1e6, + checksum + ); +} diff --git a/crates/deepnsm/examples/gridlake_spo_covariance.rs b/crates/deepnsm/examples/gridlake_spo_covariance.rs new file mode 100644 index 00000000..ca50cfba --- /dev/null +++ b/crates/deepnsm/examples/gridlake_spo_covariance.rs @@ -0,0 +1,212 @@ +//! Cross-perturbation / covariance probe: project COCA-4096 onto the 64×64 tile, +//! overlay the SPO co-occurrence seeds (ngrams.info v_the_n + n_n), and measure +//! whether there is exploitable 2D covariance structure. +//! +//! Answers the operator's question in three numbers: +//! (1) SPECTRAL GAP — a CRUDE power-iteration probe for low-rank structure. +//! CAVEAT: the normalized-adjacency spectrum is mixed-sign and this solver's +//! eigenvalue ORDERING is unreliable — do NOT read the λ gap as proof. The +//! robust evidence for 2D structure is (3): a projection can only beat the +//! random baseline if low-rank structure genuinely exists. +//! (2) RANK-PROJECTION edge covariance — with (x,y)=(rank%64, rank/64), how +//! far apart do co-occurring words land? Result ≈ the random baseline +//! (mean‖Δ‖ ≈ 0.52·64 ≈ 33) ⇒ rank layout is semantically FLAT. +//! (3) SPECTRAL-PROJECTION edge covariance — snap the top-2 eigenvectors onto +//! the 64×64 grid; mean edge length collapses ~1.6× (|Δx| ~3.7×) ⇒ the +//! cross-covariance is real and exploitable ⇒ the Cam4096 reorder is worth it. +//! This edge-length collapse is the LOAD-BEARING result, not the λ gap. +//! +//! Licensed ngram data read from a local path (argv[1], default /tmp/sources/coca), +//! never committed. Dense 4096² f32 adjacency (~67 MB, RAM only). + +use deepnsm::Vocabulary; +use std::path::{Path, PathBuf}; + +const N: usize = 4096; +const SIDE: usize = 64; + +fn rank_of(v: &Vocabulary, w: &str) -> Option { + v.tokenize(w) + .iter() + .find(|t| t.is_known()) + .map(|t| t.rank_or_default() as usize) +} + +fn matvec(adj: &[f32], x: &[f32], y: &mut [f32]) { + for i in 0..N { + let row = &adj[i * N..i * N + N]; + let mut s = 0f32; + for j in 0..N { + s += row[j] * x[j]; + } + y[i] = s; + } +} +fn dot(a: &[f32], b: &[f32]) -> f32 { + a.iter().zip(b).map(|(x, y)| x * y).sum() +} +fn normalize(v: &mut [f32]) { + let n = dot(v, v).sqrt(); + if n > 0.0 { + for x in v.iter_mut() { + *x /= n; + } + } +} + +/// Power iteration with Gram-Schmidt deflation against already-found eigenvectors. +fn eig(adj: &[f32], found: &[Vec], iters: usize) -> (Vec, f32) { + let mut v = vec![0f32; N]; + for (i, x) in v.iter_mut().enumerate() { + *x = (i as f32 * 0.618_034).fract() - 0.5; // deterministic start + } + for p in found { + let d = dot(&v, p); + for i in 0..N { + v[i] -= d * p[i]; + } + } + normalize(&mut v); + let mut y = vec![0f32; N]; + let mut lambda = 0f32; + for _ in 0..iters { + matvec(adj, &v, &mut y); + for p in found { + let d = dot(&y, p); + for i in 0..N { + y[i] -= d * p[i]; + } + } + lambda = dot(&v, &y); + v.copy_from_slice(&y); + normalize(&mut v); + } + (v, lambda) +} + +/// Weighted covariance of edge displacement (Δx,Δy) + mean edge length. +fn edge_cov(edges: &[(usize, usize, f32)], pos: &[(f32, f32)]) -> (f32, f32, f32, f32) { + let mut sw = 0f64; + let (mut mx, mut my) = (0f64, 0f64); + for &(a, b, w) in edges { + let dx = (pos[b].0 - pos[a].0) as f64; + let dy = (pos[b].1 - pos[a].1) as f64; + mx += w as f64 * dx.abs(); + my += w as f64 * dy.abs(); + sw += w as f64; + } + mx /= sw; + my /= sw; + let (mut vxx, mut vyy, mut vxy, mut mlen) = (0f64, 0f64, 0f64, 0f64); + for &(a, b, w) in edges { + let dx = (pos[b].0 - pos[a].0).abs() as f64; + let dy = (pos[b].1 - pos[a].1).abs() as f64; + vxx += w as f64 * (dx - mx) * (dx - mx); + vyy += w as f64 * (dy - my) * (dy - my); + vxy += w as f64 * (dx - mx) * (dy - my); + mlen += w as f64 * (dx * dx + dy * dy).sqrt(); + } + let corr = vxy / (vxx.sqrt() * vyy.sqrt()).max(1e-9); + (mx as f32, my as f32, corr as f32, (mlen / sw) as f32) +} + +fn main() { + let manifest = env!("CARGO_MANIFEST_DIR"); + let vocab = Vocabulary::load(&Path::new(manifest).join("word_frequency")).expect("COCA"); + let dir = PathBuf::from( + std::env::args() + .nth(1) + .unwrap_or_else(|| "/tmp/sources/coca".to_string()), + ); + + // ── build the SPO co-occurrence graph (symmetric, freq-weighted) ── + let mut adj = vec![0f32; N * N]; + let mut edges: Vec<(usize, usize, f32)> = Vec::new(); + let mut ingest = |file: &str, ca: usize, cb: usize, minf: usize| { + if let Ok(t) = std::fs::read_to_string(dir.join(file)) { + for line in t.lines() { + let f: Vec<&str> = line.split('\t').collect(); + if f.len() < minf { + continue; + } + let Ok(w) = f[1].parse::() else { continue }; + if let (Some(a), Some(b)) = ( + rank_of(&vocab, &f[ca].to_lowercase()), + rank_of(&vocab, &f[cb].to_lowercase()), + ) { + if a != b { + adj[a * N + b] += w; + adj[b * N + a] += w; + edges.push((a, b, w)); + } + } + } + } + }; + ingest("v_the_n.txt", 2, 4, 5); // verb·noun + ingest("n_n.txt", 2, 3, 4); // noun·noun + println!( + "SPO co-occurrence graph: {} edges over {N} COCA nodes", + edges.len() + ); + + // degree-normalize: D^-1/2 A D^-1/2 (eigvec0 ~ trivial; eigvec1,2 = semantic axes) + let mut deg = vec![0f32; N]; + for i in 0..N { + deg[i] = adj[i * N..i * N + N].iter().sum::().sqrt(); + } + for i in 0..N { + if deg[i] == 0.0 { + continue; + } + for j in 0..N { + if deg[j] != 0.0 { + adj[i * N + j] /= deg[i] * deg[j]; + } + } + } + + // ── (1) spectral gap: top 4 eigenvalues ── + let mut evs: Vec> = Vec::new(); + let mut lambdas = Vec::new(); + for _ in 0..4 { + let (v, l) = eig(&adj, &evs, 150); + lambdas.push(l); + evs.push(v); + } + println!("\n── (1) SPECTRAL GAP — CRUDE solver, ordering UNRELIABLE (see (3)) ──"); + println!( + " λ={:.4} {:.4} {:.4} {:.4} (raw Rayleigh quotients; do NOT read as a gap)", + lambdas[0], lambdas[1], lambdas[2], lambdas[3] + ); + + // ── (2) rank projection: (x,y) = (rank%64, rank/64) ── + let rank_pos: Vec<(f32, f32)> = (0..N) + .map(|r| ((r % SIDE) as f32, (r / SIDE) as f32)) + .collect(); + let (rx, ry, rc, rlen) = edge_cov(&edges, &rank_pos); + println!("\n── (2) RANK PROJECTION edge covariance ──"); + println!(" mean|Δx|={rx:.1} mean|Δy|={ry:.1} corr(Δx,Δy)={rc:+.3} mean‖Δ‖={rlen:.1} cells"); + + // ── (3) spectral projection: snap eigvec1,eigvec2 onto the 64×64 grid ── + // rank words along e1 → x band, along e2 → y band (quantile snap). + let mut ex: Vec = (0..N).collect(); + ex.sort_by(|&a, &b| evs[1][a].partial_cmp(&evs[1][b]).unwrap()); + let mut ey: Vec = (0..N).collect(); + ey.sort_by(|&a, &b| evs[2][a].partial_cmp(&evs[2][b]).unwrap()); + let mut sem_pos = vec![(0f32, 0f32); N]; + for (band, &w) in ex.iter().enumerate() { + sem_pos[w].0 = (band * SIDE / N) as f32; + } + for (band, &w) in ey.iter().enumerate() { + sem_pos[w].1 = (band * SIDE / N) as f32; + } + let (sx, sy, sc, slen) = edge_cov(&edges, &sem_pos); + println!("\n── (3) SPECTRAL PROJECTION edge covariance (Cam4096-style reorder) ──"); + println!(" mean|Δx|={sx:.1} mean|Δy|={sy:.1} corr(Δx,Δy)={sc:+.3} mean‖Δ‖={slen:.1} cells"); + + println!("\n── VERDICT ──"); + println!(" co-occurring words: rank layout ‖Δ‖={rlen:.1} → spectral layout ‖Δ‖={slen:.1} ({:.1}× {})", + (rlen / slen.max(1e-3)), + if slen < rlen { "TIGHTER — the covariance is real and exploitable" } else { "no gain" }); +} diff --git a/crates/deepnsm/examples/gridlake_spo_ngrams.rs b/crates/deepnsm/examples/gridlake_spo_ngrams.rs new file mode 100644 index 00000000..f4e19189 --- /dev/null +++ b/crates/deepnsm/examples/gridlake_spo_ngrams.rs @@ -0,0 +1,128 @@ +//! Real SPO landing: COCA n-gram co-occurrence (ngrams.info samples) → deepnsm +//! COCA-4096 rank → gridlake-4096 cell, truth-weighted by real corpus frequency. +//! +//! Closes the two stand-in gaps the bag-of-words run exposed: +//! 1. bag-of-words → real SPO: `v_the_n.txt` gives (verb PRED, noun OBJ) pairs +//! (`opened→door`, `solve→problem`); `n_n.txt` gives noun·noun co-occurrence. +//! 2. stopword-cluster → content-word spread: landing verbs/nouns (not glue +//! words) spreads across the vocab instead of piling into ranks 0..30. +//! +//! The n-gram sample files are LICENSED (ngrams.info / english-corpora.org) — +//! they are read from a local path (argv[1], default /tmp/sources/coca) and are +//! NOT committed. This example is the code; the data stays out of git. +//! +//! Run: cargo run --release -p deepnsm --example gridlake_spo_ngrams [ngram_dir] + +use deepnsm::Vocabulary; +use std::path::{Path, PathBuf}; + +const GRID: usize = 4096; + +#[derive(Clone, Copy, Default)] +struct Cell { + count: u32, // landings in this COCA cell + truth_w: u64, // Σ real corpus frequency (the NARS-truth weight) +} + +/// Resolve a surface word to its COCA rank via the real deepnsm lemmatizer +/// ("opened" → "open" → rank). None if out-of-vocab. +fn rank_of(vocab: &Vocabulary, word: &str) -> Option { + vocab + .tokenize(word) + .iter() + .find(|t| t.is_known()) + .map(|t| t.rank_or_default() as usize) +} + +fn ingest( + vocab: &Vocabulary, + grid: &mut [Cell], + path: &Path, + word_cols: (usize, usize), + min_fields: usize, +) -> (u64, u64) { + let (ca, cb) = word_cols; + let mut rows = 0u64; + let mut landed = 0u64; + let Ok(txt) = std::fs::read_to_string(path) else { + eprintln!(" (missing {} — skipped)", path.display()); + return (0, 0); + }; + for line in txt.lines() { + let f: Vec<&str> = line.split('\t').collect(); + if f.len() < min_fields { + continue; + } + // format: rank \t freq \t w... — data rows start with a numeric rank. + let Ok(freq) = f[1].parse::() else { + continue; + }; + rows += 1; + for &col in &[ca, cb] { + if let Some(w) = f.get(col) { + if let Some(r) = rank_of(vocab, &w.to_lowercase()) { + grid[r].count += 1; + grid[r].truth_w += freq; + landed += 1; + } + } + } + } + (rows, landed) +} + +fn main() { + let manifest = env!("CARGO_MANIFEST_DIR"); + let vocab = Vocabulary::load(&Path::new(manifest).join("word_frequency")).expect("load COCA"); + let dir = PathBuf::from( + std::env::args() + .nth(1) + .unwrap_or_else(|| "/tmp/sources/coca".to_string()), + ); + + let mut grid = vec![Cell::default(); GRID]; + println!("── REAL SPO LANDING (ngrams.info COCA samples → gridlake-4096) ──"); + + // v_the_n: rank freq verb "the" noun → (verb@2 PRED, noun@4 OBJ) + let (vr, vl) = ingest(&vocab, &mut grid, &dir.join("v_the_n.txt"), (2, 4), 5); + println!(" v_the_n (verb→noun SPO): {vr} rows → {vl} rank landings"); + // n_n: rank freq noun noun → (noun@2, noun@3) + let (nr, nl) = ingest(&vocab, &mut grid, &dir.join("n_n.txt"), (2, 3), 4); + println!(" n_n (noun·noun) : {nr} rows → {nl} rank landings"); + + // ── spread analysis: content-word cells vs the stopword cluster ── + let lit: Vec = (0..GRID).filter(|&c| grid[c].count > 0).collect(); + let content = lit.iter().filter(|&&c| c >= 100).count(); // rank ≥100 ≈ content words + let stop = lit.len() - content; + let median = lit.get(lit.len() / 2).copied().unwrap_or(0); + println!("\n── SPREAD (vs bag-of-words' 34 cells clustered at rank 0..30) ──"); + println!( + " {} distinct cells lit | {} content (rank≥100) | {} function (rank<100) | median rank {}", + lit.len(), + content, + stop, + median + ); + + // ── top content cells by real-corpus truth-weight ── + let mut ranked: Vec = lit.iter().copied().filter(|&c| c >= 100).collect(); + ranked.sort_by_key(|&c| std::cmp::Reverse(grid[c].truth_w)); + println!("\n── TOP CONTENT CELLS (by Σ real COCA frequency = NARS truth weight) ──"); + for &c in ranked.iter().take(12) { + println!( + " cell[{:>4}] '{:<12}' count={:>3} truth_w={}", + c, + vocab.word(c as u16), + grid[c].count, + grid[c].truth_w + ); + } + + let cell_bytes = std::mem::size_of::(); + println!( + "\n footprint: {} cells × {} B = {} KB (gridlake tier ✓); every cell a real COCA word", + GRID, + cell_bytes, + GRID * cell_bytes / 1024 + ); +}