From 2f1df8d56c6182b9248e7cde10a82373f01e9bd9 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 4 Jul 2026 11:33:34 +0000 Subject: [PATCH 1/7] contract: transcode the Tesseract recoder load side (UnicharCompress) New zero-dep module lance_graph_contract::unicharcompress -- the load side of Tesseract's UnicharCompress (ccutil/unicharcompress.{h,cpp}), the LSTM recognizer's recoded-code <-> unichar-id table. First binary-format leaf: a little-endian TFile reader (u32 count + per-RecodedCharID [i8 self_normalized][i32 length][i32*length code]), then ComputeCodeRange (max+1) and the decode map (last-writer-wins on a shared code). Load side only (DeSerialize + Encode/Decode/code_range); ComputeEncoding + beam-search maps are deferred to training/recognizer leaves. Byte-parity GREEN on real eng.lstm-recoder: encode 112/112 + decode 112/112 + code_range=111 (examples/recoder_dump.rs {encode,decode} diffed vs a libtesseract 5.3.4 oracle; the 1012-byte size = 4 + 112*9 was derived before the parse). Strict where C++ is UB: rejects length > kMaxCodeLen(9) and short buffers. +10 unit tests; clippy -D warnings + fmt clean (-p lance-graph-contract). Board: EPIPHANIES E-CPP-PARITY-7, LATEST_STATE contract inventory. Resolves the OGAR #148 recoder=0x0802 concept to its content-store module. Co-Authored-By: Claude Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1 --- .claude/board/EPIPHANIES.md | 10 + .claude/board/LATEST_STATE.md | 6 + .../examples/recoder_dump.rs | 51 ++ crates/lance-graph-contract/src/lib.rs | 1 + .../src/unicharcompress.rs | 559 ++++++++++++++++++ 5 files changed, 627 insertions(+) create mode 100644 crates/lance-graph-contract/examples/recoder_dump.rs create mode 100644 crates/lance-graph-contract/src/unicharcompress.rs diff --git a/.claude/board/EPIPHANIES.md b/.claude/board/EPIPHANIES.md index e3d2c802..7aae6eaf 100644 --- a/.claude/board/EPIPHANIES.md +++ b/.claude/board/EPIPHANIES.md @@ -177,6 +177,16 @@ New knowledge doc `.claude/knowledge/data-shape-etymology.md` — the shape-and- **Status:** FINDING (operator ruling on the shape — "yes valueschema") + embedded CONJECTURE (the preset-vs-dispatch probe) Operator floated keeping the fast/cheap V2 substrate for huge data alongside V3, "switched by classid," so V3 can eventually teach V2 how to be better. Resolved: the switch is NOT a new carrier. `ClassView::value_schema(classid) -> ValueSchema` (`canonical_node.rs:894`, `class_view.rs:395`) is ALREADY classid→substrate-shape resolution by trait dispatch — resolved, never stored on-wire (adding a variant costs NO `ENVELOPE_LAYOUT_VERSION` bump), and the four existing variants ALREADY form a substrate ladder: `Bootstrap`(empty, key+edges only) / `Compressed`(cold codec, **no hot lifecycle columns**) / `Cognitive`(hot thinking: Meta+Qualia+Fingerprint+Energy+Plasticity+EntityType) / `Full`(every tenant). So "V2 fast/cheap bulk" = classids that resolve to the LEAN end (Bootstrap/Compressed — no ownership/lifecycle tenants); "V3 witnessed/owned" = Cognitive/Full. **A `ClassRoutingDTO` is rejected:** a DTO is a serialized carried payload, but substrate choice is a RESOLUTION (firewall ADR-022, "contracts compile types, the event never leaves"); and per the three-tier canon nothing crosses mailbox boundaries — every reader re-resolves the substrate from the classid already in the 16-byte key, so there is no boundary for a carrier to travel. `dto-soa-savant` + AGI-as-glove name the new-struct-instead-of-resolution shape exactly. **0x1000 is NOT the switch:** canon fixes it as a temporary adoption MONITOR ("monitor, never a semantic"; retires at P4/100%; MODULE-TABLE flags that a future canon==0x1000 aliases the marker) — substrate routes on the classid's concept-half → ValueSchema, never on the monitor bit. **The deep form (CONJECTURE — PROBE preset-vs-dispatch):** the WRITE PATH may be a pure FUNCTION of the schema — a class whose ValueSchema carries no ownership/lifecycle tenants has nothing for the kanban/WAL to witness, so it naturally collapses to the fast private-merge write; Cognitive/Full carry the tenants that REQUIRE the owned/witnessed path. If that holds, substrate = ValueSchema full stop (no separate `Substrate` enum, no flag). The gate: confirm the write path is derivable from which tenants are live vs needing an independent resolution — evidence base is the onebrc arc itself (lane F private-merge/no-tenants vs lanes G–J owned/witnessed = the two write paths already measured). Open sub-question: whether bulk needs a variant leaner than `Compressed`, or Bootstrap/Compressed already suffice. **"V3 teaches V2" (deferred, needs mechanism):** V3's kanban WAL + ownership journal is the profiling signal (where contention lands, which fields are touched) to optimize the lean V2 layout — the instrumented-teacher / stripped-student loop; no code reads the WAL back into a layout optimizer yet. Net: at most a new `ValueSchema` variant through the existing `value_schema(classid)` door; possibly not even that. +## 2026-07-04 — E-CPP-PARITY-7 — the UNICHARCOMPRESS (recoder) load side is byte-identical to libtesseract; the seventh leaf, and the FIRST binary-format transcode (`TFile` little-endian) +**Status:** FINDING (byte-parity proven vs libtesseract 5.3.4; in-contract, tested) + +The recoder (`ccutil/unicharcompress.{h,cpp}`) is the LSTM recognizer's code↔id table — the first non-UNICHARSET Core type and the first BINARY leaf (every prior leaf parsed text). `lance_graph_contract::unicharcompress::UnicharCompress` transcodes the load side only (`DeSerialize` → `from_le_bytes`; `EncodeUnichar`/`DecodeUnichar`/`code_range`); byte-parity GREEN on real `/tmp/eng.lstm-recoder` — encode 112/112 + decode 112/112 + code_range=111, via the committed `examples/recoder_dump.rs {encode,decode}` diffed against a libtesseract oracle. + +Two firsts + one correction: (1) FIRST binary format — `TFile` LE: `u32 count` + per-`RecodedCharID` `[i8 self_normalized][i32 length][i32×length code]`; the 1012-byte on-disk size = `4 + 112·9` was derived from the format BEFORE the parse (a first-principles pre-registration of correctness). (2) The 5.5.0-header / 5.3.4-lib ABI skew is a NEW object layout not covered by the UNICHARSET bijection: the oracle's `Encode∘Decode` round-trip (1 explained mismatch — ids 1,2 share code 110, last-wins `decode→2`) + `enc_size=112` self-validated the layout. (3) `kMaxCodeLen = 9` — the recoder-plan summary said 3; Hangul/Han USE length-3 but the array is sized 9. + +**Pattern holds (E-CPP-KEYSTONE-1).** A new Core type, but the SAME shape: content-store tier (zero-dep, rides the keystone), one `diff` per mode, no Core gap. +10 contract tests. Consumed by `tesseract-core::{Recoder, recoded_to_text}` (codes→decode→ids→`ids_to_text`; +1 boundary test, 8/8). The recoder keystone (`invoke_recoder`, the E-CPP-KEYSTONE-1 analog) is UNBLOCKED — OGAR #148 minted concept `recoder`=0x0802 (mirrored in `ogar_codebook`) — but DEFERRED: the `classid→ClassView→content` dispatch is already proven generically, so a recoder keystone would re-prove a pattern with no new byte-parity information. + +Routing re-verified LIVE against OGAR (not the plan's cached answers): SURREAL-AST-TRAP-PREFLIGHT 5Q (data-shaped table, zero lifecycle vocabulary → content-store is honest) + OGAR-AS-IR §3 (adds no `Class`/`ActionDef`/`KausalSpec` → rerouted to the content tier, NOT `emit_rust`). ndarray and `ruff_cpp_spo` were correctly NOT used: the recoder is zero-SIMD data, and `UnicharCompress`/`RecodedCharID` have no inheritance/vtable for the harvest to resolve. Cross-ref: `E-CPP-PARITY-1..6`, `E-CPP-KEYSTONE-1`, `.claude/knowledge/core-first-transcode-doctrine.md`, OGAR #148 (0x08 OCR mint). Branch `claude/happy-hamilton-0azlw4`, lance-graph + tesseract-rs. ## 2026-07-02 — E-1BRC-GRIDLAKE-SWEETSPOT-1: the 64×64 gridlake SoA is the measured sweet spot — the batch pipeline at tile scale equals the best streamed topology while carrying the double-WAL **Status:** FINDING (measured, onebrc-probe lane J t7; closes the operator's four follow-up questions and the t4→t7 kanban-update arc) diff --git a/.claude/board/LATEST_STATE.md b/.claude/board/LATEST_STATE.md index d90788f9..353f1672 100644 --- a/.claude/board/LATEST_STATE.md +++ b/.claude/board/LATEST_STATE.md @@ -10,6 +10,12 @@ --- +## 2026-07-04 — branch `claude/happy-hamilton-0azlw4` — `contract::unicharcompress` — the Tesseract recoder load side (byte-parity vs libtesseract) + +**NEW** `lance_graph_contract::unicharcompress`: `UnicharCompress` (the LSTM recoder's code↔id table) + `RecodedCharId` + `RecoderError`, load side only (`from_le_bytes` / `load_from_file` = C++ `DeSerialize`; `encode` / `decode` / `code_range`; `dump_encode` / `dump_decode` parity surfaces). The FIRST binary-format leaf (`TFile` little-endian: `u32 count` + per-entry `[i8 self_normalized][i32 length][i32×length code]`). Byte-parity **GREEN** on real `/tmp/eng.lstm-recoder` — encode 112/112 + decode 112/112 + code_range=111 — via the committed `examples/recoder_dump.rs`, diffed vs a libtesseract 5.3.4 oracle (the 5.5.0-header ABI skew self-validated by the `Encode∘Decode` round-trip + `enc_size=112`). +10 contract tests; `-p lance-graph-contract` clippy `-D warnings` + fmt clean. Consumed by `tesseract-core::{Recoder, recoded_to_text}` (codes→decode→ids→`ids_to_text`; +1 boundary test, 8/8). Resolves the `recoder`=0x0802 concept (OGAR #148 mint, mirrored in the "0x08XX OCR rows" line below) to its content-store module. The recoder keystone (`invoke_recoder`) is UNBLOCKED but deferred (dispatch already proven generically by E-CPP-KEYSTONE-1). Refs: EPIPHANIES `E-CPP-PARITY-7`. Not yet a PR. + +--- + ## 2026-06-23 — IN PR (`claude/medcare-bridge-lance-graph-wmx76z`) — ActionHandler⟷RBAC⟷orchestration spine `contract::rbac`: `ScopeSpec` (axis-3 Copy token) + `ClassRbac` §4 default methods (`roles_reaching`/`row_scope`/`field_mask`; backward-compat, probe green). `contract::class_view::FieldMask::union`. `contract::action::ActionInvocation::commit_via` (no-admin-bypass convergence of the inline gate). `lance-graph-rbac::{authorize_scoped, ScopedDecision}` (§5 two-stage). `lance-graph-ogar::{OgarRbac, GrantSource}` (Q5 local newtype, §6 evaporation seam). rs-graph-llm: `graph-flow-kanban::{run_cycle, CycleOutcome}` + `graph-flow-action::dispatch_via`. Plan: integration-actionhandler-rbac-orchestration-v1. diff --git a/crates/lance-graph-contract/examples/recoder_dump.rs b/crates/lance-graph-contract/examples/recoder_dump.rs new file mode 100644 index 00000000..1f2eed26 --- /dev/null +++ b/crates/lance-graph-contract/examples/recoder_dump.rs @@ -0,0 +1,51 @@ +//! Dump a `.lstm-recoder`'s encoder table (`encode`) or decode round-trip +//! (`decode`) — the Rust side of the recoder byte-parity leaf, sibling to +//! `unicharset_dump`. +//! +//! ```sh +//! # on a box with libtesseract + libleptonica installed: +//! combine_tessdata -u $(dpkg -L tesseract-ocr-eng | grep eng.traineddata) /tmp/eng. +//! # C++ oracle (recoder_oracle.cpp): loads the SAME component via TFile and dumps +//! # EncodeUnichar / DecodeUnichar / code_range. It also prints, per id, the +//! # UNICHARSET bijection + an Encode.Decode round-trip so the NEW UnicharCompress +//! # object layout self-validates against the 5.5.0-header / 5.3.4-lib ABI skew. +//! # ./recoder_oracle /tmp/eng.lstm-unicharset /tmp/eng.lstm-recoder encode > /tmp/oracle_recoder_encode.tsv +//! # ./recoder_oracle /tmp/eng.lstm-unicharset /tmp/eng.lstm-recoder decode > /tmp/oracle_recoder_decode.tsv +//! # Rust side: +//! cargo run -p lance-graph-contract --example recoder_dump -- /tmp/eng.lstm-recoder encode > /tmp/rust_recoder_encode.tsv +//! cargo run -p lance-graph-contract --example recoder_dump -- /tmp/eng.lstm-recoder decode > /tmp/rust_recoder_decode.tsv +//! diff /tmp/oracle_recoder_encode.tsv /tmp/rust_recoder_encode.tsv \ +//! && diff /tmp/oracle_recoder_decode.tsv /tmp/rust_recoder_decode.tsv +//! # both byte-identical => the recoder load-side is byte-parity green +//! ``` + +#![allow( + clippy::print_stdout, + reason = "a dump CLI example writes to stdout by design" +)] + +use std::path::Path; +use std::process::ExitCode; + +use lance_graph_contract::unicharcompress::UnicharCompress; + +fn main() -> ExitCode { + let Some(path) = std::env::args().nth(1) else { + eprintln!("usage: recoder_dump [encode|decode]"); + return ExitCode::FAILURE; + }; + let mode = std::env::args().nth(2).unwrap_or_default(); + match UnicharCompress::load_from_file(Path::new(&path)) { + Ok(recoder) => { + match mode.as_str() { + "decode" => print!("{}", recoder.dump_decode()), + _ => print!("{}", recoder.dump_encode()), + } + ExitCode::SUCCESS + } + Err(err) => { + eprintln!("error: {err}"); + ExitCode::FAILURE + } + } +} diff --git a/crates/lance-graph-contract/src/lib.rs b/crates/lance-graph-contract/src/lib.rs index 12b9101c..f206e7aa 100644 --- a/crates/lance-graph-contract/src/lib.rs +++ b/crates/lance-graph-contract/src/lib.rs @@ -133,6 +133,7 @@ pub mod tax; pub mod tenant_counter; pub mod thinking; pub mod unichar; +pub mod unicharcompress; pub mod unicharset; pub mod unicharset_adapter; pub mod view_angle; diff --git a/crates/lance-graph-contract/src/unicharcompress.rs b/crates/lance-graph-contract/src/unicharcompress.rs new file mode 100644 index 00000000..a2e32c3f --- /dev/null +++ b/crates/lance-graph-contract/src/unicharcompress.rs @@ -0,0 +1,559 @@ +//! `UNICHARCOMPRESS` (the recoder) content store — the Rust side of the recoder +//! byte-parity leaf, sibling to [`crate::unicharset`]. +//! +//! Tesseract's `UnicharCompress` (`ccutil/unicharcompress.{h,cpp}`) re-encodes +//! each unichar-id as a short sequence of small codes (Han radical-stroke, +//! Hangul Jamo, ligature dissection; pass-through for simple scripts). The LSTM +//! recognizer's output lattice speaks these **recoded codes, not raw +//! unichar-ids**, so `ids_to_text` only becomes real OCR output once the decode +//! table exists. Per the Core-First doctrine this is a **classid-keyed +//! content-store tier** (a loaded codec table — id ↔ code-sequence bijection + +//! bounds), exactly like [`crate::unicharset::UniCharSet`]: data-shaped, no +//! lifecycle vocabulary, no effects. It rides the existing keystone; it is NOT +//! IR-surface (`docs/OGAR-AS-IR.md` §3: adds no `Class` field, no `ActionDef`, +//! no `KausalSpec` slot). +//! +//! # Load-side scope +//! +//! This module transcodes the **load side only** — `DeSerialize` + +//! `EncodeUnichar` + `DecodeUnichar` + `code_range` (the recognizer runtime +//! surface). `ComputeEncoding` (the training-side table builder) is out of +//! scope. `SetupDecoder`'s beam-search maps (`is_valid_start_` / `next_codes_` / +//! `final_codes_`, `unicharcompress.cpp:396-434`) are the recognizer's, not the +//! decode table's — they are deferred to the recognizer leaf; only the +//! `decoder_` map (code → id) is built here. +//! +//! # Binary format (byte-parity surface) +//! +//! Every prior leaf parsed text; the recoder is **binary** (`serialis.h` `TFile` +//! conventions). `UnicharCompress::Serialize` writes exactly the `encoder_` +//! vector (`unicharcompress.cpp:318-320`, comment `unicharcompress.h:229`: "the +//! only part that is serialized. The rest is computed on load"). The wire form +//! (little-endian; `TFile::swap_ == false` on x86) is: +//! +//! ```text +//! u32 count // TFile::DeSerialize(vector), serialis.h:90 +//! count × RecodedCharID: +//! i8 self_normalized // RecodedCharID::DeSerialize, unicharcompress.h:75 +//! i32 length // number of codes in use (<= kMaxCodeLen=9) +//! i32 × length code // only `length` codes are written, not all 9 +//! ``` +//! +//! For real `eng.lstm-recoder` (112 pass-through entries, all length-1): +//! `4 + 112·(1+4+4) = 1012` bytes — the exact on-disk size, a first-principles +//! pre-registration of a correct parse. On load, `ComputeCodeRange` +//! (`unicharcompress.cpp:383`, `max(code)+1`) and the `decoder_` map +//! (`unicharcompress.cpp:400-402`, `decoder_[code]=id` in ascending-id order, so +//! **last writer wins** on a shared code) are recomputed. +//! +//! [`UnicharCompress::dump_encode`] / [`UnicharCompress::dump_decode`] are the +//! byte-parity surfaces, diffed against the C++ `UnicharCompress` oracle +//! (`recoder_oracle.cpp`, which links libtesseract, loads the same component via +//! `TFile`, and dumps `EncodeUnichar` / `DecodeUnichar` / `code_range`). The +//! oracle's `Encode∘Decode` round-trip + the `UNICHARSET` bijection guard the +//! 5.5.0-header / 5.3.4-lib ABI skew for this NEW object layout. +//! +//! # Strict-vs-lenient +//! +//! C++ `RecodedCharID::DeSerialize` reads `length` then reads that many `i32` +//! into the fixed `code_[9]` — a buffer overflow (UB) if `length > 9` on hostile +//! input. This reader instead rejects `length < 0 || length > kMaxCodeLen` +//! ([`RecoderError::BadCodeLength`]) and a truncated buffer +//! ([`RecoderError::UnexpectedEof`]). On well-formed trained data (`length` is +//! always 1..=3) the byte-parity diff is unaffected; the guard only fires on +//! corruption. + +use std::collections::HashMap; +use std::hash::{Hash, Hasher}; +use std::path::Path; + +/// `RecodedCharID::kMaxCodeLen` (tesseract `unicharcompress.h:35`) — the fixed +/// capacity of a code array. Hangul/Han use length 3; the array is sized 9. +const K_MAX_CODE_LEN: usize = 9; + +/// The C++ `INVALID_UNICHAR_ID` sentinel (tesseract `unichar.h`) — what +/// [`UnicharCompress::decode`] returns for a code with no matching id, mirroring +/// `DecodeUnichar` (`unicharcompress.cpp:305-315`). +const INVALID_UNICHAR_ID: i32 = -1; + +/// The `TFile::DeSerialize(vector)` sanity cap (tesseract `serialis.h:96`): +/// a declared element count above this is treated as corrupt input. +const MAX_ELEMENTS: u32 = 50_000_000; + +/// The code sequence for one recoded unichar-id — the transcription of +/// tesseract's `RecodedCharID` (`unicharcompress.h:32-109`). +/// +/// Equality and hashing mirror the C++ `operator==` / `RecodedCharIDHash` +/// (`unicharcompress.h:79-99`): **only `length` + the used `code[0..length]` +/// participate**; `self_normalized` and any trailing array slots are ignored, so +/// this is a sound [`HashMap`] key for the decoder (`decoder_[code]`). +#[derive(Debug, Clone)] +pub struct RecodedCharId { + /// True (`1`) if this is the master entry for ids sharing one code; stored as + /// `i8` for serialization (`unicharcompress.h:104`). Preserved on load for + /// round-trip fidelity; not part of identity. + self_normalized: i8, + /// The number of codes in use in `code` (`unicharcompress.h:106`). + length: i32, + /// The re-encoded form (`unicharcompress.h:108`). Only `code[0..length]` is + /// meaningful; trailing slots are `0`. + code: [i32; K_MAX_CODE_LEN], +} + +impl Default for RecodedCharId { + /// Mirrors the C++ default ctor (`unicharcompress.h:37`): `self_normalized = + /// 1`, `length = 0`, all codes `0`. + fn default() -> Self { + Self { + self_normalized: 1, + length: 0, + code: [0; K_MAX_CODE_LEN], + } + } +} + +impl RecodedCharId { + /// The codes in use — `code[0..length]`. The only bytes that carry identity. + #[must_use] + pub fn codes(&self) -> &[i32] { + let len = self.length.max(0) as usize; + // `length` is bounded to `<= K_MAX_CODE_LEN` at load; `min` keeps this + // total even for a hand-built value. + &self.code[..len.min(K_MAX_CODE_LEN)] + } + + /// The number of codes in use (the C++ `length()`, `unicharcompress.h:62`). + #[must_use] + pub fn length(&self) -> i32 { + self.length + } + + /// Whether this code is empty (`length == 0`), the C++ `empty()` + /// (`unicharcompress.h:58`). + #[must_use] + pub fn is_empty(&self) -> bool { + self.length == 0 + } + + /// Whether this is the self-normalizing master entry (`unicharcompress.h:104`). + #[must_use] + pub fn self_normalized(&self) -> bool { + self.self_normalized != 0 + } + + /// Read one `RecodedCharID` from the little-endian cursor. Rejects a + /// `length` outside `0..=kMaxCodeLen` (the C++ UB guard) and a short buffer. + fn read(r: &mut ByteReader<'_>) -> Result { + let self_normalized = r.read_i8()?; + let length = r.read_i32()?; + if length < 0 || length as usize > K_MAX_CODE_LEN { + return Err(RecoderError::BadCodeLength(length)); + } + let mut code = [0_i32; K_MAX_CODE_LEN]; + for slot in code.iter_mut().take(length as usize) { + *slot = r.read_i32()?; + } + Ok(Self { + self_normalized, + length, + code, + }) + } +} + +impl PartialEq for RecodedCharId { + /// `operator==` (`unicharcompress.h:79-89`): compares `length` + + /// `code[0..length]` only. + fn eq(&self, other: &Self) -> bool { + self.codes() == other.codes() + } +} + +impl Eq for RecodedCharId {} + +impl Hash for RecodedCharId { + /// Consistent with [`PartialEq`]: hash the used codes only. (The C++ + /// `RecodedCharIDHash` folds the same `code[0..length]`; the Rust hasher need + /// only agree with `eq`, not reproduce the C++ bit-mix.) + fn hash(&self, state: &mut H) { + self.codes().hash(state); + } +} + +/// A loaded `UnicharCompress` (the recoder): the `encoder_` table (id → codes), +/// its inverse `decoder_` (codes → id), and `code_range` — the transcription of +/// tesseract's `UnicharCompress` load side (`unicharcompress.{h,cpp}`). +#[derive(Debug, Clone, Default)] +pub struct UnicharCompress { + /// id → code sequence (index IS the unichar-id). The only serialized part + /// (`unicharcompress.h:229-230`). + encoder: Vec, + /// code → unichar-id, recomputed on load (`SetupDecoder`, + /// `unicharcompress.cpp:400-402`). Last-writer-wins on a shared code. + decoder: HashMap, + /// `1 + max code value` (`ComputeCodeRange`, `unicharcompress.cpp:383-393`); + /// the lattice width. `0` for an empty encoder (`-1 + 1`). + code_range: i32, +} + +impl UnicharCompress { + /// Load a recoder from the raw little-endian bytes of a `.lstm-recoder` + /// component (the C++ `DeSerialize`, `unicharcompress.cpp:323-330`): read the + /// `encoder_` vector, then recompute `code_range` and the decode map. + /// + /// # Errors + /// + /// [`RecoderError::UnexpectedEof`] on a truncated buffer, + /// [`RecoderError::TooManyElements`] if the declared count exceeds the + /// `serialis.h` sanity cap, and [`RecoderError::BadCodeLength`] if any entry + /// declares a code length outside `0..=9`. + pub fn from_le_bytes(bytes: &[u8]) -> Result { + let mut r = ByteReader::new(bytes); + let count = r.read_u32()?; + if count > MAX_ELEMENTS { + return Err(RecoderError::TooManyElements(count)); + } + let mut encoder = Vec::with_capacity(count as usize); + for _ in 0..count { + encoder.push(RecodedCharId::read(&mut r)?); + } + // Trailing bytes are ignored on purpose: a component extracted from a + // TFile stream may be followed by the next component's bytes (the C++ + // reader leaves the cursor for them). A standalone `.lstm-recoder` is + // consumed exactly. + let mut this = Self { + encoder, + decoder: HashMap::new(), + code_range: 0, + }; + this.compute_code_range(); + this.setup_decoder(); + Ok(this) + } + + /// Load a recoder from a `.lstm-recoder` file (a thin wrapper over + /// [`Self::from_le_bytes`]). Extract one via + /// `combine_tessdata -u eng.traineddata /tmp/eng.`. + /// + /// # Errors + /// + /// [`RecoderError::Io`] if the file cannot be read, else the parse errors of + /// [`Self::from_le_bytes`]. + pub fn load_from_file(path: &Path) -> Result { + let bytes = std::fs::read(path).map_err(|e| RecoderError::Io(e.to_string()))?; + Self::from_le_bytes(&bytes) + } + + /// `1 + max code value` — the lattice width (`code_range`, + /// `unicharcompress.h:171`). + #[must_use] + pub fn code_range(&self) -> i32 { + self.code_range + } + + /// The number of encoded unichar-ids (`encoder_.size()`). + #[must_use] + pub fn len(&self) -> usize { + self.encoder.len() + } + + /// Whether the encoder is empty. + #[must_use] + pub fn is_empty(&self) -> bool { + self.encoder.is_empty() + } + + /// The code sequence for `unichar_id`, or `None` if out of range — the C++ + /// `EncodeUnichar` (`unicharcompress.cpp:295-301`; a `None` here is the C++ + /// return of length `0`). + #[must_use] + pub fn encode(&self, unichar_id: u32) -> Option<&RecodedCharId> { + self.encoder.get(unichar_id as usize) + } + + /// The unichar-id for `code`, or [`INVALID_UNICHAR_ID`] (`-1`) if the code is + /// ill-formed or unknown — the C++ `DecodeUnichar` + /// (`unicharcompress.cpp:305-315`). + #[must_use] + pub fn decode(&self, code: &RecodedCharId) -> i32 { + let len = code.length(); + if len <= 0 || len as usize > K_MAX_CODE_LEN { + return INVALID_UNICHAR_ID; + } + self.decoder + .get(code) + .map_or(INVALID_UNICHAR_ID, |&id| id as i32) + } + + /// `ComputeCodeRange` (`unicharcompress.cpp:383-393`): `code_range = 1 + max` + /// code value over every position of every entry (`0` for an empty encoder). + fn compute_code_range(&mut self) { + let mut max = -1_i32; + for entry in &self.encoder { + for &c in entry.codes() { + if c > max { + max = c; + } + } + } + self.code_range = max + 1; + } + + /// The decode-map half of `SetupDecoder` (`unicharcompress.cpp:400-402`): + /// `decoder_[encoder_[id]] = id` in ascending id order, so **last writer + /// wins** when two ids share a code. The beam-search maps are the + /// recognizer's and are not built here (see module docs). + fn setup_decoder(&mut self) { + self.decoder.clear(); + self.decoder.reserve(self.encoder.len()); + for (id, code) in self.encoder.iter().enumerate() { + self.decoder.insert(code.clone(), id as u32); + } + } + + /// Render the id→code table as `"\t\t[,...]\n"` lines — the + /// exact shape the C++ recoder oracle's `encode` mode prints, so the + /// byte-parity diff is `diff oracle_recoder_encode.tsv rust_recoder_encode.tsv`. + #[must_use] + pub fn dump_encode(&self) -> String { + let mut out = String::new(); + for (id, entry) in self.encoder.iter().enumerate() { + out.push_str(&id.to_string()); + out.push('\t'); + out.push_str(&entry.length().to_string()); + out.push('\t'); + for (i, &c) in entry.codes().iter().enumerate() { + if i > 0 { + out.push(','); + } + out.push_str(&c.to_string()); + } + out.push('\n'); + } + out + } + + /// Render `"code_range\t\n"` then `"\t\n"` lines (where + /// `decoded = decode(encode(id))`) — the exact shape the C++ recoder oracle's + /// `decode` mode prints, so the byte-parity diff is + /// `diff oracle_recoder_decode.tsv rust_recoder_decode.tsv`. On a shared code + /// the decoded id is the last-writer, matching the C++ map. + #[must_use] + pub fn dump_decode(&self) -> String { + let mut out = String::new(); + out.push_str("code_range\t"); + out.push_str(&self.code_range.to_string()); + out.push('\n'); + for (id, entry) in self.encoder.iter().enumerate() { + out.push_str(&id.to_string()); + out.push('\t'); + out.push_str(&self.decode(entry).to_string()); + out.push('\n'); + } + out + } +} + +/// A little-endian byte cursor over the recoder component — the reader half of +/// the `TFile` primitives this leaf needs (`FReadEndian` with `swap_ == false`). +struct ByteReader<'a> { + bytes: &'a [u8], + pos: usize, +} + +impl<'a> ByteReader<'a> { + fn new(bytes: &'a [u8]) -> Self { + Self { bytes, pos: 0 } + } + + /// Advance over `n` bytes, or [`RecoderError::UnexpectedEof`] if short. + fn take(&mut self, n: usize) -> Result<&'a [u8], RecoderError> { + let end = self.pos.checked_add(n).ok_or(RecoderError::UnexpectedEof)?; + let slice = self + .bytes + .get(self.pos..end) + .ok_or(RecoderError::UnexpectedEof)?; + self.pos = end; + Ok(slice) + } + + fn read_i8(&mut self) -> Result { + Ok(self.take(1)?[0] as i8) + } + + fn read_u32(&mut self) -> Result { + let arr: [u8; 4] = self + .take(4)? + .try_into() + .map_err(|_| RecoderError::UnexpectedEof)?; + Ok(u32::from_le_bytes(arr)) + } + + fn read_i32(&mut self) -> Result { + let arr: [u8; 4] = self + .take(4)? + .try_into() + .map_err(|_| RecoderError::UnexpectedEof)?; + Ok(i32::from_le_bytes(arr)) + } +} + +/// A failure loading a `UnicharCompress` (recoder). +#[derive(Debug, Clone, PartialEq, Eq)] +pub enum RecoderError { + /// The buffer ended mid-field. + UnexpectedEof, + /// The declared element count exceeded the `serialis.h` sanity cap. + TooManyElements(u32), + /// A `RecodedCharID` declared a code length outside `0..=9` (the C++ fixed + /// array capacity `kMaxCodeLen`). + BadCodeLength(i32), + /// The file could not be read (message from the underlying I/O error). + Io(String), +} + +impl std::fmt::Display for RecoderError { + fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { + match self { + Self::UnexpectedEof => write!(f, "recoder buffer ended mid-field"), + Self::TooManyElements(n) => { + write!( + f, + "recoder declared {n} elements (over the {MAX_ELEMENTS} cap)" + ) + } + Self::BadCodeLength(len) => { + write!( + f, + "recoded code length {len} out of range 0..={K_MAX_CODE_LEN}" + ) + } + Self::Io(msg) => write!(f, "recoder read failed: {msg}"), + } + } +} + +impl std::error::Error for RecoderError {} + +#[cfg(test)] +mod tests { + use super::*; + + /// Build a `.lstm-recoder` byte buffer from `(self_normalized, codes)` + /// entries, in the exact little-endian wire form the C++ `Serialize` writes. + fn build(entries: &[(i8, &[i32])]) -> Vec { + let mut b = Vec::new(); + b.extend_from_slice(&u32::try_from(entries.len()).unwrap().to_le_bytes()); + for (self_norm, codes) in entries { + b.push(*self_norm as u8); + b.extend_from_slice(&i32::try_from(codes.len()).unwrap().to_le_bytes()); + for &c in *codes { + b.extend_from_slice(&c.to_le_bytes()); + } + } + b + } + + #[test] + fn parses_count_and_entries() { + let bytes = build(&[(1, &[0]), (1, &[5]), (1, &[5])]); + let rec = UnicharCompress::from_le_bytes(&bytes).expect("valid"); + assert_eq!(rec.len(), 3); + assert_eq!(rec.encode(0).unwrap().codes(), &[0]); + assert_eq!(rec.encode(2).unwrap().codes(), &[5]); + assert!(rec.encode(3).is_none(), "out-of-range id -> None"); + } + + #[test] + fn code_range_is_max_plus_one() { + // max code value 5 -> code_range 6. + let rec = UnicharCompress::from_le_bytes(&build(&[(1, &[0]), (1, &[5]), (1, &[3])])) + .expect("valid"); + assert_eq!(rec.code_range(), 6); + // Empty encoder -> -1 + 1 = 0 (matches ComputeCodeRange's seed). + let empty = UnicharCompress::from_le_bytes(&build(&[])).expect("valid"); + assert_eq!(empty.code_range(), 0); + } + + #[test] + fn decode_is_last_writer_wins_on_shared_code() { + // ids 1 and 2 both encode to code [5]; decoder keeps the last (id 2) — + // exactly the eng.lstm-recoder id1/id2 -> code 110 case. + let rec = UnicharCompress::from_le_bytes(&build(&[(1, &[0]), (1, &[5]), (1, &[5])])) + .expect("valid"); + assert_eq!(rec.decode(rec.encode(0).unwrap()), 0); + assert_eq!( + rec.decode(rec.encode(1).unwrap()), + 2, + "shared code -> last id" + ); + assert_eq!(rec.decode(rec.encode(2).unwrap()), 2); + } + + #[test] + fn decode_unknown_or_illformed_is_invalid() { + let rec = UnicharCompress::from_le_bytes(&build(&[(1, &[0])])).expect("valid"); + // An empty code (length 0) is ill-formed for decode. + assert_eq!(rec.decode(&RecodedCharId::default()), INVALID_UNICHAR_ID); + } + + #[test] + fn equality_ignores_self_normalized_and_trailing() { + // Same code, different self_normalized -> equal (C++ operator==). + let a = UnicharCompress::from_le_bytes(&build(&[(1, &[7])])).expect("valid"); + let b = UnicharCompress::from_le_bytes(&build(&[(0, &[7])])).expect("valid"); + assert_eq!(a.encode(0).unwrap(), b.encode(0).unwrap()); + } + + #[test] + fn dump_encode_matches_oracle_shape() { + // A multi-code entry exercises the comma join. + let rec = UnicharCompress::from_le_bytes(&build(&[(1, &[0]), (1, &[5]), (1, &[1, 2, 3])])) + .expect("valid"); + assert_eq!(rec.dump_encode(), "0\t1\t0\n1\t1\t5\n2\t3\t1,2,3\n"); + } + + #[test] + fn dump_decode_matches_oracle_shape() { + let rec = UnicharCompress::from_le_bytes(&build(&[(1, &[0]), (1, &[5]), (1, &[5])])) + .expect("valid"); + // code_range = 6; id1 decodes to 2 (last-writer on shared code [5]). + assert_eq!(rec.dump_decode(), "code_range\t6\n0\t0\n1\t2\n2\t2\n"); + } + + #[test] + fn truncated_buffer_errors() { + let mut bytes = build(&[(1, &[0])]); + bytes.pop(); // drop the last code byte + assert_eq!( + UnicharCompress::from_le_bytes(&bytes).unwrap_err(), + RecoderError::UnexpectedEof + ); + // A count with no entries at all. + assert_eq!( + UnicharCompress::from_le_bytes(&[3, 0, 0, 0]).unwrap_err(), + RecoderError::UnexpectedEof + ); + } + + #[test] + fn bad_code_length_errors() { + // count=1, self_norm=1, length=10 (> kMaxCodeLen) — the C++ UB case. + let mut bytes = vec![1, 0, 0, 0, 1]; + bytes.extend_from_slice(&10_i32.to_le_bytes()); + assert_eq!( + UnicharCompress::from_le_bytes(&bytes).unwrap_err(), + RecoderError::BadCodeLength(10) + ); + } + + #[test] + fn too_many_elements_errors() { + // A declared count over the cap fails fast without allocating. + let bytes = (MAX_ELEMENTS + 1).to_le_bytes(); + assert_eq!( + UnicharCompress::from_le_bytes(&bytes).unwrap_err(), + RecoderError::TooManyElements(MAX_ELEMENTS + 1) + ); + } +} From ba5ce72f9a91aabad6f60643fa81f684d76bab20 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 4 Jul 2026 12:45:42 +0000 Subject: [PATCH 2/7] board: E-OCR-COMPUTE-NDARRAY-SEAM-1 -- recognizer int8 path maps 1:1 onto ndarray CONJECTURE (design-pass finding; byte-parity probe = recognizer Leaf 1). The OCR recognizer is COMPUTE (dense int8 GEMM), not content -- it consumes ndarray's existing matmul_i8_to_i32 / quantize / dequantize with no Core gap. int8->i32 is exact + bit-reproducible across AMX/VNNI/scalar. Corrects the "OCR is ndarray-free" framing. Cross-ref E-CPP-PARITY-7, the recognizer plan. Co-Authored-By: Claude Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1 --- .claude/board/EPIPHANIES.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/.claude/board/EPIPHANIES.md b/.claude/board/EPIPHANIES.md index 7aae6eaf..21344c47 100644 --- a/.claude/board/EPIPHANIES.md +++ b/.claude/board/EPIPHANIES.md @@ -177,6 +177,14 @@ New knowledge doc `.claude/knowledge/data-shape-etymology.md` — the shape-and- **Status:** FINDING (operator ruling on the shape — "yes valueschema") + embedded CONJECTURE (the preset-vs-dispatch probe) Operator floated keeping the fast/cheap V2 substrate for huge data alongside V3, "switched by classid," so V3 can eventually teach V2 how to be better. Resolved: the switch is NOT a new carrier. `ClassView::value_schema(classid) -> ValueSchema` (`canonical_node.rs:894`, `class_view.rs:395`) is ALREADY classid→substrate-shape resolution by trait dispatch — resolved, never stored on-wire (adding a variant costs NO `ENVELOPE_LAYOUT_VERSION` bump), and the four existing variants ALREADY form a substrate ladder: `Bootstrap`(empty, key+edges only) / `Compressed`(cold codec, **no hot lifecycle columns**) / `Cognitive`(hot thinking: Meta+Qualia+Fingerprint+Energy+Plasticity+EntityType) / `Full`(every tenant). So "V2 fast/cheap bulk" = classids that resolve to the LEAN end (Bootstrap/Compressed — no ownership/lifecycle tenants); "V3 witnessed/owned" = Cognitive/Full. **A `ClassRoutingDTO` is rejected:** a DTO is a serialized carried payload, but substrate choice is a RESOLUTION (firewall ADR-022, "contracts compile types, the event never leaves"); and per the three-tier canon nothing crosses mailbox boundaries — every reader re-resolves the substrate from the classid already in the 16-byte key, so there is no boundary for a carrier to travel. `dto-soa-savant` + AGI-as-glove name the new-struct-instead-of-resolution shape exactly. **0x1000 is NOT the switch:** canon fixes it as a temporary adoption MONITOR ("monitor, never a semantic"; retires at P4/100%; MODULE-TABLE flags that a future canon==0x1000 aliases the marker) — substrate routes on the classid's concept-half → ValueSchema, never on the monitor bit. **The deep form (CONJECTURE — PROBE preset-vs-dispatch):** the WRITE PATH may be a pure FUNCTION of the schema — a class whose ValueSchema carries no ownership/lifecycle tenants has nothing for the kanban/WAL to witness, so it naturally collapses to the fast private-merge write; Cognitive/Full carry the tenants that REQUIRE the owned/witnessed path. If that holds, substrate = ValueSchema full stop (no separate `Substrate` enum, no flag). The gate: confirm the write path is derivable from which tenants are live vs needing an independent resolution — evidence base is the onebrc arc itself (lane F private-merge/no-tenants vs lanes G–J owned/witnessed = the two write paths already measured). Open sub-question: whether bulk needs a variant leaner than `Compressed`, or Bootstrap/Compressed already suffice. **"V3 teaches V2" (deferred, needs mechanism):** V3's kanban WAL + ownership journal is the profiling signal (where contention lands, which fields are touched) to optimize the lean V2 layout — the instrumented-teacher / stripped-student loop; no code reads the WAL back into a layout optimizer yet. Net: at most a new `ValueSchema` variant through the existing `value_schema(classid)` door; possibly not even that. +## 2026-07-04 — E-OCR-COMPUTE-NDARRAY-SEAM-1 — the OCR recognizer's int8 hot path maps 1:1 onto ndarray's existing `matmul_i8_to_i32`; no Core gap, and int8→i32 is bit-reproducible across every SIMD tier +**Status:** CONJECTURE (design-pass finding; the byte-parity probe is recognizer Leaf 1, not yet run). Corrects the earlier "OCR transcode is ndarray-free" framing (operator sanity check: *OCR without hardware acceleration isn't smart*). + +The recoder/unicharset leaves were codec TABLES (correctly zero-dep content tier); the RECOGNIZER is COMPUTE — Tesseract's LSTM forward pass is dense int8 GEMM (`IntSimdMatrix::MatrixDotVector`, `WeightMatrix`; `src/arch` + `src/lstm`). Surveyed 2026-07-04 against ndarray master: it maps ONE-TO-ONE onto primitives ndarray ALREADY ships — `IntSimdMatrix::MatrixDotVector` (int8 W × int8 u → i32) ↔ `simd_runtime::matmul_i8_to_i32` (AMX TDPBUSD → VPDPBUSD → scalar); `WeightMatrix::ConvertToInt` (row max-abs → INT8_MAX + per-row float scale) ↔ `simd_amx::quantize_energy_i8`; the scale-back ↔ `dequantize_result_f64`. **No Core gap** — the recognizer CONSUMES ndarray's GEMM (the `simd-savant` "all SIMD from `ndarray::simd`" invariant), never re-transcodes SIMD. + +Two load-bearing properties: (1) int8×int8→i32 accumulation is EXACT + order-independent, so AMX / AVX512-VNNI / AVX2-VNNI / scalar all yield the IDENTICAL i32 — the recognizer's integer matmul is **bit-reproducible across every SIMD tier** (unlike float/BF16 GEMM), which is what makes byte-parity clean. (2) The base `MatrixDotVector` bias is `w(i,num_in)·INT8_MAX` NOT `·1` (intsimdmatrix.cpp:101 — the input's imaginary `1.0` is int8-quantized to 127); `TFloat` = `double` unless `FAST_FLOAT` (tesstypes.h) → the float half is a Leaf-1 probe, the i32 half is exact. + +Two-foundations architecture (the correction): `lance-graph-contract` = CONTENT (codec tables, zero-dep) · `ndarray` = COMPUTE (int8/bf16 SIMD GEMM, already shipped) · `tesseract-core` = content consumer (zero-dep) · NEW `tesseract-recognizer` = compute consumer (deps `ndarray` + `tesseract-core`). Plan: `tesseract-rs/.claude/plans/recognizer-core-shape-v1.md` (Leaf 1 = `MatrixDotVector` byte-parity on synthetic int8, no `Pix`). Cross-ref: `E-CPP-PARITY-7` (recoder), `.claude/knowledge/core-first-transcode-doctrine.md`, ndarray `vertical-simd-consumer-contract.md`. Branch `claude/happy-hamilton-0azlw4`. ## 2026-07-04 — E-CPP-PARITY-7 — the UNICHARCOMPRESS (recoder) load side is byte-identical to libtesseract; the seventh leaf, and the FIRST binary-format transcode (`TFile` little-endian) **Status:** FINDING (byte-parity proven vs libtesseract 5.3.4; in-contract, tested) From 856358a273b3629fd04799c8b20aa6ccb6055a99 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 4 Jul 2026 13:12:32 +0000 Subject: [PATCH 3/7] board: E-OCR-MATDOTVEC-1 -- recognizer Leaf 1 byte-parity green (promotes seam FINDING) The int8 MatrixDotVector, via ndarray's matmul_i8_to_i32, equals libtesseract exactly on synthetic int8 (integer-combined diff, TFloat-agnostic). Promotes E-OCR-COMPUTE-NDARRAY-SEAM-1 CONJECTURE->FINDING. New crate tesseract-recognizer (compute tier). in-env libtesseract is FAST_FLOAT. Co-Authored-By: Claude Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1 --- .claude/board/EPIPHANIES.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/.claude/board/EPIPHANIES.md b/.claude/board/EPIPHANIES.md index 21344c47..58967f0a 100644 --- a/.claude/board/EPIPHANIES.md +++ b/.claude/board/EPIPHANIES.md @@ -177,8 +177,16 @@ New knowledge doc `.claude/knowledge/data-shape-etymology.md` — the shape-and- **Status:** FINDING (operator ruling on the shape — "yes valueschema") + embedded CONJECTURE (the preset-vs-dispatch probe) Operator floated keeping the fast/cheap V2 substrate for huge data alongside V3, "switched by classid," so V3 can eventually teach V2 how to be better. Resolved: the switch is NOT a new carrier. `ClassView::value_schema(classid) -> ValueSchema` (`canonical_node.rs:894`, `class_view.rs:395`) is ALREADY classid→substrate-shape resolution by trait dispatch — resolved, never stored on-wire (adding a variant costs NO `ENVELOPE_LAYOUT_VERSION` bump), and the four existing variants ALREADY form a substrate ladder: `Bootstrap`(empty, key+edges only) / `Compressed`(cold codec, **no hot lifecycle columns**) / `Cognitive`(hot thinking: Meta+Qualia+Fingerprint+Energy+Plasticity+EntityType) / `Full`(every tenant). So "V2 fast/cheap bulk" = classids that resolve to the LEAN end (Bootstrap/Compressed — no ownership/lifecycle tenants); "V3 witnessed/owned" = Cognitive/Full. **A `ClassRoutingDTO` is rejected:** a DTO is a serialized carried payload, but substrate choice is a RESOLUTION (firewall ADR-022, "contracts compile types, the event never leaves"); and per the three-tier canon nothing crosses mailbox boundaries — every reader re-resolves the substrate from the classid already in the 16-byte key, so there is no boundary for a carrier to travel. `dto-soa-savant` + AGI-as-glove name the new-struct-instead-of-resolution shape exactly. **0x1000 is NOT the switch:** canon fixes it as a temporary adoption MONITOR ("monitor, never a semantic"; retires at P4/100%; MODULE-TABLE flags that a future canon==0x1000 aliases the marker) — substrate routes on the classid's concept-half → ValueSchema, never on the monitor bit. **The deep form (CONJECTURE — PROBE preset-vs-dispatch):** the WRITE PATH may be a pure FUNCTION of the schema — a class whose ValueSchema carries no ownership/lifecycle tenants has nothing for the kanban/WAL to witness, so it naturally collapses to the fast private-merge write; Cognitive/Full carry the tenants that REQUIRE the owned/witnessed path. If that holds, substrate = ValueSchema full stop (no separate `Substrate` enum, no flag). The gate: confirm the write path is derivable from which tenants are live vs needing an independent resolution — evidence base is the onebrc arc itself (lane F private-merge/no-tenants vs lanes G–J owned/witnessed = the two write paths already measured). Open sub-question: whether bulk needs a variant leaner than `Compressed`, or Bootstrap/Compressed already suffice. **"V3 teaches V2" (deferred, needs mechanism):** V3's kanban WAL + ownership journal is the profiling signal (where contention lands, which fields are touched) to optimize the lean V2 layout — the instrumented-teacher / stripped-student loop; no code reads the WAL back into a layout optimizer yet. Net: at most a new `ValueSchema` variant through the existing `value_schema(classid)` door; possibly not even that. +## 2026-07-04 — E-OCR-MATDOTVEC-1 — recognizer Leaf 1 is byte-parity green: the int8 `MatrixDotVector`, via ndarray's `matmul_i8_to_i32`, equals libtesseract exactly (promotes `E-OCR-COMPUTE-NDARRAY-SEAM-1` CONJECTURE→FINDING) +**Status:** FINDING (byte-parity proven vs libtesseract 5.3.4; new crate `tesseract-recognizer`, tested) + +The recognizer's first COMPUTE leaf ships. `tesseract-recognizer::matrix_dot_vector` transcodes Tesseract's base `IntSimdMatrix::MatrixDotVector` (intsimdmatrix.cpp:78-117) by **consuming** `ndarray::simd_runtime::matmul_i8_to_i32` (AMX `TDPBUSD` → `VPDPBUSD` → scalar) — the bias falls out of one matmul by padding the input with a trailing `INT8_MAX` (127), the int8 quantization of the imaginary `1.0` bias. Byte-parity GREEN on synthetic int8 across two shapes (48×49, 7×5) vs a libtesseract oracle, diffing the EXACT INTEGER combined value (`Σ w·u + w_bias·127`, scales=1.0, exact in float) so the diff is `TFloat`-agnostic. +4 unit tests; clippy `-D warnings` + fmt clean (`-p tesseract-recognizer`, scoped). + +Two plan unknowns resolved: (1) `matmul_i8_to_i32` is behind ndarray's `runtime-dispatch` feature (stable, NOT nightly); the cold ndarray compile is only ~36 s on 1.95. (2) libtesseract 5.3.4 in-env is **FAST_FLOAT → `TFloat = float`** (self-check `sizeof(TFloat)=4`, `lib=157585 hand=157585`): the ABI probe was the DOUBLE-signature link error, then the FLOAT rebuild self-validated. The integer accumulate (the transcode's core) is exact; the scaled float is an adapter float-type choice (f64, documented for a later leaf). + +Toolchain: operator policy "always bump to 1.95" cleared ndarray's `rust-version = 1.95` manifest gate (env was 1.94.1; 1.95.0 set default). CI updated to sibling-checkout ndarray + a 1.95 step. The two-foundations architecture is now REAL: `tesseract-recognizer` (deps ndarray) = compute tier next to `tesseract-core` (deps lance-graph-contract) = content tier. Plan: `tesseract-rs/.claude/plans/recognizer-core-shape-v1.md` (Leaf 1 EXECUTED; next = `WeightMatrix::DeSerialize` + the network graph → `recodebeam` → the code lattice `recoded_to_text` eats). Cross-ref: `E-OCR-COMPUTE-NDARRAY-SEAM-1` (now FINDING), `E-CPP-PARITY-7`. Branch `claude/happy-hamilton-0azlw4`. ## 2026-07-04 — E-OCR-COMPUTE-NDARRAY-SEAM-1 — the OCR recognizer's int8 hot path maps 1:1 onto ndarray's existing `matmul_i8_to_i32`; no Core gap, and int8→i32 is bit-reproducible across every SIMD tier -**Status:** CONJECTURE (design-pass finding; the byte-parity probe is recognizer Leaf 1, not yet run). Corrects the earlier "OCR transcode is ndarray-free" framing (operator sanity check: *OCR without hardware acceleration isn't smart*). +**Status:** FINDING (2026-07-04 — byte-parity proven by recognizer Leaf 1, `E-OCR-MATDOTVEC-1`; was CONJECTURE at design-pass time). Corrects the earlier "OCR transcode is ndarray-free" framing (operator sanity check: *OCR without hardware acceleration isn't smart*). The recoder/unicharset leaves were codec TABLES (correctly zero-dep content tier); the RECOGNIZER is COMPUTE — Tesseract's LSTM forward pass is dense int8 GEMM (`IntSimdMatrix::MatrixDotVector`, `WeightMatrix`; `src/arch` + `src/lstm`). Surveyed 2026-07-04 against ndarray master: it maps ONE-TO-ONE onto primitives ndarray ALREADY ships — `IntSimdMatrix::MatrixDotVector` (int8 W × int8 u → i32) ↔ `simd_runtime::matmul_i8_to_i32` (AMX TDPBUSD → VPDPBUSD → scalar); `WeightMatrix::ConvertToInt` (row max-abs → INT8_MAX + per-row float scale) ↔ `simd_amx::quantize_energy_i8`; the scale-back ↔ `dequantize_result_f64`. **No Core gap** — the recognizer CONSUMES ndarray's GEMM (the `simd-savant` "all SIMD from `ndarray::simd`" invariant), never re-transcodes SIMD. From 4af9162d53064334d39ff0d3c8f4c8feb7877ca3 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 4 Jul 2026 13:34:08 +0000 Subject: [PATCH 4/7] board: E-OCR-WEIGHTMATRIX-1 -- recognizer Leaf 2 byte-parity green WeightMatrix::DeSerialize (int mode) transcoded + byte-parity vs libtesseract (f32 bit-patterns, two shapes). forward() chains Leaf 1's proven int8 GEMM, scaling in f32 to match FAST_FLOAT. Rust-writes / lib-reads independent proof. Co-Authored-By: Claude Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1 --- .claude/board/EPIPHANIES.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/.claude/board/EPIPHANIES.md b/.claude/board/EPIPHANIES.md index 58967f0a..0e7fa213 100644 --- a/.claude/board/EPIPHANIES.md +++ b/.claude/board/EPIPHANIES.md @@ -177,6 +177,14 @@ New knowledge doc `.claude/knowledge/data-shape-etymology.md` — the shape-and- **Status:** FINDING (operator ruling on the shape — "yes valueschema") + embedded CONJECTURE (the preset-vs-dispatch probe) Operator floated keeping the fast/cheap V2 substrate for huge data alongside V3, "switched by classid," so V3 can eventually teach V2 how to be better. Resolved: the switch is NOT a new carrier. `ClassView::value_schema(classid) -> ValueSchema` (`canonical_node.rs:894`, `class_view.rs:395`) is ALREADY classid→substrate-shape resolution by trait dispatch — resolved, never stored on-wire (adding a variant costs NO `ENVELOPE_LAYOUT_VERSION` bump), and the four existing variants ALREADY form a substrate ladder: `Bootstrap`(empty, key+edges only) / `Compressed`(cold codec, **no hot lifecycle columns**) / `Cognitive`(hot thinking: Meta+Qualia+Fingerprint+Energy+Plasticity+EntityType) / `Full`(every tenant). So "V2 fast/cheap bulk" = classids that resolve to the LEAN end (Bootstrap/Compressed — no ownership/lifecycle tenants); "V3 witnessed/owned" = Cognitive/Full. **A `ClassRoutingDTO` is rejected:** a DTO is a serialized carried payload, but substrate choice is a RESOLUTION (firewall ADR-022, "contracts compile types, the event never leaves"); and per the three-tier canon nothing crosses mailbox boundaries — every reader re-resolves the substrate from the classid already in the 16-byte key, so there is no boundary for a carrier to travel. `dto-soa-savant` + AGI-as-glove name the new-struct-instead-of-resolution shape exactly. **0x1000 is NOT the switch:** canon fixes it as a temporary adoption MONITOR ("monitor, never a semantic"; retires at P4/100%; MODULE-TABLE flags that a future canon==0x1000 aliases the marker) — substrate routes on the classid's concept-half → ValueSchema, never on the monitor bit. **The deep form (CONJECTURE — PROBE preset-vs-dispatch):** the WRITE PATH may be a pure FUNCTION of the schema — a class whose ValueSchema carries no ownership/lifecycle tenants has nothing for the kanban/WAL to witness, so it naturally collapses to the fast private-merge write; Cognitive/Full carry the tenants that REQUIRE the owned/witnessed path. If that holds, substrate = ValueSchema full stop (no separate `Substrate` enum, no flag). The gate: confirm the write path is derivable from which tenants are live vs needing an independent resolution — evidence base is the onebrc arc itself (lane F private-merge/no-tenants vs lanes G–J owned/witnessed = the two write paths already measured). Open sub-question: whether bulk needs a variant leaner than `Compressed`, or Bootstrap/Compressed already suffice. **"V3 teaches V2" (deferred, needs mechanism):** V3's kanban WAL + ownership journal is the profiling signal (where contention lands, which fields are touched) to optimize the lean V2 layout — the instrumented-teacher / stripped-student loop; no code reads the WAL back into a layout optimizer yet. Net: at most a new `ValueSchema` variant through the existing `value_schema(classid)` door; possibly not even that. +## 2026-07-04 — E-OCR-WEIGHTMATRIX-1 — recognizer Leaf 2: `WeightMatrix::DeSerialize` (int mode) is byte-parity green vs libtesseract; the forward chains Leaf 1's proven int8 GEMM +**Status:** FINDING (byte-parity proven vs libtesseract 5.3.4; `tesseract-recognizer`, tested) + +The recognizer's second leaf loads the int-mode `WeightMatrix`. `tesseract_recognizer::WeightMatrix::from_le_bytes` transcodes `WeightMatrix::DeSerialize` (weightmatrix.cpp:280-320, int-mode arm): the little-endian `TFile` layout `u8 mode(0x81) | wi_[GENERIC_2D_ARRAY: u32 dim1, u32 dim2, i8 empty_, dim1·dim2 i8] | u32 num_scales | num_scales × f64 (=scale·127)`. `forward()` runs the int8 forward by consuming the byte-parity-proven `matrix_dot_vector_i32` (Leaf 1) then scaling in **f32** to match Tesseract's FAST_FLOAT build. + +Byte-parity GREEN vs a libtesseract oracle on two shapes (8×5, 24×17), comparing **f32 bit-patterns** exactly. Proof design: Rust WRITES the serialized bytes, libtesseract READS them via the REAL `DeSerialize` + `MatrixDotVector` — a wrong wire layout would make the real parser diverge, so the diff is an independent proof (no `InitWeightsFloat`/`TRand` oracle-build needed). +5 unit tests (hand-built bytes); clippy `-D warnings` + fmt clean. + +Three source-only format details captured: `mode` always carries `kDoubleFlag` (its absence = old float layout → `UnsupportedFormat`); the `empty_` fill byte sits BETWEEN the dims and the data; **scales are doubles on disk regardless of `FAST_FLOAT`** (weightmatrix.cpp:257, loaded `/INT8_MAX`). Plus: `Init` may pad `scales_` past `num_out` (SIMD layout) — the loader keeps only the first `num_out`; and the SIMD `MatrixDotVector` OVER-READS the input to `RoundInputs` padding (the oracle zero-pads `u`). Next Leaf 3+: the network graph forward (`Series`/`LSTM`/`FullyConnected`/`Convolve`) → `recodebeam` → the code lattice `recoded_to_text` eats. Cross-ref: `E-OCR-MATDOTVEC-1` (Leaf 1), `E-OCR-COMPUTE-NDARRAY-SEAM-1`. Plan: `recognizer-core-shape-v1.md` (Leaf 2 EXECUTED). Branch `claude/happy-hamilton-0azlw4`. ## 2026-07-04 — E-OCR-MATDOTVEC-1 — recognizer Leaf 1 is byte-parity green: the int8 `MatrixDotVector`, via ndarray's `matmul_i8_to_i32`, equals libtesseract exactly (promotes `E-OCR-COMPUTE-NDARRAY-SEAM-1` CONJECTURE→FINDING) **Status:** FINDING (byte-parity proven vs libtesseract 5.3.4; new crate `tesseract-recognizer`, tested) From c60d8f55a1c8bc9ccf27b1279e34ef43d2857644 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 4 Jul 2026 13:41:02 +0000 Subject: [PATCH 5/7] board: E-OCR-ACTIVATION-1 -- recognizer Leaf 3 byte-parity green The LUT activations (Tanh/Logistic + Relu/Clip/Softmax) transcoded + byte-parity vs libtesseract on a 4096-pt sweep; the regenerated tables match the baked ones. All f32 (FAST_FLOAT). Leaf 2 + Leaf 3 = the pieces of a FullyConnected forward. Co-Authored-By: Claude Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1 --- .claude/board/EPIPHANIES.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/.claude/board/EPIPHANIES.md b/.claude/board/EPIPHANIES.md index 0e7fa213..8c35bfbb 100644 --- a/.claude/board/EPIPHANIES.md +++ b/.claude/board/EPIPHANIES.md @@ -177,6 +177,14 @@ New knowledge doc `.claude/knowledge/data-shape-etymology.md` — the shape-and- **Status:** FINDING (operator ruling on the shape — "yes valueschema") + embedded CONJECTURE (the preset-vs-dispatch probe) Operator floated keeping the fast/cheap V2 substrate for huge data alongside V3, "switched by classid," so V3 can eventually teach V2 how to be better. Resolved: the switch is NOT a new carrier. `ClassView::value_schema(classid) -> ValueSchema` (`canonical_node.rs:894`, `class_view.rs:395`) is ALREADY classid→substrate-shape resolution by trait dispatch — resolved, never stored on-wire (adding a variant costs NO `ENVELOPE_LAYOUT_VERSION` bump), and the four existing variants ALREADY form a substrate ladder: `Bootstrap`(empty, key+edges only) / `Compressed`(cold codec, **no hot lifecycle columns**) / `Cognitive`(hot thinking: Meta+Qualia+Fingerprint+Energy+Plasticity+EntityType) / `Full`(every tenant). So "V2 fast/cheap bulk" = classids that resolve to the LEAN end (Bootstrap/Compressed — no ownership/lifecycle tenants); "V3 witnessed/owned" = Cognitive/Full. **A `ClassRoutingDTO` is rejected:** a DTO is a serialized carried payload, but substrate choice is a RESOLUTION (firewall ADR-022, "contracts compile types, the event never leaves"); and per the three-tier canon nothing crosses mailbox boundaries — every reader re-resolves the substrate from the classid already in the 16-byte key, so there is no boundary for a carrier to travel. `dto-soa-savant` + AGI-as-glove name the new-struct-instead-of-resolution shape exactly. **0x1000 is NOT the switch:** canon fixes it as a temporary adoption MONITOR ("monitor, never a semantic"; retires at P4/100%; MODULE-TABLE flags that a future canon==0x1000 aliases the marker) — substrate routes on the classid's concept-half → ValueSchema, never on the monitor bit. **The deep form (CONJECTURE — PROBE preset-vs-dispatch):** the WRITE PATH may be a pure FUNCTION of the schema — a class whose ValueSchema carries no ownership/lifecycle tenants has nothing for the kanban/WAL to witness, so it naturally collapses to the fast private-merge write; Cognitive/Full carry the tenants that REQUIRE the owned/witnessed path. If that holds, substrate = ValueSchema full stop (no separate `Substrate` enum, no flag). The gate: confirm the write path is derivable from which tenants are live vs needing an independent resolution — evidence base is the onebrc arc itself (lane F private-merge/no-tenants vs lanes G–J owned/witnessed = the two write paths already measured). Open sub-question: whether bulk needs a variant leaner than `Compressed`, or Bootstrap/Compressed already suffice. **"V3 teaches V2" (deferred, needs mechanism):** V3's kanban WAL + ownership journal is the profiling signal (where contention lands, which fields are touched) to optimize the lean V2 layout — the instrumented-teacher / stripped-student loop; no code reads the WAL back into a layout optimizer yet. Net: at most a new `ValueSchema` variant through the existing `value_schema(classid)` door; possibly not even that. +## 2026-07-04 — E-OCR-ACTIVATION-1 — recognizer Leaf 3: the LUT activations (Tanh/Logistic + Relu/Clip/Softmax) are byte-parity green vs libtesseract; the regenerated tables match the baked ones +**Status:** FINDING (byte-parity proven vs libtesseract 5.3.4; `tesseract-recognizer`, tested) + +The recognizer's activation non-linearities (lstm/functions.h). `tesseract_recognizer::activation::{tanh, logistic}` transcode the 4096-entry LUT sigmoids (`kScaleFactor=256`, linear interp; functions.h:44-72), regenerating `TanhTable[i]=tanh(i/256)` / `LogisticTable[i]=logistic(i/256)` (generate_lut.py's exact formula — f64 compute → f32 store) in a `LazyLock`, plus `relu`/`clip_f`/`clip_g`/`identity`/`softmax_in_place` (functions.h:85-207). All in f32 to match the FAST_FLOAT build. + +Byte-parity GREEN vs a libtesseract oracle on a 4096-point x-sweep (x ∈ [-16,16), tanh + logistic), f32 bit-patterns identical — which ALSO proves the regenerated LUTs match libtesseract's BAKED `TanhTable`/`LogisticTable` byte-exactly (this env's libm == the build's for these values), and that the negative-reflection + f32 interp match. +4 unit tests; clippy `-D warnings` + fmt clean. + +With Leaf 2 (`WeightMatrix`) + Leaf 3 (activations), the recognizer now holds both pieces of a FullyConnected layer forward — **Leaf 4 = `FullyConnected::Forward`** composing them (`activation(WeightMatrix·input)`), the first COMPLETE network layer; then LSTM/Series/Parallel → `recodebeam` → the code lattice `recoded_to_text` eats. Cross-ref: `E-OCR-WEIGHTMATRIX-1` (Leaf 2), `E-OCR-MATDOTVEC-1` (Leaf 1). Plan: recognizer-core-shape-v1.md (Leaf 3 EXECUTED). Branch `claude/happy-hamilton-0azlw4`. ## 2026-07-04 — E-OCR-WEIGHTMATRIX-1 — recognizer Leaf 2: `WeightMatrix::DeSerialize` (int mode) is byte-parity green vs libtesseract; the forward chains Leaf 1's proven int8 GEMM **Status:** FINDING (byte-parity proven vs libtesseract 5.3.4; `tesseract-recognizer`, tested) From a7dba3a8d4f5c91d9b06e6ceee6d6cf2ea2ce581 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 4 Jul 2026 14:36:57 +0000 Subject: [PATCH 6/7] =?UTF-8?q?contract::network=20=E2=80=94=20sink=20the?= =?UTF-8?q?=20Tesseract=20Network=20layer=20graph=20onto=20V3=20SoA=20(byt?= =?UTF-8?q?e-parity)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Executes the operator directive "6x8:8, 16 B tenant = classid + 12 B, ruff>OGAR transpiler sink-in". The polymorphic Network subclass tree lands on the OGAR Core the right way — NOT a hand-rolled enum (that draft was the parallel-object-model anti-pattern). - NEW src/network.rs: NetworkType (27 layer types; ordinal == on-wire kTypeNames discriminant, network.h:41-78 / network.cpp:60-75) + NetworkHeader::from_le_bytes (the base header Network::CreateFromFile reads before subclass dispatch, network.cpp:214-248) + to_facet() (sinks each node onto facet::FacetCascade, 16 B = classid + 6x8:8, CascadeShape::G6D2) + NetworkType::classid() (the invoke_network dispatch seed). facet_classid = compose_classid(network_layer, ntype) canon-high; subclass in the classid custom-low half, not 27 slots. - ogar_codebook: ONE mint network_layer=0x0804 in the 0x08 OCR domain. - NEW examples/network_dump.rs: the byte-parity surface. Byte-parity GREEN on real eng.lstm: Rust NetworkHeader::from_le_bytes == libtesseract Network::CreateFromFile for the outer node (Series ni=36 no=111 num_weights=385807 name=Series); the oracle's spec() == the model spec string (known-answer self-check, 5.5.0-hdr/5.3.4-lib ABI skew guarded, oracle built -DFAST_FLOAT). The facet 0x08040009 decodes losslessly. Reviewed by core-first-architect (TARGETS-CORE), v3-envelope-auditor (LAYOUT-CLEAN, no version bump), brutally-honest-tester (LAND). Folded in: compile-lock test (NETWORK_LAYER == codebook mint), custom-half invariant doc, to_facet debug_assert on the ni/no u16 range. +7 contract tests; clippy -D warnings + fmt clean (scoped -p lance-graph-contract). Board: EPIPHANIES E-OCR-NETWORK-SINK-1, LATEST_STATE contract inventory. Co-Authored-By: Claude Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1 --- .claude/board/EPIPHANIES.md | 13 + .claude/board/LATEST_STATE.md | 6 +- .../examples/network_dump.rs | 56 ++ crates/lance-graph-contract/src/lib.rs | 2 + crates/lance-graph-contract/src/network.rs | 620 ++++++++++++++++++ .../lance-graph-contract/src/ogar_codebook.rs | 6 + 6 files changed, 702 insertions(+), 1 deletion(-) create mode 100644 crates/lance-graph-contract/examples/network_dump.rs create mode 100644 crates/lance-graph-contract/src/network.rs diff --git a/.claude/board/EPIPHANIES.md b/.claude/board/EPIPHANIES.md index 8c35bfbb..80d7447b 100644 --- a/.claude/board/EPIPHANIES.md +++ b/.claude/board/EPIPHANIES.md @@ -177,6 +177,19 @@ New knowledge doc `.claude/knowledge/data-shape-etymology.md` — the shape-and- **Status:** FINDING (operator ruling on the shape — "yes valueschema") + embedded CONJECTURE (the preset-vs-dispatch probe) Operator floated keeping the fast/cheap V2 substrate for huge data alongside V3, "switched by classid," so V3 can eventually teach V2 how to be better. Resolved: the switch is NOT a new carrier. `ClassView::value_schema(classid) -> ValueSchema` (`canonical_node.rs:894`, `class_view.rs:395`) is ALREADY classid→substrate-shape resolution by trait dispatch — resolved, never stored on-wire (adding a variant costs NO `ENVELOPE_LAYOUT_VERSION` bump), and the four existing variants ALREADY form a substrate ladder: `Bootstrap`(empty, key+edges only) / `Compressed`(cold codec, **no hot lifecycle columns**) / `Cognitive`(hot thinking: Meta+Qualia+Fingerprint+Energy+Plasticity+EntityType) / `Full`(every tenant). So "V2 fast/cheap bulk" = classids that resolve to the LEAN end (Bootstrap/Compressed — no ownership/lifecycle tenants); "V3 witnessed/owned" = Cognitive/Full. **A `ClassRoutingDTO` is rejected:** a DTO is a serialized carried payload, but substrate choice is a RESOLUTION (firewall ADR-022, "contracts compile types, the event never leaves"); and per the three-tier canon nothing crosses mailbox boundaries — every reader re-resolves the substrate from the classid already in the 16-byte key, so there is no boundary for a carrier to travel. `dto-soa-savant` + AGI-as-glove name the new-struct-instead-of-resolution shape exactly. **0x1000 is NOT the switch:** canon fixes it as a temporary adoption MONITOR ("monitor, never a semantic"; retires at P4/100%; MODULE-TABLE flags that a future canon==0x1000 aliases the marker) — substrate routes on the classid's concept-half → ValueSchema, never on the monitor bit. **The deep form (CONJECTURE — PROBE preset-vs-dispatch):** the WRITE PATH may be a pure FUNCTION of the schema — a class whose ValueSchema carries no ownership/lifecycle tenants has nothing for the kanban/WAL to witness, so it naturally collapses to the fast private-merge write; Cognitive/Full carry the tenants that REQUIRE the owned/witnessed path. If that holds, substrate = ValueSchema full stop (no separate `Substrate` enum, no flag). The gate: confirm the write path is derivable from which tenants are live vs needing an independent resolution — evidence base is the onebrc arc itself (lane F private-merge/no-tenants vs lanes G–J owned/witnessed = the two write paths already measured). Open sub-question: whether bulk needs a variant leaner than `Compressed`, or Bootstrap/Compressed already suffice. **"V3 teaches V2" (deferred, needs mechanism):** V3's kanban WAL + ownership journal is the profiling signal (where contention lands, which fields are touched) to optimize the lean V2 layout — the instrumented-teacher / stripped-student loop; no code reads the WAL back into a layout optimizer yet. Net: at most a new `ValueSchema` variant through the existing `value_schema(classid)` door; possibly not even that. +## 2026-07-04 — E-OCR-NETWORK-SINK-1 — the Tesseract `Network` layer graph sinks onto V3 SoA via ruff→OGAR: base-header parse byte-parity green + `FacetCascade` (16 B) sink, NOT a hand-rolled enum +**Status:** FINDING (byte-parity proven vs libtesseract 5.3.4; `lance-graph-contract`, tested) + +The operator directive — *"use new V3 substrate AR rail shaped (6x8:8), 16 bytes tenant = classid + 12 bytes, use ruff>OGAR transpiler sink-in substrate"* — is executed and proven. The polymorphic `Network` subclass tree is sunk onto the Core the RIGHT way (a hand-rolled `enum NetworkKind` was rejected earlier this arc as the parallel-object-model anti-pattern): + +1. **ruff→OGAR harvest** (`ruff/crates/ruff_cpp_spo/examples/harvest_network.rs`, committed) — the libclang walker over the 11 network layer headers emits the `has_function`/`inherits_from`/`virtually_overrides` SPO manifest: **62 classes, 5060 triples** on real Tesseract 5.5.0 src. The `Forward` override set (FullyConnected/LSTM/Series/Parallel/Convolve/Maxpool/Reversed/Reconfig/Input) = the compute-leaf list; the `DeSerialize` set (FullyConnected/LSTM/Plumbing/Convolve/Maxpool/Reconfig/Input) = the binary-leaf list. This IS the `classid → ClassView` method-resolution manifest (the vtable the enum would have faked). +2. **Base-header leaf** (`lance_graph_contract::network`) — `NetworkHeader::from_le_bytes` transcodes the shared serialization prefix EVERY layer writes (`network.cpp:214-248` `Network::CreateFromFile`: `i8 tag(0) | u32+str type_name | i8 training | i8 needs_backprop | i32 flags | i32 ni | i32 no | i32 num_weights | u32+str name`). `NetworkType` (27 types, ordinal == discriminant, `kTypeNames` on-wire strings) + `to_facet()`. +3. **V3 SoA sink** — each node → `crate::facet::FacetCascade` (16 B = `classid(4) | 6×(8:8)`), read under `CascadeShape::G6D2` (the "6x8:8"): tier0=ni, tier1=no, tier2=flags, tiers3-4=num_weights u32, tier5=lifecycle. `facet_classid = compose_classid(NETWORK_LAYER=0x0804, ntype)` — canon-high, ONE `network_layer` OCR-domain mint (the 27 subclasses live in the classid custom-low half, NOT 27 codebook slots). Name + weight blob are out-of-line (`I-VSA-IDENTITIES`). + +**Byte-parity GREEN** on real `/tmp/eng.lstm`: Rust parse == libtesseract `Network::CreateFromFile` for the outer node — `Series ni=36 no=111 num_weights=385807 name=Series` — with the oracle's `spec()` == the model spec `[1,36,0,1[C3,3Ft16]Mp3,3TxyLfys48Lfx96RxLrx96Lfx192Fc111]` (the known-answer self-check guarding the 5.5.0-hdr/5.3.4-lib ABI skew; oracle built `-DFAST_FLOAT`). The facet `0x08040009` decodes losslessly (ni=36/no=111/flags=192/nw=385807/lifecycle=0). Example `network_dump.rs`; +5 contract tests; clippy `-D warnings` + fmt clean (`-p lance-graph-contract` scoped). + +Deferred (follow-ups): per-subclass payload parse + tree recursion (Plumbing children → `EdgeBlock`, weights → out-of-line Lance column); the `invoke_network` keystone (dispatch already proven generically by E-CPP-KEYSTONE-1); the recognizer COMPUTE leaves (`tesseract-recognizer`, deps ndarray — Leaf 4 `FullyConnected::Forward`, Leaf 5 `LSTM::Forward`, then `recodebeam`). Plan: `tesseract-rs/.claude/plans/network-ruff-ogar-sink-v1.md`. Cross-ref: `E-CPP-PARITY-7` (recoder, the sibling binary leaf), `E-OCR-MATDOTVEC-1`/`E-OCR-WEIGHTMATRIX-1`/`E-OCR-ACTIVATION-1` (the compute leaves), `E-CPP-KEYSTONE-1` (classid→ClassView dispatch). Branch `claude/happy-hamilton-0azlw4`. + ## 2026-07-04 — E-OCR-ACTIVATION-1 — recognizer Leaf 3: the LUT activations (Tanh/Logistic + Relu/Clip/Softmax) are byte-parity green vs libtesseract; the regenerated tables match the baked ones **Status:** FINDING (byte-parity proven vs libtesseract 5.3.4; `tesseract-recognizer`, tested) diff --git a/.claude/board/LATEST_STATE.md b/.claude/board/LATEST_STATE.md index 353f1672..68d79495 100644 --- a/.claude/board/LATEST_STATE.md +++ b/.claude/board/LATEST_STATE.md @@ -10,6 +10,10 @@ --- +## 2026-07-04 — branch `claude/happy-hamilton-0azlw4` — `contract::network` — the Tesseract `Network` layer graph sunk onto V3 SoA via ruff→OGAR (byte-parity vs libtesseract) + +**NEW** `lance_graph_contract::network`: `NetworkType` (27 layer types, ordinal == on-wire `kTypeNames` discriminant) + `NetworkHeader` (`from_le_bytes` = the base header `Network::CreateFromFile` reads before subclass dispatch: `i8 tag | u32+str type_name | i8 training | i8 needs_backprop | i32 flags | i32 ni | i32 no | i32 num_weights | u32+str name`) + `to_facet()` (the V3 SoA sink) + `NetworkType::classid()` (the `invoke_network` dispatch seed). Executes the operator directive *"6x8:8, 16 B tenant = classid + 12 B, ruff>OGAR sink-in"*: (1) the `ruff_cpp_spo` `harvest_network` example (committed to ruff) walks the 11 network headers via libclang → the `has_function`/`virtually_overrides` SPO manifest (62 classes, 5060 triples) = the `classid → ClassView` method-resolution table, NOT a hand-rolled enum; (2) each node sinks onto `crate::facet::FacetCascade` (16 B = `classid(4) | 6×(8:8)`, read `CascadeShape::G6D2`): tier0=ni, tier1=no, tier2=flags, tiers3-4=num_weights u32, tier5=lifecycle; `facet_classid = compose_classid(network_layer=0x0804, ntype)` canon-high. Byte-parity **GREEN** on real `/tmp/eng.lstm`: Rust parse == libtesseract `Network::CreateFromFile` — `Series ni=36 no=111 num_weights=385807 name=Series` — oracle `spec()` == the model spec string (known-answer self-check, 5.5.0-hdr/5.3.4-lib ABI skew guarded). Example `network_dump.rs`; +5 contract tests; clippy `-D warnings` + fmt clean (scoped `-p lance-graph-contract`). ONE `network_layer`=0x0804 OCR-domain mint added (subclasses in classid custom-low, not 27 slots). Deferred: per-subclass payload + tree recursion, the `invoke_network` keystone, the recognizer COMPUTE leaves. Refs: EPIPHANIES `E-OCR-NETWORK-SINK-1`; plan `tesseract-rs/.claude/plans/network-ruff-ogar-sink-v1.md`. Not yet a PR. + ## 2026-07-04 — branch `claude/happy-hamilton-0azlw4` — `contract::unicharcompress` — the Tesseract recoder load side (byte-parity vs libtesseract) **NEW** `lance_graph_contract::unicharcompress`: `UnicharCompress` (the LSTM recoder's code↔id table) + `RecodedCharId` + `RecoderError`, load side only (`from_le_bytes` / `load_from_file` = C++ `DeSerialize`; `encode` / `decode` / `code_range`; `dump_encode` / `dump_decode` parity surfaces). The FIRST binary-format leaf (`TFile` little-endian: `u32 count` + per-entry `[i8 self_normalized][i32 length][i32×length code]`). Byte-parity **GREEN** on real `/tmp/eng.lstm-recoder` — encode 112/112 + decode 112/112 + code_range=111 — via the committed `examples/recoder_dump.rs`, diffed vs a libtesseract 5.3.4 oracle (the 5.5.0-header ABI skew self-validated by the `Encode∘Decode` round-trip + `enc_size=112`). +10 contract tests; `-p lance-graph-contract` clippy `-D warnings` + fmt clean. Consumed by `tesseract-core::{Recoder, recoded_to_text}` (codes→decode→ids→`ids_to_text`; +1 boundary test, 8/8). Resolves the `recoder`=0x0802 concept (OGAR #148 mint, mirrored in the "0x08XX OCR rows" line below) to its content-store module. The recoder keystone (`invoke_recoder`) is UNBLOCKED but deferred (dispatch already proven generically by E-CPP-KEYSTONE-1). Refs: EPIPHANIES `E-CPP-PARITY-7`. Not yet a PR. @@ -699,5 +703,5 @@ PR sequence: #360 → #361 → post-#360 substrate-sweep (this PR). - **`codegen_spine::RouteBucketTyped`** (NEW; C6 merged verbatim from op-nexgen's vendored diff, codex-reviewed on nexgen PR #8). Kind-generic sibling of `RouteBucket` (`type Kind: Copy + Eq`) + `?Sized` blanket bridge (`impl RouteBucketTyped for T { type Kind = OdooMethodKind; }`) so non-Odoo codegen targets bring their own kind enum additively. Coherence rule: a type needing a different Kind skips the legacy trait. 12/12 module tests incl. dyn-object coverage. - **`emission_scan`** (NEW; op-nexgen L2). Zero-dep typed-DDL adoption counter, `classid_scan`'s design-language sibling: `TypedForm {Typed, AnyTyped, RecordLink, Stub}` (#[non_exhaustive]) + tokenizer `classify_ddl_type` (precedence Stub > RecordLink > AnyTyped > Typed; word-boundary tokens so `many`/`recording` never false-match) + `EmissionCounts` fold with `typed_ratio()` (f64, mirrors `adoption_pct`). 15 tests. Module doc NAMES the contract scan-family pattern (Form enum + classify_* + fold-to-counts): the next governance counter mirrors it. -- **`ogar_codebook` 0x08XX OCR rows** — `unicharset` (0x0801) / `recoder` (0x0802) / `charset` (0x0803) mirroring OGAR #148's mint (container kinds only; content never becomes concepts — Osint zero-rows precedent). Drift-guard test extended. CODEBOOK now 68 entries. +- **`ogar_codebook` 0x08XX OCR rows** — `unicharset` (0x0801) / `recoder` (0x0802) / `charset` (0x0803) / `network_layer` (0x0804) mirroring OGAR #148's mint (container kinds only; content never becomes concepts — Osint zero-rows precedent). `network_layer` = the KIND "a Tesseract recognizer network layer"; the 27 subclasses live in the classid custom-low half (`NetworkType` ordinal), NOT 27 slots. Drift-guard test extended. CODEBOOK now 69 entries. - **Rulings + intake record:** EPIPHANIES E-V3-XSESSION-INTAKE-1(+RULINGS), E-V3-GRAPHRAG-INV-1; handover `.claude/handovers/2026-07-02-cross-session-wishlist-intake.md`; plan Addendum-10/11 (per-consumer classid ownership + tripwires ratified; R-1 naming phantom closed — `domain:appid:classview`; R-2 closed — 512-byte row frozen, edges via strided view; L3 new-Arrow-schema design killed; five post-fuse workstreams enumerated). Knowledge: `graphrag-rs-inventory.md`. diff --git a/crates/lance-graph-contract/examples/network_dump.rs b/crates/lance-graph-contract/examples/network_dump.rs new file mode 100644 index 00000000..c15b192f --- /dev/null +++ b/crates/lance-graph-contract/examples/network_dump.rs @@ -0,0 +1,56 @@ +//! Dump the base `Network` header at the front of a serialized recognizer +//! component (`eng.lstm`) — the Rust side of the network base-header byte-parity +//! leaf, sibling to `recoder_dump`. Also prints the [`FacetCascade`] the node +//! sinks onto (the ruff→OGAR harvest → V3 SoA target). +//! +//! ```sh +//! # Extract the lstm component (starts with the network, lstmrecognizer.cpp:135): +//! combine_tessdata -u $(dpkg -L tesseract-ocr-eng | grep eng.traineddata) /tmp/eng. +//! # C++ oracle (network_spec_oracle.cpp): links libtesseract, calls the REAL +//! # Network::CreateFromFile on the same bytes and dumps the loaded top node's +//! # type / ni / no / num_weights / name + spec() (the known-answer self-check). +//! # ./network_spec_oracle /tmp/eng.lstm > /tmp/oracle_network.txt +//! # Rust side (parses only the base header — the shared prefix of every layer): +//! cargo run -p lance-graph-contract --example network_dump -- /tmp/eng.lstm > /tmp/rust_network.txt +//! # The "header:" line is byte-identical between the two => the base header +//! # parse is byte-parity green. +//! ``` + +#![allow( + clippy::print_stdout, + reason = "a dump CLI example writes to stdout by design" +)] + +use std::process::ExitCode; + +use lance_graph_contract::network::NetworkHeader; + +fn main() -> ExitCode { + let Some(path) = std::env::args().nth(1) else { + eprintln!("usage: network_dump "); + return ExitCode::FAILURE; + }; + let bytes = match std::fs::read(&path) { + Ok(b) => b, + Err(err) => { + eprintln!("error reading {path}: {err}"); + return ExitCode::FAILURE; + } + }; + match NetworkHeader::from_le_bytes(&bytes) { + Ok((header, consumed)) => { + // The byte-parity line (diffed against the oracle's loaded top node). + println!("header: {}", header.dump()); + // The V3 SoA sink: the 16-byte FacetCascade (classid + 6×8:8), hex. + let f = header.to_facet(); + let hex: String = f.to_bytes().iter().map(|b| format!("{b:02x}")).collect(); + println!("facet: classid={:#010x} bytes={hex}", f.facet_classid); + println!("consumed: {consumed} bytes (base header; subclass payload follows)"); + ExitCode::SUCCESS + } + Err(err) => { + eprintln!("error parsing header: {err:?}"); + ExitCode::FAILURE + } + } +} diff --git a/crates/lance-graph-contract/src/lib.rs b/crates/lance-graph-contract/src/lib.rs index f206e7aa..a25071a5 100644 --- a/crates/lance-graph-contract/src/lib.rs +++ b/crates/lance-graph-contract/src/lib.rs @@ -96,6 +96,8 @@ pub mod manifest; pub mod mul; pub mod nan_projection; pub mod nars; +/// LSTM `Network` layer-graph structure — base-header parse + `FacetCascade` sink. +pub mod network; pub mod ocr; /// D-OVC-1 — OGAR concept codebook (`0xDDCC` domain layout), wire-compat mirror. pub mod ogar_codebook; diff --git a/crates/lance-graph-contract/src/network.rs b/crates/lance-graph-contract/src/network.rs new file mode 100644 index 00000000..df41479d --- /dev/null +++ b/crates/lance-graph-contract/src/network.rs @@ -0,0 +1,620 @@ +//! LSTM `Network` layer-graph structure — the Rust side of the network +//! base-header byte-parity leaf, and the **sink of the ruff→OGAR harvest onto +//! the V3 SoA** ([`crate::facet::FacetCascade`]). +//! +//! Tesseract's recognizer is a tree of `Network` subclasses (`lstm/network.{h,cpp}` +//! + `series.cpp` / `parallel.cpp` / `fullyconnected.cpp` / `lstm.cpp` / …). Every +//! node — whatever its subclass — is serialized with the SAME base header, written +//! by `Network::Serialize` and read back by the factory `Network::CreateFromFile` +//! (`network.cpp:155-248`). This module transcodes that **base header** (the shared +//! prefix of every layer) + the `kTypeNames` on-wire type discriminant, and sinks +//! each parsed node onto a content-blind [`FacetCascade`] — the operator's +//! "16-byte tenant, classid + 12 bytes" V3 substrate. +//! +//! # Core-First placement +//! +//! Per the Core-First doctrine this is **structure** (identity + typed dims), not +//! compute: the recognizer's `Forward`/weight math lives in `tesseract-recognizer` +//! (deps ndarray); the layer *graph* — which types, nested how, with what +//! `ni`/`no` — is content the OGAR Core owns, exactly like the recoder +//! ([`crate::unicharcompress`]) and the unicharset ([`crate::unicharset`]). The +//! `ruff_cpp_spo` harvest (`has_function` / `virtually_overrides`) is the +//! `classid → ClassView` method-resolution manifest; THIS is where a harvested +//! node lands as a typed SoA row. No parallel object model: a network node is a +//! [`FacetCascade`], its type a `classid`, never a bespoke `enum NetworkKind`. +//! +//! # Base-header wire format (byte-parity surface) +//! +//! The factory reads, in order (`network.cpp:214-248`; little-endian, +//! `TFile::swap_ == false` on x86; `std::string` = `u32 len` + `len` raw bytes, +//! `serialis.cpp:94-110`): +//! +//! ```text +//! i8 tag // always NT_NONE(0); getNetworkType, network.cpp:191 +//! u32 type_name_len // then the ASCII type name (kTypeNames entry) +//! … type_name bytes // "Series" / "Input" / "LSTM" / … — the discriminant +//! i8 training // TrainingState (recognizer = TS_DISABLED) +//! i8 needs_to_backprop // 0/1 +//! i32 network_flags // NetworkFlags bits +//! i32 ni // number of inputs +//! i32 no // number of outputs +//! i32 num_weights // weights in THIS node and its sub-network (cumulative) +//! u32 name_len // then the layer's unique name +//! … name bytes +//! ``` +//! +//! then the subclass's own `DeSerialize` payload (weights / children) — DEFERRED +//! to follow-up leaves (the per-subclass payloads: `Plumbing` reads its child +//! vector, `FullyConnected`/`LSTM` read `WeightMatrix` blobs). This leaf proves the +//! shared base header, exactly as the recoder leaf proved the header before the +//! beam maps. +//! +//! For real `eng.lstm` (the extracted `lstm` component; `LSTMRecognizer::DeSerialize` +//! calls `Network::CreateFromFile` FIRST, `lstmrecognizer.cpp:135`) the outermost +//! node parses to `type=Series, ni=36, no=111, num_weights=385807` — matching the +//! model spec `[1,36,0,1[C3,3Ft16]Mp3,3TxyLfys48Lfx96RxLrx96Lfx192Fc111]` (ni=36 +//! feature rows, no=111 = the Fc111 softmax classes). That is the first-principles +//! pre-registration of a correct parse (the recoder-leaf method). +//! +//! [`NetworkHeader::dump`] is the byte-parity surface, diffed against the C++ +//! `network_spec_oracle` (which links libtesseract, calls the real +//! `Network::CreateFromFile`, and dumps `spec()` / `ni()` / `no()` / +//! `num_weights()` / `name()` of the loaded top node). + +use crate::facet::{FacetCascade, FacetTier}; +use crate::ogar_codebook::compose_classid; + +/// The `network_layer` container concept in the `0x08XX` OCR domain +/// ([`crate::ogar_codebook`]). One canon-high slot for the KIND "a Tesseract +/// network layer"; the SPECIFIC subclass (Series / LSTM / …) is the classid's +/// custom-low half = the [`NetworkType`] ordinal, NOT 27 codebook slots (the +/// "container kinds, not content" mint discipline). `compose_classid(NETWORK_LAYER, +/// nt as u16)` is the node's `facet_classid`. +/// +/// **Custom-half invariant:** a network-layer classid's custom-low half is the +/// [`NetworkType`] ordinal — a recognizer-INTERNAL facet discriminant, never a +/// render/RBAC app-prefix ([`classid_app_prefix`](crate::ogar_codebook::classid_app_prefix)). +/// These facet classids stay inside the OCR recognizer's SoA; they are never fed +/// to the app-prefix render path (which would misread ordinal 14 as an `AppPrefix`). +/// The value is kept in lock-step with the codebook by +/// [`tests::network_layer_const_matches_codebook`]. +pub const NETWORK_LAYER: u16 = 0x0804; + +/// `NetworkType` — the serialized layer-type discriminant (`network.h:41-78`, +/// `enum NetworkType`). The ordinal IS the discriminant and is stable across +/// versions (the `kTypeNames` string, written on the wire, decouples the on-disk +/// form from the enum order — `network.cpp:56-75`). `NT_NONE`(0) is the naked base +/// class / "invalid" sentinel; `NT_COUNT` is the array size, not a real type. +#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)] +#[repr(u8)] +pub enum NetworkType { + /// The naked base class ("Invalid" on the wire) — the 0 sentinel. + None = 0, + /// Inputs from an image. + Input = 1, + /// Duplicates inputs in a sliding-window neighborhood. + Convolve = 2, + /// Chooses the max result from a rectangle. + Maxpool = 3, + /// Runs networks in parallel. + Parallel = 4, + /// Runs identical networks in parallel. + Replicated = 5, + /// Runs LTR and RTL LSTMs in parallel. + ParRlLstm = 6, + /// Runs Up and Down LSTMs in parallel. + ParUdLstm = 7, + /// Runs 4 LSTMs in parallel. + Par2dLstm = 8, + /// Executes a sequence of layers. + Series = 9, + /// Scales the time/y size but makes the output deeper. + Reconfig = 10, + /// Reverses the x direction of the inputs/outputs. + XReversed = 11, + /// Reverses the y-direction of the inputs/outputs. + YReversed = 12, + /// Transposes x and y (for a single op). + XyTranspose = 13, + /// Long-Short-Term-Memory block. + Lstm = 14, + /// LSTM that only keeps its last output. + LstmSummary = 15, + /// Fully connected logistic nonlinearity. + Logistic = 16, + /// Fully connected rect-lin version of logistic. + PosClip = 17, + /// Fully connected rect-lin version of tanh. + SymClip = 18, + /// Fully connected with tanh nonlinearity. + Tanh = 19, + /// Fully connected with rectifier nonlinearity. + Relu = 20, + /// Fully connected with no nonlinearity. + Linear = 21, + /// Softmax with exponential normalization, with CTC. + Softmax = 22, + /// Softmax with exponential normalization, no CTC. + SoftmaxNoCtc = 23, + /// 1-d LSTM with built-in fully connected softmax. + LstmSoftmax = 24, + /// 1-d LSTM with built-in binary-encoded softmax. + LstmSoftmaxEncoded = 25, + /// A TensorFlow graph encapsulated as a Tesseract network. + TensorFlow = 26, +} + +/// The number of real `NetworkType`s (`NT_COUNT`, `network.h:78`) — the length of +/// the [`NetworkType::TYPE_NAMES`] table. +pub const NT_COUNT: usize = 27; + +impl NetworkType { + /// The on-wire `kTypeNames` strings (`network.cpp:60-75`), indexed by ordinal. + /// This is the serialization discriminant matched by `getNetworkType` + /// (`network.cpp:191-209`) — index-aligned with the enum, so + /// `TYPE_NAMES[nt as usize] == nt.type_name()`. + pub const TYPE_NAMES: [&'static str; NT_COUNT] = [ + "Invalid", + "Input", + "Convolve", + "Maxpool", + "Parallel", + "Replicated", + "ParBidiLSTM", + "DepParUDLSTM", + "Par2dLSTM", + "Series", + "Reconfig", + "RTLReversed", + "TTBReversed", + "XYTranspose", + "LSTM", + "SummLSTM", + "Logistic", + "LinLogistic", + "LinTanh", + "Tanh", + "Relu", + "Linear", + "Softmax", + "SoftmaxNoCTC", + "LSTMSoftmax", + "LSTMBinarySoftmax", + "TensorFlow", + ]; + + /// This type's `kTypeNames` string (the inverse of [`from_type_name`]). + /// + /// [`from_type_name`]: NetworkType::from_type_name + #[inline] + #[must_use] + pub const fn type_name(self) -> &'static str { + Self::TYPE_NAMES[self as usize] + } + + /// Resolve an ordinal (`0..NT_COUNT`) to a [`NetworkType`] — the enum + /// discriminant. `None` for `NT_COUNT` or beyond. + #[inline] + #[must_use] + pub const fn from_ordinal(o: u8) -> Option { + // Exhaustive match: the compiler proves every real ordinal is covered. + Some(match o { + 0 => NetworkType::None, + 1 => NetworkType::Input, + 2 => NetworkType::Convolve, + 3 => NetworkType::Maxpool, + 4 => NetworkType::Parallel, + 5 => NetworkType::Replicated, + 6 => NetworkType::ParRlLstm, + 7 => NetworkType::ParUdLstm, + 8 => NetworkType::Par2dLstm, + 9 => NetworkType::Series, + 10 => NetworkType::Reconfig, + 11 => NetworkType::XReversed, + 12 => NetworkType::YReversed, + 13 => NetworkType::XyTranspose, + 14 => NetworkType::Lstm, + 15 => NetworkType::LstmSummary, + 16 => NetworkType::Logistic, + 17 => NetworkType::PosClip, + 18 => NetworkType::SymClip, + 19 => NetworkType::Tanh, + 20 => NetworkType::Relu, + 21 => NetworkType::Linear, + 22 => NetworkType::Softmax, + 23 => NetworkType::SoftmaxNoCtc, + 24 => NetworkType::LstmSoftmax, + 25 => NetworkType::LstmSoftmaxEncoded, + 26 => NetworkType::TensorFlow, + _ => return None, + }) + } + + /// Resolve an on-wire type name to a [`NetworkType`] — the exact + /// `getNetworkType` match loop (`network.cpp:201`): linear scan of + /// [`TYPE_NAMES`]. `None` is `getNetworkType`'s `data == NT_COUNT` "Invalid + /// network layer type" path. + /// + /// [`TYPE_NAMES`]: NetworkType::TYPE_NAMES + #[inline] + #[must_use] + pub fn from_type_name(name: &str) -> Option { + let mut i = 0; + while i < NT_COUNT { + if Self::TYPE_NAMES[i] == name { + return Self::from_ordinal(i as u8); + } + i += 1; + } + None + } + + /// This layer type's full `classid` in the OCR domain: canon = + /// [`NETWORK_LAYER`], custom = the type ordinal. The node's `facet_classid`; + /// the `invoke_network` dispatch (the `invoke_unicharset` keystone analog) + /// resolves the subclass by [`classid_custom`](crate::ogar_codebook::classid_custom). + #[inline] + #[must_use] + pub fn classid(self) -> u32 { + compose_classid(NETWORK_LAYER, self as u16) + } +} + +/// A parse error in a serialized [`NetworkHeader`]. +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +pub enum NetworkError { + /// The buffer ended before the base header was fully read. + UnexpectedEof, + /// The `tag` byte was not `NT_NONE`(0) — an unversioned/foreign blob + /// (`getNetworkType` only branches into the string path when `tag == 0`). + BadTag(i8), + /// The `type_name` string did not match any [`NetworkType::TYPE_NAMES`] + /// entry (`getNetworkType`'s `data == NT_COUNT` path). + UnknownType, + /// A negative dimension (`ni`/`no`/`num_weights` are non-negative for any + /// serialized model). + NegativeDim, +} + +/// The base `Network` header shared by every layer node — the fields +/// `Network::CreateFromFile` reads before dispatching to the subclass +/// (`network.cpp:214-248`). +#[derive(Debug, Clone, PartialEq, Eq)] +pub struct NetworkHeader { + /// The layer subclass, from the `kTypeNames` on-wire discriminant. + pub ntype: NetworkType, + /// `TrainingState` byte (recognizer models serialize `TS_DISABLED`). + pub training: i8, + /// Whether the node needs to output back-deltas (`0`/`1`). + pub needs_backprop: bool, + /// `NetworkFlags` bits. + pub network_flags: i32, + /// Number of input values. + pub ni: i32, + /// Number of output values. + pub no: i32, + /// Number of weights in THIS node and its sub-network (cumulative). + pub num_weights: i32, + /// The layer's unique name. + pub name: String, +} + +impl NetworkHeader { + /// Parse the base header from the front of `bytes`, returning the header and + /// the number of bytes consumed (the offset at which the subclass payload + /// begins). Rejects a non-zero tag, an unknown type name, and negative dims + /// — a serialized model never carries them, so they signal a bad/foreign + /// blob rather than silently mis-parsing (stricter than the C++ factory, + /// which trusts its own output). + pub fn from_le_bytes(bytes: &[u8]) -> Result<(NetworkHeader, usize), NetworkError> { + let mut r = ByteReader::new(bytes); + let tag = r.read_i8()?; + if tag != 0 { + return Err(NetworkError::BadTag(tag)); + } + let type_name = r.read_string()?; + let ntype = NetworkType::from_type_name(&type_name).ok_or(NetworkError::UnknownType)?; + let training = r.read_i8()?; + let needs_backprop = r.read_i8()? != 0; + let network_flags = r.read_i32()?; + let ni = r.read_i32()?; + let no = r.read_i32()?; + let num_weights = r.read_i32()?; + if ni < 0 || no < 0 || num_weights < 0 { + return Err(NetworkError::NegativeDim); + } + let name = r.read_string()?; + Ok(( + NetworkHeader { + ntype, + training, + needs_backprop, + network_flags, + ni, + no, + num_weights, + name, + }, + r.pos, + )) + } + + /// Sink this node onto the V3 SoA as a content-blind [`FacetCascade`] — the + /// "16-byte tenant, classid + 12 bytes" substrate, read under + /// [`CascadeShape::G6D2`](crate::facet::CascadeShape::G6D2) (six `u16` tiers). + /// + /// The `network_layer` ClassView projection of the 6 tiers: + /// + /// | tier | 8:8 `u16` | field | + /// |---|---|---| + /// | 0 | `ni` | inputs | + /// | 1 | `no` | outputs | + /// | 2 | `network_flags & 0xFFFF` | behaviour flags | + /// | 3 | `num_weights` low 16 | cumulative weight count (lo) | + /// | 4 | `num_weights` high 16 | cumulative weight count (hi) | + /// | 5 | `training : needs_backprop` | lifecycle bytes (`lo:hi`) | + /// + /// `facet_classid` = [`NetworkType::classid`] (`NETWORK_LAYER : ntype`). The + /// **name** is NOT bundled (`I-VSA-IDENTITIES`: the facet is the identity + + /// typed dims; the name string is content in an out-of-line store keyed by the + /// classid+identity). The **weights** are out-of-line too — only their `count` + /// rides tiers 3-4; the blob is a separate Lance column. `ni`/`no`/flags are + /// truncated to `u16` (every real eng.lstm dim is `< 65536`); a hypothetical + /// `> u16` model would carry the overflow out-of-line, same as the weights. + #[inline] + #[must_use] + pub fn to_facet(&self) -> FacetCascade { + // ni/no are the semantic dims that MUST round-trip; every real eng.lstm dim + // is < 65536, but a hypothetical wider model would truncate here silently. + // Fail loudly in debug (mirrors the CANON mint-path `debug_assert`); a real + // out-of-range dim is the trigger to add an out-of-line escape. `ni`/`no` are + // non-negative (`NegativeDim` is rejected in `from_le_bytes`). `network_flags` + // is a bitmask whose low-16 is the documented projection, not a dim, so it is + // deliberately not asserted. The prefix-routing redouts (`hi_distance` etc.) + // are NOT meaningful across the tiers-3/4 `num_weights` split — this facet is + // read as 6× concatenated-`u16`, not as `hi`/`lo` prefix chains. + debug_assert!( + (self.ni as u32) <= u16::MAX as u32 && (self.no as u32) <= u16::MAX as u32, + "network ni/no exceeds u16 — needs an out-of-line escape (network.rs::to_facet)" + ); + let nw = self.num_weights as u32; + FacetCascade { + facet_classid: self.ntype.classid(), + tiers: [ + tier_u16(self.ni as u32 as u16), + tier_u16(self.no as u32 as u16), + tier_u16(self.network_flags as u32 as u16), + tier_u16((nw & 0xFFFF) as u16), + tier_u16((nw >> 16) as u16), + FacetTier { + lo: self.training as u8, + hi: u8::from(self.needs_backprop), + }, + ], + } + } + + /// A one-line byte-parity dump (`type ni no num_weights name`) — the surface + /// diffed against the C++ `network_spec_oracle`. + #[must_use] + pub fn dump(&self) -> String { + format!( + "{} ni={} no={} num_weights={} name={}", + self.ntype.type_name(), + self.ni, + self.no, + self.num_weights, + self.name + ) + } +} + +/// One 8:8 [`FacetTier`] carrying a `u16` as `(hi, lo)` — the concatenated-`u16` +/// projection ([`FacetTier::as_u16`] is its inverse). +#[inline] +const fn tier_u16(v: u16) -> FacetTier { + FacetTier { + lo: (v & 0xFF) as u8, + hi: (v >> 8) as u8, + } +} + +/// A forward-only little-endian cursor (the Core's per-module binary-read idiom; +/// mirrors [`crate::unicharcompress`]'s reader). `TFile::swap_ == false` on a LE +/// host, so scalars are raw `from_le_bytes`; a `std::string` is a `u32` length +/// prefix then that many raw bytes (`serialis.cpp:94-110`). +struct ByteReader<'a> { + bytes: &'a [u8], + pos: usize, +} + +impl<'a> ByteReader<'a> { + fn new(bytes: &'a [u8]) -> Self { + Self { bytes, pos: 0 } + } + + fn take(&mut self, n: usize) -> Result<&'a [u8], NetworkError> { + let end = self.pos.checked_add(n).ok_or(NetworkError::UnexpectedEof)?; + let slice = self + .bytes + .get(self.pos..end) + .ok_or(NetworkError::UnexpectedEof)?; + self.pos = end; + Ok(slice) + } + + fn read_i8(&mut self) -> Result { + Ok(self.take(1)?[0] as i8) + } + + fn read_i32(&mut self) -> Result { + let arr: [u8; 4] = self + .take(4)? + .try_into() + .map_err(|_| NetworkError::UnexpectedEof)?; + Ok(i32::from_le_bytes(arr)) + } + + fn read_u32(&mut self) -> Result { + let arr: [u8; 4] = self + .take(4)? + .try_into() + .map_err(|_| NetworkError::UnexpectedEof)?; + Ok(u32::from_le_bytes(arr)) + } + + /// A `TFile` `std::string`: `u32 len` then `len` raw bytes (`serialis.cpp:94-110`). + fn read_string(&mut self) -> Result { + let len = self.read_u32()? as usize; + let bytes = self.take(len)?; + Ok(String::from_utf8_lossy(bytes).into_owned()) + } +} + +#[cfg(test)] +mod tests { + use super::*; + use crate::facet::CascadeShape; + use crate::ogar_codebook::{canonical_concept_id, classid_canon, classid_custom}; + + #[test] + fn network_layer_const_matches_codebook() { + // The compile-lock: NETWORK_LAYER (used to build every facet_classid) must + // equal the codebook's `network_layer` mint — else a rename/renumber on one + // side silently mis-routes every network node's classid (core-first-architect + // hygiene finding). The codebook is the single source of truth. + assert_eq!( + canonical_concept_id("network_layer"), + Some(NETWORK_LAYER), + "network_layer const drifted from the ogar_codebook mint" + ); + } + + /// Build the base header a `Network::Serialize` would write for a node. + fn header_bytes(type_name: &str, ni: i32, no: i32, num_weights: i32, name: &str) -> Vec { + let mut b = Vec::new(); + b.push(0u8); // tag = NT_NONE + b.extend_from_slice(&(type_name.len() as u32).to_le_bytes()); + b.extend_from_slice(type_name.as_bytes()); + b.push(0u8); // training = TS_DISABLED + b.push(0u8); // needs_backprop = false + b.extend_from_slice(&192i32.to_le_bytes()); // network_flags + b.extend_from_slice(&ni.to_le_bytes()); + b.extend_from_slice(&no.to_le_bytes()); + b.extend_from_slice(&num_weights.to_le_bytes()); + b.extend_from_slice(&(name.len() as u32).to_le_bytes()); + b.extend_from_slice(name.as_bytes()); + b + } + + #[test] + fn type_names_round_trip_and_are_ordinal_aligned() { + assert_eq!(NetworkType::TYPE_NAMES.len(), NT_COUNT); + for o in 0..NT_COUNT as u8 { + let nt = NetworkType::from_ordinal(o).expect("real ordinal"); + assert_eq!(nt as u8, o, "discriminant == ordinal"); + assert_eq!(nt.type_name(), NetworkType::TYPE_NAMES[o as usize]); + assert_eq!(NetworkType::from_type_name(nt.type_name()), Some(nt)); + } + assert_eq!(NetworkType::from_ordinal(NT_COUNT as u8), None); + assert_eq!(NetworkType::from_type_name("NotAType"), None); + // The wire discriminant is decoupled from the enum name (kTypeNames). + assert_eq!( + NetworkType::from_type_name("SummLSTM"), + Some(NetworkType::LstmSummary) + ); + assert_eq!(NetworkType::None.type_name(), "Invalid"); + } + + #[test] + fn parses_pre_registered_eng_lstm_outer_header() { + // The first-principles pre-registration: eng.lstm's outermost node + // (module docs) — Series, ni=36, no=111, num_weights=385807. Built here + // as the exact bytes Network::Serialize writes; the real-file parity is + // the network_dump example vs the libtesseract oracle. + let bytes = header_bytes("Series", 36, 111, 385807, "root"); + let (h, consumed) = NetworkHeader::from_le_bytes(&bytes).expect("valid header"); + assert_eq!(h.ntype, NetworkType::Series); + assert_eq!(h.ni, 36); + assert_eq!(h.no, 111); + assert_eq!(h.num_weights, 385807); + assert_eq!(h.name, "root"); + assert_eq!( + consumed, + bytes.len(), + "base header consumes the whole prefix" + ); + assert_eq!(h.dump(), "Series ni=36 no=111 num_weights=385807 name=root"); + } + + #[test] + fn header_sinks_onto_g6d2_facet_losslessly() { + let (h, _) = NetworkHeader::from_le_bytes(&header_bytes("LSTM", 48, 96, 55296, "L1")) + .expect("valid"); + let f = h.to_facet(); + + // facet_classid = network_layer(0x0804) canon : LSTM(14) custom. + assert_eq!(classid_canon(f.facet_classid), NETWORK_LAYER); + assert_eq!(classid_custom(f.facet_classid), NetworkType::Lstm as u16); + assert_eq!(f.facet_classid, NetworkType::Lstm.classid()); + + // Read the tiers back under the operator's 6x8:8 (G6D2) shape. + let s = CascadeShape::G6D2; + assert_eq!(s.levels(), 2, "6x8:8 = 6 groups x 2 levels"); + assert_eq!(f.tiers[0].as_u16(), 48, "tier0 = ni"); + assert_eq!(f.tiers[1].as_u16(), 96, "tier1 = no"); + assert_eq!(f.tiers[2].as_u16(), 192, "tier2 = network_flags low16"); + // num_weights 55296 = 0x0000_D800 → lo=0xD800(55296), hi=0. + let nw = (f.tiers[3].as_u16() as u32) | ((f.tiers[4].as_u16() as u32) << 16); + assert_eq!(nw, 55296, "tiers 3-4 = num_weights u32"); + assert_eq!(f.tiers[5].lo, 0, "training byte"); + assert_eq!(f.tiers[5].hi, 0, "needs_backprop byte"); + + // The facet is exactly 16 bytes: classid(4) + 6x(8:8)=12. + assert_eq!(f.to_bytes().len(), 16); + } + + #[test] + fn num_weights_high_half_survives_the_two_tiers() { + // A cumulative count above u16 (the eng.lstm root is 385807) round-trips + // through tiers 3-4 — the reason num_weights takes two 8:8 tiers. + let (h, _) = NetworkHeader::from_le_bytes(&header_bytes("Series", 36, 111, 385807, "r")) + .expect("ok"); + let f = h.to_facet(); + let nw = (f.tiers[3].as_u16() as u32) | ((f.tiers[4].as_u16() as u32) << 16); + assert_eq!(nw, 385807); + assert!(f.tiers[4].as_u16() > 0, "high half is non-zero for 385807"); + } + + #[test] + fn rejects_bad_tag_and_short_and_unknown() { + // Non-zero tag → BadTag. + let mut b = header_bytes("Series", 1, 1, 0, "x"); + b[0] = 7; + assert_eq!( + NetworkHeader::from_le_bytes(&b), + Err(NetworkError::BadTag(7)) + ); + + // Truncated mid-header → UnexpectedEof. + let full = header_bytes("Series", 1, 1, 0, "x"); + assert_eq!( + NetworkHeader::from_le_bytes(&full[..10]), + Err(NetworkError::UnexpectedEof) + ); + + // Unknown type string → UnknownType. + let b = header_bytes("Frobnicate", 1, 1, 0, "x"); + assert_eq!( + NetworkHeader::from_le_bytes(&b), + Err(NetworkError::UnknownType) + ); + + // Negative dim → NegativeDim. + let b = header_bytes("Series", -1, 1, 0, "x"); + assert_eq!( + NetworkHeader::from_le_bytes(&b), + Err(NetworkError::NegativeDim) + ); + } +} diff --git a/crates/lance-graph-contract/src/ogar_codebook.rs b/crates/lance-graph-contract/src/ogar_codebook.rs index 8ebf7450..4f83a9aa 100644 --- a/crates/lance-graph-contract/src/ogar_codebook.rs +++ b/crates/lance-graph-contract/src/ogar_codebook.rs @@ -486,6 +486,12 @@ pub const CODEBOOK: &[(&str, u16)] = &[ ("unicharset", 0x0801), ("recoder", 0x0802), ("charset", 0x0803), + // `network_layer` = the KIND "a Tesseract recognizer network layer" (Series / + // LSTM / Convolve / …). ONE container slot: the specific subclass is the + // classid's custom-low half (the `NetworkType` ordinal, `network::NETWORK_LAYER`), + // not 27 slots — the layer graph sinks onto `FacetCascade` tenants (the + // ruff→OGAR network harvest lands here). + ("network_layer", 0x0804), // ── 0x09XX — Health domain (MedCare; OGIT NTO/Healthcare promotion) ── ("patient", 0x0901), ("diagnosis", 0x0902), From 4e14db01b00e23de19a1e6b8316cb3ef0c71326d Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 4 Jul 2026 14:45:59 +0000 Subject: [PATCH 7/7] =?UTF-8?q?board:=20E-OCR-FULLYCONNECTED-1=20=E2=80=94?= =?UTF-8?q?=20recognizer=20Leaf=204=20byte-parity=20green?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit FullyConnected::Forward (int8 path) = activation(WeightMatrix·input), the first complete network layer, composing the two proven halves (Leaf 2 WeightMatrix + Leaf 3 activations). Byte-parity green across all 7 activations + 2 shapes vs a libtesseract oracle running the REAL MatrixDotVector+FuncInplace. Code lands in tesseract-recognizer (the compute crate); board hygiene lands here per the CLAUDE.md rule. Co-Authored-By: Claude Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1 --- .claude/board/EPIPHANIES.md | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/.claude/board/EPIPHANIES.md b/.claude/board/EPIPHANIES.md index 80d7447b..8bef0d94 100644 --- a/.claude/board/EPIPHANIES.md +++ b/.claude/board/EPIPHANIES.md @@ -177,6 +177,15 @@ New knowledge doc `.claude/knowledge/data-shape-etymology.md` — the shape-and- **Status:** FINDING (operator ruling on the shape — "yes valueschema") + embedded CONJECTURE (the preset-vs-dispatch probe) Operator floated keeping the fast/cheap V2 substrate for huge data alongside V3, "switched by classid," so V3 can eventually teach V2 how to be better. Resolved: the switch is NOT a new carrier. `ClassView::value_schema(classid) -> ValueSchema` (`canonical_node.rs:894`, `class_view.rs:395`) is ALREADY classid→substrate-shape resolution by trait dispatch — resolved, never stored on-wire (adding a variant costs NO `ENVELOPE_LAYOUT_VERSION` bump), and the four existing variants ALREADY form a substrate ladder: `Bootstrap`(empty, key+edges only) / `Compressed`(cold codec, **no hot lifecycle columns**) / `Cognitive`(hot thinking: Meta+Qualia+Fingerprint+Energy+Plasticity+EntityType) / `Full`(every tenant). So "V2 fast/cheap bulk" = classids that resolve to the LEAN end (Bootstrap/Compressed — no ownership/lifecycle tenants); "V3 witnessed/owned" = Cognitive/Full. **A `ClassRoutingDTO` is rejected:** a DTO is a serialized carried payload, but substrate choice is a RESOLUTION (firewall ADR-022, "contracts compile types, the event never leaves"); and per the three-tier canon nothing crosses mailbox boundaries — every reader re-resolves the substrate from the classid already in the 16-byte key, so there is no boundary for a carrier to travel. `dto-soa-savant` + AGI-as-glove name the new-struct-instead-of-resolution shape exactly. **0x1000 is NOT the switch:** canon fixes it as a temporary adoption MONITOR ("monitor, never a semantic"; retires at P4/100%; MODULE-TABLE flags that a future canon==0x1000 aliases the marker) — substrate routes on the classid's concept-half → ValueSchema, never on the monitor bit. **The deep form (CONJECTURE — PROBE preset-vs-dispatch):** the WRITE PATH may be a pure FUNCTION of the schema — a class whose ValueSchema carries no ownership/lifecycle tenants has nothing for the kanban/WAL to witness, so it naturally collapses to the fast private-merge write; Cognitive/Full carry the tenants that REQUIRE the owned/witnessed path. If that holds, substrate = ValueSchema full stop (no separate `Substrate` enum, no flag). The gate: confirm the write path is derivable from which tenants are live vs needing an independent resolution — evidence base is the onebrc arc itself (lane F private-merge/no-tenants vs lanes G–J owned/witnessed = the two write paths already measured). Open sub-question: whether bulk needs a variant leaner than `Compressed`, or Bootstrap/Compressed already suffice. **"V3 teaches V2" (deferred, needs mechanism):** V3's kanban WAL + ownership journal is the profiling signal (where contention lands, which fields are touched) to optimize the lean V2 layout — the instrumented-teacher / stripped-student loop; no code reads the WAL back into a layout optimizer yet. Net: at most a new `ValueSchema` variant through the existing `value_schema(classid)` door; possibly not even that. +## 2026-07-04 — E-OCR-FULLYCONNECTED-1 — recognizer Leaf 4: `FullyConnected::Forward` (int8 path) is byte-parity green vs libtesseract — the first COMPLETE layer, the composition of the two proven halves +**Status:** FINDING (byte-parity proven vs libtesseract 5.3.4; `tesseract-recognizer`, tested) + +The first COMPLETE network layer ships. `tesseract_recognizer::fully_connected_forward` transcodes `FullyConnected::ForwardTimeStep(const int8_t*, …)` (`fullyconnected.cpp:230-234`) — which is EXACTLY two operations in order with NO intermediate step: `weights_.MatrixDotVector(i_input, output_line)` (Leaf 2 `WeightMatrix::forward`) then `ForwardTimeStep(t, output_line)` (Leaf 3 activation, dispatched on the layer's `NetworkType`, `fullyconnected.cpp:203-219`). So Leaf 4 = `activation(W·u)`, composing the two independently-proven halves; what it NEWLY proves is the composition (order + no scaling/quant between matmul and activation). + +Byte-parity **GREEN** across all 7 activations (`tanh`/`logistic`/`relu`/`softmax`/`posclip`/`symclip`/`linear`) at 8×5 AND the larger 48×49 shape, diffing f32 bit-patterns. The oracle (`/tmp/fc_oracle.cpp`, built `-DFAST_FLOAT`) runs the REAL `WeightMatrix::MatrixDotVector` then the REAL `FuncInplace` / `SoftmaxInPlace` — the exact two library calls `ForwardTimeStep` makes — so the diff is an independent proof of the composition, not a re-implementation. The `NT_POSCLIP→clip_f(clamp[0,1])` / `NT_SYMCLIP→clip_g(clamp[-1,1])` mapping was verified against `functions.h:85-95/124-134` (not swapped). +5 unit tests (18 total); clippy `-D warnings` + fmt clean (`-p tesseract-recognizer` scoped). + +Design detail: `FcActivation` (the 8 FullyConnected variants) is named LOCALLY in the compute crate — it does NOT depend on the Core's `NetworkType`; the boundary is the stable u8 ordinal (`FcActivation::from_network_type_ordinal`). This is the compute vocabulary of "which non-linearity," NOT a parallel network model (the graph structure stays in `lance_graph_contract::network`, E-OCR-NETWORK-SINK-1). **Next Leaf 5 = `LSTM::Forward`** (the gates CI/GI/GF1/GO + cell `c=clip(f·c+i·g,±100)` + `h=o·tanh(c)` recurrent state) — the recurrent counterpart, reusing `FullyConnected::Forward` for each gate. Then `Series`/`Parallel` graph walk → `recodebeam` → the code lattice `recoded_to_text` eats. Cross-ref: `E-OCR-WEIGHTMATRIX-1` (Leaf 2), `E-OCR-ACTIVATION-1` (Leaf 3), `E-OCR-MATDOTVEC-1` (Leaf 1), `E-OCR-NETWORK-SINK-1` (the structure side). Plan: `tesseract-rs/.claude/plans/recognizer-core-shape-v1.md` (Leaf 4 EXECUTED). Branch `claude/happy-hamilton-0azlw4`. + ## 2026-07-04 — E-OCR-NETWORK-SINK-1 — the Tesseract `Network` layer graph sinks onto V3 SoA via ruff→OGAR: base-header parse byte-parity green + `FacetCascade` (16 B) sink, NOT a hand-rolled enum **Status:** FINDING (byte-parity proven vs libtesseract 5.3.4; `lance-graph-contract`, tested)