Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions .claude/board/EPIPHANIES.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,42 @@
## 2026-07-04 — E-OCR-RECODEBEAM-1 — recognizer Leaf 7b: `RecodeBeamSearch::Decode` (the non-dict CTC beam) is byte-parity green — logits → labels → text, the "recognizer produces text" milestone (the hardest leaf, 1382 lines)
**Status:** FINDING (byte-parity proven vs libtesseract 5.3.4; `tesseract-core`, tested)

The CTC beam decode ships — the single hardest recognizer leaf (`recodebeam.cpp`, 1382 lines), the step that turns the LSTM's per-timestep softmax logits into a code sequence → text. `tesseract_core::recodebeam::RecodeBeamSearch` transcodes the **non-dictionary** path (`dict_ == nullptr`, `permuter` fixed at `TOP_CHOICE_PERM`, the dawg beams never populated). Placed in **tesseract-core** (not `tesseract-recognizer`): the beam is compute but **SIMD-free and recoder-coupled**, so it belongs next to the recoder tables it walks (Leaf 7a's `is_valid_first_code`/`get_final_codes`/`get_next_codes`) and the `recoded_to_text` (`E-CPP-PARITY-7`) step it feeds. Transcoded whole: `ComputeTopN` (top-n flags via the min-heap), `DecodeStep` (per-timestep, top-n-group fallback), `ContinueContext` (the beam-extension crux — prefix walk + `get_final_codes`/`get_next_codes` continuations + the dup/null CTC combination rules), `ContinueUnichar`, `PushDupOrNoDawgIfBetter`, `PushHeapIfBetter` (`score = cert + prev.score`, evict-worst-when-full), `ComputeCodeHash` (the u64 rolling mix for dup-path removal), `UpdateHeapIfMatched` (+`Reshuffle`), `ExtractBestPaths`/`ExtractBestPathAsLabels`, plus a faithful `GenericHeap<KDPairInc<f64,…>>` (binary min-heap, exact `SiftUp`/`SiftDown`/`Reshuffle` so the `get(i)` internal order the decode walks matches). The C++ borrowed-`prev`-pointer lattice becomes a safe **arena** (`Vec<RecodeNode>` + `prev: Option<u32>` index) — no `unsafe`, no dangling across heap reshuffles.

Byte-parity **GREEN** on the real `eng.lstm-recoder`: the public `extract_best_path_as_labels` diffs **byte-identical** across **4 configs** — `null_char ∈ {110, 0, 42}` × {fold, simple-text} — vs a libtesseract oracle (`/tmp/recodebeam_oracle.cpp`) that constructs the REAL `RecodeBeamSearch(recoder, null_char, simple, nullptr)` and runs the public `Decode(GENERIC_2D_ARRAY<float>, 1.0, 0.0, 0.0, nullptr)` + `ExtractBestPathAsLabels` on the SAME synthetic softmax matrix (read from a shared `.bin`, so the input is byte-identical). Using only the **public** ctor + Decode + ExtractBestPathAsLabels means **no private-member access** → the 5.5.0-header / 5.3.4-lib ABI skew that dogged the earlier leaves cannot bite. The folding path proved `[5,5,·,7,7,7,·,9] → [5,7,9]` (drop nulls, fold adjacent); the simple-text path proved the no-fold branch. **Float contract:** scores accumulate in f32 (`TFloat=float`), `ProbToCertainty(p) = p>e^-20 ? ln(p) : -20` is a raw `ln` (NOT the Leaf-3 LUT), heap keys are `f64::from(score)` (lossless → order-preserving) — no FP drift vs libtesseract. **Core addition:** `RecodedCharId::from_codes` (the beam's prefix/full-code key builder, `unicharcompress.h:43` `Set` loop). eng.lstm's recoder is pass-through (all length-1) so `next_codes_` is empty and every beam sits at length 0 — this proves the CTC core for a simple (non-CJK) script; the multi-code `next_codes_` trie is Han/Hangul, out of `eng` scope (consistent with every prior leaf, and the `next_codes_` maps themselves are proven in 7a). +4 unit tests (12 `tesseract-core` total); clippy `-D warnings` + fmt clean (`-p tesseract-core` scoped).

**The recognizer now spans logits → text:** the 1-D forward (Leaves 1-6) produces the softmax logits, and Leaf 7 (7a maps + 7b beam) decodes them → codes → `recoded_to_text` → the string. The remaining gap to **image → text** is the **2-D front-end** (`Convolve`/`Maxpool`/`Reconfig`/`XYTranspose` + the `NetworkIO`/`StrideMap` grid + the leptonica image `Input`); the dictionary / language-model beam is the later accuracy layer. Cross-ref: `E-OCR-RECODER-BEAM-1` (7a, the maps this walks), `E-CPP-PARITY-7` (`recoded_to_text`, the text step it feeds), `E-OCR-GRAPHWALK-1` (Leaf 6, the logits it decodes). Plan `tesseract-rs/.claude/plans/recognizer-decode-frontend-v1.md` (Leaf 7a + 7b EXECUTED). lance-graph PR #647 (`from_codes` + board); tesseract-rs PR #5 (the beam). Branch `claude/happy-hamilton-0azlw4`.

## 2026-07-04 — E-OCR-RECODER-BEAM-1 — recognizer Leaf 7a: the recoder `SetupDecoder` beam-search trie maps (`is_valid_start_`/`final_codes_`/`next_codes_`) are byte-parity green — the deferred half of the recoder leaf, the surface `RecodeBeamSearch` walks
**Status:** FINDING (byte-parity proven vs libtesseract 5.3.4; `lance-graph-contract`, tested)

The deferred half of the recoder leaf (`E-CPP-PARITY-7` built only the `decoder_` map) now ships in the Core's `UnicharCompress`. `setup_decoder` transcodes the **full** C++ `SetupDecoder` (`unicharcompress.cpp:395-436`) in one ascending-id pass: the `decoder_` map **and** the three beam-search trie maps — `is_valid_start_` (`code(0)` is a valid first code, `IsValidFirstCode`), `final_codes_` (prefix `code[0..len-1]` → the completing last codes, `GetFinalCodes`), `next_codes_` (prefix → non-final continuations, `GetNextCodes`). The `while (--len >= 0)` prefix walk climbs from the direct parent toward the empty prefix, deduping and stopping at the first already-seeded `next_codes_` entry. `RecodedCharId` gained `code_at` (C++ `operator()`) + a private `truncated` (C++ `Truncate`, which only sets `length_` — the trailing `code` slots drop out of identity since eq/hash read `code[0..length]`). These maps are **Core content** (computed table state loaded alongside the encoder, data-shaped, no lifecycle); the recognizer's `RecodeBeamSearch` (Leaf 7b) *consumes* them read-only via the three new public accessors `is_valid_first_code`/`get_final_codes`/`get_next_codes`. Per Core-First: the table is Core, the beam **search** that walks it is recognizer compute.

Byte-parity **GREEN** on the real `eng.lstm-recoder`: `dump_beam` (a deterministic walk — `unordered_map` iteration order is unspecified and differs from Rust's `HashMap`, so the dump drives off `encoder_` id-order × truncation-length ascending, deduped, querying the accessors) diffs **114 lines byte-identical** vs the `beam` mode added to `recoder_oracle.cpp` (links libtesseract, same deterministic walk via `IsValidFirstCode`/`GetFinalCodes`/`GetNextCodes`). The real data corrected an assumption baked into the module docs: `code_range = 111` (max code 110, **not** 112), `enc_size = 112`, one shared code (`roundtrip_bad=1`, the id1/id2 → code 110 case), and `final_codes_[<empty>]` in push order `0,110,1,2,…,109` (id1 seeds 110, id2 dedups, then codes 1..109). `next_codes_` is empty (all length-1 → the `--len` loop never runs); a hand-traced length-3 (Han/Hangul-shaped) unit test exercises the full trie (the `--len` climb, the dedup-then-`break`, multiple finals under one prefix). encode + decode regressions still green. +2 unit tests (814 contract tests total); clippy `-D warnings` + fmt clean (`-p lance-graph-contract` scoped).

**Leaf 7a unblocks Leaf 7b** — `RecodeBeamSearch::Decode` (the CTC beam search, `recodebeam.cpp` 1382 lines): `ComputeTopN` per-timestep → `DecodeStep`/`ContinueContext` walking these `next_codes_`/`final_codes_`/`is_valid_start_` maps → `ExtractBestPaths` → the code lattice → `recoded_to_text` (`E-CPP-PARITY-7`) → the string. Non-dict first pass (the falsifiable core; dict / language-model beam deferred). Cross-ref: `E-CPP-PARITY-7` (the recoder load side + `recoded_to_text`), `E-OCR-GRAPHWALK-1` (Leaf 6, produces the softmax logits the beam decodes), plan `tesseract-rs/.claude/plans/recognizer-decode-frontend-v1.md` (Leaf 7a EXECUTED). lance-graph PR #647 (recognizer Core side); tesseract-rs PR #5 (Leaves 1-6). Branch `claude/happy-hamilton-0azlw4`.

## 2026-07-04 — E-OCR-GRAPHWALK-1 — recognizer Leaf 6: the graph walk (`Series`/`Reversed`/`Parallel`) is byte-parity green — the composition that chains the proven layer leaves into a network forward, with the inter-layer int8 requant
**Status:** FINDING (byte-parity proven vs libtesseract 5.3.4; `tesseract-recognizer`, tested)

The composition that turns the individual layer leaves into a network forward — the compute-side **execution tree**, the `invoke_network` counterpart (the Core's `lance_graph_contract::network` FacetCascade describes the tree *structure*; `tesseract_recognizer::graph::Layer` *runs* it). NOT a parallel object model: it is the runnable subset (the layers whose `Forward` this crate transcodes), built from the Core's tree by a consumer. `Layer { Lstm, FullyConnected, Reversed, Series, Parallel }`:
- **`Series`** (`series.cpp:Forward`): run each sub-layer in turn, output of N → input of N+1. The recognizer runs int8, so the intermediate `NetworkScratch::IO` buffers inherit `int_mode` → the inter-layer conversion is the **int8 requant** (`quantize_i8` = `NetworkIO::WriteTimeStep`, proven Leaf 5). The final softmax is the `ResizeFloat` exception.
- **`Reversed`** (XREVERSED, `reversed.cpp`): reverse the 1-D sequence → inner → reverse. (`Txy`/`YREVERSED` are 2-D front-end, deferred.)
- **`Parallel`** (`parallel.cpp`): same input to each sub-layer, concatenate outputs.

Byte-parity **GREEN**: `Series[LSTM, FC(tanh)]` across 4 shapes incl. `ns=96/ni=192/no=111` (eng.lstm's LSTM192→Fc111 tail) — diffing f32 bit-patterns vs a libtesseract oracle (`/tmp/graph_oracle.cpp`, `-DFAST_FLOAT`) that runs the REAL LSTM per-timestep body over all timesteps → the REAL `WriteTimeStep` int8 requant → the REAL `MatrixDotVector`+`FuncInplace` — the exact `Series::Forward` composition. Proves the chaining order + the inter-layer requant (the one debug: the SIMD `MatrixDotVector` over-writes `RoundOutputs` padding, so oracle output vectors need `+128`). +4 unit tests (27 total); clippy `-D warnings` + fmt clean.

The recognizer now chains the proven layers into a **1-D network forward** from feature sequences → softmax logits. **Next Leaf 7 = `recodebeam`** — the CTC beam decode (needs the recoder's deferred `SetupDecoder` beam maps: `GetNextCodes`/`GetFinalCodes`/`is_valid_start_`) → the code lattice `recoded_to_text` (E-CPP-PARITY-7) eats → the text string. Then the 2-D front-end (`Convolve`/`Maxpool`/`XYTranspose` + the `NetworkIO`/`StrideMap` grid + leptonica image `Input`) closes image→text. Cross-ref: `E-OCR-LSTM-1` (Leaf 5), `E-OCR-FULLYCONNECTED-1` (Leaf 4), `E-CPP-PARITY-7` (the recoder `recoded_to_text` the lattice feeds). tesseract-rs PR #5 (recognizer Leaves 1-6). Branch `claude/happy-hamilton-0azlw4`.

## 2026-07-04 — E-OCR-LSTM-1 — recognizer Leaf 5: `LSTM::Forward` (1-D int8) is byte-parity green vs libtesseract — the recurrent layer, the hardest leaf, composes 4 Leaf-4 gates + the int8-quantized state recurrence
**Status:** FINDING (byte-parity proven vs libtesseract 5.3.4; `tesseract-recognizer`, tested)

The recurrent layer ships — the hardest recognizer leaf. `tesseract_recognizer::Lstm` transcodes the 1-D non-softmax `LSTM` (`lstm/lstm.{h,cpp}`, the eng.lstm case). `from_le_bytes` = `LSTM::DeSerialize` (`lstm.cpp:253-287`): `i32 na_` + 4 gate `WeightMatrix`es (`CI, GI, GF1, GO`, `GFS` skipped for 1-D), `ns = CI.num_outputs`, `ni = na_ − ns`. `forward` = `LSTM::Forward` (`lstm.cpp:363-454`) per timestep: source = `[input | int8_quantize(prev_output)]`; the 4 gates via **`fully_connected_forward` (Leaf 4)** — CI=tanh(`GFunc`), GI/GF1/GO=logistic(`FFunc`); cell `c = clip(GF1·c + CI·GI, ±100)` (`kStateClip`); output `h = tanh(c)·GO` (`HFunc`). The crux of the int8 path: the recurrent `h` is **quantized back to int8** (`NetworkIO::WriteTimeStepPart`, `networkio.cpp:662-666`: `clip(IntCastRounded(x·127), ±127)`, round-half-away-from-zero, never −128) before the next timestep's gate matmuls.

Byte-parity **GREEN** across 3 shapes incl. `ns=48/ni=36` (eng.lstm's LSTM48) × **8 timesteps** — diffing f32 bit-patterns vs a libtesseract oracle (`/tmp/lstm_oracle.cpp`, `-DFAST_FLOAT`) that runs the REAL `WeightMatrix::MatrixDotVector` + `FuncInplace<GFunc/FFunc>` + `MultiplyVectorsInPlace`/`MultiplyAccumulate`/`ClipVector` + `FuncMultiply<HFunc>` + the real `IntCastRounded`/`ClipToRange` quant — the exact per-timestep body of `LSTM::Forward`. The 8-timestep run proves the int8 recurrence feedback is correct across the sequence. **No FMA discrepancy** — the recurrence's separate mul+add matches libtesseract byte-exactly at `-O2`. Added `WeightMatrix::from_le_bytes_prefix` (returns bytes consumed) to chain the 4 gate matrices. +5 unit tests (23 total); clippy `-D warnings` + fmt clean (`-p tesseract-recognizer` scoped).

The recognizer now has all four leaf types of the eng.lstm forward pass: int8 GEMM (Leaf 1), WeightMatrix (Leaf 2), activations (Leaf 3), FullyConnected layer (Leaf 4), LSTM layer (Leaf 5). **Next Leaf 6 = the graph walk** — `Series`/`Parallel`/`Reversed`/`Convolve`/`Maxpool` composing the layers per the model tree (`[1,36,0,1[C3,3Ft16]Mp3,3TxyLfys48Lfx96RxLrx96Lfx192Fc111]`) → then `recodebeam` (CTC decode → the code lattice `recoded_to_text` eats). Cross-ref: `E-OCR-FULLYCONNECTED-1` (Leaf 4, reused per gate), `E-OCR-ACTIVATION-1` (Leaf 3), `E-OCR-WEIGHTMATRIX-1` (Leaf 2), `E-OCR-NETWORK-SINK-1` (the structure side). Plan: `tesseract-rs/.claude/plans/recognizer-core-shape-v1.md` (Leaf 5 EXECUTED). tesseract-rs PR #5 (recognizer Leaves 1-5). Branch `claude/happy-hamilton-0azlw4`.

## 2026-07-04 — E-V1-TAIL-FORBIDDEN-V3-IS-CONTENT-BLIND-1: the flat contiguous-u24 tail (`family:u24 ++ identity:u24`) is V1-LEGACY, FORBIDDEN for new units; V3 is the content-blind 4+12 facet whose 12B the ClassView projects as axis-grouped byte rails — migration of residues MANDATORY
**Status:** RULING (operator, 2026-07-04). Specializes/enforces `E-V3-FACET-4-PLUS-12` (operator-locked 2026-07-02) with an explicit *forbidding* + a *mandatory-migration* stamp. Reverses the older 2026-06-10 OGAR P0 pin (u24+u24) and this repo's own 2026-06-13 CANON block (which still document the flat tail) — latest operator word wins; both are regraded in place, append-only, never deleted.

Expand Down
7 changes: 7 additions & 0 deletions crates/lance-graph-contract/examples/recoder_dump.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,12 @@
//! diff /tmp/oracle_recoder_encode.tsv /tmp/rust_recoder_encode.tsv \
//! && diff /tmp/oracle_recoder_decode.tsv /tmp/rust_recoder_decode.tsv
//! # both byte-identical => the recoder load-side is byte-parity green
//!
//! # The `beam` mode dumps the SetupDecoder trie maps (is_valid_start_ /
//! # final_codes_ / next_codes_ — the RecodeBeamSearch surface, Leaf 7a):
//! # ./recoder_oracle /tmp/eng.lstm-unicharset /tmp/eng.lstm-recoder beam > /tmp/oracle_recoder_beam.tsv
//! cargo run -p lance-graph-contract --example recoder_dump -- /tmp/eng.lstm-recoder beam > /tmp/rust_recoder_beam.tsv
//! diff /tmp/oracle_recoder_beam.tsv /tmp/rust_recoder_beam.tsv # byte-identical => 7a green
//! ```

#![allow(
Expand All @@ -39,6 +45,7 @@ fn main() -> ExitCode {
Ok(recoder) => {
match mode.as_str() {
"decode" => print!("{}", recoder.dump_decode()),
"beam" => print!("{}", recoder.dump_beam()),
_ => print!("{}", recoder.dump_encode()),
}
ExitCode::SUCCESS
Expand Down
Loading
Loading