diff --git a/.claude/board/EPIPHANIES.md b/.claude/board/EPIPHANIES.md index 3d74b326..6fa307d6 100644 --- a/.claude/board/EPIPHANIES.md +++ b/.claude/board/EPIPHANIES.md @@ -1,3 +1,42 @@ +## 2026-07-04 — E-OCR-RECODEBEAM-1 — recognizer Leaf 7b: `RecodeBeamSearch::Decode` (the non-dict CTC beam) is byte-parity green — logits → labels → text, the "recognizer produces text" milestone (the hardest leaf, 1382 lines) +**Status:** FINDING (byte-parity proven vs libtesseract 5.3.4; `tesseract-core`, tested) + +The CTC beam decode ships — the single hardest recognizer leaf (`recodebeam.cpp`, 1382 lines), the step that turns the LSTM's per-timestep softmax logits into a code sequence → text. `tesseract_core::recodebeam::RecodeBeamSearch` transcodes the **non-dictionary** path (`dict_ == nullptr`, `permuter` fixed at `TOP_CHOICE_PERM`, the dawg beams never populated). Placed in **tesseract-core** (not `tesseract-recognizer`): the beam is compute but **SIMD-free and recoder-coupled**, so it belongs next to the recoder tables it walks (Leaf 7a's `is_valid_first_code`/`get_final_codes`/`get_next_codes`) and the `recoded_to_text` (`E-CPP-PARITY-7`) step it feeds. Transcoded whole: `ComputeTopN` (top-n flags via the min-heap), `DecodeStep` (per-timestep, top-n-group fallback), `ContinueContext` (the beam-extension crux — prefix walk + `get_final_codes`/`get_next_codes` continuations + the dup/null CTC combination rules), `ContinueUnichar`, `PushDupOrNoDawgIfBetter`, `PushHeapIfBetter` (`score = cert + prev.score`, evict-worst-when-full), `ComputeCodeHash` (the u64 rolling mix for dup-path removal), `UpdateHeapIfMatched` (+`Reshuffle`), `ExtractBestPaths`/`ExtractBestPathAsLabels`, plus a faithful `GenericHeap>` (binary min-heap, exact `SiftUp`/`SiftDown`/`Reshuffle` so the `get(i)` internal order the decode walks matches). The C++ borrowed-`prev`-pointer lattice becomes a safe **arena** (`Vec` + `prev: Option` index) — no `unsafe`, no dangling across heap reshuffles. + +Byte-parity **GREEN** on the real `eng.lstm-recoder`: the public `extract_best_path_as_labels` diffs **byte-identical** across **4 configs** — `null_char ∈ {110, 0, 42}` × {fold, simple-text} — vs a libtesseract oracle (`/tmp/recodebeam_oracle.cpp`) that constructs the REAL `RecodeBeamSearch(recoder, null_char, simple, nullptr)` and runs the public `Decode(GENERIC_2D_ARRAY, 1.0, 0.0, 0.0, nullptr)` + `ExtractBestPathAsLabels` on the SAME synthetic softmax matrix (read from a shared `.bin`, so the input is byte-identical). Using only the **public** ctor + Decode + ExtractBestPathAsLabels means **no private-member access** → the 5.5.0-header / 5.3.4-lib ABI skew that dogged the earlier leaves cannot bite. The folding path proved `[5,5,·,7,7,7,·,9] → [5,7,9]` (drop nulls, fold adjacent); the simple-text path proved the no-fold branch. **Float contract:** scores accumulate in f32 (`TFloat=float`), `ProbToCertainty(p) = p>e^-20 ? ln(p) : -20` is a raw `ln` (NOT the Leaf-3 LUT), heap keys are `f64::from(score)` (lossless → order-preserving) — no FP drift vs libtesseract. **Core addition:** `RecodedCharId::from_codes` (the beam's prefix/full-code key builder, `unicharcompress.h:43` `Set` loop). eng.lstm's recoder is pass-through (all length-1) so `next_codes_` is empty and every beam sits at length 0 — this proves the CTC core for a simple (non-CJK) script; the multi-code `next_codes_` trie is Han/Hangul, out of `eng` scope (consistent with every prior leaf, and the `next_codes_` maps themselves are proven in 7a). +4 unit tests (12 `tesseract-core` total); clippy `-D warnings` + fmt clean (`-p tesseract-core` scoped). + +**The recognizer now spans logits → text:** the 1-D forward (Leaves 1-6) produces the softmax logits, and Leaf 7 (7a maps + 7b beam) decodes them → codes → `recoded_to_text` → the string. The remaining gap to **image → text** is the **2-D front-end** (`Convolve`/`Maxpool`/`Reconfig`/`XYTranspose` + the `NetworkIO`/`StrideMap` grid + the leptonica image `Input`); the dictionary / language-model beam is the later accuracy layer. Cross-ref: `E-OCR-RECODER-BEAM-1` (7a, the maps this walks), `E-CPP-PARITY-7` (`recoded_to_text`, the text step it feeds), `E-OCR-GRAPHWALK-1` (Leaf 6, the logits it decodes). Plan `tesseract-rs/.claude/plans/recognizer-decode-frontend-v1.md` (Leaf 7a + 7b EXECUTED). lance-graph PR #647 (`from_codes` + board); tesseract-rs PR #5 (the beam). Branch `claude/happy-hamilton-0azlw4`. + +## 2026-07-04 — E-OCR-RECODER-BEAM-1 — recognizer Leaf 7a: the recoder `SetupDecoder` beam-search trie maps (`is_valid_start_`/`final_codes_`/`next_codes_`) are byte-parity green — the deferred half of the recoder leaf, the surface `RecodeBeamSearch` walks +**Status:** FINDING (byte-parity proven vs libtesseract 5.3.4; `lance-graph-contract`, tested) + +The deferred half of the recoder leaf (`E-CPP-PARITY-7` built only the `decoder_` map) now ships in the Core's `UnicharCompress`. `setup_decoder` transcodes the **full** C++ `SetupDecoder` (`unicharcompress.cpp:395-436`) in one ascending-id pass: the `decoder_` map **and** the three beam-search trie maps — `is_valid_start_` (`code(0)` is a valid first code, `IsValidFirstCode`), `final_codes_` (prefix `code[0..len-1]` → the completing last codes, `GetFinalCodes`), `next_codes_` (prefix → non-final continuations, `GetNextCodes`). The `while (--len >= 0)` prefix walk climbs from the direct parent toward the empty prefix, deduping and stopping at the first already-seeded `next_codes_` entry. `RecodedCharId` gained `code_at` (C++ `operator()`) + a private `truncated` (C++ `Truncate`, which only sets `length_` — the trailing `code` slots drop out of identity since eq/hash read `code[0..length]`). These maps are **Core content** (computed table state loaded alongside the encoder, data-shaped, no lifecycle); the recognizer's `RecodeBeamSearch` (Leaf 7b) *consumes* them read-only via the three new public accessors `is_valid_first_code`/`get_final_codes`/`get_next_codes`. Per Core-First: the table is Core, the beam **search** that walks it is recognizer compute. + +Byte-parity **GREEN** on the real `eng.lstm-recoder`: `dump_beam` (a deterministic walk — `unordered_map` iteration order is unspecified and differs from Rust's `HashMap`, so the dump drives off `encoder_` id-order × truncation-length ascending, deduped, querying the accessors) diffs **114 lines byte-identical** vs the `beam` mode added to `recoder_oracle.cpp` (links libtesseract, same deterministic walk via `IsValidFirstCode`/`GetFinalCodes`/`GetNextCodes`). The real data corrected an assumption baked into the module docs: `code_range = 111` (max code 110, **not** 112), `enc_size = 112`, one shared code (`roundtrip_bad=1`, the id1/id2 → code 110 case), and `final_codes_[]` in push order `0,110,1,2,…,109` (id1 seeds 110, id2 dedups, then codes 1..109). `next_codes_` is empty (all length-1 → the `--len` loop never runs); a hand-traced length-3 (Han/Hangul-shaped) unit test exercises the full trie (the `--len` climb, the dedup-then-`break`, multiple finals under one prefix). encode + decode regressions still green. +2 unit tests (814 contract tests total); clippy `-D warnings` + fmt clean (`-p lance-graph-contract` scoped). + +**Leaf 7a unblocks Leaf 7b** — `RecodeBeamSearch::Decode` (the CTC beam search, `recodebeam.cpp` 1382 lines): `ComputeTopN` per-timestep → `DecodeStep`/`ContinueContext` walking these `next_codes_`/`final_codes_`/`is_valid_start_` maps → `ExtractBestPaths` → the code lattice → `recoded_to_text` (`E-CPP-PARITY-7`) → the string. Non-dict first pass (the falsifiable core; dict / language-model beam deferred). Cross-ref: `E-CPP-PARITY-7` (the recoder load side + `recoded_to_text`), `E-OCR-GRAPHWALK-1` (Leaf 6, produces the softmax logits the beam decodes), plan `tesseract-rs/.claude/plans/recognizer-decode-frontend-v1.md` (Leaf 7a EXECUTED). lance-graph PR #647 (recognizer Core side); tesseract-rs PR #5 (Leaves 1-6). Branch `claude/happy-hamilton-0azlw4`. + +## 2026-07-04 — E-OCR-GRAPHWALK-1 — recognizer Leaf 6: the graph walk (`Series`/`Reversed`/`Parallel`) is byte-parity green — the composition that chains the proven layer leaves into a network forward, with the inter-layer int8 requant +**Status:** FINDING (byte-parity proven vs libtesseract 5.3.4; `tesseract-recognizer`, tested) + +The composition that turns the individual layer leaves into a network forward — the compute-side **execution tree**, the `invoke_network` counterpart (the Core's `lance_graph_contract::network` FacetCascade describes the tree *structure*; `tesseract_recognizer::graph::Layer` *runs* it). NOT a parallel object model: it is the runnable subset (the layers whose `Forward` this crate transcodes), built from the Core's tree by a consumer. `Layer { Lstm, FullyConnected, Reversed, Series, Parallel }`: +- **`Series`** (`series.cpp:Forward`): run each sub-layer in turn, output of N → input of N+1. The recognizer runs int8, so the intermediate `NetworkScratch::IO` buffers inherit `int_mode` → the inter-layer conversion is the **int8 requant** (`quantize_i8` = `NetworkIO::WriteTimeStep`, proven Leaf 5). The final softmax is the `ResizeFloat` exception. +- **`Reversed`** (XREVERSED, `reversed.cpp`): reverse the 1-D sequence → inner → reverse. (`Txy`/`YREVERSED` are 2-D front-end, deferred.) +- **`Parallel`** (`parallel.cpp`): same input to each sub-layer, concatenate outputs. + +Byte-parity **GREEN**: `Series[LSTM, FC(tanh)]` across 4 shapes incl. `ns=96/ni=192/no=111` (eng.lstm's LSTM192→Fc111 tail) — diffing f32 bit-patterns vs a libtesseract oracle (`/tmp/graph_oracle.cpp`, `-DFAST_FLOAT`) that runs the REAL LSTM per-timestep body over all timesteps → the REAL `WriteTimeStep` int8 requant → the REAL `MatrixDotVector`+`FuncInplace` — the exact `Series::Forward` composition. Proves the chaining order + the inter-layer requant (the one debug: the SIMD `MatrixDotVector` over-writes `RoundOutputs` padding, so oracle output vectors need `+128`). +4 unit tests (27 total); clippy `-D warnings` + fmt clean. + +The recognizer now chains the proven layers into a **1-D network forward** from feature sequences → softmax logits. **Next Leaf 7 = `recodebeam`** — the CTC beam decode (needs the recoder's deferred `SetupDecoder` beam maps: `GetNextCodes`/`GetFinalCodes`/`is_valid_start_`) → the code lattice `recoded_to_text` (E-CPP-PARITY-7) eats → the text string. Then the 2-D front-end (`Convolve`/`Maxpool`/`XYTranspose` + the `NetworkIO`/`StrideMap` grid + leptonica image `Input`) closes image→text. Cross-ref: `E-OCR-LSTM-1` (Leaf 5), `E-OCR-FULLYCONNECTED-1` (Leaf 4), `E-CPP-PARITY-7` (the recoder `recoded_to_text` the lattice feeds). tesseract-rs PR #5 (recognizer Leaves 1-6). Branch `claude/happy-hamilton-0azlw4`. + +## 2026-07-04 — E-OCR-LSTM-1 — recognizer Leaf 5: `LSTM::Forward` (1-D int8) is byte-parity green vs libtesseract — the recurrent layer, the hardest leaf, composes 4 Leaf-4 gates + the int8-quantized state recurrence +**Status:** FINDING (byte-parity proven vs libtesseract 5.3.4; `tesseract-recognizer`, tested) + +The recurrent layer ships — the hardest recognizer leaf. `tesseract_recognizer::Lstm` transcodes the 1-D non-softmax `LSTM` (`lstm/lstm.{h,cpp}`, the eng.lstm case). `from_le_bytes` = `LSTM::DeSerialize` (`lstm.cpp:253-287`): `i32 na_` + 4 gate `WeightMatrix`es (`CI, GI, GF1, GO`, `GFS` skipped for 1-D), `ns = CI.num_outputs`, `ni = na_ − ns`. `forward` = `LSTM::Forward` (`lstm.cpp:363-454`) per timestep: source = `[input | int8_quantize(prev_output)]`; the 4 gates via **`fully_connected_forward` (Leaf 4)** — CI=tanh(`GFunc`), GI/GF1/GO=logistic(`FFunc`); cell `c = clip(GF1·c + CI·GI, ±100)` (`kStateClip`); output `h = tanh(c)·GO` (`HFunc`). The crux of the int8 path: the recurrent `h` is **quantized back to int8** (`NetworkIO::WriteTimeStepPart`, `networkio.cpp:662-666`: `clip(IntCastRounded(x·127), ±127)`, round-half-away-from-zero, never −128) before the next timestep's gate matmuls. + +Byte-parity **GREEN** across 3 shapes incl. `ns=48/ni=36` (eng.lstm's LSTM48) × **8 timesteps** — diffing f32 bit-patterns vs a libtesseract oracle (`/tmp/lstm_oracle.cpp`, `-DFAST_FLOAT`) that runs the REAL `WeightMatrix::MatrixDotVector` + `FuncInplace` + `MultiplyVectorsInPlace`/`MultiplyAccumulate`/`ClipVector` + `FuncMultiply` + the real `IntCastRounded`/`ClipToRange` quant — the exact per-timestep body of `LSTM::Forward`. The 8-timestep run proves the int8 recurrence feedback is correct across the sequence. **No FMA discrepancy** — the recurrence's separate mul+add matches libtesseract byte-exactly at `-O2`. Added `WeightMatrix::from_le_bytes_prefix` (returns bytes consumed) to chain the 4 gate matrices. +5 unit tests (23 total); clippy `-D warnings` + fmt clean (`-p tesseract-recognizer` scoped). + +The recognizer now has all four leaf types of the eng.lstm forward pass: int8 GEMM (Leaf 1), WeightMatrix (Leaf 2), activations (Leaf 3), FullyConnected layer (Leaf 4), LSTM layer (Leaf 5). **Next Leaf 6 = the graph walk** — `Series`/`Parallel`/`Reversed`/`Convolve`/`Maxpool` composing the layers per the model tree (`[1,36,0,1[C3,3Ft16]Mp3,3TxyLfys48Lfx96RxLrx96Lfx192Fc111]`) → then `recodebeam` (CTC decode → the code lattice `recoded_to_text` eats). Cross-ref: `E-OCR-FULLYCONNECTED-1` (Leaf 4, reused per gate), `E-OCR-ACTIVATION-1` (Leaf 3), `E-OCR-WEIGHTMATRIX-1` (Leaf 2), `E-OCR-NETWORK-SINK-1` (the structure side). Plan: `tesseract-rs/.claude/plans/recognizer-core-shape-v1.md` (Leaf 5 EXECUTED). tesseract-rs PR #5 (recognizer Leaves 1-5). Branch `claude/happy-hamilton-0azlw4`. + ## 2026-07-04 — E-V1-TAIL-FORBIDDEN-V3-IS-CONTENT-BLIND-1: the flat contiguous-u24 tail (`family:u24 ++ identity:u24`) is V1-LEGACY, FORBIDDEN for new units; V3 is the content-blind 4+12 facet whose 12B the ClassView projects as axis-grouped byte rails — migration of residues MANDATORY **Status:** RULING (operator, 2026-07-04). Specializes/enforces `E-V3-FACET-4-PLUS-12` (operator-locked 2026-07-02) with an explicit *forbidding* + a *mandatory-migration* stamp. Reverses the older 2026-06-10 OGAR P0 pin (u24+u24) and this repo's own 2026-06-13 CANON block (which still document the flat tail) — latest operator word wins; both are regraded in place, append-only, never deleted. diff --git a/crates/lance-graph-contract/examples/recoder_dump.rs b/crates/lance-graph-contract/examples/recoder_dump.rs index 1f2eed26..0adb3c7a 100644 --- a/crates/lance-graph-contract/examples/recoder_dump.rs +++ b/crates/lance-graph-contract/examples/recoder_dump.rs @@ -17,6 +17,12 @@ //! diff /tmp/oracle_recoder_encode.tsv /tmp/rust_recoder_encode.tsv \ //! && diff /tmp/oracle_recoder_decode.tsv /tmp/rust_recoder_decode.tsv //! # both byte-identical => the recoder load-side is byte-parity green +//! +//! # The `beam` mode dumps the SetupDecoder trie maps (is_valid_start_ / +//! # final_codes_ / next_codes_ — the RecodeBeamSearch surface, Leaf 7a): +//! # ./recoder_oracle /tmp/eng.lstm-unicharset /tmp/eng.lstm-recoder beam > /tmp/oracle_recoder_beam.tsv +//! cargo run -p lance-graph-contract --example recoder_dump -- /tmp/eng.lstm-recoder beam > /tmp/rust_recoder_beam.tsv +//! diff /tmp/oracle_recoder_beam.tsv /tmp/rust_recoder_beam.tsv # byte-identical => 7a green //! ``` #![allow( @@ -39,6 +45,7 @@ fn main() -> ExitCode { Ok(recoder) => { match mode.as_str() { "decode" => print!("{}", recoder.dump_decode()), + "beam" => print!("{}", recoder.dump_beam()), _ => print!("{}", recoder.dump_encode()), } ExitCode::SUCCESS diff --git a/crates/lance-graph-contract/src/unicharcompress.rs b/crates/lance-graph-contract/src/unicharcompress.rs index a2e32c3f..8f5142f3 100644 --- a/crates/lance-graph-contract/src/unicharcompress.rs +++ b/crates/lance-graph-contract/src/unicharcompress.rs @@ -18,10 +18,15 @@ //! This module transcodes the **load side only** — `DeSerialize` + //! `EncodeUnichar` + `DecodeUnichar` + `code_range` (the recognizer runtime //! surface). `ComputeEncoding` (the training-side table builder) is out of -//! scope. `SetupDecoder`'s beam-search maps (`is_valid_start_` / `next_codes_` / -//! `final_codes_`, `unicharcompress.cpp:396-434`) are the recognizer's, not the -//! decode table's — they are deferred to the recognizer leaf; only the -//! `decoder_` map (code → id) is built here. +//! scope. `SetupDecoder`'s full state — the `decoder_` map (code → id) **and** +//! the beam-search trie maps `is_valid_start_` / `final_codes_` / `next_codes_` +//! (`unicharcompress.cpp:396-434`) — is built here: they are computed table +//! state loaded alongside the encoder (data-shaped, no lifecycle), and the +//! recognizer's `RecodeBeamSearch` *consumes* them read-only via +//! [`UnicharCompress::is_valid_first_code`] / [`UnicharCompress::get_final_codes`] +//! / [`UnicharCompress::get_next_codes`] (the C++ `IsValidFirstCode` / +//! `GetFinalCodes` / `GetNextCodes` accessors). The maps are Core content; the +//! beam *search* that walks them is recognizer compute (Leaf 7b). //! //! # Binary format (byte-parity surface) //! @@ -113,6 +118,27 @@ impl Default for RecodedCharId { } impl RecodedCharId { + /// Construct a code from an explicit slice of code values — the beam-search + /// consumer's key builder (the C++ `RecodedCharID::Set` loop, + /// `unicharcompress.h:43`). `RecodeBeamSearch` builds a `prefix` + /// (`codes[0..length]`) to query [`Self::get_final_codes`](UnicharCompress::get_final_codes) + /// / [`get_next_codes`](UnicharCompress::get_next_codes) and a `full_code` + /// (`prefix ++ code`) to feed [`UnicharCompress::decode`]. Only the first + /// [`K_MAX_CODE_LEN`](struct@RecodedCharId) codes are kept; extras are dropped + /// (the C++ fixed `code_[9]`). `self_normalized` is the default `1` (it never + /// participates in identity). + #[must_use] + pub fn from_codes(codes: &[i32]) -> Self { + let mut code = [0_i32; K_MAX_CODE_LEN]; + let len = codes.len().min(K_MAX_CODE_LEN); + code[..len].copy_from_slice(&codes[..len]); + Self { + self_normalized: 1, + length: len as i32, + code, + } + } + /// The codes in use — `code[0..length]`. The only bytes that carry identity. #[must_use] pub fn codes(&self) -> &[i32] { @@ -141,6 +167,26 @@ impl RecodedCharId { self.self_normalized != 0 } + /// The code value at `index` — the C++ `operator()(int)` (`unicharcompress.h:65`), + /// reading `code_[index]` directly. Out-of-range indices read `0` (the array is + /// zero-initialized and `SetupDecoder` only ever indexes `0..length`). + #[must_use] + pub fn code_at(&self, index: usize) -> i32 { + self.code.get(index).copied().unwrap_or(0) + } + + /// A copy truncated to `len` codes — the C++ `Truncate(int)` + /// (`unicharcompress.h:40`, which only sets `length_`). The trailing `code` + /// slots are retained but drop out of identity ([`codes`](Self::codes) / + /// [`PartialEq`] / [`Hash`] read `code[0..length]`), exactly as C++ leaves + /// `code_` intact and compares only `code_[0..length_]`. + #[must_use] + fn truncated(&self, len: i32) -> Self { + let mut out = self.clone(); + out.length = len; + out + } + /// Read one `RecodedCharID` from the little-endian cursor. Rejects a /// `length` outside `0..=kMaxCodeLen` (the C++ UB guard) and a short buffer. fn read(r: &mut ByteReader<'_>) -> Result { @@ -191,6 +237,18 @@ pub struct UnicharCompress { /// code → unichar-id, recomputed on load (`SetupDecoder`, /// `unicharcompress.cpp:400-402`). Last-writer-wins on a shared code. decoder: HashMap, + /// `is_valid_start_` (`unicharcompress.h:234`): indexed by code value in + /// `0..code_range`, `true` where some entry's first code is that value — the + /// beam search's valid-first-code gate (`IsValidFirstCode`). + is_valid_start: Vec, + /// `final_codes_` (`unicharcompress.h:241`): prefix (`code[0..len-1]`) → the + /// last codes that complete a full sequence from that prefix. Keyed by the + /// truncated [`RecodedCharId`]; the empty prefix maps every length-1 code. + final_codes: HashMap>, + /// `next_codes_` (`unicharcompress.h:237`): prefix → the valid *non-final* + /// continuation codes. Empty for a pass-through recoder (all length-1); only + /// multi-code scripts (Han/Hangul, length 3) populate it. + next_codes: HashMap>, /// `1 + max code value` (`ComputeCodeRange`, `unicharcompress.cpp:383-393`); /// the lattice width. `0` for an empty encoder (`-1 + 1`). code_range: i32, @@ -224,6 +282,9 @@ impl UnicharCompress { let mut this = Self { encoder, decoder: HashMap::new(), + is_valid_start: Vec::new(), + final_codes: HashMap::new(), + next_codes: HashMap::new(), code_range: 0, }; this.compute_code_range(); @@ -299,18 +360,102 @@ impl UnicharCompress { self.code_range = max + 1; } - /// The decode-map half of `SetupDecoder` (`unicharcompress.cpp:400-402`): - /// `decoder_[encoder_[id]] = id` in ascending id order, so **last writer - /// wins** when two ids share a code. The beam-search maps are the - /// recognizer's and are not built here (see module docs). + /// The full `SetupDecoder` (`unicharcompress.cpp:395-436`): in one ascending-id + /// pass over `encoder_`, build the `decoder_` map (code → id, **last writer + /// wins** on a shared code) **and** the beam-search trie maps — + /// `is_valid_start_` (`code(0)` is a valid first code), `final_codes_` (prefix + /// `code[0..len-1]` → the completing last codes), and `next_codes_` (prefix → + /// valid non-final continuations). The `while (--len >= 0)` prefix walk climbs + /// from the direct parent toward the empty prefix, stopping at the first + /// already-populated `next_codes_` entry (that prefix, and all shorter ones, + /// were seeded by an earlier entry). + /// + /// For the eng.lstm pass-through recoder (112 entries, all length-1) this is + /// `is_valid_start_[c0]=true` for each, `final_codes_[]` = every + /// distinct `code(0)` in id order, and an empty `next_codes_` (the `--len` + /// loop never runs). Multi-code scripts (Han/Hangul, length 3) exercise the + /// full trie. fn setup_decoder(&mut self) { self.decoder.clear(); self.decoder.reserve(self.encoder.len()); - for (id, code) in self.encoder.iter().enumerate() { + self.is_valid_start = vec![false; self.code_range.max(0) as usize]; + self.final_codes.clear(); + self.next_codes.clear(); + // Iterate by index so the loop body can mutate the other maps without + // holding a borrow of `self.encoder` (C++ reads `encoder_[c]` by value). + for id in 0..self.encoder.len() { + let code = self.encoder[id].clone(); self.decoder.insert(code.clone(), id as u32); + let length = code.length(); + if length <= 0 { + // Trained recoders never carry an empty entry (the reader rejects + // `length < 0`, and `ComputeEncoding` emits length >= 1); the C++ + // would degenerately index `is_valid_start_[code(0)=0]`. Skipping + // it cannot affect the byte-parity diff on real data. + continue; + } + let last = length - 1; // index of the final code, `code.length() - 1` + if let Some(slot) = self.is_valid_start.get_mut(code.code_at(0) as usize) { + *slot = true; + } + let prefix = code.truncated(last); + if let Some(list) = self.final_codes.get_mut(&prefix) { + let v = code.code_at(last as usize); + if !list.contains(&v) { + list.push(v); + } + } else { + self.final_codes + .insert(prefix, vec![code.code_at(last as usize)]); + let mut len = last; + loop { + len -= 1; + if len < 0 { + break; + } + let p = code.truncated(len); + let v = code.code_at(len as usize); + if let Some(list) = self.next_codes.get_mut(&p) { + // Reached via multiple code lengths: dedup, then stop — + // this prefix (and all shorter) is already seeded. + if !list.contains(&v) { + list.push(v); + } + break; + } + self.next_codes.insert(p, vec![v]); + } + } } } + /// Whether `code` is a valid start (or single) code — the C++ + /// `IsValidFirstCode` (`unicharcompress.h:182`). Bounds-checked (C++ indexes + /// `is_valid_start_[code]` unchecked); an out-of-range code is not valid. + #[must_use] + pub fn is_valid_first_code(&self, code: i32) -> bool { + usize::try_from(code) + .ok() + .and_then(|i| self.is_valid_start.get(i)) + .copied() + .unwrap_or(false) + } + + /// The valid final codes that complete a sequence from `prefix`, or `None` — + /// the C++ `GetFinalCodes` (`unicharcompress.h:193`). `prefix` is a + /// [`RecodedCharId`] truncated to the codes seen so far. + #[must_use] + pub fn get_final_codes(&self, prefix: &RecodedCharId) -> Option<&[i32]> { + self.final_codes.get(prefix).map(Vec::as_slice) + } + + /// The valid non-final continuation codes for `prefix`, or `None` — the C++ + /// `GetNextCodes` (`unicharcompress.h:187`). + #[must_use] + pub fn get_next_codes(&self, prefix: &RecodedCharId) -> Option<&[i32]> { + self.next_codes.get(prefix).map(Vec::as_slice) + } + /// Render the id→code table as `"\t\t[,...]\n"` lines — the /// exact shape the C++ recoder oracle's `encode` mode prints, so the /// byte-parity diff is `diff oracle_recoder_encode.tsv rust_recoder_encode.tsv`. @@ -352,6 +497,75 @@ impl UnicharCompress { } out } + + /// Render the beam-search maps in a **deterministic order** (the C++ + /// `unordered_map` iteration order is unspecified and differs from Rust's + /// [`HashMap`], so the dump drives itself off `encoder_` id-order instead): + /// + /// ```text + /// is_valid_start\t + /// \t<0|1> // for each code in 0..code_range + /// final\t\t // each distinct prefix, once + /// next\t\t + /// ``` + /// + /// Distinct prefixes are enumerated by walking every entry in id order and, + /// within each, truncation lengths `0..length` ascending, emitting each prefix + /// the first time it is seen. The C++ oracle's `beam` mode performs the + /// identical walk via `GetFinalCodes` / `GetNextCodes`, so the diff is + /// `diff oracle_recoder_beam.tsv rust_recoder_beam.tsv`. The per-prefix code + /// lists are already in push order (id-ascending, deduped) on both sides. + #[must_use] + pub fn dump_beam(&self) -> String { + fn csv(codes: &[i32]) -> String { + let mut s = String::new(); + for (i, c) in codes.iter().enumerate() { + if i > 0 { + s.push(','); + } + s.push_str(&c.to_string()); + } + s + } + fn list_or_dash(list: Option<&[i32]>) -> String { + match list { + Some(l) => csv(l), + None => "-".to_string(), + } + } + + let mut out = String::new(); + out.push_str("is_valid_start\t"); + out.push_str(&self.code_range.to_string()); + out.push('\n'); + for (i, &valid) in self.is_valid_start.iter().enumerate() { + out.push_str(&i.to_string()); + out.push('\t'); + out.push(if valid { '1' } else { '0' }); + out.push('\n'); + } + + let mut seen: std::collections::HashSet = std::collections::HashSet::new(); + for code in &self.encoder { + for l in 0..code.length().max(0) { + let prefix = code.truncated(l); + if !seen.insert(prefix.clone()) { + continue; + } + out.push_str("final\t"); + out.push_str(&csv(prefix.codes())); + out.push('\t'); + out.push_str(&list_or_dash(self.get_final_codes(&prefix))); + out.push('\n'); + out.push_str("next\t"); + out.push_str(&csv(prefix.codes())); + out.push('\t'); + out.push_str(&list_or_dash(self.get_next_codes(&prefix))); + out.push('\n'); + } + } + out + } } /// A little-endian byte cursor over the recoder component — the reader half of @@ -521,6 +735,100 @@ mod tests { assert_eq!(rec.dump_decode(), "code_range\t6\n0\t0\n1\t2\n2\t2\n"); } + /// Build a `RecodedCharId` from a slice of codes — a beam-map query key. + fn rc(codes: &[i32]) -> RecodedCharId { + let mut code = [0_i32; K_MAX_CODE_LEN]; + code[..codes.len()].copy_from_slice(codes); + RecodedCharId { + self_normalized: 1, + length: codes.len() as i32, + code, + } + } + + #[test] + fn from_codes_builds_identity_key() { + // The public beam-consumer constructor agrees with the private test `rc` + // (identity = length + code[0..length]); an empty slice is the empty + // prefix (== default); overflow past kMaxCodeLen is truncated. + assert_eq!(RecodedCharId::from_codes(&[2, 3]), rc(&[2, 3])); + assert_eq!(RecodedCharId::from_codes(&[2, 3]).codes(), &[2, 3]); + assert_eq!(RecodedCharId::from_codes(&[]), RecodedCharId::default()); + assert_eq!(RecodedCharId::from_codes(&[7]).length(), 1); + let over = RecodedCharId::from_codes(&[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]); + assert_eq!(over.length() as usize, K_MAX_CODE_LEN, "truncated to 9"); + assert_eq!(over.codes(), &[1, 2, 3, 4, 5, 6, 7, 8, 9]); + } + + #[test] + fn beam_maps_passthrough_all_length1() { + // 3 pass-through codes: every code is a valid start, the empty prefix maps + // all three final codes, and next_codes stays empty (the --len loop never + // runs) — the eng.lstm shape in miniature. + let rec = + UnicharCompress::from_le_bytes(&build(&[(1, &[0]), (1, &[1]), (1, &[2])])).expect("ok"); + assert!( + rec.is_valid_first_code(0) && rec.is_valid_first_code(1) && rec.is_valid_first_code(2) + ); + assert!( + !rec.is_valid_first_code(3), + "out-of-range code is not a start" + ); + assert_eq!( + rec.get_final_codes(&RecodedCharId::default()), + Some(&[0, 1, 2][..]) + ); + assert_eq!(rec.get_next_codes(&RecodedCharId::default()), None); + assert_eq!( + rec.dump_beam(), + "is_valid_start\t3\n0\t1\n1\t1\n2\t1\nfinal\t\t0,1,2\nnext\t\t-\n" + ); + } + + #[test] + fn beam_maps_trie_multicode() { + // Length-3 (Han/Hangul-shaped) entries sharing prefixes exercise the full + // trie: the `while (--len >= 0)` walk, the dedup-then-`break` on an already + // seeded next prefix, and multiple finals under one prefix. + // id0 [1] id1 [2,3,4] id2 [2,3,5] id3 [2,6,7] + let rec = UnicharCompress::from_le_bytes(&build(&[ + (1, &[1]), + (1, &[2, 3, 4]), + (1, &[2, 3, 5]), + (1, &[2, 6, 7]), + ])) + .expect("ok"); + // final_codes: {} <- [1]; [2,3] <- [4,5]; [2,6] <- [7] + assert_eq!( + rec.get_final_codes(&RecodedCharId::default()), + Some(&[1][..]) + ); + assert_eq!(rec.get_final_codes(&rc(&[2, 3])), Some(&[4, 5][..])); + assert_eq!(rec.get_final_codes(&rc(&[2, 6])), Some(&[7][..])); + assert_eq!( + rec.get_final_codes(&rc(&[2])), + None, + "[2] is a next-prefix, not final" + ); + // next_codes: {} <- [2] (from id1 only); [2] <- [3,6] (id1 seeds 3, id3 adds 6 then breaks) + assert_eq!( + rec.get_next_codes(&RecodedCharId::default()), + Some(&[2][..]) + ); + assert_eq!(rec.get_next_codes(&rc(&[2])), Some(&[3, 6][..])); + // is_valid_start: only the first codes 1 and 2 (code_range = 7+1 = 8). + assert!(rec.is_valid_first_code(1) && rec.is_valid_first_code(2)); + assert!(!rec.is_valid_first_code(3) && !rec.is_valid_first_code(0)); + assert_eq!( + rec.dump_beam(), + "is_valid_start\t8\n0\t0\n1\t1\n2\t1\n3\t0\n4\t0\n5\t0\n6\t0\n7\t0\n\ + final\t\t1\nnext\t\t2\n\ + final\t2\t-\nnext\t2\t3,6\n\ + final\t2,3\t4,5\nnext\t2,3\t-\n\ + final\t2,6\t7\nnext\t2,6\t-\n" + ); + } + #[test] fn truncated_buffer_errors() { let mut bytes = build(&[(1, &[0])]);