recognizer Leaf 7 Core side: recoder SetupDecoder beam maps (7a) + RecodedCharId::from_codes (7b) + board#647
Merged
Merged
Conversation
…reen The recurrent layer, the hardest recognizer leaf: 4 gates via Leaf 4 (fully_connected_forward) + the cell recurrence c=clip(GF1·c+CI·GI,±100), output h=tanh(c)·GO, and the int8-quantized recurrent feedback. Byte-parity green across 3 shapes incl. ns=48/ni=36 × 8 timesteps vs a libtesseract oracle running the REAL per-timestep LSTM::Forward body. Code in tesseract-rs #5 (recognizer Leaves 1-5); board hygiene lands here per the CLAUDE.md rule. Co-Authored-By: Claude <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
… green The composition (graph::Layer: Series/Reversed/Parallel) that chains the proven layer leaves into a network forward, with the inter-layer int8 requant. Byte-parity green: Series[LSTM,FC] across 4 shapes incl. ns=96/ni=192/no=111 (eng.lstm's LSTM192→Fc111 tail) vs a libtesseract oracle. Code in tesseract-rs #5 (Leaves 1-6); board lands here per the CLAUDE.md rule. Next: Leaf 7 recodebeam (CTC) → text. Co-Authored-By: Claude <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
📝 WalkthroughWalkthroughThis PR adds two documentation entries to the EPIPHANIES.md board file, recording byte-parity validation findings for recognizer pipeline components: Leaf 6 (graph walk composition) and Leaf 5 (LSTM::Forward int8), both marked as GREEN parity against a libtesseract oracle. ChangesEpiphany log updates
Estimated code review effort: 1 (Trivial) | ~2 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…-parity green Transcode the full C++ UnicharCompress::SetupDecoder (unicharcompress.cpp:395-436) into the Core's UnicharCompress — the deferred half of the recoder leaf. In one ascending-id pass, setup_decoder now builds the decoder_ map AND the three beam-search trie maps that RecodeBeamSearch consumes: - is_valid_start_ (Vec<bool>): code(0) is a valid first code (IsValidFirstCode) - final_codes_ (prefix -> completing last codes, GetFinalCodes) - next_codes_ (prefix -> non-final continuations, GetNextCodes) The `while (--len >= 0)` prefix walk climbs from the direct parent toward the empty prefix, deduping and breaking at the first already-seeded next prefix. RecodedCharId gains code_at (C++ operator()) + a private truncated (C++ Truncate; trailing code slots drop out of identity since eq/hash read code[0..length]). Three new public accessors surface the maps read-only for the recognizer's beam search (Leaf 7b): is_valid_first_code / get_final_codes / get_next_codes. Per Core-First the table is Core content; the beam SEARCH that walks it is recognizer compute. Byte-parity GREEN on real eng.lstm-recoder: dump_beam (a deterministic walk, since unordered_map order is unspecified) diffs 114 lines byte-identical vs the new `beam` mode in recoder_oracle.cpp. Real data corrected the module docs: code_range=111 (max code 110, not 112), one shared code (id1/id2 -> 110), final_codes_[<empty>] push order 0,110,1,2..109; next_codes_ empty (all length-1). A hand-traced length-3 unit test exercises the full trie. encode + decode regressions still green. +2 tests (814 contract total); clippy -D warnings + fmt clean (-p lance-graph-contract scoped). Cross-ref E-OCR-RECODER-BEAM-1, E-CPP-PARITY-7, E-OCR-GRAPHWALK-1. Co-Authored-By: Claude <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
… (Leaf 7b) Add the public `RecodedCharId::from_codes(&[i32])` constructor (the C++ `RecodedCharID::Set` loop, unicharcompress.h:43) — the recognizer's RecodeBeamSearch (Leaf 7b, in tesseract-core) builds `prefix` (codes[0..length]) to query get_final_codes/get_next_codes and `full_code` (prefix ++ code) to feed DecodeUnichar. This is the last Core surface the beam needs; the beam itself (byte-parity green across 4 configs vs libtesseract on eng.lstm-recoder) lands in tesseract-rs PR #5. Board: prepend E-OCR-RECODEBEAM-1 (Leaf 7b — the non-dict CTC beam, the "recognizer produces text" milestone). +1 test (815 contract total); clippy -D warnings + fmt clean (-p lance-graph-contract scoped). Co-Authored-By: Claude <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The lance-graph (OGAR Core) side of the Tesseract recognizer's Leaf 7 (the CTC decode). The compute-side beam lands in tesseract-rs PR #6; this PR is the two Core surfaces it consumes plus the board findings. Every leaf is byte-parity-proven against a live libtesseract 5.3.4 oracle.
Leaf 7a — recoder
SetupDecoderbeam maps (3242f6bb, CODE)The deferred half of the recoder leaf (
E-CPP-PARITY-7built onlydecoder_).setup_decodernow transcodes the full C++SetupDecoder(unicharcompress.cpp:395-436): thedecoder_map and the three beam-search trie maps the recognizer walks —is_valid_start_(IsValidFirstCode),final_codes_(GetFinalCodes),next_codes_(GetNextCodes), surfaced as public accessors.RecodedCharIdgainscode_at+truncated. Byte-parity GREEN on realeng.lstm-recoder:dump_beamdiffs 114 lines byte-identical vs a newbeammode inrecoder_oracle.cpp(deterministic walk, sinceunordered_maporder is unspecified). Real data corrected the docs:code_range=111, one shared code (id1/id2→110),final_codes_[<empty>]push order0,110,1…109,next_codes_empty (all length-1). Hand-traced length-3 unit test exercises the full trie.E-OCR-RECODER-BEAM-1.Leaf 7b dependency —
RecodedCharId::from_codes(26370b0e, CODE)The public
from_codes(&[i32])constructor (the C++RecodedCharID::Setloop) — the beam's key builder: it makes aprefix(codes[0..length]) to queryget_final_codes/get_next_codesand afull_code(prefix ++ code) to feeddecode. The last Core surface the beam needs.Board
E-OCR-RECODER-BEAM-1(7a) +E-OCR-RECODEBEAM-1(7b — the non-dict CTC beam, byte-parity green across 4 configs vs libtesseract; the "recognizer produces text" milestone) prepended to EPIPHANIES.+2 tests (815 contract total); clippy
-D warnings+ fmt clean (-p lance-graph-contractscoped). Branch restarted frommainafter #643/#644/#645 merged.Merge order
Merge this before tesseract-rs PR #6 — that PR's
lance-graph-contractpath dep builds against lance-graphmain, and its beam needs both the 7a accessors andfrom_codes. Expect #6 CI red until this merges.🤖 Generated with Claude Code
https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1