Skip to content

recognizer Leaf 7 Core side: recoder SetupDecoder beam maps (7a) + RecodedCharId::from_codes (7b) + board#647

Merged
AdaWorldAPI merged 4 commits into
mainfrom
claude/happy-hamilton-0azlw4
Jul 4, 2026
Merged

recognizer Leaf 7 Core side: recoder SetupDecoder beam maps (7a) + RecodedCharId::from_codes (7b) + board#647
AdaWorldAPI merged 4 commits into
mainfrom
claude/happy-hamilton-0azlw4

Conversation

@AdaWorldAPI

@AdaWorldAPI AdaWorldAPI commented Jul 4, 2026

Copy link
Copy Markdown
Owner

The lance-graph (OGAR Core) side of the Tesseract recognizer's Leaf 7 (the CTC decode). The compute-side beam lands in tesseract-rs PR #6; this PR is the two Core surfaces it consumes plus the board findings. Every leaf is byte-parity-proven against a live libtesseract 5.3.4 oracle.

Leaf 7a — recoder SetupDecoder beam maps (3242f6bb, CODE)

The deferred half of the recoder leaf (E-CPP-PARITY-7 built only decoder_). setup_decoder now transcodes the full C++ SetupDecoder (unicharcompress.cpp:395-436): the decoder_ map and the three beam-search trie maps the recognizer walks — is_valid_start_ (IsValidFirstCode), final_codes_ (GetFinalCodes), next_codes_ (GetNextCodes), surfaced as public accessors. RecodedCharId gains code_at + truncated. Byte-parity GREEN on real eng.lstm-recoder: dump_beam diffs 114 lines byte-identical vs a new beam mode in recoder_oracle.cpp (deterministic walk, since unordered_map order is unspecified). Real data corrected the docs: code_range=111, one shared code (id1/id2→110), final_codes_[<empty>] push order 0,110,1…109, next_codes_ empty (all length-1). Hand-traced length-3 unit test exercises the full trie. E-OCR-RECODER-BEAM-1.

Leaf 7b dependency — RecodedCharId::from_codes (26370b0e, CODE)

The public from_codes(&[i32]) constructor (the C++ RecodedCharID::Set loop) — the beam's key builder: it makes a prefix (codes[0..length]) to query get_final_codes/get_next_codes and a full_code (prefix ++ code) to feed decode. The last Core surface the beam needs.

Board

E-OCR-RECODER-BEAM-1 (7a) + E-OCR-RECODEBEAM-1 (7b — the non-dict CTC beam, byte-parity green across 4 configs vs libtesseract; the "recognizer produces text" milestone) prepended to EPIPHANIES.

+2 tests (815 contract total); clippy -D warnings + fmt clean (-p lance-graph-contract scoped). Branch restarted from main after #643/#644/#645 merged.

Merge order

Merge this before tesseract-rs PR #6 — that PR's lance-graph-contract path dep builds against lance-graph main, and its beam needs both the 7a accessors and from_codes. Expect #6 CI red until this merges.

🤖 Generated with Claude Code

https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1

claude added 2 commits July 4, 2026 17:23
…reen

The recurrent layer, the hardest recognizer leaf: 4 gates via Leaf 4
(fully_connected_forward) + the cell recurrence c=clip(GF1·c+CI·GI,±100),
output h=tanh(c)·GO, and the int8-quantized recurrent feedback. Byte-parity
green across 3 shapes incl. ns=48/ni=36 × 8 timesteps vs a libtesseract oracle
running the REAL per-timestep LSTM::Forward body. Code in tesseract-rs #5
(recognizer Leaves 1-5); board hygiene lands here per the CLAUDE.md rule.

Co-Authored-By: Claude <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
… green

The composition (graph::Layer: Series/Reversed/Parallel) that chains the proven
layer leaves into a network forward, with the inter-layer int8 requant. Byte-parity
green: Series[LSTM,FC] across 4 shapes incl. ns=96/ni=192/no=111 (eng.lstm's
LSTM192→Fc111 tail) vs a libtesseract oracle. Code in tesseract-rs #5 (Leaves 1-6);
board lands here per the CLAUDE.md rule. Next: Leaf 7 recodebeam (CTC) → text.

Co-Authored-By: Claude <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
@coderabbitai

coderabbitai Bot commented Jul 4, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds two documentation entries to the EPIPHANIES.md board file, recording byte-parity validation findings for recognizer pipeline components: Leaf 6 (graph walk composition) and Leaf 5 (LSTM::Forward int8), both marked as GREEN parity against a libtesseract oracle.

Changes

Epiphany log updates

Layer / File(s) Summary
Add Leaf 6 and Leaf 5 epiphany entries
.claude/board/EPIPHANIES.md
Documents byte-parity GREEN results for graph-walk composition (Series/Reversed/Parallel) with int8 requantization, and for LSTM::Forward 1-D int8 per-timestep recurrence, including test notes and cross-references.

Estimated code review effort: 1 (Trivial) | ~2 minutes

Poem

A rabbit hops through boards of green,
Two findings logged, both crisp and clean.
Byte for byte, the parity holds,
LSTM whispers what the graph walk told.
🐇✨ Onward to the next epiphany scene!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title points to Leaf 7 recoder changes, but the PR only updates the board with findings for Leaves 5 and 6. Rename it to reflect the board-only update and the actual Leaf 5/6 byte-parity findings.
✅ Passed checks (4 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

…-parity green

Transcode the full C++ UnicharCompress::SetupDecoder (unicharcompress.cpp:395-436)
into the Core's UnicharCompress — the deferred half of the recoder leaf. In one
ascending-id pass, setup_decoder now builds the decoder_ map AND the three
beam-search trie maps that RecodeBeamSearch consumes:

  - is_valid_start_ (Vec<bool>): code(0) is a valid first code (IsValidFirstCode)
  - final_codes_   (prefix -> completing last codes, GetFinalCodes)
  - next_codes_    (prefix -> non-final continuations, GetNextCodes)

The `while (--len >= 0)` prefix walk climbs from the direct parent toward the
empty prefix, deduping and breaking at the first already-seeded next prefix.
RecodedCharId gains code_at (C++ operator()) + a private truncated (C++
Truncate; trailing code slots drop out of identity since eq/hash read
code[0..length]). Three new public accessors surface the maps read-only for
the recognizer's beam search (Leaf 7b): is_valid_first_code / get_final_codes /
get_next_codes. Per Core-First the table is Core content; the beam SEARCH that
walks it is recognizer compute.

Byte-parity GREEN on real eng.lstm-recoder: dump_beam (a deterministic walk,
since unordered_map order is unspecified) diffs 114 lines byte-identical vs the
new `beam` mode in recoder_oracle.cpp. Real data corrected the module docs:
code_range=111 (max code 110, not 112), one shared code (id1/id2 -> 110),
final_codes_[<empty>] push order 0,110,1,2..109; next_codes_ empty (all
length-1). A hand-traced length-3 unit test exercises the full trie. encode +
decode regressions still green. +2 tests (814 contract total); clippy -D
warnings + fmt clean (-p lance-graph-contract scoped).

Cross-ref E-OCR-RECODER-BEAM-1, E-CPP-PARITY-7, E-OCR-GRAPHWALK-1.

Co-Authored-By: Claude <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
@AdaWorldAPI AdaWorldAPI changed the title board: recognizer Leaves 5-6 (LSTM::Forward + graph walk) byte-parity findings recognizer: Leaf 7a recoder SetupDecoder beam maps (Core code) + Leaves 5-6 board findings Jul 4, 2026
… (Leaf 7b)

Add the public `RecodedCharId::from_codes(&[i32])` constructor (the C++
`RecodedCharID::Set` loop, unicharcompress.h:43) — the recognizer's
RecodeBeamSearch (Leaf 7b, in tesseract-core) builds `prefix` (codes[0..length])
to query get_final_codes/get_next_codes and `full_code` (prefix ++ code) to feed
DecodeUnichar. This is the last Core surface the beam needs; the beam itself
(byte-parity green across 4 configs vs libtesseract on eng.lstm-recoder) lands in
tesseract-rs PR #5.

Board: prepend E-OCR-RECODEBEAM-1 (Leaf 7b — the non-dict CTC beam, the
"recognizer produces text" milestone). +1 test (815 contract total); clippy -D
warnings + fmt clean (-p lance-graph-contract scoped).

Co-Authored-By: Claude <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
@AdaWorldAPI AdaWorldAPI changed the title recognizer: Leaf 7a recoder SetupDecoder beam maps (Core code) + Leaves 5-6 board findings recognizer Leaf 7 Core side: recoder SetupDecoder beam maps (7a) + RecodedCharId::from_codes (7b) + board Jul 4, 2026
@AdaWorldAPI AdaWorldAPI merged commit b74378c into main Jul 4, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants