Skip to content

contract: Tesseract recoder + recognizer-leaf boards + network→V3-SoA sink (byte-parity)#643

Merged
AdaWorldAPI merged 7 commits into
mainfrom
claude/happy-hamilton-0azlw4
Jul 4, 2026
Merged

contract: Tesseract recoder + recognizer-leaf boards + network→V3-SoA sink (byte-parity)#643
AdaWorldAPI merged 7 commits into
mainfrom
claude/happy-hamilton-0azlw4

Conversation

@AdaWorldAPI

@AdaWorldAPI AdaWorldAPI commented Jul 4, 2026

Copy link
Copy Markdown
Owner

The Core-side of the post-#633 Tesseract-transcode plateau: the recoder content-store leaf, the Network layer-graph → V3-SoA sink, and the board record for the recognizer compute leaves (whose code lives in the companion tesseract-rs PR, per the board-hygiene rule). All additive to lance-graph-contract (zero-dep); no NodeRow/ValueTenant/ValueSchema/stride/ENVELOPE_LAYOUT_VERSION impact.

What ships (7 commits)

contract::unicharcompress (2f1df8d5) — the LSTM recoder load side (UnicharCompress + RecodedCharId, from_le_bytes = C++ DeSerialize; encode/decode/code_range). The FIRST binary-format leaf (TFile LE). Byte-parity GREEN 112 enc + 112 dec on real eng.lstm-recoder. EPIPHANIES E-CPP-PARITY-7.

contract::network (a7dba3a8) — the operator directive "6x8:8, 16-byte tenant = classid + 12 bytes, ruff→OGAR sink-in", executed the right way (NOT a hand-rolled enum). NetworkType (27 layer types, ordinal == on-wire kTypeNames) + NetworkHeader::from_le_bytes (the base header Network::CreateFromFile reads, network.cpp:214-248) + to_facet()facet::FacetCascade (16 B = classid + 6×8:8, CascadeShape::G6D2); facet_classid = compose_classid(network_layer=0x0804, ntype) canon-high (ONE OCR-domain mint; the 27 subclasses live in the classid custom-low, not 27 slots). Byte-parity GREEN vs libtesseract Network::CreateFromFile on real eng.lstm (Series ni=36 no=111 num_weights=385807; oracle spec() == the model spec string). Reviewed by core-first-architect (TARGETS-CORE), v3-envelope-auditor (LAYOUT-CLEAN, no version bump), brutally-honest-tester (LAND); their advisories folded in (compile-lock test NETWORK_LAYER == codebook mint, custom-half invariant doc, to_facet ni/no debug_assert). EPIPHANIES E-OCR-NETWORK-SINK-1.

Recognizer-leaf boards (ba5ce72f, 856358a2, 4af9162d, c60d8f55, 4e14db01) — EPIPHANIES E-OCR-{COMPUTE-NDARRAY-SEAM,MATDOTVEC,WEIGHTMATRIX,ACTIVATION,FULLYCONNECTED}-1, the byte-parity record for the recognizer compute crate (Leaves 1-4). The code lands in the companion tesseract-rs PR; the boards land here per the workspace hygiene rule ("recognizer boards land in lance-graph").

Proof / gates

  • Byte-parity (in-env, libtesseract 5.3.4, oracles built -DFAST_FLOAT): recoder 112+112, network base-header vs Network::CreateFromFile.
  • -p lance-graph-contract: network tests green; clippy -D warnings + fmt clean. Rebased onto current main (post-onebrc/lane-j: typed GridlakeCarrierError (addresses #641 review) #642, +29 commits) with the compose_classid/facet/canonical_concept_id surface verified intact.
  • Board hygiene in-commit: EPIPHANIES (6 entries) + LATEST_STATE (contract::network + contract::unicharcompress inventory + network_layer=0x0804 codebook row).

Merge order

Merge this FIRST. The companion tesseract-rs PR (recoder consumer surface + recognizer Leaves 1-4 + network docs) builds its lance-graph-contract path dep against lance-graph main, so its CI is red until this merges. The ruff PR (harvest_network — the ruff→OGAR harvester that produced the network manifest) is independent.

🤖 Generated with Claude Code

https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1


Generated by Claude Code

Summary by CodeRabbit

  • New Features

    • Added support for loading and inspecting two new OCR contract formats.
    • Introduced command-line dump tools to view parsed headers and encoding tables for comparison and troubleshooting.
    • Expanded the recognized OCR codebook with an additional network-related entry.
  • Bug Fixes

    • Improved byte-level compatibility with existing OCR outputs, including more reliable parsing and round-trip behavior.
    • Added stronger validation for malformed or truncated inputs.

claude added 7 commits July 4, 2026 15:15
New zero-dep module lance_graph_contract::unicharcompress -- the load side of
Tesseract's UnicharCompress (ccutil/unicharcompress.{h,cpp}), the LSTM
recognizer's recoded-code <-> unichar-id table. First binary-format leaf: a
little-endian TFile reader (u32 count + per-RecodedCharID
[i8 self_normalized][i32 length][i32*length code]), then ComputeCodeRange
(max+1) and the decode map (last-writer-wins on a shared code). Load side only
(DeSerialize + Encode/Decode/code_range); ComputeEncoding + beam-search maps
are deferred to training/recognizer leaves.

Byte-parity GREEN on real eng.lstm-recoder: encode 112/112 + decode 112/112 +
code_range=111 (examples/recoder_dump.rs {encode,decode} diffed vs a
libtesseract 5.3.4 oracle; the 1012-byte size = 4 + 112*9 was derived before
the parse). Strict where C++ is UB: rejects length > kMaxCodeLen(9) and short
buffers.

+10 unit tests; clippy -D warnings + fmt clean (-p lance-graph-contract).
Board: EPIPHANIES E-CPP-PARITY-7, LATEST_STATE contract inventory. Resolves the
OGAR #148 recoder=0x0802 concept to its content-store module.

Co-Authored-By: Claude <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
…onto ndarray

CONJECTURE (design-pass finding; byte-parity probe = recognizer Leaf 1). The
OCR recognizer is COMPUTE (dense int8 GEMM), not content -- it consumes
ndarray's existing matmul_i8_to_i32 / quantize / dequantize with no Core gap.
int8->i32 is exact + bit-reproducible across AMX/VNNI/scalar. Corrects the
"OCR is ndarray-free" framing. Cross-ref E-CPP-PARITY-7, the recognizer plan.

Co-Authored-By: Claude <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
…otes seam FINDING)

The int8 MatrixDotVector, via ndarray's matmul_i8_to_i32, equals libtesseract
exactly on synthetic int8 (integer-combined diff, TFloat-agnostic). Promotes
E-OCR-COMPUTE-NDARRAY-SEAM-1 CONJECTURE->FINDING. New crate tesseract-recognizer
(compute tier). in-env libtesseract is FAST_FLOAT.

Co-Authored-By: Claude <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
WeightMatrix::DeSerialize (int mode) transcoded + byte-parity vs libtesseract
(f32 bit-patterns, two shapes). forward() chains Leaf 1's proven int8 GEMM,
scaling in f32 to match FAST_FLOAT. Rust-writes / lib-reads independent proof.

Co-Authored-By: Claude <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
The LUT activations (Tanh/Logistic + Relu/Clip/Softmax) transcoded + byte-parity
vs libtesseract on a 4096-pt sweep; the regenerated tables match the baked ones.
All f32 (FAST_FLOAT). Leaf 2 + Leaf 3 = the pieces of a FullyConnected forward.

Co-Authored-By: Claude <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
…A (byte-parity)

Executes the operator directive "6x8:8, 16 B tenant = classid + 12 B,
ruff>OGAR transpiler sink-in". The polymorphic Network subclass tree lands on
the OGAR Core the right way — NOT a hand-rolled enum (that draft was the
parallel-object-model anti-pattern).

- NEW src/network.rs: NetworkType (27 layer types; ordinal == on-wire kTypeNames
  discriminant, network.h:41-78 / network.cpp:60-75) + NetworkHeader::from_le_bytes
  (the base header Network::CreateFromFile reads before subclass dispatch,
  network.cpp:214-248) + to_facet() (sinks each node onto facet::FacetCascade,
  16 B = classid + 6x8:8, CascadeShape::G6D2) + NetworkType::classid() (the
  invoke_network dispatch seed). facet_classid = compose_classid(network_layer,
  ntype) canon-high; subclass in the classid custom-low half, not 27 slots.
- ogar_codebook: ONE mint network_layer=0x0804 in the 0x08 OCR domain.
- NEW examples/network_dump.rs: the byte-parity surface.

Byte-parity GREEN on real eng.lstm: Rust NetworkHeader::from_le_bytes ==
libtesseract Network::CreateFromFile for the outer node
(Series ni=36 no=111 num_weights=385807 name=Series); the oracle's spec() ==
the model spec string (known-answer self-check, 5.5.0-hdr/5.3.4-lib ABI skew
guarded, oracle built -DFAST_FLOAT). The facet 0x08040009 decodes losslessly.

Reviewed by core-first-architect (TARGETS-CORE), v3-envelope-auditor
(LAYOUT-CLEAN, no version bump), brutally-honest-tester (LAND). Folded in:
compile-lock test (NETWORK_LAYER == codebook mint), custom-half invariant doc,
to_facet debug_assert on the ni/no u16 range. +7 contract tests; clippy
-D warnings + fmt clean (scoped -p lance-graph-contract).

Board: EPIPHANIES E-OCR-NETWORK-SINK-1, LATEST_STATE contract inventory.

Co-Authored-By: Claude <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
FullyConnected::Forward (int8 path) = activation(WeightMatrix·input), the first
complete network layer, composing the two proven halves (Leaf 2 WeightMatrix +
Leaf 3 activations). Byte-parity green across all 7 activations + 2 shapes vs a
libtesseract oracle running the REAL MatrixDotVector+FuncInplace. Code lands in
tesseract-recognizer (the compute crate); board hygiene lands here per the
CLAUDE.md rule.

Co-Authored-By: Claude <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
@coderabbitai

coderabbitai Bot commented Jul 4, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds two new contract modules to lance-graph-contract: network (NetworkType/NetworkHeader parsing and FacetCascade projection) and unicharcompress (RecodedCharId/UnicharCompress recoder load and encode/decode). Adds corresponding CLI dump examples, a new ogar_codebook entry, and board documentation updates.

Changes

Network and recoder contract modules

Layer / File(s) Summary
Network header parsing and facet projection
crates/lance-graph-contract/src/network.rs, crates/lance-graph-contract/src/lib.rs
Adds NetworkType, NetworkError, NetworkHeader::from_le_bytes parsing, to_facet() projection into FacetCascade tiers, an internal ByteReader, unit tests, and wires the module into the crate.
Network dump CLI example
crates/lance-graph-contract/examples/network_dump.rs
Adds a CLI that parses a NetworkHeader from a file and prints the header dump plus derived facet classid and hex bytes.
UnicharCompress recoder load, encode/decode
crates/lance-graph-contract/src/unicharcompress.rs, crates/lance-graph-contract/src/lib.rs
Adds RecodedCharId, UnicharCompress load/encode/decode logic, decoder rebuilding, dump formatting, RecoderError, an internal ByteReader, unit tests, and wires the module into the crate.
Recoder dump CLI example
crates/lance-graph-contract/examples/recoder_dump.rs
Adds a CLI that loads a recoder file and prints either encoder or decode table dumps.
Codebook entry and board documentation
crates/lance-graph-contract/src/ogar_codebook.rs, .claude/board/EPIPHANIES.md, .claude/board/LATEST_STATE.md
Adds a network_layer (0x0804) entry to CODEBOOK, and documents byte-parity findings and module status in board markdown files.

Estimated code review effort: 3 (Moderate) | ~30 minutes

Possibly related PRs

  • AdaWorldAPI/lance-graph#563: Introduces/defines the ogar_codebook mirror and CODEBOOK/lookup helpers that this PR extends with the new network_layer entry.

Poem

A header parsed, byte by byte,
A recoder table set just right,
Facets stacked in tidy tiers,
Parity green through all the years,
This rabbit thumps with pure delight! 🐰✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: Tesseract recoder work, recognizer-leaf board updates, and the network-to-V3-SoA sink byte-parity path.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
crates/lance-graph-contract/src/unicharcompress.rs (1)

210-219: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Bound the pre-allocation to the available buffer.

count is only checked against the 50M MAX_ELEMENTS cap, then passed straight to Vec::with_capacity. A tiny hostile file (just the 4-byte count header declaring, say, 50M) forces a ~2 GB upfront allocation before the very first RecodedCharId::read fails with UnexpectedEof. Each entry needs at least 5 bytes on the wire, so you can cheaply bound the reservation to what the buffer could actually contain. This matches the module's stated hostile-input hardening posture.

♻️ Suggested bound
-        let mut encoder = Vec::with_capacity(count as usize);
+        // Each entry is at least 5 bytes (i8 self_normalized + i32 length), so a
+        // declared count larger than the remaining buffer can hold is corrupt.
+        let max_possible = r.remaining() / 5;
+        let mut encoder = Vec::with_capacity((count as usize).min(max_possible));

You'd add a small remaining() helper to ByteReader.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/lance-graph-contract/src/unicharcompress.rs` around lines 210 - 219,
The pre-allocation in from_le_bytes currently trusts count after only checking
MAX_ELEMENTS, so a small input can trigger a huge Vec::with_capacity before any
RecodedCharId::read occurs. Add a ByteReader remaining() helper and use it in
from_le_bytes to cap the reserved encoder size to the maximum number of entries
that can fit in the available buffer, while keeping the existing
RecoderError::TooManyElements guard and the RecodedCharId::read loop intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/lance-graph-contract/src/network.rs`:
- Around line 364-395: The issue is that `to_facet()` only enforces the
`ni`/`no` `u16::MAX` invariant with `debug_assert!`, so release builds can
silently truncate invalid values. Update `Network::to_facet` to perform a
production check and return a `Result` (or a dedicated `NetworkError`) when `ni`
or `no` exceed `u16::MAX`, and propagate that error from the caller paths
instead of constructing `FacetCascade` unconditionally. Keep the existing
`FacetCascade`/`tier_u16` mapping logic for valid values, but make the
out-of-line escape mentioned in the doc comment explicit and enforced in release
builds.
- Around line 264-277: `NetworkError` is currently a plain enum, so update it to
follow the crate’s existing snafu-based error pattern instead of relying on a
bare type. Add the appropriate snafu error derive/annotations to `NetworkError`,
define per-variant messages for `UnexpectedEof`, `BadTag`, `UnknownType`, and
`NegativeDim`, and make sure the type still supports standard error usage
through the generated `Display` and `std::error::Error` behavior. Use
`NetworkError` and its variants in `network.rs` as the main anchor when updating
the error definition.

In `@crates/lance-graph-contract/src/unicharcompress.rs`:
- Around line 290-300: The compute_code_range method can overflow when it sets
self.code_range to max + 1 after scanning self.encoder for raw code values. Add
validation or a checked/saturating increment so hostile i32::MAX codes do not
panic in debug or wrap in release, and handle the invalid input consistently
with the existing BadCodeLength/UnexpectedEof corruption checks. Keep the fix
localized to compute_code_range and its code_range assignment.

---

Nitpick comments:
In `@crates/lance-graph-contract/src/unicharcompress.rs`:
- Around line 210-219: The pre-allocation in from_le_bytes currently trusts
count after only checking MAX_ELEMENTS, so a small input can trigger a huge
Vec::with_capacity before any RecodedCharId::read occurs. Add a ByteReader
remaining() helper and use it in from_le_bytes to cap the reserved encoder size
to the maximum number of entries that can fit in the available buffer, while
keeping the existing RecoderError::TooManyElements guard and the
RecodedCharId::read loop intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 0ac97b88-4e98-4885-af9b-7a69775ac83d

📥 Commits

Reviewing files that changed from the base of the PR and between 0ac5eb9 and 4e14db0.

📒 Files selected for processing (8)
  • .claude/board/EPIPHANIES.md
  • .claude/board/LATEST_STATE.md
  • crates/lance-graph-contract/examples/network_dump.rs
  • crates/lance-graph-contract/examples/recoder_dump.rs
  • crates/lance-graph-contract/src/lib.rs
  • crates/lance-graph-contract/src/network.rs
  • crates/lance-graph-contract/src/ogar_codebook.rs
  • crates/lance-graph-contract/src/unicharcompress.rs

Comment on lines +264 to +277
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum NetworkError {
/// The buffer ended before the base header was fully read.
UnexpectedEof,
/// The `tag` byte was not `NT_NONE`(0) — an unversioned/foreign blob
/// (`getNetworkType` only branches into the string path when `tag == 0`).
BadTag(i8),
/// The `type_name` string did not match any [`NetworkType::TYPE_NAMES`]
/// entry (`getNetworkType`'s `data == NT_COUNT` path).
UnknownType,
/// A negative dimension (`ni`/`no`/`num_weights` are non-negative for any
/// serialized model).
NegativeDim,
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

NetworkError doesn't follow the crate's snafu error pattern.

NetworkError is a bare enum with no Display/std::error::Error impl at all (its sibling RecoderError in unicharcompress.rs at least hand-rolls Display/Error, still not via snafu). The coding guidelines call for reusing snafu error patterns for Rust error types in this crate.

♻️ Suggested snafu-based error
-#[derive(Debug, Clone, Copy, PartialEq, Eq)]
-pub enum NetworkError {
-    /// The buffer ended before the base header was fully read.
-    UnexpectedEof,
-    /// The `tag` byte was not `NT_NONE`(0) — an unversioned/foreign blob
-    /// (`getNetworkType` only branches into the string path when `tag == 0`).
-    BadTag(i8),
-    /// The `type_name` string did not match any [`NetworkType::TYPE_NAMES`]
-    /// entry (`getNetworkType`'s `data == NT_COUNT` path).
-    UnknownType,
-    /// A negative dimension (`ni`/`no`/`num_weights` are non-negative for any
-    /// serialized model).
-    NegativeDim,
-}
+#[derive(Debug, Clone, Copy, PartialEq, Eq, snafu::Snafu)]
+pub enum NetworkError {
+    #[snafu(display("network header buffer ended before it was fully read"))]
+    UnexpectedEof,
+    #[snafu(display("network header tag {tag} was not NT_NONE(0)"))]
+    BadTag { tag: i8 },
+    #[snafu(display("network header type name did not match any known NetworkType"))]
+    UnknownType,
+    #[snafu(display("network header contained a negative dimension"))]
+    NegativeDim,
+}

As per coding guidelines, crates/**/*.rs: "reuse snafu error patterns" — this applies to the new error type here.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum NetworkError {
/// The buffer ended before the base header was fully read.
UnexpectedEof,
/// The `tag` byte was not `NT_NONE`(0) — an unversioned/foreign blob
/// (`getNetworkType` only branches into the string path when `tag == 0`).
BadTag(i8),
/// The `type_name` string did not match any [`NetworkType::TYPE_NAMES`]
/// entry (`getNetworkType`'s `data == NT_COUNT` path).
UnknownType,
/// A negative dimension (`ni`/`no`/`num_weights` are non-negative for any
/// serialized model).
NegativeDim,
}
#[derive(Debug, Clone, Copy, PartialEq, Eq, snafu::Snafu)]
pub enum NetworkError {
#[snafu(display("network header buffer ended before it was fully read"))]
UnexpectedEof,
#[snafu(display("network header tag {tag} was not NT_NONE(0)"))]
BadTag { tag: i8 },
#[snafu(display("network header type name did not match any known NetworkType"))]
UnknownType,
#[snafu(display("network header contained a negative dimension"))]
NegativeDim,
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/lance-graph-contract/src/network.rs` around lines 264 - 277,
`NetworkError` is currently a plain enum, so update it to follow the crate’s
existing snafu-based error pattern instead of relying on a bare type. Add the
appropriate snafu error derive/annotations to `NetworkError`, define per-variant
messages for `UnexpectedEof`, `BadTag`, `UnknownType`, and `NegativeDim`, and
make sure the type still supports standard error usage through the generated
`Display` and `std::error::Error` behavior. Use `NetworkError` and its variants
in `network.rs` as the main anchor when updating the error definition.

Source: Coding guidelines

Comment on lines +364 to +395
#[inline]
#[must_use]
pub fn to_facet(&self) -> FacetCascade {
// ni/no are the semantic dims that MUST round-trip; every real eng.lstm dim
// is < 65536, but a hypothetical wider model would truncate here silently.
// Fail loudly in debug (mirrors the CANON mint-path `debug_assert`); a real
// out-of-range dim is the trigger to add an out-of-line escape. `ni`/`no` are
// non-negative (`NegativeDim` is rejected in `from_le_bytes`). `network_flags`
// is a bitmask whose low-16 is the documented projection, not a dim, so it is
// deliberately not asserted. The prefix-routing redouts (`hi_distance` etc.)
// are NOT meaningful across the tiers-3/4 `num_weights` split — this facet is
// read as 6× concatenated-`u16`, not as `hi`/`lo` prefix chains.
debug_assert!(
(self.ni as u32) <= u16::MAX as u32 && (self.no as u32) <= u16::MAX as u32,
"network ni/no exceeds u16 — needs an out-of-line escape (network.rs::to_facet)"
);
let nw = self.num_weights as u32;
FacetCascade {
facet_classid: self.ntype.classid(),
tiers: [
tier_u16(self.ni as u32 as u16),
tier_u16(self.no as u32 as u16),
tier_u16(self.network_flags as u32 as u16),
tier_u16((nw & 0xFFFF) as u16),
tier_u16((nw >> 16) as u16),
FacetTier {
lo: self.training as u8,
hi: u8::from(self.needs_backprop),
},
],
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Silent truncation of ni/no in release builds.

to_facet() guards the ni/nou16::MAX invariant only with debug_assert!, which is compiled out in release builds. If from_le_bytes is ever fed a header where ni/no exceed u16::MAX (e.g. a corrupted or unexpected future model), release builds will silently truncate the values into the facet with no error signal — a data-integrity gap for what's meant to be a byte-parity contract surface.

Consider returning a Result (or a dedicated NetworkError variant) instead of relying on debug_assert! for this invariant, so the out-of-line escape mentioned in the doc comment is actually enforced in production.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/lance-graph-contract/src/network.rs` around lines 364 - 395, The issue
is that `to_facet()` only enforces the `ni`/`no` `u16::MAX` invariant with
`debug_assert!`, so release builds can silently truncate invalid values. Update
`Network::to_facet` to perform a production check and return a `Result` (or a
dedicated `NetworkError`) when `ni` or `no` exceed `u16::MAX`, and propagate
that error from the caller paths instead of constructing `FacetCascade`
unconditionally. Keep the existing `FacetCascade`/`tier_u16` mapping logic for
valid values, but make the out-of-line escape mentioned in the doc comment
explicit and enforced in release builds.

Comment on lines +290 to +300
fn compute_code_range(&mut self) {
let mut max = -1_i32;
for entry in &self.encoder {
for &c in entry.codes() {
if c > max {
max = c;
}
}
}
self.code_range = max + 1;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

max + 1 can overflow on hostile code values.

Code values are read as raw i32 with no upper-bound validation (unlike length). A single entry with a code of i32::MAX makes max + 1 overflow — panic in debug, wrap to i32::MIN in release. Given the module explicitly guards BadCodeLength/UnexpectedEof against corrupt input, this path deserves the same treatment.

🛡️ Proposed fix
-        self.code_range = max + 1;
+        self.code_range = max.saturating_add(1);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fn compute_code_range(&mut self) {
let mut max = -1_i32;
for entry in &self.encoder {
for &c in entry.codes() {
if c > max {
max = c;
}
}
}
self.code_range = max + 1;
}
fn compute_code_range(&mut self) {
let mut max = -1_i32;
for entry in &self.encoder {
for &c in entry.codes() {
if c > max {
max = c;
}
}
}
self.code_range = max.saturating_add(1);
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/lance-graph-contract/src/unicharcompress.rs` around lines 290 - 300,
The compute_code_range method can overflow when it sets self.code_range to max +
1 after scanning self.encoder for raw code values. Add validation or a
checked/saturating increment so hostile i32::MAX codes do not panic in debug or
wrap in release, and handle the invalid input consistently with the existing
BadCodeLength/UnexpectedEof corruption checks. Keep the fix localized to
compute_code_range and its code_range assignment.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4e14db01b0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +312 to +313
if tag != 0 {
return Err(NetworkError::BadTag(tag));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Accept ordinal-encoded network headers

For legacy/order-coded network blobs, Tesseract's getNetworkType treats a non-zero first byte as the NetworkType ordinal and continues reading the rest of the header without a type-name string. Rejecting every non-zero byte here means NetworkHeader::from_le_bytes fails on network files that Network::CreateFromFile still accepts, so the Rust byte-parity loader is narrower than the C++ reader for those serialized models.

Useful? React with 👍 / 👎.

Comment on lines +384 to +385
tier_u16(self.ni as u32 as u16),
tier_u16(self.no as u32 as u16),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid silently truncating wide layer dimensions

When a valid custom network has ni or no above u16::MAX, release builds skip the debug_assert! and these casts wrap the dimensions into the facet. That silently corrupts the SoA projection for those models even though the header format stores the dimensions as i32; this should either return an error or preserve the overflow out-of-line before constructing a FacetCascade.

Useful? React with 👍 / 👎.

@AdaWorldAPI AdaWorldAPI merged commit e2ad431 into main Jul 4, 2026
7 checks passed
AdaWorldAPI pushed a commit that referenced this pull request Jul 4, 2026
…perator ruling)

Operator ruling 2026-07-04 ("mark all as migration mandatory"): the V1
contiguous-u24 node-key tail (family:u24 ++ identity:u24) is forbidden and its
migration to the V3 6×(u8:u8) facet is mandatory on every surface — upgrading the
le-contract §L7 #2 reconciliation from optional to a hard mandate.

- ISSUES.md ISS-V1-U24-TAIL-MIGRATION-MANDATORY: the full residue enumerated with
  file:line (ocr.rs:121, soa_graph.rs:412, aiwar.rs:104, action.rs:417/693,
  callcenter graph_table + OWL bytes[13..16] writers, ogar lib.rs:195, and the
  CLAUDE.md CANON doc), each mandatory, each gated per-site on v3-envelope-auditor.
  Records the mechanism (no new_v3 constructor; classid tail_variant resolves V3)
  and the gotcha (NodeGuid::new byte-packing does NOT align with the V3 reading —
  classid swap alone is insufficient). Test-only fold assertions exempt.
- EPIPHANIES.md E-V3-V1-U24-MIGRATION-MANDATORY: the ruling as policy.

Confirmed already V3-clean (no action): the Tesseract transcode arc
(contract::network FacetCascade #643, recoder, tesseract-recognizer, ruff
harvest) + OGAR render_class_with_methods (#150) — zero contiguous-u24.

Board-only; no code, no build step.

Co-Authored-By: Claude <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants