Skip to content

fix(gl): paginate gl-clone Arweave recovery and make /encrypted-blobs parsing schema-strict (#49)#70

Open
beardthelion wants to merge 4 commits into
mainfrom
fix/gl-clone-arweave-paginate-strict-blobs
Open

fix(gl): paginate gl-clone Arweave recovery and make /encrypted-blobs parsing schema-strict (#49)#70
beardthelion wants to merge 4 commits into
mainfrom
fix/gl-clone-arweave-paginate-strict-blobs

Conversation

@beardthelion

@beardthelion beardthelion commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator

Summary

gl clone's Arweave fallback recovery dropped private blobs whose most recent seal fell outside the first 100 manifest anchors, and treated an /encrypted-blobs schema regression as "nothing to recover". This paginates discovery and makes the parse fail closed.

Motivation & context

Closes #49

Surfaced by CodeRabbit on #46; this is the B3 Arweave-recovery code from #36. Two defects in crates/gl/src/clone.rs:

  1. Discovery hard-coded a single first:100 GraphQL query with no pagination. One manifest is anchored per push, so a repo with more than 100 pushes silently lost recovery coverage for any blob whose latest seal fell outside that window, even though the blob is live at HEAD.
  2. /encrypted-blobs parsed the 200 body into a serde_json::Value and defaulted blobs to [] via unwrap_or_default(), so a missing or renamed field yielded a silent empty recovery instead of an error. The sibling /withheld-paths path already deserializes a typed struct and errors on drift.

Kind of change

  • Bug fix
  • Feature
  • Security fix
  • Docs
  • Tests / CI
  • Refactor (no behavior change)
  • Breaking or protocol change (issue required first)

What changed

Crate touched: gl.

  • Strict typed parse of the node /encrypted-blobs 200 body (EncryptedBlobsResponse) so schema drift is a hard error instead of a silent empty result, mirroring the existing WithheldPathsResponse convention. The sole caller surfaces the error rather than swallowing it into unwrap_or_default().
  • Cursor-paginate Arweave manifest discovery (pageInfo + after) until coverage is complete, replacing the single first:100 query.
  • Bound discovery against a hostile or buggy gateway: 1000-page and 10k-transaction caps, a per-page edge bound, cross-page dedup, and a non-advancing-cursor guard. Incomplete discovery warns rather than silently truncating.
  • Resolve equal-timestamp merge ties by Arweave block height (defensively parsed: a not-yet-mined anchor counts as newest, an unreadable height ranks lowest), preserving the pre-pagination "newest wins" behavior independent of page order.

How a reviewer can verify

cargo test -p gl --bin gl clone::
cargo clippy --workspace --all-targets -- -D warnings

30 clone tests cover the strict-parse matrix, multi-page discovery, cross-page dedup (fetched once), non-advancing-cursor termination, the per-page bound, and the block-height tie-break cases (distinct height, pending anchor, same-block, empty timestamp).

Before you request review

  • Scope is one logical change; no unrelated churn
  • cargo test --workspace passes locally
  • New behavior is covered by tests (required for fixes)
  • cargo fmt --all and cargo clippy --workspace --all-targets -- -D warnings are clean
  • Commit titles use Conventional Commits (feat(...), fix(...), docs(...))
  • Docs / .env.example updated if behavior or config changed (or N/A)
  • Checked existing PRs so this isn't a duplicate

Notes for reviewers

  • Node-side anchoring and the /encrypted-blobs endpoint are unchanged; this is read-path only.
  • Block height is gateway-reported, not independently verified. It removes page-order dependence and honest-node clock skew between nodes, not trust in the gateway (discovery already assumes an honest gateway). Ranking on block height as the primary signal instead of the node-reported timestamp is recorded as a deliberate follow-up.
  • Known follow-up: clone.rs was already over 1000 lines before this change; extracting the discovery loop into a helper to bring it back down was raised in review and deferred to keep this PR focused.

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Enhanced recovery validation to catch missing or incorrect data earlier, preventing silent failures.
    • Improved fallback recovery robustness with safeguards against infinite loops and transaction ID deduplication.
    • Strengthened error handling in recovery processes.
  • Tests

    • Expanded test coverage for recovery mechanisms and edge case scenarios.

Deserialize the /encrypted-blobs 200 body into a typed EncryptedBlobsResponse
instead of poking at a serde_json::Value and defaulting blobs to [] via
unwrap_or_default(). A server schema regression now fails closed (a hard parse
error) rather than silently yielding nothing to recover, mirroring the existing
WithheldPathsResponse convention. The sole caller still does not abort the clone
but now surfaces the error instead of swallowing it into an empty result.

Part of #49.
…eight

Discovery hard-coded a single first:100 GraphQL query with no pagination, so a
repo with more than 100 encrypted-manifest anchors (one per push) silently lost
recovery coverage for any blob whose most recent seal fell outside that window.
Follow cursor pagination (after + pageInfo) until coverage is complete, deduping
transaction ids and bounding the loop (1000 pages / 10000 ids, degenerate-cursor
guard) so a hostile or buggy gateway cannot grow the downstream per-id fetch loop
without limit. A full page returned without pageInfo, a cap hit, or a failure
after discovery began now warns; a page-1 failure (unconfigured/unreachable
gateway) stays silent as before.

Pagination concatenates pages in gateway order, which made merge_manifests'
equal-timestamp tie-break (previously insertion-order, i.e. accidentally newest
on single-page HEIGHT_DESC) gateway-dependent. Rank candidates by the composite
key (timestamp, Arweave block height, cid): timestamp primary keeps the existing
distinct-timestamp latest-wins; block height breaks equal-timestamp ties by
anchor recency (a not-yet-mined anchor counts as newest), preserving the
pre-pagination behavior; cid is a final deterministic tiebreak for same-block
anchors. Height is parsed defensively so a present-but-unreadable height (e.g. a
gateway stringifying large ints) cannot masquerade as a pending/newest anchor.

Part of #49.
…/dedup

Code-review follow-ups on the #49 recovery change:
- Cap edges parsed per GraphQL page at ARWEAVE_PAGE_SIZE so a hostile gateway
  returning one oversized page cannot defeat the MAX_TX_IDS budget or flood the
  downstream per-id manifest fetch loop (adversarial review).
- Collapse the two identical gateway-failure arms in the discovery loop into a
  single warn site, and hoist PAGE_SIZE to a named const tied to the query's
  first:100 (maintainability).
- Add tests: per-page edge bound, non-advancing-cursor termination (infinite-loop
  guard), and cross-page tx-id dedup fetched exactly once (testing).

Part of #49.
…-size invariant

Round-2 review follow-ups:
- The non-advancing-cursor test bound its mock to a _-prefixed handle with
  expect_at_most(3) and never asserted it; mockito defaults assert_on_drop=false,
  so the test passed even if the guard were removed. Bind it, tighten to expect(2)
  (the guard breaks on the second page, so discovery makes exactly two POSTs), and
  assert_async.
- Correct the parse_tx_page docstring: the per-page take() bounds the TxRef vec and
  fetch loop, not the response-body deserialization (an unbounded-body concern
  shared by every gateway read here).
- Hoist the discovery query to a const and add query_first_matches_page_size so the
  query first:N literal and ARWEAVE_PAGE_SIZE are kept equal by a test rather than
  a comment.

Part of #49.
@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 40670457-d79a-47b8-bae2-2585dc026100

📥 Commits

Reviewing files that changed from the base of the PR and between e37ea7f and 3a06b27.

📒 Files selected for processing (1)
  • crates/gl/src/clone.rs

📝 Walkthrough

Walkthrough

crates/gl/src/clone.rs receives two hardening changes: the /encrypted-blobs node-recovery path is switched from untyped serde_json::Value to a strict EncryptedBlobsResponse struct that hard-errors on schema drift, and the Arweave fallback discovery is replaced by a bounded cursor-paginated loop with tx-ID deduplication and a composite timestamp/height/cid tie-break for manifest merging.

Changes

gl clone recovery hardening

Layer / File(s) Summary
Strict /encrypted-blobs typed schema and error handling
crates/gl/src/clone.rs
Adds EncryptedBlobsResponse and EncryptedBlobEntry structs; updates recover_encrypted_blobs to deserialize into the typed body and iterate typed .blobs; changes the run call site to capture parse errors as a logged warning and fall through to Arweave with an empty vec instead of silently defaulting.
Pagination types, parse_tx_page, and merge_manifests
crates/gl/src/clone.rs
Introduces ARWEAVE_PAGE_SIZE, ARWEAVE_DISCOVERY_QUERY, TxRef, TxPage; adds parse_tx_page to extract (id, Option<u64> height) and pageInfo from GraphQL edges; defines height_rank for pending-vs-mined ordering; implements merge_manifests with a composite timestamp → height-rank → cid deterministic tie-break.
Bounded cursor-paginated discovery loop
crates/gl/src/clone.rs
Refactors the discovery section of recover_from_arweave into a loop bounded by MAX_PAGES / MAX_TX_IDS; deduplicates tx IDs across pages; guards against non-advancing cursors; treats missing pageInfo as terminal; warns only when page-1 has succeeded before an interruption.
Unit and end-to-end tests
crates/gl/src/clone.rs
Adds EncryptedBlobsResponse strict-parse tests and parse_tx_page coverage (null blocks, numeric-string heights, per-page bounding, query invariant); updates existing merge test signature; replaces timestamp-only merge suite with composite ordering tests; adds four mocked end-to-end pagination tests (later-page recovery, partial page-2 failure, cursor stall guard, cross-page dedup).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 Hoppity-hop through the Arweave chain,
No more one page—we paginate the lane!
Typed blobs now catch what once slipped through,
Cursor guards loop so we never re-queue.
Timestamps, heights, and cids align in a row,
The manifest merges with a deterministic glow! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately summarizes the two main changes: paginating Arweave recovery and making /encrypted-blobs parsing schema-strict.
Description check ✅ Passed The description comprehensively addresses both defects, explains motivation, covers kind of change, lists concrete changes, and includes verification steps and pre-review checklist.
Linked Issues check ✅ Passed The PR directly addresses both objectives from issue #49: cursor pagination for manifest discovery and strict typed schema parsing for /encrypted-blobs endpoint.
Out of Scope Changes check ✅ Passed All changes in clone.rs are scoped to the two identified defects in Arweave recovery with appropriate safeguards and test coverage.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/gl-clone-arweave-paginate-strict-blobs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gl clone Arweave recovery: paginate manifest discovery and make /encrypted-blobs parsing schema-strict

1 participant