From a2feffe283a7b27175b08eeb29f896b2240d58e3 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 12 Jun 2026 22:26:34 +0000 Subject: [PATCH 1/3] ci(rust-test): add mold linker to the coverage job (parity with test) + TD-CI-COVERAGE-MOLD-1 Diagnosis (grounded, not inferred): the test-with-coverage job intermittently failed (2/50 recent runs) while the plain test job stayed green on the SAME test command. Root cause is NOT the SoA-singleton migration and NOT a logic bug -- a logic bug would fail the plain test job too. The cause is a CI asymmetry: the `test` job sets up the mold linker (with a comment that the heavy lance+datafusion binaries OOM the default GNU ld at link), but the `test-with-coverage` job did not -- and it links even LARGER llvm-cov instrumented binaries with the default linker, so the OOM is more likely there. Fix: add the identical mold setup step to the coverage job (the action is already trusted -- used by the test job, release.yml, rust-publish.yml). Board: TD-CI-COVERAGE-MOLD-1 recorded (Open, paid-by this PR, confirm on next green coverage run). The entry explicitly records that the SoA migration plan (bindspace-singleton-to-mailbox-soa-v1) needs NO calibration on account of this -- the coverage failure is orthogonal infra noise, fail_ci_if_error:false already keeps it non-blocking, and the honest residual (timing-race not 100% excluded without the 403'd log) is noted with its escalation path. https://claude.ai/code/session_01PBTGaPCSnnt6u3pjXpbLwY --- .claude/board/TECH_DEBT.md | 29 +++++++++++++++++++++++++++++ .github/workflows/rust-test.yml | 8 ++++++++ 2 files changed, 37 insertions(+) diff --git a/.claude/board/TECH_DEBT.md b/.claude/board/TECH_DEBT.md index c3ef7040..8bb75c72 100644 --- a/.claude/board/TECH_DEBT.md +++ b/.claude/board/TECH_DEBT.md @@ -15,6 +15,35 @@ ## Open Debt +### TD-CI-COVERAGE-MOLD-1 — `test-with-coverage` job lacks the mold linker the `test` job has (2026-06-12) + +**Open — fix applied this PR, CONFIRM on next green run.** The `Rust Tests` +workflow's `test` job sets up the `mold` linker (`rui314/setup-mold@v1`) with the +comment *"Heavy lance+datafusion integration-test binaries OOM the default GNU +`ld` at the link step (intermittent)."* The sibling `test-with-coverage` job did +**not** set up mold, and links the **even larger llvm-cov-instrumented** binaries +with the default linker — so the OOM is *more* likely there. Symptom: across the +last 50 `rust-test.yml` runs, exactly 2 hit `test=success / cov=failure` +(`claude/probe-mantissa-fill` a32cb177 and `claude/nice-edison-g4rhhl` 12c5ea35); +the plain `test` job stayed green in both. **This is NOT a logic/test failure and +NOT a side-effect of the SoA-singleton migration** (`bindspace-singleton-to-mailbox-soa-v1` +et al.): a migration bug would fail the plain `test` job too — it doesn't, and the +two SoA debts (TD-RESONANCEDTO-DUP-1 P3/deferred, TD-UNBUNDLE-FROM-1 ~1-bit/100-epoch +drift) crash nothing. Confidence: HIGH on the infra cause; the workflow's own +comment names this exact OOM, mold is missing only on the coverage job, and the +failure is intermittent (= memory pressure, not a deterministic bug). +**Residual (honest):** the codecov upload step already sets `fail_ci_if_error: +false`, so the noise is a job-level ❌ that does NOT block merge (`mergeable=True`); +and without the CI log (token 403) a timing-sensitive race surfacing only under +instrumentation's slower execution cannot be *100%* excluded — but the migration's +concurrency tests (D-SNGL-6 writer+reader threads) are PROPOSAL, not shipped, so +there is no concurrent SoA test to race yet. **Paid by:** this PR adds the mold +step to the coverage job (parity with `test`). **Confirm** by a green +`test-with-coverage` run; if it still fails after mold, escalate to the +timing-race hypothesis (read the actual `cargo llvm-cov` log with a scoped token). +Cross-ref: `.github/workflows/rust-test.yml` (test job mold step vs coverage job); +`bindspace-singleton-to-mailbox-soa-v1` (the migration this is NOT). + ### TD-UNBUNDLE-FROM-1 — `unbundle_from` is NOT the inverse of `bundle_into` (2026-06-07) **Open.** `crates/lance-graph-planner/src/cache/kv_bundle.rs` — `unbundle_from` diff --git a/.github/workflows/rust-test.yml b/.github/workflows/rust-test.yml index 98e37bce..d9331d71 100644 --- a/.github/workflows/rust-test.yml +++ b/.github/workflows/rust-test.yml @@ -120,6 +120,14 @@ jobs: run: | rustup toolchain install stable rustup default stable + - name: Setup mold linker + # Parity with the `test` job above (TD-CI-COVERAGE-MOLD-1): the heavy + # lance+datafusion test binaries OOM the default GNU `ld` at link + # (intermittent) — and llvm-cov INSTRUMENTED binaries are larger, so + # the OOM is MORE likely here than in the plain `test` job that already + # has mold. Without this step the coverage job flaked while `test` + # stayed green (2/50 runs). mold links them fast + low-memory. + uses: rui314/setup-mold@v1 - uses: Swatinem/rust-cache@v2 with: shared-key: "lance-graph-deps" From b56bb2cd38e426979ed30b43c1ceac4e14462818 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 12 Jun 2026 22:26:34 +0000 Subject: [PATCH 2/3] =?UTF-8?q?ci(rust-test):=20coverage=20job=20debuginfo?= =?UTF-8?q?=3D0=20=E2=80=94=20local=20repro=20confirms=20TD-CI-COVERAGE-MO?= =?UTF-8?q?LD-1,=20second=20ceiling=20found?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Local reproduction with CI's exact flags (debuginfo=1, x86-64-v3, CARGO_INCREMENTAL=0) confirms the diagnosis and sharpens it: - The --tests --no-run build died 3x at link with CI's exact opaque signature: rustc-LLVM 'IO failure on output stream', ld killed by SIGBUS, 'could not compile ... (exit status: 101)'. Resource exhaustion at link — never a compile or test error. - Measured: 17 integration-test binaries x ~930 MB at debuginfo=1 (~252 MB at debuginfo=0, -73%). Set + deps + instrumentation + profraw lands exactly on a hosted runner's disk/RSS budget — a cliff edge, which is what a 2/50 intermittent looks like. TWO ceilings: GNU-ld RSS (mold fixes) AND disk (mold does not). - No test bug: every binary that linked was executed — 98/98 integration tests pass on lance 7.0.0. The SoA exoneration in the debt entry is now empirical. - debuginfo=0 is coverage-safe, verified: 600/600 contract tests under '-C instrument-coverage -C debuginfo=0'; __llvm_covmap + __llvm_prf_* sections present; .profraw emitted. Coverage mapping is not DWARF. Fix: job-level RUSTFLAGS '-C debuginfo=0 -C target-cpu=x86-64-v3' on test-with-coverage only (test job keeps workflow-level debuginfo=1). Mold stays from the parent commit. Note: job-level RUSTFLAGS gives the coverage job its own Swatinem cache key; first run repopulates. https://claude.ai/code/session_01PBTGaPCSnnt6u3pjXpbLwY --- .claude/board/TECH_DEBT.md | 37 +++++++++++++++++++++++++++++++++ .github/workflows/rust-test.yml | 14 +++++++++++++ 2 files changed, 51 insertions(+) diff --git a/.claude/board/TECH_DEBT.md b/.claude/board/TECH_DEBT.md index 8bb75c72..66e9d6b0 100644 --- a/.claude/board/TECH_DEBT.md +++ b/.claude/board/TECH_DEBT.md @@ -17,6 +17,43 @@ ### TD-CI-COVERAGE-MOLD-1 — `test-with-coverage` job lacks the mold linker the `test` job has (2026-06-12) +**2026-06-12 local-repro addendum (same PR, later commit) — diagnosis CONFIRMED +and sharpened; fix extended to `debuginfo=0`.** Reproduced the failure mode +locally with CI's exact env (`RUSTFLAGS="-C debuginfo=1 -C target-cpu=x86-64-v3"`, +`CARGO_INCREMENTAL=0`, stable toolchain): + +- `cargo test --manifest-path crates/lance-graph/Cargo.toml --tests --no-run` + died **3×** at the link step with the exact opaque signature CI shows: + `rustc-LLVM ERROR: IO failure on output stream: No space left on device`, + `collect2: fatal error: ld terminated with signal 7 [Bus error]` (SIGBUS = + mmap'd output on a full filesystem), `error: could not compile … (exit + status: 101)`. Resource exhaustion, not a compile error. +- **Measured weight:** each of the 17 integration-test binaries links to + ~930 MB at `debuginfo=1`; ~252 MB stripped of debuginfo (−73 %). Set total + ≈ 16 GB + ~13 GB deps tree + instrumentation growth + `.profraw` ≈ the + hosted runner's disk/RSS budget — a cliff edge, which is exactly what a + 2/50 intermittent looks like. So there are TWO ceilings, not one: GNU-ld + RSS (mold fixes) AND disk (mold does NOT fix). +- **No test bug exists:** every integration-test binary that linked was + executed — **98/98 tests pass** against lance 7.0.0 (test_sql_query 14, + test_datafusion_varlength_complex 19, test_to_sql 12, neighborhood_cascade + 10, test_explain_output 8, test_lance_vector_search 7, test_to_spark_sql 7, + spo_ground_truth 7, spo_promotion 4, test_case_insensitivity 4, + test_complex_return_clauses 3, hdr_proof 3). The SoA-migration exoneration + above is now empirical, not inferential. +- **`debuginfo=0` is coverage-safe (verified, not assumed):** 600/600 + lance-graph-contract lib tests pass under + `-C instrument-coverage -C debuginfo=0`; the test binary embeds + `__llvm_covmap` / `__llvm_prf_{names,cnts,data}` sections and emits + `.profraw`. LLVM coverage mapping is independent of DWARF. +- **Paid-by (extended):** this PR now also sets job-level + `RUSTFLAGS: "-C debuginfo=0 -C target-cpu=x86-64-v3"` on + `test-with-coverage` (workflow-level stays `debuginfo=1` for the `test` + job). Relieves both ceilings; mold stays as parity + link-speed insurance. + Side effect: the coverage job gets its own Swatinem cache key (first run + repopulates). The "escalate to timing-race hypothesis" path below is + retired unless coverage still flakes after BOTH fixes. + **Open — fix applied this PR, CONFIRM on next green run.** The `Rust Tests` workflow's `test` job sets up the `mold` linker (`rui314/setup-mold@v1`) with the comment *"Heavy lance+datafusion integration-test binaries OOM the default GNU diff --git a/.github/workflows/rust-test.yml b/.github/workflows/rust-test.yml index d9331d71..fd3646a9 100644 --- a/.github/workflows/rust-test.yml +++ b/.github/workflows/rust-test.yml @@ -101,6 +101,20 @@ jobs: test-with-coverage: runs-on: ubuntu-24.04 timeout-minutes: 30 + env: + # Override the workflow-level debuginfo=1 for this job only. llvm-cov + # coverage lives in __llvm_covmap/__llvm_prf_* ELF sections, NOT in + # DWARF (verified locally: 600/600 contract tests pass under + # `-C instrument-coverage -C debuginfo=0`, covmap sections present, + # .profraw emitted). At debuginfo=1 each of the 17 instrumented + # integration-test binaries links to ~930 MB (measured; ~252 MB at + # debuginfo=0, -73%) — the full set + deps tree sits exactly at the + # hosted runner's disk/RSS cliff, which is the 2/50 intermittent + # exit-101 (TD-CI-COVERAGE-MOLD-1). Dropping debuginfo relieves BOTH + # ceilings (GNU-ld/mold RSS at link, and disk). Note: a job-level + # RUSTFLAGS gives this job its own Swatinem cache key — the first run + # after this change repopulates the coverage cache. + RUSTFLAGS: "-C debuginfo=0 -C target-cpu=x86-64-v3" defaults: run: working-directory: lance-graph From defd29005ff9ecf01b23e4d1eb006f5938d057e6 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 13 Jun 2026 06:40:10 +0000 Subject: [PATCH 3/3] ci(rust-test): SHA-pin rui314/setup-mold (CodeRabbit zizmor unpinned-uses) Replace 'uses: rui314/setup-mold@v1' with the resolved commit SHA 9c9c13bf4c3f1adef0cc596abc155580bcb04444 in both occurrences (test job + test-with-coverage job). CodeRabbit flagged line 144 only; the test job's existing pin at line 59 carries the identical tag-retargeting risk for the same action, so SHA-pin both for consistency. Other tag-pinned actions in this workflow (actions/checkout, Swatinem/rust-cache, taiki-e/install-action, codecov/codecov-action) are pre-existing in main and out of scope for this PR. https://claude.ai/code/session_01PBTGaPCSnnt6u3pjXpbLwY --- .github/workflows/rust-test.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/workflows/rust-test.yml b/.github/workflows/rust-test.yml index fd3646a9..d83581ea 100644 --- a/.github/workflows/rust-test.yml +++ b/.github/workflows/rust-test.yml @@ -56,7 +56,7 @@ jobs: # Heavy lance+datafusion integration-test binaries OOM the default GNU `ld` # at the `cargo test --no-run` link step (intermittent). mold links them # fast + low-memory (already used by release.yml / rust-publish.yml). - uses: rui314/setup-mold@v1 + uses: rui314/setup-mold@9c9c13bf4c3f1adef0cc596abc155580bcb04444 # v1 - uses: Swatinem/rust-cache@v2 with: shared-key: "lance-graph-deps" @@ -141,7 +141,7 @@ jobs: # the OOM is MORE likely here than in the plain `test` job that already # has mold. Without this step the coverage job flaked while `test` # stayed green (2/50 runs). mold links them fast + low-memory. - uses: rui314/setup-mold@v1 + uses: rui314/setup-mold@9c9c13bf4c3f1adef0cc596abc155580bcb04444 # v1 - uses: Swatinem/rust-cache@v2 with: shared-key: "lance-graph-deps"