diff --git a/.claude/board/TECH_DEBT.md b/.claude/board/TECH_DEBT.md index c3ef7040..66e9d6b0 100644 --- a/.claude/board/TECH_DEBT.md +++ b/.claude/board/TECH_DEBT.md @@ -15,6 +15,72 @@ ## Open Debt +### TD-CI-COVERAGE-MOLD-1 — `test-with-coverage` job lacks the mold linker the `test` job has (2026-06-12) + +**2026-06-12 local-repro addendum (same PR, later commit) — diagnosis CONFIRMED +and sharpened; fix extended to `debuginfo=0`.** Reproduced the failure mode +locally with CI's exact env (`RUSTFLAGS="-C debuginfo=1 -C target-cpu=x86-64-v3"`, +`CARGO_INCREMENTAL=0`, stable toolchain): + +- `cargo test --manifest-path crates/lance-graph/Cargo.toml --tests --no-run` + died **3×** at the link step with the exact opaque signature CI shows: + `rustc-LLVM ERROR: IO failure on output stream: No space left on device`, + `collect2: fatal error: ld terminated with signal 7 [Bus error]` (SIGBUS = + mmap'd output on a full filesystem), `error: could not compile … (exit + status: 101)`. Resource exhaustion, not a compile error. +- **Measured weight:** each of the 17 integration-test binaries links to + ~930 MB at `debuginfo=1`; ~252 MB stripped of debuginfo (−73 %). Set total + ≈ 16 GB + ~13 GB deps tree + instrumentation growth + `.profraw` ≈ the + hosted runner's disk/RSS budget — a cliff edge, which is exactly what a + 2/50 intermittent looks like. So there are TWO ceilings, not one: GNU-ld + RSS (mold fixes) AND disk (mold does NOT fix). +- **No test bug exists:** every integration-test binary that linked was + executed — **98/98 tests pass** against lance 7.0.0 (test_sql_query 14, + test_datafusion_varlength_complex 19, test_to_sql 12, neighborhood_cascade + 10, test_explain_output 8, test_lance_vector_search 7, test_to_spark_sql 7, + spo_ground_truth 7, spo_promotion 4, test_case_insensitivity 4, + test_complex_return_clauses 3, hdr_proof 3). The SoA-migration exoneration + above is now empirical, not inferential. +- **`debuginfo=0` is coverage-safe (verified, not assumed):** 600/600 + lance-graph-contract lib tests pass under + `-C instrument-coverage -C debuginfo=0`; the test binary embeds + `__llvm_covmap` / `__llvm_prf_{names,cnts,data}` sections and emits + `.profraw`. LLVM coverage mapping is independent of DWARF. +- **Paid-by (extended):** this PR now also sets job-level + `RUSTFLAGS: "-C debuginfo=0 -C target-cpu=x86-64-v3"` on + `test-with-coverage` (workflow-level stays `debuginfo=1` for the `test` + job). Relieves both ceilings; mold stays as parity + link-speed insurance. + Side effect: the coverage job gets its own Swatinem cache key (first run + repopulates). The "escalate to timing-race hypothesis" path below is + retired unless coverage still flakes after BOTH fixes. + +**Open — fix applied this PR, CONFIRM on next green run.** The `Rust Tests` +workflow's `test` job sets up the `mold` linker (`rui314/setup-mold@v1`) with the +comment *"Heavy lance+datafusion integration-test binaries OOM the default GNU +`ld` at the link step (intermittent)."* The sibling `test-with-coverage` job did +**not** set up mold, and links the **even larger llvm-cov-instrumented** binaries +with the default linker — so the OOM is *more* likely there. Symptom: across the +last 50 `rust-test.yml` runs, exactly 2 hit `test=success / cov=failure` +(`claude/probe-mantissa-fill` a32cb177 and `claude/nice-edison-g4rhhl` 12c5ea35); +the plain `test` job stayed green in both. **This is NOT a logic/test failure and +NOT a side-effect of the SoA-singleton migration** (`bindspace-singleton-to-mailbox-soa-v1` +et al.): a migration bug would fail the plain `test` job too — it doesn't, and the +two SoA debts (TD-RESONANCEDTO-DUP-1 P3/deferred, TD-UNBUNDLE-FROM-1 ~1-bit/100-epoch +drift) crash nothing. Confidence: HIGH on the infra cause; the workflow's own +comment names this exact OOM, mold is missing only on the coverage job, and the +failure is intermittent (= memory pressure, not a deterministic bug). +**Residual (honest):** the codecov upload step already sets `fail_ci_if_error: +false`, so the noise is a job-level ❌ that does NOT block merge (`mergeable=True`); +and without the CI log (token 403) a timing-sensitive race surfacing only under +instrumentation's slower execution cannot be *100%* excluded — but the migration's +concurrency tests (D-SNGL-6 writer+reader threads) are PROPOSAL, not shipped, so +there is no concurrent SoA test to race yet. **Paid by:** this PR adds the mold +step to the coverage job (parity with `test`). **Confirm** by a green +`test-with-coverage` run; if it still fails after mold, escalate to the +timing-race hypothesis (read the actual `cargo llvm-cov` log with a scoped token). +Cross-ref: `.github/workflows/rust-test.yml` (test job mold step vs coverage job); +`bindspace-singleton-to-mailbox-soa-v1` (the migration this is NOT). + ### TD-UNBUNDLE-FROM-1 — `unbundle_from` is NOT the inverse of `bundle_into` (2026-06-07) **Open.** `crates/lance-graph-planner/src/cache/kv_bundle.rs` — `unbundle_from` diff --git a/.github/workflows/rust-test.yml b/.github/workflows/rust-test.yml index 98e37bce..d83581ea 100644 --- a/.github/workflows/rust-test.yml +++ b/.github/workflows/rust-test.yml @@ -56,7 +56,7 @@ jobs: # Heavy lance+datafusion integration-test binaries OOM the default GNU `ld` # at the `cargo test --no-run` link step (intermittent). mold links them # fast + low-memory (already used by release.yml / rust-publish.yml). - uses: rui314/setup-mold@v1 + uses: rui314/setup-mold@9c9c13bf4c3f1adef0cc596abc155580bcb04444 # v1 - uses: Swatinem/rust-cache@v2 with: shared-key: "lance-graph-deps" @@ -101,6 +101,20 @@ jobs: test-with-coverage: runs-on: ubuntu-24.04 timeout-minutes: 30 + env: + # Override the workflow-level debuginfo=1 for this job only. llvm-cov + # coverage lives in __llvm_covmap/__llvm_prf_* ELF sections, NOT in + # DWARF (verified locally: 600/600 contract tests pass under + # `-C instrument-coverage -C debuginfo=0`, covmap sections present, + # .profraw emitted). At debuginfo=1 each of the 17 instrumented + # integration-test binaries links to ~930 MB (measured; ~252 MB at + # debuginfo=0, -73%) — the full set + deps tree sits exactly at the + # hosted runner's disk/RSS cliff, which is the 2/50 intermittent + # exit-101 (TD-CI-COVERAGE-MOLD-1). Dropping debuginfo relieves BOTH + # ceilings (GNU-ld/mold RSS at link, and disk). Note: a job-level + # RUSTFLAGS gives this job its own Swatinem cache key — the first run + # after this change repopulates the coverage cache. + RUSTFLAGS: "-C debuginfo=0 -C target-cpu=x86-64-v3" defaults: run: working-directory: lance-graph @@ -120,6 +134,14 @@ jobs: run: | rustup toolchain install stable rustup default stable + - name: Setup mold linker + # Parity with the `test` job above (TD-CI-COVERAGE-MOLD-1): the heavy + # lance+datafusion test binaries OOM the default GNU `ld` at link + # (intermittent) — and llvm-cov INSTRUMENTED binaries are larger, so + # the OOM is MORE likely here than in the plain `test` job that already + # has mold. Without this step the coverage job flaked while `test` + # stayed green (2/50 runs). mold links them fast + low-memory. + uses: rui314/setup-mold@9c9c13bf4c3f1adef0cc596abc155580bcb04444 # v1 - uses: Swatinem/rust-cache@v2 with: shared-key: "lance-graph-deps"