Skip to content

Fix master CI: expv zero-input NaN, JET-on-1.12 QA, GPU-in-All#229

Draft
ChrisRackauckas-Claude wants to merge 2 commits into
SciML:masterfrom
ChrisRackauckas-Claude:fix-master-ci-1.12-nan-jet-gpu
Draft

Fix master CI: expv zero-input NaN, JET-on-1.12 QA, GPU-in-All#229
ChrisRackauckas-Claude wants to merge 2 commits into
SciML:masterfrom
ChrisRackauckas-Claude:fix-master-ci-1.12-nan-jet-gpu

Conversation

@ChrisRackauckas-Claude

Copy link
Copy Markdown
Contributor

Fixes three independent failures on the master grouped-tests CI.

1. Core: NaN == 0.0 at basictests.jl:307 (zero-input expv)

The real expv!(w, t::Real, Ks) method was missing the iszero(beta) guard the complex method already has. For a zero input vector firststep! skips initializing the Krylov basis V (it only fills V[:,1] when beta != 0), so the final lmul!(beta, mul!(w, @view(V[:,1:m]), expHe)) computes 0 * <uninitialized memory>, which is NaN whenever V holds garbage — explaining why the failure was flaky (heap-dependent: green on some OS/runs, NaN on others). Added the same early-return guard so expv of a zero vector is exactly zero.

Verified locally: full GROUP=Core Pkg.test passes on Julia 1.10 and 1.12 (it reliably produced NaN on 1.10 before).

2. QA: 6 JET failures on the 1 (= Julia 1.12) channel

lts (1.10) was green; only 1 (1.12) failed. On 1.12 JET traces into LinearAlgebra/Base internals — norm(::Vector)norm_recursive_checkiterate(::Nothing), and the broadcast unalias/copyto_unaliased! path over Adjoint{T, Union{}} — and reports abstract-interpretation artifacts there that this package does not control. Scoped the QA report_calls to target_modules = (ExponentialUtilities,) (the standard JET-as-package-QA configuration), which keeps full coverage of this package's own code.

That scoping surfaced two genuine may be undefined findings, which are fixed here so the scoped analysis is clean (not silenced):

  • si in exponential! (exp_baseexp.jl) — conditionally assigned inside if s > 0, used inside a separate if s > 0; now initialized to 0 unconditionally.
  • order / kest in kiops (kiops.jl) — carried across loop iterations via the orderold/kestold "reuse" flags but only conditionally assigned; now seeded with their first-iteration defaults.

Verified locally: QA passes 17/17 on Julia 1.10 and 1.12.

3. Core (windows): "CUDA driver not functional"

On Windows the Core job runs the run_tests "All" aggregate, which pulled in the GPU group, and using CUDA errored on the non-GPU runner. Marked the GPU group in_all = false so it only ever runs under an explicit GROUP=GPU on the self-hosted CUDA runner. Verified locally: GROUP=All now runs only Core/basictests.jl, never GPU/gputests.jl.

Not addressed (reported separately)

  • Core (julia pre, macos-latest): Static Arrays tolerance failure at basictests.jl:265 (expv(t,A,b) ≈ exp(t*A)*b). On linux Julia 1.13-rc1 the worst relative error is 1.25e-15; the macOS-pre failure shows ~1e-7. This is a macOS/1.13-rc-specific accuracy difference I could not reproduce or correctly fix on linux, and I will not loosen the tolerance without being able to prove the macOS deviation is benign.
  • GPU (self-hosted): requires CUDA hardware (infra), out of scope here.

Please ignore until reviewed by @ChrisRackauckas.

ChrisRackauckas and others added 2 commits June 19, 2026 05:17
Three independent master-CI failures on the grouped-tests workflow:

1. Core (NaN == 0.0 at basictests.jl:307, flaky across OS/version).
   The real `expv!(w, t::Real, Ks)` method lacked the `iszero(beta)`
   guard that the complex method already has. For a zero input vector
   `firststep!` skips initializing the Krylov basis V (it only fills it
   when beta != 0), so `lmul!(beta, mul!(w, V, expHe))` computes
   `0 * <uninitialized memory>`, which is NaN whenever V holds garbage.
   Add the same early-return guard, making expv of a zero vector exactly
   zero (matching the complex method). Verified: full Core suite now
   passes on Julia 1.10 and 1.12 (was reliably NaN on 1.10).

2. QA (6 JET failures on the Julia "1" = 1.12 channel; lts/1.10 was
   green). On 1.12 JET traces into LinearAlgebra/Base internals
   (`norm(::Vector)` -> `norm_recursive_check` -> `iterate(::Nothing)`,
   and the broadcast `unalias`/`copyto_unaliased!` path over
   `Adjoint{T, Union{}}`) and reports artifacts there that this package
   does not control. Scope the QA `report_call`s to
   `target_modules = (ExponentialUtilities,)` — the standard JET-as-QA
   configuration — which keeps full coverage of this package's own code.
   That scoping surfaced two genuine `may be undefined` findings, fixed
   here so the scoped analysis is clean: `si` in `exponential!` and
   `order`/`kest` in `kiops` are now unconditionally initialized before
   use. Verified: QA passes 17/17 on Julia 1.10 and 1.12.

3. Core (windows, all versions: "CUDA driver not functional"). On
   Windows the Core job runs the run_tests "All" aggregate, which pulled
   in the GPU group and `using CUDA` errored on the non-GPU runner. Mark
   the GPU group `in_all = false` so it only runs under an explicit
   GROUP=GPU on the self-hosted CUDA runner. Verified locally: GROUP=All
   now runs only Core/basictests.jl, never GPU/gputests.jl.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The "Static Arrays" testset compared `expv(t, A, b)` against `exp(t * A) * b`
where `exp(t * A)` is StaticArrays' SMatrix matrix exponential. That reference
uses an unbalanced scaling-and-squaring Padé path which loses ~7-9 digits for
the larger non-normal N=8 cases on macOS + Julia prerelease (relerr ~1e-7..1e-5),
tripping the default-tolerance isapprox in "Core (julia pre, macos-latest)".

Verified against a 512-bit BigFloat ground truth that the macOS `expv` output is
correct to ~1e-16 on both platforms; it was the StaticArrays `exp` reference, not
`expv`, that drifted. Switching the reference to the dense LAPACK `exp`, which is
balanced and accurate on every platform, keeps this a machine-precision assertion
that still catches real `expv` regressions (no tolerance loosening).

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ChrisRackauckas-Claude

Copy link
Copy Markdown
Contributor Author

Resolved the last red — Core (julia pre, macos-latest) failing at test/basictests.jl:265 in the "Static Arrays" testset.

Root cause: not an expv bug. The assertion was expv(t, A, b) ≈ exp(t * A) * b, where exp(t * A) dispatches to StaticArrays.jl's own SMatrix matrix exponential. I reconstructed the two exact failing matrices (N=8, t=1.0 and N=8, t=10.0; RNG seed 0) and computed a 512-bit BigFloat ground truth:

quantity relerr vs BigFloat truth (case N=8,t=1.0 / N=8,t=10.0)
macOS expv output (under test) 3.2e-16 / 7.7e-16 (correct)
macOS exp(t*A) StaticArrays reference 4.0e-7 / 7.5e-6 (wrong)
Linux expv 3.2e-16 / 7.3e-16
Linux exp(t*A) StaticArrays reference 1.9e-16 / 7.3e-16

So expv is machine-accurate on both platforms. It was the reference exp(t*A) (StaticArrays' unbalanced scaling-and-squaring Padé path — the source even notes "omitted: matrix balancing") that drifted ~7-9 digits on macOS + Julia 1.13-rc1. The test was comparing a correct value against a platform-fragile reference that is less accurate than the thing under test.

Default tolerances for context. The SMatrix expv extension targets eps(T)/2 ≈ 1.1e-16 (default_tolerance), and the Krylov expv path's happy-breakdown tol is 1e-7; neither is the issue here. The test gate is the default isapprox (rtol ≈ 1.49e-8). The macOS error of ~1e-7..1e-5 is far above any plausible expv FP floor (I confirmed across 400 seeds and forced mo/s/break-tol perturbations that faithful expv stays ≤5e-14), which is what pointed at the reference, not expv.

Fix (no tolerance loosening): compare expv against the dense LAPACK exponential exp(t * Matrix(A)) * Vector(b), which is balanced and accurate on every platform. This keeps a machine-precision (default-tolerance) assertion that still catches real expv regressions.

Verified locally (Pkg.test GROUP=Core, full basictests.jl):

  • Julia 1.13.0-rc1 (= CI pre): 329 pass, 1 broken (pre-existing @test_broken)
  • Julia 1.12.6: 329 pass, 1 broken
  • Julia 1.10.11 (lts): 329 pass, 1 broken

The "Static Arrays" testset itself: 12 pass / 12 on rc.

Ignore until reviewed by @ChrisRackauckas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants