Skip to content

Support for CUDA toolkit 13.3#3155

Merged
maleadt merged 10 commits into
mainfrom
tb/ctk_13.3
May 28, 2026
Merged

Support for CUDA toolkit 13.3#3155
maleadt merged 10 commits into
mainfrom
tb/ctk_13.3

Conversation

@maleadt
Copy link
Copy Markdown
Member

@maleadt maleadt commented May 27, 2026

No description provided.

maleadt and others added 8 commits May 27, 2026 08:51
CUDA 13.0 removed CUFFT_INCOMPLETE_PARAMETER_LIST, CUFFT_PARSE_ERROR and
CUFFT_LICENSE_ERROR from cufftResult. Since the bindings are regenerated
against 13.3, those names no longer exist, so description() threw an
UndefVarError for any error code that fell through to them. Drop the dead
branches and add descriptions for the new codes (CUFFT_MISSING_DEPENDENCY,
CUFFT_NVRTC_FAILURE, CUFFT_NVJITLINK_FAILURE, CUFFT_NVSHMEM_FAILURE).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cuEventElapsedTime_v2 (CUDA 12.8+) supersedes the now-deprecated v1 entry
point with improved accuracy and argument validation. Branch on
driver_version() so we call it on new enough drivers and keep the v1
fallback otherwise. Covered by the existing "events" testset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cuSPARSE 12.8.1 (CUDA 13.3) added the generic SpGEAM API for C = αA + βB,
replacing the type-specific csrgeam2 routines. Prefer it when available and
keep csrgeam2 as the fallback for older versions.

Also fix the generated SpGEAM bindings: the device workspace was typed as a
host Ptr{Cvoid} (it must be CuPtr{Cvoid}), and the alpha/beta scalars are now
PtrOrCuPtr{Cvoid} to match the other generic APIs. Fixed in res/wrap too.

Covered by the existing geam tests in interfaces/mul.jl.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cuSPARSE 12.8.1 (CUDA 13.3) added native CSC support to the triangular solve
APIs. Use it instead of modelling a CSC matrix as its transposed CSR on new
enough versions; the workaround couldn't represent transa = 'C', so the
adjoint of a complex CSC matrix now works too. Relax the corresponding test
skips accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CUDA added tensor-core emulation of higher precisions: BF16x9 reproduces
full FP32 accuracy (cuBLAS 12.9+) and the Ozaki fixed-point scheme emulates
FP64 (cuBLAS 13.1+, i.e. CUDA 13.0 Update 2). Expose them through the
existing `math_mode!`/`math_precision` mechanism: under FAST_MATH, a
`:BFloat16x9` precision selects FP32 emulation and `:FixedPoint` selects FP64
emulation. The math mode is applied to the cuBLAS handle (covering plain
GEMMs) and the matching compute types are returned from gemmExComputeType
(covering gemmEx!); the handle now also re-applies when the precision alone
changes. Version gates use the cuBLAS library version, which does not track
the toolkit version (CUDA 13.0u2 ships cuBLAS 13.1.0, CUDA 13.3 ships 13.5.1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: cb21ca6 Previous: 6a129e0 Ratio
array/accumulate/Float32/1d 99934 ns 100053 ns 1.00
array/accumulate/Float32/dims=1 75686 ns 75787 ns 1.00
array/accumulate/Float32/dims=1L 1587057 ns 1577853 ns 1.01
array/accumulate/Float32/dims=2 141448 ns 141062 ns 1.00
array/accumulate/Float32/dims=2L 653854 ns 653707 ns 1.00
array/accumulate/Int64/1d 117378 ns 116750 ns 1.01
array/accumulate/Int64/dims=1 79900 ns 78711 ns 1.02
array/accumulate/Int64/dims=1L 1699383 ns 1683110 ns 1.01
array/accumulate/Int64/dims=2 152175 ns 151060 ns 1.01
array/accumulate/Int64/dims=2L 960126 ns 959381 ns 1.00
array/broadcast 18881 ns 19982 ns 0.94
array/construct 1207.4 ns 1194.4 ns 1.01
array/copy 16473 ns 16735 ns 0.98
array/copyto!/cpu_to_gpu 214155 ns 212740 ns 1.01
array/copyto!/gpu_to_cpu 281669 ns 279146 ns 1.01
array/copyto!/gpu_to_gpu 10387 ns 10253 ns 1.01
array/iteration/findall/bool 132571 ns 131244 ns 1.01
array/iteration/findall/int 146336 ns 145039 ns 1.01
array/iteration/findfirst/bool 69618 ns 79765 ns 0.87
array/iteration/findfirst/int 71515 ns 81388 ns 0.88
array/iteration/findmin/1d 67419 ns 66315 ns 1.02
array/iteration/findmin/2d 101824 ns 101872 ns 1.00
array/iteration/logical 192079 ns 190195 ns 1.01
array/iteration/scalar 65603 ns 64845 ns 1.01
array/permutedims/2d 49721 ns 49329 ns 1.01
array/permutedims/3d 51360 ns 49995 ns 1.03
array/permutedims/4d 50755 ns 49456 ns 1.03
array/random/rand/Float32 11928 ns 11887 ns 1.00
array/random/rand/Int64 23844 ns 23761 ns 1.00
array/random/rand!/Float32 8021.666666666667 ns 8603.666666666666 ns 0.93
array/random/rand!/Int64 17867 ns 20714 ns 0.86
array/random/randn/Float32 36246 ns 35746 ns 1.01
array/random/randn!/Float32 24199 ns 24698 ns 0.98
array/reductions/mapreduce/Float32/1d 33444 ns 33262 ns 1.01
array/reductions/mapreduce/Float32/dims=1 37924 ns 38065 ns 1.00
array/reductions/mapreduce/Float32/dims=1L 50252 ns 50303 ns 1.00
array/reductions/mapreduce/Float32/dims=2 55904 ns 55700 ns 1.00
array/reductions/mapreduce/Float32/dims=2L 67201 ns 66967 ns 1.00
array/reductions/mapreduce/Int64/1d 40065 ns 39371 ns 1.02
array/reductions/mapreduce/Int64/dims=1 40927 ns 41064 ns 1.00
array/reductions/mapreduce/Int64/dims=1L 86328 ns 86505 ns 1.00
array/reductions/mapreduce/Int64/dims=2 57984 ns 57928 ns 1.00
array/reductions/mapreduce/Int64/dims=2L 83490 ns 82783 ns 1.01
array/reductions/reduce/Float32/1d 33479 ns 33213 ns 1.01
array/reductions/reduce/Float32/dims=1 38394 ns 38126 ns 1.01
array/reductions/reduce/Float32/dims=1L 50618 ns 50464 ns 1.00
array/reductions/reduce/Float32/dims=2 55745 ns 55536 ns 1.00
array/reductions/reduce/Float32/dims=2L 67728 ns 67415 ns 1.00
array/reductions/reduce/Int64/1d 40179 ns 40070 ns 1.00
array/reductions/reduce/Int64/dims=1 40669 ns 40780 ns 1.00
array/reductions/reduce/Int64/dims=1L 86396 ns 86708 ns 1.00
array/reductions/reduce/Int64/dims=2 57868 ns 57749 ns 1.00
array/reductions/reduce/Int64/dims=2L 82658 ns 82436 ns 1.00
array/reverse/1d 16862 ns 16956 ns 0.99
array/reverse/1dL 67699 ns 67986 ns 1.00
array/reverse/1dL_inplace 65292 ns 65223 ns 1.00
array/reverse/1d_inplace 8280.333333333334 ns 8237.333333333334 ns 1.01
array/reverse/2d 20334 ns 20361 ns 1.00
array/reverse/2dL 71899 ns 72160 ns 1.00
array/reverse/2dL_inplace 65109 ns 65120 ns 1.00
array/reverse/2d_inplace 9713 ns 9782 ns 0.99
array/sorting/1d 2713130 ns 2707240 ns 1.00
array/sorting/2d 1062830 ns 1063955 ns 1.00
array/sorting/by 3269686 ns 3281778 ns 1.00
cuda/synchronization/context/auto 1133.5 ns 1132.3 ns 1.00
cuda/synchronization/context/blocking 952.304347826087 ns 923.0333333333333 ns 1.03
cuda/synchronization/context/nonblocking 6086.2 ns 6078 ns 1.00
cuda/synchronization/stream/auto 986 ns 974.4 ns 1.01
cuda/synchronization/stream/blocking 830.2051282051282 ns 783.4722222222222 ns 1.06
cuda/synchronization/stream/nonblocking 5974.4 ns 6004.333333333333 ns 1.00
integration/byval/reference 143284 ns 143190 ns 1.00
integration/byval/slices=1 145461 ns 145031 ns 1.00
integration/byval/slices=2 283970 ns 283718 ns 1.00
integration/byval/slices=3 422516 ns 422133 ns 1.00
integration/cudadevrt 101724 ns 101714 ns 1.00
integration/volumerhs 8882117 ns 9906116 ns 0.90
kernel/indexing 13006 ns 12607 ns 1.03
kernel/indexing_checked 13648 ns 13261 ns 1.03
kernel/launch 2065.5555555555557 ns 2122.5555555555557 ns 0.97
kernel/occupancy 728.5496183206106 ns 718.696 ns 1.01
kernel/rand 13903 ns 14142 ns 0.98
latency/import 3863319483 ns 3854591463 ns 1.00
latency/precompile 4627375059 ns 4627171718 ns 1.00
latency/ttfp 4503227168 ns 4491759222 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@maleadt maleadt enabled auto-merge May 28, 2026 11:19
@maleadt maleadt disabled auto-merge May 28, 2026 11:19
@maleadt maleadt merged commit e13541e into main May 28, 2026
1 of 2 checks passed
@maleadt maleadt deleted the tb/ctk_13.3 branch May 28, 2026 11:20
@codecov
Copy link
Copy Markdown

codecov Bot commented May 28, 2026

Codecov Report

❌ Patch coverage is 11.11111% with 56 lines in your changes missing coverage. Please review.
✅ Project coverage is 16.32%. Comparing base (2fe75d6) to head (cb21ca6).
⚠️ Report is 19 commits behind head on main.

Files with missing lines Patch % Lines
lib/cusparse/src/extra.jl 0.00% 29 Missing ⚠️
lib/cublas/src/cuBLAS.jl 41.66% 7 Missing ⚠️
lib/cublas/src/wrappers.jl 22.22% 7 Missing ⚠️
lib/cusparse/src/helpers.jl 0.00% 7 Missing ⚠️
lib/cusparse/src/generic.jl 0.00% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3155      +/-   ##
==========================================
- Coverage   16.40%   16.32%   -0.08%     
==========================================
  Files         124      124              
  Lines        9827     9875      +48     
==========================================
  Hits         1612     1612              
- Misses       8215     8263      +48     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant