Support for CUDA toolkit 13.3#3155
Merged
Merged
Conversation
CUDA 13.0 removed CUFFT_INCOMPLETE_PARAMETER_LIST, CUFFT_PARSE_ERROR and CUFFT_LICENSE_ERROR from cufftResult. Since the bindings are regenerated against 13.3, those names no longer exist, so description() threw an UndefVarError for any error code that fell through to them. Drop the dead branches and add descriptions for the new codes (CUFFT_MISSING_DEPENDENCY, CUFFT_NVRTC_FAILURE, CUFFT_NVJITLINK_FAILURE, CUFFT_NVSHMEM_FAILURE). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cuEventElapsedTime_v2 (CUDA 12.8+) supersedes the now-deprecated v1 entry point with improved accuracy and argument validation. Branch on driver_version() so we call it on new enough drivers and keep the v1 fallback otherwise. Covered by the existing "events" testset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cuSPARSE 12.8.1 (CUDA 13.3) added the generic SpGEAM API for C = αA + βB,
replacing the type-specific csrgeam2 routines. Prefer it when available and
keep csrgeam2 as the fallback for older versions.
Also fix the generated SpGEAM bindings: the device workspace was typed as a
host Ptr{Cvoid} (it must be CuPtr{Cvoid}), and the alpha/beta scalars are now
PtrOrCuPtr{Cvoid} to match the other generic APIs. Fixed in res/wrap too.
Covered by the existing geam tests in interfaces/mul.jl.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cuSPARSE 12.8.1 (CUDA 13.3) added native CSC support to the triangular solve APIs. Use it instead of modelling a CSC matrix as its transposed CSR on new enough versions; the workaround couldn't represent transa = 'C', so the adjoint of a complex CSC matrix now works too. Relax the corresponding test skips accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CUDA added tensor-core emulation of higher precisions: BF16x9 reproduces full FP32 accuracy (cuBLAS 12.9+) and the Ozaki fixed-point scheme emulates FP64 (cuBLAS 13.1+, i.e. CUDA 13.0 Update 2). Expose them through the existing `math_mode!`/`math_precision` mechanism: under FAST_MATH, a `:BFloat16x9` precision selects FP32 emulation and `:FixedPoint` selects FP64 emulation. The math mode is applied to the cuBLAS handle (covering plain GEMMs) and the matching compute types are returned from gemmExComputeType (covering gemmEx!); the handle now also re-applies when the precision alone changes. Version gates use the cuBLAS library version, which does not track the toolkit version (CUDA 13.0u2 ships cuBLAS 13.1.0, CUDA 13.3 ships 13.5.1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: cb21ca6 | Previous: 6a129e0 | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
99934 ns |
100053 ns |
1.00 |
array/accumulate/Float32/dims=1 |
75686 ns |
75787 ns |
1.00 |
array/accumulate/Float32/dims=1L |
1587057 ns |
1577853 ns |
1.01 |
array/accumulate/Float32/dims=2 |
141448 ns |
141062 ns |
1.00 |
array/accumulate/Float32/dims=2L |
653854 ns |
653707 ns |
1.00 |
array/accumulate/Int64/1d |
117378 ns |
116750 ns |
1.01 |
array/accumulate/Int64/dims=1 |
79900 ns |
78711 ns |
1.02 |
array/accumulate/Int64/dims=1L |
1699383 ns |
1683110 ns |
1.01 |
array/accumulate/Int64/dims=2 |
152175 ns |
151060 ns |
1.01 |
array/accumulate/Int64/dims=2L |
960126 ns |
959381 ns |
1.00 |
array/broadcast |
18881 ns |
19982 ns |
0.94 |
array/construct |
1207.4 ns |
1194.4 ns |
1.01 |
array/copy |
16473 ns |
16735 ns |
0.98 |
array/copyto!/cpu_to_gpu |
214155 ns |
212740 ns |
1.01 |
array/copyto!/gpu_to_cpu |
281669 ns |
279146 ns |
1.01 |
array/copyto!/gpu_to_gpu |
10387 ns |
10253 ns |
1.01 |
array/iteration/findall/bool |
132571 ns |
131244 ns |
1.01 |
array/iteration/findall/int |
146336 ns |
145039 ns |
1.01 |
array/iteration/findfirst/bool |
69618 ns |
79765 ns |
0.87 |
array/iteration/findfirst/int |
71515 ns |
81388 ns |
0.88 |
array/iteration/findmin/1d |
67419 ns |
66315 ns |
1.02 |
array/iteration/findmin/2d |
101824 ns |
101872 ns |
1.00 |
array/iteration/logical |
192079 ns |
190195 ns |
1.01 |
array/iteration/scalar |
65603 ns |
64845 ns |
1.01 |
array/permutedims/2d |
49721 ns |
49329 ns |
1.01 |
array/permutedims/3d |
51360 ns |
49995 ns |
1.03 |
array/permutedims/4d |
50755 ns |
49456 ns |
1.03 |
array/random/rand/Float32 |
11928 ns |
11887 ns |
1.00 |
array/random/rand/Int64 |
23844 ns |
23761 ns |
1.00 |
array/random/rand!/Float32 |
8021.666666666667 ns |
8603.666666666666 ns |
0.93 |
array/random/rand!/Int64 |
17867 ns |
20714 ns |
0.86 |
array/random/randn/Float32 |
36246 ns |
35746 ns |
1.01 |
array/random/randn!/Float32 |
24199 ns |
24698 ns |
0.98 |
array/reductions/mapreduce/Float32/1d |
33444 ns |
33262 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=1 |
37924 ns |
38065 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=1L |
50252 ns |
50303 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2 |
55904 ns |
55700 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2L |
67201 ns |
66967 ns |
1.00 |
array/reductions/mapreduce/Int64/1d |
40065 ns |
39371 ns |
1.02 |
array/reductions/mapreduce/Int64/dims=1 |
40927 ns |
41064 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=1L |
86328 ns |
86505 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2 |
57984 ns |
57928 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2L |
83490 ns |
82783 ns |
1.01 |
array/reductions/reduce/Float32/1d |
33479 ns |
33213 ns |
1.01 |
array/reductions/reduce/Float32/dims=1 |
38394 ns |
38126 ns |
1.01 |
array/reductions/reduce/Float32/dims=1L |
50618 ns |
50464 ns |
1.00 |
array/reductions/reduce/Float32/dims=2 |
55745 ns |
55536 ns |
1.00 |
array/reductions/reduce/Float32/dims=2L |
67728 ns |
67415 ns |
1.00 |
array/reductions/reduce/Int64/1d |
40179 ns |
40070 ns |
1.00 |
array/reductions/reduce/Int64/dims=1 |
40669 ns |
40780 ns |
1.00 |
array/reductions/reduce/Int64/dims=1L |
86396 ns |
86708 ns |
1.00 |
array/reductions/reduce/Int64/dims=2 |
57868 ns |
57749 ns |
1.00 |
array/reductions/reduce/Int64/dims=2L |
82658 ns |
82436 ns |
1.00 |
array/reverse/1d |
16862 ns |
16956 ns |
0.99 |
array/reverse/1dL |
67699 ns |
67986 ns |
1.00 |
array/reverse/1dL_inplace |
65292 ns |
65223 ns |
1.00 |
array/reverse/1d_inplace |
8280.333333333334 ns |
8237.333333333334 ns |
1.01 |
array/reverse/2d |
20334 ns |
20361 ns |
1.00 |
array/reverse/2dL |
71899 ns |
72160 ns |
1.00 |
array/reverse/2dL_inplace |
65109 ns |
65120 ns |
1.00 |
array/reverse/2d_inplace |
9713 ns |
9782 ns |
0.99 |
array/sorting/1d |
2713130 ns |
2707240 ns |
1.00 |
array/sorting/2d |
1062830 ns |
1063955 ns |
1.00 |
array/sorting/by |
3269686 ns |
3281778 ns |
1.00 |
cuda/synchronization/context/auto |
1133.5 ns |
1132.3 ns |
1.00 |
cuda/synchronization/context/blocking |
952.304347826087 ns |
923.0333333333333 ns |
1.03 |
cuda/synchronization/context/nonblocking |
6086.2 ns |
6078 ns |
1.00 |
cuda/synchronization/stream/auto |
986 ns |
974.4 ns |
1.01 |
cuda/synchronization/stream/blocking |
830.2051282051282 ns |
783.4722222222222 ns |
1.06 |
cuda/synchronization/stream/nonblocking |
5974.4 ns |
6004.333333333333 ns |
1.00 |
integration/byval/reference |
143284 ns |
143190 ns |
1.00 |
integration/byval/slices=1 |
145461 ns |
145031 ns |
1.00 |
integration/byval/slices=2 |
283970 ns |
283718 ns |
1.00 |
integration/byval/slices=3 |
422516 ns |
422133 ns |
1.00 |
integration/cudadevrt |
101724 ns |
101714 ns |
1.00 |
integration/volumerhs |
8882117 ns |
9906116 ns |
0.90 |
kernel/indexing |
13006 ns |
12607 ns |
1.03 |
kernel/indexing_checked |
13648 ns |
13261 ns |
1.03 |
kernel/launch |
2065.5555555555557 ns |
2122.5555555555557 ns |
0.97 |
kernel/occupancy |
728.5496183206106 ns |
718.696 ns |
1.01 |
kernel/rand |
13903 ns |
14142 ns |
0.98 |
latency/import |
3863319483 ns |
3854591463 ns |
1.00 |
latency/precompile |
4627375059 ns |
4627171718 ns |
1.00 |
latency/ttfp |
4503227168 ns |
4491759222 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3155 +/- ##
==========================================
- Coverage 16.40% 16.32% -0.08%
==========================================
Files 124 124
Lines 9827 9875 +48
==========================================
Hits 1612 1612
- Misses 8215 8263 +48 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.