cmake: portable CUDA arch default + require CMake 3.25 for device LTO#875
cmake: portable CUDA arch default + require CMake 3.25 for device LTO#875IvanaGyro wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the CMake configuration to dynamically set CUDA target architectures from environment variables (falling back to a portable fat binary default) and adds a workaround to disable CUDA device LTO while preserving host C++ LTO. The review feedback highlights two critical issues: first, environment variables for CUDA architectures are space-separated and must be converted to semicolon-separated lists to prevent CMake build failures; second, the LTO workaround unconditionally applies GCC-specific flags on all non-Apple platforms, which will break builds on Windows (MSVC) and Clang-based systems. Code suggestions are provided to resolve both issues.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #875 +/- ##
=======================================
Coverage 29.49% 29.49%
=======================================
Files 241 241
Lines 35512 35512
Branches 14777 14777
=======================================
Hits 10475 10475
Misses 17784 17784
Partials 7253 7253
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
d495108 to
b402f36
Compare
…fault
The previous unconditional `set(CMAKE_CUDA_ARCHITECTURES native)` had
three problems for redistributable PyPI wheels built on GPU-less CI runners:
1. `native` queries the local GPU at configure time, so the build fails
outright on a machine with no NVIDIA device.
2. Even when it succeeds, `native` bakes in only the build host's
architecture — the resulting wheel does not run on any other GPU
generation.
3. Because it was a plain (non-cache) `set()`, it overrode any value
supplied via -D, a CMakePresets.json cacheVariables entry, the CUDAARCHS
environment variable, or the cache, so there was no way to override it
without editing the file.
Replace it with a guarded default that runs before enable_language(CUDA):
if(NOT CMAKE_CUDA_ARCHITECTURES AND NOT DEFINED ENV{CUDAARCHS})
set(CMAKE_CUDA_ARCHITECTURES
70-real 75-real 80-real 86-real 89-real 90-real 90-virtual)
endif()
enable_language(CUDA)
The guard must run before enable_language(CUDA): afterwards
CMAKE_CUDA_ARCHITECTURES is never empty (CMake fills in its own default),
so the "not specified" case can no longer be detected. enable_language(CUDA)
already reads the CUDAARCHS environment variable on its own, so the guard
only has to avoid shadowing it — there is no need to copy CUDAARCHS into the
variable, and no need to honor a CMAKE_CUDA_ARCHITECTURES environment
variable (CMake defines no such variable; only CUDAARCHS is standard).
A -D flag, the cache, or a preset's cacheVariables populate the normal/cache
variable, so the NOT CMAKE_CUDA_ARCHITECTURES guard lets them win.
The default targets Volta through Hopper (sm_70 is the minimum required by
cuTENSOR/cuQuantum), with 90-virtual PTX so the driver can JIT-compile for
GPUs newer than Hopper without a rebuild.
Also drop the now-false "native" justification on the cmake_minimum_required
line: with the native keyword gone, the comment is updated to note that
cmake_language(EVAL CODE ...) is the feature setting the 3.18 lower bound,
with 3.24 kept as the tested minimum.
Closes #870.
Co-authored-by: Claude <noreply@anthropic.com>
CMake 3.25 is the first release where CMAKE_INTERPROCEDURAL_OPTIMIZATION (and the INTERPROCEDURAL_OPTIMIZATION target property) activate CUDA device link-time optimization (nvcc -dlto) in addition to host C++ LTO. On earlier CMake versions the same setting silently produces no device LTO for CUDA targets, so the optimisation the build asks for via CMAKE_INTERPROCEDURAL_OPTIMIZATION=TRUE would be quietly dropped. Raise the floor from 3.24 to 3.25 so the requested device LTO is actually emitted rather than ignored. The previous 3.24 floor existed only for the removed "native" CUDA architecture keyword; the remaining version-sensitive feature, cmake_language(EVAL CODE ...), needs 3.18, which 3.25 also covers. Co-authored-by: Claude <noreply@anthropic.com>
b402f36 to
57b7618
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 57b7618b84
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # the floor required by cuTENSOR/cuQuantum, up through Hopper sm_90) plus PTX | ||
| # of the newest (90-virtual) so the driver can JIT for newer/unknown GPUs. | ||
| if(NOT CMAKE_CUDA_ARCHITECTURES AND NOT DEFINED ENV{CUDAARCHS}) | ||
| set(CMAKE_CUDA_ARCHITECTURES 70-real 75-real 80-real 86-real 89-real 90-real 90-virtual) |
There was a problem hiding this comment.
Gate CUDA architectures by toolkit version
When USE_CUDA=ON is configured with the documented CUDA v10+ dependency (docs/source/adv_install.rst:32) but without an explicit CMAKE_CUDA_ARCHITECTURES, this default now passes compute_89/compute_90 to nvcc. CUDA toolkits before 11.8 do not know those architectures, so CMake's CUDA compiler check or the first CUDA compile fails with nvcc fatal: Unsupported gpu architecture, even though those CUDA versions were previously supported via native. Please derive the default from CMAKE_CUDA_COMPILER_VERSION or keep the default to architectures supported by the minimum CUDA version.
Useful? React with 👍 / 👎.
Review: PR #875 — "cmake: portable CUDA arch default + require CMake 3.25 for device LTO"OverviewTwo focused CMake changes: replace Change 1 — Portable CUDA architecture defaultThe logic is correct:
The existing Gemini/codex comments on One open concern (codex P1): Architecture list vs. documented CUDA minimum The default list includes The simplest fix is to update the documentation minimum to match the arch list. In practice, the full GPU feature set (cuTENSOR 24.x, cuQuantum) already requires CUDA 12.x, so "CUDA 10+" is stale regardless. Either:
Change 2 —
|
| Item | Status |
|---|---|
| Guard logic and ordering | ✅ Correct |
CUDAARCHS env var handling |
✅ Correct (enable_language handles natively) |
90-virtual PTX for future GPUs |
✅ Correct |
| Gemini / outdated codex comments | ✅ Addressed — code they reviewed was dropped |
cmake_minimum_required 3.24 → 3.25 |
✅ Correct and justified |
| Docs still say "CUDA v10+" but default arch list requires CUDA 11.8+ | adv_install.rst before merging |
Posted by Claude Code on behalf of @pcchen
|
@IvanaGyro We will update the document later. Please decide if we want to follow this comment: "Trim the default arch list to 70-real 75-real 80-real 86-real 80-virtual (safe with CUDA ≥ 11.1) and note that users on Hopper/Ada should override." |
Summary
Two related CMake changes to harden CUDA configuration for GPU-less build environments and enable CUDA device LTO.
1. Replace
nativeCUDA architecture default with a portable fat-binary listProblem: The previous
set(CMAKE_CUDA_ARCHITECTURES native)requires a CUDA-capable GPU to be present at configure time. Configure fails on CI runners, cloud VMs, and developer machines without a GPU — even when cross-compiling for a GPU target.Change: A guarded default is set before
enable_language(CUDA):NOT … AND NOT DEFINED ENV{CUDAARCHS}) ensures any user-supplied value — either the CMake cache variable or theCUDAARCHSenvironment variable — takes precedence. CMake'senable_language(CUDA)readsCUDAARCHSnatively, so the old manual copy of that env var intoCMAKE_CUDA_ARCHITECTURESwas redundant and is removed.enable_language(CUDA): after that call CMake fills in its own default, making the "not yet set" check unreliable.90-virtual) is included so the binary JIT-compiles on future architectures.2. Raise
cmake_minimum_requiredto 3.25Problem: CUDA device LTO (
-dlto, activated byCMAKE_INTERPROCEDURAL_OPTIMIZATION) requires CMake ≥ 3.25. Declaring a lower minimum silently degraded to host-only LTO even whenINTERPROCEDURAL_OPTIMIZATIONwas set.Change:
cmake_minimum_required(VERSION 3.25)with an inline comment documenting the reason.Testing
Verified on a machine with CUDA toolkit installed but
CMAKE_CUDA_ARCHITECTURESunset:build.ninjashows-gencode arch=compute_70,code=sm_70 … arch=compute_90,code=sm_90(fat binary).CUDAARCHS=80 cmake --preset openblas-cudaoverrides to sm_80 only, confirming the guard is respected.