From 913971c5fd4391ea1755df59a4666454da323f44 Mon Sep 17 00:00:00 2001
From: Ivana <ivana.gyro@gmail.com>
Date: Wed, 3 Jun 2026 05:34:18 +0000
Subject: [PATCH 1/2] cmake: replace hardcoded CUDA native arch with portable
 fat-binary default
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The previous unconditional `set(CMAKE_CUDA_ARCHITECTURES native)` had
three problems for redistributable PyPI wheels built on GPU-less CI runners:

1. `native` queries the local GPU at configure time, so the build fails
   outright on a machine with no NVIDIA device.
2. Even when it succeeds, `native` bakes in only the build host's
   architecture — the resulting wheel does not run on any other GPU
   generation.
3. Because it was a plain (non-cache) `set()`, it overrode any value
   supplied via -D, a CMakePresets.json cacheVariables entry, the CUDAARCHS
   environment variable, or the cache, so there was no way to override it
   without editing the file.

Replace it with a guarded default that runs before enable_language(CUDA):

  if(NOT CMAKE_CUDA_ARCHITECTURES AND NOT DEFINED ENV{CUDAARCHS})
    set(CMAKE_CUDA_ARCHITECTURES
        70-real 75-real 80-real 86-real 89-real 90-real 90-virtual)
  endif()
  enable_language(CUDA)

The guard must run before enable_language(CUDA): afterwards
CMAKE_CUDA_ARCHITECTURES is never empty (CMake fills in its own default),
so the "not specified" case can no longer be detected. enable_language(CUDA)
already reads the CUDAARCHS environment variable on its own, so the guard
only has to avoid shadowing it — there is no need to copy CUDAARCHS into the
variable, and no need to honor a CMAKE_CUDA_ARCHITECTURES environment
variable (CMake defines no such variable; only CUDAARCHS is standard).
A -D flag, the cache, or a preset's cacheVariables populate the normal/cache
variable, so the NOT CMAKE_CUDA_ARCHITECTURES guard lets them win.

The default targets Volta through Hopper (sm_70 is the minimum required by
cuTENSOR/cuQuantum), with 90-virtual PTX so the driver can JIT-compile for
GPUs newer than Hopper without a rebuild.

Also drop the now-false "native" justification on the cmake_minimum_required
line: with the native keyword gone, the comment is updated to note that
cmake_language(EVAL CODE ...) is the feature setting the 3.18 lower bound,
with 3.24 kept as the tested minimum.

Closes #870.

Co-authored-by: Claude <noreply@anthropic.com>
---
 CMakeLists.txt | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/CMakeLists.txt b/CMakeLists.txt
index a5a86523f..f5096314e 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -7,7 +7,7 @@ message(STATUS "")
 # #####################################################################
 # ## CMAKE and CXX VERSION
 # #####################################################################
-cmake_minimum_required(VERSION 3.24) # require for the "native" value of CUDA_ARCHITECTURES
+cmake_minimum_required(VERSION 3.24) # 3.18+ required for cmake_language(EVAL CODE ...); 3.24 is a tested minimum
 
 set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_CURRENT_SOURCE_DIR}/cmake/Modules")
 # Inria's morse_cmake provides an up-to-date FindLAPACKE (and helpers) that
@@ -185,7 +185,18 @@ project(CYTNX VERSION ${CYTNX_VERSION} LANGUAGES CXX C)
 
 set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
 if(USE_CUDA)
-  set(CMAKE_CUDA_ARCHITECTURES native)
+  # Default to a portable fat binary unless the caller picked architectures via
+  # -D, the cache, a preset's cacheVariables, or the CUDAARCHS environment
+  # variable. This must run before enable_language(CUDA): afterwards
+  # CMAKE_CUDA_ARCHITECTURES is never empty (CMake fills in its own default), so
+  # the "not specified" case can no longer be detected. enable_language(CUDA)
+  # reads CUDAARCHS on its own, so we only avoid shadowing it here, not copy it.
+  # The default embeds SASS for each supported real architecture (Volta sm_70 is
+  # the floor required by cuTENSOR/cuQuantum, up through Hopper sm_90) plus PTX
+  # of the newest (90-virtual) so the driver can JIT for newer/unknown GPUs.
+  if(NOT CMAKE_CUDA_ARCHITECTURES AND NOT DEFINED ENV{CUDAARCHS})
+    set(CMAKE_CUDA_ARCHITECTURES 70-real 75-real 80-real 86-real 89-real 90-real 90-virtual)
+  endif()
   enable_language(CUDA)
   # Disable generation of "--option-file" flag in compile_commands.json.
   # This workaround helps VSCode's cpptools extension correctly locate CUDA

From 57b7618b844a7f192d8da756116b61e196bdbbd8 Mon Sep 17 00:00:00 2001
From: Ivana <ivana.gyro@gmail.com>
Date: Wed, 3 Jun 2026 09:23:56 +0000
Subject: [PATCH 2/2] cmake: require CMake 3.25 for CUDA device LTO support

CMake 3.25 is the first release where CMAKE_INTERPROCEDURAL_OPTIMIZATION
(and the INTERPROCEDURAL_OPTIMIZATION target property) activate CUDA device
link-time optimization (nvcc -dlto) in addition to host C++ LTO. On earlier
CMake versions the same setting silently produces no device LTO for CUDA
targets, so the optimisation the build asks for via
CMAKE_INTERPROCEDURAL_OPTIMIZATION=TRUE would be quietly dropped.

Raise the floor from 3.24 to 3.25 so the requested device LTO is actually
emitted rather than ignored. The previous 3.24 floor existed only for the
removed "native" CUDA architecture keyword; the remaining version-sensitive
feature, cmake_language(EVAL CODE ...), needs 3.18, which 3.25 also covers.

Co-authored-by: Claude <noreply@anthropic.com>
---
 CMakeLists.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/CMakeLists.txt b/CMakeLists.txt
index f5096314e..0d99ea934 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -7,7 +7,7 @@ message(STATUS "")
 # #####################################################################
 # ## CMAKE and CXX VERSION
 # #####################################################################
-cmake_minimum_required(VERSION 3.24) # 3.18+ required for cmake_language(EVAL CODE ...); 3.24 is a tested minimum
+cmake_minimum_required(VERSION 3.25) # 3.25 added CUDA device LTO via INTERPROCEDURAL_OPTIMIZATION
 
 set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_CURRENT_SOURCE_DIR}/cmake/Modules")
 # Inria's morse_cmake provides an up-to-date FindLAPACKE (and helpers) that