[CIR] Construct the offload merge driver pipeline for CUDA (single arch) by RiverDave · Pull Request #7 · RiverDave/llvm-project

RiverDave · 2026-06-18T07:39:09Z

We're back with the the PR's 😄

This is where the actions from #6 actually get placed into the driver's action graph - the construction that #6 deferred. With --clangir-offload-merge on a CUDA compile, each host/device TU is lowered to serialized CIR, combined into a single cir.offload.container, split back out, and every module then resumes the backend from -x cir (the .cir input path landed in #5).

The driver is a bit dense, so the relevants bits here are:

BuildOffloadingActions builds the graph: host + per-arch device compiles become inputs to one CIRMergeJobAction, which feeds a single CIRSplitJobAction, wrapped in an OffloadAction that exposes the split as one host + N device dependences.
ConstructPhaseAction stops the CUDA host/device compile at CIR (instead of going straight to LLVM) when the flag is set, so there's something for the merge to consume.
BuildActions re-applies the backend phase to each split-out module separately - the host module and each device module have different toolchains/output types, so each needs its own backend action feeding off the shared split node.

I added an isCIROffloadMerge() helper since the same gating condition is checked across all three sites.

Here's a better illustration in case the above wasn't clear. (Thanks Claude 😉)

graph TD
  HSrc["InputAction<br/>foo.cu (host)"] --> HPre["PreprocessJobAction"]
  HPre --> HComp["CompileJobAction → TY_CIR<br/>-fclangir -emit-cir"]

  DSrc["InputAction<br/>foo.cu (cuda-device, sm_80)"] --> DPre["PreprocessJobAction"]
  DPre --> DComp["CompileJobAction → TY_CIR<br/>-fclangir -emit-cir -fcuda-is-device"]

  HComp --> Merge["CIRMergeJobAction → TY_CIR<br/>cir-offload-merge -combine"]
  DComp --> Merge

  Merge --> Split["CIRSplitJobAction → TY_CIR<br/>cir-offload-merge -split"]

  Split --> HBack["BackendJobAction host → TY_LLVM_IR<br/>-fclangir -emit-llvm -x cir"]
  Split --> DBack["BackendJobAction device → TY_LLVM_BC<br/>-fclangir -emit-llvm-bc -x cir"]

  HBack --> OA["OffloadAction (root)<br/>host + device grouping"]
  DBack --> OA

RiverDave · 2026-06-18T07:39:22Z

[CIR] Construct the offload merge driver pipeline for CUDA (single arch) #7 👈 (View in Graphite)
users/riverdave/cir/cuda-register-module-pass

This stack of pull requests is managed by Graphite. Learn more about stacking.

RiverDave · 2026-06-18T07:57:18Z

@koparasy I assume eventually when we hook up merge and split to the optimization layer we'd want that stage to be part of CIRJobMergeAction? I was thinking whether if we'd want a separate action for that to be explicit but I recall we discussing how complicated it already is to justify to the community adding new actions to the driver.

koparasy · 2026-06-18T16:39:16Z

@koparasy I assume eventually when we hook up merge and split to the optimization layer we'd want that stage to be part of CIRJobMergeAction? I was thinking whether if we'd want a separate action for that to be explicit but I recall we discussing how complicated it already is to justify to the community adding new actions to the driver.

I don't think a separate invocation/action is needed. It would be nice if all of our optimizations could run by just calling cir-opt provided that the loaded module is a combined TU. But in terms of actions I would not introduce a new action. No.

koparasy · 2026-06-18T16:49:19Z

+// MERGE: "-cc1" {{.*}}"-fclangir" {{.*}}"-emit-cir"
+// MERGE: "-cc1" {{.*}}"-fclangir" {{.*}}"-emit-cir" {{.*}}"-fcuda-is-device"
+// MERGE: "{{.*}}cir-offload-merge{{(\.exe)?}}" "-combine" "-targets=host-x86_64-unknown-linux-gnu,cuda-nvptx64-nvidia-cuda-unknown-sm_80"
+// MERGE: "{{.*}}cir-offload-merge{{(\.exe)?}}" "-split" "-targets=host-x86_64-unknown-linux-gnu,cuda-nvptx64-nvidia-cuda-unknown-sm_80"
+// MERGE: "-cc1" {{.*}}"-fclangir" {{.*}}"-emit-llvm-bc" {{.*}}"-fcuda-is-device" {{.*}}"-x" "cir"


shouldn't you here also track the input output files with some pattern? To make sure that the outputs of cc1 -emit-cir are forwarded properly as inputs to cir-offload-merge.

The same also goes the other way around, the split should output files right?

koparasy · 2026-06-18T16:49:51Z

+// MERGE: "-cc1" {{.*}}"-fclangir" {{.*}}"-emit-cir"
+// MERGE: "-cc1" {{.*}}"-fclangir" {{.*}}"-emit-cir" {{.*}}"-fcuda-is-device"
+// MERGE: "{{.*}}cir-offload-merge{{(\.exe)?}}" "-combine" "-targets=host-x86_64-unknown-linux-gnu,cuda-nvptx64-nvidia-cuda-unknown-sm_80"
+// MERGE: "{{.*}}cir-offload-merge{{(\.exe)?}}" "-split" "-targets=host-x86_64-unknown-linux-gnu,cuda-nvptx64-nvidia-cuda-unknown-sm_80"
+// MERGE: "-cc1" {{.*}}"-fclangir" {{.*}}"-emit-llvm-bc" {{.*}}"-fcuda-is-device" {{.*}}"-x" "cir"


Why emit-llvm-bc, is this the normal cuda driver approach?

Yes, that's what the driver constructs when you pass -emit-llvm for cuda see:

./build/bin/clang -### -target x86_64-unknown-linux-gnu -x cuda -S -emit-llvm \ --cuda-gpu-arch=sm_80 -nocudainc -nocudalib -c /dev/null 2>&1 | grep '-emit-llvm-bc'

I get something like:

"/Users/davidfeliperiveraguerra/dev/gsoc-combine/build/bin/clang-23" "-cc1" "-triple" "nvptx64-nvidia-cuda" "-aux-triple" "x86_64-unknown-linux-gnu" "-emit-llvm-bc" .....

Perhaps I should've made things clearer in that test and show the end-to-end lowering instead of just halting at llvm? I did it mainly keep things simple.

It makes sense to have everything here. The entire lowering. In the end you need to verify that everything is correct.

koparasy · 2026-06-18T17:43:48Z

      handleTimeTrace(C, Args, JA, BaseInput, Result);
  }

+  if (TargetDeviceOffloadKind != Action::OFK_None &&


Shouldn't you here be more defensive and check if (isCIROffloadMerge(C, C.getArgs())...)?

koparasy · 2026-06-18T17:45:59Z

+         Args.hasArg(options::OPT_clangir_offload_merge) &&
+         (C.isOffloadingHostKind(Action::OFK_Cuda) ||
+          C.isOffloadingHostKind(Action::OFK_HIP));
+}


NIT: Do we support HIP? If so we should add a test, if not maybe we add here an assertion? If you have HIP + OffloadMerge = Error?

No HIP yet, I'll wire that target incrementally, I'll add the assert btw.

RiverDave · 2026-06-21T19:04:13Z

Something I'm currently facing is a crash when appending the fatbins to the host on multi-arch. I believe this is given the the nature of offloading where compiling for different archs collapses their respective binaries before appending to the host expecting a single input - I think that's the expectation for:

llvm-project/clang/lib/Driver/ToolChains/Clang.cpp

Line 8081 in 99192fe

assert(HostOffloadingInputs.size() == 1 && "Only one input expected");

Multi-arch is the focus of a future PR (not this) So I might need to see how the bundler resolves these things.

koparasy · 2026-06-22T16:33:44Z

+// RUN: %clang -### -target x86_64-unknown-linux-gnu -x cuda -fclangir \
+// RUN:   --cuda-gpu-arch=sm_80 -nocudainc -nocudalib \
+// RUN:   --clangir-offload-merge -c %s 2>&1 \
+// RUN: | FileCheck %s --check-prefix=MERGE
+


What happens if you invoke clang without the -###?

We can absolutely build the pipeline, as of now these are the bindings we emit per action:

bindings through CUDA stock driver (no merge):

➜ gsoc-combine git:(06-18-_cir_introduce_the_offload_merge_driver_pipeline_for_cuda_single_arch_) ✗ ./build/bin/clang -ccc-print-bindings \ -target x86_64-unknown-linux-gnu \ -x cu -fclangir \ --cuda-gpu-arch=sm_80 \ -nocudainc -nocudalib \ -c /tmp/test.cu # "nvptx64-nvidia-cuda" - "clang", inputs: ["/tmp/test.cu"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-2f55d0.s" # "nvptx64-nvidia-cuda" - "NVPTX::Assembler", inputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-2f55d0.s"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-cc0829.o" # "nvptx64-nvidia-cuda" - "NVPTX::Linker", inputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-cc0829.o", "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-2f55d0.s"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-e4154d.fatbin" # "x86_64-unknown-linux-gnu" - "clang", inputs: ["/tmp/test.cu", "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-e4154d.fatbin"], output: "test.o"

bindings through cir-offload-merge (CUDA - single arch):

# "x86_64-unknown-linux-gnu" - "clang", inputs: ["/tmp/test.cu"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-c35709.cir" # "nvptx64-nvidia-cuda" - "clang", inputs: ["/tmp/test.cu"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-883aff.cir" # "nvptx64-nvidia-cuda" - "CIR offload merge", inputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-c35709.cir", "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-883aff.cir"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-b077f2.cir" # "nvptx64-nvidia-cuda" - "CIR offload merge", inputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-b077f2.cir"], outputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-980cac.cir", "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-63eb4f.cir"] # "nvptx64-nvidia-cuda" - "clang", inputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-63eb4f.cir"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-44f3d0.s" # "nvptx64-nvidia-cuda" - "NVPTX::Assembler", inputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-44f3d0.s"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-39463f.o" # "nvptx64-nvidia-cuda" - "NVPTX::Linker", inputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-39463f.o", "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-44f3d0.s"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-1fa102.fatbin" # "x86_64-unknown-linux-gnu" - "clang", inputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-980cac.cir", "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-1fa102.fatbin"], output: "test.o"

I'm compiling an LLVM build on an NVIDIA machine so I'll update you on this when I get it running there.

I was able to gather compile timings for polybench after running this end to end:

Phase averages (wall seconds, over successful compilations)

Phase no-merge avg merge avg delta

Frontend+IRGen 4.197 1.553 -2.645

ISel 0.019 0.019 -0.000

LLVM-analysis 0.004 0.004 -0.000

LLVM-passes 0.139 0.135 -0.005

RegAlloc 0.001 0.001 -0.000

Total (wall) 4.254 1.660 -2.594

In the frontend and IRGen, We're seeing a 2.6x average speedup on trivial kernels and up to 8x on complex ones like adi. It is a possibility that the benchmark instrumentation I ran this with, took the first cc1 invocation as the source of truth without considering the CIR -> Fatbin transformation.

Please do take these results with a grain of salt, I will probably dig deeper on it as I find this extremely weird.

It looks like the binary contents through the merge pipeline are different, we're missing crucial sections and function calls in the final host object like: nv_fatbin, .nvFatBinSegment, __cudaRegisterFunction etc... It looks like registration calls are not being emitted on the host! I'm starting to look at what might be the cause.

Ok, big thing. Looks like the current fork doesn't have registration (On vars) implemented as compared to upstream, I'll need to rebase, will figure this out.

Ping me once you have this done.

I recently noticed LLDB crash during execution of `script print(lldb.SBDebugger().GetBroadcaster().GetName())` command: ``` PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace. Stack dump: 0. Program arguments: /home/sergei/llvm-project/build/bin/lldb-dap #0 0x000062735c3403d2 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/sergei/llvm-project/build/bin/lldb-dap+0x7c3d2) #1 0x000062735c33d7ec llvm::sys::RunSignalHandlers() (/home/sergei/llvm-project/build/bin/lldb-dap+0x797ec) #2 0x000062735c33d94c SignalHandler(int, siginfo_t*, void*) Signals.cpp:0:0 #3 0x00007eaa6aa45330 (/lib/x86_64-linux-gnu/libc.so.6+0x45330) #4 0x00007eaa6bb0c092 lldb::SBBroadcaster::GetName() const (/home/sergei/llvm-project/build/bin/../lib/liblldb.so.23.0git+0x90c092) #5 0x00007eaa6bcb9a5d _wrap_SBBroadcaster_GetName LLDBWrapPython.cpp:0:0 #6 0x00007eaa6a1df5f5 (/lib/x86_64-linux-gnu/libpython3.12.so.1.0+0x1df5f5) #7 0x00007eaa6a182b2c PyObject_Vectorcall (/lib/x86_64-linux-gnu/libpython3.12.so.1.0+0x182b2c) #8 0x00007eaa6a11d5ee _PyEval_EvalFrameDefault (/lib/x86_64-linux-gnu/libpython3.12.so.1.0+0x11d5ee) #9 0x00007eaa6a2a091f PyEval_EvalCode (/lib/x86_64-linux-gnu/libpython3.12.so.1.0+0x2a091f) #10 0x00007eaa6a29c8b0 (/lib/x86_64-linux-gnu/libpython3.12.so.1.0+0x29c8b0) llvm#11 0x00007eaa6a11fbd3 _PyEval_EvalFrameDefault (/lib/x86_64-linux-gnu/libpython3.12.so.1.0+0x11fbd3) llvm#12 0x00007eaa6c4891b7 lldb_private::ScriptInterpreterPythonImpl::ExecuteOneLine(llvm::StringRef, lldb_private::CommandReturnObject*, lldb_private::ExecuteScriptOptions const&) (/home/sergei/llvm-project/build/bin/../lib/liblldb.so.23.0git+0x12891b7) llvm#13 0x00007eaa70326ff5 CommandObjectScriptingRun::DoExecute(llvm::StringRef, lldb_private::CommandReturnObject&) (/home/sergei/llvm-project/build/bin/../lib/liblldb.so.23.0git+0x5126ff5) llvm#14 0x00007eaa6bee3739 lldb_private::CommandObjectRaw::Execute(char const*, lldb_private::CommandReturnObject&) (/home/sergei/llvm-project/build/bin/../lib/liblldb.so.23.0git+0xce3739) llvm#15 0x00007eaa6bede09a lldb_private::CommandInterpreter::HandleCommand(char const*, lldb_private::LazyBool, lldb_private::CommandReturnObject&, bool) (/home/sergei/llvm-project/build/bin/../lib/liblldb.so.23.0git+0xcde09a) llvm#16 0x00007eaa6bb0f0f8 lldb::SBCommandInterpreter::HandleCommand(char const*, lldb::SBExecutionContext&, lldb::SBCommandReturnObject&, bool) (/home/sergei/llvm-project/build/bin/../lib/liblldb.so.23.0git+0x90f0f8) llvm#17 0x00007eaa6bb0f265 lldb::SBCommandInterpreter::HandleCommand(char const*, lldb::SBCommandReturnObject&, bool) (/home/sergei/llvm-project/build/bin/../lib/liblldb.so.23.0git+0x90f265) llvm#18 0x000062735c3707f3 lldb_dap::RunLLDBCommands[abi:cxx11](lldb::SBDebugger&, lldb::SBMutex, llvm::StringRef, llvm::ArrayRef<lldb_dap::protocol::String> const&, bool&, bool, bool) (/home/sergei/llvm-project/build/bin/lldb-dap+0xac7f3) llvm#19 0x000062735c3a8019 lldb_dap::EvaluateRequestHandler::Run(lldb_dap::protocol::EvaluateArguments const&) const (/home/sergei/llvm-project/build/bin/lldb-dap+0xe4019) llvm#20 0x000062735c3aba78 lldb_dap::RequestHandler<lldb_dap::protocol::EvaluateArguments, llvm::Expected<lldb_dap::protocol::EvaluateResponseBody>>::operator()(lldb_dap::protocol::Request const&) const (/home/sergei/llvm-project/build/bin/lldb-dap+0xe7a78) llvm#21 0x000062735c3ce1bf lldb_dap::BaseRequestHandler::Run(lldb_dap::protocol::Request const&) (/home/sergei/llvm-project/build/bin/lldb-dap+0x10a1bf) llvm#22 0x000062735c3577e7 lldb_dap::DAP::HandleObject(std::variant<lldb_dap::protocol::Request, lldb_dap::protocol::Response, lldb_dap::protocol::Event> const&) (/home/sergei/llvm-project/build/bin/lldb-dap+0x937e7) llvm#23 0x000062735c358705 lldb_dap::DAP::Loop() (/home/sergei/llvm-project/build/bin/lldb-dap+0x94705) llvm#24 0x000062735c2ed0c7 main (/home/sergei/llvm-project/build/bin/lldb-dap+0x290c7) llvm#25 0x00007eaa6aa2a1ca __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:74:3 ``` As far as I understand default constuctors should be covered by fuzzing tests, so I don't know how to write test for that patch.

RiverDave · 2026-06-26T02:54:51Z

I had to figure out a way to pass the gpu binary path when we don't consume an AST in the second cc1 invocation so that we can emit the right runtime code on the host. Therefore this patch now depends on #8.

…merge

RiverDave changed the title ~~[CIR] Introduce the offload merge driver pipeline for CUDA (single arch)~~ [CIR] Construct the offload merge driver pipeline for CUDA (single arch) Jun 18, 2026

RiverDave marked this pull request as ready for review June 18, 2026 07:53

RiverDave requested a review from koparasy June 18, 2026 07:53

koparasy reviewed Jun 18, 2026

View reviewed changes

RiverDave requested a review from koparasy June 22, 2026 16:06

koparasy reviewed Jun 22, 2026

View reviewed changes

RiverDave requested a review from koparasy June 24, 2026 16:42

RiverDave force-pushed the gsoc/combine-cir branch from 99192fe to c95753c Compare June 25, 2026 01:54

RiverDave force-pushed the 06-18-_cir_introduce_the_offload_merge_driver_pipeline_for_cuda_single_arch_ branch from 672cab2 to 0e3fd3e Compare June 25, 2026 02:17

RiverDave mentioned this pull request Jun 25, 2026

[CIR][CUDA] Add separate CUDA registration pass when consuming cir through CIRGenAction #8

Open

RiverDave force-pushed the 06-18-_cir_introduce_the_offload_merge_driver_pipeline_for_cuda_single_arch_ branch from 0e3fd3e to 15fd8e0 Compare June 26, 2026 02:52

RiverDave changed the base branch from gsoc/combine-cir to users/riverdave/cir/cuda-register-module-pass June 26, 2026 02:52

RiverDave force-pushed the users/riverdave/cir/cuda-register-module-pass branch from 0b4e5e2 to c3fb89f Compare June 26, 2026 18:53

RiverDave added 5 commits June 26, 2026 15:11

[CIR] Introduce the offload merge driver pipeline for CUDA (single arch)

d22d2f4

Address review comments

6d98f7f

Propagate device info post split.

eda05cc

Embed a fatbin to host instead of raw cubin

9da4e49

Propagate proper offload info post-split

55cc2a9

RiverDave force-pushed the 06-18-_cir_introduce_the_offload_merge_driver_pipeline_for_cuda_single_arch_ branch from 15fd8e0 to 55cc2a9 Compare June 26, 2026 19:11

add CUDA -> LLVM tests when resuming CIR pipeline under -cir-offload-…

8d4fab9

…merge

RiverDave mentioned this pull request Jun 27, 2026

[CIR][CUDA] multi-arch support for the CIR offload-merge pipeline #9

Open

Phase	no-merge avg	merge avg	delta
Frontend+IRGen	4.197	1.553	-2.645
ISel	0.019	0.019	-0.000
LLVM-analysis	0.004	0.004	-0.000
LLVM-passes	0.139	0.135	-0.005
RegAlloc	0.001	0.001	-0.000
Total (wall)	4.254	1.660	-2.594

Conversation

RiverDave commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RiverDave commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RiverDave commented Jun 18, 2026

Uh oh!

koparasy commented Jun 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RiverDave Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RiverDave commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RiverDave Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Phase averages (wall seconds, over successful compilations)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RiverDave Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RiverDave commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RiverDave commented Jun 18, 2026 •

edited

Loading

RiverDave commented Jun 18, 2026 •

edited

Loading

RiverDave Jun 18, 2026 •

edited

Loading

RiverDave commented Jun 21, 2026 •

edited

Loading

RiverDave Jun 24, 2026 •

edited

Loading

RiverDave Jun 24, 2026 •

edited

Loading