Skip to content

[CIR] Construct the offload merge driver pipeline for CUDA (single arch)#7

Open
RiverDave wants to merge 6 commits into
users/riverdave/cir/cuda-register-module-passfrom
06-18-_cir_introduce_the_offload_merge_driver_pipeline_for_cuda_single_arch_
Open

[CIR] Construct the offload merge driver pipeline for CUDA (single arch)#7
RiverDave wants to merge 6 commits into
users/riverdave/cir/cuda-register-module-passfrom
06-18-_cir_introduce_the_offload_merge_driver_pipeline_for_cuda_single_arch_

Conversation

@RiverDave

@RiverDave RiverDave commented Jun 18, 2026

Copy link
Copy Markdown
Owner

We're back with the the PR's 😄

This is where the actions from #6 actually get placed into the driver's action graph - the construction that #6 deferred. With --clangir-offload-merge on a CUDA compile, each host/device TU is lowered to serialized CIR, combined into a single cir.offload.container, split back out, and every module then resumes the backend from -x cir (the .cir input path landed in #5).

The driver is a bit dense, so the relevants bits here are:

  • BuildOffloadingActions builds the graph: host + per-arch device compiles become inputs to one CIRMergeJobAction, which feeds a single CIRSplitJobAction, wrapped in an OffloadAction that exposes the split as one host + N device dependences.
  • ConstructPhaseAction stops the CUDA host/device compile at CIR (instead of going straight to LLVM) when the flag is set, so there's something for the merge to consume.
  • BuildActions re-applies the backend phase to each split-out module separately - the host module and each device module have different toolchains/output types, so each needs its own backend action feeding off the shared split node.

I added an isCIROffloadMerge() helper since the same gating condition is checked across all three sites.

Here's a better illustration in case the above wasn't clear. (Thanks Claude 😉)

graph TD
  HSrc["InputAction<br/>foo.cu (host)"] --> HPre["PreprocessJobAction"]
  HPre --> HComp["CompileJobAction → TY_CIR<br/>-fclangir -emit-cir"]

  DSrc["InputAction<br/>foo.cu (cuda-device, sm_80)"] --> DPre["PreprocessJobAction"]
  DPre --> DComp["CompileJobAction → TY_CIR<br/>-fclangir -emit-cir -fcuda-is-device"]

  HComp --> Merge["CIRMergeJobAction → TY_CIR<br/>cir-offload-merge -combine"]
  DComp --> Merge

  Merge --> Split["CIRSplitJobAction → TY_CIR<br/>cir-offload-merge -split"]

  Split --> HBack["BackendJobAction host → TY_LLVM_IR<br/>-fclangir -emit-llvm -x cir"]
  Split --> DBack["BackendJobAction device → TY_LLVM_BC<br/>-fclangir -emit-llvm-bc -x cir"]

  HBack --> OA["OffloadAction (root)<br/>host + device grouping"]
  DBack --> OA
  
Loading

RiverDave commented Jun 18, 2026

Copy link
Copy Markdown
Owner Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@RiverDave RiverDave changed the title [CIR] Introduce the offload merge driver pipeline for CUDA (single arch) [CIR] Construct the offload merge driver pipeline for CUDA (single arch) Jun 18, 2026
@RiverDave RiverDave marked this pull request as ready for review June 18, 2026 07:53
@RiverDave RiverDave requested a review from koparasy June 18, 2026 07:53
@RiverDave

Copy link
Copy Markdown
Owner Author

@koparasy I assume eventually when we hook up merge and split to the optimization layer we'd want that stage to be part of CIRJobMergeAction? I was thinking whether if we'd want a separate action for that to be explicit but I recall we discussing how complicated it already is to justify to the community adding new actions to the driver.

@koparasy

Copy link
Copy Markdown
Collaborator

@koparasy I assume eventually when we hook up merge and split to the optimization layer we'd want that stage to be part of CIRJobMergeAction? I was thinking whether if we'd want a separate action for that to be explicit but I recall we discussing how complicated it already is to justify to the community adding new actions to the driver.

I don't think a separate invocation/action is needed. It would be nice if all of our optimizations could run by just calling cir-opt provided that the loaded module is a combined TU. But in terms of actions I would not introduce a new action. No.

Comment thread clang/test/Driver/cir-offload-merge.cu Outdated
Comment on lines +14 to +18
// MERGE: "-cc1" {{.*}}"-fclangir" {{.*}}"-emit-cir"
// MERGE: "-cc1" {{.*}}"-fclangir" {{.*}}"-emit-cir" {{.*}}"-fcuda-is-device"
// MERGE: "{{.*}}cir-offload-merge{{(\.exe)?}}" "-combine" "-targets=host-x86_64-unknown-linux-gnu,cuda-nvptx64-nvidia-cuda-unknown-sm_80"
// MERGE: "{{.*}}cir-offload-merge{{(\.exe)?}}" "-split" "-targets=host-x86_64-unknown-linux-gnu,cuda-nvptx64-nvidia-cuda-unknown-sm_80"
// MERGE: "-cc1" {{.*}}"-fclangir" {{.*}}"-emit-llvm-bc" {{.*}}"-fcuda-is-device" {{.*}}"-x" "cir"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't you here also track the input output files with some pattern? To make sure that the outputs of cc1 -emit-cir are forwarded properly as inputs to cir-offload-merge.

The same also goes the other way around, the split should output files right?

Comment thread clang/test/Driver/cir-offload-merge.cu Outdated
Comment on lines +14 to +18
// MERGE: "-cc1" {{.*}}"-fclangir" {{.*}}"-emit-cir"
// MERGE: "-cc1" {{.*}}"-fclangir" {{.*}}"-emit-cir" {{.*}}"-fcuda-is-device"
// MERGE: "{{.*}}cir-offload-merge{{(\.exe)?}}" "-combine" "-targets=host-x86_64-unknown-linux-gnu,cuda-nvptx64-nvidia-cuda-unknown-sm_80"
// MERGE: "{{.*}}cir-offload-merge{{(\.exe)?}}" "-split" "-targets=host-x86_64-unknown-linux-gnu,cuda-nvptx64-nvidia-cuda-unknown-sm_80"
// MERGE: "-cc1" {{.*}}"-fclangir" {{.*}}"-emit-llvm-bc" {{.*}}"-fcuda-is-device" {{.*}}"-x" "cir"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why emit-llvm-bc, is this the normal cuda driver approach?

@RiverDave RiverDave Jun 18, 2026

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what the driver constructs when you pass -emit-llvm for cuda see:

./build/bin/clang -### -target x86_64-unknown-linux-gnu -x cuda -S -emit-llvm \
  --cuda-gpu-arch=sm_80 -nocudainc -nocudalib -c /dev/null 2>&1 |
  grep '-emit-llvm-bc'

I get something like:

 "/Users/davidfeliperiveraguerra/dev/gsoc-combine/build/bin/clang-23" 
  "-cc1" "-triple" "nvptx64-nvidia-cuda" "-aux-triple" "x86_64-unknown-linux-gnu" 
   "-emit-llvm-bc" .....

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I should've made things clearer in that test and show the end-to-end lowering instead of just halting at llvm? I did it mainly keep things simple.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to have everything here. The entire lowering. In the end you need to verify that everything is correct.

Comment thread clang/lib/Driver/Driver.cpp Outdated
handleTimeTrace(C, Args, JA, BaseInput, Result);
}

if (TargetDeviceOffloadKind != Action::OFK_None &&

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you here be more defensive and check if (isCIROffloadMerge(C, C.getArgs())...)?

Comment on lines +4498 to +4501
Args.hasArg(options::OPT_clangir_offload_merge) &&
(C.isOffloadingHostKind(Action::OFK_Cuda) ||
C.isOffloadingHostKind(Action::OFK_HIP));
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Do we support HIP? If so we should add a test, if not maybe we add here an assertion? If you have HIP + OffloadMerge = Error?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No HIP yet, I'll wire that target incrementally, I'll add the assert btw.

@RiverDave

RiverDave commented Jun 21, 2026

Copy link
Copy Markdown
Owner Author

Something I'm currently facing is a crash when appending the fatbins to the host on multi-arch. I believe this is given the the nature of offloading where compiling for different archs collapses their respective binaries before appending to the host expecting a single input - I think that's the expectation for:

assert(HostOffloadingInputs.size() == 1 && "Only one input expected");

Multi-arch is the focus of a future PR (not this) So I might need to see how the bundler resolves these things.

@RiverDave RiverDave requested a review from koparasy June 22, 2026 16:06
Comment on lines +6 to +10
// RUN: %clang -### -target x86_64-unknown-linux-gnu -x cuda -fclangir \
// RUN: --cuda-gpu-arch=sm_80 -nocudainc -nocudalib \
// RUN: --clangir-offload-merge -c %s 2>&1 \
// RUN: | FileCheck %s --check-prefix=MERGE

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you invoke clang without the -###?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can absolutely build the pipeline, as of now these are the bindings we emit per action:

bindings through CUDA stock driver (no merge):


➜  gsoc-combine git:(06-18-_cir_introduce_the_offload_merge_driver_pipeline_for_cuda_single_arch_) ✗ ./build/bin/clang -ccc-print-bindings \
  -target x86_64-unknown-linux-gnu \
  -x cu -fclangir \
  --cuda-gpu-arch=sm_80 \
  -nocudainc -nocudalib \
  -c /tmp/test.cu

# "nvptx64-nvidia-cuda" - "clang", inputs: ["/tmp/test.cu"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-2f55d0.s"
# "nvptx64-nvidia-cuda" - "NVPTX::Assembler", inputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-2f55d0.s"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-cc0829.o"
# "nvptx64-nvidia-cuda" - "NVPTX::Linker", inputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-cc0829.o", "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-2f55d0.s"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-e4154d.fatbin"
# "x86_64-unknown-linux-gnu" - "clang", inputs: ["/tmp/test.cu", "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-e4154d.fatbin"], output: "test.o"

bindings through cir-offload-merge (CUDA - single arch):


# "x86_64-unknown-linux-gnu" - "clang", inputs: ["/tmp/test.cu"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-c35709.cir"
# "nvptx64-nvidia-cuda" - "clang", inputs: ["/tmp/test.cu"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-883aff.cir"
# "nvptx64-nvidia-cuda" - "CIR offload merge", inputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-c35709.cir", "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-883aff.cir"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-b077f2.cir"
# "nvptx64-nvidia-cuda" - "CIR offload merge", inputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-b077f2.cir"], outputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-980cac.cir", "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-63eb4f.cir"]
# "nvptx64-nvidia-cuda" - "clang", inputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-63eb4f.cir"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-44f3d0.s"
# "nvptx64-nvidia-cuda" - "NVPTX::Assembler", inputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-44f3d0.s"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-39463f.o"
# "nvptx64-nvidia-cuda" - "NVPTX::Linker", inputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-39463f.o", "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-44f3d0.s"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-1fa102.fatbin"
# "x86_64-unknown-linux-gnu" - "clang", inputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-980cac.cir", "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-1fa102.fatbin"], output: "test.o"

I'm compiling an LLVM build on an NVIDIA machine so I'll update you on this when I get it running there.

@RiverDave RiverDave Jun 24, 2026

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to gather compile timings for polybench after running this end to end:

Phase averages (wall seconds, over successful compilations)

Phase no-merge avg merge avg delta
Frontend+IRGen 4.197 1.553 -2.645
ISel 0.019 0.019 -0.000
LLVM-analysis 0.004 0.004 -0.000
LLVM-passes 0.139 0.135 -0.005
RegAlloc 0.001 0.001 -0.000
Total (wall) 4.254 1.660 -2.594

In the frontend and IRGen, We're seeing a 2.6x average speedup on trivial kernels and up to 8x on complex ones like adi. It is a possibility that the benchmark instrumentation I ran this with, took the first cc1 invocation as the source of truth without considering the CIR -> Fatbin transformation.

Please do take these results with a grain of salt, I will probably dig deeper on it as I find this extremely weird.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the binary contents through the merge pipeline are different, we're missing crucial sections and function calls in the final host object like: nv_fatbin, .nvFatBinSegment, __cudaRegisterFunction etc... It looks like registration calls are not being emitted on the host! I'm starting to look at what might be the cause.

@RiverDave RiverDave Jun 24, 2026

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, big thing. Looks like the current fork doesn't have registration (On vars) implemented as compared to upstream, I'll need to rebase, will figure this out.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping me once you have this done.

@RiverDave RiverDave requested a review from koparasy June 24, 2026 16:42
RiverDave pushed a commit that referenced this pull request Jun 25, 2026
I recently noticed LLDB crash during execution of `script
print(lldb.SBDebugger().GetBroadcaster().GetName())` command:
```
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.	Program arguments: /home/sergei/llvm-project/build/bin/lldb-dap
 #0 0x000062735c3403d2 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/sergei/llvm-project/build/bin/lldb-dap+0x7c3d2)
 #1 0x000062735c33d7ec llvm::sys::RunSignalHandlers() (/home/sergei/llvm-project/build/bin/lldb-dap+0x797ec)
 #2 0x000062735c33d94c SignalHandler(int, siginfo_t*, void*) Signals.cpp:0:0
 #3 0x00007eaa6aa45330 (/lib/x86_64-linux-gnu/libc.so.6+0x45330)
 #4 0x00007eaa6bb0c092 lldb::SBBroadcaster::GetName() const (/home/sergei/llvm-project/build/bin/../lib/liblldb.so.23.0git+0x90c092)
 #5 0x00007eaa6bcb9a5d _wrap_SBBroadcaster_GetName LLDBWrapPython.cpp:0:0
 #6 0x00007eaa6a1df5f5 (/lib/x86_64-linux-gnu/libpython3.12.so.1.0+0x1df5f5)
 #7 0x00007eaa6a182b2c PyObject_Vectorcall (/lib/x86_64-linux-gnu/libpython3.12.so.1.0+0x182b2c)
 #8 0x00007eaa6a11d5ee _PyEval_EvalFrameDefault (/lib/x86_64-linux-gnu/libpython3.12.so.1.0+0x11d5ee)
 #9 0x00007eaa6a2a091f PyEval_EvalCode (/lib/x86_64-linux-gnu/libpython3.12.so.1.0+0x2a091f)
#10 0x00007eaa6a29c8b0 (/lib/x86_64-linux-gnu/libpython3.12.so.1.0+0x29c8b0)
llvm#11 0x00007eaa6a11fbd3 _PyEval_EvalFrameDefault (/lib/x86_64-linux-gnu/libpython3.12.so.1.0+0x11fbd3)
llvm#12 0x00007eaa6c4891b7 lldb_private::ScriptInterpreterPythonImpl::ExecuteOneLine(llvm::StringRef, lldb_private::CommandReturnObject*, lldb_private::ExecuteScriptOptions const&) (/home/sergei/llvm-project/build/bin/../lib/liblldb.so.23.0git+0x12891b7)
llvm#13 0x00007eaa70326ff5 CommandObjectScriptingRun::DoExecute(llvm::StringRef, lldb_private::CommandReturnObject&) (/home/sergei/llvm-project/build/bin/../lib/liblldb.so.23.0git+0x5126ff5)
llvm#14 0x00007eaa6bee3739 lldb_private::CommandObjectRaw::Execute(char const*, lldb_private::CommandReturnObject&) (/home/sergei/llvm-project/build/bin/../lib/liblldb.so.23.0git+0xce3739)
llvm#15 0x00007eaa6bede09a lldb_private::CommandInterpreter::HandleCommand(char const*, lldb_private::LazyBool, lldb_private::CommandReturnObject&, bool) (/home/sergei/llvm-project/build/bin/../lib/liblldb.so.23.0git+0xcde09a)
llvm#16 0x00007eaa6bb0f0f8 lldb::SBCommandInterpreter::HandleCommand(char const*, lldb::SBExecutionContext&, lldb::SBCommandReturnObject&, bool) (/home/sergei/llvm-project/build/bin/../lib/liblldb.so.23.0git+0x90f0f8)
llvm#17 0x00007eaa6bb0f265 lldb::SBCommandInterpreter::HandleCommand(char const*, lldb::SBCommandReturnObject&, bool) (/home/sergei/llvm-project/build/bin/../lib/liblldb.so.23.0git+0x90f265)
llvm#18 0x000062735c3707f3 lldb_dap::RunLLDBCommands[abi:cxx11](lldb::SBDebugger&, lldb::SBMutex, llvm::StringRef, llvm::ArrayRef<lldb_dap::protocol::String> const&, bool&, bool, bool) (/home/sergei/llvm-project/build/bin/lldb-dap+0xac7f3)
llvm#19 0x000062735c3a8019 lldb_dap::EvaluateRequestHandler::Run(lldb_dap::protocol::EvaluateArguments const&) const (/home/sergei/llvm-project/build/bin/lldb-dap+0xe4019)
llvm#20 0x000062735c3aba78 lldb_dap::RequestHandler<lldb_dap::protocol::EvaluateArguments, llvm::Expected<lldb_dap::protocol::EvaluateResponseBody>>::operator()(lldb_dap::protocol::Request const&) const (/home/sergei/llvm-project/build/bin/lldb-dap+0xe7a78)
llvm#21 0x000062735c3ce1bf lldb_dap::BaseRequestHandler::Run(lldb_dap::protocol::Request const&) (/home/sergei/llvm-project/build/bin/lldb-dap+0x10a1bf)
llvm#22 0x000062735c3577e7 lldb_dap::DAP::HandleObject(std::variant<lldb_dap::protocol::Request, lldb_dap::protocol::Response, lldb_dap::protocol::Event> const&) (/home/sergei/llvm-project/build/bin/lldb-dap+0x937e7)
llvm#23 0x000062735c358705 lldb_dap::DAP::Loop() (/home/sergei/llvm-project/build/bin/lldb-dap+0x94705)
llvm#24 0x000062735c2ed0c7 main (/home/sergei/llvm-project/build/bin/lldb-dap+0x290c7)
llvm#25 0x00007eaa6aa2a1ca __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:74:3
```
As far as I understand default constuctors should be covered by fuzzing
tests, so I don't know how to write test for that patch.
@RiverDave RiverDave force-pushed the 06-18-_cir_introduce_the_offload_merge_driver_pipeline_for_cuda_single_arch_ branch from 672cab2 to 0e3fd3e Compare June 25, 2026 02:17
@RiverDave RiverDave force-pushed the 06-18-_cir_introduce_the_offload_merge_driver_pipeline_for_cuda_single_arch_ branch from 0e3fd3e to 15fd8e0 Compare June 26, 2026 02:52
@RiverDave RiverDave changed the base branch from gsoc/combine-cir to users/riverdave/cir/cuda-register-module-pass June 26, 2026 02:52
@RiverDave

Copy link
Copy Markdown
Owner Author

I had to figure out a way to pass the gpu binary path when we don't consume an AST in the second cc1 invocation so that we can emit the right runtime code on the host. Therefore this patch now depends on #8.

@RiverDave RiverDave force-pushed the users/riverdave/cir/cuda-register-module-pass branch from 0b4e5e2 to c3fb89f Compare June 26, 2026 18:53
@RiverDave RiverDave force-pushed the 06-18-_cir_introduce_the_offload_merge_driver_pipeline_for_cuda_single_arch_ branch from 15fd8e0 to 55cc2a9 Compare June 26, 2026 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants