[CIR] Construct the offload merge driver pipeline for CUDA (single arch)#7
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
|
@koparasy I assume eventually when we hook up merge and split to the optimization layer we'd want that stage to be part of |
I don't think a separate invocation/action is needed. It would be nice if all of our optimizations could run by just calling |
| // MERGE: "-cc1" {{.*}}"-fclangir" {{.*}}"-emit-cir" | ||
| // MERGE: "-cc1" {{.*}}"-fclangir" {{.*}}"-emit-cir" {{.*}}"-fcuda-is-device" | ||
| // MERGE: "{{.*}}cir-offload-merge{{(\.exe)?}}" "-combine" "-targets=host-x86_64-unknown-linux-gnu,cuda-nvptx64-nvidia-cuda-unknown-sm_80" | ||
| // MERGE: "{{.*}}cir-offload-merge{{(\.exe)?}}" "-split" "-targets=host-x86_64-unknown-linux-gnu,cuda-nvptx64-nvidia-cuda-unknown-sm_80" | ||
| // MERGE: "-cc1" {{.*}}"-fclangir" {{.*}}"-emit-llvm-bc" {{.*}}"-fcuda-is-device" {{.*}}"-x" "cir" |
There was a problem hiding this comment.
shouldn't you here also track the input output files with some pattern? To make sure that the outputs of cc1 -emit-cir are forwarded properly as inputs to cir-offload-merge.
The same also goes the other way around, the split should output files right?
| // MERGE: "-cc1" {{.*}}"-fclangir" {{.*}}"-emit-cir" | ||
| // MERGE: "-cc1" {{.*}}"-fclangir" {{.*}}"-emit-cir" {{.*}}"-fcuda-is-device" | ||
| // MERGE: "{{.*}}cir-offload-merge{{(\.exe)?}}" "-combine" "-targets=host-x86_64-unknown-linux-gnu,cuda-nvptx64-nvidia-cuda-unknown-sm_80" | ||
| // MERGE: "{{.*}}cir-offload-merge{{(\.exe)?}}" "-split" "-targets=host-x86_64-unknown-linux-gnu,cuda-nvptx64-nvidia-cuda-unknown-sm_80" | ||
| // MERGE: "-cc1" {{.*}}"-fclangir" {{.*}}"-emit-llvm-bc" {{.*}}"-fcuda-is-device" {{.*}}"-x" "cir" |
There was a problem hiding this comment.
Why emit-llvm-bc, is this the normal cuda driver approach?
There was a problem hiding this comment.
Yes, that's what the driver constructs when you pass -emit-llvm for cuda see:
./build/bin/clang -### -target x86_64-unknown-linux-gnu -x cuda -S -emit-llvm \
--cuda-gpu-arch=sm_80 -nocudainc -nocudalib -c /dev/null 2>&1 |
grep '-emit-llvm-bc'
I get something like:
"/Users/davidfeliperiveraguerra/dev/gsoc-combine/build/bin/clang-23"
"-cc1" "-triple" "nvptx64-nvidia-cuda" "-aux-triple" "x86_64-unknown-linux-gnu"
"-emit-llvm-bc" .....
There was a problem hiding this comment.
Perhaps I should've made things clearer in that test and show the end-to-end lowering instead of just halting at llvm? I did it mainly keep things simple.
There was a problem hiding this comment.
It makes sense to have everything here. The entire lowering. In the end you need to verify that everything is correct.
| handleTimeTrace(C, Args, JA, BaseInput, Result); | ||
| } | ||
|
|
||
| if (TargetDeviceOffloadKind != Action::OFK_None && |
There was a problem hiding this comment.
Shouldn't you here be more defensive and check if (isCIROffloadMerge(C, C.getArgs())...)?
| Args.hasArg(options::OPT_clangir_offload_merge) && | ||
| (C.isOffloadingHostKind(Action::OFK_Cuda) || | ||
| C.isOffloadingHostKind(Action::OFK_HIP)); | ||
| } |
There was a problem hiding this comment.
NIT: Do we support HIP? If so we should add a test, if not maybe we add here an assertion? If you have HIP + OffloadMerge = Error?
There was a problem hiding this comment.
No HIP yet, I'll wire that target incrementally, I'll add the assert btw.
|
Something I'm currently facing is a crash when appending the fatbins to the host on multi-arch. I believe this is given the the nature of offloading where compiling for different archs collapses their respective binaries before appending to the host expecting a single input - I think that's the expectation for: Multi-arch is the focus of a future PR (not this) So I might need to see how the bundler resolves these things. |
| // RUN: %clang -### -target x86_64-unknown-linux-gnu -x cuda -fclangir \ | ||
| // RUN: --cuda-gpu-arch=sm_80 -nocudainc -nocudalib \ | ||
| // RUN: --clangir-offload-merge -c %s 2>&1 \ | ||
| // RUN: | FileCheck %s --check-prefix=MERGE | ||
|
|
There was a problem hiding this comment.
What happens if you invoke clang without the -###?
There was a problem hiding this comment.
We can absolutely build the pipeline, as of now these are the bindings we emit per action:
bindings through CUDA stock driver (no merge):
➜ gsoc-combine git:(06-18-_cir_introduce_the_offload_merge_driver_pipeline_for_cuda_single_arch_) ✗ ./build/bin/clang -ccc-print-bindings \
-target x86_64-unknown-linux-gnu \
-x cu -fclangir \
--cuda-gpu-arch=sm_80 \
-nocudainc -nocudalib \
-c /tmp/test.cu
# "nvptx64-nvidia-cuda" - "clang", inputs: ["/tmp/test.cu"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-2f55d0.s"
# "nvptx64-nvidia-cuda" - "NVPTX::Assembler", inputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-2f55d0.s"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-cc0829.o"
# "nvptx64-nvidia-cuda" - "NVPTX::Linker", inputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-cc0829.o", "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-2f55d0.s"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-e4154d.fatbin"
# "x86_64-unknown-linux-gnu" - "clang", inputs: ["/tmp/test.cu", "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-e4154d.fatbin"], output: "test.o"
bindings through cir-offload-merge (CUDA - single arch):
# "x86_64-unknown-linux-gnu" - "clang", inputs: ["/tmp/test.cu"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-c35709.cir"
# "nvptx64-nvidia-cuda" - "clang", inputs: ["/tmp/test.cu"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-883aff.cir"
# "nvptx64-nvidia-cuda" - "CIR offload merge", inputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-c35709.cir", "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-883aff.cir"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-b077f2.cir"
# "nvptx64-nvidia-cuda" - "CIR offload merge", inputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-b077f2.cir"], outputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-980cac.cir", "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-63eb4f.cir"]
# "nvptx64-nvidia-cuda" - "clang", inputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-63eb4f.cir"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-44f3d0.s"
# "nvptx64-nvidia-cuda" - "NVPTX::Assembler", inputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-44f3d0.s"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-39463f.o"
# "nvptx64-nvidia-cuda" - "NVPTX::Linker", inputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-39463f.o", "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-sm_80-44f3d0.s"], output: "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-1fa102.fatbin"
# "x86_64-unknown-linux-gnu" - "clang", inputs: ["/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-980cac.cir", "/var/folders/9q/4b7_q0n13lg1xnbmy3qnptxm0000gp/T/test-1fa102.fatbin"], output: "test.o"
I'm compiling an LLVM build on an NVIDIA machine so I'll update you on this when I get it running there.
There was a problem hiding this comment.
I was able to gather compile timings for polybench after running this end to end:
Phase averages (wall seconds, over successful compilations)
| Phase | no-merge avg | merge avg | delta |
|---|---|---|---|
| Frontend+IRGen | 4.197 | 1.553 | -2.645 |
| ISel | 0.019 | 0.019 | -0.000 |
| LLVM-analysis | 0.004 | 0.004 | -0.000 |
| LLVM-passes | 0.139 | 0.135 | -0.005 |
| RegAlloc | 0.001 | 0.001 | -0.000 |
| Total (wall) | 4.254 | 1.660 | -2.594 |
In the frontend and IRGen, We're seeing a 2.6x average speedup on trivial kernels and up to 8x on complex ones like adi. It is a possibility that the benchmark instrumentation I ran this with, took the first cc1 invocation as the source of truth without considering the CIR -> Fatbin transformation.
Please do take these results with a grain of salt, I will probably dig deeper on it as I find this extremely weird.
There was a problem hiding this comment.
It looks like the binary contents through the merge pipeline are different, we're missing crucial sections and function calls in the final host object like: nv_fatbin, .nvFatBinSegment, __cudaRegisterFunction etc... It looks like registration calls are not being emitted on the host! I'm starting to look at what might be the cause.
There was a problem hiding this comment.
Ok, big thing. Looks like the current fork doesn't have registration (On vars) implemented as compared to upstream, I'll need to rebase, will figure this out.
There was a problem hiding this comment.
Ping me once you have this done.
99192fe to
c95753c
Compare
I recently noticed LLDB crash during execution of `script print(lldb.SBDebugger().GetBroadcaster().GetName())` command: ``` PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace. Stack dump: 0. Program arguments: /home/sergei/llvm-project/build/bin/lldb-dap #0 0x000062735c3403d2 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/sergei/llvm-project/build/bin/lldb-dap+0x7c3d2) #1 0x000062735c33d7ec llvm::sys::RunSignalHandlers() (/home/sergei/llvm-project/build/bin/lldb-dap+0x797ec) #2 0x000062735c33d94c SignalHandler(int, siginfo_t*, void*) Signals.cpp:0:0 #3 0x00007eaa6aa45330 (/lib/x86_64-linux-gnu/libc.so.6+0x45330) #4 0x00007eaa6bb0c092 lldb::SBBroadcaster::GetName() const (/home/sergei/llvm-project/build/bin/../lib/liblldb.so.23.0git+0x90c092) #5 0x00007eaa6bcb9a5d _wrap_SBBroadcaster_GetName LLDBWrapPython.cpp:0:0 #6 0x00007eaa6a1df5f5 (/lib/x86_64-linux-gnu/libpython3.12.so.1.0+0x1df5f5) #7 0x00007eaa6a182b2c PyObject_Vectorcall (/lib/x86_64-linux-gnu/libpython3.12.so.1.0+0x182b2c) #8 0x00007eaa6a11d5ee _PyEval_EvalFrameDefault (/lib/x86_64-linux-gnu/libpython3.12.so.1.0+0x11d5ee) #9 0x00007eaa6a2a091f PyEval_EvalCode (/lib/x86_64-linux-gnu/libpython3.12.so.1.0+0x2a091f) #10 0x00007eaa6a29c8b0 (/lib/x86_64-linux-gnu/libpython3.12.so.1.0+0x29c8b0) llvm#11 0x00007eaa6a11fbd3 _PyEval_EvalFrameDefault (/lib/x86_64-linux-gnu/libpython3.12.so.1.0+0x11fbd3) llvm#12 0x00007eaa6c4891b7 lldb_private::ScriptInterpreterPythonImpl::ExecuteOneLine(llvm::StringRef, lldb_private::CommandReturnObject*, lldb_private::ExecuteScriptOptions const&) (/home/sergei/llvm-project/build/bin/../lib/liblldb.so.23.0git+0x12891b7) llvm#13 0x00007eaa70326ff5 CommandObjectScriptingRun::DoExecute(llvm::StringRef, lldb_private::CommandReturnObject&) (/home/sergei/llvm-project/build/bin/../lib/liblldb.so.23.0git+0x5126ff5) llvm#14 0x00007eaa6bee3739 lldb_private::CommandObjectRaw::Execute(char const*, lldb_private::CommandReturnObject&) (/home/sergei/llvm-project/build/bin/../lib/liblldb.so.23.0git+0xce3739) llvm#15 0x00007eaa6bede09a lldb_private::CommandInterpreter::HandleCommand(char const*, lldb_private::LazyBool, lldb_private::CommandReturnObject&, bool) (/home/sergei/llvm-project/build/bin/../lib/liblldb.so.23.0git+0xcde09a) llvm#16 0x00007eaa6bb0f0f8 lldb::SBCommandInterpreter::HandleCommand(char const*, lldb::SBExecutionContext&, lldb::SBCommandReturnObject&, bool) (/home/sergei/llvm-project/build/bin/../lib/liblldb.so.23.0git+0x90f0f8) llvm#17 0x00007eaa6bb0f265 lldb::SBCommandInterpreter::HandleCommand(char const*, lldb::SBCommandReturnObject&, bool) (/home/sergei/llvm-project/build/bin/../lib/liblldb.so.23.0git+0x90f265) llvm#18 0x000062735c3707f3 lldb_dap::RunLLDBCommands[abi:cxx11](lldb::SBDebugger&, lldb::SBMutex, llvm::StringRef, llvm::ArrayRef<lldb_dap::protocol::String> const&, bool&, bool, bool) (/home/sergei/llvm-project/build/bin/lldb-dap+0xac7f3) llvm#19 0x000062735c3a8019 lldb_dap::EvaluateRequestHandler::Run(lldb_dap::protocol::EvaluateArguments const&) const (/home/sergei/llvm-project/build/bin/lldb-dap+0xe4019) llvm#20 0x000062735c3aba78 lldb_dap::RequestHandler<lldb_dap::protocol::EvaluateArguments, llvm::Expected<lldb_dap::protocol::EvaluateResponseBody>>::operator()(lldb_dap::protocol::Request const&) const (/home/sergei/llvm-project/build/bin/lldb-dap+0xe7a78) llvm#21 0x000062735c3ce1bf lldb_dap::BaseRequestHandler::Run(lldb_dap::protocol::Request const&) (/home/sergei/llvm-project/build/bin/lldb-dap+0x10a1bf) llvm#22 0x000062735c3577e7 lldb_dap::DAP::HandleObject(std::variant<lldb_dap::protocol::Request, lldb_dap::protocol::Response, lldb_dap::protocol::Event> const&) (/home/sergei/llvm-project/build/bin/lldb-dap+0x937e7) llvm#23 0x000062735c358705 lldb_dap::DAP::Loop() (/home/sergei/llvm-project/build/bin/lldb-dap+0x94705) llvm#24 0x000062735c2ed0c7 main (/home/sergei/llvm-project/build/bin/lldb-dap+0x290c7) llvm#25 0x00007eaa6aa2a1ca __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:74:3 ``` As far as I understand default constuctors should be covered by fuzzing tests, so I don't know how to write test for that patch.
672cab2 to
0e3fd3e
Compare
0e3fd3e to
15fd8e0
Compare
|
I had to figure out a way to pass the gpu binary path when we don't consume an AST in the second cc1 invocation so that we can emit the right runtime code on the host. Therefore this patch now depends on #8. |
0b4e5e2 to
c3fb89f
Compare
15fd8e0 to
55cc2a9
Compare

We're back with the the PR's 😄
This is where the actions from #6 actually get placed into the driver's action graph - the construction that #6 deferred. With --clangir-offload-merge on a CUDA compile, each host/device TU is lowered to serialized CIR, combined into a single cir.offload.container, split back out, and every module then resumes the backend from -x cir (the .cir input path landed in #5).
The driver is a bit dense, so the relevants bits here are:
I added an isCIROffloadMerge() helper since the same gating condition is checked across all three sites.
Here's a better illustration in case the above wasn't clear. (Thanks Claude 😉)