[CIR][CUDA] Add separate CUDA registration pass when consuming cir through CIRGenAction#8
Conversation
|
Thanks for digging into the resume-path registration gap. The diagnosis is The core difference: this PR threads a 1. The triple is already on the module:No POD field needed 2.
|
This sounds very reasonable. We'd still need to have a separate registration pass to run when you make explicit the binary path you pass on when you resume the pipeline. So following your advice, consuming .cir could look like this (if gpu bin is passed): It is important to note that once the driver boundary RFC is solved, we could move this registration pass to be grouped to the post Target-Lowering passes. |
LoweringPrepare currently reaches into the live ASTContext to mangle the C++20 named-module initializer function name. That dependency blocks running the pass on serialized CIR (e.g. the proposed split-compilation flow where host/device CIR is combined, optimized, and split before resuming target-specific lowering in a separate cc1/tool invocation). Precompute the mangled name during CIRGen release() when CXX20ModuleInits is active and stash it on the ModuleOp as cir.cxx_module_init_fn_name. LoweringPrepare now prefers the attribute and only falls back to the AST-based path when it is absent. Part of the broader effort to make post-CIRGen CIR passes AST-free. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> (cherry picked from commit eea6f6f)
LoweringPrepare and any future pass that queries the ASTVarDeclInterface (isLocalVarDecl, TLSKind, isInline, TemplateSpecializationKind) currently requires a live clang::ASTContext: the existing ASTVarDeclAttr stores a raw clang::VarDecl pointer and delegates each query to the AST. That pointer is not serializable and dangles the moment CIR leaves the process, which blocks running the pipeline on reloaded .cir modules -- the exact scenario the proposed CombineCIR/SplitCIR driver actions need. Introduce StaticLocalInfoAttr, a concrete ASTVarDeclInterface implementation whose fields cache the four facts as plain data. Add the cir-materialize-ast-facts pass which walks cir.global operations, reads the AST-backed ASTVarDeclAttr (while the AST is still live), and rewrites $ast to a StaticLocalInfoAttr carrying the snapshot. The pass is registered early in runCIRToCIRPasses, right after canonicalization, so it runs in-process before any step that could serialize CIR. LoweringPrepare's read sites are unchanged: they go through the interface, which now dispatches to either the AST-backed attribute (for tests/tools that skip materialization) or the cached attribute. After this change, post-CIRGen CIR can round-trip through text or other serialization without hitting a stale VarDecl pointer. Part of the broader effort to make post-CIRGen CIR passes AST-free. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> (cherry picked from commit 07b7b13)
…e factory LowerModule's constructor took LangOptions and CodeGenOptions by value but discarded them, leaving only the TargetInfo accessible. The two MissingFeatures markers (lowerModuleLangOpts/lowerModuleCodeGenOpts) flagged this gap. As LoweringPrepare migrates off ASTContext, downstream consumers need to read those facts through LowerModule. Store the options as members, expose getLangOpts/getCodeGenOpts, and add a second createLowerModule overload that takes them from the surrounding cc1 invocation (along with an already-built TargetInfo). The module-only overload remains for callers that only have a parsed CIR module on hand, but its LangOptions/CodeGenOptions stay default-constructed. Drop the now-satisfied MissingFeatures markers and the assert in LowerModule::getCXXABIKind that referenced them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> (cherry picked from commit b4492c9)
LoweringPrepare's only remaining consumer of ASTContext was the bundle of TargetInfo / LangOpts / CXXABI queries it issues for guard-variable sizing, TLS handling, complex-div promotion, CUDA/HIP runtime registration, and the like. None of those facts are AST-derived: they all come from the cc1 invocation itself, and a LowerModule built from that invocation already exposes them in a serializable, AST-free form. Replace the pass's astCtx member with a cir::LowerModule * (constructor: createLoweringPreparePass(LowerModule*, vfs)) and route every former astCtx query through it. The CUDA fatbin VFS, also previously read from ASTContext, becomes an explicit IntrusiveRefCntPtr<vfs::FileSystem> plumbed through the pass options and defaulting to the real filesystem when absent. The C++20 named-module init AST fallback is removed -- PR1 already moved the mangled name to a module-level attribute, and the only tests that bypassed CIRGen now run on the same path as everyone else. runCIRToCIRPasses gains a LowerModule and VFS parameter; CIRGenAction's in-process path constructs the LowerModule from the surrounding CompilerInstance. The .cir cc1 input path added in the previous commit now runs the full target-lowering / cxxabi-lowering / lowering-prepare pipeline before handing the module to CIR-to-LLVM, so emit-obj on a parsed .cir file works end-to-end. IdiomRecognizer is intentionally not on the .cir path: its own header documents an AST dependency and it is opt-in even on the source path. The cc1-cir-input lit test grows an emit-obj run that confirms the new pipeline produces a linkable object with the expected symbol. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> (cherry picked from commit 0356eed)
Extract the CUDA/HIP registration emission out of LoweringPrepare into a shared CUDAModuleRegistrationBuilder and a standalone CUDARegisterModulePass. The builder sources every target/LangOpts fact from a LowerModule (no POD), so the same logic runs in-place during LoweringPrepare and, on a resumed .cir where LoweringPrepare has already run pre-serialization, from the standalone pass. cir-opt builds a module-only LowerModule from cir.triple; the SDK version (the one TargetOptions-backed fact the module-only target lacks) is passed explicitly and left empty there. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
On the merge pipeline the fatbin only exists at the second cc1 invocation (it is produced downstream of the serialized host.cir), so stamp its path as cir.cu.binary_handle and run the registration pass on the .cir resume path, sourcing target facts from a LowerModule built off the invocation. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
0b4e5e2 to
c3fb89f
Compare
|
I've stacked my changes over the commits present in your fork you pointed out to. Those were not present here, so a cherry-pick and some minor conflicts to solve were sufficient. There's plenty of LOC here, so I suggest you read it commit by commit, co-authoring here shows which ones came from your source. Most notable changes:
Lastly, I'll stack #7 on top of this as its base and run polybench again. |
I opened this to address certain obstacles presented in: #7
In the offload-merge pipeline the host TU is lowered to CIR, serialized, and only resumed to an object through a second -x cir cc1 invocation. CUDA registration (module ctor, __cudaRegisterFatBinary, the embedded .nv_fatbin) is built by LoweringPrepare in the first invocation - but the device fatbin doesn't exist yet there, since it's a downstream product of the host.cir we're emitting, so buildCUDAModuleCtor no-ops on the absent CUDABinaryHandleAttr and the object ships without registration.
The resume invocation is the only place the fatbin actually exists (-fcuda-include-gpubinary), yet it never re-ran registration. This PR factors registration out of LoweringPrepare into a standalone pass parameterized on a small POD (CUDARegisterInfo) instead of an ASTContext, and runs it on the -x cir path: stamp the binary handle, build the ctor, then lower. Architecturally, the Non-merge path is unchanged; HIP rides the same pass via isHIP.