Skip to content

[CIR][CUDA] Add separate CUDA registration pass when consuming cir through CIRGenAction#8

Open
RiverDave wants to merge 6 commits into
gsoc/combine-cirfrom
users/riverdave/cir/cuda-register-module-pass
Open

[CIR][CUDA] Add separate CUDA registration pass when consuming cir through CIRGenAction#8
RiverDave wants to merge 6 commits into
gsoc/combine-cirfrom
users/riverdave/cir/cuda-register-module-pass

Conversation

@RiverDave

@RiverDave RiverDave commented Jun 25, 2026

Copy link
Copy Markdown
Owner

I opened this to address certain obstacles presented in: #7

In the offload-merge pipeline the host TU is lowered to CIR, serialized, and only resumed to an object through a second -x cir cc1 invocation. CUDA registration (module ctor, __cudaRegisterFatBinary, the embedded .nv_fatbin) is built by LoweringPrepare in the first invocation - but the device fatbin doesn't exist yet there, since it's a downstream product of the host.cir we're emitting, so buildCUDAModuleCtor no-ops on the absent CUDABinaryHandleAttr and the object ships without registration.

The resume invocation is the only place the fatbin actually exists (-fcuda-include-gpubinary), yet it never re-ran registration. This PR factors registration out of LoweringPrepare into a standalone pass parameterized on a small POD (CUDARegisterInfo) instead of an ASTContext, and runs it on the -x cir path: stamp the binary handle, build the ctor, then lower. Architecturally, the Non-merge path is unchanged; HIP rides the same pass via isHIP.

@RiverDave RiverDave requested a review from koparasy June 25, 2026 20:15
@RiverDave RiverDave marked this pull request as ready for review June 25, 2026 20:15
@RiverDave RiverDave changed the title [CIR][CUDA] Add separate CUDA registration pass [CIR][CUDA] Add separate CUDA registration pass when consuming cir through CIRGenAction Jun 25, 2026
@koparasy

Copy link
Copy Markdown
Collaborator

Thanks for digging into the resume-path registration gap. The diagnosis is
spot on. Before this lands I want to flag that we already have a stack that
solves the same root cause (LoweringPrepare can't run on resumed .cir
because it reaches into a live ASTContext) from the other direction, and a
fair amount of this PR overlaps it.

The core difference: this PR threads a CUDARegisterInfo POD into a new
standalone pass, populated twice. Once from astCtx in LoweringPrepare
and once from the CompilerInstance on the -x cir path. Our approach makes
LoweringPrepare's state source itself AST-free and serializable, so the
existing pass runs on the .cir path unchanged. Specific overlaps:

1. The triple is already on the module:

No POD field needed
CIRGenModule serializes the triple as cir.triple
(CIRGenModule.cpp#L136),
and the CUDABinaryHandleAttr you stamp is already a module attr too. The new
pass ignores cir.triple and the cir-opt fallback hardcodes
"x86_64-unknown-linux-gnu", which is a latent bug. The .cir cc1-input
commit instead honors the cc1 -triple by overriding cir.triple on parse
(CIRGenAction::ExecuteAction, CIRGenAction.cpp#L287-L295).
The charWidth/sizeTypeWidth POD fields are likewise derivable from the
module data layout (you already build cir::CIRDataLayout(mlirModule) a few
lines away to size vars).

2. parseCIRInput + triple override already exist

Your parseCIRInput/stampCUDABinaryHandle/triple-override block is the same
machinery as a061a51265e2
("[CIR] Accept serialized .cir as cc1 input"), which adds a hasCIRSupport
capability flag and a real .cir input path in ExecuteAction
(CIRGenAction.cpp#L243).

3. linkInModules is already extracted into a shared helper

Your linkInModules is the duplicate that
09f730c2ed5a
("[CIR][CodeGen] Extract shared LinkModule struct and loadLinkModules helper")
deletes — it pulls the shared LinkModule/loadLinkModules into
clang/include/clang/CodeGen/ModuleLinker.h so CIR and classic CodeGen share
one copy. On branch
cir/cirgenaction-state-lift.

4. The actual fix: drive LoweringPrepare from LowerModule, not ASTContext

0356eed35e04
("[CIR] Drive LoweringPrepare from LowerModule instead of ASTContext")
replaces the pass's astCtx member with a cir::LowerModule * plus an
injected VFS, and routes every former astCtx query — TLS, guard sizing,
complex-div, and the entire CUDA/HIP registration path — through LowerModule:

LowerModule already stores LangOptions/CodeGenOptions and has an
invocation-aware factory built from the surrounding cc1 invocation
(b4492c9b0f2d).
So instead of a new pass + a hand-threaded POD populated in two places,
LoweringPrepare runs as-is on the .cir path because its state is now
serializable and built once.

Branches

Suggestion

Could we rebase this on that stack? The fatbin-only-exists-in-the-second-
invocation problem you identified is real and the test coverage is valuable,
but most of the plumbing here (parse .cir, link modules, re-derive
triple/widths) is already done AST-free upstream of the registration logic.
Happy to walk through it — cir/lowering-prepare-via-lower-module is the one
to start from.

This was kindly worded by claude, once I pointed to our shared branches that we had before we started the RFC.

@RiverDave

Copy link
Copy Markdown
Owner Author

Thanks for digging into the resume-path registration gap. The diagnosis is spot on. Before this lands I want to flag that we already have a stack that solves the same root cause (LoweringPrepare can't run on resumed .cir because it reaches into a live ASTContext) from the other direction, and a fair amount of this PR overlaps it.

The core difference: this PR threads a CUDARegisterInfo POD into a new standalone pass, populated twice. Once from astCtx in LoweringPrepare and once from the CompilerInstance on the -x cir path. Our approach makes LoweringPrepare's state source itself AST-free and serializable, so the existing pass runs on the .cir path unchanged. Specific overlaps:

1. The triple is already on the module:

No POD field needed CIRGenModule serializes the triple as cir.triple (CIRGenModule.cpp#L136), and the CUDABinaryHandleAttr you stamp is already a module attr too. The new pass ignores cir.triple and the cir-opt fallback hardcodes "x86_64-unknown-linux-gnu", which is a latent bug. The .cir cc1-input commit instead honors the cc1 -triple by overriding cir.triple on parse (CIRGenAction::ExecuteAction, CIRGenAction.cpp#L287-L295). The charWidth/sizeTypeWidth POD fields are likewise derivable from the module data layout (you already build cir::CIRDataLayout(mlirModule) a few lines away to size vars).

2. parseCIRInput + triple override already exist

Your parseCIRInput/stampCUDABinaryHandle/triple-override block is the same machinery as a061a51265e2 ("[CIR] Accept serialized .cir as cc1 input"), which adds a hasCIRSupport capability flag and a real .cir input path in ExecuteAction (CIRGenAction.cpp#L243).

3. linkInModules is already extracted into a shared helper

Your linkInModules is the duplicate that 09f730c2ed5a ("[CIR][CodeGen] Extract shared LinkModule struct and loadLinkModules helper") deletes — it pulls the shared LinkModule/loadLinkModules into clang/include/clang/CodeGen/ModuleLinker.h so CIR and classic CodeGen share one copy. On branch cir/cirgenaction-state-lift.

4. The actual fix: drive LoweringPrepare from LowerModule, not ASTContext

0356eed35e04 ("[CIR] Drive LoweringPrepare from LowerModule instead of ASTContext") replaces the pass's astCtx member with a cir::LowerModule * plus an injected VFS, and routes every former astCtx query — TLS, guard sizing, complex-div, and the entire CUDA/HIP registration path — through LowerModule:

LowerModule already stores LangOptions/CodeGenOptions and has an invocation-aware factory built from the surrounding cc1 invocation (b4492c9b0f2d). So instead of a new pass + a hand-threaded POD populated in two places, LoweringPrepare runs as-is on the .cir path because its state is now serializable and built once.

Branches

Suggestion

Could we rebase this on that stack? The fatbin-only-exists-in-the-second- invocation problem you identified is real and the test coverage is valuable, but most of the plumbing here (parse .cir, link modules, re-derive triple/widths) is already done AST-free upstream of the registration logic. Happy to walk through it — cir/lowering-prepare-via-lower-module is the one to start from.

This was kindly worded by claude, once I pointed to our shared branches that we had before we started the RFC.

This sounds very reasonable. We'd still need to have a separate registration pass to run when you make explicit the binary path you pass on when you resume the pipeline.

So following your advice, consuming .cir could look like this (if gpu bin is passed):

parseCIRInput                              // ours (PR #5, canonical)
stampCUDABinaryHandle(CI, module)          // ours — from CI.getCodeGenOpts().CudaGpuBinaryFileName
                                           //   (only now does the fatbin exist on disk)
lowerModule = makeLowerModuleFromInvocation(CI, module)   // your helper
pm.addPass(createCUDARegisterModulePass(lowerModule, vfs)) // ours, re-parameterized
pm.run(module)
lowerFromCIRToLLVMIR                        // backend action

It is important to note that once the driver boundary RFC is solved, we could move this registration pass to be grouped to the post Target-Lowering passes.

koparasy and others added 6 commits June 26, 2026 13:44
LoweringPrepare currently reaches into the live ASTContext to mangle the
C++20 named-module initializer function name. That dependency blocks
running the pass on serialized CIR (e.g. the proposed split-compilation
flow where host/device CIR is combined, optimized, and split before
resuming target-specific lowering in a separate cc1/tool invocation).

Precompute the mangled name during CIRGen release() when CXX20ModuleInits
is active and stash it on the ModuleOp as cir.cxx_module_init_fn_name.
LoweringPrepare now prefers the attribute and only falls back to the
AST-based path when it is absent.

Part of the broader effort to make post-CIRGen CIR passes AST-free.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(cherry picked from commit eea6f6f)
LoweringPrepare and any future pass that queries the ASTVarDeclInterface
(isLocalVarDecl, TLSKind, isInline, TemplateSpecializationKind) currently
requires a live clang::ASTContext: the existing ASTVarDeclAttr stores a
raw clang::VarDecl pointer and delegates each query to the AST. That
pointer is not serializable and dangles the moment CIR leaves the
process, which blocks running the pipeline on reloaded .cir modules --
the exact scenario the proposed CombineCIR/SplitCIR driver actions need.

Introduce StaticLocalInfoAttr, a concrete ASTVarDeclInterface
implementation whose fields cache the four facts as plain data. Add the
cir-materialize-ast-facts pass which walks cir.global operations,
reads the AST-backed ASTVarDeclAttr (while the AST is still live), and
rewrites $ast to a StaticLocalInfoAttr carrying the snapshot. The pass
is registered early in runCIRToCIRPasses, right after canonicalization,
so it runs in-process before any step that could serialize CIR.

LoweringPrepare's read sites are unchanged: they go through the
interface, which now dispatches to either the AST-backed attribute (for
tests/tools that skip materialization) or the cached attribute. After
this change, post-CIRGen CIR can round-trip through text or other
serialization without hitting a stale VarDecl pointer.

Part of the broader effort to make post-CIRGen CIR passes AST-free.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(cherry picked from commit 07b7b13)
…e factory

LowerModule's constructor took LangOptions and CodeGenOptions by value but
discarded them, leaving only the TargetInfo accessible. The two
MissingFeatures markers (lowerModuleLangOpts/lowerModuleCodeGenOpts) flagged
this gap. As LoweringPrepare migrates off ASTContext, downstream consumers
need to read those facts through LowerModule.

Store the options as members, expose getLangOpts/getCodeGenOpts, and add a
second createLowerModule overload that takes them from the surrounding cc1
invocation (along with an already-built TargetInfo). The module-only
overload remains for callers that only have a parsed CIR module on hand,
but its LangOptions/CodeGenOptions stay default-constructed.

Drop the now-satisfied MissingFeatures markers and the assert in
LowerModule::getCXXABIKind that referenced them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(cherry picked from commit b4492c9)
LoweringPrepare's only remaining consumer of ASTContext was the bundle of
TargetInfo / LangOpts / CXXABI queries it issues for guard-variable sizing,
TLS handling, complex-div promotion, CUDA/HIP runtime registration, and
the like. None of those facts are AST-derived: they all come from the
cc1 invocation itself, and a LowerModule built from that invocation
already exposes them in a serializable, AST-free form.

Replace the pass's astCtx member with a cir::LowerModule * (constructor:
createLoweringPreparePass(LowerModule*, vfs)) and route every former
astCtx query through it. The CUDA fatbin VFS, also previously read from
ASTContext, becomes an explicit IntrusiveRefCntPtr<vfs::FileSystem>
plumbed through the pass options and defaulting to the real filesystem
when absent. The C++20 named-module init AST fallback is removed -- PR1
already moved the mangled name to a module-level attribute, and the only
tests that bypassed CIRGen now run on the same path as everyone else.

runCIRToCIRPasses gains a LowerModule and VFS parameter; CIRGenAction's
in-process path constructs the LowerModule from the surrounding
CompilerInstance. The .cir cc1 input path added in the previous commit
now runs the full target-lowering / cxxabi-lowering / lowering-prepare
pipeline before handing the module to CIR-to-LLVM, so emit-obj on a
parsed .cir file works end-to-end. IdiomRecognizer is intentionally not
on the .cir path: its own header documents an AST dependency and it is
opt-in even on the source path.

The cc1-cir-input lit test grows an emit-obj run that confirms the new
pipeline produces a linkable object with the expected symbol.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(cherry picked from commit 0356eed)
Extract the CUDA/HIP registration emission out of LoweringPrepare into a
shared CUDAModuleRegistrationBuilder and a standalone CUDARegisterModulePass.
The builder sources every target/LangOpts fact from a LowerModule (no POD), so
the same logic runs in-place during LoweringPrepare and, on a resumed .cir
where LoweringPrepare has already run pre-serialization, from the standalone
pass. cir-opt builds a module-only LowerModule from cir.triple; the SDK version
(the one TargetOptions-backed fact the module-only target lacks) is passed
explicitly and left empty there.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
On the merge pipeline the fatbin only exists at the second cc1 invocation
(it is produced downstream of the serialized host.cir), so stamp its path as
cir.cu.binary_handle and run the registration pass on the .cir resume path,
sourcing target facts from a LowerModule built off the invocation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@RiverDave RiverDave force-pushed the users/riverdave/cir/cuda-register-module-pass branch from 0b4e5e2 to c3fb89f Compare June 26, 2026 18:53
@RiverDave

RiverDave commented Jun 26, 2026

Copy link
Copy Markdown
Owner Author

@koparasy

I've stacked my changes over the commits present in your fork you pointed out to. Those were not present here, so a cherry-pick and some minor conflicts to solve were sufficient. There's plenty of LOC here, so I suggest you read it commit by commit, co-authoring here shows which ones came from your source.

Most notable changes:

  1. POD dropped -> LowerModule-sourced builder.
  2. We did not take 0356eed's resume-runs-LoweringPrepare wiring - our boundary is serialize-post-LoweringPrepare, so the registration pass is scheduled standalone instead.

Lastly, I'll stack #7 on top of this as its base and run polybench again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants