[CIR][CUDA] Add separate CUDA registration pass when consuming cir through CIRGenAction by RiverDave · Pull Request #8 · RiverDave/llvm-project

RiverDave · 2026-06-25T04:49:26Z

I opened this to address certain obstacles presented in: #7

In the offload-merge pipeline the host TU is lowered to CIR, serialized, and only resumed to an object through a second -x cir cc1 invocation. CUDA registration (module ctor, __cudaRegisterFatBinary, the embedded .nv_fatbin) is built by LoweringPrepare in the first invocation - but the device fatbin doesn't exist yet there, since it's a downstream product of the host.cir we're emitting, so buildCUDAModuleCtor no-ops on the absent CUDABinaryHandleAttr and the object ships without registration.

The resume invocation is the only place the fatbin actually exists (-fcuda-include-gpubinary), yet it never re-ran registration. This PR factors registration out of LoweringPrepare into a standalone pass parameterized on a small POD (CUDARegisterInfo) instead of an ASTContext, and runs it on the -x cir path: stamp the binary handle, build the ctor, then lower. Architecturally, the Non-merge path is unchanged; HIP rides the same pass via isHIP.

koparasy · 2026-06-26T15:14:03Z

Thanks for digging into the resume-path registration gap. The diagnosis is
spot on. Before this lands I want to flag that we already have a stack that
solves the same root cause (LoweringPrepare can't run on resumed .cir
because it reaches into a live ASTContext) from the other direction, and a
fair amount of this PR overlaps it.

The core difference: this PR threads a CUDARegisterInfo POD into a new
standalone pass, populated twice. Once from astCtx in LoweringPrepare
and once from the CompilerInstance on the -x cir path. Our approach makes
LoweringPrepare's state source itself AST-free and serializable, so the
existing pass runs on the .cir path unchanged. Specific overlaps:

1. The triple is already on the module:

No POD field needed
CIRGenModule serializes the triple as cir.triple
(CIRGenModule.cpp#L136),
and the CUDABinaryHandleAttr you stamp is already a module attr too. The new
pass ignores cir.triple and the cir-opt fallback hardcodes
"x86_64-unknown-linux-gnu", which is a latent bug. The .cir cc1-input
commit instead honors the cc1 -triple by overriding cir.triple on parse
(CIRGenAction::ExecuteAction, CIRGenAction.cpp#L287-L295).
The charWidth/sizeTypeWidth POD fields are likewise derivable from the
module data layout (you already build cir::CIRDataLayout(mlirModule) a few
lines away to size vars).

2. `parseCIRInput` + triple override already exist

Your parseCIRInput/stampCUDABinaryHandle/triple-override block is the same
machinery as a061a51265e2
("[CIR] Accept serialized .cir as cc1 input"), which adds a hasCIRSupport
capability flag and a real .cir input path in ExecuteAction
(CIRGenAction.cpp#L243).

3. `linkInModules` is already extracted into a shared helper

Your linkInModules is the duplicate that
09f730c2ed5a
("[CIR][CodeGen] Extract shared LinkModule struct and loadLinkModules helper")
deletes — it pulls the shared LinkModule/loadLinkModules into
clang/include/clang/CodeGen/ModuleLinker.h so CIR and classic CodeGen share
one copy. On branch
cir/cirgenaction-state-lift.

4. The actual fix: drive LoweringPrepare from `LowerModule`, not `ASTContext`

0356eed35e04
("[CIR] Drive LoweringPrepare from LowerModule instead of ASTContext")
replaces the pass's astCtx member with a cir::LowerModule * plus an
injected VFS, and routes every former astCtx query — TLS, guard sizing,
complex-div, and the entire CUDA/HIP registration path — through LowerModule:

isHIP → lowerModule->getLangOpts().HIP (LoweringPrepare.cpp#L2310)
GPURelocatableDeviceCode → #L2312; sizeType width → #L1891
fatbin VFS → explicit IntrusiveRefCntPtr<vfs::FileSystem>, defaulting to the real FS (#L2336)

LowerModule already stores LangOptions/CodeGenOptions and has an
invocation-aware factory built from the surrounding cc1 invocation
(b4492c9b0f2d).
So instead of a new pass + a hand-threaded POD populated in two places,
LoweringPrepare runs as-is on the .cir path because its state is now
serializable and built once.

Branches

cir/ast-free-lowering-prepare — materialize AST facts as module attrs + accept .cir as cc1 input
cir/cirgenaction-state-lift — shared LinkModule helper + lift link state into CIRGenAction
cir/lowering-prepare-via-lower-module — drive LoweringPrepare from LowerModule (start here)

Suggestion

Could we rebase this on that stack? The fatbin-only-exists-in-the-second-
invocation problem you identified is real and the test coverage is valuable,
but most of the plumbing here (parse .cir, link modules, re-derive
triple/widths) is already done AST-free upstream of the registration logic.
Happy to walk through it — cir/lowering-prepare-via-lower-module is the one
to start from.

This was kindly worded by claude, once I pointed to our shared branches that we had before we started the RFC.

RiverDave · 2026-06-26T17:31:52Z

Thanks for digging into the resume-path registration gap. The diagnosis is spot on. Before this lands I want to flag that we already have a stack that solves the same root cause (LoweringPrepare can't run on resumed .cir because it reaches into a live ASTContext) from the other direction, and a fair amount of this PR overlaps it.

The core difference: this PR threads a CUDARegisterInfo POD into a new standalone pass, populated twice. Once from astCtx in LoweringPrepare and once from the CompilerInstance on the -x cir path. Our approach makes LoweringPrepare's state source itself AST-free and serializable, so the existing pass runs on the .cir path unchanged. Specific overlaps:

1. The triple is already on the module:

No POD field needed CIRGenModule serializes the triple as cir.triple (CIRGenModule.cpp#L136), and the CUDABinaryHandleAttr you stamp is already a module attr too. The new pass ignores cir.triple and the cir-opt fallback hardcodes "x86_64-unknown-linux-gnu", which is a latent bug. The .cir cc1-input commit instead honors the cc1 -triple by overriding cir.triple on parse (CIRGenAction::ExecuteAction, CIRGenAction.cpp#L287-L295). The charWidth/sizeTypeWidth POD fields are likewise derivable from the module data layout (you already build cir::CIRDataLayout(mlirModule) a few lines away to size vars).

2. parseCIRInput + triple override already exist

Your parseCIRInput/stampCUDABinaryHandle/triple-override block is the same machinery as a061a51265e2 ("[CIR] Accept serialized .cir as cc1 input"), which adds a hasCIRSupport capability flag and a real .cir input path in ExecuteAction (CIRGenAction.cpp#L243).

3. linkInModules is already extracted into a shared helper

Your linkInModules is the duplicate that 09f730c2ed5a ("[CIR][CodeGen] Extract shared LinkModule struct and loadLinkModules helper") deletes — it pulls the shared LinkModule/loadLinkModules into clang/include/clang/CodeGen/ModuleLinker.h so CIR and classic CodeGen share one copy. On branch cir/cirgenaction-state-lift.

4. The actual fix: drive LoweringPrepare from LowerModule, not ASTContext

0356eed35e04 ("[CIR] Drive LoweringPrepare from LowerModule instead of ASTContext") replaces the pass's astCtx member with a cir::LowerModule * plus an injected VFS, and routes every former astCtx query — TLS, guard sizing, complex-div, and the entire CUDA/HIP registration path — through LowerModule:

isHIP → lowerModule->getLangOpts().HIP (LoweringPrepare.cpp#L2310)

GPURelocatableDeviceCode → #L2312; sizeType width → #L1891

fatbin VFS → explicit IntrusiveRefCntPtr<vfs::FileSystem>, defaulting to the real FS (#L2336)

LowerModule already stores LangOptions/CodeGenOptions and has an invocation-aware factory built from the surrounding cc1 invocation (b4492c9b0f2d). So instead of a new pass + a hand-threaded POD populated in two places, LoweringPrepare runs as-is on the .cir path because its state is now serializable and built once.

Branches

cir/ast-free-lowering-prepare — materialize AST facts as module attrs + accept .cir as cc1 input

cir/cirgenaction-state-lift — shared LinkModule helper + lift link state into CIRGenAction

cir/lowering-prepare-via-lower-module — drive LoweringPrepare from LowerModule (start here)

Suggestion

Could we rebase this on that stack? The fatbin-only-exists-in-the-second- invocation problem you identified is real and the test coverage is valuable, but most of the plumbing here (parse .cir, link modules, re-derive triple/widths) is already done AST-free upstream of the registration logic. Happy to walk through it — cir/lowering-prepare-via-lower-module is the one to start from.

This was kindly worded by claude, once I pointed to our shared branches that we had before we started the RFC.

This sounds very reasonable. We'd still need to have a separate registration pass to run when you make explicit the binary path you pass on when you resume the pipeline.

So following your advice, consuming .cir could look like this (if gpu bin is passed):

parseCIRInput                              // ours (PR #5, canonical)
stampCUDABinaryHandle(CI, module)          // ours — from CI.getCodeGenOpts().CudaGpuBinaryFileName
                                           //   (only now does the fatbin exist on disk)
lowerModule = makeLowerModuleFromInvocation(CI, module)   // your helper
pm.addPass(createCUDARegisterModulePass(lowerModule, vfs)) // ours, re-parameterized
pm.run(module)
lowerFromCIRToLLVMIR                        // backend action

It is important to note that once the driver boundary RFC is solved, we could move this registration pass to be grouped to the post Target-Lowering passes.

LoweringPrepare currently reaches into the live ASTContext to mangle the C++20 named-module initializer function name. That dependency blocks running the pass on serialized CIR (e.g. the proposed split-compilation flow where host/device CIR is combined, optimized, and split before resuming target-specific lowering in a separate cc1/tool invocation). Precompute the mangled name during CIRGen release() when CXX20ModuleInits is active and stash it on the ModuleOp as cir.cxx_module_init_fn_name. LoweringPrepare now prefers the attribute and only falls back to the AST-based path when it is absent. Part of the broader effort to make post-CIRGen CIR passes AST-free. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> (cherry picked from commit eea6f6f)

LoweringPrepare and any future pass that queries the ASTVarDeclInterface (isLocalVarDecl, TLSKind, isInline, TemplateSpecializationKind) currently requires a live clang::ASTContext: the existing ASTVarDeclAttr stores a raw clang::VarDecl pointer and delegates each query to the AST. That pointer is not serializable and dangles the moment CIR leaves the process, which blocks running the pipeline on reloaded .cir modules -- the exact scenario the proposed CombineCIR/SplitCIR driver actions need. Introduce StaticLocalInfoAttr, a concrete ASTVarDeclInterface implementation whose fields cache the four facts as plain data. Add the cir-materialize-ast-facts pass which walks cir.global operations, reads the AST-backed ASTVarDeclAttr (while the AST is still live), and rewrites $ast to a StaticLocalInfoAttr carrying the snapshot. The pass is registered early in runCIRToCIRPasses, right after canonicalization, so it runs in-process before any step that could serialize CIR. LoweringPrepare's read sites are unchanged: they go through the interface, which now dispatches to either the AST-backed attribute (for tests/tools that skip materialization) or the cached attribute. After this change, post-CIRGen CIR can round-trip through text or other serialization without hitting a stale VarDecl pointer. Part of the broader effort to make post-CIRGen CIR passes AST-free. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> (cherry picked from commit 07b7b13)

…e factory LowerModule's constructor took LangOptions and CodeGenOptions by value but discarded them, leaving only the TargetInfo accessible. The two MissingFeatures markers (lowerModuleLangOpts/lowerModuleCodeGenOpts) flagged this gap. As LoweringPrepare migrates off ASTContext, downstream consumers need to read those facts through LowerModule. Store the options as members, expose getLangOpts/getCodeGenOpts, and add a second createLowerModule overload that takes them from the surrounding cc1 invocation (along with an already-built TargetInfo). The module-only overload remains for callers that only have a parsed CIR module on hand, but its LangOptions/CodeGenOptions stay default-constructed. Drop the now-satisfied MissingFeatures markers and the assert in LowerModule::getCXXABIKind that referenced them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> (cherry picked from commit b4492c9)

LoweringPrepare's only remaining consumer of ASTContext was the bundle of TargetInfo / LangOpts / CXXABI queries it issues for guard-variable sizing, TLS handling, complex-div promotion, CUDA/HIP runtime registration, and the like. None of those facts are AST-derived: they all come from the cc1 invocation itself, and a LowerModule built from that invocation already exposes them in a serializable, AST-free form. Replace the pass's astCtx member with a cir::LowerModule * (constructor: createLoweringPreparePass(LowerModule*, vfs)) and route every former astCtx query through it. The CUDA fatbin VFS, also previously read from ASTContext, becomes an explicit IntrusiveRefCntPtr<vfs::FileSystem> plumbed through the pass options and defaulting to the real filesystem when absent. The C++20 named-module init AST fallback is removed -- PR1 already moved the mangled name to a module-level attribute, and the only tests that bypassed CIRGen now run on the same path as everyone else. runCIRToCIRPasses gains a LowerModule and VFS parameter; CIRGenAction's in-process path constructs the LowerModule from the surrounding CompilerInstance. The .cir cc1 input path added in the previous commit now runs the full target-lowering / cxxabi-lowering / lowering-prepare pipeline before handing the module to CIR-to-LLVM, so emit-obj on a parsed .cir file works end-to-end. IdiomRecognizer is intentionally not on the .cir path: its own header documents an AST dependency and it is opt-in even on the source path. The cc1-cir-input lit test grows an emit-obj run that confirms the new pipeline produces a linkable object with the expected symbol. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> (cherry picked from commit 0356eed)

Extract the CUDA/HIP registration emission out of LoweringPrepare into a shared CUDAModuleRegistrationBuilder and a standalone CUDARegisterModulePass. The builder sources every target/LangOpts fact from a LowerModule (no POD), so the same logic runs in-place during LoweringPrepare and, on a resumed .cir where LoweringPrepare has already run pre-serialization, from the standalone pass. cir-opt builds a module-only LowerModule from cir.triple; the SDK version (the one TargetOptions-backed fact the module-only target lacks) is passed explicitly and left empty there. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

On the merge pipeline the fatbin only exists at the second cc1 invocation (it is produced downstream of the serialized host.cir), so stamp its path as cir.cu.binary_handle and run the registration pass on the .cir resume path, sourcing target facts from a LowerModule built off the invocation. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RiverDave · 2026-06-26T19:00:06Z

@koparasy

I've stacked my changes over the commits present in your fork you pointed out to. Those were not present here, so a cherry-pick and some minor conflicts to solve were sufficient. There's plenty of LOC here, so I suggest you read it commit by commit, co-authoring here shows which ones came from your source.

Most notable changes:

POD dropped -> LowerModule-sourced builder.
We did not take 0356eed's resume-runs-LoweringPrepare wiring - our boundary is serialize-post-LoweringPrepare, so the registration pass is scheduled standalone instead.

Lastly, I'll stack #7 on top of this as its base and run polybench again.

RiverDave requested a review from koparasy June 25, 2026 20:15

RiverDave marked this pull request as ready for review June 25, 2026 20:15

RiverDave changed the title ~~[CIR][CUDA] Add separate CUDA registration pass~~ [CIR][CUDA] Add separate CUDA registration pass when consuming cir through CIRGenAction Jun 25, 2026

RiverDave mentioned this pull request Jun 26, 2026

[CIR] Construct the offload merge driver pipeline for CUDA (single arch) #7

Open

koparasy and others added 6 commits June 26, 2026 13:44

RiverDave force-pushed the users/riverdave/cir/cuda-register-module-pass branch from 0b4e5e2 to c3fb89f Compare June 26, 2026 18:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CIR][CUDA] Add separate CUDA registration pass when consuming cir through CIRGenAction#8

[CIR][CUDA] Add separate CUDA registration pass when consuming cir through CIRGenAction#8
RiverDave wants to merge 6 commits into
gsoc/combine-cirfrom
users/riverdave/cir/cuda-register-module-pass

RiverDave commented Jun 25, 2026 •

edited

Loading

Uh oh!

koparasy commented Jun 26, 2026

Uh oh!

RiverDave commented Jun 26, 2026

1. The triple is already on the module:

2. `parseCIRInput` + triple override already exist

3. `linkInModules` is already extracted into a shared helper

4. The actual fix: drive LoweringPrepare from `LowerModule`, not `ASTContext`

Branches

Suggestion

Uh oh!

RiverDave commented Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RiverDave commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

koparasy commented Jun 26, 2026

1. The triple is already on the module:

2. parseCIRInput + triple override already exist

3. linkInModules is already extracted into a shared helper

4. The actual fix: drive LoweringPrepare from LowerModule, not ASTContext

Branches

Suggestion

Uh oh!

RiverDave commented Jun 26, 2026

1. The triple is already on the module:

2. parseCIRInput + triple override already exist

3. linkInModules is already extracted into a shared helper

4. The actual fix: drive LoweringPrepare from LowerModule, not ASTContext

Branches

Suggestion

Uh oh!

RiverDave commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RiverDave commented Jun 25, 2026 •

edited

Loading

2. `parseCIRInput` + triple override already exist

3. `linkInModules` is already extracted into a shared helper

4. The actual fix: drive LoweringPrepare from `LowerModule`, not `ASTContext`

2. `parseCIRInput` + triple override already exist

3. `linkInModules` is already extracted into a shared helper

4. The actual fix: drive LoweringPrepare from `LowerModule`, not `ASTContext`

RiverDave commented Jun 26, 2026 •

edited

Loading