[OPUS] gfx942 a16w16 bf16 GEMM splitK and WKC pipeline family#3594
Open
yifehuan wants to merge 3 commits into
Open
[OPUS] gfx942 a16w16 bf16 GEMM splitK and WKC pipeline family#3594yifehuan wants to merge 3 commits into
yifehuan wants to merge 3 commits into
Conversation
Contributor
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
da15623 to
bdc12b7
Compare
bdc12b7 to
ea63649
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
This PR improves OPUS coverage and performance for gfx942 a16w16 bf16 GEMM shapes used by the DSV4 bf16 workload.
The main goals are:
The branch contains three commits:
aa3cc29b3 feat(opus): add gfx942 EM3EN4 LDS1 PGR2 splitK path605f18d4c feat(opus): add gfx942 WKC paths and reduce cleanupda15623d6 perf(opus): add gfx942 bf16 workspace splitK pathTechnical Details
This PR adds a gfx942 OPUS pipeline family for bf16 GEMM and related codegen/config plumbing.
1. EM3EN4 LDS1/PGR2 splitK path
Adds a new gfx942 splitK pipeline:
opus_gemm_pipeline_a16w16_em3en4_lds1_pgr2_sk.cuh50204:a16w16_em3en4_lds1_pgr2_skThis path targets a 128x96 host/output tiling shape while using a physical 96x128 device-side orientation. It uses:
The intent is to improve coverage for mid/large DSV4 bf16 shapes where the previous 50200/50203-style splitK paths had weaker tiling or staging behavior.
2. Wave-K cooperative gfx942 WKC family
Adds a new gfx942 wave-K cooperative pipeline:
opus_gemm_pipeline_a16w16_wave_k_coop.cuhNew tuned kids:
50300: WKC 16x16, BK6450301: WKC 16x32, BK3250302: WKC 32x16, BK6450303: WKC 32x32, BK64This family is designed for skinny/small-M or skinny/small-N bf16 GEMM shapes where deterministic in-kernel wave cooperation is more competitive than the older splitK +
separate reduce-launch path.
The WKC pipeline uses:
The old fused-reduce pipeline was removed because the deterministic fused splitK structure depended on expensive cross-wave/workgroup synchronization and did not show a
viable win path for the target shapes.
3. SplitK reduce cleanup
The reduce path was cleaned up and moved toward gfx942-specific generated support.
Notable changes include:
codegen/gen_instances_gfx942.pyThis keeps the fast reduce support scoped to the gfx942 cases that need it while avoiding unnecessary changes to unrelated architectures.
4. bf16 workspace splitK path
Adds a bf16 workspace variant for legacy splitK:
50206:splitk_legacy_bf16ws_512x128x128x64The splitK workspace dtype is now threaded through the gfx942 traits/codegen path, allowing splitK main kernels and reduce kernels to use either fp32 or bf16 workspace
storage depending on the selected kid.
The bf16 workspace path is gated to bf16 output, since the target production use case is bf16 input/output GEMM. This avoids silently using bf16 intermediate storage for
output types where fp32 workspace may still be required for accuracy or semantics.
This reduces workspace bandwidth and improves performance on several larger bf16 DSV4 shapes, especially cases such as:
512x256x4096512x512x4096256x512x40965. DSV4 config update
aiter/configs/model_configs/dsv4_bf16_tuned_gemm.csvwas refreshed with OPUS-only global winner rows from the latest DSV4 bf16 retune.Important routing policy:
err_ratio=0.0.Test Plan
Validation was done in several layers:
err_ratio=0.0./mnt/raid0/yifehuan/dsv4/all_bf16_mm.csv--iters 31--shape_groupedhipblasltexcludedCompared OPUS against:
hipblasltwas intentionally excluded from this retune, matching the requested DSV4 comparison setup.dsv4_bf16_tuned_gemm.csv.Test Result
Full DSV4 bf16 no-hipblaslt gradlib sweep completed successfully.
Sweep summary:
hipblaslt: excludedWinner distribution:
The DSV4 config was refreshed with only the 29 OPUS global-winner shapes.
Accuracy summary:
err_ratio=0.0err_ratio=0.0332, from non-OPUS comparison rows in the all-lib sweep.Result artifacts are saved under:
/mnt/raid0/yifehuan/performance/dsv4_gemm/all_bf16_nohipblaslt_20260608/Key files:
dsv4_all_bf16_nohipblaslt_20260608.mddsv4_all_bf16_nohipblaslt_per_lib_best.csvdsv4_all_bf16_nohipblaslt_winners.csvdsv4_all_bf16_nohipblaslt_lib_best_long.csvtuned_opus_dsv4_all_bf16_nohipblaslt.csvdsv4_opus_behind_priority_top10.csvRepresentative OPUS winner examples from the final sweep:
512x256x4096: OPUS50206,20.5522us512x512x4096: OPUS50206,25.7569us256x1024x4096: OPUS,25.87us128x2048x4096: OPUS,29.55us64x256x4096: OPUS50300,7.37us32x512x4096: OPUS,7.79us1x2048x4096: OPUS50301,11.7529us512x64x4096: OPUS50302,10.0625usSubmission Checklist