[OPUS] gfx942 a16w16 bf16 GEMM splitK and WKC pipeline family by yifehuan · Pull Request #3594 · ROCm/aiter

yifehuan · 2026-06-08T05:31:44Z

Motivation

This PR improves OPUS coverage and performance for gfx942 a16w16 bf16 GEMM shapes used by the DSV4 bf16 workload.

The main goals are:

Add deterministic OPUS pipeline coverage for DSV4 bf16 GEMM shapes where the existing splitK paths were not competitive.
Improve small/skinny and medium-shape coverage with a gfx942 wave-K cooperative family.
Add a bf16-workspace splitK variant for bf16-in/bf16-out workloads to reduce splitK workspace traffic and improve latency on larger bf16 shapes.
Clean up obsolete fused-reduce / reduce-path code and keep the generated DSV4 model config restricted to OPUS global winners only.

The branch contains three commits:

aa3cc29b3 feat(opus): add gfx942 EM3EN4 LDS1 PGR2 splitK path
605f18d4c feat(opus): add gfx942 WKC paths and reduce cleanup
da15623d6 perf(opus): add gfx942 bf16 workspace splitK path

Technical Details

This PR adds a gfx942 OPUS pipeline family for bf16 GEMM and related codegen/config plumbing.

1. EM3EN4 LDS1/PGR2 splitK path

Adds a new gfx942 splitK pipeline:

opus_gemm_pipeline_a16w16_em3en4_lds1_pgr2_sk.cuh
New tuned kid:
- 50204: a16w16_em3en4_lds1_pgr2_sk

This path targets a 128x96 host/output tiling shape while using a physical 96x128 device-side orientation. It uses:

LDS depth 1
PGR2-style prefetching
EM3/EN4-style tile decomposition
deterministic splitK accumulation through the existing splitK reduce flow

The intent is to improve coverage for mid/large DSV4 bf16 shapes where the previous 50200/50203-style splitK paths had weaker tiling or staging behavior.

2. Wave-K cooperative gfx942 WKC family

Adds a new gfx942 wave-K cooperative pipeline:

opus_gemm_pipeline_a16w16_wave_k_coop.cuh

New tuned kids:

50300: WKC 16x16, BK64
50301: WKC 16x32, BK32
50302: WKC 32x16, BK64
50303: WKC 32x32, BK64

This family is designed for skinny/small-M or skinny/small-N bf16 GEMM shapes where deterministic in-kernel wave cooperation is more competitive than the older splitK +
separate reduce-launch path.

The WKC pipeline uses:

one workgroup with multiple waves cooperating over the K dimension
deterministic LDS partial accumulation/reduction inside the kernel
fixed mfma scheduling for the supported 16x16-based tile geometries
no atomic reduce path
no asm fallback routing

The old fused-reduce pipeline was removed because the deterministic fused splitK structure depended on expensive cross-wave/workgroup synchronization and did not show a
viable win path for the target shapes.

3. SplitK reduce cleanup

The reduce path was cleaned up and moved toward gfx942-specific generated support.

Notable changes include:

gfx942-specific reduce codegen moved into codegen/gen_instances_gfx942.py
exact-N rowblock reduce support for selected splitK reduce shapes
reduce instantiation cleanup for the supported gfx942 splitK configurations
retention of baseline fallback behavior where needed, instead of replacing every reduce path at once

This keeps the fast reduce support scoped to the gfx942 cases that need it while avoiding unnecessary changes to unrelated architectures.

4. bf16 workspace splitK path

Adds a bf16 workspace variant for legacy splitK:

New tuned kid:
- 50206: splitk_legacy_bf16ws_512x128x128x64

The splitK workspace dtype is now threaded through the gfx942 traits/codegen path, allowing splitK main kernels and reduce kernels to use either fp32 or bf16 workspace
storage depending on the selected kid.

The bf16 workspace path is gated to bf16 output, since the target production use case is bf16 input/output GEMM. This avoids silently using bf16 intermediate storage for
output types where fp32 workspace may still be required for accuracy or semantics.

This reduces workspace bandwidth and improves performance on several larger bf16 DSV4 shapes, especially cases such as:

512x256x4096
512x512x4096
256x512x4096

5. DSV4 config update

aiter/configs/model_configs/dsv4_bf16_tuned_gemm.csv was refreshed with OPUS-only global winner rows from the latest DSV4 bf16 retune.

Important routing policy:

OPUS winners are written.
Non-OPUS winners are not written.
asm/torch/triton/flydsl results are used only for comparison.
asm is not routed through this OPUS config.
OPUS rows written by this update have err_ratio=0.0.

Test Plan

Validation was done in several layers:

Python/codegen sanity

Verified the modified Python/codegen files compile.
Regenerated OPUS gfx942 instances.
Verified generated gfx942 splitK/WKC instances build successfully.

Kernel correctness

Ran OPUS runtime correctness checks for the new gfx942 kids.
Verified selected WKC and splitK paths produce valid bf16 GEMM output.
Checked OPUS winner rows used for DSV4 config have err_ratio=0.0.

Performance validation

Re-ran gradlib tuning for the full DSV4 bf16 GEMM shape set:
- Input: /mnt/raid0/yifehuan/dsv4/all_bf16_mm.csv
- 51 shapes
- bf16 input/output
- no bias
- no scaleAB
- --iters 31
- --shape_grouped
- hipblaslt excluded

Library comparison

Compared OPUS against:

asm
flydsl
torch
triton
skinny

hipblaslt was intentionally excluded from this retune, matching the requested DSV4 comparison setup.

Config validation

Only shapes where OPUS is the global winner were written back to dsv4_bf16_tuned_gemm.csv.
Non-OPUS winners were left out of the OPUS model config.
Verified written OPUS rows are deterministic OPUS kernel selections, not asm fallback routes.

Test Result

Full DSV4 bf16 no-hipblaslt gradlib sweep completed successfully.

Sweep summary:

Input shapes: 51
Completed shapes: 51 / 51
Invalid timing rows: 0
Libs swept: asm, flydsl, opus, torch, triton, skinny
hipblaslt: excluded
Skinny had no valid bf16 solutions in this run.

Winner distribution:

OPUS: 29
asm: 16
torch: 5
triton: 1

The DSV4 config was refreshed with only the 29 OPUS global-winner shapes.

Accuracy summary:

Written OPUS config rows: err_ratio=0.0
Full sweep max err_ratio=0.0332, from non-OPUS comparison rows in the all-lib sweep.

Result artifacts are saved under:

/mnt/raid0/yifehuan/performance/dsv4_gemm/all_bf16_nohipblaslt_20260608/

Key files:

dsv4_all_bf16_nohipblaslt_20260608.md
dsv4_all_bf16_nohipblaslt_per_lib_best.csv
dsv4_all_bf16_nohipblaslt_winners.csv
dsv4_all_bf16_nohipblaslt_lib_best_long.csv
tuned_opus_dsv4_all_bf16_nohipblaslt.csv
dsv4_opus_behind_priority_top10.csv

Representative OPUS winner examples from the final sweep:

512x256x4096: OPUS 50206, 20.5522us
512x512x4096: OPUS 50206, 25.7569us
256x1024x4096: OPUS, 25.87us
128x2048x4096: OPUS, 29.55us
64x256x4096: OPUS 50300, 7.37us
32x512x4096: OPUS, 7.79us
1x2048x4096: OPUS 50301, 11.7529us
512x64x4096: OPUS 50302, 10.0625us

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

github-actions · 2026-06-08T05:32:17Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:atom_full`	ATOM accuracy suite for PR and main models from ATOM `models_accuracy.json`
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All standard extended tests (excludes `ci:atom_full`)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3594 --add-label <label>

yifehuan requested a review from a team June 8, 2026 05:31

yifehuan force-pushed the yifehuan/opus_perf_v2 branch from da15623 to bdc12b7 Compare June 8, 2026 05:33

yifehuan added 3 commits June 8, 2026 16:20

feat(opus): add gfx942 EM3EN4 LDS1 PGR2 splitK path

61323e7

feat(opus): add gfx942 WKC paths and reduce cleanup

dbc822a

perf(opus): add gfx942 bf16 workspace splitK path

ea63649

yifehuan force-pushed the yifehuan/opus_perf_v2 branch from bdc12b7 to ea63649 Compare June 8, 2026 08:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OPUS] gfx942 a16w16 bf16 GEMM splitK and WKC pipeline family#3594

[OPUS] gfx942 a16w16 bf16 GEMM splitK and WKC pipeline family#3594
yifehuan wants to merge 3 commits into
mainfrom
yifehuan/opus_perf_v2

yifehuan commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yifehuan commented Jun 8, 2026

Motivation

Technical Details

1. EM3EN4 LDS1/PGR2 splitK path

2. Wave-K cooperative gfx942 WKC family

3. SplitK reduce cleanup

4. bf16 workspace splitK path

5. DSV4 config update

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions Bot commented Jun 8, 2026

🏷️ CI Guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant