Skip to content

[OPUS] gfx942 a16w16 bf16 GEMM splitK and WKC pipeline family#3594

Open
yifehuan wants to merge 3 commits into
mainfrom
yifehuan/opus_perf_v2
Open

[OPUS] gfx942 a16w16 bf16 GEMM splitK and WKC pipeline family#3594
yifehuan wants to merge 3 commits into
mainfrom
yifehuan/opus_perf_v2

Conversation

@yifehuan
Copy link
Copy Markdown
Contributor

@yifehuan yifehuan commented Jun 8, 2026

Motivation

This PR improves OPUS coverage and performance for gfx942 a16w16 bf16 GEMM shapes used by the DSV4 bf16 workload.

The main goals are:

  • Add deterministic OPUS pipeline coverage for DSV4 bf16 GEMM shapes where the existing splitK paths were not competitive.
  • Improve small/skinny and medium-shape coverage with a gfx942 wave-K cooperative family.
  • Add a bf16-workspace splitK variant for bf16-in/bf16-out workloads to reduce splitK workspace traffic and improve latency on larger bf16 shapes.
  • Clean up obsolete fused-reduce / reduce-path code and keep the generated DSV4 model config restricted to OPUS global winners only.

The branch contains three commits:

  1. aa3cc29b3 feat(opus): add gfx942 EM3EN4 LDS1 PGR2 splitK path
  2. 605f18d4c feat(opus): add gfx942 WKC paths and reduce cleanup
  3. da15623d6 perf(opus): add gfx942 bf16 workspace splitK path

Technical Details

This PR adds a gfx942 OPUS pipeline family for bf16 GEMM and related codegen/config plumbing.

1. EM3EN4 LDS1/PGR2 splitK path

Adds a new gfx942 splitK pipeline:

  • opus_gemm_pipeline_a16w16_em3en4_lds1_pgr2_sk.cuh
  • New tuned kid:
    • 50204: a16w16_em3en4_lds1_pgr2_sk

This path targets a 128x96 host/output tiling shape while using a physical 96x128 device-side orientation. It uses:

  • LDS depth 1
  • PGR2-style prefetching
  • EM3/EN4-style tile decomposition
  • deterministic splitK accumulation through the existing splitK reduce flow

The intent is to improve coverage for mid/large DSV4 bf16 shapes where the previous 50200/50203-style splitK paths had weaker tiling or staging behavior.

2. Wave-K cooperative gfx942 WKC family

Adds a new gfx942 wave-K cooperative pipeline:

  • opus_gemm_pipeline_a16w16_wave_k_coop.cuh

New tuned kids:

  • 50300: WKC 16x16, BK64
  • 50301: WKC 16x32, BK32
  • 50302: WKC 32x16, BK64
  • 50303: WKC 32x32, BK64

This family is designed for skinny/small-M or skinny/small-N bf16 GEMM shapes where deterministic in-kernel wave cooperation is more competitive than the older splitK +
separate reduce-launch path.

The WKC pipeline uses:

  • one workgroup with multiple waves cooperating over the K dimension
  • deterministic LDS partial accumulation/reduction inside the kernel
  • fixed mfma scheduling for the supported 16x16-based tile geometries
  • no atomic reduce path
  • no asm fallback routing

The old fused-reduce pipeline was removed because the deterministic fused splitK structure depended on expensive cross-wave/workgroup synchronization and did not show a
viable win path for the target shapes.

3. SplitK reduce cleanup

The reduce path was cleaned up and moved toward gfx942-specific generated support.

Notable changes include:

  • gfx942-specific reduce codegen moved into codegen/gen_instances_gfx942.py
  • exact-N rowblock reduce support for selected splitK reduce shapes
  • reduce instantiation cleanup for the supported gfx942 splitK configurations
  • retention of baseline fallback behavior where needed, instead of replacing every reduce path at once

This keeps the fast reduce support scoped to the gfx942 cases that need it while avoiding unnecessary changes to unrelated architectures.

4. bf16 workspace splitK path

Adds a bf16 workspace variant for legacy splitK:

  • New tuned kid:
    • 50206: splitk_legacy_bf16ws_512x128x128x64

The splitK workspace dtype is now threaded through the gfx942 traits/codegen path, allowing splitK main kernels and reduce kernels to use either fp32 or bf16 workspace
storage depending on the selected kid.

The bf16 workspace path is gated to bf16 output, since the target production use case is bf16 input/output GEMM. This avoids silently using bf16 intermediate storage for
output types where fp32 workspace may still be required for accuracy or semantics.

This reduces workspace bandwidth and improves performance on several larger bf16 DSV4 shapes, especially cases such as:

  • 512x256x4096
  • 512x512x4096
  • 256x512x4096

5. DSV4 config update

aiter/configs/model_configs/dsv4_bf16_tuned_gemm.csv was refreshed with OPUS-only global winner rows from the latest DSV4 bf16 retune.

Important routing policy:

  • OPUS winners are written.
  • Non-OPUS winners are not written.
  • asm/torch/triton/flydsl results are used only for comparison.
  • asm is not routed through this OPUS config.
  • OPUS rows written by this update have err_ratio=0.0.

Test Plan

Validation was done in several layers:

  1. Python/codegen sanity
  • Verified the modified Python/codegen files compile.
  • Regenerated OPUS gfx942 instances.
  • Verified generated gfx942 splitK/WKC instances build successfully.
  1. Kernel correctness
  • Ran OPUS runtime correctness checks for the new gfx942 kids.
  • Verified selected WKC and splitK paths produce valid bf16 GEMM output.
  • Checked OPUS winner rows used for DSV4 config have err_ratio=0.0.
  1. Performance validation
  • Re-ran gradlib tuning for the full DSV4 bf16 GEMM shape set:
    • Input: /mnt/raid0/yifehuan/dsv4/all_bf16_mm.csv
    • 51 shapes
    • bf16 input/output
    • no bias
    • no scaleAB
    • --iters 31
    • --shape_grouped
    • hipblaslt excluded
  1. Library comparison

Compared OPUS against:

  • asm
  • flydsl
  • torch
  • triton
  • skinny

hipblaslt was intentionally excluded from this retune, matching the requested DSV4 comparison setup.

  1. Config validation
  • Only shapes where OPUS is the global winner were written back to dsv4_bf16_tuned_gemm.csv.
  • Non-OPUS winners were left out of the OPUS model config.
  • Verified written OPUS rows are deterministic OPUS kernel selections, not asm fallback routes.

Test Result

Full DSV4 bf16 no-hipblaslt gradlib sweep completed successfully.

Sweep summary:

  • Input shapes: 51
  • Completed shapes: 51 / 51
  • Invalid timing rows: 0
  • Libs swept: asm, flydsl, opus, torch, triton, skinny
  • hipblaslt: excluded
  • Skinny had no valid bf16 solutions in this run.

Winner distribution:

  • OPUS: 29
  • asm: 16
  • torch: 5
  • triton: 1

The DSV4 config was refreshed with only the 29 OPUS global-winner shapes.

Accuracy summary:

  • Written OPUS config rows: err_ratio=0.0
  • Full sweep max err_ratio=0.0332, from non-OPUS comparison rows in the all-lib sweep.

Result artifacts are saved under:

/mnt/raid0/yifehuan/performance/dsv4_gemm/all_bf16_nohipblaslt_20260608/

Key files:

  • dsv4_all_bf16_nohipblaslt_20260608.md
  • dsv4_all_bf16_nohipblaslt_per_lib_best.csv
  • dsv4_all_bf16_nohipblaslt_winners.csv
  • dsv4_all_bf16_nohipblaslt_lib_best_long.csv
  • tuned_opus_dsv4_all_bf16_nohipblaslt.csv
  • dsv4_opus_behind_priority_top10.csv

Representative OPUS winner examples from the final sweep:

  • 512x256x4096: OPUS 50206, 20.5522us
  • 512x512x4096: OPUS 50206, 25.7569us
  • 256x1024x4096: OPUS, 25.87us
  • 128x2048x4096: OPUS, 29.55us
  • 64x256x4096: OPUS 50300, 7.37us
  • 32x512x4096: OPUS, 7.79us
  • 1x2048x4096: OPUS 50301, 11.7529us
  • 512x64x4096: OPUS 50302, 10.0625us

Submission Checklist

@yifehuan yifehuan requested a review from a team June 8, 2026 05:31
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 8, 2026

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3594 --add-label <label>

@yifehuan yifehuan force-pushed the yifehuan/opus_perf_v2 branch from da15623 to bdc12b7 Compare June 8, 2026 05:33
@yifehuan yifehuan force-pushed the yifehuan/opus_perf_v2 branch from bdc12b7 to ea63649 Compare June 8, 2026 08:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant