Skip to content

Avoid per-row SDMA fallback for narrow/tall hipMemcpy2D rects#277

Open
chun-wan wants to merge 1 commit into
ROCm:developfrom
chun-wan:fix/copy-buffer-rect-avoid-per-row-sdma
Open

Avoid per-row SDMA fallback for narrow/tall hipMemcpy2D rects#277
chun-wan wants to merge 1 commit into
ROCm:developfrom
chun-wan:fix/copy-buffer-rect-avoid-per-row-sdma

Conversation

@chun-wan

@chun-wan chun-wan commented Jun 8, 2026

Copy link
Copy Markdown

KernelBlitManager::copyBufferRect routes host<->device rect copies to the SDMA rect path (DmaBlitManager::copyBufferRect) first. The hardware rect copy hsa_amd_memory_async_copy_rect requires dword-aligned row/slice pitches; when they are not aligned, DmaBlitManager::copyBufferRect degenerates into one hsa_amd_memory_async_copy per row.

For tall, narrow rects such as hipMemcpy2D(width=1, height=1M) this submits ~1,048,576 single-byte SDMA copies. Measured on gfx942 / ROCm 7.2, a 1 MB width=1 H2D copy took ~8.6 s (up to ~17 s in a separate sweep), versus ~4 ms for width=4 and ~0.03 ms for a flat hipMemcpy of the same bytes. Pinning the host buffer does not help, because the cost is per-row submission overhead, not page-locking (pinned vs pageable differ by ~0.1%, and warmup does not change it). rocprofv3 confirms one MEMORY_COPY record per row for width=1 versus a single record for width=4.

Detect this case (row/slice pitch not dword-aligned, host<->device direction, row count > 256) and skip the SDMA rect path so the existing shader BlitCopyBufferRect kernel (alignment list {16,4,1}) handles the copy in a single dispatch instead of issuing one SDMA descriptor per row.

Associated JIRA ticket number/Github issue number

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update
  • Continuous Integration

What were the changes?

Why are these changes needed?

Updated CHANGELOG?

  • Yes
  • No, Does not apply to this PR.

Added/Updated documentation?

  • Yes
  • No, Does not apply to this PR.

Additional Checks

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally.
  • Any dependent changes have been merged.

KernelBlitManager::copyBufferRect routes host<->device rect copies to the
SDMA rect path (DmaBlitManager::copyBufferRect) first. The hardware rect copy
hsa_amd_memory_async_copy_rect requires dword-aligned row/slice pitches; when
they are not aligned, DmaBlitManager::copyBufferRect degenerates into one
hsa_amd_memory_async_copy per row.

For tall, narrow rects such as hipMemcpy2D(width=1, height=1M) this submits
~1,048,576 single-byte SDMA copies. Measured on gfx942 / ROCm 7.2, a 1 MB
width=1 H2D copy took ~8.6 s (up to ~17 s in a separate sweep), versus ~4 ms
for width=4 and ~0.03 ms for a flat hipMemcpy of the same bytes. Pinning the
host buffer does not help, because the cost is per-row submission overhead,
not page-locking (pinned vs pageable differ by ~0.1%, and warmup does not
change it). rocprofv3 confirms one MEMORY_COPY record per row for width=1
versus a single record for width=4.

Detect this case (row/slice pitch not dword-aligned, host<->device direction,
row count > 256) and skip the SDMA rect path so the existing shader
BlitCopyBufferRect kernel (alignment list {16,4,1}) handles the copy in a
single dispatch instead of issuing one SDMA descriptor per row.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant