Avoid per-row SDMA fallback for narrow/tall hipMemcpy2D rects by chun-wan · Pull Request #277 · ROCm/clr

chun-wan · 2026-06-08T10:35:05Z

KernelBlitManager::copyBufferRect routes host<->device rect copies to the SDMA rect path (DmaBlitManager::copyBufferRect) first. The hardware rect copy hsa_amd_memory_async_copy_rect requires dword-aligned row/slice pitches; when they are not aligned, DmaBlitManager::copyBufferRect degenerates into one hsa_amd_memory_async_copy per row.

For tall, narrow rects such as hipMemcpy2D(width=1, height=1M) this submits ~1,048,576 single-byte SDMA copies. Measured on gfx942 / ROCm 7.2, a 1 MB width=1 H2D copy took ~8.6 s (up to ~17 s in a separate sweep), versus ~4 ms for width=4 and ~0.03 ms for a flat hipMemcpy of the same bytes. Pinning the host buffer does not help, because the cost is per-row submission overhead, not page-locking (pinned vs pageable differ by ~0.1%, and warmup does not change it). rocprofv3 confirms one MEMORY_COPY record per row for width=1 versus a single record for width=4.

Detect this case (row/slice pitch not dword-aligned, host<->device direction, row count > 256) and skip the SDMA rect path so the existing shader BlitCopyBufferRect kernel (alignment list {16,4,1}) handles the copy in a single dispatch instead of issuing one SDMA descriptor per row.

Associated JIRA ticket number/Github issue number

What type of PR is this? (check all applicable)

What were the changes?

Why are these changes needed?

Updated CHANGELOG?

Yes
No, Does not apply to this PR.

Added/Updated documentation?

Yes
No, Does not apply to this PR.

Additional Checks

I have added tests relevant to the introduced functionality, and the unit tests are passing locally.
Any dependent changes have been merged.

KernelBlitManager::copyBufferRect routes host<->device rect copies to the SDMA rect path (DmaBlitManager::copyBufferRect) first. The hardware rect copy hsa_amd_memory_async_copy_rect requires dword-aligned row/slice pitches; when they are not aligned, DmaBlitManager::copyBufferRect degenerates into one hsa_amd_memory_async_copy per row. For tall, narrow rects such as hipMemcpy2D(width=1, height=1M) this submits ~1,048,576 single-byte SDMA copies. Measured on gfx942 / ROCm 7.2, a 1 MB width=1 H2D copy took ~8.6 s (up to ~17 s in a separate sweep), versus ~4 ms for width=4 and ~0.03 ms for a flat hipMemcpy of the same bytes. Pinning the host buffer does not help, because the cost is per-row submission overhead, not page-locking (pinned vs pageable differ by ~0.1%, and warmup does not change it). rocprofv3 confirms one MEMORY_COPY record per row for width=1 versus a single record for width=4. Detect this case (row/slice pitch not dword-aligned, host<->device direction, row count > 256) and skip the SDMA rect path so the existing shader BlitCopyBufferRect kernel (alignment list {16,4,1}) handles the copy in a single dispatch instead of issuing one SDMA descriptor per row.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid per-row SDMA fallback for narrow/tall hipMemcpy2D rects#277

Avoid per-row SDMA fallback for narrow/tall hipMemcpy2D rects#277
chun-wan wants to merge 1 commit into
ROCm:developfrom
chun-wan:fix/copy-buffer-rect-avoid-per-row-sdma

chun-wan commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

chun-wan commented Jun 8, 2026

Associated JIRA ticket number/Github issue number

What type of PR is this? (check all applicable)

What were the changes?

Why are these changes needed?

Updated CHANGELOG?

Added/Updated documentation?

Additional Checks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant