Avoid per-row SDMA fallback for narrow/tall hipMemcpy2D rects#277
Open
chun-wan wants to merge 1 commit into
Open
Avoid per-row SDMA fallback for narrow/tall hipMemcpy2D rects#277chun-wan wants to merge 1 commit into
chun-wan wants to merge 1 commit into
Conversation
KernelBlitManager::copyBufferRect routes host<->device rect copies to the
SDMA rect path (DmaBlitManager::copyBufferRect) first. The hardware rect copy
hsa_amd_memory_async_copy_rect requires dword-aligned row/slice pitches; when
they are not aligned, DmaBlitManager::copyBufferRect degenerates into one
hsa_amd_memory_async_copy per row.
For tall, narrow rects such as hipMemcpy2D(width=1, height=1M) this submits
~1,048,576 single-byte SDMA copies. Measured on gfx942 / ROCm 7.2, a 1 MB
width=1 H2D copy took ~8.6 s (up to ~17 s in a separate sweep), versus ~4 ms
for width=4 and ~0.03 ms for a flat hipMemcpy of the same bytes. Pinning the
host buffer does not help, because the cost is per-row submission overhead,
not page-locking (pinned vs pageable differ by ~0.1%, and warmup does not
change it). rocprofv3 confirms one MEMORY_COPY record per row for width=1
versus a single record for width=4.
Detect this case (row/slice pitch not dword-aligned, host<->device direction,
row count > 256) and skip the SDMA rect path so the existing shader
BlitCopyBufferRect kernel (alignment list {16,4,1}) handles the copy in a
single dispatch instead of issuing one SDMA descriptor per row.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
KernelBlitManager::copyBufferRect routes host<->device rect copies to the SDMA rect path (DmaBlitManager::copyBufferRect) first. The hardware rect copy hsa_amd_memory_async_copy_rect requires dword-aligned row/slice pitches; when they are not aligned, DmaBlitManager::copyBufferRect degenerates into one hsa_amd_memory_async_copy per row.
For tall, narrow rects such as hipMemcpy2D(width=1, height=1M) this submits ~1,048,576 single-byte SDMA copies. Measured on gfx942 / ROCm 7.2, a 1 MB width=1 H2D copy took ~8.6 s (up to ~17 s in a separate sweep), versus ~4 ms for width=4 and ~0.03 ms for a flat hipMemcpy of the same bytes. Pinning the host buffer does not help, because the cost is per-row submission overhead, not page-locking (pinned vs pageable differ by ~0.1%, and warmup does not change it). rocprofv3 confirms one MEMORY_COPY record per row for width=1 versus a single record for width=4.
Detect this case (row/slice pitch not dword-aligned, host<->device direction, row count > 256) and skip the SDMA rect path so the existing shader BlitCopyBufferRect kernel (alignment list {16,4,1}) handles the copy in a single dispatch instead of issuing one SDMA descriptor per row.
Associated JIRA ticket number/Github issue number
What type of PR is this? (check all applicable)
What were the changes?
Why are these changes needed?
Updated CHANGELOG?
Added/Updated documentation?
Additional Checks