[perf] Reduce memory peak time for putting regular tensor#54
[perf] Reduce memory peak time for putting regular tensor#540oshowero0 merged 6 commits intoAscend:mainfrom
Conversation
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
CLA Signature Pass0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
There was a problem hiding this comment.
Pull request overview
This PR targets the async simple storage backend “put” path by changing how _select_by_positions slices regular (non-nested) torch.Tensor inputs to avoid a memory copy that was triggered by fancy indexing.
Changes:
- Updated
_select_by_positionsto return per-position tensor views (list) for non-nestedtorch.Tensorinputs. - Kept nested tensor selection via
unbind()+itemgetter, returning a list. - Updated the unit test expectations for regular tensor selection.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
transfer_queue/storage/managers/simple_backend_manager.py |
Changes tensor selection for put-routing to avoid fancy-indexing copies by returning per-position views for regular tensors. |
tests/test_async_simple_storage_manager.py |
Updates _select_by_positions test for regular tensors to match the new (list-based) return behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
CLA Signature Pass0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
CLA Signature Pass0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| This method attempts to preserve zero-copy views whenever possible, while | ||
| falling back to memory-copied single tensors when indices are irregular. | ||
| This prevents severe network fragmentation (emitting too many ZMQ frames) | ||
| during serialization. | ||
|
|
||
| Supported data types: | ||
| - Nested tensors: unbind → select → return as a list of views (zero-copy). | ||
| - Regular tensors: Checks for constant-stride to return a single sliced view. | ||
| Falls back to `index_select` (memory copy) to ensure a single buffer. |
There was a problem hiding this comment.
The docstring/commentary suggests the constant-stride path is “zero-copy” and avoids memory allocation during serialization, but a strided slice (step > 1) produces a non-contiguous view and MsgpackEncoder._encode_regular_tensor() forces contiguous() before buffer extraction, causing a copy anyway. Consider clarifying the docstring to distinguish “shares storage” vs “end-to-end zero-copy”, and/or limiting the no-copy path to contiguous slices (step == 1) if the goal is to avoid downstream copies.
| This method attempts to preserve zero-copy views whenever possible, while | |
| falling back to memory-copied single tensors when indices are irregular. | |
| This prevents severe network fragmentation (emitting too many ZMQ frames) | |
| during serialization. | |
| Supported data types: | |
| - Nested tensors: unbind → select → return as a list of views (zero-copy). | |
| - Regular tensors: Checks for constant-stride to return a single sliced view. | |
| Falls back to `index_select` (memory copy) to ensure a single buffer. | |
| This method attempts to preserve views that share storage with the original | |
| data whenever possible, while falling back to memory-copied tensors when | |
| indices are irregular or a contiguous layout is required. The goal is to | |
| reduce severe network fragmentation (emitting too many ZMQ frames) during | |
| serialization, not to guarantee end-to-end zero-copy encoding. | |
| Supported data types: | |
| - Nested tensors: unbind → select → return as a list of views (shared | |
| storage with the original nested tensor; downstream encoders may still | |
| materialize copies if they require contiguity). | |
| - Regular tensors: Checks for constant-stride to return a single sliced | |
| view that shares storage with the original tensor. Falls back to | |
| `index_select` (memory copy) to ensure a single contiguous buffer when | |
| a view would be too fragmented or incompatible with the encoder. |
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
CLA Signature Pass0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
CLA Signature Pass0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
CLA Signature Pass0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Description
Problem:
In the previous implementation, the
_select_by_positionsmethod used advanced indexing (e.g.,tensor[positions]) to select elements from regular tensors. By design, advanced indexing with non-contiguous indices triggers implicit memory allocation and immediate data copying, leading to unnecessary memory overhead. Additionally, naive workarounds like returning a list of single-item views can generate excessive multipart ZMQ frames, leading to severe network fragmentation during downstream serialization.Solution:
This PR optimizes the selection logic by implementing a smart slicing strategy. Instead of focusing on strict end-to-end zero-copy (which is often broken during serialization anyway), this strategy effectively delays the memory copy and reduces the peak memory usage time:
tensor[start:stop:step]) to return a strided view that shares storage. While downstream serialization (e.g.,MsgpackEncoder) will eventually force a.contiguous()copy, this pure Python slicing avoids the immediate allocation overhead ofindex_selectand significantly reduces the peak memory period.torch.index_selectto assemble a single contiguous tensor. While this incurs an immediate memory copy, it successfully prevents network degradation that would occur from sending numerous tiny ZMQ frames.Other Changes
Add fallback logics for nested tensor packing
Memory Profile
For regular tensor with 2GB:
Before:

After:
