[perf] Reduce memory peak time for putting regular tensor by 0oshowero0 · Pull Request #54 · Ascend/TransferQueue

0oshowero0 · 2026-03-20T04:36:18Z

Description

Problem:

In the previous implementation, the _select_by_positions method used advanced indexing (e.g., tensor[positions]) to select elements from regular tensors. By design, advanced indexing with non-contiguous indices triggers implicit memory allocation and immediate data copying, leading to unnecessary memory overhead. Additionally, naive workarounds like returning a list of single-item views can generate excessive multipart ZMQ frames, leading to severe network fragmentation during downstream serialization.

Solution:

This PR optimizes the selection logic by implementing a smart slicing strategy. Instead of focusing on strict end-to-end zero-copy (which is often broken during serialization anyway), this strategy effectively delays the memory copy and reduces the peak memory usage time:

Constant-Stride Slicing: Checks if the provided indices form a perfectly regular sequence (constant stride). If so, it leverages Python's built-in slicing (tensor[start:stop:step]) to return a strided view that shares storage. While downstream serialization (e.g., MsgpackEncoder) will eventually force a .contiguous() copy, this pure Python slicing avoids the immediate allocation overhead of index_select and significantly reduces the peak memory period.
Single Element Selection: Automatically slices and returns a natively contiguous single-row view when only one item is selected.
Fallback for Irregular Indices: (should not happen) If indices are irregular, it falls back to torch.index_select to assemble a single contiguous tensor. While this incurs an immediate memory copy, it successfully prevents network degradation that would occur from sending numerous tiny ZMQ frames.

Other Changes

Add fallback logics for nested tensor packing

Memory Profile

For regular tensor with 2GB:

Before:

After:

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

ascend-robot · 2026-03-20T04:36:28Z

CLA Signature Pass

0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Copilot

Pull request overview

This PR targets the async simple storage backend “put” path by changing how _select_by_positions slices regular (non-nested) torch.Tensor inputs to avoid a memory copy that was triggered by fancy indexing.

Changes:

Updated _select_by_positions to return per-position tensor views (list) for non-nested torch.Tensor inputs.
Kept nested tensor selection via unbind() + itemgetter, returning a list.
Updated the unit test expectations for regular tensor selection.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`transfer_queue/storage/managers/simple_backend_manager.py`	Changes tensor selection for put-routing to avoid fancy-indexing copies by returning per-position views for regular tensors.
`tests/test_async_simple_storage_manager.py`	Updates `_select_by_positions` test for regular tensors to match the new (list-based) return behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

transfer_queue/storage/managers/simple_backend_manager.py

tests/test_async_simple_storage_manager.py

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

ascend-robot · 2026-03-20T04:45:05Z

CLA Signature Pass

0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

ascend-robot · 2026-03-20T05:05:13Z

CLA Signature Pass

0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-20T05:13:33Z

transfer_queue/storage/managers/simple_backend_manager.py

+        This method attempts to preserve zero-copy views whenever possible, while
+        falling back to memory-copied single tensors when indices are irregular.
+        This prevents severe network fragmentation (emitting too many ZMQ frames)
+        during serialization.
+
+        Supported data types:
+        - Nested tensors: unbind → select → return as a list of views (zero-copy).
+        - Regular tensors: Checks for constant-stride to return a single sliced view.
+          Falls back to `index_select` (memory copy) to ensure a single buffer.


The docstring/commentary suggests the constant-stride path is “zero-copy” and avoids memory allocation during serialization, but a strided slice (step > 1) produces a non-contiguous view and MsgpackEncoder._encode_regular_tensor() forces contiguous() before buffer extraction, causing a copy anyway. Consider clarifying the docstring to distinguish “shares storage” vs “end-to-end zero-copy”, and/or limiting the no-copy path to contiguous slices (step == 1) if the goal is to avoid downstream copies.

Suggested change

This method attempts to preserve zero-copy views whenever possible, while

falling back to memory-copied single tensors when indices are irregular.

This prevents severe network fragmentation (emitting too many ZMQ frames)

during serialization.

Supported data types:

- Nested tensors: unbind → select → return as a list of views (zero-copy).

- Regular tensors: Checks for constant-stride to return a single sliced view.

Falls back to `index_select` (memory copy) to ensure a single buffer.

This method attempts to preserve views that share storage with the original

data whenever possible, while falling back to memory-copied tensors when

indices are irregular or a contiguous layout is required. The goal is to

reduce severe network fragmentation (emitting too many ZMQ frames) during

serialization, not to guarantee end-to-end zero-copy encoding.

Supported data types:

- Nested tensors: unbind → select → return as a list of views (shared

storage with the original nested tensor; downstream encoders may still

materialize copies if they require contiguity).

- Regular tensors: Checks for constant-stride to return a single sliced

view that shares storage with the original tensor. Falls back to

`index_select` (memory copy) to ensure a single contiguous buffer when

a view would be too fragmented or incompatible with the encoder.

transfer_queue/storage/managers/simple_backend_manager.py

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

ascend-robot · 2026-03-20T08:02:02Z

CLA Signature Pass

0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

ascend-robot · 2026-03-20T08:53:44Z

CLA Signature Pass

0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

transfer_queue/storage/managers/simple_backend_manager.py

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

ascend-robot · 2026-03-20T09:07:08Z

CLA Signature Pass

0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍

optimize memory usage for regular tensor

82db416

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

Copilot AI review requested due to automatic review settings March 20, 2026 04:36

ascend-robot added the ascend-cla/yes label Mar 20, 2026

Copilot started reviewing on behalf of 0oshowero0 March 20, 2026 04:36 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

transfer_queue/storage/managers/simple_backend_manager.py Show resolved Hide resolved

transfer_queue/storage/managers/simple_backend_manager.py Outdated Show resolved Hide resolved

tests/test_async_simple_storage_manager.py Show resolved Hide resolved

fix comments

2fd9fba

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

more advanced strategy

85baef8

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

0oshowero0 requested a review from Copilot March 20, 2026 05:10

Copilot started reviewing on behalf of 0oshowero0 March 20, 2026 05:10 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

0oshowero0 changed the title ~~[perf] Prevent memory copy for putting regular tensor~~ [perf] Reduce memory peak time for putting regular tensor Mar 20, 2026

update comments

6bfd22a

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

add fallback for tensor packing

a8f9415

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

0oshowero0 requested a review from Copilot March 20, 2026 08:54

Copilot started reviewing on behalf of 0oshowero0 March 20, 2026 08:54 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

add UT

ce14308

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

0oshowero0 merged commit c4e86ed into Ascend:main Mar 20, 2026
5 checks passed

0oshowero0 deleted the mem_opt branch April 2, 2026 07:22

-        This method attempts to preserve zero-copy views whenever possible, while
-        falling back to memory-copied single tensors when indices are irregular.
-        This prevents severe network fragmentation (emitting too many ZMQ frames)
-        during serialization.
-        Supported data types:
-        - Nested tensors: unbind → select → return as a list of views (zero-copy).
-        - Regular tensors: Checks for constant-stride to return a single sliced view.
-          Falls back to `index_select` (memory copy) to ensure a single buffer.
+        This method attempts to preserve views that share storage with the original
+        data whenever possible, while falling back to memory-copied tensors when
+        indices are irregular or a contiguous layout is required. The goal is to
+        reduce severe network fragmentation (emitting too many ZMQ frames) during
+        serialization, not to guarantee end-to-end zero-copy encoding.
+        Supported data types:
+        - Nested tensors: unbind → select → return as a list of views (shared
+          storage with the original nested tensor; downstream encoders may still
+          materialize copies if they require contiguity).
+        - Regular tensors: Checks for constant-stride to return a single sliced
+          view that shares storage with the original tensor. Falls back to
+          `index_select` (memory copy) to ensure a single contiguous buffer when
+          a view would be too fragmented or incompatible with the encoder.

Conversation

0oshowero0 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem:

Solution:

Other Changes

Memory Profile

Uh oh!

ascend-robot commented Mar 20, 2026

CLA Signature Pass

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ascend-robot commented Mar 20, 2026

CLA Signature Pass

Uh oh!

ascend-robot commented Mar 20, 2026

CLA Signature Pass

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ascend-robot commented Mar 20, 2026

CLA Signature Pass

Uh oh!

ascend-robot commented Mar 20, 2026

CLA Signature Pass

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ascend-robot commented Mar 20, 2026

CLA Signature Pass

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

0oshowero0 commented Mar 20, 2026 •

edited

Loading