Skip to content

[perf] Reduce memory peak time for putting regular tensor#54

Merged
0oshowero0 merged 6 commits intoAscend:mainfrom
0oshowero0:mem_opt
Mar 20, 2026
Merged

[perf] Reduce memory peak time for putting regular tensor#54
0oshowero0 merged 6 commits intoAscend:mainfrom
0oshowero0:mem_opt

Conversation

@0oshowero0
Copy link
Copy Markdown
Collaborator

@0oshowero0 0oshowero0 commented Mar 20, 2026

Description

Problem:

In the previous implementation, the _select_by_positions method used advanced indexing (e.g., tensor[positions]) to select elements from regular tensors. By design, advanced indexing with non-contiguous indices triggers implicit memory allocation and immediate data copying, leading to unnecessary memory overhead. Additionally, naive workarounds like returning a list of single-item views can generate excessive multipart ZMQ frames, leading to severe network fragmentation during downstream serialization.

Solution:

This PR optimizes the selection logic by implementing a smart slicing strategy. Instead of focusing on strict end-to-end zero-copy (which is often broken during serialization anyway), this strategy effectively delays the memory copy and reduces the peak memory usage time:

  • Constant-Stride Slicing: Checks if the provided indices form a perfectly regular sequence (constant stride). If so, it leverages Python's built-in slicing (tensor[start:stop:step]) to return a strided view that shares storage. While downstream serialization (e.g., MsgpackEncoder) will eventually force a .contiguous() copy, this pure Python slicing avoids the immediate allocation overhead of index_select and significantly reduces the peak memory period.
  • Single Element Selection: Automatically slices and returns a natively contiguous single-row view when only one item is selected.
  • Fallback for Irregular Indices: (should not happen) If indices are irregular, it falls back to torch.index_select to assemble a single contiguous tensor. While this incurs an immediate memory copy, it successfully prevents network degradation that would occur from sending numerous tiny ZMQ frames.

Other Changes

Add fallback logics for nested tensor packing

Memory Profile

For regular tensor with 2GB:

Before:
image

After:
image

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
Copilot AI review requested due to automatic review settings March 20, 2026 04:36
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets the async simple storage backend “put” path by changing how _select_by_positions slices regular (non-nested) torch.Tensor inputs to avoid a memory copy that was triggered by fancy indexing.

Changes:

  • Updated _select_by_positions to return per-position tensor views (list) for non-nested torch.Tensor inputs.
  • Kept nested tensor selection via unbind() + itemgetter, returning a list.
  • Updated the unit test expectations for regular tensor selection.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
transfer_queue/storage/managers/simple_backend_manager.py Changes tensor selection for put-routing to avoid fancy-indexing copies by returning per-position views for regular tensors.
tests/test_async_simple_storage_manager.py Updates _select_by_positions test for regular tensors to match the new (list-based) return behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +212 to +220
This method attempts to preserve zero-copy views whenever possible, while
falling back to memory-copied single tensors when indices are irregular.
This prevents severe network fragmentation (emitting too many ZMQ frames)
during serialization.

Supported data types:
- Nested tensors: unbind → select → return as a list of views (zero-copy).
- Regular tensors: Checks for constant-stride to return a single sliced view.
Falls back to `index_select` (memory copy) to ensure a single buffer.
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring/commentary suggests the constant-stride path is “zero-copy” and avoids memory allocation during serialization, but a strided slice (step > 1) produces a non-contiguous view and MsgpackEncoder._encode_regular_tensor() forces contiguous() before buffer extraction, causing a copy anyway. Consider clarifying the docstring to distinguish “shares storage” vs “end-to-end zero-copy”, and/or limiting the no-copy path to contiguous slices (step == 1) if the goal is to avoid downstream copies.

Suggested change
This method attempts to preserve zero-copy views whenever possible, while
falling back to memory-copied single tensors when indices are irregular.
This prevents severe network fragmentation (emitting too many ZMQ frames)
during serialization.
Supported data types:
- Nested tensors: unbindselectreturn as a list of views (zero-copy).
- Regular tensors: Checks for constant-stride to return a single sliced view.
Falls back to `index_select` (memory copy) to ensure a single buffer.
This method attempts to preserve views that share storage with the original
data whenever possible, while falling back to memory-copied tensors when
indices are irregular or a contiguous layout is required. The goal is to
reduce severe network fragmentation (emitting too many ZMQ frames) during
serialization, not to guarantee end-to-end zero-copy encoding.
Supported data types:
- Nested tensors: unbindselectreturn as a list of views (shared
storage with the original nested tensor; downstream encoders may still
materialize copies if they require contiguity).
- Regular tensors: Checks for constant-stride to return a single sliced
view that shares storage with the original tensor. Falls back to
`index_select` (memory copy) to ensure a single contiguous buffer when
a view would be too fragmented or incompatible with the encoder.

Copilot uses AI. Check for mistakes.
@0oshowero0 0oshowero0 changed the title [perf] Prevent memory copy for putting regular tensor [perf] Reduce memory peak time for putting regular tensor Mar 20, 2026
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@0oshowero0 0oshowero0 merged commit c4e86ed into Ascend:main Mar 20, 2026
5 checks passed
@0oshowero0 0oshowero0 deleted the mem_opt branch April 2, 2026 07:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants