Skip to content

[Performance] RDT vs Direct YuanrongStore Backend Performance Analysis #31

@Evelynn-V

Description

@Evelynn-V

Background

Currently, TransferQueue supports multiple transmission backends, including:

  • RDT (Ray Direct Transport): The native distributed transmission mechanism of Ray, usually using NIXL as the underlying transmission between GPUs

  • YR (YuanrongStore): A high-performance data transmission system optimized for the NPU architecture, similar in function to NIXL but more suitable for the Ascend hardware

The purpose of this issue is to benchmark and analyze the performance overhead introduced by RDT (Ray Distributed Object Store) when using DS as the transmission backend. We conducted a comparative test between two scenarios on NPU cross-node environments:

  • TQ-RDT-YR: TransferQueue using RDT as the backend, where RDT uses YuanrongStore for data transmission.

  • TQ-YR: TransferQueue using YuanrongStore directly for data transmission.

Key Objectives

Evaluate the performance cost of the RDT layer.

Test Results

TQ-RDT-YR Get Cost

seq_length total_size TQ CLIENT get (s) Ray Storage Client get (s) BandWidth (GB/s) BW_RayStorageClient (GB/s)
32 KB 64MB 0.86 0.85 0.07 0.07
128 KB 256MB 0.85 0.83 0.30 0.30
512 KB 1 GB 0.93 0.91 1.07 1.10
1 MB 2 GB 0.95 0.94 2.10 2.13
2 MB 4 GB 1.13 1.11 3.55 3.61
4 MB 8 GB 1.43 1.41 5.58 5.68
8 MB 16 GB 13.60 1.80 1.18 8.88

TQ-YR Get Cost

seq_length total_size TQ CLIENT get (s) Ray Storage Client get (s) BandWidth (GB/s) BW_RayStorageClient (GB/s)
32 KB 64MB 0.14 0.12 0.46 0.52
128 KB 256MB 0.15 0.13 4.69 1.87
512 KB 1 GB 0.14 0.12 7.18 8.07
1 MB 2 GB 0.17 0.15 12.10 13.35
2 MB 4 GB 0.26 0.25 15.20 16.21
4 MB 8 GB 0.51 0.47 15.71 17.13
8 MB 16 GB 12.77 0.90 1.25 17.85
Image

Proposed Analysis

Regarding the reason for the time difference in the RayStorageClient.get in the two scenarios, we speculate that it is because YRStore performs batch processing, getting all items together in one batch, while RayStore processes them individually. Each key is a separate reference, and even when calling yr.get, only one can be transferred at a time.

Remaining doubts

  • When the total data volume reached 16GB, the time taken from the TQ Client to RayStore/YRStore suddenly increased significantly. Looking at the green section in the figure below, you can see that it suddenly expands significantly when the capacity reaches 16GB.
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions