-
Notifications
You must be signed in to change notification settings - Fork 16
[Performance] RDT vs Direct YuanrongStore Backend Performance Analysis #31
Description
Background
Currently, TransferQueue supports multiple transmission backends, including:
-
RDT (Ray Direct Transport): The native distributed transmission mechanism of Ray, usually using NIXL as the underlying transmission between GPUs
-
YR (YuanrongStore): A high-performance data transmission system optimized for the NPU architecture, similar in function to NIXL but more suitable for the Ascend hardware
The purpose of this issue is to benchmark and analyze the performance overhead introduced by RDT (Ray Distributed Object Store) when using DS as the transmission backend. We conducted a comparative test between two scenarios on NPU cross-node environments:
-
TQ-RDT-YR: TransferQueue using RDT as the backend, where RDT uses YuanrongStore for data transmission.
-
TQ-YR: TransferQueue using YuanrongStore directly for data transmission.
Key Objectives
Evaluate the performance cost of the RDT layer.
Test Results
TQ-RDT-YR Get Cost
| seq_length | total_size | TQ CLIENT get (s) | Ray Storage Client get (s) | BandWidth (GB/s) | BW_RayStorageClient (GB/s) |
|---|---|---|---|---|---|
| 32 KB | 64MB | 0.86 | 0.85 | 0.07 | 0.07 |
| 128 KB | 256MB | 0.85 | 0.83 | 0.30 | 0.30 |
| 512 KB | 1 GB | 0.93 | 0.91 | 1.07 | 1.10 |
| 1 MB | 2 GB | 0.95 | 0.94 | 2.10 | 2.13 |
| 2 MB | 4 GB | 1.13 | 1.11 | 3.55 | 3.61 |
| 4 MB | 8 GB | 1.43 | 1.41 | 5.58 | 5.68 |
| 8 MB | 16 GB | 13.60 | 1.80 | 1.18 | 8.88 |
TQ-YR Get Cost
| seq_length | total_size | TQ CLIENT get (s) | Ray Storage Client get (s) | BandWidth (GB/s) | BW_RayStorageClient (GB/s) |
|---|---|---|---|---|---|
| 32 KB | 64MB | 0.14 | 0.12 | 0.46 | 0.52 |
| 128 KB | 256MB | 0.15 | 0.13 | 4.69 | 1.87 |
| 512 KB | 1 GB | 0.14 | 0.12 | 7.18 | 8.07 |
| 1 MB | 2 GB | 0.17 | 0.15 | 12.10 | 13.35 |
| 2 MB | 4 GB | 0.26 | 0.25 | 15.20 | 16.21 |
| 4 MB | 8 GB | 0.51 | 0.47 | 15.71 | 17.13 |
| 8 MB | 16 GB | 12.77 | 0.90 | 1.25 | 17.85 |
Proposed Analysis
Regarding the reason for the time difference in the RayStorageClient.get in the two scenarios, we speculate that it is because YRStore performs batch processing, getting all items together in one batch, while RayStore processes them individually. Each key is a separate reference, and even when calling yr.get, only one can be transferred at a time.
Remaining doubts
- When the total data volume reached 16GB, the time taken from the
TQ ClienttoRayStore/YRStoresuddenly increased significantly. Looking at the green section in the figure below, you can see that it suddenly expands significantly when the capacity reaches 16GB.
