Skip to content

[Perf] Refactor performance test for different kv store backends#52

Merged
0oshowero0 merged 30 commits intoAscend:mainfrom
tianyi-ge:feat/perftest-refactor
Mar 28, 2026
Merged

[Perf] Refactor performance test for different kv store backends#52
0oshowero0 merged 30 commits intoAscend:mainfrom
tianyi-ge:feat/perftest-refactor

Conversation

@tianyi-ge
Copy link
Copy Markdown
Contributor

@tianyi-ge tianyi-ge commented Mar 19, 2026

Description

  1. support different kv backends
  2. support intra-node and inter-node client placement for yr
  3. output to csv
  4. remove the non-tensor part when create_complex_test_case
  5. remove ray bandwidth test
  6. add readme for perf test
  7. test 3 times to mitigate variance (warmup)
  8. use kv client to simplify usage

Usage

usage: perftest.py [-h] --backend_config BACKEND_CONFIG [--backend BACKEND] [--device {cpu,npu,gpu}] [--global_batch_size GLOBAL_BATCH_SIZE] [--field_num FIELD_NUM]
                   [--seq_len SEQ_LEN] [--num_test_iterations NUM_TEST_ITERATIONS] --head_node_ip HEAD_NODE_IP [--worker_node_ip WORKER_NODE_IP]
                   [--output_csv OUTPUT_CSV] [--use_complex_case]

TransferQueue Throughput Test

options:
  -h, --help            show this help message and exit
  --backend_config BACKEND_CONFIG
                        Path to backend config YAML file
  --backend BACKEND     Override storage_backend in config (e.g. SimpleStorage, Yuanrong, MooncakeStore)
  --device {cpu,npu,gpu}
                        Device to use (default: cpu)
  --global_batch_size GLOBAL_BATCH_SIZE
                        Global batch size (default: 1024)
  --field_num FIELD_NUM
                        Number of fields (default: 10)
  --seq_len SEQ_LEN     Sequence length (default: 8192)
  --num_test_iterations NUM_TEST_ITERATIONS
                        Number of test iterations (default: 4)
  --head_node_ip HEAD_NODE_IP
                        Head node IP address
  --worker_node_ip WORKER_NODE_IP
                        Worker node IP address (required for Yuanrong)
  --output_csv OUTPUT_CSV
                        Path to output CSV file (optional)
  --use_complex_case    Use complex test case with nested tensors and nontensor fields (default: False, simple case)

closes #51

@ascend-robot
Copy link
Copy Markdown

CLA Signature Guide

@tianyi-ge , thanks for your pull request.

The following commit(s) are not associated with a signed Contributor License Agreement (CLA).

Commit Reason
1e71d496 refactor perftest 1. support dif... the email used in the commit is not linked to a signed CLA!
please verify that it matches the email you used when signing the CLA.

To sign CLA, click here.

To check if your email is configured correctly, refer to the FAQs.

Once you've signed the CLA or updating your email, please comment /check-cla to revalidate CLA status.

@tianyi-ge tianyi-ge force-pushed the feat/perftest-refactor branch from 1e71d49 to c8e038d Compare March 20, 2026 01:05
@ascend-robot
Copy link
Copy Markdown

CLA Signature Guide

@tianyi-ge , thanks for your pull request.

The following commit(s) are not associated with a signed Contributor License Agreement (CLA).

Commit Reason
c8e038dd refactor perftest 1. support dif... the email used in the commit is not linked to a signed CLA!
please verify that it matches the email you used when signing the CLA.

To sign CLA, click here.

To check if your email is configured correctly, refer to the FAQs.

Once you've signed the CLA or updating your email, please comment /check-cla to revalidate CLA status.

@tianyi-ge
Copy link
Copy Markdown
Contributor Author

/check-cla

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

tianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Comment on lines +323 to +324
if self.device in ["npu", "gpu"]:
device_resource = {self.device: 1}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should device be uppercase?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I fixed it. and deal with gpu case as well

Comment on lines +136 to +137
self.test_data, self.total_data_size_gb = create_complex_test_case(batch_size, seq_length, field_num, device)
return list(self.test_data.keys()), self.total_data_size_gb
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is create_complex_test_case in TQClientActor for decreasing once copy?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. we don't need to pass this large dataset from driver to writer

Comment on lines +335 to +339
# Initialize storage managers
logger.info(f"Using {self.manager_type} as storage backend.")

w = self.writer.initialize_storage_manager.remote(manager_type=self.manager_type, config=self.writer_config)
r = self.reader.initialize_storage_manager.remote(manager_type=self.manager_type, config=self.reader_config)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should move initialize_storage_manager to TQClientActor.__init__

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine so far

Comment on lines +82 to +87
### Inter-node test with yuanrong backend
```bash
python perftest.py --backend=yuanrong --client_placement=inter_node \
--backend_config=configs/yuanrong.yaml \
--head_node_ip=192.168.0.1 --worker_node_ip=192.168.0.2
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yuanrong need some pre-operations like starting etcd and datasystem

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's briefly described in prerequisites. this doc focused on perftest usage.

Comment on lines +38 to +41
| `--backend` | Backend type: default, yuanrong, mooncake | default |
| `--client_placement` | Client placement: intra_node or inter_node | intra_node |
| `--backend_config` | Path to YAML config file (optional) | None |
| `--device` | Device: cpu, npu, gpu | cpu |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should list which devices each backend supports.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've described it below

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

tianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍

1 similar comment
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

tianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍


| Argument | Description | Default |
|----------|-------------|---------|
| `--backend` | Backend type: default, yuanrong, mooncake | default |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the --backend, we should also align with the main config as SimpleStorage, Yuanrong and MooncakeStore

--backend=[default|yuanrong|mooncake] \
--client_placement=[intra_node|inter_node] \
--backend_config=xxx.yaml \
--device=[cpu|npu|gpu] \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to distinguish npu and gpu?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used to determine the target devices of tensors during creation, like device="cuda:0" or device="npu:0"


## Examples

### Intra-node test with default backend
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to explain what's the expected behavior for each kind of backend.

# Tensor field
tensor_data = torch.randn(batch_size, seq_length, dtype=torch.float32, device=torch_device)
fields[field_name] = tensor_data
else:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking to delete the non-tensor part.. As a performance test, we cannot cover the use case for all scenarios. We can just illustrate tensor performance, and letting users to implement and test their own case.

"""Ray actor that holds a TransferQueueClient."""

def __init__(self, client_id: str, controller_info: Any):
self.client = TransferQueueClient(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use tq.init(config)?

"""Put data to storage."""
self.client.put(data=self.test_data, partition_id=partition_id)

def get_meta(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd better to illustrate the high-level KV api to reduce the cost to understand TQ.

Comment on lines +227 to +236
def _get_manager_type(self) -> str:
"""Get the storage manager type based on backend."""
if self.backend == "default":
return "AsyncSimpleStorageManager"
elif self.backend == "yuanrong":
return "YuanrongStorageManager"
elif self.backend == "mooncake":
return "MooncakeStorageManager"
else:
raise ValueError(f"Unknown backend: {self.backend}")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just align with the main config


self.data_system_storage_units = {}

if storage_unit_placement == "remote":
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For SimpleStorage backend, we can just illustrate the common use case where we put data distributely in all the nodes. We can provide another script to manually validate the bandwidth efficiency

def _initialize_clients(self) -> None:
"""Initialize writer and reader TQClientActors."""
# Determine node placement
if self.client_placement == "intra_node":
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we preserve this config since:

  1. We only demonstrate the normal usage for SimpleStorage backend
  2. It doesn't affect MooncakeStore
  3. For Yuanrong, only inter_node is reasonable since it prefer local storage by default?

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

tianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

tianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍

2 similar comments
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

tianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

tianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ray start --head --resources='{"node:192.168.0.1":1}'

# On worker node
ray start --address=192.168.0.1 --resources='{"node:192.168.0.2":1}'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need port

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will modify it later

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

tianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍

1. support different kv backends
2. support intra-node and inter-node client placement
3. remove ray bandwidth test

Signed-off-by: tianyi-ge <tianyig@outlook.com>
Signed-off-by: tianyi-ge <tianyig@outlook.com>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

tianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍

2. remove delete time stats

Signed-off-by: tianyi-ge <tianyig@outlook.com>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

tianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@tianyi-ge
Copy link
Copy Markdown
Contributor Author

dscli start --cpunodebind 6 --localalloc --timeout 600 -w --worker_address 10.170.27.237:31501 --etcd_address 10.170.27.237:2379 --arena_per_tenant 1 --max_log_size 1024 --max_log_file_num 10 --node_timeout_s 600 --node_dead_timeout_s 1800 --enable_fallocate false --enable_worker_worker_batch_get true --shared_memory_populate true --shared_memory_size_mb 40960 --remote_h2d_device_ids "0" --enable_huge_tlb true
dscli start --cpunodebind 6 --localalloc --timeout 600 -w --worker_address 10.170.27.158:31501 --etcd_address 10.170.27.237:2379 --arena_per_tenant 1 --max_log_size 1024 --max_log_file_num 10 --node_timeout_s 600 --node_dead_timeout_s 1800 --enable_fallocate false --enable_worker_worker_batch_get true --shared_memory_populate true --shared_memory_size_mb 40960 --remote_h2d_device_ids "0" --enable_huge_tlb true

…s will connect to the head node

Signed-off-by: tianyi-ge <tianyig@outlook.com>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

tianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

tianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍

pyproject.toml Outdated
mooncake = [
"mooncake-transfer-engine"
]
perftest = [
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this are not needed

Signed-off-by: tianyi-ge <tianyig@outlook.com>
self.backend = self.full_config["backend"]["storage_backend"]

# For Yuanrong, always use inter_node
self.use_inter_node = self.backend == "Yuanrong"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we can use inter node as default for all backends

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

tianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍

return env_value_lower in true_values


def get_local_ip_addresses() -> list[str]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can put these functions in yuanrong_client.py since now only Yuanrong requires these utils

Comment on lines +39 to +42
# Memory segment size in bytes for mounting (default: 4GB)
global_segment_size: 4294967296
# Local buffer size in bytes (default: 1GB)
local_buffer_size: 1073741824
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Memory segment size in bytes for mounting (default: 4GB)
global_segment_size: 4294967296
# Local buffer size in bytes (default: 1GB)
local_buffer_size: 1073741824
# Memory segment size in bytes for mounting
global_segment_size: 86294967296
# Local buffer size in bytes
local_buffer_size: 86294967296

# Address of local host. Set to "" to use Ray IP as local host address
local_hostname: ""
# Protocol for transmission. Choose from: tcp, rdma. (default: tcp)
protocol: tcp
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
protocol: tcp
protocol: rdma

2. modify default mooncake store perftest config

Signed-off-by: tianyi-ge <tianyig@outlook.com>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

tianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍

except ValueError:
logger.info("Some other rank has initialized TransferQueueController. Try to connect to existing controller.")
_init_from_existing()
return
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be deleted

Signed-off-by: tianyi-ge <tianyig@outlook.com>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

tianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

tianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍

declare -a SETTINGS=(
"1024,9,8192,Small"
"4096,15,32768,Medium"
"8192,21,128000,Large"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"8192,21,128000,Large"
"8192,18,100000,Large"

### Test Matrix

- **Backends**: SimpleStorage, Yuanrong, MooncakeStore, Ray (baseline)
- **Data sizes**: Small (batch=1024, fields=9, seq=8192), Medium (batch=4096, fields=15, seq=32768), Large (batch=8192, fields=21, seq=128000)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Data sizes**: Small (batch=1024, fields=9, seq=8192), Medium (batch=4096, fields=15, seq=32768), Large (batch=8192, fields=21, seq=128000)
- **Data sizes**: Small (batch=1024, fields=9, seq=8192), Medium (batch=4096, fields=15, seq=32768), Large (batch=8192, fields=18, seq=100000)

- `Yuanrong`: `cpu`, `npu`
- `MooncakeStore`: `cpu`, `gpu`

## Test Data Format
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we have 2 scenarios and we need to illustrate both the simple case and complex case https://www.yuque.com/haomingzi-lfse7/lhp4el/tml8ke0zkgn6roey?singleDoc# 《TransferQueue Performance Test - 0.1.6》

Signed-off-by: tianyi-ge <tianyig@outlook.com>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

tianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@0oshowero0 0oshowero0 merged commit 0c3ac24 into Ascend:main Mar 28, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC][Perf] Refactor performance test for different kv store backends

4 participants