[Perf] Refactor performance test for different kv store backends#52
[Perf] Refactor performance test for different kv store backends#520oshowero0 merged 30 commits intoAscend:mainfrom
Conversation
CLA Signature Guide@tianyi-ge , thanks for your pull request. The following commit(s) are not associated with a signed Contributor License Agreement (CLA).
To sign CLA, click here. To check if your email is configured correctly, refer to the FAQs. Once you've signed the CLA or updating your email, please comment |
1e71d49 to
c8e038d
Compare
CLA Signature Guide@tianyi-ge , thanks for your pull request. The following commit(s) are not associated with a signed Contributor License Agreement (CLA).
To sign CLA, click here. To check if your email is configured correctly, refer to the FAQs. Once you've signed the CLA or updating your email, please comment |
|
/check-cla |
CLA Signature Passtianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
scripts/perftest.py
Outdated
| if self.device in ["npu", "gpu"]: | ||
| device_resource = {self.device: 1} |
There was a problem hiding this comment.
should device be uppercase?
There was a problem hiding this comment.
yes. I fixed it. and deal with gpu case as well
scripts/perftest.py
Outdated
| self.test_data, self.total_data_size_gb = create_complex_test_case(batch_size, seq_length, field_num, device) | ||
| return list(self.test_data.keys()), self.total_data_size_gb |
There was a problem hiding this comment.
Is create_complex_test_case in TQClientActor for decreasing once copy?
There was a problem hiding this comment.
yes. we don't need to pass this large dataset from driver to writer
scripts/perftest.py
Outdated
| # Initialize storage managers | ||
| logger.info(f"Using {self.manager_type} as storage backend.") | ||
|
|
||
| w = self.writer.initialize_storage_manager.remote(manager_type=self.manager_type, config=self.writer_config) | ||
| r = self.reader.initialize_storage_manager.remote(manager_type=self.manager_type, config=self.reader_config) |
There was a problem hiding this comment.
Perhaps we should move initialize_storage_manager to TQClientActor.__init__
There was a problem hiding this comment.
I think it's fine so far
scripts/README_PERFTEST.md
Outdated
| ### Inter-node test with yuanrong backend | ||
| ```bash | ||
| python perftest.py --backend=yuanrong --client_placement=inter_node \ | ||
| --backend_config=configs/yuanrong.yaml \ | ||
| --head_node_ip=192.168.0.1 --worker_node_ip=192.168.0.2 | ||
| ``` |
There was a problem hiding this comment.
Yuanrong need some pre-operations like starting etcd and datasystem
There was a problem hiding this comment.
it's briefly described in prerequisites. this doc focused on perftest usage.
scripts/README_PERFTEST.md
Outdated
| | `--backend` | Backend type: default, yuanrong, mooncake | default | | ||
| | `--client_placement` | Client placement: intra_node or inter_node | intra_node | | ||
| | `--backend_config` | Path to YAML config file (optional) | None | | ||
| | `--device` | Device: cpu, npu, gpu | cpu | |
There was a problem hiding this comment.
We should list which devices each backend supports.
There was a problem hiding this comment.
I've described it below
CLA Signature Passtianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
1 similar comment
CLA Signature Passtianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
scripts/README_PERFTEST.md
Outdated
|
|
||
| | Argument | Description | Default | | ||
| |----------|-------------|---------| | ||
| | `--backend` | Backend type: default, yuanrong, mooncake | default | |
There was a problem hiding this comment.
We'd better align the config structure with https://github.com/Ascend/TransferQueue/blob/main/transfer_queue/config.yaml
There was a problem hiding this comment.
For the --backend, we should also align with the main config as SimpleStorage, Yuanrong and MooncakeStore
| --backend=[default|yuanrong|mooncake] \ | ||
| --client_placement=[intra_node|inter_node] \ | ||
| --backend_config=xxx.yaml \ | ||
| --device=[cpu|npu|gpu] \ |
There was a problem hiding this comment.
Do we have to distinguish npu and gpu?
There was a problem hiding this comment.
It's used to determine the target devices of tensors during creation, like device="cuda:0" or device="npu:0"
scripts/README_PERFTEST.md
Outdated
|
|
||
| ## Examples | ||
|
|
||
| ### Intra-node test with default backend |
There was a problem hiding this comment.
We may need to explain what's the expected behavior for each kind of backend.
scripts/perftest.py
Outdated
| # Tensor field | ||
| tensor_data = torch.randn(batch_size, seq_length, dtype=torch.float32, device=torch_device) | ||
| fields[field_name] = tensor_data | ||
| else: |
There was a problem hiding this comment.
I'm thinking to delete the non-tensor part.. As a performance test, we cannot cover the use case for all scenarios. We can just illustrate tensor performance, and letting users to implement and test their own case.
scripts/perftest.py
Outdated
| """Ray actor that holds a TransferQueueClient.""" | ||
|
|
||
| def __init__(self, client_id: str, controller_info: Any): | ||
| self.client = TransferQueueClient( |
scripts/perftest.py
Outdated
| """Put data to storage.""" | ||
| self.client.put(data=self.test_data, partition_id=partition_id) | ||
|
|
||
| def get_meta( |
There was a problem hiding this comment.
I think we'd better to illustrate the high-level KV api to reduce the cost to understand TQ.
scripts/perftest.py
Outdated
| def _get_manager_type(self) -> str: | ||
| """Get the storage manager type based on backend.""" | ||
| if self.backend == "default": | ||
| return "AsyncSimpleStorageManager" | ||
| elif self.backend == "yuanrong": | ||
| return "YuanrongStorageManager" | ||
| elif self.backend == "mooncake": | ||
| return "MooncakeStorageManager" | ||
| else: | ||
| raise ValueError(f"Unknown backend: {self.backend}") |
There was a problem hiding this comment.
Just align with the main config
scripts/perftest.py
Outdated
|
|
||
| self.data_system_storage_units = {} | ||
|
|
||
| if storage_unit_placement == "remote": |
There was a problem hiding this comment.
For SimpleStorage backend, we can just illustrate the common use case where we put data distributely in all the nodes. We can provide another script to manually validate the bandwidth efficiency
scripts/perftest.py
Outdated
| def _initialize_clients(self) -> None: | ||
| """Initialize writer and reader TQClientActors.""" | ||
| # Determine node placement | ||
| if self.client_placement == "intra_node": |
There was a problem hiding this comment.
Should we preserve this config since:
- We only demonstrate the normal usage for
SimpleStoragebackend - It doesn't affect
MooncakeStore - For
Yuanrong, onlyinter_nodeis reasonable since it prefer local storage by default?
CLA Signature Passtianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
CLA Signature Passtianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
2 similar comments
CLA Signature Passtianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
CLA Signature Passtianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
| ray start --head --resources='{"node:192.168.0.1":1}' | ||
|
|
||
| # On worker node | ||
| ray start --address=192.168.0.1 --resources='{"node:192.168.0.2":1}' |
There was a problem hiding this comment.
I will modify it later
CLA Signature Passtianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
1. support different kv backends 2. support intra-node and inter-node client placement 3. remove ray bandwidth test Signed-off-by: tianyi-ge <tianyig@outlook.com>
Signed-off-by: tianyi-ge <tianyig@outlook.com>
CLA Signature Passtianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
2. remove delete time stats Signed-off-by: tianyi-ge <tianyig@outlook.com>
CLA Signature Passtianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
dscli start --cpunodebind 6 --localalloc --timeout 600 -w --worker_address 10.170.27.237:31501 --etcd_address 10.170.27.237:2379 --arena_per_tenant 1 --max_log_size 1024 --max_log_file_num 10 --node_timeout_s 600 --node_dead_timeout_s 1800 --enable_fallocate false --enable_worker_worker_batch_get true --shared_memory_populate true --shared_memory_size_mb 40960 --remote_h2d_device_ids "0" --enable_huge_tlb truedscli start --cpunodebind 6 --localalloc --timeout 600 -w --worker_address 10.170.27.158:31501 --etcd_address 10.170.27.237:2379 --arena_per_tenant 1 --max_log_size 1024 --max_log_file_num 10 --node_timeout_s 600 --node_dead_timeout_s 1800 --enable_fallocate false --enable_worker_worker_batch_get true --shared_memory_populate true --shared_memory_size_mb 40960 --remote_h2d_device_ids "0" --enable_huge_tlb true |
…s will connect to the head node Signed-off-by: tianyi-ge <tianyig@outlook.com>
CLA Signature Passtianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
CLA Signature Passtianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
pyproject.toml
Outdated
| mooncake = [ | ||
| "mooncake-transfer-engine" | ||
| ] | ||
| perftest = [ |
Signed-off-by: tianyi-ge <tianyig@outlook.com>
| self.backend = self.full_config["backend"]["storage_backend"] | ||
|
|
||
| # For Yuanrong, always use inter_node | ||
| self.use_inter_node = self.backend == "Yuanrong" |
There was a problem hiding this comment.
actually we can use inter node as default for all backends
CLA Signature Passtianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
transfer_queue/utils/common.py
Outdated
| return env_value_lower in true_values | ||
|
|
||
|
|
||
| def get_local_ip_addresses() -> list[str]: |
There was a problem hiding this comment.
We can put these functions in yuanrong_client.py since now only Yuanrong requires these utils
| # Memory segment size in bytes for mounting (default: 4GB) | ||
| global_segment_size: 4294967296 | ||
| # Local buffer size in bytes (default: 1GB) | ||
| local_buffer_size: 1073741824 |
There was a problem hiding this comment.
| # Memory segment size in bytes for mounting (default: 4GB) | |
| global_segment_size: 4294967296 | |
| # Local buffer size in bytes (default: 1GB) | |
| local_buffer_size: 1073741824 | |
| # Memory segment size in bytes for mounting | |
| global_segment_size: 86294967296 | |
| # Local buffer size in bytes | |
| local_buffer_size: 86294967296 |
| # Address of local host. Set to "" to use Ray IP as local host address | ||
| local_hostname: "" | ||
| # Protocol for transmission. Choose from: tcp, rdma. (default: tcp) | ||
| protocol: tcp |
There was a problem hiding this comment.
| protocol: tcp | |
| protocol: rdma |
2. modify default mooncake store perftest config Signed-off-by: tianyi-ge <tianyig@outlook.com>
CLA Signature Passtianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
| except ValueError: | ||
| logger.info("Some other rank has initialized TransferQueueController. Try to connect to existing controller.") | ||
| _init_from_existing() | ||
| return |
There was a problem hiding this comment.
This should not be deleted
Signed-off-by: tianyi-ge <tianyig@outlook.com>
CLA Signature Passtianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
CLA Signature Passtianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
| declare -a SETTINGS=( | ||
| "1024,9,8192,Small" | ||
| "4096,15,32768,Medium" | ||
| "8192,21,128000,Large" |
There was a problem hiding this comment.
| "8192,21,128000,Large" | |
| "8192,18,100000,Large" |
| ### Test Matrix | ||
|
|
||
| - **Backends**: SimpleStorage, Yuanrong, MooncakeStore, Ray (baseline) | ||
| - **Data sizes**: Small (batch=1024, fields=9, seq=8192), Medium (batch=4096, fields=15, seq=32768), Large (batch=8192, fields=21, seq=128000) |
There was a problem hiding this comment.
| - **Data sizes**: Small (batch=1024, fields=9, seq=8192), Medium (batch=4096, fields=15, seq=32768), Large (batch=8192, fields=21, seq=128000) | |
| - **Data sizes**: Small (batch=1024, fields=9, seq=8192), Medium (batch=4096, fields=15, seq=32768), Large (batch=8192, fields=18, seq=100000) |
| - `Yuanrong`: `cpu`, `npu` | ||
| - `MooncakeStore`: `cpu`, `gpu` | ||
|
|
||
| ## Test Data Format |
There was a problem hiding this comment.
Now we have 2 scenarios and we need to illustrate both the simple case and complex case https://www.yuque.com/haomingzi-lfse7/lhp4el/tml8ke0zkgn6roey?singleDoc# 《TransferQueue Performance Test - 0.1.6》
Signed-off-by: tianyi-ge <tianyig@outlook.com>
CLA Signature Passtianyi-ge, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Description
Usage
usage: perftest.py [-h] --backend_config BACKEND_CONFIG [--backend BACKEND] [--device {cpu,npu,gpu}] [--global_batch_size GLOBAL_BATCH_SIZE] [--field_num FIELD_NUM] [--seq_len SEQ_LEN] [--num_test_iterations NUM_TEST_ITERATIONS] --head_node_ip HEAD_NODE_IP [--worker_node_ip WORKER_NODE_IP] [--output_csv OUTPUT_CSV] [--use_complex_case] TransferQueue Throughput Test options: -h, --help show this help message and exit --backend_config BACKEND_CONFIG Path to backend config YAML file --backend BACKEND Override storage_backend in config (e.g. SimpleStorage, Yuanrong, MooncakeStore) --device {cpu,npu,gpu} Device to use (default: cpu) --global_batch_size GLOBAL_BATCH_SIZE Global batch size (default: 1024) --field_num FIELD_NUM Number of fields (default: 10) --seq_len SEQ_LEN Sequence length (default: 8192) --num_test_iterations NUM_TEST_ITERATIONS Number of test iterations (default: 4) --head_node_ip HEAD_NODE_IP Head node IP address --worker_node_ip WORKER_NODE_IP Worker node IP address (required for Yuanrong) --output_csv OUTPUT_CSV Path to output CSV file (optional) --use_complex_case Use complex test case with nested tensors and nontensor fields (default: False, simple case)closes #51