Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
221fd70
refactor perftest
tianyi-ge Mar 19, 2026
bf8478a
fixed review comments
tianyi-ge Mar 20, 2026
06ffebb
1. adjust default storage unit number to 1
tianyi-ge Mar 20, 2026
83c9fb0
1. the current `backend` arg is "default", "yuanrong", and "mooncake"…
tianyi-ge Mar 23, 2026
a9f70bd
reduce num_cpus for ci
tianyi-ge Mar 23, 2026
c43519c
reduce perftest ci timeout to 10 min
tianyi-ge Mar 23, 2026
c3f69a8
fix ci
tianyi-ge Mar 23, 2026
a641e95
1. use transfer_queue/config.yaml instead of new configs
tianyi-ge Mar 23, 2026
eb9112b
squash all commits
0oshowero0 Mar 24, 2026
d93c7aa
Merge pull request #1 from 0oshowero0/han/performance_test
tianyi-ge Mar 24, 2026
fa5b131
add license to draw_figure.py
tianyi-ge Mar 25, 2026
537b7c6
simplify run_perf_test.sh
tianyi-ge Mar 25, 2026
8a17c19
change client host for yuanrong
tianyi-ge Mar 25, 2026
60cdcaa
use d2h and h2d instead of d2d
tianyi-ge Mar 25, 2026
cdccf6d
fix nested tensor for NPU
0oshowero0 Mar 25, 2026
d043cf9
1. delete old samples
tianyi-ge Mar 25, 2026
dc51c26
kv_batch_delete -> kv_clear
tianyi-ge Mar 25, 2026
8d621d0
clean test data
tianyi-ge Mar 25, 2026
590af45
update test sceanrio and optimize data gen speed
0oshowero0 Mar 26, 2026
5446dfe
update readme
0oshowero0 Mar 26, 2026
b918ec5
do not remove test data since it's being reused
tianyi-ge Mar 26, 2026
ca530af
update readme for perftest
tianyi-ge Mar 26, 2026
dbd830f
1. fix bar order in draw_figure.py
tianyi-ge Mar 26, 2026
eb380c2
fix incorrect init yr client from controller; otherwise all yr client…
tianyi-ge Mar 27, 2026
b68d267
add simple case
0oshowero0 Mar 27, 2026
a020706
remove host config for yuanrong; auto-detect instead
tianyi-ge Mar 28, 2026
28313dd
1. move find reachable ip to yuanrong client
tianyi-ge Mar 28, 2026
f278f8e
fix comments
tianyi-ge Mar 28, 2026
4797705
fix figure drawing
0oshowero0 Mar 28, 2026
49d1139
update large test config
tianyi-ge Mar 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions .github/workflows/perftest.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# This workflow runs the SimpleStorage performance test
name: Performance Test

on:
push:
branches:
- main
- v0.*
pull_request:
branches:
- main
- v0.*

jobs:
perftest:
runs-on: ubuntu-latest
timeout-minutes: 10
strategy:
fail-fast: false
matrix:
python-version: ["3.11"]

steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
pip install -e .
- name: Start Ray cluster
run: |
# Get the host IP
HOST_IP=$(hostname -I | awk '{print $1}')
echo "Host IP: $HOST_IP"
# Start Ray with node resource
ray start --head --resources="{\"node:$HOST_IP\":1}"
ray status
- name: Run SimpleStorage performance test
run: |
# Get the host IP
HOST_IP=$(hostname -I | awk '{print $1}')
echo "Host IP: $HOST_IP"
# Run the perftest with small batch size for quick test
cd scripts/performance_test
python perftest.py \
--backend_config=../../transfer_queue/config.yaml \
--device=cpu \
--global_batch_size=128 \
--field_num=4 \
--seq_len=1024 \
--head_node_ip=$HOST_IP \
--output_csv=results.csv
- name: Stop Ray cluster
run: |
ray stop
if: always()
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -220,3 +220,6 @@ __marimo__/

#MacOS
**/.DS_Store

# Perftest
scripts/performance_test/results/
16 changes: 6 additions & 10 deletions docs/storage_backends/openyuanrong_datasystem.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,11 +132,11 @@ from transfer_queue import (
TransferQueueController,
process_zmq_server_info,
)
# host, port, manager_type and client_name are the config for booting the datasystem.
# port, manager_type and client_name are the config for booting the datasystem.
# host will be auto-detected by checking local IP addresses.
config_str = """
manager_type: YuanrongStorageManager
client_name: YuanrongStorageClient
host: 127.0.0.1
port: 31501
"""
dict_conf = OmegaConf.create(config_str, flags={"allow_objects": True})
Expand Down Expand Up @@ -360,26 +360,22 @@ def main():
config_str = """
manager_type: YuanrongStorageManager
client_name: YuanrongStorageClient
host: 10.170.27.24
port: 31501
"""
dict_conf = OmegaConf.create(config_str, flags={"allow_objects": True})
# It is important to pay attention to the controller's lifecycle.
controller, dict_conf.controller_info = initialize_controller()

conf_writer = dict_conf.copy()
conf_writer.host = HEAD_NODE_IP
conf_reader = dict_conf.copy()
conf_reader.host = WORKER_NODE_IP

# Note: host is auto-detected on each node, no need to configure explicitly
data = TensorDict({ "prompt": torch.ones(3, 512), "big_tensor": torch.randn(3,1024,1024)}, batch_size=[3])
# you could assign npu or gpu devices by 'resources'
# resources={f"node:{HEAD_NODE_IP}": 0.001} could Force the actor to run on HEAD_NODE
writer = TransferQueueClientActor.options(
resources={f"node:{HEAD_NODE_IP}": 0.001},
).remote(conf_writer, "train")
).remote(dict_conf, "train")
reader = TransferQueueClientActor.options(
resources={f"node:{WORKER_NODE_IP}": 0.001}
).remote(conf_reader, "rollout")
).remote(dict_conf, "rollout")

ray.get(writer.put.remote(data=data, partition_id="train_0"))

Expand Down
Loading
Loading