Skip to content

[WIP][experimental] multi turn chat benchmark#821

Draft
cquil11 wants to merge 176 commits intomainfrom
experimental/multi-turn-benchmark
Draft

[WIP][experimental] multi turn chat benchmark#821
cquil11 wants to merge 176 commits intomainfrom
experimental/multi-turn-benchmark

Conversation

@cquil11
Copy link
Copy Markdown
Collaborator

@cquil11 cquil11 commented Feb 27, 2026

No description provided.

Rohan138 and others added 30 commits January 26, 2026 17:15
* fix AITER flags for v0.14.0 release

* drop mi325 triton gemm env var

* Add changes to perf changelog
…wont be erroneous negative diff [skip-sweep] (#571)
* remove assign

* initial

* update perf

* fix perf changelog

* trigger test sweep

* trigger test sweep pt 2

* rebase for evals only

* Update perf-changelog.yaml

* remove newline

* update perf changelog

---------

Co-authored-by: Cam Quilici <cjquilici@gmail.com>
* b300 srt slurm

* update generated srtslurm yaml

Signed-off-by: jthomson04 <jothomson@nvidia.com>

* fix image

* add uv and sqsh file

* change partition

* change slurm account

* use regular srt

Signed-off-by: jthomson04 <jothomson@nvidia.com>

* update perf changelog

Signed-off-by: jthomson04 <jothomson@nvidia.com>

* fix runner

Signed-off-by: jthomson04 <jothomson@nvidia.com>

* correct account

Signed-off-by: jthomson04 <jothomson@nvidia.com>

* qos support

Signed-off-by: jthomson04 <jothomson@nvidia.com>

* fix get checkout

Signed-off-by: jthomson04 <jothomson@nvidia.com>

* update runner label and partition

* undo branch checkout

Signed-off-by: jthomson04 <jothomson@nvidia.com>

* debug info

Signed-off-by: jthomson04 <jothomson@nvidia.com>

* cleanup logging

Signed-off-by: jthomson04 <jothomson@nvidia.com>

* use local model dir

Signed-off-by: jthomson04 <jothomson@nvidia.com>

* checkout specific commit

Signed-off-by: jthomson04 <jothomson@nvidia.com>

---------

Signed-off-by: jthomson04 <jothomson@nvidia.com>
Co-authored-by: Sahithi Chigurupati <schigurupati@nvidia.com>
Co-authored-by: Sahithi Chigurupati <chigurupati.sahithi@gmail.com>
…wont be erroneous negative diff [skip-sweep] (#577)
* Update SGLang Docker Image for MI355 to v0.5.8

1. activate FP8 KV cache
2. use the MLA persistent kernel

* Do not activate FP8 KV cache and the MLA persistent kernel explicitly

* Add config-keys (v0.5.5.post3 --> v0.5.8)

* Update perf-changelog.yaml with key fix description for v0.5.8

Add description: Disables mla persistent kernel when not using fp8 kv_cache

Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>

---------

Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>
Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
30s default to 300s
* chore: save server long as artifact after single node runs

* test flaky eval

* test flaky eval

* test flaky eval

* rebase

* rebase pt 2

* add trap to upload server logs on exit

* rebase pt 3

* make server log in gha workspace

* export result filename at runtime so it is present

* revert perf changelog
* chore: add pre-merge check for newline in perf-changelog.yaml

Add a validation step in run-sweep.yml that ensures perf-changelog.yaml
ends with a newline character. This prevents negative diff issues in
subsequent PRs when the file is appended to.

Closes #578

Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com>

* test

* change logic of newline check

* trigger test check

* remove test perf changelog

---------

Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@cquil11 cquil11 force-pushed the experimental/multi-turn-benchmark branch from 243e96d to f8ca118 Compare March 24, 2026 19:18
cquil11 and others added 27 commits March 25, 2026 12:53
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Loads detailed_results.csv from kv-cache-tester trace replayer
and converts to the same per-request schema used by AIPerf,
enabling Pareto frontier plotting from trace replay sweeps.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- analyze_benchmark_distributions.py now auto-detects trace replay CSV
  format in addition to AIPerf JSONL
- trace replay benchmark script calls the analysis after benchmark

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes aggregation failing with "No experiments found" when running
trace replay sweeps. Loads detailed_results.csv from trace_replay/
directory alongside existing AIPerf JSONL and client CSV formats.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…dvancement

- Enable --parallel-subagents in benchmark script
- Change time_scale from 0.1 to 0.05 (2x faster replay)
- Update submodule to feature/parallel-subagents branch
- Exponential distribution for trace advancement (favors turn 0)
- Remove token budget limit (999999999)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- multiturn_fp4_b200_trace_replay.sh: B200 variant with Blackwell arch
  and FP4 compilation config
- multiturn-agentic-trace.yaml: restructured with top-level keys for
  different hardware/model combos (h200-fp8-llama70b, b200-fp4-dsr1)
- multiturn-sweep.yml: added config_key input to select which config
  section to use

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- benchmark-multiturn-tmpl.yml: add ep input, pass as EP_SIZE env var
- multiturn-sweep.yml: add ep input, pass through to template
- multiturn_fp4_b200_trace_replay.sh: add --expert-parallel-size when EP_SIZE > 0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…llel-size

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Config entries can now specify ep per TP group. Matrix generator passes
ep per entry, falling back to global input. B200 DSR1 config uses
tp4ep4 and tp8ep8.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- multiturn_fp8_mi355x_trace_replay.sh: vLLM-based trace replay for MI355X
- multiturn-agentic-trace.yaml: add mi355x-fp8-llama70b sweep config
- launch_mi355x-amds.sh: add SCRIPT_SUFFIX support for single-node
- benchmark-multiturn-tmpl.yml: clean stale git index.lock before checkout

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- plot_sweep_overview.py: generates throughput_vs_concurrency.png and
  workload_consistency.png from sweep results
- collect_sweep_results.py: calls overview plots after Pareto analysis

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment on lines +97 to +171
runs-on: ubuntu-latest
outputs:
matrix: ${{ steps.gen.outputs.matrix }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
if: ${{ inputs.config_file != '' }}
with:
token: ${{ secrets.REPO_PAT }}
fetch-depth: 1
ref: ${{ inputs.ref || github.ref }}
sparse-checkout: ${{ inputs.config_file }}

- id: gen
run: |
pip install -q pyyaml
python3 << 'PYEOF'
import json, os, sys

config_file = "${{ inputs.config_file }}".strip()

if config_file:
import yaml
with open(config_file) as f:
full_config = yaml.safe_load(f)

config_key = "${{ inputs.config_key }}".strip()

# If config_key specified, use that section; otherwise auto-detect
if config_key and config_key in full_config:
config = full_config[config_key]
elif config_key:
print(f"ERROR: config_key '{config_key}' not found. Available: {list(full_config.keys())}")
sys.exit(1)
elif len(full_config) == 1:
config = next(iter(full_config.values()))
else:
# Check if top-level keys look like tp entries (tp2, tp4, etc.)
if all(k.startswith("tp") for k in full_config):
config = full_config
else:
print(f"ERROR: Multiple entries in config, specify --config_key. Available: {list(full_config.keys())}")
sys.exit(1)

includes = []
for key, settings in config.items():
tp = int(key.replace("tp", ""))
users = settings.get("users", [])
offloads = settings.get("offload", ["on", "off"])
ep = settings.get("ep", 0)
for u in users:
for o in offloads:
entry = {"tp": tp, "users": u, "offload": o}
if ep > 0:
entry["ep"] = ep
includes.append(entry)
else:
tp_values = json.loads('${{ inputs.tp_values }}')
user_values = json.loads('${{ inputs.user_values }}')
offload_values = json.loads('${{ inputs.offload_values }}')
includes = []
for tp in tp_values:
for u in user_values:
for o in offload_values:
includes.append({"tp": tp, "users": u, "offload": o})

matrix = {"include": includes}
print(f"Generated {len(includes)} matrix entries")
with open(os.environ["GITHUB_OUTPUT"], "a") as f:
f.write(f"matrix={json.dumps(matrix)}\n")
PYEOF

# ---------------------------------------------------------------------------
# Matrix benchmark jobs — each cell calls the multiturn template
# ---------------------------------------------------------------------------
sweep:

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}

Copilot Autofix

AI about 5 hours ago

In general, to fix this class of issue you add an explicit permissions: block to the workflow (or to each job) that grants only the minimal scopes needed. If the jobs only need to read repository contents and use artifacts, a typical baseline is permissions: contents: read at the workflow level. Additional fine-grained permissions are added only if some step needs them.

For this specific workflow, nothing in the provided snippet requires write access via GITHUB_TOKEN: repository writes are done via a separate secrets.REPO_PAT, and artifact operations work with the default contents: read (artifact permissions are managed separately and are allowed with minimal contents scope). Therefore, the best change that does not alter existing functionality is to add a workflow-level permissions block directly under the name/run-name section, before on:, setting contents: read. This will apply to all jobs that don’t override permissions, including generate-matrix, sweep, and collect, and will ensure the token is limited to read-only access to repository contents.

Concretely: in .github/workflows/multiturn-sweep.yml, insert:

permissions:
  contents: read

after line 2 (right after run-name:) and before the on: block. No additional imports or methods are required.

Suggested changeset 1
.github/workflows/multiturn-sweep.yml

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/.github/workflows/multiturn-sweep.yml b/.github/workflows/multiturn-sweep.yml
--- a/.github/workflows/multiturn-sweep.yml
+++ b/.github/workflows/multiturn-sweep.yml
@@ -1,5 +1,7 @@
 name: Multi-Turn Benchmark Sweep
 run-name: "${{ inputs.run_name || format('Multi-Turn Sweep - tp={0} users={1} offload={2}', inputs.tp_values, inputs.user_values, inputs.offload_values) }}"
+permissions:
+  contents: read
 
 on:
   # push:
EOF
@@ -1,5 +1,7 @@
name: Multi-Turn Benchmark Sweep
run-name: "${{ inputs.run_name || format('Multi-Turn Sweep - tp={0} users={1} offload={2}', inputs.tp_values, inputs.user_values, inputs.offload_values) }}"
permissions:
contents: read

on:
# push:
Copilot is powered by AI and may make mistakes. Always verify output.
Comment on lines +172 to +198
needs: generate-matrix
uses: ./.github/workflows/benchmark-multiturn-tmpl.yml
name: sweep /
strategy:
fail-fast: false
matrix: ${{ fromJson(needs.generate-matrix.outputs.matrix) }}
secrets: inherit
with:
runner: ${{ inputs.runner }}
image: ${{ inputs.image }}
model: ${{ inputs.model }}
precision: ${{ inputs.precision }}
exp-name: "multiturn_tp${{ matrix.tp }}_users${{ matrix.users }}_offload${{ matrix.offload }}"
tp: "${{ matrix.tp }}"
users: "${{ matrix.users }}"
offload-mode: ${{ matrix.offload }}
duration: ${{ inputs.duration }}
request-rate: ${{ inputs.request_rate }}
total-cpu-dram-gb: ${{ inputs.total_cpu_dram_gb }}
script-suffix: ${{ inputs.script_suffix }}
ep: "${{ matrix.ep || inputs.ep }}"
ref: ${{ inputs.ref }}

# ---------------------------------------------------------------------------
# Collect & aggregate results
# ---------------------------------------------------------------------------
collect:

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {}

Copilot Autofix

AI about 5 hours ago

In general, you fix this by explicitly defining a permissions: block either at the top level of the workflow (to apply to all jobs) or individually per job, granting only the minimal scopes required (typically contents: read for reading the repo and actions: read for artifact access; the artifact actions don’t require additional special scopes).

For this specific workflow, the minimal, safe change without altering functionality is:

  • Add a top-level permissions: block after the run-name: and before on: that sets contents: read. This allows jobs to read the repository with GITHUB_TOKEN if needed, while not granting write.
  • No per-job overrides are necessary because none of the shown jobs need to write to the repository or interact with issues/PRs; they only check out code (already handled by PAT) and handle artifacts (which work with default GITHUB_TOKEN + actions implied). GitHub recommends at least setting contents: read explicitly to mirror a read-only default.

Concretely, in .github/workflows/multiturn-sweep.yml, insert:

permissions:
  contents: read

between line 2 (run-name: ...) and line 4 (on:).

Suggested changeset 1
.github/workflows/multiturn-sweep.yml

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/.github/workflows/multiturn-sweep.yml b/.github/workflows/multiturn-sweep.yml
--- a/.github/workflows/multiturn-sweep.yml
+++ b/.github/workflows/multiturn-sweep.yml
@@ -1,12 +1,14 @@
 name: Multi-Turn Benchmark Sweep
 run-name: "${{ inputs.run_name || format('Multi-Turn Sweep - tp={0} users={1} offload={2}', inputs.tp_values, inputs.user_values, inputs.offload_values) }}"
+permissions:
+  contents: read
 
 on:
   # push:
   #   branches:
   #     - experimental/multi-turn-benchmark
   #   paths:
-  #     - .github/workflows/multiturn-sweep.yml
+    #     - .github/workflows/multiturn-sweep.yml
   workflow_dispatch:
     inputs:
       run_name:
EOF
@@ -1,12 +1,14 @@
name: Multi-Turn Benchmark Sweep
run-name: "${{ inputs.run_name || format('Multi-Turn Sweep - tp={0} users={1} offload={2}', inputs.tp_values, inputs.user_values, inputs.offload_values) }}"
permissions:
contents: read

on:
# push:
# branches:
# - experimental/multi-turn-benchmark
# paths:
# - .github/workflows/multiturn-sweep.yml
# - .github/workflows/multiturn-sweep.yml
workflow_dispatch:
inputs:
run_name:
Copilot is powered by AI and may make mistakes. Always verify output.
Comment on lines +199 to +231
runs-on: ubuntu-latest
needs: sweep
if: always()
name: Collect results
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
token: ${{ secrets.REPO_PAT }}
fetch-depth: 1
ref: ${{ inputs.ref || github.ref }}

- uses: actions/setup-python@v5
with:
python-version: '3.11'

- name: Install dependencies
run: pip install pandas matplotlib numpy

- name: Download all artifacts
uses: actions/download-artifact@v4
with:
pattern: 'multiturn_*'
path: results/

- name: Run aggregation
run: |
python experimental/multiturn/vllm_benchmark/scripts/collect_sweep_results.py results/ aggregated/

- name: Upload aggregated results
uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0
with:
name: multiturn_aggregated
path: aggregated/

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}

Copilot Autofix

AI about 5 hours ago

In general, the fix is to explicitly declare a permissions: block that limits the default GITHUB_TOKEN to the minimal privileges required, typically at the workflow root so it applies to all jobs unless overridden. For this workflow, none of the jobs need to write to the repository; they only check out code, run benchmarks, and upload artifacts. The artifact actions do not require repository write permissions, and code checkout only needs contents: read. Therefore the best fix is to add a root-level permissions: block with contents: read near the top of .github/workflows/multiturn-sweep.yml.

Concretely, add:

permissions:
  contents: read

right after the run-name: (line 2) and before the on: block (line 4). This will apply the restricted permissions to all jobs (generate-matrix, sweep, collect) without changing any existing job logic or secrets usage. No additional methods, imports, or definitions are required, since this is purely a YAML configuration change.

Suggested changeset 1
.github/workflows/multiturn-sweep.yml

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/.github/workflows/multiturn-sweep.yml b/.github/workflows/multiturn-sweep.yml
--- a/.github/workflows/multiturn-sweep.yml
+++ b/.github/workflows/multiturn-sweep.yml
@@ -1,5 +1,7 @@
 name: Multi-Turn Benchmark Sweep
 run-name: "${{ inputs.run_name || format('Multi-Turn Sweep - tp={0} users={1} offload={2}', inputs.tp_values, inputs.user_values, inputs.offload_values) }}"
+permissions:
+  contents: read
 
 on:
   # push:
EOF
@@ -1,5 +1,7 @@
name: Multi-Turn Benchmark Sweep
run-name: "${{ inputs.run_name || format('Multi-Turn Sweep - tp={0} users={1} offload={2}', inputs.tp_values, inputs.user_values, inputs.offload_values) }}"
permissions:
contents: read

on:
# push:
Copilot is powered by AI and may make mistakes. Always verify output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

6 participants