CloudAI Autotune is a lightweight experiment manager for LLM serving benchmarks. It sits on top of NVIDIA CloudAI: CloudAI runs the benchmark, while Autotune records what was tried, parses the result, stores the metrics, and recommends the next config value to test.
It is designed for the common tuning loop:
try a config -> measure throughput/latency -> compare history -> choose next config
Example:
Run 1: batch_size = 1 -> 120 tok/s, 90 ms latency
Run 2: batch_size = 4 -> 330 tok/s, 160 ms latency
Run 3: batch_size = 8 -> 430 tok/s, 260 ms latency
Recommendation: try batch_size = 6 because 8 crossed the latency budget and 6
has not been tested yet.
Autotune is:
- a CLI for running or ingesting CloudAI benchmark experiments
- a parser for JSON, JSONL, and text benchmark outputs
- a SQLite experiment database
- a simple recommender for the next knob value to try
- a Streamlit dashboard for browsing experiment history
Autotune is not:
- a benchmark engine by itself
- a replacement for CloudAI
- a full multi-variable optimizer yet
- a storage trace, POSIX/S3, or checkpoint I/O benchmark tool
flowchart TD
A[CloudAI TOML config] --> B[autotune run]
B --> C[CloudAI CLI]
C --> D[runs/run_id/stdout.log]
C --> E[runs/run_id/report.json]
F[Existing report JSON/JSONL/log] --> G[autotune ingest]
D --> H[Parser]
E --> H
G --> H
H --> I[Normalized metrics]
I --> J[(autotune.db SQLite)]
A --> J
J --> K[autotune list]
J --> L[autotune recommend]
J --> M[Streamlit dashboard]
L --> N[Next untested knob value]
The important boundary is that CloudAI owns benchmark execution. Autotune owns experiment tracking and recommendation.
| Input | Example | Used by |
|---|---|---|
| CloudAI config | configs/examples/vllm_baseline.toml |
run, derive, ingest |
| Existing report | reports/examples/vllm_batch4.json |
ingest, demo |
| Tuning knob | serving.batch_size |
recommend, demo |
| Latency budget | --latency-budget-ms 200 |
recommend, demo |
| Output | Example | Contents |
|---|---|---|
| Experiment DB | autotune.db |
configs, status, metrics, report paths |
| Run directory | runs/0001_vllm_baseline_.../ |
captured CloudAI artifacts |
| Log file | runs/.../stdout.log |
CloudAI output or failure details |
| Recommendation | Suggested: 6.0 |
next untested knob value |
| Dashboard | Streamlit app | tables, charts, recommendation view |
Reports are normalized into a small stable metric set:
| Metric | Meaning |
|---|---|
latency_ms |
latency in milliseconds |
ttft_ms |
time to first token in milliseconds |
throughput_tokens_per_sec |
generated token throughput |
runtime_sec |
benchmark runtime |
failure_rate |
failed request ratio |
The parser accepts common aliases from different report formats. For example,
tokens_per_second, request_throughput, and output_throughput can all map
to throughput_tokens_per_sec.
After recording runs, check them against simple performance budgets:
autotune check \
--latency-budget-ms 200 \
--ttft-budget-ms 50 \
--min-throughput-tokens-per-sec 300 \
--max-failure-rate 0.05Use --strict in scripts or CI to exit non-zero if any experiment fails a
budget or cannot be evaluated because a required metric is missing.
cloudai-autotune/
autotune/
cli.py # command-line interface
config_mutator.py # load and derive TOML configs
runner.py # CloudAI subprocess wrapper
parser.py # report/log -> normalized metrics
database.py # SQLite experiment store
recommender.py # next-value recommendation heuristic
configs/examples/ # sample CloudAI configs
reports/examples/ # sample benchmark reports
dashboard/app.py # Streamlit dashboard
runs/ # captured run artifacts
tests/ # unit tests
python3 -m venv .venv
source .venv/bin/activate
pip install -e .Check the CLI:
autotune --helpCloudAI is only required for real benchmark runs. The local demo works without CloudAI, GPUs, or cluster access.
The fastest way to see the project work is:
autotune demoThis command:
- loads bundled sample reports from
reports/examples/ - writes them to
autotune-demo.db - recommends a next value for
serving.batch_size
Useful options:
autotune demo --db /tmp/my-demo.db
autotune demo --scenario vllm_baseline
autotune demo --knob serving.batch_size --latency-budget-ms 200When CloudAI is installed and available as cloudai:
autotune run path/to/test_scenario.toml \
--notes "baseline before tensor-parallel change" \
--metadata hardware.gpu=A100 \
--metadata run.nodes=1 \
--system-config path/to/system.toml \
--tests-dir path/to/tests \
--hook-dir path/to/hooksAutotune will:
- create a database row with
status=running - call
cloudai run --config ... --output ... - capture stdout/stderr under
runs/<run_id>/stdout.log - parse
report.jsonor a common summary artifact such ascloudai-summary.json,summary.json,results.json,metrics.json, or JSONL equivalents - mark the experiment
completedorfailed
CloudAI stdout and stderr are preserved in the run's stdout.log. Autotune
also appends a diagnostic for launch failures, timeouts, non-zero exits, and
unreadable report artifacts. Failed runs exit non-zero so shell scripts and CI
do not mistake them for successful benchmarks.
Use a custom CloudAI binary if needed:
autotune run path/to/test_scenario.toml \
--cloudai-bin /path/to/cloudai \
--timeout-sec 3600 \
--system-config path/to/system.toml \
--tests-dir path/to/tests \
--hook-dir path/to/hooksUse CloudAI dry-run mode to validate config wiring without launching a real benchmark:
autotune run path/to/test_scenario.toml \
--cloudai-bin /path/to/cloudai \
--dry-run \
--system-config path/to/system.toml \
--tests-dir path/to/tests \
--hook-dir path/to/hooksFor a direct CloudAI CLI-contract smoke check without writing an experiment record:
autotune smoke-cloudai path/to/test_scenario.toml \
--cloudai-bin /path/to/cloudai \
--system-config path/to/system.toml \
--tests-dir path/to/tests \
--hook-dir path/to/hooksIf a benchmark report already exists, record it without launching CloudAI:
autotune ingest reports/examples/vllm_batch4.json \
--config configs/examples/vllm_batch4.toml \
--notes "baseline batch size 4" \
--metadata hardware.gpu=A100For a first pass when you only have a report artifact, provide the scenario, backend, and any config values you want Autotune to track:
autotune ingest reports/examples/vllm_batch4.json \
--scenario vllm_baseline \
--backend vllm \
--set serving.batch_size=4Create a new config by overriding dotted TOML keys:
autotune derive configs/examples/vllm_baseline.toml configs/derived/batch8.toml \
--set serving.batch_size=8Then run it:
autotune run configs/derived/batch8.tomlautotune listFilter by scenario:
autotune list --scenario vllm_baselineShow config and metric differences between two recorded runs:
autotune diff 1 2Write experiment summaries to CSV, JSON, or Markdown for sharing in issues, pull requests, or benchmark notes:
autotune export --format csv --out reports/summary.csv
autotune export --format json --scenario vllm_baseline --out reports/vllm.json
autotune export --format markdown --out reports/summary.md
autotune export --format markdown --template issue
autotune export --format markdown --template prWithout --out, the export prints to the terminal.
autotune recommend --knob serving.batch_size --latency-budget-ms 200The recommender compares completed experiments for one or more knobs. It tries
to avoid suggesting a value that was already tested. If 4 was good and 8
crossed the latency budget, it may suggest 6 as the next untested point.
To write that suggestion directly into a new config, pass a base config and an output path:
autotune recommend \
--knob serving.batch_size \
--knob serving.num_requests \
--latency-budget-ms 200 \
--derive-from configs/examples/vllm_baseline.toml \
--out-config configs/derived/batch6.tomlThis prints one recommendation per knob and writes configs/derived/batch6.toml
with the suggested values applied.
streamlit run dashboard/app.pyThe dashboard reads the local SQLite database and shows experiment history, best/latest run comparison, metric charts, and the current recommendation.
Run tests:
.venv/bin/python -m pytest -qCurrent test coverage includes:
- config derivation
- report parsing
- runner failure handling
- SQLite persistence
- CLI ingest/demo behavior
- recommendation logic
Goal: make Autotune the small, reliable companion for CloudAI performance tuning — easy enough for a first benchmark, useful enough for repeated production-readiness checks.
- Make the first-run path obvious: one command for demo, one command for an existing CloudAI report, and one command for a real CloudAI run.
- Support a stable CloudAI machine-readable summary artifact when CloudAI provides one, while keeping workload-specific parsers as fallbacks.
- Continue improving multi-knob recommendations beyond independent knob suggestions toward budget-aware search across interacting backend settings.
- Track experiment intent, environment, hardware, and config diffs so results are explainable later.
- Expand pass/fail budgets to include time to first token and richer policy reporting.
- Make the dashboard useful for comparison: best run, latest run, regressions, and suggested next config.
- Expand export templates for issue, pull request, and benchmark-report summaries.
- Keep the tool local-first: SQLite by default, no service required, and clean failure messages when CloudAI or benchmark artifacts are missing.