fix(datafusion): Update docs, skip concurrent_qps, Skip re-downloading partitioned files#943
fix(datafusion): Update docs, skip concurrent_qps, Skip re-downloading partitioned files#943alamb wants to merge 2 commits into
Conversation
5374b3c to
0ad6165
Compare
0ad6165 to
9e4afc0
Compare
| export BENCH_RESTARTABLE=no | ||
| # skip concurrent_qps tests by default as datafusion-cli is configured | ||
| # for single user/single process usage | ||
| export BENCH_CONCURRENT_DURATION="${BENCH_CONCURRENT_DURATION:-0}" |
There was a problem hiding this comment.
When I ran the rewritten script on a c6a.4xlarge, the kernel repeatedly killed datafusion-cli
$ sudo dmesg
...
[ 2474.473278] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service/tmux-spawn-15c64bd8-c60d-4843-bf48-c65eac5ced5e.scope,task=datafusion-cli,pid=19689,uid=1000
[ 2474.473439] Out of memory: Killed process 19689 (datafusion-cli) total-vm:18148748kB, anon-rss:7049888kB, file-rss:2872kB, shmem-rss:0kB, UID:1000 pgtables:15880kB oom_score_adj:0There was a problem hiding this comment.
The newly rewritten "benchmark-common.sh" appears to include new code to do a "concurrency" test which launches a bunch of concurrent datafusion-cli processes
ClickBench/lib/benchmark-common.sh
Lines 385 to 418 in a377499
I don't think this type of test makes sense for a stateless engine like DataFusion and doesn't reflect how people use this. Each datafusion-cli process thinks it has the resources of the entire machine, and doesn't coordinate between each other (by design).
Systems built with DataFusion absolutely can and do implement various multi-user resource management policies, but each picks the policy that is best suited for its requirements. DataFusion does not mandate any particular one.
If we want to include a qps number for datafusion-cli the runner should measure how fast the query can be run one at a time (set concurrency to 1)
There was a problem hiding this comment.
Fascinatingly there are recent results checked in for a c6a.4xlarge that have a qps_concurrency measurement:
ClickBench/datafusion-partitioned/results/20260511/c6a.4xlarge.json
Lines 12 to 13 in a377499
These appear to be added in 6d5ee0a by @alexey-milovidov, as part of PR #860 and I don't see any comments / additional context of how it was generated) 🤔
Given that the kernel just OOM kills my runner, I don't know how to reproduce this number. Thus I think it is better and more accurate to disable the qps measurement for DataFusion and report "null" instead
| 3. `git clone https://github.com/ClickHouse/ClickBench` | ||
| 4. `cd ClickBench/datafusion` | ||
| 5. `vi benchmark.sh` and modify the following line to target the DataFusion version | ||
| 5. `vi install` and modify the following line to target the DataFusion version |
There was a problem hiding this comment.
#860 changes the script that need to be updated
| export BENCH_DOWNLOAD_SCRIPT="download-hits-parquet-partitioned" | ||
| export BENCH_DURABLE=yes | ||
| export BENCH_RESTARTABLE=no | ||
| # skip concurrent_qps tests by default as datafusion-cli is configured |
Rationale
While trying to gather results for a new DataFusion release, I could not complete the benchmark scripts because the new concurrent QPS test caused the
bash benchmark.shrun to be OOM-killed by the kernel.I also hit two rough edges:
datafusion-partitioned/benchmark.shnow re-downloads the partitioned Parquet source files on each run, which makes it harder to iterate on the scripts locally.Changes
This PR addresses those issues and lets me run the DataFusion benchmarks again:
datafusion-cliis configured here for single-user/single-process benchmark runs.partitioned/directory instead of moving them, so reruns do not re-download the source files.installscript when changing the DataFusion version.I plan to submit a separate PR with updated results once the DataFusion release is finalized.
Question: should contributors still submit updated results?
It is not clear to me after the recent refactors whether you prefer contributors to run and submit benchmark results as part of release-update PRs, or whether you prefer to use the ClickBench automation to regenerate the numbers.
The DataFusion results were recently updated in:
The benchmark scripts were also substantially refactored in:
Updating ClickBench results takes me several hours for each DataFusion release, and I would like to decrease that. I looked into using
./run-benchmark.sh, which appears to be the automation entry point, but from an external contributor perspective:sinkuser, and I do not know how to retrieve the generated result JSON from that pipeline.prepare-database.sqlandcollect-results.sh, but the setup for using them with a contributor-owned ClickHouse instance is not documented.If contributors are expected to run the benchmarks themselves, it would be helpful to document the recommended end-to-end workflow for launching the machines and collecting the generated JSON files.