fix(datafusion): Update docs, skip concurrent_qps, Skip re-downloading partitioned files by alamb · Pull Request #943 · ClickHouse/ClickBench

alamb · 2026-06-01T14:08:24Z

Rationale

While trying to gather results for a new DataFusion release, I could not complete the benchmark scripts because the new concurrent QPS test caused the bash benchmark.sh run to be OOM-killed by the kernel.

I also hit two rough edges:

datafusion-partitioned/benchmark.sh now re-downloads the partitioned Parquet source files on each run, which makes it harder to iterate on the scripts locally.
The DataFusion documentation still referred to the old script layout.

Changes

This PR addresses those issues and lets me run the DataFusion benchmarks again:

Disable the concurrent QPS test by default for DataFusion entries, since datafusion-cli is configured here for single-user/single-process benchmark runs.
Hard-link the partitioned Parquet files into the partitioned/ directory instead of moving them, so reruns do not re-download the source files.
Update the DataFusion README to point to the new install script when changing the DataFusion version.

I plan to submit a separate PR with updated results once the DataFusion release is finalized.

Question: should contributors still submit updated results?

It is not clear to me after the recent refactors whether you prefer contributors to run and submit benchmark results as part of release-update PRs, or whether you prefer to use the ClickBench automation to regenerate the numbers.

The DataFusion results were recently updated in:

The benchmark scripts were also substantially refactored in:

Refactor: standard install/start/check/stop/load/query interface per system #860

Updating ClickBench results takes me several hours for each DataFusion release, and I would like to decrease that. I looked into using ./run-benchmark.sh, which appears to be the automation entry point, but from an external contributor perspective:

It posts logs to the ClickHouse Play sink user, and I do not know how to retrieve the generated result JSON from that pipeline.
The supporting pieces appear to be in prepare-database.sql and collect-results.sh, but the setup for using them with a contributor-owned ClickHouse instance is not documented.

If contributors are expected to run the benchmarks themselves, it would be helpful to document the recommended end-to-end workflow for launching the machines and collecting the generated JSON files.

CLAassistant · 2026-06-01T14:08:46Z

All committers have signed the CLA.

alamb · 2026-06-01T19:36:58Z

 export BENCH_RESTARTABLE=no
+# skip concurrent_qps tests by default as datafusion-cli is configured
+# for single user/single process usage
+export BENCH_CONCURRENT_DURATION="${BENCH_CONCURRENT_DURATION:-0}"


When I ran the rewritten script on a c6a.4xlarge, the kernel repeatedly killed datafusion-cli

$ sudo dmesg ... [ 2474.473278] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service/tmux-spawn-15c64bd8-c60d-4843-bf48-c65eac5ced5e.scope,task=datafusion-cli,pid=19689,uid=1000 [ 2474.473439] Out of memory: Killed process 19689 (datafusion-cli) total-vm:18148748kB, anon-rss:7049888kB, file-rss:2872kB, shmem-rss:0kB, UID:1000 pgtables:15880kB oom_score_adj:0

The newly rewritten "benchmark-common.sh" appears to include new code to do a "concurrency" test which launches a bunch of concurrent datafusion-cli processes

ClickBench/lib/benchmark-common.sh

Lines 385 to 418 in a377499

for c in $(seq 1 "$connections"); do

(

# Deterministic per-connection permutation. The seed combines

# BENCH_CONCURRENT_SEED with the connection index via SHA-256

# so cross-version Python hash randomization can't shift it —

# connection 3 on clickhouse must hit the exact same query

# order as connection 3 on duckdb for the numbers to be

# comparable.

local perm

mapfile -t perm < <(SEED="$seed" CONN="$c" NQ="$nq" python3 - <<'PY'

import hashlib, os, random

seed_str = f"{os.environ['SEED']}-{os.environ['CONN']}"

seed_int = int.from_bytes(hashlib.sha256(seed_str.encode()).digest()[:8], 'big')

r = random.Random(seed_int)

xs = list(range(int(os.environ['NQ'])))

r.shuffle(xs)

print('\n'.join(map(str, xs)))

PY

)

local ok=0 err=0 idx=0 qi q_text

while [ "$(date +%s)" -lt "$deadline" ]; do

qi="${perm[$idx]}"

q_text="${queries[$qi]}"

if printf '%s\n' "$q_text" | ./query >/dev/null 2>&1; then

ok=$((ok + 1))

else

err=$((err + 1))

fi

idx=$(( (idx + 1) % nq ))

done

printf '%s %s\n' "$ok" "$err" > "$stats_dir/$c"

) &

done

I don't think this type of test makes sense for a stateless engine like DataFusion and doesn't reflect how people use this. Each datafusion-cli process thinks it has the resources of the entire machine, and doesn't coordinate between each other (by design).

Systems built with DataFusion absolutely can and do implement various multi-user resource management policies, but each picks the policy that is best suited for its requirements. DataFusion does not mandate any particular one.

If we want to include a qps number for datafusion-cli the runner should measure how fast the query can be run one at a time (set concurrency to 1)

Fascinatingly there are recent results checked in for a c6a.4xlarge that have a qps_concurrency measurement:

ClickBench/datafusion-partitioned/results/20260511/c6a.4xlarge.json

Lines 12 to 13 in a377499

"concurrent_qps": 0.097,

"concurrent_error_ratio": 0,

These appear to be added in 6d5ee0a by @alexey-milovidov, as part of PR #860 and I don't see any comments / additional context of how it was generated) 🤔

Given that the kernel just OOM kills my runner, I don't know how to reproduce this number. Thus I think it is better and more accurate to disable the qps measurement for DataFusion and report "null" instead

alamb · 2026-06-01T19:39:35Z

 3. `git clone https://github.com/ClickHouse/ClickBench`
 4. `cd ClickBench/datafusion`
-5. `vi benchmark.sh` and modify the following line to target the DataFusion version
+5. `vi install` and modify the following line to target the DataFusion version


#860 changes the script that need to be updated

alamb · 2026-06-01T19:39:45Z

 export BENCH_DOWNLOAD_SCRIPT="download-hits-parquet-partitioned"
 export BENCH_DURABLE=yes
 export BENCH_RESTARTABLE=no
+# skip concurrent_qps tests by default as datafusion-cli is configured


See rationale below

alamb mentioned this pull request Jun 1, 2026

Update ClickBench benchmarks with DataFusion 54.0.0 (when released) apache/datafusion#21505

Open

alamb changed the title ~~WIP: fix(datafusion): Avoid re-downloading partitioned files on each run~~ WIP: fix(datafusion): Skip re-downloading partitioned files, skip concurrent_qps Jun 1, 2026

alamb changed the title ~~WIP: fix(datafusion): Skip re-downloading partitioned files, skip concurrent_qps~~ fix(datafusion): Skip re-downloading partitioned files, skip concurrent_qps Jun 2, 2026

alamb changed the title ~~fix(datafusion): Skip re-downloading partitioned files, skip concurrent_qps~~ fix(datafusion): Update docs, skip concurrent_qps, Skip re-downloading partitioned files Jun 2, 2026

alamb force-pushed the alamb/improve_scripts branch from 5374b3c to 0ad6165 Compare June 2, 2026 13:30

fix(datafusion): Avoid re-downloading partitioned files on each run

9e4afc0

alamb force-pushed the alamb/improve_scripts branch from 0ad6165 to 9e4afc0 Compare June 2, 2026 13:32

fix(datafusion): hard-link partitioned parquet files

0c0c1e9

alamb commented Jun 2, 2026

View reviewed changes

alamb marked this pull request as ready for review June 2, 2026 14:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(datafusion): Update docs, skip concurrent_qps, Skip re-downloading partitioned files#943

fix(datafusion): Update docs, skip concurrent_qps, Skip re-downloading partitioned files#943
alamb wants to merge 2 commits into
ClickHouse:mainfrom
alamb:alamb/improve_scripts

alamb commented Jun 1, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Jun 1, 2026 •

edited

Loading

Uh oh!

alamb Jun 1, 2026

Uh oh!

alamb Jun 1, 2026

Uh oh!

alamb Jun 1, 2026

Uh oh!

alamb Jun 1, 2026

Uh oh!

alamb Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	for c in $(seq 1 "$connections"); do
	(
	# Deterministic per-connection permutation. The seed combines
	# BENCH_CONCURRENT_SEED with the connection index via SHA-256
	# so cross-version Python hash randomization can't shift it —
	# connection 3 on clickhouse must hit the exact same query
	# order as connection 3 on duckdb for the numbers to be
	# comparable.
	local perm
	mapfile -t perm < <(SEED="$seed" CONN="$c" NQ="$nq" python3 - <<'PY'
	import hashlib, os, random
	seed_str = f"{os.environ['SEED']}-{os.environ['CONN']}"
	seed_int = int.from_bytes(hashlib.sha256(seed_str.encode()).digest()[:8], 'big')
	r = random.Random(seed_int)
	xs = list(range(int(os.environ['NQ'])))
	r.shuffle(xs)
	print('\n'.join(map(str, xs)))
	PY
	)

	local ok=0 err=0 idx=0 qi q_text
	while [ "$(date +%s)" -lt "$deadline" ]; do
	qi="${perm[$idx]}"
	q_text="${queries[$qi]}"
	if printf '%s\n' "$q_text" \| ./query >/dev/null 2>&1; then
	ok=$((ok + 1))
	else
	err=$((err + 1))
	fi
	idx=$(( (idx + 1) % nq ))
	done
	printf '%s %s\n' "$ok" "$err" > "$stats_dir/$c"
	) &
	done

Conversation

alamb commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale

Changes

Question: should contributors still submit updated results?

Uh oh!

CLAassistant commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alamb commented Jun 1, 2026 •

edited

Loading

CLAassistant commented Jun 1, 2026 •

edited

Loading