Skip to content

fix(datafusion): Update docs, skip concurrent_qps, Skip re-downloading partitioned files#943

Open
alamb wants to merge 2 commits into
ClickHouse:mainfrom
alamb:alamb/improve_scripts
Open

fix(datafusion): Update docs, skip concurrent_qps, Skip re-downloading partitioned files#943
alamb wants to merge 2 commits into
ClickHouse:mainfrom
alamb:alamb/improve_scripts

Conversation

@alamb
Copy link
Copy Markdown
Contributor

@alamb alamb commented Jun 1, 2026

Rationale

While trying to gather results for a new DataFusion release, I could not complete the benchmark scripts because the new concurrent QPS test caused the bash benchmark.sh run to be OOM-killed by the kernel.

I also hit two rough edges:

  1. datafusion-partitioned/benchmark.sh now re-downloads the partitioned Parquet source files on each run, which makes it harder to iterate on the scripts locally.
  2. The DataFusion documentation still referred to the old script layout.

Changes

This PR addresses those issues and lets me run the DataFusion benchmarks again:

  1. Disable the concurrent QPS test by default for DataFusion entries, since datafusion-cli is configured here for single-user/single-process benchmark runs.
  2. Hard-link the partitioned Parquet files into the partitioned/ directory instead of moving them, so reruns do not re-download the source files.
  3. Update the DataFusion README to point to the new install script when changing the DataFusion version.

I plan to submit a separate PR with updated results once the DataFusion release is finalized.

Question: should contributors still submit updated results?

It is not clear to me after the recent refactors whether you prefer contributors to run and submit benchmark results as part of release-update PRs, or whether you prefer to use the ClickBench automation to regenerate the numbers.

The DataFusion results were recently updated in:

The benchmark scripts were also substantially refactored in:

Updating ClickBench results takes me several hours for each DataFusion release, and I would like to decrease that. I looked into using ./run-benchmark.sh, which appears to be the automation entry point, but from an external contributor perspective:

  1. It posts logs to the ClickHouse Play sink user, and I do not know how to retrieve the generated result JSON from that pipeline.
  2. The supporting pieces appear to be in prepare-database.sql and collect-results.sh, but the setup for using them with a contributor-owned ClickHouse instance is not documented.

If contributors are expected to run the benchmarks themselves, it would be helpful to document the recommended end-to-end workflow for launching the machines and collecting the generated JSON files.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Jun 1, 2026

CLA assistant check
All committers have signed the CLA.

@alamb alamb changed the title WIP: fix(datafusion): Avoid re-downloading partitioned files on each run WIP: fix(datafusion): Skip re-downloading partitioned files, skip concurrent_qps Jun 1, 2026
@alamb alamb changed the title WIP: fix(datafusion): Skip re-downloading partitioned files, skip concurrent_qps fix(datafusion): Skip re-downloading partitioned files, skip concurrent_qps Jun 2, 2026
@alamb alamb changed the title fix(datafusion): Skip re-downloading partitioned files, skip concurrent_qps fix(datafusion): Update docs, skip concurrent_qps, Skip re-downloading partitioned files Jun 2, 2026
@alamb alamb force-pushed the alamb/improve_scripts branch from 5374b3c to 0ad6165 Compare June 2, 2026 13:30
@alamb alamb force-pushed the alamb/improve_scripts branch from 0ad6165 to 9e4afc0 Compare June 2, 2026 13:32
Comment thread datafusion/benchmark.sh
export BENCH_RESTARTABLE=no
# skip concurrent_qps tests by default as datafusion-cli is configured
# for single user/single process usage
export BENCH_CONCURRENT_DURATION="${BENCH_CONCURRENT_DURATION:-0}"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I ran the rewritten script on a c6a.4xlarge, the kernel repeatedly killed datafusion-cli

$ sudo dmesg 
...
[ 2474.473278] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service/tmux-spawn-15c64bd8-c60d-4843-bf48-c65eac5ced5e.scope,task=datafusion-cli,pid=19689,uid=1000
[ 2474.473439] Out of memory: Killed process 19689 (datafusion-cli) total-vm:18148748kB, anon-rss:7049888kB, file-rss:2872kB, shmem-rss:0kB, UID:1000 pgtables:15880kB oom_score_adj:0

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newly rewritten "benchmark-common.sh" appears to include new code to do a "concurrency" test which launches a bunch of concurrent datafusion-cli processes

for c in $(seq 1 "$connections"); do
(
# Deterministic per-connection permutation. The seed combines
# BENCH_CONCURRENT_SEED with the connection index via SHA-256
# so cross-version Python hash randomization can't shift it —
# connection 3 on clickhouse must hit the exact same query
# order as connection 3 on duckdb for the numbers to be
# comparable.
local perm
mapfile -t perm < <(SEED="$seed" CONN="$c" NQ="$nq" python3 - <<'PY'
import hashlib, os, random
seed_str = f"{os.environ['SEED']}-{os.environ['CONN']}"
seed_int = int.from_bytes(hashlib.sha256(seed_str.encode()).digest()[:8], 'big')
r = random.Random(seed_int)
xs = list(range(int(os.environ['NQ'])))
r.shuffle(xs)
print('\n'.join(map(str, xs)))
PY
)
local ok=0 err=0 idx=0 qi q_text
while [ "$(date +%s)" -lt "$deadline" ]; do
qi="${perm[$idx]}"
q_text="${queries[$qi]}"
if printf '%s\n' "$q_text" | ./query >/dev/null 2>&1; then
ok=$((ok + 1))
else
err=$((err + 1))
fi
idx=$(( (idx + 1) % nq ))
done
printf '%s %s\n' "$ok" "$err" > "$stats_dir/$c"
) &
done

I don't think this type of test makes sense for a stateless engine like DataFusion and doesn't reflect how people use this. Each datafusion-cli process thinks it has the resources of the entire machine, and doesn't coordinate between each other (by design).

Systems built with DataFusion absolutely can and do implement various multi-user resource management policies, but each picks the policy that is best suited for its requirements. DataFusion does not mandate any particular one.

If we want to include a qps number for datafusion-cli the runner should measure how fast the query can be run one at a time (set concurrency to 1)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fascinatingly there are recent results checked in for a c6a.4xlarge that have a qps_concurrency measurement:

"concurrent_qps": 0.097,
"concurrent_error_ratio": 0,

These appear to be added in 6d5ee0a by @alexey-milovidov, as part of PR #860 and I don't see any comments / additional context of how it was generated) 🤔

Given that the kernel just OOM kills my runner, I don't know how to reproduce this number. Thus I think it is better and more accurate to disable the qps measurement for DataFusion and report "null" instead

Comment thread datafusion/README.md
3. `git clone https://github.com/ClickHouse/ClickBench`
4. `cd ClickBench/datafusion`
5. `vi benchmark.sh` and modify the following line to target the DataFusion version
5. `vi install` and modify the following line to target the DataFusion version
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#860 changes the script that need to be updated

export BENCH_DOWNLOAD_SCRIPT="download-hits-parquet-partitioned"
export BENCH_DURABLE=yes
export BENCH_RESTARTABLE=no
# skip concurrent_qps tests by default as datafusion-cli is configured
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See rationale below

@alamb alamb marked this pull request as ready for review June 2, 2026 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants