Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions datafusion-partitioned/benchmark.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,7 @@
export BENCH_DOWNLOAD_SCRIPT="download-hits-parquet-partitioned"
export BENCH_DURABLE=yes
export BENCH_RESTARTABLE=no
# skip concurrent_qps tests by default as datafusion-cli is configured
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See rationale below

# for single user/single process usage
export BENCH_CONCURRENT_DURATION="${BENCH_CONCURRENT_DURATION:-0}"
exec ../lib/benchmark-common.sh
9 changes: 7 additions & 2 deletions datafusion-partitioned/load
Original file line number Diff line number Diff line change
@@ -1,9 +1,14 @@
#!/bin/bash
# datafusion queries the parquet files via an external table at LOCATION
# 'partitioned' (see create.sql). The shared bench_download fetches the
# parquet files into CWD; move them into the expected subdir.
# parquet files into CWD; hard-link them into the expected subdir.
#
# Note: don't move them to avoid re-downloading each time
set -e

mkdir -p partitioned
mv hits_*.parquet partitioned/ 2>/dev/null || true
for f in hits_*.parquet; do
ln -f "$f" "partitioned/$f"
done

sync
2 changes: 1 addition & 1 deletion datafusion/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ All with no EBS optimization and no instance store.
2. Wait for the status checks to pass, then ssh to EC2: `ssh ubuntu@{ip}`
3. `git clone https://github.com/ClickHouse/ClickBench`
4. `cd ClickBench/datafusion`
5. `vi benchmark.sh` and modify the following line to target the DataFusion version
5. `vi install` and modify the following line to target the DataFusion version
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#860 changes the script that need to be updated


```bash
git checkout 53.1.0
Expand Down
3 changes: 3 additions & 0 deletions datafusion/benchmark.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,7 @@
export BENCH_DOWNLOAD_SCRIPT="download-hits-parquet-single"
export BENCH_DURABLE=yes
export BENCH_RESTARTABLE=no
# skip concurrent_qps tests by default as datafusion-cli is configured
# for single user/single process usage
export BENCH_CONCURRENT_DURATION="${BENCH_CONCURRENT_DURATION:-0}"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I ran the rewritten script on a c6a.4xlarge, the kernel repeatedly killed datafusion-cli

$ sudo dmesg 
...
[ 2474.473278] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service/tmux-spawn-15c64bd8-c60d-4843-bf48-c65eac5ced5e.scope,task=datafusion-cli,pid=19689,uid=1000
[ 2474.473439] Out of memory: Killed process 19689 (datafusion-cli) total-vm:18148748kB, anon-rss:7049888kB, file-rss:2872kB, shmem-rss:0kB, UID:1000 pgtables:15880kB oom_score_adj:0

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newly rewritten "benchmark-common.sh" appears to include new code to do a "concurrency" test which launches a bunch of concurrent datafusion-cli processes

for c in $(seq 1 "$connections"); do
(
# Deterministic per-connection permutation. The seed combines
# BENCH_CONCURRENT_SEED with the connection index via SHA-256
# so cross-version Python hash randomization can't shift it —
# connection 3 on clickhouse must hit the exact same query
# order as connection 3 on duckdb for the numbers to be
# comparable.
local perm
mapfile -t perm < <(SEED="$seed" CONN="$c" NQ="$nq" python3 - <<'PY'
import hashlib, os, random
seed_str = f"{os.environ['SEED']}-{os.environ['CONN']}"
seed_int = int.from_bytes(hashlib.sha256(seed_str.encode()).digest()[:8], 'big')
r = random.Random(seed_int)
xs = list(range(int(os.environ['NQ'])))
r.shuffle(xs)
print('\n'.join(map(str, xs)))
PY
)
local ok=0 err=0 idx=0 qi q_text
while [ "$(date +%s)" -lt "$deadline" ]; do
qi="${perm[$idx]}"
q_text="${queries[$qi]}"
if printf '%s\n' "$q_text" | ./query >/dev/null 2>&1; then
ok=$((ok + 1))
else
err=$((err + 1))
fi
idx=$(( (idx + 1) % nq ))
done
printf '%s %s\n' "$ok" "$err" > "$stats_dir/$c"
) &
done

I don't think this type of test makes sense for a stateless engine like DataFusion and doesn't reflect how people use this. Each datafusion-cli process thinks it has the resources of the entire machine, and doesn't coordinate between each other (by design).

Systems built with DataFusion absolutely can and do implement various multi-user resource management policies, but each picks the policy that is best suited for its requirements. DataFusion does not mandate any particular one.

If we want to include a qps number for datafusion-cli the runner should measure how fast the query can be run one at a time (set concurrency to 1)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fascinatingly there are recent results checked in for a c6a.4xlarge that have a qps_concurrency measurement:

"concurrent_qps": 0.097,
"concurrent_error_ratio": 0,

These appear to be added in 6d5ee0a by @alexey-milovidov, as part of PR #860 and I don't see any comments / additional context of how it was generated) 🤔

Given that the kernel just OOM kills my runner, I don't know how to reproduce this number. Thus I think it is better and more accurate to disable the qps measurement for DataFusion and report "null" instead

exec ../lib/benchmark-common.sh