bench: Compare ETHDebug overhead across branches#13
Conversation
|
Usage: uv run solc-bench run \
--solc /path/to/develop/solc/solc \
--benchmark-dir ./benchmark_data \
--tags med \
--iterations 5 \
--ethdebug-overhead \
--output-dir ./ethdebug-overhead-develop
uv run solc-bench run \
--solc /path/to/current/solc/solc \
--benchmark-dir ./benchmark_data \
--tags med \
--iterations 5 \
--ethdebug-overhead \
--output-dir ./ethdebug-overhead-currentThen compare: uv run solc-bench compare \
./ethdebug-overhead-develop/bench-results.json \
./ethdebug-overhead-current/bench-results.json \
--ethdebug-branches \
--baseline-label develop \
--target-label feature/internal-specThe output: Benchmark Metric feature/internal-spec ir-ethdebug develop ir-ethdebug Δ ir-ethdebug feature/internal-spec ir develop ir Δ ir feature/internal-spec overhead develop overhead Δ overhead
----------------- ------------- --------------------------------- -------------------- ------------- ------------------------ ------------------- ------ ------------------------------ ---------------- ----------
prb-math-4.1.1 cpu_time 7.8683s ± 0.2517s 7.6545s ± 0.2175s +2.79% 6.8656s ± 0.0663s 6.7416s ± 0.1290s +1.84% +14.61% +13.54% +1.07 pp
creation_size 592,270 ± 0 592,270 ± 0 0.0% 592,270 ± 0 592,270 ± 0 0.0% 0.0% 0.0% 0.0 pp
peak_rss 2169072 ± 118595 MiB 2175808 ± 158756 MiB -0.31% 1351712 ± 33354 MiB 1348928 ± 26619 MiB +0.21% +60.47% +61.3% -0.83 pp
runtime_size 589,640 ± 0 589,640 ± 0 0.0% 589,640 ± 0 589,640 ± 0 0.0% 0.0% 0.0% 0.0 pp
wall_time 7.8968s ± 0.3190s 7.6873s ± 0.2856s +2.73% 6.9299s ± 0.0886s 6.7658s ± 0.1669s +2.43% +13.95% +13.62% +0.33 pp
ethdebug_size 44.80 MiB ± 0.00 MiB 44.76 MiB ± 0.00 MiB +0.08% n/a n/a N/A N/A N/A N/A
forge-std-1.16.1 cpu_time 7.9549s ± 0.0795s 7.9994s ± 0.0906s -0.56% 6.8365s ± 0.0742s 6.9321s ± 0.0928s -1.38% +16.36% +15.4% +0.96 pp
creation_size 725,280 ± 0 725,280 ± 0 0.0% 725,280 ± 0 725,280 ± 0 0.0% 0.0% 0.0% 0.0 pp
peak_rss 1568768 ± 36955 MiB 1612816 ± 63618 MiB -2.73% 841536 ± 3475 MiB 825392 ± 75444 MiB +1.96% +86.42% +95.4% -8.98 pp
runtime_size 698,851 ± 0 698,851 ± 0 0.0% 698,851 ± 0 698,851 ± 0 0.0% 0.0% 0.0% 0.0 pp
wall_time 7.9942s ± 0.1036s 8.0431s ± 0.0981s -0.61% 6.8578s ± 0.0758s 6.9591s ± 0.1023s -1.46% +16.57% +15.58% +0.99 pp
ethdebug_size 44.94 MiB ± 0.00 MiB 44.88 MiB ± 0.00 MiB +0.13% n/a n/a N/A N/A N/A N/A
morpho-blue-1.0.0 cpu_time 11.1818s ± 0.2293s 11.0610s ± 0.1896s +1.09% 9.5450s ± 0.1405s 9.6678s ± 0.1083s -1.27% +17.15% +14.41% +2.74 pp
creation_size 1,261,421 ± 0 1,261,421 ± 0 0.0% 1,261,421 ± 0 1,261,421 ± 0 0.0% 0.0% 0.0% 0.0 pp
peak_rss 1893616 ± 71660 MiB 1703536 ± 21941 MiB +11.16% 957168 ± 28410 MiB 948432 ± 46672 MiB +0.92% +97.84% +79.62% +18.22 pp
runtime_size 1,251,246 ± 0 1,251,246 ± 0 0.0% 1,251,246 ± 0 1,251,246 ± 0 0.0% 0.0% 0.0% 0.0 pp
wall_time 11.3639s ± 0.2788s 11.1654s ± 0.2197s +1.78% 9.5693s ± 0.1873s 9.7115s ± 0.1305s -1.46% +18.75% +14.97% +3.78 pp
ethdebug_size 57.38 MiB ± 0.00 MiB 57.29 MiB ± 0.00 MiB +0.15% n/a n/a N/A N/A N/A N/A
solmate-6 cpu_time 4.7389s ± 0.0616s 4.6239s ± 0.0854s +2.49% 4.0220s ± 0.0828s 3.9918s ± 0.1613s +0.76% +17.83% +15.84% +1.99 pp
creation_size 540,758 ± 0 540,758 ± 0 0.0% 540,758 ± 0 540,758 ± 0 0.0% 0.0% 0.0% 0.0 pp
peak_rss 985184 ± 29731 MiB 968240 ± 3574 MiB +1.75% 385376 ± 5713 MiB 372192 ± 1248 MiB +3.54% +155.64% +160.15% -4.51 pp
runtime_size 523,577 ± 0 523,577 ± 0 0.0% 523,577 ± 0 523,577 ± 0 0.0% 0.0% 0.0% 0.0 pp
wall_time 4.7608s ± 0.0685s 4.6550s ± 0.0867s +2.27% 4.0436s ± 0.1090s 4.0118s ± 0.1698s +0.79% +17.74% +16.03% +1.71 pp
ethdebug_size 30.92 MiB ± 0.00 MiB 30.89 MiB ± 0.00 MiB +0.07% n/a n/a N/A N/A N/A N/A |
|
I am not a fan of introducing yet another two flags here. Basically what we want is four different result datasets on
The current approach conflates the concepts of which datasets and which pairwise comparisons over them to run. These should be orthogonal, IMO. Ideally we have something like to produce data and then compare whatever I understand this is not (just) on you, still, this PR adds onto it. |
ir-ethdebug is a regular pipeline now, so the run mode that bundled an unoptimized ir baseline and an ir-ethdebug run into one result file is redundant and conflates dataset selection with a special mode. Each run produces exactly the dataset its flags describe, and compare runs the pairwise comparisons requested via --vs: solc-bench run --solc ./solc --pipeline ir --no-optimize -o ir.json solc-bench run --solc ./solc --pipeline ir-ethdebug -o ed.json solc-bench compare ir.json ed.json --vs ed ir
@clonker sure, I agree with this. Please check it now: Comparison: feat-ir vs dev-ir
Benchmark Metric feat-ir dev-ir Δ% winner
----------------- ------------- ------------------- ------------------- ------- ------
prb-math-4.1.1 cpu_time 7.4697s ± 0.1706s 7.2882s ± 0.1161s +2.49% ~noise
creation_size 592,270 ± 0 592,270 ± 0 0.0% tie
peak_rss 1283664 ± 18758 MiB 1300368 ± 89882 MiB -1.28% ~noise
runtime_size 589,640 ± 0 589,640 ± 0 0.0% tie
wall_time 7.6108s ± 0.2776s 7.3393s ± 0.1567s +3.7% ~noise
forge-std-1.16.1 cpu_time 7.3679s ± 0.1564s 7.5212s ± 0.0777s -2.04% ~noise
creation_size 725,280 ± 0 725,280 ± 0 0.0% tie
peak_rss 835728 ± 26586 MiB 757408 ± 40316 MiB +10.34% ~noise
runtime_size 698,851 ± 0 698,851 ± 0 0.0% tie
wall_time 7.4271s ± 0.1986s 7.6105s ± 0.0980s -2.41% ~noise
morpho-blue-1.0.0 cpu_time 10.3606s ± 0.2164s 10.2282s ± 0.1335s +1.29% ~noise
creation_size 1,261,421 ± 0 1,261,421 ± 0 0.0% tie
peak_rss 925232 ± 11855 MiB 885600 ± 40382 MiB +4.48% ~noise
runtime_size 1,251,246 ± 0 1,251,246 ± 0 0.0% tie
wall_time 10.5072s ± 0.6548s 10.3461s ± 0.1563s +1.56% ~noise
solmate-6 cpu_time 4.2158s ± 0.1264s 4.1597s ± 0.0604s +1.35% ~noise
creation_size 540,758 ± 0 540,758 ± 0 0.0% tie
peak_rss 387104 ± 4444 MiB 373008 ± 1400 MiB +3.78% dev-ir
runtime_size 523,577 ± 0 523,577 ± 0 0.0% tie
wall_time 4.2523s ± 0.1784s 4.1910s ± 0.0587s +1.46% ~noise
Comparison: feat-ed vs dev-ed
Benchmark Metric feat-ed dev-ed Δ% winner
----------------- ------------- -------------------- -------------------- ------ ------
prb-math-4.1.1 cpu_time 8.6285s ± 0.1877s 8.3880s ± 0.2046s +2.87% ~noise
creation_size 592,270 ± 0 592,270 ± 0 0.0% tie
ethdebug_size 44.80 MiB ± 0.00 MiB 44.76 MiB ± 0.00 MiB +0.08% ~noise
peak_rss 1984880 ± 121188 MiB 1867712 ± 87870 MiB +6.27% ~noise
runtime_size 589,640 ± 0 589,640 ± 0 0.0% tie
wall_time 8.8527s ± 0.2882s 8.4716s ± 0.2892s +4.5% ~noise
forge-std-1.16.1 cpu_time 8.7960s ± 0.4597s 8.3687s ± 0.0730s +5.11% ~noise
creation_size 725,280 ± 0 725,280 ± 0 0.0% tie
ethdebug_size 44.94 MiB ± 0.00 MiB 44.88 MiB ± 0.00 MiB +0.13% dev-ed
peak_rss 1609776 ± 81821 MiB 1571776 ± 121598 MiB +2.42% ~noise
runtime_size 698,851 ± 0 698,851 ± 0 0.0% tie
wall_time 8.9539s ± 1.0165s 8.4514s ± 0.0777s +5.95% ~noise
morpho-blue-1.0.0 cpu_time 11.8989s ± 0.3171s 11.5593s ± 0.1716s +2.94% ~noise
creation_size 1,261,421 ± 0 1,261,421 ± 0 0.0% tie
ethdebug_size 57.38 MiB ± 0.00 MiB 57.29 MiB ± 0.00 MiB +0.15% dev-ed
peak_rss 1853424 ± 87738 MiB 1832544 ± 49277 MiB +1.14% ~noise
runtime_size 1,251,246 ± 0 1,251,246 ± 0 0.0% tie
wall_time 12.0327s ± 0.3882s 11.7122s ± 0.2232s +2.74% ~noise
solmate-6 cpu_time 4.8951s ± 0.0925s 4.9858s ± 0.0576s -1.82% ~noise
creation_size 540,758 ± 0 540,758 ± 0 0.0% tie
ethdebug_size 30.92 MiB ± 0.00 MiB 30.89 MiB ± 0.00 MiB +0.07% ~noise
peak_rss 988512 ± 30735 MiB 971360 ± 2383 MiB +1.77% ~noise
runtime_size 523,577 ± 0 523,577 ± 0 0.0% tie
wall_time 4.9308s ± 0.1116s 5.0301s ± 0.0607s -1.98% ~noise
Comparison: dev-ed vs dev-ir
Benchmark Metric dev-ed dev-ir Δ% winner
----------------- ------------- -------------------- ------------------- -------- ------
prb-math-4.1.1 cpu_time 8.3880s ± 0.2046s 7.2882s ± 0.1161s +15.09% dev-ir
creation_size 592,270 ± 0 592,270 ± 0 0.0% tie
peak_rss 1867712 ± 87870 MiB 1300368 ± 89882 MiB +43.63% dev-ir
runtime_size 589,640 ± 0 589,640 ± 0 0.0% tie
wall_time 8.4716s ± 0.2892s 7.3393s ± 0.1567s +15.43% dev-ir
ethdebug_size 44.76 MiB ± 0.00 MiB n/a N/A n/a
forge-std-1.16.1 cpu_time 8.3687s ± 0.0730s 7.5212s ± 0.0777s +11.27% dev-ir
creation_size 725,280 ± 0 725,280 ± 0 0.0% tie
peak_rss 1571776 ± 121598 MiB 757408 ± 40316 MiB +107.52% dev-ir
runtime_size 698,851 ± 0 698,851 ± 0 0.0% tie
wall_time 8.4514s ± 0.0777s 7.6105s ± 0.0980s +11.05% dev-ir
ethdebug_size 44.88 MiB ± 0.00 MiB n/a N/A n/a
morpho-blue-1.0.0 cpu_time 11.5593s ± 0.1716s 10.2282s ± 0.1335s +13.01% dev-ir
creation_size 1,261,421 ± 0 1,261,421 ± 0 0.0% tie
peak_rss 1832544 ± 49277 MiB 885600 ± 40382 MiB +106.93% dev-ir
runtime_size 1,251,246 ± 0 1,251,246 ± 0 0.0% tie
wall_time 11.7122s ± 0.2232s 10.3461s ± 0.1563s +13.2% dev-ir
ethdebug_size 57.29 MiB ± 0.00 MiB n/a N/A n/a
solmate-6 cpu_time 4.9858s ± 0.0576s 4.1597s ± 0.0604s +19.86% dev-ir
creation_size 540,758 ± 0 540,758 ± 0 0.0% tie
peak_rss 971360 ± 2383 MiB 373008 ± 1400 MiB +160.41% dev-ir
runtime_size 523,577 ± 0 523,577 ± 0 0.0% tie
wall_time 5.0301s ± 0.0607s 4.1910s ± 0.0587s +20.02% dev-ir
ethdebug_size 30.89 MiB ± 0.00 MiB n/a N/A n/a
Comparison: feat-ed vs feat-ir
Benchmark Metric feat-ed feat-ir Δ% winner
----------------- ------------- -------------------- ------------------- -------- -------
prb-math-4.1.1 cpu_time 8.6285s ± 0.1877s 7.4697s ± 0.1706s +15.51% feat-ir
creation_size 592,270 ± 0 592,270 ± 0 0.0% tie
peak_rss 1984880 ± 121188 MiB 1283664 ± 18758 MiB +54.63% feat-ir
runtime_size 589,640 ± 0 589,640 ± 0 0.0% tie
wall_time 8.8527s ± 0.2882s 7.6108s ± 0.2776s +16.32% feat-ir
ethdebug_size 44.80 MiB ± 0.00 MiB n/a N/A n/a
forge-std-1.16.1 cpu_time 8.7960s ± 0.4597s 7.3679s ± 0.1564s +19.38% feat-ir
creation_size 725,280 ± 0 725,280 ± 0 0.0% tie
peak_rss 1609776 ± 81821 MiB 835728 ± 26586 MiB +92.62% feat-ir
runtime_size 698,851 ± 0 698,851 ± 0 0.0% tie
wall_time 8.9539s ± 1.0165s 7.4271s ± 0.1986s +20.56% ~noise
ethdebug_size 44.94 MiB ± 0.00 MiB n/a N/A n/a
morpho-blue-1.0.0 cpu_time 11.8989s ± 0.3171s 10.3606s ± 0.2164s +14.85% feat-ir
creation_size 1,261,421 ± 0 1,261,421 ± 0 0.0% tie
peak_rss 1853424 ± 87738 MiB 925232 ± 11855 MiB +100.32% feat-ir
runtime_size 1,251,246 ± 0 1,251,246 ± 0 0.0% tie
wall_time 12.0327s ± 0.3882s 10.5072s ± 0.6548s +14.52% ~noise
ethdebug_size 57.38 MiB ± 0.00 MiB n/a N/A n/a
solmate-6 cpu_time 4.8951s ± 0.0925s 4.2158s ± 0.1264s +16.11% feat-ir
creation_size 540,758 ± 0 540,758 ± 0 0.0% tie
peak_rss 988512 ± 30735 MiB 387104 ± 4444 MiB +155.36% feat-ir
runtime_size 523,577 ± 0 523,577 ± 0 0.0% tie
wall_time 4.9308s ± 0.1116s 4.2523s ± 0.1784s +15.95% feat-ir
ethdebug_size 30.92 MiB ± 0.00 MiB n/a N/A n/aand I run (for different cases) it as: solc-bench run \
--solc /path/to/solidity/build_develop/solc/solc \
--benchmark-dir ./benchmark_data \
--pipeline ir \
--no-optimize \
--tags med \
--iterations 5 \
-o dev-ir.json |
clonker
left a comment
There was a problem hiding this comment.
Thanks! A few comments regarding things I noticed when doing a quick pass. I'll want to do a more thorough one later.
| if pipeline == "ir-ethdebug": | ||
| # ETHDebug program output does not support the optimizer yet, | ||
| # so this pipeline always compiles unoptimized IR. | ||
| runs.append( | ||
| ( | ||
| pipeline, | ||
| resolve_solc_settings("ir", True, ethdebug=True), | ||
| True, | ||
| ) | ||
| ) |
There was a problem hiding this comment.
this'll silently change behavior once it does support it, doesn't it? wouldn't it be better to just throw if ethdebug is requested but nooptimize is False?
There was a problem hiding this comment.
yeah, makes sense to raise if requested without no_optimize
| self.results, self.solc_version, self.iterations | ||
| ) | ||
| result_path = self.output_dir / "bench-results.json" | ||
| result_path = self.output_file or self.output_dir / "bench-results.json" |
There was a problem hiding this comment.
is there anything that relies on the output file being named bench-results.json?
There was a problem hiding this comment.
No :) introducing DEFAULT_RESULT_FILENAME, but also considering --output-file (if used)
| if args.output_file and args.output_dir == ".": | ||
| output_dir = str(Path(args.output_file).parent) |
There was a problem hiding this comment.
doesn't this break or at least lead to inconsistent behavior if i request an output dir like ./subdir?
There was a problem hiding this comment.
Yeah, good catch. Fixing it.
|
|
||
|
|
||
| def _result_label_and_path(spec): | ||
| if "=" in spec: |
There was a problem hiding this comment.
this seems brittle. what precisely is the spec? can we find a better variable name? what if the path or the label contain = themselves?
There was a problem hiding this comment.
spec is a positional results arg, either PATH or LABEL=PATH, so I renamed it result_arg and added a doc spelling that out. The = ambiguity was a bug: a path that happened to contain = got silently mis-split into a bogus label + path. Now I only treat = as the separator when the arg isn't an existing file, so real paths with = load fine; a label still can't contain =, which I documented.
| raise FileExistsError( | ||
| f"results file already exists: {result_path} " | ||
| "(remove it or choose a different --output-dir)" | ||
| "(remove it or choose a different output path)" |
There was a problem hiding this comment.
could we reference to the actual args here instead of removing --output-dir?
| "Compare two named datasets as TARGET vs REF. Repeat to compare " | ||
| "multiple pairs. Single-pipeline result files use their inferred " | ||
| "label; multi-pipeline files expose LABEL:PIPELINE datasets." |
There was a problem hiding this comment.
so now we have two ways of referencing datasets, via the filename, via label:pipeline. does label:pipeline still need a path qualifier?
There was a problem hiding this comment.
No path needed, since the path is given once as the positional arg (optionally LABEL=PATH), and --vs just references the name it got. Reworded the help to say that.
| help="Compare two pipelines in one file: TARGET:REF (e.g. ir:evmasm)", | ||
| ) | ||
| cmp_parser.add_argument( | ||
| "--vs", |
There was a problem hiding this comment.
compare a.json b.json c.json --vs a b loads c.json and ignores it, there's no validation
|
Something seems off here? I added this Welch t-test with AI (my bad, sorry), but this: seems off? 10% change being "noise" seems wrong. Do you know what could be going on there? |
Thinking a bit more about this -- maybe we should print the MEDIAN not the AVERAGE, and then the difference should also be over the MEDIAN. I have a feeling we'd get better values? What's |
Hmmm, yeah, good catch. What was weird here is that the table was showing medians, but the Welch t-test was testing means. So the Please check it. |
No description provided.