Skip to content

bench: Compare ETHDebug overhead across branches#13

Open
djolertrk wants to merge 13 commits into
argotorg:mainfrom
walnuthq:feat/ethdebug
Open

bench: Compare ETHDebug overhead across branches#13
djolertrk wants to merge 13 commits into
argotorg:mainfrom
walnuthq:feat/ethdebug

Conversation

@djolertrk

Copy link
Copy Markdown
Contributor

No description provided.

@djolertrk

Copy link
Copy Markdown
Contributor Author

cc @nikola-matic @clonker

@djolertrk

Copy link
Copy Markdown
Contributor Author

Usage:

  uv run solc-bench run \
    --solc /path/to/develop/solc/solc \
    --benchmark-dir ./benchmark_data \
    --tags med \
    --iterations 5 \
    --ethdebug-overhead \
    --output-dir ./ethdebug-overhead-develop

  uv run solc-bench run \
    --solc /path/to/current/solc/solc \
    --benchmark-dir ./benchmark_data \
    --tags med \
    --iterations 5 \
    --ethdebug-overhead \
    --output-dir ./ethdebug-overhead-current

Then compare:

  uv run solc-bench compare \
    ./ethdebug-overhead-develop/bench-results.json \
    ./ethdebug-overhead-current/bench-results.json \
    --ethdebug-branches \
    --baseline-label develop \
    --target-label feature/internal-spec

The output:

Benchmark          Metric         feature/internal-spec ir-ethdebug  develop ir-ethdebug   Δ ir-ethdebug  feature/internal-spec ir  develop ir           Δ ir    feature/internal-spec overhead  develop overhead  Δ overhead
-----------------  -------------  ---------------------------------  --------------------  -------------  ------------------------  -------------------  ------  ------------------------------  ----------------  ----------
prb-math-4.1.1     cpu_time       7.8683s ± 0.2517s                  7.6545s ± 0.2175s     +2.79%         6.8656s ± 0.0663s         6.7416s ± 0.1290s    +1.84%  +14.61%                         +13.54%           +1.07 pp
                   creation_size  592,270 ± 0                        592,270 ± 0           0.0%           592,270 ± 0               592,270 ± 0          0.0%    0.0%                            0.0%              0.0 pp
                   peak_rss       2169072 ± 118595 MiB               2175808 ± 158756 MiB  -0.31%         1351712 ± 33354 MiB       1348928 ± 26619 MiB  +0.21%  +60.47%                         +61.3%            -0.83 pp
                   runtime_size   589,640 ± 0                        589,640 ± 0           0.0%           589,640 ± 0               589,640 ± 0          0.0%    0.0%                            0.0%              0.0 pp
                   wall_time      7.8968s ± 0.3190s                  7.6873s ± 0.2856s     +2.73%         6.9299s ± 0.0886s         6.7658s ± 0.1669s    +2.43%  +13.95%                         +13.62%           +0.33 pp
                   ethdebug_size  44.80 MiB ± 0.00 MiB               44.76 MiB ± 0.00 MiB  +0.08%         n/a                       n/a                  N/A     N/A                             N/A               N/A

forge-std-1.16.1   cpu_time       7.9549s ± 0.0795s                  7.9994s ± 0.0906s     -0.56%         6.8365s ± 0.0742s         6.9321s ± 0.0928s    -1.38%  +16.36%                         +15.4%            +0.96 pp
                   creation_size  725,280 ± 0                        725,280 ± 0           0.0%           725,280 ± 0               725,280 ± 0          0.0%    0.0%                            0.0%              0.0 pp
                   peak_rss       1568768 ± 36955 MiB                1612816 ± 63618 MiB   -2.73%         841536 ± 3475 MiB         825392 ± 75444 MiB   +1.96%  +86.42%                         +95.4%            -8.98 pp
                   runtime_size   698,851 ± 0                        698,851 ± 0           0.0%           698,851 ± 0               698,851 ± 0          0.0%    0.0%                            0.0%              0.0 pp
                   wall_time      7.9942s ± 0.1036s                  8.0431s ± 0.0981s     -0.61%         6.8578s ± 0.0758s         6.9591s ± 0.1023s    -1.46%  +16.57%                         +15.58%           +0.99 pp
                   ethdebug_size  44.94 MiB ± 0.00 MiB               44.88 MiB ± 0.00 MiB  +0.13%         n/a                       n/a                  N/A     N/A                             N/A               N/A

morpho-blue-1.0.0  cpu_time       11.1818s ± 0.2293s                 11.0610s ± 0.1896s    +1.09%         9.5450s ± 0.1405s         9.6678s ± 0.1083s    -1.27%  +17.15%                         +14.41%           +2.74 pp
                   creation_size  1,261,421 ± 0                      1,261,421 ± 0         0.0%           1,261,421 ± 0             1,261,421 ± 0        0.0%    0.0%                            0.0%              0.0 pp
                   peak_rss       1893616 ± 71660 MiB                1703536 ± 21941 MiB   +11.16%        957168 ± 28410 MiB        948432 ± 46672 MiB   +0.92%  +97.84%                         +79.62%           +18.22 pp
                   runtime_size   1,251,246 ± 0                      1,251,246 ± 0         0.0%           1,251,246 ± 0             1,251,246 ± 0        0.0%    0.0%                            0.0%              0.0 pp
                   wall_time      11.3639s ± 0.2788s                 11.1654s ± 0.2197s    +1.78%         9.5693s ± 0.1873s         9.7115s ± 0.1305s    -1.46%  +18.75%                         +14.97%           +3.78 pp
                   ethdebug_size  57.38 MiB ± 0.00 MiB               57.29 MiB ± 0.00 MiB  +0.15%         n/a                       n/a                  N/A     N/A                             N/A               N/A

solmate-6          cpu_time       4.7389s ± 0.0616s                  4.6239s ± 0.0854s     +2.49%         4.0220s ± 0.0828s         3.9918s ± 0.1613s    +0.76%  +17.83%                         +15.84%           +1.99 pp
                   creation_size  540,758 ± 0                        540,758 ± 0           0.0%           540,758 ± 0               540,758 ± 0          0.0%    0.0%                            0.0%              0.0 pp
                   peak_rss       985184 ± 29731 MiB                 968240 ± 3574 MiB     +1.75%         385376 ± 5713 MiB         372192 ± 1248 MiB    +3.54%  +155.64%                        +160.15%          -4.51 pp
                   runtime_size   523,577 ± 0                        523,577 ± 0           0.0%           523,577 ± 0               523,577 ± 0          0.0%    0.0%                            0.0%              0.0 pp
                   wall_time      4.7608s ± 0.0685s                  4.6550s ± 0.0867s     +2.27%         4.0436s ± 0.1090s         4.0118s ± 0.1698s    +0.79%  +17.74%                         +16.03%           +1.71 pp
                   ethdebug_size  30.92 MiB ± 0.00 MiB               30.89 MiB ± 0.00 MiB  +0.07%         n/a                       n/a                  N/A     N/A                             N/A               N/A

@clonker

clonker commented Jun 11, 2026

Copy link
Copy Markdown
Member

I am not a fan of introducing yet another two flags here. Basically what we want is four different result datasets on

  • develop without ethdebug
  • develop with ethdebug
  • feat without ethdebug
  • feat with ethdebug

The current approach conflates the concepts of which datasets and which pairwise comparisons over them to run. These should be orthogonal, IMO.

Ideally we have something like

solc-bench run --solc ./solc-develop --pipeline ir          -o dev-ir.json
solc-bench run --solc ./solc-develop --pipeline ir-ethdebug -o dev-ed.json
solc-bench run --solc ./solc-feat    --pipeline ir          -o feat-ir.json
solc-bench run --solc ./solc-feat    --pipeline ir-ethdebug -o feat-ed.json

to produce data and then compare whatever

solc-bench compare dev-ir.json dev-ed.json feat-ir.json feat-ed.json \
  --vs feat-ir dev-ir   \   # delta ir across branches
  --vs feat-ed dev-ed   \   # delta ir-ethdebug across branches
  --vs dev-ed  dev-ir   \   # overhead on develop
  --vs feat-ed feat-ir      # overhead on feat

I understand this is not (just) on you, still, this PR adds onto it.

ir-ethdebug is a regular pipeline now, so the run mode that bundled an
unoptimized ir baseline and an ir-ethdebug run into one result file is
redundant and conflates dataset selection with a special mode. Each run
produces exactly the dataset its flags describe, and compare runs the
pairwise comparisons requested via --vs:

  solc-bench run --solc ./solc --pipeline ir --no-optimize -o ir.json
  solc-bench run --solc ./solc --pipeline ir-ethdebug -o ed.json
  solc-bench compare ir.json ed.json --vs ed ir
@djolertrk

Copy link
Copy Markdown
Contributor Author

I am not a fan of introducing yet another two flags here. Basically what we want is four different result datasets on

  • develop without ethdebug
  • develop with ethdebug
  • feat without ethdebug
  • feat with ethdebug

The current approach conflates the concepts of which datasets and which pairwise comparisons over them to run. These should be orthogonal, IMO.

Ideally we have something like

solc-bench run --solc ./solc-develop --pipeline ir          -o dev-ir.json
solc-bench run --solc ./solc-develop --pipeline ir-ethdebug -o dev-ed.json
solc-bench run --solc ./solc-feat    --pipeline ir          -o feat-ir.json
solc-bench run --solc ./solc-feat    --pipeline ir-ethdebug -o feat-ed.json

to produce data and then compare whatever

solc-bench compare dev-ir.json dev-ed.json feat-ir.json feat-ed.json \
  --vs feat-ir dev-ir   \   # delta ir across branches
  --vs feat-ed dev-ed   \   # delta ir-ethdebug across branches
  --vs dev-ed  dev-ir   \   # overhead on develop
  --vs feat-ed feat-ir      # overhead on feat

I understand this is not (just) on you, still, this PR adds onto it.

@clonker sure, I agree with this. Please check it now:

Comparison: feat-ir vs dev-ir

Benchmark          Metric         feat-ir              dev-ir               Δ%       winner
-----------------  -------------  -------------------  -------------------  -------  ------
prb-math-4.1.1     cpu_time       7.4697s ± 0.1706s    7.2882s ± 0.1161s    +2.49%   ~noise
                   creation_size  592,270 ± 0          592,270 ± 0          0.0%     tie
                   peak_rss       1283664 ± 18758 MiB  1300368 ± 89882 MiB  -1.28%   ~noise
                   runtime_size   589,640 ± 0          589,640 ± 0          0.0%     tie
                   wall_time      7.6108s ± 0.2776s    7.3393s ± 0.1567s    +3.7%    ~noise

forge-std-1.16.1   cpu_time       7.3679s ± 0.1564s    7.5212s ± 0.0777s    -2.04%   ~noise
                   creation_size  725,280 ± 0          725,280 ± 0          0.0%     tie
                   peak_rss       835728 ± 26586 MiB   757408 ± 40316 MiB   +10.34%  ~noise
                   runtime_size   698,851 ± 0          698,851 ± 0          0.0%     tie
                   wall_time      7.4271s ± 0.1986s    7.6105s ± 0.0980s    -2.41%   ~noise

morpho-blue-1.0.0  cpu_time       10.3606s ± 0.2164s   10.2282s ± 0.1335s   +1.29%   ~noise
                   creation_size  1,261,421 ± 0        1,261,421 ± 0        0.0%     tie
                   peak_rss       925232 ± 11855 MiB   885600 ± 40382 MiB   +4.48%   ~noise
                   runtime_size   1,251,246 ± 0        1,251,246 ± 0        0.0%     tie
                   wall_time      10.5072s ± 0.6548s   10.3461s ± 0.1563s   +1.56%   ~noise

solmate-6          cpu_time       4.2158s ± 0.1264s    4.1597s ± 0.0604s    +1.35%   ~noise
                   creation_size  540,758 ± 0          540,758 ± 0          0.0%     tie
                   peak_rss       387104 ± 4444 MiB    373008 ± 1400 MiB    +3.78%   dev-ir
                   runtime_size   523,577 ± 0          523,577 ± 0          0.0%     tie
                   wall_time      4.2523s ± 0.1784s    4.1910s ± 0.0587s    +1.46%   ~noise

Comparison: feat-ed vs dev-ed

Benchmark          Metric         feat-ed               dev-ed                Δ%      winner
-----------------  -------------  --------------------  --------------------  ------  ------
prb-math-4.1.1     cpu_time       8.6285s ± 0.1877s     8.3880s ± 0.2046s     +2.87%  ~noise
                   creation_size  592,270 ± 0           592,270 ± 0           0.0%    tie
                   ethdebug_size  44.80 MiB ± 0.00 MiB  44.76 MiB ± 0.00 MiB  +0.08%  ~noise
                   peak_rss       1984880 ± 121188 MiB  1867712 ± 87870 MiB   +6.27%  ~noise
                   runtime_size   589,640 ± 0           589,640 ± 0           0.0%    tie
                   wall_time      8.8527s ± 0.2882s     8.4716s ± 0.2892s     +4.5%   ~noise

forge-std-1.16.1   cpu_time       8.7960s ± 0.4597s     8.3687s ± 0.0730s     +5.11%  ~noise
                   creation_size  725,280 ± 0           725,280 ± 0           0.0%    tie
                   ethdebug_size  44.94 MiB ± 0.00 MiB  44.88 MiB ± 0.00 MiB  +0.13%  dev-ed
                   peak_rss       1609776 ± 81821 MiB   1571776 ± 121598 MiB  +2.42%  ~noise
                   runtime_size   698,851 ± 0           698,851 ± 0           0.0%    tie
                   wall_time      8.9539s ± 1.0165s     8.4514s ± 0.0777s     +5.95%  ~noise

morpho-blue-1.0.0  cpu_time       11.8989s ± 0.3171s    11.5593s ± 0.1716s    +2.94%  ~noise
                   creation_size  1,261,421 ± 0         1,261,421 ± 0         0.0%    tie
                   ethdebug_size  57.38 MiB ± 0.00 MiB  57.29 MiB ± 0.00 MiB  +0.15%  dev-ed
                   peak_rss       1853424 ± 87738 MiB   1832544 ± 49277 MiB   +1.14%  ~noise
                   runtime_size   1,251,246 ± 0         1,251,246 ± 0         0.0%    tie
                   wall_time      12.0327s ± 0.3882s    11.7122s ± 0.2232s    +2.74%  ~noise

solmate-6          cpu_time       4.8951s ± 0.0925s     4.9858s ± 0.0576s     -1.82%  ~noise
                   creation_size  540,758 ± 0           540,758 ± 0           0.0%    tie
                   ethdebug_size  30.92 MiB ± 0.00 MiB  30.89 MiB ± 0.00 MiB  +0.07%  ~noise
                   peak_rss       988512 ± 30735 MiB    971360 ± 2383 MiB     +1.77%  ~noise
                   runtime_size   523,577 ± 0           523,577 ± 0           0.0%    tie
                   wall_time      4.9308s ± 0.1116s     5.0301s ± 0.0607s     -1.98%  ~noise

Comparison: dev-ed vs dev-ir

Benchmark          Metric         dev-ed                dev-ir               Δ%        winner
-----------------  -------------  --------------------  -------------------  --------  ------
prb-math-4.1.1     cpu_time       8.3880s ± 0.2046s     7.2882s ± 0.1161s    +15.09%   dev-ir
                   creation_size  592,270 ± 0           592,270 ± 0          0.0%      tie
                   peak_rss       1867712 ± 87870 MiB   1300368 ± 89882 MiB  +43.63%   dev-ir
                   runtime_size   589,640 ± 0           589,640 ± 0          0.0%      tie
                   wall_time      8.4716s ± 0.2892s     7.3393s ± 0.1567s    +15.43%   dev-ir
                   ethdebug_size  44.76 MiB ± 0.00 MiB  n/a                  N/A       n/a

forge-std-1.16.1   cpu_time       8.3687s ± 0.0730s     7.5212s ± 0.0777s    +11.27%   dev-ir
                   creation_size  725,280 ± 0           725,280 ± 0          0.0%      tie
                   peak_rss       1571776 ± 121598 MiB  757408 ± 40316 MiB   +107.52%  dev-ir
                   runtime_size   698,851 ± 0           698,851 ± 0          0.0%      tie
                   wall_time      8.4514s ± 0.0777s     7.6105s ± 0.0980s    +11.05%   dev-ir
                   ethdebug_size  44.88 MiB ± 0.00 MiB  n/a                  N/A       n/a

morpho-blue-1.0.0  cpu_time       11.5593s ± 0.1716s    10.2282s ± 0.1335s   +13.01%   dev-ir
                   creation_size  1,261,421 ± 0         1,261,421 ± 0        0.0%      tie
                   peak_rss       1832544 ± 49277 MiB   885600 ± 40382 MiB   +106.93%  dev-ir
                   runtime_size   1,251,246 ± 0         1,251,246 ± 0        0.0%      tie
                   wall_time      11.7122s ± 0.2232s    10.3461s ± 0.1563s   +13.2%    dev-ir
                   ethdebug_size  57.29 MiB ± 0.00 MiB  n/a                  N/A       n/a

solmate-6          cpu_time       4.9858s ± 0.0576s     4.1597s ± 0.0604s    +19.86%   dev-ir
                   creation_size  540,758 ± 0           540,758 ± 0          0.0%      tie
                   peak_rss       971360 ± 2383 MiB     373008 ± 1400 MiB    +160.41%  dev-ir
                   runtime_size   523,577 ± 0           523,577 ± 0          0.0%      tie
                   wall_time      5.0301s ± 0.0607s     4.1910s ± 0.0587s    +20.02%   dev-ir
                   ethdebug_size  30.89 MiB ± 0.00 MiB  n/a                  N/A       n/a

Comparison: feat-ed vs feat-ir

Benchmark          Metric         feat-ed               feat-ir              Δ%        winner
-----------------  -------------  --------------------  -------------------  --------  -------
prb-math-4.1.1     cpu_time       8.6285s ± 0.1877s     7.4697s ± 0.1706s    +15.51%   feat-ir
                   creation_size  592,270 ± 0           592,270 ± 0          0.0%      tie
                   peak_rss       1984880 ± 121188 MiB  1283664 ± 18758 MiB  +54.63%   feat-ir
                   runtime_size   589,640 ± 0           589,640 ± 0          0.0%      tie
                   wall_time      8.8527s ± 0.2882s     7.6108s ± 0.2776s    +16.32%   feat-ir
                   ethdebug_size  44.80 MiB ± 0.00 MiB  n/a                  N/A       n/a

forge-std-1.16.1   cpu_time       8.7960s ± 0.4597s     7.3679s ± 0.1564s    +19.38%   feat-ir
                   creation_size  725,280 ± 0           725,280 ± 0          0.0%      tie
                   peak_rss       1609776 ± 81821 MiB   835728 ± 26586 MiB   +92.62%   feat-ir
                   runtime_size   698,851 ± 0           698,851 ± 0          0.0%      tie
                   wall_time      8.9539s ± 1.0165s     7.4271s ± 0.1986s    +20.56%   ~noise
                   ethdebug_size  44.94 MiB ± 0.00 MiB  n/a                  N/A       n/a

morpho-blue-1.0.0  cpu_time       11.8989s ± 0.3171s    10.3606s ± 0.2164s   +14.85%   feat-ir
                   creation_size  1,261,421 ± 0         1,261,421 ± 0        0.0%      tie
                   peak_rss       1853424 ± 87738 MiB   925232 ± 11855 MiB   +100.32%  feat-ir
                   runtime_size   1,251,246 ± 0         1,251,246 ± 0        0.0%      tie
                   wall_time      12.0327s ± 0.3882s    10.5072s ± 0.6548s   +14.52%   ~noise
                   ethdebug_size  57.38 MiB ± 0.00 MiB  n/a                  N/A       n/a

solmate-6          cpu_time       4.8951s ± 0.0925s     4.2158s ± 0.1264s    +16.11%   feat-ir
                   creation_size  540,758 ± 0           540,758 ± 0          0.0%      tie
                   peak_rss       988512 ± 30735 MiB    387104 ± 4444 MiB    +155.36%  feat-ir
                   runtime_size   523,577 ± 0           523,577 ± 0          0.0%      tie
                   wall_time      4.9308s ± 0.1116s     4.2523s ± 0.1784s    +15.95%   feat-ir
                   ethdebug_size  30.92 MiB ± 0.00 MiB  n/a                  N/A       n/a

and I run (for different cases) it as:

solc-bench run \
    --solc /path/to/solidity/build_develop/solc/solc \
    --benchmark-dir ./benchmark_data \
    --pipeline ir \
    --no-optimize \
    --tags med \
    --iterations 5 \
    -o dev-ir.json

@clonker clonker left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A few comments regarding things I noticed when doing a quick pass. I'll want to do a more thorough one later.

Comment on lines +303 to +312
if pipeline == "ir-ethdebug":
# ETHDebug program output does not support the optimizer yet,
# so this pipeline always compiles unoptimized IR.
runs.append(
(
pipeline,
resolve_solc_settings("ir", True, ethdebug=True),
True,
)
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this'll silently change behavior once it does support it, doesn't it? wouldn't it be better to just throw if ethdebug is requested but nooptimize is False?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, makes sense to raise if requested without no_optimize

Comment thread src/solc_bench/benchmark.py Outdated
self.results, self.solc_version, self.iterations
)
result_path = self.output_dir / "bench-results.json"
result_path = self.output_file or self.output_dir / "bench-results.json"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there anything that relies on the output file being named bench-results.json?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No :) introducing DEFAULT_RESULT_FILENAME, but also considering --output-file (if used)

Comment thread src/solc_bench/cli.py Outdated
Comment on lines +90 to +91
if args.output_file and args.output_dir == ".":
output_dir = str(Path(args.output_file).parent)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this break or at least lead to inconsistent behavior if i request an output dir like ./subdir?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good catch. Fixing it.

Comment thread src/solc_bench/cli.py Outdated


def _result_label_and_path(spec):
if "=" in spec:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems brittle. what precisely is the spec? can we find a better variable name? what if the path or the label contain = themselves?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spec is a positional results arg, either PATH or LABEL=PATH, so I renamed it result_arg and added a doc spelling that out. The = ambiguity was a bug: a path that happened to contain = got silently mis-split into a bogus label + path. Now I only treat = as the separator when the arg isn't an existing file, so real paths with = load fine; a label still can't contain =, which I documented.

Comment thread src/solc_bench/cli.py Outdated
raise FileExistsError(
f"results file already exists: {result_path} "
"(remove it or choose a different --output-dir)"
"(remove it or choose a different output path)"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we reference to the actual args here instead of removing --output-dir?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thanks

Comment thread src/solc_bench/cli.py Outdated
Comment on lines +524 to +526
"Compare two named datasets as TARGET vs REF. Repeat to compare "
"multiple pairs. Single-pipeline result files use their inferred "
"label; multi-pipeline files expose LABEL:PIPELINE datasets."

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so now we have two ways of referencing datasets, via the filename, via label:pipeline. does label:pipeline still need a path qualifier?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No path needed, since the path is given once as the positional arg (optionally LABEL=PATH), and --vs just references the name it got. Reworded the help to say that.

Comment thread src/solc_bench/cli.py
help="Compare two pipelines in one file: TARGET:REF (e.g. ir:evmasm)",
)
cmp_parser.add_argument(
"--vs",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare a.json b.json c.json --vs a b loads c.json and ignores it, there's no validation

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks

@djolertrk djolertrk requested a review from clonker June 15, 2026 11:25
@msooseth

Copy link
Copy Markdown
Contributor

Something seems off here? I added this Welch t-test with AI (my bad, sorry), but this:

                   peak_rss       835728 ± 26586 MiB   757408 ± 40316 MiB   +10.34%  ~noise

seems off? 10% change being "noise" seems wrong. Do you know what could be going on there?

@msooseth

Copy link
Copy Markdown
Contributor
                   peak_rss       835728 ± 26586 MiB   757408 ± 40316 MiB   +10.34%  ~noise

seems off? 10% change being "noise" seems wrong. Do you know what could be going on there?

Thinking a bit more about this -- maybe we should print the MEDIAN not the AVERAGE, and then the difference should also be over the MEDIAN. I have a feeling we'd get better values? What's n by default right now? Would be best to make it an odd number, like 3 or 5, or 7.

@djolertrk

Copy link
Copy Markdown
Contributor Author
                   peak_rss       835728 ± 26586 MiB   757408 ± 40316 MiB   +10.34%  ~noise

seems off? 10% change being "noise" seems wrong. Do you know what could be going on there?

Thinking a bit more about this -- maybe we should print the MEDIAN not the AVERAGE, and then the difference should also be over the MEDIAN. I have a feeling we'd get better values? What's n by default right now? Would be best to make it an odd number, like 3 or 5, or 7.

Hmmm, yeah, good catch. What was weird here is that the table was showing medians, but the Welch t-test was testing means. So the +10.34% number wasn’t actually the value used for the ~noise decision. For that sample, the means are 821085 vs 782778, so the mean delta is +4.89%, and Welch gives p ~= 0.12, so ~noise is reasonable there. n defaults to 3, and this run used 5. I changed the comparison output to show mean values/deltas so it matches the t-test, and I made the output say n=... so this is less confusing.

Comparison: feat-ir vs dev-ir

Benchmark          Metric         feat-ir              dev-ir               Δ%      winner
-----------------  -------------  -------------------  -------------------  ------  ------
prb-math-4.1.1     cpu_time       7.4721s ± 0.1706s    7.2767s ± 0.1161s    +2.69%  ~noise
                   creation_size  592,270 ± 0          592,270 ± 0          0.0%    tie
                   peak_rss       1277715 ± 18758 MiB  1271731 ± 89882 MiB  +0.47%  ~noise
                   runtime_size   589,640 ± 0          589,640 ± 0          0.0%    tie
                   wall_time      7.6527s ± 0.2776s    7.3529s ± 0.1567s    +4.08%  ~noise

forge-std-1.16.1   cpu_time       7.4001s ± 0.1564s    7.5066s ± 0.0777s    -1.42%  ~noise
                   creation_size  725,280 ± 0          725,280 ± 0          0.0%    tie
                   peak_rss       821085 ± 26586 MiB   782778 ± 40316 MiB   +4.89%  ~noise
                   runtime_size   698,851 ± 0          698,851 ± 0          0.0%    tie
                   wall_time      7.4746s ± 0.1986s    7.6138s ± 0.0980s    -1.83%  ~noise

morpho-blue-1.0.0  cpu_time       10.3839s ± 0.2164s   10.2912s ± 0.1335s   +0.9%   ~noise
                   creation_size  1,261,421 ± 0        1,261,421 ± 0        0.0%    tie
                   peak_rss       930016 ± 11855 MiB   890010 ± 40382 MiB   +4.5%   ~noise
                   runtime_size   1,251,246 ± 0        1,251,246 ± 0        0.0%    tie
                   wall_time      10.7231s ± 0.6548s   10.3930s ± 0.1563s   +3.18%  ~noise

solmate-6          cpu_time       4.2701s ± 0.1264s    4.1882s ± 0.0604s    +1.96%  ~noise
                   creation_size  540,758 ± 0          540,758 ± 0          0.0%    tie
                   peak_rss       385434 ± 4444 MiB    373802 ± 1400 MiB    +3.11%  dev-ir
                   runtime_size   523,577 ± 0          523,577 ± 0          0.0%    tie
                   wall_time      4.3283s ± 0.1784s    4.2186s ± 0.0587s    +2.6%   ~noise

Please check it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants