argotorg · djolertrk · Jun 11, 2026 · Jun 11, 2026 · Jun 12, 2026 · Jun 14, 2026
diff --git a/README.md b/README.md
@@ -55,6 +55,7 @@ the pipelines in each benchmark's TOML entry (or all if unspecified);
 | `evmasm` | `"viaIR": false` — EVM assembly codegen |
 | `ir` | `"viaIR": true` — IR-based codegen |
 | `ir-ssacfg` | `"viaIR": true, "viaSSACFG": true` — SSA-CFG experimental codegen |
+| `ir-ethdebug` | `"viaIR": true`, optimizer disabled — unoptimized IR codegen with ETHDebug outputs requested (see [ETHDebug overhead](#ethdebug-overhead)) |
 
 ## Metrics
 
@@ -142,7 +143,7 @@ solc-bench fetch develop --output ./solc --force
 
 Benchmarks a suite, or a single `.sol`/`.json` `input_file` (which bypasses
 the suite and needs no `--benchmark-dir`). Results land in
-`bench-results.json` in `--output-dir`.
+`bench-results.json` in `--output-dir`, unless `-o/--output-file` is used.
 
 | Flag | Default | Description |
 |------|---------|-------------|
@@ -152,8 +153,9 @@ the suite and needs no `--benchmark-dir`). Results land in
 | `--tags TAGS` | (none) | Comma-separated tags, AND'd with `--only` |
 | `--iterations N` | `3` | Number of iterations |
 | `--output-dir DIR` | current dir | Where to write results + logs |
+| `-o, --output-file FILE` | (none) | Write result JSON to a specific file |
 | `--stdout` | off | Also print results to stdout |
-| `--pipeline P` | (all) | Single pipeline: `evmasm`/`ir`/`ir-ssacfg` |
+| `--pipeline P` | (all) | Single pipeline: `evmasm`/`ir`/`ir-ssacfg`/`ir-ethdebug` |
 | `--no-optimize` | off | Disable the optimizer |
 
 ```bash
@@ -163,44 +165,102 @@ solc-bench run --solc ./solc contract.sol --pipeline ir       # single file
 
 ### ETHDebug overhead
 
-`--ethdebug-overhead` measures the extra compilation cost of producing
-ETHDebug output with the same compiler. It runs every selected benchmark twice:
-`ir` is the unoptimized IR baseline, and `ir-ethdebug` is the same unoptimized
-IR compilation with `evm.bytecode.ethdebug`,
+`ir-ethdebug` is a regular pipeline: the same unoptimized IR compilation as
+`ir` with `--no-optimize`, plus `evm.bytecode.ethdebug`,
 `evm.deployedBytecode.ethdebug`, `ethdebug.resources`, and
-`ethdebug.compilation` requested. This mode intentionally disables the
-optimizer because ETHDebug program output does not support optimization yet,
-and skips gas benchmarks because it is intended to measure compilation cost.
-The `ir-ethdebug` results also include `ethdebug_size`, the serialized byte
-size of all requested ETHDebug artifacts. It is stored as bytes in the result
-JSON and rendered as MiB in comparison tables.
+`ethdebug.compilation` requested. It requires `--no-optimize` because
+ETHDebug program output does not support optimization yet, and gas
+benchmarks are skipped because the pipeline measures compilation cost. Its
+results include `ethdebug_size`, the serialized byte size of all requested
+ETHDebug artifacts, stored as bytes in the result JSON and rendered as MiB in
+comparison tables.
+
+Producing datasets and comparing them are orthogonal: each `run` produces one
+result dataset, and `compare` runs whatever pairwise comparisons you ask for
+with `--vs`. Use `--no-optimize` for the plain `ir` datasets so the baseline
+matches the unoptimized IR that `ir-ethdebug` compiles.
+
+ETHDebug overhead of a single compiler:
 
 ```bash
+solc-bench run --solc ./solc --benchmark-dir ./benchmark_data \
+  --tags med --iterations 5 --pipeline ir --no-optimize -o ./ir.json
+solc-bench run --solc ./solc --benchmark-dir ./benchmark_data \
+  --tags med --iterations 5 --pipeline ir-ethdebug --no-optimize -o ./ed.json
+
+solc-bench compare ./ir.json ./ed.json --vs ed ir
+solc-bench compare ./ir.json ./ed.json --vs ed ir --max-regression cpu_time:30
+```
+
+To review an ETHDebug PR against `develop`, produce four datasets and compare
+the pairs you care about:
+
+```bash
+solc-bench run \
+  --solc ./solc-develop \
+  --benchmark-dir ./benchmark_data \
+  --pipeline ir \
+  --no-optimize \
+  --tags med \
+  --iterations 5 \
+  -o ./dev-ir.json
+
+solc-bench run \
+  --solc ./solc-develop \
+  --benchmark-dir ./benchmark_data \
+  --pipeline ir-ethdebug \
+  --no-optimize \
+  --tags med \
+  --iterations 5 \
+  -o ./dev-ed.json
+
 solc-bench run \
-  --solc ./solc \
+  --solc ./solc-current \
   --benchmark-dir ./benchmark_data \
+  --pipeline ir \
+  --no-optimize \
   --tags med \
   --iterations 5 \
-  --ethdebug-overhead \
-  --output-dir ./ethdebug-overhead
+  -o ./feat-ir.json
 
-solc-bench compare ./ethdebug-overhead/bench-results.json --pipelines ir-ethdebug:ir
-solc-bench compare ./ethdebug-overhead/bench-results.json --pipelines ir-ethdebug:ir --max-regression cpu_time:30
+solc-bench run \
+  --solc ./solc-current \
+  --benchmark-dir ./benchmark_data \
+  --pipeline ir-ethdebug \
+  --no-optimize \
+  --tags med \
+  --iterations 5 \
+  -o ./feat-ed.json
+
+solc-bench compare \
+  ./dev-ir.json ./dev-ed.json ./feat-ir.json ./feat-ed.json \
+  --vs feat-ir dev-ir \
+  --vs feat-ed dev-ed \
+  --vs dev-ed dev-ir \
+  --vs feat-ed feat-ir
 ```
 
-### `solc-bench compare <baseline> [target]`
+This reports `ir` across branches, `ir-ethdebug` across branches, ETHDebug
+overhead on `develop`, and ETHDebug overhead on the feature branch.
+
+### `solc-bench compare <results...>`
 
-Compares two result files (cross-version), or two pipelines within one file
-via `--pipelines TARGET:REF`. The output shows each metric's signed percent
-delta; every metric is lower-is-better, so negative is an improvement. The
-`winner` column names the better side, but shows `~noise` unless the gap
-passes a Welch t-test and exceeds 0.10% (statistically real and large enough
-to act on). `--per-function` adds a per-function gas delta table when both
-files have gas data.
+Compares two result files (cross-version), two pipelines within one file via
+`--pipelines TARGET:REF`, or any number of named result datasets via repeated
+`--vs TARGET REF` pairs. `--vs` references the datasets defined by the
+positional files — a single-pipeline file by its label, a multi-pipeline file
+by `LABEL:PIPELINE` — and takes no path itself. The output shows each metric's
+signed percent delta; every metric is lower-is-better, so negative is an
+improvement. The `winner`
+column names the better side, but shows `~noise` unless the gap passes a
+Welch t-test and exceeds 0.10% (statistically real and large enough to act
+on). `--per-function` adds a per-function gas delta table when both files
+have gas data.
 
 | Flag | Default | Description |
 |------|---------|-------------|
 | `--pipelines TARGET:REF` | cross-version | Compare two pipelines in one file (e.g. `ir:evmasm`) |
+| `--vs TARGET REF` | off | Compare two named datasets; repeatable |
 | `--format table`/`json` | `table` | Output format |
 | `--output FILE` | (none) | Write comparison JSON to file |
 | `--per-function STAT` | `median` | Per-function gas deltas: `min`/`mean`/`median`/`max` |
@@ -211,6 +271,8 @@ files have gas data.
 ```bash
 solc-bench compare baseline/bench-results.json target/bench-results.json --per-function
 solc-bench compare bench-results.json --pipelines ir:evmasm --plot diff.png
+solc-bench compare dev-ir.json feat-ir.json --vs feat-ir dev-ir
+solc-bench compare dev=dev/bench-results.json feat=feat/bench-results.json --vs feat:ir dev:ir
 ```
 
 ### `solc-bench extract`

diff --git a/src/solc_bench/benchmark.py b/src/solc_bench/benchmark.py
@@ -7,6 +7,7 @@
 
 from solc_bench.config import (
     DEFAULT_PIPELINES,
+    DEFAULT_RESULT_FILENAME,
     load_benchmarks,
 )
 from solc_bench.gas import ensure_project, run_gas_benchmark
@@ -35,6 +36,16 @@ def perf_available():
         return False
 
 
+def _ru_maxrss_mib(ru_maxrss):
+    """Normalize resource.ru_maxrss to MiB.
+
+    Linux reports ru_maxrss in KiB, while macOS reports it in bytes.
+    """
+    if sys.platform == "darwin":
+        return ru_maxrss / (1024 * 1024)
+    return ru_maxrss / 1024
+
+
 class Benchmark:
     """Runs solc and collects all metrics."""
 
@@ -101,7 +112,7 @@ def invoke_solc(self, input_file):
         metrics = {
             "cpu_time": rusage.ru_utime + rusage.ru_stime,
             "wall_time": wall_time,
-            "peak_rss": rusage.ru_maxrss / 1024,  # KiB -> MiB
+            "peak_rss": _ru_maxrss_mib(rusage.ru_maxrss),
             "exit_code": proc.returncode,
         }
 
@@ -114,11 +125,19 @@ def invoke_solc(self, input_file):
 class BenchmarkSuite:
     """Orchestrates benchmarks across pipelines and inputs."""
 
-    def __init__(self, solc, iterations, output_dir, keep_inputs=False):
+    def __init__(
+        self,
+        solc,
+        iterations,
+        output_dir,
+        keep_inputs=False,
+        output_file=None,
+    ):
         self.solc_version = get_solc_version(solc)
         self.benchmark = Benchmark(solc)
         self.output_dir = Path(output_dir)
         self.output_dir.mkdir(parents=True, exist_ok=True)
+        self.output_file = Path(output_file) if output_file else None
         self.iterations = iterations
         self.keep_inputs = keep_inputs
         self.results = {}
@@ -186,7 +205,7 @@ def _write_error_log(self, result, name, pipeline):
         log_path.write_text("\n".join(error_messages), encoding="utf-8")
         return str(log_path)
 
-    def run_file(self, input_file, pipeline, no_optimize, ethdebug_overhead=False):
+    def run_file(self, input_file, pipeline, no_optimize):
         """Run benchmark on a single .sol or .json input file.
 
         pipeline is a pipeline name (str) or None for all pipelines.
@@ -195,7 +214,6 @@ def run_file(self, input_file, pipeline, no_optimize, ethdebug_overhead=False):
         pipeline_runs = self._pipeline_runs(
             [pipeline] if pipeline else DEFAULT_PIPELINES,
             no_optimize,
-            ethdebug_overhead,
         )
 
         for label, solc_settings, ethdebug in pipeline_runs:
@@ -214,7 +232,6 @@ def run_suite(
         pipeline,
         no_optimize,
         tags=None,
-        ethdebug_overhead=False,
     ):
         """Run configured benchmarks from benchmarks.toml.
 
@@ -253,7 +270,7 @@ def run_suite(
                 pipelines = config.get("pipelines", DEFAULT_PIPELINES)
 
             gas_project_dir = None
-            if config.get("gas") and not ethdebug_overhead:
+            if config.get("gas"):
                 try:
                     gas_project_dir = ensure_project(
                         benchmark_dir,
@@ -270,15 +287,18 @@ def run_suite(
             for label, solc_settings, ethdebug in self._pipeline_runs(
                 pipelines,
                 no_optimize,
-                ethdebug_overhead,
             ):
                 with override_json_settings(
                     input_file,
                     solc_settings,
                     ethdebug,
                 ) as tmp_file:
                     self.run_pipeline(
-                        tmp_file, name, label, solc_settings, gas_project_dir
+                        tmp_file,
+                        name,
+                        label,
+                        solc_settings,
+                        None if ethdebug else gas_project_dir,
                     )
 
         if (selected or tag_set) and not matched_any:
@@ -288,21 +308,28 @@ def run_suite(
             )
 
     @staticmethod
-    def _pipeline_runs(pipelines, no_optimize, ethdebug_overhead=False):
-        if not ethdebug_overhead:
-            return [
-                (p, resolve_solc_settings(p, no_optimize), False)
-                for p in pipelines
-            ]
-
-        return [
-            ("ir", resolve_solc_settings("ir", True), False),
-            (
-                "ir-ethdebug",
-                resolve_solc_settings("ir", True, ethdebug=True),
-                True,
-            ),
-        ]
+    def _pipeline_runs(pipelines, no_optimize):
+        runs = []
+        for pipeline in pipelines:
+            if pipeline == "ir-ethdebug":
+                # ETHDebug program output does not support the optimizer yet;
+                # resolve_solc_settings requires --no-optimize for this pipeline.
+                runs.append(
+                    (
+                        pipeline,
+                        resolve_solc_settings("ir", no_optimize, ethdebug=True),
+                        True,
+                    )
+                )
+            else:
+                runs.append(
+                    (
+                        pipeline,
+                        resolve_solc_settings(pipeline, no_optimize),
+                        False,
+                    )
+                )
+        return runs
 
     def write_results(self, stdout=False):
         """Write results JSON to output dir, optionally also to stdout."""
@@ -313,7 +340,7 @@ def write_results(self, stdout=False):
         output = reporter.build_result_json(
             self.results, self.solc_version, self.iterations
         )
-        result_path = self.output_dir / "bench-results.json"
+        result_path = self.output_file or self.output_dir / DEFAULT_RESULT_FILENAME
         reporter.write_result_json(output, result_path, stdout=stdout)