Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 23 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,12 @@ It simulates what actually happens when you deploy coding agents: multi-turn con

Works against any OpenAI-compatible endpoint — vLLM, TGI, OpenAI, Ollama, LMStudio.

> **Inspired by** Fabian Wesner's [One-Shot Shop Challenge](https://agentic-engineers.dev)
> ([announcement](https://www.linkedin.com/posts/fabian-wesner_oneshotshop-share-7442096217976897536-SRI9/)) —
> the study that showed orchestration architecture beats model choice (Team Mode 85% vs Sub-Agents 57% on the same model).
> Pawbench's orchestration × complexity matrix operationalizes that finding inside a reproducible benchmark.
> See [spec 009](https://github.com/zenprocess/switchyard/blob/main/specs/009-pawbench-orchestration-axis/spec.md).

<br>

## Meet Lola
Expand Down Expand Up @@ -84,14 +90,30 @@ pawbench --scenario my_scenario.json

## What It Measures

### 4 Dimensions
### 4 Dimensions + Spec 009 Matrix

| Dimension | Metrics |
|---|---|
| **Throughput** | Single-agent tok/s, parallel saturation curve (1->N), TTFT, peak concurrency |
| **Quality** | Tool call accuracy, instruction following, format compliance, keyword matching |
| **Efficiency** | Useful token ratio (code in tool args vs filler preamble), tokens per turn |
| **Adaptability** | Steering event response, mid-conversation context injection, nudge quality delta |
| **Artifact Quality** *(spec 009)* | Static analysis over changed files (ruff/mypy/radon for Python, generic fallback otherwise). Orthogonal to AC pass. |
| **Complexity Tier** *(spec 009)* | Per-task tagging — `display` / `crud` / `transactional` / `cross_cutting` — with stratified `quality_by_tier` reporting. |
| **Orchestration Shape** *(spec 009)* | Same scenario × 5 shapes (`flat` / `waves` / `scatter-gather` / `team-mode` / `subagents`) → `orchestration_dqs_spread` SLI. |
| **DQS** *(spec 009)* | Composite Dispatch Quality Score v1.0.0 with auditable weights + post-hoc ablation matrix. |

### New flags (spec 009)

```bash
pawbench --orchestration flat,waves,scatter-gather,team-mode,subagents
pawbench --ablate quality,format_compliance,tool_accuracy,useful_ratio,steering_rate
pawbench --context-tier manifest-only
pawbench --verification-runs 2
pawbench --no-quality-analysis
```

The orchestration matrix scenario (`pawstyle-orchestration-matrix`) is designed to differentiate shapes — four independent feature blocks, one per complexity tier. Inspired by Fabian Wesner's [One-Shot Shop Challenge](https://agentic-engineers.dev) (orchestration > model). See [switchyard spec 009](https://github.com/zenprocess/switchyard/blob/main/specs/009-pawbench-orchestration-axis/spec.md).

<br>

Expand Down
150 changes: 150 additions & 0 deletions src/pawbench/ablation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
"""Ablation matrix — spec 009 / B7.

The publishable counter-intuitive: Fabian's study showed quality-focused
instructions *decreased* performance and the fastest build scored worst.
Pawbench has accumulated five scoring/quality components over time, none
of which have ever been measured in isolation.

The ablation runner takes a base BenchmarkReport and recomputes DQS with
one component disabled at a time, returning the delta. Components with
consistently negative deltas across consecutive runs are removal
candidates.

This module is **pure** — it operates on already-collected results, never
re-runs the benchmark. That keeps ablation cheap (no extra GPU time) and
deterministic (same inputs → same deltas).
"""

from __future__ import annotations

from dataclasses import dataclass, field
from typing import Any

from pawbench.dqs import DQSBreakdown, compute_dqs

# Components that can be ablated. Each maps to a kwarg of compute_dqs that
# gets pinned to a "neutral" value (1.0 = component contributes its max,
# i.e., it's invisible to the comparison; 0.0 = component is silenced).
# We use 1.0-pinning so the ablation answers "what if this component
# always reported success", which is the meaningful counterfactual: a
# component is dead weight if its absence doesn't reduce the score.
ABLATABLE_COMPONENTS: dict[str, str] = {
"format_compliance": "format_compliance",
"tool_accuracy": "tool_accuracy",
"useful_ratio": "useful_ratio",
"steering_rate": "steering_rate",
"quality": "quality",
}


@dataclass
class AblationDelta:
"""Per-component DQS delta from disabling that component."""

component: str
baseline_dqs: float
ablated_dqs: float
delta: float
interpretation: str

def to_dict(self) -> dict[str, Any]:
return {
"component": self.component,
"baseline_dqs": round(self.baseline_dqs, 4),
"ablated_dqs": round(self.ablated_dqs, 4),
"delta": round(self.delta, 4),
"interpretation": self.interpretation,
}


@dataclass
class AblationReport:
"""Complete ablation matrix for a single scenario or scenario aggregate."""

scenario_id: str
baseline: DQSBreakdown
deltas: list[AblationDelta] = field(default_factory=list)

def to_dict(self) -> dict[str, Any]:
return {
"scenario_id": self.scenario_id,
"baseline": self.baseline.to_dict(),
"deltas": [d.to_dict() for d in self.deltas],
"removal_candidates": [d.component for d in self.deltas if d.delta <= 0.0],
}


def _interpret(delta: float) -> str:
"""Plain-English read of a delta. Conservative thresholds."""
if delta > 0.05:
return "load-bearing — component contributes meaningfully"
if delta > 0.01:
return "marginal — small but real contribution"
if delta > -0.01:
return "neutral — within noise floor"
return "DEAD WEIGHT — score improves when component is silenced"


def ablate(
*,
scenario_id: str,
quality: float,
format_compliance: float,
tool_accuracy: float,
useful_ratio: float,
steering_rate: float,
components: list[str] | None = None,
) -> AblationReport:
"""Compute the ablation delta for each requested component.

`components=None` means "all ablatable components". Unknown components
are ignored (logged into the interpretation field).

Pinning convention: when a component is "ablated", its input is set to
1.0 — i.e., we ask "what if this signal were perfect?". A negative
delta then means: turning the signal perfect made the score *worse*,
which is impossible under the current additive formula. So this
counterfactual surfaces components that are *zero-information*: if
pinning them to 1.0 gives the same score as the real value, the
component is contributing nothing in this run.
"""
base_inputs = {
"quality": quality,
"format_compliance": format_compliance,
"tool_accuracy": tool_accuracy,
"useful_ratio": useful_ratio,
"steering_rate": steering_rate,
}
baseline = compute_dqs(**base_inputs)

requested = components if components else list(ABLATABLE_COMPONENTS)
deltas: list[AblationDelta] = []

for comp in requested:
if comp not in ABLATABLE_COMPONENTS:
deltas.append(
AblationDelta(
component=comp,
baseline_dqs=baseline.composite,
ablated_dqs=baseline.composite,
delta=0.0,
interpretation=f"unknown component '{comp}' — skipped",
)
)
continue

ablated_inputs = dict(base_inputs)
ablated_inputs[comp] = 1.0 # pin to perfect
ablated = compute_dqs(**ablated_inputs)
delta = ablated.composite - baseline.composite
deltas.append(
AblationDelta(
component=comp,
baseline_dqs=baseline.composite,
ablated_dqs=ablated.composite,
delta=delta,
interpretation=_interpret(delta),
)
)

return AblationReport(scenario_id=scenario_id, baseline=baseline, deltas=deltas)
Loading
Loading