Workspace setup: wire setup/teardown scripts, per-case overrides, reproducibility

## Problem

AgentV's workspace system has gaps that hurt reproducibility and limit flexibility:

1. **Setup/teardown scripts are implemented but NOT wired into the orchestrator** — `executeWorkspaceSetup`/`executeWorkspaceTeardown` exist in `packages/core/src/evaluation/workspace/script-executor.ts` with full tests, but the orchestrator never calls them. Users can't use `setup_script`/`teardown_script` in YAML configs.
2. **No per-case workspace overrides** — workspace config is target-level only; heterogeneous eval suites (different repos/commits per case, like SWE-bench) aren't possible.
3. **No workspace reproducibility** — no fingerprinting or lockfile to verify workspace state across runs.
4. **Template copy ignores .gitignore** — copies everything including `node_modules/`, `dist/`, etc.
5. **No case metadata passed to scripts** — setup scripts can't access per-case info (repo, commit, etc.).

## Industry Research

We analyzed workspace patterns across 10+ evaluation frameworks. Here's what the industry does:

| Framework | Workspace Method | Isolation | Reproducibility |
|-----------|-----------------|-----------|----------------|
| **SWE-bench** | 3-layer Docker images (base→env→instance), `environment_setup_commit` separation | Container | Pre-built image registry, conda env pinning |
| **OpenCode-Bench** | `mkdtemp` → shallow git clone at pinned commit | Filesystem | Fresh clone per episode, 3 episodes per task |
| **terminal-bench** | Docker container per trial from compose template | Container + tmux | Docker images + asciinema recordings |
| **sniffbench** | Pluggable `SandboxManager` with `SandboxConfig` | Container (security-hardened) | Variant containers with baked-in config |
| **convex-evals** | `mkdtemp` + file writes + dual Convex backend processes | Process-level | Deterministic tasks + downloaded binary |
| **Aider** | Git repo clone into Docker container | Container | Fixed exercism set, Docker images |
| **HumanEval/MBPP** | In-memory Python process | Subprocess | No workspace state needed |

**Key patterns worth adopting:**
- SWE-bench's `environment_setup_commit` (separate commit for deps vs code state)
- sniffbench's pluggable sandbox interface (`SandboxConfig` type system)
- OpenCode-Bench's fresh-clone-per-episode pattern (already covered by our trials system)
- The dominant coding eval pattern: `mkdtemp → clone at commit → setup → agent → git diff → score → cleanup`

## Design

### Approach: Declarative Workspace Spec (Isolation-Agnostic)

Define **what** a workspace needs without prescribing **how** isolation is achieved. Two modes:

### 1. Workspace Spec Schema

**Simple mode** — template directory copy (existing behavior, unchanged):
```yaml
workspace:
  template: ./workspace-template
  workspace_type: directory           # or vscode-workspace
  env:
    CI: "1"
```

**Scripted mode** — setup script creates the workspace (including cloning, deps, etc.):
```yaml
workspace:
  workspace_type: directory
  setup_script:
    script: ["bun", "run", "setup.ts"]
    timeout_ms: 120000
    cwd: ./scripts
  teardown_script:
    script: ["bun", "run", "teardown.ts"]
    timeout_ms: 30000
  env:
    CI: "1"
  resources:
    memory: 512MB
    timeout: 300s
```

**Combined mode** — template copied first, then setup script runs:
```yaml
workspace:
  template: ./workspace-template
  setup_script:
    script: ["bun", "run", "setup.ts"]
    timeout_ms: 120000
```

Setup script receives context on stdin:
```json
{
  "workspace_path": "/home/user/.agentv/workspaces/run-123/case-01",
  "eval_case_id": "case-01",
  "eval_run_id": "run-123",
  "case_input": "Implement the add function...",
  "case_metadata": { "repo": "owner/repo", "ref": "abc123" }
}
```

### 2. Per-Case Workspace Overrides

Target-level workspace config serves as the default. Cases can override:

```yaml
# targets.yaml
targets:
  - name: coding_agent
    provider: cli
    workspace:
      setup_script:
        script: ["bun", "run", "setup.ts"]
      env:
        CI: "1"

# evals/swebench.yaml
cases:
  - id: sympy-20590
    input: "Fix the bug in issue #20590..."
    metadata:
      repo: sympy/sympy
      base_commit: 9aabb237
      env_commit: latest-stable
    workspace:                      # overrides target
      setup_script:
        script: ["bun", "run", "swebench-setup.ts"]
      env:
        PYTHON_VERSION: "3.9"

  - id: simple-function
    input: "Implement the add function"
    # No workspace override — uses target default
```

**Merge strategy:**

| Field | Merge behavior |
|-------|---------------|
| `template` | Case replaces target |
| `setup_script` | Case replaces target |
| `teardown_script` | Case replaces target |
| `env` | Deep merge (case extends target) |
| `resources` | Deep merge |
| `workspace_type` | Case replaces target |

### 3. Reproducibility

**Workspace fingerprint** — after setup, hash the workspace state:
```yaml
workspace_fingerprint:
  hash: sha256:abc123...
  source_ref: "owner/repo@abc123"
  setup_script_hash: sha256:def..
```

**Workspace lockfile** — optional `workspace.lock` pinning exact versions:
```yaml
sources:
  - repo: owner/repo
    resolved_ref: abc123def456...  # Full SHA even if config said "main"
setup_script:
  hash: sha256:...
  output_hash: sha256:...
```

**Trials** — existing pass@k/mean/confidence_interval system already provides multi-episode with fresh workspaces. No changes needed.

## Implementation Plan

### 4a. Wire setup/teardown into orchestrator (critical gap fix)
- Import `executeWorkspaceSetup`/`executeWorkspaceTeardown` in `orchestrator.ts`
- Call setup after `createTempWorkspace()` and before `initializeBaseline()`
- Call teardown after evaluation completes
- Store `setupOutput`/`teardownOutput` in `EvaluationResult`

### 4b. Add workspace config parsing
- Extend YAML schema to accept `workspace:` block at both target and case level
- Implement merge strategy (template/scripts replaced, env/resources deep-merged)
- Parse `setup_script`/`teardown_script` from YAML into `WorkspaceScriptConfig`

### 4c. Pass case metadata to setup scripts
- Extend `ScriptExecutionContext` to include `caseMetadata` and `caseInput`
- Setup scripts receive full case context on stdin

### 4d. Workspace fingerprinting
- After setup + baseline init, compute SHA-256 of workspace file tree
- Store in `EvaluationResult.workspaceFingerprint`
- Optionally write `workspace.lock`

### 4e. Improve template copy
- Respect `.gitignore` patterns during copy
- Parallel file copy for large templates

### Order of operations
```
1. Copy template (if specified)
2. Execute setup_script (if specified)
3. Initialize git baseline
4. Compute workspace fingerprint
5. Invoke agent
6. Capture file changes
7. Evaluate
8. Execute teardown_script (if specified)
9. Cleanup (conditional)
```

## Key Files

```
packages/core/src/evaluation/workspace/
├── index.ts                          # Public exports
├── manager.ts                        # createTempWorkspace, cleanup*, getWorkspacePath
├── file-changes.ts                   # initializeBaseline, captureFileChanges
└── script-executor.ts                # executeWorkspaceSetup/Teardown (NOT WIRED IN)

packages/core/src/evaluation/
├── orchestrator.ts                   # Main eval runner — needs setup/teardown integration
├── types.ts                          # WorkspaceScriptConfig, EvaluationResult
├── baseline.ts                       # Result trimming (already strips setup/teardown fields)
└── loaders/config-loader.ts          # YAML parsing — needs workspace block support

packages/core/test/evaluation/workspace/
├── manager.test.ts                   # ✅ Tests exist
├── file-changes.test.ts              # ✅ Tests exist
└── script-executor.test.ts           # ✅ Tests exist
```

## What's Deferred

- **Built-in Docker/container support**: setup scripts handle this
- **Pluggable SandboxManager interface**: can be added later
- **Pre-built image registries**: setup scripts can pull images
- **Workspace caching/layering**: SWE-bench's 3-layer approach is complex; start simple


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workspace setup: wire setup/teardown scripts, per-case overrides, reproducibility #221

Problem

Industry Research

Design

Approach: Declarative Workspace Spec (Isolation-Agnostic)

1. Workspace Spec Schema

2. Per-Case Workspace Overrides

3. Reproducibility

Implementation Plan

4a. Wire setup/teardown into orchestrator (critical gap fix)

4b. Add workspace config parsing

4c. Pass case metadata to setup scripts

4d. Workspace fingerprinting

4e. Improve template copy

Order of operations

Key Files

What's Deferred

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Framework	Workspace Method	Isolation	Reproducibility
SWE-bench	3-layer Docker images (base→env→instance), `environment_setup_commit` separation	Container	Pre-built image registry, conda env pinning
OpenCode-Bench	`mkdtemp` → shallow git clone at pinned commit	Filesystem	Fresh clone per episode, 3 episodes per task
terminal-bench	Docker container per trial from compose template	Container + tmux	Docker images + asciinema recordings
sniffbench	Pluggable `SandboxManager` with `SandboxConfig`	Container (security-hardened)	Variant containers with baked-in config
convex-evals	`mkdtemp` + file writes + dual Convex backend processes	Process-level	Deterministic tasks + downloaded binary
Aider	Git repo clone into Docker container	Container	Fixed exercism set, Docker images
HumanEval/MBPP	In-memory Python process	Subprocess	No workspace state needed

Field	Merge behavior
`template`	Case replaces target
`setup_script`	Case replaces target
`teardown_script`	Case replaces target
`env`	Deep merge (case extends target)
`resources`	Deep merge
`workspace_type`	Case replaces target

Workspace setup: wire setup/teardown scripts, per-case overrides, reproducibility #221

Description

Problem

Industry Research

Design

Approach: Declarative Workspace Spec (Isolation-Agnostic)

1. Workspace Spec Schema

2. Per-Case Workspace Overrides

3. Reproducibility

Implementation Plan

4a. Wire setup/teardown into orchestrator (critical gap fix)

4b. Add workspace config parsing

4c. Pass case metadata to setup scripts

4d. Workspace fingerprinting

4e. Improve template copy

Order of operations

Key Files

What's Deferred

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions