Skip to content

Workspace setup: wire setup/teardown scripts, per-case overrides, reproducibility #221

@christso

Description

@christso

Problem

AgentV's workspace system has gaps that hurt reproducibility and limit flexibility:

  1. Setup/teardown scripts are implemented but NOT wired into the orchestratorexecuteWorkspaceSetup/executeWorkspaceTeardown exist in packages/core/src/evaluation/workspace/script-executor.ts with full tests, but the orchestrator never calls them. Users can't use setup_script/teardown_script in YAML configs.
  2. No per-case workspace overrides — workspace config is target-level only; heterogeneous eval suites (different repos/commits per case, like SWE-bench) aren't possible.
  3. No workspace reproducibility — no fingerprinting or lockfile to verify workspace state across runs.
  4. Template copy ignores .gitignore — copies everything including node_modules/, dist/, etc.
  5. No case metadata passed to scripts — setup scripts can't access per-case info (repo, commit, etc.).

Industry Research

We analyzed workspace patterns across 10+ evaluation frameworks. Here's what the industry does:

Framework Workspace Method Isolation Reproducibility
SWE-bench 3-layer Docker images (base→env→instance), environment_setup_commit separation Container Pre-built image registry, conda env pinning
OpenCode-Bench mkdtemp → shallow git clone at pinned commit Filesystem Fresh clone per episode, 3 episodes per task
terminal-bench Docker container per trial from compose template Container + tmux Docker images + asciinema recordings
sniffbench Pluggable SandboxManager with SandboxConfig Container (security-hardened) Variant containers with baked-in config
convex-evals mkdtemp + file writes + dual Convex backend processes Process-level Deterministic tasks + downloaded binary
Aider Git repo clone into Docker container Container Fixed exercism set, Docker images
HumanEval/MBPP In-memory Python process Subprocess No workspace state needed

Key patterns worth adopting:

  • SWE-bench's environment_setup_commit (separate commit for deps vs code state)
  • sniffbench's pluggable sandbox interface (SandboxConfig type system)
  • OpenCode-Bench's fresh-clone-per-episode pattern (already covered by our trials system)
  • The dominant coding eval pattern: mkdtemp → clone at commit → setup → agent → git diff → score → cleanup

Design

Approach: Declarative Workspace Spec (Isolation-Agnostic)

Define what a workspace needs without prescribing how isolation is achieved. Two modes:

1. Workspace Spec Schema

Simple mode — template directory copy (existing behavior, unchanged):

workspace:
  template: ./workspace-template
  workspace_type: directory           # or vscode-workspace
  env:
    CI: "1"

Scripted mode — setup script creates the workspace (including cloning, deps, etc.):

workspace:
  workspace_type: directory
  setup_script:
    script: ["bun", "run", "setup.ts"]
    timeout_ms: 120000
    cwd: ./scripts
  teardown_script:
    script: ["bun", "run", "teardown.ts"]
    timeout_ms: 30000
  env:
    CI: "1"
  resources:
    memory: 512MB
    timeout: 300s

Combined mode — template copied first, then setup script runs:

workspace:
  template: ./workspace-template
  setup_script:
    script: ["bun", "run", "setup.ts"]
    timeout_ms: 120000

Setup script receives context on stdin:

{
  "workspace_path": "/home/user/.agentv/workspaces/run-123/case-01",
  "eval_case_id": "case-01",
  "eval_run_id": "run-123",
  "case_input": "Implement the add function...",
  "case_metadata": { "repo": "owner/repo", "ref": "abc123" }
}

2. Per-Case Workspace Overrides

Target-level workspace config serves as the default. Cases can override:

# targets.yaml
targets:
  - name: coding_agent
    provider: cli
    workspace:
      setup_script:
        script: ["bun", "run", "setup.ts"]
      env:
        CI: "1"

# evals/swebench.yaml
cases:
  - id: sympy-20590
    input: "Fix the bug in issue #20590..."
    metadata:
      repo: sympy/sympy
      base_commit: 9aabb237
      env_commit: latest-stable
    workspace:                      # overrides target
      setup_script:
        script: ["bun", "run", "swebench-setup.ts"]
      env:
        PYTHON_VERSION: "3.9"

  - id: simple-function
    input: "Implement the add function"
    # No workspace override — uses target default

Merge strategy:

Field Merge behavior
template Case replaces target
setup_script Case replaces target
teardown_script Case replaces target
env Deep merge (case extends target)
resources Deep merge
workspace_type Case replaces target

3. Reproducibility

Workspace fingerprint — after setup, hash the workspace state:

workspace_fingerprint:
  hash: sha256:abc123...
  source_ref: "owner/repo@abc123"
  setup_script_hash: sha256:def..

Workspace lockfile — optional workspace.lock pinning exact versions:

sources:
  - repo: owner/repo
    resolved_ref: abc123def456...  # Full SHA even if config said "main"
setup_script:
  hash: sha256:...
  output_hash: sha256:...

Trials — existing pass@k/mean/confidence_interval system already provides multi-episode with fresh workspaces. No changes needed.

Implementation Plan

4a. Wire setup/teardown into orchestrator (critical gap fix)

  • Import executeWorkspaceSetup/executeWorkspaceTeardown in orchestrator.ts
  • Call setup after createTempWorkspace() and before initializeBaseline()
  • Call teardown after evaluation completes
  • Store setupOutput/teardownOutput in EvaluationResult

4b. Add workspace config parsing

  • Extend YAML schema to accept workspace: block at both target and case level
  • Implement merge strategy (template/scripts replaced, env/resources deep-merged)
  • Parse setup_script/teardown_script from YAML into WorkspaceScriptConfig

4c. Pass case metadata to setup scripts

  • Extend ScriptExecutionContext to include caseMetadata and caseInput
  • Setup scripts receive full case context on stdin

4d. Workspace fingerprinting

  • After setup + baseline init, compute SHA-256 of workspace file tree
  • Store in EvaluationResult.workspaceFingerprint
  • Optionally write workspace.lock

4e. Improve template copy

  • Respect .gitignore patterns during copy
  • Parallel file copy for large templates

Order of operations

1. Copy template (if specified)
2. Execute setup_script (if specified)
3. Initialize git baseline
4. Compute workspace fingerprint
5. Invoke agent
6. Capture file changes
7. Evaluate
8. Execute teardown_script (if specified)
9. Cleanup (conditional)

Key Files

packages/core/src/evaluation/workspace/
├── index.ts                          # Public exports
├── manager.ts                        # createTempWorkspace, cleanup*, getWorkspacePath
├── file-changes.ts                   # initializeBaseline, captureFileChanges
└── script-executor.ts                # executeWorkspaceSetup/Teardown (NOT WIRED IN)

packages/core/src/evaluation/
├── orchestrator.ts                   # Main eval runner — needs setup/teardown integration
├── types.ts                          # WorkspaceScriptConfig, EvaluationResult
├── baseline.ts                       # Result trimming (already strips setup/teardown fields)
└── loaders/config-loader.ts          # YAML parsing — needs workspace block support

packages/core/test/evaluation/workspace/
├── manager.test.ts                   # ✅ Tests exist
├── file-changes.test.ts              # ✅ Tests exist
└── script-executor.test.ts           # ✅ Tests exist

What's Deferred

  • Built-in Docker/container support: setup scripts handle this
  • Pluggable SandboxManager interface: can be added later
  • Pre-built image registries: setup scripts can pull images
  • Workspace caching/layering: SWE-bench's 3-layer approach is complex; start simple

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions