-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem
AgentV's workspace system has gaps that hurt reproducibility and limit flexibility:
- Setup/teardown scripts are implemented but NOT wired into the orchestrator —
executeWorkspaceSetup/executeWorkspaceTeardownexist inpackages/core/src/evaluation/workspace/script-executor.tswith full tests, but the orchestrator never calls them. Users can't usesetup_script/teardown_scriptin YAML configs. - No per-case workspace overrides — workspace config is target-level only; heterogeneous eval suites (different repos/commits per case, like SWE-bench) aren't possible.
- No workspace reproducibility — no fingerprinting or lockfile to verify workspace state across runs.
- Template copy ignores .gitignore — copies everything including
node_modules/,dist/, etc. - No case metadata passed to scripts — setup scripts can't access per-case info (repo, commit, etc.).
Industry Research
We analyzed workspace patterns across 10+ evaluation frameworks. Here's what the industry does:
| Framework | Workspace Method | Isolation | Reproducibility |
|---|---|---|---|
| SWE-bench | 3-layer Docker images (base→env→instance), environment_setup_commit separation |
Container | Pre-built image registry, conda env pinning |
| OpenCode-Bench | mkdtemp → shallow git clone at pinned commit |
Filesystem | Fresh clone per episode, 3 episodes per task |
| terminal-bench | Docker container per trial from compose template | Container + tmux | Docker images + asciinema recordings |
| sniffbench | Pluggable SandboxManager with SandboxConfig |
Container (security-hardened) | Variant containers with baked-in config |
| convex-evals | mkdtemp + file writes + dual Convex backend processes |
Process-level | Deterministic tasks + downloaded binary |
| Aider | Git repo clone into Docker container | Container | Fixed exercism set, Docker images |
| HumanEval/MBPP | In-memory Python process | Subprocess | No workspace state needed |
Key patterns worth adopting:
- SWE-bench's
environment_setup_commit(separate commit for deps vs code state) - sniffbench's pluggable sandbox interface (
SandboxConfigtype system) - OpenCode-Bench's fresh-clone-per-episode pattern (already covered by our trials system)
- The dominant coding eval pattern:
mkdtemp → clone at commit → setup → agent → git diff → score → cleanup
Design
Approach: Declarative Workspace Spec (Isolation-Agnostic)
Define what a workspace needs without prescribing how isolation is achieved. Two modes:
1. Workspace Spec Schema
Simple mode — template directory copy (existing behavior, unchanged):
workspace:
template: ./workspace-template
workspace_type: directory # or vscode-workspace
env:
CI: "1"Scripted mode — setup script creates the workspace (including cloning, deps, etc.):
workspace:
workspace_type: directory
setup_script:
script: ["bun", "run", "setup.ts"]
timeout_ms: 120000
cwd: ./scripts
teardown_script:
script: ["bun", "run", "teardown.ts"]
timeout_ms: 30000
env:
CI: "1"
resources:
memory: 512MB
timeout: 300sCombined mode — template copied first, then setup script runs:
workspace:
template: ./workspace-template
setup_script:
script: ["bun", "run", "setup.ts"]
timeout_ms: 120000Setup script receives context on stdin:
{
"workspace_path": "/home/user/.agentv/workspaces/run-123/case-01",
"eval_case_id": "case-01",
"eval_run_id": "run-123",
"case_input": "Implement the add function...",
"case_metadata": { "repo": "owner/repo", "ref": "abc123" }
}2. Per-Case Workspace Overrides
Target-level workspace config serves as the default. Cases can override:
# targets.yaml
targets:
- name: coding_agent
provider: cli
workspace:
setup_script:
script: ["bun", "run", "setup.ts"]
env:
CI: "1"
# evals/swebench.yaml
cases:
- id: sympy-20590
input: "Fix the bug in issue #20590..."
metadata:
repo: sympy/sympy
base_commit: 9aabb237
env_commit: latest-stable
workspace: # overrides target
setup_script:
script: ["bun", "run", "swebench-setup.ts"]
env:
PYTHON_VERSION: "3.9"
- id: simple-function
input: "Implement the add function"
# No workspace override — uses target defaultMerge strategy:
| Field | Merge behavior |
|---|---|
template |
Case replaces target |
setup_script |
Case replaces target |
teardown_script |
Case replaces target |
env |
Deep merge (case extends target) |
resources |
Deep merge |
workspace_type |
Case replaces target |
3. Reproducibility
Workspace fingerprint — after setup, hash the workspace state:
workspace_fingerprint:
hash: sha256:abc123...
source_ref: "owner/repo@abc123"
setup_script_hash: sha256:def..Workspace lockfile — optional workspace.lock pinning exact versions:
sources:
- repo: owner/repo
resolved_ref: abc123def456... # Full SHA even if config said "main"
setup_script:
hash: sha256:...
output_hash: sha256:...Trials — existing pass@k/mean/confidence_interval system already provides multi-episode with fresh workspaces. No changes needed.
Implementation Plan
4a. Wire setup/teardown into orchestrator (critical gap fix)
- Import
executeWorkspaceSetup/executeWorkspaceTeardowninorchestrator.ts - Call setup after
createTempWorkspace()and beforeinitializeBaseline() - Call teardown after evaluation completes
- Store
setupOutput/teardownOutputinEvaluationResult
4b. Add workspace config parsing
- Extend YAML schema to accept
workspace:block at both target and case level - Implement merge strategy (template/scripts replaced, env/resources deep-merged)
- Parse
setup_script/teardown_scriptfrom YAML intoWorkspaceScriptConfig
4c. Pass case metadata to setup scripts
- Extend
ScriptExecutionContextto includecaseMetadataandcaseInput - Setup scripts receive full case context on stdin
4d. Workspace fingerprinting
- After setup + baseline init, compute SHA-256 of workspace file tree
- Store in
EvaluationResult.workspaceFingerprint - Optionally write
workspace.lock
4e. Improve template copy
- Respect
.gitignorepatterns during copy - Parallel file copy for large templates
Order of operations
1. Copy template (if specified)
2. Execute setup_script (if specified)
3. Initialize git baseline
4. Compute workspace fingerprint
5. Invoke agent
6. Capture file changes
7. Evaluate
8. Execute teardown_script (if specified)
9. Cleanup (conditional)
Key Files
packages/core/src/evaluation/workspace/
├── index.ts # Public exports
├── manager.ts # createTempWorkspace, cleanup*, getWorkspacePath
├── file-changes.ts # initializeBaseline, captureFileChanges
└── script-executor.ts # executeWorkspaceSetup/Teardown (NOT WIRED IN)
packages/core/src/evaluation/
├── orchestrator.ts # Main eval runner — needs setup/teardown integration
├── types.ts # WorkspaceScriptConfig, EvaluationResult
├── baseline.ts # Result trimming (already strips setup/teardown fields)
└── loaders/config-loader.ts # YAML parsing — needs workspace block support
packages/core/test/evaluation/workspace/
├── manager.test.ts # ✅ Tests exist
├── file-changes.test.ts # ✅ Tests exist
└── script-executor.test.ts # ✅ Tests exist
What's Deferred
- Built-in Docker/container support: setup scripts handle this
- Pluggable SandboxManager interface: can be added later
- Pre-built image registries: setup scripts can pull images
- Workspace caching/layering: SWE-bench's 3-layer approach is complex; start simple