fix(eval): deterministic per-test sandbox cleanup via owner attribution by samcm · Pull Request #237 · ethpandaops/panda

samcm · 2026-06-19T05:36:18Z

The eval leaked sandbox session containers because the provider reaped them per-run by scraping session ids out of the agent transcript with regexes, which was fragile and broke. This replaces that with deterministic, server-authoritative per-test cleanup: pkg/server/api.go authOwnerID now falls back to the caller attribution (X-Panda-On-Behalf-Of) when there's no auth user (backward-compatible — authed → GitHub id, unauthenticated+attribution → the value, neither → ""), so the eval gives each worker process a unique owner id (set as PANDA_ON_BEHALF_OF, inherited by the host serve and passed -e into the docker serve) and, after each test, lists and deletes that worker's sessions by attribution; since tests run serially within a worker this tears down exactly the just-finished test's sandboxes. The short 5m session TTL stays only as a backstop for crashed or missed workers, and authOwnerID's fallback is also a real improvement (caller-scoped sandbox sessions in local mode).

The eval harness leaked sandbox session containers. Each agent run that calls `panda execute` makes the server create a persistent session container. The provider tried to clean these up per-run by scraping session ids out of the agent transcript with regexes (_SESSION_IN_ARGS/_SESSION_IN_OUTPUT/_LIST_CMD). That scraping was broken: _SESSION_IN_OUTPUT matched the literal `session_id<sep><hex>`, but panda actually prints `[session] <id> (ttl: …)` on the CLI route (pkg/cli/execute.go:113) and `[session] id=<id> ttl=… …` on the MCP route (pkg/tool/execute_python.go:209). Neither matches, so a session was only torn down if a later command reused its id in args. Every single-`execute` run leaked its container; they piled up to the max_sessions cap and the host tapped out at low concurrency. Fix: stop scraping logs and let the server's idle reaper own session lifecycle. The reaper already destroys idle session containers (pkg/sandbox/session.go: cleanupLoop -> cleanupExpired -> cleanupCallback), never reaps a session with an active execution (activeExecs guard in cleanupExpired), and bumps lastUsed on each use (unmarkExecuting) so an actively-used multi-turn session is safe. - _panda_env.py: set a short 5m sandbox.sessions.ttl so leaked single-shot sessions self-reap fast (well longer than any inter-turn gap, far shorter than the 30m default); lower max_sessions 150 -> 100. - provider.py: delete the log-scraping teardown, its regexes, the call site, the destroyed_sandbox_sessions metadata field, and the now-unused _server_base helper / yaml / urllib imports. purge_sessions (server-shutdown final sweep) is kept.

github-actions · 2026-06-19T05:39:43Z

🐼 Smoke eval — `f65384b`: ✅ 6/6 pass

📊 Interactive report — tokens p50 13,543 · tokens/solve 13,723.

Reference points: master@3d7dfaa 100% · fix/eval-sandbox-session-reaper@1f5519b 100%.

question	result	tokens	tools
`forky_node_coverage`	✅	11,995	2
`tracoor_node_coverage`	✅	13,274	4
`mainnet_block_arrival_p50`	✅	15,799	10
`list_datasources`	✅	11,447	1
`block_count_24h`	✅	16,009	10
`missed_slots_24h`	✅	13,812	6

🔭 Langfuse traces (6 runs; ⚠️ = failed)

_{The report walks this branch's commits against the master baseline and the most recent release. A self-contained copy is in the run's eval-smoke-* artifact.}

The eval leaked sandbox session containers. The prior fix leaned on the server's idle TTL reaper alone — lazy GC that lets a finished test's sandboxes linger up to the TTL. Replace that with deterministic per-test cleanup, done server-authoritatively (not by scraping the agent log). Sessions are already owner-scoped server-side (CreateSession/ListSessions/ DestroySession key on an owner id). The owner is authOwnerID(r): the GitHub user id, or "" when unauthenticated — so an unauthenticated server (local mode, the eval) couldn't ask "which sessions are mine". - server: authOwnerID falls back to the caller attribution (X-Panda-On-Behalf-Of) when there's no auth user. Backward-compatible (authed -> GitHubID; unauth+attribution -> the value; neither -> "") and a genuine improvement: caller-scoped sessions in local mode. - eval: mint a unique owner id per worker process (set into PANDA_ON_BEHALF_OF at import so the host serve inherits it; passed -e into the docker serve). Each worker's panda CLI calls now carry that attribution. - eval: after each test, GET /api/v1/sessions then DELETE each, both carrying the worker's attribution header. Workers run tests serially, so this tears down exactly the just-finished test's sandboxes. Returns destroyed_sandbox_sessions for debugging. - the short 5m TTL + max_sessions stay as a BACKSTOP for crashed/missed workers; comments updated to make primary-vs-backstop clear. Adds a unit test for authOwnerID's attribution fallback.

samcm changed the title ~~fix(eval): let server idle reaper own sandbox session lifecycle~~ fix(eval): deterministic per-test sandbox cleanup via owner attribution Jun 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eval): deterministic per-test sandbox cleanup via owner attribution#237

fix(eval): deterministic per-test sandbox cleanup via owner attribution#237
samcm wants to merge 2 commits into
masterfrom
fix/eval-sandbox-session-reaper

samcm commented Jun 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samcm commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐼 Smoke eval — f65384b: ✅ 6/6 pass

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

samcm commented Jun 19, 2026 •

edited

Loading

github-actions Bot commented Jun 19, 2026 •

edited

Loading

🐼 Smoke eval — `f65384b`: ✅ 6/6 pass