Skip to content

fix(eval): deterministic per-test sandbox cleanup via owner attribution#237

Open
samcm wants to merge 2 commits into
masterfrom
fix/eval-sandbox-session-reaper
Open

fix(eval): deterministic per-test sandbox cleanup via owner attribution#237
samcm wants to merge 2 commits into
masterfrom
fix/eval-sandbox-session-reaper

Conversation

@samcm

@samcm samcm commented Jun 19, 2026

Copy link
Copy Markdown
Member

The eval leaked sandbox session containers because the provider reaped them per-run by scraping session ids out of the agent transcript with regexes, which was fragile and broke. This replaces that with deterministic, server-authoritative per-test cleanup: pkg/server/api.go authOwnerID now falls back to the caller attribution (X-Panda-On-Behalf-Of) when there's no auth user (backward-compatible — authed → GitHub id, unauthenticated+attribution → the value, neither → ""), so the eval gives each worker process a unique owner id (set as PANDA_ON_BEHALF_OF, inherited by the host serve and passed -e into the docker serve) and, after each test, lists and deletes that worker's sessions by attribution; since tests run serially within a worker this tears down exactly the just-finished test's sandboxes. The short 5m session TTL stays only as a backstop for crashed or missed workers, and authOwnerID's fallback is also a real improvement (caller-scoped sandbox sessions in local mode).

The eval harness leaked sandbox session containers. Each agent run that
calls `panda execute` makes the server create a persistent session
container. The provider tried to clean these up per-run by scraping
session ids out of the agent transcript with regexes
(_SESSION_IN_ARGS/_SESSION_IN_OUTPUT/_LIST_CMD).

That scraping was broken: _SESSION_IN_OUTPUT matched the literal
`session_id<sep><hex>`, but panda actually prints `[session] <id>
(ttl: …)` on the CLI route (pkg/cli/execute.go:113) and `[session]
id=<id> ttl=… …` on the MCP route (pkg/tool/execute_python.go:209).
Neither matches, so a session was only torn down if a later command
reused its id in args. Every single-`execute` run leaked its container;
they piled up to the max_sessions cap and the host tapped out at low
concurrency.

Fix: stop scraping logs and let the server's idle reaper own session
lifecycle. The reaper already destroys idle session containers
(pkg/sandbox/session.go: cleanupLoop -> cleanupExpired ->
cleanupCallback), never reaps a session with an active execution
(activeExecs guard in cleanupExpired), and bumps lastUsed on each use
(unmarkExecuting) so an actively-used multi-turn session is safe.

- _panda_env.py: set a short 5m sandbox.sessions.ttl so leaked
  single-shot sessions self-reap fast (well longer than any inter-turn
  gap, far shorter than the 30m default); lower max_sessions 150 -> 100.
- provider.py: delete the log-scraping teardown, its regexes, the call
  site, the destroyed_sandbox_sessions metadata field, and the now-unused
  _server_base helper / yaml / urllib imports.

purge_sessions (server-shutdown final sweep) is kept.
@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

🐼 Smoke eval — f65384b: ✅ 6/6 pass

📊 Interactive report — tokens p50 13,543 · tokens/solve 13,723.

Reference points: master@3d7dfaa 100% · fix/eval-sandbox-session-reaper@1f5519b 100%.

question result tokens tools
forky_node_coverage 11,995 2
tracoor_node_coverage 13,274 4
mainnet_block_arrival_p50 15,799 10
list_datasources 11,447 1
block_count_24h 16,009 10
missed_slots_24h 13,812 6
🔭 Langfuse traces (6 runs; ⚠️ = failed)

The report walks this branch's commits against the master baseline and the most recent release. A self-contained copy is in the run's eval-smoke-* artifact.

The eval leaked sandbox session containers. The prior fix leaned on the
server's idle TTL reaper alone — lazy GC that lets a finished test's
sandboxes linger up to the TTL. Replace that with deterministic per-test
cleanup, done server-authoritatively (not by scraping the agent log).

Sessions are already owner-scoped server-side (CreateSession/ListSessions/
DestroySession key on an owner id). The owner is authOwnerID(r): the GitHub
user id, or "" when unauthenticated — so an unauthenticated server (local
mode, the eval) couldn't ask "which sessions are mine".

- server: authOwnerID falls back to the caller attribution
  (X-Panda-On-Behalf-Of) when there's no auth user. Backward-compatible
  (authed -> GitHubID; unauth+attribution -> the value; neither -> "") and
  a genuine improvement: caller-scoped sessions in local mode.
- eval: mint a unique owner id per worker process (set into PANDA_ON_BEHALF_OF
  at import so the host serve inherits it; passed -e into the docker serve).
  Each worker's panda CLI calls now carry that attribution.
- eval: after each test, GET /api/v1/sessions then DELETE each, both carrying
  the worker's attribution header. Workers run tests serially, so this tears
  down exactly the just-finished test's sandboxes. Returns
  destroyed_sandbox_sessions for debugging.
- the short 5m TTL + max_sessions stay as a BACKSTOP for crashed/missed
  workers; comments updated to make primary-vs-backstop clear.

Adds a unit test for authOwnerID's attribution fallback.
@samcm samcm changed the title fix(eval): let server idle reaper own sandbox session lifecycle fix(eval): deterministic per-test sandbox cleanup via owner attribution Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant