fix(eval): deterministic per-test sandbox cleanup via owner attribution#237
Open
samcm wants to merge 2 commits into
Open
fix(eval): deterministic per-test sandbox cleanup via owner attribution#237samcm wants to merge 2 commits into
samcm wants to merge 2 commits into
Conversation
The eval harness leaked sandbox session containers. Each agent run that calls `panda execute` makes the server create a persistent session container. The provider tried to clean these up per-run by scraping session ids out of the agent transcript with regexes (_SESSION_IN_ARGS/_SESSION_IN_OUTPUT/_LIST_CMD). That scraping was broken: _SESSION_IN_OUTPUT matched the literal `session_id<sep><hex>`, but panda actually prints `[session] <id> (ttl: …)` on the CLI route (pkg/cli/execute.go:113) and `[session] id=<id> ttl=… …` on the MCP route (pkg/tool/execute_python.go:209). Neither matches, so a session was only torn down if a later command reused its id in args. Every single-`execute` run leaked its container; they piled up to the max_sessions cap and the host tapped out at low concurrency. Fix: stop scraping logs and let the server's idle reaper own session lifecycle. The reaper already destroys idle session containers (pkg/sandbox/session.go: cleanupLoop -> cleanupExpired -> cleanupCallback), never reaps a session with an active execution (activeExecs guard in cleanupExpired), and bumps lastUsed on each use (unmarkExecuting) so an actively-used multi-turn session is safe. - _panda_env.py: set a short 5m sandbox.sessions.ttl so leaked single-shot sessions self-reap fast (well longer than any inter-turn gap, far shorter than the 30m default); lower max_sessions 150 -> 100. - provider.py: delete the log-scraping teardown, its regexes, the call site, the destroyed_sandbox_sessions metadata field, and the now-unused _server_base helper / yaml / urllib imports. purge_sessions (server-shutdown final sweep) is kept.
Contributor
🐼 Smoke eval —
|
| question | result | tokens | tools |
|---|---|---|---|
forky_node_coverage |
✅ | 11,995 | 2 |
tracoor_node_coverage |
✅ | 13,274 | 4 |
mainnet_block_arrival_p50 |
✅ | 15,799 | 10 |
list_datasources |
✅ | 11,447 | 1 |
block_count_24h |
✅ | 16,009 | 10 |
missed_slots_24h |
✅ | 13,812 | 6 |
🔭 Langfuse traces (6 runs; ⚠️ = failed)
The report walks this branch's commits against the master baseline and the most recent release. A self-contained copy is in the run's eval-smoke-* artifact.
The eval leaked sandbox session containers. The prior fix leaned on the server's idle TTL reaper alone — lazy GC that lets a finished test's sandboxes linger up to the TTL. Replace that with deterministic per-test cleanup, done server-authoritatively (not by scraping the agent log). Sessions are already owner-scoped server-side (CreateSession/ListSessions/ DestroySession key on an owner id). The owner is authOwnerID(r): the GitHub user id, or "" when unauthenticated — so an unauthenticated server (local mode, the eval) couldn't ask "which sessions are mine". - server: authOwnerID falls back to the caller attribution (X-Panda-On-Behalf-Of) when there's no auth user. Backward-compatible (authed -> GitHubID; unauth+attribution -> the value; neither -> "") and a genuine improvement: caller-scoped sessions in local mode. - eval: mint a unique owner id per worker process (set into PANDA_ON_BEHALF_OF at import so the host serve inherits it; passed -e into the docker serve). Each worker's panda CLI calls now carry that attribution. - eval: after each test, GET /api/v1/sessions then DELETE each, both carrying the worker's attribution header. Workers run tests serially, so this tears down exactly the just-finished test's sandboxes. Returns destroyed_sandbox_sessions for debugging. - the short 5m TTL + max_sessions stay as a BACKSTOP for crashed/missed workers; comments updated to make primary-vs-backstop clear. Adds a unit test for authOwnerID's attribution fallback.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The eval leaked sandbox session containers because the provider reaped them per-run by scraping session ids out of the agent transcript with regexes, which was fragile and broke. This replaces that with deterministic, server-authoritative per-test cleanup:
pkg/server/api.goauthOwnerIDnow falls back to the caller attribution (X-Panda-On-Behalf-Of) when there's no auth user (backward-compatible — authed → GitHub id, unauthenticated+attribution → the value, neither → ""), so the eval gives each worker process a unique owner id (set asPANDA_ON_BEHALF_OF, inherited by the host serve and passed-einto the docker serve) and, after each test, lists and deletes that worker's sessions by attribution; since tests run serially within a worker this tears down exactly the just-finished test's sandboxes. The short 5m session TTL stays only as a backstop for crashed or missed workers, andauthOwnerID's fallback is also a real improvement (caller-scoped sandbox sessions in local mode).