feat: add eval awareness experiments infrastructure for browser-use b… by jasmineee-li · Pull Request #4 · jasmineee-li/browser-sim

jasmineee-li · 2026-04-06T23:13:15Z

…enchmarks

Implements the full infrastructure for measuring how distinguishable browser-use evaluation environments are from real deployment websites. Adapts eval-awareness detection methods from Needham et al. (2025) to the browser/website setting with HTML, AXTree, and screenshot formats.

Key components:

Copied and adapted judges from eval-awareness-testbed (no inspect_ai dep)
Browser-adapted P(eval) prompts (5 binary + 5 probability variants)
4 experiment types: P(eval), comparative, transcript awareness, open-ended
Playwright scraper for real websites, WorldSim data extractor
YAML-config-driven CLI runner with incremental JSONL output
Analysis module with aggregation and matplotlib plotting

https://claude.ai/code/session_01NRRJgjqCNw79P2iEKNp1Cn

…enchmarks Implements the full infrastructure for measuring how distinguishable browser-use evaluation environments are from real deployment websites. Adapts eval-awareness detection methods from Needham et al. (2025) to the browser/website setting with HTML, AXTree, and screenshot formats. Key components: - Copied and adapted judges from eval-awareness-testbed (no inspect_ai dep) - Browser-adapted P(eval) prompts (5 binary + 5 probability variants) - 4 experiment types: P(eval), comparative, transcript awareness, open-ended - Playwright scraper for real websites, WorldSim data extractor - YAML-config-driven CLI runner with incremental JSONL output - Analysis module with aggregation and matplotlib plotting https://claude.ai/code/session_01NRRJgjqCNw79P2iEKNp1Cn

…ution Addresses all 13 items from docs/handoffs/codex-handoff-setup-hardening.md so a fresh host bring-up is one command (bootstrap_ec2.sh) and Phase 4 can actually run to completion. Key changes: - seeding.py: adversarial data_seed template vars now resolve from a chained seed_context (prior call responses + site-specific helpers for reddit forum/submission lookup via db_connection and map way_id via Nominatim). Fixes Phase 4 "0 complied, 0 variant_success, 7 error" failure where URLs were sent with literal {placeholder} values. (item #13, P0) - build-webarena-amd64-images.sh (new): build 4 patched amd64 site images (shopping, shopping_admin, gitlab, reddit) with env-ctrl base_url fallback baked into the source, so the fix survives docker compose up --force-recreate. Obsoletes the per-run patcher dance. (item #6) - webarena-compose-override.yml: parameterized by WORLDSIM_{ADVERTISE,BIND,DB_BIND}_HOST, binds DB ports for reward eval, pins wikipedia to worldsim/webarena-verified-wikipedia:amd64, sets WA_ENV_CTRL_EXTERNAL_SITE_URL on every service, and overrides map volume names to avoid the docker compose project prefix trap. (items #2, #6, #7, #8) - configs/benchmark_hosts/r5.yaml (new): per-host config consumed by bootstrap_ec2.sh + preflight scripts. Captures advertise_host, bind_host, DB bind, SG id, region, SSH user. - bootstrap_ec2.sh: now invokes build-webarena-amd64-images.sh and build-wikipedia-amd64.sh as mandatory build steps; idempotent; skips when images already present. (items #3, #6) - patch_webarena_containers.sh: hardcoded path /usr/local/environment_control/ops/sites/<site>.py (importlib find_spec returned None under bare docker exec python3). (item #4) - setup-map-robust.sh: ensure_empty_volume creates the 3 map volumes that S3 archives never populated (tiles, style, website-db). (item #2) - storage_state_preflight.py (new) + task_reset_cache.py (new): Phase 3 preflight catches host-bound storage_state mismatches before running; per-instance reset cache uses _MUTATING_HTTP_METHODS to skip reconfigure when no mutation occurred (cuts gitlab's ~10s per task). (items #9, #10) - instance_selection.py (new) + agent_config.py: bind_task_to_instance now routes to select_task_site_instance, laying the foundation for per-task host rewriting. (item #11) - configure_db_access.sh / deploy_proxy_r5.sh / generate_scale_r5.sh / bootstrap_r5.sh / export_host_config_env.py / preflight_*.py: one-command host bring-up + security group preflight. Tests: 701 passed, 2 skipped. 3 new test modules directly covering the Phase 4 seed failures (chained response, reddit submission, map way_id). Includes the handoff doc itself and the prior orchestrator handoff for context, plus scripts/login_gitlab_r5.py (one-shot gitlab storage_state refresh used in the migration session).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add eval awareness experiments infrastructure for browser-use b…#4

feat: add eval awareness experiments infrastructure for browser-use b…#4
jasmineee-li wants to merge 1 commit intorestore-placeholderfrom
claude/plan-eval-experiments-infra-aE5ng

jasmineee-li commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jasmineee-li commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants