Skip to content

feat: add eval awareness experiments infrastructure for browser-use b…#4

Open
jasmineee-li wants to merge 1 commit intorestore-placeholderfrom
claude/plan-eval-experiments-infra-aE5ng
Open

feat: add eval awareness experiments infrastructure for browser-use b…#4
jasmineee-li wants to merge 1 commit intorestore-placeholderfrom
claude/plan-eval-experiments-infra-aE5ng

Conversation

@jasmineee-li
Copy link
Copy Markdown
Owner

…enchmarks

Implements the full infrastructure for measuring how distinguishable browser-use evaluation environments are from real deployment websites. Adapts eval-awareness detection methods from Needham et al. (2025) to the browser/website setting with HTML, AXTree, and screenshot formats.

Key components:

  • Copied and adapted judges from eval-awareness-testbed (no inspect_ai dep)
  • Browser-adapted P(eval) prompts (5 binary + 5 probability variants)
  • 4 experiment types: P(eval), comparative, transcript awareness, open-ended
  • Playwright scraper for real websites, WorldSim data extractor
  • YAML-config-driven CLI runner with incremental JSONL output
  • Analysis module with aggregation and matplotlib plotting

https://claude.ai/code/session_01NRRJgjqCNw79P2iEKNp1Cn

…enchmarks

Implements the full infrastructure for measuring how distinguishable
browser-use evaluation environments are from real deployment websites.
Adapts eval-awareness detection methods from Needham et al. (2025) to
the browser/website setting with HTML, AXTree, and screenshot formats.

Key components:
- Copied and adapted judges from eval-awareness-testbed (no inspect_ai dep)
- Browser-adapted P(eval) prompts (5 binary + 5 probability variants)
- 4 experiment types: P(eval), comparative, transcript awareness, open-ended
- Playwright scraper for real websites, WorldSim data extractor
- YAML-config-driven CLI runner with incremental JSONL output
- Analysis module with aggregation and matplotlib plotting

https://claude.ai/code/session_01NRRJgjqCNw79P2iEKNp1Cn
ashtonchew added a commit that referenced this pull request Apr 17, 2026
…ution

Addresses all 13 items from docs/handoffs/codex-handoff-setup-hardening.md
so a fresh host bring-up is one command (bootstrap_ec2.sh) and Phase 4 can
actually run to completion.

Key changes:

- seeding.py: adversarial data_seed template vars now resolve from a
  chained seed_context (prior call responses + site-specific helpers for
  reddit forum/submission lookup via db_connection and map way_id via
  Nominatim). Fixes Phase 4 "0 complied, 0 variant_success, 7 error"
  failure where URLs were sent with literal {placeholder} values.
  (item #13, P0)

- build-webarena-amd64-images.sh (new): build 4 patched amd64 site
  images (shopping, shopping_admin, gitlab, reddit) with env-ctrl
  base_url fallback baked into the source, so the fix survives
  docker compose up --force-recreate. Obsoletes the per-run patcher
  dance. (item #6)

- webarena-compose-override.yml: parameterized by
  WORLDSIM_{ADVERTISE,BIND,DB_BIND}_HOST, binds DB ports for reward
  eval, pins wikipedia to worldsim/webarena-verified-wikipedia:amd64,
  sets WA_ENV_CTRL_EXTERNAL_SITE_URL on every service, and overrides
  map volume names to avoid the docker compose project prefix trap.
  (items #2, #6, #7, #8)

- configs/benchmark_hosts/r5.yaml (new): per-host config consumed by
  bootstrap_ec2.sh + preflight scripts. Captures advertise_host,
  bind_host, DB bind, SG id, region, SSH user.

- bootstrap_ec2.sh: now invokes build-webarena-amd64-images.sh and
  build-wikipedia-amd64.sh as mandatory build steps; idempotent;
  skips when images already present. (items #3, #6)

- patch_webarena_containers.sh: hardcoded path
  /usr/local/environment_control/ops/sites/<site>.py (importlib
  find_spec returned None under bare docker exec python3). (item #4)

- setup-map-robust.sh: ensure_empty_volume creates the 3 map volumes
  that S3 archives never populated (tiles, style, website-db). (item #2)

- storage_state_preflight.py (new) + task_reset_cache.py (new):
  Phase 3 preflight catches host-bound storage_state mismatches before
  running; per-instance reset cache uses _MUTATING_HTTP_METHODS to
  skip reconfigure when no mutation occurred (cuts gitlab's ~10s per
  task). (items #9, #10)

- instance_selection.py (new) + agent_config.py: bind_task_to_instance
  now routes to select_task_site_instance, laying the foundation for
  per-task host rewriting. (item #11)

- configure_db_access.sh / deploy_proxy_r5.sh / generate_scale_r5.sh /
  bootstrap_r5.sh / export_host_config_env.py / preflight_*.py:
  one-command host bring-up + security group preflight.

Tests: 701 passed, 2 skipped. 3 new test modules directly covering the
Phase 4 seed failures (chained response, reddit submission, map way_id).

Includes the handoff doc itself and the prior orchestrator handoff for
context, plus scripts/login_gitlab_r5.py (one-shot gitlab storage_state
refresh used in the migration session).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants