feat: add eval awareness experiments infrastructure for browser-use b…#4
Open
jasmineee-li wants to merge 1 commit intorestore-placeholderfrom
Open
feat: add eval awareness experiments infrastructure for browser-use b…#4jasmineee-li wants to merge 1 commit intorestore-placeholderfrom
jasmineee-li wants to merge 1 commit intorestore-placeholderfrom
Conversation
…enchmarks Implements the full infrastructure for measuring how distinguishable browser-use evaluation environments are from real deployment websites. Adapts eval-awareness detection methods from Needham et al. (2025) to the browser/website setting with HTML, AXTree, and screenshot formats. Key components: - Copied and adapted judges from eval-awareness-testbed (no inspect_ai dep) - Browser-adapted P(eval) prompts (5 binary + 5 probability variants) - 4 experiment types: P(eval), comparative, transcript awareness, open-ended - Playwright scraper for real websites, WorldSim data extractor - YAML-config-driven CLI runner with incremental JSONL output - Analysis module with aggregation and matplotlib plotting https://claude.ai/code/session_01NRRJgjqCNw79P2iEKNp1Cn
ashtonchew
added a commit
that referenced
this pull request
Apr 17, 2026
…ution
Addresses all 13 items from docs/handoffs/codex-handoff-setup-hardening.md
so a fresh host bring-up is one command (bootstrap_ec2.sh) and Phase 4 can
actually run to completion.
Key changes:
- seeding.py: adversarial data_seed template vars now resolve from a
chained seed_context (prior call responses + site-specific helpers for
reddit forum/submission lookup via db_connection and map way_id via
Nominatim). Fixes Phase 4 "0 complied, 0 variant_success, 7 error"
failure where URLs were sent with literal {placeholder} values.
(item #13, P0)
- build-webarena-amd64-images.sh (new): build 4 patched amd64 site
images (shopping, shopping_admin, gitlab, reddit) with env-ctrl
base_url fallback baked into the source, so the fix survives
docker compose up --force-recreate. Obsoletes the per-run patcher
dance. (item #6)
- webarena-compose-override.yml: parameterized by
WORLDSIM_{ADVERTISE,BIND,DB_BIND}_HOST, binds DB ports for reward
eval, pins wikipedia to worldsim/webarena-verified-wikipedia:amd64,
sets WA_ENV_CTRL_EXTERNAL_SITE_URL on every service, and overrides
map volume names to avoid the docker compose project prefix trap.
(items #2, #6, #7, #8)
- configs/benchmark_hosts/r5.yaml (new): per-host config consumed by
bootstrap_ec2.sh + preflight scripts. Captures advertise_host,
bind_host, DB bind, SG id, region, SSH user.
- bootstrap_ec2.sh: now invokes build-webarena-amd64-images.sh and
build-wikipedia-amd64.sh as mandatory build steps; idempotent;
skips when images already present. (items #3, #6)
- patch_webarena_containers.sh: hardcoded path
/usr/local/environment_control/ops/sites/<site>.py (importlib
find_spec returned None under bare docker exec python3). (item #4)
- setup-map-robust.sh: ensure_empty_volume creates the 3 map volumes
that S3 archives never populated (tiles, style, website-db). (item #2)
- storage_state_preflight.py (new) + task_reset_cache.py (new):
Phase 3 preflight catches host-bound storage_state mismatches before
running; per-instance reset cache uses _MUTATING_HTTP_METHODS to
skip reconfigure when no mutation occurred (cuts gitlab's ~10s per
task). (items #9, #10)
- instance_selection.py (new) + agent_config.py: bind_task_to_instance
now routes to select_task_site_instance, laying the foundation for
per-task host rewriting. (item #11)
- configure_db_access.sh / deploy_proxy_r5.sh / generate_scale_r5.sh /
bootstrap_r5.sh / export_host_config_env.py / preflight_*.py:
one-command host bring-up + security group preflight.
Tests: 701 passed, 2 skipped. 3 new test modules directly covering the
Phase 4 seed failures (chained response, reddit submission, map way_id).
Includes the handoff doc itself and the prior orchestrator handoff for
context, plus scripts/login_gitlab_r5.py (one-shot gitlab storage_state
refresh used in the migration session).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…enchmarks
Implements the full infrastructure for measuring how distinguishable browser-use evaluation environments are from real deployment websites. Adapts eval-awareness detection methods from Needham et al. (2025) to the browser/website setting with HTML, AXTree, and screenshot formats.
Key components:
https://claude.ai/code/session_01NRRJgjqCNw79P2iEKNp1Cn