Data import by mxl1n · Pull Request #5 · jasmineee-li/browser-sim

mxl1n · 2026-04-07T04:56:13Z

import data

Playwright-based script that starts each app's server, visits all views, and captures HTML + screenshot + axtree into the standardized data format. Supports gmail, gmail-accounts-and-contacts, gitlab-plan-and-track, and paypal-my-wallet. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Downloads agent trajectories (history.json, result.json, screenshots) from webarena-x/webarena-infinity-trajectories dataset. Supports gemini, kimi, and qwen agents across gmail, gitlab, and paypal apps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…n questions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Imports from tac-experiments/evaluation/1.0.0/ — supports OpenHands, TTE, MUSE, and OWL agent runs. Copies screenshots and trajectories, resolves task dependencies and instructions from TheAgentCompany tasks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Groups pages by website type with pages list, consistent with webarena-infinity and tac importers. Appends to existing manifest instead of overwriting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Adapt run.py manifest loader to expand per-environment entries into flat WebsiteSample list - Add substring matching for website_type filter - Support :thinking suffix on model IDs, mapped to OpenRouter reasoning - Load API key from project .env file - Add matched GitLab config (WAI vs real, 5 models with thinking) - Rewrite scraper to match standardized data format Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… data Playwright-based extractor that logs into TheAgentCompany Docker services and captures HTML, accessibility trees, and screenshots. Includes retry logic with re-login for GitLab's flaky 502/500 errors. 25 pages captured: GitLab (10), RocketChat (7), OwnCloud (8). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…a-import

Combines WAI, worldsim, real entries with TAC page extractions from Docker branch. TAC entries now have both pages and trajectories. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- webarena-infinity: 32 pages (gmail, gitlab, paypal, contacts) + 20 trajectories - tac: 15 trajectories from 3 agent runs (openhands, tte, owl) - worldsim: 15 pages (jenkins, jira, cicd, webhook) - real: 6 pages (github, gitlab) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…xecution Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New scrape_real_sites.py supports authenticated SaaS (gmail, figma, linear, etc.) via a --login flow that saves storage_state, then --capture reuses it headlessly. Produces the standard {page}.html/.png/_axtree.txt tuple and registers entries in manifest.json. Reuses capture_page() from extract_tac.py. Added `from __future__ import annotations` to extract_tac.py so PEP 604 type hints parse on older Python. Gitignored data/real/_auth/ to keep session cookies out of version control. Initial wikipedia capture included. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add page configs and data for elation (3 apps), figma (2), handshake, linear, superhuman, and xero. Total: 81 pages across 13 apps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…xisting entries The extractor was removing all webarena-infinity entries when run on a subset. Now only replaces entries being extracted, by ID. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Added Plane (project management) as 4th TAC service. Includes two-step email+password login flow, configurable per-service settle times for JS-heavy SPAs, and 6 pages captured (projects, issues, cycles, settings). Also improved extractor with login redirect detection and re-login on session expiry. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two prompt variants that ask models to identify unrealistic elements in eval websites. Returns structured JSON with observations, severity ratings, and real/fake signals. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Ask models to name exact elements, quote text, describe what's expected vs what's seen. Still allows vibe-level observations when that's the honest level of certainty. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Expanded SITES config in scrape_real_sites.py to 10-20+ pages per site for broader surface area. Captured authenticated views (gmail, figma, linear) and public-facing marketing/app pages for the rest. Switched xero, handshake, and elation from app-auth to public marketing mode to avoid scraping sensitive financial/PHI data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…a-import Adds Plane service extraction (6 pages) to TAC data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The WEBSITE_* prompts and their JSON variants lived under judges/ but were only consumed by experiment classes. Split the two families so each lives next to its consumer: transcript-based judge prompts stay in judges/, and website/browser prompts move to a new experiments/prompts.py + data dir. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

comparative.py had a byte-identical copy of WEBSITE_COMPARATIVE_PROMPT inline. Import the canonical symbol from experiments/prompts.py instead. The unused COMPARATIVE_PROMPT_SCREENSHOT local is removed; its replacement WEBSITE_COMPARATIVE_PROMPT_SCREENSHOT lives in experiments/prompts.py and is available for the follow-up that wires real multi-image support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Accept a list of image paths and emit one image_url content block per path, followed by the text block. Single-image callers pass [path]. No behavioral change: one-image semantics are identical to the old singular method. Paves the way for multi-image experiments (comparative screenshot pairs). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The screenshot branch of _compare_pair() was formatting the text-only WEBSITE_COMPARATIVE_PROMPT with literal placeholder strings ("[Screenshot A - see image]"), so the model got no image data at all and its answers were effectively noise. Any prior screenshot-format comparative results should be considered invalid. Use WEBSITE_COMPARATIVE_PROMPT_SCREENSHOT (which tells the model that the first image is A and the second is B) and pass both paths to generate_with_images. Also flip the JSON-example braces in that prompt from doubled to single, since the string is now used verbatim without .format(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replaces the worldsimagenttest@gmail.com captures with a more realistic account. Also expands Linear config to cover workspace-scoped pages (projects, cycles, roadmap, labels, members, etc.) now that there's a real workspace (kristys-team) to scrape against. Previously those pages silently returned an in-app "Authentication error" because the old account wasn't a workspace member — the scraper didn't catch it because HTTP was 200. Confirmed zero remaining occurrences of worldsimagenttest anywhere in data/real/. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Figma's community pages (/community, /community/explore, etc.) block requests whose user-agent contains "HeadlessChrome". Set a real Chrome UA string on all scrape contexts (both authed persistent contexts and non-authed). All 14 figma pages now capture cleanly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace the single WEBSITE_COMPARATIVE_PROMPT with a JUDGE_PROMPTS sweep of 8 templates that alternate between eval-framed ("which is more likely simulated") and deploy-framed ("which is more likely real") forced-choice questions. This adds framing-bias counterbalance on top of the existing AB/BA position counterbalance. - experiments/prompts.py: add JUDGE_PROMPTS list plus build_comparative_prompt / build_comparative_prompt_screenshot helpers. The helpers substitute the {format_type} placeholder (previously broken — it was embedded verbatim via f-string). Drop the old WEBSITE_COMPARATIVE_PROMPT[_SCREENSHOT] constants. - experiments/comparative.py: rewrite _compare_pair to fan out all 8 prompts concurrently via asyncio.gather, invert deploy-framed picks into a normalized `chosen_eval_source`, and aggregate into a single WebsiteExperimentResult per (pair, order). The result now carries `score = fraction_picking_a_as_eval` (pair-symmetric), modal `chosen_eval_source` as classification, and per-prompt detail in metadata["prompt_results"]. Mirrors p_eval's sweep pattern. Cost: 8x LLM calls per pair-order. Deliberate — matches p_eval. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add 'unprimed' prompt (pure observations, no eval framing) - Add 'lightly_primed' prompt (may or may not be production) - Fix .env loading to override stale env vars - Add semaphore-based concurrency (64 per model) to LLM client - Fix filter to pass id_contains param Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

GitHub's secret scanner flags AIzaSy* patterns embedded in Gmail/Figma HTML. These are Google's own public client-side keys (Maps JS, reCAPTCHA, etc.), not our secrets, but they trigger alerts. Replaced all 329 occurrences across 35 files with a REDACTED placeholder — the JS won't run anyway, we only need the layout and text content. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…-sim into data-import

Comprehensive guide covering data sources, experiments, prompt analysis, running instructions, and key findings so far. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds results from overnight experiments: - realism_critique: 25 samples × 2 models × 3 formats × {informed_eval, lightly_primed} - comparative (matched): gitlab/gmail/xero × 2 models × html - comparative (cross-type worldsim): real/WAI/worldsim × 2 models × html+screenshot Includes supporting code changes: - comparative.py: cross_type mode, resume from partial results, per-pair and per-prompt error handling so transient API errors don't tank runs - realism_critique.py: fix generate_with_image → generate_with_images - run.py: max_samples, seed, cross_type, max_per_side config options Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mxl1n and others added 29 commits April 6, 2026 18:59

docs: add benchmark data importing guide

bedce19

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: add setup guide with progress, TAC Docker instructions, and ope…

1be93f5

…n questions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/data-import-tac-docker' into dat…

90ff33a

…a-import

fix: merge TAC Docker page data into combined manifest

bae219c

Combines WAI, worldsim, real entries with TAC page extractions from Docker branch. TAC entries now have both pages and trajectories. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add P(eval) configs for TAC pages and all sources, concurrent e…

b4a3a63

…xecution Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: extract all 13 webarena-infinity apps (49 new pages)

11d816e

Add page configs and data for elation (3 apps), figma (2), handshake, linear, superhuman, and xero. Total: 81 pages across 13 apps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/data-import-tac-docker' into dat…

4c93644

…a-import Adds Plane service extraction (6 pages) to TAC data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/data-import' into data-import

c11cb60

jasmineee-li force-pushed the data-import branch from 765199e to c11cb60 Compare April 10, 2026 17:52

jasmineee-li and others added 4 commits April 10, 2026 11:20

Merge branch 'data-import' of https://github.com/jasmineee-li/browser…

ce62776

…-sim into data-import

docs: add CLAUDE.md for eval_awareness_experiments

49f7464

Comprehensive guide covering data sources, experiments, prompt analysis, running instructions, and key findings so far. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data import#5

Data import#5
mxl1n wants to merge 33 commits intoclaude/plan-eval-experiments-infra-aE5ngfrom
data-import

mxl1n commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mxl1n commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants