Open
Conversation
Playwright-based script that starts each app's server, visits all views, and captures HTML + screenshot + axtree into the standardized data format. Supports gmail, gmail-accounts-and-contacts, gitlab-plan-and-track, and paypal-my-wallet. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Downloads agent trajectories (history.json, result.json, screenshots) from webarena-x/webarena-infinity-trajectories dataset. Supports gemini, kimi, and qwen agents across gmail, gitlab, and paypal apps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n questions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Imports from tac-experiments/evaluation/1.0.0/ — supports OpenHands, TTE, MUSE, and OWL agent runs. Copies screenshots and trajectories, resolves task dependencies and instructions from TheAgentCompany tasks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Groups pages by website type with pages list, consistent with webarena-infinity and tac importers. Appends to existing manifest instead of overwriting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Adapt run.py manifest loader to expand per-environment entries into flat WebsiteSample list - Add substring matching for website_type filter - Support :thinking suffix on model IDs, mapped to OpenRouter reasoning - Load API key from project .env file - Add matched GitLab config (WAI vs real, 5 models with thinking) - Rewrite scraper to match standardized data format Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… data Playwright-based extractor that logs into TheAgentCompany Docker services and captures HTML, accessibility trees, and screenshots. Includes retry logic with re-login for GitLab's flaky 502/500 errors. 25 pages captured: GitLab (10), RocketChat (7), OwnCloud (8). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Combines WAI, worldsim, real entries with TAC page extractions from Docker branch. TAC entries now have both pages and trajectories. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- webarena-infinity: 32 pages (gmail, gitlab, paypal, contacts) + 20 trajectories - tac: 15 trajectories from 3 agent runs (openhands, tte, owl) - worldsim: 15 pages (jenkins, jira, cicd, webhook) - real: 6 pages (github, gitlab) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…xecution Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New scrape_real_sites.py supports authenticated SaaS (gmail, figma, linear, etc.)
via a --login flow that saves storage_state, then --capture reuses it headlessly.
Produces the standard {page}.html/.png/_axtree.txt tuple and registers entries
in manifest.json. Reuses capture_page() from extract_tac.py.
Added `from __future__ import annotations` to extract_tac.py so PEP 604 type
hints parse on older Python. Gitignored data/real/_auth/ to keep session
cookies out of version control. Initial wikipedia capture included.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add page configs and data for elation (3 apps), figma (2), handshake, linear, superhuman, and xero. Total: 81 pages across 13 apps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…xisting entries The extractor was removing all webarena-infinity entries when run on a subset. Now only replaces entries being extracted, by ID. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added Plane (project management) as 4th TAC service. Includes two-step email+password login flow, configurable per-service settle times for JS-heavy SPAs, and 6 pages captured (projects, issues, cycles, settings). Also improved extractor with login redirect detection and re-login on session expiry. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two prompt variants that ask models to identify unrealistic elements in eval websites. Returns structured JSON with observations, severity ratings, and real/fake signals. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ask models to name exact elements, quote text, describe what's expected vs what's seen. Still allows vibe-level observations when that's the honest level of certainty. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Expanded SITES config in scrape_real_sites.py to 10-20+ pages per site for broader surface area. Captured authenticated views (gmail, figma, linear) and public-facing marketing/app pages for the rest. Switched xero, handshake, and elation from app-auth to public marketing mode to avoid scraping sensitive financial/PHI data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…a-import Adds Plane service extraction (6 pages) to TAC data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The WEBSITE_* prompts and their JSON variants lived under judges/ but were only consumed by experiment classes. Split the two families so each lives next to its consumer: transcript-based judge prompts stay in judges/, and website/browser prompts move to a new experiments/prompts.py + data dir. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
comparative.py had a byte-identical copy of WEBSITE_COMPARATIVE_PROMPT inline. Import the canonical symbol from experiments/prompts.py instead. The unused COMPARATIVE_PROMPT_SCREENSHOT local is removed; its replacement WEBSITE_COMPARATIVE_PROMPT_SCREENSHOT lives in experiments/prompts.py and is available for the follow-up that wires real multi-image support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Accept a list of image paths and emit one image_url content block per path, followed by the text block. Single-image callers pass [path]. No behavioral change: one-image semantics are identical to the old singular method. Paves the way for multi-image experiments (comparative screenshot pairs). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The screenshot branch of _compare_pair() was formatting the text-only
WEBSITE_COMPARATIVE_PROMPT with literal placeholder strings ("[Screenshot
A - see image]"), so the model got no image data at all and its answers
were effectively noise. Any prior screenshot-format comparative results
should be considered invalid.
Use WEBSITE_COMPARATIVE_PROMPT_SCREENSHOT (which tells the model that the
first image is A and the second is B) and pass both paths to
generate_with_images. Also flip the JSON-example braces in that prompt
from doubled to single, since the string is now used verbatim without
.format().
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replaces the worldsimagenttest@gmail.com captures with a more realistic account. Also expands Linear config to cover workspace-scoped pages (projects, cycles, roadmap, labels, members, etc.) now that there's a real workspace (kristys-team) to scrape against. Previously those pages silently returned an in-app "Authentication error" because the old account wasn't a workspace member — the scraper didn't catch it because HTTP was 200. Confirmed zero remaining occurrences of worldsimagenttest anywhere in data/real/. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Figma's community pages (/community, /community/explore, etc.) block requests whose user-agent contains "HeadlessChrome". Set a real Chrome UA string on all scrape contexts (both authed persistent contexts and non-authed). All 14 figma pages now capture cleanly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the single WEBSITE_COMPARATIVE_PROMPT with a JUDGE_PROMPTS sweep
of 8 templates that alternate between eval-framed ("which is more
likely simulated") and deploy-framed ("which is more likely real")
forced-choice questions. This adds framing-bias counterbalance on top
of the existing AB/BA position counterbalance.
- experiments/prompts.py: add JUDGE_PROMPTS list plus
build_comparative_prompt / build_comparative_prompt_screenshot
helpers. The helpers substitute the {format_type} placeholder
(previously broken — it was embedded verbatim via f-string). Drop
the old WEBSITE_COMPARATIVE_PROMPT[_SCREENSHOT] constants.
- experiments/comparative.py: rewrite _compare_pair to fan out all
8 prompts concurrently via asyncio.gather, invert deploy-framed
picks into a normalized `chosen_eval_source`, and aggregate into a
single WebsiteExperimentResult per (pair, order). The result now
carries `score = fraction_picking_a_as_eval` (pair-symmetric),
modal `chosen_eval_source` as classification, and per-prompt detail
in metadata["prompt_results"]. Mirrors p_eval's sweep pattern.
Cost: 8x LLM calls per pair-order. Deliberate — matches p_eval.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add 'unprimed' prompt (pure observations, no eval framing) - Add 'lightly_primed' prompt (may or may not be production) - Fix .env loading to override stale env vars - Add semaphore-based concurrency (64 per model) to LLM client - Fix filter to pass id_contains param Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
765199e to
c11cb60
Compare
GitHub's secret scanner flags AIzaSy* patterns embedded in Gmail/Figma HTML. These are Google's own public client-side keys (Maps JS, reCAPTCHA, etc.), not our secrets, but they trigger alerts. Replaced all 329 occurrences across 35 files with a REDACTED placeholder — the JS won't run anyway, we only need the layout and text content. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comprehensive guide covering data sources, experiments, prompt analysis, running instructions, and key findings so far. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds results from overnight experiments:
- realism_critique: 25 samples × 2 models × 3 formats × {informed_eval, lightly_primed}
- comparative (matched): gitlab/gmail/xero × 2 models × html
- comparative (cross-type worldsim): real/WAI/worldsim × 2 models × html+screenshot
Includes supporting code changes:
- comparative.py: cross_type mode, resume from partial results, per-pair
and per-prompt error handling so transient API errors don't tank runs
- realism_critique.py: fix generate_with_image → generate_with_images
- run.py: max_samples, seed, cross_type, max_per_side config options
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
import data