Skip to content

Data import#5

Open
mxl1n wants to merge 33 commits intoclaude/plan-eval-experiments-infra-aE5ngfrom
data-import
Open

Data import#5
mxl1n wants to merge 33 commits intoclaude/plan-eval-experiments-infra-aE5ngfrom
data-import

Conversation

@mxl1n
Copy link
Copy Markdown
Collaborator

@mxl1n mxl1n commented Apr 7, 2026

import data

mxl1n and others added 29 commits April 6, 2026 18:59
Playwright-based script that starts each app's server, visits all views,
and captures HTML + screenshot + axtree into the standardized data format.
Supports gmail, gmail-accounts-and-contacts, gitlab-plan-and-track, and
paypal-my-wallet.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Downloads agent trajectories (history.json, result.json, screenshots)
from webarena-x/webarena-infinity-trajectories dataset. Supports gemini,
kimi, and qwen agents across gmail, gitlab, and paypal apps.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n questions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Imports from tac-experiments/evaluation/1.0.0/ — supports OpenHands,
TTE, MUSE, and OWL agent runs. Copies screenshots and trajectories,
resolves task dependencies and instructions from TheAgentCompany tasks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Groups pages by website type with pages list, consistent with
webarena-infinity and tac importers. Appends to existing manifest
instead of overwriting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Adapt run.py manifest loader to expand per-environment entries into
  flat WebsiteSample list
- Add substring matching for website_type filter
- Support :thinking suffix on model IDs, mapped to OpenRouter reasoning
- Load API key from project .env file
- Add matched GitLab config (WAI vs real, 5 models with thinking)
- Rewrite scraper to match standardized data format

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… data

Playwright-based extractor that logs into TheAgentCompany Docker services
and captures HTML, accessibility trees, and screenshots. Includes retry
logic with re-login for GitLab's flaky 502/500 errors.

25 pages captured: GitLab (10), RocketChat (7), OwnCloud (8).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Combines WAI, worldsim, real entries with TAC page extractions from
Docker branch. TAC entries now have both pages and trajectories.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- webarena-infinity: 32 pages (gmail, gitlab, paypal, contacts) + 20 trajectories
- tac: 15 trajectories from 3 agent runs (openhands, tte, owl)
- worldsim: 15 pages (jenkins, jira, cicd, webhook)
- real: 6 pages (github, gitlab)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…xecution

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New scrape_real_sites.py supports authenticated SaaS (gmail, figma, linear, etc.)
via a --login flow that saves storage_state, then --capture reuses it headlessly.
Produces the standard {page}.html/.png/_axtree.txt tuple and registers entries
in manifest.json. Reuses capture_page() from extract_tac.py.

Added `from __future__ import annotations` to extract_tac.py so PEP 604 type
hints parse on older Python. Gitignored data/real/_auth/ to keep session
cookies out of version control. Initial wikipedia capture included.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add page configs and data for elation (3 apps), figma (2), handshake,
linear, superhuman, and xero. Total: 81 pages across 13 apps.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…xisting entries

The extractor was removing all webarena-infinity entries when run on a
subset. Now only replaces entries being extracted, by ID.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added Plane (project management) as 4th TAC service. Includes two-step
email+password login flow, configurable per-service settle times for
JS-heavy SPAs, and 6 pages captured (projects, issues, cycles, settings).

Also improved extractor with login redirect detection and re-login on
session expiry.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two prompt variants that ask models to identify unrealistic elements
in eval websites. Returns structured JSON with observations, severity
ratings, and real/fake signals.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ask models to name exact elements, quote text, describe what's expected
vs what's seen. Still allows vibe-level observations when that's the
honest level of certainty.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Expanded SITES config in scrape_real_sites.py to 10-20+ pages per site for
broader surface area. Captured authenticated views (gmail, figma, linear) and
public-facing marketing/app pages for the rest. Switched xero, handshake, and
elation from app-auth to public marketing mode to avoid scraping sensitive
financial/PHI data.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…a-import

Adds Plane service extraction (6 pages) to TAC data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The WEBSITE_* prompts and their JSON variants lived under judges/ but were
only consumed by experiment classes. Split the two families so each lives
next to its consumer: transcript-based judge prompts stay in judges/, and
website/browser prompts move to a new experiments/prompts.py + data dir.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
comparative.py had a byte-identical copy of WEBSITE_COMPARATIVE_PROMPT
inline. Import the canonical symbol from experiments/prompts.py instead.
The unused COMPARATIVE_PROMPT_SCREENSHOT local is removed; its replacement
WEBSITE_COMPARATIVE_PROMPT_SCREENSHOT lives in experiments/prompts.py and
is available for the follow-up that wires real multi-image support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Accept a list of image paths and emit one image_url content block per
path, followed by the text block. Single-image callers pass [path].

No behavioral change: one-image semantics are identical to the old
singular method. Paves the way for multi-image experiments (comparative
screenshot pairs).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The screenshot branch of _compare_pair() was formatting the text-only
WEBSITE_COMPARATIVE_PROMPT with literal placeholder strings ("[Screenshot
A - see image]"), so the model got no image data at all and its answers
were effectively noise. Any prior screenshot-format comparative results
should be considered invalid.

Use WEBSITE_COMPARATIVE_PROMPT_SCREENSHOT (which tells the model that the
first image is A and the second is B) and pass both paths to
generate_with_images. Also flip the JSON-example braces in that prompt
from doubled to single, since the string is now used verbatim without
.format().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replaces the worldsimagenttest@gmail.com captures with a more realistic
account. Also expands Linear config to cover workspace-scoped pages
(projects, cycles, roadmap, labels, members, etc.) now that there's a real
workspace (kristys-team) to scrape against. Previously those pages silently
returned an in-app "Authentication error" because the old account wasn't a
workspace member — the scraper didn't catch it because HTTP was 200.

Confirmed zero remaining occurrences of worldsimagenttest anywhere in data/real/.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Figma's community pages (/community, /community/explore, etc.) block requests
whose user-agent contains "HeadlessChrome". Set a real Chrome UA string on all
scrape contexts (both authed persistent contexts and non-authed). All 14 figma
pages now capture cleanly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the single WEBSITE_COMPARATIVE_PROMPT with a JUDGE_PROMPTS sweep
of 8 templates that alternate between eval-framed ("which is more
likely simulated") and deploy-framed ("which is more likely real")
forced-choice questions. This adds framing-bias counterbalance on top
of the existing AB/BA position counterbalance.

- experiments/prompts.py: add JUDGE_PROMPTS list plus
  build_comparative_prompt / build_comparative_prompt_screenshot
  helpers. The helpers substitute the {format_type} placeholder
  (previously broken — it was embedded verbatim via f-string). Drop
  the old WEBSITE_COMPARATIVE_PROMPT[_SCREENSHOT] constants.

- experiments/comparative.py: rewrite _compare_pair to fan out all
  8 prompts concurrently via asyncio.gather, invert deploy-framed
  picks into a normalized `chosen_eval_source`, and aggregate into a
  single WebsiteExperimentResult per (pair, order). The result now
  carries `score = fraction_picking_a_as_eval` (pair-symmetric),
  modal `chosen_eval_source` as classification, and per-prompt detail
  in metadata["prompt_results"]. Mirrors p_eval's sweep pattern.

Cost: 8x LLM calls per pair-order. Deliberate — matches p_eval.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add 'unprimed' prompt (pure observations, no eval framing)
- Add 'lightly_primed' prompt (may or may not be production)
- Fix .env loading to override stale env vars
- Add semaphore-based concurrency (64 per model) to LLM client
- Fix filter to pass id_contains param

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jasmineee-li and others added 4 commits April 10, 2026 11:20
GitHub's secret scanner flags AIzaSy* patterns embedded in Gmail/Figma HTML.
These are Google's own public client-side keys (Maps JS, reCAPTCHA, etc.),
not our secrets, but they trigger alerts. Replaced all 329 occurrences across
35 files with a REDACTED placeholder — the JS won't run anyway, we only need
the layout and text content.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comprehensive guide covering data sources, experiments, prompt analysis,
running instructions, and key findings so far.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds results from overnight experiments:
- realism_critique: 25 samples × 2 models × 3 formats × {informed_eval, lightly_primed}
- comparative (matched): gitlab/gmail/xero × 2 models × html
- comparative (cross-type worldsim): real/WAI/worldsim × 2 models × html+screenshot

Includes supporting code changes:
- comparative.py: cross_type mode, resume from partial results, per-pair
  and per-prompt error handling so transient API errors don't tank runs
- realism_critique.py: fix generate_with_image → generate_with_images
- run.py: max_samples, seed, cross_type, max_per_side config options

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants