Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
3aad6d7
feat(init): one-command setup + zero-config Vite plugin
divshekhar Jun 17, 2026
1e71782
fix(vite-plugin): serve connect() as a virtual module, not an inline …
divshekhar Jun 17, 2026
3420a9d
fix(init): survive jsonc .mcp.json, honor --port everywhere, fix bin …
divshekhar Jun 17, 2026
8984036
feat(init): register the MCP server globally (user scope), not per pr…
divshekhar Jun 17, 2026
1f85bce
feat(init): also register the MCP server in Cursor's global config
divshekhar Jun 17, 2026
a538846
feat(flows): grade flow assertions — flag assertion-free / presence-o…
divshekhar Jun 17, 2026
840c65a
feat(flows): evaluate flow.success in the live iris_flow_replay (shar…
divshekhar Jun 17, 2026
00dcbfe
feat(tools): cost preview on iris_snapshot/iris_query so agents bail …
divshekhar Jun 17, 2026
2db5eb9
feat(tools): diffed snapshots — iris_snapshot diff:true returns only …
divshekhar Jun 17, 2026
346f5c2
test(bench): M0 — measure the M2 token wins on real functions
divshekhar Jun 17, 2026
414ca24
feat(domain): iris_domain — learn the app's flows + untested intent b…
divshekhar Jun 18, 2026
1d91fea
feat(heal): never auto-heal an ambiguous drift (heals the locator, ne…
divshekhar Jun 18, 2026
3c30edb
docs: surface the new capabilities (iris_domain, snapshot diff/cost, …
divshekhar Jun 18, 2026
eeb8ead
feat(domain): risk-rank flows in iris_domain (run history × assertion…
divshekhar Jun 18, 2026
70b9979
feat(domain): headline the riskiest flow in the iris_domain summary
divshekhar Jun 18, 2026
25ed01f
fix(flows): iris_flow_list returns {name,path} objects to match its o…
divshekhar Jun 18, 2026
52cf56f
feat(observe): nudge the agent to scope a large timeline (cost.recomm…
divshekhar Jun 18, 2026
814bbf0
fix(browser): restore original fetch identity on network observer tea…
divshekhar Jun 18, 2026
e9a6670
fix(browser): restore original identity on route + console observer t…
divshekhar Jun 18, 2026
a7775a6
feat(server): add deterministic 'settled' predicate (M3 — kills sleep…
divshekhar Jun 18, 2026
20b4007
feat(server): act_and_wait defaults to settle when `until` is omitted…
divshekhar Jun 18, 2026
f86c676
feat(server): flow_heal verifies the consequence before persisting (M…
divshekhar Jun 18, 2026
112c366
feat(server): iris_query limit + count_only for token efficiency (M2)
divshekhar Jun 18, 2026
c8278db
feat(server): limit + cost hint on iris_network and iris_console (M2)
divshekhar Jun 18, 2026
63e6ede
feat(server): domain model surfaces mustHold — what must hold per flo…
divshekhar Jun 18, 2026
c2bd0c8
docs(changelog): capture the Unreleased deterministic-waiting, token-…
divshekhar Jun 18, 2026
131ddeb
feat(server): iris_assert nudges presence-only assertions toward a co…
divshekhar Jun 18, 2026
3284f0a
docs(cheatsheet): teach deterministic waiting, consequence-first asse…
divshekhar Jun 18, 2026
80f2d80
fix(server): settled predicate ignores ambient dom.text/animation churn
divshekhar Jun 18, 2026
0b43710
fix(server): floor the flow success oracle at each replay (no stale-s…
divshekhar Jun 18, 2026
84c296c
test(e2e): live verification of the new features against the real demo
divshekhar Jun 18, 2026
3b0e8e0
chore(release): 0.6.10 — deterministic waiting, safe healing, token c…
divshekhar Jun 18, 2026
777a60b
fix(security): bump vite-plugin's dev vite to ^7 (clears GHSA-fx2h-pf…
divshekhar Jun 18, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 41 additions & 12 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,41 @@ All notable changes to **`@syrin/iris`** are documented here. The format follows
[Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and the project follows
[Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.6.10] — 2026-06-18

### Added

- **Deterministic waiting — the `settled` predicate** (`packages/server`). A new predicate
`{ kind: "settled", quietMs }` passes once network + structural-DOM activity has been quiet for
`quietMs` (default 500ms); ambient `dom.text`/animation churn (count-ups, spinners) is ignored so
an animated page can still settle. Usable in `iris_wait_for` and `iris_assert`, and composable inside
`allOf` with the consequence you expect. Replaces fixed sleeps — the #1 cause of flaky agent tests.
- **`iris_act_and_wait` auto-settle** (`packages/server`). Omit `until` and the tool waits for the page
to settle instead of requiring a predicate — "act, then wait for quiet" is now a single zero-config
call, the documented alternative to a sleep.
- **`iris_query` token controls** (`packages/server`) — `limit` (cap returned descriptors; reports
`total` + `truncated` so a trim is never silent) and `count_only` (return just the match count).
- **`iris_network` / `iris_console` token controls** (`packages/server`) — `limit` (keep the most
recent N matches, reporting `total` + `droppedOldest`) and a `cost:{bytes,tokens}` hint, matching the
other read tools so the agent can self-budget everywhere.
- **`iris_domain` `mustHold` per flow** (`packages/server`) — each flow now reports the success
consequence that must hold for it (signal name / net URL), so an agent can answer "what are the
critical flows and what must hold for each?" from the domain model alone.

### Changed

- **Self-healing now verifies the consequence before persisting** (`packages/server`). `iris_flow_heal`
with `apply:true` re-replays the healed flow and re-asserts its success consequence; if a rebound
locator resolves but the flow no longer satisfies its intent, the write is **refused**
(`status:consequence_broken`, file untouched). It heals the locator, never the intent.

### Fixed

- **Browser observers fully restore patched globals on teardown** (`packages/browser`). The network,
route, and console observers stored a bound copy and assigned it back on teardown, so `window.fetch`
/ `history.pushState` / `console.*` were never restored to their original identity. They now keep the
true original for restore and a bound copy only for invocation.

## [0.5.0] — 2026-06-15

### Added
Expand All @@ -25,24 +60,18 @@ All notable changes to **`@syrin/iris`** are documented here. The format follows
dev-only HUD overlay that the agent can control: `iris_narrate` shows a caption, `iris_highlight`
draws a ring around any element. The HUD is excluded from snapshots and tree-shaken in production.
- **Unified `SKILL.md` at repo root** — a single skill file auto-detects mode: setup wizard on first
run (no `.iris.json`), live-app testing on every run after. Covers Claude Code, OpenCode, Codex CLI,
Cursor, Windsurf, VS Code, and Zed MCP config formats.
run (no `.iris.json`), live-app testing on every run after. Covers Claude Code, OpenCode, Codex CLI, Cursor, Windsurf, VS Code, and Zed MCP config formats.
- **`.iris.json` project config** — written after first-run setup; persists `port`, `headed`,
`framework`, and `harnesses` so subsequent runs need zero questions.
- **`dev:iris` script** in `apps/demo` — second Vite dev server on port 4310, isolated from the user's
normal dev port.
- **`dev:iris` script** in `apps/demo` — second Vite dev server on port 4310, isolated from the user's normal dev port.

### Fixed

- **All-throttled session auto-selection** (`packages/server`). When every connected tab is hidden
(e.g. user is in VS Code with Chrome on another desktop), `SessionManager.resolve()` now picks the
session with the freshest heartbeat instead of throwing `"multiple sessions connected"`.
- **Presenter HUD shows on bridge connect** — the overlay now mounts as soon as the SDK connects to the
bridge, not only after the first `iris_narrate` call.
- **`iris_narrate` MCP schema validation** — relaxed the output schema so the tool no longer rejects
responses from narration calls.
- **`iris_inspect` / `iris_clock` output schemas** — relaxed to pass through extra fields instead of
stripping them, fixing spurious validation errors.
(e.g. user is in VS Code with Chrome on another desktop), `SessionManager.resolve()` now picks the session with the freshest heartbeat instead of throwing `"multiple sessions connected"`.
- **Presenter HUD shows on bridge connect** — the overlay now mounts as soon as the SDK connects to the bridge, not only after the first `iris_narrate` call.
- **`iris_narrate` MCP schema validation** — relaxed the output schema so the tool no longer rejects responses from narration calls.
- **`iris_inspect` / `iris_clock` output schemas** — relaxed to pass through extra fields instead of stripping them, fixing spurious validation errors.

---

Expand Down
91 changes: 91 additions & 0 deletions apps/e2e/specs/new-features-test.mjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
// Live verification of the features added in the [Unreleased] CHANGELOG section, against the real
// showcase dashboard (apps/demo :4310 + apps/api :8787). The existing battery proves no regression;
// this spec positively exercises the NEW surfaces end-to-end in a real browser:
// - settled predicate + iris_act_and_wait auto-settle (incl. the ambient-animation fix: the demo's
// count-up counters emit dom.text every frame, which must NOT prevent settling)
// - iris_query limit / count_only token controls
// - iris_assert presence-only `advice` nudge
import { chromium } from 'playwright';
import {
start,
TOOLS,
BaselineStore,
RecordingStore,
FlowStore,
ProjectStore,
AnnotationStore,
createNodeFileSystem,
} from '@syrin/iris-server';
import os from 'node:os';
import path from 'node:path';
const sleep = (ms) => new Promise((r) => setTimeout(r, ms));
let pass = 0,
fail = 0;
const chk = (l, o, d = '') => {
console.log(` ${o ? '✅' : '❌'} ${l}${d ? ' — ' + d : ''}`);
o ? pass++ : fail++;
};

const irisRoot = path.join(os.tmpdir(), `iris-nf-${process.pid}`, '.iris');
const fsp = createNodeFileSystem();
const now = () => Date.now();
const server = await start({ port: 4400, mcp: false });
const deps = {
sessions: server.bridge.sessions,
baselines: new BaselineStore(),
recordings: new RecordingStore(),
flows: new FlowStore(fsp, irisRoot, { now }),
project: new ProjectStore(fsp, irisRoot, { now }),
annotations: new AnnotationStore(),
fs: fsp,
irisRoot,
now,
};
const T = (n, a = {}) => TOOLS.find((t) => t.name === n).handler(deps, { sessionId: 'demo', ...a });
const refOf = async (by, value) => {
for (let i = 0; i < 40; i++) {
const r = (await T('iris_query', { by, value })).elements?.[0]?.ref;
if (r) return r;
await sleep(100);
}
return null;
};

const b = await chromium.launch({ headless: true });
const p = await b.newPage();
await p.goto('http://localhost:4310/?session=demo', { waitUntil: 'networkidle' });
for (let i = 0; i < 200 && server.bridge.sessions.count() === 0; i++) await sleep(50);

console.log('\n=== Iris × new features (:4310) ===');
chk('dashboard SDK connected', server.bridge.sessions.count() > 0);

// count_only — just the match count, no descriptors.
const co = await T('iris_query', { by: 'role', value: 'button', count_only: true });
chk('iris_query count_only returns a count, drops elements', typeof co.count === 'number' && co.count >= 1 && co.elements === undefined, `count=${co.count}`);

// limit — cap descriptors; when more matched, total + truncated flag it.
const lim = await T('iris_query', { by: 'role', value: 'button', limit: 1 });
const moreThanOne = (co.count ?? 0) > 1;
chk('iris_query limit caps descriptors (truncated when more)', (lim.elements?.length ?? 0) <= 1 && (!moreThanOne || (lim.truncated === true && lim.total === co.count)), `returned=${lim.elements?.length}, total=${lim.total ?? 'n/a'}`);

// Auth (pre-filled) → dashboard with its count-up animations.
await T('iris_act_and_wait', { ref: await refOf('testid', 'login-submit'), action: 'click', until: { kind: 'signal', name: 'auth:granted' }, timeout_ms: 5000 });
chk('login → dashboard', (await refOf('testid', 'nav-deployments')) !== null);

// settled wait — the dashboard's count-up counters emit dom.text every frame; settle must STILL
// resolve (the ambient-animation fix). Pre-fix this would time out at 4s with pass:false.
const settled = await T('iris_wait_for', { predicate: { kind: 'settled', quietMs: 300 }, timeout_ms: 4000 });
chk('settled resolves despite count-up animation churn', settled.pass === true, JSON.stringify(settled.evidence ?? {}));

// act_and_wait with NO `until` → auto-settle after a nav click; verdict carries settled evidence.
const aw = await T('iris_act_and_wait', { ref: await refOf('testid', 'nav-deployments'), action: 'click' });
chk('act_and_wait (no until) auto-settles', aw.verdict?.pass === true && aw.verdict?.evidence?.settled === true, JSON.stringify(aw.verdict?.evidence ?? {}));

// presence-only advice — a PASSING element assertion is nudged toward a consequence.
const adv = await T('iris_assert', { predicate: { kind: 'element', query: { testid: 'deploy-list' } } });
chk('iris_assert presence-only attaches advice', adv.pass === true && typeof adv.advice === 'string' && adv.advice.includes('consequence'), adv.advice ? 'advice present' : 'no advice');

console.log(`\n${fail === 0 ? '✅ NEW FEATURES VERIFIED' : '❌ FAILED'} (${pass} passed, ${fail} failed)`);
await b.close();
await server.close();
process.exit(fail === 0 ? 0 : 1);
Comment on lines +54 to +91

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guarantee teardown with try/finally around the e2e run.

Any thrown await in the flow skips b.close()/server.close(), which can leak resources and hang CI jobs.

🛠️ Suggested hardening
-const b = await chromium.launch({ headless: true });
-const p = await b.newPage();
-await p.goto('http://localhost:4310/?session=demo', { waitUntil: 'networkidle' });
-for (let i = 0; i < 200 && server.bridge.sessions.count() === 0; i++) await sleep(50);
+let b;
+let p;
+try {
+  b = await chromium.launch({ headless: true });
+  p = await b.newPage();
+  await p.goto('http://localhost:4310/?session=demo', { waitUntil: 'networkidle' });
+  for (let i = 0; i < 200 && server.bridge.sessions.count() === 0; i++) await sleep(50);
@@
-console.log(`\n${fail === 0 ? '✅ NEW FEATURES VERIFIED' : '❌ FAILED'} (${pass} passed, ${fail} failed)`);
-await b.close();
-await server.close();
+  console.log(`\n${fail === 0 ? '✅ NEW FEATURES VERIFIED' : '❌ FAILED'} (${pass} passed, ${fail} failed)`);
+} finally {
+  await p?.close();
+  await b?.close();
+  await server.close();
+}
 process.exit(fail === 0 ? 0 : 1);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/e2e/specs/new-features-test.mjs` around lines 54 - 91, The test code
lacks proper error handling and resource cleanup. If any await statement throws
an error before reaching b.close() and server.close(), resources will leak and
hang the CI. Wrap all the test logic (from chromium.launch through the final chk
call) in a try block, and move the b.close() and server.close() calls into a
finally block that runs regardless of success or failure. Keep the process.exit
call after the finally block to ensure it only executes after cleanup is
complete.

40 changes: 35 additions & 5 deletions docs/agent-cheatsheet.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,20 @@ pointer sequence on the element (no coordinate gesture for the HUD to intercept)
`occluded:true` when something covers the target, and stays synthetic even with CDP configured
(use `args:{ native:true }` for a trusted native click).

**Never sleep — wait deterministically.** Fixed sleeps are the #1 cause of flaky agent tests. Instead:

- `iris_act_and_wait({ ref, action })` with **no `until`** waits for the page to _settle_ (network +
structural DOM idle; ambient count-up/spinner churn is ignored so an animated page still settles)
before returning — the one-call replacement for "click then sleep 500ms".
- Need to wait without acting? `iris_wait_for({ predicate: { kind: "settled", quietMs } })`.
- Waiting for a specific outcome? Pass that consequence as the predicate (`{ signal }` / `{ net }`),
or `allOf` it with `{ kind: "settled" }` to wait for both the event _and_ the page going quiet.

**Assert a consequence, not just presence.** `{ signal }` / `{ net }` prove the feature actually did
something; `{ element }` / `{ text }` only prove something is on screen — which a stale render or a
locator healed to the wrong element can fake. A _passing_ presence-only `iris_assert` returns
`advice` nudging you to a consequence; heed it on anything that matters.

## The 4-layer cross-check — never trust a green the state contradicts

A claim is real only when the layers agree. Check more than the UI:
Expand Down Expand Up @@ -54,9 +68,9 @@ tree; a wrong `path` returns `{ found:false, availableKeys }` so it's self-corre

Sessions/perception/verify — what you'll use 90% of the time:

`iris_sessions` · `iris_snapshot` · `iris_query` · `iris_act` · `iris_act_and_wait` ·
`iris_observe` · `iris_wait_for` · `iris_assert` · `iris_state` · `iris_diff` ·
`iris_capabilities` · `iris_narrate` (show intent on-page) · `iris_project` (run-history, see below).
`iris_sessions` · `iris_domain` (learn the app + gaps, read first) · `iris_snapshot` · `iris_query` ·
`iris_act` · `iris_act_and_wait` · `iris_observe` · `iris_wait_for` · `iris_assert` · `iris_state` ·
`iris_diff` · `iris_capabilities` · `iris_narrate` (show intent on-page) · `iris_project` (run-history).

**Reach past core when…** you need to record/replay a journey (`iris_record_start/stop`,
`iris_replay`), persist a self-healing golden flow (`iris_flow_save*` / `iris_flow_replay` /
Expand Down Expand Up @@ -89,15 +103,31 @@ Both need a **driven browser** (`iris drive <url>` / `IRIS_CDP_URL`); without on
## Start here

1. `iris_sessions` — find the connected tab (omit `sessionId` if there's only one).
2. `iris_capabilities` — learn the app's testable surface (`testids`, `signals`, `stores`, `flows`)
so you assert on facts without reading source. (`iris_sessions` flags `hasCapabilities`.)
2. `iris_domain` — learn the app BEFORE testing: the saved flows, what each asserts, and the **gaps**
(declared signals/testids that no flow verifies — untested intent). Tells you what to test and
where the real risk is without crawling the whole app. Falls back to `iris_capabilities` for the
raw testable surface (`testids`, `signals`, `stores`, `flows`).
3. Run the loop: **look → act → observe → assert**, cross-checking the 4 layers on anything that matters.

## Token note

- **Keep the eyes cheap.** Prefer `iris_query` / scoped or `interactive` `iris_snapshot` /
`iris_assert` over dumping the full tree. A full verify loop is ~100 tokens; see
[token-efficiency.md](token-efficiency.md) (~73× leaner than full-tree snapshots).
- **Re-look with `iris_snapshot({ diff:true })`** after an action — it returns only what changed
(`mode:delta`/`unchanged`), ~99% fewer tokens than a full re-snapshot and no stale tree to
mis-read. Every snapshot/query result carries `cost:{ bytes, tokens }` — re-scope before reading
if it's large.
- **Cap broad reads.** `iris_query` takes `limit` (caps descriptors; reports `total`/`truncated`) and
`count_only` (just the match count). `iris_network` / `iris_console` take `limit` (most-recent-N,
reports `droppedOldest`) and carry the same `cost` hint — so a busy page or wide window never floods
your context unnoticed.
- **A saved flow tells you if it's a real test.** `iris_flow_save` returns `assertions.grade`
(`asserted` / `presence-only` / `assertion-free`); if it's not `asserted`, add a consequence
(`iris_annotate` assert-signal/assert-net or a success-state) so it can't pass while broken. On
replay, an ambiguous heal (two testids tie) is surfaced, never auto-applied — and an `apply` heal
re-replays the rebound flow and **refuses to write** if the success consequence no longer fires
(`status:consequence_broken`): it heals the locator, never the intent.
- **Predicate schema is not bloated.** The recursive predicate DSL used by `iris_assert` /
`iris_wait_for` / `iris_act_and_wait` is **factored, not inlined**: when converted to the
JSON Schema MCP sends, the predicate body is emitted **once** (~2.7k chars ≈ **~685 tokens**
Expand Down
Loading
Loading