test(mock-llm): strengthen E2E coverage for conversation and automation flows#940
Conversation
…on flows Conversation test additions (mock-llm-conversation.spec.ts): - Step 2: verify settings API reflects active profile's llm.model and base_url - Step 3: intercept POST /api/conversations and assert worktree:true in payload - Step 3: verify user message is visible in a user-message element - Step 3: verify conversation appears in sidebar with correct link - Step 4 (new): resume conversation from sidebar after navigating away, verify agent reply and user message are still visible Automation test additions (mock-llm-automation.spec.ts): - Step 3: verify active-status-badge-active is visible on detail page - Step 3: verify cron schedule (or human-readable equivalent) on detail page Addresses coverage gaps from issue #511 'I can' statements: - I can start the Agent Canvas (conversation in sidebar) - worktree flag in conversation creation payload - I can create conversations, list and resume them - Activate profile -> settings API reflects model - I can run automations on a schedule (UI verification) Co-authored-by: openhands <openhands@all-hands.dev>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
✅ Mock-LLM E2E Tests7/7 passed Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
✅ Mock-LLM E2E Tests7/7 passed Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review — test(mock-llm): strengthen E2E coverage for conversation and automation flows
Summary
This PR does a good job filling the gaps identified in #511. The new checks are well-structured, the afterAll safety net is a thoughtful addition, and the explanation of the LIFO conflict between page.on and page.route in the comments is exactly the kind of context future maintainers need. A few items below are worth addressing before merge.
Issues
1. Broad URL filter may capture unintended POST requests
req.url().includes("/api/conversations") will match any path containing that substring — including future paths like /api/conversations/{id}/messages or /api/conversations/{id}/events. Since only the conversation-creation POST is relevant here, tighten the match:
const url = new URL(req.url());
if (req.method() === "POST" && url.pathname === "/api/conversations") {or, simpler:
req.url().endsWith("/api/conversations")As-is, if the app ever issues a second POST to a /api/conversations/* sub-path before the user message is sent, capturedConversationPayload would be silently overwritten and the assertion could pass with stale data.
2. Hardcoded user message string in step 4
The user message "Please run a quick terminal command and then reply." is defined as userMessage in step 3 (line 288) but then hardcoded again in step 4 (line 450). In serial tests these two strings must stay in sync, but there is no compile-time guarantee.
Extract it to a module-level constant so a single change keeps both steps aligned:
// Near PROFILE_NAME, BASH_TOKEN, etc. at the top of the file
const USER_MESSAGE = "Please run a quick terminal command and then reply.";Then reference USER_MESSAGE in both step 3 and step 4.
3. Brittle fallback strings in the cron schedule assertion
text.toLowerCase().includes("9:00") and text.toLowerCase().includes("every day") scan the entire document.body.innerText. Any clock widget, table cell, or debug output on the page containing "9:00" or "every day" would cause a false-positive pass. The first branch (text.includes(schedule) where schedule = CRON_SCHEDULE) already covers the raw-cron fallback; the second and third branches add fragility without much additional signal.
Prefer a scoped Playwright assertion instead:
// Use getByText if CRON_SCHEDULE is always rendered on the page
await expect(page.getByText(CRON_SCHEDULE)).toBeVisible({ timeout: 10_000 });If the UI always renders a human-readable form and never the raw cron, consider querying the API for schedule_human and asserting against that value — either way, tie the assertion to the actual rendered element rather than a full-page text scan.
Nits / Non-blocking
- The
page.on("request", ...)listener is never removed viapage.off(). In practice this is fine because Playwright disposes the page after each test, but for completeness a named handler variable with a correspondingpage.off()call would be cleaner. conversationIds.add(step3ConversationId!)in step 4 uses a non-null assertion after theexpect(...).toBeTruthy()guard — that pattern is correct and readable.- The
afterAllsafety-net cleanup is a good practice; no issues there.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
1. Tighten URL filter: use exact pathname match (new URL(...).pathname ===
'/api/conversations') instead of broad .includes() to avoid capturing
sub-path POSTs like /api/conversations/{id}/messages.
2. Extract hardcoded user message to module-level USER_MESSAGE constant
so step 3 and step 4 stay in sync automatically.
3. Replace overly broad cron schedule assertion (full-page text scan with
fallback strings) with a scoped page.getByText(CRON_SCHEDULE) check.
Co-authored-by: openhands <openhands@all-hands.dev>
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
✅ Mock-LLM E2E Tests7/7 passed Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review — test(mock-llm): strengthen E2E coverage for conversation and automation flows
Prior Review Resolution
All three issues raised in the previous review have been fully addressed in this head commit:
- ✅ URL filter tightened to an exact
pathname === "/api/conversations"match — no longer captures sub-paths like/api/conversations/{id}/messages. - ✅ User message string extracted to the module-level
USER_MESSAGEconstant (line 44), eliminating the silent drift risk between step 3 and step 4. - ✅ Cron schedule assertion replaced with
page.getByText(CRON_SCHEDULE).toBeVisible()— scoped element query instead of a full-page text scan.
The new checks integrate cleanly into the serial test flow and close the gaps identified in #511.
Remaining Issues
1. Misleading comment and redundant add() call at step 4 setup (mock-llm-conversation.spec.ts:407–408)
// Re-create the conversation ID tracking so afterEach can clean up
conversationIds.add(step3ConversationId!);The comment is wrong on two counts:
- The
add()call is a no-op.step3ConversationIdwas already added toconversationIdsin step 3 (line 315) and was intentionally preserved byafterEach’sif (id === step3ConversationId) continueguard. Adding it again to aSetthat already contains it has no effect. afterEachdoes not clean upstep3ConversationId— it skips it. TheafterAllsafety net is what actually callsdeleteConversation. The comment therefore misrepresents which hook performs the cleanup.
Suggested fix — remove both the line and the comment (the conversation is already tracked and afterAll handles cleanup), or replace the comment with an accurate description:
// step3ConversationId is already in conversationIds (added in step 3).
// afterEach skips it; afterAll handles the final cleanup.2. page.on("request", …) listener never removed (mock-llm-conversation.spec.ts:261) — nit
The listener registered with page.on("request", ...) is not removed after the assertions complete. The prior review flagged this as non-blocking — Playwright disposes the page after each test so there is no functional risk. A named handler with a corresponding page.off() call would be cleaner but is not required before merge.
Summary
A well-structured PR that systematically closes the coverage gaps from #511. The serial test orchestration — shared step3ConversationId, skip-in-afterEach, afterAll cleanup — is logically correct, and the inline comments explaining design decisions (particularly the LIFO note for page.on vs page.route) are exactly the kind of context future maintainers need.
The misleading comment at step 4 (issue 1 above) should be corrected before merge; everything else is in good shape.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
- Remove no-op conversationIds.add(step3ConversationId) in step 4 — the
ID is already tracked from step 3 and afterAll handles its cleanup.
- Extract page.on('request') handler to a named function and call
page.off() after the conversation URL is captured.
Co-authored-by: openhands <openhands@all-hands.dev>
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
✅ Mock-LLM E2E Tests7/7 passed Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review — test(mock-llm): strengthen E2E coverage for conversation and automation flows
Summary
All three issues and both nits raised in the previous review have been addressed cleanly:
- URL filter now uses exact
pathname === "/api/conversations"✅ USER_MESSAGEextracted to module-level constant ✅- Cron schedule assertion uses scoped
page.getByText(CRON_SCHEDULE)✅ - Named
captureConversationPayloadhandler withpage.off()✅ - Redundant
conversationIds.add(step3ConversationId)removed ✅
The step 2 settings-API verification is thorough (checks both model and base_url), the afterAll cleanup pattern is correct Playwright practice, and the LIFO-conflict explanation in the comment is exactly the right context for future maintainers. Two minor observations below.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
…ge check
- Replace expect().toBeTruthy() with test.skip() so step 4 is skipped
(not failed) when step 3 didn't complete.
- Replace expect.poll + page.evaluate with Playwright-native
locator().filter({ hasText }).toBeVisible() for the user message
assertion in step 4.
Co-authored-by: openhands <openhands@all-hands.dev>
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
✅ Mock-LLM E2E Tests7/7 passed Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review — test(mock-llm): strengthen E2E coverage for conversation and automation flows
Summary
The PR is in good shape. All seven threads from previous rounds are resolved, and the current diff is clean:
- Settings API verification (step 2): checks both
modelandbase_url— thorough. ✅ - Payload capture (
captureConversationPayload): exact-pathname filter, named handler,page.off()after navigation — correct. ✅ worktree: trueassertion: properly gates on the captured payload being non-null before asserting the field. ✅- User-message and sidebar checks: native Playwright locators (
filter({ hasText }),getByTestId). ✅ - Resume test (step 4):
test.skip(!step3ConversationId)for graceful skip, then verifies reply token, user message, and no error banners. ✅ - Cleanup lifecycle:
afterEachskip guard +afterAllsafety-net covers all exit paths. ✅ - Automation spec:
getByTestId("active-status-badge-active")andgetByText(CRON_SCHEDULE)are well-scoped. ✅
One minor inline comment below (error message wording only — non-blocking).
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
Co-authored-by: openhands <openhands@all-hands.dev>
|
| Status | Test | Duration |
|---|
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses)
✅ Mock-LLM E2E Tests7/7 passed Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
📸 Snapshot Test ReportWarning Snapshot comparison step crashed (timeout, OOM, or runner error) — diff results below may be incomplete or absent. Warning One or more snapshot tests crashed during generation — some snapshots below may be incomplete. ❌ 8 snapshots differ from the main branch baselines. Add the
🔴 Changed snapshots (8)
|
| Expected (main) | Actual (PR) | Diff |
|---|---|---|
![]() |
![]() |
![]() |
backend-manage-after-removal
| Expected (main) | Actual (PR) | Diff |
|---|---|---|
![]() |
![]() |
![]() |
mcp-page — 5 snapshots
mcp-custom-server-1-editor-open
| Expected (main) | Actual (PR) | Diff |
|---|---|---|
![]() |
![]() |
![]() |
mcp-custom-server-editor
| Expected (main) | Actual (PR) | Diff |
|---|---|---|
![]() |
![]() |
![]() |
mcp-empty-installed
| Expected (main) | Actual (PR) | Diff |
|---|---|---|
![]() |
![]() |
![]() |
mcp-search-filtered
| Expected (main) | Actual (PR) | Diff |
|---|---|---|
![]() |
![]() |
![]() |
mcp-slack-install-1-marketplace
| Expected (main) | Actual (PR) | Diff |
|---|---|---|
![]() |
![]() |
![]() |
settings-page
settings-page
| Expected (main) | Actual (PR) | Diff |
|---|---|---|
![]() |
![]() |
![]() |
✅ Unchanged snapshots (65)
archived-conversation
- conversation-panel-with-archived-badges
- conversation-view-archived
- conversation-view-sandbox-error
automations
- automations-delete-modal
- automations-list-active-inactive
- automations-no-automations
- automations-search-no-results
backends-extended
- backend-add-blank-disabled
- backend-add-cloud-advanced-open
- backend-add-cloud-no-key-disabled
- backend-add-cloud-with-key-enabled
- backend-add-form-partially-filled
- backend-add-invalid-url-disabled
- backend-add-local-ready
- backend-add-name-only-disabled
- backend-add-two-column-layout
- backend-add-whitespace-host-disabled
- backend-cancel-nothing-saved
- backend-dropdown-two-backends
- backend-edit-prefilled
- backend-manage-two-listed
- backend-remove-cancelled
- backend-remove-confirmation
- backend-switch-overlay
backends
- backend-add-modal
- backend-manage-modal
- backend-selector-open
changes-tab
- changes-deleted-file
- changes-diff-viewer
- changes-empty
collapsible-thinking
- reasoning-content-collapsed
- reasoning-content-expanded
- think-action-collapsed
- think-action-expanded
mcp-page
- mcp-custom-server-2-url-filled
- mcp-custom-server-3-all-filled
- mcp-custom-server-4-installed
- mcp-slack-install-2-modal
- mcp-slack-install-3-filled
- mcp-slack-install-4-installed
onboarding
- onboarding-step-0-choose-agent
- onboarding-step-1-check-backend
- onboarding-step-2-setup-llm
- onboarding-step-3-say-hello
projects-workspace-browser
- projects-workspace-browser
settings-page
- add-backend-modal
- analytics-consent-modal
- home-screen
- settings-app-page
settings-secrets
- secrets-add-form-filled
- secrets-add-form
- secrets-after-save
- secrets-delete-confirm
- secrets-list
settings-verification
- condenser-settings
- verification-settings-off
- verification-settings-on
sidebar
- sidebar-collapsed
- sidebar-conversation-panel
- sidebar-filter-menu
skills-page
- skills-empty
- skills-loaded
- skills-no-match
- skills-search-filtered
- skills-type-filter
Generated by the Snapshot Tests workflow. This comment was created by an AI agent (OpenHands) on behalf of the repo maintainers.
























Why
Issue #511 identified gaps in mock-LLM E2E test coverage. The existing tests exercised happy-path agent output but did not verify several aspects users depend on: sidebar presence, user message rendering, payload correctness (
worktree: true), settings API consistency after profile activation, conversation resume, or automation detail page UI elements.Summary
test.stepchecks in existing steps (settings API model verification, worktree payload interception, user message visibility, sidebar link) and 1 entirely new step (resume conversation from sidebar after navigating away).test.stepchecks (active status badge and cron schedule display on the detail page).afterAllsafety-net cleanup.Issue Number
#511
How to Test
Run the mock-LLM E2E tests:
npm run build:app # if build/ is missing npm run test:e2e:mock-llmAll 7 tests (4 conversation steps + 3 automation steps) should pass. The new checks are
test.stepblocks within existing serial tests, so they run as part of the existing flow.Type
This PR was created by an AI agent (OpenHands) on behalf of the user.
@malhotra5 can click here to continue refining the PR
🐳 Docker images for this PR
• GHCR package: https://github.com/OpenHands/agent-canvas/pkgs/container/agent-canvas
ghcr.io/openhands/agent-canvasghcr.io/openhands/agent-server:1.24.0-pythonopenhands-automation==1.0.0a5a6a2c5c19a2486d858cd440bb956c085b339e4b3Pull (multi-arch manifest)
# Multi-arch manifest — Docker automatically pulls the correct architecture docker pull ghcr.io/openhands/agent-canvas:sha-a6a2c5cRun
All tags pushed for this build
About Multi-Architecture Support
sha-a6a2c5c) is a multi-arch manifest supporting both amd64 and arm64sha-a6a2c5c-amd64) are also available if needed