[audit-workflows] Agentic Workflow Audit — 2026-06-02 Evening (16:51–19:34Z): 97.8% success, but copilot-sdk session.idle timeout reaches producti [Content truncated due to length] #36517

2026-06-02T19:48:38Z

github-actions[bot]
Bot Jun 2, 2026

Summary

Evening incremental audit of the 2026-06-02 16:51–19:34Z window (50 runs). Fleet health is strong — 97.8% success among completed runs (45/46) with the safe-output partial-failure class and token-budget-429 both absent this window. The single failure is the headline finding: the copilot-sdk session.idle 600s timeout has escalated from experimental feature branches to a production scheduled workflow on main.

Metric	Value
Runs	50 (46 completed, 3 in-progress, 1 queued)
Failures (completed)	1 → 97.8% success
Engines	copilot 40 · claude 9 · codex 1
Tokens / effective	23.0M / 140.9M
Action-minutes / turns	407 / 473
Missing tools / missing data / MCP failures	0 / 0 / 0
Safe outputs (write-capable runs)	3 (47 read-only)
Firewall blocked	513 / 2527 (20.3%) — no failure-causing blocks

🔴 Critical: copilot-sdk `session.idle` timeout now hits production `main`

Daily Security Observability Report (run 26836111514, copilot / claude-sonnet-4.6, schedule/main) was the only failure this window. On the new copilot-sdk sdk-driver path the session was created, the custom provider resolved with auth (provider=copilot baseUrl=api-proxy:10002), the prompt was sent — then:

[sdk-driver] error: Timeout after 600000ms waiting for session.idle
attempt 1 failed: exitCode=1 ... hasOutput=false retriesRemaining=3
attempt 1: no output produced — not retrying
done: exitCode=1 totalDuration=10m 1s

Why this matters:

This is not the earlier auth-info variant (provider resolved with auth) and is not firewall-caused (only 1 trivial block, patch-diff.githubusercontent.com).
Previous occurrences of this exact session.idle signature were confined to experimental branches (ab-advisor max-turns campaign). This is the first time it hit a production scheduled workflow on main.
The sdk-driver path is now progressively rolled out to ~14 of 48 copilot+codex runs (~13 genuine production copilot workflows this window). 1 of ~13 prod sdk runs hung (~7%), while the legacy copilot CLI path was 0/34.
The harness treated hasOutput=false as a startup crash and did not retry — but here the session was created and the prompt was sent, so this is a mid-session hang where a retry would likely have succeeded.

Recommended fix (High)

Retry at least once after a session.idle timeout. The current no output produced → not retrying rule misclassifies a mid-session hang as a startup crash. The session existed and the prompt was sent; a retry is warranted.
Root-cause the missing session.idle signal from the headless Copilot server for some prompts (long tool-loop, lost SDK event, or never-emitted idle).
De-risk the rollout: until the sdk-driver session lifecycle is stable, lower the rollout percentage or keep heavy/long-running production workflows (Daily Security Observability Report ran 27.5m end-to-end) on the legacy copilot path.

✅ Positives this window

safe-output partial-failure-intolerance did NOT recur (0 safe-output failures) — the dominant class across recent windows was quiet.
token-budget-429 did NOT recur — no isMaxEffectiveTokensExceededError signatures.
Legacy copilot CLI path: 0/34 failures — stable.
0 missing tools, 0 missing data, 0 MCP failures.

📊 Trend Charts (last ~14 days)

Daily success/failure volume with the success-rate line. After the 05-23 dip (41.6%, a bad-window outlier) the fleet has held a healthy ~84–96% band; the 06-02 full-day point (84.2%) reflects the morning token-429 + safe-output failures, while this evening slice recovers to 97.8%.

Daily token volume with a 7-day moving average. Usage is trending down — the MA has eased from ~48M to ~39M/day, and 06-02's 32.7M full-day total sits comfortably below the 05-31 spike (68.8M). No cost-runaway signal; heavy daily-aggregation workflows remain the variance drivers.

Recommendations

[High] Make the copilot-sdk harness retry on session.idle timeout and investigate the missing idle signal; de-risk the sdk-driver rollout for heavy production workflows (tracked under copilot-sdk-session-auth, now escalated).
[High — carried] Harden safe-output partial-failure tolerance — keep first-N-allowed items + warn instead of red-failing the agent step (quiet this window, still the dominant recent class).
[High — carried, [aw-failures] Token-budget exhaustion (25M effective-tokens cap) recurring across 6+ scheduled workflows — 2026-05-29 02:00–07:32 UTC #35661] Token-budget-429: chunk heavy daily-aggregation workflows and fail-fast on the max-effective-tokens signature (quiet this window).

Methodology & scope

Source: agentic-workflows MCP logs tool, start_date: -1d (window 16:51–19:34Z). Engine classification from aw_info.json engine_id (not lock-file scanning).
Failure root cause confirmed from per-run agent-stdio.log + firewall-summary.json; the lone audit-agent self-match on the failure grep was a false positive (claude engine analyzing other runs' logs).
Repo memory updated: audit-history.jsonl, known-issues.json (copilot-sdk escalated to High), recommendations.json, anomalies.json, metrics-summary.json.

References:

§26836111514 — Daily Security Observability Report (the failure)
§26840296477 — On-Demand Token Trend Audit (longest healthy run, 21.2m)

Generated by 🔍 Agentic Workflow Audit Agent · opus48 3.4M · ◷

expires on Jun 3, 2026, 7:48 PM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[audit-workflows] Agentic Workflow Audit — 2026-06-02 Evening (16:51–19:34Z): 97.8% success, but copilot-sdk session.idle timeout reaches producti [Content truncated due to length] #36517

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

[audit-workflows] Agentic Workflow Audit — 2026-06-02 Evening (16:51–19:34Z): 97.8% success, but copilot-sdk session.idle timeout reaches producti [Content truncated due to length] #36517

Uh oh!

github-actions[bot] Bot Jun 2, 2026

Summary

🔴 Critical: copilot-sdk session.idle timeout now hits production main

✅ Positives this window

📊 Trend Charts (last ~14 days)

Recommendations

Replies: 0 comments

github-actions[bot]
Bot Jun 2, 2026

🔴 Critical: copilot-sdk `session.idle` timeout now hits production `main`