You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@enyst — I investigated the integration/behavior-test model history from origin/main.
Short answer
We lost Claude Sonnet from the default integration/behavior model matrix in PR #3257, merged to main on 2026-05-17 as commit f942e34f (Update default LLM model to gpt-5.5).
That PR was primarily about changing the SDK's default LLM.model from Claude to gpt-5.5, but it also changed the integration workflow default matrix:
So the Claude removal was not caused by a specific integration-test failure or model-availability change that I found. It appears to have been swept into the broad “replace Claude Sonnet defaults with GPT-5.5” cleanup.
Why behavior tests and integration tests use the same 4 LLMs
Yes, the behavior-test and integration-test PR labels use the same model matrix by default.
In .github/workflows/integration-runner.yml, the setup-matrix job resolves one shared DEFAULT_MODEL_IDS list. Later, the label only changes TEST_TYPE_ARGS:
integration-test → --test-type integration
behavior-test → --test-type behavior
Unless workflow_dispatch passes explicit model_ids, both labels consume the same default matrix.
Runs across 4 LLM models (Claude Sonnet 4.6, DeepSeek V4 Flash, Kimi K2.6, Gemini 3.1 Pro)
That line was last updated in #3102 and was not updated by #3257 or #3383, so it no longer matches the workflow. This stale doc likely makes the Claude removal less obvious.
Interpretation
My read: the release-gate matrix lost Anthropic coverage as a side effect of the project-wide default-model migration to GPT-5.5, not through an explicit “remove Claude from integration/behavior coverage” decision.
If we want Anthropic coverage in release integration/behavior tests again, the most direct follow-up would be one of:
replace one current model with claude-sonnet-4-6 (or the preferred current Sonnet ID), or
expand the default matrix to 5 models and add Claude back, accepting the extra runtime/cost, and
update tests/integration/README.md so the documented model coverage is generated from or kept in sync with .github/workflows/integration-runner.yml.
This issue was created by an AI agent (OpenHands) on behalf of @enyst.
@enyst — I investigated the integration/behavior-test model history from
origin/main.Short answer
We lost Claude Sonnet from the default integration/behavior model matrix in PR #3257, merged to
mainon 2026-05-17 as commitf942e34f(Update default LLM model to gpt-5.5).That PR was primarily about changing the SDK's default
LLM.modelfrom Claude togpt-5.5, but it also changed the integration workflow default matrix:Within #3257, the direct intermediate commit was
ecd0414d(chore: update remaining default LLM references). It was made after this request on the PR:Source: #3257 (comment)
So the Claude removal was not caused by a specific integration-test failure or model-availability change that I found. It appears to have been swept into the broad “replace Claude Sonnet defaults with GPT-5.5” cleanup.
Why behavior tests and integration tests use the same 4 LLMs
Yes, the
behavior-testandintegration-testPR labels use the same model matrix by default.In
.github/workflows/integration-runner.yml, thesetup-matrixjob resolves one sharedDEFAULT_MODEL_IDSlist. Later, the label only changesTEST_TYPE_ARGS:integration-test→--test-type integrationbehavior-test→--test-type behaviorUnless
workflow_dispatchpasses explicitmodel_ids, both labels consume the same default matrix.Current
origin/maindefault:This matches the four models in the release PR result comment: GPT-5.5, DeepSeek V4 Flash, MiniMax M2.7, and Gemini 3.1 Pro.
Timeline of the default matrix
git log -L 53,53:.github/workflows/integration-runner.yml origin/mainshows this line's recent evolution:2026-02-06 — feat(ci): add model_ids and issue_number inputs to integration-runner #1883 /
9fed9a38DEFAULT_MODEL_IDSwhile addingmodel_idsandissue_numberworkflow inputs.2026-02-21 — Update integration tests to use claude-sonnet-4-6 #2113 /
f33e328b2026-03-30 — fix(ci): switch Gemini defaults to 3.1 Pro #2615 /
00b36b4a2026-05-07 — ci: replace kimi-k2-thinking with kimi-k2.6 in integration test defaults #3102 /
b2122dd12026-05-17 — Update default LLM model to gpt-5.5 #3257 /
f942e34f2026-05-25 — ci: use MiniMax M2.7 in integration default matrix #3383 /
d0776e95Additional finding: README is stale
tests/integration/README.mdstill says:That line was last updated in #3102 and was not updated by #3257 or #3383, so it no longer matches the workflow. This stale doc likely makes the Claude removal less obvious.
Interpretation
My read: the release-gate matrix lost Anthropic coverage as a side effect of the project-wide default-model migration to GPT-5.5, not through an explicit “remove Claude from integration/behavior coverage” decision.
If we want Anthropic coverage in release integration/behavior tests again, the most direct follow-up would be one of:
claude-sonnet-4-6(or the preferred current Sonnet ID), ortests/integration/README.mdso the documented model coverage is generated from or kept in sync with.github/workflows/integration-runner.yml.This issue was created by an AI agent (OpenHands) on behalf of @enyst.