Skip to content

Investigation: Claude was removed from integration/behavior defaults in #3257 #3408

@enyst

Description

@enyst

@enyst — I investigated the integration/behavior-test model history from origin/main.

Short answer

We lost Claude Sonnet from the default integration/behavior model matrix in PR #3257, merged to main on 2026-05-17 as commit f942e34f (Update default LLM model to gpt-5.5).

That PR was primarily about changing the SDK's default LLM.model from Claude to gpt-5.5, but it also changed the integration workflow default matrix:

- DEFAULT_MODEL_IDS: claude-sonnet-4-6,deepseek-v4-flash,kimi-k2.6,gemini-3.1-pro
+ DEFAULT_MODEL_IDS: gpt-5.5,deepseek-v4-flash,kimi-k2.6,gemini-3.1-pro

Within #3257, the direct intermediate commit was ecd0414d (chore: update remaining default LLM references). It was made after this request on the PR:

@openhands check all other places where we use claude sonnet as the default and switch them to gpt-5.5

Source: #3257 (comment)

So the Claude removal was not caused by a specific integration-test failure or model-availability change that I found. It appears to have been swept into the broad “replace Claude Sonnet defaults with GPT-5.5” cleanup.

Why behavior tests and integration tests use the same 4 LLMs

Yes, the behavior-test and integration-test PR labels use the same model matrix by default.

In .github/workflows/integration-runner.yml, the setup-matrix job resolves one shared DEFAULT_MODEL_IDS list. Later, the label only changes TEST_TYPE_ARGS:

  • integration-test--test-type integration
  • behavior-test--test-type behavior

Unless workflow_dispatch passes explicit model_ids, both labels consume the same default matrix.

Current origin/main default:

DEFAULT_MODEL_IDS: gpt-5.5,deepseek-v4-flash,minimax-m2.7,gemini-3.1-pro

This matches the four models in the release PR result comment: GPT-5.5, DeepSeek V4 Flash, MiniMax M2.7, and Gemini 3.1 Pro.

Timeline of the default matrix

git log -L 53,53:.github/workflows/integration-runner.yml origin/main shows this line's recent evolution:

  1. 2026-02-06 — feat(ci): add model_ids and issue_number inputs to integration-runner #1883 / 9fed9a38

    • Introduced DEFAULT_MODEL_IDS while adding model_ids and issue_number workflow inputs.
    • Defaults became:
      claude-sonnet-4-5-20250929,deepseek-v3.2-reasoner,kimi-k2-thinking,gemini-3-pro
      
  2. 2026-02-21 — Update integration tests to use claude-sonnet-4-6 #2113 / f33e328b

    • Updated Claude from Sonnet 4.5 to Sonnet 4.6:
      claude-sonnet-4-6,deepseek-v3.2-reasoner,kimi-k2-thinking,gemini-3-pro
      
  3. 2026-03-30 — fix(ci): switch Gemini defaults to 3.1 Pro #2615 / 00b36b4a

    • Switched Gemini default to 3.1 Pro:
      claude-sonnet-4-6,deepseek-v3.2-reasoner,kimi-k2-thinking,gemini-3.1-pro
      
  4. 2026-05-07 — ci: replace kimi-k2-thinking with kimi-k2.6 in integration test defaults #3102 / b2122dd1

    • Replaced Kimi/DeepSeek variants, but Claude was still present:
      claude-sonnet-4-6,deepseek-v4-flash,kimi-k2.6,gemini-3.1-pro
      
  5. 2026-05-17 — Update default LLM model to gpt-5.5 #3257 / f942e34f

    • This is where Claude was removed:
      gpt-5.5,deepseek-v4-flash,kimi-k2.6,gemini-3.1-pro
      
  6. 2026-05-25 — ci: use MiniMax M2.7 in integration default matrix #3383 / d0776e95

    • Replaced Kimi K2.6 with MiniMax M2.7:
      gpt-5.5,deepseek-v4-flash,minimax-m2.7,gemini-3.1-pro
      

Additional finding: README is stale

tests/integration/README.md still says:

Runs across 4 LLM models (Claude Sonnet 4.6, DeepSeek V4 Flash, Kimi K2.6, Gemini 3.1 Pro)

That line was last updated in #3102 and was not updated by #3257 or #3383, so it no longer matches the workflow. This stale doc likely makes the Claude removal less obvious.

Interpretation

My read: the release-gate matrix lost Anthropic coverage as a side effect of the project-wide default-model migration to GPT-5.5, not through an explicit “remove Claude from integration/behavior coverage” decision.

If we want Anthropic coverage in release integration/behavior tests again, the most direct follow-up would be one of:

  • replace one current model with claude-sonnet-4-6 (or the preferred current Sonnet ID), or
  • expand the default matrix to 5 models and add Claude back, accepting the extra runtime/cost, and
  • update tests/integration/README.md so the documented model coverage is generated from or kept in sync with .github/workflows/integration-runner.yml.

This issue was created by an AI agent (OpenHands) on behalf of @enyst.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions