Investigation: Claude was removed from integration/behavior defaults in #3257

@enyst — I investigated the integration/behavior-test model history from `origin/main`.

## Short answer

We lost Claude Sonnet from the default integration/behavior model matrix in **PR #3257**, merged to `main` on **2026-05-17** as commit [`f942e34f`](https://github.com/OpenHands/software-agent-sdk/commit/f942e34f046d1a6bb70c8b7a2e79a762b7acca7b) (`Update default LLM model to gpt-5.5`).

That PR was primarily about changing the SDK's default `LLM.model` from Claude to `gpt-5.5`, but it also changed the integration workflow default matrix:

```diff
- DEFAULT_MODEL_IDS: claude-sonnet-4-6,deepseek-v4-flash,kimi-k2.6,gemini-3.1-pro
+ DEFAULT_MODEL_IDS: gpt-5.5,deepseek-v4-flash,kimi-k2.6,gemini-3.1-pro
```

Within #3257, the direct intermediate commit was [`ecd0414d`](https://github.com/OpenHands/software-agent-sdk/pull/3257/commits/ecd0414d19d62b870ffece6998fe5cf0f3016426) (`chore: update remaining default LLM references`). It was made after this request on the PR:

> `@openhands check all other places where we use claude sonnet as the default and switch them to gpt-5.5`

Source: https://github.com/OpenHands/software-agent-sdk/pull/3257#issuecomment-4453482429

So the Claude removal was not caused by a specific integration-test failure or model-availability change that I found. It appears to have been swept into the broad “replace Claude Sonnet defaults with GPT-5.5” cleanup.

## Why behavior tests and integration tests use the same 4 LLMs

Yes, the `behavior-test` and `integration-test` PR labels use the same model matrix by default.

In `.github/workflows/integration-runner.yml`, the `setup-matrix` job resolves one shared `DEFAULT_MODEL_IDS` list. Later, the label only changes `TEST_TYPE_ARGS`:

- `integration-test` → `--test-type integration`
- `behavior-test` → `--test-type behavior`

Unless `workflow_dispatch` passes explicit `model_ids`, both labels consume the same default matrix.

Current `origin/main` default:

```yaml
DEFAULT_MODEL_IDS: gpt-5.5,deepseek-v4-flash,minimax-m2.7,gemini-3.1-pro
```

This matches the four models in the release PR result comment: GPT-5.5, DeepSeek V4 Flash, MiniMax M2.7, and Gemini 3.1 Pro.

## Timeline of the default matrix

`git log -L 53,53:.github/workflows/integration-runner.yml origin/main` shows this line's recent evolution:

1. **2026-02-06 — #1883 / [`9fed9a38`](https://github.com/OpenHands/software-agent-sdk/commit/9fed9a38c9b06a585bcf905bd27e1cbee6326dfd)**
   - Introduced `DEFAULT_MODEL_IDS` while adding `model_ids` and `issue_number` workflow inputs.
   - Defaults became:
     ```text
     claude-sonnet-4-5-20250929,deepseek-v3.2-reasoner,kimi-k2-thinking,gemini-3-pro
     ```

2. **2026-02-21 — #2113 / [`f33e328b`](https://github.com/OpenHands/software-agent-sdk/commit/f33e328b6d40d63924d78d7ff87d1b22a824b8d4)**
   - Updated Claude from Sonnet 4.5 to Sonnet 4.6:
     ```text
     claude-sonnet-4-6,deepseek-v3.2-reasoner,kimi-k2-thinking,gemini-3-pro
     ```

3. **2026-03-30 — #2615 / [`00b36b4a`](https://github.com/OpenHands/software-agent-sdk/commit/00b36b4ab83de7511a01892c892f5efcc1e49c29)**
   - Switched Gemini default to 3.1 Pro:
     ```text
     claude-sonnet-4-6,deepseek-v3.2-reasoner,kimi-k2-thinking,gemini-3.1-pro
     ```

4. **2026-05-07 — #3102 / [`b2122dd1`](https://github.com/OpenHands/software-agent-sdk/commit/b2122dd1657eeaad6aa8716763b36a664070c597)**
   - Replaced Kimi/DeepSeek variants, but **Claude was still present**:
     ```text
     claude-sonnet-4-6,deepseek-v4-flash,kimi-k2.6,gemini-3.1-pro
     ```

5. **2026-05-17 — #3257 / [`f942e34f`](https://github.com/OpenHands/software-agent-sdk/commit/f942e34f046d1a6bb70c8b7a2e79a762b7acca7b)**
   - This is where Claude was removed:
     ```text
     gpt-5.5,deepseek-v4-flash,kimi-k2.6,gemini-3.1-pro
     ```

6. **2026-05-25 — #3383 / [`d0776e95`](https://github.com/OpenHands/software-agent-sdk/commit/d0776e959a8488fdbbc6ee1c0da8f148a8451be0)**
   - Replaced Kimi K2.6 with MiniMax M2.7:
     ```text
     gpt-5.5,deepseek-v4-flash,minimax-m2.7,gemini-3.1-pro
     ```

## Additional finding: README is stale

`tests/integration/README.md` still says:

```text
Runs across 4 LLM models (Claude Sonnet 4.6, DeepSeek V4 Flash, Kimi K2.6, Gemini 3.1 Pro)
```

That line was last updated in #3102 and was not updated by #3257 or #3383, so it no longer matches the workflow. This stale doc likely makes the Claude removal less obvious.

## Interpretation

My read: the release-gate matrix lost Anthropic coverage as a side effect of the project-wide default-model migration to GPT-5.5, not through an explicit “remove Claude from integration/behavior coverage” decision.

If we want Anthropic coverage in release integration/behavior tests again, the most direct follow-up would be one of:

- replace one current model with `claude-sonnet-4-6` (or the preferred current Sonnet ID), or
- expand the default matrix to 5 models and add Claude back, accepting the extra runtime/cost, and
- update `tests/integration/README.md` so the documented model coverage is generated from or kept in sync with `.github/workflows/integration-runner.yml`.

_This issue was created by an AI agent (OpenHands) on behalf of @enyst._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigation: Claude was removed from integration/behavior defaults in #3257 #3408

Short answer

Why behavior tests and integration tests use the same 4 LLMs

Timeline of the default matrix

Additional finding: README is stale

Interpretation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Investigation: Claude was removed from integration/behavior defaults in #3257 #3408

Description

Short answer

Why behavior tests and integration tests use the same 4 LLMs

Timeline of the default matrix

Additional finding: README is stale

Interpretation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions