Skip to content

Release v1.24.0#3401

Merged
simonrosenberg merged 3 commits into
mainfrom
rel-1.24.0
May 27, 2026
Merged

Release v1.24.0#3401
simonrosenberg merged 3 commits into
mainfrom
rel-1.24.0

Conversation

@all-hands-bot
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot commented May 27, 2026

Release v1.24.0

This PR prepares the release for version 1.24.0.

Release Checklist

  • Version set to 1.24.0
  • Fix any deprecation deadlines if they exist
  • Integration tests pass (tagged with integration-test)
  • Behavior tests pass (tagged with behavior-test)
  • Example tests pass (tagged with test-examples)
  • Evaluation on OpenHands Index
  • Confirm any release-note-required PRs are accurately called out in the final release notes

What happens on merge

When this PR is merged, the create-release.yml workflow will automatically:

  1. Create a GitHub release with tag v1.24.0 and auto-generated notes, plus an explicit preamble for merged release-note-required PRs
  2. Trigger pypi-release.yml to publish all packages to PyPI
  3. Trigger version-bump-prs.yml to create downstream version bump PRs

Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:17cb597-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-17cb597-python \
  ghcr.io/openhands/agent-server:17cb597-python

All tags pushed for this build

ghcr.io/openhands/agent-server:17cb597-golang-amd64
ghcr.io/openhands/agent-server:17cb597e0db541d110c10dbcde7aeab3ffa77c22-golang-amd64
ghcr.io/openhands/agent-server:rel-1.24.0-golang-amd64
ghcr.io/openhands/agent-server:17cb597-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:17cb597-golang-arm64
ghcr.io/openhands/agent-server:17cb597e0db541d110c10dbcde7aeab3ffa77c22-golang-arm64
ghcr.io/openhands/agent-server:rel-1.24.0-golang-arm64
ghcr.io/openhands/agent-server:17cb597-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:17cb597-java-amd64
ghcr.io/openhands/agent-server:17cb597e0db541d110c10dbcde7aeab3ffa77c22-java-amd64
ghcr.io/openhands/agent-server:rel-1.24.0-java-amd64
ghcr.io/openhands/agent-server:17cb597-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:17cb597-java-arm64
ghcr.io/openhands/agent-server:17cb597e0db541d110c10dbcde7aeab3ffa77c22-java-arm64
ghcr.io/openhands/agent-server:rel-1.24.0-java-arm64
ghcr.io/openhands/agent-server:17cb597-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:17cb597-python-amd64
ghcr.io/openhands/agent-server:17cb597e0db541d110c10dbcde7aeab3ffa77c22-python-amd64
ghcr.io/openhands/agent-server:rel-1.24.0-python-amd64
ghcr.io/openhands/agent-server:17cb597-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:17cb597-python-arm64
ghcr.io/openhands/agent-server:17cb597e0db541d110c10dbcde7aeab3ffa77c22-python-arm64
ghcr.io/openhands/agent-server:rel-1.24.0-python-arm64
ghcr.io/openhands/agent-server:17cb597-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:17cb597-golang
ghcr.io/openhands/agent-server:17cb597e0db541d110c10dbcde7aeab3ffa77c22-golang
ghcr.io/openhands/agent-server:rel-1.24.0-golang
ghcr.io/openhands/agent-server:17cb597-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:17cb597-java
ghcr.io/openhands/agent-server:17cb597e0db541d110c10dbcde7aeab3ffa77c22-java
ghcr.io/openhands/agent-server:rel-1.24.0-java
ghcr.io/openhands/agent-server:17cb597-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:17cb597-python
ghcr.io/openhands/agent-server:17cb597e0db541d110c10dbcde7aeab3ffa77c22-python
ghcr.io/openhands/agent-server:rel-1.24.0-python
ghcr.io/openhands/agent-server:17cb597-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

  • Each variant tag (e.g., 17cb597-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 17cb597-python-amd64) are also available if needed

@all-hands-bot all-hands-bot added integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation. behavior-test labels May 27, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Copy Markdown
Contributor

Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

Copy link
Copy Markdown
Collaborator Author

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable release bump, but I can’t approve this release PR yet.

Issues to address before approval:

  • Deprecation deadlines is failing for 1.24.0: register_tool(callable_factory) and the openhands.sdk.settings import shim have reached their removed in: 1.24.0 deadline.
  • Required release validation is not complete/current yet: Run tests still has in-progress jobs and no coverage report comment, Run Examples Scripts is still in progress/no result comment, and Run Integration Tests is still in progress/no final results comment. Please wait for these to pass on 19e34c7a376eb21734385de2074b629061313c00, then have a human maintainer review.

Risk: 🟡 Medium — release publication should not proceed with failed deprecation cleanup and incomplete release validation.

This review was created by an AI agent (OpenHands) on behalf of the user.


Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26508259686

@github-actions
Copy link
Copy Markdown
Contributor

🧪 Integration Tests Results

Overall Success Rate: 100.0%
Total Cost: $1.04
Models Tested: 4
Timestamp: 2026-05-27 11:29:25 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Tests Passed Skipped Total Cost Tokens
litellm_proxy_minimax_MiniMax_M2.7 100.0% 8/8 1 9 $0.00 370,316
litellm_proxy_gemini_3.1_pro_preview 100.0% 9/9 0 9 $0.16 321,838
litellm_proxy_deepseek_deepseek_v4_flash 100.0% 8/8 1 9 $0.00 447,556
litellm_proxy_openai_gpt_5.5 100.0% 9/9 0 9 $0.88 291,083

📋 Detailed Results

litellm_proxy_minimax_MiniMax_M2.7

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.00
  • Token Usage: prompt: 364,766, completion: 5,550, cache_read: 195,783, reasoning: 34
  • Run Suffix: litellm_proxy_minimax_MiniMax_M2.7_19e34c7_minimax_m2_7_run_N9_20260527_112652
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3.1_pro_preview

  • Success Rate: 100.0% (9/9)
  • Total Cost: $0.16
  • Token Usage: prompt: 317,473, completion: 4,365, cache_read: 293,199, reasoning: 2,684
  • Run Suffix: litellm_proxy_gemini_3.1_pro_preview_19e34c7_gemini_3_1_pro_run_N9_20260527_112657

litellm_proxy_deepseek_deepseek_v4_flash

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.00
  • Token Usage: prompt: 442,309, completion: 5,247, cache_read: 395,520, reasoning: 1,127
  • Run Suffix: litellm_proxy_deepseek_deepseek_v4_flash_19e34c7_deepseek_v4_flash_run_N9_20260527_112656
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_openai_gpt_5.5

  • Success Rate: 100.0% (9/9)
  • Total Cost: $0.88
  • Token Usage: prompt: 286,233, completion: 4,850, cache_read: 154,624, reasoning: 1,784
  • Run Suffix: litellm_proxy_openai_gpt_5.5_19e34c7_gpt_5_5_run_N9_20260527_112701

Copy link
Copy Markdown
Collaborator Author

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ QA Report: PASS WITH ISSUES

Release-version behavior works in real execution, but the PR is not fully release-ready because the deprecation-deadline gate is failing and several release checks are still pending.

Does this PR achieve its stated goal?

Partially. I verified the user-visible release outputs: package metadata/imports, built wheels, the running agent-server /server_info endpoint, and the Run Eval workflow default all moved from 1.23.1/v1.23.1 on main to 1.24.0/v1.24.0 on this PR. However, the stated release checklist includes deprecation-deadline cleanup, and CI currently fails that gate for two SDK deprecations with removed in: 1.24.0, so I would not consider the release fully prepared yet.

Phase Result
Environment Setup make build completed and installed editable packages at 1.24.0
CI Status ⚠️ 1 failing check (Deprecation deadlines) and pending release checks (Run tests, Run Examples Scripts, Run Integration Tests) at review time
Functional Verification ✅ Actual imports, server endpoint, workflow config parse, and distribution build all reported 1.24.0
Functional Verification

Test 1: Installed package metadata and imports

Step 1 — Establish baseline on main:
Ran uv run --project /tmp/qa-sdk-main-baseline python /tmp/qa_versions.py:

openhands-sdk=1.23.1
openhands-tools=1.23.1
openhands-workspace=1.23.1
openhands-agent-server=1.23.1
sdk_import=ok:Agent
tools_import=ok:file_editor

This shows the pre-release baseline exposes 1.23.1 package metadata while core SDK/tools imports work.

Step 2 — Apply the PR's changes:
Checked out rel-1.24.0 at 19e34c7a376eb21734385de2074b629061313c00 and ran the repository setup with make build.

Step 3 — Re-run with the PR in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_versions.py:

openhands-sdk=1.24.0
openhands-tools=1.24.0
openhands-workspace=1.24.0
openhands-agent-server=1.24.0
sdk_import=ok:Agent
tools_import=ok:file_editor

This confirms the installed packages a Python user imports now expose 1.24.0 and still import successfully.

Test 2: Running agent-server reports release versions

Step 1 — Establish baseline on main:
Started the server with uv run python -m openhands.agent_server --host 127.0.0.1 --port 8765 and requested /server_info:

{"version":"1.23.1","sdk_version":"1.23.1","tools_version":"1.23.1","workspace_version":"1.23.1"}

This shows the real server API previously surfaced 1.23.1 across the server, SDK, tools, and workspace versions.

Step 2 — Apply the PR's changes:
Started the same server command from the PR checkout on port 8766.

Step 3 — Re-run with the PR in place:
Requested http://127.0.0.1:8766/server_info:

{"version":"1.24.0","sdk_version":"1.24.0","tools_version":"1.24.0","workspace_version":"1.24.0"}

This confirms a real API consumer sees the intended 1.24.0 release versions.

Test 3: Release distribution build metadata

Step 1 — Establish baseline:
The baseline package metadata check above established the current release line as 1.23.1.

Step 2 — Apply the PR's changes:
Built release artifacts from the PR with uv build --all-packages --out-dir /tmp/qa-dist-1.24.0.

Step 3 — Inspect built wheels:
Read wheel METADATA from the generated artifacts:

openhands_agent_server-1.24.0-py3-none-any.whl -> openhands-agent-server 1.24.0
openhands_sdk-1.24.0-py3-none-any.whl -> openhands-sdk 1.24.0
openhands_tools-1.24.0-py3-none-any.whl -> openhands-tools 1.24.0
openhands_workspace-1.24.0-py3-none-any.whl -> openhands-workspace 1.24.0

This confirms the artifacts that would be published carry the intended release version.

Test 4: Run Eval workflow default

Step 1 — Establish baseline on main:
Parsed .github/workflows/run-eval.yml with yaml.BaseLoader:

main run-eval sdk_ref default=v1.23.1

This shows manual eval dispatch defaulted to the prior release.

Step 2 — Apply the PR's changes:
Parsed the same workflow in the PR checkout.

Step 3 — Re-run with the PR in place:

PR run-eval sdk_ref default=v1.24.0

This confirms a workflow-dispatch user gets the new release tag by default.

CI Evidence

Latest gh pr checks 3401 --repo OpenHands/software-agent-sdk summary at review time:

bucket_counts={'skipping': 16, 'pending': 14, 'pass': 19, 'fail': 1}
[Run tests]
- IN_PROGRESS (pending): tools-tests
- SUCCESS (pass): sdk-tests, workspace-tests, cross-tests, agent-server-tests, Test directory allowlist, windows-tests
[Run Examples Scripts]
- IN_PROGRESS (pending): test-examples
[Run Integration Tests]
- pending jobs remain for gpt-5.5, minimax-m2.7, and deepseek-v4-flash; one gemini job passed
[Deprecation deadlines]
- FAILURE (fail): check
[Version bump guard]
- SUCCESS (pass): Check package versions

Failing Deprecation deadlines job excerpt:

The following deprecated features have passed their removal deadline:

- [openhands-sdk] 'register_tool(callable_factory)' (warn_call)
  deprecated in: 1.19.1
  removed in:    1.24.0
  defined at:    openhands-sdk/openhands/sdk/tool/registry.py:163

- [openhands-sdk] f'Importing {name!r} from openhands.sdk.settings' (warn_call)
  deprecated in: 1.19.0
  removed in:    1.24.0
  defined at:    openhands-sdk/openhands/sdk/settings/__init__.py:122

Update or remove the listed features before publishing a version that meets or exceeds their removal deadline.

Issues Found

  • 🟠 Issue: The release-version behavior is correct, but the PR is not fully release-ready while the Deprecation deadlines workflow fails for two SDK deprecations whose removal target is 1.24.0.
  • 🟡 Minor / Status: Release-critical CI was still pending at review time (Run tests tools job, Run Examples Scripts, and multiple Run Integration Tests jobs), so final release readiness still depends on those completing successfully.

This QA review was created by an AI agent (OpenHands) on behalf of the user.

Comment thread openhands-sdk/pyproject.toml
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk
   __init__.py28292%111–112
openhands-sdk/openhands/sdk/settings
   model.py5625091%83, 108, 113, 352, 362–365, 368, 381, 385, 391, 401, 407, 412, 602, 615, 626, 636, 640, 642, 644, 646, 648, 650, 652, 927, 929, 1042, 1226, 1295, 1334, 1361, 1397–1400, 1426, 1550, 1595, 1627, 1637, 1639, 1644, 1662, 1675, 1677, 1679, 1681, 1688
openhands-sdk/openhands/sdk/tool
   registry.py901088%39, 59–60, 71, 84, 106–107, 110, 118, 156
TOTAL29189641178% 

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Generated: 2026-05-27 11:55:16 UTC

Example Status Duration Cost
01_standalone_sdk/02_custom_tools.py ✅ PASS 24.4s $0.03
01_standalone_sdk/03_activate_skill.py ✅ PASS 24.6s $0.03
01_standalone_sdk/05_use_llm_registry.py ✅ PASS 11.6s $0.01
01_standalone_sdk/07_mcp_integration.py ✅ PASS 35.4s $0.03
01_standalone_sdk/09_pause_example.py ✅ PASS 12.1s $0.01
01_standalone_sdk/10_persistence.py ✅ PASS 32.9s $0.03
01_standalone_sdk/11_async.py ✅ PASS 26.6s $0.03
01_standalone_sdk/12_custom_secrets.py ✅ PASS 8.4s $0.00
01_standalone_sdk/13_get_llm_metrics.py ✅ PASS 36.6s $0.03
01_standalone_sdk/14_context_condenser.py ✅ PASS 2m 27s $0.17
01_standalone_sdk/17_image_input.py ✅ PASS 20.0s $0.02
01_standalone_sdk/18_send_message_while_processing.py ✅ PASS 17.3s $0.02
01_standalone_sdk/19_llm_routing.py ✅ PASS 16.0s $0.02
01_standalone_sdk/20_stuck_detector.py ✅ PASS 11.1s $0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py ✅ PASS 10.2s $0.00
01_standalone_sdk/22_anthropic_thinking.py ✅ PASS 18.1s $0.02
01_standalone_sdk/23_responses_reasoning.py ✅ PASS 1m 31s $0.02
01_standalone_sdk/24_planning_agent_workflow.py ✅ PASS 8m 17s $0.62
01_standalone_sdk/25_agent_delegation.py ✅ PASS 1m 27s $0.09
01_standalone_sdk/26_custom_visualizer.py ✅ PASS 16.3s $0.03
01_standalone_sdk/28_ask_agent_example.py ✅ PASS 33.8s $0.02
01_standalone_sdk/29_llm_streaming.py ✅ PASS 33.5s $0.02
01_standalone_sdk/30_tom_agent.py ✅ PASS 17.8s $0.02
01_standalone_sdk/31_iterative_refinement.py ✅ PASS 5m 23s $0.42
01_standalone_sdk/32_configurable_security_policy.py ✅ PASS 16.7s $0.02
01_standalone_sdk/33_hooks/main.py ✅ PASS 31.7s $0.04
01_standalone_sdk/34_critic_example.py ✅ PASS 8m 52s $0.61
01_standalone_sdk/36_event_json_to_openai_messages.py ✅ PASS 9.6s $0.00
01_standalone_sdk/37_llm_profile_store/main.py ✅ PASS 4.2s $0.00
01_standalone_sdk/38_browser_session_recording.py ✅ PASS 38.0s $0.03
01_standalone_sdk/39_llm_fallback.py ✅ PASS 9.5s $0.00
01_standalone_sdk/40_acp_agent_example.py ✅ PASS 31.5s $0.32
01_standalone_sdk/41_task_tool_set.py ✅ PASS 45.8s $0.04
01_standalone_sdk/42_file_based_subagents.py ✅ PASS 28.8s $0.04
01_standalone_sdk/43_mixed_marketplace_skills/main.py ✅ PASS 3.3s $0.00
01_standalone_sdk/44_model_switching_in_convo.py ✅ PASS 7.1s $0.01
01_standalone_sdk/45_parallel_tool_execution.py ✅ PASS 7m 8s $0.53
01_standalone_sdk/46_agent_settings.py ✅ PASS 12.0s $0.01
01_standalone_sdk/47_defense_in_depth_security.py ✅ PASS 4.9s $0.00
01_standalone_sdk/48_conversation_fork.py ✅ PASS 13.7s $0.00
01_standalone_sdk/49_switch_llm_tool.py ✅ PASS 7.2s $0.03
01_standalone_sdk/50_async_cancellation.py ✅ PASS 12.1s $0.01
02_remote_agent_server/01_convo_with_local_agent_server.py ✅ PASS 40.3s $0.03
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py ✅ PASS 1m 38s $0.06
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py ✅ PASS 1m 32s $0.08
02_remote_agent_server/04_convo_with_api_sandboxed_server.py ✅ PASS 1m 46s $0.03
02_remote_agent_server/06_custom_tool/main.py ✅ PASS 4m 29s $0.03
02_remote_agent_server/07_convo_with_cloud_workspace.py ✅ PASS 42.6s $0.04
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py ✅ PASS 3m 26s $0.02
02_remote_agent_server/09_acp_agent_with_remote_runtime.py ✅ PASS 47.1s $0.11
02_remote_agent_server/10_cloud_workspace_share_credentials.py ✅ PASS 30.9s $0.04
02_remote_agent_server/11_conversation_fork.py ✅ PASS 1m 44s $0.00
02_remote_agent_server/12_settings_and_secrets_api.py ✅ PASS 2m 27s $0.02
02_remote_agent_server/13_workspace_get_llm.py ✅ PASS 31.0s $0.02
04_llm_specific_tools/01_gpt5_apply_patch_preset.py ✅ PASS 20.1s $0.02
04_llm_specific_tools/02_gemini_file_tools.py ✅ PASS 43.6s $0.05
05_skills_and_plugins/01_loading_agentskills/main.py ✅ PASS 16.6s $0.02
05_skills_and_plugins/02_loading_plugins/main.py ✅ PASS 2m 51s $0.02

✅ All tests passed!

Total: 58 | Passed: 58 | Failed: 0 | Total Cost: $4.01

View full workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🧪 Integration Tests Results

Overall Success Rate: 100.0%
Total Cost: $17.19
Models Tested: 4
Timestamp: 2026-05-27 11:49:12 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Tests Passed Skipped Total Cost Tokens
litellm_proxy_minimax_MiniMax_M2.7 100.0% 5/5 0 5 $0.17 3,734,601
litellm_proxy_gemini_3.1_pro_preview 100.0% 5/5 0 5 $13.25 7,597,614
litellm_proxy_deepseek_deepseek_v4_flash 100.0% 5/5 0 5 $0.17 3,423,284
litellm_proxy_openai_gpt_5.5 100.0% 5/5 0 5 $3.59 2,871,859

📋 Detailed Results

litellm_proxy_minimax_MiniMax_M2.7

  • Success Rate: 100.0% (5/5)
  • Total Cost: $0.17
  • Token Usage: prompt: 3,696,713, completion: 37,888, cache_read: 3,323,267, reasoning: 58
  • Run Suffix: litellm_proxy_minimax_MiniMax_M2.7_19e34c7_minimax_m2_7_run_N5_20260527_112703

litellm_proxy_gemini_3.1_pro_preview

  • Success Rate: 100.0% (5/5)
  • Total Cost: $13.25
  • Token Usage: prompt: 7,566,162, completion: 31,452, cache_read: 1,171,255, reasoning: 13,775
  • Run Suffix: litellm_proxy_gemini_3.1_pro_preview_19e34c7_gemini_3_1_pro_run_N5_20260527_112702

litellm_proxy_deepseek_deepseek_v4_flash

  • Success Rate: 100.0% (5/5)
  • Total Cost: $0.17
  • Token Usage: prompt: 3,386,146, completion: 37,138, cache_read: 3,131,008, reasoning: 10,076
  • Run Suffix: litellm_proxy_deepseek_deepseek_v4_flash_19e34c7_deepseek_v4_flash_run_N5_20260527_112718

litellm_proxy_openai_gpt_5.5

  • Success Rate: 100.0% (5/5)
  • Total Cost: $3.59
  • Token Usage: prompt: 2,835,368, completion: 36,491, cache_read: 2,492,416, reasoning: 11,257
  • Run Suffix: litellm_proxy_openai_gpt_5.5_19e34c7_gpt_5_5_run_N5_20260527_112655

@simonrosenberg simonrosenberg force-pushed the rel-1.24.0 branch 2 times, most recently from 82599a2 to 1da3dc3 Compare May 27, 2026 12:40
@enyst
Copy link
Copy Markdown
Member

enyst commented May 27, 2026

  • Total Cost: $13.25
  • Token Usage: prompt: 7,566,162, completion: 31,452, cache_read: 1,171,255, reasoning: 13,775
  • Run Suffix: litellm_proxy_gemini_3.1_pro_preview_19e34c7_gemini_3_1_pro_run_N5_20260527_112702

Wow this is a funny one, 13.25. Everything else is much less 🤷

I'm guessing it's nothing, just random fluke, Gemini wants to think round-n-round

@simonrosenberg
Copy link
Copy Markdown
Collaborator

Deprecation-deadline cleanup in this release

Cutting 1.24.0 tripped the Deprecation deadlines check for two features marked removed_in: 1.24.0. Both are now removed in this PR:

  1. register_tool(callable_factory) — removed; the 16 callable-factory call sites across the test suite were migrated to ToolDefinition subclasses/instances.
  2. LLMAgentSettings public import aliases — removed from openhands.sdk and openhands.sdk.settings. The class is retained at openhands.sdk.settings.model (it's a live agent_kind="llm" member of the settings union, so legacy payloads still deserialize and the API-breakage field-value check is unchanged).

Note on the commit sequence: an interim commit deferred LLMAgentSettings to 1.25.0, because removing it failed the Python API (api-breakage) gate — the gate couldn't see its deprecation (it lives in the _DEPRECATED_SDK_EXPORTS registry + an f-string __getattr__, not a decorator/literal call). That gap was fixed in #3402 (now merged to main) and is cherry-picked here (commit 6807dfb) so this branch's api-breakage check passes deterministically; it dedupes on merge. A later commit then performs the actual removal. Net diff = both deprecations removed.

api-breakage now reports the LLMAgentSettings removal as a sanctioned scheduled removal (::notice, exit 0).

@enyst
Copy link
Copy Markdown
Member

enyst commented May 27, 2026

@OpenHands /codereview this PR, note that it's release PR so pay attention to specifics. Post directly on the PR.

@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented May 27, 2026

I'm on it! enyst can track my progress at all-hands.dev

Copy link
Copy Markdown
Member

@enyst enyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Taste Rating: Acceptable — the code changes are pragmatic for a release cleanup, but I’m not approving yet because the release-specific validation is stale on the current head.

[TESTING GAPS]

  • [Release validation] Stale required release workflows: current PR head is bd02939bef29ccc894b246bcadd2484805fd6edd. I verified Run tests passed on that head (26514653367, coverage comment updated to link bd02939). However:
    • Run Examples Scripts last passed on 19e34c7a376eb21734385de2074b629061313c00 (26508259690), before the deprecation-deadline cleanup commits.
    • Run Integration Tests last passed on 19e34c7a376eb21734385de2074b629061313c00 for both the integration-test and behavior-test label runs (26508259747, 26508259849).

Per the release PR review guideline, I can’t approve a release PR until the latest PR-specific Run tests, Run Examples Scripts, and Run Integration Tests results all match the current PR state. Please rerun examples + integration/behavior against bd02939 (or have a human maintainer explicitly accept the stale validation before merging).

Code-wise, I didn’t find a blocker in the actual cleanup: register_tool(callable_factory) is removed with tests migrated to ToolDefinition subclasses/instances, and the LLMAgentSettings public aliases are removed while the model class remains in settings.model for legacy payload deserialization. The API-breakage checker update for _DEPRECATED_SDK_EXPORTS is narrow and covered by targeted tests.

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM

This is a release PR that publishes package versions and removes two deprecated public surfaces. The implementation looks intentional and the API/REST/unit gates are green on current head, but stale examples/integration/behavior validation means the release checklist is not fully proven for the final commit.

VERDICT:
⚠️ Comment / not approving yet: core logic looks sound, but release validation needs to be refreshed or explicitly accepted by a human maintainer before approval.

KEY INSIGHT:
The deprecation removals are aligned with the scheduled 1.24.0 cleanup; the remaining risk is release-process validation freshness, not code structure.

This review was created by an AI agent (OpenHands) on behalf of @enyst.


Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

  1. Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.
  2. Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
  3. When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

Was this review helpful? React with 👍 or 👎 to give feedback.

@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented May 27, 2026

OpenHands encountered an error: Request timeout after 30 seconds to https://balwusrsvebknnow.prod-runtime.all-hands.dev/api/conversations/e31d3d2e-aef6-4df1-890d-2d43d9eb60cd/ask_agent

See the conversation for more information.

@simonrosenberg
Copy link
Copy Markdown
Collaborator

Rebased onto latest main (now includes #3347, #3323, #3247, #3398, #3329, #3346, and #3402). The interim cherry-pick of #3402 was auto-dropped during the rebase since main now provides the api-breakage checker fix directly — so the Python API gate still sanctions the LLMAgentSettings removal. Re-validated locally on the rebased base: deprecation-deadline check clean, api-breakage exit 0, affected suites pass.

github-actions Bot and others added 3 commits May 27, 2026 18:21
Co-authored-by: openhands <openhands@all-hands.dev>
…moval to 1.25.0

Cutting v1.24.0 trips the deprecation-deadline check for two features whose
removal was scheduled for 1.24.0. Handle each per its actual removability:

* register_tool(callable_factory) (deprecated 1.19.1): REMOVED. register_tool
  now accepts only a ToolDefinition instance or subclass; the dead
  _resolver_from_callable / _usability_from_callable helpers and the callable
  branch are dropped. 16 call sites across 6 test files still used the callable
  form and are migrated to register the ToolDefinition subclass directly, a
  prebuilt instance, or -- where conv_state is needed at resolve time -- a small
  ToolDefinition subclass. Clean under the SDK api-breakage gate: register_tool
  stays exported, only its accepted-arg union narrows.

* LLMAgentSettings import aliases (deprecated 1.19.0): KEPT; deadline moved to
  1.25.0. Removing them now fails the api-breakage gate -- the published 1.23.1
  baseline deprecated them via a module __getattr__ calling
  warn_deprecated(f"Importing {name!r} ..."), an f-string the breakage checker
  cannot statically detect, so the removal reads as unsanctioned and there is no
  override. Re-expressed with a literal warn_deprecated("LLMAgentSettings", ...)
  feature name so the 1.25.0 baseline carries a detectable record and the removal
  passes the gate next minor.

Tests: callable-factory registration now asserts TypeError; the qualname
callable test is dropped; the 16 callers above are migrated. LLMAgentSettings
alias test is unchanged (still warns).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
With the api-breakage checker now recognizing the _DEPRECATED_SDK_EXPORTS
registry (cherry-picked from #3402), removing LLMAgentSettings -- deprecated
in 1.19.0 with removed_in 1.24.0 -- is sanctioned and lands in this release
instead of being deferred.

- Drop the public import aliases from `openhands.sdk` and
  `openhands.sdk.settings` (the __all__ entries, TYPE_CHECKING imports, and the
  __getattr__ / _DEPRECATED_SDK_EXPORTS shims). `from openhands.sdk import
  LLMAgentSettings` and the settings-level import now raise.
- Retain the LLMAgentSettings *class* at `openhands.sdk.settings.model`: it is
  a live member of the settings discriminated union (agent_kind="llm") so
  legacy payloads still deserialize and the API-breakage field-value check is
  unchanged.
- Rewrite the alias test to assert removal; update class/union docstrings.

Verified: deprecation-deadline check clean; api-breakage reports a sanctioned
scheduled removal (::notice, exit 0); pyright clean; 79 affected tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@enyst
Copy link
Copy Markdown
Member

enyst commented May 27, 2026

@OpenHands Remove and re-add the 3 release tests labels, so that we have them run on the actual recent release PR. Wait until all three are done, then tell us directly on PR as a comment WDYT about the results.

@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented May 27, 2026

I'm on it! enyst can track my progress at all-hands.dev

@enyst enyst removed integration-test Runs the integration tests and comments the results behavior-test test-examples Run all applicable "examples/" files. Expensive operation. labels May 27, 2026
@enyst enyst added integration-test Runs the integration tests and comments the results behavior-test test-examples Run all applicable "examples/" files. Expensive operation. labels May 27, 2026 — with OpenHands AI
@github-actions
Copy link
Copy Markdown
Contributor

Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Copy Markdown
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Generated: 2026-05-27 16:57:34 UTC

Example Status Duration Cost
01_standalone_sdk/02_custom_tools.py ✅ PASS 48.7s $0.07
01_standalone_sdk/03_activate_skill.py ✅ PASS 20.8s $0.03
01_standalone_sdk/05_use_llm_registry.py ✅ PASS 11.1s $0.01
01_standalone_sdk/07_mcp_integration.py ✅ PASS 26.2s $0.02
01_standalone_sdk/09_pause_example.py ✅ PASS 10.4s $0.01
01_standalone_sdk/10_persistence.py ✅ PASS 29.2s $0.02
01_standalone_sdk/11_async.py ✅ PASS 28.4s $0.04
01_standalone_sdk/12_custom_secrets.py ✅ PASS 8.7s $0.00
01_standalone_sdk/13_get_llm_metrics.py ✅ PASS 57.7s $0.06
01_standalone_sdk/14_context_condenser.py ✅ PASS 2m 14s $0.16
01_standalone_sdk/17_image_input.py ✅ PASS 25.4s $0.02
01_standalone_sdk/18_send_message_while_processing.py ✅ PASS 23.9s $0.02
01_standalone_sdk/19_llm_routing.py ✅ PASS 14.2s $0.02
01_standalone_sdk/20_stuck_detector.py ✅ PASS 15.6s $0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py ✅ PASS 8.7s $0.00
01_standalone_sdk/22_anthropic_thinking.py ✅ PASS 19.0s $0.01
01_standalone_sdk/23_responses_reasoning.py ✅ PASS 1m 8s $0.01
01_standalone_sdk/24_planning_agent_workflow.py ✅ PASS 5m 20s $0.37
01_standalone_sdk/25_agent_delegation.py ✅ PASS 1m 7s $0.07
01_standalone_sdk/26_custom_visualizer.py ✅ PASS 16.9s $0.03
01_standalone_sdk/28_ask_agent_example.py ✅ PASS 38.7s $0.04
01_standalone_sdk/29_llm_streaming.py ✅ PASS 29.0s $0.02
01_standalone_sdk/30_tom_agent.py ✅ PASS 7.9s $0.01
01_standalone_sdk/31_iterative_refinement.py ✅ PASS 4m 49s $0.36
01_standalone_sdk/32_configurable_security_policy.py ✅ PASS 22.7s $0.03
01_standalone_sdk/33_hooks/main.py ✅ PASS 46.7s $0.05
01_standalone_sdk/34_critic_example.py ✅ PASS 8m 0s $0.68
01_standalone_sdk/36_event_json_to_openai_messages.py ✅ PASS 8.4s $0.00
01_standalone_sdk/37_llm_profile_store/main.py ✅ PASS 3.5s $0.00
01_standalone_sdk/38_browser_session_recording.py ✅ PASS 39.1s $0.03
01_standalone_sdk/39_llm_fallback.py ✅ PASS 9.0s $0.01
01_standalone_sdk/40_acp_agent_example.py ✅ PASS 29.1s $0.32
01_standalone_sdk/41_task_tool_set.py ✅ PASS 23.0s $0.03
01_standalone_sdk/42_file_based_subagents.py ✅ PASS 41.6s $0.05
01_standalone_sdk/43_mixed_marketplace_skills/main.py ✅ PASS 2.9s $0.00
01_standalone_sdk/44_model_switching_in_convo.py ✅ PASS 7.1s $0.01
01_standalone_sdk/45_parallel_tool_execution.py ✅ PASS 6m 29s $0.54
01_standalone_sdk/46_agent_settings.py ✅ PASS 11.4s $0.01
01_standalone_sdk/47_defense_in_depth_security.py ✅ PASS 2.8s $0.00
01_standalone_sdk/48_conversation_fork.py ✅ PASS 11.7s $0.00
01_standalone_sdk/49_switch_llm_tool.py ✅ PASS 11.5s $0.03
01_standalone_sdk/50_async_cancellation.py ✅ PASS 12.0s $0.00
02_remote_agent_server/01_convo_with_local_agent_server.py ✅ PASS 30.7s $0.02
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py ✅ PASS 1m 51s $0.04
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py ✅ PASS 1m 5s $0.06
02_remote_agent_server/04_convo_with_api_sandboxed_server.py ✅ PASS 1m 49s $0.05
02_remote_agent_server/06_custom_tool/main.py ✅ PASS 6m 3s $0.04
02_remote_agent_server/07_convo_with_cloud_workspace.py ✅ PASS 38.2s $0.03
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py ✅ PASS 4m 19s $0.02
02_remote_agent_server/09_acp_agent_with_remote_runtime.py ✅ PASS 1m 1s $0.14
02_remote_agent_server/10_cloud_workspace_share_credentials.py ✅ PASS 37.1s $0.06
02_remote_agent_server/11_conversation_fork.py ✅ PASS 1m 43s $0.00
02_remote_agent_server/12_settings_and_secrets_api.py ✅ PASS 2m 24s $0.02
02_remote_agent_server/13_workspace_get_llm.py ✅ PASS 1m 16s $0.04
04_llm_specific_tools/01_gpt5_apply_patch_preset.py ✅ PASS 19.9s $0.03
04_llm_specific_tools/02_gemini_file_tools.py ✅ PASS 1m 1s $0.09
05_skills_and_plugins/01_loading_agentskills/main.py ✅ PASS 13.8s $0.02
05_skills_and_plugins/02_loading_plugins/main.py ✅ PASS 22.7s $0.02

✅ All tests passed!

Total: 58 | Passed: 58 | Failed: 0 | Total Cost: $3.91

View full workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🧪 Integration Tests Results

Overall Success Rate: 100.0%
Total Cost: $0.87
Models Tested: 4
Timestamp: 2026-05-27 16:40:55 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Tests Passed Skipped Total Cost Tokens
litellm_proxy_minimax_MiniMax_M2.7 100.0% 8/8 1 9 $0.00 332,490
litellm_proxy_deepseek_deepseek_v4_flash 100.0% 8/8 1 9 $0.00 425,687
litellm_proxy_gemini_3.1_pro_preview 100.0% 9/9 0 9 $0.16 326,829
litellm_proxy_openai_gpt_5.5 100.0% 9/9 0 9 $0.72 281,330

📋 Detailed Results

litellm_proxy_minimax_MiniMax_M2.7

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.00
  • Token Usage: prompt: 327,726, completion: 4,764, cache_read: 257,306, reasoning: 345
  • Run Suffix: litellm_proxy_minimax_MiniMax_M2.7_17cb597_minimax_m2_7_run_N9_20260527_163836
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_deepseek_deepseek_v4_flash

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.00
  • Token Usage: prompt: 420,110, completion: 5,577, cache_read: 375,424, reasoning: 1,530
  • Run Suffix: litellm_proxy_deepseek_deepseek_v4_flash_17cb597_deepseek_v4_flash_run_N9_20260527_163831
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3.1_pro_preview

  • Success Rate: 100.0% (9/9)
  • Total Cost: $0.16
  • Token Usage: prompt: 322,242, completion: 4,587, cache_read: 301,124, reasoning: 2,615
  • Run Suffix: litellm_proxy_gemini_3.1_pro_preview_17cb597_gemini_3_1_pro_run_N9_20260527_163841

litellm_proxy_openai_gpt_5.5

  • Success Rate: 100.0% (9/9)
  • Total Cost: $0.72
  • Token Usage: prompt: 276,603, completion: 4,727, cache_read: 179,712, reasoning: 1,507
  • Run Suffix: litellm_proxy_openai_gpt_5.5_17cb597_gpt_5_5_run_N9_20260527_163845

@enyst
Copy link
Copy Markdown
Member

enyst commented May 27, 2026

@OpenHands see the comment here #3401 (comment) and I think the same 4 LLMs are for behavior tests and integration tests. Make an investigation in recent history to tell me: we used to run them for a Claude, probably a Sonnet IIRC; where/when did we “lose” that Claude? This is a side question, not part of the release PR, so investigate deeply from main branch and make a new issue on the repo where you tag me and explain to me what happened with those LLMs. Do NOT touch the release PR.

@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented May 27, 2026

I'm on it! enyst can track my progress at all-hands.dev

@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented May 27, 2026

OpenHands encountered an error: Request timeout after 30 seconds to https://yqvmoyebykrkpsdo.prod-runtime.all-hands.dev/api/conversations/6375e3a7-fc47-411f-8e0f-4b8b7952fe95/ask_agent

See the conversation for more information.

@github-actions
Copy link
Copy Markdown
Contributor

🧪 Integration Tests Results

Overall Success Rate: 95.0%
Total Cost: $13.33
Models Tested: 4
Timestamp: 2026-05-27 16:50:03 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Tests Passed Skipped Total Cost Tokens
litellm_proxy_minimax_MiniMax_M2.7 100.0% 5/5 0 5 $0.15 3,507,102
litellm_proxy_deepseek_deepseek_v4_flash 100.0% 5/5 0 5 $0.17 2,937,345
litellm_proxy_gemini_3.1_pro_preview 80.0% 4/5 0 5 $9.34 5,759,103
litellm_proxy_openai_gpt_5.5 100.0% 5/5 0 5 $3.68 2,659,870

📋 Detailed Results

litellm_proxy_minimax_MiniMax_M2.7

  • Success Rate: 100.0% (5/5)
  • Total Cost: $0.15
  • Token Usage: prompt: 3,475,595, completion: 31,507, cache_read: 3,212,533
  • Run Suffix: litellm_proxy_minimax_MiniMax_M2.7_17cb597_minimax_m2_7_run_N5_20260527_163826

litellm_proxy_deepseek_deepseek_v4_flash

  • Success Rate: 100.0% (5/5)
  • Total Cost: $0.17
  • Token Usage: prompt: 2,901,332, completion: 36,013, cache_read: 2,642,048, reasoning: 10,838
  • Run Suffix: litellm_proxy_deepseek_deepseek_v4_flash_17cb597_deepseek_v4_flash_run_N5_20260527_163834

litellm_proxy_gemini_3.1_pro_preview

  • Success Rate: 80.0% (4/5)
  • Total Cost: $9.34
  • Token Usage: prompt: 5,721,376, completion: 37,727, cache_read: 1,337,698, reasoning: 17,695
  • Run Suffix: litellm_proxy_gemini_3.1_pro_preview_17cb597_gemini_3_1_pro_run_N5_20260527_163836

Failed Tests:

  • b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: The agent successfully completed the core task: (1) updated MAX_CMD_OUTPUT_SIZE from 30000 to 20_000, (2) correctly identified that test files don't need modification since they import the constant dynamically, (3) ran test_observation_truncation.py which passed, and (4) verified the change with git diff.

However, the agent's approach violated the evaluation criterion "without unnecessary verification." Specifically:

Over-verification Issues:

  • After successfully running test_observation_truncation.py (the targeted, sufficient test), the agent attempted to run the broader uv run pytest tests/tools/terminal/ suite multiple times
  • These broader attempts repeatedly caused tmux session crashes
  • The agent made at least 3-4 separate attempts to run broader test suites despite encountering failures each time
  • This consumed many iterations (iterations 7-15+) dealing with tmux recovery, background processes, and log file inspection

What Should Have Happened:

  1. Make the change ✅ (done efficiently)
  2. Run targeted test: test_observation_truncation.py ✅ (done)
  3. Report success and stop ✅ (eventually done, but after excessive iterations)

Positive Aspects:

  • Correctly understood that tests import MAX_CMD_OUTPUT_SIZE dynamically, so no test modifications needed
  • Properly used uv as instructed
  • Eventually recognized the issue and moved forward
  • Final summary message was accurate

Problematic Aspects:

  • Repeated unnecessary attempts to run broader test suites
  • Spent significant effort on tmux crash handling rather than moving forward
  • Did not cleanly stop after the targeted test passed
  • The evaluation criteria explicitly noted over-verification should be avoided; the agent's multiple attempts at broader suites constitute exactly that

The task was accomplishable in ~10-15 iterations; the agent took ~50+ iterations largely due to over-verification and repeated failed attempts at broader testing. (confidence=0.65) (Cost: $0.77)

litellm_proxy_openai_gpt_5.5

  • Success Rate: 100.0% (5/5)
  • Total Cost: $3.68
  • Token Usage: prompt: 2,623,692, completion: 36,178, cache_read: 2,232,320, reasoning: 9,854
  • Run Suffix: litellm_proxy_openai_gpt_5.5_17cb597_gpt_5_5_run_N5_20260527_163841

Copy link
Copy Markdown
Member

enyst commented May 27, 2026

I removed and re-added the three release-test labels at ~16:36 UTC and waited for the fresh runs to finish.

Results from the new runs:

WDYT: integration and examples look release-ready to me. I would not call the behavior-test checklist item fully green yet, because the latest behavior report still has a real reported failure. That said, the failure looks like a model/agent behavioral flake around over-verification and tmux recovery loops, not an obvious SDK release-code regression: Gemini completed the requested code change and targeted test, then failed the behavior criterion by continuing with unnecessary broader verification. If maintainers are comfortable accepting that as known/flaky behavior, I think the release can proceed; otherwise I’d rerun behavior-test once more or ask for explicit maintainer sign-off before checking that box.

This PR comment was created by an AI agent (OpenHands) on behalf of the requester.

@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented May 27, 2026

OpenHands encountered an error: Request timeout after 30 seconds to https://owkbyecunjtnbgml.prod-runtime.all-hands.dev/api/conversations/abde2eab-bfaa-454e-96d6-2bdf28600047/ask_agent

See the conversation for more information.

@simonrosenberg
Copy link
Copy Markdown
Collaborator

I removed and re-added the three release-test labels at ~16:36 UTC and waited for the fresh runs to finish.

Results from the new runs:

WDYT: integration and examples look release-ready to me. I would not call the behavior-test checklist item fully green yet, because the latest behavior report still has a real reported failure. That said, the failure looks like a model/agent behavioral flake around over-verification and tmux recovery loops, not an obvious SDK release-code regression: Gemini completed the requested code change and targeted test, then failed the behavior criterion by continuing with unnecessary broader verification. If maintainers are comfortable accepting that as known/flaky behavior, I think the release can proceed; otherwise I’d rerun behavior-test once more or ask for explicit maintainer sign-off before checking that box.

This PR comment was created by an AI agent (OpenHands) on behalf of the requester.

I think it's safe to cut the release?

Comment thread openhands-sdk/openhands/sdk/settings/model.py
@simonrosenberg simonrosenberg merged commit fdc2bdf into main May 27, 2026
109 of 110 checks passed
@simonrosenberg simonrosenberg deleted the rel-1.24.0 branch May 27, 2026 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

behavior-test integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants