Skip to content

[Bug] SDK effective_max_output_tokens exceeds actual model limits for non-Bedrock providers (moonshot, deepseek via custom base_url) #3317

@cjb1234567

Description

@cjb1234567

Problem

When using models like moonshot/kimi-k2.5 or deepseek/deepseek-v3.2 through a custom base_url (e.g., Baidu Qianfan gateway), the SDK's auto-inferred effective_max_output_tokens exceeds the model's actual API limit, causing BadRequestError on every request.

MoonshotException - parameter check failed, max_tokens range is [1, 98304]
DeepseekException - parameter check failed, max_completion_tokens range is [1, 65536]

Root Cause

In openhands/sdk/llm/llm.py (lines 1249-1330), when the user does not set max_output_tokens, the SDK falls back to self._model_info from litellm's model database. For these models, the litellm metadata is inaccurate:

Model SDK inferred effective_max_output_tokens Actual API limit Result
moonshot/kimi-k2.5 131072 98304 BadRequestError
deepseek/deepseek-v3.2 81920 65536 BadRequestError

This is the same class of bug as #2247 (Bedrock models with incorrect litellm metadata), but that fix only guarded bedrock/-prefixed models. Non-Bedrock models routed through custom base_url gateways are still affected.

Additionally, the SDK's default extended_thinking_budget=200000 is passed as max_tokens for models that don't support extended thinking, which also exceeds limits (e.g., kimi-k2.5 max 98304).

Steps to Reproduce

from openhands.sdk.llm.llm import LLM

llm = LLM(
    model="moonshot/kimi-k2.5",
    base_url="https://qianfan.baidubce.com/v2/coding",
    api_key="sk-...",
    reasoning_effort="medium",
)
print(llm.effective_max_output_tokens)
# Output: 131072 — but the model's actual max_tokens limit is 98304

Workaround

Explicitly set max_output_tokens and extended_thinking_budget when constructing LLM:

llm = LLM(
    model="moonshot/kimi-k2.5",
    base_url="https://qianfan.baidubce.com/v2/coding",
    api_key="sk-...",
    reasoning_effort="medium",
    max_output_tokens=16384,
    extended_thinking_budget=None,
    enable_encrypted_reasoning=False,
)

Suggested Fix

The guard added in #2264 for Bedrock should be generalized to all providers: when litellm's reported max_output_tokens cannot be verified as accurate for the actual endpoint, the SDK should either:

  1. Apply the existing DEFAULT_MAX_OUTPUT_TOKENS_CAP (16384) as a universal safety cap, or
  2. Omit max_completion_tokens / max_tokens entirely and let the provider use its default (as suggested in Bedrock models fail when litellm reports max_output_tokens equal to context window size #2247).

Environment

  • openhands-sdk version: 1.22.1
  • Python: 3.12
  • litellm: bundled with openhands-sdk 1.22.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions