fix(llm): preserve fragmented agent tool calls#10985
Conversation
Signed-off-by: Matej Kosec <mkosec@nvidia.com>
This comment has been minimized.
This comment has been minimized.
WalkthroughAnthropic and OpenAI Responses stream converters now defer tool-call completion events to stream end. The PR also adds shared HTTP test harness support, replay fixtures, and end-to-end Anthropic and Responses replay tests. ChangesTool-call buffering and replay coverage
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/llm/src/protocols/anthropic/stream_converter.rs (1)
305-318: 🎯 Functional Correctness | 🟡 MinorHandle id/name-only tool-call deltas
startedonly flips whenfunc.argumentsis present, so a tool call that arrives withid/namebut never sends arguments is dropped at end-of-stream. The codebase already hasarguments: Nonetool-call chunks, so consider marking the call as started earlier or otherwise emitting an emptytool_useblock instead of omitting it.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/llm/src/protocols/anthropic/stream_converter.rs` around lines 305 - 318, The tool-call delta handling in the Anthropic stream converter only marks a call as started when `func.arguments` is present, so id/name-only chunks can be dropped at end-of-stream. Update `stream_converter.rs` in the tool-call state update logic to mark the call as started as soon as an `id` or `name` is received, or otherwise ensure `ToolCallState` emits an empty `tool_use` block when `arguments` stays `None`, so `id`/`name`-only deltas are preserved.
🧹 Nitpick comments (1)
lib/llm/tests/data/replays/agent-harness/README.md (1)
3-5: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueAdd the matching GitHub link for
DYN-2764.This doc references the internal Linear ticket
DYN-2764but omits a GitHub link. External contributors can't resolve Linear IDs, so include the matching GitHub issue (e.g.#NNNN) alongside it if one exists, as you already do for PR#8284.As per coding guidelines: "Markdown documentation files may reference Linear tickets for internal context, but should prefer to also include the matching GitHub link when one exists".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/llm/tests/data/replays/agent-harness/README.md` around lines 3 - 5, The README currently mentions the internal Linear ticket DYN-2764 without a corresponding public GitHub reference; update the fixture description to include the matching GitHub issue link or issue number alongside DYN-2764, similar to how PR `#8284` is referenced. Use the existing README text near the DYN-2764 mention and keep the wording concise so external contributors can identify the related GitHub item without needing Linear access.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@lib/llm/tests/responses_http_replay.rs`:
- Around line 326-347: The round-trip test in responses_http_replay.rs hardcodes
the function call output `call_id` instead of reusing the one returned in the
first response. Update the test around `first_body["output"]` and the
`post_responses` payload so the `function_call_output` uses the captured
`function_call`’s `call_id`, and adjust the later assertions to compare against
that same dynamic value rather than `call_list_directory`.
---
Outside diff comments:
In `@lib/llm/src/protocols/anthropic/stream_converter.rs`:
- Around line 305-318: The tool-call delta handling in the Anthropic stream
converter only marks a call as started when `func.arguments` is present, so
id/name-only chunks can be dropped at end-of-stream. Update
`stream_converter.rs` in the tool-call state update logic to mark the call as
started as soon as an `id` or `name` is received, or otherwise ensure
`ToolCallState` emits an empty `tool_use` block when `arguments` stays `None`,
so `id`/`name`-only deltas are preserved.
---
Nitpick comments:
In `@lib/llm/tests/data/replays/agent-harness/README.md`:
- Around line 3-5: The README currently mentions the internal Linear ticket
DYN-2764 without a corresponding public GitHub reference; update the fixture
description to include the matching GitHub issue link or issue number alongside
DYN-2764, similar to how PR `#8284` is referenced. Use the existing README text
near the DYN-2764 mention and keep the wording concise so external contributors
can identify the related GitHub item without needing Linear access.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5a7ff35b-eda4-4eaa-924c-2d80c1054e93
⛔ Files ignored due to path filters (4)
lib/llm/tests/snapshots/anthropic_http_replay__anthropic_streaming_text.snapis excluded by!**/*.snaplib/llm/tests/snapshots/anthropic_http_replay__anthropic_unary_text.snapis excluded by!**/*.snaplib/llm/tests/snapshots/responses_http_replay__responses_streaming_text.snapis excluded by!**/*.snaplib/llm/tests/snapshots/responses_http_replay__responses_unary_text.snapis excluded by!**/*.snap
📒 Files selected for processing (12)
lib/llm/src/protocols/anthropic/stream_converter.rslib/llm/src/protocols/openai/chat_completions/tool_parser_v2.rslib/llm/src/protocols/openai/responses/stream_converter.rslib/llm/tests/anthropic_http_replay.rslib/llm/tests/common/http_harness.rslib/llm/tests/common/scripted_chat_engine.rslib/llm/tests/data/replays/agent-harness/README.mdlib/llm/tests/data/replays/agent-harness/fragmented-tool.sselib/llm/tests/data/replays/agent-harness/parallel-tools.sselib/llm/tests/data/replays/agent-harness/text.sselib/llm/tests/data/replays/agent-harness/thinking-tool.sselib/llm/tests/responses_http_replay.rs
| let function_call = first_body["output"] | ||
| .as_array() | ||
| .unwrap() | ||
| .iter() | ||
| .find(|item| item["type"] == "function_call") | ||
| .expect("first response did not contain a function call") | ||
| .clone(); | ||
|
|
||
| let second_response = post_responses( | ||
| &svc, | ||
| &json!({ | ||
| "model": MODEL, | ||
| "stream": false, | ||
| "max_output_tokens": 64, | ||
| "tools": [tool("list_directory")], | ||
| "input": [ | ||
| {"role": "user", "content": "List /tmp"}, | ||
| function_call, | ||
| { | ||
| "type": "function_call_output", | ||
| "call_id": "call_list_directory", | ||
| "output": "[\"a.txt\"]" |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Use the first response’s call_id instead of hardcoding it.
This test is meant to validate round-trip propagation, but the second request and final assertions bake in call_list_directory rather than reusing whatever call_id the first response actually returned. That makes the test fixture-dependent and can fail even when the implementation preserves IDs correctly.
Suggested change
let function_call = first_body["output"]
.as_array()
.unwrap()
.iter()
.find(|item| item["type"] == "function_call")
.expect("first response did not contain a function call")
.clone();
+ let call_id = function_call["call_id"]
+ .as_str()
+ .expect("function call missing call_id")
+ .to_owned();
let second_response = post_responses(
&svc,
&json!({
@@
function_call,
{
"type": "function_call_output",
- "call_id": "call_list_directory",
+ "call_id": call_id.clone(),
"output": "[\"a.txt\"]"
}
]
}),
@@
let calls = assistant.tool_calls.as_deref().expect("tool calls missing");
assert_eq!(calls.len(), 1);
- assert_eq!(calls[0].id, "call_list_directory");
+ assert_eq!(calls[0].id, call_id);
assert_eq!(calls[0].function.name, "list_directory");
assert_eq!(calls[0].function.arguments, r#"{"path":"/tmp"}"#);
- assert_eq!(tool_result.tool_call_id, "call_list_directory");
+ assert_eq!(tool_result.tool_call_id, call_id);Also applies to: 373-376
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@lib/llm/tests/responses_http_replay.rs` around lines 326 - 347, The
round-trip test in responses_http_replay.rs hardcodes the function call output
`call_id` instead of reusing the one returned in the first response. Update the
test around `first_body["output"]` and the `post_responses` payload so the
`function_call_output` uses the captured `function_call`’s `call_id`, and adjust
the later assertions to compare against that same dynamic value rather than
`call_list_directory`.
|
Superseded by #10988, which includes the finish-signal completion timing and subsequent ordering corrections. |
Summary
llm_metricstest initializer needed by the full library suite.This replaces the stale approach in #8284 and implements DYN-2764 without a model, GPU, Docker, or a new dependency.
Validation
Validation commands run on a Linux build host with CUDA toolchain:
cargo fmt --all -- --checkcargo clippy -p dynamo-llm --tests -- -D warningscargo test -p dynamo-llm --lib stream_converter::tests— 20 passedINSTA_UPDATE=no cargo test -p dynamo-llm --test anthropic_http_replay --test responses_http_replay— 11 passedcargo test -p dynamo-llm -- --test-threads=1— all 1,375 active library tests passed, followed by the integration binaries until ten existingpreprocessorcases received HTTP 401 while downloading gatedmeta-llama/Llama-3.1-70B-InstructassetsTwo independent AI reviews examined protocol correctness and harness/test design. Their block-ordering, terminal-SSE, completion-assertion, fixture-provenance, ID-canonicalization, and duplicate-call findings were addressed; both reviewers re-checked the revised diff and reported no remaining actionable findings.
Summary by CodeRabbit
New Features
Bug Fixes
Tests