Skip to content

fix(llm): preserve fragmented agent tool calls#10988

Open
MatejKosec wants to merge 2 commits into
mainfrom
codex/dyn-2764-replay-harness-v2
Open

fix(llm): preserve fragmented agent tool calls#10988
MatejKosec wants to merge 2 commits into
mainfrom
codex/dyn-2764-replay-harness-v2

Conversation

@MatejKosec

@MatejKosec MatejKosec commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add deterministic full-HTTP replay coverage for the Anthropic Messages and OpenAI Responses compatibility endpoints, including fragmented and parallel tool calls and tool-result round trips.
  • Accumulate fragmented tool-call state until the backend emits an explicit tool_calls or legacy function_call finish reason, then publish completion events immediately; retain end-of-stream completion only for backends that omit the finish reason.
  • Preserve call identity, argument ordering, contiguous Anthropic block indices, and Responses output ordering when tool-call fragments arrive out of order.
  • Add gated streaming tests that withhold the usage tail and prove tool completion reaches the client first.

The replay fixtures are minimized reconstructions of recorded event ordering rather than backend-specific captures. The harness uses the production HTTP handlers and converters without a model, GPU, Docker, timing delay, or new dependency.

Validation

Run on computelab-build:

  • cargo fmt --all -- --check
  • cargo clippy -p dynamo-llm --tests -- -D warnings
  • cargo test -p dynamo-llm --lib stream_converter::tests — 29 passed
  • INSTA_UPDATE=no cargo test -p dynamo-llm --test anthropic_http_replay --test responses_http_replay — 13 passed

Reviewer guidance

The finish-reason chunk is the first explicit indication that all tool-call fragments are complete. Usage-only chunks may follow it, so tool completion is emitted on the finish reason while the usage-bearing terminal response event remains at end of stream.

This supersedes #8284 and #10985.


Open in Devin Review

Summary by CodeRabbit

  • New Features

    • Improved streaming handling for tool/function calls so partial arguments are delivered in the correct order, even when identifiers arrive late.
    • Better support for parallel tool calls, with cleaner event sequencing and completion behavior.
  • Bug Fixes

    • Tool and function call blocks now finish reliably at stream end, even when a response omits a terminal finish signal.
    • Fixed duplicate or out-of-order completion events in streamed responses.
  • Documentation

    • Added replay fixture notes and expanded test coverage for common streaming scenarios.

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Signed-off-by: Matej Kosec <mkosec@nvidia.com>
@MatejKosec MatejKosec requested a review from a team as a code owner June 26, 2026 14:14
@MatejKosec MatejKosec temporarily deployed to external_collaborator June 26, 2026 14:14 — with GitHub Actions Inactive
@github-actions github-actions Bot added fix documentation Improvements or additions to documentation frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` labels Jun 26, 2026
@github-actions

Copy link
Copy Markdown
Contributor

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

Adds buffered tool-call handling in Anthropic and OpenAI stream converters, and adds shared HTTP replay harnesses, scripted engine support, and replay tests/fixtures for Anthropic messages and OpenAI Responses compatibility.

Changes

Tool-call streaming and replay coverage

Layer / File(s) Summary
Shared test harness and scripted engine
lib/llm/tests/common/scripted_chat_engine.rs, lib/llm/tests/common/http_harness.rs
Adds a scripted chat engine with gated tails, HTTP server startup/shutdown helpers, SSE parsing, and JSON canonicalization for replay tests.
Anthropic buffered tool-call emission
lib/llm/src/protocols/anthropic/stream_converter.rs
Buffers tool-call ids, names, and argument fragments, and flushes tool-use blocks on finish reasons or EOF.
Anthropic buffering tests and tagged helpers
lib/llm/src/protocols/anthropic/stream_converter.rs
Mirrors the buffered flush path in tagged helpers and updates assertions for finish-driven flush, EOF fallback, text/tool ordering, fragmented arguments, incomplete identities, and parallel or duplicate tool calls.
OpenAI function-call buffering
lib/llm/src/protocols/openai/responses/stream_converter.rs
Buffers function-call arguments until identity is complete, emits start and done events on finish reasons or EOF, and keeps completed output construction ordered.
OpenAI tests and parser helper
lib/llm/src/protocols/openai/chat_completions/tool_parser_v2.rs, lib/llm/src/protocols/openai/responses/stream_converter.rs
Updates the stream test helper to include llm_metrics: None and revises unit tests for deferred done timing, identity and argument ordering, EOF closeout, completed output ordering, and parallel call sequencing.
Anthropic HTTP replay suite
lib/llm/tests/anthropic_http_replay.rs, lib/llm/tests/data/replays/agent-harness/*
Adds replay fixtures and Tokio HTTP tests for /v1/messages and /v1/messages/count_tokens covering text, fragmented tool calls, gated finish ordering, parallel tools, and tool-result round-trips.
OpenAI Responses HTTP replay suite
lib/llm/tests/responses_http_replay.rs
Adds HTTP tests for /v1/responses covering unary text, streaming text, fragmented and parallel function calls, gated finish ordering, and function_call_output round-trips.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description covers the main changes and validation, but it omits required template sections and the required issue section. Add the required Overview, Details, Where should reviewer start, and Related Issues sections, with either a linked issue or the no-issue confirmation.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly matches the main change: preserving fragmented agent tool calls.
Docstring Coverage ✅ Passed Docstring coverage is 83.52% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/llm/src/protocols/openai/responses/stream_converter.rs (1)

306-420: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Normalize legacy delta.function_call chunks before closing on FunctionCall.

Line 418 handles FinishReason::FunctionCall, but Lines 306-416 only record delta.tool_calls. A backend that streams the legacy function_call delta shape will reach the finish reason with no FunctionCallState, so the Responses stream drops the call instead of emitting response.output_item.added/argument/done events. Please route delta.function_call through the same buffering path, likely as index 0 with a generated call id when the legacy shape has none. Based on PR objectives, legacy function-call finish behavior is part of this change.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/llm/src/protocols/openai/responses/stream_converter.rs` around lines 306
- 420, The stream conversion in ResponseStreamConverter currently buffers only
delta.tool_calls, so legacy delta.function_call chunks are lost before
FinishReason::FunctionCall closes the item. Update the buffering logic in the
tool-call handling path to also normalize delta.function_call into the same
FunctionCallState flow, likely using index 0 and generating a call_id when
missing, so the existing response.output_item.added and argument/done event
emission still occurs for legacy callers.
🧹 Nitpick comments (2)
lib/llm/src/protocols/anthropic/stream_converter.rs (1)

98-124: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

record_tool_call buffering is correct; started is effectively redundant.

started is set whenever id/name/arguments are present, but is_emit_ready() already gates on non-empty id and name, so a state can never be is_emit_ready() without started == true. Harmless, but the flag adds no signal beyond the id/name checks.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/llm/src/protocols/anthropic/stream_converter.rs` around lines 98 - 124,
The `record_tool_call` state machine in `stream_converter.rs` uses `started`
redundantly, since `is_emit_ready()` already depends on the presence of `id` and
`name`. Update `ToolCallState` handling and `record_tool_call` so the readiness
logic relies on the actual fields (`id`, `name`, and `argument_fragments`)
instead of maintaining `started`, and remove any remaining writes/checks tied to
that flag.
lib/llm/tests/data/replays/agent-harness/README.md (1)

3-5: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add the public issue reference for DYN-2764 if one exists.

External readers cannot resolve the Linear ID on its own. Please add the matching GitHub issue reference/link next to DYN-2764 (or replace it if the Linear ID is no longer needed). As per coding guidelines, Markdown docs should "prefer to also include the matching GitHub link when one exists".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/llm/tests/data/replays/agent-harness/README.md` around lines 3 - 5, The
README for the agent-harness replay fixtures references DYN-2764 but does not
include a resolvable public issue link. Update the text in this document to add
the matching GitHub issue reference/link next to the DYN-2764 mention, or
replace the Linear ID if it is no longer needed, so external readers can follow
it; keep the existing mention of PR `#8284` intact.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@lib/llm/tests/anthropic_http_replay.rs`:
- Around line 219-239: The test in `anthropic_http_replay.rs` is asserting the
wrong streaming order by requiring `message_delta` to be absent before
`gate.release()`. Update the `saw_message_delta` expectation in the
`tokio::time::timeout` loop around the `parser.push` handling so it allows a
finish-driven `message_delta` to appear before the gated usage tail, while still
verifying `content_block_stop` is observed before releasing the gate.

---

Outside diff comments:
In `@lib/llm/src/protocols/openai/responses/stream_converter.rs`:
- Around line 306-420: The stream conversion in ResponseStreamConverter
currently buffers only delta.tool_calls, so legacy delta.function_call chunks
are lost before FinishReason::FunctionCall closes the item. Update the buffering
logic in the tool-call handling path to also normalize delta.function_call into
the same FunctionCallState flow, likely using index 0 and generating a call_id
when missing, so the existing response.output_item.added and argument/done event
emission still occurs for legacy callers.

---

Nitpick comments:
In `@lib/llm/src/protocols/anthropic/stream_converter.rs`:
- Around line 98-124: The `record_tool_call` state machine in
`stream_converter.rs` uses `started` redundantly, since `is_emit_ready()`
already depends on the presence of `id` and `name`. Update `ToolCallState`
handling and `record_tool_call` so the readiness logic relies on the actual
fields (`id`, `name`, and `argument_fragments`) instead of maintaining
`started`, and remove any remaining writes/checks tied to that flag.

In `@lib/llm/tests/data/replays/agent-harness/README.md`:
- Around line 3-5: The README for the agent-harness replay fixtures references
DYN-2764 but does not include a resolvable public issue link. Update the text in
this document to add the matching GitHub issue reference/link next to the
DYN-2764 mention, or replace the Linear ID if it is no longer needed, so
external readers can follow it; keep the existing mention of PR `#8284` intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6f33e136-09ad-4cbd-95a4-7484d0d2d6e9

📥 Commits

Reviewing files that changed from the base of the PR and between c0bc4a0 and f839a4e.

⛔ Files ignored due to path filters (4)
  • lib/llm/tests/snapshots/anthropic_http_replay__anthropic_streaming_text.snap is excluded by !**/*.snap
  • lib/llm/tests/snapshots/anthropic_http_replay__anthropic_unary_text.snap is excluded by !**/*.snap
  • lib/llm/tests/snapshots/responses_http_replay__responses_streaming_text.snap is excluded by !**/*.snap
  • lib/llm/tests/snapshots/responses_http_replay__responses_unary_text.snap is excluded by !**/*.snap
📒 Files selected for processing (12)
  • lib/llm/src/protocols/anthropic/stream_converter.rs
  • lib/llm/src/protocols/openai/chat_completions/tool_parser_v2.rs
  • lib/llm/src/protocols/openai/responses/stream_converter.rs
  • lib/llm/tests/anthropic_http_replay.rs
  • lib/llm/tests/common/http_harness.rs
  • lib/llm/tests/common/scripted_chat_engine.rs
  • lib/llm/tests/data/replays/agent-harness/README.md
  • lib/llm/tests/data/replays/agent-harness/fragmented-tool.sse
  • lib/llm/tests/data/replays/agent-harness/parallel-tools.sse
  • lib/llm/tests/data/replays/agent-harness/text.sse
  • lib/llm/tests/data/replays/agent-harness/thinking-tool.sse
  • lib/llm/tests/responses_http_replay.rs

Comment on lines +219 to +239
let mut saw_tool_stop = false;
let mut saw_message_delta = false;

tokio::time::timeout(Duration::from_secs(2), async {
while !saw_tool_stop {
let bytes = body
.next()
.await
.expect("response ended before tool block completion")
.expect("failed to read response SSE bytes");
for event in parser.push(&bytes).expect("failed to parse response SSE") {
saw_tool_stop |= event == "content_block_stop";
saw_message_delta |= event == "message_delta";
}
}
})
.await
.expect("tool block completion did not arrive before the gated usage tail");

assert!(!saw_message_delta);
gate.release();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Don't assert that message_delta is delayed behind the usage tail.

Because the gate starts at the first usage chunk, a finish-driven message_delta should already be observable before gate.release(). Line 238 currently locks in the opposite contract and would fail once the stream publishes the stop reason immediately, which is the behavior described in this PR.

Suggested fix
         let mut body = response.bytes_stream();
         let mut parser = IncrementalSseParser::default();
         let mut saw_tool_stop = false;
         let mut saw_message_delta = false;

         tokio::time::timeout(Duration::from_secs(2), async {
-            while !saw_tool_stop {
+            while !saw_tool_stop || !saw_message_delta {
                 let bytes = body
                     .next()
                     .await
                     .expect("response ended before tool block completion")
                     .expect("failed to read response SSE bytes");
                 for event in parser.push(&bytes).expect("failed to parse response SSE") {
                     saw_tool_stop |= event == "content_block_stop";
                     saw_message_delta |= event == "message_delta";
                 }
             }
         })
         .await
-        .expect("tool block completion did not arrive before the gated usage tail");
+        .expect("tool completion did not arrive before the gated usage tail");

-        assert!(!saw_message_delta);
+        assert!(saw_message_delta);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let mut saw_tool_stop = false;
let mut saw_message_delta = false;
tokio::time::timeout(Duration::from_secs(2), async {
while !saw_tool_stop {
let bytes = body
.next()
.await
.expect("response ended before tool block completion")
.expect("failed to read response SSE bytes");
for event in parser.push(&bytes).expect("failed to parse response SSE") {
saw_tool_stop |= event == "content_block_stop";
saw_message_delta |= event == "message_delta";
}
}
})
.await
.expect("tool block completion did not arrive before the gated usage tail");
assert!(!saw_message_delta);
gate.release();
let mut saw_tool_stop = false;
let mut saw_message_delta = false;
tokio::time::timeout(Duration::from_secs(2), async {
while !saw_tool_stop || !saw_message_delta {
let bytes = body
.next()
.await
.expect("response ended before tool block completion")
.expect("failed to read response SSE bytes");
for event in parser.push(&bytes).expect("failed to parse response SSE") {
saw_tool_stop |= event == "content_block_stop";
saw_message_delta |= event == "message_delta";
}
}
})
.await
.expect("tool completion did not arrive before the gated usage tail");
assert!(saw_message_delta);
gate.release();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/llm/tests/anthropic_http_replay.rs` around lines 219 - 239, The test in
`anthropic_http_replay.rs` is asserting the wrong streaming order by requiring
`message_delta` to be absent before `gate.release()`. Update the
`saw_message_delta` expectation in the `tokio::time::timeout` loop around the
`parser.push` handling so it allows a finish-driven `message_delta` to appear
before the gated usage tail, while still verifying `content_block_stop` is
observed before releasing the gate.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation fix frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant