Skip to content

fix(llm): preserve fragmented agent tool calls#10985

Closed
MatejKosec wants to merge 1 commit into
ai-dynamo:mainfrom
MatejKosec:codex/dyn-2764-replay-harness
Closed

fix(llm): preserve fragmented agent tool calls#10985
MatejKosec wants to merge 1 commit into
ai-dynamo:mainfrom
MatejKosec:codex/dyn-2764-replay-harness

Conversation

@MatejKosec

@MatejKosec MatejKosec commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add a deterministic, request-capturing chat engine and real HTTP replay coverage for the Anthropic Messages and OpenAI Responses APIs.
  • Cover unary and streaming text, fragmented and parallel tool calls, two-request tool-result exchanges, and Anthropic token counting with eleven endpoint tests and four stable snapshots.
  • Delay Responses function-call completion until the backend stream ends, and buffer Anthropic tool fragments so parallel calls are emitted as sequential, non-overlapping content blocks.
  • Document the minimized fixture provenance and repair the current-main llm_metrics test initializer needed by the full library suite.

This replaces the stale approach in #8284 and implements DYN-2764 without a model, GPU, Docker, or a new dependency.

Validation

Validation commands run on a Linux build host with CUDA toolchain:

  • cargo fmt --all -- --check
  • cargo clippy -p dynamo-llm --tests -- -D warnings
  • cargo test -p dynamo-llm --lib stream_converter::tests — 20 passed
  • INSTA_UPDATE=no cargo test -p dynamo-llm --test anthropic_http_replay --test responses_http_replay — 11 passed
  • cargo test -p dynamo-llm -- --test-threads=1 — all 1,375 active library tests passed, followed by the integration binaries until ten existing preprocessor cases received HTTP 401 while downloading gated meta-llama/Llama-3.1-70B-Instruct assets

Two independent AI reviews examined protocol correctness and harness/test design. Their block-ordering, terminal-SSE, completion-assertion, fixture-provenance, ID-canonicalization, and duplicate-call findings were addressed; both reviewers re-checked the revised diff and reported no remaining actionable findings.

Summary by CodeRabbit

  • New Features

    • Added end-to-end HTTP coverage for AI chat and responses streaming, including text, tool calls, parallel tools, and tool-result round trips.
    • Added replay fixtures for common streamed scenarios, including fragmented tool arguments and reasoning + tool-use flows.
  • Bug Fixes

    • Improved tool-call streaming so fragmented arguments are finalized consistently at stream end.
    • Fixed completion event timing so tool-call “done” events are emitted after the full stream, matching expected output order.
  • Tests

    • Expanded integration tests and shared harness utilities for more reliable, deterministic protocol validation.

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot added external-contribution Pull request is from an external contributor fix documentation Improvements or additions to documentation frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` labels Jun 26, 2026
@datadog-official

This comment has been minimized.

@MatejKosec MatejKosec marked this pull request as ready for review June 26, 2026 10:35
@MatejKosec MatejKosec requested a review from a team as a code owner June 26, 2026 10:35
@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

Anthropic and OpenAI Responses stream converters now defer tool-call completion events to stream end. The PR also adds shared HTTP test harness support, replay fixtures, and end-to-end Anthropic and Responses replay tests.

Changes

Tool-call buffering and replay coverage

Layer / File(s) Summary
Anthropic buffering and tests
lib/llm/src/protocols/anthropic/stream_converter.rs
Tool-call arguments are buffered as fragments and emitted at stream end, and the unit tests were updated for buffered stop timing, ordering, and duplicate-id behavior.
OpenAI Responses buffering and tests
lib/llm/src/protocols/openai/chat_completions/tool_parser_v2.rs, lib/llm/src/protocols/openai/responses/stream_converter.rs
Function-call completion is deferred to stream end, and the related tests and chunk helper were updated to match the new event timing.
HTTP harness and replay fixtures
lib/llm/tests/common/*, lib/llm/tests/data/replays/agent-harness/*
The shared harness starts scripted chat services, parses and canonicalizes SSE bodies, and adds the replay README plus the text, tool, parallel-tool, and thinking-tool fixtures.
Anthropic HTTP replay tests
lib/llm/tests/anthropic_http_replay.rs
Anthropic Messages replay tests cover unary, streaming, fragmented tool arguments, parallel tools, tool-result round trips, and token counting.
Responses HTTP replay tests
lib/llm/tests/responses_http_replay.rs
Responses replay tests cover unary, streaming, fragmented function calls, parallel tool calls, and function-call output round trips.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description covers summary and validation, but it does not follow the required template and is missing Overview, Details, Where to start, and Related Issues sections. Rewrite it using the repository template and add the missing sections, including a Related Issues entry such as Closes #XXXX or the no-issue confirmation.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: preserving fragmented agent tool calls.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/llm/src/protocols/anthropic/stream_converter.rs (1)

305-318: 🎯 Functional Correctness | 🟡 Minor

Handle id/name-only tool-call deltas

started only flips when func.arguments is present, so a tool call that arrives with id/name but never sends arguments is dropped at end-of-stream. The codebase already has arguments: None tool-call chunks, so consider marking the call as started earlier or otherwise emitting an empty tool_use block instead of omitting it.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/llm/src/protocols/anthropic/stream_converter.rs` around lines 305 - 318,
The tool-call delta handling in the Anthropic stream converter only marks a call
as started when `func.arguments` is present, so id/name-only chunks can be
dropped at end-of-stream. Update `stream_converter.rs` in the tool-call state
update logic to mark the call as started as soon as an `id` or `name` is
received, or otherwise ensure `ToolCallState` emits an empty `tool_use` block
when `arguments` stays `None`, so `id`/`name`-only deltas are preserved.
🧹 Nitpick comments (1)
lib/llm/tests/data/replays/agent-harness/README.md (1)

3-5: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Add the matching GitHub link for DYN-2764.

This doc references the internal Linear ticket DYN-2764 but omits a GitHub link. External contributors can't resolve Linear IDs, so include the matching GitHub issue (e.g. #NNNN) alongside it if one exists, as you already do for PR #8284.

As per coding guidelines: "Markdown documentation files may reference Linear tickets for internal context, but should prefer to also include the matching GitHub link when one exists".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/llm/tests/data/replays/agent-harness/README.md` around lines 3 - 5, The
README currently mentions the internal Linear ticket DYN-2764 without a
corresponding public GitHub reference; update the fixture description to include
the matching GitHub issue link or issue number alongside DYN-2764, similar to
how PR `#8284` is referenced. Use the existing README text near the DYN-2764
mention and keep the wording concise so external contributors can identify the
related GitHub item without needing Linear access.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@lib/llm/tests/responses_http_replay.rs`:
- Around line 326-347: The round-trip test in responses_http_replay.rs hardcodes
the function call output `call_id` instead of reusing the one returned in the
first response. Update the test around `first_body["output"]` and the
`post_responses` payload so the `function_call_output` uses the captured
`function_call`’s `call_id`, and adjust the later assertions to compare against
that same dynamic value rather than `call_list_directory`.

---

Outside diff comments:
In `@lib/llm/src/protocols/anthropic/stream_converter.rs`:
- Around line 305-318: The tool-call delta handling in the Anthropic stream
converter only marks a call as started when `func.arguments` is present, so
id/name-only chunks can be dropped at end-of-stream. Update
`stream_converter.rs` in the tool-call state update logic to mark the call as
started as soon as an `id` or `name` is received, or otherwise ensure
`ToolCallState` emits an empty `tool_use` block when `arguments` stays `None`,
so `id`/`name`-only deltas are preserved.

---

Nitpick comments:
In `@lib/llm/tests/data/replays/agent-harness/README.md`:
- Around line 3-5: The README currently mentions the internal Linear ticket
DYN-2764 without a corresponding public GitHub reference; update the fixture
description to include the matching GitHub issue link or issue number alongside
DYN-2764, similar to how PR `#8284` is referenced. Use the existing README text
near the DYN-2764 mention and keep the wording concise so external contributors
can identify the related GitHub item without needing Linear access.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5a7ff35b-eda4-4eaa-924c-2d80c1054e93

📥 Commits

Reviewing files that changed from the base of the PR and between 15bdb11 and 22b1a22.

⛔ Files ignored due to path filters (4)
  • lib/llm/tests/snapshots/anthropic_http_replay__anthropic_streaming_text.snap is excluded by !**/*.snap
  • lib/llm/tests/snapshots/anthropic_http_replay__anthropic_unary_text.snap is excluded by !**/*.snap
  • lib/llm/tests/snapshots/responses_http_replay__responses_streaming_text.snap is excluded by !**/*.snap
  • lib/llm/tests/snapshots/responses_http_replay__responses_unary_text.snap is excluded by !**/*.snap
📒 Files selected for processing (12)
  • lib/llm/src/protocols/anthropic/stream_converter.rs
  • lib/llm/src/protocols/openai/chat_completions/tool_parser_v2.rs
  • lib/llm/src/protocols/openai/responses/stream_converter.rs
  • lib/llm/tests/anthropic_http_replay.rs
  • lib/llm/tests/common/http_harness.rs
  • lib/llm/tests/common/scripted_chat_engine.rs
  • lib/llm/tests/data/replays/agent-harness/README.md
  • lib/llm/tests/data/replays/agent-harness/fragmented-tool.sse
  • lib/llm/tests/data/replays/agent-harness/parallel-tools.sse
  • lib/llm/tests/data/replays/agent-harness/text.sse
  • lib/llm/tests/data/replays/agent-harness/thinking-tool.sse
  • lib/llm/tests/responses_http_replay.rs

Comment on lines +326 to +347
let function_call = first_body["output"]
.as_array()
.unwrap()
.iter()
.find(|item| item["type"] == "function_call")
.expect("first response did not contain a function call")
.clone();

let second_response = post_responses(
&svc,
&json!({
"model": MODEL,
"stream": false,
"max_output_tokens": 64,
"tools": [tool("list_directory")],
"input": [
{"role": "user", "content": "List /tmp"},
function_call,
{
"type": "function_call_output",
"call_id": "call_list_directory",
"output": "[\"a.txt\"]"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Use the first response’s call_id instead of hardcoding it.

This test is meant to validate round-trip propagation, but the second request and final assertions bake in call_list_directory rather than reusing whatever call_id the first response actually returned. That makes the test fixture-dependent and can fail even when the implementation preserves IDs correctly.

Suggested change
         let function_call = first_body["output"]
             .as_array()
             .unwrap()
             .iter()
             .find(|item| item["type"] == "function_call")
             .expect("first response did not contain a function call")
             .clone();
+        let call_id = function_call["call_id"]
+            .as_str()
+            .expect("function call missing call_id")
+            .to_owned();

         let second_response = post_responses(
             &svc,
             &json!({
@@
                     function_call,
                     {
                         "type": "function_call_output",
-                        "call_id": "call_list_directory",
+                        "call_id": call_id.clone(),
                         "output": "[\"a.txt\"]"
                     }
                 ]
             }),
@@
                 let calls = assistant.tool_calls.as_deref().expect("tool calls missing");
                 assert_eq!(calls.len(), 1);
-                assert_eq!(calls[0].id, "call_list_directory");
+                assert_eq!(calls[0].id, call_id);
                 assert_eq!(calls[0].function.name, "list_directory");
                 assert_eq!(calls[0].function.arguments, r#"{"path":"/tmp"}"#);
-                assert_eq!(tool_result.tool_call_id, "call_list_directory");
+                assert_eq!(tool_result.tool_call_id, call_id);

Also applies to: 373-376

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/llm/tests/responses_http_replay.rs` around lines 326 - 347, The
round-trip test in responses_http_replay.rs hardcodes the function call output
`call_id` instead of reusing the one returned in the first response. Update the
test around `first_body["output"]` and the `post_responses` payload so the
`function_call_output` uses the captured `function_call`’s `call_id`, and adjust
the later assertions to compare against that same dynamic value rather than
`call_list_directory`.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

@MatejKosec

Copy link
Copy Markdown
Contributor Author

Superseded by #10988, which includes the finish-signal completion timing and subsequent ordering corrections.

@MatejKosec MatejKosec closed this Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation external-contribution Pull request is from an external contributor fix frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant