Durable execution: minimal advanced-template + chatbot integration#207
Open
dhruv0811 wants to merge 9 commits into
Open
Durable execution: minimal advanced-template + chatbot integration#207dhruv0811 wants to merge 9 commits into
dhruv0811 wants to merge 9 commits into
Conversation
Lets the advanced templates use `LongRunningAgentServer` end-to-end so an
in-flight agent run survives a pod restart without losing the user's turn.
Chatbot side (`e2e-chatbot-app-next/packages/ai-sdk-providers/src/providers-server.ts`):
the AI SDK's custom `databricksFetch` is the single boundary every agent
request flows through, so we put all the durable-execution glue here:
- inject `background: true` on streaming requests when `API_PROXY` points
at a long-running server,
- capture the rotated `conversation_id` emitted in `response.resumed`
sentinels and replay it on subsequent turns so the next request lands
on the post-resume session,
- on a stream that closes without `[DONE]`, transparently re-stream from
`GET /responses/{id}?stream=true&starting_after=<seq>` (capped retries).
Template side:
- factor the LangGraph dedup helper into `agent_server/utils.py` as
`deduplicate_input` (matches the OpenAI template's existing helper),
and call it from `stream_handler` so checkpointer-backed history
isn't double-counted with the chatbot's UI-echoed history,
- pin `databricks-ai-bridge` to the durable-execution branch via
`[tool.uv.sources]` until that work ships in a stable release.
Plumbing:
- `start_app.py` now honors `APP_TEMPLATES_BRANCH` (default `main`) when
cloning the chatbot, so a non-main branch can be tested end-to-end.
Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
Uvicorn's default logging config drops INFO records from non-uvicorn loggers, which silently swallows all the durable-execution lifecycle breadcrumbs (task spawn, resume, prose-recovery build, terminal-status CAS, stale-scan claims). When debugging a deployed app the only signal left was raw uvicorn access logs — not enough to tell whether the durable path was even firing. Attach a stream handler to the `databricks_ai_bridge` logger explicitly so its lifecycle logs reach app stdout. Long-term this belongs in the bridge's `LongRunningAgentServer.__init__`, but doing it in the templates means we don't have to wait for a bridge release.
The prior heuristic compared `session_items >= messages - 1` to decide whether to forward only the latest user message. Under prose-recovery + always-rotate the rotated session has FEWER items than the chatbot's accumulated UI echo (attempt N+1's session is fresh while the UI accumulated events from both attempts), so the heuristic was returning all messages — duplicates of attempt N+1's tool_calls plus the orphan tool_use from attempt N. The Runner then combined session+input, producing duplicate function_call items that the OpenAI SDK grouped into a malformed assistant.tool_calls block. Anthropic-backed models (databricks-claude-*) rejected the request with 400 "tool_use ids were found without tool_result blocks immediately after". gpt-* tolerated it; LangGraph templates were unaffected because their dedup uses checkpointer state, not a count heuristic. Fix: if the session has any items at all, treat it as authoritative for cross-turn history and forward only the latest message. First-turn path (empty session) still returns the full input so MLflow evaluation works. This was originally fixed in the prior templates PR (commit 31d87d6, "agent-openai-advanced: trust session as authoritative for cross-turn dedup") and was inadvertently dropped when re-imagining the PR from main.
cfba80e to
6e3b3be
Compare
dhruv0811
added a commit
that referenced
this pull request
May 1, 2026
#425 merged so the durable-execution bits are on bridge main now. Pin both advanced templates to the bridge main branch (instead of the now- deleted feature branch); remove this block entirely once a release ships. Also revert the APP_TEMPLATES_BRANCH support in start_app.py — that was a testing-only hack to let deployed templates clone the chatbot from a non-main branch while #207 was open. Once #207 lands, mainline is the right default. Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
dhruv0811
added a commit
that referenced
this pull request
May 1, 2026
#425 merged so the durable-execution bits are on bridge main now. Pin both advanced templates to the bridge main branch (instead of the now- deleted feature branch); remove this block entirely once a release ships. Also revert the APP_TEMPLATES_BRANCH support in start_app.py — that was a testing-only hack to let deployed templates clone the chatbot from a non-main branch while #207 was open. Once #207 lands, mainline is the right default. Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
754242e to
d13064e
Compare
#425 merged so the durable-execution bits are on bridge main now. Templates only need PyPI databricks-ai-bridge — `LongRunningAgentServer` already exists in 0.19.0 (older flavor), so without the new release the templates degrade gracefully (chatbot's alias capture stays dormant because no `response.resumed` sentinels arrive). To exercise the new prose-recovery features end-to-end before 0.20.0 ships, add a temporary `[tool.uv.sources]` block pinning the bridge to its main branch. Also revert the APP_TEMPLATES_BRANCH support in start_app.py — that was a testing-only hack to let deployed templates clone the chatbot from a non-main branch while #207 was open. Once #207 lands, mainline is the right default.
d13064e to
432d654
Compare
dhruv0811
commented
May 1, 2026
Companion to bridge PR adding `conversation_aliases` table. With server-side resolution the chatbot always sends the original chat.id as conversation_id and the bridge maps it to the post-rotation SDK session internally. This makes the durable path safe across: - chatbot server restarts (the prior in-memory Map evaporated) - multi-pod chatbot deployments (each pod started with its own blank Map) - new browser tabs opening a chat that was rotated by a different session What stays: - background:true injection on streaming requests when API_PROXY is set - auto-resume on stream close-without-DONE (capped retries) What goes: - module-scope `conversationAliasMap` - `captureRotation` helper - alias substitution in databricksFetch body mutation - `originalChatId` plumbing through to `wrapDurableSseStream` The `response.resumed` SSE sentinel keeps flowing for visibility / debug but the chatbot no longer reads it. Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
Open
3 tasks
Companion to bridge#427 which adds the `agent_server.conversation_aliases` table for server-side rotation alias resolution. Templates' grant script needs to know about it so the app's SP gets read/write access on first deploy; otherwise the bridge fails on every POST with `InsufficientPrivilege: permission denied for table conversation_aliases`. Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
Bridge PR moves `_attach_bridge_logger()` into `LongRunningAgentServer.__init__`, so the duplicated 14-line block in each template's start_server.py is no longer needed. Idempotent on the bridge side; opt out with `DATABRICKS_AI_BRIDGE_LOG_QUIET=1`. Addresses Bryan's review on #207. Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
bbqiu
reviewed
May 11, 2026
Contributor
bbqiu
left a comment
There was a problem hiding this comment.
thanks for doing this! overall looks super close, but we may wanna revisit the comments about the UI resumption strat
|
|
||
| // Create a new response with the logging stream | ||
| return new Response(loggingStream, { | ||
| const wrapped = wrapDurableSseStream( |
Contributor
There was a problem hiding this comment.
do we want to do this always? or should we only do this for templates that have API_PROXY set in the response?
also is there a way for us to just replay from the beginning instead of tracking sequence numbers? i remember either the ai-sdk or the frontend logci handles dedupes for us
Contributor
Author
There was a problem hiding this comment.
We have dedupe logic for tool calls + tool results, but critically NOT for text-deltas, reasoning-deltas, start-step chunks. I think the cursor solution is the cleanest here
- Gate the SSE wrap on API_PROXY at the call site, so standard Databricks
serving endpoints skip the wrap entirely instead of paying per-frame
parse cost. The resume branch can't fire there anyway.
- Pass only the Authorization header to the resume fetch — it's a GET,
and copying the whole request init was carrying along stale content
headers + the mlflow trace header that don't apply.
- Read API_PROXY at call sites (`getApiProxy()`) instead of capturing
at module load so unit tests can flip it per-case via process.env.
- Drop the misleading rationale comment in agent-openai-advanced's
deduplicate_input — the code is self-explanatory.
- Add tests in tests/ai-sdk-provider/durable-fetch.test.ts covering:
* background:true injection only when API_PROXY set + stream=true
* non-streaming + no-API_PROXY paths leave background alone
* SSE wrap fires GET retrieve with starting_after=<lastSeq> on
close-without-DONE, including the URL shape
* resume request carries ONLY Authorization (no content-type,
mlflow trace, or other request headers)
* SSE responses are NOT wrapped when API_PROXY is unset
Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Integ test run: https://github.com/databricks-eng/ai-oss-integration-tests-runner/actions/runs/25755599119
Summary
Wires the advanced templates (
agent-langgraph-advanced,agent-openai-advanced) and the e2e chatbot up to the bridge'sLongRunningAgentServerso an in-flight agent run survives a pod restart without losing the user's turn. Companion to bridge #425 (merged) and #427 (open — server-side rotation alias resolution + bridge logger setup).This is a re-imagined, minimal version of #204. The keystone simplification: the express
/invocationsproxy is gone. Vercel AI SDK'sdatabricksFetch(inproviders-server.ts) is already the single boundary every agent request flows through, so the only durable-execution glue we need (background: trueinjection + auto-resume on stream close) lives there. Conversation-rotation tracking — which was the most complex piece in the prior PR — is now handled server-side by the bridge (#427); the chatbot doesn't need to know rotation exists.What the chatbot does (
e2e-chatbot-app-next)In
packages/ai-sdk-providers/src/providers-server.ts:body.background = trueon streaming requests whenAPI_PROXYpoints at a long-running server, so the server persists every SSE frame to its durable store.[DONE], transparently re-stream fromGET /responses/{id}?stream=true&starting_after=<seq>(capped at 5 retries).That's the whole chatbot delta — no new express route, no module-scope alias map, no UI message rewriting.
What the templates do
agent.aget_state(config)); OpenAI treats the session as authoritative whenever non-empty (session.get_items()). When the SDK already holds the prior turns, only the latest user message is forwarded. Lives inagent_server/utils.py::deduplicate_inputper template.scripts/grant_lakebase_permissions.py(synced to all templates) grants onagent_server.conversation_aliasesso the app's SP can read/write the new bridge-side alias table withoutInsufficientPrivilegeon first POST.LongRunningAgentServeralready exists indatabricks-ai-bridge0.19.0 (older flavor), so without the new release the templates degrade gracefully — durable resume / alias resolution stay dormant until 0.20.0 ships. We'll bump the dependency floor then.What is NOT in this PR
/invocationsproxy. All its responsibilities collapsed intodatabricksFetchor moved server-side to the bridge.AsyncDatabricksSession.aget_tupleoverrides, noDatabricksSaver.aget_tupleoverrides. The bridge contract is "agent owns its session/checkpointer state"; templates abide by it.databricks_ai_bridgestream handler inLongRunningAgentServer.__init__directly. Templates'start_server.pyare back to a clean shape.databricks.ymlcustomizations. Bundle and app names stay at the template defaults.Files
e2e-chatbot-app-next/packages/ai-sdk-providers/src/providers-server.tsagent-langgraph-advanced/agent_server/utils.py::deduplicate_inputstream_handleragent-langgraph-advanced/agent_server/agent.pyagent-openai-advanced/agent_server/utils.py::deduplicate_inputconversation_aliases(synced to all templates).scripts/source/grant_lakebase_permissions.pyTesting the durable path before 0.20.0 ships
Two pieces of pre-release setup are needed if you want to exercise the new prose-recovery + alias resolution flow end-to-end (after 0.20.0 they go away):
dhruv0811/durable-bridge-side-alias-resolutionbranch via[tool.uv.sources]in either advanced template'spyproject.toml:LONG_RUNNING_ENABLE_DEBUG_KILL=1on the deployed app (env block indatabricks.yml). This gatesPOST /_debug/kill_task/{response_id}, the test-only endpoint that simulates a pod crash mid-stream. Leave it unset in production.Test plan
agent-langgraph-advancedUI testing ondhruv-lg-claude-durable(Claude Sonnet 4.5) — multi-tool kill mid-deep_research, multi-turn ✅agent-openai-advancedUI testing ondhruv-oai-gpt-durable(GPT-5) — same matrix ✅agent-openai-advancedUI testing ondhruv-oai-claude-durable(Claude Sonnet 4.5) — same matrix ✅dhruv-oai-gpt-durableafter a recovered turn — next turn correctly resolved through the new bridge alias table ✅agent-openai-agents-sdkasdhruv-oai-base-newuiagainst the same chatbot branch — confirms the newproviders-server.tsdoesn't break templates that don't use durable execution ✅Companion PRs
LongRunningAgentServerdurable prose-recovery + always-rotate (merged).Known follow-ups (non-blocking)
databricks-ai-bridgedependency floor in both advanced templates once a release containing #425 + #427 ships on PyPI.@databricks/ai-sdk-providerpackage once the durable contract stabilizes.Text.Only.-.Prose.mov
Tool.Calling.Multiturn.-.Prose.mov