Durable execution: minimal advanced-template + chatbot integration by dhruv0811 · Pull Request #207 · databricks/app-templates

dhruv0811 · 2026-05-01T21:25:45Z

Integ test run: https://github.com/databricks-eng/ai-oss-integration-tests-runner/actions/runs/25755599119

Summary

Wires the advanced templates (agent-langgraph-advanced, agent-openai-advanced) and the e2e chatbot up to the bridge's LongRunningAgentServer so an in-flight agent run survives a pod restart without losing the user's turn. Companion to bridge #425 (merged) and #427 (open — server-side rotation alias resolution + bridge logger setup).

This is a re-imagined, minimal version of #204. The keystone simplification: the express /invocations proxy is gone. Vercel AI SDK's databricksFetch (in providers-server.ts) is already the single boundary every agent request flows through, so the only durable-execution glue we need (background: true injection + auto-resume on stream close) lives there. Conversation-rotation tracking — which was the most complex piece in the prior PR — is now handled server-side by the bridge (#427); the chatbot doesn't need to know rotation exists.

What the chatbot does (`e2e-chatbot-app-next`)

In packages/ai-sdk-providers/src/providers-server.ts:

Inject body.background = true on streaming requests when API_PROXY points at a long-running server, so the server persists every SSE frame to its durable store.
On a stream that closes without [DONE], transparently re-stream from GET /responses/{id}?stream=true&starting_after=<seq> (capped at 5 retries).

That's the whole chatbot delta — no new express route, no module-scope alias map, no UI message rewriting.

What the templates do

Cross-turn UI-echo dedup. Both advanced templates dedupe the chatbot's full-history echo against the SDK's own state. LangGraph asks the checkpointer (agent.aget_state(config)); OpenAI treats the session as authoritative whenever non-empty (session.get_items()). When the SDK already holds the prior turns, only the latest user message is forwarded. Lives in agent_server/utils.py::deduplicate_input per template.
Lakebase grant script knows about the new table. scripts/grant_lakebase_permissions.py (synced to all templates) grants on agent_server.conversation_aliases so the app's SP can read/write the new bridge-side alias table without InsufficientPrivilege on first POST.
No bridge dependency pin. LongRunningAgentServer already exists in databricks-ai-bridge 0.19.0 (older flavor), so without the new release the templates degrade gracefully — durable resume / alias resolution stay dormant until 0.20.0 ships. We'll bump the dependency floor then.

What is NOT in this PR

No express proxy. The previous iteration added a 272-line /invocations proxy. All its responsibilities collapsed into databricksFetch or moved server-side to the bridge.
No per-SDK adapter wrappers. No AsyncDatabricksSession.aget_tuple overrides, no DatabricksSaver.aget_tuple overrides. The bridge contract is "agent owns its session/checkpointer state"; templates abide by it.
No bridge logger setup in templates. Bridge #427 attaches the databricks_ai_bridge stream handler in LongRunningAgentServer.__init__ directly. Templates' start_server.py are back to a clean shape.
No bundled databricks.yml customizations. Bundle and app names stay at the template defaults.

Files

Component	File
Background injection + auto-resume	`e2e-chatbot-app-next/packages/ai-sdk-providers/src/providers-server.ts`
LangGraph UI-echo dedup helper	`agent-langgraph-advanced/agent_server/utils.py::deduplicate_input`
LangGraph dedup wired into `stream_handler`	`agent-langgraph-advanced/agent_server/agent.py`
OpenAI UI-echo dedup heuristic (session is authoritative)	`agent-openai-advanced/agent_server/utils.py::deduplicate_input`
Lakebase grant for `conversation_aliases` (synced to all templates)	`.scripts/source/grant_lakebase_permissions.py`

Testing the durable path before 0.20.0 ships

Two pieces of pre-release setup are needed if you want to exercise the new prose-recovery + alias resolution flow end-to-end (after 0.20.0 they go away):

Pin to bridge dhruv0811/durable-bridge-side-alias-resolution branch via [tool.uv.sources] in either advanced template's pyproject.toml:

[tool.uv.sources]
databricks-ai-bridge = { git = "https://github.com/databricks/databricks-ai-bridge.git", branch = "dhruv0811/durable-bridge-side-alias-resolution" }

Enable the debug kill endpoint. Set LONG_RUNNING_ENABLE_DEBUG_KILL=1 on the deployed app (env block in databricks.yml). This gates POST /_debug/kill_task/{response_id}, the test-only endpoint that simulates a pod crash mid-stream. Leave it unset in production.

Test plan

agent-langgraph-advanced UI testing on dhruv-lg-claude-durable (Claude Sonnet 4.5) — multi-tool kill mid-deep_research, multi-turn ✅
agent-openai-advanced UI testing on dhruv-oai-gpt-durable (GPT-5) — same matrix ✅
agent-openai-advanced UI testing on dhruv-oai-claude-durable (Claude Sonnet 4.5) — same matrix ✅
Restart-path test: stopped + restarted dhruv-oai-gpt-durable after a recovered turn — next turn correctly resolved through the new bridge alias table ✅
Base-template back-test: deployed unmodified agent-openai-agents-sdk as dhruv-oai-base-newui against the same chatbot branch — confirms the new providers-server.ts doesn't break templates that don't use durable execution ✅

Companion PRs

databricks-ai-bridge#425 — LongRunningAgentServer durable prose-recovery + always-rotate (merged).
databricks-ai-bridge#427 — server-side rotation alias resolution + bridge logger setup (open).

Known follow-ups (non-blocking)

Bump the databricks-ai-bridge dependency floor in both advanced templates once a release containing #425 + #427 ships on PyPI.
Lift the chatbot's auto-resume into the upstream @databricks/ai-sdk-provider package once the durable contract stabilizes.

Text.Only.-.Prose.mov

Tool.Calling.Multiturn.-.Prose.mov

Lets the advanced templates use `LongRunningAgentServer` end-to-end so an in-flight agent run survives a pod restart without losing the user's turn. Chatbot side (`e2e-chatbot-app-next/packages/ai-sdk-providers/src/providers-server.ts`): the AI SDK's custom `databricksFetch` is the single boundary every agent request flows through, so we put all the durable-execution glue here: - inject `background: true` on streaming requests when `API_PROXY` points at a long-running server, - capture the rotated `conversation_id` emitted in `response.resumed` sentinels and replay it on subsequent turns so the next request lands on the post-resume session, - on a stream that closes without `[DONE]`, transparently re-stream from `GET /responses/{id}?stream=true&starting_after=<seq>` (capped retries). Template side: - factor the LangGraph dedup helper into `agent_server/utils.py` as `deduplicate_input` (matches the OpenAI template's existing helper), and call it from `stream_handler` so checkpointer-backed history isn't double-counted with the chatbot's UI-echoed history, - pin `databricks-ai-bridge` to the durable-execution branch via `[tool.uv.sources]` until that work ships in a stable release. Plumbing: - `start_app.py` now honors `APP_TEMPLATES_BRANCH` (default `main`) when cloning the chatbot, so a non-main branch can be tested end-to-end. Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>

Uvicorn's default logging config drops INFO records from non-uvicorn loggers, which silently swallows all the durable-execution lifecycle breadcrumbs (task spawn, resume, prose-recovery build, terminal-status CAS, stale-scan claims). When debugging a deployed app the only signal left was raw uvicorn access logs — not enough to tell whether the durable path was even firing. Attach a stream handler to the `databricks_ai_bridge` logger explicitly so its lifecycle logs reach app stdout. Long-term this belongs in the bridge's `LongRunningAgentServer.__init__`, but doing it in the templates means we don't have to wait for a bridge release.

The prior heuristic compared `session_items >= messages - 1` to decide whether to forward only the latest user message. Under prose-recovery + always-rotate the rotated session has FEWER items than the chatbot's accumulated UI echo (attempt N+1's session is fresh while the UI accumulated events from both attempts), so the heuristic was returning all messages — duplicates of attempt N+1's tool_calls plus the orphan tool_use from attempt N. The Runner then combined session+input, producing duplicate function_call items that the OpenAI SDK grouped into a malformed assistant.tool_calls block. Anthropic-backed models (databricks-claude-*) rejected the request with 400 "tool_use ids were found without tool_result blocks immediately after". gpt-* tolerated it; LangGraph templates were unaffected because their dedup uses checkpointer state, not a count heuristic. Fix: if the session has any items at all, treat it as authoritative for cross-turn history and forward only the latest message. First-turn path (empty session) still returns the full input so MLflow evaluation works. This was originally fixed in the prior templates PR (commit 31d87d6, "agent-openai-advanced: trust session as authoritative for cross-turn dedup") and was inadvertently dropped when re-imagining the PR from main.

#425 merged so the durable-execution bits are on bridge main now. Pin both advanced templates to the bridge main branch (instead of the now- deleted feature branch); remove this block entirely once a release ships. Also revert the APP_TEMPLATES_BRANCH support in start_app.py — that was a testing-only hack to let deployed templates clone the chatbot from a non-main branch while #207 was open. Once #207 lands, mainline is the right default. Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>

#425 merged so the durable-execution bits are on bridge main now. Templates only need PyPI databricks-ai-bridge — `LongRunningAgentServer` already exists in 0.19.0 (older flavor), so without the new release the templates degrade gracefully (chatbot's alias capture stays dormant because no `response.resumed` sentinels arrive). To exercise the new prose-recovery features end-to-end before 0.20.0 ships, add a temporary `[tool.uv.sources]` block pinning the bridge to its main branch. Also revert the APP_TEMPLATES_BRANCH support in start_app.py — that was a testing-only hack to let deployed templates clone the chatbot from a non-main branch while #207 was open. Once #207 lands, mainline is the right default.

Companion to bridge PR adding `conversation_aliases` table. With server-side resolution the chatbot always sends the original chat.id as conversation_id and the bridge maps it to the post-rotation SDK session internally. This makes the durable path safe across: - chatbot server restarts (the prior in-memory Map evaporated) - multi-pod chatbot deployments (each pod started with its own blank Map) - new browser tabs opening a chat that was rotated by a different session What stays: - background:true injection on streaming requests when API_PROXY is set - auto-resume on stream close-without-DONE (capped retries) What goes: - module-scope `conversationAliasMap` - `captureRotation` helper - alias substitution in databricksFetch body mutation - `originalChatId` plumbing through to `wrapDurableSseStream` The `response.resumed` SSE sentinel keeps flowing for visibility / debug but the chatbot no longer reads it. Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>

Companion to bridge#427 which adds the `agent_server.conversation_aliases` table for server-side rotation alias resolution. Templates' grant script needs to know about it so the app's SP gets read/write access on first deploy; otherwise the bridge fails on every POST with `InsufficientPrivilege: permission denied for table conversation_aliases`. Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>

Bridge PR moves `_attach_bridge_logger()` into `LongRunningAgentServer.__init__`, so the duplicated 14-line block in each template's start_server.py is no longer needed. Idempotent on the bridge side; opt out with `DATABRICKS_AI_BRIDGE_LOG_QUIET=1`. Addresses Bryan's review on #207. Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>

bbqiu

thanks for doing this! overall looks super close, but we may wanna revisit the comments about the UI resumption strat

bbqiu · 2026-05-11T05:45:56Z

-
-      // Create a new response with the logging stream
-      return new Response(loggingStream, {
+      const wrapped = wrapDurableSseStream(


do we want to do this always? or should we only do this for templates that have API_PROXY set in the response?

also is there a way for us to just replay from the beginning instead of tracking sequence numbers? i remember either the ai-sdk or the frontend logci handles dedupes for us

We have dedupe logic for tool calls + tool results, but critically NOT for text-deltas, reasoning-deltas, start-step chunks. I think the cursor solution is the cleanest here

- Gate the SSE wrap on API_PROXY at the call site, so standard Databricks serving endpoints skip the wrap entirely instead of paying per-frame parse cost. The resume branch can't fire there anyway. - Pass only the Authorization header to the resume fetch — it's a GET, and copying the whole request init was carrying along stale content headers + the mlflow trace header that don't apply. - Read API_PROXY at call sites (`getApiProxy()`) instead of capturing at module load so unit tests can flip it per-case via process.env. - Drop the misleading rationale comment in agent-openai-advanced's deduplicate_input — the code is self-explanatory. - Add tests in tests/ai-sdk-provider/durable-fetch.test.ts covering: * background:true injection only when API_PROXY set + stream=true * non-streaming + no-API_PROXY paths leave background alone * SSE wrap fires GET retrieve with starting_after=<lastSeq> on close-without-DONE, including the URL shape * resume request carries ONLY Authorization (no content-type, mlflow trace, or other request headers) * SSE responses are NOT wrapped when API_PROXY is unset Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>

dhruv0811 added 3 commits May 1, 2026 21:25

dhruv0811 force-pushed the prose-min-templates branch from cfba80e to 6e3b3be Compare May 1, 2026 21:59

dhruv0811 requested a review from bbqiu May 1, 2026 22:08

dhruv0811 force-pushed the prose-min-templates branch from 754242e to d13064e Compare May 1, 2026 22:12

dhruv0811 force-pushed the prose-min-templates branch from d13064e to 432d654 Compare May 1, 2026 22:13

dhruv0811 commented May 1, 2026

View reviewed changes

Comment thread agent-langgraph-advanced/agent_server/start_server.py Outdated

dhruv0811 mentioned this pull request May 6, 2026

LongRunningAgentServer: server-side rotation alias resolution databricks/databricks-ai-bridge#427

Open

3 tasks

dhruv0811 added 2 commits May 6, 2026 22:40

bbqiu reviewed May 11, 2026

View reviewed changes

dhruv0811 added 2 commits May 12, 2026 18:20

Merge remote-tracking branch 'origin/main' into prose-min-templates

d7fa876

dhruv0811 requested a review from bbqiu May 13, 2026 00:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Durable execution: minimal advanced-template + chatbot integration#207

Durable execution: minimal advanced-template + chatbot integration#207
dhruv0811 wants to merge 9 commits into
mainfrom
prose-min-templates

dhruv0811 commented May 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

bbqiu left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bbqiu May 11, 2026

Uh oh!

dhruv0811 May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dhruv0811 commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What the chatbot does (e2e-chatbot-app-next)

What the templates do

What is NOT in this PR

Files

Testing the durable path before 0.20.0 ships

Test plan

Companion PRs

Known follow-ups (non-blocking)

Uh oh!

Uh oh!

bbqiu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bbqiu May 11, 2026

Choose a reason for hiding this comment

Uh oh!

dhruv0811 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dhruv0811 commented May 1, 2026 •

edited

Loading

What the chatbot does (`e2e-chatbot-app-next`)