Skip to content

Durable execution: minimal advanced-template + chatbot integration#207

Open
dhruv0811 wants to merge 9 commits into
mainfrom
prose-min-templates
Open

Durable execution: minimal advanced-template + chatbot integration#207
dhruv0811 wants to merge 9 commits into
mainfrom
prose-min-templates

Conversation

@dhruv0811
Copy link
Copy Markdown
Contributor

@dhruv0811 dhruv0811 commented May 1, 2026

Integ test run: https://github.com/databricks-eng/ai-oss-integration-tests-runner/actions/runs/25755599119

Summary

Wires the advanced templates (agent-langgraph-advanced, agent-openai-advanced) and the e2e chatbot up to the bridge's LongRunningAgentServer so an in-flight agent run survives a pod restart without losing the user's turn. Companion to bridge #425 (merged) and #427 (open — server-side rotation alias resolution + bridge logger setup).

This is a re-imagined, minimal version of #204. The keystone simplification: the express /invocations proxy is gone. Vercel AI SDK's databricksFetch (in providers-server.ts) is already the single boundary every agent request flows through, so the only durable-execution glue we need (background: true injection + auto-resume on stream close) lives there. Conversation-rotation tracking — which was the most complex piece in the prior PR — is now handled server-side by the bridge (#427); the chatbot doesn't need to know rotation exists.

What the chatbot does (e2e-chatbot-app-next)

In packages/ai-sdk-providers/src/providers-server.ts:

  • Inject body.background = true on streaming requests when API_PROXY points at a long-running server, so the server persists every SSE frame to its durable store.
  • On a stream that closes without [DONE], transparently re-stream from GET /responses/{id}?stream=true&starting_after=<seq> (capped at 5 retries).

That's the whole chatbot delta — no new express route, no module-scope alias map, no UI message rewriting.

What the templates do

  • Cross-turn UI-echo dedup. Both advanced templates dedupe the chatbot's full-history echo against the SDK's own state. LangGraph asks the checkpointer (agent.aget_state(config)); OpenAI treats the session as authoritative whenever non-empty (session.get_items()). When the SDK already holds the prior turns, only the latest user message is forwarded. Lives in agent_server/utils.py::deduplicate_input per template.
  • Lakebase grant script knows about the new table. scripts/grant_lakebase_permissions.py (synced to all templates) grants on agent_server.conversation_aliases so the app's SP can read/write the new bridge-side alias table without InsufficientPrivilege on first POST.
  • No bridge dependency pin. LongRunningAgentServer already exists in databricks-ai-bridge 0.19.0 (older flavor), so without the new release the templates degrade gracefully — durable resume / alias resolution stay dormant until 0.20.0 ships. We'll bump the dependency floor then.

What is NOT in this PR

  • No express proxy. The previous iteration added a 272-line /invocations proxy. All its responsibilities collapsed into databricksFetch or moved server-side to the bridge.
  • No per-SDK adapter wrappers. No AsyncDatabricksSession.aget_tuple overrides, no DatabricksSaver.aget_tuple overrides. The bridge contract is "agent owns its session/checkpointer state"; templates abide by it.
  • No bridge logger setup in templates. Bridge #427 attaches the databricks_ai_bridge stream handler in LongRunningAgentServer.__init__ directly. Templates' start_server.py are back to a clean shape.
  • No bundled databricks.yml customizations. Bundle and app names stay at the template defaults.

Files

Component File
Background injection + auto-resume e2e-chatbot-app-next/packages/ai-sdk-providers/src/providers-server.ts
LangGraph UI-echo dedup helper agent-langgraph-advanced/agent_server/utils.py::deduplicate_input
LangGraph dedup wired into stream_handler agent-langgraph-advanced/agent_server/agent.py
OpenAI UI-echo dedup heuristic (session is authoritative) agent-openai-advanced/agent_server/utils.py::deduplicate_input
Lakebase grant for conversation_aliases (synced to all templates) .scripts/source/grant_lakebase_permissions.py

Testing the durable path before 0.20.0 ships

Two pieces of pre-release setup are needed if you want to exercise the new prose-recovery + alias resolution flow end-to-end (after 0.20.0 they go away):

  1. Pin to bridge dhruv0811/durable-bridge-side-alias-resolution branch via [tool.uv.sources] in either advanced template's pyproject.toml:
    [tool.uv.sources]
    databricks-ai-bridge = { git = "https://github.com/databricks/databricks-ai-bridge.git", branch = "dhruv0811/durable-bridge-side-alias-resolution" }
  2. Enable the debug kill endpoint. Set LONG_RUNNING_ENABLE_DEBUG_KILL=1 on the deployed app (env block in databricks.yml). This gates POST /_debug/kill_task/{response_id}, the test-only endpoint that simulates a pod crash mid-stream. Leave it unset in production.

Test plan

  • agent-langgraph-advanced UI testing on dhruv-lg-claude-durable (Claude Sonnet 4.5) — multi-tool kill mid-deep_research, multi-turn ✅
  • agent-openai-advanced UI testing on dhruv-oai-gpt-durable (GPT-5) — same matrix ✅
  • agent-openai-advanced UI testing on dhruv-oai-claude-durable (Claude Sonnet 4.5) — same matrix ✅
  • Restart-path test: stopped + restarted dhruv-oai-gpt-durable after a recovered turn — next turn correctly resolved through the new bridge alias table ✅
  • Base-template back-test: deployed unmodified agent-openai-agents-sdk as dhruv-oai-base-newui against the same chatbot branch — confirms the new providers-server.ts doesn't break templates that don't use durable execution ✅

Companion PRs

Known follow-ups (non-blocking)

  • Bump the databricks-ai-bridge dependency floor in both advanced templates once a release containing #425 + #427 ships on PyPI.
  • Lift the chatbot's auto-resume into the upstream @databricks/ai-sdk-provider package once the durable contract stabilizes.
Text.Only.-.Prose.mov
Tool.Calling.Multiturn.-.Prose.mov

dhruv0811 added 3 commits May 1, 2026 21:25
Lets the advanced templates use `LongRunningAgentServer` end-to-end so an
in-flight agent run survives a pod restart without losing the user's turn.

Chatbot side (`e2e-chatbot-app-next/packages/ai-sdk-providers/src/providers-server.ts`):
the AI SDK's custom `databricksFetch` is the single boundary every agent
request flows through, so we put all the durable-execution glue here:
  - inject `background: true` on streaming requests when `API_PROXY` points
    at a long-running server,
  - capture the rotated `conversation_id` emitted in `response.resumed`
    sentinels and replay it on subsequent turns so the next request lands
    on the post-resume session,
  - on a stream that closes without `[DONE]`, transparently re-stream from
    `GET /responses/{id}?stream=true&starting_after=<seq>` (capped retries).

Template side:
  - factor the LangGraph dedup helper into `agent_server/utils.py` as
    `deduplicate_input` (matches the OpenAI template's existing helper),
    and call it from `stream_handler` so checkpointer-backed history
    isn't double-counted with the chatbot's UI-echoed history,
  - pin `databricks-ai-bridge` to the durable-execution branch via
    `[tool.uv.sources]` until that work ships in a stable release.

Plumbing:
  - `start_app.py` now honors `APP_TEMPLATES_BRANCH` (default `main`) when
    cloning the chatbot, so a non-main branch can be tested end-to-end.
Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
Uvicorn's default logging config drops INFO records from non-uvicorn
loggers, which silently swallows all the durable-execution lifecycle
breadcrumbs (task spawn, resume, prose-recovery build, terminal-status
CAS, stale-scan claims). When debugging a deployed app the only signal
left was raw uvicorn access logs — not enough to tell whether the
durable path was even firing.

Attach a stream handler to the `databricks_ai_bridge` logger explicitly
so its lifecycle logs reach app stdout. Long-term this belongs in the
bridge's `LongRunningAgentServer.__init__`, but doing it in the
templates means we don't have to wait for a bridge release.
The prior heuristic compared `session_items >= messages - 1` to decide
whether to forward only the latest user message. Under prose-recovery
+ always-rotate the rotated session has FEWER items than the chatbot's
accumulated UI echo (attempt N+1's session is fresh while the UI
accumulated events from both attempts), so the heuristic was returning
all messages — duplicates of attempt N+1's tool_calls plus the orphan
tool_use from attempt N.

The Runner then combined session+input, producing duplicate function_call
items that the OpenAI SDK grouped into a malformed assistant.tool_calls
block. Anthropic-backed models (databricks-claude-*) rejected the
request with 400 "tool_use ids were found without tool_result blocks
immediately after". gpt-* tolerated it; LangGraph templates were unaffected
because their dedup uses checkpointer state, not a count heuristic.

Fix: if the session has any items at all, treat it as authoritative for
cross-turn history and forward only the latest message. First-turn path
(empty session) still returns the full input so MLflow evaluation works.

This was originally fixed in the prior templates PR (commit 31d87d6,
"agent-openai-advanced: trust session as authoritative for cross-turn
dedup") and was inadvertently dropped when re-imagining the PR from main.
@dhruv0811 dhruv0811 force-pushed the prose-min-templates branch from cfba80e to 6e3b3be Compare May 1, 2026 21:59
@dhruv0811 dhruv0811 requested a review from bbqiu May 1, 2026 22:08
dhruv0811 added a commit that referenced this pull request May 1, 2026
#425 merged so the durable-execution bits are on bridge main now. Pin
both advanced templates to the bridge main branch (instead of the now-
deleted feature branch); remove this block entirely once a release ships.

Also revert the APP_TEMPLATES_BRANCH support in start_app.py — that was
a testing-only hack to let deployed templates clone the chatbot from a
non-main branch while #207 was open. Once #207 lands, mainline is the
right default.

Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
dhruv0811 added a commit that referenced this pull request May 1, 2026
#425 merged so the durable-execution bits are on bridge main now. Pin
both advanced templates to the bridge main branch (instead of the now-
deleted feature branch); remove this block entirely once a release ships.

Also revert the APP_TEMPLATES_BRANCH support in start_app.py — that was
a testing-only hack to let deployed templates clone the chatbot from a
non-main branch while #207 was open. Once #207 lands, mainline is the
right default.

Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
@dhruv0811 dhruv0811 force-pushed the prose-min-templates branch from 754242e to d13064e Compare May 1, 2026 22:12
#425 merged so the durable-execution bits are on bridge main now.
Templates only need PyPI databricks-ai-bridge — `LongRunningAgentServer`
already exists in 0.19.0 (older flavor), so without the new release the
templates degrade gracefully (chatbot's alias capture stays dormant
because no `response.resumed` sentinels arrive). To exercise the new
prose-recovery features end-to-end before 0.20.0 ships, add a temporary
`[tool.uv.sources]` block pinning the bridge to its main branch.

Also revert the APP_TEMPLATES_BRANCH support in start_app.py — that was
a testing-only hack to let deployed templates clone the chatbot from a
non-main branch while #207 was open. Once #207 lands, mainline is the
right default.
@dhruv0811 dhruv0811 force-pushed the prose-min-templates branch from d13064e to 432d654 Compare May 1, 2026 22:13
Comment thread agent-langgraph-advanced/agent_server/start_server.py Outdated
Companion to bridge PR adding `conversation_aliases` table. With
server-side resolution the chatbot always sends the original chat.id as
conversation_id and the bridge maps it to the post-rotation SDK session
internally. This makes the durable path safe across:
  - chatbot server restarts (the prior in-memory Map evaporated)
  - multi-pod chatbot deployments (each pod started with its own blank Map)
  - new browser tabs opening a chat that was rotated by a different session

What stays:
  - background:true injection on streaming requests when API_PROXY is set
  - auto-resume on stream close-without-DONE (capped retries)
What goes:
  - module-scope `conversationAliasMap`
  - `captureRotation` helper
  - alias substitution in databricksFetch body mutation
  - `originalChatId` plumbing through to `wrapDurableSseStream`

The `response.resumed` SSE sentinel keeps flowing for visibility / debug
but the chatbot no longer reads it.

Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
dhruv0811 added 2 commits May 6, 2026 22:40
Companion to bridge#427 which adds the `agent_server.conversation_aliases`
table for server-side rotation alias resolution. Templates' grant script
needs to know about it so the app's SP gets read/write access on first
deploy; otherwise the bridge fails on every POST with
`InsufficientPrivilege: permission denied for table conversation_aliases`.

Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
Bridge PR moves `_attach_bridge_logger()` into `LongRunningAgentServer.__init__`,
so the duplicated 14-line block in each template's start_server.py is no
longer needed. Idempotent on the bridge side; opt out with
`DATABRICKS_AI_BRIDGE_LOG_QUIET=1`.

Addresses Bryan's review on #207.

Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
Copy link
Copy Markdown
Contributor

@bbqiu bbqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for doing this! overall looks super close, but we may wanna revisit the comments about the UI resumption strat

Comment thread e2e-chatbot-app-next/packages/ai-sdk-providers/src/providers-server.ts Outdated
Comment thread agent-openai-advanced/agent_server/utils.py Outdated

// Create a new response with the logging stream
return new Response(loggingStream, {
const wrapped = wrapDurableSseStream(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to do this always? or should we only do this for templates that have API_PROXY set in the response?

also is there a way for us to just replay from the beginning instead of tracking sequence numbers? i remember either the ai-sdk or the frontend logci handles dedupes for us

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have dedupe logic for tool calls + tool results, but critically NOT for text-deltas, reasoning-deltas, start-step chunks. I think the cursor solution is the cleanest here

dhruv0811 added 2 commits May 12, 2026 18:20
- Gate the SSE wrap on API_PROXY at the call site, so standard Databricks
  serving endpoints skip the wrap entirely instead of paying per-frame
  parse cost. The resume branch can't fire there anyway.
- Pass only the Authorization header to the resume fetch — it's a GET,
  and copying the whole request init was carrying along stale content
  headers + the mlflow trace header that don't apply.
- Read API_PROXY at call sites (`getApiProxy()`) instead of capturing
  at module load so unit tests can flip it per-case via process.env.
- Drop the misleading rationale comment in agent-openai-advanced's
  deduplicate_input — the code is self-explanatory.
- Add tests in tests/ai-sdk-provider/durable-fetch.test.ts covering:
    * background:true injection only when API_PROXY set + stream=true
    * non-streaming + no-API_PROXY paths leave background alone
    * SSE wrap fires GET retrieve with starting_after=<lastSeq> on
      close-without-DONE, including the URL shape
    * resume request carries ONLY Authorization (no content-type,
      mlflow trace, or other request headers)
    * SSE responses are NOT wrapped when API_PROXY is unset

Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
@dhruv0811 dhruv0811 requested a review from bbqiu May 13, 2026 00:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants