LongRunningAgentServer: server-side rotation alias resolution by dhruv0811 · Pull Request #427 · databricks/databricks-ai-bridge

dhruv0811 · 2026-05-06T22:20:58Z

Summary

Follow-up to #425 with two changes that came out of UI testing the durable path against the chatbot in app-templates#207:

Server-side rotation alias resolution. Move the rotated conversation_id bookkeeping out of the chatbot's in-memory Map (lost on chatbot restart, blank on each new chatbot pod, missing in any new browser tab) and into a conversation_aliases table the bridge owns. Clients always send the original chat id; the bridge resolves it forward to the post-rotation SDK session. With the in-memory map, a user who crashed-and-resumed a conversation would silently lose history the next time their chatbot pod cycled — the next turn re-sent the original id and landed on the orphan-poisoned pre-rotation session.
Bridge logger setup in LongRunningAgentServer.__init__. Templates were each duplicating ~14 lines to attach a stream handler to the databricks_ai_bridge logger so durable lifecycle logs ([durable] resume…, stale-scan, etc.) reach app stdout — uvicorn's logging config drops INFO from non-uvicorn loggers. Move that into the bridge as _attach_bridge_logger() called from __init__. Idempotent. Opt out with DATABRICKS_AI_BRIDGE_LOG_QUIET=1. Addresses Bryan's review on app-templates#207.

Files

Component	File
`ConversationAlias` model + table	`long_running/models.py`
`resolve_conversation_alias` + `upsert_conversation_alias`	`long_running/repository.py`
Resolve at POST entry, upsert on rotate, rotation suffix change, logger attach	`long_running/server.py`
Tests	`tests/databricks_ai_bridge/test_long_running_{db,server}.py`

Notable details

Rotation suffix changed. From {base}::attempt-N to {base}::r-{response_id_short}-N. The old format would collide across turns (turn 1 attempt-2 and turn 2 attempt-2 both mint the same id, re-poisoning the just-rotated session). New test test_rotate_no_collision_across_turns_at_same_attempt covers this.
original_request keeps the BASE id, so multi-crash chains stay flat (chat-123 → chat-123::r-…-3, never chat-123::r-…-2::r-…-3).
response.resumed SSE sentinel still fires for visibility / debug — clients no longer need to act on it for cross-turn alias tracking.

Test plan

109 unit tests pass for long_running/
ruff check, ruff format --check clean
UI tested on dhruv-lg-claude-durable, dhruv-oai-gpt-durable, dhruv-oai-claude-durable: multi-tool kill mid-deep_research, multi-turn after resume, chatbot pod restart between turns ✅

Companion PR

app-templates#207 — drops the chatbot's conversationAliasMap, the duplicated bridge logger blocks in each template, and adds conversation_aliases to the Lakebase grant script.

Known follow-ups (non-blocking)

Prune conversation_aliases rows whose underlying responses are all in a terminal state (probably as part of the same scanner loop that handles stale heartbeats).
Once shipped on PyPI, drop the temp [tool.uv.sources] block in app-templates#207 and bump the dependency floor.

mlflow 3.12 (released 2026-05-05) flipped OSS-default trace logging from synchronous to asynchronous (mlflow/mlflow#22304). After the bump, ``get_trace(get_last_active_trace_id())`` reads the span store before the BatchSpanProcessor has flushed the just-ended span and returns None, so the next ``trace.search_spans(...)`` call raises: AttributeError: 'NoneType' object has no attribute 'search_spans' This shows up on both ``openai_test`` and ``langchain_test`` matrix variants on ``uv: highest`` (where mlflow resolves to 3.12); the ``lowest-direct`` and pinned-version variants resolve an older mlflow and pass. mlflow's own fix is to pass ``flush=True`` to ``get_trace``, which calls ``_flush_pending_async_trace_writes()`` and retries before returning. Two-line patch — no version pin, no production-code change. Surfaced first by #427 because it was the first CI run after the mlflow bump; main hasn't run CI since 2026-05-01 so it'd manifest there too. Co-authored-by: Isaac Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>

Companion to app-templates#207. Moves the rotated-conversation_id bookkeeping from the chatbot's in-memory `Map` (lost on chatbot pod restart, blank on each new chatbot pod, missing in any new browser tab) into a `conversation_aliases` table the bridge owns. Clients always send the original chat id; the bridge resolves it forward at request entry to the post-rotation SDK session. What changed: - New `ConversationAlias` model and `agent_server.conversation_aliases` table (PK base_conversation_id → current_conversation_id). - `repository.resolve_conversation_alias(base)` (cheap forward lookup, miss returns the input) and `repository.upsert_conversation_alias(...)` (Postgres ON CONFLICT DO UPDATE so the table stays one row per logical conversation regardless of rotation count). - `_handle_background_request` resolves on the way in: `original_request` keeps the BASE id (anchor for future rotations stays flat — `chat-123 → chat-123::r-…-3`, never `chat-123::r-…-2::r-…-3`); the dispatched copy uses the resolved current id so the SDK lands on the rotated session. - `_try_claim_and_resume` upserts `base → rotated` after rotating, so alias persistence survives both bridge and chatbot restarts. - Rotation suffix changed from `{base}::attempt-N` to `{base}::r-{response_id_short}-N` so a second turn that crashes at the same attempt number can't collide with a prior turn's already-rotated session and re-poison it. The new test `test_rotate_no_collision_across_turns_at_same_attempt` covers this. - Existing rotation-format assertions updated; resume-path tests mock `upsert_conversation_alias` so they don't try to hit the DB. - New unit tests for `resolve` (hit + passthrough miss) and `upsert` (verifies the SQL is INSERT ... ON CONFLICT and binds correctly). The `response.resumed` SSE sentinel still fires for visibility / debug; the chatbot no longer reads it (companion PR drops the alias capture). 109 unit tests pass. Co-authored-by: Isaac Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>

Templates currently each duplicate ~14 lines of `databricks_ai_bridge` logger setup in their start_server.py to surface durable-execution lifecycle logs (uvicorn's logging config drops INFO from non-uvicorn loggers, so without an explicit handler the breadcrumbs are silently swallowed and a deployed app looks dead even when working correctly). Move that to a single `_attach_bridge_logger()` called from LongRunningAgentServer.__init__. Idempotent — leaves existing handlers alone — so a consumer that ships their own logging config wins. Set DATABRICKS_AI_BRIDGE_LOG_QUIET=1 to opt out. Addresses Bryan's review on app-templates#207: "we can just gate this logging on an env var, but it feels like this belongs in ai-bridge instead." Companion app-templates change deletes the duplicated blocks. Co-authored-by: Isaac Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>

Setting propagate=False on the parent `databricks_ai_bridge` logger silently broke any test using caplog on a child logger (test_model_serving_obo_credential_strategy::test_logging_statements fails as soon as ANY test in the suite instantiates LongRunningAgentServer, since the logger is a process-level singleton). Propagation can stay on: in uvicorn deployments the root logger has no handlers, so propagation produces no double-log; in pytest, propagation is what lets caplog capture in the first place. Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>

This was referenced May 6, 2026

Durable execution: minimal advanced-template + chatbot integration databricks/app-templates#207

Open

test_vector_search_retriever_tool: flush async traces before assertion #428

Merged

dhruv0811 added 3 commits May 7, 2026 00:21

dhruv0811 force-pushed the dhruv0811/durable-bridge-side-alias-resolution branch from ab9ffb5 to d931140 Compare May 7, 2026 00:21

bbqiu approved these changes May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LongRunningAgentServer: server-side rotation alias resolution#427

LongRunningAgentServer: server-side rotation alias resolution#427
dhruv0811 wants to merge 3 commits into
mainfrom
dhruv0811/durable-bridge-side-alias-resolution

dhruv0811 commented May 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dhruv0811 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files

Notable details

Test plan

Companion PR

Known follow-ups (non-blocking)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dhruv0811 commented May 6, 2026 •

edited

Loading