LongRunningAgentServer: server-side rotation alias resolution#427
Open
dhruv0811 wants to merge 3 commits into
Open
LongRunningAgentServer: server-side rotation alias resolution#427dhruv0811 wants to merge 3 commits into
dhruv0811 wants to merge 3 commits into
Conversation
This was referenced May 6, 2026
Open
dhruv0811
added a commit
that referenced
this pull request
May 7, 2026
mlflow 3.12 (released 2026-05-05) flipped OSS-default trace logging from synchronous to asynchronous (mlflow/mlflow#22304). After the bump, ``get_trace(get_last_active_trace_id())`` reads the span store before the BatchSpanProcessor has flushed the just-ended span and returns None, so the next ``trace.search_spans(...)`` call raises: AttributeError: 'NoneType' object has no attribute 'search_spans' This shows up on both ``openai_test`` and ``langchain_test`` matrix variants on ``uv: highest`` (where mlflow resolves to 3.12); the ``lowest-direct`` and pinned-version variants resolve an older mlflow and pass. mlflow's own fix is to pass ``flush=True`` to ``get_trace``, which calls ``_flush_pending_async_trace_writes()`` and retries before returning. Two-line patch — no version pin, no production-code change. Surfaced first by #427 because it was the first CI run after the mlflow bump; main hasn't run CI since 2026-05-01 so it'd manifest there too. Co-authored-by: Isaac Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
dhruv0811
added a commit
that referenced
this pull request
May 7, 2026
mlflow 3.12 (released 2026-05-05) flipped OSS-default trace logging from synchronous to asynchronous (mlflow/mlflow#22304). After the bump, ``get_trace(get_last_active_trace_id())`` reads the span store before the BatchSpanProcessor has flushed the just-ended span and returns None, so the next ``trace.search_spans(...)`` call raises: AttributeError: 'NoneType' object has no attribute 'search_spans' This shows up on both ``openai_test`` and ``langchain_test`` matrix variants on ``uv: highest`` (where mlflow resolves to 3.12); the ``lowest-direct`` and pinned-version variants resolve an older mlflow and pass. mlflow's own fix is to pass ``flush=True`` to ``get_trace``, which calls ``_flush_pending_async_trace_writes()`` and retries before returning. Two-line patch — no version pin, no production-code change. Surfaced first by #427 because it was the first CI run after the mlflow bump; main hasn't run CI since 2026-05-01 so it'd manifest there too. Co-authored-by: Isaac Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
Companion to app-templates#207. Moves the rotated-conversation_id
bookkeeping from the chatbot's in-memory `Map` (lost on chatbot pod
restart, blank on each new chatbot pod, missing in any new browser tab)
into a `conversation_aliases` table the bridge owns. Clients always send
the original chat id; the bridge resolves it forward at request entry to
the post-rotation SDK session.
What changed:
- New `ConversationAlias` model and `agent_server.conversation_aliases`
table (PK base_conversation_id → current_conversation_id).
- `repository.resolve_conversation_alias(base)` (cheap forward lookup,
miss returns the input) and `repository.upsert_conversation_alias(...)`
(Postgres ON CONFLICT DO UPDATE so the table stays one row per logical
conversation regardless of rotation count).
- `_handle_background_request` resolves on the way in: `original_request`
keeps the BASE id (anchor for future rotations stays flat —
`chat-123 → chat-123::r-…-3`, never `chat-123::r-…-2::r-…-3`); the
dispatched copy uses the resolved current id so the SDK lands on the
rotated session.
- `_try_claim_and_resume` upserts `base → rotated` after rotating, so
alias persistence survives both bridge and chatbot restarts.
- Rotation suffix changed from `{base}::attempt-N` to
`{base}::r-{response_id_short}-N` so a second turn that crashes at the
same attempt number can't collide with a prior turn's already-rotated
session and re-poison it. The new test
`test_rotate_no_collision_across_turns_at_same_attempt` covers this.
- Existing rotation-format assertions updated; resume-path tests mock
`upsert_conversation_alias` so they don't try to hit the DB.
- New unit tests for `resolve` (hit + passthrough miss) and `upsert`
(verifies the SQL is INSERT ... ON CONFLICT and binds correctly).
The `response.resumed` SSE sentinel still fires for visibility / debug;
the chatbot no longer reads it (companion PR drops the alias capture).
109 unit tests pass.
Co-authored-by: Isaac
Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
Templates currently each duplicate ~14 lines of `databricks_ai_bridge` logger setup in their start_server.py to surface durable-execution lifecycle logs (uvicorn's logging config drops INFO from non-uvicorn loggers, so without an explicit handler the breadcrumbs are silently swallowed and a deployed app looks dead even when working correctly). Move that to a single `_attach_bridge_logger()` called from LongRunningAgentServer.__init__. Idempotent — leaves existing handlers alone — so a consumer that ships their own logging config wins. Set DATABRICKS_AI_BRIDGE_LOG_QUIET=1 to opt out. Addresses Bryan's review on app-templates#207: "we can just gate this logging on an env var, but it feels like this belongs in ai-bridge instead." Companion app-templates change deletes the duplicated blocks. Co-authored-by: Isaac Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
Setting propagate=False on the parent `databricks_ai_bridge` logger silently broke any test using caplog on a child logger (test_model_serving_obo_credential_strategy::test_logging_statements fails as soon as ANY test in the suite instantiates LongRunningAgentServer, since the logger is a process-level singleton). Propagation can stay on: in uvicorn deployments the root logger has no handlers, so propagation produces no double-log; in pytest, propagation is what lets caplog capture in the first place. Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
ab9ffb5 to
d931140
Compare
bbqiu
approved these changes
May 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #425 with two changes that came out of UI testing the durable path against the chatbot in app-templates#207:
Server-side rotation alias resolution. Move the rotated
conversation_idbookkeeping out of the chatbot's in-memoryMap(lost on chatbot restart, blank on each new chatbot pod, missing in any new browser tab) and into aconversation_aliasestable the bridge owns. Clients always send the original chat id; the bridge resolves it forward to the post-rotation SDK session. With the in-memory map, a user who crashed-and-resumed a conversation would silently lose history the next time their chatbot pod cycled — the next turn re-sent the original id and landed on the orphan-poisoned pre-rotation session.Bridge logger setup in
LongRunningAgentServer.__init__. Templates were each duplicating ~14 lines to attach a stream handler to thedatabricks_ai_bridgelogger so durable lifecycle logs ([durable] resume…, stale-scan, etc.) reach app stdout — uvicorn's logging config drops INFO from non-uvicorn loggers. Move that into the bridge as_attach_bridge_logger()called from__init__. Idempotent. Opt out withDATABRICKS_AI_BRIDGE_LOG_QUIET=1. Addresses Bryan's review on app-templates#207.Files
ConversationAliasmodel + tablelong_running/models.pyresolve_conversation_alias+upsert_conversation_aliaslong_running/repository.pylong_running/server.pytests/databricks_ai_bridge/test_long_running_{db,server}.pyNotable details
{base}::attempt-Nto{base}::r-{response_id_short}-N. The old format would collide across turns (turn 1 attempt-2 and turn 2 attempt-2 both mint the same id, re-poisoning the just-rotated session). New testtest_rotate_no_collision_across_turns_at_same_attemptcovers this.original_requestkeeps the BASE id, so multi-crash chains stay flat (chat-123 → chat-123::r-…-3, neverchat-123::r-…-2::r-…-3).response.resumedSSE sentinel still fires for visibility / debug — clients no longer need to act on it for cross-turn alias tracking.Test plan
long_running/ruff check,ruff format --checkcleandhruv-lg-claude-durable,dhruv-oai-gpt-durable,dhruv-oai-claude-durable: multi-tool kill mid-deep_research, multi-turn after resume, chatbot pod restart between turns ✅Companion PR
app-templates#207 — drops the chatbot's
conversationAliasMap, the duplicated bridge logger blocks in each template, and addsconversation_aliasesto the Lakebase grant script.Known follow-ups (non-blocking)
conversation_aliasesrows whose underlying responses are all in a terminal state (probably as part of the same scanner loop that handles stale heartbeats).[tool.uv.sources]block in app-templates#207 and bump the dependency floor.