Skip to content

LongRunningAgentServer: server-side rotation alias resolution#427

Open
dhruv0811 wants to merge 3 commits into
mainfrom
dhruv0811/durable-bridge-side-alias-resolution
Open

LongRunningAgentServer: server-side rotation alias resolution#427
dhruv0811 wants to merge 3 commits into
mainfrom
dhruv0811/durable-bridge-side-alias-resolution

Conversation

@dhruv0811
Copy link
Copy Markdown
Contributor

@dhruv0811 dhruv0811 commented May 6, 2026

Summary

Follow-up to #425 with two changes that came out of UI testing the durable path against the chatbot in app-templates#207:

  1. Server-side rotation alias resolution. Move the rotated conversation_id bookkeeping out of the chatbot's in-memory Map (lost on chatbot restart, blank on each new chatbot pod, missing in any new browser tab) and into a conversation_aliases table the bridge owns. Clients always send the original chat id; the bridge resolves it forward to the post-rotation SDK session. With the in-memory map, a user who crashed-and-resumed a conversation would silently lose history the next time their chatbot pod cycled — the next turn re-sent the original id and landed on the orphan-poisoned pre-rotation session.

  2. Bridge logger setup in LongRunningAgentServer.__init__. Templates were each duplicating ~14 lines to attach a stream handler to the databricks_ai_bridge logger so durable lifecycle logs ([durable] resume…, stale-scan, etc.) reach app stdout — uvicorn's logging config drops INFO from non-uvicorn loggers. Move that into the bridge as _attach_bridge_logger() called from __init__. Idempotent. Opt out with DATABRICKS_AI_BRIDGE_LOG_QUIET=1. Addresses Bryan's review on app-templates#207.

Files

Component File
ConversationAlias model + table long_running/models.py
resolve_conversation_alias + upsert_conversation_alias long_running/repository.py
Resolve at POST entry, upsert on rotate, rotation suffix change, logger attach long_running/server.py
Tests tests/databricks_ai_bridge/test_long_running_{db,server}.py

Notable details

  • Rotation suffix changed. From {base}::attempt-N to {base}::r-{response_id_short}-N. The old format would collide across turns (turn 1 attempt-2 and turn 2 attempt-2 both mint the same id, re-poisoning the just-rotated session). New test test_rotate_no_collision_across_turns_at_same_attempt covers this.
  • original_request keeps the BASE id, so multi-crash chains stay flat (chat-123 → chat-123::r-…-3, never chat-123::r-…-2::r-…-3).
  • response.resumed SSE sentinel still fires for visibility / debug — clients no longer need to act on it for cross-turn alias tracking.

Test plan

  • 109 unit tests pass for long_running/
  • ruff check, ruff format --check clean
  • UI tested on dhruv-lg-claude-durable, dhruv-oai-gpt-durable, dhruv-oai-claude-durable: multi-tool kill mid-deep_research, multi-turn after resume, chatbot pod restart between turns ✅

Companion PR

app-templates#207 — drops the chatbot's conversationAliasMap, the duplicated bridge logger blocks in each template, and adds conversation_aliases to the Lakebase grant script.

Known follow-ups (non-blocking)

  • Prune conversation_aliases rows whose underlying responses are all in a terminal state (probably as part of the same scanner loop that handles stale heartbeats).
  • Once shipped on PyPI, drop the temp [tool.uv.sources] block in app-templates#207 and bump the dependency floor.

dhruv0811 added a commit that referenced this pull request May 7, 2026
mlflow 3.12 (released 2026-05-05) flipped OSS-default trace logging
from synchronous to asynchronous (mlflow/mlflow#22304). After the bump,
``get_trace(get_last_active_trace_id())`` reads the span store before
the BatchSpanProcessor has flushed the just-ended span and returns None,
so the next ``trace.search_spans(...)`` call raises:

    AttributeError: 'NoneType' object has no attribute 'search_spans'

This shows up on both ``openai_test`` and ``langchain_test`` matrix
variants on ``uv: highest`` (where mlflow resolves to 3.12); the
``lowest-direct`` and pinned-version variants resolve an older mlflow
and pass.

mlflow's own fix is to pass ``flush=True`` to ``get_trace``, which
calls ``_flush_pending_async_trace_writes()`` and retries before
returning. Two-line patch — no version pin, no production-code change.

Surfaced first by #427
because it was the first CI run after the mlflow bump; main hasn't run
CI since 2026-05-01 so it'd manifest there too.

Co-authored-by: Isaac
Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
dhruv0811 added a commit that referenced this pull request May 7, 2026
mlflow 3.12 (released 2026-05-05) flipped OSS-default trace logging
from synchronous to asynchronous (mlflow/mlflow#22304). After the bump,
``get_trace(get_last_active_trace_id())`` reads the span store before
the BatchSpanProcessor has flushed the just-ended span and returns None,
so the next ``trace.search_spans(...)`` call raises:

    AttributeError: 'NoneType' object has no attribute 'search_spans'

This shows up on both ``openai_test`` and ``langchain_test`` matrix
variants on ``uv: highest`` (where mlflow resolves to 3.12); the
``lowest-direct`` and pinned-version variants resolve an older mlflow
and pass.

mlflow's own fix is to pass ``flush=True`` to ``get_trace``, which
calls ``_flush_pending_async_trace_writes()`` and retries before
returning. Two-line patch — no version pin, no production-code change.

Surfaced first by #427
because it was the first CI run after the mlflow bump; main hasn't run
CI since 2026-05-01 so it'd manifest there too.

Co-authored-by: Isaac
Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
dhruv0811 added 3 commits May 7, 2026 00:21
Companion to app-templates#207. Moves the rotated-conversation_id
bookkeeping from the chatbot's in-memory `Map` (lost on chatbot pod
restart, blank on each new chatbot pod, missing in any new browser tab)
into a `conversation_aliases` table the bridge owns. Clients always send
the original chat id; the bridge resolves it forward at request entry to
the post-rotation SDK session.

What changed:
- New `ConversationAlias` model and `agent_server.conversation_aliases`
  table (PK base_conversation_id → current_conversation_id).
- `repository.resolve_conversation_alias(base)` (cheap forward lookup,
  miss returns the input) and `repository.upsert_conversation_alias(...)`
  (Postgres ON CONFLICT DO UPDATE so the table stays one row per logical
  conversation regardless of rotation count).
- `_handle_background_request` resolves on the way in: `original_request`
  keeps the BASE id (anchor for future rotations stays flat —
  `chat-123 → chat-123::r-…-3`, never `chat-123::r-…-2::r-…-3`); the
  dispatched copy uses the resolved current id so the SDK lands on the
  rotated session.
- `_try_claim_and_resume` upserts `base → rotated` after rotating, so
  alias persistence survives both bridge and chatbot restarts.
- Rotation suffix changed from `{base}::attempt-N` to
  `{base}::r-{response_id_short}-N` so a second turn that crashes at the
  same attempt number can't collide with a prior turn's already-rotated
  session and re-poison it. The new test
  `test_rotate_no_collision_across_turns_at_same_attempt` covers this.
- Existing rotation-format assertions updated; resume-path tests mock
  `upsert_conversation_alias` so they don't try to hit the DB.
- New unit tests for `resolve` (hit + passthrough miss) and `upsert`
  (verifies the SQL is INSERT ... ON CONFLICT and binds correctly).

The `response.resumed` SSE sentinel still fires for visibility / debug;
the chatbot no longer reads it (companion PR drops the alias capture).

109 unit tests pass.

Co-authored-by: Isaac
Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
Templates currently each duplicate ~14 lines of `databricks_ai_bridge`
logger setup in their start_server.py to surface durable-execution
lifecycle logs (uvicorn's logging config drops INFO from non-uvicorn
loggers, so without an explicit handler the breadcrumbs are silently
swallowed and a deployed app looks dead even when working correctly).

Move that to a single `_attach_bridge_logger()` called from
LongRunningAgentServer.__init__. Idempotent — leaves existing handlers
alone — so a consumer that ships their own logging config wins. Set
DATABRICKS_AI_BRIDGE_LOG_QUIET=1 to opt out.

Addresses Bryan's review on app-templates#207: "we can just gate this
logging on an env var, but it feels like this belongs in ai-bridge
instead." Companion app-templates change deletes the duplicated blocks.

Co-authored-by: Isaac
Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
Setting propagate=False on the parent `databricks_ai_bridge` logger
silently broke any test using caplog on a child logger
(test_model_serving_obo_credential_strategy::test_logging_statements
fails as soon as ANY test in the suite instantiates
LongRunningAgentServer, since the logger is a process-level singleton).

Propagation can stay on: in uvicorn deployments the root logger has no
handlers, so propagation produces no double-log; in pytest, propagation
is what lets caplog capture in the first place.

Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
@dhruv0811 dhruv0811 force-pushed the dhruv0811/durable-bridge-side-alias-resolution branch from ab9ffb5 to d931140 Compare May 7, 2026 00:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants