Skip to content

[arch] Design discussion: shared pluggable service for cross-agent state (feedback, sessions, traces) #112

@rdwj

Description

@rdwj

Background

The v0.12.0 enterprise feature track added several stateful surfaces to BaseAgent's server layer:

  • POST/GET /v1/sessions — conversation persistence
  • GET /v1/traces — trace inspection
  • POST/GET/PATCH /v1/feedback — user feedback collection
  • GET /metrics — Prometheus

All four follow the same shape: BaseAgent owns a pluggable store (Null / SQLite / Postgres), the server layer exposes REST endpoints, and the gateway-template proxies through. This was the right call when most deployments had one or two agents.

In multi-agent deployments (eg 10 agents fronted by a single gateway and UI) the per-agent ownership starts to chafe:

  • Duplication — 10 Postgres pools, 10 schema migrations, 10 housekeeping loops, all writing the same tables
  • Fan-out for cross-agent queries — "show me all thumbs-down feedback this week" requires the dashboard to hit 10 endpoints and merge client-side, OR query the shared Postgres directly out-of-band (the schema becomes a de-facto API)
  • Schema becomes a contract — once N agents write the same table, schema changes need coordinated rollouts
  • No auth boundary between agents sharing storage
  • Sessions are conceptually cross-agent — a user talks to "the system," not to agent base-agent: Implement prompt loader (prompts.py) #4. Today there's no clean way to follow a conversation that gets routed to different agents

What to design

Open question: what does the shape of a shared 'agent platform' service look like, and which surfaces move there?

Initial options to discuss (not a decision, a starting point):

  1. Status quo + documentation. Document that multi-agent deployments should point all agents at the same Postgres and treat the shared schema as a stable join point. Cheapest, but the rough edges remain.

  2. Full extraction. A new FastAPI service (working name: `fipsagents-platform`) owns sessions + traces + feedback. BaseAgent becomes a thin client. Gateway routes `/v1/sessions`, `/v1/traces`, `/v1/feedback` to the platform service rather than fanning out to per-agent endpoints. One Postgres pool, one REST surface, one dashboard backend.

  3. Partial extraction. Move feedback + sessions (genuinely cross-agent) but leave traces in BaseAgent shipping to an Otel collector (industry-standard answer, already partially done via `OTELTraceStore`). Less moving parts, addresses the highest-value duplication.

  4. Something else. Maybe BaseAgent keeps everything but grows a 'remote store' adapter for each — same code, configurable backend (in-process vs HTTP). Lets a deployer choose per-feature without forcing a topology.

Things to think about during the discussion

  • Migration story. The longer we wait, the more deployments depend on the per-agent endpoints. Cheap to do now while there's effectively one production user; observable migration later
  • Memory is intentionally NOT in this list — `self.memory` is per-agent by design, and MemoryHub already provides the centralized option
  • Metrics is also separate — Prometheus scrape targets are inherently per-pod, that's fine
  • Auth — if multiple agents share a backend, who's allowed to write what? Today there's no model for this
  • Deployment friction — every service we extract is another Helm chart, another readiness probe, another thing for ops to think about. Worth it iff the cross-agent benefits land
  • Pluggability shape — same `FeedbackStore`/`SessionStore`/`TraceStore` ABCs we have today, just running in a different process? Or a different abstraction entirely?

Out of scope for this issue

This is a design discussion issue, not an implementation. The goal is to come out with a written architecture decision (in `docs/architecture.md` or similar) that we can point at when implementing.

Captured during the v0.12.0 feedback feature track. Conversation context: the per-agent ownership felt fine for one or two agents but the smell got louder once we considered the 10-agent case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions