Inspiration
A company deploys an AI customer support agent system.
It has:
Agent A: Intake agent → reads customer messages, classifies intent.Agent B: Knowledge agent → searches company docs for answers.Agent C: Resolution agent → drafts replies and tracks follow-ups.
They handle thousands of tickets daily, often from returning customers. Each agent is stateless, they don’t “remember” what happened yesterday.
Reality
Each agent call is isolated. There’s no shared state beyond a central database of tickets.
Example
Customer: "I still haven't received my refund."-
Agent Aclassifies it:refund inquiry. But it doesn’t remember that this same customer already had 3 refund interactions last week. -
Agent Bfetches refund policy docs again (re-runs the same search). That's costly and redundant, thousands of similar lookups happen daily. -
Agent Cwrites:Please provide your order number, because it can’t recall that the user already sent it in a previous thread.
Solution
Now we plug in a semantic memory layer shared across agents. Each agent reads/writes facts into this store in near-real-time.
Agent Awrites semantic memory into theSynapseDB.
memory.write("user:4321", {
"content": "Refund requested for order #1458 last week",
"salience": 0.9,
"ts": "2025-10-24T10:00:00Z"
})Agent Breuses past reasoning. WhenAgent Breceives the new query:
context = memory.query("user:4321 refund status")Memory returns semantically similar entries:
[
"Customer refund for order #1458 initiated on 10/20",
"Refund confirmation email sent 10/21"
]Agent B immediately concludes: Refund already processed → customer following up.
No need to re-run search over documentation or ask for order details.
Agent Cadds an update. After resolving, it appends:
memory.append("user:4321", "Refund confirmed delivered on 10/28.")- Next time the customer says:
Why is my refund taking so long?”
All agents can instantly “remember” past interactions, without re-querying databases or rerunning prompts.
| Before | After |
|---|---|
| Agents forgot user context | Agents recall prior summaries instantly |
| Repeated LLM calls for the same info | Cached semantic context → 70–90% fewer LLM tokens |
| Fragmented reasoning | Shared, real-time semantic memory graph |
| No auditability | Memory log shows exactly what each agent knew & when |
| Poor UX | Feels like one coherent, continuous assistant |
🎬 From Memento.
Solution
A low-latency, in-memory semantic state store purpose-built for LLM/agent workloads:
- Hybrid data model:
KV/JSONfor facts,VECTORfor embeddings, andSTREAMfor temporal events, all namespaced per agent/team/tenant. - Semantic retrieval: HNSW-based ANN with metadata filters + optional reranking—query by meaning, not just keys.
- Temporal versioning: Time-travel reads (e.g.,
memory as of 2025-10-24T10:00Z) and append-only event streams per entity/session. - Semantic eviction: Forget using salience (recency × access × downstream impact), not only TTL/LRU.
- Observability: Memory lineage + OpenTelemetry traces show which memories influenced which answers.
- Clustered & safe: Raft replication, shard-local leaders, mTLS, per-tenant quotas.
- Runtime & API:
FastAPI+uvicorn(asyncio/uvloop) — async HTTP JSON service exposingkv,vector,stream,hybridendpoints. - Storage: In-memory Python dicts for hot data +
aiosqlite(WAL mode) for durable snapshots. - Vector Search:
hnswlib(cosine ANN) + lightweight Python metadata filters; caller provides embeddings. - Eviction & Policy: Background
asynciotask computes semantic score (α·recency + β·freq + γ·salience); supports pinned keys and namespace quotas. - Observability & Ops:
prometheus_clientmetrics, basicOpenTelemetrytracing,/healthzand/metricsroutes, single Docker container deploy.
All libraries are open source (Apache 2.0 / MIT) and run locally with no proprietary dependencies.
Ship an MVP that lets any LLM agent:
- Write & read structured memory and semantic entries for queries.
- Append & replay temporal streams per user/session and read memory “as of” a timestamp.
- Run hybrid semantic queries with metadata filters and return ranked, deduped memories.
- Provide lineage & traces so teams can debug why an answer happened given the memory at that moment.
