LycheeMem is a compact memory framework for LLM agents. It starts from efficient conversational memory—through structured organization, lightweight consolidation, and adaptive retrieval—and gradually extends toward action-aware, usage-aware memory for more capable agentic systems.
- [04/03/2026] The project now supports installation via
pip install lycheemem. You can easily start the service from anywhere usinglycheemem-cli! - [03/30/2026] We evaluated LycheeMem on PinchBench with the OpenClaw plugin: compared to OpenClaw's native memory, it achieved an ~6% score improvement, while reducing token consumption by ~71% and cost by ~55%!
- [03/28/2026] Semantic memory has been upgraded to Compact Semantic Memory (SQLite + LanceDB), no Neo4j required. See /quick-start for details.
- [03/27/2026] OpenClaw Plugin is now available at /openclaw-plugin ! Setup guide →
- [03/26/2026] MCP support is available at /mcp !
- [03/23/2026] LycheeMem is now open source: GitHub Repository →
LycheeMem is part of the 3rd-generation Lychee (立知) large model series, which focuses on memory intelligence, continual learning, and long-context reasoning.
We welcome you to explore our related works:
-
LycheeMemory: a unified framework for implicit long-term memory and explicit working memory collaboration in large language models
-
LycheeMem (this project): long-term memory infrastructure for LLM-based agents
-
LycheeDecode: selective recall from massive KV-cache context memory
-
LycheeCluster: structured organization and hierarchical indexing for context memory
- Python 3.9+
- An LLM API key (OpenAI, Gemini, or any litellm-compatible provider)
You can install LycheeMem directly via pip:
pip install lycheememOnce installed, you can start the backend server instantly using the CLI:
lycheemem-cliFor development or if you prefer to run from source:
git clone https://github.com/LycheeMem/LycheeMem.git
cd LycheeMem
pip install -e .Create a .env file in your working directory and fill in your values. The full template in .env.example also includes session/user DB paths, JWT settings, and working-memory thresholds; the snippet below shows the most important ones:
# LLM — litellm format: provider/model
LLM_MODEL=openai/gpt-4o-mini
LLM_API_KEY=sk-...
LLM_API_BASE= # optional
# Embedder
EMBEDDING_MODEL=openai/text-embedding-3-small
EMBEDDING_DIM=1536
EMBEDDING_API_KEY= # optional
EMBEDDING_API_BASE= # optional
Supported LLM providers (via litellm):
openai/gpt-4o-mini·gemini/gemini-2.0-flash·ollama_chat/qwen2.5· any OpenAI-compatible endpoint
If you installed via pip, you can start the LycheeMem background service from anywhere using:
lycheemem-cli(If running from source, you can also use python main.py to start the server.)
The API is served at http://localhost:8000. Interactive docs at /docs.
main.pycurrently starts Uvicorn without enabling live reload. For development reload, run Uvicorn directly, for example:uvicorn src.api.server:create_app --factory --reload
A frontend demo is included under web-demo/. It provides a chat interface alongside live views of the semantic memory tree, skill library, and working memory state.
cd web-demo
npm install
npm run dev # served at http://localhost:5173Make sure the backend is running on port 8000 (or update proxy settings in
web-demo/vite.config.ts) before starting the frontend.
LycheeMem ships a native OpenClaw plugin that gives any OpenClaw session persistent long-term memory with zero manual wiring.
What the plugin provides:
lychee_memory_smart_search— default long-term memory retrieval entry point- Automatic turn mirroring via hooks — the model does not need to call
append_turnmanually- User messages are appended automatically
- Assistant messages are appended automatically
/new,/reset,/stop, andsession_endautomatically trigger boundary consolidation- Proactive consolidation on strong long-term knowledge signals
Under normal operation:
- The model only calls
lychee_memory_smart_searchwhen recalling long-term context - The model may call
lychee_memory_consolidatemanually when an immediate persist is warranted - The model does not need to call
lychee_memory_append_turnat all
openclaw plugins install "/path/to/LycheeMem/openclaw-plugin"
openclaw gateway restartSee the full setup guide: openclaw-plugin/INSTALL_OPENCLAW.md
LycheeMem also exposes an HTTP MCP endpoint at http://localhost:8000/mcp.
- Available tools:
lychee_memory_smart_search,lychee_memory_search,lychee_memory_append_turn,lychee_memory_synthesize,lychee_memory_consolidate - Use
Authorization: Bearer <token>if you want per-user memory isolation lychee_memory_consolidateworks for sessions that already contain mirrored turns from/chat,/memory/reason, orlychee_memory_append_turn
POST /mcphandles JSON-RPC requestsGET /mcpexposes the SSE stream used by some MCP clients- The server returns
Mcp-Session-Idduringinitialize; reuse that header on later requests
If you want isolated memory per user, first obtain a JWT token from /auth/register or /auth/login, then send:
Authorization: Bearer <token>
Without a token, requests run with an empty user_id, so anonymous traffic shares the same namespace.
For any MCP client that supports remote HTTP servers, configure the MCP URL as:
http://localhost:8000/mcp
Generic config example:
{
"mcpServers": {
"lycheemem": {
"url": "http://localhost:8000/mcp",
"headers": {
"Authorization": "Bearer <token>"
}
}
}
}- Call
initialize - Reuse the returned
Mcp-Session-Id - Send
initialized - Call
tools/list - Call
tools/call
Initialize example:
curl -i -X POST http://localhost:8000/mcp \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <token>" \
-d '{
"jsonrpc": "2.0",
"id": 1,
"method": "initialize",
"params": {
"protocolVersion": "2025-03-26",
"capabilities": {},
"clientInfo": {
"name": "debug-client",
"version": "0.1.0"
}
}
}'Tool call example:
curl -X POST http://localhost:8000/mcp \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <token>" \
-H "Mcp-Session-Id: <session-id>" \
-d '{
"jsonrpc": "2.0",
"id": 2,
"method": "tools/call",
"params": {
"name": "lychee_memory_smart_search",
"arguments": {
"query": "what tools do I use for database backups",
"top_k": 5,
"mode": "compact",
"include_graph": true,
"include_skills": true
}
}
}'- Use
/chator/memory/reasonwith a stablesession_idto write conversation turns, or mirror external host turns withlychee_memory_append_turn. - Use
lychee_memory_smart_searchincompactmode for the default one-shot recall path. - Use
lychee_memory_search+lychee_memory_synthesizeonly when you explicitly want search and synthesis as separate stages. - After the conversation ends, call
lychee_memory_consolidatewith the samesession_id.
LycheeMem organizes memory into three complementary stores:
| Working Memory | Semantic Memory | Procedural Memory |
|---|---|---|
|
(Episodic)
|
(Typed Action Store)
|
(Skills)
|
The working memory window holds the active conversation context for a session. It operates under a dual-threshold token budget:
- Warn threshold (70%) — triggers asynchronous background pre-compression; the current request is not blocked.
- Block threshold (90%) — the pipeline pauses and flushes older turns to a compressed summary before proceeding.
Compression produces summary anchors (past context, distilled) + raw recent turns (last N turns, verbatim). Both are passed downstream as the conversation history.
Semantic memory is organised around typed MemoryRecords plus action-grounded retrieval state. The storage layer is SQLite (FTS5 full-text search) + LanceDB (vector index), while retrieval is conditioned on recent context, tentative action, constraints, and missing slots.
Each memory entry is stored as a MemoryRecord. The memory_type field distinguishes seven semantic categories:
| Type | Description |
|---|---|
fact |
Objective facts about the user, environment, or world |
preference |
User preferences (style, habits, likes/dislikes) |
event |
Specific events that have occurred |
constraint |
Conditions that must be respected |
procedure |
Reusable step-by-step procedures / methods |
failure_pattern |
Previously failed action paths and their causes |
tool_affordance |
Capabilities and applicable scenarios of tools/APIs |
Beyond text, every MemoryRecord carries action-facing metadata (tool_tags, constraint_tags, failure_tags, affordance_tags) and usage statistics (retrieval_count, action_success_count, etc.) to seed future reinforcement-learning signals. Retrieval logs also persist retrieval_plan, action_state, response excerpts, and later user feedback so the system can close a lightweight action-outcome loop without training.
Related MemoryRecords can be fused online by the Record Fusion Engine into denser CompositeRecords. Composite entries persist direct child_composite_ids, so long-term semantic memory is organised as a hierarchical memory tree instead of a flat bag of summaries.
A single-pass pipeline that converts conversation turns into a list of MemoryRecords:
- Typed extraction — LLM extracts self-contained facts and assigns a semantic category to each record.
- Decontextualization — Pronouns and context-dependent phrases are expanded into full expressions, so each record is understandable without the original dialogue.
- Action metadata annotation — LLM annotates each record with
memory_type,tool_tags,constraint_tags,failure_tags,affordance_tags, and other structured labels.
record_id = SHA256(normalized_text) — naturally idempotent; duplicate content is deduplicated automatically.
Triggered online after each consolidation:
- FTS / vector recall gathers related existing atomic records around the new records (candidate pool).
- The existing synthesis judge prompt decides whether each candidate set should produce a new
CompositeRecordor perform aconflict_updateagainst an existing atomic record. - On
conflict_update, the existing anchor record is updated in place, conflicting incoming records are soft-expired, and composites covering affected source records are invalidated. - On synthesis, the engine writes a new
CompositeRecordto SQLite + LanceDB. - Additional hierarchy rounds can synthesize
record -> compositeandcomposite -> composite, persistingchild_composite_idsso the memory tree can keep growing upward.
Before retrieval, ActionAwareRetrievalPlanner analyses the user query + recent context + ActionState and emits a SearchPlan:
mode:answer(factual Q&A) /action(needs execution) /mixedsemantic_queries: content-facing search termspragmatic_queries: action/tool/constraint-facing search termstool_hints: tools likely needed for this requestrequired_constraints: constraints that must be respectedrequired_affordances: capabilities the retrieved memory should providemissing_slots: parameters / slots that are absenttree_retrieval_mode/tree_expansion_depth/include_leaf_records: whether retrieval should stay at high-level composites (root_only) or descend into child composites / direct leaf records (balanced/descend)
ActionState can carry fields such as current_subgoal, tentative_action, known_constraints, available_tools, failure_signal, and a recent-context excerpt. The planner merges this state with the LLM-produced plan so retrieval is conditioned on the current decision state rather than the query alone.
The plan drives multi-channel recall:
- FTS channel — SQLite FTS5 keyword recall over
MemoryRecord+CompositeRecord - Semantic vector channel — LanceDB ANN over
semantic_textembeddings - Normalised vector channel — LanceDB ANN over
normalized_textembeddings (for pragmatic queries) - Tag filter channel — exact filter by
tool_hints/required_constraints/required_affordances - Temporal channel — filter by
SearchPlan.temporal_filtertime window - Slot-hint supplementation — when
missing_slotsis non-empty, extra FTS/tag recall is triggered to find records that can fill missing parameters
After base recall, retrieval can also expand along the memory tree. root_only keeps high-level composite summaries, balanced descends one level when tree hints match, and descend pulls child composites plus direct leaf records when the current action needs finer-grained detail.
Candidates from all channels are de-duplicated and ranked by MemoryScorer using a weighted linear combination. Final top-k selection is composite-first: covering parent composites are preferred, covered child records are folded away unless they add unique value, and near-duplicate fragments are suppressed.
| Weight | Meaning | Default |
|---|---|---|
| α | SemanticRelevance (vector distance -> similarity) | 0.25 |
| β | ActionUtility (tag match score, mode-aware) | 0.25 |
| κ | SlotUtility (whether the memory helps fill missing action slots) | 0.15 |
| γ | TemporalFit (temporal reference match) | 0.15 |
| δ | Recency (memory freshness) | 0.10 |
| η | EvidenceDensity (evidence span density) | 0.10 |
| λ | TokenCost penalty (text length penalty) | 0.10 |
The skill store preserves reusable how-to knowledge as structured skill entries, each carrying:
- Intent — a short description of what the skill does.
doc_markdown— a full Markdown document describing the procedure, commands, parameters, and caveats.- Embedding — a dense vector of the intent text, used for similarity search.
- Metadata — usage counters, last-used timestamp, preconditions.
Skill retrieval uses HyDE (Hypothetical Document Embeddings): the query is first expanded into a hypothetical ideal answer by the LLM, then that draft text is embedded to produce a query vector that matches well against stored procedure descriptions, even when the user's original phrasing is vague.
Every request passes through a fixed sequence of five agents. Four are synchronous stages in the LangGraph pipeline; one is a background post-processing task.
Rule-based agent (no LLM prompt). Appends the user turn to the session log, counts tokens, and fires compression if either threshold is crossed. Produces compressed_history and raw_recent_turns for downstream stages.
SearchCoordinator first builds recent_context from compressed summaries + raw recent turns, then derives an ActionState from the current query, constraints, recent failures, token budget, and recent tool use. ActionAwareRetrievalPlanner uses that state to produce a SearchPlan containing mode, semantic_queries, pragmatic_queries, tool_hints, required_affordances, missing_slots, tree-traversal strategy, and more. Multi-channel recall (FTS, semantic vector, normalised vector, tag/affordance filter, temporal filter, slot-hint supplementation, plus tree expansion when needed) then queries SQLite + LanceDB. This stage returns raw semantic fragments, skill hits, retrieval provenance, and a dedicated novelty_retrieved_context built from pre-synthesis semantic fragments for later novelty checking; it does not build the final background_context yet. Skill retrieval is mode-aware (answer / action / mixed) and uses HyDE against the skill store only when it is likely to help.
When a new user turn arrives, SearchCoordinator also tries to apply lightweight feedback to the most recent unresolved action/mixed retrieval log, so the next turn can mark the prior memory usage as success / fail / correction.
Acts as an LLM-as-Judge: scores every retrieved memory fragment on an absolute 0-1 relevance scale, discards fragments below the threshold (default 0.6), and fuses the survivors into a single dense background_context string. It also identifies skill_reuse_plan entries that can directly guide the final response. This stage is where the final answer-time context is built; it outputs provenance — a citation list containing scoring breakdown and source references for each kept memory item.
Receives compressed_history, background_context, and skill_reuse_plan and generates the final assistant reply. It appends the assistant turn back to the session store, and the pipeline finalizes the semantic usage log with a response excerpt so the next user turn can provide outcome feedback.
Triggered immediately after ReasoningAgent completes, runs in a thread pool and does not block the response. It:
- Performs a novelty check — LLM judges whether the conversation introduced new information worth persisting. Skips consolidation for pure retrieval exchanges.
- Compact consolidation — calls
CompactSemanticEngine.ingest_conversation(), which runs a single-pass encoder (typed extraction → decontextualization → action metadata annotation), writesMemoryRecords to SQLite + LanceDB, then triggers conflict-aware Record Fusion. Novelty check uses the search-stagenovelty_retrieved_context(raw semantic fragments), not the answer-timebackground_context, so query-conditioned synthesis does not suppress valid new-memory ingestion. - Skill extraction — identifies successful tool-usage patterns in the conversation and adds skill entries to the skill store. Runs in parallel with compact consolidation (ThreadPoolExecutor).
Query both the semantic memory channel and the skill store in a single call. New integrations should prefer semantic_results; graph_results is kept as a backward-compatible alias. The response also includes novelty_retrieved_context, which is the correct input for later /memory/consolidate calls.
// Request
{
"query": "what tools do I use for database backups",
"top_k": 5,
"include_graph": true,
"include_skills": true
}
// Response
{
"query": "...",
"graph_results": [
{
"anchor": {
"node_id": "compact_context",
"name": "CompactSemanticMemory",
"label": "SemanticContext",
"score": 1.0
},
"constructed_context": "...",
"provenance": [ { "record_id": "...", "source": "record", "semantic_source_type": "record", "score": 0.91, ... } ]
}
],
"semantic_results": [
{
"anchor": { "node_id": "compact_context", "name": "CompactSemanticMemory", "label": "SemanticContext", "score": 1.0 },
"constructed_context": "...",
"provenance": [ { "record_id": "...", "source": "record", "semantic_source_type": "record", "score": 0.91, ... } ]
}
],
"novelty_retrieved_context": "[1] (procedure, source=record) Use pg_dump with cron ...",
"skill_results": [ { "id": "...", "intent": "pg_dump backup to S3", "score": 0.87, ... } ],
"total": 6
}Runs search and, optionally, synthesis in one API call. mode=compact is the default integration path when you want a concise background_context without handling intermediate payloads yourself. Even in compact mode, the response still returns novelty_retrieved_context so a host can consolidate against raw retrieved memory instead of answer-time synthesis.
// Request
{
"query": "what tools do I use for database backups",
"top_k": 5,
"synthesize": true,
"mode": "compact"
}
// Response
{
"query": "...",
"mode": "compact",
"synthesized": true,
"background_context": "User regularly uses pg_dump with a cron job...",
"skill_reuse_plan": [ { "skill_id": "...", "intent": "...", "doc_markdown": "..." } ],
"provenance": [ { "record_id": "...", "source": "record", "score": 0.91, ... } ],
"novelty_retrieved_context": "[1] (procedure, source=record) Use pg_dump with cron ...",
"kept_count": 4,
"dropped_count": 2,
"total": 6
}Takes raw retrieval results and produces a fused memory context using LLM-as-Judge.
// Request
{
"user_query": "what tools do I use for database backups",
"semantic_results": [...], // preferred from /memory/search
"graph_results": [...], // compatibility alias also accepted
"skill_results": [...]
}
// Response
{
"background_context": "User regularly uses pg_dump with a cron job...",
"skill_reuse_plan": [ { "skill_id": "...", "intent": "...", "doc_markdown": "..." } ],
"provenance": [ { "record_id": "...", "source": "semantic", "semantic_source_type": "record", "score": 0.91, ... } ],
"kept_count": 4,
"dropped_count": 2
}Runs the ReasoningAgent given pre-synthesized context. Can be chained after /memory/synthesize for full pipeline control.
// Request
{
"session_id": "my-session",
"user_query": "what tools do I use for database backups",
"background_context": "User regularly uses pg_dump...",
"skill_reuse_plan": [...],
"append_to_session": true // write result to session history (default: true)
}
// Response
{
"response": "You typically use pg_dump scheduled via cron...",
"session_id": "my-session",
"wm_token_usage": 3412
}Appends one user or assistant turn into LycheeMem's session store so it can be consolidated later.
// Request
{
"session_id": "my-session",
"role": "user",
"content": "I usually back up PostgreSQL with pg_dump to S3."
}
// Response
{
"status": "appended",
"session_id": "my-session",
"turn_count": 3
}Manually trigger memory consolidation for a session. This is the primary consolidation endpoint and supports both background and synchronous modes.
retrieved_context should preferably be the novelty_retrieved_context returned by /memory/search or /memory/smart-search, i.e. the search-stage raw semantic fragments, not /memory/synthesize's background_context.
// Request
{
"session_id": "my-session",
"retrieved_context": "[1] (procedure, source=record) Use pg_dump with cron ...",
"background": true
}
// Response (background mode)
{
"status": "started",
"entities_added": 0,
"skills_added": 0,
"facts_added": 0
}Legacy compatibility endpoint: POST /memory/consolidate/{session_id}.
Returns the current semantic memory as a hierarchy. mode=cleaned (default) emits tree_roots plus direct tree edges for the frontend memory-tree view; mode=debug exposes the lower-level flattened relations for inspection.
Use these endpoints for operational checks and background consolidation polling:
GET /pipeline/statusreturns aggregate counts for sessions, semantic memory, and skills.GET /pipeline/last-consolidation?session_id=<id>returns the latest consolidation result for a session, orpendingif the background task has not finished yet.
# Basic single-turn demo (automatically registers 'demo_user')
python examples/api_pipeline_demo.py
# Multi-turn chat demo (3 consecutive turns, followed by consolidation)
python examples/api_pipeline_demo.py --multi-turn
# Custom query and user credentials
python examples/api_pipeline_demo.py --username alice --password secret123 \
--query "How do I backup my database with pg_dump?"
# Use a fixed session_id (useful for accumulating history across multiple runs)
python examples/api_pipeline_demo.py --session-id my-test-session