Feed it PDFs. Ask questions. It remembers what you asked before.
Built with
LangChain·LangGraph·BM25·LLMLingua·MCP Tools·LangSmith·FastAPI·Docker·Redis·Qdrant
MemoryForge is an agentic RAG API that answers questions about research documents — not with a fixed retrieve-then-generate chain, but with a LangGraph state machine that:
- 🧠 Decides retrieval strategy per query (BM25, dense, or external tool)
- ✂️ Compresses context up to 6× before hitting the LLM
- 🌐 Falls back to live arXiv search when local index is insufficient
- 💾 Remembers full conversation history across sessions via Redis
User Query → FastAPI
→ LangGraph Agent
→ load memory (Redis session history)
→ BM25 Retriever (keyword + metadata filter)
→ LLMLingua (4–6× context compression)
→ MCP Tools (arxiv — when local index insufficient)
→ LLM (GPT-4o-mini / any OpenAI-compatible)
→ Answer + Sources + save turn to memory
Async document ingestion — uploads never block the API:
POST /ingest → Redis Queue → Worker → BM25 Index (persisted to disk)
flowchart TD
A[User Query / API Call] --> B[LangGraph Agent]
B --> C{Decision Node}
C -- Retrieval Needed? --> D[BM25 + Dense Hybrid Retrieval]
C -- Use External Tool? --> E[MCP Tool - arXiv]
D --> F[LLMLingua Compression]
E --> F
F --> G[LLM Generation]
G --> H[Answer + Sources]
B --> I[Memory: Redis - short/long-term]
I --> B
subgraph Observability
G --> J[LangSmith / OpenTelemetry]
D --> J
F --> J
C --> J
end
flowchart TD
Q[User Query] --> D{Decision Node}
D -- No Retrieval --> G[LLM Generation]
D -- BM25/Dense --> R[Hybrid Retrieval]
D -- arXiv --> T[MCP Tool]
R --> C[Compression]
T --> C
C --> G
G --> A[Answer]
subgraph Memory
M[Short/Long-term Memory]
G --> M
M --> D
end
subgraph Fallbacks
R -- No Docs --> T
C -- Compression Fail --> T
end
| Layer | Technology | Why |
|---|---|---|
| 🌐 API | FastAPI | Async, Pydantic validation |
| 🤖 Agent Orchestration | LangGraph | Stateful decision-making, not a fixed chain |
| 🔍 Retrieval | BM25 (rank_bm25) + metadata filtering |
Keyword search + filter by year/author/topic |
| ✂️ Context Compression | LLMLingua 2 | 4–6× token reduction before LLM call |
| 💾 Memory | Redis (dual-layer) | Short-term MessagesState + long-term session store |
| 🛠️ External Tools | MCP (arxiv API) | Live paper search when local index is insufficient |
| 📦 Vector Store | Qdrant (ready) | Hybrid BM25 + dense retrieval at scale |
| 📊 Observability | LangSmith + OpenTelemetry | Full agent trace — latency, tokens, retrieval quality |
| 🐳 Deployment | Docker Compose | 4 services: app + worker + redis + qdrant |
Prerequisites: Docker + Docker Compose, an OpenAI API key, a LangChain API key (for LangSmith tracing).
# 1. Clone and configure
git clone https://github.com/divyat2605/MemoryForge
cd MemoryForge
# 2. Set your keys
cp .env.example .env
# → Add OPENAI_API_KEY and LANGCHAIN_API_KEY to .env
# 3. Start all 4 services
docker compose up --buildAPI available at http://localhost:8000 · Docs at http://localhost:8000/docs
⚠️ First run note: The LLMLingua model (~400MB) downloads on first startup. An extractive fallback activates automatically in the meantime — no code change needed.
Upload a research PDF, TXT, or MD for async indexing.
curl -X POST http://localhost:8000/ingest \
-F "file=@attention_is_all_you_need.pdf"
# → { "job_id": "abc123", "status": "queued" }Check ingestion progress.
curl http://localhost:8000/ingest/status/abc123
# → { "status": "done", "chunks": 42, "doc": "attention..." }Query indexed documents. Pass session_id for multi-turn memory.
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{
"query": "What is the transformer attention mechanism?",
"session_id": "my-session",
"filters": { "year": "2017" },
"top_k": 5
}'Follow-up queries with the same session_id retain full conversation context:
# agent remembers the previous turn
curl -X POST http://localhost:8000/query \
-d '{"query": "How does that compare to RNN?", "session_id": "my-session"}'List all indexed documents and their metadata.
curl http://localhost:8000/documentsClear conversation history for a session.
curl -X DELETE http://localhost:8000/memory/my-sessionMemoryForge uses a dual-layer memory system — Redis is already in docker-compose, so this is zero extra infra.
| Layer | Implementation | Scope |
|---|---|---|
| ⚡ Short-term | LangGraph MessagesState |
Last 6 turns, in-state per request |
| 🗄️ Long-term | Redis, keyed by session_id |
Full history, 7-day TTL, survives restarts |
Redis serves two roles in one service: async ingestion job queue + session memory store.
Add these to .env — zero code changes needed:
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=your_key_hereLangSmith automatically traces every LangGraph node (retrieve → compress → generate), capturing LLM token usage, per-query latency, and retrieval quality across all agent steps.
Current implementation uses in-memory BM25 (rank_bm25). Here's the upgrade path:
| Bottleneck | Current | At Scale |
|---|---|---|
| 🔍 Keyword search | rank_bm25 (in-memory) |
Elasticsearch — BM25 built-in, distributed |
| 📦 Vector search | Single Qdrant container | Qdrant cluster — hybrid BM25 + dense ANN |
| 🌐 API throughput | Single FastAPI instance | docker compose up --scale app=3 |
| ⚙️ Ingestion speed | Redis queue (async) | Increase worker replicas |
MemoryForge/
├── main.py # FastAPI — /ingest /query /memory /documents
├── agent.py # LangGraph state machine — retrieve → compress → generate
├── memory.py # Dual-layer memory — LangGraph MessagesState + Redis
├── retriever.py # BM25 keyword search + metadata filtering
├── compressor.py # LLMLingua context compression + extractive fallback
├── mcp_tools.py # MCP tool definitions (arxiv search + paper fetch)
├── ingestor.py # Doc chunking + auto metadata extraction
├── queue_worker.py # Redis async ingestion worker
├── monitoring.py # LangSmith + OpenTelemetry tracing
├── Dockerfile
├── docker-compose.yml # app + worker + redis + qdrant
├── requirements.txt
├── .env.example
├── LICENSE # Apache 2.0
└── README.md
Licensed under the Apache License 2.0.
Built with ❤️ by Divya Tripathi
⭐ Star this repo if you found it useful!