Perplexity-style answers · Native image understanding · Near-infinite memory · Streaming
An open framework that turns any local LLM into a Perplexity-like AI assistant —
with smart web search, multi-agent orchestration, multimodal vision, conversation memory,
and a plug-and-play tool system. Your infrastructure, your models, your data.
🇻🇳 Tiếng Việt — Xem phần Hướng dẫn tiếng Việt bên dưới.
Send images alongside your questions. Vision models see the raw pixels — no pre-description, no lossy OCR, no information loss.
You: img:images/prescription.jpg img:images/pill.png Thuốc gì đây? Liều dùng?
📎 images/prescription.jpg
📎 images/pill.png
⏳ Generating search queries.. done!
🔍 3 queries → 12 results merged
A: Đây là thuốc Amoxicillin 500mg, kháng sinh nhóm Penicillin.
Liều dùng thông thường cho người lớn: 500mg mỗi 8 giờ [S1].
Toa thuốc ghi liều 3 lần/ngày × 7 ngày, phù hợp với
hướng dẫn điều trị nhiễm khuẩn đường hô hấp [S2][S3].
Nguồn:
[S1] MIMS Vietnam — https://mims.com/vietnam/...
[S2] BYT Hướng dẫn sử dụng kháng sinh — https://...
[S3] Drugs.com Amoxicillin — https://drugs.com/...
💾 Memory saved · 🧠 +5 entities +3 relations +4 facts
(24.7s)
The model saw both the prescription scan and the pill photo, cross-referenced with web search results, and composed a cited Vietnamese answer — all in one turn.
Cloud AI search tools (Perplexity, ChatGPT Search, Gemini) are powerful but closed, expensive, and can't be customized. This framework gives you the same experience running on your own infrastructure with any LLM.
| This Framework | Basic RAG Chatbot | |
|---|---|---|
| 🖼️ Vision | Native multimodal — VLM sees raw images via ChatML | ❌ Text only |
| 🤖 Architecture | 5 specialized agents collaborate | 1 monolithic prompt |
| 🔍 Search | 3 expanded queries × RRF merge = 10-15 diverse results | 1 query, top-3 |
| 💬 Memory | 3-tier: facts (permanent) + summary + recent turns | Last N messages |
| 🔌 Tools | Plug-and-play: web, APIs, RAG, databases | Web search only |
| 🌊 Streaming | Token-by-token via CLI + SSE API | Usually blocking |
| ⏰ Time-aware | Agents know today's date, reason about "72h tới" | Static |
| 🧠 Thinking models | Native <think> tag handling (Qwen3, QwQ) |
❌ |
User: text + images
│
┌─────────────┼─────────────┐
│ │
┌────▼─────┐ ┌──────▼──────┐
│ Router │ │ img:a.jpg │
│ Agent │ │ img:b.png │
└────┬──────┘ │ ↓ base64 │
│ └──────┬──────┘
┌────────┼────────┐ │
▼ ▼ ▼ │
┌─────────┐ ┌──────┐ ┌─────┐ │
│ Web │ │Stud. │ │ RAG │ │
│Research.│ │ API │ │ │ │
└────┬────┘ └──┬───┘ └──┬──┘ │
└─────────┼────────┘ │
▼ │
┌──────────────┐ │
│ Composer │◄───────────────────┘
│ Agent │ images injected via
└──────┬───────┘ litellm vision patch
│
▼
┌───────────────────────┐
│ vLLM / Ollama │
│ (Qwen2.5-VL, etc.) │
│ Sees text + images │
└───────────┬───────────┘
│
▼
Streaming answer with
[S1][S2] citations
│
▼
┌───────────────────────┐
│ 3-Tier Memory Store │
│ Facts · Summary · │
│ Recent (persistent) │
└───────────────────────┘
Agents at a glance:
| Agent | Job | Tools | Sees images? |
|---|---|---|---|
| Router | Classify intent, pick agents | get_datetime |
✅ |
| Web Researcher | Multi-source search + verify | web_search_deep, web_search_expanded, web_crawl_url |
✅ |
| Student Data | Internal API lookups | student_get_profile, student_get_grades, student_get_attendance |
— |
| Policy RAG | Policy/regulation vector search | rag_query_policy |
— |
| Composer | Synthesize evidence → cited answer | (synthesis only) | ✅ |
git clone https://github.com/YOUR_USERNAME/simple-multi-agent-crewai.git
cd simple-multi-agent-crewai
git checkout feature/multimodal
pip install -e .cp .env.example .envLLM_API_KEY=your-key-here
TAVILY_API_KEY=tvly-xxxxxxxxxxxxxEdit src/school_agents/config/llm.yaml:
llm:
model: "qwen/qwen2.5-vl-32b" # vision model for image support
# model: "qwen/qwen3-32b" # text-only (images will be ignored)
base_url: "http://localhost:1234/v1" # LMStudio / vLLM / Ollama
api_key: "${LLM_API_KEY}"
temperature: 0.4
max_tokens: 8192
structured_max_tokens: 16384 # thinking models need more headroom# Interactive REPL with streaming (recommended)
python -m school_agents.run_chat --session demo --interactive --stream
# Single query with images
python -m school_agents.run_chat \
-q "Compare these two prescriptions" \
--image images/rx1.jpg images/rx2.jpg \
--stream
# FastAPI server
uvicorn school_agents.server:app --host 0.0.0.0 --port 8000Images are injected at the lowest possible layer: a one-time monkey-patch on litellm.completion. When images are present, the last user message transforms from a plain string into the standard OpenAI Vision / ChatML multimodal array:
# Without images (business as usual)
{"role": "user", "content": "What's the weather?"}
# With images (automatic transformation)
{"role": "user", "content": [
{"type": "text", "text": "What medicine is this?"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4A..."}},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBO..."}},
]}# Flag syntax (single query mode)
python -m school_agents.run_chat \
-q "What's in these photos?" \
--image photo1.jpg photo2.png \
--stream
# Inline syntax (REPL mode)
You: img:images/thuoc.jpg What is this medicine?
📎 images/thuoc.jpg
You: img:a.jpg img:b.jpg img:c.png Compare all three
📎 a.jpg
📎 b.jpg
📎 c.png
You: Thanks, summarize our conversation ← text only, no images# Multipart (file upload — web forms, curl)
curl -X POST http://localhost:8000/chat \
-F 'query=Identify this medicine' \
-F 'session_id=pharma01' \
-F 'images=@pill_front.jpg' \
-F 'images=@pill_back.jpg'
# JSON with base64 (mobile apps, programmatic clients)
curl -X POST http://localhost:8000/chat/json \
-H 'Content-Type: application/json' \
-d '{
"query": "What is this?",
"session_id": "pharma01",
"images": [
{"b64": "/9j/4AAQ...", "mime": "image/jpeg"},
{"b64": "iVBORw0K...", "mime": "image/png"}
]
}'| Model | vLLM | Ollama | LMStudio | Notes |
|---|---|---|---|---|
| Qwen2.5-VL (7B/32B/72B) | ✅ | ✅ | ✅ | Recommended — strong Vietnamese + English |
| InternVL2.5 | ✅ | — | — | Strong multilingual vision |
| LLaVA-NeXT | ✅ | ✅ | ✅ | Good general-purpose |
| Gemma 3 | ✅ | ✅ | ✅ | Google's latest multimodal |
Text-only models (Qwen3, Llama) will ignore image inputs or error — use a VLM for vision features.
One question becomes 3 targeted searches for maximum recall:
User: "Doanh số FPT Retail phụ thuộc Long Châu đúng không?"
Expanded:
1. "FPT Retail revenue dependence Long Châu 2026"
2. "doanh số FPT Retail phụ thuộc Long Châu"
3. "FPT Retail financial analysis Long Châu revenue share"
→ Tavily search × 3 queries × 5 results each
→ Reciprocal Rank Fusion merge + dedup
→ 12 unique results injected as [Pre-searched results]
Date-aware: the expander knows today's date and adds temporal context. Three modes: auto (silent), confirm (ask before searching), or off.
┌────────────────────────────────────────────────────────────┐
│ Tier 1: FACTS Never deleted, permanent │
│ Entities + relations + facts extracted every turn. │
│ Up to 10 per category per turn. Accumulates forever. │
│ │
│ "FPT_Retail → revenue_68%_from → Long_Châu" │
│ "Amoxicillin → treats → respiratory_infection" │
├────────────────────────────────────────────────────────────┤
│ Tier 3: SUMMARY LLM-compressed old turns │
│ Scales with conversation length: │
│ 2-4 turns → 3-5 sentences │
│ 5-10 turns → 1-2 paragraphs │
│ 10+ turns → 2-4 paragraphs organized by topic │
├────────────────────────────────────────────────────────────┤
│ Tier 2: RECENT TURNS Last 4, verbatim │
│ Size-guarded: auto-truncated if exceeds token budget. │
│ Full fidelity for immediate conversation context. │
└────────────────────────────────────────────────────────────┘
Persistent on disk (sessions.jsonl). Survives restarts.
Sessions load instantly on reconnect. Nothing is ever deleted.
Full-featured API server with session management and multimodal support.
uvicorn school_agents.server:app --host 0.0.0.0 --port 8000| Method | Endpoint | Description |
|---|---|---|
POST |
/chat |
Multimodal chat (multipart: text + image files) |
POST |
/chat/stream |
SSE streaming (multipart: text + image files) |
POST |
/chat/json |
JSON-only chat (base64 images in body) |
GET |
/sessions/{id} |
Session stats: turns, facts, summary |
DELETE |
/sessions/{id} |
Clear session history |
GET |
/health |
Health check + model info |
| Field | Type | Required | Description |
|---|---|---|---|
query |
string | ✅ | User question |
session_id |
string | — | Session ID (auto-generated if omitted) |
images |
file(s) | — | Image files — JPEG, PNG, WebP. Multiple OK. |
student_id |
string | — | For student API lookups |
from_date |
string | — | Date filter (YYYY-MM-DD) |
to_date |
string | — | Date filter (YYYY-MM-DD) |
{
"answer": "Đây là thuốc Amoxicillin 500mg...",
"session_id": "pharma01",
"routes": ["web"],
"elapsed_seconds": 24.7,
"turn_count": 3
}event: chunk data: {"content": "Đây", "agent": "composer"}
event: chunk data: {"content": " là", "agent": "composer"}
event: chunk data: {"content": " thuốc", "agent": "composer"}
...
event: done data: {"answer": "...", "session_id": "pharma01", ...}
# Text-only query
curl -X POST http://localhost:8000/chat \
-F 'query=Tin tức AI mới nhất' \
-F 'session_id=news01'
# With images
curl -X POST http://localhost:8000/chat \
-F 'query=So sánh 2 toa thuốc' \
-F 'session_id=pharma01' \
-F 'images=@images/rx1.jpg' \
-F 'images=@images/rx2.jpg'
# JSON body (mobile/programmatic clients)
curl -X POST http://localhost:8000/chat/json \
-H 'Content-Type: application/json' \
-d '{"query":"Thời tiết Hà Nội","session_id":"weather01"}'
# SSE streaming
curl -N -X POST http://localhost:8000/chat/stream \
-F 'query=US-Iran situation analysis' \
-F 'session_id=geo01'
# Session management
curl http://localhost:8000/sessions/pharma01
curl -X DELETE http://localhost:8000/sessions/pharma01Token-by-token streaming from the Composer agent only — intermediate agents (router, researcher) are filtered out. The Final Answer: prefix required by CrewAI is auto-stripped before display.
⏳ Generating search queries.. done!
🔍 Query expansion (3 queries):
1. ...
2. ...
3. ...
Searching 3 queries...
✅ 12 results merged.
🔄 Web Researcher working...
A: [streaming tokens appear here...]
💾 Memory saved
🧠 +5E +3R +4F
(24.7s)
The framework is designed to be extended. Add a new data source in 5 steps:
1. Create the tool in tools/my_tools.py:
from crewai.tools import tool
@tool("my_database_search")
def my_database_search(query: str) -> str:
"""Search the internal database for relevant records."""
results = my_db.search(query)
return json.dumps(results)2. Add the agent in config/agents.yaml:
my_specialist:
role: "Database Specialist"
goal: "Search internal database and return structured results."
backstory: >
You search the internal database. Call my_database_search ONCE,
then return JSON prefixed with "Final Answer:" (system requirement).
allow_delegation: false3. Add the task in config/tasks.yaml:
my_task:
description: |
Search the database for: {user_query}
You MUST prefix your JSON with "Final Answer:" — SYSTEM REQUIREMENT.
expected_output: "Final Answer: JSON with results."
agent: my_specialist4. Wire it up — Add your agent to _make_agents() and route in crew_runner.py
5. Update routing — Add "my_route" to the router logic in agents.yaml and tasks.yaml
simple-multi-agent-crewai/
├── .env.example # API key template
├── pyproject.toml
├── README.md # ← you are here
├── ARCHITECTURE.md # Detailed flow diagrams (491 lines)
│
└── src/school_agents/
│
├── server.py # FastAPI: /chat, /chat/stream, /chat/json
├── run_chat.py # Interactive CLI: REPL + images + streaming
├── run.py # Legacy single-shot CLI (no memory)
│
├── crew_runner.py # CrewAI orchestration + vision litellm patch
├── image_context.py # Thread-local image store (set/get/clear)
├── config.py # YAML loader with ${ENV_VAR} resolution
├── tool_context.py # Thread-local tool config
├── llm_utils.py # <think> tag handling, JSON extraction
│
├── conversation_memory.py # 3-tier memory: facts + summary + recent
├── memory_bank.py # JSONL persistent session backend
├── fact_store.py # Entity / relation / fact extraction
├── context_compressor.py # LLM summary compression (scales by length)
├── query_expander.py # 3-query expansion + reciprocal rank fusion
│
├── config/
│ ├── llm.yaml # Model, base_url, temperature, max_tokens
│ ├── agents.yaml # 5 agents: Perplexity-style prompts
│ ├── tasks.yaml # 5 tasks: output formats + Final Answer rules
│ ├── tools.yaml # Tavily, student API, RAG endpoints
│ └── memory.yaml # Compression, expansion, facts settings
│
├── tools/
│ ├── web_tools.py # Tavily search, crawl, expand+search
│ ├── student_tools.py # Student REST API (example)
│ ├── rag_tools.py # Policy RAG / vector search (example)
│ ├── datetime_tools.py # Vietnamese-aware date/time
│ └── speech_tools.py # STT/TTS skeleton (optional)
│
└── images/ # Test images folder
All config lives in src/school_agents/config/. Secrets use ${ENV_VAR} syntax, auto-resolved from .env.
| File | Purpose | Key Settings |
|---|---|---|
| llm.yaml | LLM connection | model, base_url, max_tokens, structured_max_tokens |
| tools.yaml | External APIs | Tavily (search_depth: advanced, max_results: 5), student API, RAG |
| agents.yaml | Agent prompts | Perplexity-style routing, multi-source verification, claim-level citations |
| tasks.yaml | Task prompts | Output formats, Final Answer: requirement for CrewAI parser |
| memory.yaml | Memory system | Turn limits, compressor strategy, expand mode (auto/confirm/off), fact extraction |
| Backend | Text | Vision | Best For |
|---|---|---|---|
| vLLM | ✅ | ✅ | Production — continuous batching, PagedAttention, H100/H200 optimized |
| Ollama | ✅ | ✅ | Quick start — ollama pull qwen2.5-vl and go |
| LMStudio | ✅ | ✅ | Local dev — GUI model management |
| OpenAI | ✅ | ✅ | Cloud fallback — GPT-4o for vision |
| Any OpenAI-compatible | ✅ | Vision requires ChatML multimodal format support |
| Component | Role |
|---|---|
| CrewAI | Multi-agent orchestration, ReAct loop, sequential task pipeline |
| Tavily | Web search API — deep search, crawl, extract (handles PDFs natively) |
| FastAPI | Async API server — SSE streaming, multipart upload, session management |
| LiteLLM | Universal LLM gateway — patched for multimodal vision injection |
| json-repair | Robust JSON parsing from LLM output with thinking tags |
Framework mã nguồn mở biến bất kỳ LLM local nào thành trợ lý AI kiểu Perplexity — tìm kiếm web thông minh, trả lời có nguồn dẫn, streaming realtime, hiểu ảnh trực tiếp qua vision model, và nhớ hội thoại gần như vô tận.
Chạy trên hạ tầng của bạn, dùng model của bạn.
| Tính năng | Chi tiết | |
|---|---|---|
| 🖼️ | Hiểu ảnh | Gửi ảnh kèm câu hỏi — VLM nhìn ảnh gốc, không qua mô tả trung gian |
| 🔍 | Tìm kiếm thông minh | 1 câu → 3 queries (Anh + Việt + phân tích) → RRF merge → 10-15 kết quả |
| 🤖 | Đa agent | Router → Researcher → Composer, phong cách Perplexity với nguồn dẫn [S1][S2] |
| 💬 | Nhớ hội thoại | 3 tầng: facts vĩnh viễn + summary nén + 4 turns gần nhất. Nhớ xuyên phiên. |
| 🔌 | Mở rộng dễ | Thêm API nội bộ, RAG, database bên cạnh web search |
| 🌊 | Streaming | Token-by-token qua CLI và SSE cho web/mobile client |
| ⏰ | Biết ngày giờ | Tự inject thời gian, tính "72h tới", "tuần này" chính xác |
| 🧠 | Thinking model | Hỗ trợ Qwen3, QwQ — xử lý <think> tags, JSON extraction robust |
git clone https://github.com/YOUR_USERNAME/simple-multi-agent-crewai.git
cd simple-multi-agent-crewai && git checkout feature/multimodal
pip install -e .
cp .env.example .env && nano .env # Điền: LLM_API_KEY, TAVILY_API_KEY
nano src/school_agents/config/llm.yaml # Chỉnh model + base_url# CLI tương tác
python -m school_agents.run_chat --session test --interactive --stream
# Với ảnh
python -m school_agents.run_chat -q "Thuốc gì?" --image images/thuoc.jpg --stream# REPL — dùng img: prefix
You: img:images/thuoc.jpg Thuốc gì đây?
📎 images/thuoc.jpg
You: img:images/a.jpg img:images/b.png So sánh 2 toa thuốc
📎 images/a.jpg
📎 images/b.png
You: Tóm tắt cuộc trò chuyện ← chỉ text, không ảnh# Server API
uvicorn school_agents.server:app --host 0.0.0.0 --port 8000
# Text
curl -X POST http://localhost:8000/chat \
-F 'query=Tin tức hôm nay' -F 'session_id=test'
# Với ảnh
curl -X POST http://localhost:8000/chat \
-F 'query=Thuốc gì đây?' -F 'session_id=test' \
-F 'images=@images/thuoc.jpg'
# Session
curl http://localhost:8000/sessions/test # Xem stats
curl -X DELETE http://localhost:8000/sessions/test # XóaYou: Doanh số FPT Retail phụ thuộc chủ yếu vào Long Châu đúng không?
⏳ Generating search queries.. done!
🔍 Query expansion (3 queries):
1. FPT Retail revenue dependence on Long Châu 2026
2. Doanh số và thu nhập của FPT Retail phụ thuộc vào Long Châu
3. FPT Retail financial performance analysis Long Châu impact
Search all? [Y/n/edit]
Searching 3 queries...
✅ 12 results merged.
🔄 Web Researcher working...
A: Đúng, Long Châu đóng góp 68% doanh thu FPT Retail năm 2025 [S4].
Kinh doanh dược phẩm chiếm 91% tổng doanh thu dự kiến [S1].
FPT Retail dự kiến tăng trưởng 15% năm 2026 nhờ mở rộng Long Châu [S3].
Nguồn:
[S1] FPT Retail (FRT) BUY — vietcap.com.vn/...
[S3] Vietnam Consumer 2026 — gtjai.com.vn/...
[S4] FPT Retail nhóm tốt nhất châu Á — vnexpress.net/...
💾 Saving to memory...
🧠 Facts: +5E +5R +5F
(89.2s)
Framework thiết kế để thêm tool dễ dàng. Ví dụ có sẵn: student_tools.py (REST API), rag_tools.py (vector search), speech_tools.py (STT/TTS). Xem chi tiết tại Adding Your Own Tools và kiến trúc hệ thống tại ARCHITECTURE.md.
MIT
PRs welcome. Please open an issue first to discuss what you would like to change.