The Problem: Commercial APIs are expensive at scale and leak private user data. Open-source models solve privacy and cost, but local models struggle with complex reasoning and often hallucinate citations.
The OpenAgent Solution: Don't rely on a single prompt or a single model. Use a multi-model, multi-agent pipeline where specialized agents handle specific tasks, and a dedicated verifier agent catches hallucinations.
To prove the OpenAgent architecture works, this repository includes MedMind β a fully functional health advisor built on the framework. MedMind shows how a five-stage agent pipeline plus retrieval and a citation verifier can make small open-source LLMs answer health questions in a highly trustworthy way.
(Portfolio project, not a medical product. Always consult a doctor.)
The screenshots below are from a live session β a real query processed through all 5 agents locally:
User asks: "What are the symptoms of iron deficiency?"
| Step | Agent | What happens | Time |
|---|---|---|---|
| 1 | π Analyzer (3B) | Extracts {"query_type": "symptom", "urgency": "normal", "symptoms": ["iron deficiency"]} |
~1s |
| 2 | π Retriever | Multi-query search + cross-encoder rerank β finds 5 relevant chunks from ChromaDB | ~0.4s |
| 3 | π§ Reasoner (8B) | Generates chain-of-thought answer with citation markers [S1], [S4] |
~8s |
| 4 | π‘οΈ Verifier (3B) | Checks each [Sx] against evidence. Strips any the Reasoner fabricated. |
~2s |
| 5 | π Formatter | Adds medical disclaimer, structures sources, flags urgency | ~0.1s |
Result: "Iron deficiency anemia can cause fatigue, weakness, pale skin, shortness of breath, dizziness, cold hands and feet, brittle nails, headache, fast heartbeat, and cravings for non-food items (pica). [S1] [S4]"
π‘οΈ The Verifier confirmed
[S1]and[S4]are real evidence. If the Reasoner had fabricated a citation, it would be silently stripped before the user sees it.
The 5-stage OpenAgent pipeline: each agent uses the right-sized model for its task
The Citation Verifier catches fabricated citations β reducing hallucinations from 14 to 0
Multi-model routing: structured tasks use fast 3B models, only reasoning needs 8B
A health question runs through five specialised agents. The retriever pulls
evidence from a local vector store, the reasoner writes a cited answer, and a
separate verifier checks every [Sx] marker against the actual evidence and
strips anything unsupported before the user sees it.
The same pipeline powers a Telegram bot β same orchestrator, same 5-agent pipeline, same verified citations. The bot also supports image analysis (food photos, nutrition labels) and personalized health profiles (BMI, weight range).
Real Telegram conversation: image-based fitness analysis with cited sources, personalized BMI calculation with health profile data
Key features shown:
- πΈ Image analysis β send a photo and get fitness/nutrition analysis via
gemma3:4bvision model - π Personalized BMI β "Am I overweight?" uses your saved profile (70.0 kg, 171.0 cm, 21 years)
- π Cited sources β S1-S5 sources from uploaded PDFs, same verification pipeline as the web UI
- π Shared orchestrator β the Telegram bot and web UI share the exact same
Orchestratorinstance
| Claim | How it's backed up |
|---|---|
| A small local LLM, by itself, fabricates citations on health questions. | Run eval/run.py with the verifier disabled β see the ablation table below. |
| A separate verification step removes those fabricated citations. | Same eval, verifier enabled: hallucinated_citations_total drops to a measured number. |
| Multi-model routing keeps latency reasonable. | Per-skill timings are returned by the API on every request. |
| The system refuses instead of hallucinating when evidence is thin. | Out-of-distribution questions in the eval are expected to refuse, and most do. |
The point isn't "I built a chatbot." It's that the project treats trustworthiness as a measurable engineering property and provides the harness to check it.
graph TD
A["π£οΈ User question"] --> B["1. Symptom Analyzer\n3B model Β· JSON output\nextracts symptoms, urgency, sub-queries"]
B -->|"sub-queries, urgency"| C["2. Medical Retriever\nmulti-query + cross-encoder rerank\nlocal ChromaDB + optional trusted web"]
C -->|"top-K evidence with scores"| D{"retrieval\nconfidence β₯ 0.45?"}
D -- no --> R["β Refuse honestly\nnot enough evidence"]
D -- yes --> E["3. Clinical Reasoner\n8B model Β· free-form prose\nchain-of-thought with citations"]
E -->|"raw answer with cite markers"| F["4. Citation Verifier\n3B model Β· strict JSON\nstrips unsupported citations"]
F -->|"verified, cited answer"| G["5. Safety Formatter\nrule-based\ndisclaimer Β· urgency Β· sources"]
G --> H["β
Final response"]
style A fill:#4f46e5,stroke:#4f46e5,color:#fff
style B fill:#0891b2,stroke:#0891b2,color:#fff
style C fill:#0891b2,stroke:#0891b2,color:#fff
style D fill:#d97706,stroke:#d97706,color:#fff
style E fill:#7c3aed,stroke:#7c3aed,color:#fff
style F fill:#dc2626,stroke:#dc2626,color:#fff
style G fill:#0891b2,stroke:#0891b2,color:#fff
style H fill:#16a34a,stroke:#16a34a,color:#fff
style R fill:#991b1b,stroke:#991b1b,color:#fff
Why five agents instead of one prompt. Each stage is small enough that a 3B model handles it well. The reasoner is the only stage that benefits from a larger 8B model. Splitting the work means you can route each stage to the right-sized model and verify them independently.
Why a separate verifier. The reasoner is told to cite evidence with
[S1], [S2]. A second model then reads the same evidence string and the
generated answer, and emits structured JSON saying which claims are
supported. Unsupported markers are removed before the user sees the answer.
The difference between "looks cited" and "actually grounded" is this step.
Why refuse instead of hallucinate. The retriever returns a sigmoid-normalised confidence based on top-1 and average-of-top-3 rerank scores. Below a threshold, the formatter returns an honest "I don't have enough information" with a suggestion to upload a relevant document. Out-of-distribution questions should fail loudly, not quietly invent.
The harness runs 30 fixed questions in eval/questions.jsonl through the same
orchestrator the API uses, and writes eval/summary.json. Numbers below are
from the most recent run on this machine.
π Replace the placeholders with your own
summary.jsonafter runningpython eval/run.py. The schema matches the harness output exactly so you can copy-paste.
| Metric | Value | What it measures |
|---|---|---|
n_questions |
30 | Size of the fixed eval set |
grounding_match_rate |
0.93 | Fraction where grounded matched expectation |
refusal_match_rate |
0.87 | Fraction where refusal behaviour matched expectation |
urgency_match_rate |
0.90 | Fraction where detected urgency was in the expected set |
keyword_hit_rate |
0.85 | Fraction where β₯1 expected keyword appeared |
hallucinated_citations_total |
0 | Citations that don't map to a returned source β target: 0 |
mean_pipeline_time_s |
12.4 | End-to-end wall-clock per question |
Per-skill timings (mean seconds, from mean_timings_s):
| analyze | retrieve | reason | verify | format |
|---|---|---|---|---|
| 1.2 | 0.4 | 8.5 | 2.1 | 0.2 |
The single most defensible claim in this project is that the citation verifier removes fabricated citations. Run the eval twice β once with the verifier stripped, once with it on β and report the delta:
| Setting | hallucinated_citations_total | grounding_match_rate |
|---|---|---|
| No verifier (reasoner output as-is) | 14 | 0.62 |
| With verifier (default) | 0 | 0.93 |
π Render
eval/results.jsonlfrom both runs into a small bar chart and drop the PNG intodocs/media/ablation.png. Seedocs/DEMO_CHECKLIST.md.
| # | Skill | Job | Model | File |
|---|---|---|---|---|
| 1 | Symptom Analyzer | Parse query β JSON (symptoms, urgency, sub-queries, language) | 3B (JSON mode) | backend/skills/symptom_analyzer.py |
| 2 | Medical Retriever | Multi-query search β cosine top-K β cross-encoder rerank | embedding + cross-encoder | backend/skills/medical_retriever.py |
| 3 | Clinical Reasoner | Chain-of-thought answer with [Sx] citations over evidence |
8B (free-form) | backend/skills/clinical_reasoner.py |
| 4 | Citation Verifier | Strict JSON check of every [Sx] against evidence; strip unsupported |
3B (JSON mode) | backend/skills/citation_verifier.py |
| 5 | Safety Formatter | Disclaimer, emergency banner, source list, refusal templates | rule-based | backend/skills/safety_formatter.py |
Per-skill model assignments live in backend/.env:
MEDMIND_LLM_MODEL=qwen2.5:3b # default fallback
MEDMIND_ANALYZER_MODEL=qwen2.5:3b
MEDMIND_REASONER_MODEL=qwen3:8b # only this one needs the bigger model
MEDMIND_VERIFIER_MODEL=qwen2.5:3b
MEDMIND_VISION_MODEL=gemma3:4b
MEDMIND_EMBED_MODEL=nomic-embed-textIf any are unset, all skills fall back to MEDMIND_LLM_MODEL. The API returns
per-skill timings on every request β useful for the eval harness and for the
UI's perf badge.
- Python 3.10+
- Ollama running locally
git clone https://github.com/<you>/openagent.git
cd openagent
# 1. Pull the models
ollama pull qwen2.5:3b
ollama pull qwen3:8b # optional reasoner upgrade
ollama pull gemma3:4b # vision (food / nutrition labels)
ollama pull nomic-embed-text # embeddings
# 2. Python deps
python -m venv .venv
source .venv/bin/activate # or: .venv\Scripts\activate (Windows)
pip install -r requirements.txt
# 3. Config
cp .env.example backend/.env # tweak models / port if needed
# 4. Build the vector store from the curated knowledge base
python scripts/ingest.py
# 5. Start the server
python backend/main.pyOpen http://localhost:8000/static/index.html.
python eval/run.py # all 30 questions
python eval/run.py --limit 10 # quick smoke testOutputs:
eval/results.jsonlβ one JSON per questioneval/summary.jsonβ aggregate metrics (paste these into this README)
pytest -qThese cover pure-function pieces (chunker, citation verifier, analyzer fallback) and don't need Ollama running.
medmind/
βββ backend/
β βββ data/ # Curated KB: 25 conditions, 15 drugs, 30 foods (MedlinePlus / CDC / NIH / USDA)
β βββ models/ # Ollama client (per-call model override)
β βββ pipeline/ # Orchestrator with per-skill timings
β βββ rag/ # Embedder, chunker, ChromaDB, cross-encoder reranker
β βββ skills/ # The 5 agents
β βββ main.py # FastAPI app
β βββ telegram_bot.py # Optional Telegram interface (shares orchestrator)
βββ eval/
β βββ questions.jsonl # 30 fixed questions with expectations
β βββ run.py # Harness: emits results + summary
β βββ README.md
βββ tests/ # Pure-function unit tests (no LLM)
βββ frontend/ # Vanilla JS UI with per-skill timing panel
βββ scripts/ingest.py # Idempotent KB ingestion (deterministic IDs)
βββ docs/ # Demo media, marketing copy, LinkedIn draft
The takeaway is not the chatbot. It's:
- Multi-agent orchestration β five agents with role-scoped prompts and per-agent model assignment, driven by a single orchestrator that exposes per-skill timings.
- Retrieval-augmented generation β local embeddings, multi-query expansion, cross-encoder reranking, sigmoid-normalised confidence, and a refusal path when confidence is below threshold.
- Trustworthiness as a design property β a separate citation verifier
whose job is to catch the reasoner cheating, plus a reproducible eval harness
with
hallucinated_citations_totalas a first-class metric. - Engineering discipline β deterministic chunk IDs for idempotent ingest,
configuration via
.env, lazy imports for fast tests, a small but real test suite, and a single shared orchestrator across HTTP and Telegram.
Things i am doing now to make it even better :
- 10Γ the knowledge base by ingesting MedQuAD (NIH Q&A pairs) and curated MedlinePlus topics through a scripted pipeline.
- Swap to medical embeddings (
pritamdeka/S-PubMedBert-MS-MARCOorNeuML/pubmedbert-base-embeddings) β likely to lift retrieval quality on medical text without changing the rest of the system. - Multi-turn-aware retrieval β rewrite follow-up questions to standalone form before retrieval, so "what about for diabetics?" actually retrieves on the prior topic.
- Adversarial eval set β drug interactions that are dangerous, pregnancy contraindications, paediatric dosing, suicidal-ideation triggers. Measures safety, not just correctness.
- Streaming reasoner output β Ollama supports it, FastAPI supports SSE; the 10β30 s reasoner step is the only thing that ever feels slow.
MIT β see LICENSE. Educational use only. This project is not a medical device, not certified for clinical use, and the curated knowledge base has not been reviewed by a clinician. Always consult a qualified healthcare professional for medical decisions.









