Project Astryx

Multimodal RAG system on Azure. Documents go in, cited answers come out.

Architecture

File uploaded via FastAPI, SHA256-deduped into Azure Blob
Event Grid triggers Storage Queue message
KEDA-scaled ACA worker picks it up
Adaptive 2-pass Docling: fast pass on all pages, expensive OCR only on pages with tables/images
Semantic chunking by headers, image-to-text association via OCR caption overlap
Batch embed via Azure embed-v-4-0 (text 96/batch, images 20/batch)
Multi-tenant Weaviate ingest with bidirectional TextChunk/ImageChunk cross-refs
Blob tags track state: processing / embedded / embedding_failed

Layer	What
Processing	Docling (adaptive 2-pass PDF), MarkItDown (rest)
Chunking	Custom semantic, header-aware, OCR-based image association
Embedding	Azure AI embed-v-4-0, 1536d, BYOV via LiteLLM
Vector DB	Weaviate Cloud, multi-tenant, HNSW dynamic, bidirectional cross-refs
Reranker	BGE-reranker-v2-m3 via Replicate
LLM	Azure OpenAI via LiteLLM
Inference	DSPy, 13 signatures, custom reranker wrapper
Chat memory	CosmosDB, per-user partitioned
Observability	MLflow on Azure ML Studio
Infra	Azure Container Apps (KEDA), Blob, Queue, Event Grid

Adaptive 2-pass Docling -- fast pass first, expensive OCR only on pages that need it. 86s to 5s on mixed PDFs.
BYOV over native vectorizer -- embeddings computed via LiteLLM, passed to Weaviate manually. Full control over model swaps.
Single-stage reranking -- ColBERT dropped, added latency without improving relevance at our scale. BGE alone at 0.0015s/call via Replicate.
Deterministic IDs -- uuid5 from node_id. Re-ingesting same file produces same objects. Idempotent.
Image URLs, not base64 -- images uploaded to Blob, only URL stored in ImageChunk. Avoids index bloat.
Bidirectional cross-refs -- TextChunk.hasImages / ImageChunk.belongsToText, wired in second pass. Preserved across backup/restore.
All models via LiteLLM -- one SDK, swap any provider by changing a string.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
docs		docs
src		src
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
OVERVIEW.md		OVERVIEW.md
README.md		README.md