Python service for generating multi-modal embeddings for social media content.
- CLIP Visual Embeddings (512-dim)
- Text Embeddings (768-dim) from captions and OCR
- OCR Extraction using EasyOCR
- NSFW Classification and Content Type Detection
- Video Support with frame extraction
- Batch Processing
pip install -r requirements.txt
uvicorn app.main:app --reload --port 8000GET /health
POST /extract-multimodal
files: Image/Video filescaption: Optional text
POST /extract-features(Images)POST /extract-features-video(Videos)POST /extract-features-text(Text only)POST /extract-ocr(OCR only)POST /classify-nsfw(NSFW only)
Environment variables:
QDRANT_URL: Qdrant connection stringMEDIA_STORAGE_PATH: Path to media filesUSE_GPU: Enable GPU acceleration (default: false)PORT: Service port (default: 8000)
- CLIP:
openai/clip-vit-base-patch32 - Text:
sentence-transformers/all-mpnet-base-v2 - NSFW:
JanadaSroor/vit-nsfw-classifier
- Image: ~500ms
- OCR: ~1-2s
- Text: ~50ms
- Video: ~5-10s (10 frames)
Apache License 2.0.