Autocap is a premium, enterprise-grade multimodal AI dataset creation and management platform. It allows machine learning researchers and data engineers to import batch images, run automated image captioning through multiple advanced deep learning architectures, evaluate semantic alignment scores, moderate contents via a customizable flagging engine, and package curated image-caption pairs into deployable datasets.
- Key Features & User Flows
- System Architecture & Data Flow
- Technology Stack
- Project Directory Structure
- Database Schema Overview
- Getting Started (Local Setup)
- AI Core: Model Hub, VRAM Management & Flagging Engine
- API Integration & Callbacks
- UI/UX Design System Guidelines
- Production Deployment
- License & Contributing
VisionCaption splits workflows into two key journeys: Standard User Portal and Admin Control Panel.
- Main Dashboard: Drag-and-drop or select up to 50 images (max 10MB each, JPEG/PNG/WebP). Configure hyper-parameters for caption generation (temperature, length bounds, beam count, repetition penalty).
- Asynchronous Pipeline Tracker: Watch pipeline progress in real-time through sequential steps:
UPLOADING➔QUEUED➔PROCESSING➔GENERATING➔SCORING➔COMPLETE. - Dataset Explorer: Deep-dive into completed datasets. View image thumbnails alongside generated captions, edit captions manually to fix errors, see detailed evaluation metrics, and download the entire dataset as a zipped archive containing files,
captions.json, andcaptions.csv. - Global Search: Search and filter public and personal datasets by category, metadata tags, model variant, date, and semantic similarity range.
- Feedback Portal: Rate generated captions, report bugs, and submit feature requests.
- Insights & Statistics Dashboard: Monitor system-wide KPIs, including active users, total datasets, images uploaded, flagging rates, and performance statistics across different model variants.
- User Management: Access accounts, assign roles (
USER/ADMIN), and lock or unlock access. - Metadata Configuration: Perform full CRUD operations on dataset Categories, Tags, and Tokenizer configurations.
- Feedback & Content Moderation Hub: Audit flagged caption records, review user feedback, and manage system tickets (
New➔Reviewed➔Closed).
VisionCaption uses a modular, decoupled architecture. The React frontend interacts solely with the Spring Boot orchestration backend. Spring Boot delegates long-running AI inference tasks to the FastAPI service and handles database state, storage integrations, and JWT authentication.
sequenceDiagram
autonumber
actor User as User Browser
participant FE as React Frontend
participant BE as Spring Boot (8080)
participant DB as PostgreSQL (Supabase)
participant ST as Supabase Storage
participant AI as FastAPI Service (8000)
User->>FE: Upload Images & Set BLIP Config
FE->>BE: POST /api/images/upload (FormData + JWT)
Note over BE: Validate payload & check authorization
BE->>ST: Upload image files to storage bucket
ST-->>BE: Return public storage URLs
BE->>DB: INSERT images ('uploaded'), datasets, & captioning_jobs ('QUEUED')
BE->>AI: Async POST /api/jobs/process (jobId, imageUrls, blipConfig)
BE-->>FE: Return 202 Accepted (jobId)
loop Status Polling (Every 2.5s)
FE->>BE: GET /api/jobs/{id}/status
BE->>DB: SELECT job status
DB-->>BE: Return current status
BE-->>FE: Return JSON status tracker
end
Note over AI: VRAM Check & Swap Model
loop Inference Pipeline
AI->>AI: Generate candidate captions
AI->>AI: CLIP cosine similarity scoring
AI->>AI: Run content flagging checks
end
AI->>BE: POST /api/jobs/{id}/callback (captions + similarity scores + flagging)
Note over BE: Update DB & finalize statistics
BE->>DB: INSERT captions & UPDATE job status to 'COMPLETE'
User->>FE: View explorer page / Trigger download
FE->>BE: GET /api/datasets/{id}/download
BE-->>FE: Stream ZIP (images + captions.csv + captions.json)
- Core: React 18, TypeScript, Vite.
- Routing & State: React Router DOM (v6), Axios for API requests with cookie-based authorization interceptors.
- Styling: Custom Vanilla CSS utilizing variable-driven layouts and premium dark mode values.
- Core: Java 17, Spring Boot 3.x, Spring Security (JWT authentication and CORS control).
- Persistence: Spring Data JPA, Hibernate, PostgreSQL driver.
- File Processing: Spring Multipart (with large stream swallow overrides).
- Hosting: Supabase.
- Database: PostgreSQL (with Row-Level Security enabled).
- Storage: Supabase Storage bucket for securely hosting dataset images.
- Core: Python 3.10+, FastAPI, Uvicorn, PyTorch, Pillow.
- Deep Learning Frameworks: HuggingFace Transformers, HuggingFace Accelerate, bitsandbytes (for model weights quantization).
- Evaluation Metrics: CLIP (Contrastive Language-Image Pretraining) model for semantic similarity calculation.
AutoCap-Application/
├── ai-service/ # FastAPI AI Inference microservice
│ ├── flagging/ # Content moderation & auto-flagging engine
│ │ ├── domain_vocab.py # Problematic keyword lists & regex filters
│ │ ├── flag_engine.py # Moderation supervisor
│ │ └── rules.py # Evaluation rules (vocab & CLIP score bounds)
│ ├── models/ # Deep learning model architectures
│ │ ├── baseline.py # EncoderCNN + DecoderRNN
│ │ ├── caption.py # GPT-2 caption decoder
│ │ ├── clip_evaluator.py # CLIP model scorer & ranker
│ │ └── vit_model.py # Vision Transformer + LLaMA architecture
│ ├── services/
│ │ └── caption_service.py # Core generation orchestration & VRAM manager
│ ├── main.py # FastAPI entrypoint, HTTP router & background tasks
│ └── requirements.txt # Python packages list
│
├── backend/ # Spring Boot API Gateway & Orchestrator
│ ├── src/main/java/com/autocap/backend/
│ │ ├── config/ # Web Security, CORS, Password Encryptor
│ │ ├── controller/ # REST Endpoints (Auth, Dataset, Admin, Feedback)
│ │ ├── dto/ # Data Transfer Objects (requests & callbacks)
│ │ ├── entity/ # JPA Database Entities
│ │ ├── repository/ # Spring Data JPA Repository Interfaces
│ │ └── service/ # Business Logic (JWT tokenizing, storage, auth)
│ ├── src/main/resources/
│ │ ├── application.properties # Primary configurations
│ │ └── application-local.properties # Local dev parameters (profile: local)
│ ├── pom.xml # Maven dependency tree
│ └── mvnw / mvnw.cmd # Maven wrapper scripts
│
├── frontend/ # React TypeScript Client
│ ├── src/
│ │ ├── api/ # Axios HTTP instance & endpoints wrapper
│ │ ├── components/ # Reusable UI widgets & admin layouts
│ │ ├── hooks/ # React Hooks (e.g., job polling helpers)
│ │ ├── pages/ # Views (Dashboard, Admin, Explorer, Search, Auth)
│ │ ├── index.css # Global styling rules & Outfit typeface variables
│ │ └── App.tsx # React client routes
│ ├── package.json # Node.js configurations
│ └── vite.config.ts # Vite build settings
│
├── dashboard_module_spec.md # Technical specification for main dashboard
└── ui-guidelines.md # Theme styles and aesthetic design system tokens
The Supabase PostgreSQL database contains the following critical tables:
users: Contains authenticated user records (supabase_uidmaps to incoming tokens,is_activeblocks suspended users).roles: User role configuration (e.g.,USER,ADMIN).datasets: Metadata container for groups of images (average_similaritytracks CLIP alignment,total_itemstracks size).images: Represents uploaded image items (file_pathpoints to Supabase storage,statusflags pipeline stage).captions: Holds generated or edited captions, along with similarity score, model name, and BLEU/METEOR/CIDEr metric scores.dataset_items: M-to-M junction table connecting datasets, images, and captions.captioning_jobs: Job queue record tracking batch captioning progress and task status (QUEUED,PROCESSING,COMPLETE,FAILED).feedback: Stores bug reports, rating data, and feature requests.docs/doc_categories/doc_tags: Stores system documentation pages managed by administrators.
Note on Migrations: Most tables are generated automatically by Hibernate (via
spring.jpa.hibernate.ddl-auto=validate) or managed by Flyway. However, thecaptioning_jobstable and its associated Row-Level Security (RLS) policies require manual execution in the Supabase SQL Editor prior to running the backend. Seedashboard_module_spec.mdfor the exact SQL commands.
Make sure your development machine has the following tools installed:
- Java Development Kit (JDK) 17
- Node.js 18+ & npm
- Python 3.10+ (with
pipandvirtualenv) - A Supabase Account with an active project and storage bucket.
- Open a terminal and navigate to the
ai-servicedirectory:cd ai-service - Create and activate a Python virtual environment:
python -m venv venv # On Windows: .\venv\Scripts\activate # On macOS/Linux: source venv/bin/activate
- Install dependencies:
pip install -r requirements.txt
- Place the required model checkpoints inside the
ai-service/models/folder:caption_model_baseline.pthcaption_model.pthViT_1_1_model.pt/ViT_1_4_model.pt
- Create a
.envfile in theai-servicedirectory:HF_TOKEN=your_huggingface_access_token
- Start the FastAPI development server:
The service will start running on
uvicorn main:app --host 127.0.0.1 --port 8000 --reload
http://127.0.0.1:8000.
- Navigate to the
backenddirectory:cd backend - Create your local configuration file by copying the template:
cp .env.example .env
- Open
.envand fill in your Supabase project properties. Make sure to use the Session Pooler (port5432) to allow Hibernate schema validations:DB_URL=jdbc:postgresql://<your-supabase-db-host>:5432/postgres?sslmode=require DB_USERNAME=postgres.<your-project-ref> DB_PASSWORD=your_supabase_db_password SUPABASE_URL=https://<your-project-ref>.supabase.co SUPABASE_SERVICE_KEY=your_supabase_service_role_key JWT_SECRET=your_base64_encoded_jwt_secret_key
- Build the Spring Boot application using Maven:
./mvnw clean install
- Run the application with the
localprofile enabled:The backend API will start running on./mvnw spring-boot:run -Dspring-boot.run.profiles=local
http://localhost:8080.
- Navigate to the
frontenddirectory:cd frontend - Install the package dependencies:
npm install
- Create a
.envfile in thefrontenddirectory:VITE_SPRING_BOOT_URL=http://localhost:8080
- Start the Vite local development server:
Open your browser and navigate to the local address displayed (usually
npm run dev
http://localhost:5173).
VisionCaption includes three built-in model variants to balance speed and accuracy:
baseline_model: A fast, lightweight Encoder-Decoder model combining a CNN visual encoder (ResNet/Inception) with an RNN (LSTM) text decoder.caption_model: An advanced architecture pairing a transformer visual encoder with a GPT-2 decoder. It generates multiple candidate captions and selects the best one using CLIP evaluation.vit_model(vit_1_1/vit_1_4): A heavyweight architecture pairing a Vision Transformer (ViT) encoder with a LLaMA-based language model. It offers maximum precision and uses CLIP voting for final selection.
Deep learning models consume significant VRAM. To run multiple heavy architectures on consumer-grade hardware without causing Out-of-Memory (OOM) crashes, the CaptionService implements an eviction system:
- Shared CLIP Resident: The CLIP model is relatively small (~600MB) and is kept permanently loaded in VRAM for fast scoring and ranking.
- Lazy Loading: Model weights are kept on disk and loaded into VRAM only when requested.
- Eviction Cycle: Before swapping to a new variant, the VRAM manager severs all references, runs double Python garbage collection (
gc.collect()), clears the PyTorch CUDA cache (torch.cuda.empty_cache()), and synchronizes the GPU (torch.cuda.synchronize()).
The AI pipeline includes a real-time moderation engine that runs immediately after caption generation:
LowAlignmentRule: Checks the cosine similarity score computed by CLIP. If the similarity between the image embeddings and the caption text is below0.68, the caption is flagged as low alignment.VocabRule: Scans the text using regex rules and matching lists (defined indomain_vocab.py) to block and flag problematic, offensive, or disallowed vocabulary.
The backend dispatches jobs asynchronously to prevent HTTP timeouts.
POST http://127.0.0.1:8000/api/jobs/process
{
"jobId": 123,
"userId": 42,
"datasetName": "Museum Gallery Batch",
"datasetDescription": "Paintings from the gallery",
"modelVariant": "caption_model",
"temperature": 1.0,
"maxLength": 50,
"minLength": 5,
"numBeams": 4,
"repetitionPenalty": 1.0,
"topP": 0.9,
"images": [
{
"id": 1001,
"storageUrl": "https://mztbiewiqjnairxnurfk.supabase.co/storage/v1/object/public/images/42/painting.jpg"
}
]
}POST http://127.0.0.1:8080/api/jobs/123/callback
{
"jobId": 123,
"status": "SUCCESS",
"results": [
{
"imageId": 1001,
"captionText": "an oil painting of a sunset over the valley",
"similarityScore": 0.8412,
"isFlagged": false,
"bleu1": null,
"bleu2": null,
"bleu3": null,
"bleu4": null,
"meteor": null,
"cider": null,
"modelName": "caption_model",
"modelVersion": "1.0"
}
],
"errorMessage": null
}VisionCaption features a premium dark theme. Developers should follow these color and design rules when updating styles in .css modules:
- Base Background:
#110F17(used for standard views and full layouts). - Layered Surface Panels:
#1F1E29(used for cards, inputs, and modular panels). - Structured Borders:
#28272F(subtle separators) and#3E3E47(interactive outlines). - Interactive Highlights:
- Primary Accent:
#194BFF(used for main actions, sliders, and toggles). - Secondary Accent:
#D8EE10(used for badges, analytics values, and selected tags).
- Primary Accent:
- Alert States:
- Success Indicators:
#89EB79 - Errors / Deactivation alerts:
#E84A34
- Success Indicators:
- Main Headers:
32pxSemibold, Primary Color (#F6F6F6) - Sub-Headers:
20pxMedium, Secondary Color (#C5C4C7) - Large Metric Displays:
38pxMedium, Lime Accent (#D8EE10) or Blue (#194BFF) - Body / Descriptions:
16pxor18pxMedium, Secondary Color (#C5C4C7) - Metadata & Label Tags:
14pxRegular, Muted Gray (#98979D) - Hints / Placeholders:
12pxRegular
Build the static bundle for production:
cd frontend
npm run buildDeploy the generated dist/ directory to any static host (e.g., Vercel, Netlify, or Nginx).
Package the application into a standalone JAR file:
cd backend
./mvnw clean package -DskipTestsRun the JAR on your production server:
java -jar target/backend-0.0.1-SNAPSHOT.jar --spring.profiles.active=prodFor production, use an ASGI server like Uvicorn managed by Gunicorn:
cd ai-service
gunicorn main:app -w 4 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000(Ensure that your production server has adequate GPU VRAM to handle model swapping).
This project is licensed under the MIT License. See the LICENSE file in the root directory for details.
Contributions are welcome! Please follow these steps:
- Fork the repository.
- Create a new branch (
git checkout -b feature/your-feature-name). - Commit your changes (
git commit -m 'Add some feature'). - Push to the branch (
git push origin feature/your-feature-name). - Open a Pull Request.
Please adhere to the coding standards and UI guidelines (referenced in ui-guidelines.md).
