AutoCap - Domain-Specific Image Captioning & Validation System

Autocap is a premium, enterprise-grade multimodal AI dataset creation and management platform. It allows machine learning researchers and data engineers to import batch images, run automated image captioning through multiple advanced deep learning architectures, evaluate semantic alignment scores, moderate contents via a customizable flagging engine, and package curated image-caption pairs into deployable datasets.

Key Features & User Flows
System Architecture & Data Flow
Technology Stack
Project Directory Structure
Database Schema Overview
Getting Started (Local Setup)
AI Core: Model Hub, VRAM Management & Flagging Engine
API Integration & Callbacks
UI/UX Design System Guidelines
Production Deployment
License & Contributing

1. Key Features & User Flows

VisionCaption splits workflows into two key journeys: Standard User Portal and Admin Control Panel.

👤 Standard User Journey

Main Dashboard: Drag-and-drop or select up to 50 images (max 10MB each, JPEG/PNG/WebP). Configure hyper-parameters for caption generation (temperature, length bounds, beam count, repetition penalty).
Asynchronous Pipeline Tracker: Watch pipeline progress in real-time through sequential steps: UPLOADING ➔ QUEUED ➔ PROCESSING ➔ GENERATING ➔ SCORING ➔ COMPLETE.
Dataset Explorer: Deep-dive into completed datasets. View image thumbnails alongside generated captions, edit captions manually to fix errors, see detailed evaluation metrics, and download the entire dataset as a zipped archive containing files, captions.json, and captions.csv.
Global Search: Search and filter public and personal datasets by category, metadata tags, model variant, date, and semantic similarity range.
Feedback Portal: Rate generated captions, report bugs, and submit feature requests.

🔑 Admin Journey

Insights & Statistics Dashboard: Monitor system-wide KPIs, including active users, total datasets, images uploaded, flagging rates, and performance statistics across different model variants.
User Management: Access accounts, assign roles (USER / ADMIN), and lock or unlock access.
Metadata Configuration: Perform full CRUD operations on dataset Categories, Tags, and Tokenizer configurations.
Feedback & Content Moderation Hub: Audit flagged caption records, review user feedback, and manage system tickets (New ➔ Reviewed ➔ Closed).

2. System Architecture & Data Flow

VisionCaption uses a modular, decoupled architecture. The React frontend interacts solely with the Spring Boot orchestration backend. Spring Boot delegates long-running AI inference tasks to the FastAPI service and handles database state, storage integrations, and JWT authentication.

sequenceDiagram
    autonumber
    actor User as User Browser
    participant FE as React Frontend
    participant BE as Spring Boot (8080)
    participant DB as PostgreSQL (Supabase)
    participant ST as Supabase Storage
    participant AI as FastAPI Service (8000)

    User->>FE: Upload Images & Set BLIP Config
    FE->>BE: POST /api/images/upload (FormData + JWT)
    Note over BE: Validate payload & check authorization
    BE->>ST: Upload image files to storage bucket
    ST-->>BE: Return public storage URLs
    BE->>DB: INSERT images ('uploaded'), datasets, & captioning_jobs ('QUEUED')
    BE->>AI: Async POST /api/jobs/process (jobId, imageUrls, blipConfig)
    BE-->>FE: Return 202 Accepted (jobId)

    loop Status Polling (Every 2.5s)
        FE->>BE: GET /api/jobs/{id}/status
        BE->>DB: SELECT job status
        DB-->>BE: Return current status
        BE-->>FE: Return JSON status tracker
    end

    Note over AI: VRAM Check & Swap Model
    loop Inference Pipeline
        AI->>AI: Generate candidate captions
        AI->>AI: CLIP cosine similarity scoring
        AI->>AI: Run content flagging checks
    end

    AI->>BE: POST /api/jobs/{id}/callback (captions + similarity scores + flagging)
    Note over BE: Update DB & finalize statistics
    BE->>DB: INSERT captions & UPDATE job status to 'COMPLETE'

    User->>FE: View explorer page / Trigger download
    FE->>BE: GET /api/datasets/{id}/download
    BE-->>FE: Stream ZIP (images + captions.csv + captions.json)

3. Technology Stack

Frontend

Core: React 18, TypeScript, Vite.
Routing & State: React Router DOM (v6), Axios for API requests with cookie-based authorization interceptors.
Styling: Custom Vanilla CSS utilizing variable-driven layouts and premium dark mode values.

Backend

Core: Java 17, Spring Boot 3.x, Spring Security (JWT authentication and CORS control).
Persistence: Spring Data JPA, Hibernate, PostgreSQL driver.
File Processing: Spring Multipart (with large stream swallow overrides).

Database & Storage

Hosting: Supabase.
Database: PostgreSQL (with Row-Level Security enabled).
Storage: Supabase Storage bucket for securely hosting dataset images.

AI / ML Service

Core: Python 3.10+, FastAPI, Uvicorn, PyTorch, Pillow.
Deep Learning Frameworks: HuggingFace Transformers, HuggingFace Accelerate, bitsandbytes (for model weights quantization).
Evaluation Metrics: CLIP (Contrastive Language-Image Pretraining) model for semantic similarity calculation.

4. Project Directory Structure

AutoCap-Application/
├── ai-service/                       # FastAPI AI Inference microservice
│   ├── flagging/                     # Content moderation & auto-flagging engine
│   │   ├── domain_vocab.py           # Problematic keyword lists & regex filters
│   │   ├── flag_engine.py            # Moderation supervisor
│   │   └── rules.py                  # Evaluation rules (vocab & CLIP score bounds)
│   ├── models/                       # Deep learning model architectures
│   │   ├── baseline.py               # EncoderCNN + DecoderRNN
│   │   ├── caption.py                # GPT-2 caption decoder
│   │   ├── clip_evaluator.py         # CLIP model scorer & ranker
│   │   └── vit_model.py              # Vision Transformer + LLaMA architecture
│   ├── services/
│   │   └── caption_service.py        # Core generation orchestration & VRAM manager
│   ├── main.py                       # FastAPI entrypoint, HTTP router & background tasks
│   └── requirements.txt              # Python packages list
│
├── backend/                          # Spring Boot API Gateway & Orchestrator
│   ├── src/main/java/com/autocap/backend/
│   │   ├── config/                   # Web Security, CORS, Password Encryptor
│   │   ├── controller/               # REST Endpoints (Auth, Dataset, Admin, Feedback)
│   │   ├── dto/                      # Data Transfer Objects (requests & callbacks)
│   │   ├── entity/                   # JPA Database Entities
│   │   ├── repository/               # Spring Data JPA Repository Interfaces
│   │   └── service/                  # Business Logic (JWT tokenizing, storage, auth)
│   ├── src/main/resources/
│   │   ├── application.properties    # Primary configurations
│   │   └── application-local.properties # Local dev parameters (profile: local)
│   ├── pom.xml                       # Maven dependency tree
│   └── mvnw / mvnw.cmd               # Maven wrapper scripts
│
├── frontend/                         # React TypeScript Client
│   ├── src/
│   │   ├── api/                      # Axios HTTP instance & endpoints wrapper
│   │   ├── components/               # Reusable UI widgets & admin layouts
│   │   ├── hooks/                    # React Hooks (e.g., job polling helpers)
│   │   ├── pages/                    # Views (Dashboard, Admin, Explorer, Search, Auth)
│   │   ├── index.css                 # Global styling rules & Outfit typeface variables
│   │   └── App.tsx                   # React client routes
│   ├── package.json                  # Node.js configurations
│   └── vite.config.ts                # Vite build settings
│
├── dashboard_module_spec.md          # Technical specification for main dashboard
└── ui-guidelines.md                  # Theme styles and aesthetic design system tokens

5. Database Schema Overview

The Supabase PostgreSQL database contains the following critical tables:

users: Contains authenticated user records (supabase_uid maps to incoming tokens, is_active blocks suspended users).
roles: User role configuration (e.g., USER, ADMIN).
datasets: Metadata container for groups of images (average_similarity tracks CLIP alignment, total_items tracks size).
images: Represents uploaded image items (file_path points to Supabase storage, status flags pipeline stage).
captions: Holds generated or edited captions, along with similarity score, model name, and BLEU/METEOR/CIDEr metric scores.
dataset_items: M-to-M junction table connecting datasets, images, and captions.
captioning_jobs: Job queue record tracking batch captioning progress and task status (QUEUED, PROCESSING, COMPLETE, FAILED).
feedback: Stores bug reports, rating data, and feature requests.
docs / doc_categories / doc_tags: Stores system documentation pages managed by administrators.

Note on Migrations: Most tables are generated automatically by Hibernate (via spring.jpa.hibernate.ddl-auto=validate) or managed by Flyway. However, the captioning_jobs table and its associated Row-Level Security (RLS) policies require manual execution in the Supabase SQL Editor prior to running the backend. See dashboard_module_spec.md for the exact SQL commands.

6. Getting Started (Local Setup)

Prerequisites

Make sure your development machine has the following tools installed:

Java Development Kit (JDK) 17
Node.js 18+ & npm
Python 3.10+ (with pip and virtualenv)
A Supabase Account with an active project and storage bucket.

1. AI/ML Service Setup

Open a terminal and navigate to the ai-service directory:
```
cd ai-service
```

Create and activate a Python virtual environment:

python -m venv venv
# On Windows:
.\venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```
Place the required model checkpoints inside the ai-service/models/ folder:
- caption_model_baseline.pth
- caption_model.pth
- ViT_1_1_model.pt / ViT_1_4_model.pt
Create a .env file in the ai-service directory:
```
HF_TOKEN=your_huggingface_access_token
```
Start the FastAPI development server:
```
uvicorn main:app --host 127.0.0.1 --port 8000 --reload
```
The service will start running on http://127.0.0.1:8000.

2. Backend Service Setup

Navigate to the backend directory:
```
cd backend
```
Create your local configuration file by copying the template:
```
cp .env.example .env
```

Open .env and fill in your Supabase project properties. Make sure to use the Session Pooler (port 5432) to allow Hibernate schema validations:

DB_URL=jdbc:postgresql://<your-supabase-db-host>:5432/postgres?sslmode=require
DB_USERNAME=postgres.<your-project-ref>
DB_PASSWORD=your_supabase_db_password
SUPABASE_URL=https://<your-project-ref>.supabase.co
SUPABASE_SERVICE_KEY=your_supabase_service_role_key
JWT_SECRET=your_base64_encoded_jwt_secret_key

Build the Spring Boot application using Maven:
```
./mvnw clean install
```
Run the application with the local profile enabled:
```
./mvnw spring-boot:run -Dspring-boot.run.profiles=local
```
The backend API will start running on http://localhost:8080.

3. Frontend Service Setup

Navigate to the frontend directory:
```
cd frontend
```
Install the package dependencies:
```
npm install
```

Create a .env file in the frontend directory:

VITE_SPRING_BOOT_URL=http://localhost:8080

Start the Vite local development server:
```
npm run dev
```
Open your browser and navigate to the local address displayed (usually http://localhost:5173).

7. AI Core: Model Hub, VRAM Management & Flagging Engine

Supported Model Architectures

VisionCaption includes three built-in model variants to balance speed and accuracy:

baseline_model: A fast, lightweight Encoder-Decoder model combining a CNN visual encoder (ResNet/Inception) with an RNN (LSTM) text decoder.
caption_model: An advanced architecture pairing a transformer visual encoder with a GPT-2 decoder. It generates multiple candidate captions and selects the best one using CLIP evaluation.
vit_model (vit_1_1 / vit_1_4): A heavyweight architecture pairing a Vision Transformer (ViT) encoder with a LLaMA-based language model. It offers maximum precision and uses CLIP voting for final selection.

Nuclear VRAM Eviction Mechanism

Deep learning models consume significant VRAM. To run multiple heavy architectures on consumer-grade hardware without causing Out-of-Memory (OOM) crashes, the CaptionService implements an eviction system:

Shared CLIP Resident: The CLIP model is relatively small (~600MB) and is kept permanently loaded in VRAM for fast scoring and ranking.
Lazy Loading: Model weights are kept on disk and loaded into VRAM only when requested.
Eviction Cycle: Before swapping to a new variant, the VRAM manager severs all references, runs double Python garbage collection (gc.collect()), clears the PyTorch CUDA cache (torch.cuda.empty_cache()), and synchronizes the GPU (torch.cuda.synchronize()).

Automated Flagging Rules

The AI pipeline includes a real-time moderation engine that runs immediately after caption generation:

LowAlignmentRule: Checks the cosine similarity score computed by CLIP. If the similarity between the image embeddings and the caption text is below 0.68, the caption is flagged as low alignment.
VocabRule: Scans the text using regex rules and matching lists (defined in domain_vocab.py) to block and flag problematic, offensive, or disallowed vocabulary.

8. API Integration & Callbacks

The backend dispatches jobs asynchronously to prevent HTTP timeouts.

1. Spring Boot ➔ FastAPI dispatch

POST http://127.0.0.1:8000/api/jobs/process
{
  "jobId": 123,
  "userId": 42,
  "datasetName": "Museum Gallery Batch",
  "datasetDescription": "Paintings from the gallery",
  "modelVariant": "caption_model",
  "temperature": 1.0,
  "maxLength": 50,
  "minLength": 5,
  "numBeams": 4,
  "repetitionPenalty": 1.0,
  "topP": 0.9,
  "images": [
    {
      "id": 1001,
      "storageUrl": "https://mztbiewiqjnairxnurfk.supabase.co/storage/v1/object/public/images/42/painting.jpg"
    }
  ]
}

2. FastAPI ➔ Spring Boot callback

POST http://127.0.0.1:8080/api/jobs/123/callback
{
  "jobId": 123,
  "status": "SUCCESS",
  "results": [
    {
      "imageId": 1001,
      "captionText": "an oil painting of a sunset over the valley",
      "similarityScore": 0.8412,
      "isFlagged": false,
      "bleu1": null,
      "bleu2": null,
      "bleu3": null,
      "bleu4": null,
      "meteor": null,
      "cider": null,
      "modelName": "caption_model",
      "modelVersion": "1.0"
    }
  ],
  "errorMessage": null
}

9. UI/UX Design System Guidelines

VisionCaption features a premium dark theme. Developers should follow these color and design rules when updating styles in .css modules:

Harmony Color Palette

Base Background: #110F17 (used for standard views and full layouts).
Layered Surface Panels: #1F1E29 (used for cards, inputs, and modular panels).
Structured Borders: #28272F (subtle separators) and #3E3E47 (interactive outlines).
Interactive Highlights:
- Primary Accent: #194BFF (used for main actions, sliders, and toggles).
- Secondary Accent: #D8EE10 (used for badges, analytics values, and selected tags).
Alert States:
- Success Indicators: #89EB79
- Errors / Deactivation alerts: #E84A34

Typography Hierarchy (Outfit Typeface)

Main Headers: 32px Semibold, Primary Color (#F6F6F6)
Sub-Headers: 20px Medium, Secondary Color (#C5C4C7)
Large Metric Displays: 38px Medium, Lime Accent (#D8EE10) or Blue (#194BFF)
Body / Descriptions: 16px or 18px Medium, Secondary Color (#C5C4C7)
Metadata & Label Tags: 14px Regular, Muted Gray (#98979D)
Hints / Placeholders: 12px Regular

10. Production Deployment

Frontend (React/Vite)

Build the static bundle for production:

cd frontend
npm run build

Deploy the generated dist/ directory to any static host (e.g., Vercel, Netlify, or Nginx).

Backend (Spring Boot)

Package the application into a standalone JAR file:

cd backend
./mvnw clean package -DskipTests

Run the JAR on your production server:

java -jar target/backend-0.0.1-SNAPSHOT.jar --spring.profiles.active=prod

AI Service (FastAPI)

For production, use an ASGI server like Uvicorn managed by Gunicorn:

cd ai-service
gunicorn main:app -w 4 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000

(Ensure that your production server has adequate GPU VRAM to handle model swapping).

11. License & Contributing

License

This project is licensed under the MIT License. See the LICENSE file in the root directory for details.

Contributing

Contributions are welcome! Please follow these steps:

Fork the repository.
Create a new branch (git checkout -b feature/your-feature-name).
Commit your changes (git commit -m 'Add some feature').
Push to the branch (git push origin feature/your-feature-name).
Open a Pull Request.

Please adhere to the coding standards and UI guidelines (referenced in ui-guidelines.md).

Name		Name	Last commit message	Last commit date
Latest commit History 277 Commits
.vscode		.vscode
ai-service		ai-service
backend		backend
frontend		frontend
ui-images		ui-images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

AutoCap - Domain-Specific Image Captioning & Validation System

Table of Contents

1. Key Features & User Flows

👤 Standard User Journey

🔑 Admin Journey

2. System Architecture & Data Flow

3. Technology Stack

Frontend

Backend

Database & Storage

AI / ML Service

4. Project Directory Structure

5. Database Schema Overview

6. Getting Started (Local Setup)

Prerequisites

1. AI/ML Service Setup

2. Backend Service Setup

3. Frontend Service Setup

7. AI Core: Model Hub, VRAM Management & Flagging Engine

Supported Model Architectures

Nuclear VRAM Eviction Mechanism

Automated Flagging Rules

8. API Integration & Callbacks

1. Spring Boot ➔ FastAPI dispatch

2. FastAPI ➔ Spring Boot callback

9. UI/UX Design System Guidelines

Harmony Color Palette

Typography Hierarchy (Outfit Typeface)

10. Production Deployment

Frontend (React/Vite)

Backend (Spring Boot)

AI Service (FastAPI)

11. License & Contributing

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages