Skip to content

PRITHIVSAKTHIUR/Doc-VLMs-v2-Localization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Doc-VLMs-v2-Localization

A comprehensive multi-modal AI application that combines document analysis, optical character recognition (OCR), video understanding, and object detection capabilities using state-of-the-art vision-language models.

Features

Core Capabilities

  • Document Analysis: Extract and convert document content to structured formats (text, tables, markdown)
  • OCR Processing: Advanced optical character recognition for various document types
  • Video Understanding: Analyze and describe video content with temporal awareness
  • Object Detection: Locate and identify objects in images with bounding box annotations
  • Multi-Model Support: Choose from four specialized vision-language models

Supported Models

  1. Camel-Doc-OCR-062825: Fine-tuned Qwen2.5-VL model optimized for document retrieval and content extraction
  2. OCRFlux-3B: Specialized 3B parameter model for OCR tasks with high accuracy
  3. ViLaSR-7B: Advanced spatial reasoning model with visual drawing capabilities
  4. ShotVL-7B: Cinematic language understanding model trained on high-quality video datasets

Installation

Prerequisites

  • Python 3.8+
  • CUDA-compatible GPU (recommended)
  • Git

Setup

# Clone the repository
git clone https://github.com/PRITHIVSAKTHIUR/Doc-VLMs-v2-Localization.git
cd Doc-VLMs-v2-Localization

# Install dependencies
pip install -r requirements.txt

Required Dependencies

gradio
spaces
torch
numpy
pillow
opencv-python
transformers
qwen-vl-utils

Usage

Running the Application

python app.py

The application will launch a Gradio interface accessible through your web browser.

Interface Overview

Image Inference Tab

  • Upload images for analysis
  • Query the model with natural language
  • Get structured outputs including text extraction and document conversion

Video Inference Tab

  • Upload video files for analysis
  • Generate detailed descriptions of video content
  • Temporal understanding with frame-by-frame analysis

Object Detection Tab

  • Upload images for object localization
  • Specify objects to detect using natural language
  • View annotated images with bounding boxes
  • Get precise coordinate information

Advanced Configuration

  • Max New Tokens: Control response length (1-2048)
  • Temperature: Adjust creativity/randomness (0.1-4.0)
  • Top-p: Nuclear sampling parameter (0.05-1.0)
  • Top-k: Token selection threshold (1-1000)
  • Repetition Penalty: Prevent repetitive outputs (1.0-2.0)

API Reference

Core Functions

generate_image(model_name, text, image, **kwargs)

Process image inputs with selected model

  • Parameters: Model selection, query text, PIL image, generation parameters
  • Returns: Streaming text response

generate_video(model_name, text, video_path, **kwargs)

Analyze video content with temporal understanding

  • Parameters: Model selection, query text, video file path, generation parameters
  • Returns: Streaming analysis response

run_example(image, text_input, system_prompt)

Perform object detection with bounding box output

  • Parameters: PIL image, detection query, system prompt
  • Returns: Detection results, parsed coordinates, annotated image

Helper Functions

downsample_video(video_path)

Extract representative frames from video files

  • Parameters: Path to video file
  • Returns: List of PIL images with timestamps

rescale_bounding_boxes(boxes, width, height)

Convert normalized coordinates to image dimensions

  • Parameters: Bounding box coordinates, image dimensions
  • Returns: Rescaled coordinate arrays

Model Information

Camel-Doc-OCR-062825

  • Base Model: Qwen2.5-VL-7B-Instruct
  • Specialization: Document comprehension and OCR
  • Use Cases: Text extraction, table conversion, document analysis

OCRFlux-3B

  • Base Model: Qwen2.5-VL-3B-Instruct
  • Specialization: Optical character recognition
  • Use Cases: Text recognition, document digitization

ViLaSR-7B

  • Base Model: Advanced spatial reasoning architecture
  • Specialization: Visual drawing and spatial understanding
  • Use Cases: Complex visual reasoning tasks

ShotVL-7B

  • Base Model: Qwen2.5-VL-7B-Instruct
  • Specialization: Cinematic content understanding
  • Use Cases: Video analysis, shot detection, narrative understanding

Examples

Document Analysis

# Query: "convert this page to doc [text] precisely for markdown"
# Input: Document image
# Output: Structured markdown format

Object Detection

# Query: "detect red and yellow cars"
# Input: Street scene image
# Output: Bounding boxes around detected vehicles

Video Understanding

# Query: "explain the ad video in detail"
# Input: Advertisement video file
# Output: Comprehensive video content analysis

Performance Notes

  • GPU acceleration recommended for optimal performance
  • Video processing involves frame sampling (10 frames per video)
  • Object detection uses normalized 512x512 coordinate system
  • Streaming responses provide real-time feedback

Limitations

  • Video inference performance may vary across models
  • GPU memory requirements scale with model size
  • Processing time depends on input complexity and hardware

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

License

This project is licensed under the MIT License. See LICENSE file for details.

Acknowledgments

  • Built on Hugging Face Transformers
  • Powered by Gradio for the web interface
  • Utilizes Qwen vision-language model architecture
  • Integrated with Spaces GPU acceleration

Support

For issues and questions:

  • Open an issue on GitHub
  • Check the Hugging Face model documentation
  • Review the examples and documentation

About

Doc-VLMs-v2-Localization is a demo app for the Camel-Doc-OCR-062825 model, fine-tuned from Qwen2.5-VL-7B-Instruct for advanced document retrieval, extraction, and analysis. It enhances document understanding and also integrates other notable Hugging Face models.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages