A searchable transparency database for DOJ Epstein Files, providing public access to documents, emails, flight logs, and AI-powered search.
π Live Site: epsteinsuite.com
- About
- Features
- Screenshots
- Tech Stack
- Getting Started
- Database Schema
- Data Pipeline
- Contributing
- License
- Acknowledgments
Epstein Suite is a web application that makes public records related to the Epstein case searchable and accessible. It combines traditional document management with modern AI-powered features to help researchers, journalists, and the public explore this important dataset.
After the DOJ released thousands of pages of Epstein-related documents, they were difficult to search and navigate. This project makes them:
- β Searchable - Full-text search across all documents and OCR'd pages
- β Organized - Entity extraction (people, organizations, locations)
- β Interactive - AI-powered Q&A interface
- β Transparent - Open source code, public mission
The live site at epsteinsuite.com currently indexes:
- 4,700+ documents from DOJ, FBI Vault, House Oversight
- Millions of OCR'd pages with full-text search
- Thousands of extracted entities (people, organizations, locations)
- Flight logs with geographic mapping
- Email threads with relationship analysis
- Full-Text Search - MySQL FULLTEXT search across documents and OCR'd pages
- Entity Browser - Explore people, organizations, and locations
- Advanced Filters - Filter by source, date, file type, status
- Document Timeline - Chronological view of documents
- Ask AI - Natural language Q&A powered by OpenAI GPT-5-nano
- Document Summaries - AI-generated summaries for complex documents
- Entity Extraction - Automatic identification of key people/organizations
- Semantic Search - Vector embeddings for similarity-based search
- Flight Logs - Searchable flight manifests with map visualization
- Email Client - Thread view for email collections
- Photo Gallery - Media browser with metadata
- Network Graphs - Entity relationship visualizations with D3.js
- File-Based Caching - Fast page loads with intelligent cache invalidation
- Responsive Design - Mobile-first UI with TailwindCSS
- Admin Dashboard - Operations console for monitoring
- API Endpoints - JSON APIs for integrations
Note: Add screenshots of your live site here once you've added them to the repo
- PHP 8.4 - Strict typing, PSR-12 coding standard
- MySQL 8.0 - InnoDB engine, FULLTEXT indexes, utf8mb4
- Apache/PHP-FPM - Production web server
- TailwindCSS 3.4 - Utility-first CSS framework
- Vanilla JavaScript - No framework dependencies
- D3.js - Network graph visualizations
- Leaflet.js - Flight log mapping
- OpenAI GPT-5-nano - Document summaries, entity extraction, Q&A
- text-embedding-3-small - Vector embeddings for semantic search
- PHP Vector Search - Cosine similarity computed in PHP
- Flat PHP Routing - No framework, direct file-to-URL mapping
- PDO Singleton - Centralized database access
- File-Based Cache - Simple, fast caching layer
- PHP 8.4 or higher
- MySQL 8.0 or higher
- Git
- (Optional) OpenAI API key for AI features
-
Clone the repository
git clone https://github.com/YOUR_USERNAME/epstein-suite.git cd epstein-suite -
Create database
mysql -u root -p CREATE DATABASE epstein_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; exit
-
Import database schema
mysql -u root -p epstein_db < config/schema.sql -
Configure environment
cp .env.example .env # Edit .env with your settings nano .envMinimum required settings:
DB_HOST=localhost DB_NAME=epstein_db DB_USERNAME=root DB_PASSWORD=your_password ADMIN_PASSWORD=your_admin_password
-
Start development server
php -S localhost:8000
-
Visit http://localhost:8000
The production database is not included. For local development, you can:
Option A: Work with empty database
- Good for UI/UX development
- Test edge cases with no data
Option B: Create test data
-- Create sample documents
INSERT INTO documents (title, description, status, file_url, source, created_at) VALUES
('Sample Document 1', 'A test document for development', 'processed', 'https://example.com/doc1.pdf', 'TEST', NOW()),
('Sample Document 2', 'Another test document', 'processed', 'https://example.com/doc2.pdf', 'TEST', NOW());
-- Create sample entities
INSERT INTO entities (name, type, created_at) VALUES
('John Doe', 'PERSON', NOW()),
('Acme Corporation', 'ORG', NOW()),
('New York', 'LOCATION', NOW());
-- Link entities to documents
INSERT INTO document_entities (document_id, entity_id) VALUES
(1, 1), (1, 2), (2, 1), (2, 3);For production deployment:
- Configure
.envwith production credentials - Set up Apache/Nginx with PHP-FPM
- Enable HTTPS with Let's Encrypt
- Configure file permissions:
chmod 755 *.php chmod 775 cache/ - Set up admin authentication (HTTP Basic Auth)
- Configure caching headers in
.htaccess
See TECH.md for detailed production setup (if you include it).
The application uses these core tables:
Main document metadata with full lifecycle tracking:
- id, title, description, file_url, source
- status (pending β downloaded β processed)
- ai_summary, created_at, updated_atOCR text per page with FULLTEXT index:
- document_id, page_number, ocr_text
- FULLTEXT INDEX(ocr_text)People, organizations, locations:
- id, name, type (PERSON/ORG/LOCATION)
- created_atMany-to-many document-entity relationships:
- document_id, entity_idemails- Email threads (FULLTEXT indexed)flight_logs- Flight manifest recordspassengers- Flight passenger detailsai_sessions- AI chat session trackingai_messages- AI conversation historyai_citations- Document citations in AI responses
See config/schema.sql for complete schema definition.
Note: The data ingestion pipeline is not included in this open source release.
The production site uses a proprietary pipeline that:
- Discovers documents from DOJ, FBI Vault, and House Oversight sources
- Downloads and OCRs PDF documents (pdf2image + Tesseract)
- Generates AI summaries and extracts entities (OpenAI GPT-5-nano)
- Analyzes flight logs for significance scoring
- Generates vector embeddings for semantic search
To use this application with your own data, you'll need to:
- Populate the
documentstable with your dataset - Run your own OCR/processing pipeline
- Generate AI summaries and entity extractions
- Follow the database schema in
config/schema.sql
The web application works with any dataset that follows the schema structure.
We welcome contributions! See CONTRIBUTING.md for detailed guidelines.
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make your changes
- Test locally:
php -S localhost:8000 - Commit:
git commit -m 'Add amazing feature' - Push:
git push origin feature/amazing-feature - Open a Pull Request
- π¨ UI/UX improvements (mobile, accessibility, dark mode)
- π Search enhancements (better filters, faceted search)
- β‘ Performance optimizations
- π Data visualizations
- π§ͺ Automated testing (we have none!)
- π Documentation improvements
- Follow PSR-12 coding standard
- Use strict typing:
declare(strict_types=1); - Always use PDO prepared statements
- Test your changes locally
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).
This means:
- β You can use, modify, and distribute this code
- β If you run a modified version as a web service, you must release your source code
- β All derivative works must also be AGPL-3.0 licensed
- β You must credit the original project
Data Pipeline Exception: The data ingestion pipeline, scrapers, and automation scripts are proprietary and not included in this release.
See LICENSE for full details.
Kevin Champlin
- Website: kevinchamplin.com
- Email: info@epsteinsuite.com
- Production Site: epsteinsuite.com
- DOJ, FBI, and House Oversight for releasing these public records
- OpenAI for GPT-5 and embedding APIs
- The open source community for tools like TailwindCSS, D3.js, and Leaflet.js
- Tesseract OCR project
- Everyone committed to transparency and accountability
- Bug Reports: Open an issue
- Feature Requests: Open an issue with [FEATURE] tag
- Security Issues: Email info@epsteinsuite.com privately
- General Questions: GitHub Discussions
This project is designed with strict privacy protections:
- AI prompts explicitly forbid un-redacting victim names
- All data sources are already-public records
- Focus is on investigative leads, not victim identification
- Victim privacy is paramount
If you find this project useful, please consider giving it a star on GitHub!
Built for transparency. Designed for accountability.



