High-performance Unix file deduplication engine with tiered short-circuit logic.
bgate-unix is a fingerprinting gatekeeper that performs strict binary identity deduplication using tiered short-circuit logic. Designed for high-volume Unix pipelines where disk I/O is the bottleneck.
Key Features:
- Sub-millisecond duplicate rejection via O(1) index lookups
- Journaled file moves with crash recovery
- BLOB-based xxHash128 storage for collision-proof identity
- Atomic
link/unlinkmoves (no TOCTOU races)
Incoming File
│
▼
┌─────────────────────────────────────────┐
│ TIER 0: Empty Check │
│ file_size == 0 → SKIP │
│ Cost: stat() only │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ TIER 1: Size Uniqueness │
│ Size not in DB → UNIQUE │
│ Cost: SQLite lookup │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ TIER 2: Fringe Hash (xxh64) │
│ First 64KB + Last 64KB + size │
│ (Last 64KB overlaps if file < 128KB) │
│ Hash not in DB → UNIQUE │
│ Cost: 128KB read max │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ TIER 3: Full Hash (xxh128) │
│ Entire file in 256KB chunks │
│ Hash in DB → DUPLICATE │
│ Hash not in DB → UNIQUE │
│ Cost: Full file read │
└─────────────────────────────────────────┘
Install globally in an isolated environment using uv:
uv tool install bgate-unixVerify it works:
bgate --helpuv add bgate-unix
# or
pip install bgate-unixRequirements: Unix-based OS (Linux, macOS, BSD). Windows is not supported.
bgate-unix provides a high-performance CLI for pipeline integration.
# Scan and move unique files to vault (Active Mode)
bgate scan ./incoming --into ./vault --recursive --move
# Read-only scan (default behavior)
bgate scan ./incoming --recursive
# Show index statistics
bgate stats --db dedupe.db
# Recover from an interrupted session
bgate recover --db dedupe.db# Install
uv tool install bgate-unix
# Scan and move unique files to tiered storage (Active Mode)
bgate scan ./incoming --into ./vault --recursive --movefrom pathlib import Path
from bgate_unix import FileDeduplicator
from bgate_unix.engine import DedupeResult
with FileDeduplicator("dedupe.db") as deduper:
result = deduper.process_file("incoming/document.pdf")
match result.result:
case DedupeResult.UNIQUE:
print(f"New file (tier {result.tier})")
case DedupeResult.DUPLICATE:
print(f"Duplicate of {result.duplicate_of}")
case DedupeResult.SKIPPED:
print(f"Skipped: {result.error or 'empty'}")Unique files are atomically moved to a processing directory:
from pathlib import Path
from bgate_unix import FileDeduplicator
with FileDeduplicator("index.db", processing_dir=Path("processed/")) as deduper:
for result in deduper.process_directory("inbound/", recursive=True):
if result.result == DedupeResult.UNIQUE:
# result.path is the new location in processed/
# result.original_path is the source location
# result.stored_path is also the new location (explicit field)
print(f"Moved: {result.original_path.name} -> {result.stored_path.name}")Important: processing_dir must be on the same filesystem as source files (required for atomic os.link).
from bgate_unix import FileDeduplicator
from bgate_unix.engine import DedupeResult
with FileDeduplicator("index.db") as deduper:
results = list(deduper.process_directory("incoming/", recursive=True))
unique = sum(1 for r in results if r.result == DedupeResult.UNIQUE)
dupes = sum(1 for r in results if r.result == DedupeResult.DUPLICATE)
print(f"Unique: {unique}, Duplicates: {dupes}")
print(f"Stats: {deduper.stats}")- Strict Schema Enforcement: Engines will hard-stop if a database version mismatch is detected.
- Orphan Recovery: If a crash occurs during file moves, orphaned files are automatically recovered on next connect.
- Emergency Logging: If the database becomes unavailable during a critical I/O operation, orphan records are written to an atomic
.jsonllog file for manual recovery.
bgate-unix is designed for trusted internal pipelines.
- xxHash128: Used as an extremely low-collision identifier for high-volume data (2^128 range). For trusted inputs, collisions are treated as mathematically impossible.
- Deduplication Priority: Speed and durability are prioritized over security.
- Not for Adversarial Input: If you are processing untrusted/malicious files where hash collisions could be intentionally engineered, use a cryptographically secure mode (like BLAKE3 or SHA-256) which may be added in future versions.
Unique files are stored in a 2-level hex-sharded structure inside processing_dir:
- Path:
{processing_dir}/{id[0:2]}/{id[2:16]}{original_suffix} - Note:
idis the full content hash when available (Tier 3), otherwise a unique UUID (Tier 1/2) to preserve "Move-then-Hash" performance. - Example:
processed/a3/bc4f91e2d0f8.pdf
SQLite with BLOB-based hash storage:
-- Tier 1: Size lookup (existence set)
CREATE TABLE size_index (
file_size INTEGER PRIMARY KEY
) WITHOUT ROWID;
-- Tier 2: Fringe hash (BLOB)
CREATE TABLE fringe_index (
fringe_hash BLOB NOT NULL,
file_size INTEGER NOT NULL,
file_path TEXT NOT NULL,
PRIMARY KEY (fringe_hash, file_size)
) WITHOUT ROWID;
-- Tier 3: Full hash (BLOB)
CREATE TABLE full_index (
full_hash BLOB PRIMARY KEY,
file_path TEXT NOT NULL
) WITHOUT ROWID;
-- Crash recovery tables
CREATE TABLE orphan_registry (
id INTEGER PRIMARY KEY AUTOINCREMENT,
original_path TEXT NOT NULL,
orphan_path TEXT NOT NULL,
file_size INTEGER NOT NULL,
created_at TEXT NOT NULL,
recovered_at TEXT,
status TEXT NOT NULL DEFAULT 'pending',
UNIQUE(orphan_path)
);
CREATE TABLE move_journal (
id INTEGER PRIMARY KEY AUTOINCREMENT,
source_path TEXT NOT NULL,
dest_path TEXT NOT NULL,
file_size INTEGER NOT NULL,
created_at TEXT NOT NULL,
phase TEXT NOT NULL DEFAULT 'planned',
completed_at TEXT
);
CREATE TABLE schema_version (
version INTEGER PRIMARY KEY,
applied_at TEXT NOT NULL
);Pragmas: WAL mode, synchronous=FULL, 64MB cache, 256MB mmap.
Uses hard-link + unlink (os.link / Path.unlink) for atomic same-filesystem moves.
- Signal Deferral: SIGINT/SIGTERM signals are deferred during critical move operations using
critical_section(). - Fsync Ordering: File and directory durability is strictly enforced:
- After linking destination, newly created parent directories are fsynced (top-down).
- The destination directory is fsynced to persist the new link.
- The source file is unlinked.
- The source directory is fsynced to persist the removal.
- FS Enforcement: Cross-device moves are explicitly rejected (
EXDEVerror) to maintain atomicity.
Move operations use phase-based journaling: planned → moving → completed.
On startup, the engine automatically recovers incomplete entries:
planned: Move never started → Marked asfailed.moving: File may have been moved but not yet indexed → Engine attempts atomic rollback (link back to source + fsync + unlink destination).
Performance benchmarks on production datasets demonstrate bgate-unix's efficiency for enterprise workloads.
Hardware: AWS EC2 ARM64 instance
Storage: Amazon Elastic Block Store (NVMe SSD)
OS: Debian GNU/Linux (ARM64)
Dataset: 24.68 GB production data pipeline files
| Metric | Value |
|---|---|
| Dataset Size | 24.68 GB, 9,174 files |
| Processing Time | 274.96 seconds (~4.6 minutes) |
| Bandwidth | 89.1 MB/sec |
| File Throughput | 28.8 files/sec (moved) |
| Files Moved | 7,932 unique files (23.92 GB) |
| Deduplication | 13.5% duplicates found (1,242 files) |
| Idempotency | ✅ 0 files moved on subsequent runs |
- Excellent bandwidth on large datasets (89.1 MB/sec)
- Consistent throughput across different file sizes
- Production-ready performance for enterprise workloads
- Perfect idempotency - no unnecessary operations on re-runs
- Effective deduplication with 13.5% duplicate detection
- I/O optimized - performance bottleneck is disk throughput, not CPU cycles (as designed)
Use the included benchmark script to test performance on your data:
# Run benchmark with idempotency test
./scripts/benchmark.sh /path/to/source /path/to/vault
# Example output:
# 🚀 bgate-unix Move Operation Benchmark
# FIRST RUN: 89.1 MB/sec, 7,932 files moved
# IDEMPOTENCY TEST: ✅ 0 files moved (perfect idempotency)Note: Source and vault must be on the same filesystem for atomic operations.
git clone https://github.com/mr3od/bgate-unix.git
cd bgate-unix
uv sync --dev
# Run tests
uv run pytest
# Lint
uv run ruff check .
uv run ruff format .
# Type check
uv run ty check src/MIT