Production-Ready ML Data Loading Library
TurboLoader is a high-performance data loading library for machine learning workflows. Built with C++20 and featuring Python bindings, it provides efficient data loading with SIMD-accelerated transforms, custom binary formats, and distributed training support.
- Decoded Tensor Caching (v2.7.0) -
cache_decoded=Truefor 100K+ img/s on subsequent epochs - Multiple Loader Types - FastDataLoader (8-12% faster), MemoryEfficientDataLoader, standard DataLoader
- Distributed Training Support - Multi-node data loading with deterministic sharding
- SIMD-Accelerated Transforms - 19 vectorized transforms using AVX2/AVX-512/NEON
- TBL v2 Binary Format - Custom format with LZ4 compression for reduced storage
- Framework Integration - Seamless support for PyTorch, TensorFlow, and JAX
- Memory-Mapped I/O - Zero-copy file access for improved throughput
- Lock-Free Queues - Concurrent data structures for efficient multi-threading
- GPU JPEG Decoding - Optional NVIDIA nvJPEG support for accelerated decoding
pip install turboloadergit clone https://github.com/ALJainProjects/TurboLoader.git
cd TurboLoader
pip install -e .- Python: 3.10 or higher
- Compiler: C++20 capable (GCC 10+, Clang 12+, MSVC 19.29+)
- OS: macOS, Linux, Windows
Install for enhanced performance:
# macOS
brew install jpeg-turbo libpng libwebp lz4
# Ubuntu/Debian
sudo apt-get install libjpeg-turbo8-dev libpng-dev libwebp-dev liblz4-devimport turboloader
# Create DataLoader
loader = turboloader.DataLoader(
'imagenet.tar',
batch_size=128,
num_workers=8
)
# Iterate over batches
for batch in loader:
for sample in batch:
image = sample['image'] # NumPy array (H, W, C)
label = sample['label']
# Train your model...import turboloader
# Create transforms
resize = turboloader.Resize(224, 224)
normalize = turboloader.ImageNetNormalize()
flip = turboloader.RandomHorizontalFlip(p=0.5)
# Apply transforms
loader = turboloader.DataLoader('data.tar', batch_size=64, num_workers=8)
for batch in loader:
for sample in batch:
img = sample['image']
img = resize.apply(img)
img = flip.apply(img)
img = normalize.apply(img)
# Ready for trainingimport turboloader
import torch
loader = turboloader.DataLoader('imagenet.tar', batch_size=64, num_workers=8)
# Convert to PyTorch tensors
to_tensor = turboloader.ToTensor(
format=turboloader.TensorFormat.PYTORCH_CHW
)
for batch in loader:
images = []
for sample in batch:
img = to_tensor.apply(sample['image'])
images.append(torch.from_numpy(img))
batch_tensor = torch.stack(images)
# Train model...import turboloader
import torch.distributed as dist
# Initialize distributed training
dist.init_process_group(backend='nccl')
# Create loader with distributed support
loader = turboloader.DataLoader(
data_path="/data/imagenet.tar",
batch_size=64,
num_workers=4,
shuffle=True,
enable_distributed=True,
world_rank=dist.get_rank(),
world_size=dist.get_world_size(),
drop_last=True
)
# Each rank automatically gets its shard
for batch in loader:
# Your training code
passTurboLoader includes 19 SIMD-accelerated transforms:
- Resize - Bilinear/Bicubic/Lanczos interpolation
- Normalize - Mean/std normalization with SIMD
- CenterCrop - Center region extraction
- RandomCrop - Random crop with padding
- RandomHorizontalFlip - SIMD horizontal flip
- RandomVerticalFlip - SIMD vertical flip
- ColorJitter - Brightness/contrast/saturation/hue
- RandomRotation - Arbitrary angle rotation
- GaussianBlur - Separable convolution
- RandomErasing - Cutout augmentation
- Pad - Border padding (CONSTANT/EDGE/REFLECT)
- RandomPosterize - Bit-depth reduction
- RandomSolarize - Threshold inversion
- RandomPerspective - Perspective warp
- AutoAugment - Learned policies (ImageNet/CIFAR10/SVHN)
- ToTensor - PyTorch CHW or TensorFlow HWC format
TurboLoader includes a custom binary format optimized for ML workloads:
- LZ4 compression for reduced storage
- Memory-mapped access for fast loading
- O(1) random access via indexed structure
- Data integrity validation with CRC checksums
- Cached image dimensions for filtered loading
import turboloader
writer = turboloader.TblWriterV2(
output_path="/data/imagenet.tbl",
compression=True
)
reader = turboloader.TarReader("/data/imagenet.tar")
for sample in reader:
writer.add_sample(
data=sample.data,
format=sample.format,
metadata={"label": sample.label}
)
writer.finalize()- Quick Start Notebook - Interactive tutorial for beginners
- Installation Guide - Detailed setup instructions
- Quick Start - Getting started examples
- Troubleshooting Guide - Common issues and solutions
- API Reference - Complete API documentation
- Transforms API - All 19 transforms with examples
- PyTorch Integration Guide - Complete PyTorch guide
- TensorFlow Integration Guide - Complete TensorFlow/Keras guide
- PyTorch Lightning Example - Production-ready Lightning integration
- Distributed Training (DDP) - Multi-GPU PyTorch DDP example
- ImageNet ResNet50 Training - Complete training pipeline with AMP, checkpointing, TensorBoard
- Distributed Training - Multi-node setup guide
Head-to-head comparison with optimized PyTorch DataLoader (persistent_workers=True, prefetch_factor=4). Both loaders tested under identical conditions.
| Configuration | TurboLoader | PyTorch | Speedup |
|---|---|---|---|
| uint8 CHW (resize only) | 8,027 img/s | 2,457 img/s | 3.3x |
| float32 CHW (0-1 normalize) | 8,456 img/s | 2,040 img/s | 4.1x |
| float32 CHW + ImageNet mean/std | 8,029 img/s | 2,039 img/s | 3.9x |
| Configuration | Epoch 2 Throughput |
|---|---|
| uint8 HWC (from cache) | 57,692,695 img/s |
| float32 CHW (from cache) | 42,933,573 img/s |
| float32 CHW + ImageNet (from cache) | 39,853,643 img/s |
| Workers | TurboLoader | PyTorch | Speedup |
|---|---|---|---|
| 1 worker | 1,585 img/s | 625 img/s | 2.5x |
| 2 workers | 3,383 img/s | 1,184 img/s | 2.9x |
| 4 workers | 7,744 img/s | 2,016 img/s | 3.8x |
| 8 workers | 13,327 img/s | 3,047 img/s | 4.4x |
| Batch Size | TurboLoader | PyTorch | Speedup |
|---|---|---|---|
| 8 | 7,997 img/s | 2,342 img/s | 3.4x |
| 16 | 8,280 img/s | 2,261 img/s | 3.7x |
| 32 | 7,418 img/s | 1,946 img/s | 3.8x |
| 64 | 7,896 img/s | 1,765 img/s | 4.5x |
| 128 | 7,841 img/s | 1,521 img/s | 5.2x |
Test conditions: Apple M4 Pro, 5000 JPEG images (640x480), best of 3 trials, 100 batches per trial. PyTorch uses persistent_workers=True, prefetch_factor=4.
- OpenMP parallelism for batch assembly (decode, resize, transpose, convert)
- Fused SIMD deinterleave: NEON
vld3q_u8for HWC→CHW + u8→f32 + normalize in a single pass - Thread-local buffers to eliminate per-sample heap allocation under OpenMP
- Pipeline reset reuses buffer pools, decoders, and memory maps across epochs
- LTO (thin) for cross-TU inlining of SIMD functions
- GIL released during all C++ processing
Note: Actual throughput depends on your hardware, image sizes, and pipeline configuration. Run the benchmark on your setup for precise figures.
TurboLoader uses a multi-threaded pipeline architecture:
┌─────────────────────────────────────────────┐
│ Memory-Mapped Reader │
│ (TAR/TBL v2 with zero-copy access) │
└──────────────┬──────────────────────────────┘
│
┌──────▼──────┐
│Worker Pool │
│ (N threads)│
├─────────────┤
│ Decode │
│ Transform │
│ Convert │
└──────┬──────┘
│
┌──────▼──────────────┐
│ Lock-Free Queue │
└──────┬──────────────┘
│
┌──────▼──────┐
│Python API │
└─────────────┘
- Memory-Mapped I/O - Zero-copy file access
- Worker Thread Pool - Parallel processing with per-thread decoders
- SIMD Transforms - Vectorized operations (AVX2/AVX-512/NEON)
- Lock-Free Queues - High-performance concurrent data structures
TurboLoader is released under the MIT License.
If you use TurboLoader in your research:
@software{turboloader2025,
author = {Jain, Arnav},
title = {TurboLoader: Production-Ready ML Data Loading},
year = {2025},
version = {2.7.0},
url = {https://github.com/ALJainProjects/TurboLoader}
}- Documentation: https://github.com/ALJainProjects/TurboLoader/tree/main/docs
- Troubleshooting: https://github.com/ALJainProjects/TurboLoader/blob/main/docs/TROUBLESHOOTING.md
- Verification Script: Run
python scripts/verify_installation.pyto check your setup - Issues: GitHub Issues
- Discussions: GitHub Discussions
- PyPI: https://pypi.org/project/turboloader/
TurboLoader - Production-ready ML data loading. 2.5-5.2x faster than optimized PyTorch DataLoader.