TurboLoader/DETAILED_DOCUMENTATION.txt at main · ALJainProjects/TurboLoader · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
================================================================================
TURBOLOADER COMPREHENSIVE TECHNICAL DOCUMENTATION
Version 1.7.1
================================================================================

This document provides a detailed, senior-engineer-level breakdown of the
TurboLoader codebase architecture, implementation details, and complete
step-by-step instructions for building, testing, and running all components.

Written by: Senior Software Engineer/Researcher
Target Audience: Engineers and researchers seeking deep technical understanding


TABLE OF CONTENTS
================================================================================
1. PROJECT OVERVIEW & ARCHITECTURE
2. CORE DESIGN PATTERNS & PHILOSOPHY
3. DETAILED CODEBASE STRUCTURE
4. BUILD SYSTEM COMPREHENSIVE GUIDE
5. STEP-BY-STEP INSTALLATION
6. RUNNING THE TEST SUITE
7. BENCHMARKING FRAMEWORK
8. PYTHON API & BINDINGS
9. ADVANCED FEATURES & OPTIMIZATION
10. TROUBLESHOOTING & DEBUGGING


================================================================================
SECTION 1: PROJECT OVERVIEW & ARCHITECTURE
================================================================================

1.1 WHAT IS TURBOLOADER?
------------------------------------------------------------------------------

TurboLoader is a high-performance machine learning data loading library written
in C++20 with Python bindings. The primary design goal is to eliminate data
loading as the bottleneck in ML training pipelines by achieving throughputs
that significantly exceed standard frameworks like PyTorch DataLoader.

Key Performance Numbers:
- Peak Throughput: 21,035 images/second (16 workers, batch_size=64)
- 12x faster than PyTorch DataLoader
- 1.3x faster than TensorFlow's tf.data
- 52+ Gbps local file I/O throughput
- TBL v2 Format: 4,875 img/s TAR→TBL conversion throughput
- TBL v2 Compression: 40-60% space savings with LZ4
- Smart Batching: 15-25% throughput improvement with size-aware grouping

The library achieves these performance numbers through several key techniques:
1. Lock-free SPSC (Single Producer Single Consumer) ring buffers
2. Memory-mapped I/O for zero-copy file access
3. 19 SIMD-accelerated transforms (AVX2/NEON) including AutoAugment policies
4. Per-worker thread-local decoders to eliminate contention
5. GPU-accelerated JPEG decoding via nvJPEG (optional)
6. Intelligent prefetching and batching strategies
7. TBL v2 Format with LZ4 streaming compression (40-60% smaller than TAR)
8. Smart Batching for size-aware sample grouping
9. Distributed Training with deterministic sharding for multi-node setups


1.2 ARCHITECTURAL PRINCIPLES
------------------------------------------------------------------------------

The architecture follows several critical design principles:

PRINCIPLE 1: ZERO-COPY PHILOSOPHY
Wherever possible, data is accessed via memory-mapped files and passed by
reference or moved (C++ move semantics) rather than copied. This eliminates
redundant memory allocations and reduces cache pressure.

PRINCIPLE 2: LOCK-FREE CONCURRENCY
Traditional locks introduce contention and unpredictable latency. TurboLoader
uses atomic operations and lock-free data structures for all hot paths. Mutexes
are only used in cold paths like initialization.

PRINCIPLE 3: SIMD-FIRST DESIGN
All performance-critical operations (image transforms, format conversions, etc.)
are implemented with SIMD intrinsics. The library automatically detects CPU
capabilities and falls back to NEON on ARM or scalar code if SIMD unavailable.

PRINCIPLE 4: PER-WORKER RESOURCE ISOLATION
Each worker thread maintains its own decoder instances, buffer pools, and state.
This eliminates false sharing and cache line bouncing between CPU cores.

PRINCIPLE 5: CONDITIONAL COMPILATION FOR OPTIONAL FEATURES
GPU acceleration (nvJPEG), async I/O (io_uring), and other platform-specific
features are compiled conditionally. The library provides graceful fallbacks
ensuring it works on all platforms (Linux, macOS, Windows).


1.3 HIGH-LEVEL DATA FLOW
------------------------------------------------------------------------------

The data flow through TurboLoader follows this pipeline:

Step 1: SOURCE READING
    TAR files are memory-mapped into the process address space. The TarReader
    parses TAR headers in-memory without system calls for reading.

    Location: src/readers/tar_reader.hpp
    Key Method: TarReader::load_samples()

Step 2: WORKER DISPATCH
    The UnifiedPipeline maintains a pool of Worker threads. Each worker pulls
    TAR entry metadata from a thread-safe queue and processes samples
    independently.

    Location: src/pipeline/pipeline.hpp
    Key Classes: UnifiedPipeline, TarWorker

Step 3: DECODING
    Workers decode JPEG data using per-worker JPEGDecoder instances (libjpeg-
    turbo with SIMD). On systems with nvJPEG, GPU decoding can be used for 10x
    faster JPEG decompression.

    Location: src/decode/jpeg_decoder.hpp, src/decode/nvjpeg_decoder.hpp

Step 4: TRANSFORMATION
    Decoded RGB images pass through a transform pipeline. Transforms are
    composable and use SIMD intrinsics for operations like resize, normalize,
    color jitter, etc.

    Location: src/transforms/*.hpp
    Key Transforms: resize_transform.hpp, normalize_transform.hpp

Step 5: BATCHING
    Transformed samples are collected into batches. Smart batching groups
    similar-sized samples to reduce padding overhead.

    Location: src/pipeline/smart_batching.hpp

Step 6: TENSOR CONVERSION
    Batches are converted to PyTorch/TensorFlow/JAX tensor format (CHW or HWC
    layout) and returned to the Python layer via pybind11.

    Location: src/transforms/tensor_conversion.hpp


================================================================================
SECTION 2: CORE DESIGN PATTERNS & PHILOSOPHY
================================================================================

2.1 CONCURRENT QUEUE DESIGN
------------------------------------------------------------------------------

The pipeline uses lock-free SPSC (Single Producer Single Consumer) ring buffers
for passing data between pipeline stages. This is critical for performance.

IMPLEMENTATION DETAILS:

File: src/core/spsc_ring_buffer.hpp

The ring buffer maintains two atomic indices:
- write_idx_: Modified only by producer thread
- read_idx_: Modified only by consumer thread

This separation ensures each thread writes to its own cache line, preventing
false sharing. The buffer size is always a power of 2, allowing fast modulo
operations via bitwise AND:

    index & (capacity - 1)  // Fast modulo

Memory ordering is carefully chosen:
- push() uses memory_order_release on write_idx_
- pop() uses memory_order_acquire on read_idx_

This establishes a happens-before relationship ensuring all writes to the data
buffer are visible to the consumer before it reads write_idx_.


2.2 MEMORY POOLING STRATEGY
------------------------------------------------------------------------------

Creating and destroying std::vector<uint8_t> for every image is expensive due
to allocator overhead. TurboLoader uses object pooling to reuse allocations.

File: src/core/object_pool.hpp

The ObjectPool maintains a free list of pre-allocated objects. When a worker
needs a buffer:

1. Try to pop from free list (lock-free)
2. If empty, allocate new object
3. When done, return object to pool

The pool is thread-safe and grows dynamically. Objects are never freed until
program termination, eliminating deallocation overhead entirely.


2.3 SIMD ABSTRACTION LAYER
------------------------------------------------------------------------------

File: src/transforms/simd_utils.hpp

TurboLoader provides a thin abstraction over SIMD intrinsics to support
multiple ISAs:

#ifdef __AVX2__
    // Use 256-bit AVX2 intrinsics
    __m256i vec = _mm256_load_si256(...)
#elif defined(__ARM_NEON)
    // Use 128-bit NEON intrinsics
    uint8x16_t vec = vld1q_u8(...)
#else
    // Scalar fallback
    for (size_t i = 0; i < size; ++i) { ... }
#endif

Key SIMD Operations Implemented:
- cvt_u8_to_f32_normalized: Convert uint8 [0,255] to float [0.0,1.0]
- mul_u8_scalar: Multiply uint8 pixels by scalar (brightness)
- horizontal_flip: Flip image horizontally using SIMD loads/stores
- bilinear_interpolation: SIMD-accelerated image resizing

The SIMD functions process 32 bytes (AVX2) or 16 bytes (NEON) per iteration,
achieving 8-16x speedup over scalar code.


================================================================================
SECTION 3: DETAILED CODEBASE STRUCTURE
================================================================================

3.1 DIRECTORY LAYOUT
------------------------------------------------------------------------------

/Users/arnavjain/turboloader/
├── CMakeLists.txt                # Root build configuration
├── pyproject.toml                # Python package metadata
├── setup.py                      # Python build script
├── README.md                     # User-facing documentation
├── CHANGELOG.md                  # Version history
├── ARCHITECTURE.md               # Architecture documentation
│
├── src/                          # C++ source code
│   ├── core/                     # Core data structures
│   │   ├── object_pool.hpp       # Thread-safe object pool
│   │   ├── sample.hpp            # Sample data structure
│   │   └── spsc_ring_buffer.hpp  # Lock-free queue
│   │
│   ├── decode/                   # Image/video decoders
│   │   ├── jpeg_decoder.hpp      # libjpeg-turbo decoder
│   │   ├── png_decoder.hpp       # libpng decoder
│   │   ├── webp_decoder.hpp      # libwebp decoder
│   │   ├── nvjpeg_decoder.hpp    # GPU JPEG decoder (CUDA)
│   │   ├── image_decoder.hpp     # Multi-format dispatcher
│   │   ├── video_decoder.hpp     # FFmpeg video decoder
│   │   ├── csv_decoder.hpp       # CSV parser
│   │   └── parquet_decoder.hpp   # Apache Parquet reader
│   │
│   ├── formats/                  # Custom binary formats
│   │   ├── tbl_format.hpp        # TBL v2 binary format spec
│   │   ├── tbl_reader_v2.hpp     # TBL v2 reader with LZ4
│   │   └── tbl_writer_v2.hpp     # TBL v2 streaming writer
│   │
│   ├── gpu/                      # GPU-specific code
│   │   ├── multi_gpu_pipeline.hpp    # Multi-GPU pipeline
│   │   └── multi_gpu_pipeline.cpp    # Implementation
│   │
│   ├── io/                       # I/O abstractions
│   │   └── io_uring_reader.hpp   # Linux async I/O (io_uring)
│   │
│   ├── pipeline/                 # Core pipeline logic
│   │   ├── pipeline.hpp          # Main UnifiedPipeline class (includes distributed training)
│   │   ├── prefetch_pipeline.hpp # Double-buffer prefetching
│   │   └── smart_batching.hpp    # Size-aware batching (NEW in v1.7.0)
│   │
│   ├── python/                   # Python bindings
│   │   └── turboloader_bindings.cpp  # pybind11 bindings
│   │
│   ├── readers/                  # Data source readers
│   │   ├── tar_reader.hpp        # TAR archive reader (mmap)
│   │   ├── tbl_reader.hpp        # TBL format reader
│   │   ├── http_reader.hpp       # HTTP/HTTPS remote loading
│   │   ├── s3_reader.hpp         # AWS S3 reader
│   │   ├── gcs_reader.hpp        # Google Cloud Storage
│   │   └── reader_orchestrator.hpp   # Auto source detection
│   │
│   ├── transforms/               # 19 SIMD-accelerated transforms
│   │   ├── simd_utils.hpp        # SIMD abstraction layer (AVX2/NEON)
│   │   ├── transform_base.hpp    # Base transform class
│   │   ├── transforms.hpp        # All transforms header
│   │   ├── resize_transform.hpp  # Bilinear/bicubic/Lanczos resize
│   │   ├── normalize_transform.hpp   # Mean/std normalization
│   │   ├── crop_transform.hpp    # Center/random crop
│   │   ├── flip_transform.hpp    # Horizontal/vertical flip
│   │   ├── color_jitter_transform.hpp  # Brightness/contrast/saturation
│   │   ├── rotation_transform.hpp      # Arbitrary rotation
│   │   ├── affine_transform.hpp        # Affine transformations
│   │   ├── blur_transform.hpp          # Gaussian blur
│   │   ├── pad_transform.hpp           # Padding
│   │   ├── erasing_transform.hpp       # Random erasing (Cutout)
│   │   ├── grayscale_transform.hpp     # Color to grayscale
│   │   ├── perspective_transform.hpp   # Perspective warp (NEW in v1.5.1)
│   │   ├── posterize_transform.hpp     # Bit depth reduction (NEW in v1.5.1)
│   │   ├── solarize_transform.hpp      # Threshold inversion (NEW in v1.5.1)
│   │   ├── autoaugment_transform.hpp   # AutoAugment policies (NEW in v1.5.1)
│   │   └── tensor_conversion.hpp       # PyTorch/TF/JAX conversion
│   │
│   └── writers/                  # Data format writers
│       ├── tbl_writer.hpp        # Legacy TBL writer
│       └── tbl_writer_v2.hpp     # TBL v2 streaming writer (NEW in v1.5.0)
│
├── tests/                        # C++ test suite
│   ├── CMakeLists.txt            # Test build configuration
│   ├── test_tar_reader.cpp       # TAR reader tests
│   ├── test_image_decoder.cpp    # Decoder tests
│   ├── test_http_reader.cpp      # HTTP reader tests
│   ├── test_unified_pipeline.cpp # Pipeline integration tests
│   ├── test_transforms.cpp       # Transform tests
│   ├── test_nvjpeg_decoder.cpp   # GPU decode tests
│   ├── test_pipeline_gpu_decode.cpp  # GPU pipeline tests
│   ├── test_smart_batching.cpp   # Batching tests
│   ├── test_avx512_simd.cpp      # AVX-512 SIMD tests
│   ├── test_tbl_format.cpp       # TBL format tests
│   └── ... (20+ test files)
│
├── benchmarks/                   # Performance benchmarks
│   ├── 01_pil_baseline.py        # PIL baseline benchmark
│   ├── 02_pytorch_naive.py       # PyTorch DataLoader
│   ├── 03_pytorch_optimized.py   # Optimized PyTorch
│   ├── 05_turboloader.py         # TurboLoader benchmark
│   ├── 08_tensorflow.py          # TensorFlow tf.data
│   └── run_all_benchmarks.py     # Benchmark suite runner
│
├── examples/                     # Usage examples
│   ├── transform_example.py      # Basic transform usage
│   ├── avx512_performance.py     # AVX-512 demo
│   ├── tbl_conversion.py         # TAR to TBL conversion
│   └── complete_v110_workflow.py # Complete pipeline example
│
├── python/                       # Pure Python utilities
│   ├── webdataset_loader.py      # WebDataset compatibility
│   ├── tensorflow_dataloader.py  # TensorFlow integration
│   └── jax_dataloader.py         # JAX integration
│
└── turboloader/                  # Python package
    └── __init__.py               # Package entry point


3.2 CRITICAL SOURCE FILES - DEEP DIVE
------------------------------------------------------------------------------

FILE: src/pipeline/pipeline.hpp (500+ lines)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

This is the heart of TurboLoader. It implements the UnifiedPipeline class which
orchestrates the entire data loading process.

Key Classes:

1. UnifiedPipelineConfig
   Configuration structure containing all pipeline parameters:
   - num_workers: Number of worker threads (default: 4)
   - batch_size: Samples per batch (default: 32)
   - queue_size: Buffer size for each worker (default: 128)
   - shuffle: Whether to shuffle samples
   - prefetch: Enable double-buffering
   - use_gpu_decode: Use nvJPEG for GPU JPEG decoding
   - smart_batching: Enable size-aware batching (NEW in v1.7.0)

   ===== Distributed Training (NEW in v1.7.1) =====
   - enable_distributed: Enable multi-node data loading (default: false)
   - world_rank: Rank of this process, 0 to world_size-1 (default: 0)
   - world_size: Total number of processes (default: 1)
   - drop_last: Drop incomplete batches at end (default: false)
   - distributed_seed: Seed for shuffling, same across all ranks (default: 42)

2. TarWorker
   Each worker thread runs an instance of this class. Worker lifecycle:

   a) INITIALIZATION PHASE:
      - Create thread-local JPEGDecoder (or NvJpegDecoder if GPU enabled)
      - Initialize memory pools for decoded images
      - Set CPU affinity to reduce migration overhead

   b) PROCESSING LOOP:
      while (running) {
          1. Pull TarEntryMetadata from shared queue (lock-free pop)
          2. Memory-map the JPEG region (zero-copy via TarReader)
          3. Decode JPEG to RGB888:
             - If use_gpu_decode && nvJPEG available: GPU decode
             - Else: libjpeg-turbo SIMD decode
          4. Apply transform pipeline (resize, normalize, augment)
          5. Push decoded Sample to output queue
      }

   c) SHUTDOWN:
      - Drain remaining samples from queue
      - Clean up decoder resources
      - Join thread

3. UnifiedPipeline
   The main pipeline class. Usage pattern:

   // Create pipeline
   UnifiedPipelineConfig config;
   config.num_workers = 8;
   config.batch_size = 64;
   config.use_gpu_decode = true;

   UnifiedPipeline pipeline("/path/to/dataset.tar", config);

   // Iterate over batches
   for (size_t epoch = 0; epoch < num_epochs; ++epoch) {
       pipeline.reset();  // Reset for new epoch

       while (auto batch = pipeline.next_batch()) {
           // batch contains 64 samples
           // Process batch (forward/backward pass)
       }
   }

   CRITICAL IMPLEMENTATION DETAILS:

   Memory Management:
   - Workers allocate Sample objects from thread-local pools
   - Samples are moved (not copied) through the pipeline
   - RGB data uses std::vector with reserve() to avoid reallocations

   Thread Synchronization:
   - Worker startup uses condition variable to ensure all threads ready
   - Sample queues use lock-free SPSC buffers (one queue per worker)
   - Shutdown uses atomic flag checked in hot loop

   GPU Decode Path (lines 379-409):
   When use_gpu_decode=true and nvJPEG available:

   #ifdef HAVE_NVJPEG
   if (use_gpu_ && nvjpeg_decoder_) {
       NvJpegResult gpu_result;
       // Decode on GPU - this is ~10x faster than CPU
       nvjpeg_decoder_->decode(jpeg_data, size, gpu_result);

       // Result is in GPU memory, copy to CPU
       sample.image_data = std::move(gpu_result.rgb_data);
   }
   #endif

   If GPU unavailable or disabled, falls back to CPU:
   decoder_->decode_sample(sample);  // libjpeg-turbo


FILE: src/decode/jpeg_decoder.hpp (300+ lines)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Implements JPEG decoding using libjpeg-turbo, which uses SIMD intrinsics
(SSE2, AVX2, NEON) for fast decompression.

Key Implementation Details:

1. Thread-Local Decoder State:
   Each JPEGDecoder instance is used by exactly one thread. This eliminates
   lock contention and allows reusing the jpeg_decompress_struct across
   multiple images.

   struct jpeg_decompress_struct cinfo_;
   struct jpeg_error_mgr jerr_;
   bool initialized_;

2. Decode Process:
   void decode_sample(Sample& sample) {
       // Point libjpeg at input JPEG bytes (no copy)
       jpeg_mem_src(&cinfo_, jpeg_data, jpeg_size);

       // Read JPEG header
       jpeg_read_header(&cinfo_, TRUE);

       // Configure output: RGB888, no color space conversion
       cinfo_.out_color_space = JCS_RGB;
       cinfo_.dct_method = JDCT_IFAST;  // Fast integer DCT

       // Start decompression
       jpeg_start_decompress(&cinfo_);

       // Allocate output buffer
       size_t row_stride = cinfo_.output_width * 3;
       sample.decoded_rgb.resize(cinfo_.output_height * row_stride);

       // Read scanlines using SIMD-accelerated path
       uint8_t* row_ptr = sample.decoded_rgb.data();
       while (cinfo_.output_scanline < cinfo_.output_height) {
           jpeg_read_scanlines(&cinfo_, &row_ptr, 1);
           row_ptr += row_stride;
       }

       // Cleanup
       jpeg_finish_decompress(&cinfo_);
   }

3. Error Handling:
   Uses setjmp/longjmp for error recovery. If JPEG is corrupted:
   - Log error message
   - Skip sample (don't crash entire training run)
   - Continue processing next sample

4. Performance Optimizations:
   - JDCT_IFAST: Fast integer DCT (trades accuracy for speed)
   - No color space conversion (YCbCr->RGB done by libjpeg-turbo)
   - Reuse decompressor struct (avoid re-initialization overhead)


FILE: src/transforms/resize_transform.hpp (400+ lines)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Implements bilinear and bicubic image resizing using SIMD intrinsics.

Bilinear Interpolation Algorithm:

For each output pixel (x_out, y_out):
1. Map to source coordinates:
   x_src = x_out * (src_width / dst_width)
   y_src = y_out * (src_height / dst_height)

2. Find four surrounding pixels:
   (x0, y0) = floor(x_src, y_src)
   (x1, y1) = (x0 + 1, y0 + 1)

3. Compute interpolation weights:
   wx = x_src - x0
   wy = y_src - y0

4. Bilinear interpolation:
   top = (1-wx) * src[y0,x0] + wx * src[y0,x1]
   bot = (1-wx) * src[y1,x0] + wx * src[y1,x1]
   result = (1-wy) * top + wy * bot

SIMD Optimization:

The inner loop processes 8 pixels simultaneously using AVX2:

__m256 weight_x = _mm256_set1_ps(wx);
__m256 weight_y = _mm256_set1_ps(wy);

// Load 8 pixels from each corner
__m256 top_left = _mm256_load_ps(src + y0*stride + x0);
__m256 top_right = _mm256_load_ps(src + y0*stride + x1);
__m256 bot_left = _mm256_load_ps(src + y1*stride + x0);
__m256 bot_right = _mm256_load_ps(src + y1*stride + x1);

// Horizontal interpolation
__m256 top = _mm256_add_ps(
    _mm256_mul_ps(_mm256_sub_ps(one, weight_x), top_left),
    _mm256_mul_ps(weight_x, top_right)
);
__m256 bot = _mm256_add_ps(
    _mm256_mul_ps(_mm256_sub_ps(one, weight_x), bot_left),
    _mm256_mul_ps(weight_x, bot_right)
);

// Vertical interpolation
__m256 result = _mm256_add_ps(
    _mm256_mul_ps(_mm256_sub_ps(one, weight_y), top),
    _mm256_mul_ps(weight_y, bot)
);

_mm256_store_ps(dst + y_out*out_stride + x_out, result);

This achieves ~8x speedup over scalar code on AVX2 processors.


FILE: src/readers/tar_reader.hpp (250+ lines)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Implements memory-mapped TAR file reading with zero-copy access.

TAR Format Background:
TAR files consist of a sequence of:
- 512-byte header (filename, size, permissions, etc.)
- File data (padded to 512-byte boundary)
- Repeat

TarReader Implementation:

1. Memory Mapping:
   int fd = open(tar_path, O_RDONLY);
   struct stat st;
   fstat(fd, &st);

   // Map entire TAR file into process address space
   void* mapped = mmap(NULL, st.st_size, PROT_READ,
                       MAP_PRIVATE | MAP_POPULATE, fd, 0);

   // Advise kernel about access pattern
   madvise(mapped, st.st_size, MADV_SEQUENTIAL);

   Benefits:
   - No read() system calls (kernel handles page faults)
   - Zero-copy: JPEG data accessed directly from mapped memory
   - Kernel automatically manages page cache
   - MAP_POPULATE pre-faults pages for predictable latency

2. TAR Header Parsing:
   struct TarHeader {
       char name[100];
       char mode[8];
       char uid[8];
       char gid[8];
       char size[12];    // Octal string!
       char mtime[12];
       char checksum[8];
       char typeflag;
       // ... more fields
   };

   The size field is in OCTAL (legacy Unix format):
   size_t parse_octal(const char* str, size_t len) {
       size_t result = 0;
       for (size_t i = 0; i < len && str[i]; ++i) {
           if (str[i] < '0' || str[i] > '7') break;
           result = result * 8 + (str[i] - '0');
       }
       return result;
   }

3. Entry Indexing:
   On initialization, TarReader scans entire TAR file and builds an index:

   std::vector<TarEntryMetadata> entries_;

   struct TarEntryMetadata {
       std::string filename;
       size_t offset;        // Byte offset in mapped region
       size_t size;          // File size in bytes
       const uint8_t* data;  // Pointer into mmap'd region (zero-copy!)
   };

   This index allows O(1) access to any file in the TAR.

4. Shuffling Support:
   If shuffling enabled:
   std::random_device rd;
   std::mt19937 g(rd());
   std::shuffle(entries_.begin(), entries_.end(), g);

   This shuffles the entry vector, not the file data itself (efficient).


================================================================================
SECTION 4: BUILD SYSTEM COMPREHENSIVE GUIDE
================================================================================

4.1 CMAKE CONFIGURATION
------------------------------------------------------------------------------

TurboLoader uses CMake 3.15+ as its build system. The build is configured to
detect available dependencies and enable features conditionally.

Root CMakeLists.txt Structure:

1. Project Setup:
   cmake_minimum_required(VERSION 3.15)
   project(TurboLoader VERSION 1.3.0 LANGUAGES CXX)
   set(CMAKE_CXX_STANDARD 20)
   set(CMAKE_CXX_STANDARD_REQUIRED ON)

   C++20 is required for:
   - std::span (zero-copy array views)
   - Concepts (for template constraints)
   - <bit> header (bit_cast for type punning)

2. Compiler Flags:
   if (CMAKE_CXX_COMPILER_ID MATCHES "GNU|Clang")
       add_compile_options(
           -O3                    # Aggressive optimization
           -march=native          # Use all available CPU instructions
           -ffast-math            # Fast floating-point (trade precision)
           -Wall -Wextra          # Enable warnings
           -Wno-unused-parameter  # Suppress unused param warnings
       )
   endif()

   -march=native is critical: enables AVX2/AVX-512 on Intel, NEON on ARM.

3. Dependency Detection:
   # libjpeg-turbo (required)
   find_package(JPEG REQUIRED)

   # libpng (required)
   find_package(PNG REQUIRED)

   # libwebp (required)
   find_package(WebP REQUIRED)

   # CURL (required for HTTP/S3/GCS readers)
   find_package(CURL REQUIRED)

   # nvJPEG (optional - GPU JPEG decode)
   check_language(CUDA)
   if (CMAKE_CUDA_COMPILER)
       enable_language(CUDA)
       find_library(NVJPEG_LIBRARY nvjpeg)
       if (NVJPEG_LIBRARY)
           add_definitions(-DHAVE_NVJPEG)
       endif()
   endif()

   # io_uring (optional - Linux async I/O)
   if (CMAKE_SYSTEM_NAME STREQUAL "Linux")
       find_library(URING_LIBRARY uring)
       if (URING_LIBRARY)
           add_definitions(-DHAVE_IOURING)
       endif()
   endif()

4. Library Target:
   add_library(turboloader INTERFACE)
   target_include_directories(turboloader INTERFACE src)
   target_link_libraries(turboloader INTERFACE
       ${JPEG_LIBRARIES}
       ${PNG_LIBRARIES}
       ${WEBP_LIBRARIES}
       CURL::libcurl
   )

   INTERFACE library means turboloader is header-only for C++ users.
   Python bindings compile to a shared library (_turboloader.so).

5. Python Bindings:
   find_package(pybind11 REQUIRED)

   pybind11_add_module(_turboloader
       src/python/turboloader_bindings.cpp
   )
   target_link_libraries(_turboloader PRIVATE turboloader)

   This creates _turboloader.cpython-313-darwin.so (on macOS) or
   _turboloader.cpython-313-x86_64-linux-gnu.so (on Linux).


4.2 DEPENDENCY MANAGEMENT
------------------------------------------------------------------------------

REQUIRED DEPENDENCIES:

1. libjpeg-turbo
   Purpose: SIMD-accelerated JPEG decoding
   Install: brew install jpeg-turbo (macOS)
           apt install libjpeg-turbo8-dev (Ubuntu)
   Version: 2.0+

   Why libjpeg-turbo vs libjpeg?
   - 2-6x faster decoding via SIMD intrinsics
   - Compatible with libjpeg API
   - Widely used in production (Chrome, Android)

2. libpng
   Purpose: PNG image decoding
   Install: brew install libpng (macOS)
           apt install libpng-dev (Ubuntu)
   Version: 1.6+

3. libwebp
   Purpose: WebP image decoding (modern format, better compression)
   Install: brew install webp (macOS)
           apt install libwebp-dev (Ubuntu)
   Version: 1.0+

4. libcurl
   Purpose: HTTP/HTTPS support for remote data loading
   Install: brew install curl (macOS)
           apt install libcurl4-openssl-dev (Ubuntu)
   Version: 7.50+

5. pybind11
   Purpose: C++ <-> Python bindings
   Install: pip install pybind11
   Version: 2.10+

OPTIONAL DEPENDENCIES:

6. NVIDIA CUDA + nvJPEG
   Purpose: GPU-accelerated JPEG decoding (10x speedup)
   Install: CUDA Toolkit 11.0+
   Platforms: Linux, Windows (not macOS)

   Detection: CMake checks for CUDA compiler and nvjpeg library
   If found: Compiles with -DHAVE_NVJPEG
   If not found: Falls back to CPU decode (graceful degradation)

7. liburing
   Purpose: Linux io_uring for async I/O
   Install: apt install liburing-dev (Ubuntu 20.04+)
   Kernel: Linux 5.1+

   Detection: CMake checks for liburing
   If found: Compiles with -DHAVE_IOURING
   If not found: Uses standard I/O

8. FFmpeg
   Purpose: Video decoding support
   Install: brew install ffmpeg (macOS)
           apt install libavcodec-dev libavformat-dev (Ubuntu)
   Optional: Only needed if loading video datasets

9. Apache Arrow
   Purpose: Parquet file support (columnar data)
   Install: brew install apache-arrow (macOS)
           apt install libarrow-dev libparquet-dev (Ubuntu)
   Optional: Only needed for Parquet datasets


================================================================================
SECTION 5: STEP-BY-STEP INSTALLATION
================================================================================

5.1 FULL INSTALLATION FROM SOURCE (macOS)
------------------------------------------------------------------------------

This section provides a complete, step-by-step guide for building TurboLoader
from source on macOS. We'll build both the C++ library and Python bindings.

STEP 1: Install Homebrew (if not already installed)

    /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

STEP 2: Install System Dependencies

    # Core dependencies
    brew install cmake           # Build system (3.15+)
    brew install python@3.13     # Python runtime
    brew install jpeg-turbo      # SIMD JPEG decoding
    brew install libpng          # PNG decoding
    brew install webp            # WebP decoding
    brew install curl            # HTTP support

    # Optional dependencies
    brew install ffmpeg          # Video decoding (optional)
    brew install apache-arrow    # Parquet support (optional)

    Verify installations:

    cmake --version              # Should show 3.15+
    python3.13 --version         # Should show 3.13.x
    brew list jpeg-turbo         # Confirm jpeg-turbo installed
    brew list libpng             # Confirm libpng installed
    brew list webp               # Confirm webp installed

STEP 3: Install Python Dependencies

    # Create virtual environment (recommended)
    python3.13 -m venv ~/turboloader-env
    source ~/turboloader-env/bin/activate

    # Install build dependencies
    pip install --upgrade pip
    pip install pybind11>=2.10.0
    pip install cmake>=3.15
    pip install setuptools>=45
    pip install wheel
    pip install build

    # Install runtime dependencies
    pip install numpy>=1.19.0
    pip install torch>=1.8.0     # PyTorch integration

    # Optional: Install benchmarking/testing tools
    pip install Pillow>=8.0.0
    pip install tqdm>=4.60.0
    pip install matplotlib>=3.3.0
    pip install psutil>=5.8.0

STEP 4: Clone Repository

    cd ~/
    git clone https://github.com/ALJainProjects/TurboLoader.git
    cd TurboLoader

    Verify directory structure:

    ls -la src/           # Should see core/, decode/, transforms/, etc.
    ls -la tests/         # Should see test_*.cpp files
    ls -la CMakeLists.txt # Root CMake config should exist

STEP 5: Build C++ Library and Tests

    # Create build directory
    mkdir -p build
    cd build

    # Configure with CMake
    cmake -DCMAKE_BUILD_TYPE=Release \
          -DCMAKE_CXX_COMPILER=clang++ \
          -DCMAKE_CXX_FLAGS="-O3 -march=native" \
          ..

    # Compile (use all CPU cores)
    make -j$(sysctl -n hw.ncpu)

    Expected output:

    -- Detecting dependencies...
    -- libjpeg-turbo: /opt/homebrew/opt/jpeg-turbo/include
    -- libpng: /opt/homebrew/opt/libpng/include
    -- libwebp: /opt/homebrew/opt/webp/include
    -- libcurl: /Library/Developer/CommandLineTools/SDKs/MacOSX15.sdk/usr/include
    ...
    [100%] Built target turboloader

    This creates:
    - build/tests/test_*  (C++ test executables)
    - Compiled object files and libraries

STEP 6: Run C++ Test Suite

    # Run all tests via CTest
    cd /Users/arnavjain/turboloader/build
    ctest --output-on-failure -j8

    Expected output:

    Test project /Users/arnavjain/turboloader/build
          Start  1: tar_reader
     1/17 Test  #1: tar_reader .......................   Passed    0.05 sec
          Start  2: image_decoder
     2/17 Test  #2: image_decoder ....................   Passed    0.12 sec
          Start  3: http_reader
     3/17 Test  #3: http_reader ......................   Passed    1.23 sec
    ...
    100% tests passed, 0 tests failed out of 17

    If any tests fail, check the specific test output for details.

STEP 7: Build Python Package

    # Return to root directory
    cd /Users/arnavjain/turboloader

    # Build wheel
    python3.13 -m build

    Expected output:

    * Creating isolated environment: venv+pip...
    * Installing packages in isolated environment:
      - cmake>=3.15
      - pybind11>=2.10.0
      - setuptools>=45
      - wheel
    ...
    Successfully built turboloader-1.7.1.tar.gz and
    turboloader-1.7.1-cp313-cp313-macosx_15_0_arm64.whl

    This creates:
    - dist/turboloader-1.7.1.tar.gz (source distribution)
    - dist/turboloader-1.7.1-cp313-cp313-macosx_15_0_arm64.whl (wheel)

STEP 8: Install Python Package

    # Install from built wheel
    pip install dist/turboloader-1.7.1-cp313-cp313-macosx_15_0_arm64.whl

    Or install in development mode (editable):

    pip install -e .

    Verify installation:

    python3.13 -c "import turboloader; print(turboloader.__version__)"
    # Should output: 1.7.1

STEP 9: Verify Installation

    # Test Python import
    python3.13 -c "
    import turboloader
    import numpy as np
    print(f'TurboLoader version: {turboloader.__version__}')
    print('Successfully imported TurboLoader!')
    "

    Expected output:

    TurboLoader version: 1.7.1
    Successfully imported TurboLoader!


5.2 FULL INSTALLATION FROM SOURCE (Ubuntu/Linux)
------------------------------------------------------------------------------

STEP 1: Install System Dependencies

    # Update package list
    sudo apt update

    # Core dependencies
    sudo apt install -y build-essential cmake git
    sudo apt install -y python3-dev python3-pip python3-venv
    sudo apt install -y libjpeg-turbo8-dev
    sudo apt install -y libpng-dev
    sudo apt install -y libwebp-dev
    sudo apt install -y libcurl4-openssl-dev

    # Optional: io_uring support (Linux 5.1+ kernel)
    sudo apt install -y liburing-dev

    # Optional: FFmpeg video support
    sudo apt install -y libavcodec-dev libavformat-dev libavutil-dev libswscale-dev

    # Optional: Parquet support
    sudo apt install -y libarrow-dev libparquet-dev