Skip to content

Phase 2: Distributed Training Correctness#3

Merged
ALJainProjects merged 2 commits into
mainfrom
phase2/distributed-training-fixes
Feb 9, 2026
Merged

Phase 2: Distributed Training Correctness#3
ALJainProjects merged 2 commits into
mainfrom
phase2/distributed-training-fixes

Conversation

@ALJainProjects
Copy link
Copy Markdown
Owner

Summary

  • Replace biased custom shuffle with proper Fisher-Yates using std::mt19937 in DistributedSampler
  • Upgrade HashBasedSharding::hash_index() from weak FNV-1a to splitmix64 for uniform shard distribution across GPUs
  • Fix checkpoint resume for large shuffled datasets: regenerate shuffle from seed+epoch instead of storing full permutation

Test plan

  • Added hash balance test: 10000 samples across 8 ranks within ±5% of expected
  • Added Fisher-Yates permutation test: no duplicates in shuffled output
  • Added epoch determinism tests: same seed+epoch → same order, different epochs → different order

…resume

- Replace biased shuffle with proper Fisher-Yates using std::mt19937
- Upgrade hash_index() from weak FNV-1a to splitmix64 for uniform shard distribution
- Fix checkpoint resume for large shuffled datasets via seed+epoch regeneration
- Add tests for hash balance, shuffle permutation correctness, and epoch determinism
The shuffle tests reference turboloader::distributed::DistributedConfig
and DistributedSampler which live in distributed_dataloader.hpp, not
sharding_strategies.hpp.
@ALJainProjects ALJainProjects merged commit 491d46a into main Feb 9, 2026
7 checks passed
@ALJainProjects ALJainProjects deleted the phase2/distributed-training-fixes branch February 9, 2026 00:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant