Skip to content

Draft: Miles30/nikoli fixes#35

Open
PatrickRMiles wants to merge 9 commits intoLBANN:mainfrom
PatrickRMiles:miles30/nikoli_fixes
Open

Draft: Miles30/nikoli fixes#35
PatrickRMiles wants to merge 9 commits intoLBANN:mainfrom
PatrickRMiles:miles30/nikoli_fixes

Conversation

@PatrickRMiles
Copy link
Copy Markdown
Collaborator

Summary

This PR applies most of Nikoli's feedback and adds additional updates. Improves training-step behavior, data loading performance, dataset format efficiency, and warmup/profiling visibility.

Changes

  • Switched optimizer updates from once per epoch to once per batch in the training loop.
  • Corrected AMP gradient handling so gradients are unscaled before clip_grad_norm_.
  • Extracted training lifecycle phases so prepare_training, warmup, and main train run as separate regions in traces.
  • Changed warmup from epoch-based to batch-based with new warmup_batches support, defaulting to 5 batches per rank.
  • Updated warmup to better match real training:
    • runs in model.train()
    • follows the same DDP + DistConv tensor distribution path as the main training loop
    • performs backward passes without optimizer steps

Data Pipeline

  • Added configurable dataloader_num_workers in config and CLI.
  • Enabled persistent_workers and prefetch_factor when worker processes are used.
  • Optimized generated dataset format:
    • images now save in final float32 CDHW layout
    • masks now save in final int64 training dtype
  • Added dual-format dataset loading:
    • fast path for new optimized datasets
    • fallback path for legacy datasets that still need transpose/remap work

Dataset Reuse

  • Added dataset_format_version metadata for generated datasets.
  • Included dataset_format_version in dataset reuse validation and hashed dataset identity so old/new dataset formats do not share the same cache key.

Config / CLI Additions

  • dataloader_num_workers
  • warmup_batches

@michaelmckinsey1 michaelmckinsey1 self-requested a review March 26, 2026 16:40
return {
"image": torch.as_tensor(img.copy()).float().contiguous(),
"mask": torch.as_tensor(mask.copy()).long().contiguous(),
"image": torch.from_numpy(img),
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need .contiguous() here even though volumegen now saves images/masks in contiguous format?

Copy link
Copy Markdown
Collaborator

@michaelmckinsey1 michaelmckinsey1 Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ndryden Do we still need .float().contiguous()?

Copy link
Copy Markdown
Collaborator

@michaelmckinsey1 michaelmckinsey1 Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need

torch.from_numpy(img).contiguous().float()
torch.from_numpy(mask).contiguous().long()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants