Draft: Miles30/nikoli fixes by PatrickRMiles · Pull Request #35 · LBANN/ScaFFold

PatrickRMiles · 2026-03-26T16:32:21Z

Summary

This PR applies most of Nikoli's feedback and adds additional updates. Improves training-step behavior, data loading performance, dataset format efficiency, and warmup/profiling visibility.

Changes

Switched optimizer updates from once per epoch to once per batch in the training loop.
Corrected AMP gradient handling so gradients are unscaled before clip_grad_norm_.
Extracted training lifecycle phases so prepare_training, warmup, and main train run as separate regions in traces.
Changed warmup from epoch-based to batch-based with new warmup_batches support, defaulting to 5 batches per rank.
Updated warmup to better match real training:
- runs in model.train()
- follows the same DDP + DistConv tensor distribution path as the main training loop
- performs backward passes without optimizer steps

Data Pipeline

Added configurable dataloader_num_workers in config and CLI.
Enabled persistent_workers and prefetch_factor when worker processes are used.
Optimized generated dataset format:
- images now save in final float32 CDHW layout
- masks now save in final int64 training dtype
Added dual-format dataset loading:
- fast path for new optimized datasets
- fallback path for legacy datasets that still need transpose/remap work

Dataset Reuse

Added dataset_format_version metadata for generated datasets.
Included dataset_format_version in dataset reuse validation and hashed dataset identity so old/new dataset formats do not share the same cache key.

Config / CLI Additions

dataloader_num_workers
warmup_batches

…points.py

…e clipping

…up, maintaining support for old datasets

… batches (user configurable)

PatrickRMiles · 2026-03-26T17:21:11Z

ScaFFold/utils/data_loading.py

        return {
-            "image": torch.as_tensor(img.copy()).float().contiguous(),
-            "mask": torch.as_tensor(mask.copy()).long().contiguous(),
+            "image": torch.from_numpy(img),


Do we need .contiguous() here even though volumegen now saves images/masks in contiguous format?

@ndryden Do we still need .float().contiguous()?

We need

torch.from_numpy(img).contiguous().float() torch.from_numpy(mask).contiguous().long()

Patrick Miles added 9 commits March 26, 2026 07:55

remove deprecated IFS class -- this was replaced by generate_fractal_…

21fd3c5

…points.py

valueerror should say config_type

b545950

remove unnecessary copy

a9749a0

images.to device should be non-blocking

8626b84

apply optimizer every batch, not every epoch; unscale gradients befor…

123c90d

…e clipping

shift dataloader preprocessing work into dataset generation for speed…

4a8e7a2

…up, maintaining support for old datasets

make dataloader num_workers user-configurable

2968cc1

make dataloader num_workers user-configurable

81a54b5

extract warmup to separate method; switch to warming up set number of…

d6429d3

… batches (user configurable)

michaelmckinsey1 self-requested a review March 26, 2026 16:40

PatrickRMiles commented Mar 26, 2026

View reviewed changes

PatrickRMiles mentioned this pull request Mar 26, 2026

Does dataloader num_workers actually impact data loading speed? #36

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Miles30/nikoli fixes#35

Draft: Miles30/nikoli fixes#35
PatrickRMiles wants to merge 9 commits intoLBANN:mainfrom
PatrickRMiles:miles30/nikoli_fixes

PatrickRMiles commented Mar 26, 2026

Uh oh!

PatrickRMiles Mar 26, 2026

Uh oh!

michaelmckinsey1 Mar 26, 2026 •

edited

Loading

Uh oh!

michaelmckinsey1 Mar 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PatrickRMiles commented Mar 26, 2026

Summary

Changes

Data Pipeline

Dataset Reuse

Config / CLI Additions

Uh oh!

PatrickRMiles Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

michaelmckinsey1 Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelmckinsey1 Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

michaelmckinsey1 Mar 26, 2026 •

edited

Loading

michaelmckinsey1 Mar 26, 2026 •

edited

Loading