Skip to content

Feat/gpu augmentations#2944

Closed
etienne87 wants to merge 10 commits intoMIC-DKFZ:masterfrom
etienne87:feat/gpu_augmentations
Closed

Feat/gpu augmentations#2944
etienne87 wants to merge 10 commits intoMIC-DKFZ:masterfrom
etienne87:feat/gpu_augmentations

Conversation

@etienne87
Copy link
Copy Markdown

In an effort for GPU Augmentation (See #2911 (comment)), i add an option "gpu_augmentation", and i added a new iterator called "ThreadedGPUAugmenter" that applies data augmentation after transfer to the GPU.

@FabianIsensee
Copy link
Copy Markdown
Member

Dear Etienne,
thanks for this PR! The current data augmentation pipeline is not GPU compatible. I think this would require major additional effort with limited benefits as one usually has sufficient CPU power to run data augmentation in the background, and even if not then there still is the possibility of using SegOrd0 variants of the trainers to cut down on CPU utilization. Can you please outline situations in which this approach offers substantial advantages over the current workflow, given that integrating this would require changes to batchgenerators to make everything GPU compatible and overall increase complexity?
Thanks a lot!
Best,
Fabian

@etienne87
Copy link
Copy Markdown
Author

etienne87 commented Apr 15, 2026

Hello Fabian!

This MR is indeed a big change, probably not worth in the nominal cases. I see improvements when CPU is busy with other trainings. People get crowded on one server, and everybody uses the CPU for augmentation, and each GPU is starving for data.

In our team everybody uses nnUNetv2 which already switched to nearest neighbor if i am not mistaken ?

At the same time, this GPU-only solution introduces its own caveats as you mentioned in the other MR:

  • Some time of the GPU is now dedicated to data augmentation
  • Requires the introduction of this new class because multiprocessing seem to fail when patches are too big to be shared via the Queue (since i rotate on the gpu raw patches need to be bigger to avoid padding values)

Realistically, I agree that other, simpler solutions can fix our problems:

  • accelerating grid_sample on the CPU (pytorch seems to use only 1 thread for all voxels for grid_sample c++ cpu side)
  • use cheaper interpolation as you say.

Anyway,
it has been some time since i worked on this topic, but if it can help, i can come back with raw numbers on some specific training configurations.

@FabianIsensee
Copy link
Copy Markdown
Member

Hey etienne,
it's similar on our cluster. nnU-Net-style workloads (nnU-Net + its variants) make up a huge fraction of jobs. But we don't encounter CPU issues and if we are, we simply switch to nearest neighbor interpolation for segmentation augmentation (nnUNetTrainer_DASegOrd0) to address that. I would very much prefer to keep things as they are, since switching to GPU augmentation would really have only marginal benefits for a small group of users while increasing the complexity by a lot
Best,
Fabian

@etienne87
Copy link
Copy Markdown
Author

ok, agreed, closing for now.

@etienne87 etienne87 closed this Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants