Add ability to shuffle (and reshuffle) batches#170
Add ability to shuffle (and reshuffle) batches#170arbennett wants to merge 2 commits intoxarray-contrib:mainfrom
Conversation
|
I've been using this long enough on my own work that I think it's behaving as intended. If the code/approach is good I would be happy to add some tests. |
|
Going on a bit of a tangent, but continuing on a bit from #176 (comment), have you tried |
|
I'll give it a shot! Apparently I need to dig into the torchdata docs a bit more closely 😅 |
|
I tried using the built in torchdata shuffler and, at least for subsampling from a large zarr file, it is extremely slow. Using the method implemented here is much faster/lightweight. |
Hmm yes, that's what I expected. You could change the What you've done in this xbatcher PR is essentially shuffling of the indexes (lightweight on RAM). With torchdata's Shuffler, you would be shuffling the arrays (heavy on RAM), unless you find a way to get in between the slicing and batching part. This sort of ties in to my proposal at #172 on decomposing xbatcher into a |
Description of proposed changes
This relatively simple addition just adds the
shuffleflag andreshufflemethod to allow for randomizing the ordering of batches. This can be useful to reduce the effect of auto-correlation between samples that are nearby in space/time. The way I've implemented it is to simply preemptively turn thepatch_selectorsinto a list which might not be optimal. But, in my testing, these are usually explicitly loaded at some point before the batch generator is iterated over anyhow so hopefully that's not a huge blocker.Fixes # <--- I thought there used to be an issue around this, but I was unsuccessful in finding it. I'll update this if someone links the relevant issue.