Add new generic EasyBlock HuggingFaceDataset#4141
Conversation
|
Test report by @VRehnberg Overview of tested easyconfigs (in order)
Build succeeded for 1 out of 1 (total: 34 secs) (1 easyconfigs in total) |
|
Running CIFAR-10 dataset with 2023a and 2025a leads to .arrow files which are not identical. I will have to investigate some more, might be that 2.18 to 4.5 is simply a too big a jump. |
|
Is fine between 4.0.0 and 4.5.0 and 2.18.0 against a separate 2.18.0 build gives same results. Good enough for me. |
load_dataset wants to reproduce the dataset in HF_HOME each time using save_to_disk and load_from_disk should be more canonical
2e60632 to
ca8f013
Compare
|
There is some potential future work for this easyblock (not planned for this PR):
Adapting it for model weightsThis is no longer especially feasible when I have switched to using save_to_disk. Model weights would have worked better with just running HF_MODEL="%(installdir)s". On the other hand providing model weights might be simple enough to just use the base Dataset block. Reducing disk footprint during buildFor large datasets this can currently be an issue. The disk footprint is:
So at one point the dataset exists as four copies in different places in the filesystem. There are some possible options to mitigate this, but nothing I've thought seems as robust. |
Extends the generic Dataset easyblock for any dataset from https://huggingface.co. Currently only handles datasets (not models etc), but I imagine it wouldn't be too hard to extend it.