Skip to content

Add new generic EasyBlock HuggingFaceDataset#4141

Open
VRehnberg wants to merge 8 commits into
easybuilders:developfrom
VRehnberg:generic_huggingfacedataset
Open

Add new generic EasyBlock HuggingFaceDataset#4141
VRehnberg wants to merge 8 commits into
easybuilders:developfrom
VRehnberg:generic_huggingfacedataset

Conversation

@VRehnberg
Copy link
Copy Markdown
Contributor

Extends the generic Dataset easyblock for any dataset from https://huggingface.co. Currently only handles datasets (not models etc), but I imagine it wouldn't be too hard to extend it.

@VRehnberg
Copy link
Copy Markdown
Contributor Author

Test report by @VRehnberg

Overview of tested easyconfigs (in order)

  • SUCCESS CIFAR-10-data-20240104-hf-gfbf-2025a.eb

Build succeeded for 1 out of 1 (total: 34 secs) (1 easyconfigs in total)
vera-icelake-build - Linux Rocky Linux 9.6, x86_64, Intel(R) Xeon(R) Silver 4316 CPU @ 2.30GHz, Python 3.9.21
See https://gist.github.com/VRehnberg/4d7f04b08044aecfe264819399ddbc75 for a full test report.

@VRehnberg VRehnberg marked this pull request as draft May 23, 2026 15:14
@VRehnberg
Copy link
Copy Markdown
Contributor Author

Running CIFAR-10 dataset with 2023a and 2025a leads to .arrow files which are not identical. I will have to investigate some more, might be that 2.18 to 4.5 is simply a too big a jump.

@VRehnberg
Copy link
Copy Markdown
Contributor Author

Is fine between 4.0.0 and 4.5.0 and 2.18.0 against a separate 2.18.0 build gives same results. Good enough for me.

@VRehnberg VRehnberg marked this pull request as ready for review May 23, 2026 15:37
@VRehnberg VRehnberg marked this pull request as draft May 24, 2026 07:43
VRehnberg added 4 commits May 25, 2026 07:32
load_dataset wants to reproduce the dataset in HF_HOME each time

using save_to_disk and load_from_disk should be more canonical
@VRehnberg VRehnberg force-pushed the generic_huggingfacedataset branch from 2e60632 to ca8f013 Compare May 25, 2026 05:51
@VRehnberg VRehnberg marked this pull request as ready for review May 25, 2026 06:31
@VRehnberg
Copy link
Copy Markdown
Contributor Author

VRehnberg commented May 25, 2026

There is some potential future work for this easyblock (not planned for this PR):

  • adapting it for model weights
  • reducing disk footprint during build

Adapting it for model weights

This is no longer especially feasible when I have switched to using save_to_disk. Model weights would have worked better with just running HF_MODEL="%(installdir)s".

On the other hand providing model weights might be simple enough to just use the base Dataset block.

Reducing disk footprint during build

For large datasets this can currently be an issue. The disk footprint is:

  • Data source directory:
    • Copy of dataset (usually in parquet format)
  • Build directory:
    • Copy of source files
    • Processed dataset in arrow files in load cache
    • Processed dataset from save_to_disk
  • Install directory:
    • Moving processed dataset result from save_to_disk

So at one point the dataset exists as four copies in different places in the filesystem. There are some possible options to mitigate this, but nothing I've thought seems as robust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant