Add new generic EasyBlock HuggingFaceDataset by VRehnberg · Pull Request #4141 · easybuilders/easybuild-easyblocks

VRehnberg · 2026-05-23T14:07:18Z

Extends the generic Dataset easyblock for any dataset from https://huggingface.co. Currently only handles datasets (not models etc), but I imagine it wouldn't be too hard to extend it.

VRehnberg · 2026-05-23T14:26:51Z

Test report by @VRehnberg

Overview of tested easyconfigs (in order)

SUCCESS CIFAR-10-data-20240104-hf-gfbf-2025a.eb

Build succeeded for 1 out of 1 (total: 34 secs) (1 easyconfigs in total)
vera-icelake-build - Linux Rocky Linux 9.6, x86_64, Intel(R) Xeon(R) Silver 4316 CPU @ 2.30GHz, Python 3.9.21
See https://gist.github.com/VRehnberg/4d7f04b08044aecfe264819399ddbc75 for a full test report.

VRehnberg · 2026-05-23T15:16:20Z

Running CIFAR-10 dataset with 2023a and 2025a leads to .arrow files which are not identical. I will have to investigate some more, might be that 2.18 to 4.5 is simply a too big a jump.

VRehnberg · 2026-05-23T15:37:45Z

Is fine between 4.0.0 and 4.5.0 and 2.18.0 against a separate 2.18.0 build gives same results. Good enough for me.

load_dataset wants to reproduce the dataset in HF_HOME each time using save_to_disk and load_from_disk should be more canonical

VRehnberg · 2026-05-25T06:41:39Z

There is some potential future work for this easyblock (not planned for this PR):

adapting it for model weights
reducing disk footprint during build

Adapting it for model weights

This is no longer especially feasible when I have switched to using save_to_disk. Model weights would have worked better with just running HF_MODEL="%(installdir)s".

On the other hand providing model weights might be simple enough to just use the base Dataset block.

Reducing disk footprint during build

For large datasets this can currently be an issue. The disk footprint is:

Data source directory:
- Copy of dataset (usually in parquet format)
Build directory:
- Copy of source files
- Processed dataset in arrow files in load cache
- Processed dataset from save_to_disk
Install directory:
- Moving processed dataset result from save_to_disk

So at one point the dataset exists as four copies in different places in the filesystem. There are some possible options to mitigate this, but nothing I've thought seems as robust.

VRehnberg mentioned this pull request May 23, 2026

{dataset}[gfbf/2025a,gfbf/2024a,gfbf/2023b,gfbf/2023a] CIFAR-10-data v20240104 w/ hf easybuilders/easybuild-easyconfigs#26080

Draft

VRehnberg marked this pull request as draft May 23, 2026 15:14

VRehnberg marked this pull request as ready for review May 23, 2026 15:37

VRehnberg marked this pull request as draft May 24, 2026 07:43

VRehnberg added 4 commits May 25, 2026 07:32

Add new generic EasyBlock HuggingFaceDataset

c8b91d7

Listen to hound

72db361

Move tmp env var usage into run_shell_cmd

b2923b3

Use save_to_disk over load_dataset

ca8f013

load_dataset wants to reproduce the dataset in HF_HOME each time using save_to_disk and load_from_disk should be more canonical

VRehnberg force-pushed the generic_huggingfacedataset branch from 2e60632 to ca8f013 Compare May 25, 2026 05:51

VRehnberg added 3 commits May 25, 2026 08:04

Clean up after hound

cba9cea

Clean up after hound

0495e39

Fix style

ebfcf5c

VRehnberg marked this pull request as ready for review May 25, 2026 06:31

Use download_filename for hash_url

390ee5e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new generic EasyBlock HuggingFaceDataset#4141

Add new generic EasyBlock HuggingFaceDataset#4141
VRehnberg wants to merge 8 commits into
easybuilders:developfrom
VRehnberg:generic_huggingfacedataset

VRehnberg commented May 23, 2026

Uh oh!

VRehnberg commented May 23, 2026

Uh oh!

VRehnberg commented May 23, 2026

Uh oh!

VRehnberg commented May 23, 2026

Uh oh!

VRehnberg commented May 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

VRehnberg commented May 23, 2026

Uh oh!

VRehnberg commented May 23, 2026

Overview of tested easyconfigs (in order)

Uh oh!

VRehnberg commented May 23, 2026

Uh oh!

VRehnberg commented May 23, 2026

Uh oh!

VRehnberg commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Adapting it for model weights

Reducing disk footprint during build

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

VRehnberg commented May 25, 2026 •

edited

Loading