Add new pipeline to manage hugginface hub uploads #1967

matentzn · 2025-11-28T23:24:13Z

Description of the changes

The PR adds a new pipeline, data_publication, to the matrix pipeline system. It takes the ${globals:paths.integration}/prm/unified/edges (and nodes) from the data_release pipeline, uploads them to hugginface via a new dataset type.

Note that @pascalwhoop created 99% of the code in #1932 - this PR is a reduction of his work to the absolute minimum and its integration into the regular pipeline system.

I have tested the release locally, and it works fine (9-10 minutes runtime with a good internet connection). Right now, due to access issues, the pipeline is set to upload to my private account, eg.

So before merging we should make sure we create corresponding datasets in the https://huggingface.co/everycure organisation

Make sure that the correct huggingface datasets are added to https://huggingface.co/everycure (update catalog.yml)

Furthermore we should agree on how / when the pipeline is run:

Add CI system to run the pipeline once a while (say, once every 2 weeks, or on demand)

Fixes / Resolves the following issues:

https://linear.app/everycure/issue/XDATA-250/setup-huggingface-hub-for-ec-data-distribution

Checklist:

Added label to PR (e.g. enhancement or bug)
Ensured the PR is named descriptively. FYI: This name is used as part of our changelog & release notes.
Looked at the diff on github to make sure no unwanted files have been committed.
Made corresponding changes to the documentation
Added tests that prove my fix is effective or that my feature works
Any dependent changes have been merged and published in downstream modules
~~If breaking changes occur or you need everyone to run a command locally after pulling in latest main, uncomment the below "Merge Notification" section and describe steps necessary for people~~
~~Ran on sample data using kedro run -e sample -p test_sample (see sample environment guide)~~

Review instructions

You can run the code like this:

export HF_TOKEN="hf_...YOURTOKEN..."
uv run kedro run -e cloud -p data_publication "$@"

I have not added the tests @pascalwhoop created in WIP (AKA DO NOT REVIEW): Add Hugging Face publication pipeline to Kedro #1932 but I am happy to migrate them over if this is so desired!

Added import and registration for the data_publication pipeline in pipeline_registry.py to enable its use within the project.

@pascalwhoop

This pipeline is a minimal variant of @pascalwhoop draft supplied in #1932. The pipeline basically takes the released integrated graph (edges and nodes separately) and publishes them on huggingface hub.

pipelines/matrix/src/matrix/pipelines/data_publication/pipeline.py

pipelines/matrix/src/matrix/pipelines/data_publication/nodes.py

pipelines/matrix/conf/base/data_publication/catalog.yml

Eliminated the 'credentials: hf' field from multiple HuggingFace dataset entries in catalog.yml to simplify configuration and rely on default authentication mechanisms.

@pandas

Removed pandas dataset definitions from data_publication catalog and added them to integration catalog with '@pandas' suffix to leverage Kedro transcoding. Updated pipeline to use new '@pandas' dataset names for publishing nodes and edges to Hugging Face.

Eliminated the publish_dataset_to_hf function from nodes.py and replaced its usage in the pipeline with a passthrough lambda function. This simplifies the pipeline since dataset publishing is handled via catalog configuration.

Eliminates verification nodes and related config for published datasets in the data publication pipeline. Adds upload verification logic to HFIterableDataset. This streamlines the pipeline by removing redundant read-back checks and consolidates verification within the dataset class.

Renamed dataset keys in catalog.yml and pipeline.py from 'kg_edges_hf_published' and 'kg_nodes_hf_published' to 'data_publication.kg_edges_hf_published' and 'data_publication.kg_nodes_hf_published' to fullfill project requirements.

Empty parameter files not allowed in Matrix monorepo

Renamed catalog and pipeline output keys to include 'prm.' prefix for consistency.

libs/matrix-gcp-datasets/src/matrix_gcp_datasets/huggingface.py

Updated the input name for filter_unified_kg_edges node to use 'filtering.prm.prefiltered_nodes' instead of 'filtering.prm.prefiltered_nodes@spark'. (Typo)

Changed the repo_id values for kg_edges_hf_published and kg_nodes_hf_published from 'matentzn' test repositories to 'everycure' production repositories in the data publication catalog configuration.

JacquesVergine · 2025-12-08T09:45:24Z

libs/matrix-gcp-datasets/src/matrix_gcp_datasets/huggingface.py

+            try:
+                from pyspark.sql import SparkSession
+            except Exception as exc:  # pragma: no cover
+                raise RuntimeError("Spark is not installed but dataframe_type='spark'.") from exc


I'm not sure how often we would see spark not being installed. Is it an issue you faced?

No (Pascals code). This is very defensive programming! I dont think it is a problem per se - it is probably more robust this way. But yes, it adds 3 lines of code that are not strictly speaking necessary. Want me to remove? I would have probably not thought of adding this, but I think its not bad to have it.

Yes please remove it

I did the same for the other "defensive" try catch block, see cadf99e;

ok?

libs/matrix-gcp-datasets/src/matrix_gcp_datasets/huggingface.py

JacquesVergine · 2025-12-08T09:50:24Z

pipelines/matrix/conf/base/integration/catalog.yml

  <<: [*_spark_parquet, *_layer_prm]
  filepath: ${globals:paths.integration}/prm/unified/edges

+integration.prm.unified_edges@pandas:


Why is the @pandas version required if we can push to HuggingFace using Spark instead? I remember you saying that pandas is the recommended HF approach, but your dataset also allows for Spark and polars.

or maybe I missed something

This feedback will result in a lot of code changes - I was misled by AI (triple checked it on three chat tools and they all said the same thing!) that pandas is (1) simpler and (2) preferred but the truth is - the choice between pandas and sparql is arbitrary. So the only argument I can now give here is that the explicit decorator introduced here (@spark, @pandas) protects the code from FUTURE changes of format (where, say, the primary storage format of integration.prm.unified_edges changes). There is no other technical reason - I could remove the entire transcoding logic again JUST refer to the spark dataset - the result will be the same technially. Please advice!

After discussion, we decided to revert the transcoding and use spark dataframe (or other if suitable)

I reverted the transcoding logic: 2c25634

You will have rarely seen such a PR:

It is purely additive!! I find this strangely satisfying.

Thanks Nico! One could argue that a PR that is only removing lines is more elegant, but let's not turn this into a philosophical debate 😄

Replaces multiple 'if' statements with 'elif' and 'else' to ensure only one dataframe type branch is executed and to improve error handling for unsupported types.

Refactored the Hugging Face dataset loading logic to remove try/except blocks and fallback conversions for Spark and Polars dataframe types. Now assumes required libraries are installed and uses direct conversion methods, improving code clarity and reducing complexity.

@pandas

Removes kedro transcoding logic by removing '@spark' and '@pandas' suffixes from 'integration.prm.unified_nodes' and 'integration.prm.unified_edges' across configuration files and pipeline code. Updates documentation and all pipeline references to use the new unified dataset names, simplifying catalog management and usage.

Updated the dataframe_type for both kg_edges_hf_published and kg_nodes_hf_published datasets from 'pandas' to 'spark' to enable Spark-based processing.

Deleted an unnecessary blank line between integration.prm.unified_edges and integration.prm.unified_edges_simplified for improved readability.

Updated internal methods to accept and use an optional token parameter for authenticated API requests to the HuggingFace Hub. This improves support for private datasets and ensures all verification steps can operate with proper authorization.

matentzn · 2025-12-18T12:46:54Z

@JacquesVergine I have:

Removed the transcoding logic again (which drastically reduced the amount of code changes)
Reduced the complexity of error handling in dataset load methods as discussed (I went a bit further and simplified other code blocks as well)
Added a way to handle private repos during the verification phase. Since the dataset repos on huggingface are still private, the verification process was failing because it needed API access (which was previously not authenticated).
Added a README explaining the pipeline for the interested developer, including how to add a new dataset etc.

github-actions · 2026-01-02T00:15:11Z

This pull request has been automatically marked as stale because it has had no activity in the last 14 days.

If this PR is still relevant:

Please update it to the latest main/master branch
Address any existing review comments
Leave a comment to keep it open

Otherwise, it will be closed in 2 weeks if no further activity occurs.

JacquesVergine

Thanks Nico, good work!

matentzn added 2 commits November 29, 2025 00:35

Register data_publication pipeline

666ac10

Added import and registration for the data_publication pipeline in pipeline_registry.py to enable its use within the project.

Create a new pipeline data_publication, for publishing HF datasets

37cefc5

This pipeline is a minimal variant of @pascalwhoop draft supplied in #1932. The pipeline basically takes the released integrated graph (edges and nodes separately) and publishes them on huggingface hub.

matentzn requested a review from a team as a code owner November 28, 2025 23:24

matentzn requested a review from piotrkan November 28, 2025 23:24

matentzn added the enhancement improving an existing system or feature to work better. label Nov 28, 2025

matentzn temporarily deployed to dev November 28, 2025 23:24 — with GitHub Actions Inactive

matentzn requested a review from JacquesVergine November 28, 2025 23:26

pascalwhoop reviewed Dec 1, 2025

View reviewed changes

pipelines/matrix/src/matrix/pipelines/data_publication/pipeline.py Outdated Show resolved Hide resolved

pascalwhoop reviewed Dec 1, 2025

View reviewed changes

pipelines/matrix/src/matrix/pipelines/data_publication/nodes.py Outdated Show resolved Hide resolved

pascalwhoop reviewed Dec 1, 2025

View reviewed changes

pipelines/matrix/src/matrix/pipelines/data_publication/nodes.py Outdated Show resolved Hide resolved

pascalwhoop reviewed Dec 1, 2025

View reviewed changes

pipelines/matrix/conf/base/data_publication/catalog.yml Outdated Show resolved Hide resolved

Remove 'credentials' field from HuggingFace dataset configs

4f7d54b

Eliminated the 'credentials: hf' field from multiple HuggingFace dataset entries in catalog.yml to simplify configuration and rely on default authentication mechanisms.

matentzn temporarily deployed to dev December 2, 2025 14:00 — with GitHub Actions Inactive

JacquesVergine assigned matentzn Dec 2, 2025

matentzn added 3 commits December 3, 2025 16:28

Merge branch 'main' into hf-kg-upload

218ad14

Merge branch 'main' into hf-kg-upload

444dfb5

matentzn temporarily deployed to dev December 4, 2025 11:43 — with GitHub Actions Inactive

Remove publish_dataset_to_hf function by lambda

116e68e

Eliminated the publish_dataset_to_hf function from nodes.py and replaced its usage in the pipeline with a passthrough lambda function. This simplifies the pipeline since dataset publishing is handled via catalog configuration.

matentzn temporarily deployed to dev December 4, 2025 11:57 — with GitHub Actions Inactive

matentzn added 2 commits December 4, 2025 19:44

Run ruff

d88331f

matentzn temporarily deployed to dev December 4, 2025 17:46 — with GitHub Actions Inactive

matentzn added 5 commits December 5, 2025 00:10

Update Hugging Face dataset keys for data publication

18f6857

Renamed dataset keys in catalog.yml and pipeline.py from 'kg_edges_hf_published' and 'kg_nodes_hf_published' to 'data_publication.kg_edges_hf_published' and 'data_publication.kg_nodes_hf_published' to fullfill project requirements.

Run ruff formatting

fe81f0e

Delete parameters.yml

d9b2fe8

Empty parameter files not allowed in Matrix monorepo

Merge branch 'main' into hf-kg-upload

bd98df6

Update data publication catalog and pipeline keys

3a2a20b

Renamed catalog and pipeline output keys to include 'prm.' prefix for consistency.

matentzn temporarily deployed to dev December 4, 2025 22:17 — with GitHub Actions Inactive

matentzn commented Dec 4, 2025

View reviewed changes

libs/matrix-gcp-datasets/src/matrix_gcp_datasets/huggingface.py Outdated Show resolved Hide resolved

matentzn temporarily deployed to dev December 5, 2025 12:09 — with GitHub Actions Inactive

Remove '@spark' suffix from prefiltered_nodes input

c4224d1

Updated the input name for filter_unified_kg_edges node to use 'filtering.prm.prefiltered_nodes' instead of 'filtering.prm.prefiltered_nodes@spark'. (Typo)

matentzn temporarily deployed to dev December 5, 2025 12:17 — with GitHub Actions Inactive

Update Hugging Face repo IDs in catalog config

a7e7da2

Changed the repo_id values for kg_edges_hf_published and kg_nodes_hf_published from 'matentzn' test repositories to 'everycure' production repositories in the data publication catalog configuration.

matentzn had a problem deploying to dev December 5, 2025 17:28 — with GitHub Actions Error

Merge branch 'main' into hf-kg-upload

0619e29

matentzn temporarily deployed to dev December 5, 2025 17:30 — with GitHub Actions Inactive

pascalwhoop approved these changes Dec 5, 2025

View reviewed changes

JacquesVergine reviewed Dec 8, 2025

View reviewed changes

Fix dataframe type handling in HFIterableDataset

8264132

Replaces multiple 'if' statements with 'elif' and 'else' to ensure only one dataframe type branch is executed and to improve error handling for unsupported types.

matentzn temporarily deployed to dev December 8, 2025 12:26 — with GitHub Actions Inactive

matentzn added 3 commits December 18, 2025 13:24

Update uv.lock

0cbe243

Merge branch 'main' into hf-kg-upload

f272f14

matentzn temporarily deployed to dev December 18, 2025 11:28 — with GitHub Actions Inactive

matentzn added 2 commits December 18, 2025 13:36

Switch dataframe_type from pandas to spark in catalog.yml

453af5b

Updated the dataframe_type for both kg_edges_hf_published and kg_nodes_hf_published datasets from 'pandas' to 'spark' to enable Spark-based processing.

matentzn temporarily deployed to dev December 18, 2025 11:44 — with GitHub Actions Inactive

Remove extra blank line in catalog.yml

81c99fa

Deleted an unnecessary blank line between integration.prm.unified_edges and integration.prm.unified_edges_simplified for improved readability.

matentzn temporarily deployed to dev December 18, 2025 11:46 — with GitHub Actions Inactive

matentzn added 2 commits December 18, 2025 14:29

Add data publication pipeline readme

af7b7a3

matentzn temporarily deployed to dev December 18, 2025 12:40 — with GitHub Actions Inactive

github-actions bot added the stale label Jan 2, 2026

Merge branch 'main' into hf-kg-upload

b6ca4c7

matentzn removed the stale label Jan 2, 2026

matentzn temporarily deployed to dev January 2, 2026 06:35 — with GitHub Actions Inactive

JacquesVergine approved these changes Jan 5, 2026

View reviewed changes

Add new pipeline to manage hugginface hub uploads #1967

Are you sure you want to change the base?

Add new pipeline to manage hugginface hub uploads #1967

Uh oh!

Conversation

matentzn commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of the changes

Fixes / Resolves the following issues:

Checklist:

Review instructions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matentzn commented Dec 18, 2025

Uh oh!

github-actions bot commented Jan 2, 2026

Uh oh!

JacquesVergine left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

matentzn commented Nov 28, 2025 •

edited

Loading