Skip to content

Conversation

@matentzn
Copy link
Contributor

@matentzn matentzn commented Nov 28, 2025

Description of the changes

The PR adds a new pipeline, data_publication, to the matrix pipeline system. It takes the ${globals:paths.integration}/prm/unified/edges (and nodes) from the data_release pipeline, uploads them to hugginface via a new dataset type.

Note that @pascalwhoop created 99% of the code in #1932 - this PR is a reduction of his work to the absolute minimum and its integration into the regular pipeline system.

I have tested the release locally, and it works fine (9-10 minutes runtime with a good internet connection). Right now, due to access issues, the pipeline is set to upload to my private account, eg.

So before merging we should make sure we create corresponding datasets in the https://huggingface.co/everycure organisation

Furthermore we should agree on how / when the pipeline is run:

  • Add CI system to run the pipeline once a while (say, once every 2 weeks, or on demand)

Fixes / Resolves the following issues:

Checklist:

  • Added label to PR (e.g. enhancement or bug)
  • Ensured the PR is named descriptively. FYI: This name is used as part of our changelog & release notes.
  • Looked at the diff on github to make sure no unwanted files have been committed.
  • Made corresponding changes to the documentation
  • Added tests that prove my fix is effective or that my feature works
  • Any dependent changes have been merged and published in downstream modules
  • If breaking changes occur or you need everyone to run a command locally after pulling in latest main, uncomment the below "Merge Notification" section and describe steps necessary for people
  • Ran on sample data using kedro run -e sample -p test_sample (see sample environment guide)

Review instructions

  1. You can run the code like this:
export HF_TOKEN="hf_...YOURTOKEN..."
uv run kedro run -e cloud -p data_publication "$@"
  1. I have not added the tests @pascalwhoop created in WIP (AKA DO NOT REVIEW): Add Hugging Face publication pipeline to Kedro #1932 but I am happy to migrate them over if this is so desired!

Added import and registration for the data_publication pipeline in pipeline_registry.py to enable its use within the project.
This pipeline is a minimal variant of @pascalwhoop draft supplied in #1932.

The pipeline basically takes the released integrated graph (edges and nodes separately) and publishes them on huggingface hub.
@matentzn matentzn requested a review from a team as a code owner November 28, 2025 23:24
@matentzn matentzn requested a review from piotrkan November 28, 2025 23:24
@matentzn matentzn added the enhancement improving an existing system or feature to work better. label Nov 28, 2025
Eliminated the 'credentials: hf' field from multiple HuggingFace dataset entries in catalog.yml to simplify configuration and rely on default authentication mechanisms.
Removed pandas dataset definitions from data_publication catalog and added them to integration catalog with '@pandas' suffix to leverage Kedro transcoding. Updated pipeline to use new '@pandas' dataset names for publishing nodes and edges to Hugging Face.
Eliminated the publish_dataset_to_hf function from nodes.py and replaced its usage in the pipeline with a passthrough lambda function. This simplifies the pipeline since dataset publishing is handled via catalog configuration.
Eliminates verification nodes and related config for published datasets in the data publication pipeline. Adds upload verification logic to HFIterableDataset. This streamlines the pipeline by removing redundant read-back checks and consolidates verification within the dataset class.
Renamed dataset keys in catalog.yml and pipeline.py from 'kg_edges_hf_published' and 'kg_nodes_hf_published' to 'data_publication.kg_edges_hf_published' and 'data_publication.kg_nodes_hf_published' to fullfill project requirements.
Empty parameter files not allowed in Matrix monorepo
Renamed catalog and pipeline output keys to include 'prm.' prefix for consistency.
Updated the input name for filter_unified_kg_edges node to use 'filtering.prm.prefiltered_nodes' instead of 'filtering.prm.prefiltered_nodes@spark'. (Typo)
Changed the repo_id values for kg_edges_hf_published and kg_nodes_hf_published from 'matentzn' test repositories to 'everycure' production repositories in the data publication catalog configuration.
try:
from pyspark.sql import SparkSession
except Exception as exc: # pragma: no cover
raise RuntimeError("Spark is not installed but dataframe_type='spark'.") from exc
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how often we would see spark not being installed. Is it an issue you faced?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No (Pascals code). This is very defensive programming! I dont think it is a problem per se - it is probably more robust this way. But yes, it adds 3 lines of code that are not strictly speaking necessary. Want me to remove? I would have probably not thought of adding this, but I think its not bad to have it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the same for the other "defensive" try catch block, see cadf99e;

ok?

<<: [*_spark_parquet, *_layer_prm]
filepath: ${globals:paths.integration}/prm/unified/edges

integration.prm.unified_edges@pandas:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the @pandas version required if we can push to HuggingFace using Spark instead? I remember you saying that pandas is the recommended HF approach, but your dataset also allows for Spark and polars.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe I missed something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feedback will result in a lot of code changes - I was misled by AI (triple checked it on three chat tools and they all said the same thing!) that pandas is (1) simpler and (2) preferred but the truth is - the choice between pandas and sparql is arbitrary. So the only argument I can now give here is that the explicit decorator introduced here (@spark, @pandas) protects the code from FUTURE changes of format (where, say, the primary storage format of integration.prm.unified_edges changes). There is no other technical reason - I could remove the entire transcoding logic again JUST refer to the spark dataset - the result will be the same technially. Please advice!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion, we decided to revert the transcoding and use spark dataframe (or other if suitable)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted the transcoding logic: 2c25634

You will have rarely seen such a PR:

image

It is purely additive!! I find this strangely satisfying.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nico! One could argue that a PR that is only removing lines is more elegant, but let's not turn this into a philosophical debate 😄

Replaces multiple 'if' statements with 'elif' and 'else' to ensure only one dataframe type branch is executed and to improve error handling for unsupported types.
Refactored the Hugging Face dataset loading logic to remove try/except blocks and fallback conversions for Spark and Polars dataframe types. Now assumes required libraries are installed and uses direct conversion methods, improving code clarity and reducing complexity.
Removes kedro transcoding logic by removing '@spark' and '@pandas' suffixes from 'integration.prm.unified_nodes' and 'integration.prm.unified_edges' across configuration files and pipeline code. Updates documentation and all pipeline references to use the new unified dataset names, simplifying catalog management and usage.
Updated the dataframe_type for both kg_edges_hf_published and kg_nodes_hf_published datasets from 'pandas' to 'spark' to enable Spark-based processing.
Deleted an unnecessary blank line between integration.prm.unified_edges and integration.prm.unified_edges_simplified for improved readability.
Updated internal methods to accept and use an optional token parameter for authenticated API requests to the HuggingFace Hub. This improves support for private datasets and ensures all verification steps can operate with proper authorization.
@matentzn
Copy link
Contributor Author

@JacquesVergine I have:

@github-actions
Copy link
Contributor

github-actions bot commented Jan 2, 2026

This pull request has been automatically marked as stale because it has had no activity in the last 14 days.

If this PR is still relevant:

  • Please update it to the latest main/master branch
  • Address any existing review comments
  • Leave a comment to keep it open

Otherwise, it will be closed in 2 weeks if no further activity occurs.

@github-actions github-actions bot added the stale label Jan 2, 2026
Copy link
Collaborator

@JacquesVergine JacquesVergine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nico, good work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement improving an existing system or feature to work better.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants