-
Notifications
You must be signed in to change notification settings - Fork 7
Add new pipeline to manage hugginface hub uploads #1967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Added import and registration for the data_publication pipeline in pipeline_registry.py to enable its use within the project.
This pipeline is a minimal variant of @pascalwhoop draft supplied in #1932. The pipeline basically takes the released integrated graph (edges and nodes separately) and publishes them on huggingface hub.
pipelines/matrix/src/matrix/pipelines/data_publication/pipeline.py
Outdated
Show resolved
Hide resolved
pipelines/matrix/src/matrix/pipelines/data_publication/nodes.py
Outdated
Show resolved
Hide resolved
pipelines/matrix/src/matrix/pipelines/data_publication/nodes.py
Outdated
Show resolved
Hide resolved
Eliminated the 'credentials: hf' field from multiple HuggingFace dataset entries in catalog.yml to simplify configuration and rely on default authentication mechanisms.
Eliminated the publish_dataset_to_hf function from nodes.py and replaced its usage in the pipeline with a passthrough lambda function. This simplifies the pipeline since dataset publishing is handled via catalog configuration.
Eliminates verification nodes and related config for published datasets in the data publication pipeline. Adds upload verification logic to HFIterableDataset. This streamlines the pipeline by removing redundant read-back checks and consolidates verification within the dataset class.
Renamed dataset keys in catalog.yml and pipeline.py from 'kg_edges_hf_published' and 'kg_nodes_hf_published' to 'data_publication.kg_edges_hf_published' and 'data_publication.kg_nodes_hf_published' to fullfill project requirements.
Empty parameter files not allowed in Matrix monorepo
Renamed catalog and pipeline output keys to include 'prm.' prefix for consistency.
libs/matrix-gcp-datasets/src/matrix_gcp_datasets/huggingface.py
Outdated
Show resolved
Hide resolved
Updated the input name for filter_unified_kg_edges node to use 'filtering.prm.prefiltered_nodes' instead of 'filtering.prm.prefiltered_nodes@spark'. (Typo)
Changed the repo_id values for kg_edges_hf_published and kg_nodes_hf_published from 'matentzn' test repositories to 'everycure' production repositories in the data publication catalog configuration.
| try: | ||
| from pyspark.sql import SparkSession | ||
| except Exception as exc: # pragma: no cover | ||
| raise RuntimeError("Spark is not installed but dataframe_type='spark'.") from exc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how often we would see spark not being installed. Is it an issue you faced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No (Pascals code). This is very defensive programming! I dont think it is a problem per se - it is probably more robust this way. But yes, it adds 3 lines of code that are not strictly speaking necessary. Want me to remove? I would have probably not thought of adding this, but I think its not bad to have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the same for the other "defensive" try catch block, see cadf99e;
ok?
libs/matrix-gcp-datasets/src/matrix_gcp_datasets/huggingface.py
Outdated
Show resolved
Hide resolved
| <<: [*_spark_parquet, *_layer_prm] | ||
| filepath: ${globals:paths.integration}/prm/unified/edges | ||
|
|
||
| integration.prm.unified_edges@pandas: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the @pandas version required if we can push to HuggingFace using Spark instead? I remember you saying that pandas is the recommended HF approach, but your dataset also allows for Spark and polars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe I missed something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feedback will result in a lot of code changes - I was misled by AI (triple checked it on three chat tools and they all said the same thing!) that pandas is (1) simpler and (2) preferred but the truth is - the choice between pandas and sparql is arbitrary. So the only argument I can now give here is that the explicit decorator introduced here (@spark, @pandas) protects the code from FUTURE changes of format (where, say, the primary storage format of integration.prm.unified_edges changes). There is no other technical reason - I could remove the entire transcoding logic again JUST refer to the spark dataset - the result will be the same technially. Please advice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussion, we decided to revert the transcoding and use spark dataframe (or other if suitable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted the transcoding logic: 2c25634
You will have rarely seen such a PR:
It is purely additive!! I find this strangely satisfying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Nico! One could argue that a PR that is only removing lines is more elegant, but let's not turn this into a philosophical debate 😄
Replaces multiple 'if' statements with 'elif' and 'else' to ensure only one dataframe type branch is executed and to improve error handling for unsupported types.
Refactored the Hugging Face dataset loading logic to remove try/except blocks and fallback conversions for Spark and Polars dataframe types. Now assumes required libraries are installed and uses direct conversion methods, improving code clarity and reducing complexity.
Removes kedro transcoding logic by removing '@spark' and '@pandas' suffixes from 'integration.prm.unified_nodes' and 'integration.prm.unified_edges' across configuration files and pipeline code. Updates documentation and all pipeline references to use the new unified dataset names, simplifying catalog management and usage.
Updated the dataframe_type for both kg_edges_hf_published and kg_nodes_hf_published datasets from 'pandas' to 'spark' to enable Spark-based processing.
Deleted an unnecessary blank line between integration.prm.unified_edges and integration.prm.unified_edges_simplified for improved readability.
Updated internal methods to accept and use an optional token parameter for authenticated API requests to the HuggingFace Hub. This improves support for private datasets and ensures all verification steps can operate with proper authorization.
|
@JacquesVergine I have:
|
|
This pull request has been automatically marked as stale because it has had no activity in the last 14 days. If this PR is still relevant:
Otherwise, it will be closed in 2 weeks if no further activity occurs. |
JacquesVergine
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Nico, good work!
Description of the changes
The PR adds a new pipeline, data_publication, to the matrix pipeline system. It takes the
${globals:paths.integration}/prm/unified/edges(and nodes) from thedata_releasepipeline, uploads them to hugginface via a new dataset type.Note that @pascalwhoop created 99% of the code in #1932 - this PR is a reduction of his work to the absolute minimum and its integration into the regular pipeline system.
I have tested the release locally, and it works fine (9-10 minutes runtime with a good internet connection). Right now, due to access issues, the pipeline is set to upload to my private account, eg.
So before merging we should make sure we create corresponding datasets in the https://huggingface.co/everycure organisation
Furthermore we should agree on how / when the pipeline is run:
Fixes / Resolves the following issues:
Checklist:
enhancementorbug)If breaking changes occur or you need everyone to run a command locally after pulling in latest main, uncomment the below "Merge Notification" section and describe steps necessary for peopleRan on sample data usingkedro run -e sample -p test_sample(see sample environment guide)Review instructions