Integrate a dataset to reformat into Zarr.
Integrating a dataset in dynamical.org reformatters is done by subclassing a trio of base classes, customizing their behavior based on the unique characteristics of your dataset.
There are three core base classes to subclass.
TemplateConfigdefines the dataset structure.RegionJobdefines the process by which a region of that dataset is reformatted: downloading, reading, rewriting.DynamicalDatasetbrings together aTemplateConfigandRegionJoband defines the compute resources to operationally update and validate a dataset.
- Provider - the agency or organization that publishes the source data. e.g. ECMWF
- Model - the model or system that produced the data. e.g. GFS
- Variant - the specific subset and structure of data from the model. e.g. forecast, analysis, climatology. Variant may include any other information needed to distinguish datasets from the same model.
- Dataset - a specific provider-model-variant. e.g. noaa-gfs-forecast
Before getting started, follow the brief setup steps in README.md > Local development > Setup.
Explore the source dataset to understand the nuances of what's available and how to access it. See docs/source_data_exploration_guide.md.
uv run main initialize-new-integration <provider> <model> <variant>Provider, model and variant can contain letters, numbers and dashes (e.g. ICON-EU or analysis-hourly). Capitalization will be normalized for you.
This will add a number of files within src/reformatters/<provider>/<model>/<variant> and tests/<provider>/<model>/<variant>.
These files will contain placeholder implementations of the subclasses referenced above. Follow the rest of this doc for guidance on how to complete the implementations to integrate your new dataset.
Add an instance of your DynamicalDataset subclass to the DYNAMICAL_DATASETS constant in src/reformatters/__main__.py:
from reformatters.provider.model.variant import ProviderModelVariantDataset
DYNAMICAL_DATASETS = [
...,
ProviderModelVariantDataset(
primary_storage_config=ProviderModelIcechunkAwsOpenDataDatasetStorageConfig(),
]If you plan to write this dataset to a location not maintained by dynamical.org, you can instantiate and pass your own StorageConfig, contact feedback@dynamical.org for support.
Work through src/reformatters/$DATASET_PATH/template_config.py, setting the attributes and method definitions to describe the structure of your dataset. The report generated by following the source_data_exploration_guide.md will be helpful here.
Read the chunk/shard layout tool docs and use the tool to find chunk and shard sizes for your data variables.
Using the information in the TemplateConfig, reformatters writes the Zarr metadata for your dataset to src/reformatters/$DATASET_PATH/templates/latest.zarr. Run this command in your terminal to create or update the template based on the your TemplateConfig subclass:
uv run main $DATASET_ID update-template
git add src/reformatters/$DATASET_PATH/templates/latest.zarrTracking the template in git lets us review diffs of any changes to the structure of our dataset.
Run the tests, making any changes necessary.
uv run pytest tests/$DATASET_PATH/template_config_test.pyWork through src/reformatters/$DATASET_PATH/region_job.py, implementing the attributes and method definitions based on the unique structure and processing required for your dataset.
There are four required methods:
generate_source_file_coordslists all the files of source data that will be processed to complete theRegionJob.download_fileretrieves a specific source file and writes it to local disk.read_dataloads data from a local path and returns a numpy array.operational_update_jobsis a factory method that returns theRegionJobs necessary to update the dataset with the latest available data. You can skip this until you're ready to implement dataset updates, a dataset backfill can be run with just the first three methods.
There are a few optional, additional methods which are described in the example code. Implement them if required for your dataset, otherwise remove them to use the base class RegionJob implementations.
Write tests for any custom logic you've created.
uv run pytest tests/$DATASET_PATH/region_job_test.pyYou've reached the point where you can run the reformatter locally!
uv run main $DATASET_ID backfill-local <append_dim_end> --filter-variable-names <data var name>Reformatting locally can be slow. Choosing an <append_dim_end> not long after your template's append_dim_start and selecting a single variable to process with --filter-variable-names can limit the amount of work.
To operationalize your dataset and have the update and validate Kubernetes cron jobs be deployed automatically by GitHub CI, implement the two methods in src/reformatters/$DATASET_PATH/dynamical_dataset.py.
Kubernetes resource values:
- shared memory: Round the value calculated in the chunk/shard size tool output up to the nearest half GB.
- memory: 1.5x shared memory.
- cpu: the number of spatial dimension shards minus 1 to account for kubernetes headroom. e.g. if 2 latitude shards * 4 longitude shards = 8, choose 7 cpu to schedule on an 8 cpu node.
- ephemeral_storage: 20GB is a good starting point.
Parallelism: Set workers_total and parallelism on the ReformatCronJob using self.num_variable_groups(). Multiply by 2 if operational_update_jobs reprocesses the most recent time slice (see GEFS datasets for examples).
The update cron schedule should run shortly after the source data is expected to be available and the validate cron should run at update cron start + update pod_active_deadline.
In dynamical_dataset_test.py create a test that runs backfill_local followed by update for a couple data variables and a minimal number of time steps, lead times and ensemble members. Include snapshot value assertions for every data variable that the test processes — check specific known values at specific coordinates (e.g. assert_allclose(point["temperature_2m"].values, [28.75, 29.23])). Snapshot values catch silent regressions in data reading, unit conversion, or coordinate alignment that other tests miss.
uv run pytest tests/$DATASET_PATH/dynamical_dataset_test.pyThe details here depend on the computing resources and the Zarr storage location you'll be using. Get in touch with feedback@dynamical.org for support at this point if you haven't already.
- Run a backfill on your local computer:
DYNAMICAL_ENV=prod uv run main $DATASET_ID backfill-local <append-dim-end>. If this is fast enough and you have the disk space, it is a nice and simple approach. - If you're working to create a public dynamical.org dataset, run
./deploy/aws/create_new_aws_open_data_bucket.sh <provider>-<model> - Run a backfill on a kubernetes cluster:
- This supports parallelism across servers to process much larger datasets.
- Complete the steps in README.md > Deploying to the cloud > Setup.
DYNAMICAL_ENV=prod uv run main $DATASET_ID backfill-kubernetes <append-dim-end> <jobs-per-pod> <max-parallelism>, then track the job withkubectl get jobs.
- See operational cronjobs in your kubernetes cluster and check their schedule:
kubectl get cronjobs. - To enable issue reporting and cron monitoring with the error reporting service Sentry, create a secret in your kubernetes cluster with your Sentry account's DSN:
kubectl create secret generic sentry --from-literal='DYNAMICAL_SENTRY_DSN=xxx'.
Follow docs/validation.md — it walks through running run-all, reading validation_summary.md, inspecting every plot, and the full data quality checklist.
Update the dataset catalog docs on dynamical.org by adding entries into the catalog.js, rebuilding (npm run build), and merging updates to main in https://github.com/dynamical-org/dynamical.org.