Dataset Integration Guide

Integrate a dataset to reformat into Zarr.

Overview

Integrating a dataset in dynamical.org reformatters is done by subclassing a trio of base classes, customizing their behavior based on the unique characteristics of your dataset.

There are three core base classes to subclass.

TemplateConfig defines the dataset structure.
RegionJob defines the process by which a region of that dataset is reformatted: downloading, reading, rewriting.
DynamicalDataset brings together a TemplateConfig and RegionJob and defines the compute resources to operationally update and validate a dataset.

Words

Provider - the agency or organization that publishes the source data. e.g. ECMWF
Model - the model or system that produced the data. e.g. GFS
Variant - the specific subset and structure of data from the model. e.g. forecast, analysis, climatology. Variant may include any other information needed to distinguish datasets from the same model.
Dataset - a specific provider-model-variant. e.g. noaa-gfs-forecast

Integration steps

Before getting started, follow the brief setup steps in README.md > Local development > Setup.

0. Explore the source dataset

Explore the source dataset to understand the nuances of what's available and how to access it. See docs/source_data_exploration_guide.md.

1. Initialize a new integration

uv run main initialize-new-integration <provider> <model> <variant>

Provider, model and variant can contain letters, numbers and dashes (e.g. ICON-EU or analysis-hourly). Capitalization will be normalized for you.

This will add a number of files within src/reformatters/<provider>/<model>/<variant> and tests/<provider>/<model>/<variant>.

These files will contain placeholder implementations of the subclasses referenced above. Follow the rest of this doc for guidance on how to complete the implementations to integrate your new dataset.

2. Register your dataset

Add an instance of your DynamicalDataset subclass to the DYNAMICAL_DATASETS constant in src/reformatters/__main__.py:

from reformatters.provider.model.variant import ProviderModelVariantDataset

DYNAMICAL_DATASETS = [
    ...,
    ProviderModelVariantDataset(
        primary_storage_config=ProviderModelIcechunkAwsOpenDataDatasetStorageConfig(),
]

If you plan to write this dataset to a location not maintained by dynamical.org, you can instantiate and pass your own StorageConfig, contact feedback@dynamical.org for support.

3. Implement `TemplateConfig` subclass

Work through src/reformatters/$DATASET_PATH/template_config.py, setting the attributes and method definitions to describe the structure of your dataset. The report generated by following the source_data_exploration_guide.md will be helpful here.

Read the chunk/shard layout tool docs and use the tool to find chunk and shard sizes for your data variables.

Using the information in the TemplateConfig, reformatters writes the Zarr metadata for your dataset to src/reformatters/$DATASET_PATH/templates/latest.zarr. Run this command in your terminal to create or update the template based on the your TemplateConfig subclass:

uv run main $DATASET_ID update-template
git add src/reformatters/$DATASET_PATH/templates/latest.zarr

Tracking the template in git lets us review diffs of any changes to the structure of our dataset.

Run the tests, making any changes necessary.

uv run pytest tests/$DATASET_PATH/template_config_test.py

4. Implement `RegionJob` subclass

Work through src/reformatters/$DATASET_PATH/region_job.py, implementing the attributes and method definitions based on the unique structure and processing required for your dataset.

There are four required methods:

generate_source_file_coords lists all the files of source data that will be processed to complete the RegionJob.
download_file retrieves a specific source file and writes it to local disk.
read_data loads data from a local path and returns a numpy array.
operational_update_jobs is a factory method that returns the RegionJobs necessary to update the dataset with the latest available data. You can skip this until you're ready to implement dataset updates, a dataset backfill can be run with just the first three methods.

There are a few optional, additional methods which are described in the example code. Implement them if required for your dataset, otherwise remove them to use the base class RegionJob implementations.

Write tests for any custom logic you've created.

uv run pytest tests/$DATASET_PATH/region_job_test.py

You've reached the point where you can run the reformatter locally!

uv run main $DATASET_ID backfill-local <append_dim_end> --filter-variable-names <data var name>

Reformatting locally can be slow. Choosing an <append_dim_end> not long after your template's append_dim_start and selecting a single variable to process with --filter-variable-names can limit the amount of work.

5. Implement `DynamicalDataset` subclass

To operationalize your dataset and have the update and validate Kubernetes cron jobs be deployed automatically by GitHub CI, implement the two methods in src/reformatters/$DATASET_PATH/dynamical_dataset.py.

Kubernetes resource values:

shared memory: Round the value calculated in the chunk/shard size tool output up to the nearest half GB.
memory: 1.5x shared memory.
cpu: the number of spatial dimension shards minus 1 to account for kubernetes headroom. e.g. if 2 latitude shards * 4 longitude shards = 8, choose 7 cpu to schedule on an 8 cpu node.
ephemeral_storage: 20GB is a good starting point.

Parallelism: Set workers_total and parallelism on the ReformatCronJob using self.num_variable_groups(). Multiply by 2 if operational_update_jobs reprocesses the most recent time slice (see GEFS datasets for examples).

The update cron schedule should run shortly after the source data is expected to be available and the validate cron should run at update cron start + update pod_active_deadline.

Integration test with snapshot values

In dynamical_dataset_test.py create a test that runs backfill_local followed by update for a couple data variables and a minimal number of time steps, lead times and ensemble members. Include snapshot value assertions for every data variable that the test processes — check specific known values at specific coordinates (e.g. assert_allclose(point["temperature_2m"].values, [28.75, 29.23])). Snapshot values catch silent regressions in data reading, unit conversion, or coordinate alignment that other tests miss.

uv run pytest tests/$DATASET_PATH/dynamical_dataset_test.py

6. Deploy

The details here depend on the computing resources and the Zarr storage location you'll be using. Get in touch with feedback@dynamical.org for support at this point if you haven't already.

Run a backfill on your local computer: DYNAMICAL_ENV=prod uv run main $DATASET_ID backfill-local <append-dim-end>. If this is fast enough and you have the disk space, it is a nice and simple approach.
If you're working to create a public dynamical.org dataset, run ./deploy/aws/create_new_aws_open_data_bucket.sh <provider>-<model>
Run a backfill on a kubernetes cluster:
- This supports parallelism across servers to process much larger datasets.
- Complete the steps in README.md > Deploying to the cloud > Setup.
- DYNAMICAL_ENV=prod uv run main $DATASET_ID backfill-kubernetes <append-dim-end> <jobs-per-pod> <max-parallelism>, then track the job with kubectl get jobs.
See operational cronjobs in your kubernetes cluster and check their schedule: kubectl get cronjobs.
To enable issue reporting and cron monitoring with the error reporting service Sentry, create a secret in your kubernetes cluster with your Sentry account's DSN: kubectl create secret generic sentry --from-literal='DYNAMICAL_SENTRY_DSN=xxx'.

7. Validate

Follow docs/validation.md — it walks through running run-all, reading validation_summary.md, inspecting every plot, and the full data quality checklist.

8. Update dataset catalog documentation

Update the dataset catalog docs on dynamical.org by adding entries into the catalog.js, rebuilding (npm run build), and merging updates to main in https://github.com/dynamical-org/dynamical.org.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Integration Guide

Overview

Words

Integration steps

0. Explore the source dataset

1. Initialize a new integration

2. Register your dataset

3. Implement `TemplateConfig` subclass

4. Implement `RegionJob` subclass

5. Implement `DynamicalDataset` subclass

Integration test with snapshot values

6. Deploy

7. Validate

8. Update dataset catalog documentation

FilesExpand file tree

dataset_integration_guide.md

Latest commit

History

dataset_integration_guide.md

File metadata and controls

Dataset Integration Guide

Overview

Words

Integration steps

0. Explore the source dataset

1. Initialize a new integration

2. Register your dataset

3. Implement TemplateConfig subclass

4. Implement RegionJob subclass

5. Implement DynamicalDataset subclass

Integration test with snapshot values

6. Deploy

7. Validate

8. Update dataset catalog documentation

3. Implement `TemplateConfig` subclass

4. Implement `RegionJob` subclass

5. Implement `DynamicalDataset` subclass