Skip to content

Integrate ‘ingestion’ scripts from CKAN set-up contractors #34

@khaeru

Description

@khaeru

As part of the contract to develop transport-data/tdc-data-portal, the contractor wrote some “data ingestion scripts”. These are two directories in that repo:

Unfortunately:

  • The scripts for the JRC IDEES source and Eurostat provider duplicate the contents of transport_data.jrc and transport_data.estat; the script for GFEI data seems to be a fixed mirror (i.e. not reusable) of the GFEI Zenodo record.
  • The code seems extremely verbose (JRC file is 4400 lines without formatting; data-integration/process_tdc.py is 20000 lines), and involves a lot of duplication/copy-and-paste.
  • SDMX metadata are not generated; metadata are fed directly into CKAN via API calls.

More info:

  • The scripts do serve as a complete/working example of how to interact with CKAN through its APIs—though directly using requests, and not through a CKAN API client (Add a CKAN client #3).
  • According to the contractor, the scripts either create records or skip those that exist; they do not update metadata on existing records if it has changed.

To resolve, likely in multiple issues/PRs:

  • Integrate the functions of the scripts by into existing modules in the current package.
  • Replace the workflow that calls the scripts with a workflow calling, e.g. tdc jrc refresh-ckan
  • Add functionality to identify existing records and update as needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions