Skip to content

Replace dlt framework and overhaul ingestion code #321

@martyngigg

Description

@martyngigg

Summary

The dlt package provides great abstractions for dealing with data sources and running extractions within the ingestion pipelines. However, I've come across two main issues:

  • the default processing model of extract everything to files, process and the write; all in separate processes is heavy for most of our uses cases and in fact slows things down.
  • debugging is difficult due to layers of decorators
  • upgrades have broken pipelines silently although this has not happened so much of late.
    Most recent: "warning: `dlt==1.27.0` is yanked (reason: "Data-loss bug: incremental merge truncates the destination table. Fixed in 1.27.2") warning: `dlt==1.27.0` is yanked (reason: "Data-loss bug: incremental merge truncates the destination table. Fixed in 1.27.2") warning: `dlt==1.27.0` is yanked (reason: "Data-loss bug: incremental merge truncates the destination table. Fixed in 1.27.2")
  • the local cache of pipeline data in ~/.dlt is hidden and causes an unexpected developer experience when trying to rerun pipelines and backfill resulting in most of the time the cache is just wiped out and we start again.
  • each of the elt scripts is a standalone script making managing dependencies challenging

Many of the source abstractions in dlt are thin, e.g:

  • database interactions are all handled by sqlalchemy
  • requests/httpx can easily be used for REST apis
  • we already had to write our own code for sharepoint access

For destinations we had to write our own Iceberg destination implementation. The abstractions required by the framework created quite complicated code whereas what is required is not too challenging.

Action

We have our common code within elt-common. Create our own thin abstractions in this package along with a centralized command called elt so that ingestion and transform (keep using dbt for this) steps can be run as a single command, e.g

>elt run [PIPELINE_ROOT_DIR] domain_name.source_name

The branch elt-command-without-dlt contains code that has made some progress with this.

What works

What is needed

  • Finishing elt-common rework:

    • Sources/ingestion:

      • The latest commit has some code for pulling from an SQL database. This is basically complete but there may be a use case for determining the table names dynamically, for example if you just want to reflect a whole schema. Check with Chi Kai.
      • The m365 code was ported using Claude as a test. It looks okay but certainly the tests are missing. This could be a just start again with porting the current m365 dlt source and removing the dlt bits...
      • For the rest APIs we could leave that for now as each ingest job handles that itself as the both requests/httpx are pretty easy to use. If lots more use cases develop then we could implement some thin wrappers to pull out any commonality.
    • To completely replace the cron script the run command needs to also run the dbt models dependent on those sources. Add code to execute dbt via the elt command.

    • I'd also like there to be a common entrypoint for running a unit tests on the sources that get implemented. Some may have quite complex logic that it would be nice to test in isolation.

  • Switching existing pipelines to new framework. This also involves combining the current warehouse directories for landing and transform (landing is really an implementation detail). I think the new name top-level name should be elt-pipelines rather than warehouses. My thinking was having a directory structure as follows:

    .
    |-- elt-common/
    |-- elt-pipelines/
        |-- facility_ops  # this is the root of pipeline for the `facility_ops` warehouse (where the dbt transformed tables end up)
            |-- ingest/
            |   |-- domain_a/
            |   |   |-- source_a/
            |   |   |-- source_b/
            |   |-- domain_b/
            |-- transform/
    
    • We also need to consider how we manage dependencies for the different source modules. My thought was using a pyproject.toml file in the root of the facility_ops directory where the separate sources could be optional dependencies. This possibly requires more thought but the current approach of script headers won't scale well.

Metadata

Metadata

Assignees

Labels

pipelinesRelates to the ingestion/modelling pipelines

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions