Replace dlt framework and overhaul ingestion code

## Summary

The dlt package provides great abstractions for dealing with data sources and running extractions within the ingestion pipelines. However, I've come across two main issues:

- the default processing model of extract everything to files, process and the write; all in separate processes is heavy for most of our uses cases and in fact slows things down. 
- debugging is difficult due to layers of decorators
- upgrades have broken pipelines silently although this has not happened so much of late.
  ```Most recent: "warning: `dlt==1.27.0` is yanked (reason: "Data-loss bug: incremental merge truncates the destination table. Fixed in 1.27.2")
warning: `dlt==1.27.0` is yanked (reason: "Data-loss bug: incremental merge truncates the destination table. Fixed in 1.27.2")
warning: `dlt==1.27.0` is yanked (reason: "Data-loss bug: incremental merge truncates the destination table. Fixed in 1.27.2")```
- the local cache of pipeline data in ~/.dlt is hidden and causes an unexpected developer experience when trying to rerun pipelines and backfill resulting in most of the time the cache is just wiped out and we start again.
- each of the elt scripts is a standalone script making managing dependencies challenging

Many of the source abstractions in dlt are thin, e.g:

- database interactions are all handled by sqlalchemy
- requests/httpx can easily be used for REST apis
- we already had to write our own code for sharepoint access

For destinations we had to write our own Iceberg destination implementation. The abstractions required by the framework created quite complicated code whereas what is required is not too challenging.

## Action

We have our common code within `elt-common`. Create our own thin abstractions in this package along with a centralized command called `elt` so that ingestion and transform (keep using dbt for this) steps can be run as a single command, e.g

```
>elt run [PIPELINE_ROOT_DIR] domain_name.source_name
```

The branch [elt-command-without-dlt](https://github.com/ISISNeutronMuon/analytics-data-platform/tree/elt-command-without-dlt) contains code that has made some progress with this.

### What works

- The basic `ls` & `run` (just ingest): <https://github.com/ISISNeutronMuon/analytics-data-platform/blob/elt-command-without-dlt/elt-common/src/elt_common/cli.py>. An [e2e test](https://github.com/ISISNeutronMuon/analytics-data-platform/blob/elt-command-without-dlt/elt-common/tests/e2e_tests/test_ingest.py) exists.
   - This supports writing to iceberg, including updating schemas
   - Incremental loading using watermarks saved as Iceberg properties

### What is needed

- Finishing elt-common rework:

   - Sources/ingestion:

      - The latest commit has some code for pulling from an SQL database. This is basically complete but there may be a use case for determining the table names dynamically, for example if you just want to reflect a whole schema. Check with Chi Kai.
      - The [m365](https://github.com/ISISNeutronMuon/analytics-data-platform/tree/elt-command-without-dlt/elt-common/src/elt_common/sources/m365) code was ported using Claude as a test. It looks okay but certainly the tests are missing. This could be a just start again with porting the current m365 dlt source and removing the dlt bits...
      - For the rest APIs we could leave that for now as each ingest job handles that itself as the both `requests`/`httpx` are pretty easy to use. If lots more use cases develop then we could implement some thin wrappers to pull out any commonality.

   - To completely replace the [cron script](https://github.com/ISISNeutronMuon/analytics-data-platform/blob/elt-command-without-dlt/infra/ansible/roles/elt/templates/cron/elt_task.sh.j2) the `run` command needs to also run the `dbt` models dependent on those sources. Add code to execute dbt via the elt command.

  - I'd also like there to be a common entrypoint for running a unit tests on the sources that get implemented. Some may have quite complex logic that it would be nice to test in isolation.

- Switching existing pipelines to new framework. This also involves combining the current warehouse directories for landing and transform (landing is really an implementation detail). I think the new name top-level name should be elt-pipelines rather than warehouses. My thinking was having a directory structure as follows:

  ```text
  .
  |-- elt-common/
  |-- elt-pipelines/
      |-- facility_ops  # this is the root of pipeline for the `facility_ops` warehouse (where the dbt transformed tables end up)
          |-- ingest/
          |   |-- domain_a/
          |   |   |-- source_a/
          |   |   |-- source_b/
          |   |-- domain_b/
          |-- transform/
  ```

   - We also need to consider how we manage dependencies for the different source modules. My thought was using a `pyproject.toml` file in the root of the `facility_ops` directory where the separate sources could be optional dependencies. This possibly requires more thought but the current approach of script headers won't scale well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace dlt framework and overhaul ingestion code #321

Summary

Action

What works

What is needed

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Replace dlt framework and overhaul ingestion code #321

Description

Summary

Action

What works

What is needed

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions