Skip to content

Fix: dataset deletion cleanup gaps#18

Closed
AlexianMasson wants to merge 15 commits into
mainfrom
fix/dataset-deletion-cleanup
Closed

Fix: dataset deletion cleanup gaps#18
AlexianMasson wants to merge 15 commits into
mainfrom
fix/dataset-deletion-cleanup

Conversation

@AlexianMasson
Copy link
Copy Markdown
Collaborator

@AlexianMasson AlexianMasson commented Jun 2, 2026

@AlexianMasson AlexianMasson force-pushed the fix/dataset-deletion-cleanup branch 7 times, most recently from 76fbbbc to c8f3ef1 Compare June 3, 2026 08:30
Dataset deletion removed the feature type but left the layer's
.r/.w ACL rules in GeoServer. Since the final table name becomes
reusable once the row is deleted, a stale rule could silently apply
to a future dataset published under the same name.

Add GeoServerService.delete_layer_acl (best-effort, 404-tolerant)
and call it from DatasetDeletionService next to delete_layer.
The generator gained a normalize_nan import that the stub never
provided, so the test module failed at collection.
In refresh mode (scheduled re-ingestion), clean_staging_table_task ran
only on transform success, so every failed scheduled run leaked an
untracked temp_<hex> table in the staging schema that dataset deletion
could never clean up.

Give the task group a clean_on_failure flag that switches the cleanup
task to TriggerRule.ALL_DONE, enabled for transform_refresh only.
Direct mode keeps ALL_SUCCESS so the user's tracked staging table
survives a failed transform (it is dropped on dataset deletion). In
direct-mode runs the refresh cleanup task now executes but exits early
via its existing XCom-None guard, and a succeeding ALL_DONE leaf does
not flip a failed DAG run to success, so failure callbacks still fire.
DAG deletion was guarded by the schedule still being set. When the
schedule had been cleared before the dataset was deleted, the stale
ingestion_<id> DAG (metadata + run history) stayed in Airflow forever.

Call delete_dag unconditionally — it already treats 404 as success.
Clearing a recurrence schedule (DELETE/PATCH /schedule, or
re-processing without recurrence) only cancelled runs / updated the
DB. The DAG generator then stopped emitting ingestion_<id>, leaving
stale DAG metadata and run history in Airflow.

Add remove_ingestion_dag (cancel runs + delete DAG, best-effort) and
call it from every code path that clears an existing schedule.
The temp upload file (FILE imports) was only removed in the staging
success callback. On staging failure the IntegrityLink row was deleted
but the file stayed in TMP_UPLOAD_PATH with nothing referencing it
anymore.
A staging_dag or process_dag run still in flight when the dataset was
deleted would recreate the staging/final table after cleanup (its
callbacks 404, leaving an orphaned table with no GS/GN artifacts).

Force-fail all runs of the dataset first (best-effort): the scheduled
ingestion DAG, process_dag runs (prefix '<id>_') and staging_dag runs
(prefix '<id>' — the first staging run id is exactly the uuid).
staging_dag/process_dag run records (with task instances and XComs
holding staging/final table names) outlived the dataset forever.

Add purge_dataset_dag_runs, matching run ids by SQL LIKE prefix
('<id>%' for staging_dag, '<id>_%' for process_dag), called best-effort
before the IntegrityLink row is deleted.

Known limitation: failed-run log files under /opt/airflow/logs live on
the Airflow volume and cannot be reached from the backend; success-run
logs are already removed by the DAG success callback.
The per-org GeoServer workspace ({org}) and datastore ({org}_ds) and
the org PostgreSQL schema survived forever, even after the org's last
dataset was deleted.

When no IntegrityLink remains for the organization, delete the
datastore and workspace via raw non-recursive REST calls (GeoServer
refuses to delete non-empty resources, so anything still in use
survives; geoservercloud's own delete helpers hardcode recurse=true and
are deliberately avoided) and DROP SCHEMA ... RESTRICT the org schema.
The shared 'staging' and default 'data' schemas are never touched.
Run-id formats were built and matched with hand-written literals at
seven sites across staging, process, log routes and the airflow client,
with nothing keeping them in sync. One module now owns the make/match
helpers; the elt generator (which cannot import backend code) gets a
cross-reference comment. No behavior change.
_force_fail_dag_runs fetched a single unfiltered page (server default
100) of the shared process_dag/staging_dag history and filtered
client-side: once a shared DAG accumulates more than a page of runs,
in-flight runs of the target dataset can sit past page 1 and silently
survive cancellation — defeating the cancel-before-delete protection.

Both helpers now share one paginated iterator (_for_each_dag_run);
force-fail passes state=[running,queued] and the run-id LIKE pattern so
the server returns only relevant runs, and re-queries until none remain.
The literal-prefix re-check guards against LIKE '_' wildcard overmatch.
Clearing a recurrence cancelled every in-flight run of the dataset —
including manual process runs and staging runs (which are always
user-initiated) — when the user only meant to stop scheduled
ingestions. Split the cancel:

- cancel_dataset_runs: everything (ingestion + process + staging),
  used by dataset deletion where no run may keep writing to the tables
- cancel_scheduled_runs: ingestion runs + non-'_manual' process runs
  only, used by remove_ingestion_dag on schedule-clear
Clearing a recurrence was implemented three times (DELETE /schedule,
update_schedule with a had_schedule flag, process_staging_data) with
divergent guards, each having to remember the remove_ingestion_dag
call. One helper now pairs the field reset with the DAG removal; a
future caller cannot forget it.
The FILE-only cleanup block was byte-for-byte duplicated between the
success and failure callbacks; the eligibility predicate already
drifted once, so keep it in one place.
The never-drop guard hardcoded the 'staging' literal inside the
deletion service; when get_staging_schema becomes configurable the
guard would silently stop covering it. is_shared_schema now derives
from get_staging_schema/DEFAULT_DATA_SCHEMA in core.config.
@AlexianMasson AlexianMasson force-pushed the fix/dataset-deletion-cleanup branch from c8f3ef1 to 0e7ce13 Compare June 3, 2026 09:09
@tonio
Copy link
Copy Markdown
Contributor

tonio commented Jun 3, 2026

Not tested everything, but seems to globally work fine. Code review in splitted PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants