Skip to content

Persist workspace volume cleanup state to stop repeated GC removal attempts #1351

@rowan-stein

Description

@rowan-stein

Summary

Repeated GC logs show attempts to remove workspace volumes for closed threads on every sweep and after server restarts. Example:

[Nest] ... LOG [ContainerService] Removing volume name=ha_ws_<id> force=true
[Nest] ... DEBUG [ContainerService] Volume already removed name=ha_ws_<id>

This indicates the system doesn’t persist that a thread’s workspace volume was already removed (or is absent). The Volume GC reprocesses the same closed threads indefinitely because candidate selection includes all status='closed' and cooldown is only in-memory.

User report

  • Increasing logs related to removal of volumes.
  • After server restart the same logs appear again, implying DB contains outdated state.

Research specification (Emerson Gray)

Root cause

  • VolumeGcService selects candidates with where: { status: 'closed' } and never persists a per-thread cleanup state.
  • Cooldown is in-memory; restarts reset it and cause retries on the same closed threads.
  • Docker removal is idempotent (404 swallowed) but the DB doesn’t reflect “already cleaned”.

Fix plan

  1. Schema: add a nullable timestamp field on Thread:

    • workspaceVolumeRemovedAt DateTime?
    • Migration: add column; backward compatible; optional index (status, workspaceVolumeRemovedAt).
  2. VolumeGcService:

    • Change candidate selection to where: { status: 'closed', workspaceVolumeRemovedAt: null }.
    • When no containers reference the volume:
      • On successful removal OR 404/not_found, set workspaceVolumeRemovedAt = now.
      • On other errors, do not mark; allow retry.
    • Keep in-memory cooldown as secondary throttle.
    • Use idempotent guarded update: updateMany({ where: { id, workspaceVolumeRemovedAt: null }, data: { workspaceVolumeRemovedAt: now } }).
  3. ThreadCleanupCoordinator + Provider:

    • deleteWorkspaceVolume(threadId): after removal or confirmed absence, mark workspaceVolumeRemovedAt.
    • Adjust provider contract to return a small outcome enum (recommended):
      • { outcome: 'removed' | 'not_found' | 'referenced' }.
    • container.service.ts removeVolume: return 'removed' | 'not_found'; throw on other errors; keep swallowing 404.
  4. Concurrency & idempotency:

    • Removal remains idempotent at Docker.
    • DB marking is race-safe via conditional update; GC and coordinator won’t conflict.

Tests

  • VolumeGcService:
    • Filters candidates to closed threads with workspaceVolumeRemovedAt=NULL.
    • Marks timestamp on removed and not_found; does not mark when referenced; does not mark on non-404 error; respects runner down.
  • ThreadCleanupCoordinator:
    • Marks on success and not_found; skips marking on referenced; does not mark on error.
  • Optional: provider/container.service mapping of 404 to not_found.

Acceptance criteria

  • After first sweep post-deploy, threads whose volumes are removed/absent are marked and no longer retried.
  • Server restart does not re-trigger volume removal attempts for already cleaned threads.
  • Existing behavior for open threads and referenced volumes is preserved.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions