Summary
Repeated GC logs show attempts to remove workspace volumes for closed threads on every sweep and after server restarts. Example:
[Nest] ... LOG [ContainerService] Removing volume name=ha_ws_<id> force=true
[Nest] ... DEBUG [ContainerService] Volume already removed name=ha_ws_<id>
This indicates the system doesn’t persist that a thread’s workspace volume was already removed (or is absent). The Volume GC reprocesses the same closed threads indefinitely because candidate selection includes all status='closed' and cooldown is only in-memory.
User report
- Increasing logs related to removal of volumes.
- After server restart the same logs appear again, implying DB contains outdated state.
Research specification (Emerson Gray)
Root cause
- VolumeGcService selects candidates with
where: { status: 'closed' } and never persists a per-thread cleanup state.
- Cooldown is in-memory; restarts reset it and cause retries on the same closed threads.
- Docker removal is idempotent (404 swallowed) but the DB doesn’t reflect “already cleaned”.
Fix plan
-
Schema: add a nullable timestamp field on Thread:
workspaceVolumeRemovedAt DateTime?
- Migration: add column; backward compatible; optional index
(status, workspaceVolumeRemovedAt).
-
VolumeGcService:
- Change candidate selection to
where: { status: 'closed', workspaceVolumeRemovedAt: null }.
- When no containers reference the volume:
- On successful removal OR 404/not_found, set
workspaceVolumeRemovedAt = now.
- On other errors, do not mark; allow retry.
- Keep in-memory cooldown as secondary throttle.
- Use idempotent guarded update:
updateMany({ where: { id, workspaceVolumeRemovedAt: null }, data: { workspaceVolumeRemovedAt: now } }).
-
ThreadCleanupCoordinator + Provider:
deleteWorkspaceVolume(threadId): after removal or confirmed absence, mark workspaceVolumeRemovedAt.
- Adjust provider contract to return a small outcome enum (recommended):
{ outcome: 'removed' | 'not_found' | 'referenced' }.
container.service.ts removeVolume: return 'removed' | 'not_found'; throw on other errors; keep swallowing 404.
-
Concurrency & idempotency:
- Removal remains idempotent at Docker.
- DB marking is race-safe via conditional update; GC and coordinator won’t conflict.
Tests
- VolumeGcService:
- Filters candidates to closed threads with
workspaceVolumeRemovedAt=NULL.
- Marks timestamp on removed and not_found; does not mark when referenced; does not mark on non-404 error; respects runner down.
- ThreadCleanupCoordinator:
- Marks on success and not_found; skips marking on referenced; does not mark on error.
- Optional: provider/container.service mapping of 404 to
not_found.
Acceptance criteria
- After first sweep post-deploy, threads whose volumes are removed/absent are marked and no longer retried.
- Server restart does not re-trigger volume removal attempts for already cleaned threads.
- Existing behavior for open threads and referenced volumes is preserved.
Summary
Repeated GC logs show attempts to remove workspace volumes for closed threads on every sweep and after server restarts. Example:
This indicates the system doesn’t persist that a thread’s workspace volume was already removed (or is absent). The Volume GC reprocesses the same closed threads indefinitely because candidate selection includes all
status='closed'and cooldown is only in-memory.User report
Research specification (Emerson Gray)
Root cause
where: { status: 'closed' }and never persists a per-thread cleanup state.Fix plan
Schema: add a nullable timestamp field on
Thread:workspaceVolumeRemovedAt DateTime?(status, workspaceVolumeRemovedAt).VolumeGcService:
where: { status: 'closed', workspaceVolumeRemovedAt: null }.workspaceVolumeRemovedAt = now.updateMany({ where: { id, workspaceVolumeRemovedAt: null }, data: { workspaceVolumeRemovedAt: now } }).ThreadCleanupCoordinator + Provider:
deleteWorkspaceVolume(threadId): after removal or confirmed absence, markworkspaceVolumeRemovedAt.{ outcome: 'removed' | 'not_found' | 'referenced' }.container.service.ts removeVolume: return'removed' | 'not_found'; throw on other errors; keep swallowing 404.Concurrency & idempotency:
Tests
workspaceVolumeRemovedAt=NULL.not_found.Acceptance criteria