feat: capture checkpoints via Snapshot/SnapshotContent (Component C) by Ronkahn21 · Pull Request #3 · Ronkahn21/dynamo

Ronkahn21 · 2026-06-15T14:40:40Z

Overview:

Component C of the Snapshot/SnapshotContent migration: move container-checkpoint capture onto the new CRDs using a CSI VolumeSnapshot / external-snapshotter split. Restore is untouched (stays on DynamoCheckpoint + pod annotations). External DGD API is unchanged.

Details:

Operator SnapshotReconciler (new, cluster-wide) ≈ snapshot-controller: creates the cluster-scoped SnapshotContent work order (source podRef{name,uid} + nodeName + checkpointID/artifactVersion metadata), binds Snapshot.status.boundSnapshotContentName, mirrors SnapshotContent.status → Snapshot.status, and owns finalizers on both. Cluster→namespaced ownerRef is impossible, so the link is spec.snapshotRef + a finalizer; the status-mirror mapper unwraps cache.DeletedFinalStateUnknown so deletes don't wedge.
DynamoCheckpoint controller: Owns(&Snapshot{}) (update-only predicate); completion reads Snapshot.status; deadline + Job-failure hang guards.
Agent SnapshotContentReconciler (new, per-node, action-only) ≈ CSI driver: controller-runtime manager with a node-label-scoped cache; dumps via a NodeCheckpointer interface; writes only SnapshotContent.status. UID provenance check, inFlight guard + lease-renewal goroutine. The capture-side pod informer + reconcileCheckpointPod are removed (the work order is the sole trigger; no double-dump).
API: SnapshotContent.spec.source = {podRef{name,uid}, nodeName} (both required); storage coords ride as existing label/annotation; protocol.ObserveCheckpointJob/observation types removed.

Verification: operator controller envtest + api tests pass (envtest run outside sandbox); agent internal/controller + protocol tests pass in a Linux container; both modules build; codegen has zero drift.

Where should the reviewer start?

deploy/operator/internal/controller/snapshot_reconciler.go
deploy/operator/internal/controller/dynamocheckpoint_controller.go (observeSnapshot)
deploy/snapshot/internal/controller/snapshotcontent_reconciler.go

Related Issues

🚫 This PR is NOT linked to an issue:

Confirmed — no related issue (intermediate component PR into the feat/snapshot-crd fork base; tracked in the migration's internal state doc — upstream issue is created at the base→upstream PR)

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Ronkahn21 · 2026-06-15T16:26:11Z

Addressed the review comments in 65208a9 + 3e95529. Per-comment:

dynamocheckpoint_controller.go

ptr.To/new → CreatedAt = ptr.To(metav1.Now()) in the new markCheckpointReady.
IsSnapshotSucceeded util + move update out → added IsSnapshotSucceeded/IsSnapshotFailed in api/v1alpha1; observeSnapshot uses them and the status write moved into markCheckpointReady.
mixes observe + status update → split: observeSnapshot only decides; markCheckpointReady/failCreating write.
job status should expose jobFailed → checkpointJobFailed(job) reports the failed condition; checkpoint fails when the Job failed and the Snapshot has not succeeded.
also deleted should be watched → Owns(&Snapshot{}) predicate DeleteFunc -> true.
use SSA → kept Status().Update here. The other ~8 status writes in this controller use Update; mixing SSA on the same status subresource risks silently dropping conditions (e.g. JobCreated) and is not exercised by the fake-client/envtest suite. Can do a full SSA-status migration as a separate follow-up if you want it.

snapshot_reconciler.go

separate get + validate → getSourcePod + validateSourcePod.
follow ensureSnapshotContent get-or-create / fold findBoundContent in → ensureSnapshotContent now get-or-creates (Get, else build + SSA) and returns the content, mirroring ensureSnapshot in checkpoint_snapshot.go.
get once, not in every function → Reconcile resolves the content once and passes it to propagateStatus; no re-Get.
status as a separate step / call it propagateStatus → renamed bindAndMirror -> propagateStatus, now a separate reconcile step taking the content.
delete without pre-Get, block to wait → handleDelete deletes by name with IgnoreNotFound (no pre-Get), then requeues until a Get confirms it is gone before dropping the Snapshot finalizer.
start without finalizer on the SnapshotContent → removed the finalizer from buildSnapshotContent.
start without the predicate → removed the Watches(&SnapshotContent{}) predicate; kept the mapper + cache.DeletedFinalStateUnknown tombstone unwrap.

Verified: go build/go vet clean, go test ./api/... + controller envtest (64 specs + table tests) pass, agent module untouched, no codegen drift.

Note: per-thread replies were blocked by an existing pending review on this PR.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

The agent read the checkpoint id from the source pod label but also reverse-parsed it out of the SnapshotContent name and failed on mismatch. That re-introduced the name-encoding coupling removed from Snapshot.spec. The pod is the sole source of truth; the content name is now opaque. Naming stays the operator's producer-side concern. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Restore the pre-migration active unstick in the agent capture path: when a checkpoint container exits non-zero, SIGKILL the source pod's still-running containers so a quiesced/CUDA-locked workload cannot hang forever, then fail the SnapshotContent. The kill runs while holding the in-flight key, so it never races a live dump. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

The capture path was driven only by the SnapshotContent informer (10s resync), so pod status changes (a checkpoint container crashing, the target becoming ready) were only acted on at resync. Add a source-pod informer plus a podRef index on the content informer: a pod event maps O(1) to its work order and re-drives reconcileSnapshotContent, choosing the oldest non-terminal content. The resync stays as a backstop. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Both informers funneled into one content-centric reconcile that mixed pre-bind validation and capture. Split into reconcileSourcePod (the one capture flow, pod-driven, picks the oldest active work order) and a thin reconcileSnapshotContent pre-bind gate that validates the source pod (NotFound/UID/gone -> writeFailed) and triggers the pod reconcile. The unstick runs before the gone-guard so a crashed container is always SIGKILLed. classifySourcePod replaces resolveSourcePod. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

) * feat(operator): add SnapshotContent work-order source fields Signed-off-by: Ron Kahn <rkahn@nvidia.com> * feat(operator): capture via SnapshotReconciler and Snapshot status Signed-off-by: Ron Kahn <rkahn@nvidia.com> * refactor(snapshot): remove Job-observation checkpoint completion Signed-off-by: Ron Kahn <rkahn@nvidia.com> * feat(snapshot): capture checkpoints via SnapshotContent work order Signed-off-by: Ron Kahn <rkahn@nvidia.com> * feat(operator): add Snapshot status condition helpers Signed-off-by: Ron Kahn <rkahn@nvidia.com> * refactor(operator): apply Component C PR review feedback Signed-off-by: Ron Kahn <rkahn@nvidia.com> * refactor(snapshot): drop manager, capture from pod annotations Signed-off-by: Ron Kahn <rkahn@nvidia.com> * fix(operator): fail checkpoint on unbound Snapshot failure Signed-off-by: Ron Kahn <rkahn@nvidia.com> * fix(snapshot): add helm RBAC document separator Signed-off-by: Ron Kahn <rkahn@nvidia.com> * refactor(operator): drop Snapshot.spec.checkpointID, use pod label Signed-off-by: Ron Kahn <rkahn@nvidia.com> * refactor(snapshot): drop agent content-name id cross-check The agent read the checkpoint id from the source pod label but also reverse-parsed it out of the SnapshotContent name and failed on mismatch. That re-introduced the name-encoding coupling removed from Snapshot.spec. The pod is the sole source of truth; the content name is now opaque. Naming stays the operator's producer-side concern. Signed-off-by: Ron Kahn <rkahn@nvidia.com> * feat(snapshot): unstick pod on failed checkpoint container Restore the pre-migration active unstick in the agent capture path: when a checkpoint container exits non-zero, SIGKILL the source pod's still-running containers so a quiesced/CUDA-locked workload cannot hang forever, then fail the SnapshotContent. The kill runs while holding the in-flight key, so it never races a live dump. Signed-off-by: Ron Kahn <rkahn@nvidia.com> * feat(snapshot): drive capture on source-pod events via podRef index The capture path was driven only by the SnapshotContent informer (10s resync), so pod status changes (a checkpoint container crashing, the target becoming ready) were only acted on at resync. Add a source-pod informer plus a podRef index on the content informer: a pod event maps O(1) to its work order and re-drives reconcileSnapshotContent, choosing the oldest non-terminal content. The resync stays as a backstop. Signed-off-by: Ron Kahn <rkahn@nvidia.com> * refactor(snapshot): single pod-driven capture path + pre-bind gate Both informers funneled into one content-centric reconcile that mixed pre-bind validation and capture. Split into reconcileSourcePod (the one capture flow, pod-driven, picks the oldest active work order) and a thin reconcileSnapshotContent pre-bind gate that validates the source pod (NotFound/UID/gone -> writeFailed) and triggers the pod reconcile. The unstick runs before the gone-guard so a crashed container is always SIGKILLed. classifySourcePod replaces resolveSourcePod. Signed-off-by: Ron Kahn <rkahn@nvidia.com> --------- Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Ronkahn21 added 4 commits June 15, 2026 17:37

feat(operator): add SnapshotContent work-order source fields

4721aea

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

feat(operator): capture via SnapshotReconciler and Snapshot status

101a87a

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

refactor(snapshot): remove Job-observation checkpoint completion

c007071

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

feat(snapshot): capture checkpoints via SnapshotContent work order

e5a7d78

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

github-actions Bot added the deployment::k8s label Jun 15, 2026

Ronkahn21 commented Jun 15, 2026

View reviewed changes

Comment thread deploy/operator/internal/controller/dynamocheckpoint_controller.go Outdated

Ronkahn21 commented Jun 15, 2026

View reviewed changes

Ronkahn21 added 2 commits June 15, 2026 19:22

feat(operator): add Snapshot status condition helpers

65208a9

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

refactor(operator): apply Component C PR review feedback

3e95529

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Ronkahn21 added 8 commits June 15, 2026 23:10

refactor(snapshot): drop manager, capture from pod annotations

927d989

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

fix(operator): fail checkpoint on unbound Snapshot failure

935c5b0

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

fix(snapshot): add helm RBAC document separator

7d5659b

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

refactor(operator): drop Snapshot.spec.checkpointID, use pod label

2c0647a

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Ronkahn21 merged commit f32ec82 into feat/snapshot-crd Jun 22, 2026
11 of 16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: capture checkpoints via Snapshot/SnapshotContent (Component C)#3

feat: capture checkpoints via Snapshot/SnapshotContent (Component C)#3
Ronkahn21 merged 14 commits into
feat/snapshot-crdfrom
feat/snapshot-crd-c-agent

Ronkahn21 commented Jun 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ronkahn21 commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Ronkahn21 commented Jun 15, 2026

Overview:

Details:

Where should the reviewer start?

Related Issues

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ronkahn21 commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant