Skip to content

feat: capture checkpoints via Snapshot/SnapshotContent (Component C)#3

Merged
Ronkahn21 merged 14 commits into
feat/snapshot-crdfrom
feat/snapshot-crd-c-agent
Jun 22, 2026
Merged

feat: capture checkpoints via Snapshot/SnapshotContent (Component C)#3
Ronkahn21 merged 14 commits into
feat/snapshot-crdfrom
feat/snapshot-crd-c-agent

Conversation

@Ronkahn21

Copy link
Copy Markdown
Owner

Overview:

Component C of the Snapshot/SnapshotContent migration: move container-checkpoint capture onto the new CRDs using a CSI VolumeSnapshot / external-snapshotter split. Restore is untouched (stays on DynamoCheckpoint + pod annotations). External DGD API is unchanged.

Details:

  • Operator SnapshotReconciler (new, cluster-wide) ≈ snapshot-controller: creates the cluster-scoped SnapshotContent work order (source podRef{name,uid} + nodeName + checkpointID/artifactVersion metadata), binds Snapshot.status.boundSnapshotContentName, mirrors SnapshotContent.statusSnapshot.status, and owns finalizers on both. Cluster→namespaced ownerRef is impossible, so the link is spec.snapshotRef + a finalizer; the status-mirror mapper unwraps cache.DeletedFinalStateUnknown so deletes don't wedge.
  • DynamoCheckpoint controller: Owns(&Snapshot{}) (update-only predicate); completion reads Snapshot.status; deadline + Job-failure hang guards.
  • Agent SnapshotContentReconciler (new, per-node, action-only) ≈ CSI driver: controller-runtime manager with a node-label-scoped cache; dumps via a NodeCheckpointer interface; writes only SnapshotContent.status. UID provenance check, inFlight guard + lease-renewal goroutine. The capture-side pod informer + reconcileCheckpointPod are removed (the work order is the sole trigger; no double-dump).
  • API: SnapshotContent.spec.source = {podRef{name,uid}, nodeName} (both required); storage coords ride as existing label/annotation; protocol.ObserveCheckpointJob/observation types removed.

Verification: operator controller envtest + api tests pass (envtest run outside sandbox); agent internal/controller + protocol tests pass in a Linux container; both modules build; codegen has zero drift.

Where should the reviewer start?

  • deploy/operator/internal/controller/snapshot_reconciler.go
  • deploy/operator/internal/controller/dynamocheckpoint_controller.go (observeSnapshot)
  • deploy/snapshot/internal/controller/snapshotcontent_reconciler.go

Related Issues

🚫 This PR is NOT linked to an issue:

  • Confirmed — no related issue (intermediate component PR into the feat/snapshot-crd fork base; tracked in the migration's internal state doc — upstream issue is created at the base→upstream PR)

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Comment thread deploy/operator/internal/controller/dynamocheckpoint_controller.go Outdated
Comment thread deploy/operator/internal/controller/dynamocheckpoint_controller.go Outdated
Comment thread deploy/operator/internal/controller/dynamocheckpoint_controller.go Outdated
Comment thread deploy/operator/internal/controller/dynamocheckpoint_controller.go
Comment thread deploy/operator/internal/controller/dynamocheckpoint_controller.go
Comment thread deploy/operator/internal/controller/dynamocheckpoint_controller.go
Comment thread deploy/operator/internal/controller/snapshot_reconciler.go Outdated
Comment thread deploy/operator/internal/controller/snapshot_reconciler.go Outdated
Comment thread deploy/operator/internal/controller/snapshot_reconciler.go Outdated
Comment thread deploy/operator/internal/controller/snapshot_reconciler.go Outdated
Comment thread deploy/operator/internal/controller/snapshot_reconciler.go Outdated
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
@Ronkahn21

Copy link
Copy Markdown
Owner Author

Addressed the review comments in 65208a9 + 3e95529. Per-comment:

dynamocheckpoint_controller.go

  • ptr.To/newCreatedAt = ptr.To(metav1.Now()) in the new markCheckpointReady.
  • IsSnapshotSucceeded util + move update out → added IsSnapshotSucceeded/IsSnapshotFailed in api/v1alpha1; observeSnapshot uses them and the status write moved into markCheckpointReady.
  • mixes observe + status update → split: observeSnapshot only decides; markCheckpointReady/failCreating write.
  • job status should expose jobFailedcheckpointJobFailed(job) reports the failed condition; checkpoint fails when the Job failed and the Snapshot has not succeeded.
  • also deleted should be watchedOwns(&Snapshot{}) predicate DeleteFunc -> true.
  • use SSA → kept Status().Update here. The other ~8 status writes in this controller use Update; mixing SSA on the same status subresource risks silently dropping conditions (e.g. JobCreated) and is not exercised by the fake-client/envtest suite. Can do a full SSA-status migration as a separate follow-up if you want it.

snapshot_reconciler.go

  • separate get + validategetSourcePod + validateSourcePod.
  • follow ensureSnapshotContent get-or-create / fold findBoundContent inensureSnapshotContent now get-or-creates (Get, else build + SSA) and returns the content, mirroring ensureSnapshot in checkpoint_snapshot.go.
  • get once, not in every function → Reconcile resolves the content once and passes it to propagateStatus; no re-Get.
  • status as a separate step / call it propagateStatus → renamed bindAndMirror -> propagateStatus, now a separate reconcile step taking the content.
  • delete without pre-Get, block to waithandleDelete deletes by name with IgnoreNotFound (no pre-Get), then requeues until a Get confirms it is gone before dropping the Snapshot finalizer.
  • start without finalizer on the SnapshotContent → removed the finalizer from buildSnapshotContent.
  • start without the predicate → removed the Watches(&SnapshotContent{}) predicate; kept the mapper + cache.DeletedFinalStateUnknown tombstone unwrap.

Verified: go build/go vet clean, go test ./api/... + controller envtest (64 specs + table tests) pass, agent module untouched, no codegen drift.

Note: per-thread replies were blocked by an existing pending review on this PR.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
The agent read the checkpoint id from the source pod label but also
reverse-parsed it out of the SnapshotContent name and failed on
mismatch. That re-introduced the name-encoding coupling removed from
Snapshot.spec. The pod is the sole source of truth; the content name
is now opaque. Naming stays the operator's producer-side concern.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Restore the pre-migration active unstick in the agent capture path:
when a checkpoint container exits non-zero, SIGKILL the source pod's
still-running containers so a quiesced/CUDA-locked workload cannot hang
forever, then fail the SnapshotContent. The kill runs while holding the
in-flight key, so it never races a live dump.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
The capture path was driven only by the SnapshotContent informer (10s
resync), so pod status changes (a checkpoint container crashing, the
target becoming ready) were only acted on at resync. Add a source-pod
informer plus a podRef index on the content informer: a pod event maps
O(1) to its work order and re-drives reconcileSnapshotContent, choosing
the oldest non-terminal content. The resync stays as a backstop.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Both informers funneled into one content-centric reconcile that mixed
pre-bind validation and capture. Split into reconcileSourcePod (the one
capture flow, pod-driven, picks the oldest active work order) and a thin
reconcileSnapshotContent pre-bind gate that validates the source pod
(NotFound/UID/gone -> writeFailed) and triggers the pod reconcile. The
unstick runs before the gone-guard so a crashed container is always
SIGKILLed. classifySourcePod replaces resolveSourcePod.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
@Ronkahn21 Ronkahn21 merged commit f32ec82 into feat/snapshot-crd Jun 22, 2026
11 of 16 checks passed
Ronkahn21 added a commit that referenced this pull request Jun 22, 2026
)

* feat(operator): add SnapshotContent work-order source fields

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* feat(operator): capture via SnapshotReconciler and Snapshot status

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* refactor(snapshot): remove Job-observation checkpoint completion

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* feat(snapshot): capture checkpoints via SnapshotContent work order

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* feat(operator): add Snapshot status condition helpers

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* refactor(operator): apply Component C PR review feedback

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* refactor(snapshot): drop manager, capture from pod annotations

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* fix(operator): fail checkpoint on unbound Snapshot failure

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* fix(snapshot): add helm RBAC document separator

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* refactor(operator): drop Snapshot.spec.checkpointID, use pod label

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* refactor(snapshot): drop agent content-name id cross-check

The agent read the checkpoint id from the source pod label but also
reverse-parsed it out of the SnapshotContent name and failed on
mismatch. That re-introduced the name-encoding coupling removed from
Snapshot.spec. The pod is the sole source of truth; the content name
is now opaque. Naming stays the operator's producer-side concern.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* feat(snapshot): unstick pod on failed checkpoint container

Restore the pre-migration active unstick in the agent capture path:
when a checkpoint container exits non-zero, SIGKILL the source pod's
still-running containers so a quiesced/CUDA-locked workload cannot hang
forever, then fail the SnapshotContent. The kill runs while holding the
in-flight key, so it never races a live dump.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* feat(snapshot): drive capture on source-pod events via podRef index

The capture path was driven only by the SnapshotContent informer (10s
resync), so pod status changes (a checkpoint container crashing, the
target becoming ready) were only acted on at resync. Add a source-pod
informer plus a podRef index on the content informer: a pod event maps
O(1) to its work order and re-drives reconcileSnapshotContent, choosing
the oldest non-terminal content. The resync stays as a backstop.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* refactor(snapshot): single pod-driven capture path + pre-bind gate

Both informers funneled into one content-centric reconcile that mixed
pre-bind validation and capture. Split into reconcileSourcePod (the one
capture flow, pod-driven, picks the oldest active work order) and a thin
reconcileSnapshotContent pre-bind gate that validates the source pod
(NotFound/UID/gone -> writeFailed) and triggers the pod reconcile. The
unstick runs before the gone-guard so a crashed container is always
SIGKILLed. classifySourcePod replaces resolveSourcePod.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

---------

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant