feat: capture checkpoints via Snapshot/SnapshotContent (Component C)#3
Merged
Conversation
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Ronkahn21
commented
Jun 15, 2026
Ronkahn21
commented
Jun 15, 2026
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Owner
Author
|
Addressed the review comments in dynamocheckpoint_controller.go
snapshot_reconciler.go
Verified: Note: per-thread replies were blocked by an existing pending review on this PR. |
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
The agent read the checkpoint id from the source pod label but also reverse-parsed it out of the SnapshotContent name and failed on mismatch. That re-introduced the name-encoding coupling removed from Snapshot.spec. The pod is the sole source of truth; the content name is now opaque. Naming stays the operator's producer-side concern. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Restore the pre-migration active unstick in the agent capture path: when a checkpoint container exits non-zero, SIGKILL the source pod's still-running containers so a quiesced/CUDA-locked workload cannot hang forever, then fail the SnapshotContent. The kill runs while holding the in-flight key, so it never races a live dump. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
The capture path was driven only by the SnapshotContent informer (10s resync), so pod status changes (a checkpoint container crashing, the target becoming ready) were only acted on at resync. Add a source-pod informer plus a podRef index on the content informer: a pod event maps O(1) to its work order and re-drives reconcileSnapshotContent, choosing the oldest non-terminal content. The resync stays as a backstop. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Both informers funneled into one content-centric reconcile that mixed pre-bind validation and capture. Split into reconcileSourcePod (the one capture flow, pod-driven, picks the oldest active work order) and a thin reconcileSnapshotContent pre-bind gate that validates the source pod (NotFound/UID/gone -> writeFailed) and triggers the pod reconcile. The unstick runs before the gone-guard so a crashed container is always SIGKILLed. classifySourcePod replaces resolveSourcePod. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Ronkahn21
added a commit
that referenced
this pull request
Jun 22, 2026
) * feat(operator): add SnapshotContent work-order source fields Signed-off-by: Ron Kahn <rkahn@nvidia.com> * feat(operator): capture via SnapshotReconciler and Snapshot status Signed-off-by: Ron Kahn <rkahn@nvidia.com> * refactor(snapshot): remove Job-observation checkpoint completion Signed-off-by: Ron Kahn <rkahn@nvidia.com> * feat(snapshot): capture checkpoints via SnapshotContent work order Signed-off-by: Ron Kahn <rkahn@nvidia.com> * feat(operator): add Snapshot status condition helpers Signed-off-by: Ron Kahn <rkahn@nvidia.com> * refactor(operator): apply Component C PR review feedback Signed-off-by: Ron Kahn <rkahn@nvidia.com> * refactor(snapshot): drop manager, capture from pod annotations Signed-off-by: Ron Kahn <rkahn@nvidia.com> * fix(operator): fail checkpoint on unbound Snapshot failure Signed-off-by: Ron Kahn <rkahn@nvidia.com> * fix(snapshot): add helm RBAC document separator Signed-off-by: Ron Kahn <rkahn@nvidia.com> * refactor(operator): drop Snapshot.spec.checkpointID, use pod label Signed-off-by: Ron Kahn <rkahn@nvidia.com> * refactor(snapshot): drop agent content-name id cross-check The agent read the checkpoint id from the source pod label but also reverse-parsed it out of the SnapshotContent name and failed on mismatch. That re-introduced the name-encoding coupling removed from Snapshot.spec. The pod is the sole source of truth; the content name is now opaque. Naming stays the operator's producer-side concern. Signed-off-by: Ron Kahn <rkahn@nvidia.com> * feat(snapshot): unstick pod on failed checkpoint container Restore the pre-migration active unstick in the agent capture path: when a checkpoint container exits non-zero, SIGKILL the source pod's still-running containers so a quiesced/CUDA-locked workload cannot hang forever, then fail the SnapshotContent. The kill runs while holding the in-flight key, so it never races a live dump. Signed-off-by: Ron Kahn <rkahn@nvidia.com> * feat(snapshot): drive capture on source-pod events via podRef index The capture path was driven only by the SnapshotContent informer (10s resync), so pod status changes (a checkpoint container crashing, the target becoming ready) were only acted on at resync. Add a source-pod informer plus a podRef index on the content informer: a pod event maps O(1) to its work order and re-drives reconcileSnapshotContent, choosing the oldest non-terminal content. The resync stays as a backstop. Signed-off-by: Ron Kahn <rkahn@nvidia.com> * refactor(snapshot): single pod-driven capture path + pre-bind gate Both informers funneled into one content-centric reconcile that mixed pre-bind validation and capture. Split into reconcileSourcePod (the one capture flow, pod-driven, picks the oldest active work order) and a thin reconcileSnapshotContent pre-bind gate that validates the source pod (NotFound/UID/gone -> writeFailed) and triggers the pod reconcile. The unstick runs before the gone-guard so a crashed container is always SIGKILLed. classifySourcePod replaces resolveSourcePod. Signed-off-by: Ron Kahn <rkahn@nvidia.com> --------- Signed-off-by: Ron Kahn <rkahn@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview:
Component C of the
Snapshot/SnapshotContentmigration: move container-checkpoint capture onto the new CRDs using a CSI VolumeSnapshot / external-snapshotter split. Restore is untouched (stays onDynamoCheckpoint+ pod annotations). External DGD API is unchanged.Details:
SnapshotReconciler(new, cluster-wide) ≈ snapshot-controller: creates the cluster-scopedSnapshotContentwork order (sourcepodRef{name,uid}+nodeName+checkpointID/artifactVersionmetadata), bindsSnapshot.status.boundSnapshotContentName, mirrorsSnapshotContent.status→Snapshot.status, and owns finalizers on both. Cluster→namespaced ownerRef is impossible, so the link isspec.snapshotRef+ a finalizer; the status-mirror mapper unwrapscache.DeletedFinalStateUnknownso deletes don't wedge.DynamoCheckpointcontroller:Owns(&Snapshot{})(update-only predicate); completion readsSnapshot.status; deadline + Job-failure hang guards.SnapshotContentReconciler(new, per-node, action-only) ≈ CSI driver: controller-runtime manager with a node-label-scoped cache; dumps via aNodeCheckpointerinterface; writes onlySnapshotContent.status. UID provenance check,inFlightguard + lease-renewal goroutine. The capture-side pod informer +reconcileCheckpointPodare removed (the work order is the sole trigger; no double-dump).SnapshotContent.spec.source = {podRef{name,uid}, nodeName}(both required); storage coords ride as existing label/annotation;protocol.ObserveCheckpointJob/observation types removed.Verification: operator controller envtest + api tests pass (envtest run outside sandbox); agent
internal/controller+protocoltests pass in a Linux container; both modules build; codegen has zero drift.Where should the reviewer start?
deploy/operator/internal/controller/snapshot_reconciler.godeploy/operator/internal/controller/dynamocheckpoint_controller.go(observeSnapshot)deploy/snapshot/internal/controller/snapshotcontent_reconciler.goRelated Issues
🚫 This PR is NOT linked to an issue:
feat/snapshot-crdfork base; tracked in the migration's internal state doc — upstream issue is created at the base→upstream PR)