feat(snapshot): capture checkpoints via PodSnapshot + node agent#10951
Open
Ronkahn21 wants to merge 19 commits into
Open
feat(snapshot): capture checkpoints via PodSnapshot + node agent#10951Ronkahn21 wants to merge 19 commits into
Ronkahn21 wants to merge 19 commits into
Conversation
Contributor
3d49052 to
cca8329
Compare
The snapshot module imports the operator module via replace => ../operator, so the operator module must be present in the image build. Pass it as an 'operator' named build context (build/check-deploy-component actions) and COPY it to /operator in the Dockerfile go-base stage before 'go mod download', so go can resolve the local replace. Fixes the snapshot/agent/placeholder image builds. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Record status.podSnapshotName when observing an already-created PodSnapshot so a checkpoint never reaches Ready with an empty pointer, and remove the unused delete verb on podsnapshots (nothing deletes them; cleanup is owner-reference GC). Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
f3f200e to
9b6236b
Compare
- Use the context logger (seeded in Run) instead of w.log on the capture path - reconcileSourcePod returns error; informer handlers log it - Drop the unreachable contentIndexer nil guard - Remove the snapshot-content back-ref label (helper, const, tests) - Rename writeReady/writeFailed -> setSnapshotContentSucceeded/Failed - Patch only labels (maps.Clone) instead of deep-copying the whole pod - Trim redundant comments Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
799af37 to
367521f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview:
Wire the
PodSnapshot/PodSnapshotContentCRDs from #10820 into the end-to-end checkpoint capture path: the operator drives aPodSnapshotfrom eachDynamoCheckpoint, and the per-node snapshot-agent fulfils the cluster-scopedPodSnapshotContentwork order via CRIU. This is PR 2 + 3 of 3 for the capture-flow migration (#10819), combining theDynamoCheckpoint→PodSnapshotintegration (PR 2) and the node-agent capture (PR 3).Capture flow (after this PR)
sequenceDiagram autonumber participant DC as DynamoCheckpoint ctrl participant PSR as PodSnapshotReconciler (#10820) participant SNAP as PodSnapshot participant PSC as PodSnapshotContent (cluster-scoped) participant AG as Node agent (NodeController) participant POD as Source (checkpoint Job) pod DC->>POD: create checkpoint Job (workload carrier) DC->>SNAP: create PodSnapshot (name=ckpt.Name, label snapshot-owner; Owns) DC->>DC: record status.podSnapshotName PSR->>POD: resolve podRef -> nodeName (wait until scheduled) PSR->>PSC: create work order {source.podRef, nodeName} + label snapshot-node PSR->>SNAP: status.boundPodSnapshotContentName + Pending AG->>PSC: content informer (snapshot-node filter) picks up work order (trigger only) AG->>POD: gate validates pod -> label snapshot-capture-eligible AG->>POD: read capture params from pod labels/annotations (source of truth) AG->>POD: acquire lease, wait Ready, CRIU dump, write sentinel AG->>PSC: Status().Patch Ready/Failed (status only) PSC-->>PSR: Watches(PodSnapshotContent) map-fn re-enqueues PodSnapshot PSR->>SNAP: mirror Ready/Failed SNAP-->>DC: Owns(PodSnapshot) update re-enqueues DynamoCheckpoint DC->>DC: Ready (+status.checkpointID) / Failed note over POD,PSC: PodSnapshotContent is only the trigger + routing record; capture params come from the source pod note over PSC,DC: status cascades up two watch hops (PSC->PodSnapshot map-fn, PodSnapshot->DC Owns); DC never watches PSCPodSnapshotnamed after the checkpoint (ckpt.Name) and stampednvidia.com/snapshot-owner; it records the name instatus.podSnapshotNameand observes the PodSnapshot by owner label (never a reconstructed name).PodSnapshotReconciler(feat(operator): add PodSnapshot/PodSnapshotContent CRDs + reconciler #10820) waits for the source pod to be scheduled, then creates the cluster-scopedPodSnapshotContentwork order carryingspec.source.podRef+nodeNameand thenvidia.com/snapshot-node=<node>routing label, and setsstatus.boundPodSnapshotContentName(Pending).snapshot-node) picks up the work order; the gate validates the source pod and promotes it by labeling itnvidia.com/snapshot-capture-eligible-- it never runs the dump directly.nvidia.com/snapshot-checkpoint-idlabel), target container (nvidia.com/snapshot-target-containersannotation), and destination (storage annotations + agent base path). ThePodSnapshotContentis only a trigger + routing record (spec.source.podRef+nodeName) -- a content event merely nudges the agent, which then picks the oldest active work order for that pod. The agent acquires the per-work-order lease, waits for the target container to be Ready, runs the CRIU dump via the executor, writes the snapshot-complete sentinel, and patches terminal status on the PodSnapshotContent (Status().Patchonly -- Ready/Failed; it never writes the content's spec).Owns) sets the checkpoint Ready (+status.checkpointID) / Failed. The DynamoCheckpoint never watches PodSnapshotContent directly.Details:
Operator (DynamoCheckpoint → PodSnapshot):
DynamoCheckpointcontroller creates and observes aPodSnapshot(replacing the oldObserveCheckpointJobswitch).PodSnapshotReady/Failed →DynamoCheckpointReady/Failed (+status.checkpointID).PodSnapshotis looked up by thenvidia.com/snapshot-ownerlabel + ownership (never by reconstructed name); its name is recorded instatus.podSnapshotName.AlreadyExists-requeue (no SSA force-ownership); an event-driven,CheckpointSourceLabel-scoped source-pod watch replaces the previous poll.activeDeadlineSecondsis left to Kubernetes (which setsJobFailed/DeadlineExceeded) + theOwns(&Job)watch — no operator-side deadline math.Node agent (PodSnapshotContent capture):
NodeControllerfulfilsPodSnapshotContentwork orders: a pre-bind gate validates the source pod and promotes it withnvidia.com/snapshot-capture-eligible; the capture path drives the CRIU dump via the executor and writes terminalPodSnapshotContentstatus (status-only patches).nvidia.com/snapshot-contentback-ref is stamped on the source pod.ObserveCheckpointJob/CheckpointObservation*).Where should the reviewer start?
deploy/operator/internal/controller/dynamocheckpoint_controller.go,checkpoint_podsnapshot.godeploy/snapshot/internal/controller/podsnapshotcontent.go,controller.goRelated Issues
🔗 This PR is linked to an issue:
Summary by CodeRabbit
New Features
Bug Fixes