Skip to content

feat(snapshot): capture checkpoints via PodSnapshot + node agent#10951

Open
Ronkahn21 wants to merge 19 commits into
ai-dynamo:mainfrom
Ronkahn21:feat/podsnapshot-capture
Open

feat(snapshot): capture checkpoints via PodSnapshot + node agent#10951
Ronkahn21 wants to merge 19 commits into
ai-dynamo:mainfrom
Ronkahn21:feat/podsnapshot-capture

Conversation

@Ronkahn21

@Ronkahn21 Ronkahn21 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Overview:

Wire the PodSnapshot / PodSnapshotContent CRDs from #10820 into the end-to-end checkpoint capture path: the operator drives a PodSnapshot from each DynamoCheckpoint, and the per-node snapshot-agent fulfils the cluster-scoped PodSnapshotContent work order via CRIU. This is PR 2 + 3 of 3 for the capture-flow migration (#10819), combining the DynamoCheckpointPodSnapshot integration (PR 2) and the node-agent capture (PR 3).

Stacking: #10820 (feat/podsnapshot-api, the PodSnapshot/PodSnapshotContent CRDs) has merged; this branch is rebased onto main and now contains only the capture-backend changes.

Capture flow (after this PR)

sequenceDiagram
    autonumber
    participant DC as DynamoCheckpoint ctrl
    participant PSR as PodSnapshotReconciler (#10820)
    participant SNAP as PodSnapshot
    participant PSC as PodSnapshotContent (cluster-scoped)
    participant AG as Node agent (NodeController)
    participant POD as Source (checkpoint Job) pod
    DC->>POD: create checkpoint Job (workload carrier)
    DC->>SNAP: create PodSnapshot (name=ckpt.Name, label snapshot-owner; Owns)
    DC->>DC: record status.podSnapshotName
    PSR->>POD: resolve podRef -> nodeName (wait until scheduled)
    PSR->>PSC: create work order {source.podRef, nodeName} + label snapshot-node
    PSR->>SNAP: status.boundPodSnapshotContentName + Pending
    AG->>PSC: content informer (snapshot-node filter) picks up work order (trigger only)
    AG->>POD: gate validates pod -> label snapshot-capture-eligible
    AG->>POD: read capture params from pod labels/annotations (source of truth)
    AG->>POD: acquire lease, wait Ready, CRIU dump, write sentinel
    AG->>PSC: Status().Patch Ready/Failed (status only)
    PSC-->>PSR: Watches(PodSnapshotContent) map-fn re-enqueues PodSnapshot
    PSR->>SNAP: mirror Ready/Failed
    SNAP-->>DC: Owns(PodSnapshot) update re-enqueues DynamoCheckpoint
    DC->>DC: Ready (+status.checkpointID) / Failed
    note over POD,PSC: PodSnapshotContent is only the trigger + routing record; capture params come from the source pod
    note over PSC,DC: status cascades up two watch hops (PSC->PodSnapshot map-fn, PodSnapshot->DC Owns); DC never watches PSC
Loading
  1. DynamoCheckpoint -> PodSnapshot. The DynamoCheckpoint controller creates the checkpoint Job (the workload-carrier pod) and a PodSnapshot named after the checkpoint (ckpt.Name) and stamped nvidia.com/snapshot-owner; it records the name in status.podSnapshotName and observes the PodSnapshot by owner label (never a reconstructed name).
  2. PodSnapshot -> PodSnapshotContent. The PodSnapshotReconciler (feat(operator): add PodSnapshot/PodSnapshotContent CRDs + reconciler #10820) waits for the source pod to be scheduled, then creates the cluster-scoped PodSnapshotContent work order carrying spec.source.podRef + nodeName and the nvidia.com/snapshot-node=<node> routing label, and sets status.boundPodSnapshotContentName (Pending).
  3. Agent pick-up + pre-bind gate. The node agent's dynamic content informer (filtered to its node via snapshot-node) picks up the work order; the gate validates the source pod and promotes it by labeling it nvidia.com/snapshot-capture-eligible -- it never runs the dump directly.
  4. Capture -- the source pod is the single source of truth. The source-pod informer (keyed on the eligible label) drives the single capture path. All capture parameters are read from the source pod's labels/annotations, not from the work order: the checkpoint ID (nvidia.com/snapshot-checkpoint-id label), target container (nvidia.com/snapshot-target-containers annotation), and destination (storage annotations + agent base path). The PodSnapshotContent is only a trigger + routing record (spec.source.podRef + nodeName) -- a content event merely nudges the agent, which then picks the oldest active work order for that pod. The agent acquires the per-work-order lease, waits for the target container to be Ready, runs the CRIU dump via the executor, writes the snapshot-complete sentinel, and patches terminal status on the PodSnapshotContent (Status().Patch only -- Ready/Failed; it never writes the content's spec).
  5. Status cascades up. PodSnapshotContent -> PodSnapshot (the reconciler's content-watch map-fn) mirrors Ready/Failed; PodSnapshot -> DynamoCheckpoint (Owns) sets the checkpoint Ready (+status.checkpointID) / Failed. The DynamoCheckpoint never watches PodSnapshotContent directly.

Details:

Operator (DynamoCheckpoint → PodSnapshot):

  • The DynamoCheckpoint controller creates and observes a PodSnapshot (replacing the old ObserveCheckpointJob switch). PodSnapshot Ready/Failed → DynamoCheckpoint Ready/Failed (+ status.checkpointID).
  • The PodSnapshot is looked up by the nvidia.com/snapshot-owner label + ownership (never by reconstructed name); its name is recorded in status.podSnapshotName.
  • Pure create + AlreadyExists-requeue (no SSA force-ownership); an event-driven, CheckpointSourceLabel-scoped source-pod watch replaces the previous poll.
  • The checkpoint Job's activeDeadlineSeconds is left to Kubernetes (which sets JobFailed/DeadlineExceeded) + the Owns(&Job) watch — no operator-side deadline math.

Node agent (PodSnapshotContent capture):

  • The per-node NodeController fulfils PodSnapshotContent work orders: a pre-bind gate validates the source pod and promotes it with nvidia.com/snapshot-capture-eligible; the capture path drives the CRIU dump via the executor and writes terminal PodSnapshotContent status (status-only patches).
  • Detection-only cancellation (in-flight dumps run to completion); a nvidia.com/snapshot-content back-ref is stamped on the source pod.
  • Removes the legacy Job-annotation observe protocol (ObserveCheckpointJob / CheckpointObservation*).

Where should the reviewer start?

  • deploy/operator/internal/controller/dynamocheckpoint_controller.go, checkpoint_podsnapshot.go
  • deploy/snapshot/internal/controller/podsnapshotcontent.go, controller.go

Related Issues

🔗 This PR is linked to an issue:

Summary by CodeRabbit

  • New Features

    • Checkpointing now tracks snapshot state more directly, improving reliability and recovery.
    • Added clearer pod and snapshot labels to support capture and reconciliation workflows.
    • Improved Helm/RBAC setup for snapshot-related permissions and resources.
  • Bug Fixes

    • Better handling of snapshot creation, missing resources, and retryable failures.
    • Checkpoint and snapshot controllers now handle transient errors and requeues more consistently.
    • Updated startup/shutdown flow for the snapshot agent to be more predictable.

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot added feat external-contribution Pull request is from an external contributor documentation Improvements or additions to documentation deployment::k8s Relates to dynamo deployment in kubernetes labels Jun 25, 2026
@datadog-official

datadog-official Bot commented Jun 25, 2026

Copy link
Copy Markdown

Pipelines

⚠️ Warnings

🚦 1 Pipeline job failed

Docs link check | lychee   View in Datadog   GitHub Actions

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 367521f | Docs | Give us feedback!

@Ronkahn21 Ronkahn21 temporarily deployed to external_collaborator June 25, 2026 11:54 — with GitHub Actions Inactive
@Ronkahn21 Ronkahn21 temporarily deployed to external_collaborator June 25, 2026 12:44 — with GitHub Actions Inactive
@Ronkahn21 Ronkahn21 temporarily deployed to external_collaborator June 25, 2026 13:33 — with GitHub Actions Inactive
@Ronkahn21 Ronkahn21 temporarily deployed to external_collaborator June 25, 2026 14:44 — with GitHub Actions Inactive
@Ronkahn21 Ronkahn21 temporarily deployed to external_collaborator June 25, 2026 14:59 — with GitHub Actions Inactive
@Ronkahn21 Ronkahn21 temporarily deployed to external_collaborator June 25, 2026 16:48 — with GitHub Actions Inactive
@Ronkahn21 Ronkahn21 force-pushed the feat/podsnapshot-capture branch from 3d49052 to cca8329 Compare June 25, 2026 17:05
@Ronkahn21 Ronkahn21 temporarily deployed to external_collaborator June 25, 2026 17:05 — with GitHub Actions Inactive
Ronkahn21 added 10 commits June 28, 2026 14:43
The snapshot module imports the operator module via replace => ../operator, so
the operator module must be present in the image build. Pass it as an 'operator'
named build context (build/check-deploy-component actions) and COPY it to
/operator in the Dockerfile go-base stage before 'go mod download', so go can
resolve the local replace. Fixes the snapshot/agent/placeholder image builds.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Record status.podSnapshotName when observing an already-created PodSnapshot so a
checkpoint never reaches Ready with an empty pointer, and remove the unused delete
verb on podsnapshots (nothing deletes them; cleanup is owner-reference GC).

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Use the context logger (seeded in Run) instead of w.log on the capture path
- reconcileSourcePod returns error; informer handlers log it
- Drop the unreachable contentIndexer nil guard
- Remove the snapshot-content back-ref label (helper, const, tests)
- Rename writeReady/writeFailed -> setSnapshotContentSucceeded/Failed
- Patch only labels (maps.Clone) instead of deep-copying the whole pod
- Trim redundant comments

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
@Ronkahn21 Ronkahn21 temporarily deployed to external_collaborator June 28, 2026 13:06 — with GitHub Actions Inactive
@Ronkahn21 Ronkahn21 temporarily deployed to external_collaborator June 28, 2026 14:40 — with GitHub Actions Inactive
@Ronkahn21 Ronkahn21 force-pushed the feat/podsnapshot-capture branch from 799af37 to 367521f Compare June 28, 2026 14:41
@Ronkahn21 Ronkahn21 deployed to external_collaborator June 28, 2026 14:41 — with GitHub Actions Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

actions deployment::k8s Relates to dynamo deployment in kubernetes documentation Improvements or additions to documentation external-contribution Pull request is from an external contributor feat size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant