Add OCI skill image mounting to AgentRuntime by cooktheryan · Pull Request #332 · kagenti/kagenti-operator

cooktheryan · 2026-05-06T15:27:30Z

Summary

Adds a skills field to AgentRuntimeSpec for declaring OCI skill images to mount into agent pods as Kubernetes ImageVolumes
Gated behind a skillImageVolumes feature gate (default off), requires Kubernetes 1.31+
Uses the skillimage OCI format: FROM scratch images with skill.yaml + SKILL.md
Each skill specifies a mountPath, making the feature framework-agnostic (Claude, Cursor, custom agents, etc.)

Example

apiVersion: agent.kagenti.dev/v1alpha1
kind: AgentRuntime
metadata:
  name: resume-agent-runtime
spec:
  type: agent
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: resume-agent
  skills:
    - name: resume-reviewer
      image: ghcr.io/redhat-et/skillimage/resume-reviewer:v1.0.0
      mountPath: /agent/skills/resume-reviewer
    - name: blog-writer
      image: ghcr.io/redhat-et/skillimage/blog-writer:latest
      mountPath: /app/.claude/skills/blog-writer
      pullPolicy: Always

Changes

Area	Files	What
CRD types	`api/v1alpha1/agentruntime_types.go`	`SkillImageRef`, `SkillPullPolicy`, `Skills` field
Feature gate	`internal/webhook/config/feature_gates.go`	`SkillImageVolumes` (default false)
Controller	`internal/controller/agentruntime_controller.go`, `agentruntime_skills.go`	Reconcile ImageVolumes on Deployment/StatefulSet, cleanup on deletion
Config hash	`internal/controller/agentruntime_config.go`	Skills in hash → rolling updates on change
Webhook	`internal/webhook/v1alpha1/agentruntime_webhook.go`	Validate duplicate names, reserved volume collisions
Wiring	`cmd/main.go`	Pass feature gate loader to reconciler
Docs	`docs/api-reference.md`, `docs/architecture.md`	SkillImageRef reference, conditions, examples
Samples	`config/samples/agent_v1alpha1_agentruntime_skills.yaml`, updated `_full.yaml`	New and updated sample manifests
Helm	`charts/kagenti-operator/values.yaml`, CRD YAML	Feature gate + CRD schema
Tests	`agentruntime_skills_test.go`, `agentruntime_webhook_test.go`	Volume reconciliation, config hash, validation

Relationship to ConfigMap-based skill linking (kagenti/kagenti#1440)

This feature complements the ConfigMap-based skill mounting in kagenti/kagenti#1440. Both deliver skill files into agent pods, but target different maturity stages from kagenti/kagenti#1342:

	#1440 (ConfigMap)	This PR (OCI ImageVolume)
Storage	Kubernetes ConfigMap (~1MB limit)	OCI registry (no size limit)
Versioning	None (mutable ConfigMap)	OCI tags + digests (immutable)
Lifecycle	Create/delete ConfigMap	draft → testing → published → deprecated → archived
Declaration	Backend API at deploy time	AgentRuntime CR (declarative, GitOps-friendly)
Mount path	Hardcoded `/app/skills/<name>`	User-specified per skill
K8s version	Any	1.31+ (ImageVolume feature gate)

Integration opportunities for discussion

SKILL_FOLDERS env var — #1440 sets SKILL_FOLDERS so agents discover mounted skills. This operator feature could inject the same env var so agents work transparently with both delivery mechanisms.
kagenti.io/skills annotation — #1440 stores linked skills in this annotation. The operator could write this annotation when skills are declared on the AgentRuntime CR, enabling the UI/backend to display OCI-mounted skills alongside ConfigMap-mounted ones.
Coexistence — Both mechanisms can coexist on the same pod. ConfigMap volumes use names like skill-0, skill-1; OCI ImageVolumes use skill-<name>. Different volume types, different names, no conflicts.
Migration path — ConfigMap skills work today on any K8s version. OCI ImageVolume skills are the upgrade path when clusters reach K8s 1.31+. Teams can adopt incrementally.

Test plan

Unit tests: volume reconciliation (add/remove/update/multi-container), config hash, webhook validation
make manifests generate — CRD and deepcopy regenerated
go build ./... — compiles cleanly
go test ./internal/controller/ ./internal/webhook/... — all tests pass
Kind cluster (K8s 1.31): CRD installs, schema validation works, fields round-trip correctly
E2E: Full operator deployment with skill ImageVolumes on K8s 1.33+ cluster (requires kind v0.29.0+ with containerd 2.1.1 for runtime-level ImageVolume support)

Assisted-By: Claude Code

cooktheryan · 2026-05-06T15:31:16Z

DO NOT MERGE at the current time. I would like feedback based on kagenti/kagenti#1342

cwiklik

Solid implementation with proper feature gating, comprehensive tests (unit + E2E), clean separation of concerns (controller, webhook, config hash), and thorough docs. The ImageVolume K8s requirement (1.31+) is well-documented and the graceful degradation (condition + event when gate is disabled) is good UX.

Areas reviewed: Go (types, controller, webhook), Helm, CRD, Docs, Tests
Commits: 3 commits, all signed-off: yes
CI status: all passing (E2E pending manual trigger)

Suggestions (non-blocking)

1. PR body attribution (nit)
PR body ends with "Generated with Claude Code" — per repo conventions this should be "Assisted-By: Claude Code".

2. Commit hygiene (suggestion)
Commits 45e06bc ("include e2e tests for oci") and 4bd2f5a ("fixes due to code review") don't follow the imperative commit convention and are vague. Consider squashing into the main commit before merge.

3. Skill mounts applied to all containers (suggestion)
Skills are currently mounted into ALL containers including sidecars (envoy-proxy, spiffe-helper). For pods with AuthBridge injection, sidecars don't need skill files. Consider targeting only the agent container in a follow-up. Not a blocker for alpha — the extra read-only mounts are harmless — but worth tracking to avoid clutter in complex pod specs.

pavelanni · 2026-05-06T17:39:31Z

It's important to make sure that the mounted skills are listed in the AgentCard exposed by the agent running in Agent Runtime. There is a section in the AgentCard spec for that.

https://agent2agent.info/docs/concepts/agentcard/

In my agent harness (https://github.com/redhat-et/docsclaw) it is implemented by the agent itself, but it would be good to have it implemented at the runtime level to make it agent-agnostic.

Another important thing is ensure that images are mounted in containers read-only to avoid any risk of mutating them my malicious agents. If the Operator mounts them, it should be in its logic.

pavelanni · 2026-05-06T21:57:10Z

Please take a look at the SkillCard schema that I use in Skill Image: https://github.com/redhat-et/skillimage/blob/main/schemas/skillcard-v1.json
It might be used as a prototype for Kagenti skills.

cooktheryan · 2026-05-07T19:58:21Z

@kevincogan one thing my brain is stuck on right now when we build the container image for an agent we build it with the agentcard and that agentcard is r/o. The stuck point I have is the OCI mounting for skills may be dynamic but the SKILL section of an agentcard is pretty much locked with our mechanism. Any advice or thoughts here?

cooktheryan · 2026-05-07T20:25:55Z

additionally @Ladas do you have any opinions here based on your work launching claude code and etc using the OCI mounting mechanisms

kevincogan · 2026-05-07T21:07:12Z

@kevincogan one thing my brain is stuck on right now when we build the container image for an agent we build it with the agentcard and that agentcard is r/o. The stuck point I have is the OCI mounting for skills may be dynamic but the SKILL section of an agentcard is pretty much locked with our mechanism. Any advice or thoughts here?

@cooktheryan I don't think we should touch the signed card the agent serves. That stays locked down. But the AgentCardReconciler already fetches and caches the card into status, so we can just append the runtime skills to that cached copy after verification completes. One flow, one CR, just an enriched status at the end.

Security-wise nothing changes. Verification (JWS or mTLS) still runs against the original card before any merging happens. The Verified condition, NetworkPolicy, and identity binding are all driven by the original signed card. The appended skills are purely informational for discovery and the UI.

Your kagenti.io/skills annotation is basically all I'd need on my side. The AgentCard controller reads that and appends anything not already in the card.

Let me know if I am missing anything. If not I can pick this up as a follow-up once yours lands.

pdettori · 2026-05-07T23:01:58Z

@cooktheryan should we set this PR as draft until ready to merge ?

cooktheryan · 2026-05-08T00:01:55Z

@pdettori yes for sure...i was feeling confident in the PR early then I realized how many pieces we have to tie in

eranra · 2026-05-10T10:03:06Z

@cooktheryan @pavelanni @pdettori are you guys in sync with the initial community effort around OCI and skills here: https://github.com/agentskills/agentskills/discussions/292?ref=thomasvitale.com --- if will be best if we can make Kagnti as "generic" as possible and if we can join forces with the community effort and align the code it will be best.

pavelanni · 2026-05-10T15:01:21Z

@eranra Yes, I reached out to Thomas Vitale on Slack and we are working on organizing a meeting. There is also a CNCF initiative around that: cncf/toc#1740 which I am participating in as well.
I'm also in contact with the Lola project: https://github.com/LobsterTrap/lola where we are adding OCI extension to their toolset.

eranra · 2026-05-11T16:05:30Z

@eranra Yes, I reached out to Thomas Vitale on Slack and we are working on organizing a meeting. There is also a CNCF initiative around that: cncf/toc#1740 which I am participating in as well. I'm also in contact with the Lola project: https://github.com/LobsterTrap/lola where we are adding OCI extension to their toolset.

@pavelanni Thanks for sharing ;-)

I looked at the link/initiative, and it is indeed very interesting. I think we should also consider a more “shift-right” approach that automates processes and moves more of the intelligence and optimization into the runtime space.

Focusing on the AI developer persona makes a lot of sense today, but as the skills and AI ecosystem evolves toward greater automation and iterative optimization, the outer loop will become just as important. In particular, the ability to automatically improve, adapt, and incorporate new skills over time will be critical for long-term scalability and operational efficiency. I think that dynamic interaction with skills is a characteristic we need to consider in the interface between agents and skills.

kevincogan · 2026-05-25T09:41:27Z

The AgentCard concern raised above isn't an issue anymore due to the deprecation of the AgentCard CRD (#371/#372). The agent serves its card dynamically at /.well-known/agent-card.json and the kagenti.io/skills annotation provides discovery for the backend/UI. No AgentCard integration needed, this PR is self-contained.

Confirming the other open feedback is already addressed in code:

Read-only mounts (@pavelanni): All skill mounts have ReadOnly: true
Skill mounts on sidecars (@cwiklik): Skills mount only to Containers[0] (the agent container)
Community alignment (@eranra): Feature is behind skillImageVolumes gate (default off). Pavel is coordinating with the agentskills community.

I have rebased onto current main and resolved merge conflicts, build and all tests pass. Ready for review @cooktheryan @pdettori .

cooktheryan · 2026-05-25T14:38:11Z

@pdettori @kevincogan took over this PR and it is ready for review

Ladas

Code Review + Security Review

Reviewer: @Ladas (via Claude Code)
Scope: Full code review + security review with external context research
CI: All 15 checks passing ✓

Overall Assessment

Solid implementation. Clean API design, proper feature gating, good test coverage (27+ specs across unit/integration/webhook/E2E), thorough documentation. The Containers[0]-only mount targeting (addressing @cwiklik's earlier suggestion), read-only enforcement, and declarative reconciliation all follow good patterns.

The feature is appropriately scoped for alpha behind a feature gate. Findings below are suggestions — none are merge-blockers given the gate defaults to off.

Findings

1. MountPath denylist — defense in depth (non-blocking, follow-up)

Files: api/v1alpha1/agentruntime_types.go (pattern), webhook/v1alpha1/agentruntime_webhook.go (validateSkills)

The CRD validates ^/.* (any absolute path). A user with AgentRuntime create permissions could mount over sensitive paths like /var/run/secrets/kubernetes.io/serviceaccount.

ImageVolumes are kubelet-enforced read-only, so this can't inject content — but it CAN overlay the real SA token with garbage, breaking workload auth. The blast radius is limited to pods the user already has RBAC to target, but AgentRuntime permissions may be delegated to teams without direct Deployment access.

Suggestion: Add a denylist in validateSkills:

var deniedPrefixes = []string{"/proc", "/sys", "/dev", "/var/run/secrets"}
for _, s := range skills {
    for _, p := range deniedPrefixes {
        if strings.HasPrefix(s.MountPath, p) {
            return fmt.Errorf("spec.skills[%d]: mountPath %q overlaps protected path %q", i, s.MountPath, p)
        }
    }
}

Also reject .. segments: strings.Contains(s.MountPath, "..").

2. PR description claims "reserved volume collisions" validation — but it doesn't exist (nit)

The PR description table says the webhook validates "reserved volume collisions," but validateSkills only checks duplicate names and duplicate mountPaths within the same CR. No validation against external volume name collisions.

In practice the skill- prefix prevents collision, so this is a documentation accuracy issue, not a code issue. Suggest updating the PR description to match.

3. `json.Marshal` error silently discarded (non-blocking)

File: internal/controller/agentruntime_controller.go, in applyWorkloadConfig:

b, _ := json.Marshal(names)

[]string marshaling can't practically fail, but silently swallowing errors is a code smell. Consider:

b, err := json.Marshal(names)
if err != nil {
    logger.Error(err, "failed to marshal skill names")
}

4. `combinedSidecar` feature gate is unrelated scope (observation)

The PR adds both combinedSidecar and skillImageVolumes to FeatureGates struct and values.yaml. combinedSidecar is unrelated to skill image volumes and not on main. If this was intentional (bundling two feature gates in one PR), no issue — just noting for reviewers.

5. Image reference format not validated at admission (non-blocking, follow-up)

The image field has only minLength: 1. Malformed references pass admission but fail at pod scheduling with cryptic kubelet errors. A basic OCI ref regex in the webhook would improve UX. Not a security issue since kubelet handles validation.

6. Commit attribution (nit per repo conventions)

The commit has Co-authored-by: Cursor <cursoragent@cursor.com>. Per repo CLAUDE.md conventions, this should be Assisted-By. The PR body also ends with "Assisted-By: Claude Code" which is correct, but @cwiklik noted the commit trailer in the original review.

Security Assessment

Control	Status	Notes
ReadOnly mounts	✅	All mounts have `ReadOnly: true` + kubelet-enforced
Container targeting	✅	`Containers[0]` only, sidecars excluded
Feature gate default	✅	Off by default, platform-wide policy
Webhook validation	✅	Duplicate names/paths rejected at admission
Volume prefix isolation	✅	`skill-` prefix prevents collision
Cleanup on deletion	✅	Finalizer removes volumes + annotation
No RBAC escalation	✅	Uses existing Deployment/StatefulSet permissions
MountPath denylist	⚠️	Missing — see finding #1 (non-critical due to read-only)
Image signature verification	ℹ️	Not in scope — planned for skillimage phase 2
Registry allowlist	ℹ️	Not in scope — can use K8s ImagePolicyWebhook

No high-severity exploitable vulnerabilities found. The primary gap (mountPath denylist) is mitigated by K8s ImageVolume's inherent read-only enforcement and the requirement for AgentRuntime RBAC.

Test Coverage

Excellent for alpha:

Layer	Specs	What
Unit (skills)	9	Volume add/remove/update, multi-container, sidecar exclusion, pull policy, framework paths
Unit (config hash)	4	Hash changes on skill add/image/pullPolicy/mountPath
Controller integration	3	Annotation set/not-set/removed on deletion
Webhook	4	Duplicate names/paths rejected on create/update
E2E	7	Feature gate disabled/enabled, mount/update/remove, deletion cleanup, webhook

Smart use of registry.k8s.io/pause:3.9 in E2E — tiny, cached, available.

Recommendation

Approve for merge with the mountPath denylist as a tracked follow-up issue. The feature is behind a default-off gate and the implementation is solid.

Assisted-By: Claude Code

Ladas

👍 looks good, items for followup identified

Add kagenti.io/skills annotation on target workload metadata with a JSON array of mounted skill names for downstream discovery (agent card controllers, UI). The annotation is set when the skillImageVolumes feature gate is enabled and removed on skill clearing or AgentRuntime deletion. Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Ryan Cook <rcook@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>

- Add mountPath denylist rejecting /proc, /sys, /dev, /var/run/secrets and path traversal (..) segments - Handle json.Marshal error instead of silently discarding - Remove unrelated combinedSidecar feature gate (out of scope) - Add unit tests for denylist validation Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>

kevincogan · 2026-05-25T16:22:05Z

👍 looks good, items for followup identified

Thanks for the review @Ladas. I've pushed a follow-up addressing your feedback.

cooktheryan requested a review from a team as a code owner May 6, 2026 15:27

rubambiza added this to Kagenti Issue Prioritization May 6, 2026

github-project-automation Bot moved this to Backlog in Kagenti Issue Prioritization May 6, 2026

cooktheryan force-pushed the feat/skill-image-volumes branch from b32eaa8 to d3257c1 Compare May 6, 2026 15:29

cooktheryan mentioned this pull request May 6, 2026

Proposal: Skills Management for Kagenti kagenti/kagenti#1342

Closed

1 task

cooktheryan force-pushed the feat/skill-image-volumes branch from d3257c1 to 274bd62 Compare May 6, 2026 15:42

cwiklik approved these changes May 6, 2026

View reviewed changes

cooktheryan force-pushed the feat/skill-image-volumes branch 2 times, most recently from bd605a8 to 2961db3 Compare May 6, 2026 18:45

cooktheryan force-pushed the feat/skill-image-volumes branch 2 times, most recently from 2f3bf03 to 58f3815 Compare May 7, 2026 19:57

cooktheryan changed the title ~~Feat: Add OCI skill image mounting to AgentRuntime~~ WIP: Add OCI skill image mounting to AgentRuntime May 8, 2026

pdettori marked this pull request as draft May 8, 2026 02:19

xjacka mentioned this pull request May 11, 2026

Weekly Report 2026-05-11 kagenti/kagenti#1533

Closed

This was referenced May 18, 2026

Weekly Report 2026-05-18 kagenti/kagenti#1608

Closed

Weekly Report 2026-05-25 kagenti/kagenti#1662

Closed

kevincogan force-pushed the feat/skill-image-volumes branch from 58f3815 to 8e5fb79 Compare May 25, 2026 09:40

cooktheryan changed the title ~~WIP: Add OCI skill image mounting to AgentRuntime~~ Add OCI skill image mounting to AgentRuntime May 25, 2026

cooktheryan marked this pull request as ready for review May 25, 2026 14:37

Ladas reviewed May 25, 2026

View reviewed changes

Ladas approved these changes May 25, 2026

View reviewed changes

cooktheryan and others added 2 commits May 25, 2026 17:11

kevincogan force-pushed the feat/skill-image-volumes branch from 8e5fb79 to 613566c Compare May 25, 2026 16:17

Ladas merged commit 3dc01b8 into kagenti:main May 26, 2026
15 checks passed

github-project-automation Bot moved this from New /:ToDo to Done in Kagenti Issue Prioritization May 26, 2026

cooktheryan mentioned this pull request May 29, 2026

feat: replace spec.skills injection with passive skill discovery #388

Open

7 tasks

clawgenti mentioned this pull request Jun 1, 2026

Weekly Report 2026-06-01 kagenti/kagenti#1756

Open

Conversation

cooktheryan commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Example

Changes

Relationship to ConfigMap-based skill linking (kagenti/kagenti#1440)

Integration opportunities for discussion

Test plan

Uh oh!

cooktheryan commented May 6, 2026

Uh oh!

cwiklik left a comment

Choose a reason for hiding this comment

Suggestions (non-blocking)

Uh oh!

pavelanni commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pavelanni commented May 6, 2026

Uh oh!

cooktheryan commented May 7, 2026

Uh oh!

cooktheryan commented May 7, 2026

Uh oh!

kevincogan commented May 7, 2026

Uh oh!

pdettori commented May 7, 2026

Uh oh!

cooktheryan commented May 8, 2026

Uh oh!

eranra commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pavelanni commented May 10, 2026

Uh oh!

eranra commented May 11, 2026

Uh oh!

kevincogan commented May 25, 2026

Uh oh!

cooktheryan commented May 25, 2026

Uh oh!

Ladas left a comment

Choose a reason for hiding this comment

Code Review + Security Review

Overall Assessment

Findings

1. MountPath denylist — defense in depth (non-blocking, follow-up)

2. PR description claims "reserved volume collisions" validation — but it doesn't exist (nit)

3. json.Marshal error silently discarded (non-blocking)

4. combinedSidecar feature gate is unrelated scope (observation)

5. Image reference format not validated at admission (non-blocking, follow-up)

6. Commit attribution (nit per repo conventions)

Security Assessment

Test Coverage

Recommendation

Uh oh!

Ladas left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevincogan commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

cooktheryan commented May 6, 2026 •

edited

Loading

pavelanni commented May 6, 2026 •

edited

Loading

eranra commented May 10, 2026 •

edited

Loading

3. `json.Marshal` error silently discarded (non-blocking)

4. `combinedSidecar` feature gate is unrelated scope (observation)

Ladas left a comment •

edited

Loading