Add OCI skill image mounting to AgentRuntime#332
Conversation
b32eaa8 to
d3257c1
Compare
|
DO NOT MERGE at the current time. I would like feedback based on kagenti/kagenti#1342 |
d3257c1 to
274bd62
Compare
cwiklik
left a comment
There was a problem hiding this comment.
Solid implementation with proper feature gating, comprehensive tests (unit + E2E), clean separation of concerns (controller, webhook, config hash), and thorough docs. The ImageVolume K8s requirement (1.31+) is well-documented and the graceful degradation (condition + event when gate is disabled) is good UX.
Areas reviewed: Go (types, controller, webhook), Helm, CRD, Docs, Tests
Commits: 3 commits, all signed-off: yes
CI status: all passing (E2E pending manual trigger)
Suggestions (non-blocking)
1. PR body attribution (nit)
PR body ends with "Generated with Claude Code" — per repo conventions this should be "Assisted-By: Claude Code".
2. Commit hygiene (suggestion)
Commits 45e06bc ("include e2e tests for oci") and 4bd2f5a ("fixes due to code review") don't follow the imperative commit convention and are vague. Consider squashing into the main commit before merge.
3. Skill mounts applied to all containers (suggestion)
Skills are currently mounted into ALL containers including sidecars (envoy-proxy, spiffe-helper). For pods with AuthBridge injection, sidecars don't need skill files. Consider targeting only the agent container in a follow-up. Not a blocker for alpha — the extra read-only mounts are harmless — but worth tracking to avoid clutter in complex pod specs.
|
It's important to make sure that the mounted skills are listed in the AgentCard exposed by the agent running in Agent Runtime. There is a section in the AgentCard spec for that. https://agent2agent.info/docs/concepts/agentcard/ In my agent harness (https://github.com/redhat-et/docsclaw) it is implemented by the agent itself, but it would be good to have it implemented at the runtime level to make it agent-agnostic. Another important thing is ensure that images are mounted in containers read-only to avoid any risk of mutating them my malicious agents. If the Operator mounts them, it should be in its logic. |
bd605a8 to
2961db3
Compare
|
Please take a look at the SkillCard schema that I use in Skill Image: https://github.com/redhat-et/skillimage/blob/main/schemas/skillcard-v1.json |
2f3bf03 to
58f3815
Compare
|
@kevincogan one thing my brain is stuck on right now when we build the container image for an agent we build it with the agentcard and that agentcard is r/o. The stuck point I have is the OCI mounting for skills may be dynamic but the SKILL section of an agentcard is pretty much locked with our mechanism. Any advice or thoughts here? |
|
additionally @Ladas do you have any opinions here based on your work launching claude code and etc using the OCI mounting mechanisms |
@cooktheryan I don't think we should touch the signed card the agent serves. That stays locked down. But the Security-wise nothing changes. Verification (JWS or mTLS) still runs against the original card before any merging happens. The Your Let me know if I am missing anything. If not I can pick this up as a follow-up once yours lands. |
|
@cooktheryan should we set this PR as draft until ready to merge ? |
|
@pdettori yes for sure...i was feeling confident in the PR early then I realized how many pieces we have to tie in |
|
@cooktheryan @pavelanni @pdettori are you guys in sync with the initial community effort around OCI and skills here: https://github.com/agentskills/agentskills/discussions/292?ref=thomasvitale.com --- if will be best if we can make Kagnti as "generic" as possible and if we can join forces with the community effort and align the code it will be best. |
|
@eranra Yes, I reached out to Thomas Vitale on Slack and we are working on organizing a meeting. There is also a CNCF initiative around that: cncf/toc#1740 which I am participating in as well. |
@pavelanni Thanks for sharing ;-) I looked at the link/initiative, and it is indeed very interesting. I think we should also consider a more “shift-right” approach that automates processes and moves more of the intelligence and optimization into the runtime space. Focusing on the AI developer persona makes a lot of sense today, but as the skills and AI ecosystem evolves toward greater automation and iterative optimization, the outer loop will become just as important. In particular, the ability to automatically improve, adapt, and incorporate new skills over time will be critical for long-term scalability and operational efficiency. I think that dynamic interaction with skills is a characteristic we need to consider in the interface between agents and skills. |
58f3815 to
8e5fb79
Compare
|
The AgentCard concern raised above isn't an issue anymore due to the deprecation of the AgentCard CRD (#371/#372). The agent serves its card dynamically at Confirming the other open feedback is already addressed in code:
I have rebased onto current main and resolved merge conflicts, build and all tests pass. Ready for review @cooktheryan @pdettori . |
|
@pdettori @kevincogan took over this PR and it is ready for review |
Ladas
left a comment
There was a problem hiding this comment.
Code Review + Security Review
Reviewer: @Ladas (via Claude Code)
Scope: Full code review + security review with external context research
CI: All 15 checks passing ✓
Overall Assessment
Solid implementation. Clean API design, proper feature gating, good test coverage (27+ specs across unit/integration/webhook/E2E), thorough documentation. The Containers[0]-only mount targeting (addressing @cwiklik's earlier suggestion), read-only enforcement, and declarative reconciliation all follow good patterns.
The feature is appropriately scoped for alpha behind a feature gate. Findings below are suggestions — none are merge-blockers given the gate defaults to off.
Findings
1. MountPath denylist — defense in depth (non-blocking, follow-up)
Files: api/v1alpha1/agentruntime_types.go (pattern), webhook/v1alpha1/agentruntime_webhook.go (validateSkills)
The CRD validates ^/.* (any absolute path). A user with AgentRuntime create permissions could mount over sensitive paths like /var/run/secrets/kubernetes.io/serviceaccount.
ImageVolumes are kubelet-enforced read-only, so this can't inject content — but it CAN overlay the real SA token with garbage, breaking workload auth. The blast radius is limited to pods the user already has RBAC to target, but AgentRuntime permissions may be delegated to teams without direct Deployment access.
Suggestion: Add a denylist in validateSkills:
var deniedPrefixes = []string{"/proc", "/sys", "/dev", "/var/run/secrets"}
for _, s := range skills {
for _, p := range deniedPrefixes {
if strings.HasPrefix(s.MountPath, p) {
return fmt.Errorf("spec.skills[%d]: mountPath %q overlaps protected path %q", i, s.MountPath, p)
}
}
}Also reject .. segments: strings.Contains(s.MountPath, "..").
2. PR description claims "reserved volume collisions" validation — but it doesn't exist (nit)
The PR description table says the webhook validates "reserved volume collisions," but validateSkills only checks duplicate names and duplicate mountPaths within the same CR. No validation against external volume name collisions.
In practice the skill- prefix prevents collision, so this is a documentation accuracy issue, not a code issue. Suggest updating the PR description to match.
3. json.Marshal error silently discarded (non-blocking)
File: internal/controller/agentruntime_controller.go, in applyWorkloadConfig:
b, _ := json.Marshal(names)[]string marshaling can't practically fail, but silently swallowing errors is a code smell. Consider:
b, err := json.Marshal(names)
if err != nil {
logger.Error(err, "failed to marshal skill names")
}4. combinedSidecar feature gate is unrelated scope (observation)
The PR adds both combinedSidecar and skillImageVolumes to FeatureGates struct and values.yaml. combinedSidecar is unrelated to skill image volumes and not on main. If this was intentional (bundling two feature gates in one PR), no issue — just noting for reviewers.
5. Image reference format not validated at admission (non-blocking, follow-up)
The image field has only minLength: 1. Malformed references pass admission but fail at pod scheduling with cryptic kubelet errors. A basic OCI ref regex in the webhook would improve UX. Not a security issue since kubelet handles validation.
6. Commit attribution (nit per repo conventions)
The commit has Co-authored-by: Cursor <cursoragent@cursor.com>. Per repo CLAUDE.md conventions, this should be Assisted-By. The PR body also ends with "Assisted-By: Claude Code" which is correct, but @cwiklik noted the commit trailer in the original review.
Security Assessment
| Control | Status | Notes |
|---|---|---|
| ReadOnly mounts | ✅ | All mounts have ReadOnly: true + kubelet-enforced |
| Container targeting | ✅ | Containers[0] only, sidecars excluded |
| Feature gate default | ✅ | Off by default, platform-wide policy |
| Webhook validation | ✅ | Duplicate names/paths rejected at admission |
| Volume prefix isolation | ✅ | skill- prefix prevents collision |
| Cleanup on deletion | ✅ | Finalizer removes volumes + annotation |
| No RBAC escalation | ✅ | Uses existing Deployment/StatefulSet permissions |
| MountPath denylist | Missing — see finding #1 (non-critical due to read-only) | |
| Image signature verification | ℹ️ | Not in scope — planned for skillimage phase 2 |
| Registry allowlist | ℹ️ | Not in scope — can use K8s ImagePolicyWebhook |
No high-severity exploitable vulnerabilities found. The primary gap (mountPath denylist) is mitigated by K8s ImageVolume's inherent read-only enforcement and the requirement for AgentRuntime RBAC.
Test Coverage
Excellent for alpha:
| Layer | Specs | What |
|---|---|---|
| Unit (skills) | 9 | Volume add/remove/update, multi-container, sidecar exclusion, pull policy, framework paths |
| Unit (config hash) | 4 | Hash changes on skill add/image/pullPolicy/mountPath |
| Controller integration | 3 | Annotation set/not-set/removed on deletion |
| Webhook | 4 | Duplicate names/paths rejected on create/update |
| E2E | 7 | Feature gate disabled/enabled, mount/update/remove, deletion cleanup, webhook |
Smart use of registry.k8s.io/pause:3.9 in E2E — tiny, cached, available.
Recommendation
Approve for merge with the mountPath denylist as a tracked follow-up issue. The feature is behind a default-off gate and the implementation is solid.
Assisted-By: Claude Code
Add kagenti.io/skills annotation on target workload metadata with a JSON array of mounted skill names for downstream discovery (agent card controllers, UI). The annotation is set when the skillImageVolumes feature gate is enabled and removed on skill clearing or AgentRuntime deletion. Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Ryan Cook <rcook@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
- Add mountPath denylist rejecting /proc, /sys, /dev, /var/run/secrets and path traversal (..) segments - Handle json.Marshal error instead of silently discarding - Remove unrelated combinedSidecar feature gate (out of scope) - Add unit tests for denylist validation Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
8e5fb79 to
613566c
Compare
Thanks for the review @Ladas. I've pushed a follow-up addressing your feedback. |
Summary
skillsfield toAgentRuntimeSpecfor declaring OCI skill images to mount into agent pods as Kubernetes ImageVolumesskillImageVolumesfeature gate (default off), requires Kubernetes 1.31+FROM scratchimages withskill.yaml+SKILL.mdmountPath, making the feature framework-agnostic (Claude, Cursor, custom agents, etc.)Example
Changes
api/v1alpha1/agentruntime_types.goSkillImageRef,SkillPullPolicy,Skillsfieldinternal/webhook/config/feature_gates.goSkillImageVolumes(default false)internal/controller/agentruntime_controller.go,agentruntime_skills.gointernal/controller/agentruntime_config.gointernal/webhook/v1alpha1/agentruntime_webhook.gocmd/main.godocs/api-reference.md,docs/architecture.mdconfig/samples/agent_v1alpha1_agentruntime_skills.yaml, updated_full.yamlcharts/kagenti-operator/values.yaml, CRD YAMLagentruntime_skills_test.go,agentruntime_webhook_test.goRelationship to ConfigMap-based skill linking (kagenti/kagenti#1440)
This feature complements the ConfigMap-based skill mounting in kagenti/kagenti#1440. Both deliver skill files into agent pods, but target different maturity stages from kagenti/kagenti#1342:
/app/skills/<name>Integration opportunities for discussion
SKILL_FOLDERSenv var — #1440 setsSKILL_FOLDERSso agents discover mounted skills. This operator feature could inject the same env var so agents work transparently with both delivery mechanisms.kagenti.io/skillsannotation — #1440 stores linked skills in this annotation. The operator could write this annotation when skills are declared on the AgentRuntime CR, enabling the UI/backend to display OCI-mounted skills alongside ConfigMap-mounted ones.Coexistence — Both mechanisms can coexist on the same pod. ConfigMap volumes use names like
skill-0,skill-1; OCI ImageVolumes useskill-<name>. Different volume types, different names, no conflicts.Migration path — ConfigMap skills work today on any K8s version. OCI ImageVolume skills are the upgrade path when clusters reach K8s 1.31+. Teams can adopt incrementally.
Test plan
make manifests generate— CRD and deepcopy regeneratedgo build ./...— compiles cleanlygo test ./internal/controller/ ./internal/webhook/...— all tests passAssisted-By: Claude Code