WIP: replace spec.skills injection with passive skill discovery by cooktheryan · Pull Request #388 · kagenti/kagenti-operator

cooktheryan · 2026-05-29T18:12:24Z

Summary

Remove spec.skills from the AgentRuntime CRD — the operator no longer mutates target Deployments to inject OCI skill ImageVolumes, eliminating GitOps drift with ArgoCD/Flux
Skills are now declared directly in the Deployment manifest (OCI ImageVolumes or ConfigMap volumes) and discovered by the operator through the kagenti.io/skills annotation
New skillDiscovery feature gate (default: false) controls whether the operator reads the annotation and populates status.linkedSkills
The agent discovers mounted skills via SKILL_FOLDERS and reports them in its A2A card — the operator reads the card and the annotation, surfacing both in status

Context

PR #332 added spec.skills to AgentRuntime, which instructed the operator to inject OCI ImageVolumes into the target Deployment via r.Update(). This caused GitOps drift (ArgoCD/Flux fight with the operator over the Deployment spec) and was architecturally overreaching — the operator was modifying a workload it doesn't own.

This PR changes the operator's role from injector to observer:

Deployment owns its skill volumes (user-managed, in Git)
Agent discovers mounted skills at runtime via SKILL_FOLDERS and advertises them in its A2A card
AgentRuntime reads the kagenti.io/skills annotation and the card, reports both in status

This aligns with the ConfigMap-based skill path in kagenti/kagenti#1440, where the kagenti backend sets the same annotation. Both skill delivery mechanisms (ConfigMap and OCI ImageVolume) are now visible through a single annotation and the agent's card.

Test plan

go build ./... — compiles cleanly
go vet ./... — no issues
golangci-lint run ./... — no new issues
Kind cluster validation: Deployment with both OCI ImageVolume and ConfigMap skills, AgentRuntime discovers both via annotation, agent card reflects all mounted skills
Feature gate validation: skillDiscovery: true populates status.linkedSkills, skillDiscovery: false clears it
Operator does NOT mutate the Deployment PodSpec
E2E tests (skill injection block removed, needs new skill discovery e2e)

Assisted-By: Claude Code

Remove spec.skills from AgentRuntime CRD — the operator no longer mutates target Deployments to inject OCI skill ImageVolumes. This eliminates GitOps drift with ArgoCD/Flux. Skills are now declared directly in the Deployment manifest (OCI ImageVolumes or ConfigMap volumes) and discovered by the operator through the kagenti.io/skills annotation, gated behind the skillDiscovery feature flag. The agent discovers mounted skills via SKILL_FOLDERS and reports them in its A2A card. The operator reads the card and the annotation, surfacing both in status. Removed: - spec.skills, SkillImageRef, SkillPullPolicy types - reconcileSkillVolumes controller mutation code - skillImageVolumes feature gate - Webhook skill validation (name/path checks) - E2E tests, fixtures, and utils for skill injection Added: - status.linkedSkills populated from kagenti.io/skills annotation - SkillsDiscovered condition - skillDiscovery feature gate (default: false) - Skill discovery sample manifest with OCI + ConfigMap examples Assisted-By: Claude Code Signed-off-by: Ryan Cook <rcook@redhat.com>

cooktheryan · 2026-05-29T18:30:27Z

@pavelanni @kevincogan @pdettori @eranra I wanted to re-roll the way skills were defined. I started the process of blogging about the skill work and I quickly realize we would make for a really painful scenario with GitOps/ArgoCD with modifying the volume mounts of the deployment. I wanted to try to pull in the skills dynamically while also referencing what was brought in with kagenti/kagenti#1440 I am open to any feedback, conversations, and etc. @kevincogan I also want to ensure this does not cause issues with any security related ideas or concepts you had.

One of the big things I am thinking about with skills is the fact that a user may start with 0 skills attached. Then add a skill or two. Lifecycle it for a while then discover possibly the skills is not longer required. All of this could and should occur without the need to rewrite or rebuild the Agent unless it is necesary.

apiVersion: agent.kagenti.dev/v1alpha1
kind: AgentRuntime
metadata:
  annotations:
    agent.kagenti.dev/last-card-fetch-hash: "5"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"agent.kagenti.dev/v1alpha1","kind":"AgentRuntime","metadata":{"annotations":{},"labels":{"app.kubernetes.io/name":"a2a-currency-converter"},"name":"a2a-currency-converter","namespace":"default"},"spec":{"targetRef":{"apiVersion":"apps/v1","kind":"Deployment","name":"a2a-currency-converter"},"type":"agent"}}
  creationTimestamp: "2026-05-29T17:55:10Z"
  finalizers:
  - kagenti.io/cleanup
  generation: 1
  labels:
    app.kubernetes.io/name: a2a-currency-converter
  name: a2a-currency-converter
  namespace: default
  resourceVersion: "5688"
  uid: fccb4363-7873-4a05-992e-8dca085404c4
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: a2a-currency-converter
  type: agent
status:
  card:
    capabilities:
      pushNotifications: false
      streaming: false
    cardId: 294e46254525e6aa0e0a10969044437cd7fabfcd9b97ff76385a1f534588d5d3
    description: Converts between currencies using real-time exchange rates
    fetchedAt: "2026-05-29T17:55:10Z"
    name: Currency Converter
    protocol: a2a
    skills:
    - description: Convert an amount from one currency to another
      examples:
      - Convert 100 USD to EUR
      - How much is 50 GBP in JPY?
      id: convert-currency
      name: Currency Conversion
    - description: Review Kubernetes manifests for OpenShift readiness
      id: openshift-review
      name: openshift-review
    - description: Summarize long text into concise structured summaries
      id: summarizer
      name: summarizer
    url: http://a2a-currency-converter.default.svc.cluster.local:8080/
    version: 1.0.0
  conditions:
  - lastTransitionTime: "2026-05-29T17:55:10Z"
    message: Deployment a2a-currency-converter resolved
    observedGeneration: 1
    reason: TargetFound
    status: "True"
    type: TargetResolved
  - lastTransitionTime: "2026-05-29T17:55:10Z"
    message: Configuration resolved successfully
    observedGeneration: 1
    reason: ConfigResolved
    status: "True"
    type: ConfigResolved
  - lastTransitionTime: "2026-05-29T17:55:10Z"
    message: Pod template unchanged; existing card data still valid
    observedGeneration: 1
    reason: FetchSkipped
    status: "True"
    type: CardSynced
  - lastTransitionTime: "2026-05-29T17:55:10Z"
    message: 2 linked skill(s) discovered from workload annotation
    observedGeneration: 1
    reason: SkillsFound
    status: "True"
    type: SkillsDiscovered
  - lastTransitionTime: "2026-05-29T17:55:10Z"
    message: Workload a2a-currency-converter configured with config-hash 57de3ab75a7e
    observedGeneration: 1
    reason: Configured
    status: "True"
    type: Ready
  configuredPods: 2
  linkedSkills:
  - openshift-review
  - summarizer
  phase: Active

pavelanni · 2026-05-29T18:41:22Z

Adding skills dynamically is possible (you can add a tool skill-puller to AgentRuntime) but that breaks the idea of having read-only skills to avoid malicious agents/skills to edit existing skills. When we mount them upfront read-only we are safe. When the agent pulls them during its activity -- we have to make sure that 1) it verifies the skill (based on its signature or SHA) and 2) it locks the pulled skill so it can't change it after that. The second part is tricky. Let me think about it and investigate.
Alternatively, we can define different modes in which AgentRuntime operates. One is flexible/development/unsafe mode, and another is locked-down, production mode (where all skills are mounted read-only upfront). It is also good from the positioning perspective when you give the user a choice and the user understands the risks.

cooktheryan · 2026-05-29T18:58:00Z

looking forward to what find @pavelanni I think one of the pieces I really overlooked was around how people manage their running applications. We definitely need to present and provide our solution with as much security around it as possible

pavelanni · 2026-05-30T14:58:58Z

After reviewing several options I think the right solution for dynamic skill discovery and deployment will be a skill-puller sidecar that can search, verify, and pull skills into the shared volume that is mounted read-write for this sidecar and read-only for the main agent. The sidecar itself will be written in Go so it won't add much to the total pod's resource consumption. It will be possible to configure the agent Deployment with this sidecar depending whether we need dynamic agents or not.
I'll create a prototype next week.

cooktheryan requested a review from a team as a code owner May 29, 2026 18:12

rubambiza added this to Kagenti Issue Prioritization May 29, 2026

github-project-automation Bot moved this to New /:ToDo in Kagenti Issue Prioritization May 29, 2026

cooktheryan force-pushed the worktree-poc-skill-observer branch from a9d3fe7 to 177ed3b Compare May 29, 2026 18:26

cooktheryan changed the title ~~feat: replace spec.skills injection with passive skill discovery~~ WIP: replace spec.skills injection with passive skill discovery May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: replace spec.skills injection with passive skill discovery#388

WIP: replace spec.skills injection with passive skill discovery#388
cooktheryan wants to merge 1 commit into
kagenti:mainfrom
cooktheryan:worktree-poc-skill-observer

cooktheryan commented May 29, 2026

Uh oh!

cooktheryan commented May 29, 2026

Uh oh!

pavelanni commented May 29, 2026

Uh oh!

cooktheryan commented May 29, 2026

Uh oh!

pavelanni commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cooktheryan commented May 29, 2026

Summary

Context

Test plan

Uh oh!

cooktheryan commented May 29, 2026

Uh oh!

pavelanni commented May 29, 2026

Uh oh!

cooktheryan commented May 29, 2026

Uh oh!

pavelanni commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants