Skip to content

WIP: replace spec.skills injection with passive skill discovery#388

Open
cooktheryan wants to merge 1 commit into
kagenti:mainfrom
cooktheryan:worktree-poc-skill-observer
Open

WIP: replace spec.skills injection with passive skill discovery#388
cooktheryan wants to merge 1 commit into
kagenti:mainfrom
cooktheryan:worktree-poc-skill-observer

Conversation

@cooktheryan
Copy link
Copy Markdown
Contributor

Summary

  • Remove spec.skills from the AgentRuntime CRD — the operator no longer mutates target Deployments to inject OCI skill ImageVolumes, eliminating GitOps drift with ArgoCD/Flux
  • Skills are now declared directly in the Deployment manifest (OCI ImageVolumes or ConfigMap volumes) and discovered by the operator through the kagenti.io/skills annotation
  • New skillDiscovery feature gate (default: false) controls whether the operator reads the annotation and populates status.linkedSkills
  • The agent discovers mounted skills via SKILL_FOLDERS and reports them in its A2A card — the operator reads the card and the annotation, surfacing both in status

Context

PR #332 added spec.skills to AgentRuntime, which instructed the operator to inject OCI ImageVolumes into the target Deployment via r.Update(). This caused GitOps drift (ArgoCD/Flux fight with the operator over the Deployment spec) and was architecturally overreaching — the operator was modifying a workload it doesn't own.

This PR changes the operator's role from injector to observer:

  • Deployment owns its skill volumes (user-managed, in Git)
  • Agent discovers mounted skills at runtime via SKILL_FOLDERS and advertises them in its A2A card
  • AgentRuntime reads the kagenti.io/skills annotation and the card, reports both in status

This aligns with the ConfigMap-based skill path in kagenti/kagenti#1440, where the kagenti backend sets the same annotation. Both skill delivery mechanisms (ConfigMap and OCI ImageVolume) are now visible through a single annotation and the agent's card.

Test plan

  • go build ./... — compiles cleanly
  • go vet ./... — no issues
  • golangci-lint run ./... — no new issues
  • Kind cluster validation: Deployment with both OCI ImageVolume and ConfigMap skills, AgentRuntime discovers both via annotation, agent card reflects all mounted skills
  • Feature gate validation: skillDiscovery: true populates status.linkedSkills, skillDiscovery: false clears it
  • Operator does NOT mutate the Deployment PodSpec
  • E2E tests (skill injection block removed, needs new skill discovery e2e)

Assisted-By: Claude Code

Remove spec.skills from AgentRuntime CRD — the operator no longer
mutates target Deployments to inject OCI skill ImageVolumes. This
eliminates GitOps drift with ArgoCD/Flux.

Skills are now declared directly in the Deployment manifest (OCI
ImageVolumes or ConfigMap volumes) and discovered by the operator
through the kagenti.io/skills annotation, gated behind the
skillDiscovery feature flag. The agent discovers mounted skills
via SKILL_FOLDERS and reports them in its A2A card. The operator
reads the card and the annotation, surfacing both in status.

Removed:
- spec.skills, SkillImageRef, SkillPullPolicy types
- reconcileSkillVolumes controller mutation code
- skillImageVolumes feature gate
- Webhook skill validation (name/path checks)
- E2E tests, fixtures, and utils for skill injection

Added:
- status.linkedSkills populated from kagenti.io/skills annotation
- SkillsDiscovered condition
- skillDiscovery feature gate (default: false)
- Skill discovery sample manifest with OCI + ConfigMap examples

Assisted-By: Claude Code
Signed-off-by: Ryan Cook <rcook@redhat.com>
@cooktheryan cooktheryan force-pushed the worktree-poc-skill-observer branch from a9d3fe7 to 177ed3b Compare May 29, 2026 18:26
@cooktheryan cooktheryan changed the title feat: replace spec.skills injection with passive skill discovery WIP: replace spec.skills injection with passive skill discovery May 29, 2026
@cooktheryan
Copy link
Copy Markdown
Contributor Author

@pavelanni @kevincogan @pdettori @eranra I wanted to re-roll the way skills were defined. I started the process of blogging about the skill work and I quickly realize we would make for a really painful scenario with GitOps/ArgoCD with modifying the volume mounts of the deployment. I wanted to try to pull in the skills dynamically while also referencing what was brought in with kagenti/kagenti#1440 I am open to any feedback, conversations, and etc. @kevincogan I also want to ensure this does not cause issues with any security related ideas or concepts you had.

One of the big things I am thinking about with skills is the fact that a user may start with 0 skills attached. Then add a skill or two. Lifecycle it for a while then discover possibly the skills is not longer required. All of this could and should occur without the need to rewrite or rebuild the Agent unless it is necesary.

apiVersion: agent.kagenti.dev/v1alpha1
kind: AgentRuntime
metadata:
  annotations:
    agent.kagenti.dev/last-card-fetch-hash: "5"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"agent.kagenti.dev/v1alpha1","kind":"AgentRuntime","metadata":{"annotations":{},"labels":{"app.kubernetes.io/name":"a2a-currency-converter"},"name":"a2a-currency-converter","namespace":"default"},"spec":{"targetRef":{"apiVersion":"apps/v1","kind":"Deployment","name":"a2a-currency-converter"},"type":"agent"}}
  creationTimestamp: "2026-05-29T17:55:10Z"
  finalizers:
  - kagenti.io/cleanup
  generation: 1
  labels:
    app.kubernetes.io/name: a2a-currency-converter
  name: a2a-currency-converter
  namespace: default
  resourceVersion: "5688"
  uid: fccb4363-7873-4a05-992e-8dca085404c4
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: a2a-currency-converter
  type: agent
status:
  card:
    capabilities:
      pushNotifications: false
      streaming: false
    cardId: 294e46254525e6aa0e0a10969044437cd7fabfcd9b97ff76385a1f534588d5d3
    description: Converts between currencies using real-time exchange rates
    fetchedAt: "2026-05-29T17:55:10Z"
    name: Currency Converter
    protocol: a2a
    skills:
    - description: Convert an amount from one currency to another
      examples:
      - Convert 100 USD to EUR
      - How much is 50 GBP in JPY?
      id: convert-currency
      name: Currency Conversion
    - description: Review Kubernetes manifests for OpenShift readiness
      id: openshift-review
      name: openshift-review
    - description: Summarize long text into concise structured summaries
      id: summarizer
      name: summarizer
    url: http://a2a-currency-converter.default.svc.cluster.local:8080/
    version: 1.0.0
  conditions:
  - lastTransitionTime: "2026-05-29T17:55:10Z"
    message: Deployment a2a-currency-converter resolved
    observedGeneration: 1
    reason: TargetFound
    status: "True"
    type: TargetResolved
  - lastTransitionTime: "2026-05-29T17:55:10Z"
    message: Configuration resolved successfully
    observedGeneration: 1
    reason: ConfigResolved
    status: "True"
    type: ConfigResolved
  - lastTransitionTime: "2026-05-29T17:55:10Z"
    message: Pod template unchanged; existing card data still valid
    observedGeneration: 1
    reason: FetchSkipped
    status: "True"
    type: CardSynced
  - lastTransitionTime: "2026-05-29T17:55:10Z"
    message: 2 linked skill(s) discovered from workload annotation
    observedGeneration: 1
    reason: SkillsFound
    status: "True"
    type: SkillsDiscovered
  - lastTransitionTime: "2026-05-29T17:55:10Z"
    message: Workload a2a-currency-converter configured with config-hash 57de3ab75a7e
    observedGeneration: 1
    reason: Configured
    status: "True"
    type: Ready
  configuredPods: 2
  linkedSkills:
  - openshift-review
  - summarizer
  phase: Active

@pavelanni
Copy link
Copy Markdown
Contributor

Adding skills dynamically is possible (you can add a tool skill-puller to AgentRuntime) but that breaks the idea of having read-only skills to avoid malicious agents/skills to edit existing skills. When we mount them upfront read-only we are safe. When the agent pulls them during its activity -- we have to make sure that 1) it verifies the skill (based on its signature or SHA) and 2) it locks the pulled skill so it can't change it after that. The second part is tricky. Let me think about it and investigate.
Alternatively, we can define different modes in which AgentRuntime operates. One is flexible/development/unsafe mode, and another is locked-down, production mode (where all skills are mounted read-only upfront). It is also good from the positioning perspective when you give the user a choice and the user understands the risks.

@cooktheryan
Copy link
Copy Markdown
Contributor Author

looking forward to what find @pavelanni I think one of the pieces I really overlooked was around how people manage their running applications. We definitely need to present and provide our solution with as much security around it as possible

@pavelanni
Copy link
Copy Markdown
Contributor

After reviewing several options I think the right solution for dynamic skill discovery and deployment will be a skill-puller sidecar that can search, verify, and pull skills into the shared volume that is mounted read-write for this sidecar and read-only for the main agent. The sidecar itself will be written in Go so it won't add much to the total pod's resource consumption. It will be possible to configure the agent Deployment with this sidecar depending whether we need dynamic agents or not.
I'll create a prototype next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: New /:ToDo

Development

Successfully merging this pull request may close these issues.

3 participants