Skip to content

Consolidate AgentCard Data Into AgentRuntime Status [Spec]#372

Merged
pdettori merged 11 commits into
kagenti:mainfrom
rhuss:001-agentcard-into-status
May 25, 2026
Merged

Consolidate AgentCard Data Into AgentRuntime Status [Spec]#372
pdettori merged 11 commits into
kagenti:mainfrom
rhuss:001-agentcard-into-status

Conversation

@rhuss
Copy link
Copy Markdown
Contributor

@rhuss rhuss commented May 21, 2026

Review Guide for full context: motivation, key decisions, and scope boundaries.

Introducing Specification-Driven Development (SDD)

This is our first PR using a spec-driven development workflow. The idea is simple: write a specification, get it reviewed, and then implement. This inverts the usual flow where design decisions are discovered (and debated) during code review.

Why SDD? Design issues caught during code review are expensive. The code is already written, tests are passing, and the author is invested. Feedback at that stage feels like rework. By reviewing the design first, we align on scope, technical approach, and key decisions before anyone writes a line of Go. Code review then focuses on correctness and quality, not "should we have done this differently?"

What an SDD PR looks like:

  • Phase 1 (now): The PR contains only spec artifacts (spec, plan, tasks, review guide). Review these for scope, approach, and missing requirements. Leave comments on the spec, not on hypothetical code.
  • Phase 2 (after approval): Implementation commits land on this same branch. The spec artifacts serve as the contract. Code review checks the implementation against the approved spec.
  • Merge: When both spec and implementation are approved.

Do not merge this PR yet. We are in Phase 1.

What to review

Focus on decisions and scope, not implementation details (those do not exist yet):

  1. spec.md: 4 user stories covering card discovery (P1), mTLS verification (P1), deprecation warnings (P2), and feature gating (P2). Are the requirements correct? Is anything missing or over-scoped?
  2. plan.md: Technical approach. Does extending the existing AgentRuntime controller make sense? Are the reuse decisions (fetcher interfaces, PR ✨ feat: Add mTLS runtime identity verification for AgentCards #284 mTLS infrastructure) sound?
  3. tasks.md: 22 tasks across 7 phases. Is the phasing and dependency ordering reasonable? Any tasks missing?
  4. REVIEWERS.md: Summarizes key decisions and areas needing attention. Start here for a quick overview.

Context

This builds on the consolidation path outlined in issue #371. The key insight: AgentCard is a pure read-only CR whose content is entirely controller-managed, with no room for admin-authored policy fields. Its data fits naturally into AgentRuntime status.

The mTLS verified fetch path reuses the infrastructure @kevincogan built in PR #284 (mTLS verified fetch for AgentCard, merged 2026-05-20).

Artifacts

File Purpose
spec.md Feature specification (4 user stories, 13 functional requirements)
plan.md Implementation plan (technical context, project structure, file map)
tasks.md 22 implementation tasks across 7 phases
research.md 6 research decisions (service resolution, fetch triggers, reuse strategy)
data-model.md CardStatus struct design and state transitions
contracts/ CRD status field contract with example YAML
REVIEWERS.md Review guide for this PR

Assisted-By: 🤖 Claude Code

rhuss added 2 commits May 21, 2026 09:04
First step of the AgentCard-to-AgentRuntime consolidation
(kagenti-operator#371). Covers card fetch in the AgentRuntime
controller, A2A-compliant status schema, and deprecation path.

Assisted-By: 🤖 Claude Code
Signed-off-by: Roland Huß <rhuss@redhat.com>
Assisted-By: 🤖 Claude Code
Signed-off-by: Roland Huß <rhuss@redhat.com>
Specification, implementation plan, and task breakdown for moving A2A
agent card discovery into the AgentRuntime controller reconcile loop.
Key design: reuse existing fetcher/verifier infrastructure, feature-gated
with --enable-card-discovery, mTLS included via PR kagenti#284 infrastructure.

Assisted-By: 🤖 Claude Code
Signed-off-by: Roland Huß <rhuss@redhat.com>
@rhuss rhuss force-pushed the 001-agentcard-into-status branch from 6d9d5a4 to 33a4905 Compare May 21, 2026 09:24
rhuss added 3 commits May 21, 2026 11:25
The .specify directory contains local spec-kit tooling (templates,
scripts, extensions) that should not be committed. Ignore the entire
directory instead of individual files.

Assisted-By: 🤖 Claude Code
Signed-off-by: Roland Huß <rhuss@redhat.com>
Brainstorm documents are local working notes, not PR artifacts.
The spec.md captures the refined requirements.

Assisted-By: 🤖 Claude Code
Signed-off-by: Roland Huß <rhuss@redhat.com>
The speckit plan reference in CLAUDE.md is a local development aid,
not part of the spec PR.

Assisted-By: 🤖 Claude Code
Signed-off-by: Roland Huß <rhuss@redhat.com>
Copy link
Copy Markdown
Contributor

@r3v5 r3v5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @rhuss for detailed spec with explanations.


**Applies when**:
- The operator is started with `--enable-card-discovery=true`
- An AgentRuntime targets a Deployment whose Pods serve `/.well-known/agent-card.json`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this addressing only deployment ? Shouldn't this be "a Workload" to cover also StatefulSets, Jobs and AgentSandbox ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Generalized to "workload (Deployment, StatefulSet, or Sandbox)" throughout REVIEWERS.md.

Comment thread specs/001-agentcard-into-status/spec.md Outdated

### Session 2026-05-21

- Q: How should the controller discover the Service endpoint for a given AgentRuntime's targetRef Deployment? → A: Selector match: resolve Deployment Pod selector, find Services whose selector matches, use the first match.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted also in REVIEWERS.md - shouldn't this be more general than Deployment ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Generalized all references from "Deployment" to "workload" in spec.md, covering Deployments, StatefulSets, and Sandboxes. Updated clarifications, acceptance scenarios, edge cases, FR-008, and assumptions.

Copy link
Copy Markdown
Contributor

@pdettori pdettori left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design Review — Blind Spots & Missing Pieces

The SDD approach is excellent and the spec is well-structured. The technical decisions are sound for the read-only discovery use case. However, there are meaningful gaps around what gets lost or deferred that should be addressed before implementation begins.


Must-Address

1. Identity Binding Policy — Where Does It Go?

The current AgentCard has spec.identityBinding with trustDomain (per-agent override) and strict (enforcement trigger). The spec moves card data into AgentRuntime status but explicitly defers "mTLS policy fields" to a future iteration.

Gap: If AgentCard is deprecated, there's no spec-level place for an admin to say "this agent must be bound to trust domain X, and enforce network isolation on failure." The spec should state where identityBinding lands (AgentRuntime.spec? A new PolicyBinding CRD?) and when — even if deferred, the migration path needs to be visible so early adopters know what to expect.

2. Enforcement Story During Coexistence

Today: identityBinding.strict=true + --enforce-network-policies=true → binding failure removes signature-verified label → NetworkPolicy blocks traffic.

The new CardStatus has attestedAgentSpiffeID and validSignature but no enforcement action — it's purely observational. The spec doesn't say:

  • Does NetworkPolicy enforcement still work during the coexistence period?
  • What happens to enforcement when AgentCard is eventually removed?
  • Is enforcement being intentionally severed from card discovery?

Should-Address

3. lastPodTemplateHash Leaks Implementation Into the API

Storing lastPodTemplateHash in CRD status couples the implementation mechanism (hash-based change detection) to the public API surface. If you later switch to generation-based triggers or periodic re-fetch, this field becomes legacy baggage requiring backward compat. Consider an annotation (agent.kagenti.dev/last-card-fetch-hash) instead.

4. Card Fetch Timeout / Reconcile Blocking

The spec doesn't specify timeout or context-cancellation requirements for the HTTP card fetch. A hung agent serving /.well-known/agent-card.json slowly could block the reconcile loop (or cause excessive requeues). Add a non-functional requirement: max fetch duration, behavior on timeout, and whether fetch errors requeue the entire AgentRuntime reconcile.


Consider

5. Signatures — Dead Weight or Future Investment?

Issue #371 argues card signing adds no value beyond mTLS for in-cluster communication, and cross-cluster federation is not on the roadmap. Yet CardStatus carries signatures, validSignature, signatureKeyID, and signatureVerificationDetails. If cross-cluster is out of scope, carrying signature verification in a purely in-cluster status is dead weight. Either drop it or add a user story explaining when it provides value.

6. AgentCardSyncReconciler Fate

The repo has an AgentCardSyncReconciler that auto-creates AgentCards for workloads. The spec doesn't mention what happens to this controller. If AgentCard is being deprecated, the sync controller also needs a deprecation/removal plan.

7. Dynamic Card Changes (Non-Rollout Updates)

The design assumes "agent card content is static for the lifetime of a Pod." This is an assumption, not an A2A protocol guarantee. Agents that hot-reload skills, register tools dynamically, or update capabilities based on loaded models will have stale status.card until the next rollout. Even if the answer is "not supported," the constraint should be stated explicitly in the spec.


What This Gets Right

  • One CR per agent — correct direction, genuine maintenance burden reduction
  • Reusing PR #284 mTLS infrastructure — no rewrites
  • Feature flag disabled by default — safe rollout
  • Additive status change, no API version bump — backward compatible
  • Pod template hash trigger — efficient
  • Retain stale data on failure — good operator UX

Verdict: REQUEST_CHANGES — the spec needs the identity binding migration path (#1) and enforcement story (#2) addressed before implementation proceeds. These are the one policy concern AgentCard has, and they drive real NetworkPolicy enforcement. Implementing the read-only discovery path without a plan for the enforcement path risks architectural orphans.

Assisted-By: Claude Code

- Add explicit Out of Scope section with migration paths for identity
  binding policy, enforcement actions, AgentCardSyncReconciler, and
  AgentCard CRD removal
- Generalize "Deployment" to "workload" throughout spec and review guide
  to cover StatefulSet and Sandbox
- Add edge cases for fetch timeout (10s) and dynamic card changes
- Add brainstorm document for identity binding migration (Phase 2)
- Restore brainstorm/ in git tracking

Addresses review feedback from @pdettori on PR kagenti#372.

Assisted-By: 🤖 Claude Code
Signed-off-by: Roland Huß <rhuss@redhat.com>
Copy link
Copy Markdown
Contributor Author

@rhuss rhuss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @pdettori. Pushed updates addressing all points:

Must-address

1. Identity binding policy migration path — Added an "Out of Scope (with migration path)" section to the spec. Identity binding is orthogonal to card discovery. It validates whether the agent's SPIFFE ID belongs to a trust domain. The card fetch surfaces the SPIFFE ID, but binding evaluation and enforcement are about the workload, not the card content.

During coexistence, identity binding stays on AgentCard. It moves to AgentRuntime.spec.identity.binding in a follow-up iteration. See brainstorm/02-identity-binding-migration.md for the full migration analysis.

2. Enforcement during coexistence — Covered in the same Out of Scope section. The AgentCard controller keeps handling all enforcement (label propagation, NetworkPolicy). status.card on AgentRuntime is read-only observation. Nothing changes for enforcement until identity binding moves to AgentRuntime.spec.

Should-address

3. lastPodTemplateHash in API surface — Agreed. Will use an annotation instead during implementation.

4. Card fetch timeout — Added edge case: 10-second timeout (reuses DefaultFetchTimeout from the fetcher). Timeout counts as a fetch failure, stale data retained, CardSynced=False.

Consider

5. Signature fieldsattestedAgentSpiffeID pulls its weight for mTLS. The JWS fields (validSignature, signatureKeyID) tag along because we reuse the existing SignatureProvider. If cross-cluster stays out of scope, we can drop them later without breaking anything since they are optional status fields.

6. AgentCardSyncReconciler fate — Added to Out of Scope. Keeps running during coexistence, gets removed with the AgentCard CRD in Phase 4.

7. Dynamic card changes — Added as an explicit constraint: card data reflects the state at the last rollout. No polling. Agents that hot-reload skills will show stale data until the next rollout. That is the tradeoff for not hammering the API server.

Inline comments

8-9. Deployment → workload — Generalized to "workload (Deployment, StatefulSet, or Sandbox)" throughout spec.md and REVIEWERS.md.

rhuss added 2 commits May 21, 2026 19:21
- Move lastPodTemplateHash from CRD status to annotation
- Generalize Deployment references to workload throughout
- Add fetch timeout condition to data model
- Add annotation note to research decisions

Assisted-By: 🤖 Claude Code
Signed-off-by: Roland Huß <rhuss@redhat.com>
Add CardStatus type to AgentRuntimeStatus for storing fetched A2A agent
card data with content hashing, signature verification, and SPIFFE
attestation fields. Implement the AgentCard controller that discovers
card endpoints from workload Services and syncs card data into the
AgentRuntime status. Wire up the controller with a --enable-card-discovery
flag and add comprehensive unit tests.

Assisted-By: 🤖 Claude Code
Signed-off-by: Roland Huß <rhuss@redhat.com>
@rhuss rhuss force-pushed the 001-agentcard-into-status branch from aaf7003 to 33d745e Compare May 21, 2026 19:52
rhuss added 2 commits May 22, 2026 10:19
persistCardFetchAnnotation calls r.Patch() which refreshes the rt object
from the API server, wiping all in-memory status mutations (card data,
CardSynced, TargetResolved, ConfigResolved conditions) that have not yet
been persisted via Status().Update. Save and restore rt.Status across the
Patch call so the caller's subsequent Status().Update persists everything.

Add regression test with a stub fetcher that verifies card data and all
conditions survive the annotation patch.

Assisted-By: 🤖 Claude Code
Signed-off-by: Roland Huß <rhuss@redhat.com>
…principles

Codifies lessons from the status-wipe bug into project governance:
Principle I (Reconciler Status Integrity), Principle III (Controller-Runtime
Safety), and a Controller-Runtime Gotchas reference section. Un-ignores
.specify/memory/constitution.md so the constitution is tracked in git.

Assisted-By: 🤖 Claude Code
Signed-off-by: Roland Huß <rhuss@redhat.com>
@rhuss
Copy link
Copy Markdown
Contributor Author

rhuss commented May 22, 2026

@pdettori I went ahead, implemented the feature and tested it locally on a kind cluster (where I found an issue again, but I updated the constitution.md to prevent this is in the future).

@rhuss
Copy link
Copy Markdown
Contributor Author

rhuss commented May 25, 2026

@pdettori All review feedback has been addressed (see my response from May 21). Since then, the implementation is complete with the card discovery fix and regression test. Could you re-review when you get a chance?

Copy link
Copy Markdown
Contributor

@pdettori pdettori left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All review feedback addressed. Verified each item:

Must-address — both resolved:

  • Identity binding migration path documented (spec Out of Scope section + brainstorm/02-identity-binding-migration.md)
  • Enforcement during coexistence: AgentCard controller continues handling enforcement; status.card is read-only observation

Should-address — both resolved:

  • lastPodTemplateHash moved to annotation (agent.kagenti.dev/last-card-fetch-hash)
  • 10s fetch timeout with defined failure behavior (stale data retained, CardSynced=False)

Consider — addressed:

  • AgentCardSyncReconciler fate documented (keeps running, removed with CRD)
  • Dynamic card changes constraint explicit (reflects last rollout, no polling)
  • Deployment → workload generalized throughout

Implementation includes regression test for the status-wipe bug. CI all green.

Assisted-By: Claude Code

@pdettori pdettori merged commit 4f822d3 into kagenti:main May 25, 2026
15 checks passed
@github-project-automation github-project-automation Bot moved this from New /:ToDo to Done in Kagenti Issue Prioritization May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants