Consolidate AgentCard Data Into AgentRuntime Status [Spec] by rhuss · Pull Request #372 · kagenti/kagenti-operator

rhuss · 2026-05-21T07:45:23Z

Review Guide for full context: motivation, key decisions, and scope boundaries.

Introducing Specification-Driven Development (SDD)

This is our first PR using a spec-driven development workflow. The idea is simple: write a specification, get it reviewed, and then implement. This inverts the usual flow where design decisions are discovered (and debated) during code review.

Why SDD? Design issues caught during code review are expensive. The code is already written, tests are passing, and the author is invested. Feedback at that stage feels like rework. By reviewing the design first, we align on scope, technical approach, and key decisions before anyone writes a line of Go. Code review then focuses on correctness and quality, not "should we have done this differently?"

What an SDD PR looks like:

Phase 1 (now): The PR contains only spec artifacts (spec, plan, tasks, review guide). Review these for scope, approach, and missing requirements. Leave comments on the spec, not on hypothetical code.
Phase 2 (after approval): Implementation commits land on this same branch. The spec artifacts serve as the contract. Code review checks the implementation against the approved spec.
Merge: When both spec and implementation are approved.

Do not merge this PR yet. We are in Phase 1.

What to review

Focus on decisions and scope, not implementation details (those do not exist yet):

spec.md: 4 user stories covering card discovery (P1), mTLS verification (P1), deprecation warnings (P2), and feature gating (P2). Are the requirements correct? Is anything missing or over-scoped?
plan.md: Technical approach. Does extending the existing AgentRuntime controller make sense? Are the reuse decisions (fetcher interfaces, PR ✨ feat: Add mTLS runtime identity verification for AgentCards #284 mTLS infrastructure) sound?
tasks.md: 22 tasks across 7 phases. Is the phasing and dependency ordering reasonable? Any tasks missing?
REVIEWERS.md: Summarizes key decisions and areas needing attention. Start here for a quick overview.

Context

This builds on the consolidation path outlined in issue #371. The key insight: AgentCard is a pure read-only CR whose content is entirely controller-managed, with no room for admin-authored policy fields. Its data fits naturally into AgentRuntime status.

The mTLS verified fetch path reuses the infrastructure @kevincogan built in PR #284 (mTLS verified fetch for AgentCard, merged 2026-05-20).

Artifacts

File	Purpose
`spec.md`	Feature specification (4 user stories, 13 functional requirements)
`plan.md`	Implementation plan (technical context, project structure, file map)
`tasks.md`	22 implementation tasks across 7 phases
`research.md`	6 research decisions (service resolution, fetch triggers, reuse strategy)
`data-model.md`	CardStatus struct design and state transitions
`contracts/`	CRD status field contract with example YAML
`REVIEWERS.md`	Review guide for this PR

Assisted-By: 🤖 Claude Code

First step of the AgentCard-to-AgentRuntime consolidation (kagenti-operator#371). Covers card fetch in the AgentRuntime controller, A2A-compliant status schema, and deprecation path. Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>

Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>

Specification, implementation plan, and task breakdown for moving A2A agent card discovery into the AgentRuntime controller reconcile loop. Key design: reuse existing fetcher/verifier infrastructure, feature-gated with --enable-card-discovery, mTLS included via PR kagenti#284 infrastructure. Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>

The .specify directory contains local spec-kit tooling (templates, scripts, extensions) that should not be committed. Ignore the entire directory instead of individual files. Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>

Brainstorm documents are local working notes, not PR artifacts. The spec.md captures the refined requirements. Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>

The speckit plan reference in CLAUDE.md is a local development aid, not part of the spec PR. Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>

r3v5

LGTM, thanks @rhuss for detailed spec with explanations.

pdettori · 2026-05-21T13:58:26Z

+
+**Applies when**:
+- The operator is started with `--enable-card-discovery=true`
+- An AgentRuntime targets a Deployment whose Pods serve `/.well-known/agent-card.json`


Is this addressing only deployment ? Shouldn't this be "a Workload" to cover also StatefulSets, Jobs and AgentSandbox ?

Fixed. Generalized to "workload (Deployment, StatefulSet, or Sandbox)" throughout REVIEWERS.md.

pdettori · 2026-05-21T14:00:28Z

+
+### Session 2026-05-21
+
+- Q: How should the controller discover the Service endpoint for a given AgentRuntime's targetRef Deployment? → A: Selector match: resolve Deployment Pod selector, find Services whose selector matches, use the first match.


As noted also in REVIEWERS.md - shouldn't this be more general than Deployment ?

Fixed. Generalized all references from "Deployment" to "workload" in spec.md, covering Deployments, StatefulSets, and Sandboxes. Updated clarifications, acceptance scenarios, edge cases, FR-008, and assumptions.

pdettori

Design Review — Blind Spots & Missing Pieces

The SDD approach is excellent and the spec is well-structured. The technical decisions are sound for the read-only discovery use case. However, there are meaningful gaps around what gets lost or deferred that should be addressed before implementation begins.

Must-Address

1. Identity Binding Policy — Where Does It Go?

The current AgentCard has spec.identityBinding with trustDomain (per-agent override) and strict (enforcement trigger). The spec moves card data into AgentRuntime status but explicitly defers "mTLS policy fields" to a future iteration.

Gap: If AgentCard is deprecated, there's no spec-level place for an admin to say "this agent must be bound to trust domain X, and enforce network isolation on failure." The spec should state where identityBinding lands (AgentRuntime.spec? A new PolicyBinding CRD?) and when — even if deferred, the migration path needs to be visible so early adopters know what to expect.

2. Enforcement Story During Coexistence

Today: identityBinding.strict=true + --enforce-network-policies=true → binding failure removes signature-verified label → NetworkPolicy blocks traffic.

The new CardStatus has attestedAgentSpiffeID and validSignature but no enforcement action — it's purely observational. The spec doesn't say:

Does NetworkPolicy enforcement still work during the coexistence period?
What happens to enforcement when AgentCard is eventually removed?
Is enforcement being intentionally severed from card discovery?

Should-Address

3. `lastPodTemplateHash` Leaks Implementation Into the API

Storing lastPodTemplateHash in CRD status couples the implementation mechanism (hash-based change detection) to the public API surface. If you later switch to generation-based triggers or periodic re-fetch, this field becomes legacy baggage requiring backward compat. Consider an annotation (agent.kagenti.dev/last-card-fetch-hash) instead.

4. Card Fetch Timeout / Reconcile Blocking

The spec doesn't specify timeout or context-cancellation requirements for the HTTP card fetch. A hung agent serving /.well-known/agent-card.json slowly could block the reconcile loop (or cause excessive requeues). Add a non-functional requirement: max fetch duration, behavior on timeout, and whether fetch errors requeue the entire AgentRuntime reconcile.

Consider

5. Signatures — Dead Weight or Future Investment?

Issue #371 argues card signing adds no value beyond mTLS for in-cluster communication, and cross-cluster federation is not on the roadmap. Yet CardStatus carries signatures, validSignature, signatureKeyID, and signatureVerificationDetails. If cross-cluster is out of scope, carrying signature verification in a purely in-cluster status is dead weight. Either drop it or add a user story explaining when it provides value.

6. `AgentCardSyncReconciler` Fate

The repo has an AgentCardSyncReconciler that auto-creates AgentCards for workloads. The spec doesn't mention what happens to this controller. If AgentCard is being deprecated, the sync controller also needs a deprecation/removal plan.

7. Dynamic Card Changes (Non-Rollout Updates)

The design assumes "agent card content is static for the lifetime of a Pod." This is an assumption, not an A2A protocol guarantee. Agents that hot-reload skills, register tools dynamically, or update capabilities based on loaded models will have stale status.card until the next rollout. Even if the answer is "not supported," the constraint should be stated explicitly in the spec.

What This Gets Right

One CR per agent — correct direction, genuine maintenance burden reduction
Reusing PR #284 mTLS infrastructure — no rewrites
Feature flag disabled by default — safe rollout
Additive status change, no API version bump — backward compatible
Pod template hash trigger — efficient
Retain stale data on failure — good operator UX

Verdict: REQUEST_CHANGES — the spec needs the identity binding migration path (#1) and enforcement story (#2) addressed before implementation proceeds. These are the one policy concern AgentCard has, and they drive real NetworkPolicy enforcement. Implementing the read-only discovery path without a plan for the enforcement path risks architectural orphans.

Assisted-By: Claude Code

@pdettori

- Add explicit Out of Scope section with migration paths for identity binding policy, enforcement actions, AgentCardSyncReconciler, and AgentCard CRD removal - Generalize "Deployment" to "workload" throughout spec and review guide to cover StatefulSet and Sandbox - Add edge cases for fetch timeout (10s) and dynamic card changes - Add brainstorm document for identity binding migration (Phase 2) - Restore brainstorm/ in git tracking Addresses review feedback from @pdettori on PR kagenti#372. Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>

rhuss

Thanks for the review @pdettori. Pushed updates addressing all points:

Must-address

1. Identity binding policy migration path — Added an "Out of Scope (with migration path)" section to the spec. Identity binding is orthogonal to card discovery. It validates whether the agent's SPIFFE ID belongs to a trust domain. The card fetch surfaces the SPIFFE ID, but binding evaluation and enforcement are about the workload, not the card content.

During coexistence, identity binding stays on AgentCard. It moves to AgentRuntime.spec.identity.binding in a follow-up iteration. See brainstorm/02-identity-binding-migration.md for the full migration analysis.

2. Enforcement during coexistence — Covered in the same Out of Scope section. The AgentCard controller keeps handling all enforcement (label propagation, NetworkPolicy). status.card on AgentRuntime is read-only observation. Nothing changes for enforcement until identity binding moves to AgentRuntime.spec.

Should-address

3. lastPodTemplateHash in API surface — Agreed. Will use an annotation instead during implementation.

4. Card fetch timeout — Added edge case: 10-second timeout (reuses DefaultFetchTimeout from the fetcher). Timeout counts as a fetch failure, stale data retained, CardSynced=False.

Consider

5. Signature fields — attestedAgentSpiffeID pulls its weight for mTLS. The JWS fields (validSignature, signatureKeyID) tag along because we reuse the existing SignatureProvider. If cross-cluster stays out of scope, we can drop them later without breaking anything since they are optional status fields.

6. AgentCardSyncReconciler fate — Added to Out of Scope. Keeps running during coexistence, gets removed with the AgentCard CRD in Phase 4.

7. Dynamic card changes — Added as an explicit constraint: card data reflects the state at the last rollout. No polling. Agents that hot-reload skills will show stale data until the next rollout. That is the tradeoff for not hammering the API server.

Inline comments

8-9. Deployment → workload — Generalized to "workload (Deployment, StatefulSet, or Sandbox)" throughout spec.md and REVIEWERS.md.

- Move lastPodTemplateHash from CRD status to annotation - Generalize Deployment references to workload throughout - Add fetch timeout condition to data model - Add annotation note to research decisions Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>

Add CardStatus type to AgentRuntimeStatus for storing fetched A2A agent card data with content hashing, signature verification, and SPIFFE attestation fields. Implement the AgentCard controller that discovers card endpoints from workload Services and syncs card data into the AgentRuntime status. Wire up the controller with a --enable-card-discovery flag and add comprehensive unit tests. Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>

persistCardFetchAnnotation calls r.Patch() which refreshes the rt object from the API server, wiping all in-memory status mutations (card data, CardSynced, TargetResolved, ConfigResolved conditions) that have not yet been persisted via Status().Update. Save and restore rt.Status across the Patch call so the caller's subsequent Status().Update persists everything. Add regression test with a stub fetcher that verifies card data and all conditions survive the annotation patch. Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>

…principles Codifies lessons from the status-wipe bug into project governance: Principle I (Reconciler Status Integrity), Principle III (Controller-Runtime Safety), and a Controller-Runtime Gotchas reference section. Un-ignores .specify/memory/constitution.md so the constitution is tracked in git. Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>

rhuss · 2026-05-22T08:54:57Z

@pdettori I went ahead, implemented the feature and tested it locally on a kind cluster (where I found an issue again, but I updated the constitution.md to prevent this is in the future).

rhuss · 2026-05-25T13:41:12Z

@pdettori All review feedback has been addressed (see my response from May 21). Since then, the implementation is complete with the card discovery fix and regression test. Could you re-review when you get a chance?

pdettori

All review feedback addressed. Verified each item:

Must-address — both resolved:

Identity binding migration path documented (spec Out of Scope section + brainstorm/02-identity-binding-migration.md)
Enforcement during coexistence: AgentCard controller continues handling enforcement; status.card is read-only observation

Should-address — both resolved:

lastPodTemplateHash moved to annotation (agent.kagenti.dev/last-card-fetch-hash)
10s fetch timeout with defined failure behavior (stale data retained, CardSynced=False)

Consider — addressed:

AgentCardSyncReconciler fate documented (keeps running, removed with CRD)
Dynamic card changes constraint explicit (reflects last rollout, no polling)
Deployment → workload generalized throughout

Implementation includes regression test for the status-wipe bug. CI all green.

Assisted-By: Claude Code

rhuss added 2 commits May 21, 2026 09:04

Update brainstorm overview

4b01e71

Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>

rhuss requested a review from a team as a code owner May 21, 2026 07:45

rubambiza added this to Kagenti Issue Prioritization May 21, 2026

github-project-automation Bot moved this to Backlog in Kagenti Issue Prioritization May 21, 2026

rhuss force-pushed the 001-agentcard-into-status branch from 6d9d5a4 to 33a4905 Compare May 21, 2026 09:24

rhuss added 3 commits May 21, 2026 11:25

Remove brainstorm directory from tracking

a49542d

Brainstorm documents are local working notes, not PR artifacts. The spec.md captures the refined requirements. Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>

Revert CLAUDE.md speckit pointer

421db64

The speckit plan reference in CLAUDE.md is a local development aid, not part of the spec PR. Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>

r3v5 approved these changes May 21, 2026

View reviewed changes

pdettori reviewed May 21, 2026

View reviewed changes

pdettori requested changes May 21, 2026

View reviewed changes

rhuss commented May 21, 2026

View reviewed changes

rhuss added 2 commits May 21, 2026 19:21

rhuss force-pushed the 001-agentcard-into-status branch from aaf7003 to 33d745e Compare May 21, 2026 19:52

rhuss added 2 commits May 22, 2026 10:19

clawgenti mentioned this pull request May 25, 2026

Weekly Report 2026-05-25 kagenti/kagenti#1662

Closed

kevincogan mentioned this pull request May 25, 2026

Add OCI skill image mounting to AgentRuntime #332

Merged

6 tasks

pdettori approved these changes May 25, 2026

View reviewed changes

pdettori merged commit 4f822d3 into kagenti:main May 25, 2026
15 checks passed

github-project-automation Bot moved this from New /:ToDo to Done in Kagenti Issue Prioritization May 25, 2026

pdettori mentioned this pull request May 27, 2026

Feat: Implementing Sigstore-A2A verification for A2A Agent Cards #355

Open

rhuss mentioned this pull request May 27, 2026

Card Discovery Refinement Alignment [Spec + Impl] #380

Open

clawgenti mentioned this pull request Jun 1, 2026

Weekly Report 2026-06-01 kagenti/kagenti#1756

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate AgentCard Data Into AgentRuntime Status [Spec]#372

Consolidate AgentCard Data Into AgentRuntime Status [Spec]#372
pdettori merged 11 commits into
kagenti:mainfrom
rhuss:001-agentcard-into-status

rhuss commented May 21, 2026 •

edited

Loading

Uh oh!

r3v5 left a comment

Uh oh!

pdettori May 21, 2026

Uh oh!

rhuss May 21, 2026

Uh oh!

pdettori May 21, 2026

Uh oh!

rhuss May 21, 2026

Uh oh!

pdettori left a comment

Uh oh!

rhuss left a comment •

edited

Loading

Uh oh!

rhuss commented May 22, 2026

Uh oh!

rhuss commented May 25, 2026

Uh oh!

pdettori left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		### Session 2026-05-21

		- Q: How should the controller discover the Service endpoint for a given AgentRuntime's targetRef Deployment? → A: Selector match: resolve Deployment Pod selector, find Services whose selector matches, use the first match.

Conversation

rhuss commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Introducing Specification-Driven Development (SDD)

What to review

Context

Artifacts

Uh oh!

r3v5 left a comment

Choose a reason for hiding this comment

Uh oh!

pdettori May 21, 2026

Choose a reason for hiding this comment

Uh oh!

rhuss May 21, 2026

Choose a reason for hiding this comment

Uh oh!

pdettori May 21, 2026

Choose a reason for hiding this comment

Uh oh!

rhuss May 21, 2026

Choose a reason for hiding this comment

Uh oh!

pdettori left a comment

Choose a reason for hiding this comment

Design Review — Blind Spots & Missing Pieces

Must-Address

1. Identity Binding Policy — Where Does It Go?

2. Enforcement Story During Coexistence

Should-Address

3. lastPodTemplateHash Leaks Implementation Into the API

4. Card Fetch Timeout / Reconcile Blocking

Consider

5. Signatures — Dead Weight or Future Investment?

6. AgentCardSyncReconciler Fate

7. Dynamic Card Changes (Non-Rollout Updates)

What This Gets Right

Uh oh!

rhuss left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Must-address

Should-address

Consider

Inline comments

Uh oh!

rhuss commented May 22, 2026

Uh oh!

rhuss commented May 25, 2026

Uh oh!

pdettori left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rhuss commented May 21, 2026 •

edited

Loading

3. `lastPodTemplateHash` Leaks Implementation Into the API

6. `AgentCardSyncReconciler` Fate

rhuss left a comment •

edited

Loading