Consolidate AgentCard Data Into AgentRuntime Status [Spec]#372
Conversation
First step of the AgentCard-to-AgentRuntime consolidation (kagenti-operator#371). Covers card fetch in the AgentRuntime controller, A2A-compliant status schema, and deprecation path. Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>
Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>
Specification, implementation plan, and task breakdown for moving A2A agent card discovery into the AgentRuntime controller reconcile loop. Key design: reuse existing fetcher/verifier infrastructure, feature-gated with --enable-card-discovery, mTLS included via PR kagenti#284 infrastructure. Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>
6d9d5a4 to
33a4905
Compare
The .specify directory contains local spec-kit tooling (templates, scripts, extensions) that should not be committed. Ignore the entire directory instead of individual files. Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>
Brainstorm documents are local working notes, not PR artifacts. The spec.md captures the refined requirements. Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>
The speckit plan reference in CLAUDE.md is a local development aid, not part of the spec PR. Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>
|
|
||
| **Applies when**: | ||
| - The operator is started with `--enable-card-discovery=true` | ||
| - An AgentRuntime targets a Deployment whose Pods serve `/.well-known/agent-card.json` |
There was a problem hiding this comment.
Is this addressing only deployment ? Shouldn't this be "a Workload" to cover also StatefulSets, Jobs and AgentSandbox ?
There was a problem hiding this comment.
Fixed. Generalized to "workload (Deployment, StatefulSet, or Sandbox)" throughout REVIEWERS.md.
|
|
||
| ### Session 2026-05-21 | ||
|
|
||
| - Q: How should the controller discover the Service endpoint for a given AgentRuntime's targetRef Deployment? → A: Selector match: resolve Deployment Pod selector, find Services whose selector matches, use the first match. |
There was a problem hiding this comment.
As noted also in REVIEWERS.md - shouldn't this be more general than Deployment ?
There was a problem hiding this comment.
Fixed. Generalized all references from "Deployment" to "workload" in spec.md, covering Deployments, StatefulSets, and Sandboxes. Updated clarifications, acceptance scenarios, edge cases, FR-008, and assumptions.
pdettori
left a comment
There was a problem hiding this comment.
Design Review — Blind Spots & Missing Pieces
The SDD approach is excellent and the spec is well-structured. The technical decisions are sound for the read-only discovery use case. However, there are meaningful gaps around what gets lost or deferred that should be addressed before implementation begins.
Must-Address
1. Identity Binding Policy — Where Does It Go?
The current AgentCard has spec.identityBinding with trustDomain (per-agent override) and strict (enforcement trigger). The spec moves card data into AgentRuntime status but explicitly defers "mTLS policy fields" to a future iteration.
Gap: If AgentCard is deprecated, there's no spec-level place for an admin to say "this agent must be bound to trust domain X, and enforce network isolation on failure." The spec should state where identityBinding lands (AgentRuntime.spec? A new PolicyBinding CRD?) and when — even if deferred, the migration path needs to be visible so early adopters know what to expect.
2. Enforcement Story During Coexistence
Today: identityBinding.strict=true + --enforce-network-policies=true → binding failure removes signature-verified label → NetworkPolicy blocks traffic.
The new CardStatus has attestedAgentSpiffeID and validSignature but no enforcement action — it's purely observational. The spec doesn't say:
- Does NetworkPolicy enforcement still work during the coexistence period?
- What happens to enforcement when AgentCard is eventually removed?
- Is enforcement being intentionally severed from card discovery?
Should-Address
3. lastPodTemplateHash Leaks Implementation Into the API
Storing lastPodTemplateHash in CRD status couples the implementation mechanism (hash-based change detection) to the public API surface. If you later switch to generation-based triggers or periodic re-fetch, this field becomes legacy baggage requiring backward compat. Consider an annotation (agent.kagenti.dev/last-card-fetch-hash) instead.
4. Card Fetch Timeout / Reconcile Blocking
The spec doesn't specify timeout or context-cancellation requirements for the HTTP card fetch. A hung agent serving /.well-known/agent-card.json slowly could block the reconcile loop (or cause excessive requeues). Add a non-functional requirement: max fetch duration, behavior on timeout, and whether fetch errors requeue the entire AgentRuntime reconcile.
Consider
5. Signatures — Dead Weight or Future Investment?
Issue #371 argues card signing adds no value beyond mTLS for in-cluster communication, and cross-cluster federation is not on the roadmap. Yet CardStatus carries signatures, validSignature, signatureKeyID, and signatureVerificationDetails. If cross-cluster is out of scope, carrying signature verification in a purely in-cluster status is dead weight. Either drop it or add a user story explaining when it provides value.
6. AgentCardSyncReconciler Fate
The repo has an AgentCardSyncReconciler that auto-creates AgentCards for workloads. The spec doesn't mention what happens to this controller. If AgentCard is being deprecated, the sync controller also needs a deprecation/removal plan.
7. Dynamic Card Changes (Non-Rollout Updates)
The design assumes "agent card content is static for the lifetime of a Pod." This is an assumption, not an A2A protocol guarantee. Agents that hot-reload skills, register tools dynamically, or update capabilities based on loaded models will have stale status.card until the next rollout. Even if the answer is "not supported," the constraint should be stated explicitly in the spec.
What This Gets Right
- One CR per agent — correct direction, genuine maintenance burden reduction
- Reusing PR #284 mTLS infrastructure — no rewrites
- Feature flag disabled by default — safe rollout
- Additive status change, no API version bump — backward compatible
- Pod template hash trigger — efficient
- Retain stale data on failure — good operator UX
Verdict: REQUEST_CHANGES — the spec needs the identity binding migration path (#1) and enforcement story (#2) addressed before implementation proceeds. These are the one policy concern AgentCard has, and they drive real NetworkPolicy enforcement. Implementing the read-only discovery path without a plan for the enforcement path risks architectural orphans.
Assisted-By: Claude Code
- Add explicit Out of Scope section with migration paths for identity binding policy, enforcement actions, AgentCardSyncReconciler, and AgentCard CRD removal - Generalize "Deployment" to "workload" throughout spec and review guide to cover StatefulSet and Sandbox - Add edge cases for fetch timeout (10s) and dynamic card changes - Add brainstorm document for identity binding migration (Phase 2) - Restore brainstorm/ in git tracking Addresses review feedback from @pdettori on PR kagenti#372. Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>
There was a problem hiding this comment.
Thanks for the review @pdettori. Pushed updates addressing all points:
Must-address
1. Identity binding policy migration path — Added an "Out of Scope (with migration path)" section to the spec. Identity binding is orthogonal to card discovery. It validates whether the agent's SPIFFE ID belongs to a trust domain. The card fetch surfaces the SPIFFE ID, but binding evaluation and enforcement are about the workload, not the card content.
During coexistence, identity binding stays on AgentCard. It moves to AgentRuntime.spec.identity.binding in a follow-up iteration. See brainstorm/02-identity-binding-migration.md for the full migration analysis.
2. Enforcement during coexistence — Covered in the same Out of Scope section. The AgentCard controller keeps handling all enforcement (label propagation, NetworkPolicy). status.card on AgentRuntime is read-only observation. Nothing changes for enforcement until identity binding moves to AgentRuntime.spec.
Should-address
3. lastPodTemplateHash in API surface — Agreed. Will use an annotation instead during implementation.
4. Card fetch timeout — Added edge case: 10-second timeout (reuses DefaultFetchTimeout from the fetcher). Timeout counts as a fetch failure, stale data retained, CardSynced=False.
Consider
5. Signature fields — attestedAgentSpiffeID pulls its weight for mTLS. The JWS fields (validSignature, signatureKeyID) tag along because we reuse the existing SignatureProvider. If cross-cluster stays out of scope, we can drop them later without breaking anything since they are optional status fields.
6. AgentCardSyncReconciler fate — Added to Out of Scope. Keeps running during coexistence, gets removed with the AgentCard CRD in Phase 4.
7. Dynamic card changes — Added as an explicit constraint: card data reflects the state at the last rollout. No polling. Agents that hot-reload skills will show stale data until the next rollout. That is the tradeoff for not hammering the API server.
Inline comments
8-9. Deployment → workload — Generalized to "workload (Deployment, StatefulSet, or Sandbox)" throughout spec.md and REVIEWERS.md.
- Move lastPodTemplateHash from CRD status to annotation - Generalize Deployment references to workload throughout - Add fetch timeout condition to data model - Add annotation note to research decisions Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>
Add CardStatus type to AgentRuntimeStatus for storing fetched A2A agent card data with content hashing, signature verification, and SPIFFE attestation fields. Implement the AgentCard controller that discovers card endpoints from workload Services and syncs card data into the AgentRuntime status. Wire up the controller with a --enable-card-discovery flag and add comprehensive unit tests. Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>
aaf7003 to
33d745e
Compare
persistCardFetchAnnotation calls r.Patch() which refreshes the rt object from the API server, wiping all in-memory status mutations (card data, CardSynced, TargetResolved, ConfigResolved conditions) that have not yet been persisted via Status().Update. Save and restore rt.Status across the Patch call so the caller's subsequent Status().Update persists everything. Add regression test with a stub fetcher that verifies card data and all conditions survive the annotation patch. Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>
…principles Codifies lessons from the status-wipe bug into project governance: Principle I (Reconciler Status Integrity), Principle III (Controller-Runtime Safety), and a Controller-Runtime Gotchas reference section. Un-ignores .specify/memory/constitution.md so the constitution is tracked in git. Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>
|
@pdettori I went ahead, implemented the feature and tested it locally on a |
|
@pdettori All review feedback has been addressed (see my response from May 21). Since then, the implementation is complete with the card discovery fix and regression test. Could you re-review when you get a chance? |
pdettori
left a comment
There was a problem hiding this comment.
All review feedback addressed. Verified each item:
Must-address — both resolved:
- Identity binding migration path documented (spec Out of Scope section + brainstorm/02-identity-binding-migration.md)
- Enforcement during coexistence: AgentCard controller continues handling enforcement; status.card is read-only observation
Should-address — both resolved:
lastPodTemplateHashmoved to annotation (agent.kagenti.dev/last-card-fetch-hash)- 10s fetch timeout with defined failure behavior (stale data retained, CardSynced=False)
Consider — addressed:
- AgentCardSyncReconciler fate documented (keeps running, removed with CRD)
- Dynamic card changes constraint explicit (reflects last rollout, no polling)
- Deployment → workload generalized throughout
Implementation includes regression test for the status-wipe bug. CI all green.
Assisted-By: Claude Code
Introducing Specification-Driven Development (SDD)
This is our first PR using a spec-driven development workflow. The idea is simple: write a specification, get it reviewed, and then implement. This inverts the usual flow where design decisions are discovered (and debated) during code review.
Why SDD? Design issues caught during code review are expensive. The code is already written, tests are passing, and the author is invested. Feedback at that stage feels like rework. By reviewing the design first, we align on scope, technical approach, and key decisions before anyone writes a line of Go. Code review then focuses on correctness and quality, not "should we have done this differently?"
What an SDD PR looks like:
Do not merge this PR yet. We are in Phase 1.
What to review
Focus on decisions and scope, not implementation details (those do not exist yet):
Context
This builds on the consolidation path outlined in issue #371. The key insight: AgentCard is a pure read-only CR whose content is entirely controller-managed, with no room for admin-authored policy fields. Its data fits naturally into AgentRuntime status.
The mTLS verified fetch path reuses the infrastructure @kevincogan built in PR #284 (mTLS verified fetch for AgentCard, merged 2026-05-20).
Artifacts
spec.mdplan.mdtasks.mdresearch.mddata-model.mdcontracts/REVIEWERS.mdAssisted-By: 🤖 Claude Code