Skip to content

spec: introduce AgentObservation for agent-originated observability records #6

@ravisantoshgudimetla

Description

@ravisantoshgudimetla

Problem

Every AIP adopter will have an observation phase before acting. A security scanner observes vulnerabilities. A cost optimizer observes resource waste. A debugging agent observes unhealthy pods. Currently the spec has no home for these observations.

Without a standard, every adopter builds a different CRD with a different schema. The ecosystem fragments — no shared dashboards, no cross-agent analysis, no consistent audit trail linking why an agent acted to what it did.

What was considered and rejected

Extending AuditRecord (add source: agent | control-plane):

  • Breaks the immutability guarantee — AuditRecord's value to SIEM and compliance auditors is that exactly one principal (the control plane) writes it
  • No enforcement mechanism for the source field — a compromised agent could write source: control-plane
  • Volume mismatch — governance AuditRecords are bounded (~5–10 per AgentRequest); observations are unbounded

scope: observationOnly on AgentRequest:

  • Violates the core Kubernetes API convention: Kind determines semantics, not a field within an object
  • Makes action and target (currently required) conditional on scope — an explicit antipattern in K8s API conventions (conditional required fields)
  • Branches the control plane reconciler everywhere (if scope == observationOnly) — OpsLock, SafetyPolicy evaluation, phase machine all need special-casing
  • Creates a policy bypass surface — observationOnly requests skip SafetyPolicy evaluation

Proposed solution: AgentObservation Kind

A new Kind in the same API group (governance.aip.io/v1alpha1). Agent-written directly — no controller involved. Immutable after creation.

Key design decisions

No controller required. The Kubernetes precedent is Lease (written directly by the holder, no Lease controller) and v1.Event (written directly by controllers, no Event controller). AgentObservation fits the same pattern — the API server validates the schema, RBAC controls who writes, the agent writes it once and it is done.

metadata.creationTimestamp is the authoritative timestamp. Set by the API server on creation, cannot be faked. No controller-set recordedAt field needed.

Immutable after creation via CEL validation rule:

x-kubernetes-validations:
  - rule: self == oldSelf
    message: "AgentObservation is immutable after creation"

Cross-referencing via aip.io/correlationID label — not field updates. The agent generates a UUID before acting, sets it on the AgentObservation and as a label on the subsequent AgentRequest. No back-link field on AgentObservation required — one label query retrieves the full incident chain.

AgentObservation is NOT visible to SafetyPolicy CEL expressions. Policies only see request.spec.reasoningTrace.* — the agent-attested summary baked into the AgentRequest. traceReference is opaque to the control plane; it is for auditors and tooling only. Allowing policies to JOIN against agent-authored observations would give agent-authored data governance authority, collapsing the trust boundary.

Example CR

apiVersion: governance.aip.io/v1alpha1
kind: AgentObservation
metadata:
  name: diag-abc123
  namespace: production
  creationTimestamp: "2026-03-26T10:00:01Z"  # authoritative, API server set
  labels:
    aip.io/correlationID: diag-abc123
    aip.io/agentIdentity: sre-agent-v2
    aip.io/eventType:     diagnosis
spec:
  agentIdentity: sre-agent-v2
  eventType:     diagnosis    # observation | diagnosis | escalation | signal
  correlationID: diag-abc123
  details:                    # open JSON — same extensibility model as parameters
    rootCause:  OOMKilled
    confidence: 0.91
    alternativesConsidered:
      - action: restart
        selected: true
      - action: "scale out"
        rejected: "memory leak affects all replicas"
    evidenceSources:
      - type: metrics
        ref:  "prometheus://production/container_oom_events"

Querying the full incident chain

kubectl get agentobservations,agentrequests,auditrecords \
  -n production \
  -l aip.io/correlationID=diag-abc123 \
  --sort-by=.metadata.creationTimestamp

Gives the complete story: what the agent observed → what it requested → what the control plane decided → what happened.

Spec changes required

  1. New §3.x — AgentObservation Kind: schema, immutability contract, eventType enum, correlationID convention
  2. New §A.x — aip.io/correlationID label standard: agent-generated UUID, propagated on AgentObservation, AgentRequest (as label), and AuditRecord (as label)
  3. §9 JSON Schema — AgentObservation schema
  4. A.4 Conformance Checklist — assertions for AgentObservation immutability and correlationID propagation
  5. Clarify that AgentObservation details are NOT accessible in SafetyPolicy CEL expressions

Trust model summary

Resource Author Trust level
AgentObservation Agent Informational — agent-attested
AgentRequest spec Agent Informational — agent-attested
AgentRequest status Control plane Authoritative — governance decision
AuditRecord Control plane Authoritative — tamper-evident

The control plane only touches resources where a governance decision is being made. It has zero involvement with AgentObservation. That is the right separation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions