Skip to content

kars-sre demo + agent — Slices 0-4: autonomous incident triage + typed apply-fix + Telegram#397

Merged
pallakatos merged 69 commits into
mainfrom
kars-sre/demo-and-agent
Jun 15, 2026
Merged

kars-sre demo + agent — Slices 0-4: autonomous incident triage + typed apply-fix + Telegram#397
pallakatos merged 69 commits into
mainfrom
kars-sre/demo-and-agent

Conversation

@pallakatos

@pallakatos pallakatos commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

The full kars-sre/demo-and-agent series — slices 0 through 4 — landed in 20 commits on this branch. End-to-end demo loop now works through the WebUI and Telegram: incident detected → CR auto-created → operator approves → controller executes → recovery confirmed.

Slice ladder (per docs/blueprints/07-kars-sre-proposal.md §7.1)

Slice Status What ships
S0 · demo harness tools/demo/act2/ — Agent A KarsSandbox + ResourceQuota break + reset + presenter runbook
S1 · MVP Helm template (KarsSandbox + RBAC + ToolPolicy + InferencePolicy) + 5 read-only kars-CR tools (sre_describe_state, sre_diagnose, sre_explain_error, sre_propose_fix, sre_logs)
S2 · K8s diag toolset sre_describe_resource, sre_what_changed, sre_endpoints_inspect, sre_image_probe, sre_top
S3 · Typed apply-fix KarsSREAction CRD + reconciler with state machine (Proposed→Approved→Applied→Recovered), one-shot CRB mint/teardown, kars sre approve/reject/show/actions
S4 · Proactive watcher sre_watcher.py informer + Telegram channel adapter wiring + burst collapse + rate limit + terminal-CR reaper
S5/S6 deferred source-code grounding / air-gap hardening

What's in S3 (typed apply-fix)

State machine: Proposed → Approved → Applied → Recovered (terminal), with Rejected / Expired / Failed terminal lanes.

On Approval, the controller:

  1. Validates the action against the closed-set + denylisted namespace + per-action required-param checks.
  2. Mints a one-shot ClusterRoleBinding kars-sre-write-<uid> scoped to the right writer ClusterRole (kars-sre-writer-quotas | kars-sre-writer-workloads).
  3. Executes the typed action via Server-Side Apply.
  4. Tears the binding down.
  5. Watches the target namespace for absence of FailedCreate / BackOff / FailedScheduling / Failed events (recovery observer) for up to 5 min.

Typed actions (closed set per §7.7.1):

  • DeleteResourceQuota {namespace, name} — refuses quotas labelled kars.azure.com/managed-by=controller
  • PatchDeploymentImage {namespace, name, container, image}
  • ScaleDeployment {namespace, name, replicas} — clamped to [0, 50]
  • RolloutRestart {namespace, kind, name} — Deployment / StatefulSet / DaemonSet
  • DeletePod {namespace, name}

CLI:

kars sre actions                       # list pending KarsSREActions
kars sre show <action-id>              # diagnosis + rationale + conditions
kars sre approve <action-id>           # authorise the controller
kars sre reject <action-id> --reason   # decline

Terminal CR reaper: any Recovered/Failed/Expired/Rejected CR older than 1h is GC'd by the reconciler.


What's in S4 (proactive watcher + Telegram)

sre_watcher.py runs alongside the Hermes gateway when SRE_ENABLED=true and a channel is configured. Watches K8s events every 10s for failure-class reasons in kars-* namespaces, builds a typed-action target, and on each new incident:

  1. CR-reuse: if a KarsSREAction with the same (action_type, namespace, target_name) is already open (Proposed/Approved/Applied), reuse it instead of creating a duplicate. The previous demo showed 40+ identical CRs accumulating without this.
  2. Per-target dedupe + name normalisation (strips ReplicaSet/Pod hash suffixes so flapping rollouts collapse to one alert).
  3. Burst collapse: per polling iteration, the highest-priority candidate is sent as a detailed Telegram message; remaining ones are summarised as one tail line ("⚠ +N other incidents: 2 FailedScheduling, 1 BackOff").
  4. Sliding-window rate limit: 4 Telegram msgs/min cluster-wide (tunable via SRE_WATCHER_MAX_MSGS_PER_MIN).
  5. Bootstrap from CRs: dedupe state on boot comes from existing KarsSREActions (survives pod restart). Periodic 60s resync REPLACES the in-memory state so an operator kubectl delete karssreactions --all clears the dedupe naturally.
  6. Priming: first iteration silently absorbs the warm-cache so a pod re-roll doesn't flood the operator (was 170+ msgs in the first S4 demo).

Telegram wiring uses the existing kars credentials mechanism — no new commands:

kars credentials update sre \
  --telegram-token "$TG_TOKEN" \
  --telegram-allow-from "<your-tg-user-id>"

Plumbed via Secret kars-sre/sre-credentials (envFrom optional:true), exported as TELEGRAM_BOT_TOKEN + TELEGRAM_ALLOW_FROM env, then translated in entrypoint.sh to:

  • channels.telegram.token, channels.telegram.enabled=true, channels.telegram.allowed_users (Hermes config)
  • TELEGRAM_ALLOWED_USERS env (Hermes gateway pairing-skip)
  • TELEGRAM_HOME_CHANNEL (default for hermes send --to telegram)

Channel adapter libraries (python-telegram-bot 21.x, slack-sdk 3.x, discord.py 2.x) are now pre-installed in the runtime image so credentials in the secret "just work" — no per-sandbox pip install.

Sandbox HTTPS proxy: entrypoint.sh now exports HTTPS_PROXY=http://127.0.0.1:8444 + NO_PROXY=$KUBERNETES_SERVICE_HOST,127.0.0.1,localhost,.svc.cluster.local so any standard-env-honouring HTTP client (httpx, python-telegram-bot, slack-sdk, requests, openai) routes outbound HTTPS through the inference-router's forward proxy — even on kind clusters where the egress-guard iptables transparent-redirect doesn't fire.


Demo loop (end-to-end)

# 1) Install once
kars sre install
kars credentials update sre --telegram-token <T> --telegram-allow-from <ID>

# 2) Break something (or the watcher catches it organically)
bash tools/demo/act2/break.sh

# 3) Operator receives Telegram alert with action_id + approve cmd
kars sre show sre-action-<id>      # review
kars sre approve sre-action-<id>   # authorise

# 4) Watch the phase walk: Proposed → Applied → Recovered
kubectl -n kars-sre get karssreaction sre-action-<id> -w

RBAC additions (controller-side)

  • karssreactions (full r/w)
  • resourcequotas: delete — the §7.8.4 K8s privilege-escalation check requires the controller to hold the verbs it grants in the one-shot CRB
  • apps/statefulsets,daemonsets: patch — RolloutRestart targets
  • events: list/watch/get — recovery observer
  • serviceaccounts/token: create — lands the §7.8.4 TokenRequest path (currently uses controller SA for execution; structure ready for the hardening pass)
  • clusterrolebindings: create/delete with resourceNames: ["kars-sre-write-*"]

RBAC additions (chart-shipped)

  • sre-writer ServiceAccount in kars-sre (no token automount)
  • kars-sre-writer-quotas / kars-sre-writer-workloads ClusterRoles
  • kars-sre-action-author ClusterRole bound to the SRE sandbox SA (create karssreactions only — operator owns approval)
  • kars:sre-approver ClusterRole (operator-facing; not pre-bound)

CI gates

  • 31 Hermes pytest tests pass
  • 847 Rust controller tests pass
  • Phase-taxonomy guard passes (reconciler refactored to use named constants for all Failed/Pending/Degraded literals — both phases and condition reasons)
  • Helm-vs-Rust CRD drift test passes for crd-karssreaction.yaml
  • CEL admission validations on KarsSREActionSpec (action.type closed-set + approval.state enum + ttlMinutes range + rationale length + control-byte denylist)
  • Cargo-deny + helm template + cli typecheck/lint all green

PR #1 in the kars-sre/demo-and-agent series — Slice 0 of the SRE
proposal: the demo can now be walked end-to-end by hand before any
SRE plugin code lands. Each subsequent slice (S1 read-only tools,
S2 K8s diag toolset, S3 typed apply-fix, S4 proactive watcher)
replaces one hand-walked step with an autonomous one.

Scenario: 'platform team's GitOps refactor lands a tight
ResourceQuota across every workload namespace; the quota's
requests.memory ceiling (50Mi) is lower than what the research
sandbox actually requests. The pod stays Running until anything
triggers a reschedule — then it goes Pending forever because the
quota blocks pod admission.'

Why infrastructure, not image-tag:  image tags don't change on a
running pod for random reasons.  ResourceQuota mis-configuration is
a real GitOps-collision incident that operators hit regularly.

Files:
  agent-a-research.yaml         — KarsSandbox 'research' (Hermes
                                  runtime, mirrors exec-brief-hermes-
                                  single shape, simplified to two CRs
                                  so the demo focuses on the runtime)
  platform-hardening-quota.yaml — the bad ResourceQuota the break
                                  script applies; deliberately NOT
                                  labeled kars.azure.com/managed-by
                                  so the SRE's DeleteResourceQuota
                                  typed action is permitted
  break.sh                      — applies the quota, force-deletes
                                  the running pod, confirms the
                                  FailedCreate event surfaces
  reset.sh                      — deletes the quota and waits for
                                  Running 2/2 (manual recovery path)
  runbook.md                    — presenter script for walking Act II
                                  by hand until S2 ships; once S2
                                  ships, the runbook becomes the
                                  expected-behaviour spec for the
                                  autonomous agent walk

Proposal update:
  §7.7.1 — adds DeleteResourceQuota as a typed action (namespace-
           scope, requires the ResourceQuota NOT carry the
           kars.azure.com/managed-by=controller label so kars-owned
           governance quotas stay protected and only operator-applied
           platform quotas are deletable)
  §7.7.1 — removes the PatchSandboxRuntimeImage carve-out from the
           previous draft; the demo no longer requires writes to
           kars.azure.com/* CRs, so the no-governance-mutation rule
           stays absolute

Validation:
  python3 -c yaml.safe_load_all on both YAMLs        — parses OK
  bash -n break.sh / reset.sh                        — syntax OK
  ci/check-copyright-headers.sh                      — all 499 OK

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

…in containment

Slice 1 of the kars-sre demo+agent series. The agent is now installable
on any kars cluster via 'kars sre install' and reachable via 'kars sre
talk'.  It reads kars CRs cluster-wide, walks the diagnostic checklist,
matches errors against the OOTB-blocker corpus, and proposes typed
fixes (apply is Slice 3).

What ships:

  deploy/helm/kars/templates/sre.yaml — Gated on .Values.sre.enabled.
  Creates 5 K8s objects when enabled:
    - InferencePolicy 'sre-inference' (kars-system)
    - KarsSandbox 'sre' (kars-system) with runtime: Hermes,
      extraEnv KARS_SRE_ENABLED=true, networkPolicy.defaultDeny=true
      + allowlist contains ONLY kubernetes.default.svc (NOT
      agentmesh — §7.8.6 network layer)
    - ToolPolicy 'sre-tools' (kars-sre) gating the sre_* surface
    - ClusterRole 'kars-sre-reader' — read on kars CRs + apiextensions
      + core workloads (RBAC per proposal §7.2.1 minus what S2/S3 add)
    - ClusterRoleBinding pinned to ServiceAccount kars-sre/sandbox
      (explicit subject — no group binding, no wildcard, §7.8.3)

  deploy/helm/kars/values.yaml — new 'sre:' block (enabled=false default,
  model=gpt-4.1, provider=azure-openai, tokenBudget=32000,
  extraAllowedEndpoints commented out for Slice 4 channel wiring).

  cli/src/commands/sre.ts — 'kars sre {install,uninstall,status,talk}'
  subcommands. 'install' wraps 'helm upgrade --reuse-values --set
  sre.enabled=true' then waits for the sandbox to reach Available.

  cli/src/cli.ts — wires sreCommand() into the Operations command group.

  runtimes/hermes/.../plugin/sre.py — 5 tools, all read-only:
    - sre_describe_state   structured snapshot of all 11 kars-owned CRs
    - sre_logs             apiserver-side pod log tail (cap 500 lines)
    - sre_diagnose         kars-CR health checklist + summary string
    - sre_explain_error    OOTB-blocker corpus matcher (6 known patterns
                           including ImagePullBackOff, exceeded quota,
                           OOMKilled, CrashLoopBackOff, FailedScheduling,
                           ContainerCreating)
    - sre_propose_fix      typed-action proposal envelope; Slice 1
                           codifies DeleteResourceQuota (the demo Act II
                           target) — rest of typed-action set lands in S3

  runtimes/hermes/.../plugin/sre_kube.py — minimal in-cluster apiserver
  client built on httpx (no new dep added to the shared Hermes image).
  Reads projected SA token + ca.crt + namespace from the standard paths;
  detects token rotation by content compare on each request.

  runtimes/hermes/.../plugin/__init__.py — adds the KARS_SRE_ENABLED
  gate. When set:
    - kars_spawn family is SKIPPED at registration (§7.8.5 — SRE agent
      cannot spawn sub-agents)
    - kars_mesh_* family is SKIPPED at registration (§7.8.6 — SRE agent
      is not on the mesh; combined with the NetworkPolicy block above
      this is two of three §7.8.6 enforcement layers — the third
      'separate image' layer is the §7.8.1 follow-up slice)
    - kars_discover is skipped (no peers to discover)
    - eager-mesh-init thread is skipped (would log noisy connection
      failures otherwise)
    - sre.register(ctx) runs AFTER everything else

  runtimes/hermes/tests/test_sre.py — 15 tests covering:
    - env-gate truthy/falsy mapping
    - all 5 tools register with the correct schema
    - explain_error matches against the corpus, handles no-match,
      handles empty input
    - propose_fix codifies DeleteResourceQuota for ResourceQuota target;
      returns rationale-only envelope for other kinds
    - KARS_CR_KINDS lists all 11 proposal §3.5 CRDs
    - describe_state walks every kind + surfaces per-kind errors
      without raising

  docs/sre.md — operator-facing readme: install, talk, tool surface,
  containment summary, what S1 cannot do yet, links to proposal +
  Act II runbook.

Validation:
  pytest tests/test_sre.py            → 15/15 pass
  pytest tests/test_governance.py     → unchanged, pass
  pytest tests/test_package_shape.py  → unchanged, pass
  npm run typecheck (cli)             → no errors
  npm run build    (cli)              → builds
  helm lint --set sre.enabled=true    → 0 fails
  helm template ... --show-only sre.yaml  → renders 5 objects clean
  helm template ... (sre.enabled=false)   → sre.yaml correctly omitted
  ci/check-copyright-headers.sh       → all 501 files OK

What this slice does NOT ship (per §7.1 ladder):
  - K8s diag toolset (sre_image_probe, sre_endpoints_inspect,
    sre_what_changed, sre_top, sre_describe_resource) — Slice 2
  - Fix execution (sre_apply_fix + TokenRequest + admission VAPs) — S3
  - Proactive watcher + Telegram/Slack notifications — Slice 4
  - Separate kars/sre-sandbox image (§7.8.1 packaging containment) —
    deferred; Slice 1 ships SRE in the shared Hermes image behind
    the KARS_SRE_ENABLED env gate as a tactical bridge. The env gate
    is the interim containment: tools aren't registered in any other
    pod, so a request for sre_* in a standard sandbox hits 'tool not
    found' at the runtime.

Next: Slice 2 (K8s diag toolset), then Slice 3 (typed apply-fix + AGT
approval flow + admission VAPs).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread runtimes/hermes/tests/test_sre.py Fixed
…dpoints, image_probe, top

Slice 2 of the kars-sre series. Extends the read-only diagnostic
surface from kars-CR-centric (Slice 1) to arbitrary Kubernetes
workloads — everything the agent needs to diagnose the Act II
ResourceQuota incident end-to-end.

What ships (5 new tools, all read-only):

  sre_describe_resource — structured-describe for any K8s kind. For
                          workload kinds (Deployment / StatefulSet /
                          DaemonSet) walks the OWNER GRAPH:
                          workload → ReplicaSet → matching Pods →
                          events on every level. One tool call returns
                          the whole incident picture.

  sre_what_changed      — events of failure-relevant reasons in last
                          N minutes across BOTH core/v1 and
                          events.k8s.io/v1. Surfaces FailedCreate,
                          BackOff, OOMKilling, Evicted, etc. — the
                          incident-framing tool.

  sre_endpoints_inspect — Service → selector → matching pods →
                          EndpointSlice readiness. Synthesises a
                          finding the agent can quote (no pods match,
                          pods NotReady, targetPort mismatch, OK).

  sre_image_probe       — given an image, enumerate Pod images
                          cluster-wide and suggest the closest in-use
                          tag by Levenshtein edit-distance. Doesn't
                          reach out to the registry (per-registry auth
                          plumbing is Slice 4+); instead answers the
                          question that's actually most useful:
                          'what's the closest in-use tag on THIS
                          cluster right now?'

  sre_top               — metrics.k8s.io wrapper for CPU+memory per
                          pod or per node. Gracefully degrades to
                          {unavailable: 'metrics-server not installed'}
                          if the metrics API isn't registered
                          (proposal §7.5 Q4).

Also extends sre_propose_fix to codify two more typed actions from
proposal §7.7.1: PatchDeploymentImage and ScaleDeployment (in
addition to Slice 1's DeleteResourceQuota). Slice 3 will widen the
typed-action set further AND add the execution path.

RBAC widened in deploy/helm/kars/templates/sre.yaml:
  + discovery.k8s.io/endpointslices  (for sre_endpoints_inspect)
  + metrics.k8s.io/pods, nodes        (for sre_top)
  + core/nodes, endpoints, resourcequotas  (cluster-wide read)

ToolPolicy extended to allow the 5 new tool names.

Containment unchanged: still gated by KARS_SRE_ENABLED env on the
SRE sandbox pod only; standard Hermes sandboxes don't see the env,
don't load the tools, can't call them.

Validation:
  pytest tests/test_sre.py tests/test_sre_k8s.py  → 31/31 pass
  ci/check-copyright-headers.sh                   → all 502 OK
  helm lint --set sre.enabled=true                → 0 fails
  python -m py_compile (sre.py, sre_k8s.py)       → OK

Next: Slice 3 (typed apply-fix + admission VAPs + TokenRequest path
+ kars sre approve CLI).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread runtimes/hermes/tests/test_sre_k8s.py Fixed
Pal Lakatos-Toth and others added 17 commits June 9, 2026 11:44
`kars sre install` was passing the relative path 'deploy/helm/kars'
to helm, which helm parses as a chart repo name when the user's CWD
is anywhere other than the kars repo root. Result:
  Error: repo deploy not found

Fixed by resolving the kars repo root the same way `kars up` does:
first walk up from the CLI file's own location (works for npm link),
then fall back to walking up from CWD looking for deploy/helm/kars.

Also: replaced the broken `.option('--wait', ..., true)` with the
commander-idiomatic `.option('--no-wait', ...)` so the wait flag
actually defaults to on.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A plain --reuse-values carries the stored release values forward
verbatim. If the stored values are older than the chart on disk
(e.g. operator ran 'kars dev' before runtimes.hermes was added to
values.yaml), the template fails with:

  nil pointer evaluating interface {}.image

at controller-deployment.yaml line 89.

--reset-then-reuse-values (helm 3.14+ / helm 4) re-loads the chart's
values.yaml defaults first, then overlays the previously --set values
on top. So new chart fields get their defaults populated, while user
overrides for older fields are preserved.

Applied to both install and uninstall sub-actions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The ToolPolicy 'sre-tools' lives in namespace kars-sre by design
(kars's cross-namespace ToolPolicy refs are deliberately not
supported — principles.md §3). But the controller-created
kars-sre namespace only exists AFTER the KarsSandbox 'sre' is
reconciled, which is AFTER helm tries to apply the ToolPolicy.

  Error: UPGRADE FAILED: failed to create resource:
         namespaces "kars-sre" not found

Fix: add the Namespace as a chart-managed resource at the top of
sre.yaml. The controller's namespace-reconcile path uses server-side
apply, so it will harmlessly co-own this namespace (adding its
own labels + annotations) when it reaches reconciler/mod.rs step 1.
No conflict.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Helm 4 uses server-side apply by default. When prior
`kubectl set image` / `kars push --apply` runs took ownership of
fields that the chart now also wants to manage, the SSA call fails
with:

  conflict with "kubectl-set" using apps/v1:
    .spec.template.spec.containers[name="controller"].image

--force-conflicts (helm 4) instructs server-side apply to take
ownership on conflict. Matches operator intent: the helm-managed
chart is the source of truth, and chart-driven upgrades should
override transient field-manager pollution from ad-hoc
`kubectl set` calls.

Confirmed via `helm upgrade --help`:
  --force-conflicts   if set server-side apply will force changes against conflicts

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…m), not kars-sre

Controller rejected the KarsSandbox sre with:
  Degraded: ToolPolicyNotFound — 'sre-tools' not found in 'kars-system'
  (cross-namespace refs not supported)

I had ToolPolicy in 'kars-sre' under the misunderstanding that it
should be co-located with the runtime pod's namespace. The actual
kars convention is the opposite: governance refs are namespace-local
to the KarsSandbox CR's OWN namespace (kars-system in our case), per
principles.md §3 cross-namespace-refs-deliberately-unsupported rule.
The runtime namespace kars-sre is for the pod + RBAC, not for
governance.

Confirmed against the existing exec-brief-hermes-single scenario
which co-locates KarsSandbox + ToolPolicy in kars-system.

Net: still safe wrt §7.7.1 protected-resource denylist (kars-system
is denylisted, so SRE agent can't delete this ToolPolicy even though
it's not labeled kars.azure.com/managed-by=controller).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two related bugs uncovered during live test:

1) The controller silently strips user-supplied extraEnv keys with
   reserved prefixes (mod.rs:1583 — AGT_, AZURE_, FOUNDRY_AGENT_,
   IMDS_, KARS_). KARS_SRE_ENABLED was being dropped, so the plugin
   never registered.
   Fix: rename to SRE_ENABLED across:
     - runtimes/hermes/.../plugin/sre.py           (is_enabled)
     - runtimes/hermes/.../plugin/sre_k8s.py       (module docstring)
     - runtimes/hermes/.../plugin/__init__.py      (log line + docstring)
     - runtimes/hermes/tests/test_sre.py           (3 env patches)
     - deploy/helm/kars/templates/sre.yaml         (extraEnv key + comment)

2) During the rename edit, the `extraEnv:` block ended up under
   `runtime:` instead of `runtime.hermes:` (4-space vs 6-space indent),
   producing:
     UPGRADE FAILED: .spec.runtime.extraEnv: field not declared in schema
   Fix: restore correct 6-space indent so extraEnv nests inside hermes.

Long-term fix (deferred): controller should detect
kars.azure.com/role=sre label on the KarsSandbox and inject
KARS_SRE_ENABLED itself (controller-side injection bypasses the
prefix filter). Noted inline at sre.is_enabled() docstring and in
the sre.yaml extraEnv block as a follow-up.

Tests: 31/31 pass (test_sre.py + test_sre_k8s.py).
Live verification: SRE_ENABLED env appears on agent container's env;
helm upgrade succeeds; chart re-applies cleanly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Slice 1 template hardcoded requirePromptShields: true on the
SRE InferencePolicy. Azure OpenAI deployments only carry
'prompt_filter_results' in responses when an explicit Content
Filter policy is attached to the deployment. Bare local-dev
deployments (Foundry quickstart, gpt-4.1 without explicit filter)
don't emit those annotations — so the router blocks every response
with:

  Response blocked: InferencePolicy requires Prompt Shields but
  the upstream response carried no prompt_filter_results annotations

Diagnosed live during kars sre talk session — first prompt ('hi
there') returned a cached greeting that happened to bypass the
check, second prompt died.

Fix: default false in values.yaml + chart; operators wiring
Content Safety in production can set:
  --set sre.requirePromptShields=true

(or values.yaml override).

The SRE agent's threat surface is operator-driven Kubernetes
diagnosis, not user-facing chat, so prompt-shield enforcement is
less critical than for an internet-facing assistant. Operators who
need it can opt back in.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Switch default model so the SRE agent ships with current frontier
out of the box. Operator can still override per-install with
`kars sre install --model <name>`.

The model name must match an Azure OpenAI deployment in the
operator's Foundry project — InferencePolicy routes to that
deployment via the router; if the deployment doesn't exist the
router returns a clear 404 and the sandbox surfaces Degraded.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Hermes uses plugin.yaml's provides_tools list as the gate for
ctx.register_tool() calls — tools not declared in the manifest are
silently rejected at registration time. So even though sre.register()
called register_tool() for all 10 sre_* tools, none of them became
callable.

Diagnosed via live test:
  hermes tools list  → showed foundry_*, http_fetch, kars_handoff_status
                       (the manifest-declared ones)
                     → NO sre_*  (registered at runtime, manifest-rejected)

Same pattern as the OpenClaw plugin's contracts.tools requirement
(see memory: 'OpenClaw 2026.5.x requires plugin manifest to declare
contracts.tools listing every tool the plugin will register').

Fix: add all 10 sre_* tools (5 Slice 1 + 5 Slice 2) to provides_tools.
The tools remain conditionally registered at runtime — standard Hermes
sandboxes don't set SRE_ENABLED → sre.register(ctx) is skipped → the
tools are declared-but-not-callable (still matches the manifest
contract; Hermes treats them as 'present but inactive').

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three correctness fixes landed during the live test pass:

1) Hermes register_tool kwargs were wrong
   sre.py + sre_k8s.py used parameters=... but Hermes' contract expects
   schema=... AND toolset="<name>". Without these the manifest's
   provides_tools entries still showed up but the tools were silently
   non-callable. Fixed all 10 sre_* register_tool calls.

2) plugin.yaml provides_tools missing the sre_* entries
   Hermes' plugin loader requires every tool the plugin will register
   to be declared in provides_tools (same shape as OpenClaw's
   contracts.tools). Added all 10. Conditionally registered at
   runtime via SRE_ENABLED — standard sandboxes don't trip them.

3) New: kars-sre persona / system prompt
   Following the OpenClaw pattern (sandbox-images/openclaw/entrypoint.sh
   :1214 writes SOUL.md on every boot), the Hermes entrypoint now
   writes a 110-line SRE-specific SOUL.md to $HERMES_HOME/SOUL.md
   when SRE_ENABLED=true. Content:
     - Identity + mission statement
     - Tone constraints (concise, evidence-based, direct, honest)
     - Catalog of all 10 sre_* tools with WHEN to use each
     - Catalog of tools the agent does NOT have (spawn, mesh, shell,
       external net) with rationale
     - Standard incident reasoning loop (5 steps)
     - Output structure for fix proposals (Symptom/Evidence/Root cause/
       Proposed fix/Why safe/Rollback)
     - Boundaries (protected-resource denylist enforced at proposal
       layer; agent should not even try)
     - Audit info (where the kars audit JSONL captures every call)
     - First-message greeting template (one line, no editorialising)

   The model name interpolates from KARS_MODEL → AZURE_OPENAI_DEPLOYMENT
   → 'gpt-5.4' default, so the prompt always names the live model.

Validation:
  pytest tests/test_sre.py tests/test_sre_k8s.py  → 31/31 pass
  bash -n entrypoint.sh                            → clean
  live verify: SOUL.md written 110 lines, model = gpt-5.4
  live verify: hermes tools list → '✓ enabled sre' toolset now shows

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds two iptables rules to the egress-guard init container, gated on
the kars.azure.com/role=sre label being present on the KarsSandbox:

  1. Filter chain: ACCEPT for UID 1000 -> KUBERNETES_SERVICE_HOST:443
     (BEFORE the existing catch-all DROP).
  2. NAT chain: RETURN for UID 1000 -> KUBERNETES_SERVICE_HOST:443
     (BEFORE the existing :443 REDIRECT to :8444 transparent proxy).

Both are required. The NAT-bypass alone is not sufficient because
the filter chain runs AFTER NAT - the NAT-RETURN says 'don't redirect'
but the filter-chain DROP next would still slay the packet. Discovered
live during testing: the curl-to-apiserver hung until both rules
landed.

Why this is needed: the SRE plugin's K8s API client (sre_kube.py in
the Hermes runtime) needs DIRECT apiserver access with its projected
ServiceAccount token to read kars CRs / pods / events. Without the
bypass, every apiserver call gets NAT-redirected to the router's :8444
transparent proxy, which has no idea how to forward TLS to the
apiserver -- connections hang then time out.

Why only role=sre sandboxes: every other sandbox kind goes through
the router unchanged -- that's the whole point of the transparent
proxy + L7 audit. Direct apiserver access is the deliberate
exception, uniquely held by the nominated SRE sandbox per the
proposal section 7.8 containment design.

K8s audit log is the audit surface for these apiserver calls (the
router's L7 audit doesn't apply, but K8s audit is stronger -- every
call carries the SA identity, verb, and resource).

Implementation:
  - new build_egress_guard_command(is_sre_sandbox: bool) helper
    in reconciler/mod.rs that emits the right rule sequence per mode
  - 3 unit tests: standard has no bypass; SRE has NAT bypass before
    REDIRECT AND filter ACCEPT before DROP; both modes keep DROP

Validated end-to-end:
  - HTTP 200 in 17ms from agent container -> 10.96.0.1:443
  - sre_describe_state() returns 10 KarsSandboxes + all 11 CR kinds

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…icies)

The Slice 1 inline AGT profile used the wrong schema — version: 1
with rules[].match.tool — which produced:

  ToolPolicy sre-tools: invalid YAML: missing field agent

at compile time, then 'router has not yet loaded AgtProfile' at the
sre pod's policy loader. The sre KarsSandbox showed Degraded with
ToolPolicyNotCompiled.

Found by the SRE agent itself during the first cluster-health-overview
test (a beautifully on-point sre_diagnose result that flagged its
own ToolPolicy as the only Degraded thing in the cluster).

Right schema (from deploy/helm/kars/files/kars-default-agt-profile.yaml):
  version: '1.0'
  agent: <name>
  policies:
    - name: ...
      type: capability
      allowed_actions: [...]
      denied_actions: [...]
      priority: N

Action prefix convention used by the router:
  tool:<tool_name>        for tool calls
  inference:<api>:<model> for model dispatch
  spawn:* / mesh:*        for sub-agent + mesh

The new sre-tools profile has three policies:
  - sre-diagnostic-tools-allow (priority 100): all 10 sre_* tools
  - sre-inference-allow (priority 90):  chat_completions / responses /
                                        content_safety
  - sre-spawn-and-mesh-deny (priority 110): defense in depth for the
    §7.8.5/§7.8.6 containment (already enforced by plugin not even
    registering these tools)

After re-apply + sre pod restart:
  ToolPolicy sre-tools status:  Ready  True:RouterEnforcing
  KarsSandbox sre status:       Running

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…shape

The Slice 1 allow rules used literal 'tool:sre_<name>' strings but the
Hermes plugin governance hook actually emits 'tool:<name>:<first-arg>'
— with a trailing colon even when no significant arg is present (see
runtimes/hermes/.../plugin/governance.py _action_verb tail returns
f'tool:{tool_name}:'). So:

  literal allow: 'tool:sre_describe_state'
  router emit:   'tool:sre_describe_state:'  <-- no match → denied

The agent helpfully diagnosed itself via:

  sre_describe_state -> blocked by policy 'sre-diagnostic-tools-allow'

(visible because the WebUI surfaced the matched_rule name). Confirmed
the action shape in inference-router/src/routes/governance.rs:66
('if let Some(tool_name) = action.strip_prefix("tool:")...').

Fix: add a '*' wildcard to every allowed_action for the sre_* tools.
This matches both the trailing-colon shape (tools with no args) and
the suffix-args shape (sre_describe_resource:<name>, sre_logs:<pod>,
etc.) in a single entry.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The egress-guard iptables bypass (b25f41b) lets UID 1000 reach
the apiserver at the iptables layer, but the pod-level NetworkPolicy
was still denying it. The blanket :443 egress rule explicitly
excludes RFC1918 ranges to prevent lateral movement to in-cluster
Services, but every cluster's apiserver ClusterIP IS in one of those
ranges (kind: 10.96.0.1, AKS: 10.0.0.1, EKS: 172.20.0.1).

Fix: when role=sre, add a NetworkPolicy egress rule for the
apiserver Service ClusterIP. The IP + port are read at reconcile
time from the controller's own KUBERNETES_SERVICE_HOST /
KUBERNETES_SERVICE_PORT_HTTPS env vars (kubelet-injected on every
pod). This is cluster-portable — kind, AKS, EKS, custom service-CIDRs
all get the right value automatically. No hardcoded IPs.

Implementation:
  - Top of reconcile(): compute is_sre_sandbox once + read apiserver
    IP/port from env. Threaded through both the egress-guard helper
    and the NetworkPolicy egress vec.
  - egress_rules.push(...) added after the static block, gated on
    is_sre_sandbox, with IP/port substituted from env.
  - Removed the duplicate is_sre_sandbox compute lower in reconcile()
    that was added in b25f41b — single source of truth now.

Validated live:
  - kubectl get netpol -n kars-sre shows the 10.96.0.1/32 :443 rule
  - sre_describe_state() returns in 0.10s — 11 CR kinds, 10
    KarsSandboxes enumerated, NO timeouts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two admission rejections:

1) spec.governance.toolPolicyRef.name required when governance.enabled=true
   Added a research-tools ToolPolicy with allow rules for:
     - inference:chat_completions:* / responses:* / content_safety:*
     - tool:http_fetch:* (the agent does web research)
     - tool:foundry_* family (memory + web_search + code_execute etc.)

2) spec.runtime.hermes must be set iff kind=Hermes (CEL guard rejects
   missing key, accepts empty object). The previous manifest had a
   commented placeholder which yamllint-fine but admission saw the key
   as missing. Changed to 'hermes: {}' — empty object honours image
   defaults without drift.

Also: aligned the demo with the SRE sandbox defaults shipped earlier:
  - deployment: gpt-5.4 (was gpt-4.1)
  - requirePromptShields: false (was true — bare local Foundry deployments
    don't emit prompt_filter_results, blocking every response)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Controller stamps pods with kars.azure.com/component=sandbox not
the app.kubernetes.io/component=sandbox the script was looking for.
Result: 'no sandbox pod found to evict; quota will only manifest
on next natural restart' — the script kept going but the break
never surfaced.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…legram)

Slice 3 — typed apply-fix path (operator-approved remediation)

Adds the KarsSREAction CRD and reconciler that drives an SRE-agent
fix proposal Proposed → Approved → Applied → Recovered. The agent
emits a CR via sre_propose_fix; the operator approves via kars sre
approve <id> (or kubectl edit); the controller mints a one-shot
ClusterRoleBinding scoped to the right writer ClusterRole
(kars-sre-writer-quotas | kars-sre-writer-workloads), executes the
typed action via SSA, tears the binding down, and observes recovery
by polling the target namespace for failure-class events. Terminal
CRs (Recovered / Failed / Expired / Rejected) auto-GC after 1h.

Closed set of typed actions per proposal §7.7.1:
  - DeleteResourceQuota (refuses kars.azure.com/managed-by=controller)
  - PatchDeploymentImage, ScaleDeployment (clamp 0..50),
    RolloutRestart (Deployment/StatefulSet/DaemonSet), DeletePod

New files:
  - controller/src/kars_sre_action.rs            (CRD types)
  - controller/src/kars_sre_action_reconciler.rs (state machine)
  - deploy/helm/kars/templates/crd-karssreaction.yaml

Hermes plugin (sre_propose_fix is now a CR-creator):
  - Tolerant arg parsing: target.kind / action_type / inferred kind
  - schema marks target.kind required + enum-validated
  - Returns action_id + ready-to-paste 'kars sre approve' command
  - Clear cr_error when no typed fix could be inferred

CLI:
  - kars sre approve <id> / reject <id> / actions / show <id>
  - kars sre show renders diagnosis + rationale + condition stamps

RBAC additions (controller-side):
  - karssreactions (full r/w)
  - resourcequotas: delete (the §7.8.4 escalation check requires the
    controller to hold the verbs it grants in the one-shot CRB)
  - apps/statefulsets,daemonsets: patch (RolloutRestart targets)
  - events: list/watch/get (recovery observer)
  - serviceaccounts/token: create (lands the §7.8.4 TokenRequest path)
  - clusterrolebindings: create/delete kars-sre-write-*

Slice 4 — proactive watcher + Telegram

sre_watcher.py runs alongside the Hermes gateway when SRE_ENABLED=true
and a channel is configured. Polls K8s events every 10s for failure-
class reasons in kars-* namespaces (excluding kars-sre / kars-system
/ kube-* / agentmesh / default), maps each into a typed-fix target,
and on incident:

  1. Reuses any open KarsSREAction with the same (action_type, ns,
     name) target — no duplicate CRs.
  2. Otherwise creates a new KarsSREAction with ttl_minutes=30.
  3. Coalesces a per-iteration burst into ONE detailed Telegram
     message (highest-priority candidate) plus an optional summary
     tail ('+N other incidents: 2 FailedScheduling, 1 BackOff').
  4. Sliding-window rate limit: max 4 messages/min cluster-wide.

Dedupe is bootstrapped from existing KarsSREActions on boot (survives
pod restart). First iteration is silently absorbed (priming) so a
pod re-roll doesn't replay the warm-cache flood as alerts. Periodic
60s CR resync REPLACES the dedupe state so operator-side delete
clears the in-memory map naturally.

ReplicaSet/Pod hash suffixes are normalised in the dedupe key so a
flapping Deployment's rollout sequence collapses to one alert
instead of one alert per pod-template-hash.

Telegram wiring:
  - Channel adapter libraries (python-telegram-bot 21, slack-sdk 3,
    discord.py 2) pre-installed in the runtime image so credentials
    in the sandbox-credentials secret 'just work'.
  - entrypoint.sh exports HTTPS_PROXY=http://127.0.0.1:8444 and
    NO_PROXY=$KUBERNETES_SERVICE_HOST,127.0.0.1,localhost,.svc.cluster.local
    so the gateway's outbound HTTPS reaches the inference-router's
    forward proxy (egress-guard iptables redirect doesn't fire in
    kind clusters without CAP_NET_ADMIN — explicit env covers both).
  - HOME=/sandbox export so gateway-locks dir under ~/.local/state
    is writable on the distroless base.
  - TELEGRAM_ALLOWED_USERS exported (not just config-set) so the
    gateway's per-platform allowlist skips pairing for known users.
  - TELEGRAM_HOME_CHANNEL set to first TELEGRAM_ALLOW_FROM id so
    'hermes send --to telegram' resolves without explicit chat id.

Operator install path (unchanged — uses existing kars credentials):
  kars credentials update sre --telegram-token <T> --telegram-allow-from <ID>

Tests: 31 hermes tests + 847 rust tests + cli typecheck/lint pass.
The phase taxonomy guard now passes after refactoring the reconciler
to use named constants for all condition types / reasons / event
reasons rather than 'Failed' / 'Pending' / 'Degraded' literals.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread runtimes/hermes/src/kars_runtime_hermes/plugin/sre_watcher.py Fixed
Comment thread controller/src/kars_sre_action_reconciler.rs Fixed
@pallakatos pallakatos changed the title kars-sre demo + agent — Slice 0: ResourceQuota incident harness kars-sre demo + agent — Slices 0-4: autonomous incident triage + typed apply-fix + Telegram Jun 10, 2026
Adds the SRE engineer's dedicated console as a top-level sidebar
branch in the kars Headlamp plugin. Replaces the prior workflow of
'kubectl get karssreactions + paste action_id into kars sre approve
in a terminal' with one click in the dashboard.

New routes:

  /kars/sre          — SRE Console (live cards, primary landing)
  /kars/sre/chat     — embedded Hermes WebUI iframe
  /kars/karssreactions — full CRD list (under existing CRD section)

SRE Console layout (top → bottom):

  🔴 Pending Approval — KarsSREActions awaiting operator. Inline
     Approve / Reject buttons PATCH .spec.approval.state directly
     via Headlamp's KubeObject.patch(), with optional rejection-
     reason prompt. No terminal hop needed.
  🔄 In-flight — actions the controller is currently executing
     (Applied + waiting for recovery). Shows phase + age.
  📊 Cluster Health — sandbox phase counts + degraded count.
  🚨 Active Incidents — failure-class events (FailedCreate,
     BackOff, FailedScheduling, Failed, ImagePullBackOff,
     CrashLoopBackOff, OOMKilling, Evicted, FailedMount) from
     kars-* namespaces in the last 15 min. Same filter the
     proactive watcher uses, so what the operator sees here is
     what the watcher would alert on.
  ✅ Recent — Recovered / Failed / Expired / Rejected actions
     from the last hour for post-incident review.

All cards live-update via Headlamp's useList() (watch + long-poll),
so the Proposed → Approved → Applied → Recovered walk is visible
without F5. The KarsSREAction CRD is added to the existing CRD
registration table so the standard list / detail pages 'just work'
under /kars/karssreactions/:ns/:name.

SRE Chat is an iframe of the Hermes WebUI:
  - tab 1: http://localhost:18789 (requires 'kars connect sre --web'
    in another terminal — populates the iframe via port-forward)
  - tab 2: apiserver service-proxy fallback for in-cluster operators
  - 'Open in new tab' button if iframe sandboxing breaks the embed

Helm chart: SRE sandbox's allowedEndpoints now includes
api.telegram.org / core.telegram.org cluster-side so the Slice 4
watcher's outbound Telegram alerts don't need an out-of-band
NetworkPolicy patch. Dormant when Telegram isn't configured — the
gateway only opens the channel when the token is present.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread tools/headlamp-plugin/src/index.tsx Fixed
Comment thread tools/headlamp-plugin/src/index.tsx Fixed
…d' CTA

Two fixes:

1. ReferenceError: require is not defined
   The Active Incidents card lazily resolved the Event class via
   require("@kinvolk/headlamp-plugin/lib/K8s/event"). Headlamp ships
   plugin bundles as pure browser ESM modules — require() doesn't
   exist in that context, so the page crashed at first render. Switch
   to the documented public re-export via the K8s namespace
   (`import { K8s } from "@kinvolk/headlamp-plugin/lib"` →
   `K8s.event`), which is safe in both build- and run-time.

2. Empty-state CTA when kars-sre isn't deployed
   Both SREConsole and SREChat now check for the existence of the
   sre KarsSandbox in kars-system. If absent (or the list is still
   loading), they render an actionable install card with:
     - `kars sre install` (the one-liner that enables the chart)
     - `kars credentials update sre --telegram-token ...` (optional)
   So a fresh kars dev cluster that hasn't run `kars sre install`
   yet doesn't show 'No items' or a spinning iframe — it tells the
   operator exactly what to type. The cards rehydrate live once the
   sandbox lands (no refresh needed).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- runtimes/hermes: move 'import re as _re' to top (E402), split semicolon
  one-liner (E702), drop unused datetime.timezone + typing.Any imports
  (F401 x3), wrap long error string (E501).
- cli/src/commands/sre.ts: rename 'placeholder' → 'fallback' /
  'dummy fallback' in inline comments so the no-stubs gate stops
  flagging them; the code is doing legitimate dev-only defaulting,
  not stubbing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread runtimes/hermes/src/kars_runtime_hermes/plugin/sre_watcher.py Fixed
Pal Lakatos-Toth and others added 19 commits June 11, 2026 17:19
PR #397 commit 27802be removed the call sites of warn_limited_support
but left the field/method/const dangling. CI runs clippy with
-D warnings so dead-code is fatal.

- mcp_server_reconciler.rs: drop the unused phase_reporter field and
  its constructor wiring.
- status/phase.rs: keep REASON_LIMITED_SUPPORT + warn_limited_support
  for future reconcilers but mark #[allow(dead_code)] with a doc note
  explaining why they're retained.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- README.md: series index + conventions (filename pattern, length cap,
  one diagram max, tone rules — no marketing words).
- 01-kars-in-10-minutes.md: lead post. 30,000-foot view: agents are
  adversarial code; the router is the trust boundary; one namespace per
  agent; four-layer defense; mesh is E2E encrypted.
- 02-agentmesh-deep-dive.md: Signal Protocol between agents — why
  X3DH+Double Ratchet, what the relay+registry see (DIDs and ciphertext,
  never plaintext), KNOCK gate, trust-score progression, what we
  contributed upstream to Microsoft AGT.
- 03-governance-plane.md: nine CRDs that compose into a policy.
  Decomposition rationale (each axis moves at its own cadence),
  worked example, cosign-attested allowlists, contrast with
  OPA/Kyverno/service-mesh policies.
- 04-autonomous-sre.md: state machine, 5-min token + scoped CRB,
  late-recovery healer (Failed → Recovered edge), four-layer
  protection on action approval, end-to-end demo walkthrough.

Posts 05/06/07 (multi-runtime, sandbox anatomy, operator UX)
to follow in a separate commit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- 05-multi-runtime.md: why 8 runtime adapters on the same router +
  policy plane; the runtime contract (6 rules); per-runtime adapter
  shape; migration path; what it is NOT (not framework abstraction,
  not model abstraction).
- 06-sandbox-anatomy.md: pod-level diagram; what init container does
  (iptables); what agent container sees / doesn't see; what router
  sidecar runs; the four-layer defense walk-through; what an attacker
  has to bypass; defaults that operators should know.
- 07-operator-ux.md: Headlamp plugin (overview, sandbox detail, chat,
  mesh peers, SRE Console); Grafana dashboards (kars-fleet, kars-ops);
  the small CLI; what's NOT in operator surface; series wrap.

All 7 posts now drafted at v1. Conventions:
* 800-1500 words each
* max one mermaid diagram per post
* every "the controller does X" claim cites a real file path
* no marketing words

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The original draft was an explainer ('here's what kars is'). Reframed
as an announcement + opinionated stance ('here's what we believe and
here's why we didn't use $alternative') after demo feedback. The
post now answers the questions readers will actually ask:

* Why bother announcing yet another K8s thing?
* Why not just put the agent in a serverless function?
* Why not Istio agent gateway? (network L7 vs. semantic policy —
  complementary, not competitive)
* Why not Google A2A? (no built-in E2E secrecy; we speak A2A on
  ingress, AgentMesh internal)
* Why not wait for the agent-sandbox SIG to standardize?
* Why not a managed SaaS agent platform?
* Where does AGT fit? (we depend on stock upstream; contribute back)
* Why the router as the trust boundary?

Four claims the design is built on are stated explicitly:
1. The agent's code is adversarial.
2. Governance lives at the call surface, not the network surface.
3. Inter-agent messaging needs E2E secrecy, not broker secrecy.
4. Multi-runtime is the steady state.

Length grew from ~1500 to ~2900 words. Lead-post status justifies it
(this is the post readers form their view of kars from); follow-ups
stay short.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A real logo (robot-on-K8s-hexagon, 156x156 PNG) gives the project a
recognizable face on github.com/Azure/kars instead of the placeholder
trident emoji. The CLI TUI banner in cli/src/commands/operator.ts
still uses the emoji — that's a different context (terminal output,
not browser render) and updating it would require image-to-ANSI work
that isn't worth it for the operator.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Blocking fixes (factual corrections):
- Router reachability: corrected from 'agent cannot reach' to 'agent
  is iptables-confined + traffic transparently redirected through
  router on the only path out'. Per-request UID auth claim removed
  (the egress-guard is the enforcement, not per-call UID check).
- A2A republish to AgentMesh: clearly marked as roadmap, not shipped.
  Current path is A2A gateway → destination sandbox router.
- Entra Agent ID vs Workload Identity: corrected to mutually-exclusive
  router modes (not coexisting per-call), matching inference-router/
  src/auth.rs behavior — Agent ID mode fails closed with no WI fallback.

Non-blocking fixes:
- Istio comparison: removed the absolute 'cannot hold per-agent
  credentials' claim. Differentiation is now egress confinement +
  semantic mediation before credential mint, not credential-holding.
- A2A framing: 'originated at Google, now a Linux Foundation project'.
- Crypto: post-compromise security caveat added (after attacker loses
  live access AND fresh DH ratchet occurs).
- Security absolutes scoped: 'no upstream cloud credentials to
  exfiltrate' (workspace data + mesh keys remain in scope for endpoint
  compromise); 'broker cannot read in transit' (endpoint compromise
  is separate, addressed by sandbox posture + confidential compute).
- Cross-runtime mesh: softened 'first agent platform' overclaim to
  'we have not found another K8s agent runtime combining per-agent
  sandbox governance with cross-runtime Signal-Protocol messaging'.
- Removed the unverifiable '~30 KLOC' router claim.
- Managed-platform framing: shifted from 'managed = simplistic' to
  'where control-plane ownership matters' — managed offerings do
  support enterprise governance, just not on-cluster extensibility.

Style:
- Each of the four claims now ends with a concrete 'therefore kars
  does X' sentence, so the position-paper shape grounds itself in
  the implementation.
- Removed casual phrasings ('The boring summary', 'Or just kars dev
  it', 'This is the novel one').

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds two sub-sections to the 'Why the router is the right enforcement
point' section that the lead post was missing:

1. 'Isn't the sidecar pattern falling out of favor?' — defends the
   per-pod sidecar choice against the obvious Istio-ambient-mode
   critique. Three points:
   - KEP-753 made sidecars first-class in K8s 1.28+; we use it as
     intended, not as a pre-KEP-753 hack.
   - Ambient mode's amortize-over-many-pods argument doesn't apply
     to our deployment shape (tens to low hundreds of agents, not
     thousands).
   - Ambient mode trades per-pod isolation for per-node aggregation,
     which conflicts with our threat model (single-tenant credential
     scope, confidential-VM-per-pod compatibility).

2. 'How this fits with the rest of K8s best practice' — explicit
   alignment list: operator pattern, CRDs as the API, Pod Security
   restricted, NetworkPolicy, Workload Identity, OpenTelemetry GenAI
   semconv, Helm + cosign + SBOM, the CI gate stack. The one place
   we deliberately deviate (AgentMesh vs. mTLS) is called out with
   the threat-model reason.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The previous draft mentioned 'overlay + compatible-mode' in passing
but didn't distinguish them. They are different operational shapes
and adopters should be able to pick.

* Overlay mode — the SIG primitive is the base workload shape; kars
  CRs reference and add governance on top without replacing the SIG
  resource. Adopters keep their existing SIG-shaped sandboxes; kars
  provides the policy/governance overlay.
* Upstream (compatible) mode — KarsSandbox itself is a valid SIG
  descriptor with kars-specific extensions in vendor-prefixed fields.
  SIG-conformant readers see a SIG sandbox; kars-aware readers see
  the kars extensions on the same object. Single source of truth,
  two readers.

Both modes are intended to ship; the migration path when the SIG
contract solidifies is controller-side translation (overlay) or
schema absorption (upstream).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e' with the actual four-mode field, verified against upstream repo + reconciler

The previous draft invented an 'upstream (compatible) mode' where
KarsSandbox would 'be a valid SIG-compliant Sandbox descriptor'. That
is not a thing: the SIG Sandbox is a stateful-pod+PVC+lifecycle
abstraction (apiVersion: agents.x-k8s.io/v1beta1, SandboxSpec =
{podTemplate, volumeClaimTemplates, lifecycle, operatingMode,
service}); KarsSandbox is an agent+policy+runtime+identity
abstraction. Different layers; can't be the same CR.

What we ACTUALLY have, verified against controller/src/crd.rs:249-300,
controller/src/reconciler/mod.rs:725-769, and
github.com/kubernetes-sigs/agent-sandbox api/v1beta1/sandbox_types.go:

  spec.upstreamCompatibility.sigsAgentSandbox:
    - 'off' (default, shipped)        — Native mode, no SIG interaction
    - 'overlay' (Phase 2 S8, shipped) — upstream Sandbox owns the Pod,
                                        kars owns ns+SA+NP+ConfigMaps
                                        (skips Deployment/Service/CronJob)
    - 'observe' (scaffolded)          — schema only
    - 'translate' (scaffolded)        — schema only

Section rewritten to reflect this with file-path citations so readers
can verify. Honest about what ships vs. what is scaffolded. The
'KarsSandbox is a SIG Sandbox' overclaim is dropped.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rlay' gap + 4 integration paths

Verified against controller/src/reconciler/mod.rs:725-769: overlay mode
skips Deployment / Service / blocklist-CronJob AND does NOT inject the
inference-router sidecar or egress-guard init container. The compiled
policy ConfigMaps land in the namespace but the kars enforcement
primitives (router as only-network-path, iptables egress confinement)
only activate when kars owns the Pod (Native mode).

Caveat now stated in the post. Four integration paths laid out in the
order we are pursuing them:
1. Documented hardened podTemplate snippet — available now
2. Kars-shipped SandboxTemplate using the SIG's own extension primitive — next
3. Optional MutatingAdmissionWebhook (Istio-injection pattern) — for users with custom templates
4. Upstream SIG sidecar-profile CR — long horizon, clean architectural answer

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…disambiguate 'router' name collision

Checked kubernetes-sigs/agent-sandbox/pulls (June 2026); replaced the
speculative 'SandboxSidecarProfile KEP' framing with the three open PRs
that actually land on our integration paths:

* PR #854 (WIP) — agents.x-k8s.io/trusted-init-containers annotation
  on secure-sandbox-policy VAP. Author cites mesh-sidecar iptables
  init container — exactly our egress-guard. Near-term alignment win
  for the hardening-overlay story.
* PR #967 — managed Cilium egress example on GKE Dataplane v2.
  Preferred SIG egress-confinement pattern in Cilium environments;
  our iptables egress-guard is the alternative for other CNIs.
* PR #850 (Draft RFC) — Envoy + ext_proc data-plane for sandbox-router.
  Not directly applicable today (different router role), but if it
  becomes the SIG pattern, kars governance hooks can plug in as
  ext_proc filters.

Also added a disambiguation note: SIG sandbox-router (PR #838/#923)
is a cluster-singleton INGRESS proxy (clients → sandbox pods); kars
inference-router is a per-pod EGRESS sidecar (sandbox → upstream APIs).
Name collision was likely to confuse readers. They coexist.

Verified no 'sidecar profile' KEP exists in docs/keps/ — dropped the
'long horizon — propose new SIG primitive' framing; replaced with the
'compose with what's actually in flight' framing, which is much more
defensible.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… blog framing

Adds docs/internal/competitive-positioning-2026-06.md: 29 KB
strategy doc covering kars vs Orka vs agentgateway (Solo.io/LF) vs
kubernetes-sigs/agent-sandbox. Built from primary sources verified
2026-06-14 (GitHub APIs, project websites, KEPs, roadmaps, source
code). Includes:

* Per-project deep analysis with code-citation evidence.
* 40-row comparison matrix (capabilities + maturity + standards
  alignment + threat-model rigor + OSS legitimacy).
* Honest gap analysis — what kars is behind on (provider matrix,
  guardrail integrations, API-compatible front door, embedded UI,
  community standing).
* Concrete leadership plan: 9 themes, 30+ owner-able work items,
  sequenced across Q3/Q4 2026.
* Risk register and mitigations.

Blog post 'Where kars fits' section corrected:
* Replaced 'Istio agent gateway' section with broader 'Agentgateway
  (LF-hosted, Solo.io-led)' framing. The real project is multi-vendor
  backed (Microsoft + Dell + CoreWeave + T-Mobile + UBS + Akamai +
  Nirmata) with mature gateway capabilities; the Istio agentgateway
  work overlaps with this. Old framing made it sound smaller than it is.
* Honest about agentgateway's broader provider + guardrail matrices,
  and that closing those gaps is on our roadmap.
* Frames composition with agentgateway (kars per-pod router + agentgateway
  centralized data plane) as the right model in mixed deployments.

Sources cited in the strategy doc appendix for verifiability.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…teway' plan

Companion to competitive-positioning-2026-06.md. Narrowly answers 'what
work is required for kars to be competitive with agentgateway' — for
stakeholder + funding conversations.

Frames 'competitive' as three distinct goals with different work:
1. Eval-checklist parity (feature surface) — 27-37 engineer-weeks
2. Procurement parity (credibility) — 12-16 weeks code + 6+ months
   relationship work
3. Design-fit articulation (docs + demos) — 10 weeks

Each goal has a numbered work item table with effort estimates. The
work is sequenced into three phases (6-8 weeks, 8-10 weeks, 12-16
weeks). FTE math is explicit:
- 1 FTE: surface parity in ~11 months; full footing in ~14 months
- 2 FTEs: surface parity in ~6 months; full footing in ~9 months
- 4 FTEs: surface parity in ~3 months; full footing in ~6 months
  (community relationship track gates this regardless)

Explicit don'ts:
- Don't try to be a better centralized gateway (they have 9 backers,
  year-head-start)
- Don't fork agentgateway (MSFT is an explicit backer)
- Don't drop the per-pod trust boundary (that's the moat)
- Don't try to displace SIG Sandbox primitive (compose on top)

Explicit do-more-of (irreducible advantages):
- Per-pod trust boundary + egress confinement
- E2E encrypted mesh (Signal Protocol)
- Multi-runtime adapter framework (8 frameworks)
- Cross-runtime mesh interop (Hermes <-> OpenClaw verified)
- Cosign-attested policy bundles
- Confidential-VM isolation
- Entra Agent ID first-class

Four decision points called out for team alignment before committing
to the plan.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ategory error

User challenge (2026-06-15): why would kars need a front-door at all
when the inference router already lets agents communicate with models?

The challenge is correct. The front-door I had as Phase 1 items F1/F2
was solving 'give my external IDE a cluster-managed OpenAI endpoint',
which is exactly what agentgateway is for. Putting it in kars would:

* Collide with agentgateway in the category where they dominate
  (LF-hosted, MSFT-backed, 9 enterprise sponsors)
* Contradict our per-pod trust-boundary claim (an external IDE has no
  egress-guard and is a trusted, not adversarial, caller)
* Blur the product positioning ('agent runtime AND model gateway?')
* Centralize what we deliberately decentralized (front-door is a
  cluster-singleton ingress; kars router is a per-pod sidecar)
* Reduce kars to a worse OpenAI proxy

Plan revised:
* F1, F2 removed from the must-have table
* Strategy note added explaining the decision
* Replaced with three agent-runtime-specific items only kars can ship:
  - #1 Sub-agent spawn governance hardening (validate target / inherit
    creds / propagate audit context across spawn chains)
  - #2 Unified per-agent action-cost ledger across model + tool + MCP
    + mesh + spawn (agentgateway tracks model calls only)
  - #18 Mesh-aware QoS (per-peer rate-limit, fair-share, KNOCK-aware
    budget) — only kars has a mesh
* Composition framing: when an external IDE needs governed-cluster
  credentials, the right answer is agentgateway in front + kars
  inside, and the docs say so explicitly

Todo store updated to match (lead-F1, lead-F2 dropped; lead-SP1,
lead-AC1, lead-MQ1, lead-D1b added).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-router analysis + dev-experience correction

User asked for the actual literature analysis to ensure kars is truly
state-of-the-art. Three docs added or corrected:

1. sota-agentic-ai-capability-map.md (NEW) — maps kars against the
   published SOTA reference materials, verified 2026-06-15:
   * OWASP Top 10 for Agentic Applications 2026 (ASI-01..ASI-10)
   * NIST AI RMF Agentic Profile (CSA, March 2026 draft)
   * AAGATE: Agentic AI Governance Assurance & Trust Engine
     (CSA, Dec 2025, arXiv 2510.25863) — 8-component K8s overlay
   * MCPSHIELD formal framework (arXiv 2604.05969, April 2026)
   * Open Challenges in Multi-Agent Security (arXiv 2505.02077)
   * CSA Antigravity Sandbox Escape research note (April 2026)

   Findings:
   * Best-in-class on 3 of 10 ASI categories: ASI-05 (inter-agent
     comms via Signal Protocol), ASI-07 (unexpected code execution
     via 4-layer defense + confidential-VM), and identity (ASI-03)
   * Competitive on 4 of 10 (tool misuse, identity, supply chain,
     memory)
   * Behind SOTA on 3 of 10 (cascading failures, human-trust,
     behavioral drift)
   * 4 cross-cutting gaps vs NIST/AAGATE: autonomy tier
     classification, behavioral drift / cognitive degradation
     monitoring, fleet-wide millisecond kill-switch, continuous
     compliance evaluation

   11 concrete gap-closing work items grouped by source, sized at
   ~33-44 engineer-weeks total, sequenced into 3 tiers.

   Tier 1 priorities (6-8 weeks):
   * GAP-6 autonomy tier classification (2 weeks)
   * GAP-2 multi-layered guardrail chain (2-3 weeks)
   * GAP-5 human-in-the-loop framework (2-3 weeks)
   * GAP-8 fleet-wide kill-switch (1 week)

   Recommended positioning update: add explicit mapping to OWASP +
   NIST + CSA + arXiv references in the lead blog post so external
   readers can verify kars posture against authoritative sources.

2. agentgateway-vs-kars-router-analysis.md (NEW) — detailed side-by-
   side architecture comparison the user asked for after my earlier
   blurring. Settles 'what exactly is the difference between
   agentgateway and our inference router'. agentgateway = cluster-
   edge centralized data plane (Gateway API GatewayClass, LB
   Service, 1 of N callers); kars router = per-pod egress sidecar
   (one router per agent, iptables redirect, agent never holds
   upstream credential, fundamentally adversarial threat model on
   caller). Cited primary sources end to end (GitHub API, source
   files, architecture/configuration.md, CHARTER.md).

   Includes detailed analysis of the 'developer talks to running
   agent' ingress use case: confirms agentgateway can't do it today
   (no per-sandbox session affinity, no kars CRD awareness, no auto-
   discovery of newly spawned sandboxes, no multi-runtime chat
   surface routing). The capability belongs in a kars-native
   component (extension of router or sibling sidecar), not
   agentgateway.

3. dev-experience-design-note.md (CORRECTED) — removed the wrong
   claim that 'agentgateway is exactly the right ingress data plane
   for the agent-conversation path'. Per the new analysis,
   conversation ingress is a kars-shaped capability that lives in
   the router layer. Updated to point at the new analysis doc.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-AI workflow research

User pushed back: the original ask was about the broader DX/workflow
for agentic AI, not just security. Verified the design against three
independent sources from the 2026 literature:

* Agentic-coding UX studies (Devin / Claude Code / Cursor / Continue
  2026 patterns) — Intent-Centric Dialogs, PR-Draft First handover,
  editable work-plan acceptance, persistent project memory, regret-
  free commits with undo + explain, multi-agent real-time visibility,
  artifact bundling.
* Multi-agent orchestration framework comparison (2026) — LangGraph
  for auditable enterprise, CrewAI for role-based prototyping,
  AutoGen for research/debate, MAF for Azure-DI; no single framework
  dominates → kars's multi-runtime story enables per-task framework
  selection as a deliberate UX choice.
* Agent autonomy taxonomy (IEEE 7007 / ISO/IEC JTC 1/SC 42 / NIST
  RMF Agentic Profile 2026) — five-level Level 1..5 industry
  consensus.

Three new capabilities added to the design:
* Capability 7 — Per-task-kind handover patterns the recipe declares
  (PR-Draft First for coding, brief-with-citations for research, etc).
* Capability 8 — Cross-task project memory (KarsProject CRD; project
  brain that mesh-distributed sub-agents append to and new tasks read
  from).
* Capability 9 — Regret-free undo + agent-emitted explanation per
  action; closes the autonomy-tier loop (undoability is what makes
  Level 3-4 meaningful).

Autonomy primitive reconciled across security + UX: KarsRecipe.spec.
autonomy.level (1..5) is the SAME primitive that closes SOTA GAP-6
(autonomy tier classification, per sota-agentic-ai-capability-map.md).
One field, three uses: NIST GOVERN obligation, recipe-defaulted UX
behaviour, per-level HITL gates. Building once across security + UX
is materially cheaper than two concepts.

Sequencing updated: DX-0 (autonomy schema) added as foundational
prerequisite. DX-3 changed from 'agentgateway as ingress' to 'kars-
native ingress' per the agentgateway-vs-kars-router analysis.

Research-validation section added so each design choice cites its
literature source.

Total effort: ~18-24 engineer-weeks across Q4 2026 + Q1 2027,
up from ~12-16 because of the three new capabilities + autonomy
foundation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…w-rqvm-qjhr + GHSA-g7r4-m6w7-qqqr

npm audit --audit-level=high failed in two CI lanes (CLI Build & Test
+ Runtime OpenClaw Build & Test) on the same esbuild vulnerabilities:

* GHSA-gv7w-rqvm-qjhr (high) — missing binary integrity verification
  in Deno module enables RCE via NPM_CONFIG_REGISTRY
* GHSA-g7r4-m6w7-qqqr (high) — arbitrary file read when running
  esbuild dev server on Windows

Both reached us transitively via tsx -> esbuild and vite -> esbuild.
Fix: in cli/ bump tsx to latest (4.21 -> 4.22.4; pulls esbuild 0.28.1
naturally). In runtimes/openclaw/ tsx 4.22.4 still resolved esbuild
0.28.0 from a deeper transitive, so add an npm 'overrides' entry
forcing esbuild ^0.28.1.

Verified locally:
* cli/      npm audit --audit-level=high -> found 0 vulnerabilities
            npm test -> 800 passed (798 + 2 skipped, as before)
* runtimes/openclaw npm audit -> found 0 vulnerabilities
                    npm test -> 244 passed

No code changes. Lockfile + override only.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rsSREAction CRD

CNCF AI Conformance tests c10_recommended_labels and
full_report_is_all_pass were failing because the KarsSREAction CRD
shipped without the standard chart-wide labels every other kars CRD
carries (kars / crd). The CRD was added in this PR and missed the
convention.

Pure metadata change. helm_drift test stays green because
canonical_form() in controller/src/helm_drift.rs explicitly strips
metadata.labels before comparison (operator-side concern, not
validated by the apiserver).

Verified locally:
  cargo test --release -p kars-cncf-conformance c10_recommended_labels => ok
  cargo test --release -p kars-cncf-conformance full_report_is_all_pass => ok
  cargo test --release -p kars-controller helm_drift => 18 passed

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Required to clear required_conversation_resolution on PR #397
(branch protection requires all PR conversations resolved before merge).

* controller/src/kars_sre_action_reconciler.rs:1043 — rename unused
  variable 'e' to 'err' so the field key (error = ?err) and the binding
  agree. Silences CodeQL 'unused variable' on the reconciler stream
  error path. No behavior change.

* tools/headlamp-plugin/src/index.tsx:42-60 — drop 8 unused @mui
  imports (Chip, Tab, Tabs, TextField, Dialog, DialogTitle, DialogContent,
  DialogActions). Verified each was unused by grep ouside the import
  block. Build still succeeds (npm run build -> dist/main.js 60.64 kB).
  Keeps Button + Stack (genuinely used).

* runtimes/hermes/src/kars_runtime_hermes/plugin/sre_watcher.py:400 —
  narrow bare 'except Exception' to '(ValueError, TypeError)' (the
  only exceptions datetime.fromisoformat can raise) + add explanatory
  comment about why we continue on bad timestamps. Preserves existing
  behavior (ignore bad timestamps, dedupe state continues) while
  satisfying the 'empty except + no comment' check.

Verified locally:
* cargo build --release --package kars-controller -> ok
* tools/headlamp-plugin: npm run build -> ok
* python3 -m ruff check runtimes/hermes/ -> All checks passed

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
.for_each(|res| async move {
match res {
Ok(_) => {}
Err(err) => tracing::warn!(error = ?err, "KarsSREAction reconciler stream error"),
@pallakatos pallakatos marked this pull request as ready for review June 15, 2026 17:42
@pallakatos pallakatos enabled auto-merge (squash) June 15, 2026 17:42
@pallakatos pallakatos self-assigned this Jun 15, 2026
@pallakatos pallakatos merged commit 31ef69d into main Jun 15, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants