Azure AI Agents: add next-step guidance and doctor diagnostics#8198
Azure AI Agents: add next-step guidance and doctor diagnostics#8198antriksh30 wants to merge 85 commits into
Conversation
401ca65 to
829063a
Compare
jongio
left a comment
There was a problem hiding this comment.
Solid foundation for the doctor + nextstep subsystems - clean module boundaries, interface-driven testability, and well-structured check pipeline. A few things to sort out before this leaves draft:
invoke.go:541-543 is the one I'd prioritize - it silently changes CLI semantics for --name. The rest are lower-stakes but worth a look.
The remote.auth doctor check surfaced the raw user principal name in
both the Message string and the structured Details map regardless of
the --unredacted flag, in contrast with the rest of the doctor checks
(checks_rbac.go, checks_agent_identity_roles.go) which already gate
identity values on opts.Unredacted.
This change threads Options into the auth check function and adds two
small helpers that reuse the existing redactedPlaceholder constant:
- redactUPN(upn, unredacted) returns the value to surface in Message
text: raw when --unredacted, the shared <redacted> placeholder when
a UPN was discovered but should be scrubbed, and empty when none
was found so composeAuthMessage cleanly drops the prefix.
- authDetails(upn, minutes, unredacted) builds the Details map and
omits the "upn" key entirely unless --unredacted is set, so
machine consumers do not see the raw value by default.
PASS, WARN, and expired-FAIL branches now compose their messages from
the redacted display value. Existing tests that asserted the raw UPN
were updated to pass Options{Unredacted: true}; new table tests cover
the default-redacted and --unredacted contracts on every branch, the
empty-UPN drop, and both helpers in isolation.
Resolves PR Azure#8198 review comment from @jongio.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
TestErrorCodeWireValues pins the lowercase JSON wire values of every exported enum the nextstep package consumes from the Agents API, but the AgentVersionStatus map was missing the "idle" entry. The idle status is actively read at show.go:207 and nextstep/resolver.go:248, so silent drift on that one literal would regress the show command's idle branch and the resolver's deployment-pending hint without any test failure. Add the missing "idle": string(AgentVersionIdle) case. Resolves PR Azure#8198 Copilot review comments (ids 3246075889, 3246075800). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add the foundation for context-aware `Next:` guidance described in PR Azure#8057: - New `internal/cmd/nextstep` package with `Suggestion`, `State`, `ServiceState`, `AuthState` types and a format-agnostic `PrintNext` writer that aligns commands on the longest entry and caps output at one primary + one secondary line. - Add an `isTerminal(fd uintptr) bool` helper in `internal/cmd/helpers.go` wrapping `golang.org/x/term`; promote that module from indirect to direct in `go.mod`. - Register `nextstep` in the repo cspell dictionary. No callers yet; resolvers, state assembly, and command wiring land in subsequent commits. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… C12)
Implements Check 8 from the doctor remote-checks design as the second
populated entry in the remote chain. The check proves that the
configured Foundry project endpoint is reachable AND that the bearer
token minted by C11 actually authorizes against it — a combined
"can we talk to the project at all" signal that gates every other
remote check (models / agent status / RBAC) from the design's
dependency matrix.
The probe issues a single `GET <endpoint>/agents?api-version=<v>&limit=1`
with a 10s timeout, no retries, using the same credential + scope
as the production runtime path (`agent_context.go:newAgentCredential`
+ `agent_api/operations.go`). The `limit=1` parameter matches the
production agent_api client exactly so a Pass here proves the same
query shape the runtime invoke flow uses (the earlier `$top=1`
choice was a divergence flagged by reviewers).
The status-code response is mapped 1:1 to user-actionable outcomes:
- 200 → Pass: "endpoint reachable (HTTP 200)"
- 401 → Fail: "token expired or scope mismatch" +
suggest `azd auth login` (link: auth login docs)
- 403 → Fail: "wrong tenant or insufficient RBAC" +
suggest re-auth in correct tenant and let
`remote.rbac` flag the role-assignment gap
(deliberately NO `azd auth login` here, but DO
carry a docs Link to the Foundry RBAC quickstart
so the suggestion is actionable — matches the
C11 "every actionable Fail carries Links"
convention)
- 404 → Fail: "endpoint is wrong or project is gone" +
suggest `azd provision` / `azd env set
AZURE_AI_PROJECT_ENDPOINT`
- 5xx → Fail: "service-side error" + suggest retry
- other → Fail: "unexpected HTTP <N>" + verbose hint
- transport err → Fail: "could not reach <host>: <first line>" +
network / VPN / firewall guidance
- ctx canceled → Skip (user aborted)
- 10s elapsed → Fail: "did not respond within 10s" + retry hint
Skip-cascade: `local.project-endpoint-set` AND `remote.auth`. The
former gives us the endpoint to probe; the latter gives us the
token to authenticate with. Skipping when either failed prevents
double-reporting the same root cause. `local.environment-selected`
is transitively cascaded via `local.project-endpoint-set`.
Implementation notes:
- The api-version is read from the package-level
`DefaultAgentAPIVersion` constant by the Cobra wiring in
`cmd/doctor.go` and passed in through `Dependencies.AgentAPIVersion`.
This honors the design's "single source of truth" requirement
while keeping the doctor package self-contained (no import cycle
against `internal/cmd`).
- `makeRealProbeFoundryEndpoint(apiVersion string) func(...)` is
a closure factory rather than a top-level function so the
api-version is captured at construction time without becoming a
global.
- `buildFoundryProbeURL` parses the user-supplied endpoint FIRST,
then mutates `u.Path` (trim-right + "/agents"), clears
`u.Fragment` / `u.RawFragment`, and only then sets RawQuery.
This prevents a stray `?api-version=evil` or `#fragment` in the
endpoint from displacing the `/agents` path segment — a
silent-misdiagnosis bug the prior `endpoint + "/agents"` string
concatenation could trigger. Regression tests now positively
assert `/agents?` is present in every successful build output.
The builder also returns an error for any endpoint that is not
an absolute HTTPS URL with a non-empty host, so a malformed
env value cannot leak a bearer token to the wrong scheme/host.
- A new `validateFoundryEndpoint` helper runs at the check-level
BEFORE the probe is invoked, BEFORE any token is acquired. A
non-HTTPS, relative, or otherwise-malformed
`AZURE_AI_PROJECT_ENDPOINT` surfaces a precise Fail with an
`azd env set AZURE_AI_PROJECT_ENDPOINT <https://...>` suggestion
instead of either a generic transport error (with the token
leak that would have come with it) or the builder's defensive
error wrapped in a less-helpful network-VPN-firewall message.
- Cancellation classification mirrors C11's pattern:
`errors.Is(ctx.Err(), context.Canceled)` → StatusSkip (user
aborted); `errors.Is(probeCtx.Err(), context.DeadlineExceeded)`
→ StatusFail (we hit our own 10s bound, not the parent ctx).
- Multi-line transport errors are reduced to their first line
via the shared `firstLine` helper from C11 so the resulting
Message stays readable.
- The `Details` map carries the endpoint, request URL, and
HTTP status code (when available) for `--output json` consumers
and `--unredacted` debugging. No raw tokens, no response body
excerpts.
- 24 tests cover skip-cascade (env-not-selected, endpoint-not-set,
auth-failed, AgentAPIVersion-missing), every status-code branch,
cancellation vs timeout disambiguation, URL builder safety
(junk query in endpoint, trailing slashes, fragment, blank
api-version, non-HTTPS / relative / malformed endpoint),
endpoint validation (HTTPS-only, non-empty host, well-formed),
helper functions (`endpointHost`, `readProjectEndpoint`,
`firstLine` reuse), a TLS httptest server smoke test asserting
the built URL lands on `/agents` on the wire, and a token-leak
sanity check on Details / Message / Suggestion strings.
Behavior: with this commit, `azd ai agent doctor` now produces two
remote checks (`remote.auth` from C11 + `remote.foundry-endpoint`
from C12) instead of just one. The full remote chain still requires
C13+ to be useful end-to-end, but every subsequent check can now
take a Pass on this one as proof that the project URL works.
Refs: Azure#7975, PR Azure#8057 design-spec
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds the third remote doctor check, `remote.rbac`, implementing Check Azure#10 of issue Azure#7975 / Check 10 of the design doc (`azd-ai-agent-doctor-remote-checks.md` lines 159-180): "Developer has required role on Foundry project". The check queries the developer's role assignments on the Foundry project's ARM scope via the Microsoft Graph + ARM stack, then classifies the result: * Pass: "<display> has the required role on project '<acct>/<proj>'" (display name replaced with "the current principal" in the default redacted mode for UPN safety). * Fail: templated `az role assignment create --role "Azure AI User" --assignee <oid> --scope <scope>` + a learn.microsoft.com link to the RBAC concepts page. Principal ID and scope ARN are substituted with shell-safe ALL_CAPS placeholders (OBJECT_ID / PROJECT_SCOPE) in redacted mode — NOT `<redacted>`, because bash/zsh interpret `<word>` as input redirection. * Skip: precondition unmet (no AzdClient, no env, auth Failed, `AZURE_AI_PROJECT_ID` missing/malformed/unset, service-principal token detected, transient Graph/ARM error, cancellation). Architecture ------------ Two new files: internal/project/developer_rbac_query.go (~175 LoC) - QueryDeveloperRBAC: side-effect-free wrapper returning a structured DeveloperRBACResult. Reuses package-private parseAgentIdentityInfo / hasAnyRoleAssignment / sufficientAIUserRoles from developer_rbac_check.go. - ValidateProjectResourceID: shape check for AZURE_AI_PROJECT_ID, returns wrapped ErrInvalidProjectResourceID sentinel. - ErrInvalidProjectResourceID, ErrSPNDelegatedAuthRequired: sentinel errors for diagnostic classification. - Graph /me failure detection: routes app-only/SPN token rejections onto ErrSPNDelegatedAuthRequired (case-insensitive message match) so doctor surfaces a SPN-aware Skip. internal/cmd/doctor/checks_rbac.go (~430 LoC) - newCheckRBAC: skip-cascade + projectID lookup + upfront ValidateProjectResourceID gate + probe dispatch. - classifyRBACProbeError: sentinel-keyed error classification (Canceled / SPN / InvalidProjectID / generic transient). - classifyRBACResult: pure Pass/Fail mapping for diagnostic consumers. - sanitizeScopeARNs: regex-based scope+GUID scrubber for probe error text. - readProjectResourceID: AZURE_AI_PROJECT_ID env lookup via gRPC EnvironmentService. - redactID / redactScope / redactDisplay: centralized placeholder substitution helpers. Two new seams on Dependencies: probeDeveloperRBAC - replaces project.QueryDeveloperRBAC readProjectResourceIDFn - replaces readProjectResourceID Skip-cascade rationale ---------------------- Per design dependency matrix (line 115), `remote.rbac` cascades against `local.environment-selected` + `remote.auth` but NOT `remote.foundry-endpoint`. RBAC reads ARM, not the data plane; a transient DNS / proxy / outage on the data-plane check must not prevent the user from learning their role assignment is missing. `TestCheckRBAC_DoesNotSkipOnFoundryEndpointFail` pins this. Probe errors → Skip (not Fail) to avoid false alarms on transient Graph/ARM hiccups. Cancellation similarly Skips with a clean message rather than rendering an error trace. Review fixes applied -------------------- Following the 3-reviewer pass (Opus xhigh + Sonnet 4.6 + GPT-5.5) of an earlier draft (commit 0c4d5ee), the following findings were addressed: * MEDIUM (Opus + GPT-5.5): probe-error path leaked raw subscription/scope IDs via azcore.ResponseError.Error()'s first line (`GET https://management.azure.com/subscriptions/...`) into Message + Details. Fix: sanitizeScopeARNs regex pass strips ARM scopes + bare GUIDs from Message in redacted mode; Details["probeError"] is OMITTED entirely unless --unredacted. * MEDIUM (Sonnet 4.6): malformed AZURE_AI_PROJECT_ID got "check your network" Suggestion despite no network call. Fix: upfront ValidateProjectResourceID gate runs before the probe; surfaces "is not a valid Foundry project ARM resource ID" with an `azd env set` Suggestion. * MEDIUM (GPT-5.5): PrincipalDisplay rendered verbatim in Message even in default redacted mode; display names can contain UPN fragments (e.g., "Alice (alice@contoso.com)"). Fix: redactedDisplayLabel ("the current principal") substitutes for raw display in default mode; unredacted mode still echoes the real display. * MEDIUM (GPT-5.5): service-principal `azd auth login` cannot use Graph /me — confusing generic transient Skip. Fix: ErrSPNDelegatedAuthRequired sentinel + case-insensitive detection of the canonical "delegated authentication flow" Graph response; doctor surfaces a SPN-aware Skip with user-delegated guidance. * LOW/Nit (Opus): `<redacted>` is a bash input-redirection token; `--assignee <redacted>` would fail with `redacted: No such file or directory` on copy-paste. Fix: shellSafePlaceholderID/Scope constants render `OBJECT_ID` / `PROJECT_SCOPE` in the templated az command. Verified clean (no action): skip-cascade structure, firstLine helper, AZURE_AI_PROJECT_ID provenance (set by init_foundry_resources_helpers.go:327), seam design, account-scope check missing (acceptable per `assignedTo()` inheriting parent- scope assignments). Testing ------- 22 unit tests in checks_rbac_test.go and developer_rbac_query covering: skip cascades, probe-error branches (transient, parse, SPN, defensive sentinel, cancellation), end-to-end probe Pass/ Fail, classifyRBACResult mapping with both redaction modes, display-name fallback, sensitive-identifier leak prevention, shell-safe placeholder substitution, ValidateProjectResourceID shape coverage, sanitizeScopeARNs regex coverage, all three redaction helpers (redactID/Scope/Display) with empty-input + flag permutations. Preflight (from cli/azd/extensions/azure.ai.agents) --------------------------------------------------- gofmt -s -w . - clean go vet ./... - clean go build ./... - clean go test ./... -count=1 - all packages green golangci-lint run ./internal/cmd/... - 0 issues ./internal/project/... npx cspell lint - 0 issues "internal/cmd/doctor/**/*.go" "internal/project/developer_rbac_query.go" Refs: Azure#7975 (PR Azure#8057, Phase 5 / C16) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds the fourth (and final per the current phase-5 scope) remote
doctor check: `remote.agent-status`. For each hosted-agent service
declared in azure.yaml, the check resolves the deployed agent
name + version from the active azd environment (the AGENT_<KEY>_NAME
and AGENT_<KEY>_VERSION values written by service_target_agent.go
on a successful `azd deploy`) and probes Foundry's
`GET /agents/{name}/versions/{version}` endpoint, mapping the
lifecycle status to a doctor classification.
Per-service classification
- Active → Pass (one of N agents active)
- Creating → Warn (azd ai agent monitor --follow)
- Failed → Fail (azd ai agent monitor --follow,
NOT azd deploy — read logs first)
- 404 / Deleting / Deleted → Fail (azd deploy)
- AGENT_<KEY>_NAME unset, OR NAME set but AGENT_<KEY>_VERSION
unset → Fail (azd deploy — the service is
declared but has not been
deployed cleanly; the post-deploy
hook writes both env vars
atomically so a half-pair is a
deterministic config issue)
- Unrecognized status → Fail with raw status surfaced so the
user can search; suggests the Foundry
portal
- Probe error / 401-403 → service-scoped transient skip; does
NOT fail the aggregate when other
services are healthy
Aggregate
The per-service entries fold into a single doctor Result via a
ranked classifier (`agentClassRank`). Worst-class drives the
aggregate Status:
- All Active → Pass
- Worst is Failed → Fail (monitor)
- Worst is Missing (404 / Deleted / Deleting) → Fail (deploy)
- Worst is NotDeployed (no env var / half-pair) → Fail (deploy)
- Worst is Unknown → Fail (portal)
- Worst is Deploying / Creating → Warn (monitor)
- Worst is Transient AND ≥1 Active → Pass with
Message note
- All Transient → Skip (retry)
When multiple failing classes coexist (e.g., one Failed agent and
one Missing agent in the same project) the dominant class drives
the headline and Suggestion. Detail lines are filtered to the
dominant class only so the headline count and the rendered body
always agree; a short "other agents have additional issues" hint
is appended to the Suggestion telling the user to read Details for
the secondary fix path.
The Message lists up to 3 failing services (with `(N more)` after
that). Active services are listed in the all-active Pass message.
Details["services"] always carries the full per-service list for
JSON consumers.
Fan-out
Probes execute concurrently across services with a bounded worker
pool (probeConcurrency = 4) so wall-clock cost is bounded by the
slowest probe rather than the sum of probes. Each probe still
enforces its own 6 s timeout via agentStatusProbeTimeout; the
parent ctx propagates cancellation to all in-flight workers.
Skip-cascade (per design dependency matrix lines 110-117 of
.tmp/pr-8057/azd-ai-agent-doctor-remote-checks.md):
- local.environment-selected (env reads would fail)
- local.agent-service-detected (no service list ⇒ Pass = bug)
- remote.auth (no token ⇒ every probe 401s)
- remote.foundry-endpoint (no reachability ⇒ N transports)
We deliberately do NOT cascade from `remote.rbac`: agent-list /
agent-get is Reader-level, and developers with read-only access
on the Foundry project still benefit from knowing whether their
agents are healthy. Pinned by
TestCheckAgentStatus_DoesNotSkipOnRBACFail.
Review findings applied (Opus 4.7 xhigh + Sonnet 4.6 + GPT-5.5)
- MEDIUM (Sonnet Azure#1 + Opus Azure#1): doc/code/test disagreement on
Active + Transient mix. The godoc promised "Pass with note"
but the code returned Skip and the test pinned the wrong
behavior. Implemented the documented Pass-with-note so a
transient probe failure for one service does not mask the
healthy state of the rest. Test renamed to
TestCheckAgentStatus_Aggregate_ActiveAndTransient_PassWithNote
and asserts the Message text.
- MEDIUM (Sonnet Azure#2 + Opus Azure#2 + GPT-5.5 Azure#3): headline count and
rendered detail body diverged when entries from multiple
failing classes coexisted (e.g., "1 of 2 agents are in a
failed state" with 2 detail lines including a Missing entry).
The aggregate now filters detail lines to the dominant class
via `detailsForClass(worst)`, and the Suggestion is enriched
with an "Other agents have additional issues (<classes>); see
the per-service Details" hint when other failing classes are
present. Pinned by
TestCheckAgentStatus_Aggregate_FailedDominatesMissing.
- MEDIUM (GPT-5.5 Azure#1): AGENT_<KEY>_NAME present with
AGENT_<KEY>_VERSION empty was classified as `transient`,
surfacing "retry doctor" — wrong for a deterministic config
issue. Reclassified as `not-deployed` so the user is told to
`azd deploy`. Test
TestCheckAgentStatus_MismatchedNameVersion_NotDeployedForService
+ TestProbeOneService_NameSetVersionEmpty_NotDeployed.
- MEDIUM (GPT-5.5 Azure#2): serial probes — 100 services × 6 s could
drag the doctor run past 10 minutes. The design spec calls it
"fan out", so probes now run in parallel via a bounded
4-worker pool (probeConcurrency). Order preservation guarantees
deterministic Details rendering.
- LOW (Sonnet Azure#3): added TestProbeOneService_DeletingStatus_Missing
covering the `Deleting` lifecycle branch that previously had no
direct test (only `Deleted` was covered).
Files
- internal/cmd/doctor/checks_agent_status.go
* newCheckAgentStatus + the Check shape
* probeOneService — per-service classification body
(now-corrected name-set-version-empty path)
* probeAllServices — bounded-concurrency fan-out helper
* classifyAgentStatusAggregate — folds entries into one
Result with class-filtered detail lines + mixed-class
Suggestion hint + Active+Transient Pass-with-note branch
* makeRealProbeAgentStatus — production probe closure (uses
agent_api.GetAgentVersion + azidentity.NewAzureDeveloperCLI
Credential, the same auth path the runtime invoke flow uses)
* readAgentNameVersion + readAgentServices helpers
* doctorServiceKey (mirrors cmd.toServiceKey; duplicated to
avoid an import cycle, same rationale as agentHost in
checks_project.go)
* Lifecycle constants (Active / Creating / Failed / Deleted /
Deleting) sourced from
vienna:Contracts/V2/Generated/Agents/AgentVersionStatus.cs
* truncateLines / serviceNamesByClass / firstTransient
- internal/cmd/doctor/checks_agent_status_test.go (~640 LoC, 34
tests; +2 new tests over v1)
* Skip-cascade gates × 9
* Per-service classification × 7 (incl. Deleting, NameNoVersion)
* Status case-insensitive matching
* Aggregate ranking × 5 (incl. Active+Transient Pass-with-note,
FailedDominatesMissing with Message-text assertion)
* Aggregate truncation at 3 + "(N more)"
* probeOneService transport branches × 6 (incl. context.Canceled
and context.DeadlineExceeded handling)
* Service-key edge cases, rank fallback, truncateLines
boundary, makeRealProbeAgentStatus closure check, plus a
ServerHandler-based smoke test that the
azcore.ResponseError code is surfaced via statusCode
- internal/cmd/doctor/checks_local.go
* Dependencies struct grows two new seams:
probeAgentStatus + readAgentNameVersionFn (mirrors the
probeDeveloperRBAC + readProjectResourceIDFn pattern from C16)
- internal/cmd/doctor/checks_remote.go
* NewRemoteChecks adds newCheckAgentStatus(deps) as the 4th
entry, after auth / foundry-endpoint / rbac
- internal/cmd/doctor/checks_remote_test.go
* TestNewRemoteChecks_HasAuthFoundryEndpointRBACAndAgentStatus
pins the 4-entry shape
Out of scope (deferred)
- Sharing the credential across services: each probe currently
constructs its own azidentity.NewAzureDeveloperCLICredential.
Since the credential is essentially a thin shell around
`azd auth token`, the cost is negligible (a single in-process
call per probe) and threading it through complicates the
test-seam shape. Will revisit if benchmarks surface it.
- Re-using readAgentNameVersion for the doctor's eventual ENV
pretty-print mode. Out of scope for the check itself; the
helper is unexported and can be promoted when the renderer
needs it.
Preflight (from cli/azd/extensions/azure.ai.agents)
- gofmt -s -w . clean
- go vet ./... clean
- go build ./... clean
- go test ./... -count=1 32 doctor tests + full ext suite PASS
- golangci-lint run ./... 0 issues
- cspell lint ... (cspell.yaml) 0 issues
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes two pre-existing UX bugs in `azd ai agent run`:
B5 (Next: race): pre-C8 the Next: block was rendered BEFORE
`proc.Start()`, so its "in another terminal, try: <example>" line could
land in the user's terminal seconds before the agent's port was bound.
Following the suggestion verbatim while the agent was still starting
yielded a connection-refused.
B6 (stale cached OpenAPI): pre-C8 the OpenAPI probe was strictly
cache-only. The first `run` ever issued — and every `run` against an
agent whose schema had drifted — surfaced a protocol-generic
`<payload>` literal instead of the agent's actual example.
Scope — only `run.go` uses the live probe. `show.go` (`WithOpenAPIProbe
(name, "remote")`), `deploy`/artifact-note path, `init.go`, and
`init_from_code.go` remain cache-only by design.
Changes
-------
nextstep/state.go
- New functional option `WithLiveOpenAPIProbe(fetch func(context.
Context) ([]byte, error))`. Stores the fetcher in
`cfg.openAPILiveFetch`.
- `populateOpenAPIPayload` now takes `(ctx, cfg, projectPath, envName,
state)`. Order of resolution: (1) live, on success → use it;
(2) cache (existing `WithOpenAPIProbe` path), on success → use it;
(3) leave HasOpenAPI=false. Every failure is silent.
- `assembleState` threads `ctx` through to the new signature.
- Doc comment on `WithOpenAPIProbe` updated to note that
`WithLiveOpenAPIProbe` overrides it when both are supplied.
run.go
- Removed the pre-Start `nextstep.AssembleState` + `PrintNext` block
that was emitting Next: before the agent bound its port.
- After `proc.Start()`, spawn `emitNextAfterBind` in a goroutine with
a `nextDone` channel. After `proc.Wait()` returns, the main flow
calls `cancel()` and waits on `<-nextDone` so the goroutine is
fully joined before stdout writes from `runRun`'s caller resume —
closes a stdout race on shutdown.
- `emitNextAfterBind` early-returns when stdout is not a terminal,
honoring the documented nextstep call-site TTY-gating contract
(matches `invoke.go:217`, `show.go:381`). Redirected stdout
(`run > log`) no longer receives the banner or Next: block.
- After waitForPortReady succeeds, builds a closure that wraps
`fetchLiveOpenAPI(ctx, port)` and passes BOTH
`WithOpenAPIProbe(serviceName, "local")` (cache fallback) and
`WithLiveOpenAPIProbe(...)` to `AssembleState`. The state
assembler picks live first, falls back to cache silently if the
live fetch fails.
- Re-checks `ctx.Err()` after `AssembleState` returns so a Ctrl+C
arriving mid-call doesn't surface "Agent ready" after
"Agent stopped." was already printed.
- Four new constants: `portReadyBudget` (5 s),
`portReadyPollInterval` (100 ms), `portReadyDialTimeout` (50 ms),
`liveOpenAPITimeout` (3 s).
- `waitForPortReady(ctx, port, budget) bool`: bounded TCP dial-loop
that honors ctx.
- `fetchLiveOpenAPI(ctx, port) ([]byte, error)`: uses
`http.NewRequestWithContext` to GET
`http://localhost:<port>/invocations/docs/openapi.json`. The route
matches the cache-side fetcher in `helpers.go:368` and the
user-facing curl tip in `nextstep/resolver.go:226`. Non-200
responses are returned as errors so the assembler falls back to
cache rather than ingesting a 404 body via `ExtractInvokeExample`.
Tests
-----
state_test.go (+5 new TestAssembleState_WithLiveOpenAPIProbe_* cases +
expanded `TestOptionsApplyCleanly`):
- PrefersLiveOverCache, FallsBackToCacheOnError,
FallsBackToCacheOnEmptyBody, LiveWorksEvenWithoutCacheProbe,
LiveFailureWithoutCacheLeavesUnset.
run_test.go (+8 new tests + `listenLoopback` helper):
- 3× waitForPortReady (bound port, budget elapse, ctx cancel).
- 3× fetchLiveOpenAPI (200 body asserts
`/invocations/docs/openapi.json` path, non-200 error, ctx
deadline).
- 2× emitNextAfterBind (never-binds, ctx cancelled — both pass nil
azdClient through the safe early-return paths to verify the helper
exits silently without panic or goroutine leak).
Preflight clean: gofmt -s -w, go vet, go build,
go test ./internal/cmd/... ./internal/cmd/nextstep/... -count=1
(cmd 10.5 s, doctor 1.6 s, nextstep 1.9 s), golangci-lint run
./internal/cmd/... (0 issues), cspell on the four modified files (0).
Review fixes (3-reviewer pass)
------------------------------
Three independent reviewers (Opus xhigh, Sonnet 4.6, GPT-5.5) reached
consensus on three correctness findings before merge:
- Live probe URL corrected to /invocations/docs/openapi.json
(matches existing cache fetcher and user-facing curl tip).
- Banner + PrintNext now gated on isTerminal(os.Stdout.Fd()) to
honor the nextstep call-site contract.
- emitNextAfterBind goroutine is now joined via nextDone channel
after proc.Wait, and re-checks ctx.Err() before printing so the
banner cannot land after "Agent stopped."
- Replaced misleading "ReturnsSilentlyWhenPortNeverBinds" test that
only exercised waitForPortReady with two tests that actually call
emitNextAfterBind with nil azdClient on the safe early-return
paths.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…P5.1 C12)
Adds the final doctor check (P5.1 C12) that surfaces deployed-agent
managed-identity role assignments at three ARM scopes — project,
account, and resource group. For each agent classified active by
the upstream `remote.agent-status` check, the new
`remote.agent-identity-roles` check fetches the agent's
`instance_identity.principal_id` from Foundry and lists ARM role
assignments at all three scopes via the existing
`armauthorization` SDK already pulled in for the developer-RBAC
check.
## What lands
`internal/project/agent_identity_query.go` (new, ~340 LoC):
- Public API `QueryAgentIdentityRoles(ctx, azdClient,
projectResourceID, principals) (*AgentIdentityRolesResult,
error)` reuses `parseAgentIdentityInfo` from
`agent_identity_rbac.go` to derive the three scope ARNs, looks
up the user-access tenant via `LookupTenant`, builds an
`AzureDeveloperCLICredential` pinned to that tenant, and fans
out per-principal role-assignment listings with `wg.Go`.
- Public types `AgentPrincipal`, `AgentScopeRoles`,
`AgentIdentityRolesEntry`, `AgentIdentityRolesScopes`,
`AgentIdentityRolesResult` form the structured listing the
doctor renders.
- `queryAgentIdentityRolesWithLister` separates the per-scope
listing strategy from credential acquisition so unit tests
drive the inner classifier without standing up ARM fakes.
- `listRoleNamesAtScope` lists role assignments with ARM's
server-side `assignedTo('<principal>')` filter, then resolves
role-definition IDs to human-readable names via
`RoleDefinitions.Get`. Failures on individual role-name
resolution downgrade gracefully (omitted from the listing).
`internal/cmd/doctor/checks_agent_identity_roles.go` (new, ~640 LoC
including doc comments):
- `newCheckAgentIdentityRoles(deps)` builds the check Closure.
Skip cascade against `local.environment-selected`,
`local.agent-service-detected`, `remote.auth`,
`remote.foundry-endpoint`, and `remote.agent-status`'s Pass
(per the design's "for each active agent found in check 11").
- `readActiveAgents(prior)` enumerates agents reachable to this
check by reading the upstream `remote.agent-status` Details'
`services` slice and filtering to Classification == "active".
- `classifyOneAgent` buckets a single agent into fine /
underscoped / empty / unknown per the design's pass condition
("project + (account|RG) covered"). `describeOneAgent`
renders the one-line per-agent breakdown
(`<agent>: project=N, account=M, resource-group=K`, with `?`
on probe-error scopes).
- `classifyAgentIdentityRolesAggregate` folds per-agent entries
into a single doctor Result: all "fine" → Info; any "empty"
→ Fail (smoking-gun for "every tool call 403s"); worst
"underscoped" → Warn; worst "unknown" → Warn.
- `makeRealProbeAgentPrincipal` builds the production probe
closure (mirrors `makeRealProbeAgentStatus` byte-for-byte
apart from the field consumed — `InstanceIdentity.PrincipalID`
vs `Status`).
## Renderer additions
`StatusInfo` joins the existing Pass/Warn/Fail/Skip status set so the
"all agents are fine" case can surface as an informational role
listing without flagging the run yellow.
- `types.go`: `StatusInfo Status = "info"` + `Summary.Info int`
(JSON tag added).
- `runner.go`: canonical validation switch + summarize switch
extended for Info; `ExitCode` treats Info as a "useful
diagnostic completed" status (matches Pass for exit-code
purposes).
- `doctor_format.go`: glyph "ⓘ" and label "INFO" added to
`statusGlyphAndLabel`; `writeSummaryLine` appends ", N info"
when Info > 0 (preserves existing test assertions otherwise).
## Dependencies wiring
`internal/cmd/doctor/checks_local.go` adds two test seams on
`Dependencies`:
- `probeAgentPrincipal` — replaces the production
`GetAgentVersion` call with a unit-test fake. Same signature
shape as `probeAgentStatus`.
- `queryAgentIdentityRoles` — replaces the production
`project.QueryAgentIdentityRoles` call. Signature mirrors the
public API so wiring is a single substitution.
`internal/cmd/doctor/checks_remote.go` appends
`newCheckAgentIdentityRoles(deps)` after the existing
`remote.agent-status` entry in `NewRemoteChecks`. The append-after
ordering is load-bearing — every skip-cascade guard in C12 reads
`remote.agent-status`'s Result from `prior []Result`, and the
local-then-remote ordering invariant remains intact (verified by
the existing `TestNewLocalAndRemoteChecks_ProductionCompositionLocalsFirst`).
## Tests
`internal/cmd/doctor/checks_agent_identity_roles_test.go` (new, 16 KB):
- Skip-cascade gates: nil AzdClient, `remote.agent-status` not
Passed, project endpoint missing, no active agents,
project-resource-ID unset, project-resource-ID malformed.
- Aggregate classification: Info when all fine; Fail when any
agent has zero roles; Warn when worst is underscoped; Warn on
transient query error.
- Per-agent classifier table (six cases: project+account,
project+RG, project-only, account-only, all-empty,
all-errored).
- Detail formatting: scope counts and `?` for probe-error
scopes.
- Missing-principal degradation: agent with no
`instance_identity` surfaces as a warning rather than a fail.
- `readActiveAgents` filtering invariants (active-only,
missing-name dropped, nil-return on missing upstream).
`internal/cmd/doctor/checks_remote_test.go` updated: the
`NewRemoteChecks` contract test now pins five entries (auth →
foundry-endpoint → rbac → agent-status → agent-identity-roles)
with their ID / Name / Remote / Fn invariants.
## Preflight
- gofmt -s -w . clean
- go vet ./... clean
- go build ./... clean
- go test ./... -count=1 all green
- internal/cmd 14.6s
- internal/cmd/doctor 1.6s
- internal/cmd/nextstep 3.8s
- internal/pkg/agents/agent_api 10.9s
- internal/pkg/agents/agent_yaml 1.0s
- internal/pkg/azure 12.7s
- internal/project 5.5s
- golangci-lint run ./internal/cmd/... ./internal/project/... 0 issues
- cspell on new files (after "underscoped" added to cspell.yaml) 0 issues
- copyright-check.sh on extension clean
## Design notes
- The spec at `.tmp/pr-8057/azd-ai-agent-doctor-remote-checks.md`
lines 193–223 specifies a per-agent fan-out at three scopes
with the "fine" pass condition (project + (account|RG)). Renders
as INFO rather than PASS because the design's intent is a
diagnostic listing — operators inspect it on `--output json`
and confirm no MI is starved; the check should not flip the
doctor green on its own.
- C12 uses the `wg.Go` Go 1.26 idiom for per-principal fan-out;
per-scope probes within one principal run sequentially (3
scopes × 1 ARM listing each is well under budget and avoids
the goroutine-per-scope-explosion).
- `probeAgentPrincipal` deliberately does NOT extend C17's
`probeAgentStatus` surface — extending it would couple two
independent checks. The mirror cost is one ~40-line factory
function shared by both.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…P5.1 C2)
Adds a best-effort manifest walker that surfaces model / toolbox /
connection resources from each service's `agent.manifest.yaml`
into `nextstep.State`. Unblocks the doctor checks C13 (model
deployments), C14 (toolboxes), and C15 (connections), all of which
need to know whether the relevant resource kinds are declared
before they can decide to run or skip.
## State additions
- `State.HasModels`, `State.HasToolboxes`, `State.HasConnections` —
aggregate boolean flags. True when at least one matching resource
is found across all `azure.ai.agent` services.
- `State.ModelRefs`, `State.Toolboxes`, `State.Connections` —
sorted `[]ResourceRef` per kind.
- `ResourceRef{Name, ServiceName, Detail}` — slim doctor-facing
shape. Detail carries the kind-specific identifier (model id,
connection `<Category> | <Target>`, empty for toolboxes).
## Walker semantics
- File names probed (in order): `agent.manifest.yaml`,
`agent.manifest.yml`. `agent.yaml` is deliberately excluded —
it describes the container, not declared resources.
- Uses `agent_yaml.ExtractResourceDefinitions` directly (NOT
`LoadAndValidateAgentManifest`) so a manifest with an absent /
partial `template` block — common during init — still surfaces
its `resources:` declarations.
- Best-effort: missing file, unreadable bytes, malformed YAML,
zero resources, and unknown resource kinds all silently degrade
(Has* flags stay false; lists stay nil). Walker never adds to
the `errs` slice so a manifest-in-flight (which init re-writes
mid-flow) never blocks the rest of state assembly.
- Dedup key is `(ServiceName, Name)`. Same name twice in one
service collapses to one entry (first-occurrence wins, matching
agent_yaml's manifest semantics). Same name under two services
surfaces twice so per-service doctor failures remain
individually addressable.
- Result slices are sorted by `(Name, ServiceName)` so doctor
output snapshots and downstream renderers are deterministic.
## Why this is its own commit
The walker is a pure data-collection step with no resolver-side
consumers in this commit. Doctor checks C13/C14/C15 (following
commits) gate-skip themselves on `state.Has{Models,Toolboxes,
Connections}` and iterate the matching ref slice. Landing the
walker first keeps each downstream commit focused on its single
check.
## Tests
8 new tests in `manifest_test.go`:
- All three kinds present → flags + lists populated, sort order
+ detail formatting locked.
- Missing manifest → silent, no errors logged through the
walker.
- Malformed YAML → silent, no errors.
- Manifest with no `resources:` key → silent, flags false.
- Multi-service aggregation → entries sorted by Name, ties
broken by ServiceName.
- Duplicate `(service, name)` within one manifest → first
occurrence wins.
- `.yaml` wins over `.yml` when both exist.
- `agent.yaml` (not a manifest) is ignored even if its content
happens to parse as one.
- `connectionDetail` table-driven test covers all four
category/target combinations.
## Preflight
- `gofmt -s -w .` — clean
- `go vet ./...` — clean
- `go test ./... -count=1` — full extension suite green
- `golangci-lint run ./internal/cmd/...` — 0 issues
- `cspell` over the touched files — 0 issues
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds the seventh doctor remote check (`remote.model-deployments`) which verifies that every model resource declared in any service's `agent.manifest.yaml` (collected by the C2 manifest walker into `State.ModelRefs`) has a corresponding Cognitive Services deployment on the Foundry project's underlying account. # What it does For each project run: 1. Skip-cascade gates (in order): `AzdClient` nil → `local.environment-selected` → `local.azure-yaml` / `local.agent-service-detected` → `remote.auth` → `remote.foundry-endpoint` → `!state.HasModels` → `AZURE_AI_PROJECT_ID` unreadable / cannot be parsed. Every gate produces a single, actionable Skip message that points the user at the upstream check. 2. Parse `AZURE_AI_PROJECT_ID` (Foundry project ARM resource ID) for `(subscription, resourceGroup, accountName)` via the new `parseAccountFromProjectID` helper. Deployments live at the Cognitive Services *account* level, NOT the project level, so the account name is the load-bearing parameter here. 3. Issue exactly one `armcognitiveservices/v2.DeploymentsClient .NewListPager` round trip via the new `realProbeModelDeployments` helper, capped at 10s (matches the design's per-probe budget in `.tmp/pr-8057/azd-ai-agent-doctor-remote-checks.md`). Returns `[]string` of deployment names. Transport errors short-circuit the check to Skip with the error verbatim plus an actionable retry suggestion — we can not distinguish "deployment missing" from "ARM unreachable" without a successful round trip. 4. `classifyModelDeployments` joins `state.ModelRefs` to the deployment set on name. All match → Pass with the matched count. One or more missing → Fail with the missing names listed in the Message and structured under `Details["missingModels"]` (each entry carries both `name` and `service` so the user can locate the offending manifest entry). Suggestion: `azd provision` to create the missing deployments, or update `agent.manifest.yaml` `resources[].name` to match deployments that already exist. # Aggregation The walker may surface `ModelRefs` from multiple services. Every service in an azd project belongs to the same Foundry project (and therefore the same Cognitive Services account), so the check issues exactly one deployments list per run regardless of how many services / model refs exist. The same model referenced by two services collapses to a single match check; a missing model referenced by two services surfaces as two `missingModels` entries (one per service) so the user can pinpoint each affected manifest. # Test seam `Dependencies.probeModelDeployments` (lowercase, package-internal) matches the established pattern from `probeAuth`, `probeFoundryEndpoint`, `probeDeveloperRBAC`, `probeAgentStatus`, `probeAgentPrincipal`. Production wiring leaves it nil; tests inject a closure that returns canned `(names, err)` tuples and (optionally) captures the `(subscription, resourceGroup, accountName)` it was called with. `Dependencies.assembleState` and `Dependencies.readProjectResourceIDFn` are reused from earlier checks; no new top-level seam is added besides the probe. # Files - `internal/cmd/doctor/checks_model_deployments.go` — new check factory `newCheckModelDeployments`, `parseAccountFromProjectID`, `classifyModelDeployments`, `realProbeModelDeployments`, `listDeploymentNames`. 363 lines. - `internal/cmd/doctor/checks_model_deployments_test.go` — 11 tests: skip-cascade (1 + table of 5 upstream-blocked permutations), no manifest models, unset project ID, unparsable project ID, probe transport error, all-match Pass, partial-match Fail, all-missing Fail, parser table (canonical / mixed-case / missing segments / garbage), factory shape pin. - `internal/cmd/doctor/checks_local.go` — adds the `probeModelDeployments` field to `Dependencies` next to its same-shape siblings. - `internal/cmd/doctor/checks_remote.go` — appends `newCheckModelDeployments` after `newCheckAgentIdentityRoles` in `NewRemoteChecks`. - `internal/cmd/doctor/checks_remote_test.go` — extends the composition pin test to assert 6 checks (was 5) including the new `remote.model-deployments` slot. # Preflight - `gofmt -s -w .` clean. - `go vet ./...` clean. - `go build ./...` clean. - Full extension test suite: green (`cmd`, `cmd/doctor`, `cmd/nextstep`, `exterrors`, `agents/agent_api`, `agents/agent_yaml`, `pkg/azure`, `project` — all pass). - `golangci-lint run ./internal/cmd/doctor/...` 0 issues. - `cspell` 0 issues on production file. - No `go.mod` or `go.sum` changes (uses already-imported `armcognitiveservices/v2`). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds the eighth local doctor check, `local.toolboxes`, which verifies
that every ToolboxResource declared in any service's
agent.manifest.yaml has its canonical MCP endpoint env var
(TOOLBOX_<NORMALIZED_NAME>_MCP_ENDPOINT) set in the active azd
environment.
Why local (Remote: false). The check only reads the active azd
environment via the existing gRPC env service — no ARM / Foundry
round trips. Tagging it local means `--local-only` still runs it
(which is exactly what we want: a missing TOOLBOX_*_MCP_ENDPOINT is
diagnosable without network access).
Skip cascade. Skips on AzdClient==nil, local.environment-selected,
local.azure-yaml, local.agent-service-detected, and when
state.HasToolboxes is false. Deliberately NOT gated on remote.auth /
remote.foundry-endpoint — a down Foundry must not poison a local env
diagnostic.
Classification.
- All endpoints set → Pass with matchedCount.
- One or more missing → Fail with the missing toolbox names + env
var keys in the Message, Suggestion points at `azd provision`
(the canonical fix) or `azd env set` as the manual override,
Details["missingToolboxes"] carries a structured list (name,
service, envVar) for JSON consumers.
- Env lookup transport error → Fail (NOT Skip). Divergent from
C13's model-deployments which Skips on probe error, because env
lookup is local; a transport failure here means the user's azd
config / extension is broken and a Skip would silently swallow
that signal. Suggestion points at `azd env list` / `azd env
get-values`.
- Empty / whitespace-only value → treated as missing (matches
detectMissingVars semantics in nextstep/state.go).
Convention. TOOLBOX_<NORMALIZED_NAME>_MCP_ENDPOINT, with name
upper-cased and `-` / `.` / ` ` mapped to `_`. Matches the
hosted-toolbox Bicep sample output names. The prefix and suffix are
pinned in code (not derived from the env) so the Fail message can
name the exact env var the user must grep their Bicep template
for.
Dedup. classifyToolboxEndpoints dedupes on the canonical env key
because the C2 manifest walker dedupes on (ServiceName, Name) — the
same toolbox referenced by two services would otherwise produce two
env lookups and two missing-list entries. Exposed
`dedupToolboxKeys` for callers (renderer / future telemetry) that
want the expected-key list up front; the classifier does its own
inline dedup so it does not depend on this helper.
Test seam. `Dependencies.lookupToolboxEnv toolboxEnvLookupFn`
matches the established seam pattern (probeAuth,
probeFoundryEndpoint, probeModelDeployments). Production wiring
leaves it nil; the check binds `makeRealToolboxEnvLookup(deps
.AzdClient)` on first call, which calls
`client.Environment().GetValue` — the canonical one-key env reader
used by service_target_agent.go and checks_rbac.go.
Tests (15). Skip cascade (azdClient nil + 3 priors), not-gated-on-
remote-priors invariant, state emptiness (no toolboxes / nil
state), 3 classifier paths (all-set / partial / all-missing),
whitespace-as-missing, transport-error-is-Fail, cross-service
dedup, normalizeToolboxName table (8 cases), toolboxEndpointKey
roundtrip, dedupToolboxKeys table, factory-shape pin.
Wired into NewLocalChecks (now 8 entries); local-checks pin test
updated. NewRemoteChecks unchanged (still 6 entries).
Preflight clean: gofmt, vet, build, full extension test suite green
(cmd 16.7s, doctor 2.9s, nextstep 6.7s, etc.), golangci-lint 0
issues, cspell 0 issues on production files.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Issue: Doctor previously had no way to verify that Foundry connections
referenced by agent manifests (e.g. `bing-grounding`, key-vault-backed
auth connections) actually exist on the project. Failure surfaced
later at invoke time as 401/403 from the upstream tool, with no clear
path back to the missing connection.
Approach: New `remote.connections` doctor check that enumerates manifest
ConnectionResource entries (already discovered by the C2 manifest
walker into `state.Connections`), calls
`FoundryProjectsClient.GetAllConnections(ctx)` on the active project,
and reports any manifest-declared connection that isn't present on the
project as a Fail. Missing-entry rendering format:
`<name> [<detail>] (service <svc>)` when Detail is non-empty, falling
back to `<name> (service <svc>)` to avoid a bare `[]`. Detail
typically renders as `<Category> | <Target>` for connections
(types.go:183-189; manifest.go:152-162).
Classification: REMOTE check (Remote: true). Calls the Foundry API.
Skip cascade mirrors C13 (`remote.model-deployments`):
AzdClient → environment-selected → azure.yaml / agent-service-detected
→ remote.auth → remote.foundry-endpoint → !state.HasConnections
→ unparsable project ID
Probe error → Skip (matching C13's pattern — distinguishing transport
failure from "missing connection" requires a successful round-trip).
10s probe timeout.
Wiring: `newCheckConnections(deps)` added as the 7th and final entry
in `NewRemoteChecks` (after C12 agent-identity-roles, C13 model-
deployments). Pin test `TestNewRemoteChecks_HasAuthFoundryEndpointRBAC
AgentStatusIdentityRolesModelDeployments` renamed to
`...ModelDeploymentsConnections`, Len bumped 6 → 7, 4 new
index-6 assertions for ID / Title / Description / Remote.
Test seam: New `probeFoundryConnections` field appended to
`Dependencies` matching the existing seam pattern from
`probeModelDeployments` (C13). Production wiring uses
`realProbeFoundryConnections` which constructs a credential via
`azidentity.NewAzureDeveloperCLICredential` (matching the rest of
the extension per `agent_context.go:101-109`).
New helper: `parseAccountProjectFromProjectID(projectID) (account,
project, err)` — sibling of C13's `parseAccountFromProjectID`. Returns
two segments instead of the four C13 needs; kept separate to avoid
churning C13's signature for a single new caller. Both case-insensitive
on segment markers. Follow-up: consolidate into a single parser when
a third caller appears.
Tests: `checks_connections_test.go` — 13 tests mirroring C13 patterns:
- Skip cascade table (5 rows: AzdClient, environment, azure.yaml /
agent-service-detected, auth, foundry-endpoint).
- State emptiness (HasConnections false → Skip).
- Project ID unset → Skip.
- Project ID unparsable → Skip.
- Probe error → Skip.
- All-match → Pass.
- Partial mismatch → Fail with missing names + service tags.
- All-missing → Fail.
- Empty-Detail rendering omits `[]`.
- Parser table (5 cases: canonical, mixed case, missing project,
missing account, garbage).
- Factory shape pin (Remote: true, ID, Title, Description).
Preflight:
- gofmt -s -w . (clean)
- go vet ./... (clean)
- go build ./... (clean)
- go test ./... -count=1 (all packages pass; doctor 5.474s)
- golangci-lint run ./internal/cmd/doctor/... (0 issues)
- cspell lint "internal/cmd/doctor/*.go" (14 files, 0 issues)
- copyright header verified on both new files
Files:
- internal/cmd/doctor/checks_connections.go (NEW, +332)
- internal/cmd/doctor/checks_connections_test.go (NEW, +326)
- internal/cmd/doctor/checks_local.go (probeFoundryConnections seam +5)
- internal/cmd/doctor/checks_remote.go (wire +newCheckConnections +1)
- internal/cmd/doctor/checks_remote_test.go (pin test 6 → 7, +14)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two reviewer-consensus findings from the batched code review of commits 385b147 (C13 remote.model-deployments), 87e1dcc (C14 local.toolboxes), and 1b3143f (C15 remote.connections): Fix 1 (MEDIUM, Opus + GPT-5.5): toolbox env-key normalizer divergence. C14's `normalizeToolboxName` only mapped `-`, `.`, and ` ` to `_` rune-by-rune, while the production helpers `init.go:toolboxMCPEndpointEnvKey` (manifest injection) and `listen.go:toolboxMCPEndpointEnvKey` (runtime env write) both use the regex `[^A-Z0-9]+` -> `_` (run-collapsing, all non-alphanumerics). The two algorithms agreed only on the subset of inputs the test table exercised (`web-search-tools`, `my.toolbox.v2`, `my toolbox`, ...) and diverged on inputs like `my--tool`, `my+tool`, `my:tool`, `my(tool)`, `my\ttool`. A user with such a toolbox name would see the doctor flag a missing endpoint under a key (`TOOLBOX_MY__TOOL_MCP_ENDPOINT`, `TOOLBOX_MY+TOOL_MCP_ENDPOINT`, ...) that nothing in the system ever writes. Resolution: hoist the canonical helper into a shared `internal/pkg/envkey` package so the producing and diagnostic sides cannot drift again. - new internal/pkg/envkey/envkey.go -- `ToolboxMCPEndpoint` - new internal/pkg/envkey/envkey_test.go -- 13 cases incl. double-hyphen run, `+`, `:`, `/`, tab, parens, empty - internal/cmd/listen.go -- drop local helper, drop `regexp` import, route through envkey - internal/cmd/init.go -- route through envkey - internal/cmd/init_test.go -- delete duplicated table (covered by envkey package test) - internal/cmd/doctor/checks_toolboxes.go -- drop local normalizeToolboxName / toolboxEndpointKey /toolbox{Prefix, Suffix}, route 2 callsites through envkey - internal/cmd/doctor/checks_toolboxes_test.go -- replace the normalize-table test with a thin pin test verifying the doctor's renderer helper routes through envkey - cspell.yaml -- allowlist `envkey` Fix 2 (MEDIUM, Sonnet): assembler errors silently swallowed. C13/C14/C15 all used `state, _ := assembler(...)` and reported a `Skip` with "no X declared in any service's agent.manifest.yaml" whenever `state == nil || !state.HasX`. The existing pattern at `checks_manual_env.go:95-109` instead captures `errs` and Fails with the actual cause when `state == nil` (defensive against a future contract change where AssembleState may return a nil state with a populated errs slice). Resolution: mirror the established pattern in all three new checks. The Skip for `!state.HasX` is preserved; only the `state == nil` branch becomes a Fail surfacing `errs[0].Error()`. - checks_model_deployments.go -- Fail-on-nil with cause - checks_toolboxes.go -- Fail-on-nil with cause - checks_connections.go -- Fail-on-nil with cause - checks_model_deployments_test.go -- new test: nil state surfaces errs[0] - checks_toolboxes_test.go -- update existing `SkipsWhenAssemblerReturnsNil` to `FailsWhenAssembler ReturnsNilState` plus new test asserting errs[0] surfaces in the Fail message - checks_connections_test.go -- new test: nil state surfaces errs[0] Not addressed (deferred): LOW (GPT-5.5): `parseAccountProjectFromProjectID` (C15) accepts partial paths; `parseAccountFromProjectID` (C13) does not. Opus reviewed and called the dual-parser duplication "defensible for two callers"; commit 1b3143f's message already notes the follow-up to consolidate when a third caller appears. Preflight: - gofmt -s -w . (clean) - go build ./... (clean) - go vet ./... (clean) - go test ./... -count=1 (all packages pass; envkey 1.837s, doctor 6.276s, cmd 14.401s) - golangci-lint run ./... (0 issues) - cspell lint <new+touched> (17 files, 0 issues) Files (10): - internal/pkg/envkey/envkey.go (NEW) - internal/pkg/envkey/envkey_test.go (NEW) - internal/cmd/listen.go (MOD) - internal/cmd/init.go (MOD) - internal/cmd/init_test.go (MOD) - internal/cmd/doctor/checks_toolboxes.go (MOD) - internal/cmd/doctor/checks_toolboxes_test.go (MOD) - internal/cmd/doctor/checks_model_deployments.go (MOD) - internal/cmd/doctor/checks_model_deployments_test.go (MOD) - internal/cmd/doctor/checks_connections.go (MOD) - internal/cmd/doctor/checks_connections_test.go (MOD) - cspell.yaml (MOD) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When Doctor/post-deploy guidance has no cached OpenAPI-derived sample payload but a service README is present, don't suggest a concrete protocol-generic payload that may fail for that sample's schema. Emit the README pointer first, then an invoke command with an explicit '<payload>' placeholder. Cached OpenAPI payloads still produce runnable invoke commands, and services without a README still get the protocol-generic fallback payload with a generic label. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Active/idle show output should stay an inspection view of the hosted agent resource. Remove the active-state invoke suggestion from ResolveAfterShow and avoid attaching next_step in show JSON when the agent is already healthy. Non-active states keep actionable guidance: - creating -> monitor --type system --follow - failed/empty -> monitor --follow - deleting/deleted -> azd deploy - unknown -> azd ai agent show <service> This also avoids state/OpenAPI assembly work for active show output because no active-state guidance is rendered. Validation: - go test ./internal/cmd ./internal/cmd/nextstep -run 'TestResolveAfterShow|TestResolveNextStepFromStatus|TestShowResultJSON|TestPrintAgentVersionJSON|TestPrintAgentVersionTable|TestResolveAfterInvoke_Success|TestResolveAfterInit_UnresolvedPlaceholders|TestResolveAfterRun' -count=1 - go test ./internal/cmd/... -count=1 - go vet ./internal/cmd/... - golangci-lint run ./internal/cmd/... - cspell lint touched show/nextstep files Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Stream text-mode doctor output by observing each finalized check result, while keeping JSON output buffered and unchanged. Split the text formatter into header/check/footer pieces so the streaming path preserves the existing report shape. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove unnecessary explanatory comments across the Azure AI Agents extension and shorten the remaining comments to focus on non-obvious contracts and behavior. Also drops the unused .tmp/ ignore entry. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the reserved doctor --output flag path, apply go fix modernizations, and reword cspell-sensitive text. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Restore comments that predated this PR in existing files so the cleanup only affects new or changed comment surfaces. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add project-root validation for service-relative file probes and reads, including symlink-aware containment checks and root-service handling.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the unused nextstep auth probe option and state fields that were added by the PR but never consumed by the resolver or doctor wiring. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Route PR-added nextstep stdout emission through one cmd helper so TTY gating stays consistent across init, invoke, run, show, and doctor text output. This intentionally applies the nextstep call-site TTY contract to the PR-added init next-step blocks as well, keeping redirected output free of human-only guidance. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The remote.auth doctor check surfaced the raw user principal name in
both the Message string and the structured Details map regardless of
the --unredacted flag, in contrast with the rest of the doctor checks
(checks_rbac.go, checks_agent_identity_roles.go) which already gate
identity values on opts.Unredacted.
This change threads Options into the auth check function and adds two
small helpers that reuse the existing redactedPlaceholder constant:
- redactUPN(upn, unredacted) returns the value to surface in Message
text: raw when --unredacted, the shared <redacted> placeholder when
a UPN was discovered but should be scrubbed, and empty when none
was found so composeAuthMessage cleanly drops the prefix.
- authDetails(upn, minutes, unredacted) builds the Details map and
omits the "upn" key entirely unless --unredacted is set, so
machine consumers do not see the raw value by default.
PASS, WARN, and expired-FAIL branches now compose their messages from
the redacted display value. Existing tests that asserted the raw UPN
were updated to pass Options{Unredacted: true}; new table tests cover
the default-redacted and --unredacted contracts on every branch, the
empty-UPN drop, and both helpers in isolation.
Resolves PR Azure#8198 review comment from @jongio.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
TestErrorCodeWireValues pins the lowercase JSON wire values of every exported enum the nextstep package consumes from the Agents API, but the AgentVersionStatus map was missing the "idle" entry. The idle status is actively read at show.go:207 and nextstep/resolver.go:248, so silent drift on that one literal would regress the show command's idle branch and the resolver's deployment-pending hint without any test failure. Add the missing "idle": string(AgentVersionIdle) case. Resolves PR Azure#8198 Copilot review comments (ids 3246075889, 3246075800). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Restores the .tmp/ entry in .gitignore (accidentally trimmed during a mid-rebase comment-cleanup commit) and removes the design-spec mirror and PR comment scratch files that were swept into the index by an overbroad `git add -A` during conflict resolution. The .tmp/ folder remains on disk locally as documented scratch space, but must not be tracked by git. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
6312729 to
da2efb5
Compare
wbreza
left a comment
There was a problem hiding this comment.
Delta review of the 5 commits pushed today (a017e251..da2efb5d2). Prior reviewer threads from Copilot and @jongio look resolved; this pass only checks the new fix-ups for regression and completeness.
✅ Verified
4bbe4d0ecentralize nextstep stdout gating —nextstep_output.gocorrectly delegateswriterIsTerminal()tow == os.Stdout && isTerminal(), preserving the TTY-gate semantics introduced inb9c1349a. All call sites ininvoke.go,run.go,show.go,init.go,init_from_code.goroute through the new helper; no stale stdout/TTY checks remain in the nextstep path. No conflict with the existingisTerminal()inhelpers.go.0d35d47bredact doctor auth UPN by default —redactUPN()andauthDetails()redact bothMessageandDetails; the "upn" key is omitted entirely fromDetailswhen not--unredacted. The newTestCheckAuth_RedactsUPNByDefault,TestCheckAuth_UnredactedKeepsUPN, andTestCheckAuth_RedactionWithoutUPNDropsPrefixcover PASS/WARN/FAIL × redacted/unredacted matrices.a017e251remove unused nextstep auth state —AuthState/AuthUnknown|Authed|Unauthed,State.IsAuthenticated, andWithAuthProbe()removed cleanly; no remaining references in the package; matched test deletions instate_test.go(no orphan assertions).1b0b6a73pinAgentVersionIdlewire value —"idle": string(AgentVersionIdle)is now inTestErrorCodeWireValues; the consumers atshow.go(shouldResolveShowNextStep) andnextstep/resolver.goboth still key offAgentVersionIdle, so silent literal drift will fail loudly.da2efb5drestore.tmp/gitignore —.gitignoreentry restored with comment; all scratch files removed from the tracked tree; no leaked artifacts.
Nit (non-blocking)
cli/azd/extensions/azure.ai.agents/internal/cmd/nextstep_output_test.go— the new tests verify the suppression path (writerIsTerminal == false), but no test asserts the happy path where the helper actually invokes the underlying print when stdout is a real terminal. The gate logic is trivial (w == os.Stdout && isTerminal()) and the end-to-end behavior is covered by the per-command tests, so this is genuinely optional. If you want full unit coverage of the helper in isolation, the typical pattern is to makeisTerminala package-levelvarso a test can swap it out.
Nothing else from my side — looks ready to merge once @jongio's resolutions are reflected in his review state.
therealjohn
left a comment
There was a problem hiding this comment.
For the doctor command:
Capitalize the beginning of a sentence. for example:
✓ PASS Foundry project endpoint reachable
endpoint reachable (HTTP 200)
Should become:
✓ PASS Foundry project endpoint reachable
Endpoint reachable (HTTP 200)
Improve the output formatting and break down into categories. Include support for a --debug flag for more verbose output, but the default is more concise. Include clear next steps at the end.
Current:
azd ai agent doctor
✓ PASS azd extension reachable
azd extension reachable (version 0.1.32-preview).
✓ PASS azure.yaml present and valid
azure.yaml parsed (project: ai-foundry-starter-basic).
✓ PASS azd environment selected
environment selected: cpu-test-dev.
✓ PASS agent service in azure.yaml
1 agent service(s) in azure.yaml: hello-world-python-responses
✓ PASS AZURE_AI_PROJECT_ENDPOINT set
AZURE_AI_PROJECT_ENDPOINT = https://ai-account-asdf.services.ai.azure.com/api/projects/ai-project-adc-test-1
✓ PASS agent.yaml valid (per service)
agent.yaml valid for 1 service(s)
✓ PASS manual env vars set
no manual env vars are missing
- SKIP Manifest toolboxes have endpoint env vars set
skipped: no toolbox resources declared in any service's agent.manifest.yaml.
✓ PASS authentication
<redacted> · token valid for 58 minutes
✓ PASS Foundry project endpoint reachable
endpoint reachable (HTTP 200)
✓ PASS Developer has required role on Foundry project
the current principal has the required role on project 'ai-account-asdf/ai-project-adc-test-1'.
✗ FAIL Hosted agents are active
1 of 1 agents have not been deployed:
hello-world-python-responses: service "hello-world-python-responses" has not been deployed yet (AGENT_HELLO_WORLD_PYTHON_RESPONSES_NAME is unset).
fix: Run `azd deploy` to deploy the missing agents.
- SKIP Agent identity role assignments
skipped: agent status check did not pass (see check `remote.agent-status`).
✗ FAIL Manifest model deployments exist in Foundry
1 model deployment(s) referenced by agent.manifest.yaml are missing on account ai-account-g4uvbu4lxidzu: AZURE_AI_MODEL_DEPLOYMENT_NAME (service hello-world-python-responses)
fix: Run `azd provision` to create the missing deployment(s), or update the agent.manifest.yaml `resources[].name` entries to match deployments that already exist in Foundry.
- SKIP Manifest connections exist on Foundry project
skipped: no connection resources declared in any service's agent.manifest.yaml.
Summary: 10 passed, 2 failed, 3 skipped, 0 warned
Expected (default mode, concise):
Local
(✓) azd extension reachable (version 0.1.32-preview)
(✓) azure.yaml present and valid
(✓) azd environment selected: cpu-test-dev
(✓) agent service in azure.yaml (1 service)
(✓) agent.yaml valid (1 service)
(✓) manual env vars set
(-) toolbox endpoint env vars -- skipped (no toolbox resources declared)
Authentication
(✓) credentials valid (token expires in 58 minutes)
Remote
(✓) AZURE_AI_PROJECT_ENDPOINT reachable
(✓) developer has required Foundry role
(x) hosted agents are active
1 agent not deployed: hello-world-python-responses
(AGENT_HELLO_WORLD_PYTHON_RESPONSES_NAME is unset)
(-) agent identity role assignments -- skipped (depends on agent status)
(x) manifest model deployments exist in Foundry
1 missing: AZURE_AI_MODEL_DEPLOYMENT_NAME (hello-world-python-responses)
(-) manifest connections -- skipped (no connections declared)
10 passed, 2 failed, 3 skipped
To fix, run these commands in order:
1. azd provision -- create the missing model deployment(s)
2. azd deploy -- deploy the agent(s)
with --debug (verbose):
Local
(✓) azd extension reachable
azd extension reachable (version 0.1.32-preview).
(✓) azure.yaml present and valid
azure.yaml parsed (project: ai-foundry-starter-basic).
(✓) azd environment selected
environment selected: cpu-test-dev.
(✓) agent service in azure.yaml
1 agent service(s) in azure.yaml: hello-world-python-responses
(✓) AZURE_AI_PROJECT_ENDPOINT set
AZURE_AI_PROJECT_ENDPOINT = https://ai-account-g4uvbu4lxidzu.services...
(✓) agent.yaml valid (per service)
agent.yaml valid for 1 service(s)
(✓) manual env vars set
no manual env vars are missing
(-) Manifest toolboxes have endpoint env vars set
skipped: no toolbox resources declared in any service's agent.manifest.yaml.
Authentication
(✓) authentication
<redacted> -- token valid for 58 minutes
Remote
(✓) Foundry project endpoint reachable
endpoint reachable (HTTP 200)
(✓) Developer has required role on Foundry project
the current principal has the required role on project
'ai-account-g4uvbu4lxidzu/ai-project-adc-test-1'.
(x) Hosted agents are active
1 of 1 agents have not been deployed:
hello-world-python-responses: service "hello-world-python-responses"
has not been deployed yet (AGENT_HELLO_WORLD_PYTHON_RESPONSES_NAME is
unset).
(-) Agent identity role assignments
skipped: agent status check did not pass (see check
`remote.agent-status`).
(x) Manifest model deployments exist in Foundry
1 model deployment(s) referenced by agent.manifest.yaml are missing on
account ai-account-g4uvbu4lxidzu: AZURE_AI_MODEL_DEPLOYMENT_NAME
(service hello-world-python-responses)
(-) Manifest connections exist on Foundry project
skipped: no connection resources declared in any service's
agent.manifest.yaml.
10 passed, 2 failed, 3 skipped
To fix, run these commands in order:
1. azd provision -- create the missing model deployment(s)
2. azd deploy -- deploy the agent(s)
Then re-run `azd ai agent doctor` to verify.
When everything passes, provide the next step to invoke it:
Local
(✓) azd extension reachable (version 0.1.32-preview)
(✓) azure.yaml present and valid
...
Authentication
(✓) credentials valid (token expires in 58 minutes)
Remote
(✓) all agents active
(✓) model deployments exist
...
All 13 checks passed.
Next: azd ai agent invoke <payload> -- send a message to hello-world-python-responses
@trangevi what if we move this into a separate extension so it could be used for any of them, not just agent? azure.ai.doctor and use azd ai doctor?
|
I agree with John, let's pull |
…e output Responds to therealjohn's UX review on PR Azure#8198 by rewriting the doctor text renderer around five contracts: 1. Default = concise. PASS shows just the check name; FAIL shows a one-line Message + one-line `fix:` Suggestion; SKIP inlines the skip reason after `-- skipped`. Use `--debug` to surface the verbose path (full Message + Suggestion + Links). 2. Section grouping. Checks render under Local, Authentication, and Remote headers, derived from the check ID prefix. `remote.auth` gets its own Authentication section per the literal mock. 3. New glyph format: `(✓)`, `(x)`, `(-)`, `(!)`, `(ⓘ)`. ASCII `x` for FAIL matches the mock literally. 4. "To fix" footer on failure. When at least one failure maps to a canonical remediation (`remediationForCheckID`), the footer is a numbered, deduplicated command list in execution order (auth → init → provision → deploy). When all failures are unmapped (or any are unmapped alongside mapped ones), the footer defers to the per-check `fix:` notes rendered in the body. The re-run instruction always closes the block so the user is never left without an actionable next step. 5. First-letter capitalization at render time, with a brand-name blocklist so `azd`, `azure.yaml`, `agent.yaml`, `agent.manifest. yaml`, and `skipped:` leads stay lowercase. Source check strings are untouched. Summary line simplified from `Summary: 1 passed, 1 failed, 1 skipped, 0 warned` to `1 passed, 1 failed, 1 skipped`. Warn/info segments are appended only when non-zero. The streaming render path (`runAndRenderDoctorText` → `renderer.write Check` per result) and the buffered path (`printDoctorReportText`, used by tests) share a single `doctorRenderState` so they produce byte-identical output. The parity test exercises both concise and verbose modes with a fixture covering Message detail, multi-line Suggestion (so `writeIndentedBlock` runs), and Links. The trailing `Next:` block (via `nextstep.PrintAllNext`) is suppressed when the To-fix footer fires, because the failure block is the actionable next step. The `--debug` flag is the existing persistent root flag provided by the azdext SDK; we read it via `isDebug(cmd.Flags())` and thread it through `runAndRenderDoctorText` and `newDoctorRenderer`. No new flag is registered. The JSON output path (`--output json`) does not traverse this renderer and is unchanged. Test additions pin every new contract: concise defaults (including zero-suppression of warn/info), verbose `--debug`, section transitions, streaming/buffered parity in both modes, trailing `Next:`, To-fix footer with mapped failures, To-fix footer with all-unmapped failures (deferred to per-check `fix:` notes), To-fix footer with mixed mapped and unmapped failures (numbered list plus per-check pointer), summary line with non-zero warn/info, empty report, status glyphs, category routing, capitalize edges, and the `firstLine` helper. Three-model code review consensus reached (Opus 4.7 xhigh, Sonnet 4.6, GPT-5.5). All flagged issues addressed. Out of scope (deferred to a follow-up after user input): the trangevi + therealjohn proposal to move doctor into a separate `azure.ai.doctor` extension. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Doctor text-mode UX redesign landed in 804913a in response to @therealjohn's review. Summary of what changed: 1. Default = concise. PASS shows just the check name; FAIL shows a one-line Message + 2. Section grouping. Checks now render under 3. New glyph format. 4. "To fix" footer on failure. When at least one failure maps to a canonical remediation, the footer is a numbered, deduplicated command list in execution order (auth init provision deploy). When all failures are unmapped (or any are unmapped alongside mapped ones), the footer defers to the per-check 5. First-letter capitalization at render time with a brand-name blocklist so Summary line simplified from The trailing JSON output ( |
|
Thanks for the delta review @wbreza, much appreciated. On the optional |
Summary
Next:guidance for the Azure AI Agents extension across init, run, invoke, show, deploy-hook, and doctor flowsazd ai agent doctorwith local and remote checks for project setup, environment variables, authentication, Foundry reachability/RBAC, hosted agent status, identity roles, model deployments, connections, and toolboxesshownext steps, and stream text-mode doctor checks as they complete