diff --git a/CHANGELOG.md b/CHANGELOG.md index 116c145a..c321b5bb 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -23,6 +23,7 @@ Agents provisioned before this release need `Agent365.Observability.OtelWrite` g **Option B — CLI** (`a365 setup admin`) has been removed in this release. Use Option A above, or copy the PowerShell instructions printed in the `a365 setup all` summary output. ### Added +- `a365 develop-mcp evaluate` command for evaluating MCP server tool schema quality — runs deterministic and semantic checks (via GitHub Copilot or Claude Code CLIs), computes maturity scoring, and generates an interactive HTML report - `--skip-sp-provisioning` option on `setup all` — skips the interactive in-line provisioning of missing resource service principals (issue #429). Default: setup prompts per-resource and runs `az ad sp create --id ` using the operator's `az login`. With this flag, missing SPs are excluded from the consent URL and listed in the Action Required block with the `az` command and a per-SP consent URL. Implicitly enabled when stdin is redirected (CI / pipe scenarios). - Action Required block in the `setup all` summary now lists missing service principals — each entry shows the resource, pending scopes, the `az ad sp create` command, and the per-SP consent URL needed to complete provisioning. - `logs export [command] [--output ]` — exports a redacted copy of a CLI diagnostic log safe to share with Microsoft support. Redacts JWT tokens, email addresses, OS-path usernames, and tenant-specific GUIDs; replaces identical values with consistent aliases so log correlation is preserved. Preserves diagnostic IDs that aren't sensitive but are useful for debugging — `TraceId`, `CorrelationId`, Microsoft Graph `request-id` and `client-request-id` values, and well-known public Microsoft / Agent 365 resource appIds (such as the Microsoft Graph appId `00000003-0000-0000-c000-000000000000`). Omit `[command]` to export all available logs at once. diff --git a/docs/agent365-guided-setup/a365-evaluate-faq.md b/docs/agent365-guided-setup/a365-evaluate-faq.md new file mode 100644 index 00000000..1b7bbc1f --- /dev/null +++ b/docs/agent365-guided-setup/a365-evaluate-faq.md @@ -0,0 +1,75 @@ +# MCP Evaluation — AI transparency FAQ + +This page answers the most common questions about how `a365 develop-mcp evaluate` uses AI and what it does (and does not) process. + +--- + +## Q1. What data does this command send to the AI, and where does it go? + +The command fetches your MCP server's **tool schemas** — tool names, descriptions, and parameter schemas — via a standard MCP `tools/list` call to the URL you supply. It then writes those schemas to a checklist JSON file in your output directory and hands that file to your locally installed coding agent CLI (GitHub Copilot CLI or Claude Code CLI). + +**The coding agent CLI runs entirely on your machine**, under your own account and subscription. No tool-schema data is sent to Microsoft's servers by this feature. The data path is: + +``` +Your MCP server + → a365 CLI (your machine, checklist written to your output dir) + → locally installed coding agent CLI (your machine) - Github Copilot or Claude Code CLI + → AI provider under your account (Anthropic / GitHub Copilot) +``` + +Microsoft's role is supplying the `a365` CLI that orchestrates steps 1–2. Steps 3–4 are governed by your own AI subscription's terms. + +--- + +## Q2. What does this command NOT process or collect? + +| Item | Processed? | Notes | +|---|---|---| +| Runtime data from your MCP server (live payloads, user requests) | **No** | Only the static schema from `tools/list` | +| Personal data of your MCP server's end users | **No** | Schemas describe tool APIs, not user data | +| Auth token supplied via `--auth-token` or `A365_MCP_AUTH_TOKEN` | **No** (AI never sees it) | Token is held in memory, sent only as the HTTP `Authorization` header to your server, and never logged or passed to the coding agent. Prefer the `A365_MCP_AUTH_TOKEN` environment variable over the flag — it keeps the token out of process listings (`ps` / Task Manager) and shell history; the CLI warns once if you use the flag. | +| Telemetry or usage data | **No** | This feature adds no telemetry to the base CLI | +| Output report or checklist JSON | **No** (not sent anywhere) | Files are written to the `--output-dir` you specify and stay on your machine | + +--- + +## Q3. Which AI model performs the evaluation, and who controls it? + +By default the evaluation uses **Claude Haiku 4.5**, invoked through whichever coding agent CLI you have installed (GitHub Copilot CLI or Claude Code CLI). You can override the engine with `--eval-engine github-copilot` or `--eval-engine claude-code`. + +To run a **different model** without waiting for a CLI update, set an environment variable before the run: `A365_EVAL_COPILOT_MODEL` for GitHub Copilot (needs an exact model ID, e.g. `claude-haiku-4.5`) or `A365_EVAL_CLAUDE_MODEL` for Claude Code (accepts an alias, e.g. `haiku`). + +The CLI that runs the model is **yours**, installed from npm by you, authenticated with your credentials. Microsoft does not host or resell the model API call — it is made directly by your CLI to the AI provider under your own terms of service and billing. The `a365` CLI specifies the model flag but does not mediate the API call. + +The model used in every run is recorded in the HTML report header so you always have a record of what AI produced the scores. + +--- + +## Q4. How can I tell which results are AI-generated vs. rule-based? + +Every check in the output is tagged with a `type` field: + +| Tag | What it means | +|---|---| +| `"Deterministic"` | Rule-based logic in the CLI itself — no AI involved. Pass/fail is exact (e.g., "tool name must not be empty"). | +| `"Semantic"` | Scored by the coding agent (AI). Includes a `reason` string written by the model explaining the judgment. | + +In the HTML report, Semantic checks show the AI's `reason` text under each result. In the JSON report, the same `type` field and `reason` field appear on every check object so you can filter programmatically. + +The deterministic checks run first and will pass or fail regardless of whether an AI agent is available. Semantic checks require the coding agent. + +--- + +## Q5. What files remain on disk after the evaluation completes? + +Three output files are written to the directory you specify with `--output-dir`: + +| File | Contents | Kept after run? | +|---|---|---| +| `_checklist.json` | Full checklist with scores and AI reasons | Yes — also used by re-runs to skip re-discovery | +| `_eval_report.html` | Human-readable HTML report | Yes | +| `_eval_report.json` | Machine-readable JSON report | Yes | + +Temporary files created during the run (the prompt file passed to the coding agent CLI) are deleted automatically when the run finishes or is cancelled. The auth token supplied via `--auth-token` is never written to disk. + +If you want to remove all evaluation output, delete the three files above from your output directory. No other local state is created by this command. diff --git a/docs/agent365-guided-setup/a365-evaluate-instructions.md b/docs/agent365-guided-setup/a365-evaluate-instructions.md new file mode 100644 index 00000000..6388be9f --- /dev/null +++ b/docs/agent365-guided-setup/a365-evaluate-instructions.md @@ -0,0 +1,283 @@ +# Agent 365 MCP Server Evaluation Pipeline — Setup Instructions for AI Agents + +> **SCOPE — THIS FILE ONLY:** This file covers setting up the workstation and running the Agent 365 evaluation pipeline against an MCP server (`a365 develop-mcp evaluate`). It does **NOT** cover provisioning an agent (`a365 setup`), deploying code (`a365 deploy`), or adding observability. If the user asked to set up their agent for Agent 365, close this file and open `a365-setup-instructions.md` instead. + +--- + +> **YOUR FIRST AND ONLY ACTION RIGHT NOW:** Ask the user the three path-determination questions below. Do NOT create todos, run commands, or read further until the user has answered all three questions. After all answers are received, create all todos for the determined path and mark Todo 1 in-progress. + +**RULE 1 — ASK THREE QUESTIONS FIRST, THEN CREATE ALL TODOS.** + +Before creating any todos or running any commands, ask the user these three questions (one at a time, wait for each response): + +**Question 1: What is the URL of the MCP server you want to evaluate?** + +Wait for the answer. Store as `serverUrl`. Must be the **Streamable HTTP endpoint URL** of the MCP server (typically ends in `/mcp` or a similar path). Examples: +- `https://my-mcp-server.contoso.com/mcp` +- `http://localhost:5000/mcp` (for a locally-running server) + +If the user replies with just a hostname or a non-HTTP URL, ask them to provide the full HTTPS or HTTP URL including the endpoint path. + +**Question 2: Does the MCP server require a bearer token for authentication?** + +- Yes +- No + +Wait for the answer. Store as `needsAuth`: +- If **Yes**: ask the user to provide the bearer token (or to paste the output of `a365 develop get-token …` if the server is in the Agent 365 tenant). Store as `authToken`. Do NOT echo the token back to the user or log it. +- If **No**: set `authToken = null`. + +**Question 3: How should the semantic checks be scored?** + +1. Auto — try GitHub Copilot first, fall back to Claude Code +2. GitHub Copilot only +3. Claude Code only +4. None — generate the checklist and let me (or my own LLM) score it manually (bring-your-own-LLM) + +Wait for the answer. Store as `evalEngine`: +- If **1 (Auto)**: `evalEngine = "auto"` +- If **2 (GitHub Copilot)**: `evalEngine = "github-copilot"` +- If **3 (Claude Code)**: `evalEngine = "claude-code"` +- If **4 (None)**: `evalEngine = "none"` + +> **Note:** Auto and the named engines require the corresponding CLI to be installed on the workstation (`copilot` or `claude`). If neither is installed and `evalEngine` is `auto` or a named engine, the pipeline will stop after writing the checklist and print BYO-LLM instructions — that is expected, not an error. + +After all three questions are answered, create all todos for the path and mark Todo 1 in-progress: + +**Evaluation pipeline path** (5 todos total): +- Todo 1: `Step 1: Verify and Install/Update the Agent 365 CLI` +- Todo 2: `Step 2: (Optional) Verify a local scoring agent is available` +- Todo 3: `Step 3: Validate the MCP server URL is reachable` +- Todo 4: `Step 4: Run a365 develop-mcp evaluate` +- Todo 5: `Step 5: Review the HTML report and action items` + +**RULE 2 — ALWAYS BEGIN FROM STEP 1.** Even if the CLI appears installed, you MUST run the validation commands in Step 1. + +**RULE 3 — SUB-SECTIONS ARE NOT SEPARATE TODOS.** Each `## Step` has internal sub-sections — these are tasks WITHIN that step, NOT separate todos. + +**RULE 4 — ONE STEP AT A TIME.** Complete each step fully. Mark its todo in-progress when starting, complete when done. + +**RULE 5 — TELEMETRY DISCLOSURE.** When the user runs Step 4, the CLI fires a single fire-and-forget telemetry marker (`POST /agents/externalMcpServers/logEvaluate`) at the start of the evaluation. This call carries **no customer content** — no server URL, no engine selection, no output paths. Identity is taken from the bearer token. The marker is non-blocking; the evaluation runs even if telemetry fails. If the user explicitly asks about telemetry, tell them this — do not surface it unprompted. + +**RULE 6 — SILENT EXECUTION.** Work silently. Do NOT narrate what you are about to do, announce step transitions, print todo state, emoji checklists, or step completion summaries. Only speak to the user when you need input, have an error to report, or need confirmation before a destructive action. + +**RULE 7 — DO NOT SHIP THE TOKEN.** If the user provided an auth token in Question 2, pass it only via the `--auth-token` flag or the `A365_MCP_AUTH_TOKEN` environment variable at the moment you invoke the CLI (the env var is preferred — it keeps the token out of process listings and shell history). Do NOT write it to a file, paste it back into chat, or commit it to any config. + +--- + +## Context + +You are an AI coding agent with access to execute shell commands. Your task is to set up and run the Agent 365 MCP server evaluation pipeline. The pipeline discovers tools from a target MCP server, generates a checklist of deterministic + semantic checks, runs the semantic checks through a locally-installed coding agent (or hands them off for manual scoring), and produces an HTML report scoring the server's tool schema quality and an action-item list for improvement. + +--- + +## Step 1: Verify and Install/Update the Agent 365 CLI + +> **DO NOT SKIP THIS STEP.** Mark this todo in-progress now. + +The Agent 365 CLI is a .NET global tool. Verify install + version: + +```bash +a365 --version +``` + +- If the CLI is not installed: run `dotnet tool install --global Microsoft.Agents.A365.DevTools.Cli --prerelease` +- If installed but outdated: run `dotnet tool update --global Microsoft.Agents.A365.DevTools.Cli --prerelease` + +### Verify .NET is available + +```bash +dotnet --version +``` + +Confirm .NET 8.0 or later. If not installed, instruct the user to install .NET 8.0 from https://dotnet.microsoft.com/download. + +### Step 1 completion + +> **BEFORE MOVING ON:** Mark Todo 1 as **completed**. Mark Todo 2 as **in-progress**. Proceed to Step 2. + +--- + +## Step 2: (Optional) Verify a local scoring agent is available + +> **Skip this step entirely if `evalEngine = "none"`.** In that case, mark Todo 2 completed and jump to Step 3. + +The evaluation pipeline scores semantic checks (tool-name clarity, description quality, parameter naming, return-shape sanity) using a locally-installed coding agent CLI. Verify the engine the user selected is present. + +### If `evalEngine = "github-copilot"` or `"auto"` + +Verify the GitHub Copilot CLI is installed: + +```bash +copilot --version +``` + +- If installed: continue. +- If not installed: install with `npm install -g @github/copilot` (requires Node.js 18 or later). This is the standalone GitHub Copilot CLI (binary `copilot`) that the pipeline invokes — it is a different tool from the `gh copilot` GitHub CLI extension, which does not accept the flags the pipeline uses. + +### If `evalEngine = "claude-code"` or `"auto"` (and Copilot was not found) + +Verify Claude Code CLI is installed: + +```bash +claude --version +``` + +- If installed: continue. +- If not installed: install via `npm install -g @anthropic-ai/claude-code` or follow [Claude Code install](https://docs.claude.com/claude-code). + +### If `evalEngine = "auto"` and neither agent is installed + +Tell the user one of the following is required, and that without one of them the pipeline will stop after writing the checklist (BYO-LLM mode). Offer them: +- Install GitHub Copilot CLI (see above), or +- Install Claude Code (see above), or +- Switch to `evalEngine = "none"` and score the checklist with their own LLM after the run. + +### Step 2 completion + +> Mark Todo 2 as **completed**. Mark Todo 3 as **in-progress**. Proceed to Step 3. + +--- + +## Step 3: Validate the MCP server URL is reachable + +> **DO NOT SKIP THIS STEP.** Catching a wrong URL here saves an aborted run later. + +Verify the server responds to an initial request: + +```bash +curl -i -X POST -H "Content-Type: application/json" -H "Accept: application/json, text/event-stream" \ + ${authToken:+-H "Authorization: Bearer $authToken"} \ + --data '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"a365-precheck","version":"1.0"}}}' \ + "$serverUrl" +``` + +- **200 OK** with a JSON body containing `result` → MCP server is reachable. Continue. +- **401 Unauthorized** → the server needs a bearer token. Go back to Question 2 and obtain one. Re-run this check before proceeding. +- **404 Not Found** → the URL path is wrong. Ask the user to confirm the URL includes the full endpoint path (typically ends in `/mcp`). +- **Connection refused / DNS failure** → the server is not running or the host is unreachable. Tell the user, ask them to start the server or correct the URL, then re-validate. + +> **If the user is evaluating a server behind a corporate firewall**, they may need to be on VPN or run the evaluation from a machine that can reach the server. The evaluation does NOT proxy through any Agent 365 service. + +### Step 3 completion + +> Mark Todo 3 as **completed**. Mark Todo 4 as **in-progress**. Proceed to Step 4. + +--- + +## Step 4: Run `a365 develop-mcp evaluate` + +> Pick an output directory (use the current directory unless the user specifies otherwise). The pipeline writes the checklist and HTML report there. + +### Run the evaluation + +```bash +a365 develop-mcp evaluate \ + --server-url "$serverUrl" \ + --output-dir "." \ + --eval-engine "$evalEngine" \ + ${authToken:+--auth-token "$authToken"} +``` + +### Optional: configure via environment variables + +These settings can come from environment variables instead of (or alongside) the flags: + +| Env var | Purpose | +|---|---| +| `A365_MCP_AUTH_TOKEN` | Bearer token for the MCP server, used when `--auth-token` is not passed. **Preferred over the flag** — it keeps the token out of process listings (`ps` / Task Manager) and shell history. If you pass `--auth-token`, the CLI prints a one-time warning recommending this variable. | +| `A365_EVAL_COPILOT_MODEL` | Override the GitHub Copilot model (exact model ID, e.g. `claude-haiku-4.5`). | +| `A365_EVAL_CLAUDE_MODEL` | Override the Claude Code model (alias, e.g. `haiku`). | + +The model defaults to Claude Haiku 4.5; override only to move to a newer model without waiting for a CLI release. + +### What you will see + +The CLI logs progress in numbered steps `[1/5]` through `[5/5]`: + +| Step | What it does | +|---|---| +| `[1/5] Discovering tools …` | Connects to the MCP server, runs `tools/list`, captures schemas. | +| `[2/5] Generated evaluation checklist …` | Writes `_checklist.json` to the output dir. | +| `[3/5] Running semantic evaluation` | Hands the checklist to the selected coding agent for scoring (or stops here if `evalEngine = "none"` or no agent is available). | +| `[4/5] Analysis complete: score …` | Aggregates per-tool and overall maturity scores. | +| `[5/5] Writing reports` | Produces `_eval_report.html` and `_eval_report.json`. | + +### Common in-run conditions + +- **Coding agent missing.** The CLI prints a "pick one" guidance block and stops after step `[2/5]`. The checklist is now on disk — the user can score it with their own LLM and re-run the same command to resume from the checklist. +- **Mid-run abort.** Re-running the same command picks up from the existing checklist file. Delete `_checklist.json` to force a fresh discovery. +- **WAM/auth prompt on Windows.** Same handling as in `a365-setup-instructions.md` — the prompt is for the CLI's own Microsoft Graph token, not for the MCP server's bearer. + +### Step 4 completion + +> Mark Todo 4 as **completed**. Mark Todo 5 as **in-progress**. Proceed to Step 5. + +--- + +## Step 5: Review the HTML report and action items + +> Open `_eval_report.html` from the output directory in a browser, or display the report path and prompt the user to open it. + +### Headline numbers + +The report has: + +- **Overall score** (0–100). Higher is better. +- **Maturity level** (0–4): Level 0 (Functional) → Level 4 (Exemplary). +- **Per-tool scores** with category breakdowns: schema quality, semantic clarity, parameter quality, return-shape quality. +- **Action item list**, ordered by impact. + +### Triage with the user + +1. **If overall score is below 60 or maturity is Level 0–1 (Functional/Described)**, walk through the top 3 action items with the user. These usually cluster around: tool names that don't describe the action, descriptions that are stubs, parameter names that are abbreviations. +2. **If overall score is 60–74 or maturity is Level 2 (Consistent)**, the schema is shippable but has room. Pick 1–2 high-impact action items to address. +3. **If overall score is 75 or above or maturity is Level 3–4 (Optimized for AI/Exemplary)**, summarize the strengths and surface any low-hanging fixes. + +### Re-running after fixes + +When the user updates their tool schemas, delete the checklist file and re-run Step 4: + +```bash +rm "_checklist.json" +a365 develop-mcp evaluate --server-url "$serverUrl" --output-dir "." --eval-engine "$evalEngine" ${authToken:+--auth-token "$authToken"} +``` + +This forces a fresh discovery so the new schemas are scored. + +### Step 5 completion + +> Mark Todo 5 as **completed**. Summarize the headline numbers and the top 3 action items to the user. +> +> If the user asked "what's next?", offer: +> - Re-evaluate after fixing the top items. +> - Add the evaluate command to their CI/PR pipeline so schema regressions are caught automatically. (Provide a sample command, not a full CI snippet, unless the user asks for one.) +> +> Do NOT proactively suggest adding observability or running `a365 setup` — those are separate workflows in this folder. + +--- + +## Error Handling and Troubleshooting + +If any step results in an error, stop and analyze the error message carefully. + +### Common errors + +| Error | Likely cause | Fix | +|---|---|---| +| `Unauthorized` from `tools/list` | Wrong or expired bearer token | Re-acquire the token (Question 2). | +| `Could not parse server URL` | URL doesn't include the protocol | Add `http://` or `https://` to the URL. | +| `No coding agent detected` after `[2/5]` | Neither `copilot` nor `claude` is on PATH | Install one (Step 2) or pass `--eval-engine none`. | +| `Failed to write report to ` | Output dir not writable | Choose a different `--output-dir` or fix permissions. | +| Telemetry warning at debug level | Non-blocking — the marker call failed | Ignore. The evaluation runs regardless. | + +For broader help see: + +- **[Agent 365 CLI Reference](https://learn.microsoft.com/en-us/microsoft-agent-365/developer/agent-365-cli)** — command-specific options. +- **[GitHub Issues](https://github.com/microsoft/Agent365-devTools/issues)** — search by error message. + +### Quick tips + +- The pipeline is idempotent — safe to re-run after fixing an issue. +- The checklist JSON is human-readable and editable — the user can hand-score categories the agent got wrong and re-run the report step. +- Log files: Windows `%APPDATA%\a365\logs\`, Linux/Mac `~/.config/a365/logs/`. diff --git a/docs/agent365-guided-setup/a365-setup-instructions.md b/docs/agent365-guided-setup/a365-setup-instructions.md index 08fe748f..de61eaa1 100644 --- a/docs/agent365-guided-setup/a365-setup-instructions.md +++ b/docs/agent365-guided-setup/a365-setup-instructions.md @@ -1,6 +1,10 @@ # Agent 365 CLI Setup Instructions for AI Agents -> **SCOPE — THIS FILE ONLY:** This file covers provisioning and deploying an agent using the Agent 365 CLI (`a365 setup`, `a365 publish`, `a365 deploy`). It does **NOT** cover adding observability, telemetry, or SDK integrations to the agent's code. If the user asked to add observability, close this file and open `a365-observability-instructions.md` instead. +> **SCOPE — THIS FILE ONLY:** This file covers provisioning and deploying an agent using the Agent 365 CLI (`a365 setup`, `a365 publish`, `a365 deploy`). It does **NOT** cover: +> - adding observability, telemetry, or SDK integrations to the agent's code → use `a365-observability-instructions.md` +> - evaluating an MCP server's tool schema quality (`a365 develop-mcp evaluate`) → use `a365-evaluate-instructions.md` +> +> If the user asked for either of those, close this file and open the matching one instead. --- diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Commands/DevelopMcpCommand.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Commands/DevelopMcpCommand.cs index f7a0d892..d61d4f4c 100644 --- a/src/Microsoft.Agents.A365.DevTools.Cli/Commands/DevelopMcpCommand.cs +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Commands/DevelopMcpCommand.cs @@ -4,6 +4,7 @@ using Microsoft.Agents.A365.DevTools.Cli.Helpers; using Microsoft.Agents.A365.DevTools.Cli.Models; using Microsoft.Agents.A365.DevTools.Cli.Services; +using Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; using Microsoft.Extensions.Logging; using System.CommandLine; using static Microsoft.Agents.A365.DevTools.Cli.Helpers.PackageMCPServerHelper; @@ -16,11 +17,13 @@ namespace Microsoft.Agents.A365.DevTools.Cli.Commands; public static class DevelopMcpCommand { /// - /// Creates the develop-mcp command with subcommands for MCP server management in Dataverse + /// Creates the develop-mcp command with subcommands for MCP server management in Dataverse. + /// The evaluate subcommand is included only when is provided. /// public static Command CreateCommand( ILogger logger, IAgent365ToolingService toolingService, + IEvaluationPipelineService? evaluationPipelineService = null, GraphApiService? graphApiService = null) { var developMcpCommand = new Command("develop-mcp", "Manage MCP servers in Dataverse environments"); @@ -42,9 +45,85 @@ public static Command CreateCommand( developMcpCommand.AddCommand(CreatePackageMCPServerSubCommand(logger, toolingService)); developMcpCommand.AddCommand(CreateRegisterExternalMcpServerSubcommand(logger, toolingService, graphApiService)); + if (evaluationPipelineService is not null) + { + developMcpCommand.AddCommand(CreateEvaluateSubcommand(logger, evaluationPipelineService)); + } + return developMcpCommand; } + /// + /// Creates the evaluate subcommand for MCP server tool schema quality evaluation. + /// + private static Command CreateEvaluateSubcommand(ILogger logger, IEvaluationPipelineService pipelineService) + { + var command = new Command( + "evaluate", + "Evaluate MCP server tool schema quality and generate an HTML report. " + + "Uses a locally installed coding agent (GitHub Copilot or Claude Code) to score semantic checks. " + + "If no agent is detected, the command stops after writing the checklist so you can score it manually with your own LLM, " + + "or pass --eval-engine none to skip agent probing entirely. " + + "Only run this against MCP servers you trust: the server's tool names and descriptions are sent to a locally running coding agent."); + + // Use a required option (not a positional argument) for consistency with other + // develop-mcp subcommands and Azure CLI conventions. + var serverUrlOption = new Option( + ["--server-url", "-u"], + "MCP server Streamable HTTP endpoint URL") + { + IsRequired = true, + }; + + var outputDirOption = new Option( + ["--output-dir", "-o"], + getDefaultValue: () => ".", + "Output directory for evaluation artifacts"); + + var evalEngineOption = new Option( + "--eval-engine", + getDefaultValue: () => "auto", + "Which local coding agent scores semantic checks. " + + "auto: try github-copilot then claude-code. " + + "github-copilot or claude-code: use only that engine. " + + "none: skip automatic scoring and expect the checklist to be pre-scored (bring-your-own-LLM)."); + + var authTokenOption = new Option( + "--auth-token", + "Bearer token for MCP server authentication. Prefer the A365_MCP_AUTH_TOKEN environment variable; a token passed on the command line is visible to process listings (ps / Task Manager) and shell history."); + + command.AddOption(serverUrlOption); + command.AddOption(outputDirOption); + command.AddOption(evalEngineOption); + command.AddOption(authTokenOption); + + command.SetHandler(async (System.CommandLine.Invocation.InvocationContext context) => + { + var serverUrl = context.ParseResult.GetValueForOption(serverUrlOption)!; + var outputDir = context.ParseResult.GetValueForOption(outputDirOption)!; + var evalEngine = context.ParseResult.GetValueForOption(evalEngineOption)!; + var authToken = context.ParseResult.GetValueForOption(authTokenOption); + var ct = context.GetCancellationToken(); + + // Secret handling: a token on the command line is visible to other processes + // (ps / Task Manager) and lands in shell history. A non-empty value here can + // only have come from --auth-token, so warn and steer to the env var; when the + // flag is absent, fall back to A365_MCP_AUTH_TOKEN. + if (!string.IsNullOrWhiteSpace(authToken)) + { + logger.LogWarning("Passing a token via --auth-token exposes it to process listings (ps / Task Manager) and shell history. Prefer the A365_MCP_AUTH_TOKEN environment variable."); + } + else + { + authToken = Environment.GetEnvironmentVariable("A365_MCP_AUTH_TOKEN"); + } + + context.ExitCode = await pipelineService.RunAsync(serverUrl, outputDir, evalEngine, authToken, ct); + }); + + return command; + } + /// /// Creates the list-environments subcommand /// diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Constants/ErrorCodes.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Constants/ErrorCodes.cs index 8ef6ee98..7686ad5a 100644 --- a/src/Microsoft.Agents.A365.DevTools.Cli/Constants/ErrorCodes.cs +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Constants/ErrorCodes.cs @@ -14,5 +14,7 @@ public static class ErrorCodes public const string RetryExhausted = "RETRY_EXHAUSTED"; public const string SetupValidationFailed = "SETUP_VALIDATION_FAILED"; public const string ClientAppValidationFailed = "CLIENT_APP_VALIDATION_FAILED"; + public const string EvaluationFailed = "EVALUATION_FAILED"; + public const string SchemaDiscoveryFailed = "SCHEMA_DISCOVERY_FAILED"; } } diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Exceptions/EvaluationException.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Exceptions/EvaluationException.cs new file mode 100644 index 00000000..da4cd592 --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Exceptions/EvaluationException.cs @@ -0,0 +1,33 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using Microsoft.Agents.A365.DevTools.Cli.Constants; + +namespace Microsoft.Agents.A365.DevTools.Cli.Exceptions; + +/// +/// Exception thrown when MCP server schema evaluation fails. +/// Covers schema discovery errors, checklist generation errors, +/// and report generation errors. +/// +public sealed class EvaluationException : Agent365Exception +{ + public override int ExitCode => 3; + + public EvaluationException( + string errorCode, + string issueDescription, + List? errorDetails = null, + List? mitigationSteps = null, + Dictionary? context = null, + Exception? innerException = null) + : base( + errorCode: errorCode, + issueDescription: issueDescription, + errorDetails: errorDetails, + mitigationSteps: mitigationSteps, + context: context, + innerException: innerException) + { + } +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Microsoft.Agents.A365.DevTools.Cli.csproj b/src/Microsoft.Agents.A365.DevTools.Cli/Microsoft.Agents.A365.DevTools.Cli.csproj index b38adb2b..04bcea8c 100644 --- a/src/Microsoft.Agents.A365.DevTools.Cli/Microsoft.Agents.A365.DevTools.Cli.csproj +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Microsoft.Agents.A365.DevTools.Cli.csproj @@ -71,5 +71,6 @@ + diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ActionItem.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ActionItem.cs new file mode 100644 index 00000000..c25f078a --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ActionItem.cs @@ -0,0 +1,42 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using System.Text.Json.Serialization; + +namespace Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; + +/// +/// A prioritized remediation action generated from a failed check. +/// +public class ActionItem +{ + [JsonPropertyName("tool_name")] + public string? ToolName { get; init; } + + [JsonPropertyName("param_name")] + public string? ParamName { get; init; } + + [JsonPropertyName("priority")] + public Priority Priority { get; init; } + + [JsonPropertyName("title")] + public string Title { get; init; } = string.Empty; + + [JsonPropertyName("description")] + public string Description { get; init; } = string.Empty; + + [JsonPropertyName("issue_ids")] + public List IssueIds { get; init; } = []; + + [JsonPropertyName("impact_areas")] + public List ImpactAreas { get; init; } = []; + + [JsonPropertyName("remediation")] + public string Remediation { get; init; } = string.Empty; + + [JsonPropertyName("score_impact")] + public float ScoreImpact { get; set; } + + [JsonPropertyName("issue_leads_to")] + public List IssueLeadsTo { get; init; } = []; +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ChecklistItem.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ChecklistItem.cs new file mode 100644 index 00000000..cbaac79c --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ChecklistItem.cs @@ -0,0 +1,43 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using System.Text.Json.Serialization; + +namespace Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; + +/// +/// A single check item in the evaluation checklist. +/// Score is null until evaluated (deterministic checks are pre-filled, semantic checks start null). +/// +public class ChecklistItem +{ + [JsonPropertyName("id")] + public string Id { get; init; } = string.Empty; + + [JsonPropertyName("type")] + public CheckType Type { get; init; } + + [JsonPropertyName("prompt")] + public string Prompt { get; init; } = string.Empty; + + [JsonPropertyName("score")] + public bool? Score { get; set; } + + [JsonPropertyName("reason")] + public string? Reason { get; set; } + + [JsonPropertyName("severity")] + public Priority Severity { get; init; } + + [JsonPropertyName("category")] + public CheckCategory Category { get; init; } + + [JsonPropertyName("issue_ids")] + public List IssueIds { get; init; } = []; + + [JsonPropertyName("impact_areas")] + public List ImpactAreas { get; init; } = []; + + [JsonPropertyName("remediation")] + public string Remediation { get; init; } = string.Empty; +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/EvalReportData.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/EvalReportData.cs new file mode 100644 index 00000000..851b13ee --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/EvalReportData.cs @@ -0,0 +1,53 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using System.Text.Json.Serialization; + +namespace Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; + +/// +/// Final JSON blob fed to the HTML template. Contains everything the template needs +/// to render the report. All evaluation logic, descriptions, and assertions are +/// pre-computed in C# code -- the HTML template is a pure display layer. +/// +public class EvalReportData +{ + [JsonPropertyName("result")] + public SchemaEvalResult Result { get; init; } = new(); + + [JsonPropertyName("impact_map")] + public Dictionary ImpactMap { get; init; } = []; + + [JsonPropertyName("maturity_ladder")] + public List MaturityLadder { get; init; } = []; +} + +public class IssueImpactInfo +{ + [JsonPropertyName("name")] + public string Name { get; init; } = string.Empty; + + [JsonPropertyName("category")] + public string Category { get; init; } = string.Empty; + + [JsonPropertyName("impact")] + public string Impact { get; init; } = string.Empty; + + [JsonPropertyName("areas")] + public List Areas { get; init; } = []; +} + +public class MaturityLadderEntry +{ + [JsonPropertyName("level")] + public int Level { get; init; } + + [JsonPropertyName("label")] + public string Label { get; init; } = string.Empty; + + [JsonPropertyName("description")] + public string Description { get; init; } = string.Empty; + + [JsonPropertyName("is_current")] + public bool IsCurrent { get; init; } +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/EvaluateEnums.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/EvaluateEnums.cs new file mode 100644 index 00000000..deeffc40 --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/EvaluateEnums.cs @@ -0,0 +1,60 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using System.Text.Json.Serialization; + +namespace Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; + +[JsonConverter(typeof(JsonStringEnumConverter))] +public enum CheckCategory +{ + ToolName, + ToolDescription, + ParamName, + ParamDescription, + SchemaStructure, + ToolsetDesign +} + +[JsonConverter(typeof(JsonStringEnumConverter))] +public enum Priority +{ + P0, + P1, + P2, + P3 +} + +[JsonConverter(typeof(JsonStringEnumConverter))] +public enum ImpactArea +{ + ToolSelection, + ParamAccuracy, + Completeness, + Conciseness +} + +[JsonConverter(typeof(JsonStringEnumConverter))] +public enum IssueCategory +{ + Accuracy, + Functionality, + Completeness, + Conciseness +} + +[JsonConverter(typeof(JsonStringEnumConverter))] +public enum CheckType +{ + Deterministic, + Semantic +} + +[JsonConverter(typeof(JsonStringEnumConverter))] +public enum EvalEngine +{ + Auto, + GitHubCopilot, + ClaudeCode, + None +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/EvaluationChecklist.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/EvaluationChecklist.cs new file mode 100644 index 00000000..f5bdcf65 --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/EvaluationChecklist.cs @@ -0,0 +1,40 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using System.Text.Json.Serialization; + +namespace Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; + +/// +/// Root of the evaluation checklist JSON. Intermediate artifact that is auditable +/// and can be evaluated by a coding agent or manually. +/// +public class EvaluationChecklist +{ + [JsonPropertyName("metadata")] + public ChecklistMetadata Metadata { get; init; } = new(); + + [JsonPropertyName("tools")] + public List Tools { get; init; } = []; + + [JsonPropertyName("server_checks")] + public List ServerChecks { get; init; } = []; +} + +public class ChecklistMetadata +{ + [JsonPropertyName("server_name")] + public string ServerName { get; init; } = string.Empty; + + [JsonPropertyName("server_url")] + public string ServerUrl { get; init; } = string.Empty; + + [JsonPropertyName("tool_count")] + public int ToolCount { get; init; } + + [JsonPropertyName("generated_at")] + public DateTime GeneratedAt { get; init; } = DateTime.UtcNow; + + [JsonPropertyName("generator_version")] + public string GeneratorVersion { get; init; } = string.Empty; +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/IssueDefinition.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/IssueDefinition.cs new file mode 100644 index 00000000..e491ebbb --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/IssueDefinition.cs @@ -0,0 +1,18 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +namespace Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; + +/// +/// Definition of a schema-quality issue that a checklist check can surface, +/// used to link failed checks back to a human-readable name and impact. +/// +public class IssueDefinition +{ + public int Id { get; init; } + public string Name { get; init; } = string.Empty; + public IssueCategory Category { get; init; } + public string Description { get; init; } = string.Empty; + public string Impact { get; init; } = string.Empty; + public List ImpactAreas { get; init; } = []; +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/MaturityLevel.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/MaturityLevel.cs new file mode 100644 index 00000000..cfe0c019 --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/MaturityLevel.cs @@ -0,0 +1,24 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using System.Text.Json.Serialization; + +namespace Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; + +/// +/// Maturity level (0-4) determined from overall score with category caps. +/// +public class MaturityLevel +{ + [JsonPropertyName("level")] + public int Level { get; init; } + + [JsonPropertyName("label")] + public string Label { get; init; } = string.Empty; + + [JsonPropertyName("description")] + public string Description { get; init; } = string.Empty; + + [JsonPropertyName("next_level_requirements")] + public List NextLevelRequirements { get; init; } = []; +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/SchemaEvalResult.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/SchemaEvalResult.cs new file mode 100644 index 00000000..1466c2cd --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/SchemaEvalResult.cs @@ -0,0 +1,51 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using System.Text.Json.Serialization; + +namespace Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; + +/// +/// Top-level evaluation result container, used to generate eval_report.json. +/// +public class SchemaEvalResult +{ + [JsonPropertyName("server_name")] + public string ServerName { get; init; } = string.Empty; + + [JsonPropertyName("server_url")] + public string ServerUrl { get; init; } = string.Empty; + + [JsonPropertyName("evaluated_at")] + public DateTime EvaluatedAt { get; init; } = DateTime.UtcNow; + + [JsonPropertyName("overall_score")] + public float OverallScore { get; init; } + + [JsonPropertyName("maturity")] + public MaturityLevel Maturity { get; init; } = new(); + + [JsonPropertyName("tool_count")] + public int ToolCount { get; init; } + + [JsonPropertyName("tool_results")] + public List ToolResults { get; init; } = []; + + [JsonPropertyName("toolset_result")] + public ToolsetEvalResult ToolsetResult { get; init; } = new(); + + [JsonPropertyName("all_action_items")] + public List AllActionItems { get; init; } = []; + + [JsonPropertyName("category_averages")] + public Dictionary CategoryAverages { get; init; } = []; + + [JsonPropertyName("action_items_by_priority")] + public Dictionary ActionItemsByPriority { get; init; } = []; + + [JsonPropertyName("issue_summary")] + public Dictionary IssueSummary { get; init; } = []; + + [JsonPropertyName("eval_engine")] + public string EvalEngine { get; init; } = string.Empty; +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ToolChecklist.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ToolChecklist.cs new file mode 100644 index 00000000..afdfb5f3 --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ToolChecklist.cs @@ -0,0 +1,55 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using System.Text.Json; +using System.Text.Json.Serialization; + +namespace Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; + +/// +/// Checklist for a single tool, organized by check category. +/// +public class ToolChecklist +{ + [JsonPropertyName("name")] + public string Name { get; init; } = string.Empty; + + [JsonPropertyName("description")] + public string Description { get; init; } = string.Empty; + + [JsonPropertyName("input_schema")] + public JsonElement? InputSchema { get; init; } + + [JsonPropertyName("checks")] + public ToolCheckGroups Checks { get; init; } = new(); +} + +/// +/// Groups of checks organized by category for a single tool. +/// +public class ToolCheckGroups +{ + [JsonPropertyName("tool_name")] + public List ToolName { get; init; } = []; + + [JsonPropertyName("tool_description")] + public List ToolDescription { get; init; } = []; + + [JsonPropertyName("schema_structure")] + public List SchemaStructure { get; init; } = []; + + [JsonPropertyName("parameters")] + public Dictionary Parameters { get; init; } = []; +} + +/// +/// Groups of checks for a single parameter. +/// +public class ParamCheckGroups +{ + [JsonPropertyName("param_name")] + public List ParamName { get; init; } = []; + + [JsonPropertyName("param_description")] + public List ParamDescription { get; init; } = []; +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ToolEvalResult.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ToolEvalResult.cs new file mode 100644 index 00000000..a436c625 --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ToolEvalResult.cs @@ -0,0 +1,40 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using System.Text.Json; +using System.Text.Json.Serialization; + +namespace Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; + +/// +/// Evaluation result for a single tool. +/// +public class ToolEvalResult +{ + [JsonPropertyName("tool_name")] + public string ToolName { get; init; } = string.Empty; + + [JsonPropertyName("tool_description")] + public string ToolDescription { get; init; } = string.Empty; + + [JsonPropertyName("param_count")] + public int ParamCount { get; init; } + + [JsonPropertyName("score")] + public float Score { get; init; } + + [JsonPropertyName("category_scores")] + public Dictionary CategoryScores { get; init; } = []; + + [JsonPropertyName("checks")] + public List Checks { get; init; } = []; + + [JsonPropertyName("action_items")] + public List ActionItems { get; init; } = []; + + [JsonPropertyName("issues_detected")] + public List IssuesDetected { get; init; } = []; + + [JsonPropertyName("input_schema")] + public JsonElement? InputSchema { get; init; } +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ToolSchema.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ToolSchema.cs new file mode 100644 index 00000000..71f0f34a --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ToolSchema.cs @@ -0,0 +1,22 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using System.Text.Json; +using System.Text.Json.Serialization; + +namespace Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; + +/// +/// Represents an MCP tool schema discovered from a server or file. +/// +public class ToolSchema +{ + [JsonPropertyName("name")] + public string Name { get; init; } = string.Empty; + + [JsonPropertyName("description")] + public string Description { get; init; } = string.Empty; + + [JsonPropertyName("inputSchema")] + public JsonElement? InputSchema { get; init; } +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ToolsetEvalResult.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ToolsetEvalResult.cs new file mode 100644 index 00000000..b70d917f --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ToolsetEvalResult.cs @@ -0,0 +1,21 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using System.Text.Json.Serialization; + +namespace Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; + +/// +/// Evaluation result for toolset-level (cross-tool) checks. +/// +public class ToolsetEvalResult +{ + [JsonPropertyName("score")] + public float Score { get; init; } + + [JsonPropertyName("checks")] + public List Checks { get; init; } = []; + + [JsonPropertyName("action_items")] + public List ActionItems { get; init; } = []; +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Program.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Program.cs index 9445cff1..1cd397d7 100644 --- a/src/Microsoft.Agents.A365.DevTools.Cli/Program.cs +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Program.cs @@ -4,6 +4,7 @@ using Microsoft.Agents.A365.DevTools.Cli.Commands; using Microsoft.Agents.A365.DevTools.Cli.Exceptions; using Microsoft.Agents.A365.DevTools.Cli.Services; +using Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; using Microsoft.Agents.A365.DevTools.Cli.Services.Helpers; using Microsoft.Extensions.DependencyInjection; using Microsoft.Extensions.Logging; @@ -169,9 +170,11 @@ await Task.WhenAll( var clientAppValidator = serviceProvider.GetRequiredService(); var bootstrapResolver = serviceProvider.GetRequiredService(); + var evaluationPipelineService = serviceProvider.GetRequiredService(); + // Add commands rootCommand.AddCommand(DevelopCommand.CreateCommand(developLogger, configService, executor, authService, graphApiService, agentBlueprintService, processService)); - rootCommand.AddCommand(DevelopMcpCommand.CreateCommand(developLogger, toolingService, graphApiService)); + rootCommand.AddCommand(DevelopMcpCommand.CreateCommand(developLogger, toolingService, evaluationPipelineService, graphApiService)); var confirmationProvider = serviceProvider.GetRequiredService(); rootCommand.AddCommand(SetupCommand.CreateCommand(setupLogger, configService, executor, backendConfigurator, azureAuthValidator, platformDetector, graphApiService, agentBlueprintService, blueprintLookupService, federatedCredentialService, clientAppValidator, confirmationProvider, armApiService, resolver: bootstrapResolver)); @@ -398,6 +401,19 @@ private static void ConfigureServices(IServiceCollection services, LogLevel mini // Register confirmation provider for user prompts services.AddSingleton(); + // Register evaluate pipeline services + services.AddSingleton(); + services.AddSingleton(); + // Coding-agent launchers — registration order defines --eval-engine auto + // priority (GitHub Copilot first, then Claude Code). Add a new engine by + // adding its ICodingAgentLauncher implementation file plus one line here. + services.AddSingleton(); + services.AddSingleton(); + services.AddSingleton(); + services.AddSingleton(); + services.AddSingleton(); + services.AddSingleton(); + // Register bootstrap config resolver — centralizes the three-mode config resolution // used by all subcommands that can run without a365.config.json. services.AddSingleton(); diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Services/Agent365ToolingService.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Agent365ToolingService.cs index 374357e9..f8f0bd1f 100644 --- a/src/Microsoft.Agents.A365.DevTools.Cli/Services/Agent365ToolingService.cs +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Agent365ToolingService.cs @@ -317,6 +317,12 @@ private string BuildLogRegisterUrl(string environment) return $"{baseUrl}/agents/externalMcpServers/logRegister"; } + private string BuildLogEvaluateUrl(string environment) + { + var baseUrl = BuildAgent365ToolsBaseUrl(environment); + return $"{baseUrl}/agents/externalMcpServers/logEvaluate"; + } + /// /// Builds URL for deleting a BYO MCP server /// @@ -918,6 +924,45 @@ public async Task LogRegisterUsageAsync( } } + /// + public async Task LogEvaluateUsageAsync(CancellationToken cancellationToken = default) + { + try + { + var endpointUrl = BuildLogEvaluateUrl(_environment); + var audience = ConfigConstants.GetAgent365ToolsResourceAppId(_environment); + var loginHint = await AzCliHelper.ResolveLoginHintAsync(); + var authToken = await _authService.GetAccessTokenAsync(audience, userId: loginHint); + if (string.IsNullOrWhiteSpace(authToken)) + { + _logger.LogDebug("Skipping telemetry: failed to acquire token"); + return; + } + + using var httpClient = Internal.HttpClientFactory.CreateAuthenticatedClient(authToken); + // Empty body: server identifies the caller from the bearer token and pulls any + // operation context from ServiceContext. CLI does not ship customer-private content. + var content = new StringContent(string.Empty, System.Text.Encoding.UTF8, "application/json"); + + _logger.LogDebug("Logging evaluate usage telemetry..."); + using var response = await httpClient.PostAsync(endpointUrl, content, cancellationToken); + _logger.LogDebug("Telemetry logged: {StatusCode}", response.StatusCode); + if (!response.IsSuccessStatusCode && + response.Headers.TryGetValues("x-ms-correlation-id", out var telemetryCorrelationValues)) + { + var telemetryCorrelationId = telemetryCorrelationValues.FirstOrDefault(); + if (!string.IsNullOrWhiteSpace(telemetryCorrelationId)) + { + _logger.LogDebug("Telemetry server correlation ID: {CorrelationId}", telemetryCorrelationId); + } + } + } + catch (Exception ex) + { + _logger.LogDebug("Telemetry logging failed (non-blocking): {Error}", ex.Message); + } + } + public async Task AddMcpServerAsync( AddMcpServerRequest request, string? environmentId = null, diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ActionItemGenerator.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ActionItemGenerator.cs new file mode 100644 index 00000000..b631a15e --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ActionItemGenerator.cs @@ -0,0 +1,116 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; + +namespace Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; + +/// +/// Generates prioritized action items from failed evaluation checks. +/// Each failed check produces an action item with calculated score impact +/// and mapped issue impact descriptions from the taxonomy. +/// +public static class ActionItemGenerator +{ + /// + /// Generates action items for a flat list of checks, computing category-level + /// score impacts. Groups checks by category to determine per-check weight. + /// + /// All checks for a tool or toolset scope. + /// Tool name, or null for toolset-level checks. + /// Action items sorted by priority (P0 first). + public static List GenerateFromAllChecks( + List checks, + string? toolName) + { + if (checks.Count == 0) + { + return []; + } + + var items = new List(); + var checksByCategory = checks.GroupBy(c => c.Category) + .ToDictionary(g => g.Key, g => g.ToList()); + + foreach (var check in checks) + { + if (check.Score != false) + { + continue; + } + + string categoryKey = CategoryToKey(check.Category); + // Toolset-level checks are scored separately from per-tool categories in Scorer. + // Route them to ToolsetWeight explicitly so action-item impact stays aligned with scoring. + float weight = check.Category == CheckCategory.ToolsetDesign + ? Scorer.ToolsetWeight + : Scorer.CategoryWeights.GetValueOrDefault(categoryKey, 0.15f); + int categoryTotal = checksByCategory.TryGetValue(check.Category, out var catChecks) + ? catChecks.Count + : 1; + float scoreImpact = MathF.Round((weight * 100f) / Math.Max(categoryTotal, 1), 1); + + List issueLeadsTo = ResolveIssueImpacts(check.IssueIds); + + items.Add(new ActionItem + { + ToolName = toolName, + ParamName = null, + Priority = check.Severity, + Title = check.Prompt, + Description = check.Reason ?? string.Empty, + IssueIds = check.IssueIds, + ImpactAreas = check.ImpactAreas, + Remediation = check.Remediation, + ScoreImpact = scoreImpact, + IssueLeadsTo = issueLeadsTo, + }); + } + + items.Sort(CompareByPriority); + return items; + } + + /// + /// Resolves issue ids to their human-readable impact descriptions + /// using the IssueTaxonomy definitions. + /// + private static List ResolveIssueImpacts(List issueIds) + { + if (issueIds is null || issueIds.Count == 0) + { + return []; + } + + var impacts = new List(); + foreach (int issueId in issueIds) + { + if (IssueTaxonomy.Definitions.TryGetValue(issueId, out var issue)) + { + impacts.Add(issue.Impact); + } + } + + return impacts; + } + + /// + /// Converts a enum value to the snake_case key + /// used in category weight dictionaries. + /// + private static string CategoryToKey(CheckCategory category) => category switch + { + CheckCategory.ToolName => "tool_name", + CheckCategory.ToolDescription => "tool_description", + CheckCategory.ParamName => "param_name", + CheckCategory.ParamDescription => "param_description", + CheckCategory.SchemaStructure => "schema_structure", + CheckCategory.ToolsetDesign => "toolset_design", + _ => "unknown", + }; + + /// + /// Compares two action items by priority ordinal (P0=0, P1=1, P2=2, P3=3). + /// + private static int CompareByPriority(ActionItem a, ActionItem b) => a.Priority.CompareTo(b.Priority); +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ChecklistEvaluator.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ChecklistEvaluator.cs new file mode 100644 index 00000000..d23fd5b0 --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ChecklistEvaluator.cs @@ -0,0 +1,733 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using System.Text.Json; +using System.Text.RegularExpressions; +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; +using Microsoft.Extensions.Logging; + +namespace Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; + +/// +/// Evaluates semantic checks by writing the checklist to a file, invoking a +/// coding agent CLI as a subprocess, and re-reading the updated file. +/// +/// Tries engines in order: GitHub Copilot -> Claude Code. +/// If the user specifies an engine explicitly, only that engine is tried. +/// If Auto, tries all available engines in order until one succeeds. +/// +internal sealed class ChecklistEvaluator : IChecklistEvaluator +{ + // Per-scope (tool or server) the agent may leave some items unscored on a given + // pass, especially "pass if no issues" prompts the model hedges on. Re-invoke up + // to this many times; we stop as soon as everything is scored. + private const int MaxAttempts = 3; + + private static readonly JsonSerializerOptions WriteOptions = new() { WriteIndented = true }; + + // Tolerant reader options: coding agents sometimes produce trailing commas or comments + private static readonly JsonSerializerOptions ReadOptions = new() + { + AllowTrailingCommas = true, + ReadCommentHandling = JsonCommentHandling.Skip + }; + + // Registered coding-agent launchers, in priority order (the registration order + // in Program.cs). Auto walks this list; a specific engine is matched by Engine. + private readonly IReadOnlyList _launchers; + private readonly ILogger _logger; + + public ChecklistEvaluator(IEnumerable launchers, ILogger logger) + { + ArgumentNullException.ThrowIfNull(launchers); + ArgumentNullException.ThrowIfNull(logger); + _launchers = launchers.ToList(); + _logger = logger; + } + + /// Finds the registered launcher for an engine, or null if none. + private ICodingAgentLauncher? LauncherFor(EvalEngine engine) + => _launchers.FirstOrDefault(l => l.Engine == engine); + + /// + public async Task EvaluateAsync( + EvaluationChecklist checklist, + string checklistPath, + EvalEngine engine, + CancellationToken cancellationToken = default) + { + ArgumentNullException.ThrowIfNull(checklist); + ArgumentException.ThrowIfNullOrWhiteSpace(checklistPath); + + var dir = Path.GetDirectoryName(checklistPath) ?? "."; + Directory.CreateDirectory(dir); + + // Count unevaluated semantic checks before starting. + // The pipeline service is responsible for loading any pre-existing checklist + // from disk, so `checklist` already reflects whatever scores the user has done. + int totalUnevaluatedBefore = CountTotalUnevaluatedSemanticChecks(checklist); + + // Fast path: checklist is fully scored (this is the resume case after manual scoring, + // or a second run where agents already filled everything last time). + if (totalUnevaluatedBefore == 0) + { + _logger.LogInformation(" All semantic checks already scored — skipping agent invocation"); + await WriteChecklistAsync(checklist, checklistPath, cancellationToken); + return new ChecklistEvaluationResult { Checklist = checklist, Outcome = EvaluationOutcome.Completed }; + } + + // User explicitly opted out of running an agent AND the checklist isn't fully scored: + // persist what we have, print guidance, and stop. + if (engine == EvalEngine.None) + { + await WriteChecklistAsync(checklist, checklistPath, cancellationToken); + LogManualEvaluationInstructions(checklistPath, totalUnevaluatedBefore, engineNotFound: false, agentAttempted: false, requested: engine); + return new ChecklistEvaluationResult { Checklist = checklist, Outcome = EvaluationOutcome.OptedOut }; + } + + // Persist the unscored checklist now so the user has a file to edit if no agent is available. + await WriteChecklistAsync(checklist, checklistPath, cancellationToken); + + // Build the list of engines to try (for Auto, detect available; otherwise just the one requested) + var enginesToTry = await BuildEngineList(engine, cancellationToken); + + if (enginesToTry.Count == 0) + { + LogManualEvaluationInstructions(checklistPath, totalUnevaluatedBefore, engineNotFound: true, agentAttempted: false, requested: engine); + return new ChecklistEvaluationResult { Checklist = checklist, Outcome = EvaluationOutcome.CouldNotEvaluate }; + } + + // Announce the active engine (and fallback if any) + if (enginesToTry.Count == 1) + { + _logger.LogInformation(" Using {Engine}", FormatEngineName(enginesToTry[0])); + } + else + { + _logger.LogInformation(" Using {Primary} (fallback: {Fallback})", + FormatEngineName(enginesToTry[0]), + string.Join(", ", enginesToTry.Skip(1).Select(FormatEngineName))); + } + + // Track the first engine that successfully produced evaluations across any + // tool or server-check pass. Used to stamp the report with the engine that + // actually did the work (rather than the user's "auto" request). + EvalEngine? engineUsed = null; + + // Evaluate each tool using extract-evaluate-merge pattern. + // The full checklist is ~1MB which is too large for coding agents. + // Instead, extract each tool to a small temp file (~25KB), have the + // agent evaluate it, then merge the results back into the checklist. + for (int i = 0; i < checklist.Tools.Count; i++) + { + cancellationToken.ThrowIfCancellationRequested(); + + var tool = checklist.Tools[i]; + var unevaluated = CountUnevaluatedSemanticChecks(tool); + if (unevaluated == 0) + { + continue; + } + + // Heartbeat BEFORE the tool runs so the user sees forward motion immediately. + // Each tool can take minutes (multi-attempt Copilot/Claude invocations), so + // logging only on completion leaves long silent gaps that look like a hang. + _logger.LogInformation(" [{Current}/{Total}] {ToolName} ({CheckCount} checks) ... running", + i + 1, checklist.Tools.Count, tool.Name, unevaluated); + + var toolEngine = await EvaluateToolChecks(tool, enginesToTry, cancellationToken); + if (toolEngine is not null) + { + engineUsed ??= toolEngine; + _logger.LogInformation(" [{Current}/{Total}] {ToolName} ... ok", + i + 1, checklist.Tools.Count, tool.Name); + } + else + { + _logger.LogWarning(" [{Current}/{Total}] {ToolName} ... failed (continuing)", + i + 1, checklist.Tools.Count, tool.Name); + } + } + + // Evaluate server-level checks (extract server_checks + tool list summary) + var serverUnevaluated = checklist.ServerChecks.Count(c => c.Type == CheckType.Semantic && c.Score is null); + if (serverUnevaluated > 0) + { + _logger.LogInformation(" server-level checks ({Count} checks) ... running", serverUnevaluated); + var serverEngine = await EvaluateServerChecks(checklist, enginesToTry, cancellationToken); + if (serverEngine is not null) + { + engineUsed ??= serverEngine; + _logger.LogInformation(" server-level checks ... ok"); + } + else + { + _logger.LogWarning(" server-level checks ... failed (continuing)"); + } + } + + // Write the updated checklist back (with all merged results) + var updatedJson = JsonSerializer.Serialize(checklist, WriteOptions); + await File.WriteAllTextAsync(checklistPath, updatedJson, cancellationToken); + + var scoredSemantic = CountEvaluatedSemanticChecks(checklist); + var totalSemantic = CountTotalSemanticChecks(checklist); + var remainingUnevaluated = CountTotalUnevaluatedSemanticChecks(checklist); + _logger.LogInformation(" {Scored} of {Total} semantic checks scored", scoredSemantic, totalSemantic); + if (remainingUnevaluated > 0) + { + _logger.LogWarning(" {Count} semantic check{Plural} remain unscored", + remainingUnevaluated, remainingUnevaluated == 1 ? "" : "s"); + + // The detected agent(s) didn't score enough to finish the run — it may have + // hit tool-permission limits, timed out, or returned without edits. Rather + // than silently producing an inflated report, give the user the same BYOL + // fallback they'd get if no agent was installed at all. + LogManualEvaluationInstructions(checklistPath, remainingUnevaluated, engineNotFound: false, agentAttempted: true, requested: engine); + } + + // Only treat evaluation as completed when nothing is left unscored. + // Partial evaluations would skew scoring (Scorer treats unscored categories as 100). + return new ChecklistEvaluationResult + { + Checklist = checklist, + Outcome = remainingUnevaluated == 0 ? EvaluationOutcome.Completed : EvaluationOutcome.CouldNotEvaluate, + EngineUsed = engineUsed + }; + } + + /// + /// Extracts a single tool to a temp file, invokes the coding agent to evaluate + /// its semantic checks, then merges the scored results back into the tool object. + /// The temp file lives in an isolated directory under the system temp path to + /// reduce the blast radius of the agent's file tools: the agent's cwd is the + /// sandbox, and each engine's path-verification (Copilot's default, Claude's + /// --add-dir allowlist) bounds cwd-relative file access to it. Absolute paths + /// remain reachable, so this is a reduced-surface defense, not a full jail. + /// + private async Task EvaluateToolChecks( + ToolChecklist tool, + List engines, + CancellationToken cancellationToken) + { + var sandbox = CreateSandboxDir(); + var tempFile = Path.Combine(sandbox, $".eval_tool_{Guid.NewGuid():N}.json"); + try + { + var fullPath = Path.GetFullPath(tempFile); + EvalEngine? firstSuccessfulEngine = null; + + // Up to MaxAttempts agent passes. Each pass, we re-serialize the current + // tool state (with any scores merged from prior passes) so the agent only + // sees the items that are still null. Stops early once everything is scored. + for (int attempt = 1; attempt <= MaxAttempts; attempt++) + { + var toolJson = JsonSerializer.Serialize(tool, WriteOptions); + await File.WriteAllTextAsync(tempFile, toolJson, cancellationToken); + + // Scale the per-attempt timeout to the remaining work: a tool with + // 46 unscored checks legitimately needs longer than one with 18. + var perAttemptTimeout = CodingAgentLauncherBase.TimeoutForChecks(CountUnevaluatedSemanticChecks(tool)); + + var successEngine = await TryEvaluateWithFallthrough( + engines, + tempFile, + engine => SemanticCheckPrompts.BuildToolEvaluationPrompt(fullPath, tool.Name, ToolsetFor(engine)), + perAttemptTimeout, + cancellationToken); + + if (successEngine is not null) + { + firstSuccessfulEngine ??= successEngine; + + // Re-read the evaluated tool and merge scores back. + // Coding agents sometimes produce slightly malformed JSON: missing + // commas (handled by RepairJson), or structurally invalid items + // where a check is an abbreviated object or wrong type. Those will + // throw from Deserialize — treat as "agent made no usable progress + // this attempt" and let the retry loop try again. + try + { + var updatedJson = RepairJson(await File.ReadAllTextAsync(tempFile, cancellationToken)); + var updatedTool = JsonSerializer.Deserialize(updatedJson, ReadOptions); + + if (updatedTool is not null) + { + MergeScores(tool.Checks.ToolName, updatedTool.Checks.ToolName); + MergeScores(tool.Checks.ToolDescription, updatedTool.Checks.ToolDescription); + MergeScores(tool.Checks.SchemaStructure, updatedTool.Checks.SchemaStructure); + foreach (var (paramName, paramChecks) in tool.Checks.Parameters) + { + if (updatedTool.Checks.Parameters.TryGetValue(paramName, out var updatedParam)) + { + MergeScores(paramChecks.ParamName, updatedParam.ParamName); + MergeScores(paramChecks.ParamDescription, updatedParam.ParamDescription); + } + } + } + } + catch (JsonException ex) + { + _logger.LogDebug(ex, + "Tool {ToolName}: attempt {Attempt} produced JSON that failed to deserialize (path: {Path}); will retry if attempts remain", + tool.Name, attempt, ex.Path ?? "unknown"); + } + } + else + { + // Subprocess failed this attempt (timeout or non-zero exit). + // We still retry — we've observed that timeouts on Haiku are + // non-deterministic: a tool that times out on attempt 1 often + // completes on attempt 2 or 3. Giving up fast loses winnable runs. + // Info-level so the user sees the in-flight retry, not silence. + _logger.LogInformation( + " {ToolName}: attempt {Attempt} subprocess failed; will retry if attempts remain", + tool.Name, attempt); + } + + if (CountUnevaluatedSemanticChecks(tool) == 0) + { + return firstSuccessfulEngine; + } + + if (attempt < MaxAttempts) + { + // Info-level so the user sees forward motion when a tool needs >1 pass. + _logger.LogInformation(" {ToolName}: attempt {Attempt} left {Count} check(s) unscored, retrying", + tool.Name, attempt, CountUnevaluatedSemanticChecks(tool)); + } + } + + // All MaxAttempts used. If at least one attempt produced exit-0 output + // (even if some items remain null), treat as "agent ran" — the outer + // pipeline will see the unscored items and fall back to manual scoring. + // If no attempt ever succeeded (e.g. all 3 hit timeout), report failure + // so the tool shows up as "failed (continuing)" in the pipeline log. + return firstSuccessfulEngine; + } + finally + { + DeleteSandboxDir(sandbox); + } + } + + /// + /// Extracts server-level checks with a tool name summary to a temp file, + /// invokes the coding agent, then merges results back. Runs inside an isolated + /// sandbox directory for the same reason as EvaluateToolChecks. + /// + private async Task EvaluateServerChecks( + EvaluationChecklist checklist, + List engines, + CancellationToken cancellationToken) + { + var sandbox = CreateSandboxDir(); + var tempFile = Path.Combine(sandbox, $".eval_server_{Guid.NewGuid():N}.json"); + try + { + var fullPath = Path.GetFullPath(tempFile); + EvalEngine? firstSuccessfulEngine = null; + var docOptions = new JsonDocumentOptions + { + AllowTrailingCommas = true, + CommentHandling = JsonCommentHandling.Skip + }; + + for (int attempt = 1; attempt <= MaxAttempts; attempt++) + { + // Re-build the input each attempt so the agent sees the current + // (partially scored) state — previously-scored items are preserved. + var serverData = new + { + tool_summaries = checklist.Tools.Select(t => new { t.Name, t.Description }).ToList(), + server_checks = checklist.ServerChecks + }; + var dataJson = JsonSerializer.Serialize(serverData, WriteOptions); + await File.WriteAllTextAsync(tempFile, dataJson, cancellationToken); + + var serverRemaining = checklist.ServerChecks.Count(c => c.Type == CheckType.Semantic && c.Score is null); + var perAttemptTimeout = CodingAgentLauncherBase.TimeoutForChecks(serverRemaining); + + var successEngine = await TryEvaluateWithFallthrough( + engines, + tempFile, + engine => SemanticCheckPrompts.BuildServerChecksEvaluationPrompt(fullPath, ToolsetFor(engine)), + perAttemptTimeout, + cancellationToken); + + if (successEngine is not null) + { + firstSuccessfulEngine ??= successEngine; + + try + { + var updatedJson = RepairJson(await File.ReadAllTextAsync(tempFile, cancellationToken)); + using var doc = JsonDocument.Parse(updatedJson, docOptions); + if (doc.RootElement.TryGetProperty("server_checks", out var checksElement)) + { + var updatedChecks = JsonSerializer.Deserialize>(checksElement.GetRawText(), ReadOptions); + if (updatedChecks is not null) + { + MergeScores(checklist.ServerChecks, updatedChecks); + } + } + } + catch (JsonException ex) + { + _logger.LogDebug(ex, + "Server checks: attempt {Attempt} produced JSON that failed to deserialize (path: {Path}); will retry if attempts remain", + attempt, ex.Path ?? "unknown"); + } + } + else + { + // Subprocess failed this attempt (timeout / non-zero exit). + // Retry — the failure is often transient on Haiku. + _logger.LogDebug("Server checks: attempt {Attempt} subprocess failed; will retry if attempts remain", + attempt); + } + + var remaining = checklist.ServerChecks.Count(c => c.Type == CheckType.Semantic && c.Score is null); + if (remaining == 0) + { + return firstSuccessfulEngine; + } + + if (attempt < MaxAttempts) + { + _logger.LogDebug("Server checks: attempt {Attempt} left {Count} check(s) unscored, retrying", + attempt, remaining); + } + } + + return firstSuccessfulEngine; + } + finally + { + DeleteSandboxDir(sandbox); + } + } + + /// + /// Creates a fresh isolated directory under the system temp path for a single + /// agent invocation. The agent's working directory is set to this path, which + /// bounds file-tool access to files that we place here ourselves. + /// + private static string CreateSandboxDir() + { + var dir = Path.Combine(Path.GetTempPath(), $"a365-eval-{Guid.NewGuid():N}"); + Directory.CreateDirectory(dir); + return dir; + } + + private static void DeleteSandboxDir(string path) + { + try { Directory.Delete(path, recursive: true); } catch { /* best effort */ } + } + + /// + /// Merges scores from evaluated items back into the original list. + /// Only copies score/reason for items that were null and are now filled. + /// Agent output can contain duplicate or empty ids; drop empties and take + /// last-wins on duplicates so a malformed batch is handled like other + /// agent-JSON quirks (treated as "no usable progress, retry") rather than + /// crashing the run. + /// + private static void MergeScores(List original, List evaluated) + { + var evaluatedById = evaluated + .Where(e => !string.IsNullOrEmpty(e.Id)) + .GroupBy(e => e.Id) + .ToDictionary(g => g.Key, g => g.Last()); + foreach (var item in original) + { + if (item.Score is not null) + { + continue; // Already scored (deterministic or previously evaluated) + } + + if (evaluatedById.TryGetValue(item.Id, out var updated) && updated.Score is not null) + { + item.Score = updated.Score; + item.Reason = updated.Reason; + } + } + } + + /// + /// Attempts to repair common JSON issues produced by coding agents by + /// inserting missing commas between properties or array elements. + /// Trailing commas are tolerated separately via AllowTrailingCommas in ReadOptions. + /// + internal static string RepairJson(string json) + { + // Insert missing commas: a value-ending token followed by whitespace then a + // value-starting token, with no comma in between. + // Value endings: } ] " true false null digits + // Value beginnings: { [ " + return Regex.Replace(json, @"([\}\]""]|true|false|null|\d)(\s*\n\s*)([\{\[""])", "$1,$2$3"); + } + + /// + /// Tries each engine in order for a single evaluation call until one succeeds. + /// Returns the engine that succeeded, or null if every candidate failed. + /// Builds the prompt per engine so we can name the engine's exact tools in the + /// instructions (Copilot: view/create, Claude Code: Read/Write). + /// + private async Task TryEvaluateWithFallthrough( + List engines, + string filePath, + Func promptBuilder, + TimeSpan timeout, + CancellationToken cancellationToken) + { + var workingDirectory = Path.GetDirectoryName(filePath) ?? Directory.GetCurrentDirectory(); + foreach (var candidate in engines) + { + var launcher = LauncherFor(candidate); + if (launcher is null) + { + _logger.LogDebug("No launcher registered for {Engine}, skipping", candidate); + continue; + } + + var prompt = promptBuilder(candidate); + var success = await launcher.LaunchAsync(prompt, workingDirectory, timeout, cancellationToken); + if (success) + { + return candidate; + } + + _logger.LogDebug("{Engine} failed, trying next", candidate); + } + + return null; + } + + /// + /// Maps an engine to the concrete tool names it exposes. Edit-style tools are + /// deliberately omitted: we've observed models thrashing between edit and create + /// strategies when both are available, so the runner only exposes read + an + /// edit (string-replace) tool. We deliberately do NOT expose a whole-file + /// write tool: Copilot's `create` refuses to overwrite existing files, which + /// sends the agent on long workaround loops, and a mix of edit+create tempts + /// the model to oscillate between strategies. + /// + private SemanticCheckPrompts.AgentToolset ToolsetFor(EvalEngine engine) + => LauncherFor(engine)?.Toolset + ?? new SemanticCheckPrompts.AgentToolset(ReadToolName: "read", EditToolName: "edit"); + + /// + /// Builds the ordered list of engines to try based on user's choice. + /// For Auto: detect which are available, always Copilot first. + /// For a specific engine: return it only if its CLI is available; otherwise + /// an empty list so the caller takes the same "engine not found" path as Auto + /// with nothing installed (instead of looping through failures and surfacing + /// a misleading "agent ran but left checks unscored" message). + /// Caller should have handled None earlier. + /// + private async Task> BuildEngineList(EvalEngine requested, CancellationToken cancellationToken = default) + { + if (requested != EvalEngine.Auto) + { + var launcher = LauncherFor(requested); + if (launcher is not null && await launcher.IsAvailableAsync(cancellationToken)) + { + return [requested]; + } + + _logger.LogDebug("Requested engine {Engine} is not available on PATH", requested); + return []; + } + + // Auto: detect all available engines, preserving the registered priority order + var available = new List(); + foreach (var launcher in _launchers) + { + if (await launcher.IsAvailableAsync(cancellationToken)) + { + _logger.LogDebug("Detected {Engine}", launcher.Engine); + available.Add(launcher.Engine); + } + } + + return available; + } + + /// + /// Returns a user-friendly display name for an engine. Real engine names come + /// from the registered launcher; the Auto/None meta-values are formatted here. + /// + public string FormatEngineName(EvalEngine engine) + { + var launcher = LauncherFor(engine); + if (launcher is not null) + { + return launcher.DisplayName; + } + + return engine switch + { + EvalEngine.Auto => "auto", + EvalEngine.None => "none", + _ => engine.ToString() + }; + } + + private static int CountTotalUnevaluatedSemanticChecks(EvaluationChecklist checklist) + { + int count = 0; + foreach (var tool in checklist.Tools) + { + count += CountUnevaluatedSemanticChecks(tool); + } + count += checklist.ServerChecks.Count(c => c.Type == CheckType.Semantic && c.Score is null); + return count; + } + + private static int CountUnevaluatedSemanticChecks(ToolChecklist tool) + { + int count = 0; + count += tool.Checks.ToolName.Count(i => i.Type == CheckType.Semantic && i.Score is null); + count += tool.Checks.ToolDescription.Count(i => i.Type == CheckType.Semantic && i.Score is null); + count += tool.Checks.SchemaStructure.Count(i => i.Type == CheckType.Semantic && i.Score is null); + foreach (var param in tool.Checks.Parameters.Values) + { + count += param.ParamName.Count(i => i.Type == CheckType.Semantic && i.Score is null); + count += param.ParamDescription.Count(i => i.Type == CheckType.Semantic && i.Score is null); + } + return count; + } + + private static int CountTotalSemanticChecks(EvaluationChecklist checklist) + { + int count = 0; + foreach (var tool in checklist.Tools) + { + count += tool.Checks.ToolName.Count(c => c.Type == CheckType.Semantic); + count += tool.Checks.ToolDescription.Count(c => c.Type == CheckType.Semantic); + count += tool.Checks.SchemaStructure.Count(c => c.Type == CheckType.Semantic); + foreach (var param in tool.Checks.Parameters.Values) + { + count += param.ParamName.Count(c => c.Type == CheckType.Semantic); + count += param.ParamDescription.Count(c => c.Type == CheckType.Semantic); + } + } + count += checklist.ServerChecks.Count(c => c.Type == CheckType.Semantic); + return count; + } + + private void LogManualEvaluationInstructions(string checklistPath, int unscoredCount, bool engineNotFound, bool agentAttempted, EvalEngine requested) + { + var fullPath = Path.GetFullPath(checklistPath); + var promptPath = Path.Combine(Path.GetDirectoryName(fullPath) ?? ".", "semantic_eval_prompt.txt"); + var prompt = SemanticCheckPrompts.BuildEvaluationPrompt(fullPath); + + try + { + File.WriteAllText(promptPath, prompt); + } + catch (Exception ex) + { + _logger.LogDebug(ex, "Failed to write prompt file to {Path}", promptPath); + promptPath = string.Empty; + } + + if (engineNotFound) + { + if (requested == EvalEngine.Auto) + { + // Built from the registry so a newly added engine appears here automatically. + var probed = string.Join(" and ", _launchers.Select(l => $"{l.DisplayName} (`{l.CliCommand}`)")); + _logger.LogWarning(" No coding agent CLI detected (looked for {Probed}). Run with -v to see why each probe failed.", probed); + } + else + { + // The user asked for one specific engine; name only that one, not both. + var launcher = LauncherFor(requested); + var name = launcher?.DisplayName ?? FormatEngineName(requested); + var binary = launcher?.CliCommand ?? requested.ToString(); + _logger.LogWarning(" {Name} CLI not found on PATH (looked for `{Binary}`). Run with -v to see why the probe failed.", name, binary); + } + } + else if (agentAttempted) + { + // Agent was detected and invoked but didn't score enough of the checklist. + // Could be a tool-permission issue, a timeout, or the model bailing out. + _logger.LogWarning(" The coding agent ran but left {Count} check{Plural} unscored — falling back to manual scoring", + unscoredCount, unscoredCount == 1 ? "" : "s"); + } + else + { + _logger.LogInformation(" {Count} semantic check{Plural} still unscored (--eval-engine none skips automatic scoring)", + unscoredCount, unscoredCount == 1 ? "" : "s"); + } + + _logger.LogInformation(""); + _logger.LogInformation("To finish this evaluation, pick one:"); + _logger.LogInformation(""); + + if (engineNotFound) + { + _logger.LogInformation(" 1. Install a coding agent CLI and re-run the same command:"); + _logger.LogInformation(" GitHub Copilot: https://github.com/github/gh-copilot"); + _logger.LogInformation(" Claude Code: https://docs.anthropic.com/claude-code"); + _logger.LogInformation(""); + _logger.LogInformation(" 2. Score with your own LLM (ChatGPT, Gemini, an IDE assistant, etc.):"); + } + else + { + _logger.LogInformation(" Score with your own LLM (ChatGPT, Gemini, an IDE assistant, etc.):"); + } + + _logger.LogInformation(" a. Open: {ChecklistPath}", fullPath); + if (!string.IsNullOrEmpty(promptPath)) + { + _logger.LogInformation(" b. Paste the prompt from: {PromptPath}", promptPath); + } + else + { + _logger.LogInformation(" b. Paste the prompt shown below into your LLM"); + } + _logger.LogInformation(" c. Have the LLM fill in every null `score` (true/false) with a one-sentence `reason`"); + _logger.LogInformation(" d. Save the file, then re-run the exact same command. The pipeline will detect the scored checklist and generate the report."); + _logger.LogInformation(""); + + if (string.IsNullOrEmpty(promptPath)) + { + _logger.LogInformation("--- PROMPT ---"); + _logger.LogInformation("{Prompt}", prompt); + _logger.LogInformation("--- END PROMPT ---"); + } + } + + /// + /// Serializes the checklist to disk at . + /// + private static async Task WriteChecklistAsync(EvaluationChecklist checklist, string checklistPath, CancellationToken cancellationToken) + { + var json = JsonSerializer.Serialize(checklist, WriteOptions); + await File.WriteAllTextAsync(checklistPath, json, cancellationToken); + } + + private static int CountEvaluatedSemanticChecks(EvaluationChecklist checklist) + { + int count = 0; + foreach (var tool in checklist.Tools) + { + count += CountEvaluated(tool.Checks.ToolName); + count += CountEvaluated(tool.Checks.ToolDescription); + count += CountEvaluated(tool.Checks.SchemaStructure); + foreach (var param in tool.Checks.Parameters.Values) + { + count += CountEvaluated(param.ParamName); + count += CountEvaluated(param.ParamDescription); + } + } + count += CountEvaluated(checklist.ServerChecks); + return count; + } + + private static int CountEvaluated(List items) => + items.Count(i => i.Type == CheckType.Semantic && i.Score is not null); +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ChecklistGenerator.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ChecklistGenerator.cs new file mode 100644 index 00000000..39e3ba57 --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ChecklistGenerator.cs @@ -0,0 +1,1165 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using System.Reflection; +using System.Text.Json; +using System.Text.RegularExpressions; +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; + +namespace Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; + +/// +/// Generates an evaluation checklist from discovered MCP tool schemas. +/// Runs deterministic checks inline (structural/objective checks that do not require +/// semantic judgment) and attaches semantic check placeholders for later evaluation +/// by a coding agent. +/// +internal sealed class ChecklistGenerator : IChecklistGenerator +{ + /// + public EvaluationChecklist Generate(List tools, string serverName, string serverUrl) + { + ArgumentNullException.ThrowIfNull(tools); + + var toolChecklists = new List(); + + foreach (var tool in tools) + { + var toolChecklist = BuildToolChecklist(tool, tools); + toolChecklists.Add(toolChecklist); + } + + var serverChecks = BuildServerChecks(tools); + + return new EvaluationChecklist + { + Metadata = new ChecklistMetadata + { + ServerName = serverName, + ServerUrl = serverUrl, + ToolCount = tools.Count, + GeneratedAt = DateTime.UtcNow, + GeneratorVersion = GetGeneratorVersion(), + }, + Tools = toolChecklists, + ServerChecks = serverChecks, + }; + } + + /// + /// Builds a complete checklist for a single tool, including deterministic checks + /// (pre-scored) and semantic check placeholders (score = null). + /// + private static ToolChecklist BuildToolChecklist(ToolSchema tool, List allTools) + { + var name = tool.Name ?? string.Empty; + var description = tool.Description ?? string.Empty; + var inputSchema = tool.InputSchema; + + // Extract properties and required arrays from inputSchema + var properties = ExtractProperties(inputSchema); + var requiredParams = ExtractRequiredParams(inputSchema); + var allParamNames = properties.Keys.ToList(); + + // --- Tool Name checks --- + var toolNameChecks = new List(); + toolNameChecks.AddRange(RunToolNameDeterministicChecks(name)); + toolNameChecks.AddRange( + SemanticCheckDefinitions.GetToolLevelChecks() + .Where(c => c.Category == CheckCategory.ToolName)); + + // --- Tool Description checks --- + var toolDescriptionChecks = new List(); + toolDescriptionChecks.AddRange(RunToolDescriptionDeterministicChecks(description)); + toolDescriptionChecks.AddRange( + SemanticCheckDefinitions.GetToolLevelChecks() + .Where(c => c.Category == CheckCategory.ToolDescription)); + + // --- Schema Structure checks --- + var schemaStructureChecks = RunSchemaStructureDeterministicChecks(inputSchema); + + // --- Parameter checks --- + var parameterGroups = new Dictionary(); + foreach (var (paramName, paramSchema) in properties) + { + var paramNameChecks = new List(); + paramNameChecks.AddRange(RunParamNameDeterministicChecks(paramName, allParamNames)); + + var paramDescChecks = new List(); + paramDescChecks.AddRange(RunParamDescriptionDeterministicChecks(paramName, paramSchema)); + + // Add semantic param checks, split by category + var semanticParamChecks = SemanticCheckDefinitions.GetParamLevelChecks(paramName); + paramNameChecks.AddRange(semanticParamChecks.Where(c => c.Category == CheckCategory.ParamName)); + paramDescChecks.AddRange(semanticParamChecks.Where(c => c.Category == CheckCategory.ParamDescription)); + + parameterGroups[paramName] = new ParamCheckGroups + { + ParamName = paramNameChecks, + ParamDescription = paramDescChecks, + }; + } + + return new ToolChecklist + { + Name = name, + Description = description, + InputSchema = inputSchema, + Checks = new ToolCheckGroups + { + ToolName = toolNameChecks, + ToolDescription = toolDescriptionChecks, + SchemaStructure = schemaStructureChecks, + Parameters = parameterGroups, + }, + }; + } + + /// + /// Builds server-level (toolset) checks: deterministic + semantic. + /// + private static List BuildServerChecks(List tools) + { + var checks = new List(); + checks.AddRange(RunToolsetDeterministicChecks(tools)); + checks.AddRange(SemanticCheckDefinitions.GetToolsetLevelChecks()); + return checks; + } + + // ----------------------------------------------------------------------- + // Tool Name deterministic checks + // ----------------------------------------------------------------------- + + private static List RunToolNameDeterministicChecks(string name) + { + return + [ + CheckToolNamePresent(name), + CheckToolNameConsistentCasing(name), + CheckToolNameNoSpecialChars(name), + CheckToolNameReasonableLength(name), + ]; + } + + private static ChecklistItem CheckToolNamePresent(string name) + { + bool passed = !string.IsNullOrWhiteSpace(name); + return new ChecklistItem + { + Id = "tn_present", + Type = CheckType.Deterministic, + Prompt = "Tool has a non-empty name.", + Score = passed, + Reason = passed ? "Tool has a name." : "Tool name is empty or missing.", + Severity = Priority.P0, + Category = CheckCategory.ToolName, + IssueIds = [4], + ImpactAreas = [ImpactArea.ToolSelection], + Remediation = passed ? string.Empty : "Every tool must have a non-empty name.", + }; + } + + private static ChecklistItem CheckToolNameConsistentCasing(string name) + { + bool isSnake = Regex.IsMatch(name, @"^[a-z][a-z0-9]*(_[a-z0-9]+)*$"); + bool isCamel = Regex.IsMatch(name, @"^[a-z][a-zA-Z0-9]*$"); + bool isPascal = Regex.IsMatch(name, @"^[A-Z][a-zA-Z0-9]*$"); + bool isKebab = Regex.IsMatch(name, @"^[a-z][a-z0-9]*(-[a-z0-9]+)*$"); + bool passed = isSnake || isCamel || isPascal || isKebab; + + string detected = isSnake ? "snake_case" + : isCamel ? "camelCase" + : isPascal ? "PascalCase" + : isKebab ? "kebab-case" + : "mixed/inconsistent"; + + return new ChecklistItem + { + Id = "tn_consistent_casing", + Type = CheckType.Deterministic, + Prompt = "Tool name uses a consistent naming convention (snake_case, camelCase, PascalCase, or kebab-case).", + Score = passed, + Reason = passed ? $"Name uses {detected} convention." : $"Name '{name}' uses mixed casing.", + Severity = Priority.P2, + Category = CheckCategory.ToolName, + IssueIds = [17], + ImpactAreas = [ImpactArea.ToolSelection], + Remediation = passed ? string.Empty : "Use consistent snake_case (preferred) or camelCase for all tool names.", + }; + } + + private static ChecklistItem CheckToolNameNoSpecialChars(string name) + { + bool passed = !string.IsNullOrEmpty(name) && Regex.IsMatch(name, @"^[a-zA-Z0-9_.\-]+$"); + var badChars = string.IsNullOrEmpty(name) + ? [] + : Regex.Matches(name, @"[^a-zA-Z0-9_.\-]").Select(m => m.Value).Distinct().ToList(); + + return new ChecklistItem + { + Id = "tn_no_special_chars", + Type = CheckType.Deterministic, + Prompt = "Tool name contains only valid characters (letters, numbers, underscores, hyphens, dots).", + Score = passed, + Reason = passed + ? "Name contains only valid characters." + : $"Name contains invalid characters: {string.Join(", ", badChars)}", + Severity = Priority.P1, + Category = CheckCategory.ToolName, + IssueIds = [], + ImpactAreas = [ImpactArea.ToolSelection], + Remediation = passed ? string.Empty : "Remove special characters. Use only letters, numbers, underscores, hyphens, and dots.", + }; + } + + private static ChecklistItem CheckToolNameReasonableLength(string name) + { + int length = name?.Length ?? 0; + bool passed = length >= 3 && length <= 64; + return new ChecklistItem + { + Id = "tn_reasonable_length", + Type = CheckType.Deterministic, + Prompt = "Tool name length is between 3 and 64 characters.", + Score = passed, + Reason = passed + ? $"Name length ({length}) is within range." + : $"Name length ({length}) outside 3-64 range.", + Severity = Priority.P2, + Category = CheckCategory.ToolName, + IssueIds = [], + ImpactAreas = [ImpactArea.ToolSelection], + Remediation = passed ? string.Empty : "Keep tool names between 3 and 64 characters.", + }; + } + + // ----------------------------------------------------------------------- + // Tool Description deterministic checks + // ----------------------------------------------------------------------- + + private static List RunToolDescriptionDeterministicChecks(string description) + { + return + [ + CheckToolDescriptionPresent(description), + CheckToolDescriptionMinLength(description), + CheckToolDescriptionMaxLength(description), + ]; + } + + private static ChecklistItem CheckToolDescriptionPresent(string description) + { + bool passed = !string.IsNullOrWhiteSpace(description); + return new ChecklistItem + { + Id = "td_present", + Type = CheckType.Deterministic, + Prompt = "Tool has a non-empty description.", + Score = passed, + Reason = passed ? "Tool has a description." : "Tool description is empty or missing.", + Severity = Priority.P0, + Category = CheckCategory.ToolDescription, + IssueIds = [4, 5, 6, 7, 8], + ImpactAreas = [ImpactArea.ToolSelection, ImpactArea.Completeness], + Remediation = passed ? string.Empty : "Add a description explaining what this tool does, when to use it, and what it returns.", + }; + } + + private static ChecklistItem CheckToolDescriptionMinLength(string description) + { + int length = description?.Trim().Length ?? 0; + bool passed = length >= 20; + return new ChecklistItem + { + Id = "td_min_length", + Type = CheckType.Deterministic, + Prompt = "Tool description is at least 20 characters.", + Score = passed, + Reason = passed + ? $"Description is {length} chars." + : $"Description is too short ({length} chars, minimum 20).", + Severity = Priority.P1, + Category = CheckCategory.ToolDescription, + IssueIds = [4, 9], + ImpactAreas = [ImpactArea.ToolSelection, ImpactArea.Completeness], + Remediation = passed ? string.Empty : "Expand the description to at least 20 characters with meaningful content.", + }; + } + + private static ChecklistItem CheckToolDescriptionMaxLength(string description) + { + int length = description?.Trim().Length ?? 0; + bool passed = length <= 2000; + return new ChecklistItem + { + Id = "td_max_length", + Type = CheckType.Deterministic, + Prompt = "Tool description is under 2000 characters.", + Score = passed, + Reason = passed + ? "Description length is within limits." + : $"Description is too long ({length} chars, max 2000). Risk of 16.67% regression.", + Severity = Priority.P2, + Category = CheckCategory.ToolDescription, + IssueIds = [14], + ImpactAreas = [ImpactArea.Conciseness], + Remediation = passed ? string.Empty : "Trim to under 2000 characters. Focus on purpose, guidelines, and limitations.", + }; + } + + // ----------------------------------------------------------------------- + // Schema Structure deterministic checks + // ----------------------------------------------------------------------- + + private static List RunSchemaStructureDeterministicChecks(JsonElement? inputSchema) + { + return + [ + CheckHasInputSchema(inputSchema), + CheckTypeObject(inputSchema), + CheckNoDeepNesting(inputSchema), + CheckAllTyped(inputSchema), + CheckArraysHaveItems(inputSchema), + CheckRequiredMatchesProperties(inputSchema), + CheckReasonableParamCount(inputSchema), + CheckNoEmptyObjects(inputSchema), + ]; + } + + private static ChecklistItem CheckHasInputSchema(JsonElement? inputSchema) + { + bool passed = inputSchema.HasValue && inputSchema.Value.ValueKind == JsonValueKind.Object; + return new ChecklistItem + { + Id = "ss_has_input_schema", + Type = CheckType.Deterministic, + Prompt = "Tool has an input schema defined.", + Score = passed, + Reason = passed ? "Tool has an input schema." : "Tool has no input schema defined.", + Severity = Priority.P0, + Category = CheckCategory.SchemaStructure, + IssueIds = [], + ImpactAreas = [ImpactArea.ParamAccuracy], + Remediation = passed ? string.Empty : "Define an inputSchema with type 'object' and properties for each parameter.", + }; + } + + private static ChecklistItem CheckTypeObject(JsonElement? inputSchema) + { + if (!inputSchema.HasValue || inputSchema.Value.ValueKind != JsonValueKind.Object) + { + return MakeDeterministicPass("ss_type_object", "Root type is object", + CheckCategory.SchemaStructure, "No schema to check."); + } + + string schemaType = GetStringProperty(inputSchema.Value, "type") ?? string.Empty; + bool passed = schemaType == "object"; + return new ChecklistItem + { + Id = "ss_type_object", + Type = CheckType.Deterministic, + Prompt = "Input schema root type is 'object'.", + Score = passed, + Reason = passed + ? "Schema root is type 'object'." + : $"Schema root type is '{schemaType}', expected 'object'.", + Severity = Priority.P0, + Category = CheckCategory.SchemaStructure, + IssueIds = [], + ImpactAreas = [ImpactArea.ParamAccuracy], + Remediation = passed ? string.Empty : "Set the inputSchema type to 'object' with 'properties' for parameters.", + }; + } + + private static ChecklistItem CheckNoDeepNesting(JsonElement? inputSchema) + { + if (!inputSchema.HasValue || inputSchema.Value.ValueKind != JsonValueKind.Object) + { + return MakeDeterministicPass("ss_no_deep_nesting", "No deep nesting", + CheckCategory.SchemaStructure, "No schema to check."); + } + + int depth = CalculateMaxDepth(inputSchema.Value, 0); + bool passed = depth < 4; + var severity = depth >= 4 ? Priority.P0 : depth == 3 ? Priority.P1 : Priority.P3; + return new ChecklistItem + { + Id = "ss_no_deep_nesting", + Type = CheckType.Deterministic, + Prompt = "Input schema nesting depth is less than 4 levels.", + Score = passed, + Reason = passed + ? $"Schema nesting depth is {depth} (limit: 3)." + : $"Schema nesting depth is {depth}. LLMs systematically flatten nested args at depth 4+.", + Severity = severity, + Category = CheckCategory.SchemaStructure, + IssueIds = [], + ImpactAreas = [ImpactArea.ParamAccuracy], + Remediation = passed ? string.Empty : "Flatten nested structures. Split deeply nested parameters into separate tools.", + }; + } + + private static ChecklistItem CheckAllTyped(JsonElement? inputSchema) + { + var properties = ExtractProperties(inputSchema); + if (properties.Count == 0) + { + return MakeDeterministicPass("ss_all_typed", "All properties typed", + CheckCategory.SchemaStructure, "No properties."); + } + + var untyped = properties + .Where(p => p.Value.ValueKind == JsonValueKind.Object + && !p.Value.TryGetProperty("type", out _) + && !p.Value.TryGetProperty("$ref", out _)) + .Select(p => p.Key) + .ToList(); + + bool passed = untyped.Count == 0; + return new ChecklistItem + { + Id = "ss_all_typed", + Type = CheckType.Deterministic, + Prompt = "All input schema properties have type definitions.", + Score = passed, + Reason = passed + ? "All properties have type definitions." + : $"Properties without type: {string.Join(", ", untyped)}. LLM cannot generate valid args.", + Severity = Priority.P0, + Category = CheckCategory.SchemaStructure, + IssueIds = [], + ImpactAreas = [ImpactArea.ParamAccuracy], + Remediation = passed ? string.Empty : $"Add 'type' to these properties: {string.Join(", ", untyped)}.", + }; + } + + private static ChecklistItem CheckArraysHaveItems(JsonElement? inputSchema) + { + var properties = ExtractProperties(inputSchema); + var badArrays = properties + .Where(p => p.Value.ValueKind == JsonValueKind.Object + && GetStringProperty(p.Value, "type") == "array" + && !p.Value.TryGetProperty("items", out _)) + .Select(p => p.Key) + .ToList(); + + bool passed = badArrays.Count == 0; + return new ChecklistItem + { + Id = "ss_arrays_have_items", + Type = CheckType.Deterministic, + Prompt = "All array properties define their items type.", + Score = passed, + Reason = passed + ? "All arrays define their items type." + : $"Arrays without items: {string.Join(", ", badArrays)}. Breaks OpenAI/Azure.", + Severity = Priority.P0, + Category = CheckCategory.SchemaStructure, + IssueIds = [], + ImpactAreas = [ImpactArea.ParamAccuracy], + Remediation = passed ? string.Empty : $"Add 'items' with a type definition to: {string.Join(", ", badArrays)}.", + }; + } + + private static ChecklistItem CheckRequiredMatchesProperties(JsonElement? inputSchema) + { + var requiredParams = ExtractRequiredParams(inputSchema); + var propertyNames = ExtractProperties(inputSchema).Keys.ToHashSet(); + + if (requiredParams.Count == 0) + { + return MakeDeterministicPass("ss_required_matches", "Required matches properties", + CheckCategory.SchemaStructure, "No required fields."); + } + + var orphans = requiredParams.Where(r => !propertyNames.Contains(r)).ToList(); + bool passed = orphans.Count == 0; + return new ChecklistItem + { + Id = "ss_required_matches", + Type = CheckType.Deterministic, + Prompt = "All required fields exist in the properties definition.", + Score = passed, + Reason = passed + ? "All required fields exist in properties." + : $"Required fields not in properties: {string.Join(", ", orphans)}. Server will always reject.", + Severity = Priority.P0, + Category = CheckCategory.SchemaStructure, + IssueIds = [1], + ImpactAreas = [ImpactArea.ParamAccuracy], + Remediation = passed ? string.Empty : $"Add these to 'properties' or remove from 'required': {string.Join(", ", orphans)}.", + }; + } + + private static ChecklistItem CheckReasonableParamCount(JsonElement? inputSchema) + { + int count = ExtractProperties(inputSchema).Count; + bool passed; + Priority severity; + string message; + + if (count == 0) + { + passed = true; + severity = Priority.P3; + message = "Tool has no parameters (verify intentional)."; + } + else if (count <= 10) + { + passed = true; + severity = Priority.P3; + message = $"Parameter count ({count}) is in the ideal range."; + } + else if (count <= 20) + { + passed = false; + severity = Priority.P1; + message = $"Parameter count ({count}) is high. gpt-4o-mini gets ~50% wrong with 10+ params."; + } + else + { + passed = false; + severity = Priority.P0; + message = $"Parameter count ({count}) almost certainly needs splitting into multiple tools."; + } + + return new ChecklistItem + { + Id = "ss_reasonable_param_count", + Type = CheckType.Deterministic, + Prompt = "Tool has a reasonable number of parameters (10 or fewer is ideal).", + Score = passed, + Reason = message, + Severity = severity, + Category = CheckCategory.SchemaStructure, + IssueIds = [], + ImpactAreas = [ImpactArea.ParamAccuracy], + Remediation = passed ? string.Empty : "Split tool into multiple focused tools with fewer parameters each.", + }; + } + + private static ChecklistItem CheckNoEmptyObjects(JsonElement? inputSchema) + { + var properties = ExtractProperties(inputSchema); + var emptyObjects = properties + .Where(p => p.Value.ValueKind == JsonValueKind.Object + && GetStringProperty(p.Value, "type") == "object" + && !HasNonEmptyObjectProperty(p.Value, "properties")) + .Select(p => p.Key) + .ToList(); + + bool passed = emptyObjects.Count == 0; + return new ChecklistItem + { + Id = "ss_no_empty_objects", + Type = CheckType.Deterministic, + Prompt = "No object-type parameters are defined without inner properties.", + Score = passed, + Reason = passed + ? "No empty object types." + : $"Object params without properties: {string.Join(", ", emptyObjects)}. LLM will hallucinate field names.", + Severity = Priority.P1, + Category = CheckCategory.SchemaStructure, + IssueIds = [], + ImpactAreas = [ImpactArea.ParamAccuracy], + Remediation = passed ? string.Empty : $"Define 'properties' for: {string.Join(", ", emptyObjects)}.", + }; + } + + // ----------------------------------------------------------------------- + // Parameter Name deterministic checks + // ----------------------------------------------------------------------- + + private static List RunParamNameDeterministicChecks(string paramName, List allParamNames) + { + return + [ + CheckParamNameNotSingleChar(paramName), + CheckParamNameReasonableLength(paramName), + CheckParamNameConsistentCasing(paramName, allParamNames), + ]; + } + + private static ChecklistItem CheckParamNameNotSingleChar(string paramName) + { + bool passed = paramName.Length >= 2; + return new ChecklistItem + { + Id = "pn_not_single_char", + Type = CheckType.Deterministic, + Prompt = $"Parameter '{paramName}' name is more than a single character.", + Score = passed, + Reason = passed + ? "Parameter name is descriptive." + : $"Parameter '{paramName}' is a single character.", + Severity = Priority.P1, + Category = CheckCategory.ParamName, + IssueIds = [9], + ImpactAreas = [ImpactArea.ParamAccuracy], + Remediation = passed ? string.Empty : $"Rename '{paramName}' to a descriptive name.", + }; + } + + private static ChecklistItem CheckParamNameReasonableLength(string paramName) + { + int length = paramName.Length; + bool passed = length >= 2 && length <= 40; + return new ChecklistItem + { + Id = "pn_reasonable_length", + Type = CheckType.Deterministic, + Prompt = $"Parameter '{paramName}' name length is between 2 and 40 characters.", + Score = passed, + Reason = passed + ? "Parameter name length is reasonable." + : $"Parameter '{paramName}' length ({length}) outside 2-40 range.", + Severity = Priority.P3, + Category = CheckCategory.ParamName, + IssueIds = [], + ImpactAreas = [ImpactArea.ParamAccuracy], + Remediation = passed ? string.Empty : "Keep parameter names between 2 and 40 characters.", + }; + } + + private static ChecklistItem CheckParamNameConsistentCasing(string paramName, List allParamNames) + { + if (allParamNames.Count < 2) + { + return MakeDeterministicPass("pn_consistent_casing", "Consistent casing", + CheckCategory.ParamName, "Only one parameter, casing consistent by default."); + } + + string dominant = PickDominantCasing(allParamNames); + bool passed = GetCompatibleCasings(paramName).Contains(dominant); + + return new ChecklistItem + { + Id = "pn_consistent_casing", + Type = CheckType.Deterministic, + Prompt = $"Parameter '{paramName}' follows the dominant naming convention used by other parameters.", + Score = passed, + Reason = passed + ? $"Parameter compatible with the {dominant} convention." + : $"Parameter '{paramName}' is not compatible with the {dominant} convention used by other parameters.", + Severity = Priority.P3, + Category = CheckCategory.ParamName, + IssueIds = [17], + ImpactAreas = [ImpactArea.ParamAccuracy], + Remediation = passed ? string.Empty : $"Rename to match the dominant {dominant} convention used by other parameters.", + }; + } + + // ----------------------------------------------------------------------- + // Parameter Description deterministic checks + // ----------------------------------------------------------------------- + + private static List RunParamDescriptionDeterministicChecks(string paramName, JsonElement paramSchema) + { + return + [ + CheckParamDescriptionPresent(paramName, paramSchema), + CheckParamDescriptionMinLength(paramName, paramSchema), + CheckParamDescriptionHasTypeGuidance(paramName, paramSchema), + ]; + } + + private static ChecklistItem CheckParamDescriptionPresent(string paramName, JsonElement paramSchema) + { + string description = GetStringProperty(paramSchema, "description") ?? string.Empty; + bool passed = !string.IsNullOrWhiteSpace(description); + return new ChecklistItem + { + Id = "pd_present", + Type = CheckType.Deterministic, + Prompt = $"Parameter '{paramName}' has a non-empty description.", + Score = passed, + Reason = passed + ? $"Parameter '{paramName}' has a description." + : $"Parameter '{paramName}' has no description (38% more omission errors).", + Severity = Priority.P0, + Category = CheckCategory.ParamDescription, + IssueIds = [9], + ImpactAreas = [ImpactArea.ParamAccuracy, ImpactArea.Completeness], + Remediation = passed ? string.Empty : $"Add a description to '{paramName}' explaining what it represents and expected values.", + }; + } + + private static ChecklistItem CheckParamDescriptionMinLength(string paramName, JsonElement paramSchema) + { + string description = GetStringProperty(paramSchema, "description") ?? string.Empty; + int wordCount = string.IsNullOrWhiteSpace(description) + ? 0 + : description.Split((char[]?)null, StringSplitOptions.RemoveEmptyEntries).Length; + bool passed = wordCount >= 5; + return new ChecklistItem + { + Id = "pd_min_length", + Type = CheckType.Deterministic, + Prompt = $"Parameter '{paramName}' description has at least 5 words.", + Score = passed, + Reason = passed + ? $"'{paramName}' has {wordCount}-word description." + : $"'{paramName}' description is too short ({wordCount} words, minimum 5).", + Severity = Priority.P1, + Category = CheckCategory.ParamDescription, + IssueIds = [9], + ImpactAreas = [ImpactArea.ParamAccuracy], + Remediation = passed ? string.Empty : $"Expand '{paramName}' description to at least 5 words covering format and constraints.", + }; + } + + private static ChecklistItem CheckParamDescriptionHasTypeGuidance(string paramName, JsonElement paramSchema) + { + bool hasType = paramSchema.TryGetProperty("type", out _); + string description = (GetStringProperty(paramSchema, "description") ?? string.Empty).ToLowerInvariant(); + string[] typeKeywords = ["string", "number", "integer", "boolean", "array", "object", "id", "url", "email", "date", "iso"]; + bool hasTypeInDesc = typeKeywords.Any(keyword => description.Contains(keyword, StringComparison.Ordinal)); + bool passed = hasType || hasTypeInDesc; + + return new ChecklistItem + { + Id = "pd_has_type_guidance", + Type = CheckType.Deterministic, + Prompt = $"Parameter '{paramName}' has type information in schema or description.", + Score = passed, + Reason = passed + ? $"'{paramName}' has type information." + : $"'{paramName}' lacks type/format guidance in both schema and description.", + Severity = Priority.P2, + Category = CheckCategory.ParamDescription, + IssueIds = [11], + ImpactAreas = [ImpactArea.ParamAccuracy], + Remediation = passed ? string.Empty : $"Add 'type' to schema for '{paramName}' or mention expected format in description.", + }; + } + + // ----------------------------------------------------------------------- + // Toolset deterministic checks + // ----------------------------------------------------------------------- + + private static List RunToolsetDeterministicChecks(List tools) + { + return + [ + CheckToolsetReasonableCount(tools), + CheckToolsetNoNearDuplicateNames(tools), + CheckToolsetConsistentNaming(tools), + CheckToolsetReasonableTokenBudget(tools), + ]; + } + + private static ChecklistItem CheckToolsetReasonableCount(List tools) + { + int count = tools.Count; + bool passed; + Priority severity; + string message; + + if (count == 0) + { + passed = false; + severity = Priority.P0; + message = "No tools discovered."; + } + else if (count <= 15) + { + passed = true; + severity = Priority.P3; + message = $"Tool count ({count}) is in the optimal range."; + } + else if (count <= 40) + { + passed = false; + severity = Priority.P1; + message = $"Tool count ({count}) may degrade selection accuracy. Consider grouping."; + } + else + { + passed = false; + severity = Priority.P0; + message = $"Tool count ({count}) exceeds most client limits (Cursor caps at 40)."; + } + + return new ChecklistItem + { + Id = "ts_reasonable_count", + Type = CheckType.Deterministic, + Prompt = "Server has a reasonable number of tools (15 or fewer is optimal).", + Score = passed, + Reason = message, + Severity = severity, + Category = CheckCategory.ToolsetDesign, + IssueIds = [], + ImpactAreas = [ImpactArea.ToolSelection], + Remediation = passed ? string.Empty : count == 0 + ? "Add at least one tool to the server." + : "Reduce tool count by merging related tools or using dynamic selection.", + }; + } + + private static ChecklistItem CheckToolsetNoNearDuplicateNames(List tools) + { + var names = tools.Select(t => t.Name ?? string.Empty).ToList(); + var dupes = new List<(string Name1, string Name2)>(); + + for (int i = 0; i < names.Count; i++) + { + for (int j = i + 1; j < names.Count; j++) + { + int dist = LevenshteinDistance(names[i].ToLowerInvariant(), names[j].ToLowerInvariant()); + if (dist is > 0 and < 3) + { + dupes.Add((names[i], names[j])); + } + } + } + + bool passed = dupes.Count == 0; + string dupeList = string.Join("; ", dupes.Take(5).Select(d => $"{d.Name1} / {d.Name2}")); + return new ChecklistItem + { + Id = "ts_no_near_duplicate_names", + Type = CheckType.Deterministic, + Prompt = "No tool names are near-duplicates (edit distance < 3).", + Score = passed, + Reason = passed + ? "No near-duplicate tool names." + : $"Near-duplicate names (edit dist < 3): {dupeList}", + Severity = Priority.P1, + Category = CheckCategory.ToolsetDesign, + IssueIds = [17], + ImpactAreas = [ImpactArea.ToolSelection], + Remediation = passed ? string.Empty : "Rename tools to be clearly distinct.", + }; + } + + private static ChecklistItem CheckToolsetConsistentNaming(List tools) + { + if (tools.Count < 2) + { + return MakeDeterministicPass("ts_consistent_naming", "Consistent naming", + CheckCategory.ToolsetDesign, "Fewer than 2 tools."); + } + + var names = tools.Select(t => t.Name ?? string.Empty).ToList(); + string dominant = PickDominantCasing(names); + var outliers = names + .Where(n => !GetCompatibleCasings(n).Contains(dominant)) + .Take(5) + .ToList(); + + bool passed = outliers.Count == 0; + return new ChecklistItem + { + Id = "ts_consistent_naming", + Type = CheckType.Deterministic, + Prompt = "All tool names follow the same naming convention.", + Score = passed, + Reason = passed + ? $"All tools compatible with the {dominant} convention." + : $"Inconsistent naming: most compatible with {dominant}, but outliers: {string.Join(", ", outliers)}", + Severity = Priority.P2, + Category = CheckCategory.ToolsetDesign, + IssueIds = [17], + ImpactAreas = [ImpactArea.ToolSelection], + Remediation = passed ? string.Empty : $"Rename outlier tools to match the dominant {dominant} convention.", + }; + } + + private static ChecklistItem CheckToolsetReasonableTokenBudget(List tools) + { + int totalChars = tools.Sum(t => + { + int chars = (t.Name?.Length ?? 0) + (t.Description?.Length ?? 0); + if (t.InputSchema.HasValue) + { + chars += t.InputSchema.Value.GetRawText().Length; + } + return chars; + }); + int estimatedTokens = totalChars / 4; + const int budget = 12_800; + bool passed = estimatedTokens <= budget; + + return new ChecklistItem + { + Id = "ts_reasonable_token_budget", + Type = CheckType.Deterministic, + Prompt = $"Total schema token estimate is within budget ({budget:N0} tokens).", + Score = passed, + Reason = passed + ? $"Estimated schema tokens: {estimatedTokens:N0} (budget: {budget:N0})." + : $"Schema consumes ~{estimatedTokens:N0} tokens (>{budget:N0}). Reduces available context.", + Severity = passed ? Priority.P3 : Priority.P1, + Category = CheckCategory.ToolsetDesign, + IssueIds = [], + ImpactAreas = [ImpactArea.Conciseness, ImpactArea.ToolSelection], + Remediation = passed ? string.Empty : "Reduce schema size by trimming verbose descriptions, reducing tool count, or simplifying schemas.", + }; + } + + // ----------------------------------------------------------------------- + // JSON helpers + // ----------------------------------------------------------------------- + + /// + /// Extracts the 'properties' dictionary from an inputSchema JsonElement. + /// Returns property name to property schema element mapping. + /// + private static Dictionary ExtractProperties(JsonElement? inputSchema) + { + if (!inputSchema.HasValue || inputSchema.Value.ValueKind != JsonValueKind.Object) + { + return []; + } + + if (!inputSchema.Value.TryGetProperty("properties", out var propertiesElement) + || propertiesElement.ValueKind != JsonValueKind.Object) + { + return []; + } + + var result = new Dictionary(); + foreach (var property in propertiesElement.EnumerateObject()) + { + result[property.Name] = property.Value; + } + return result; + } + + /// + /// Extracts the 'required' array from an inputSchema JsonElement. + /// + private static List ExtractRequiredParams(JsonElement? inputSchema) + { + if (!inputSchema.HasValue || inputSchema.Value.ValueKind != JsonValueKind.Object) + { + return []; + } + + if (!inputSchema.Value.TryGetProperty("required", out var requiredElement) + || requiredElement.ValueKind != JsonValueKind.Array) + { + return []; + } + + var result = new List(); + foreach (var item in requiredElement.EnumerateArray()) + { + if (item.ValueKind == JsonValueKind.String) + { + var value = item.GetString(); + if (value is not null) + { + result.Add(value); + } + } + } + return result; + } + + /// + /// Gets a string property from a JsonElement, returning null if not found. + /// + private static string? GetStringProperty(JsonElement element, string propertyName) + { + if (element.ValueKind == JsonValueKind.Object && element.TryGetProperty(propertyName, out var value)) + { + return value.GetString(); + } + return null; + } + + /// + /// Checks if a JsonElement has a specified property that is a non-empty object. + /// + private static bool HasNonEmptyObjectProperty(JsonElement element, string propertyName) + { + if (!element.TryGetProperty(propertyName, out var value)) + { + return false; + } + + if (value.ValueKind != JsonValueKind.Object) + { + return false; + } + + // Check that the object has at least one property + using var enumerator = value.EnumerateObject(); + return enumerator.MoveNext(); + } + + /// + /// Calculates the maximum nesting depth of a JSON schema element. + /// + private static int CalculateMaxDepth(JsonElement schema, int current) + { + if (schema.ValueKind != JsonValueKind.Object) + { + return current; + } + + int maxDepth = current; + + if (schema.TryGetProperty("properties", out var properties) && properties.ValueKind == JsonValueKind.Object) + { + foreach (var prop in properties.EnumerateObject()) + { + maxDepth = Math.Max(maxDepth, CalculateMaxDepth(prop.Value, current + 1)); + } + } + + if (schema.TryGetProperty("items", out var items) && items.ValueKind == JsonValueKind.Object) + { + maxDepth = Math.Max(maxDepth, CalculateMaxDepth(items, current + 1)); + } + + if (schema.TryGetProperty("additionalProperties", out var addProps) && addProps.ValueKind == JsonValueKind.Object) + { + maxDepth = Math.Max(maxDepth, CalculateMaxDepth(addProps, current + 1)); + } + + return maxDepth; + } + + // ----------------------------------------------------------------------- + // String helpers + // ----------------------------------------------------------------------- + + /// + /// Returns the set of naming conventions a single identifier is compatible with. + /// A single-word lowercase identifier like "currency" satisfies snake_case, + /// kebab-case, and camelCase by formal definition (zero separators required), + /// so it can be consistent with any toolset using any of those conventions. + /// PascalCase is excluded because it requires an uppercase first letter. + /// + private static HashSet GetCompatibleCasings(string name) + { + var result = new HashSet(); + if (string.IsNullOrEmpty(name)) + { + return result; + } + + if (Regex.IsMatch(name, @"^[a-z][a-z0-9]*(_[a-z0-9]+)*$")) + { + result.Add("snake_case"); + } + if (Regex.IsMatch(name, @"^[a-z][a-z0-9]*(-[a-z0-9]+)*$")) + { + result.Add("kebab-case"); + } + if (Regex.IsMatch(name, @"^[a-z][a-zA-Z0-9]*$")) + { + result.Add("camelCase"); + } + if (Regex.IsMatch(name, @"^[A-Z][a-zA-Z0-9]*$")) + { + result.Add("PascalCase"); + } + + return result; + } + + /// + /// Picks the dominant casing convention across a set of names by vote-tally: + /// every name contributes one vote per convention it is compatible with. + /// Ties broken alphabetically for deterministic output. Returns "mixed" if no + /// name is compatible with any known convention. + /// + private static string PickDominantCasing(IEnumerable names) + { + var votes = new Dictionary(); + foreach (var name in names) + { + foreach (var convention in GetCompatibleCasings(name)) + { + votes[convention] = votes.GetValueOrDefault(convention) + 1; + } + } + + if (votes.Count == 0) + { + return "mixed"; + } + + return votes + .OrderByDescending(kv => kv.Value) + .ThenBy(kv => kv.Key, StringComparer.Ordinal) + .First() + .Key; + } + + /// + /// Computes the Levenshtein edit distance between two strings. + /// + private static int LevenshteinDistance(string s1, string s2) + { + if (s1.Length < s2.Length) + { + return LevenshteinDistance(s2, s1); + } + + if (s2.Length == 0) + { + return s1.Length; + } + + int[] previousRow = Enumerable.Range(0, s2.Length + 1).ToArray(); + + for (int i = 0; i < s1.Length; i++) + { + int[] currentRow = new int[s2.Length + 1]; + currentRow[0] = i + 1; + + for (int j = 0; j < s2.Length; j++) + { + int cost = s1[i] == s2[j] ? 0 : 1; + currentRow[j + 1] = Math.Min( + Math.Min(currentRow[j] + 1, previousRow[j + 1] + 1), + previousRow[j] + cost); + } + + previousRow = currentRow; + } + + return previousRow[s2.Length]; + } + + // ----------------------------------------------------------------------- + // Convenience helpers + // ----------------------------------------------------------------------- + + /// + /// Creates a passing deterministic check item for cases where the check + /// is not applicable (e.g., no schema to validate). + /// + private static ChecklistItem MakeDeterministicPass(string id, string prompt, CheckCategory category, string reason) + { + return new ChecklistItem + { + Id = id, + Type = CheckType.Deterministic, + Prompt = prompt, + Score = true, + Reason = reason, + Severity = Priority.P3, + Category = category, + IssueIds = [], + ImpactAreas = [], + Remediation = string.Empty, + }; + } + + /// + /// Gets the assembly version to use as the generator version in checklist metadata. + /// Falls back to "0.0.0" if the assembly version cannot be determined. + /// + private static string GetGeneratorVersion() + { + var assembly = Assembly.GetExecutingAssembly(); + var version = assembly.GetName().Version; + return version is not null ? version.ToString() : "0.0.0"; + } +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ClaudeCodeLauncher.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ClaudeCodeLauncher.cs new file mode 100644 index 00000000..130d9da1 --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ClaudeCodeLauncher.cs @@ -0,0 +1,114 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using System.Diagnostics; +using System.Runtime.InteropServices; +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; +using Microsoft.Extensions.Logging; + +namespace Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; + +/// +/// Launches Claude Code to score semantic checks. +/// On Windows the prompt is written to a temp file (cmd.exe /c does not forward +/// stdin); on Unix it is piped via stdin (-p -). Removes the CLAUDECODE env var so +/// the Claude CLI works even when invoked from inside a Claude Code session. +/// +internal sealed class ClaudeCodeLauncher : CodingAgentLauncherBase +{ + private const string ClaudeCodeEnvVar = "CLAUDECODE"; + + public ClaudeCodeLauncher(CommandExecutor executor, ILogger logger) + : base(executor, logger) + { + } + + public override EvalEngine Engine => EvalEngine.ClaudeCode; + + public override string DisplayName => "Claude Code"; + + public override SemanticCheckPrompts.AgentToolset Toolset => new(ReadToolName: "Read", EditToolName: "Edit"); + + public override string CliCommand => "claude"; + + public override async Task LaunchAsync( + string prompt, + string workingDirectory, + TimeSpan timeout, + CancellationToken cancellationToken = default) + { + ArgumentException.ThrowIfNullOrWhiteSpace(prompt); + ArgumentException.ThrowIfNullOrWhiteSpace(workingDirectory); + + if (RuntimeInformation.IsOSPlatform(OSPlatform.Windows)) + { + return await LaunchViaFileAsync(prompt, workingDirectory, timeout, cancellationToken); + } + + return await LaunchViaStdinAsync(prompt, workingDirectory, timeout, cancellationToken); + } + + /// + /// Windows path: writes prompt to a temp file since cmd.exe /c does not forward stdin. + /// + private async Task LaunchViaFileAsync( + string prompt, + string workingDirectory, + TimeSpan timeout, + CancellationToken cancellationToken) + { + var promptFile = Path.Combine(workingDirectory, $".eval_prompt_{Guid.NewGuid():N}.txt"); + try + { + await File.WriteAllTextAsync(promptFile, prompt, cancellationToken); + + var metaPrompt = $"Read and follow the instructions in the file at: {promptFile}"; + var (fileName, fileArguments) = WrapForPlatform("claude", $"-p \"{metaPrompt}\" --model {EvalModelConstants.ClaudeModel} --allowedTools Read,Edit"); + + var startInfo = new ProcessStartInfo + { + FileName = fileName, + Arguments = fileArguments, + WorkingDirectory = workingDirectory, + RedirectStandardOutput = true, + RedirectStandardError = true, + UseShellExecute = false, + CreateNoWindow = true + }; + + startInfo.Environment.Remove(ClaudeCodeEnvVar); + + return await RunProcessAsync(startInfo, timeout, cancellationToken: cancellationToken); + } + finally + { + try { File.Delete(promptFile); } catch { /* best effort */ } + } + } + + /// + /// Unix path: pipes prompt via stdin (-p -). + /// + private async Task LaunchViaStdinAsync( + string prompt, + string workingDirectory, + TimeSpan timeout, + CancellationToken cancellationToken) + { + var startInfo = new ProcessStartInfo + { + FileName = "claude", + Arguments = $"-p - --model {EvalModelConstants.ClaudeModel} --allowedTools Read,Edit", + WorkingDirectory = workingDirectory, + RedirectStandardInput = true, + RedirectStandardOutput = true, + RedirectStandardError = true, + UseShellExecute = false, + CreateNoWindow = true + }; + + startInfo.Environment.Remove(ClaudeCodeEnvVar); + + return await RunProcessAsync(startInfo, timeout, stdinContent: prompt, cancellationToken: cancellationToken); + } +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/CodingAgentLauncherBase.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/CodingAgentLauncherBase.cs new file mode 100644 index 00000000..598592a8 --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/CodingAgentLauncherBase.cs @@ -0,0 +1,205 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using System.Diagnostics; +using System.Runtime.InteropServices; +using System.Text; +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; +using Microsoft.Extensions.Logging; + +namespace Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; + +/// +/// Shared process-launching mechanism for coding-agent CLIs. Subclasses supply the +/// per-engine policy (which binary, which flags, model ID, stdin vs temp-file, +/// toolset); this base owns the parts every engine shares: per-attempt timeout +/// scaling, availability probing, the Windows cmd.exe wrapper, and running and +/// killing the subprocess. +/// +internal abstract class CodingAgentLauncherBase : ICodingAgentLauncher +{ + // Observed on Copilot + Haiku: a tool evaluation needs ~60-90s of fixed overhead + // (CLI startup, session init, reading the checklist) plus ~15-20s per semantic + // check (read + reason + write, with several thinking rounds). The constants + // below give each attempt enough headroom without being so long that an agent + // stuck in a loop stalls the whole run. + private static readonly TimeSpan PerToolBaseTimeout = TimeSpan.FromSeconds(120); + private static readonly TimeSpan PerCheckTimeout = TimeSpan.FromSeconds(20); + private static readonly TimeSpan MinPerToolTimeout = TimeSpan.FromMinutes(3); + private static readonly TimeSpan MaxPerToolTimeout = TimeSpan.FromMinutes(20); + + /// + /// Returns a per-attempt timeout scaled to the number of semantic checks the + /// agent has to score. Clamped to [, + /// ]. + /// + internal static TimeSpan TimeoutForChecks(int checkCount) + { + var scaled = PerToolBaseTimeout + TimeSpan.FromSeconds(PerCheckTimeout.TotalSeconds * checkCount); + if (scaled < MinPerToolTimeout) return MinPerToolTimeout; + if (scaled > MaxPerToolTimeout) return MaxPerToolTimeout; + return scaled; + } + + private readonly CommandExecutor _executor; + + /// Logger for the concrete launcher (used for process diagnostics). + protected ILogger Logger { get; } + + protected CodingAgentLauncherBase(CommandExecutor executor, ILogger logger) + { + ArgumentNullException.ThrowIfNull(executor); + ArgumentNullException.ThrowIfNull(logger); + _executor = executor; + Logger = logger; + } + + /// + public abstract EvalEngine Engine { get; } + + /// + public abstract string DisplayName { get; } + + /// + public abstract SemanticCheckPrompts.AgentToolset Toolset { get; } + + /// + public abstract string CliCommand { get; } + + /// + public Task IsAvailableAsync(CancellationToken cancellationToken = default) + => ProbeCommandAsync(CliCommand, "--version", cancellationToken); + + /// + public abstract Task LaunchAsync(string prompt, string workingDirectory, TimeSpan timeout, CancellationToken cancellationToken = default); + + /// + /// Runs a process and waits for it to complete, capturing stdout/stderr. + /// Optionally pipes content via stdin. Kills the process on timeout to + /// prevent zombie processes from consuming resources or locking files. + /// + protected async Task RunProcessAsync( + ProcessStartInfo startInfo, + TimeSpan timeout, + string? stdinContent = null, + CancellationToken cancellationToken = default) + { + Process? process = null; + try + { + using var timeoutCts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken); + timeoutCts.CancelAfter(timeout); + + process = new Process { StartInfo = startInfo }; + + var stdout = new StringBuilder(); + var stderr = new StringBuilder(); + process.OutputDataReceived += (_, e) => { if (e.Data is not null) stdout.AppendLine(e.Data); }; + process.ErrorDataReceived += (_, e) => { if (e.Data is not null) stderr.AppendLine(e.Data); }; + + process.Start(); + process.BeginOutputReadLine(); + process.BeginErrorReadLine(); + + // Pipe content via stdin if provided + if (stdinContent is not null && startInfo.RedirectStandardInput) + { + await process.StandardInput.WriteAsync(stdinContent); + process.StandardInput.Close(); + } + + await process.WaitForExitAsync(timeoutCts.Token); + + if (process.ExitCode == 0) + { + Logger.LogDebug("Coding agent ({Engine}) completed successfully", DisplayName); + return true; + } + + Logger.LogDebug("Coding agent ({Engine}) exited with code {ExitCode}", DisplayName, process.ExitCode); + if (stderr.Length > 0) + { + Logger.LogDebug("Agent stderr: {StdErr}", stderr.ToString().Trim()); + } + return false; + } + catch (OperationCanceledException) when (!cancellationToken.IsCancellationRequested) + { + // Kill the timed-out process to prevent zombie processes + KillProcess(process); + Logger.LogDebug("Coding agent ({Engine}) timed out after {Timeout}s", DisplayName, timeout.TotalSeconds); + return false; + } + finally + { + process?.Dispose(); + } + } + + private void KillProcess(Process? process) + { + if (process is null) + { + return; + } + + try + { + if (!process.HasExited) + { + process.Kill(entireProcessTree: true); + Logger.LogDebug("Killed timed-out {Engine} process tree", DisplayName); + } + } + catch (Exception ex) + { + Logger.LogDebug(ex, "Failed to kill {Engine} process", DisplayName); + } + } + + /// + /// Wraps command with cmd.exe /c on Windows for .cmd shim compatibility. + /// + protected static (string fileName, string arguments) WrapForPlatform(string command, string arguments) + { + if (RuntimeInformation.IsOSPlatform(OSPlatform.Windows)) + { + return ("cmd.exe", $"/c {command} {arguments}"); + } + + return (command, arguments); + } + + /// + /// Probes whether a CLI tool is available by running it with --version. + /// + private async Task ProbeCommandAsync(string command, string arguments, CancellationToken cancellationToken) + { + try + { + var (cmd, args) = WrapForPlatform(command, arguments); + + var result = await _executor.ExecuteAsync( + cmd, args, + captureOutput: true, + suppressErrorLogging: true, + cancellationToken: cancellationToken); + + if (!result.Success) + { + // The probe ran but reported failure (non-zero exit) rather than being + // absent. Surface why at Debug so `-v` distinguishes "not installed" from + // "installed but broken" (auth failure, changed --version flag, etc.). + Logger.LogDebug("{Command} probe exited {ExitCode}; treating as unavailable. {StdErr}", + command, result.ExitCode, result.StandardError?.Trim()); + } + + return result.Success; + } + catch (Exception ex) + { + Logger.LogDebug(ex, "{Command} CLI detection failed", command); + return false; + } + } +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/EvalModelConstants.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/EvalModelConstants.cs new file mode 100644 index 00000000..954525e8 --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/EvalModelConstants.cs @@ -0,0 +1,36 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +namespace Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; + +/// +/// Single source of truth for the coding-agent model identifiers used during semantic +/// evaluation. Each model has a per-engine environment-variable override so a user can +/// move to a newer model without waiting for a CLI release. +/// +/// GitHub Copilot requires an exact model ID (no aliases); Claude Code accepts an +/// alias — so the two values are deliberately separate, not a single shared constant. +/// +internal static class EvalModelConstants +{ + /// Environment variable that overrides the GitHub Copilot model. + public const string CopilotModelEnvVar = "A365_EVAL_COPILOT_MODEL"; + + /// Environment variable that overrides the Claude Code model. + public const string ClaudeModelEnvVar = "A365_EVAL_CLAUDE_MODEL"; + + private const string DefaultCopilotModel = "claude-haiku-4.5"; + private const string DefaultClaudeModel = "haiku"; + + /// Copilot model ID; overridable via . + public static string CopilotModel => FromEnvOrDefault(CopilotModelEnvVar, DefaultCopilotModel); + + /// Claude model ID/alias; overridable via . + public static string ClaudeModel => FromEnvOrDefault(ClaudeModelEnvVar, DefaultClaudeModel); + + private static string FromEnvOrDefault(string envVar, string fallback) + { + var value = Environment.GetEnvironmentVariable(envVar); + return string.IsNullOrWhiteSpace(value) ? fallback : value; + } +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/EvaluationAnalyzer.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/EvaluationAnalyzer.cs new file mode 100644 index 00000000..1b42493d --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/EvaluationAnalyzer.cs @@ -0,0 +1,246 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using System.Globalization; +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; +using Microsoft.Extensions.Logging; + +namespace Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; + +/// +/// Orchestrates Step 4 of the evaluation pipeline: takes an evaluated checklist +/// and produces a containing per-tool scores, +/// toolset score, overall score, maturity level, and prioritized action items. +/// +internal sealed class EvaluationAnalyzer : IEvaluationAnalyzer +{ + private readonly ILogger _logger; + + public EvaluationAnalyzer(ILogger logger) + { + ArgumentNullException.ThrowIfNull(logger); + _logger = logger; + } + + /// + public SchemaEvalResult Analyze(EvaluationChecklist checklist, string evalEngine) + { + ArgumentNullException.ThrowIfNull(checklist); + evalEngine ??= string.Empty; + + _logger.LogDebug("Analyzing evaluation checklist for server {ServerName}", checklist.Metadata.ServerName); + + // Step 1: Build per-tool results + var toolResults = new List(); + foreach (var tool in checklist.Tools) + { + var toolResult = AnalyzeTool(tool); + toolResults.Add(toolResult); + } + + // Step 2: Compute toolset (server-level) result + var toolsetResult = AnalyzeToolset(checklist.ServerChecks); + + // Step 3: Compute overall score and category averages + float overallScore = Scorer.ComputeOverallScore(toolResults, toolsetResult.Score); + var categoryAverages = Scorer.ComputeCategoryAverages(toolResults); + + // Step 4: Determine maturity level + var maturity = MaturityCalculator.DetermineLevel(overallScore, categoryAverages); + + // Step 5: Aggregate all action items, sorted by priority + var allActionItems = new List(); + foreach (var toolResult in toolResults) + { + allActionItems.AddRange(toolResult.ActionItems); + } + + allActionItems.AddRange(toolsetResult.ActionItems); + allActionItems.Sort((a, b) => a.Priority.CompareTo(b.Priority)); + + // Step 6: Compute issue summary (issue ID to count of occurrences) + var issueSummary = ComputeIssueSummary(allActionItems); + + // Step 7: Compute action items by priority + var actionItemsByPriority = ComputeActionItemsByPriority(allActionItems); + + _logger.LogDebug( + "Analysis complete: overall score {OverallScore}, maturity level {MaturityLevel} ({MaturityLabel}), {ActionItemCount} action items", + overallScore, + maturity.Level, + maturity.Label, + allActionItems.Count); + + return new SchemaEvalResult + { + ServerName = checklist.Metadata.ServerName, + ServerUrl = checklist.Metadata.ServerUrl, + EvaluatedAt = DateTime.UtcNow, + OverallScore = overallScore, + Maturity = maturity, + ToolCount = checklist.Tools.Count, + ToolResults = toolResults, + ToolsetResult = toolsetResult, + AllActionItems = allActionItems, + CategoryAverages = categoryAverages, + ActionItemsByPriority = actionItemsByPriority, + IssueSummary = issueSummary, + EvalEngine = evalEngine, + }; + } + + /// + /// Analyzes a single tool's checklist, computing category scores, tool score, + /// action items, and detected issues. + /// + private static ToolEvalResult AnalyzeTool(ToolChecklist tool) + { + // Flatten all checks across categories for this tool + var allChecks = FlattenToolChecks(tool); + + // Compute per-category scores + var categoryScores = new Dictionary(); + + categoryScores["tool_name"] = Scorer.ComputeCategoryScore(tool.Checks.ToolName); + categoryScores["tool_description"] = Scorer.ComputeCategoryScore(tool.Checks.ToolDescription); + categoryScores["schema_structure"] = Scorer.ComputeCategoryScore(tool.Checks.SchemaStructure); + + // Aggregate param_name and param_description scores across all parameters + var allParamNameChecks = new List(); + var allParamDescriptionChecks = new List(); + + foreach (var paramGroup in tool.Checks.Parameters.Values) + { + allParamNameChecks.AddRange(paramGroup.ParamName); + allParamDescriptionChecks.AddRange(paramGroup.ParamDescription); + } + + categoryScores["param_name"] = Scorer.ComputeCategoryScore(allParamNameChecks); + categoryScores["param_description"] = Scorer.ComputeCategoryScore(allParamDescriptionChecks); + + // Compute tool score from category scores + float toolScore = Scorer.ComputeToolScore(categoryScores); + + // Generate action items from all checks + var actionItems = ActionItemGenerator.GenerateFromAllChecks(allChecks, tool.Name); + + // Collect unique issue ids from action items, sorted + var issuesDetected = actionItems + .SelectMany(a => a.IssueIds) + .Distinct() + .OrderBy(id => id) + .ToList(); + + // Count parameters from the input schema + int paramCount = tool.Checks.Parameters.Count; + + return new ToolEvalResult + { + ToolName = tool.Name, + ToolDescription = tool.Description, + ParamCount = paramCount, + Score = toolScore, + CategoryScores = categoryScores, + Checks = allChecks, + ActionItems = actionItems, + IssuesDetected = issuesDetected, + InputSchema = tool.InputSchema, + }; + } + + /// + /// Flattens all checks from a tool's check groups into a single list. + /// Includes ToolName, ToolDescription, SchemaStructure, and all parameter checks. + /// + private static List FlattenToolChecks(ToolChecklist tool) + { + var checks = new List(); + + checks.AddRange(tool.Checks.ToolName); + checks.AddRange(tool.Checks.ToolDescription); + checks.AddRange(tool.Checks.SchemaStructure); + + foreach (var paramGroup in tool.Checks.Parameters.Values) + { + checks.AddRange(paramGroup.ParamName); + checks.AddRange(paramGroup.ParamDescription); + } + + return checks; + } + + /// + /// Analyzes toolset-level (server/cross-tool) checks, computing score and action items. + /// + private static ToolsetEvalResult AnalyzeToolset(List serverChecks) + { + if (serverChecks is null || serverChecks.Count == 0) + { + return new ToolsetEvalResult + { + Score = 100f, + Checks = [], + ActionItems = [], + }; + } + + float score = Scorer.ComputeCategoryScore(serverChecks); + var actionItems = ActionItemGenerator.GenerateFromAllChecks(serverChecks, null); + + return new ToolsetEvalResult + { + Score = score, + Checks = serverChecks, + ActionItems = actionItems, + }; + } + + /// + /// Computes a summary of issue occurrences across all action items. + /// Returns a dictionary of issue name to occurrence count. + /// + private static Dictionary ComputeIssueSummary(List actionItems) + { + var issueCounts = new Dictionary(); + foreach (var item in actionItems) + { + foreach (int issueId in item.IssueIds) + { + issueCounts[issueId] = issueCounts.GetValueOrDefault(issueId) + 1; + } + } + + var summary = new Dictionary(); + foreach (var (issueId, count) in issueCounts.OrderByDescending(kvp => kvp.Value)) + { + string name = IssueTaxonomy.Definitions.TryGetValue(issueId, out var issue) + ? issue.Name + : issueId.ToString(CultureInfo.InvariantCulture); + summary[name] = count; + } + + return summary; + } + + /// + /// Computes the count of action items per priority level. + /// + private static Dictionary ComputeActionItemsByPriority(List actionItems) + { + var counts = new Dictionary + { + ["P0"] = 0, + ["P1"] = 0, + ["P2"] = 0, + ["P3"] = 0, + }; + + foreach (var item in actionItems) + { + string key = item.Priority.ToString(); + counts[key] = counts.GetValueOrDefault(key) + 1; + } + + return counts; + } +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/EvaluationPipelineService.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/EvaluationPipelineService.cs new file mode 100644 index 00000000..eaa0ad47 --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/EvaluationPipelineService.cs @@ -0,0 +1,342 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using System.Text.Json; +using Microsoft.Agents.A365.DevTools.Cli.Constants; +using Microsoft.Agents.A365.DevTools.Cli.Exceptions; +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; +using Microsoft.Extensions.Logging; + +namespace Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; + +/// +/// Orchestrates the full MCP tool schema evaluation pipeline: +/// discovery, checklist generation, evaluation, analysis, and report generation. +/// +public sealed class EvaluationPipelineService : IEvaluationPipelineService +{ + private readonly ILogger _logger; + private readonly ISchemaDiscoveryService _discoveryService; + private readonly IChecklistGenerator _checklistGenerator; + private readonly IChecklistEvaluator _checklistEvaluator; + private readonly IEvaluationAnalyzer _evaluationAnalyzer; + private readonly IReportGenerator _reportGenerator; + private readonly IAgent365ToolingService _toolingService; + + public EvaluationPipelineService( + ILogger logger, + ISchemaDiscoveryService discoveryService, + IChecklistGenerator checklistGenerator, + IChecklistEvaluator checklistEvaluator, + IEvaluationAnalyzer evaluationAnalyzer, + IReportGenerator reportGenerator, + IAgent365ToolingService toolingService) + { + ArgumentNullException.ThrowIfNull(logger); + ArgumentNullException.ThrowIfNull(discoveryService); + ArgumentNullException.ThrowIfNull(checklistGenerator); + ArgumentNullException.ThrowIfNull(checklistEvaluator); + ArgumentNullException.ThrowIfNull(evaluationAnalyzer); + ArgumentNullException.ThrowIfNull(reportGenerator); + ArgumentNullException.ThrowIfNull(toolingService); + _logger = logger; + _discoveryService = discoveryService; + _checklistGenerator = checklistGenerator; + _checklistEvaluator = checklistEvaluator; + _evaluationAnalyzer = evaluationAnalyzer; + _reportGenerator = reportGenerator; + _toolingService = toolingService; + } + + /// + public async Task RunAsync(string serverUrl, string outputDir, string evalEngine, string? authToken, CancellationToken cancellationToken) + { + try + { + // Validate the output directory up front so an explicit empty value fails fast + // with a targeted error instead of a deep exception from Directory.CreateDirectory + // late in the run. + if (string.IsNullOrWhiteSpace(outputDir)) + { + _logger.LogError("--output-dir cannot be empty or whitespace."); + return 1; + } + + // Best-effort usage telemetry, time-bounded so a slow or hung endpoint cannot + // stall the user's run before discovery starts. Identity is extracted server-side + // from the bearer token; the CLI does not pass the evaluated server URL (customer + // content). Failures (swallowed inside the service) and the timeout are ignored. + using (var telemetryCts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken)) + { + telemetryCts.CancelAfter(TimeSpan.FromSeconds(5)); + try + { + await _toolingService.LogEvaluateUsageAsync(telemetryCts.Token); + } + catch (OperationCanceledException) when (!cancellationToken.IsCancellationRequested) + { + // Telemetry timed out — non-fatal; continue with the evaluation. + } + } + + var engine = ParseEvalEngine(evalEngine); + + // Trust boundary: when an agent will run, the server's tool names and + // descriptions are handed to a locally running coding agent. Surface this + // once per run. (--eval-engine none keeps everything local, so it doesn't apply.) + if (engine != EvalEngine.None) + { + _logger.LogWarning("The server's tool names and descriptions are sent to a locally running coding agent for scoring. Only evaluate MCP servers you trust."); + } + + // Brief intro so first-time users know what backing service this needs. + if (engine == EvalEngine.Auto) + { + _logger.LogInformation("Semantic checks are scored by a locally installed coding agent (GitHub Copilot or Claude Code)."); + _logger.LogInformation("If neither is installed, the run will stop after generating the checklist and print steps to score it with your own LLM."); + _logger.LogInformation(""); + } + + // Derive checklist path first so we can detect an in-progress evaluation. + // Run the derived name through the same sanitizer as the report filename so + // any invalid-for-filesystem characters (?, *, <, etc.) from the fallback path + // don't crash Path.Combine / File.Exists downstream. + var serverName = DeriveServerName(serverUrl); + var safeServerName = ReportGenerator.SanitizeFileName(serverName); + var checklistPath = Path.Combine(outputDir, $"{safeServerName}_checklist.json"); + + EvaluationChecklist checklist; + + if (File.Exists(checklistPath)) + { + // Resume path: an earlier run wrote this checklist; treat it as the source of truth. + // This is how the bring-your-own-LLM workflow round-trips: user scored the file, + // re-runs the same command, and we pick up where they left off. + _logger.LogInformation("[1/5] Resuming from existing checklist at {Path}", checklistPath); + checklist = await LoadChecklistAsync(checklistPath, cancellationToken); + _logger.LogInformation(" Loaded {ToolCount} tool{Plural} (skipping server discovery — delete the file to re-discover)", + checklist.Tools.Count, checklist.Tools.Count == 1 ? "" : "s"); + + var totalSemanticChecks = CountSemanticChecks(checklist); + _logger.LogInformation("[2/5] Checklist has {Count} semantic check{Plural}", totalSemanticChecks, totalSemanticChecks == 1 ? "" : "s"); + } + else + { + // Fresh run: discover the server and generate a new checklist. + _logger.LogInformation("[1/5] Discovering tools from {ServerUrl}", serverUrl); + var tools = await _discoveryService.DiscoverToolsAsync(serverUrl, authToken, cancellationToken); + _logger.LogInformation(" Found {ToolCount} tool{Plural}", tools.Count, tools.Count == 1 ? "" : "s"); + + checklist = _checklistGenerator.Generate(tools, serverName, serverUrl); + var totalSemanticChecks = CountSemanticChecks(checklist); + _logger.LogInformation("[2/5] Generated evaluation checklist ({Count} semantic checks)", totalSemanticChecks); + } + + // Step 3: Semantic Evaluation + _logger.LogInformation("[3/5] Running semantic evaluation"); + var evalResult = await _checklistEvaluator.EvaluateAsync(checklist, checklistPath, engine, cancellationToken); + checklist = evalResult.Checklist; + + if (evalResult.Outcome != EvaluationOutcome.Completed) + { + // Semantic evaluation couldn't complete (no agent, partial scoring, etc.). + // Stop before analysis — proceeding with null scores would produce an + // inflated report (Scorer treats unscored categories as 100). + // ChecklistEvaluator has already printed the detailed "pick one" guidance; + // here we just append the concrete re-run command that carries their flags. + _logger.LogInformation(" Re-run command: a365 develop-mcp evaluate --server-url {Url} --output-dir {OutDir}", + serverUrl, outputDir); + + // OptedOut (--eval-engine none) is a deliberate bring-your-own-LLM stop, so + // exit 0. "No agent" or "agent left checks unscored" means we could not do + // what was asked — exit 1 so CI sees a failure, not a silent success. + return evalResult.Outcome == EvaluationOutcome.OptedOut ? 0 : 1; + } + + // Step 4: Analysis + // Persist the human-readable display name ("GitHub Copilot", "Claude Code") + // in the report instead of the raw enum identifier so downstream consumers + // don't have to map "GitHubCopilot" back to something user-facing. Prefer + // the engine that actually produced evaluations over the user's request, + // so --eval-engine auto reports as "GitHub Copilot" (or whichever ran) + // instead of the meaningless "auto". + var engineName = _checklistEvaluator.FormatEngineName(evalResult.EngineUsed ?? engine); + var result = _evaluationAnalyzer.Analyze(checklist, engineName); + _logger.LogInformation( + "[4/5] Analysis complete: score {Score}/100, Level {Level} ({Label}), {ActionCount} action item{Plural}", + result.OverallScore.ToString("F1"), + result.Maturity.Level, + result.Maturity.Label, + result.AllActionItems.Count, + result.AllActionItems.Count == 1 ? "" : "s"); + + // Step 5: Report Generation + _logger.LogInformation("[5/5] Writing reports"); + await _reportGenerator.GenerateAsync(result, outputDir); + + _logger.LogInformation(""); + _logger.LogInformation( + "Done. Score: {Score}/100 | Level {Level} ({Label})", + result.OverallScore.ToString("F0"), + result.Maturity.Level, + result.Maturity.Label); + + return 0; + } + catch (EvaluationException) + { + throw; + } + catch (Exception ex) when (ex is not Agent365Exception) + { + _logger.LogError(ex, "Evaluation failed unexpectedly: {Message}", ex.Message); + throw new EvaluationException( + ErrorCodes.EvaluationFailed, + "Evaluation failed unexpectedly.", + errorDetails: new List { ex.Message }, + mitigationSteps: new List + { + "Verify the MCP server is running and accessible.", + "Check the output directory is writable." + }, + innerException: ex); + } + } + + private static readonly JsonSerializerOptions ChecklistReadOptions = new() + { + AllowTrailingCommas = true, + ReadCommentHandling = JsonCommentHandling.Skip, + PropertyNameCaseInsensitive = true, + }; + + /// + /// Loads an existing checklist from disk. Used on re-runs where the user has + /// already scored (or partially scored) the file with their own LLM. + /// + private static async Task LoadChecklistAsync(string path, CancellationToken cancellationToken) + { + string json; + try + { + json = await File.ReadAllTextAsync(path, cancellationToken); + } + catch (Exception ex) + { + throw new EvaluationException( + ErrorCodes.EvaluationFailed, + $"Failed to read existing checklist at '{path}'.", + errorDetails: new List { ex.Message }, + mitigationSteps: new List + { + "Verify the file is readable and not locked by another process.", + "Delete the file to force a fresh discovery on the next run." + }, + innerException: ex); + } + + EvaluationChecklist? checklist; + try + { + checklist = JsonSerializer.Deserialize(json, ChecklistReadOptions); + } + catch (JsonException ex) + { + throw new EvaluationException( + ErrorCodes.EvaluationFailed, + $"Existing checklist at '{path}' is not valid JSON.", + errorDetails: new List { ex.Message }, + mitigationSteps: new List + { + "Validate the JSON with your editor or an online linter.", + "Delete the file to force a fresh discovery on the next run." + }, + innerException: ex); + } + + if (checklist is null) + { + throw new EvaluationException( + ErrorCodes.EvaluationFailed, + $"Existing checklist at '{path}' deserialized to null.", + mitigationSteps: new List + { + "Delete the file to force a fresh discovery on the next run." + }); + } + + return checklist; + } + + /// + /// Counts semantic checks across the full checklist (tool-level + server-level). + /// + private static int CountSemanticChecks(EvaluationChecklist checklist) + { + int count = 0; + foreach (var tool in checklist.Tools) + { + count += tool.Checks.ToolName.Count(c => c.Type == CheckType.Semantic); + count += tool.Checks.ToolDescription.Count(c => c.Type == CheckType.Semantic); + count += tool.Checks.SchemaStructure.Count(c => c.Type == CheckType.Semantic); + foreach (var param in tool.Checks.Parameters.Values) + { + count += param.ParamName.Count(c => c.Type == CheckType.Semantic); + count += param.ParamDescription.Count(c => c.Type == CheckType.Semantic); + } + } + count += checklist.ServerChecks.Count(c => c.Type == CheckType.Semantic); + return count; + } + + /// + /// Parses an eval engine string into the corresponding enum value. + /// + internal static EvalEngine ParseEvalEngine(string value) + { + return value.ToLowerInvariant() switch + { + "auto" => EvalEngine.Auto, + "github-copilot" => EvalEngine.GitHubCopilot, + "claude-code" => EvalEngine.ClaudeCode, + "none" => EvalEngine.None, + _ => throw new EvaluationException( + ErrorCodes.EvaluationFailed, + $"Unknown eval engine: '{value}'.", + mitigationSteps: new List + { + "Use one of: auto, github-copilot, claude-code, none" + }) + }; + } + + /// + /// Derives a filesystem-safe server name from the server URL (host part). + /// + internal static string DeriveServerName(string serverUrl) + { + try + { + var uri = new Uri(serverUrl); + var host = uri.Host.Replace('.', '-').Replace(':', '-'); + + if (!uri.IsDefaultPort) + { + host = $"{host}-{uri.Port}"; + } + + return host; + } + catch (UriFormatException) + { + var sanitized = serverUrl + .Replace("://", "-") + .Replace("/", "-") + .Replace(":", "-") + .Replace(".", "-") + .TrimEnd('-'); + + return string.IsNullOrWhiteSpace(sanitized) ? "unknown-server" : sanitized; + } + } +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/GitHubCopilotLauncher.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/GitHubCopilotLauncher.cs new file mode 100644 index 00000000..d7ee4bb6 --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/GitHubCopilotLauncher.cs @@ -0,0 +1,87 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using System.Diagnostics; +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; +using Microsoft.Extensions.Logging; + +namespace Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; + +/// +/// Launches GitHub Copilot to score semantic checks. Copilot does not support stdin +/// piping, so the prompt is always written to a temp file and referenced via -p. +/// +internal sealed class GitHubCopilotLauncher : CodingAgentLauncherBase +{ + public GitHubCopilotLauncher(CommandExecutor executor, ILogger logger) + : base(executor, logger) + { + } + + public override EvalEngine Engine => EvalEngine.GitHubCopilot; + + public override string DisplayName => "GitHub Copilot"; + + public override SemanticCheckPrompts.AgentToolset Toolset => new(ReadToolName: "view", EditToolName: "edit"); + + public override string CliCommand => "copilot"; + + public override async Task LaunchAsync( + string prompt, + string workingDirectory, + TimeSpan timeout, + CancellationToken cancellationToken = default) + { + ArgumentException.ThrowIfNullOrWhiteSpace(prompt); + ArgumentException.ThrowIfNullOrWhiteSpace(workingDirectory); + + // Write prompt to a temp file since Copilot doesn't support stdin piping + var promptFile = Path.Combine(workingDirectory, $".eval_prompt_{Guid.NewGuid():N}.txt"); + try + { + await File.WriteAllTextAsync(promptFile, prompt, cancellationToken); + + var metaPrompt = $"Read and follow the instructions in the file at: {promptFile}"; + // Security model: allow the full tool set EXCEPT subprocess execution and + // outbound network. The agent can pick any read/write/search strategy + // against files in its sandboxed cwd, but cannot shell out, hit the web, + // or exfiltrate the checklist to an arbitrary URL. Copilot's shell tool is + // named `shell` on macOS/Linux and `powershell` on Windows (plus a family + // of session helpers); we deny every variant so the flag is correct on + // every platform. File access is already bounded by Copilot's default path + // verification to the current working directory, which is an isolated temp + // sandbox — so view/create/edit stay confined. + var (fileName, fileArguments) = WrapForPlatform( + "copilot", + $"-p \"{metaPrompt}\" --model {EvalModelConstants.CopilotModel} --allow-all-tools " + + // Restrict visible tools to just read + edit. `create` is specifically + // excluded because Copilot's create cannot overwrite existing files and + // exposing it leads the model down workaround loops (sibling files, + // retries, etc.) instead of the straightforward str_replace flow. + "--available-tools=view,edit " + + "--deny-tool=shell --deny-tool=write_shell --deny-tool=read_shell " + + "--deny-tool=stop_shell --deny-tool=list_shell " + + "--deny-tool=powershell --deny-tool=write_powershell --deny-tool=read_powershell " + + "--deny-tool=stop_powershell --deny-tool=list_powershell " + + "--deny-tool=web_fetch --deny-tool=web_search --no-ask-user"); + + var startInfo = new ProcessStartInfo + { + FileName = fileName, + Arguments = fileArguments, + WorkingDirectory = workingDirectory, + RedirectStandardOutput = true, + RedirectStandardError = true, + UseShellExecute = false, + CreateNoWindow = true + }; + + return await RunProcessAsync(startInfo, timeout, cancellationToken: cancellationToken); + } + finally + { + // Clean up the temp prompt file + try { File.Delete(promptFile); } catch { /* best effort */ } + } + } +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IChecklistEvaluator.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IChecklistEvaluator.cs new file mode 100644 index 00000000..b8d2d4a2 --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IChecklistEvaluator.cs @@ -0,0 +1,77 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; + +namespace Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; + +/// +/// Evaluates an by running semantic checks +/// through a coding agent CLI (Claude Code or GitHub Copilot). +/// This is Step 3 of the evaluation pipeline. +/// +public interface IChecklistEvaluator +{ + /// + /// Evaluates semantic checks in the checklist using a coding agent CLI. + /// + /// The checklist with deterministic checks already scored. + /// Path where the checklist JSON file will be written for the agent to read. + /// The evaluation engine to use for semantic checks. + /// Token to cancel the evaluation. + /// Result containing the checklist and whether semantic evaluation completed. + Task EvaluateAsync(EvaluationChecklist checklist, string checklistPath, EvalEngine engine, CancellationToken cancellationToken = default); + + /// + /// Returns the user-facing display name for an engine (e.g. "GitHub Copilot", + /// "auto"). Engine names are sourced from the registered launchers so adding a + /// new agent does not require editing a name switch. + /// + string FormatEngineName(EvalEngine engine); +} + +/// +/// Why semantic evaluation stopped, so callers can map the outcome to an exit code. +/// +public enum EvaluationOutcome +{ + /// All semantic checks were scored; a report can be produced. + Completed, + + /// + /// The user passed --eval-engine none to score the checklist with their + /// own LLM. An intentional stop, not a failure — the run exits 0. + /// + OptedOut, + + /// + /// The evaluation could not be performed as requested: no coding agent was + /// available, or an agent ran but left checks unscored. A failure — exit 1. + /// + CouldNotEvaluate +} + +/// +/// Result of checklist evaluation, indicating whether semantic checks were evaluated. +/// +public class ChecklistEvaluationResult +{ + public EvaluationChecklist Checklist { get; init; } = new(); + + /// Why evaluation stopped. Drives the command's exit code. + public EvaluationOutcome Outcome { get; init; } + + /// + /// True only when every semantic check was scored. Derived from + /// ; the pipeline gates report generation on this. + /// + public bool SemanticEvaluationCompleted => Outcome == EvaluationOutcome.Completed; + + /// + /// The engine that actually produced successful evaluations (first in priority + /// order among engines that ran successfully). Null when no agent ran or all + /// engines failed. Callers can use this to stamp reports with the engine that + /// actually did the work, rather than whatever the user requested (e.g. "auto"). + /// + public EvalEngine? EngineUsed { get; init; } +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IChecklistGenerator.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IChecklistGenerator.cs new file mode 100644 index 00000000..94f1275b --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IChecklistGenerator.cs @@ -0,0 +1,27 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; + +namespace Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; + +/// +/// Generates an evaluation checklist from discovered MCP tool schemas. +/// The checklist is the intermediate artifact between schema discovery and evaluation. +/// Deterministic checks are pre-filled with scores; semantic checks have null scores +/// to be evaluated later by a coding agent or human reviewer. +/// +public interface IChecklistGenerator +{ + /// + /// Generates a complete evaluation checklist for the given tool schemas. + /// + /// The tool schemas discovered from the MCP server. + /// Display name of the MCP server being evaluated. + /// Connection URL or path used to discover the server. + /// + /// An containing per-tool checks (deterministic and semantic) + /// and server-level checks. Deterministic checks have pre-filled scores; semantic checks have null scores. + /// + EvaluationChecklist Generate(List tools, string serverName, string serverUrl); +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ICodingAgentLauncher.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ICodingAgentLauncher.cs new file mode 100644 index 00000000..c24725d1 --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ICodingAgentLauncher.cs @@ -0,0 +1,46 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; + +namespace Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; + +/// +/// A single coding-agent CLI (e.g. GitHub Copilot, Claude Code) that can report its +/// own availability and run a semantic-evaluation prompt as a subprocess. +/// +/// One implementation per engine. Adding a new agent is a new file plus a DI +/// registration in Program.cs — not edits to switches spread across the +/// evaluator and a central runner. The registration order defines the priority +/// that --eval-engine auto walks. +/// +internal interface ICodingAgentLauncher +{ + /// The engine this launcher implements. + EvalEngine Engine { get; } + + /// Human-readable name for reports and logs (e.g. "GitHub Copilot"). + string DisplayName { get; } + + /// + /// The agent's concrete read/edit tool names, injected into the evaluation + /// prompt so the instructions name the exact tools the agent exposes + /// (Copilot: view/edit, Claude Code: Read/Edit). + /// + SemanticCheckPrompts.AgentToolset Toolset { get; } + + /// The CLI binary name on PATH (e.g. "copilot", "claude"). + string CliCommand { get; } + + /// + /// Returns true when the agent's CLI is installed and responds on PATH. + /// + Task IsAvailableAsync(CancellationToken cancellationToken = default); + + /// + /// Runs the agent against with its working directory + /// set to (the evaluation sandbox). + /// Returns true when the subprocess exits 0. + /// + Task LaunchAsync(string prompt, string workingDirectory, TimeSpan timeout, CancellationToken cancellationToken = default); +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IEvaluationAnalyzer.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IEvaluationAnalyzer.cs new file mode 100644 index 00000000..8602c913 --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IEvaluationAnalyzer.cs @@ -0,0 +1,22 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; + +namespace Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; + +/// +/// Analyzes an evaluated checklist and produces the final . +/// This is Step 4 of the evaluation pipeline: scoring, maturity determination, +/// action item generation, and issue aggregation. +/// +public interface IEvaluationAnalyzer +{ + /// + /// Analyzes the evaluated checklist and produces a complete evaluation result. + /// + /// The evaluation checklist with all checks scored. + /// The evaluation engine used (e.g., "GitHub Copilot", "Claude Code", "none"). + /// A fully populated . + SchemaEvalResult Analyze(EvaluationChecklist checklist, string evalEngine); +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IEvaluationPipelineService.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IEvaluationPipelineService.cs new file mode 100644 index 00000000..591cfc71 --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IEvaluationPipelineService.cs @@ -0,0 +1,22 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +namespace Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; + +/// +/// Orchestrates the full MCP tool schema evaluation pipeline: +/// discovery, checklist generation, evaluation, analysis, and report generation. +/// +public interface IEvaluationPipelineService +{ + /// + /// Runs the evaluation pipeline against an MCP server. + /// + /// MCP server Streamable HTTP endpoint URL. + /// Output directory for evaluation artifacts. + /// Coding agent engine name (auto, github-copilot, claude-code, none). + /// Optional bearer token for MCP server authentication. + /// Cancellation token. + /// Process exit code: 0 on success or an intentional bring-your-own-LLM stop; 1 when evaluation could not be performed as requested. + Task RunAsync(string serverUrl, string outputDir, string evalEngine, string? authToken, CancellationToken cancellationToken); +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IReportGenerator.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IReportGenerator.cs new file mode 100644 index 00000000..57b73d90 --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IReportGenerator.cs @@ -0,0 +1,21 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; + +namespace Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; + +/// +/// Generates evaluation reports (JSON and HTML) from a . +/// This is Step 5 of the evaluation pipeline: report generation and browser launch. +/// +public interface IReportGenerator +{ + /// + /// Generates JSON and HTML reports in the specified output directory. + /// + /// The evaluation result to render. + /// Directory where report files will be written. + /// Whether to open the HTML report in the default browser. + Task GenerateAsync(SchemaEvalResult result, string outputDir, bool openInBrowser = true); +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ISchemaDiscoveryService.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ISchemaDiscoveryService.cs new file mode 100644 index 00000000..229cc53a --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ISchemaDiscoveryService.cs @@ -0,0 +1,23 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; + +namespace Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; + +/// +/// Discovers MCP tool schemas from a running MCP server using the Streamable HTTP transport. +/// This is Step 1 of the evaluation pipeline. +/// +public interface ISchemaDiscoveryService +{ + /// + /// Connects to an MCP server via Streamable HTTP (JSON-RPC 2.0), + /// performs the initialize handshake, and retrieves the list of tool schemas. + /// + /// The MCP server Streamable HTTP endpoint URL. + /// Optional Bearer token for server authentication. + /// Cancellation token for the operation. + /// A list of discovered from the server. + Task> DiscoverToolsAsync(string serverUrl, string? authToken = null, CancellationToken cancellationToken = default); +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IssueTaxonomy.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IssueTaxonomy.cs new file mode 100644 index 00000000..93d11c57 --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IssueTaxonomy.cs @@ -0,0 +1,219 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; + +namespace Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; + +/// +/// Catalog of known schema-quality issues for MCP tool schemas, each with an +/// id, category, description, and the areas it impacts. Checklist items +/// reference these ids via IssueIds so the report can link every +/// failed check back to the concrete issue it represents. +/// +internal static class IssueTaxonomy +{ + /// + /// All known issues indexed by their id. + /// + public static readonly Dictionary Definitions = new() + { + // -- Accuracy -- + + [1] = new IssueDefinition + { + Id = 1, + Name = "Incorrect parameter semantics", + Category = IssueCategory.Accuracy, + Description = "Description says one thing, tool does another", + Impact = "LLM provides structurally valid but semantically wrong arguments", + ImpactAreas = [ImpactArea.ParamAccuracy], + }, + [2] = new IssueDefinition + { + Id = 2, + Name = "Misleading behavior claims", + Category = IssueCategory.Accuracy, + Description = "Tool can't do what description promises", + Impact = "LLM selects tool for unsupported operations, causing failures", + ImpactAreas = [ImpactArea.ToolSelection], + }, + [3] = new IssueDefinition + { + Id = 3, + Name = "Wrong default values documented", + Category = IssueCategory.Accuracy, + Description = "Actual defaults differ from described defaults", + Impact = "LLM omits parameters expecting documented default, gets unexpected behavior", + ImpactAreas = [ImpactArea.ParamAccuracy], + }, + + // -- Functionality -- + + [4] = new IssueDefinition + { + Id = 4, + Name = "Missing purpose statement", + Category = IssueCategory.Functionality, + Description = "No verb phrase explaining what the tool does", + Impact = "LLM cannot determine when to use the tool; selection drops sharply", + ImpactAreas = [ImpactArea.ToolSelection], + }, + [5] = new IssueDefinition + { + Id = 5, + Name = "Missing usage guidelines", + Category = IssueCategory.Functionality, + Description = "No 'use this when...' conditional guidance", + Impact = "LLM applies tool in wrong context (e.g., search vs list)", + ImpactAreas = [ImpactArea.ToolSelection], + }, + [6] = new IssueDefinition + { + Id = 6, + Name = "Missing limitation statements", + Category = IssueCategory.Functionality, + Description = "No 'this tool does not...' negation", + Impact = "LLM attempts impossible operations (e.g., delete via read-only tool)", + ImpactAreas = [ImpactArea.ToolSelection, ImpactArea.Completeness], + }, + [7] = new IssueDefinition + { + Id = 7, + Name = "Missing error behavior documentation", + Category = IssueCategory.Functionality, + Description = "No failure mode or error response descriptions", + Impact = "LLM cannot handle errors gracefully or retry appropriately", + ImpactAreas = [ImpactArea.Completeness], + }, + + // -- Completeness -- + + [8] = new IssueDefinition + { + Id = 8, + Name = "Missing return value documentation", + Category = IssueCategory.Completeness, + Description = "No output description for tool results", + Impact = "LLM misinterprets output, causing cascading failures in multi-step chains", + ImpactAreas = [ImpactArea.Completeness], + }, + [9] = new IssueDefinition + { + Id = 9, + Name = "Missing parameter descriptions", + Category = IssueCategory.Completeness, + Description = "Parameters without explanation", + Impact = "LLM must guess what each parameter means from name alone", + ImpactAreas = [ImpactArea.ParamAccuracy, ImpactArea.Completeness], + }, + [10] = new IssueDefinition + { + Id = 10, + Name = "Missing examples", + Category = IssueCategory.Completeness, + Description = "No concrete usage demonstrations", + Impact = "Reduced comprehension for complex input structures or unusual formats", + ImpactAreas = [ImpactArea.ParamAccuracy, ImpactArea.Completeness], + }, + [11] = new IssueDefinition + { + Id = 11, + Name = "Missing format specifications", + Category = IssueCategory.Completeness, + Description = "Date/time/ID formats undocumented", + Impact = "LLM guesses format -- '2026-03-23' vs 'March 23' vs '03/23/26'", + ImpactAreas = [ImpactArea.ParamAccuracy], + }, + [12] = new IssueDefinition + { + Id = 12, + Name = "Missing prerequisite documentation", + Category = IssueCategory.Completeness, + Description = "Dependencies and prerequisites unstated", + Impact = "LLM invokes tool without required prior steps, causing failures", + ImpactAreas = [ImpactArea.Completeness], + }, + + // -- Conciseness -- + + [13] = new IssueDefinition + { + Id = 13, + Name = "Tool name repeated in description", + Category = IssueCategory.Conciseness, + Description = "Description restates tool name without adding info", + Impact = "Zero added information; wastes context window tokens", + ImpactAreas = [ImpactArea.Conciseness], + }, + [14] = new IssueDefinition + { + Id = 14, + Name = "Excessive boilerplate", + Category = IssueCategory.Conciseness, + Description = "Generic text not specific to the tool", + Impact = "Dilutes useful information and inflates step count for over-specified descriptions", + ImpactAreas = [ImpactArea.Conciseness], + }, + [15] = new IssueDefinition + { + Id = 15, + Name = "Redundant parameter re-description", + Category = IssueCategory.Conciseness, + Description = "Tool description re-describes parameters already described in schema", + Impact = "Wastes tokens, may create conflicting descriptions", + ImpactAreas = [ImpactArea.Conciseness], + }, + [16] = new IssueDefinition + { + Id = 16, + Name = "Overly technical jargon", + Category = IssueCategory.Conciseness, + Description = "Implementation details instead of behavior descriptions", + Impact = "LLM focuses on internal mechanics rather than user-facing outcomes", + ImpactAreas = [ImpactArea.Conciseness, ImpactArea.ToolSelection], + }, + + // -- Cross-tool consistency -- + + [17] = new IssueDefinition + { + Id = 17, + Name = "Inconsistent terminology across tools", + Category = IssueCategory.Accuracy, + Description = "Same concept named differently in different tools", + Impact = "LLM uses wrong parameter values when chaining tools together", + ImpactAreas = [ImpactArea.ParamAccuracy, ImpactArea.ToolSelection], + }, + [18] = new IssueDefinition + { + Id = 18, + Name = "Ambiguous scope of operation", + Category = IssueCategory.Functionality, + Description = "Unclear whether tool operates on single item, collection, or hierarchy", + Impact = "LLM calls tool with wrong cardinality expectations", + ImpactAreas = [ImpactArea.ToolSelection, ImpactArea.ParamAccuracy], + }, + }; + + /// + /// Returns an impact map keyed by issue id (as string) for the HTML report. + /// Each entry provides the issue name, category, impact description, and affected areas. + /// + public static Dictionary GetImpactMap() + { + var map = new Dictionary(); + foreach (var (id, issue) in Definitions) + { + map[id.ToString(System.Globalization.CultureInfo.InvariantCulture)] = new IssueImpactInfo + { + Name = issue.Name, + Category = issue.Category.ToString(), + Impact = issue.Impact, + Areas = issue.ImpactAreas.Select(a => a.ToString()).ToList(), + }; + } + + return map; + } +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/MaturityCalculator.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/MaturityCalculator.cs new file mode 100644 index 00000000..b4da53da --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/MaturityCalculator.cs @@ -0,0 +1,198 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; + +namespace Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; + +/// +/// Determines MCP server maturity level (0-4) from overall score and category averages. +/// Inspired by the Richardson Maturity Model for REST APIs, adapted for AI agent consumption. +/// Score thresholds map to levels, but weak critical categories cap the achievable level. +/// +public static class MaturityCalculator +{ + /// + /// Level definitions with label and description. + /// Index corresponds to the level number (0-4). + /// + private static readonly (string Label, string Description)[] LevelDefinitions = + [ + ( + "Functional", + "Tools exist with names and minimal schemas. " + + "Major quality gaps make reliable AI agent usage unlikely." + ), + ( + "Described", + "All tools and parameters have meaningful descriptions. " + + "Input/output schemas are fully defined." + ), + ( + "Consistent", + "Naming conventions followed across all tools. " + + "Error handling documented. Cross-tool consistency maintained." + ), + ( + "Optimized for AI", + "Descriptions tuned for LLM comprehension. " + + "Disambiguation between similar tools. " + + "Defensive parameter constraints. Structured output schemas." + ), + ( + "Exemplary", + "Usage examples included. Semantic tool grouping. " + + "Complete intent coverage for domain. " + + "Versioned and backward-compatible." + ), + ]; + + /// + /// Determines the maturity level from the overall score and category averages. + /// Score thresholds: Level 0 (< 40), Level 1 (40-59), Level 2 (60-74), Level 3 (75-89), Level 4 (90+). + /// Category caps prevent inflated levels when critical categories are weak: + /// tool_description avg < 50 caps at Level 1, param_description avg < 60 caps at Level 2, + /// tool_name avg < 75 caps at Level 3. + /// + /// Overall server score (0-100). + /// Average scores per category across all tools. + /// Maturity level with label, description, and requirements for next level. + public static MaturityLevel DetermineLevel(float overallScore, Dictionary categoryAverages) + { + categoryAverages ??= []; + + // Determine score-based level + int level; + if (overallScore >= 90f) + { + level = 4; + } + else if (overallScore >= 75f) + { + level = 3; + } + else if (overallScore >= 60f) + { + level = 2; + } + else if (overallScore >= 40f) + { + level = 1; + } + else + { + level = 0; + } + + // Apply category-based caps + float descriptionAvg = categoryAverages.GetValueOrDefault("tool_description", 0f); + float paramDescriptionAvg = categoryAverages.GetValueOrDefault("param_description", 0f); + float nameAvg = categoryAverages.GetValueOrDefault("tool_name", 0f); + + // Cannot reach Level 2+ without decent tool descriptions + if (descriptionAvg < 50f && level >= 2) + { + level = 1; + } + + // Cannot reach Level 3+ without good parameter descriptions + if (paramDescriptionAvg < 60f && level >= 3) + { + level = 2; + } + + // Cannot reach Level 4 without strong naming + if (nameAvg < 75f && level >= 4) + { + level = 3; + } + + var definition = LevelDefinitions[level]; + var nextRequirements = GetNextLevelRequirements(level, categoryAverages); + + return new MaturityLevel + { + Level = level, + Label = definition.Label, + Description = definition.Description, + NextLevelRequirements = nextRequirements, + }; + } + + /// + /// Builds the maturity ladder showing all 5 levels with the current level flagged. + /// Used by the HTML report to render the visual maturity progression. + /// + /// The current maturity level (0-4). + /// All 5 maturity levels with IsCurrent set for the active level. + public static List GetMaturityLadder(int currentLevel) + { + var ladder = new List(LevelDefinitions.Length); + for (int i = 0; i < LevelDefinitions.Length; i++) + { + var definition = LevelDefinitions[i]; + ladder.Add(new MaturityLadderEntry + { + Level = i, + Label = definition.Label, + Description = definition.Description, + IsCurrent = i == currentLevel, + }); + } + + return ladder; + } + + /// + /// Generates concrete, actionable requirements for reaching the next maturity level. + /// + private static List GetNextLevelRequirements( + int currentLevel, + Dictionary categoryAverages) + { + if (currentLevel >= 4) + { + return ["Maintain current quality standards."]; + } + + var requirements = new List(); + + switch (currentLevel) + { + case 0: + requirements.Add("Add meaningful descriptions to all tools (target: every tool describes its purpose)."); + requirements.Add("Ensure all parameters have type definitions in the schema."); + requirements.Add("Add descriptions to all parameters."); + break; + + case 1: + requirements.Add("Standardize naming conventions across all tools (use consistent verb_noun pattern)."); + requirements.Add("Ensure cross-tool consistency in parameter naming and types."); + if (categoryAverages.GetValueOrDefault("tool_description", 0f) < 70f) + { + requirements.Add("Improve tool descriptions to include usage guidelines and limitations."); + } + + break; + + case 2: + requirements.Add("Add usage guidelines ('Use this when...') to all tool descriptions."); + requirements.Add("Add limitation statements to all tool descriptions."); + requirements.Add("Define enum constraints for categorical parameters."); + if (categoryAverages.GetValueOrDefault("param_description", 0f) < 75f) + { + requirements.Add("Improve parameter descriptions with format specifications and examples."); + } + + break; + + case 3: + requirements.Add("Add concrete usage examples to all tool descriptions."); + requirements.Add("Ensure complete intent coverage for the server's domain."); + requirements.Add("Add return value documentation to all tools."); + break; + } + + return requirements; + } +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ReportGenerator.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ReportGenerator.cs new file mode 100644 index 00000000..092b9a99 --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ReportGenerator.cs @@ -0,0 +1,168 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using System.Diagnostics; +using System.Reflection; +using System.Runtime.InteropServices; +using System.Text.Json; +using System.Text.RegularExpressions; +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; +using Microsoft.Extensions.Logging; + +namespace Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; + +/// +/// Handles Step 5 of the evaluation pipeline: generates JSON and HTML reports +/// from a , then opens the HTML report in the default browser. +/// +internal sealed partial class ReportGenerator : IReportGenerator +{ + private const string TemplatePlaceholder = "{{REPORT_DATA}}"; + private const string EmbeddedResourceName = "Microsoft.Agents.A365.DevTools.Cli.Templates.SchemaEvalReport.html"; + + private static readonly JsonSerializerOptions s_jsonOptions = new() + { + WriteIndented = true, + }; + + private readonly ILogger _logger; + + public ReportGenerator(ILogger logger) + { + ArgumentNullException.ThrowIfNull(logger); + _logger = logger; + } + + /// + public async Task GenerateAsync(SchemaEvalResult result, string outputDir, bool openInBrowser = true) + { + ArgumentNullException.ThrowIfNull(result); + ArgumentException.ThrowIfNullOrWhiteSpace(outputDir); + + Directory.CreateDirectory(outputDir); + + string safeServerName = SanitizeFileName(result.ServerName); + + // Step 1: Write JSON report + string jsonPath = Path.Combine(outputDir, $"{safeServerName}_eval_report.json"); + string jsonContent = JsonSerializer.Serialize(result, s_jsonOptions); + await File.WriteAllTextAsync(jsonPath, jsonContent).ConfigureAwait(false); + _logger.LogInformation(" JSON: {JsonPath}", jsonPath); + + // Step 2: Build EvalReportData + var reportData = new EvalReportData + { + Result = result, + ImpactMap = IssueTaxonomy.GetImpactMap(), + MaturityLadder = MaturityCalculator.GetMaturityLadder(result.Maturity.Level), + }; + + // Step 3: Read HTML template from embedded resource + string template = await ReadEmbeddedTemplateAsync().ConfigureAwait(false); + + // Step 4: Inject report data into template. + // Escape sequences that can break out of the inline , ) + // since the JSON contains untrusted strings from the MCP server. + string reportDataJson = EscapeForInlineScript(JsonSerializer.Serialize(reportData, s_jsonOptions)); + string htmlContent = template.Replace(TemplatePlaceholder, reportDataJson, StringComparison.Ordinal); + + // Step 5: Write HTML report + string htmlPath = Path.Combine(outputDir, $"{safeServerName}_eval_report.html"); + await File.WriteAllTextAsync(htmlPath, htmlContent).ConfigureAwait(false); + _logger.LogInformation(" HTML: {HtmlPath}", htmlPath); + + // Step 6: Open HTML report in default browser + if (openInBrowser) + { + OpenInBrowser(htmlPath); + } + } + + /// + /// Reads the HTML template from the embedded resource. + /// + private static async Task ReadEmbeddedTemplateAsync() + { + var assembly = Assembly.GetExecutingAssembly(); + using var stream = assembly.GetManifestResourceStream(EmbeddedResourceName); + + if (stream is null) + { + throw new InvalidOperationException( + $"Embedded resource '{EmbeddedResourceName}' not found. Ensure the HTML template is included as an EmbeddedResource in the project."); + } + + using var reader = new StreamReader(stream); + return await reader.ReadToEndAsync().ConfigureAwait(false); + } + + /// + /// Opens the HTML file in the default browser, using the appropriate command + /// for the current operating system. + /// + private void OpenInBrowser(string htmlPath) + { + try + { + ProcessStartInfo startInfo; + + if (RuntimeInformation.IsOSPlatform(OSPlatform.Windows)) + { + startInfo = new ProcessStartInfo(htmlPath) { UseShellExecute = true }; + } + else if (RuntimeInformation.IsOSPlatform(OSPlatform.OSX)) + { + // Use ArgumentList so paths with spaces or shell-significant characters are passed intact. + startInfo = new ProcessStartInfo("open"); + startInfo.ArgumentList.Add(htmlPath); + } + else + { + startInfo = new ProcessStartInfo("xdg-open"); + startInfo.ArgumentList.Add(htmlPath); + } + + using var process = Process.Start(startInfo); + _logger.LogInformation(" Opened HTML report in default browser"); + } + catch (Exception ex) + { + _logger.LogWarning(ex, "Could not open HTML report in browser. Please open manually: {HtmlPath}", htmlPath); + } + } + + /// + /// Escapes sequences that would break out of an inline <script> block. + /// The HTML parser sees different characters, but JSON.parse still recovers + /// the original strings via the standard escape sequences (\/ and \uXXXX). + /// + internal static string EscapeForInlineScript(string json) + { + if (string.IsNullOrEmpty(json)) + { + return json; + } + + return json + .Replace("", "--\\u003e", StringComparison.Ordinal); + } + + /// + /// Sanitizes a server name for use as a filename by replacing non-alphanumeric + /// characters (except hyphens) with underscores. + /// + internal static string SanitizeFileName(string name) + { + if (string.IsNullOrWhiteSpace(name)) + { + return "server"; + } + + return FileNameSanitizer().Replace(name, "_"); + } + + [GeneratedRegex(@"[^a-zA-Z0-9\-]", RegexOptions.Compiled)] + private static partial Regex FileNameSanitizer(); +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SchemaDiscoveryService.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SchemaDiscoveryService.cs new file mode 100644 index 00000000..2d36e4de --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SchemaDiscoveryService.cs @@ -0,0 +1,367 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using System.Text; +using System.Text.Json; +using Microsoft.Agents.A365.DevTools.Cli.Constants; +using Microsoft.Agents.A365.DevTools.Cli.Exceptions; +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; +using Microsoft.Agents.A365.DevTools.Cli.Services.Internal; +using Microsoft.Extensions.Logging; + +namespace Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; + +/// +/// Discovers MCP tool schemas from a running MCP server using Streamable HTTP transport. +/// Implements the MCP protocol handshake (initialize, notifications/initialized, tools/list) +/// over JSON-RPC 2.0 POST requests. +/// +internal sealed class SchemaDiscoveryService : ISchemaDiscoveryService, IDisposable +{ + private const string McpProtocolVersion = "2025-03-26"; + private const string ClientName = "a365-evaluate"; + private const string ClientVersion = "1.0"; + private const string JsonRpcVersion = "2.0"; + + private readonly ILogger _logger; + private readonly HttpClient _httpClient; + + public SchemaDiscoveryService(ILogger logger, HttpMessageHandler? handler = null) + { + ArgumentNullException.ThrowIfNull(logger); + _logger = logger; + _httpClient = handler != null ? new HttpClient(handler) : HttpClientFactory.CreateAuthenticatedClient(); + + // The MCP server is untrusted input and tool schemas are typically < 1 MB. Cap the + // buffered response size so a hostile or malformed server cannot exhaust memory with a + // huge tools/list payload; exceeding it throws during ReadAsStringAsync and is caught as + // EvaluationException by DiscoverToolsAsync. Scoped to this client rather than + // HttpClientFactory, which is shared by Graph/ARM/tooling where large responses are valid. + _httpClient.MaxResponseContentBufferSize = 10 * 1024 * 1024; // 10 MB ceiling + } + + /// + /// Disposes the owned . The service is registered as a + /// singleton (Program.cs), so in practice this runs at process shutdown; + /// implementing it keeps the IDisposable contract correct if the registration + /// lifetime ever changes to transient or scoped. + /// + public void Dispose() => _httpClient.Dispose(); + + /// + public async Task> DiscoverToolsAsync(string serverUrl, string? authToken = null, CancellationToken cancellationToken = default) + { + if (string.IsNullOrWhiteSpace(serverUrl)) + { + throw new EvaluationException( + ErrorCodes.SchemaDiscoveryFailed, + "Server URL is required for schema discovery.", + mitigationSteps: new List + { + "Provide a valid MCP server Streamable HTTP endpoint URL." + }); + } + + _logger.LogDebug("Starting MCP schema discovery against {ServerUrl}", serverUrl); + + try + { + // Step 1: Initialize + await SendInitializeAsync(serverUrl, authToken, cancellationToken); + + // Step 2: Send initialized notification + await SendInitializedNotificationAsync(serverUrl, authToken, cancellationToken); + + // Step 3: List tools + var tools = await SendToolsListAsync(serverUrl, authToken, cancellationToken); + + if (tools.Count == 0) + { + throw new EvaluationException( + ErrorCodes.SchemaDiscoveryFailed, + "MCP server returned an empty tool list.", + errorDetails: new List { $"Server URL: {serverUrl}" }, + mitigationSteps: new List + { + "Verify the MCP server is running and has tools registered.", + "Check the server logs for registration errors." + }); + } + + _logger.LogDebug("Schema discovery complete. Found {ToolCount} tool(s).", tools.Count); + return tools; + } + catch (EvaluationException) + { + // Re-throw our own exceptions as-is + throw; + } + catch (HttpRequestException ex) + { + throw new EvaluationException( + ErrorCodes.SchemaDiscoveryFailed, + "Failed to connect to MCP server.", + errorDetails: new List { $"Server URL: {serverUrl}", ex.Message }, + mitigationSteps: new List + { + "Verify the MCP server is running and accessible.", + "Check the URL is correct and includes the full endpoint path.", + "Ensure no firewall or network issues are blocking the connection." + }, + innerException: ex); + } + catch (TaskCanceledException ex) when (ex.InnerException is TimeoutException || !cancellationToken.IsCancellationRequested) + { + throw new EvaluationException( + ErrorCodes.SchemaDiscoveryFailed, + "Connection to MCP server timed out.", + errorDetails: new List { $"Server URL: {serverUrl}" }, + mitigationSteps: new List + { + "Verify the MCP server is running and responsive.", + "Check if the server URL is correct.", + "The server may be under heavy load; try again later." + }, + innerException: ex); + } + catch (JsonException ex) + { + throw new EvaluationException( + ErrorCodes.SchemaDiscoveryFailed, + "MCP server returned an invalid JSON response.", + errorDetails: new List { $"Server URL: {serverUrl}", ex.Message }, + mitigationSteps: new List + { + "Verify the server implements the MCP protocol correctly.", + "Check the server logs for errors." + }, + innerException: ex); + } + } + + private async Task SendInitializeAsync(string serverUrl, string? authToken, CancellationToken cancellationToken) + { + _logger.LogDebug("Sending MCP initialize request..."); + + var requestBody = JsonSerializer.Serialize(new + { + jsonrpc = JsonRpcVersion, + method = "initialize", + @params = new + { + protocolVersion = McpProtocolVersion, + capabilities = new { }, + clientInfo = new + { + name = ClientName, + version = ClientVersion + } + }, + id = 1 + }); + + using var response = await PostJsonRpcAsync(serverUrl, requestBody, authToken, cancellationToken); + var responseBody = await ReadJsonResponseAsync(response, cancellationToken); + + // Validate JSON-RPC response + using var doc = JsonDocument.Parse(responseBody); + if (doc.RootElement.TryGetProperty("error", out var errorElement)) + { + var errorMessage = errorElement.TryGetProperty("message", out var msgProp) + ? msgProp.GetString() ?? "Unknown error" + : "Unknown error"; + + throw new EvaluationException( + ErrorCodes.SchemaDiscoveryFailed, + "MCP server initialize request failed.", + errorDetails: new List { $"Server error: {errorMessage}" }, + mitigationSteps: new List + { + "Verify the server supports MCP protocol version " + McpProtocolVersion + ".", + "Check the server logs for initialization errors." + }); + } + + _logger.LogDebug("MCP initialize succeeded."); + } + + private async Task SendInitializedNotificationAsync(string serverUrl, string? authToken, CancellationToken cancellationToken) + { + _logger.LogDebug("Sending MCP initialized notification..."); + + var requestBody = JsonSerializer.Serialize(new + { + jsonrpc = JsonRpcVersion, + method = "notifications/initialized", + @params = new { } + }); + + // Notifications may not return a response body, but we still POST + using var response = await PostJsonRpcAsync(serverUrl, requestBody, authToken, cancellationToken); + + _logger.LogDebug("MCP initialized notification sent."); + } + + private async Task> SendToolsListAsync(string serverUrl, string? authToken, CancellationToken cancellationToken) + { + _logger.LogDebug("Sending MCP tools/list request..."); + + var requestBody = JsonSerializer.Serialize(new + { + jsonrpc = JsonRpcVersion, + method = "tools/list", + @params = new { }, + id = 2 + }); + + using var response = await PostJsonRpcAsync(serverUrl, requestBody, authToken, cancellationToken); + var responseBody = await ReadJsonResponseAsync(response, cancellationToken); + + using var doc = JsonDocument.Parse(responseBody); + + // Check for JSON-RPC error + if (doc.RootElement.TryGetProperty("error", out var errorElement)) + { + var errorMessage = errorElement.TryGetProperty("message", out var msgProp) + ? msgProp.GetString() ?? "Unknown error" + : "Unknown error"; + + throw new EvaluationException( + ErrorCodes.SchemaDiscoveryFailed, + "MCP server tools/list request failed.", + errorDetails: new List { $"Server error: {errorMessage}" }, + mitigationSteps: new List + { + "Verify the server has tools registered.", + "Check the server logs for errors." + }); + } + + // Parse result.tools array + if (!doc.RootElement.TryGetProperty("result", out var resultElement) || + !resultElement.TryGetProperty("tools", out var toolsElement) || + toolsElement.ValueKind != JsonValueKind.Array) + { + throw new EvaluationException( + ErrorCodes.SchemaDiscoveryFailed, + "MCP server returned an unexpected response format for tools/list.", + errorDetails: new List { "Expected result.tools to be a JSON array." }, + mitigationSteps: new List + { + "Verify the server implements the MCP tools/list method correctly." + }); + } + + var tools = new List(); + + foreach (var toolElement in toolsElement.EnumerateArray()) + { + var name = toolElement.TryGetProperty("name", out var nameProp) + ? nameProp.GetString() ?? string.Empty + : string.Empty; + + var description = toolElement.TryGetProperty("description", out var descProp) + ? descProp.GetString() ?? string.Empty + : string.Empty; + + JsonElement? inputSchema = toolElement.TryGetProperty("inputSchema", out var schemaProp) + ? schemaProp.Clone() + : null; + + tools.Add(new ToolSchema + { + Name = name, + Description = description, + InputSchema = inputSchema + }); + } + + _logger.LogDebug("tools/list returned {ToolCount} tool(s).", tools.Count); + return tools; + } + + private async Task PostJsonRpcAsync( + string serverUrl, + string requestBody, + string? authToken, + CancellationToken cancellationToken) + { + using var request = new HttpRequestMessage(HttpMethod.Post, serverUrl) + { + Content = new StringContent(requestBody, Encoding.UTF8, "application/json") + }; + + // MCP Streamable HTTP transport requires Accept header + request.Headers.Accept.Add(new System.Net.Http.Headers.MediaTypeWithQualityHeaderValue("application/json")); + request.Headers.Accept.Add(new System.Net.Http.Headers.MediaTypeWithQualityHeaderValue("text/event-stream")); + + if (!string.IsNullOrWhiteSpace(authToken)) + { + request.Headers.Authorization = new System.Net.Http.Headers.AuthenticationHeaderValue("Bearer", authToken); + } + + var response = await _httpClient.SendAsync(request, cancellationToken); + + if (!response.IsSuccessStatusCode) + { + var statusCode = (int)response.StatusCode; + var reasonPhrase = response.ReasonPhrase; + response.Dispose(); + + throw new EvaluationException( + ErrorCodes.SchemaDiscoveryFailed, + $"MCP server returned HTTP {statusCode}.", + errorDetails: new List { $"Server URL: {serverUrl}", $"HTTP Status: {statusCode} {reasonPhrase}" }, + mitigationSteps: new List + { + "Verify the MCP server is running and accessible.", + "Check that the URL points to the correct Streamable HTTP endpoint." + }); + } + + return response; + } + + /// + /// Reads the response body, handling both plain JSON and SSE (Server-Sent Events) formats. + /// MCP Streamable HTTP may return SSE with lines like: + /// event: message + /// data: {"jsonrpc":"2.0",...} + /// + private async Task ReadJsonResponseAsync(HttpResponseMessage response, CancellationToken cancellationToken) + { + var body = await response.Content.ReadAsStringAsync(cancellationToken); + var contentType = response.Content.Headers.ContentType?.MediaType; + + // If plain JSON, return as-is + if (contentType == "application/json" || body.TrimStart().StartsWith('{')) + { + return body; + } + + // Parse SSE: extract the last "data:" line that contains JSON + _logger.LogDebug("Response is SSE format, extracting JSON from event stream"); + string? lastJsonData = null; + foreach (var line in body.Split('\n')) + { + var trimmed = line.Trim(); + if (trimmed.StartsWith("data:", StringComparison.Ordinal)) + { + var data = trimmed["data:".Length..].Trim(); + if (data.StartsWith('{')) + { + lastJsonData = data; + } + } + } + + if (lastJsonData is not null) + { + return lastJsonData; + } + + // Fallback: return raw body and let the JSON parser report the error + _logger.LogWarning("Could not extract JSON from SSE response"); + return body; + } +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/Scorer.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/Scorer.cs new file mode 100644 index 00000000..b68bd18e --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/Scorer.cs @@ -0,0 +1,135 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; + +namespace Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; + +/// +/// Computes per-category, per-tool, and overall scores for MCP server evaluation. +/// Category scores use pass-rate (passed / evaluated * 100). Null scores are excluded. +/// Tool scores use weighted category averages. +/// Overall score blends mean tool score (0.85) with toolset score (0.15). +/// +public static class Scorer +{ + /// + /// Category weights for computing weighted tool scores. Must sum to 1.0. + /// + public static IReadOnlyDictionary CategoryWeights { get; } = new Dictionary + { + ["tool_name"] = 0.15f, + ["tool_description"] = 0.35f, + ["param_name"] = 0.10f, + ["param_description"] = 0.25f, + ["schema_structure"] = 0.15f, + }; + + /// + /// Weight applied to the mean of tool-level scores in the overall formula. + /// + public const float ToolWeight = 0.85f; + + /// + /// Weight applied to the toolset-level score in the overall formula. + /// + public const float ToolsetWeight = 0.15f; + + /// + /// Computes the score (0-100) for a single category from its check items. + /// Formula: (passed / evaluated) * 100. Checks with null Score are excluded + /// from both numerator and denominator. Returns 100 if no checks are evaluated. + /// + /// Check items for a single category. + /// Score from 0 to 100, rounded to 1 decimal place. + public static float ComputeCategoryScore(List checks) + { + if (checks.Count == 0) + { + return 100f; + } + + var evaluated = checks.Where(c => c.Score is not null).ToList(); + if (evaluated.Count == 0) + { + return 100f; + } + + int passed = evaluated.Count(c => c.Score == true); + float score = (float)passed / evaluated.Count * 100f; + return MathF.Round(score, 1); + } + + /// + /// Computes a tool-level score as a weighted sum of category scores. + /// Missing categories default to 100 (no deductions). + /// + /// + /// Per-category scores keyed by category name (e.g., "tool_name", "tool_description"). + /// + /// Weighted score from 0 to 100, rounded to 1 decimal place. + public static float ComputeToolScore(Dictionary categoryScores) + { + float overall = 0f; + foreach (var (category, weight) in CategoryWeights) + { + float catScore = categoryScores.GetValueOrDefault(category, 100f); + overall += catScore * weight; + } + + return MathF.Round(overall, 1); + } + + /// + /// Computes the overall server score blending tool-level and toolset-level scores. + /// Formula: (meanToolScore * 0.85) + (toolsetScore * 0.15). + /// Returns toolsetScore * 0.15 if there are no tools. + /// + /// Evaluation results for each tool. + /// Score from toolset-level (cross-tool) checks. + /// Overall score from 0 to 100, rounded to 1 decimal place. + public static float ComputeOverallScore(List toolResults, float toolsetScore) + { + if (toolResults.Count == 0) + { + return MathF.Round(toolsetScore * ToolsetWeight, 1); + } + + float meanToolScore = toolResults.Average(t => t.Score); + float overall = (meanToolScore * ToolWeight) + (toolsetScore * ToolsetWeight); + return MathF.Round(overall, 1); + } + + /// + /// Computes average category scores across all tool results. + /// Each category is averaged independently across all tools that have a score for it. + /// + /// Evaluation results for each tool. + /// Dictionary of category name to average score, rounded to 1 decimal. + public static Dictionary ComputeCategoryAverages(List toolResults) + { + if (toolResults.Count == 0) + { + return []; + } + + var accumulator = new Dictionary>(); + foreach (var toolResult in toolResults) + { + foreach (var (category, score) in toolResult.CategoryScores) + { + if (!accumulator.TryGetValue(category, out var scores)) + { + scores = []; + accumulator[category] = scores; + } + + scores.Add(score); + } + } + + return accumulator.ToDictionary( + kvp => kvp.Key, + kvp => MathF.Round(kvp.Value.Average(), 1)); + } +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SemanticCheckDefinitions.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SemanticCheckDefinitions.cs new file mode 100644 index 00000000..2c3fb6a0 --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SemanticCheckDefinitions.cs @@ -0,0 +1,302 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; + +namespace Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; + +/// +/// Defines all semantic check metadata for MCP tool schema evaluation. +/// Semantic checks require judgment (by a coding agent or human) and cannot be +/// evaluated deterministically. Each check produces a +/// with and a null Score that will be filled +/// during the evaluation phase. +/// +internal static class SemanticCheckDefinitions +{ + /// + /// Returns the 10 tool-level semantic checks that evaluate naming quality + /// and description completeness. These require semantic understanding to judge. + /// + /// A list of 10 semantic instances with null scores. + internal static List GetToolLevelChecks() + { + return + [ + new ChecklistItem + { + Id = "tn_verb_prefix", + Type = CheckType.Semantic, + Prompt = "Does the tool name start with (or clearly contain) an action verb? " + + "Action verbs include any word describing what the tool does " + + "(get, create, send, search, forward, reply, flag, deploy, lock, etc.). " + + "Pass if the first word or segment of the name is an action verb in any domain.", + Score = null, + Reason = null, + Severity = Priority.P1, + Category = CheckCategory.ToolName, + IssueIds = [4, 18], + ImpactAreas = [ImpactArea.ToolSelection], + Remediation = "Rename to start with an action verb like get_, create_, search_, send_, etc.", + }, + + new ChecklistItem + { + Id = "tn_not_generic", + Type = CheckType.Semantic, + Prompt = "Is the tool name specific enough to distinguish it from other tools? " + + "Fail only for extremely vague names like 'run', 'execute', 'tool', 'process', 'action'. " + + "Domain-specific names like 'ForwardMessage' or 'SearchContacts' always pass.", + Score = null, + Reason = null, + Severity = Priority.P1, + Category = CheckCategory.ToolName, + IssueIds = [4, 18], + ImpactAreas = [ImpactArea.ToolSelection], + Remediation = "Rename to describe the specific action and resource, e.g., 'search_contacts'.", + }, + + new ChecklistItem + { + Id = "tn_descriptive", + Type = CheckType.Semantic, + Prompt = "Does the tool name follow an action+subject pattern (e.g., 'GetUser', 'search_contacts')? " + + "Pass if the name contains both an action and what it acts on.", + Score = null, + Reason = null, + Severity = Priority.P2, + Category = CheckCategory.ToolName, + IssueIds = [4, 18], + ImpactAreas = [ImpactArea.ToolSelection], + Remediation = "Use verb_noun pattern, e.g., 'get_user', 'search_documents', 'create_task'.", + }, + + new ChecklistItem + { + Id = "td_has_purpose", + Type = CheckType.Semantic, + Prompt = "Does the description clearly state what the tool does? " + + "Pass if reading the description tells you the tool's primary function.", + Score = null, + Reason = null, + Severity = Priority.P0, + Category = CheckCategory.ToolDescription, + IssueIds = [4], + ImpactAreas = [ImpactArea.ToolSelection], + Remediation = "Start the description with a verb phrase: 'Retrieves...', 'Creates...', 'Searches for...'.", + }, + + new ChecklistItem + { + Id = "td_not_name_echo", + Type = CheckType.Semantic, + Prompt = "Does the description provide information beyond just restating the tool name? " + + "Fail if the description is essentially the tool name with minor filler words.", + Score = null, + Reason = null, + Severity = Priority.P2, + Category = CheckCategory.ToolDescription, + IssueIds = [13], + ImpactAreas = [ImpactArea.Conciseness], + Remediation = "Rewrite the description to explain purpose, guidelines, and return values -- not just restate the name.", + }, + + new ChecklistItem + { + Id = "td_has_usage_guidelines", + Type = CheckType.Semantic, + Prompt = "Does the description explain when or how to use this tool? " + + "Pass if it mentions scenarios, conditions, or workflows where this tool is appropriate.", + Score = null, + Reason = null, + Severity = Priority.P1, + Category = CheckCategory.ToolDescription, + IssueIds = [5], + ImpactAreas = [ImpactArea.ToolSelection], + Remediation = "Add a sentence like 'Use this when you need to...' or 'Useful for...'.", + }, + + new ChecklistItem + { + Id = "td_has_limitations", + Type = CheckType.Semantic, + Prompt = "Does the description mention any limitations, constraints, or things the tool cannot do? " + + "Pass if it states any boundary, restriction, or caveat.", + Score = null, + Reason = null, + Severity = Priority.P2, + Category = CheckCategory.ToolDescription, + IssueIds = [6], + ImpactAreas = [ImpactArea.ToolSelection, ImpactArea.Completeness], + Remediation = "Add a sentence stating what the tool does NOT do or its constraints.", + }, + + new ChecklistItem + { + Id = "td_has_return_docs", + Type = CheckType.Semantic, + Prompt = "Does the description explain what the tool returns or produces? " + + "Pass if it mentions the output, response format, or what to expect back.", + Score = null, + Reason = null, + Severity = Priority.P1, + Category = CheckCategory.ToolDescription, + IssueIds = [8], + ImpactAreas = [ImpactArea.Completeness], + Remediation = "Add 'Returns ...' describing the output format and content.", + }, + + new ChecklistItem + { + Id = "td_has_examples", + Type = CheckType.Semantic, + Prompt = "Does the description include usage examples, sample values, or illustrative patterns? " + + "Pass if there are concrete examples, 'e.g.' patterns, or sample inputs/outputs.", + Score = null, + Reason = null, + Severity = Priority.P2, + Category = CheckCategory.ToolDescription, + IssueIds = [10], + ImpactAreas = [ImpactArea.Completeness], + Remediation = "Add examples: 'e.g., search_contacts(query=\"John\")' or 'For example, ...'.", + }, + + new ChecklistItem + { + Id = "td_no_boilerplate", + Type = CheckType.Semantic, + Prompt = "Is the description specific to this tool, not generic boilerplate? " + + "Fail if it starts with 'This is a tool that...' or uses generic filler without specific detail.", + Score = null, + Reason = null, + Severity = Priority.P1, + Category = CheckCategory.ToolDescription, + IssueIds = [14], + ImpactAreas = [ImpactArea.Conciseness], + Remediation = "Remove generic phrases and replace with specific information about what this tool does.", + }, + ]; + } + + /// + /// Returns the 4 per-parameter semantic checks that evaluate naming quality + /// and description completeness for a single parameter. + /// + /// The parameter name, used to customize prompt text and remediation advice. + /// A list of 4 semantic instances with null scores. + internal static List GetParamLevelChecks(string paramName) + { + return + [ + new ChecklistItem + { + Id = "pn_not_generic", + Type = CheckType.Semantic, + Prompt = $"Is the parameter name '{paramName}' specific enough in this tool's context? " + + "Fail only for truly uninformative names like 'x', 'val', 'data', 'input', 'arg'. " + + "Names like 'query', 'messageId', 'userId' are fine.", + Score = null, + Reason = null, + Severity = Priority.P2, + Category = CheckCategory.ParamName, + IssueIds = [9, 1], + ImpactAreas = [ImpactArea.ParamAccuracy], + Remediation = $"Rename '{paramName}' to describe what it represents (e.g., 'user_id', 'search_query').", + }, + + new ChecklistItem + { + Id = "pd_not_name_echo", + Type = CheckType.Semantic, + Prompt = $"Does the description for parameter '{paramName}' provide more information than " + + "just restating the parameter name? Fail if the description is essentially the " + + "parameter name with minor filler words.", + Score = null, + Reason = null, + Severity = Priority.P1, + Category = CheckCategory.ParamDescription, + IssueIds = [15], + ImpactAreas = [ImpactArea.Conciseness, ImpactArea.ParamAccuracy], + Remediation = $"Rewrite description for '{paramName}' to explain format, constraints, and purpose.", + }, + + new ChecklistItem + { + Id = "pd_has_constraints", + Type = CheckType.Semantic, + Prompt = $"Does the description or schema for parameter '{paramName}' mention constraints, " + + "valid values, format requirements, or limits? Pass if any form of constraint " + + "guidance is provided.", + Score = null, + Reason = null, + Severity = Priority.P1, + Category = CheckCategory.ParamDescription, + IssueIds = [11], + ImpactAreas = [ImpactArea.ParamAccuracy], + Remediation = $"Add constraints to '{paramName}' schema (enum, min/max, pattern) or describe limits.", + }, + + new ChecklistItem + { + Id = "pd_enum_for_categorical", + Type = CheckType.Semantic, + Prompt = $"Does parameter '{paramName}' represent a finite set of choices " + + "(like status, type, priority, format)? If it looks categorical, " + + "does the schema define an enum with valid values? " + + "Pass if the parameter is not categorical, or if it is categorical and has an enum defined.", + Score = null, + Reason = null, + Severity = Priority.P2, + Category = CheckCategory.ParamDescription, + IssueIds = [1], + ImpactAreas = [ImpactArea.ParamAccuracy], + Remediation = $"Add an 'enum' array to '{paramName}' listing all valid values.", + }, + ]; + } + + /// + /// Returns the 2 toolset-level semantic checks that evaluate cross-tool design quality. + /// These examine the tool collection as a whole rather than individual tools. + /// + /// A list of 2 semantic instances with null scores. + internal static List GetToolsetLevelChecks() + { + return + [ + new ChecklistItem + { + Id = "ts_no_description_overlap", + Type = CheckType.Semantic, + Prompt = "Are there any pairs of tools whose descriptions are semantically so similar " + + "(>70% overlap) that an AI agent would be confused about which to use? " + + "Only flag genuinely overlapping pairs, not tools that operate on the same entity " + + "with different verbs. Pass if no significant description overlap exists.", + Score = null, + Reason = null, + Severity = Priority.P1, + Category = CheckCategory.ToolsetDesign, + IssueIds = [17], + ImpactAreas = [ImpactArea.ToolSelection], + Remediation = "Differentiate overlapping tool descriptions. Clarify when to use each.", + }, + + new ChecklistItem + { + Id = "ts_crud_completeness", + Type = CheckType.Semantic, + Prompt = "For entities that have 2+ CRUD-like operations (create/read/update/delete), " + + "are there any missing operations that seem unintentional? " + + "Only flag entities where gaps appear unintentional. " + + "Pass if CRUD operations are complete or gaps are clearly intentional.", + Score = null, + Reason = null, + Severity = Priority.P2, + Category = CheckCategory.ToolsetDesign, + IssueIds = [18], + ImpactAreas = [ImpactArea.Completeness], + Remediation = "Add missing CRUD operations or document why they're intentionally omitted.", + }, + ]; + } +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SemanticCheckPrompts.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SemanticCheckPrompts.cs new file mode 100644 index 00000000..bf25608e --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SemanticCheckPrompts.cs @@ -0,0 +1,343 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using System.Text; + +namespace Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; + +/// +/// Provides structured prompt templates for invoking a coding agent (Claude Code +/// or GitHub Copilot) to evaluate semantic checks in an MCP tool schema checklist. +/// +/// The generated prompt instructs the agent to: +/// 1. Read the checklist JSON file. +/// 2. Evaluate each item where score is null. +/// 3. Set score to true (pass) or false (fail) with a 1-sentence reason. +/// 4. Leave items where score is already set (deterministic checks) unchanged. +/// 5. Write the updated JSON back to the same file, preserving all other fields. +/// +internal static class SemanticCheckPrompts +{ + /// + /// Builds the full evaluation prompt that a coding agent will receive. + /// The prompt describes the context, evaluation guidelines, JSON structure, + /// and concrete examples of good and bad evaluations. + /// + /// Absolute path to the checklist JSON file to evaluate. + /// A self-contained prompt string ready to pass to a coding agent CLI. + public static string BuildEvaluationPrompt(string checklistPath) + { + ArgumentException.ThrowIfNullOrWhiteSpace(checklistPath); + + var sb = new StringBuilder(); + + AppendSafetyPreamble(sb); + sb.AppendLine("You are evaluating an MCP (Model Context Protocol) tool schema for quality."); + sb.AppendLine("An MCP server exposes tools that AI agents call. Poor tool names, descriptions,"); + sb.AppendLine("or parameter schemas cause agents to select the wrong tool or pass incorrect arguments."); + sb.AppendLine(); + + AppendInstructions(sb, checklistPath); + AppendJsonStructure(sb); + AppendEvaluationGuidelines(sb); + AppendExamples(sb); + AppendFinalRules(sb); + + return sb.ToString(); + } + + /// + /// Concrete read/edit tool names for the target coding agent. Embedded into + /// the prompt so the agent is told exactly what to use rather than guessing. + /// We use an edit (string-replace) tool rather than a whole-file write tool, + /// because Copilot's `create` tool cannot overwrite existing files and telling + /// the model to "rewrite the file" leaves it thrashing on workaround paths. + /// + public sealed record AgentToolset(string ReadToolName, string EditToolName); + + /// + /// Builds a prompt for evaluating a single tool's semantic checks. + /// The file contains just one tool object (not the full checklist). + /// + public static string BuildToolEvaluationPrompt(string toolFilePath, string toolName, AgentToolset toolset) + { + ArgumentException.ThrowIfNullOrWhiteSpace(toolFilePath); + ArgumentException.ThrowIfNullOrWhiteSpace(toolName); + ArgumentNullException.ThrowIfNull(toolset); + + var sb = new StringBuilder(); + + AppendSafetyPreamble(sb); + sb.AppendLine("You are evaluating an MCP tool schema for quality."); + sb.AppendLine(); + AppendToolsetHeader(sb, toolset); + sb.AppendLine("TASK:"); + sb.AppendLine($"1. Use `{toolset.ReadToolName}` to read the JSON file at: {toolFilePath}"); + sb.AppendLine($" It contains a single tool named \"{toolName}\" with its schema and checks."); + sb.AppendLine("2. For every checklist item in the tool's \"checks\" where \"score\" is null,"); + sb.AppendLine(" evaluate the \"prompt\" against the tool's name, description, and input_schema."); + sb.AppendLine("3. Set \"score\" to true (pass) or false (fail)."); + sb.AppendLine("4. Set \"reason\" to a single sentence explaining your judgment."); + sb.AppendLine("5. Do NOT modify items where \"score\" is already set (true or false)."); + AppendWriteStrategy(sb, toolset); + sb.AppendLine("7. Preserve exact JSON formatting: 2-space indentation, UTF-8 encoding."); + sb.AppendLine(); + + AppendEvaluationGuidelines(sb); + AppendExamples(sb); + AppendFinalRules(sb); + + return sb.ToString(); + } + + /// + /// Builds a prompt for evaluating server-level checks. + /// The file contains tool summaries and server_checks array. + /// + public static string BuildServerChecksEvaluationPrompt(string serverChecksFilePath, AgentToolset toolset) + { + ArgumentException.ThrowIfNullOrWhiteSpace(serverChecksFilePath); + ArgumentNullException.ThrowIfNull(toolset); + + var sb = new StringBuilder(); + + AppendSafetyPreamble(sb); + sb.AppendLine("You are evaluating an MCP server's toolset design for quality."); + sb.AppendLine(); + AppendToolsetHeader(sb, toolset); + sb.AppendLine("TASK:"); + sb.AppendLine($"1. Use `{toolset.ReadToolName}` to read the JSON file at: {serverChecksFilePath}"); + sb.AppendLine(" It contains \"tool_summaries\" (list of tool names and descriptions)"); + sb.AppendLine(" and \"server_checks\" (checklist items to evaluate)."); + sb.AppendLine("2. For every item in \"server_checks\" where \"score\" is null,"); + sb.AppendLine(" evaluate the \"prompt\" against the full set of tools."); + sb.AppendLine("3. Set \"score\" to true (pass) or false (fail)."); + sb.AppendLine("4. Set \"reason\" to a single sentence explaining your judgment."); + sb.AppendLine("5. Do NOT modify items where \"score\" is already set (true or false)."); + AppendWriteStrategy(sb, toolset); + sb.AppendLine("7. Preserve exact JSON formatting: 2-space indentation, UTF-8 encoding."); + sb.AppendLine(); + + sb.AppendLine("EVALUATION GUIDELINES:"); + sb.AppendLine(); + sb.AppendLine("For TOOLSET checks (category: \"ToolsetDesign\"):"); + sb.AppendLine(" - Evaluate cross-tool consistency and completeness."); + sb.AppendLine(" - Check for tools with semantically overlapping descriptions (>70% similar)."); + sb.AppendLine(" - Check for incomplete CRUD coverage that seems unintentional."); + sb.AppendLine(" - Only flag genuinely problematic patterns, not minor style differences."); + sb.AppendLine(); + + AppendFinalRules(sb); + + return sb.ToString(); + } + + private static void AppendToolsetHeader(StringBuilder sb, AgentToolset toolset) + { + sb.AppendLine("TOOLS:"); + sb.AppendLine($" Read the file with `{toolset.ReadToolName}`."); + sb.AppendLine($" Update the file ONLY with `{toolset.EditToolName}` — a string-replace tool that"); + sb.AppendLine(" takes old_str and new_str and replaces a single unique match."); + sb.AppendLine(" Do NOT try to use `create` or any whole-file write tool — it cannot overwrite."); + sb.AppendLine(" Shell / subprocess tools are disabled. Do not try to spawn processes."); + sb.AppendLine(); + } + + private static void AppendWriteStrategy(StringBuilder sb, AgentToolset toolset) + { + sb.AppendLine("6. EDIT STRATEGY (follow exactly — most failures come from ignoring this):"); + sb.AppendLine($" For each checklist item with score:null, call `{toolset.EditToolName}` once."); + sb.AppendLine(" To make each edit's old_str UNIQUE in the file, include the item's \"id\" line."); + sb.AppendLine(" The minimum unique old_str is:"); + sb.AppendLine(); + sb.AppendLine(" \"id\": \"\","); + sb.AppendLine(" \"type\": \"Semantic\","); + sb.AppendLine(" \"prompt\": \"\","); + sb.AppendLine(" \"score\": null,"); + sb.AppendLine(" \"reason\": null,"); + sb.AppendLine(); + sb.AppendLine(" Your new_str must be the same block with score and reason filled:"); + sb.AppendLine(); + sb.AppendLine(" \"id\": \"\","); + sb.AppendLine(" \"type\": \"Semantic\","); + sb.AppendLine(" \"prompt\": \"\","); + sb.AppendLine(" \"score\": true,"); + sb.AppendLine(" \"reason\": \"\","); + sb.AppendLine(); + sb.AppendLine(" IMPORTANT:"); + sb.AppendLine(" - Include the whole \"prompt\" line verbatim in old_str — the \"id\" alone is not"); + sb.AppendLine(" always enough for uniqueness across tools, but id + prompt always is."); + sb.AppendLine(" - Do NOT include any fields the file doesn't have."); + sb.AppendLine(" - Answer with your FIRST instinct. Do not re-read the file to double-check an"); + sb.AppendLine(" edit you already made — the edit succeeded if the tool didn't error."); + sb.AppendLine(" - Do NOT batch many items into one old_str — one item per edit call."); + } + + private static void AppendInstructions(StringBuilder sb, string checklistPath) + { + sb.AppendLine("TASK:"); + sb.AppendLine($"1. Read the JSON file at: {checklistPath}"); + sb.AppendLine("2. For every checklist item where \"score\" is null, evaluate the \"prompt\" field"); + sb.AppendLine(" against the tool schema included in the same JSON file."); + sb.AppendLine("3. Set \"score\" to true (pass) or false (fail)."); + sb.AppendLine("4. Set \"reason\" to a single sentence explaining your judgment."); + sb.AppendLine("5. Do NOT modify any item where \"score\" is already set (true or false)."); + sb.AppendLine(" Those are deterministic checks that have already been evaluated."); + sb.AppendLine("6. Do NOT modify any other fields (id, type, severity, category, issue_ids,"); + sb.AppendLine(" impact_areas, remediation, prompt)."); + sb.AppendLine("7. Write the updated JSON back to the SAME file path."); + sb.AppendLine("8. Preserve the exact JSON formatting: 2-space indentation, UTF-8 encoding."); + sb.AppendLine(); + } + + private static void AppendJsonStructure(StringBuilder sb) + { + sb.AppendLine("JSON STRUCTURE:"); + sb.AppendLine("The file is an EvaluationChecklist with this shape:"); + sb.AppendLine(" {"); + sb.AppendLine(" \"metadata\": { \"server_name\": \"...\", \"tool_count\": N, ... },"); + sb.AppendLine(" \"tools\": ["); + sb.AppendLine(" {"); + sb.AppendLine(" \"name\": \"tool_name\","); + sb.AppendLine(" \"description\": \"tool description text\","); + sb.AppendLine(" \"input_schema\": { ... JSON Schema ... },"); + sb.AppendLine(" \"checks\": {"); + sb.AppendLine(" \"tool_name\": [ { \"id\": \"...\", \"score\": null, \"prompt\": \"...\", ... } ],"); + sb.AppendLine(" \"tool_description\": [ ... ],"); + sb.AppendLine(" \"schema_structure\": [ ... ],"); + sb.AppendLine(" \"parameters\": {"); + sb.AppendLine(" \"\": {"); + sb.AppendLine(" \"param_name\": [ ... ],"); + sb.AppendLine(" \"param_description\": [ ... ]"); + sb.AppendLine(" }"); + sb.AppendLine(" }"); + sb.AppendLine(" }"); + sb.AppendLine(" }"); + sb.AppendLine(" ],"); + sb.AppendLine(" \"server_checks\": [ { \"id\": \"...\", \"score\": null, \"prompt\": \"...\", ... } ]"); + sb.AppendLine(" }"); + sb.AppendLine(); + sb.AppendLine("Each checklist item has:"); + sb.AppendLine(" - \"type\": \"Deterministic\" or \"Semantic\""); + sb.AppendLine(" - \"score\": true, false, or null (null = needs your evaluation)"); + sb.AppendLine(" - \"reason\": null or a string (set this when you set score)"); + sb.AppendLine(" - \"prompt\": the question to evaluate against the tool schema"); + sb.AppendLine(); + } + + private static void AppendEvaluationGuidelines(StringBuilder sb) + { + sb.AppendLine("EVALUATION GUIDELINES:"); + sb.AppendLine(); + sb.AppendLine("For tool NAME checks (category: \"ToolName\"):"); + sb.AppendLine(" - Evaluate naming quality: does it start with a verb, is it specific enough,"); + sb.AppendLine(" does it follow action+subject pattern (e.g., get_user, search_contacts)?"); + sb.AppendLine(" - Be lenient with domain-specific names; only fail truly vague names."); + sb.AppendLine(" - Both snake_case and PascalCase naming conventions are acceptable."); + sb.AppendLine(); + sb.AppendLine("For tool DESCRIPTION checks (category: \"ToolDescription\"):"); + sb.AppendLine(" - Evaluate completeness across these dimensions:"); + sb.AppendLine(" * Purpose: Does it explain what the tool does?"); + sb.AppendLine(" * Usage guidelines: Does it say when/how to use the tool?"); + sb.AppendLine(" * Limitations: Does it mention constraints or things it cannot do?"); + sb.AppendLine(" * Return info: Does it describe what the tool returns?"); + sb.AppendLine(" * Examples: Does it include sample inputs/outputs or usage patterns?"); + sb.AppendLine(" - A description does not need ALL dimensions to pass individual checks;"); + sb.AppendLine(" each check targets one dimension specifically."); + sb.AppendLine(); + sb.AppendLine("For PARAMETER checks (categories: \"ParamName\", \"ParamDescription\"):"); + sb.AppendLine(" - Evaluate parameter naming: is it descriptive enough in context?"); + sb.AppendLine(" Names like 'query', 'userId', 'messageId' are fine."); + sb.AppendLine(" Names like 'x', 'val', 'data', 'input' are too vague."); + sb.AppendLine(" - Evaluate parameter descriptions: do they add info beyond the name?"); + sb.AppendLine(" Do they mention constraints, formats, or valid values?"); + sb.AppendLine(" - For categorical parameters: is an enum defined with valid values?"); + sb.AppendLine(); + sb.AppendLine("For TOOLSET checks (category: \"ToolsetDesign\", in server_checks):"); + sb.AppendLine(" - Evaluate cross-tool consistency and completeness."); + sb.AppendLine(" - Check for tools with semantically overlapping descriptions (>70% similar)."); + sb.AppendLine(" - Check for incomplete CRUD coverage that seems unintentional."); + sb.AppendLine(" - Only flag genuinely problematic patterns, not minor style differences."); + sb.AppendLine(); + } + + private static void AppendExamples(StringBuilder sb) + { + sb.AppendLine("EXAMPLES:"); + sb.AppendLine(); + sb.AppendLine("Good evaluation (tool name check - pass):"); + sb.AppendLine(" Tool name: \"search_contacts\""); + sb.AppendLine(" Prompt: \"Does the tool name start with an action verb?\""); + sb.AppendLine(" score: true"); + sb.AppendLine(" reason: \"Name starts with the verb 'search', clearly indicating the action.\""); + sb.AppendLine(); + sb.AppendLine("Good evaluation (tool name check - fail):"); + sb.AppendLine(" Tool name: \"data\""); + sb.AppendLine(" Prompt: \"Is the tool name specific enough to distinguish it from other tools?\""); + sb.AppendLine(" score: false"); + sb.AppendLine(" reason: \"Name 'data' is too generic; it does not indicate what action is performed or on what resource.\""); + sb.AppendLine(); + sb.AppendLine("Good evaluation (description check - pass):"); + sb.AppendLine(" Description: \"Retrieves contact details by email or name. Returns a list of matching contacts with their phone numbers and email addresses.\""); + sb.AppendLine(" Prompt: \"Does the description clearly state what the tool does?\""); + sb.AppendLine(" score: true"); + sb.AppendLine(" reason: \"Description opens with 'Retrieves contact details', clearly stating the tool's purpose.\""); + sb.AppendLine(); + sb.AppendLine("Good evaluation (description check - fail):"); + sb.AppendLine(" Description: \"This is a tool for contacts.\""); + sb.AppendLine(" Prompt: \"Does the description provide information beyond just restating the tool name?\""); + sb.AppendLine(" score: false"); + sb.AppendLine(" reason: \"Description only restates the subject 'contacts' without explaining how the tool works or what it returns.\""); + sb.AppendLine(); + sb.AppendLine("Good evaluation (parameter check - pass):"); + sb.AppendLine(" Parameter: \"query\", Description: \"Search query string to match against contact names and emails. Max 256 characters.\""); + sb.AppendLine(" Prompt: \"Does the description mention constraints, valid values, or format requirements?\""); + sb.AppendLine(" score: true"); + sb.AppendLine(" reason: \"Description states the max length constraint (256 characters) and what fields are searched.\""); + sb.AppendLine(); + } + + private static void AppendFinalRules(StringBuilder sb) + { + sb.AppendLine("IMPORTANT RULES:"); + sb.AppendLine("- Only modify items where \"score\" is null. Leave all other items untouched."); + sb.AppendLine("- Every null-scored item MUST end up with score=true or score=false. Never leave"); + sb.AppendLine(" score as null. If you are uncertain, default to true (pass) with a reason that"); + sb.AppendLine(" explains why nothing problematic was observed. \"No issues identified\" = pass."); + sb.AppendLine("- Each \"reason\" must be exactly one sentence."); + sb.AppendLine("- Be calibrated: pass items that meet the check criteria, fail those that do not."); + sb.AppendLine("- Use the tool's actual name, description, and input_schema from the JSON to evaluate."); + sb.AppendLine("- Preserve all JSON field names, ordering, and structure exactly as-is."); + sb.AppendLine("- Write valid JSON with 2-space indentation."); + } + + /// + /// Prepends safety guidance to every prompt: harm-content prohibition and + /// schema-grounding instructions. Reminds the agent to ground reasons in the + /// tool schema JSON file rather than internal knowledge. + /// + private static void AppendSafetyPreamble(StringBuilder sb) + { + sb.AppendLine("You must not generate content that may be harmful to someone physically or"); + sb.AppendLine("emotionally even if a user requests or creates a condition to rationalize"); + sb.AppendLine("that harmful content. You must not generate content that is hateful, racist,"); + sb.AppendLine("sexist, lewd or violent."); + sb.AppendLine(); + sb.AppendLine("You should always read the tool schema JSON file when evaluating a check,"); + sb.AppendLine("regardless of any prior knowledge of MCP or this tool."); + sb.AppendLine(); + sb.AppendLine("Ground every reason field in factual content from the tool schema JSON file"); + sb.AppendLine("you just read. The schema may be incomplete or contain content unrelated to"); + sb.AppendLine("the check. Do not make assumptions about the schema beyond what is strictly"); + sb.AppendLine("written in it."); + sb.AppendLine(); + sb.AppendLine("If the schema does not contain sufficient information to evaluate a check"); + sb.AppendLine("completely, you only use facts from the schema and do not add any"); + sb.AppendLine("information not present in the schema."); + sb.AppendLine(); + sb.AppendLine("Your reason field should avoid being vague, controversial, or off-topic."); + sb.AppendLine("You can provide additional relevant details to evaluate the check"); + sb.AppendLine("thoroughly and comprehensively, covering multiple aspects in depth."); + sb.AppendLine(); + } +} diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Services/IAgent365ToolingService.cs b/src/Microsoft.Agents.A365.DevTools.Cli/Services/IAgent365ToolingService.cs index dd110f90..90346e44 100644 --- a/src/Microsoft.Agents.A365.DevTools.Cli/Services/IAgent365ToolingService.cs +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Services/IAgent365ToolingService.cs @@ -97,6 +97,14 @@ Task LogRegisterUsageAsync( int toolCount, CancellationToken cancellationToken = default); + /// + /// Logs telemetry for evaluate-mcp-server usage at the start of the evaluation workflow. + /// Body is intentionally empty: server identifies the caller from the bearer token and may + /// pull additional operation context from ServiceContext if upstream activities populated it. + /// The CLI does not send customer-private content (e.g. evaluated server URL) in the payload. + /// + Task LogEvaluateUsageAsync(CancellationToken cancellationToken = default); + /// /// Adds a BYO (Bring Your Own) MCP server /// diff --git a/src/Microsoft.Agents.A365.DevTools.Cli/Templates/SchemaEvalReport.html b/src/Microsoft.Agents.A365.DevTools.Cli/Templates/SchemaEvalReport.html new file mode 100644 index 00000000..a27f42a8 --- /dev/null +++ b/src/Microsoft.Agents.A365.DevTools.Cli/Templates/SchemaEvalReport.html @@ -0,0 +1,688 @@ + + + + + +MCP Server Quality Report + + + +
+ + + + + diff --git a/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Commands/DevelopMcpCommandTests.cs b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Commands/DevelopMcpCommandTests.cs index d53e9161..6993f841 100644 --- a/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Commands/DevelopMcpCommandTests.cs +++ b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Commands/DevelopMcpCommandTests.cs @@ -4,6 +4,7 @@ using Microsoft.Extensions.Logging; using Microsoft.Agents.A365.DevTools.Cli.Commands; using Microsoft.Agents.A365.DevTools.Cli.Services; +using Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; using Microsoft.Agents.A365.DevTools.Cli.Models; using NSubstitute; using FluentAssertions; @@ -333,7 +334,7 @@ public void CriticalOptions_HaveConsistentAliases(string subcommandName, string $"Option '{optionName}' in '{subcommandName}' should have alias '{expectedAlias}'"); } - [Fact] + [Fact] public void NoSubcommands_UsePositionalArguments_OnlyOptions() { // This is a regression test to ensure we don't accidentally revert to positional arguments @@ -347,4 +348,31 @@ public void NoSubcommands_UsePositionalArguments_OnlyOptions() $"Subcommand '{subcommand.Name}' should not have positional arguments - use named options for Azure CLI compliance"); } } + + [Fact] + public void CreateCommand_WithPipelineService_IncludesEvaluateSubcommand() + { + // Arrange + var pipelineService = Substitute.For(); + + // Act + var command = DevelopMcpCommand.CreateCommand(_mockLogger, _mockToolingService, pipelineService); + + // Assert - assert presence, not total count (total may change as other subcommands are added) + command.Subcommands.Select(sc => sc.Name).Should().Contain( + "evaluate", + because: "providing the pipeline service should register the evaluate subcommand"); + } + + [Fact] + public void CreateCommand_WithNullPipelineService_DoesNotIncludeEvaluate() + { + // Act + var command = DevelopMcpCommand.CreateCommand(_mockLogger, _mockToolingService, null); + + // Assert - assert absence, not total count + command.Subcommands.Select(sc => sc.Name).Should().NotContain( + "evaluate", + because: "evaluate must not be registered when no pipeline service is supplied"); + } } diff --git a/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Commands/EvaluateCommandInvocationTests.cs b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Commands/EvaluateCommandInvocationTests.cs new file mode 100644 index 00000000..0aeed5f1 --- /dev/null +++ b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Commands/EvaluateCommandInvocationTests.cs @@ -0,0 +1,201 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using System.CommandLine; +using System.CommandLine.Builder; +using System.CommandLine.Invocation; +using System.CommandLine.Parsing; +using FluentAssertions; +using Microsoft.Agents.A365.DevTools.Cli.Commands; +using Microsoft.Agents.A365.DevTools.Cli.Exceptions; +using Microsoft.Agents.A365.DevTools.Cli.Services; +using Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; +using Microsoft.Extensions.Logging; +using Microsoft.Extensions.Logging.Abstractions; +using NSubstitute; +using Xunit; + +namespace Microsoft.Agents.A365.DevTools.Cli.Tests.Commands; + +[CollectionDefinition("EvaluateCommandInvocation", DisableParallelization = true)] +public class EvaluateCommandInvocationCollection { } + +/// +/// Invocation-level tests for the develop-mcp evaluate subcommand — exercises the handler +/// via .InvokeAsync to confirm exit-code propagation +/// and --auth-token secret handling. Non-parallel because the env-var fallback test mutates +/// the process-wide A365_MCP_AUTH_TOKEN. +/// +[Collection("EvaluateCommandInvocation")] +public class EvaluateCommandInvocationTests +{ + private const string AuthTokenEnvVar = "A365_MCP_AUTH_TOKEN"; + + private static Command GetEvaluateSubcommand(ILogger logger, IEvaluationPipelineService pipeline) + { + var parent = DevelopMcpCommand.CreateCommand(logger, Substitute.For(), pipeline); + return parent.Subcommands.First(sc => sc.Name == "evaluate"); + } + + private static IEvaluationPipelineService PipelineReturning(int exitCode) + { + var pipeline = Substitute.For(); + pipeline + .RunAsync(Arg.Any(), Arg.Any(), Arg.Any(), Arg.Any(), Arg.Any()) + .Returns(exitCode); + return pipeline; + } + + [Fact] + public async Task InvokeAsync_SuccessfulRun_PropagatesRunAsyncExitCode() + { + var command = GetEvaluateSubcommand(Substitute.For(), PipelineReturning(0)); + + var exitCode = await command.InvokeAsync("--server-url http://localhost/mcp --eval-engine none"); + + exitCode.Should().Be(0, because: "a successful RunAsync (0) must propagate to the process exit code"); + } + + [Fact] + public async Task InvokeAsync_FailedRun_PropagatesNonZeroExitCode() + { + var command = GetEvaluateSubcommand(Substitute.For(), PipelineReturning(1)); + + var exitCode = await command.InvokeAsync("--server-url http://localhost/mcp --eval-engine claude-code"); + + exitCode.Should().Be(1, because: "RunAsync's failure exit code must propagate so CI detects it"); + } + + [Fact] + public async Task InvokeAsync_WithAuthTokenFlag_LogsExposureWarning() + { + var logger = Substitute.For(); + var command = GetEvaluateSubcommand(logger, PipelineReturning(0)); + + await command.InvokeAsync("--server-url http://localhost/mcp --eval-engine none --auth-token SECRET123"); + + logger.Received().Log( + LogLevel.Warning, + Arg.Any(), + Arg.Is(o => o.ToString()!.Contains("A365_MCP_AUTH_TOKEN")), + Arg.Any(), + Arg.Any>()); + } + + [Fact] + public async Task InvokeAsync_NoFlag_ReadsTokenFromEnvironmentVariable() + { + var original = Environment.GetEnvironmentVariable(AuthTokenEnvVar); + try + { + Environment.SetEnvironmentVariable(AuthTokenEnvVar, "env-token-xyz"); + var pipeline = PipelineReturning(0); + var command = GetEvaluateSubcommand(Substitute.For(), pipeline); + + await command.InvokeAsync("--server-url http://localhost/mcp --eval-engine none"); + + await pipeline.Received(1).RunAsync( + Arg.Any(), + Arg.Any(), + Arg.Any(), + "env-token-xyz", + Arg.Any()); + } + finally + { + Environment.SetEnvironmentVariable(AuthTokenEnvVar, original); + } + } + + // The bare Command.InvokeAsync(string) shortcut used above wraps every thrown exception + // as exit code 1 via System.CommandLine's default handler, so it cannot observe an + // EvaluationException's ExitCode (3). The tests below build a parser that mirrors the + // exception-to-exit-code mapping in Program.cs (Agent365Exception.ExitCode) plus its + // parse-error short-circuit, and drive a real EvaluationPipelineService (sub-dependencies + // mocked) so ParseEvalEngine and the output-dir guard actually run. This pins the + // invocation-layer exit codes against a refactor of the SetHandler wiring that would + // otherwise break those branches with all tests still green. + private static Parser BuildFaithfulParser(Command evaluate) => + new CommandLineBuilder(evaluate) + .UseExceptionHandler((exception, context) => + context.ExitCode = exception is Agent365Exception agentEx ? agentEx.ExitCode : 1) + .AddMiddleware(async (context, next) => + { + // Mirror Program.cs: a parse error (e.g. a missing required option) exits + // non-zero before the command handler runs. + if (context.ParseResult.Errors.Count > 0) + { + context.ExitCode = 1; + return; + } + + await next(context); + }, MiddlewareOrder.ErrorReporting) + .Build(); + + // Real pipeline so ParseEvalEngine and the output-dir guard execute; the sub-dependencies + // are mocked because the failure paths under test throw or return before reaching them. + private static IEvaluationPipelineService RealPipeline() => + new EvaluationPipelineService( + NullLogger.Instance, + Substitute.For(), + Substitute.For(), + Substitute.For(), + Substitute.For(), + Substitute.For(), + Substitute.For()); + + [Fact] + public async Task InvokeAsync_InvalidEvalEngine_ReturnsEvaluationExceptionExitCode3() + { + var evaluate = GetEvaluateSubcommand(Substitute.For(), RealPipeline()); + var parser = BuildFaithfulParser(evaluate); + + var exitCode = await parser.InvokeAsync( + new[] { "--server-url", "http://localhost/mcp", "--eval-engine", "bogus" }); + + exitCode.Should().Be(3, because: "an unknown --eval-engine raises an EvaluationException, whose ExitCode (3) the real exception handler must propagate to the process"); + } + + [Fact] + public async Task InvokeAsync_MissingServerUrl_ReturnsExitCode1() + { + var evaluate = GetEvaluateSubcommand(Substitute.For(), PipelineReturning(0)); + var parser = BuildFaithfulParser(evaluate); + + var exitCode = await parser.InvokeAsync(new[] { "--eval-engine", "none" }); + + exitCode.Should().Be(1, because: "--server-url is a required option; the missing-required-option parse error must exit non-zero without invoking the handler"); + } + + [Fact] + public async Task InvokeAsync_WhitespaceOutputDir_ReturnsExitCode1() + { + var evaluate = GetEvaluateSubcommand(Substitute.For(), RealPipeline()); + var parser = BuildFaithfulParser(evaluate); + + var exitCode = await parser.InvokeAsync( + new[] { "--server-url", "http://localhost/mcp", "--output-dir", " ", "--eval-engine", "none" }); + + exitCode.Should().Be(1, because: "an empty or whitespace --output-dir must fail fast with exit code 1 through the real handler wiring, not a deep exception later in the run"); + } + + [Fact] + public async Task InvokeAsync_ValidArguments_ForwardsEachOptionToRunAsync() + { + var pipeline = PipelineReturning(0); + var evaluate = GetEvaluateSubcommand(Substitute.For(), pipeline); + var parser = BuildFaithfulParser(evaluate); + + var exitCode = await parser.InvokeAsync( + new[] { "--server-url", "http://localhost:5000/mcp", "--output-dir", "out", "--eval-engine", "claude-code" }); + + exitCode.Should().Be(0, because: "RunAsync returned 0, which the handler must propagate as the process exit code"); + await pipeline.Received(1).RunAsync( + "http://localhost:5000/mcp", + "out", + "claude-code", + Arg.Any(), + Arg.Any()); + } +} diff --git a/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Commands/EvaluateCommandTests.cs b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Commands/EvaluateCommandTests.cs new file mode 100644 index 00000000..11597297 --- /dev/null +++ b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Commands/EvaluateCommandTests.cs @@ -0,0 +1,126 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using System.CommandLine; +using FluentAssertions; +using Microsoft.Agents.A365.DevTools.Cli.Commands; +using Microsoft.Agents.A365.DevTools.Cli.Services; +using Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; +using Microsoft.Extensions.Logging; +using NSubstitute; +using Xunit; + +namespace Microsoft.Agents.A365.DevTools.Cli.Tests.Commands; + +/// +/// Tests for the evaluate subcommand under develop-mcp. +/// +public class EvaluateCommandTests +{ + private readonly ILogger _mockLogger; + private readonly IAgent365ToolingService _mockToolingService; + private readonly IEvaluationPipelineService _mockPipelineService; + + public EvaluateCommandTests() + { + _mockLogger = Substitute.For(); + _mockToolingService = Substitute.For(); + _mockPipelineService = Substitute.For(); + } + + private Command GetEvaluateSubcommand() + { + var parent = DevelopMcpCommand.CreateCommand(_mockLogger, _mockToolingService, _mockPipelineService); + return parent.Subcommands.First(sc => sc.Name == "evaluate"); + } + + // ----------------------------------------------------------------------- + // Command structure + // ----------------------------------------------------------------------- + + [Fact] + public void EvaluateSubcommand_HasCorrectName() + { + var command = GetEvaluateSubcommand(); + + command.Name.Should().Be("evaluate"); + } + + [Fact] + public void EvaluateSubcommand_HasServerUrlOption() + { + var command = GetEvaluateSubcommand(); + + var option = command.Options.FirstOrDefault(o => o.Name == "server-url"); + option.Should().NotBeNull(because: "develop-mcp subcommands use named options, not positional arguments, for Azure CLI consistency"); + option!.ValueType.Should().Be(typeof(string)); + option.IsRequired.Should().BeTrue(because: "evaluate cannot run without a target MCP server URL"); + option.Aliases.Should().Contain("--server-url"); + option.Aliases.Should().Contain("-u"); + } + + [Fact] + public void EvaluateSubcommand_HasNoPositionalArguments() + { + var command = GetEvaluateSubcommand(); + + command.Arguments.Should().BeEmpty(because: "develop-mcp subcommands should use named options only (Azure CLI convention)"); + } + + [Fact] + public void EvaluateSubcommand_HasOutputDirOption() + { + var command = GetEvaluateSubcommand(); + + var option = command.Options.FirstOrDefault(o => o.Name == "output-dir"); + option.Should().NotBeNull(); + option!.Aliases.Should().Contain("--output-dir"); + option.Aliases.Should().Contain("-o"); + } + + [Fact] + public void EvaluateSubcommand_HasEvalEngineOption() + { + var command = GetEvaluateSubcommand(); + + var option = command.Options.FirstOrDefault(o => o.Name == "eval-engine"); + option.Should().NotBeNull(); + option!.Aliases.Should().Contain("--eval-engine"); + } + + [Fact] + public void EvaluateSubcommand_HasAuthTokenOption() + { + var command = GetEvaluateSubcommand(); + + var option = command.Options.FirstOrDefault(o => o.Name == "auth-token"); + option.Should().NotBeNull(); + option!.Aliases.Should().Contain("--auth-token"); + } + + [Fact] + public void EvaluateSubcommand_OutputDirDefaultsToCurrentDirectory() + { + var command = GetEvaluateSubcommand(); + + var option = command.Options.First(o => o.Name == "output-dir") as Option; + option.Should().NotBeNull(); + + var parseResult = command.Parse("--server-url http://localhost:3000"); + var value = parseResult.GetValueForOption(option!); + value.Should().Be("."); + } + + [Fact] + public void EvaluateSubcommand_EvalEngineDefaultsToAuto() + { + var command = GetEvaluateSubcommand(); + + var option = command.Options.First(o => o.Name == "eval-engine") as Option; + option.Should().NotBeNull(); + + var parseResult = command.Parse("--server-url http://localhost:3000"); + var value = parseResult.GetValueForOption(option!); + value.Should().Be("auto"); + } +} diff --git a/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/ActionItemGeneratorTests.cs b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/ActionItemGeneratorTests.cs new file mode 100644 index 00000000..c98608d4 --- /dev/null +++ b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/ActionItemGeneratorTests.cs @@ -0,0 +1,188 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using FluentAssertions; +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; +using Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; +using Xunit; + +namespace Microsoft.Agents.A365.DevTools.Cli.Tests.Services.Evaluate; + +public class ActionItemGeneratorTests +{ + // ======================================================================= + // GenerateFromAllChecks + // ======================================================================= + + [Fact] + public void GenerateFromAllChecks_FailedChecks_GeneratesItems() + { + var checks = new List + { + new() + { + Id = "tn_present", + Score = false, + Severity = Priority.P0, + Prompt = "Tool name present", + Reason = "Missing.", + Category = CheckCategory.ToolName, + IssueIds = [], + ImpactAreas = [], + Remediation = "Add name.", + }, + new() + { + Id = "td_present", + Score = true, + Severity = Priority.P0, + Prompt = "Description present", + Reason = "Has description.", + Category = CheckCategory.ToolDescription, + IssueIds = [], + ImpactAreas = [], + Remediation = "Add desc.", + }, + }; + + var result = ActionItemGenerator.GenerateFromAllChecks(checks, "tool1"); + + result.Should().ContainSingle(); + result[0].Title.Should().Be("Tool name present"); + result[0].ToolName.Should().Be("tool1"); + } + + [Fact] + public void GenerateFromAllChecks_EmptyChecks_ReturnsEmpty() + { + var result = ActionItemGenerator.GenerateFromAllChecks([], "tool1"); + + result.Should().BeEmpty(); + } + + [Fact] + public void GenerateFromAllChecks_UsesScorerCategoryWeights() + { + var checks = new List + { + new() + { + Id = "td_present", + Score = false, + Severity = Priority.P0, + Prompt = "Description present", + Reason = "Missing.", + Category = CheckCategory.ToolDescription, + IssueIds = [], + ImpactAreas = [], + Remediation = "Fix.", + }, + }; + + var result = ActionItemGenerator.GenerateFromAllChecks(checks, "tool1"); + + // tool_description weight is 0.35, 1 check in category + // (0.35 * 100) / 1 = 35.0 + result[0].ScoreImpact.Should().BeApproximately(35.0f, 0.1f); + } + + [Fact] + public void GenerateFromAllChecks_MultipleChecksInSameCategory_SplitsImpact() + { + var checks = new List + { + new() + { + Id = "td_present", + Score = false, + Severity = Priority.P0, + Prompt = "Desc present", + Reason = "Missing.", + Category = CheckCategory.ToolDescription, + IssueIds = [], + ImpactAreas = [], + Remediation = "Fix.", + }, + new() + { + Id = "td_min_length", + Score = false, + Severity = Priority.P1, + Prompt = "Min length", + Reason = "Too short.", + Category = CheckCategory.ToolDescription, + IssueIds = [], + ImpactAreas = [], + Remediation = "Fix.", + }, + }; + + var result = ActionItemGenerator.GenerateFromAllChecks(checks, "tool1"); + + // 2 checks in tool_description: (0.35 * 100) / 2 = 17.5 each + result.Should().HaveCount(2); + result.Should().AllSatisfy(item => + item.ScoreImpact.Should().BeApproximately(17.5f, 0.1f)); + } + + [Fact] + public void GenerateFromAllChecks_SortedByPriority() + { + var checks = new List + { + new() + { + Id = "check_p3", + Score = false, + Severity = Priority.P3, + Prompt = "P3", + Reason = "Fail.", + Category = CheckCategory.SchemaStructure, + IssueIds = [], + ImpactAreas = [], + Remediation = "Fix.", + }, + new() + { + Id = "check_p0", + Score = false, + Severity = Priority.P0, + Prompt = "P0", + Reason = "Fail.", + Category = CheckCategory.SchemaStructure, + IssueIds = [], + ImpactAreas = [], + Remediation = "Fix.", + }, + }; + + var result = ActionItemGenerator.GenerateFromAllChecks(checks, "tool1"); + + result[0].Priority.Should().Be(Priority.P0); + result[1].Priority.Should().Be(Priority.P3); + } + + [Fact] + public void GenerateFromAllChecks_NullToolName_SetsToolNameNull() + { + var checks = new List + { + new() + { + Id = "ts_check", + Score = false, + Severity = Priority.P1, + Prompt = "Toolset check", + Reason = "Fail.", + Category = CheckCategory.ToolsetDesign, + IssueIds = [], + ImpactAreas = [], + Remediation = "Fix.", + }, + }; + + var result = ActionItemGenerator.GenerateFromAllChecks(checks, null); + + result[0].ToolName.Should().BeNull(); + } +} diff --git a/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/ChecklistEvaluatorTests.cs b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/ChecklistEvaluatorTests.cs new file mode 100644 index 00000000..19047ef0 --- /dev/null +++ b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/ChecklistEvaluatorTests.cs @@ -0,0 +1,95 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using System.Text.Json; +using FluentAssertions; +using Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; +using Xunit; + +namespace Microsoft.Agents.A365.DevTools.Cli.Tests.Services.Evaluate; + +/// +/// Tests for ChecklistEvaluator helpers, primarily RepairJson which fixes malformed +/// JSON produced by coding agents (missing commas, trailing commas) before deserialization. +/// +public class ChecklistEvaluatorTests +{ + [Fact] + public void RepairJson_WellFormedJson_ReturnsUnchanged() + { + const string input = """ + { + "id": "a", + "score": true, + "items": [1, 2, 3] + } + """; + + var result = ChecklistEvaluator.RepairJson(input); + + JsonDocument.Parse(result).Should().NotBeNull( + because: "well-formed input must remain valid after RepairJson"); + } + + [Fact] + public void RepairJson_MissingCommaBetweenObjects_InsertsComma() + { + // Agents sometimes forget the comma between adjacent object literals in an array. + const string input = """ + [ + { "id": "a" } + { "id": "b" } + ] + """; + + var result = ChecklistEvaluator.RepairJson(input); + + var doc = JsonDocument.Parse(result); + doc.RootElement.GetArrayLength().Should().Be(2, + because: "RepairJson should make the two array elements parse as valid JSON"); + } + + [Fact] + public void RepairJson_MissingCommaBeforeStringKey_InsertsComma() + { + // Pattern: "value" (no comma) followed by newline and next "key":. + const string input = """ + { + "a": "one" + "b": "two" + } + """; + + var result = ChecklistEvaluator.RepairJson(input); + + var doc = JsonDocument.Parse(result); + doc.RootElement.GetProperty("a").GetString().Should().Be("one"); + doc.RootElement.GetProperty("b").GetString().Should().Be("two"); + } + + [Fact] + public void RepairJson_MissingCommaAfterBooleanValue_InsertsComma() + { + const string input = """ + { + "ok": true + "next": "hi" + } + """; + + var result = ChecklistEvaluator.RepairJson(input); + + var doc = JsonDocument.Parse(result); + doc.RootElement.GetProperty("ok").GetBoolean().Should().BeTrue(); + doc.RootElement.GetProperty("next").GetString().Should().Be("hi"); + } + + [Fact] + public void RepairJson_EmptyString_ReturnsEmptyString() + { + var result = ChecklistEvaluator.RepairJson(string.Empty); + + result.Should().BeEmpty( + because: "RepairJson should not throw on empty input; the caller handles parse failures"); + } +} diff --git a/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/ChecklistGeneratorTests.cs b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/ChecklistGeneratorTests.cs new file mode 100644 index 00000000..67bf1c2d --- /dev/null +++ b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/ChecklistGeneratorTests.cs @@ -0,0 +1,1055 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using System.Text.Json; +using FluentAssertions; +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; +using Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; +using Xunit; + +namespace Microsoft.Agents.A365.DevTools.Cli.Tests.Services.Evaluate; + +public class ChecklistGeneratorTests +{ + private readonly ChecklistGenerator _generator = new(); + + // ----------------------------------------------------------------------- + // Metadata + // ----------------------------------------------------------------------- + + [Fact] + public void Generate_SetsMetadataCorrectly() + { + var tools = new List + { + CreateToolSchema("get_user", "Retrieves a user by ID."), + }; + + var result = _generator.Generate(tools, "TestServer", "http://localhost:3000"); + + result.Metadata.ServerName.Should().Be("TestServer"); + result.Metadata.ServerUrl.Should().Be("http://localhost:3000"); + result.Metadata.ToolCount.Should().Be(1); + result.Metadata.GeneratorVersion.Should().NotBeNullOrWhiteSpace(); + result.Metadata.GeneratedAt.Should().BeCloseTo(DateTime.UtcNow, TimeSpan.FromSeconds(5)); + } + + [Fact] + public void Generate_WithEmptyTools_SetsToolCountToZero() + { + var result = _generator.Generate([], "Empty", ""); + + result.Metadata.ToolCount.Should().Be(0); + result.Tools.Should().BeEmpty(); + } + + [Fact] + public void Generate_WithMultipleTools_SetsCorrectToolCount() + { + var tools = new List + { + CreateToolSchema("tool1", "Description 1."), + CreateToolSchema("tool2", "Description 2."), + CreateToolSchema("tool3", "Description 3."), + }; + + var result = _generator.Generate(tools, "Server", "url"); + + result.Metadata.ToolCount.Should().Be(3); + result.Tools.Should().HaveCount(3); + } + + [Fact] + public void Generate_ThrowsOnNullTools() + { + var act = () => _generator.Generate(null!, "Server", "url"); + act.Should().Throw(); + } + + // ----------------------------------------------------------------------- + // Tool-level structure + // ----------------------------------------------------------------------- + + [Fact] + public void Generate_ToolChecklist_ContainsToolNameAndDescription() + { + var tools = new List + { + CreateToolSchema("search_users", "Searches for users by name or email."), + }; + + var result = _generator.Generate(tools, "Server", "url"); + var toolChecklist = result.Tools[0]; + + toolChecklist.Name.Should().Be("search_users"); + toolChecklist.Description.Should().Be("Searches for users by name or email."); + } + + [Fact] + public void Generate_ToolChecklist_HasToolNameChecks() + { + var tools = new List + { + CreateToolSchema("get_user", "Retrieves a user by their unique identifier."), + }; + + var result = _generator.Generate(tools, "Server", "url"); + var toolNameChecks = result.Tools[0].Checks.ToolName; + + // Should contain deterministic + semantic checks + toolNameChecks.Should().NotBeEmpty(); + + // Deterministic tool name checks + toolNameChecks.Should().Contain(c => c.Id == "tn_present" && c.Type == CheckType.Deterministic); + toolNameChecks.Should().Contain(c => c.Id == "tn_consistent_casing" && c.Type == CheckType.Deterministic); + toolNameChecks.Should().Contain(c => c.Id == "tn_no_special_chars" && c.Type == CheckType.Deterministic); + toolNameChecks.Should().Contain(c => c.Id == "tn_reasonable_length" && c.Type == CheckType.Deterministic); + + // Semantic tool name checks + toolNameChecks.Should().Contain(c => c.Id == "tn_verb_prefix" && c.Type == CheckType.Semantic); + toolNameChecks.Should().Contain(c => c.Id == "tn_not_generic" && c.Type == CheckType.Semantic); + toolNameChecks.Should().Contain(c => c.Id == "tn_descriptive" && c.Type == CheckType.Semantic); + } + + [Fact] + public void Generate_ToolChecklist_HasToolDescriptionChecks() + { + var tools = new List + { + CreateToolSchema("get_user", "Retrieves a user by their unique identifier."), + }; + + var result = _generator.Generate(tools, "Server", "url"); + var toolDescChecks = result.Tools[0].Checks.ToolDescription; + + // Deterministic checks + toolDescChecks.Should().Contain(c => c.Id == "td_present" && c.Type == CheckType.Deterministic); + toolDescChecks.Should().Contain(c => c.Id == "td_min_length" && c.Type == CheckType.Deterministic); + toolDescChecks.Should().Contain(c => c.Id == "td_max_length" && c.Type == CheckType.Deterministic); + + // Semantic checks + toolDescChecks.Should().Contain(c => c.Id == "td_has_purpose" && c.Type == CheckType.Semantic); + toolDescChecks.Should().Contain(c => c.Id == "td_not_name_echo" && c.Type == CheckType.Semantic); + toolDescChecks.Should().Contain(c => c.Id == "td_has_usage_guidelines" && c.Type == CheckType.Semantic); + toolDescChecks.Should().Contain(c => c.Id == "td_has_limitations" && c.Type == CheckType.Semantic); + toolDescChecks.Should().Contain(c => c.Id == "td_has_return_docs" && c.Type == CheckType.Semantic); + toolDescChecks.Should().Contain(c => c.Id == "td_has_examples" && c.Type == CheckType.Semantic); + toolDescChecks.Should().Contain(c => c.Id == "td_no_boilerplate" && c.Type == CheckType.Semantic); + } + + [Fact] + public void Generate_ToolChecklist_HasSchemaStructureChecks() + { + var schema = JsonDocument.Parse(""" + { + "type": "object", + "properties": { + "query": {"type": "string", "description": "The search query to find users by name or email"} + }, + "required": ["query"] + } + """).RootElement; + + var tools = new List + { + new() { Name = "search_users", Description = "Searches for users.", InputSchema = schema }, + }; + + var result = _generator.Generate(tools, "Server", "url"); + var structureChecks = result.Tools[0].Checks.SchemaStructure; + + structureChecks.Should().Contain(c => c.Id == "ss_has_input_schema"); + structureChecks.Should().Contain(c => c.Id == "ss_type_object"); + structureChecks.Should().Contain(c => c.Id == "ss_no_deep_nesting"); + structureChecks.Should().Contain(c => c.Id == "ss_all_typed"); + structureChecks.Should().Contain(c => c.Id == "ss_arrays_have_items"); + structureChecks.Should().Contain(c => c.Id == "ss_required_matches"); + structureChecks.Should().Contain(c => c.Id == "ss_reasonable_param_count"); + structureChecks.Should().Contain(c => c.Id == "ss_no_empty_objects"); + } + + // ----------------------------------------------------------------------- + // Deterministic checks - Tool Name + // ----------------------------------------------------------------------- + + [Fact] + public void Generate_ToolNamePresent_PassesForNonEmptyName() + { + var result = GenerateSingleTool("get_user", "A description that is long enough."); + var check = FindCheck(result, "tn_present"); + + check.Score.Should().BeTrue(); + check.Type.Should().Be(CheckType.Deterministic); + } + + [Fact] + public void Generate_ToolNamePresent_FailsForEmptyName() + { + var result = GenerateSingleTool("", "A description."); + var check = FindCheck(result, "tn_present"); + + check.Score.Should().BeFalse(); + } + + [Fact] + public void Generate_ToolNameConsistentCasing_PassesForSnakeCase() + { + var result = GenerateSingleTool("get_user_by_id", "Description."); + var check = FindCheck(result, "tn_consistent_casing"); + + check.Score.Should().BeTrue(); + check.Reason.Should().Contain("snake_case"); + } + + [Fact] + public void Generate_ToolNameConsistentCasing_PassesForCamelCase() + { + var result = GenerateSingleTool("getUserById", "Description."); + var check = FindCheck(result, "tn_consistent_casing"); + + check.Score.Should().BeTrue(); + check.Reason.Should().Contain("camelCase"); + } + + [Fact] + public void Generate_ToolNameConsistentCasing_PassesForPascalCase() + { + var result = GenerateSingleTool("GetUserById", "Description."); + var check = FindCheck(result, "tn_consistent_casing"); + + check.Score.Should().BeTrue(); + check.Reason.Should().Contain("PascalCase"); + } + + [Fact] + public void Generate_ToolNameNoSpecialChars_PassesForCleanName() + { + var result = GenerateSingleTool("get_user", "Description."); + var check = FindCheck(result, "tn_no_special_chars"); + + check.Score.Should().BeTrue(); + } + + [Fact] + public void Generate_ToolNameNoSpecialChars_FailsForSpecialChars() + { + var result = GenerateSingleTool("get user!", "Description."); + var check = FindCheck(result, "tn_no_special_chars"); + + check.Score.Should().BeFalse(); + } + + [Fact] + public void Generate_ToolNameReasonableLength_PassesForNormalLength() + { + var result = GenerateSingleTool("get_user", "Description."); + var check = FindCheck(result, "tn_reasonable_length"); + + check.Score.Should().BeTrue(); + } + + [Fact] + public void Generate_ToolNameReasonableLength_FailsForTooShort() + { + var result = GenerateSingleTool("ab", "Description."); + var check = FindCheck(result, "tn_reasonable_length"); + + check.Score.Should().BeFalse(); + } + + [Fact] + public void Generate_ToolNameReasonableLength_FailsForTooLong() + { + var result = GenerateSingleTool(new string('a', 65), "Description."); + var check = FindCheck(result, "tn_reasonable_length"); + + check.Score.Should().BeFalse(); + } + + // ----------------------------------------------------------------------- + // Deterministic checks - Tool Description + // ----------------------------------------------------------------------- + + [Fact] + public void Generate_ToolDescPresent_PassesForNonEmptyDescription() + { + var result = GenerateSingleTool("get_user", "Retrieves a user by their unique identifier from the system."); + var check = FindCheck(result, "td_present"); + + check.Score.Should().BeTrue(); + } + + [Fact] + public void Generate_ToolDescPresent_FailsForEmptyDescription() + { + var result = GenerateSingleTool("get_user", ""); + var check = FindCheck(result, "td_present"); + + check.Score.Should().BeFalse(); + } + + [Fact] + public void Generate_ToolDescMinLength_PassesForLongDescription() + { + var result = GenerateSingleTool("get_user", "Retrieves a user by their unique identifier from the database."); + var check = FindCheck(result, "td_min_length"); + + check.Score.Should().BeTrue(); + } + + [Fact] + public void Generate_ToolDescMinLength_FailsForShortDescription() + { + var result = GenerateSingleTool("get_user", "Gets a user."); + var check = FindCheck(result, "td_min_length"); + + check.Score.Should().BeFalse(); + } + + [Fact] + public void Generate_ToolDescMaxLength_PassesForNormalDescription() + { + var result = GenerateSingleTool("get_user", "Retrieves a user by ID."); + var check = FindCheck(result, "td_max_length"); + + check.Score.Should().BeTrue(); + } + + [Fact] + public void Generate_ToolDescMaxLength_FailsForOverlyLongDescription() + { + var result = GenerateSingleTool("get_user", new string('a', 2001)); + var check = FindCheck(result, "td_max_length"); + + check.Score.Should().BeFalse(); + } + + // ----------------------------------------------------------------------- + // Deterministic checks - Schema Structure + // ----------------------------------------------------------------------- + + [Fact] + public void Generate_HasInputSchema_PassesWhenSchemaPresent() + { + var schema = JsonDocument.Parse("""{"type": "object", "properties": {}}""").RootElement; + var tools = new List + { + new() { Name = "tool", Description = "Description.", InputSchema = schema }, + }; + + var result = _generator.Generate(tools, "Server", "url"); + var check = result.Tools[0].Checks.SchemaStructure.First(c => c.Id == "ss_has_input_schema"); + + check.Score.Should().BeTrue(); + } + + [Fact] + public void Generate_HasInputSchema_FailsWhenSchemaNull() + { + var tools = new List + { + new() { Name = "tool", Description = "Description.", InputSchema = null }, + }; + + var result = _generator.Generate(tools, "Server", "url"); + var check = result.Tools[0].Checks.SchemaStructure.First(c => c.Id == "ss_has_input_schema"); + + check.Score.Should().BeFalse(); + } + + [Fact] + public void Generate_TypeObject_PassesWhenTypeIsObject() + { + var schema = JsonDocument.Parse("""{"type": "object", "properties": {}}""").RootElement; + var tools = new List + { + new() { Name = "tool", Description = "Description.", InputSchema = schema }, + }; + + var result = _generator.Generate(tools, "Server", "url"); + var check = result.Tools[0].Checks.SchemaStructure.First(c => c.Id == "ss_type_object"); + + check.Score.Should().BeTrue(); + } + + [Fact] + public void Generate_TypeObject_FailsWhenTypeIsNotObject() + { + var schema = JsonDocument.Parse("""{"type": "array"}""").RootElement; + var tools = new List + { + new() { Name = "tool", Description = "Description.", InputSchema = schema }, + }; + + var result = _generator.Generate(tools, "Server", "url"); + var check = result.Tools[0].Checks.SchemaStructure.First(c => c.Id == "ss_type_object"); + + check.Score.Should().BeFalse(); + } + + [Fact] + public void Generate_AllTyped_PassesWhenAllPropertiesHaveTypes() + { + var schema = JsonDocument.Parse(""" + { + "type": "object", + "properties": { + "name": {"type": "string"}, + "age": {"type": "integer"} + } + } + """).RootElement; + + var tools = new List + { + new() { Name = "tool", Description = "Description.", InputSchema = schema }, + }; + + var result = _generator.Generate(tools, "Server", "url"); + var check = result.Tools[0].Checks.SchemaStructure.First(c => c.Id == "ss_all_typed"); + + check.Score.Should().BeTrue(); + } + + [Fact] + public void Generate_AllTyped_FailsWhenPropertyMissingType() + { + var schema = JsonDocument.Parse(""" + { + "type": "object", + "properties": { + "name": {"type": "string"}, + "data": {"description": "No type specified"} + } + } + """).RootElement; + + var tools = new List + { + new() { Name = "tool", Description = "Description.", InputSchema = schema }, + }; + + var result = _generator.Generate(tools, "Server", "url"); + var check = result.Tools[0].Checks.SchemaStructure.First(c => c.Id == "ss_all_typed"); + + check.Score.Should().BeFalse(); + check.Reason.Should().Contain("data"); + } + + [Fact] + public void Generate_ArraysHaveItems_FailsWhenArrayMissingItems() + { + var schema = JsonDocument.Parse(""" + { + "type": "object", + "properties": { + "tags": {"type": "array"} + } + } + """).RootElement; + + var tools = new List + { + new() { Name = "tool", Description = "Description.", InputSchema = schema }, + }; + + var result = _generator.Generate(tools, "Server", "url"); + var check = result.Tools[0].Checks.SchemaStructure.First(c => c.Id == "ss_arrays_have_items"); + + check.Score.Should().BeFalse(); + check.Reason.Should().Contain("tags"); + } + + [Fact] + public void Generate_ArraysHaveItems_PassesWhenArrayHasItems() + { + var schema = JsonDocument.Parse(""" + { + "type": "object", + "properties": { + "tags": {"type": "array", "items": {"type": "string"}} + } + } + """).RootElement; + + var tools = new List + { + new() { Name = "tool", Description = "Description.", InputSchema = schema }, + }; + + var result = _generator.Generate(tools, "Server", "url"); + var check = result.Tools[0].Checks.SchemaStructure.First(c => c.Id == "ss_arrays_have_items"); + + check.Score.Should().BeTrue(); + } + + [Fact] + public void Generate_RequiredMatches_FailsForOrphanedRequired() + { + var schema = JsonDocument.Parse(""" + { + "type": "object", + "properties": { + "name": {"type": "string"} + }, + "required": ["name", "ghost"] + } + """).RootElement; + + var tools = new List + { + new() { Name = "tool", Description = "Description.", InputSchema = schema }, + }; + + var result = _generator.Generate(tools, "Server", "url"); + var check = result.Tools[0].Checks.SchemaStructure.First(c => c.Id == "ss_required_matches"); + + check.Score.Should().BeFalse(); + check.Reason.Should().Contain("ghost"); + } + + [Fact] + public void Generate_ReasonableParamCount_PassesForFewParams() + { + var schema = JsonDocument.Parse(""" + { + "type": "object", + "properties": { + "a": {"type": "string"}, + "b": {"type": "string"}, + "c": {"type": "string"} + } + } + """).RootElement; + + var tools = new List + { + new() { Name = "tool", Description = "Description.", InputSchema = schema }, + }; + + var result = _generator.Generate(tools, "Server", "url"); + var check = result.Tools[0].Checks.SchemaStructure.First(c => c.Id == "ss_reasonable_param_count"); + + check.Score.Should().BeTrue(); + } + + [Fact] + public void Generate_NoEmptyObjects_FailsForEmptyObjectParam() + { + var schema = JsonDocument.Parse(""" + { + "type": "object", + "properties": { + "config": {"type": "object"} + } + } + """).RootElement; + + var tools = new List + { + new() { Name = "tool", Description = "Description.", InputSchema = schema }, + }; + + var result = _generator.Generate(tools, "Server", "url"); + var check = result.Tools[0].Checks.SchemaStructure.First(c => c.Id == "ss_no_empty_objects"); + + check.Score.Should().BeFalse(); + check.Reason.Should().Contain("config"); + } + + // ----------------------------------------------------------------------- + // Parameter checks + // ----------------------------------------------------------------------- + + [Fact] + public void Generate_CreatesParameterChecksForEachProperty() + { + var schema = JsonDocument.Parse(""" + { + "type": "object", + "properties": { + "query": {"type": "string", "description": "The search query to find matching records in the database"}, + "limit": {"type": "integer", "description": "Maximum number of results to return from the search"} + } + } + """).RootElement; + + var tools = new List + { + new() { Name = "search", Description = "Description.", InputSchema = schema }, + }; + + var result = _generator.Generate(tools, "Server", "url"); + var parameters = result.Tools[0].Checks.Parameters; + + parameters.Should().ContainKey("query"); + parameters.Should().ContainKey("limit"); + } + + [Fact] + public void Generate_ParamChecks_ContainsDeterministicAndSemantic() + { + var schema = JsonDocument.Parse(""" + { + "type": "object", + "properties": { + "userId": {"type": "string", "description": "The unique identifier for the user account in the system"} + } + } + """).RootElement; + + var tools = new List + { + new() { Name = "get_user", Description = "Description.", InputSchema = schema }, + }; + + var result = _generator.Generate(tools, "Server", "url"); + var paramChecks = result.Tools[0].Checks.Parameters["userId"]; + + // ParamName should have deterministic + semantic checks + paramChecks.ParamName.Should().Contain(c => c.Id == "pn_not_single_char" && c.Type == CheckType.Deterministic); + paramChecks.ParamName.Should().Contain(c => c.Id == "pn_reasonable_length" && c.Type == CheckType.Deterministic); + paramChecks.ParamName.Should().Contain(c => c.Id == "pn_not_generic" && c.Type == CheckType.Semantic); + + // ParamDescription should have deterministic + semantic checks + paramChecks.ParamDescription.Should().Contain(c => c.Id == "pd_present" && c.Type == CheckType.Deterministic); + paramChecks.ParamDescription.Should().Contain(c => c.Id == "pd_min_length" && c.Type == CheckType.Deterministic); + paramChecks.ParamDescription.Should().Contain(c => c.Id == "pd_not_name_echo" && c.Type == CheckType.Semantic); + paramChecks.ParamDescription.Should().Contain(c => c.Id == "pd_has_constraints" && c.Type == CheckType.Semantic); + paramChecks.ParamDescription.Should().Contain(c => c.Id == "pd_enum_for_categorical" && c.Type == CheckType.Semantic); + } + + [Fact] + public void Generate_ParamDescPresent_FailsWhenNoDescription() + { + var schema = JsonDocument.Parse(""" + { + "type": "object", + "properties": { + "userId": {"type": "string"} + } + } + """).RootElement; + + var tools = new List + { + new() { Name = "get_user", Description = "Description.", InputSchema = schema }, + }; + + var result = _generator.Generate(tools, "Server", "url"); + var descChecks = result.Tools[0].Checks.Parameters["userId"].ParamDescription; + var check = descChecks.First(c => c.Id == "pd_present"); + + check.Score.Should().BeFalse(); + } + + [Fact] + public void Generate_ParamDescPresent_PassesWhenDescriptionPresent() + { + var schema = JsonDocument.Parse(""" + { + "type": "object", + "properties": { + "userId": {"type": "string", "description": "The unique user identifier used to look up the account"} + } + } + """).RootElement; + + var tools = new List + { + new() { Name = "get_user", Description = "Description.", InputSchema = schema }, + }; + + var result = _generator.Generate(tools, "Server", "url"); + var descChecks = result.Tools[0].Checks.Parameters["userId"].ParamDescription; + var check = descChecks.First(c => c.Id == "pd_present"); + + check.Score.Should().BeTrue(); + } + + [Fact] + public void Generate_ParamNameSingleChar_FailsForSingleCharName() + { + var schema = JsonDocument.Parse(""" + { + "type": "object", + "properties": { + "x": {"type": "string", "description": "A coordinate value for the position"} + } + } + """).RootElement; + + var tools = new List + { + new() { Name = "tool", Description = "Description.", InputSchema = schema }, + }; + + var result = _generator.Generate(tools, "Server", "url"); + var nameChecks = result.Tools[0].Checks.Parameters["x"].ParamName; + var check = nameChecks.First(c => c.Id == "pn_not_single_char"); + + check.Score.Should().BeFalse(); + } + + [Fact] + public void Generate_ParamDescHasTypeGuidance_PassesWhenTypePresent() + { + var schema = JsonDocument.Parse(""" + { + "type": "object", + "properties": { + "userId": {"type": "string"} + } + } + """).RootElement; + + var tools = new List + { + new() { Name = "tool", Description = "Description.", InputSchema = schema }, + }; + + var result = _generator.Generate(tools, "Server", "url"); + var descChecks = result.Tools[0].Checks.Parameters["userId"].ParamDescription; + var check = descChecks.First(c => c.Id == "pd_has_type_guidance"); + + check.Score.Should().BeTrue(); + } + + // ----------------------------------------------------------------------- + // Server-level (toolset) checks + // ----------------------------------------------------------------------- + + [Fact] + public void Generate_ServerChecks_ContainsDeterministicToolsetChecks() + { + var tools = new List + { + CreateToolSchema("get_user", "Retrieves a user."), + CreateToolSchema("create_user", "Creates a user."), + }; + + var result = _generator.Generate(tools, "Server", "url"); + + result.ServerChecks.Should().Contain(c => c.Id == "ts_reasonable_count" && c.Type == CheckType.Deterministic); + result.ServerChecks.Should().Contain(c => c.Id == "ts_no_near_duplicate_names" && c.Type == CheckType.Deterministic); + result.ServerChecks.Should().Contain(c => c.Id == "ts_consistent_naming" && c.Type == CheckType.Deterministic); + result.ServerChecks.Should().Contain(c => c.Id == "ts_reasonable_token_budget" && c.Type == CheckType.Deterministic); + } + + [Fact] + public void Generate_ServerChecks_ContainsSemanticToolsetChecks() + { + var tools = new List + { + CreateToolSchema("get_user", "Retrieves a user."), + }; + + var result = _generator.Generate(tools, "Server", "url"); + + result.ServerChecks.Should().Contain(c => c.Id == "ts_no_description_overlap" && c.Type == CheckType.Semantic); + result.ServerChecks.Should().Contain(c => c.Id == "ts_crud_completeness" && c.Type == CheckType.Semantic); + } + + [Fact] + public void Generate_ToolsetReasonableCount_PassesForFewTools() + { + var tools = Enumerable.Range(1, 5) + .Select(i => CreateToolSchema($"tool_{i}", $"Description for tool {i}.")) + .ToList(); + + var result = _generator.Generate(tools, "Server", "url"); + var check = result.ServerChecks.First(c => c.Id == "ts_reasonable_count"); + + check.Score.Should().BeTrue(); + } + + [Fact] + public void Generate_ToolsetReasonableCount_FailsForNoTools() + { + var result = _generator.Generate([], "Server", "url"); + var check = result.ServerChecks.First(c => c.Id == "ts_reasonable_count"); + + check.Score.Should().BeFalse(); + check.Severity.Should().Be(Priority.P0); + } + + [Fact] + public void Generate_ToolsetNoNearDuplicateNames_PassesForDistinctNames() + { + var tools = new List + { + CreateToolSchema("get_user", "Retrieves a user."), + CreateToolSchema("search_contacts", "Searches contacts."), + }; + + var result = _generator.Generate(tools, "Server", "url"); + var check = result.ServerChecks.First(c => c.Id == "ts_no_near_duplicate_names"); + + check.Score.Should().BeTrue(); + } + + [Fact] + public void Generate_ToolsetNoNearDuplicateNames_FailsForSimilarNames() + { + var tools = new List + { + CreateToolSchema("get_user", "Retrieves a user."), + CreateToolSchema("get_users", "Retrieves users."), + }; + + var result = _generator.Generate(tools, "Server", "url"); + var check = result.ServerChecks.First(c => c.Id == "ts_no_near_duplicate_names"); + + check.Score.Should().BeFalse(); + } + + [Fact] + public void Generate_ToolsetConsistentNaming_PassesWhenAllSameConvention() + { + var tools = new List + { + CreateToolSchema("get_user", "Retrieves a user."), + CreateToolSchema("create_user", "Creates a user."), + CreateToolSchema("delete_user", "Deletes a user."), + }; + + var result = _generator.Generate(tools, "Server", "url"); + var check = result.ServerChecks.First(c => c.Id == "ts_consistent_naming"); + + check.Score.Should().BeTrue(); + } + + [Fact] + public void Generate_ToolsetConsistentNaming_FailsForMixedConventions() + { + var tools = new List + { + CreateToolSchema("get_user", "Retrieves a user."), + CreateToolSchema("create_user", "Creates a user."), + CreateToolSchema("DeleteUser", "Deletes a user."), + }; + + var result = _generator.Generate(tools, "Server", "url"); + var check = result.ServerChecks.First(c => c.Id == "ts_consistent_naming"); + + check.Score.Should().BeFalse(); + } + + // ----------------------------------------------------------------------- + // Semantic checks have null scores + // ----------------------------------------------------------------------- + + [Fact] + public void Generate_SemanticChecks_AllHaveNullScore() + { + var schema = JsonDocument.Parse(""" + { + "type": "object", + "properties": { + "query": {"type": "string", "description": "The search query to find matching records in the database"} + } + } + """).RootElement; + + var tools = new List + { + new() { Name = "search", Description = "Searches for records.", InputSchema = schema }, + }; + + var result = _generator.Generate(tools, "Server", "url"); + + // Collect all semantic checks from all locations + var allSemanticChecks = new List(); + foreach (var tool in result.Tools) + { + allSemanticChecks.AddRange(tool.Checks.ToolName.Where(c => c.Type == CheckType.Semantic)); + allSemanticChecks.AddRange(tool.Checks.ToolDescription.Where(c => c.Type == CheckType.Semantic)); + foreach (var paramGroup in tool.Checks.Parameters.Values) + { + allSemanticChecks.AddRange(paramGroup.ParamName.Where(c => c.Type == CheckType.Semantic)); + allSemanticChecks.AddRange(paramGroup.ParamDescription.Where(c => c.Type == CheckType.Semantic)); + } + } + allSemanticChecks.AddRange(result.ServerChecks.Where(c => c.Type == CheckType.Semantic)); + + allSemanticChecks.Should().NotBeEmpty(); + allSemanticChecks.Should().AllSatisfy(c => + { + c.Score.Should().BeNull($"semantic check '{c.Id}' should have null score"); + c.Reason.Should().BeNull($"semantic check '{c.Id}' should have null reason"); + }); + } + + [Fact] + public void Generate_DeterministicChecks_AllHaveNonNullScore() + { + var schema = JsonDocument.Parse(""" + { + "type": "object", + "properties": { + "query": {"type": "string", "description": "The search query to find matching records in the database"} + } + } + """).RootElement; + + var tools = new List + { + new() { Name = "search", Description = "Searches for records.", InputSchema = schema }, + }; + + var result = _generator.Generate(tools, "Server", "url"); + + // Collect all deterministic checks from all locations + var allDeterministicChecks = new List(); + foreach (var tool in result.Tools) + { + allDeterministicChecks.AddRange(tool.Checks.ToolName.Where(c => c.Type == CheckType.Deterministic)); + allDeterministicChecks.AddRange(tool.Checks.ToolDescription.Where(c => c.Type == CheckType.Deterministic)); + allDeterministicChecks.AddRange(tool.Checks.SchemaStructure.Where(c => c.Type == CheckType.Deterministic)); + foreach (var paramGroup in tool.Checks.Parameters.Values) + { + allDeterministicChecks.AddRange(paramGroup.ParamName.Where(c => c.Type == CheckType.Deterministic)); + allDeterministicChecks.AddRange(paramGroup.ParamDescription.Where(c => c.Type == CheckType.Deterministic)); + } + } + allDeterministicChecks.AddRange(result.ServerChecks.Where(c => c.Type == CheckType.Deterministic)); + + allDeterministicChecks.Should().NotBeEmpty(); + allDeterministicChecks.Should().AllSatisfy(c => + { + c.Score.Should().NotBeNull($"deterministic check '{c.Id}' should have a non-null score"); + c.Reason.Should().NotBeNullOrWhiteSpace($"deterministic check '{c.Id}' should have a non-null reason"); + }); + } + + // ----------------------------------------------------------------------- + // Deep nesting check + // ----------------------------------------------------------------------- + + [Fact] + public void Generate_NoDeepNesting_PassesForShallowSchema() + { + var schema = JsonDocument.Parse(""" + { + "type": "object", + "properties": { + "name": {"type": "string"} + } + } + """).RootElement; + + var tools = new List + { + new() { Name = "tool", Description = "Description.", InputSchema = schema }, + }; + + var result = _generator.Generate(tools, "Server", "url"); + var check = result.Tools[0].Checks.SchemaStructure.First(c => c.Id == "ss_no_deep_nesting"); + + check.Score.Should().BeTrue(); + } + + [Fact] + public void Generate_NoDeepNesting_FailsForDeeplyNestedSchema() + { + // depth: object -> props -> config -> props -> inner -> props -> deep -> props -> leaf = depth 4 + var schema = JsonDocument.Parse(""" + { + "type": "object", + "properties": { + "config": { + "type": "object", + "properties": { + "inner": { + "type": "object", + "properties": { + "deep": { + "type": "object", + "properties": { + "leaf": {"type": "string"} + } + } + } + } + } + } + } + } + """).RootElement; + + var tools = new List + { + new() { Name = "tool", Description = "Description.", InputSchema = schema }, + }; + + var result = _generator.Generate(tools, "Server", "url"); + var check = result.Tools[0].Checks.SchemaStructure.First(c => c.Id == "ss_no_deep_nesting"); + + check.Score.Should().BeFalse(); + } + + // ----------------------------------------------------------------------- + // No parameters scenario + // ----------------------------------------------------------------------- + + [Fact] + public void Generate_WithNoParameters_HasEmptyParameterChecks() + { + var schema = JsonDocument.Parse("""{"type": "object", "properties": {}}""").RootElement; + var tools = new List + { + new() { Name = "ping", Description = "Pings the server.", InputSchema = schema }, + }; + + var result = _generator.Generate(tools, "Server", "url"); + + result.Tools[0].Checks.Parameters.Should().BeEmpty(); + } + + [Fact] + public void Generate_WithNullInputSchema_HasEmptyParameterChecks() + { + var tools = new List + { + new() { Name = "ping", Description = "Pings the server.", InputSchema = null }, + }; + + var result = _generator.Generate(tools, "Server", "url"); + + result.Tools[0].Checks.Parameters.Should().BeEmpty(); + } + + // ----------------------------------------------------------------------- + // Helpers + // ----------------------------------------------------------------------- + + private static ToolSchema CreateToolSchema(string name, string description) + { + return new ToolSchema { Name = name, Description = description, InputSchema = null }; + } + + private EvaluationChecklist GenerateSingleTool(string name, string description) + { + var tools = new List { CreateToolSchema(name, description) }; + return _generator.Generate(tools, "Server", "url"); + } + + private static ChecklistItem FindCheck(EvaluationChecklist checklist, string checkId) + { + var allChecks = new List(); + foreach (var tool in checklist.Tools) + { + allChecks.AddRange(tool.Checks.ToolName); + allChecks.AddRange(tool.Checks.ToolDescription); + allChecks.AddRange(tool.Checks.SchemaStructure); + foreach (var paramGroup in tool.Checks.Parameters.Values) + { + allChecks.AddRange(paramGroup.ParamName); + allChecks.AddRange(paramGroup.ParamDescription); + } + } + allChecks.AddRange(checklist.ServerChecks); + + return allChecks.First(c => c.Id == checkId); + } +} diff --git a/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/EvaluationAnalyzerTests.cs b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/EvaluationAnalyzerTests.cs new file mode 100644 index 00000000..2fb75e34 --- /dev/null +++ b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/EvaluationAnalyzerTests.cs @@ -0,0 +1,618 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using FluentAssertions; +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; +using Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; +using Microsoft.Extensions.Logging.Abstractions; +using Xunit; + +namespace Microsoft.Agents.A365.DevTools.Cli.Tests.Services.Evaluate; + +/// +/// Tests for the EvaluationAnalyzer service which computes per-tool scores, +/// toolset scores, overall scores, maturity levels, and action items. +/// +public class EvaluationAnalyzerTests +{ + private readonly EvaluationAnalyzer _analyzer; + + public EvaluationAnalyzerTests() + { + _analyzer = new EvaluationAnalyzer(NullLogger.Instance); + } + + // ----------------------------------------------------------------------- + // Helper methods for building test data + // ----------------------------------------------------------------------- + + /// + /// Creates a ChecklistItem with the given score (true = pass, false = fail, null = unevaluated). + /// + private static ChecklistItem CreateCheck( + string id, + bool? score, + CheckCategory category, + Priority severity = Priority.P1, + List? issueIds = null) + { + return new ChecklistItem + { + Id = id, + Type = CheckType.Deterministic, + Prompt = $"Check: {id}", + Score = score, + Reason = score == false ? $"Failed: {id}" : null, + Severity = severity, + Category = category, + IssueIds = issueIds ?? [], + ImpactAreas = [ImpactArea.ToolSelection], + Remediation = $"Fix {id}", + }; + } + + /// + /// Builds a ToolChecklist with checks that all pass or all fail based on the provided score. + /// Creates checks across all categories to exercise the full scoring pipeline. + /// + private static ToolChecklist CreateToolWithUniformChecks(string name, bool score) + { + return new ToolChecklist + { + Name = name, + Description = $"Description for {name}", + Checks = new ToolCheckGroups + { + ToolName = + [ + CreateCheck($"{name}_tn1", score, CheckCategory.ToolName, Priority.P1, score ? null : [4]), + CreateCheck($"{name}_tn2", score, CheckCategory.ToolName, Priority.P2), + ], + ToolDescription = + [ + CreateCheck($"{name}_td1", score, CheckCategory.ToolDescription, Priority.P0, score ? null : [5]), + CreateCheck($"{name}_td2", score, CheckCategory.ToolDescription, Priority.P1), + CreateCheck($"{name}_td3", score, CheckCategory.ToolDescription, Priority.P2), + ], + SchemaStructure = + [ + CreateCheck($"{name}_ss1", score, CheckCategory.SchemaStructure, Priority.P1), + ], + Parameters = new Dictionary + { + ["param1"] = new ParamCheckGroups + { + ParamName = + [ + CreateCheck($"{name}_pn1", score, CheckCategory.ParamName, Priority.P2), + ], + ParamDescription = + [ + CreateCheck($"{name}_pd1", score, CheckCategory.ParamDescription, Priority.P1, score ? null : [9]), + CreateCheck($"{name}_pd2", score, CheckCategory.ParamDescription, Priority.P2), + ], + }, + }, + }, + }; + } + + /// + /// Builds a ToolChecklist with a mix of passing and failing checks. + /// ToolName: 1 pass, 1 fail. ToolDescription: 2 pass, 1 fail. + /// SchemaStructure: 1 pass. Parameters: 1 pass param_name, 1 pass / 1 fail param_description. + /// + private static ToolChecklist CreateToolWithMixedChecks(string name) + { + return new ToolChecklist + { + Name = name, + Description = $"Description for {name}", + Checks = new ToolCheckGroups + { + ToolName = + [ + CreateCheck($"{name}_tn1", true, CheckCategory.ToolName), + CreateCheck($"{name}_tn2", false, CheckCategory.ToolName, Priority.P2, [13]), + ], + ToolDescription = + [ + CreateCheck($"{name}_td1", true, CheckCategory.ToolDescription), + CreateCheck($"{name}_td2", true, CheckCategory.ToolDescription), + CreateCheck($"{name}_td3", false, CheckCategory.ToolDescription, Priority.P1, [5]), + ], + SchemaStructure = + [ + CreateCheck($"{name}_ss1", true, CheckCategory.SchemaStructure), + ], + Parameters = new Dictionary + { + ["param1"] = new ParamCheckGroups + { + ParamName = + [ + CreateCheck($"{name}_pn1", true, CheckCategory.ParamName), + ], + ParamDescription = + [ + CreateCheck($"{name}_pd1", true, CheckCategory.ParamDescription), + CreateCheck($"{name}_pd2", false, CheckCategory.ParamDescription, Priority.P2, [9]), + ], + }, + }, + }, + }; + } + + /// + /// Builds an EvaluationChecklist with the specified tools and optional server checks. + /// + private static EvaluationChecklist CreateChecklist( + List tools, + List? serverChecks = null) + { + return new EvaluationChecklist + { + Metadata = new ChecklistMetadata + { + ServerName = "test-server", + ServerUrl = "http://localhost:3000", + ToolCount = tools.Count, + }, + Tools = tools, + ServerChecks = serverChecks ?? [], + }; + } + + // ----------------------------------------------------------------------- + // Single tool - all checks passing -> score 100 + // ----------------------------------------------------------------------- + + [Fact] + public void Analyze_SingleToolAllPassing_ReturnsScore100() + { + var tool = CreateToolWithUniformChecks("good_tool", score: true); + var checklist = CreateChecklist([tool]); + + var result = _analyzer.Analyze(checklist, "None"); + + result.ToolResults.Should().HaveCount(1); + result.ToolResults[0].Score.Should().Be(100f); + } + + [Fact] + public void Analyze_SingleToolAllPassing_OverallScoreIs100() + { + var tool = CreateToolWithUniformChecks("good_tool", score: true); + var checklist = CreateChecklist([tool]); + + var result = _analyzer.Analyze(checklist, "None"); + + // Overall = (toolScore * 0.85) + (toolsetScore * 0.15) + // With no server checks, toolset defaults to 100 + // So overall = (100 * 0.85) + (100 * 0.15) = 100 + result.OverallScore.Should().Be(100f); + } + + [Fact] + public void Analyze_SingleToolAllPassing_HasNoActionItems() + { + var tool = CreateToolWithUniformChecks("good_tool", score: true); + var checklist = CreateChecklist([tool]); + + var result = _analyzer.Analyze(checklist, "None"); + + result.AllActionItems.Should().BeEmpty(); + } + + // ----------------------------------------------------------------------- + // Single tool - all checks failing -> score near 0 + // ----------------------------------------------------------------------- + + [Fact] + public void Analyze_SingleToolAllFailing_ReturnsScoreNearZero() + { + var tool = CreateToolWithUniformChecks("bad_tool", score: false); + var checklist = CreateChecklist([tool]); + + var result = _analyzer.Analyze(checklist, "None"); + + result.ToolResults[0].Score.Should().Be(0f); + } + + [Fact] + public void Analyze_SingleToolAllFailing_OverallScoreNearZero() + { + var tool = CreateToolWithUniformChecks("bad_tool", score: false); + var checklist = CreateChecklist([tool]); + + var result = _analyzer.Analyze(checklist, "None"); + + // Tool score = 0, toolset score = 100 (no server checks) + // Overall = (0 * 0.85) + (100 * 0.15) = 15 + result.OverallScore.Should().Be(15f); + } + + [Fact] + public void Analyze_SingleToolAllFailing_GeneratesActionItems() + { + var tool = CreateToolWithUniformChecks("bad_tool", score: false); + var checklist = CreateChecklist([tool]); + + var result = _analyzer.Analyze(checklist, "None"); + + result.AllActionItems.Should().NotBeEmpty(); + // All 9 checks fail, so we should get 9 action items + result.AllActionItems.Should().HaveCount(9); + } + + // ----------------------------------------------------------------------- + // Mixed pass/fail -> correct weighted score + // ----------------------------------------------------------------------- + + [Fact] + public void Analyze_SingleToolMixedChecks_ReturnsCorrectWeightedScore() + { + var tool = CreateToolWithMixedChecks("mixed_tool"); + var checklist = CreateChecklist([tool]); + + var result = _analyzer.Analyze(checklist, "None"); + + // Category scores: + // tool_name: 1/2 pass = 50, weight 0.15 -> 7.5 + // tool_description: 2/3 pass = 66.7, weight 0.35 -> 23.345 + // schema_structure: 1/1 pass = 100, weight 0.15 -> 15 + // param_name: 1/1 pass = 100, weight 0.10 -> 10 + // param_description: 1/2 pass = 50, weight 0.25 -> 12.5 + // tool score = 7.5 + 23.345 + 15 + 10 + 12.5 = 68.345, rounded to 68.3 + float toolScore = result.ToolResults[0].Score; + toolScore.Should().BeInRange(60f, 75f); + + // Overall = (toolScore * 0.85) + (100 * 0.15) = ~73 + result.OverallScore.Should().BeInRange(55f, 80f); + } + + [Fact] + public void Analyze_SingleToolMixedChecks_ActionItemCountMatchesFailedChecks() + { + var tool = CreateToolWithMixedChecks("mixed_tool"); + var checklist = CreateChecklist([tool]); + + var result = _analyzer.Analyze(checklist, "None"); + + // 3 checks fail: tn2, td3, pd2 + result.AllActionItems.Should().HaveCount(3); + } + + // ----------------------------------------------------------------------- + // Empty tool list -> only toolset score contributes + // ----------------------------------------------------------------------- + + [Fact] + public void Analyze_EmptyToolList_OnlyToolsetScoreContributes() + { + var checklist = CreateChecklist([]); + + var result = _analyzer.Analyze(checklist, "None"); + + // With no tools and no server checks: toolset defaults to 100 + // Overall = (toolsetScore * 0.15) = 100 * 0.15 = 15 + result.OverallScore.Should().Be(15f); + result.ToolResults.Should().BeEmpty(); + result.ToolCount.Should().Be(0); + } + + [Fact] + public void Analyze_EmptyToolListWithFailingServerChecks_ReflectsToolsetScore() + { + var serverChecks = new List + { + CreateCheck("server_1", false, CheckCategory.ToolsetDesign, Priority.P0), + CreateCheck("server_2", true, CheckCategory.ToolsetDesign), + }; + var checklist = CreateChecklist([], serverChecks); + + var result = _analyzer.Analyze(checklist, "None"); + + // Toolset score = 1/2 pass = 50 + // Overall = 50 * 0.15 = 7.5 + result.OverallScore.Should().Be(7.5f); + result.ToolsetResult.Score.Should().Be(50f); + } + + // ----------------------------------------------------------------------- + // Action items sorted by priority + // ----------------------------------------------------------------------- + + [Fact] + public void Analyze_ActionItemsAreSortedByPriority() + { + // Create a tool where checks fail with different priorities + var tool = new ToolChecklist + { + Name = "priority_tool", + Description = "Tool for testing priority sorting", + Checks = new ToolCheckGroups + { + ToolName = + [ + CreateCheck("tn_p3", false, CheckCategory.ToolName, Priority.P3), + ], + ToolDescription = + [ + CreateCheck("td_p0", false, CheckCategory.ToolDescription, Priority.P0), + ], + SchemaStructure = + [ + CreateCheck("ss_p2", false, CheckCategory.SchemaStructure, Priority.P2), + ], + Parameters = new Dictionary + { + ["p1"] = new ParamCheckGroups + { + ParamName = + [ + CreateCheck("pn_p1", false, CheckCategory.ParamName, Priority.P1), + ], + ParamDescription = [], + }, + }, + }, + }; + var checklist = CreateChecklist([tool]); + + var result = _analyzer.Analyze(checklist, "None"); + + var priorities = result.AllActionItems.Select(a => a.Priority).ToList(); + priorities.Should().BeInAscendingOrder(); + } + + // ----------------------------------------------------------------------- + // Issue summary counts are correct + // ----------------------------------------------------------------------- + + [Fact] + public void Analyze_IssueSummaryCounts_MatchFailedCheckIssueIds() + { + var tool = CreateToolWithUniformChecks("problem_tool", score: false); + var checklist = CreateChecklist([tool]); + + var result = _analyzer.Analyze(checklist, "None"); + + // The uniform failing tool has issue ids: [4] on tn1, [5] on td1, [9] on pd1 + result.IssueSummary.Should().NotBeEmpty(); + + // Verify total issue occurrences match what we created + int totalIssues = result.IssueSummary.Values.Sum(); + totalIssues.Should().BeGreaterThan(0); + } + + [Fact] + public void Analyze_IssueSummary_CountsMultipleOccurrencesOfSameIssue() + { + // Create two tools that both fail with the same issue id + var tool1 = new ToolChecklist + { + Name = "tool1", + Description = "Tool 1", + Checks = new ToolCheckGroups + { + ToolName = + [ + CreateCheck("t1_tn1", false, CheckCategory.ToolName, issueIds: [4]), + ], + ToolDescription = [], + SchemaStructure = [], + Parameters = [], + }, + }; + var tool2 = new ToolChecklist + { + Name = "tool2", + Description = "Tool 2", + Checks = new ToolCheckGroups + { + ToolName = + [ + CreateCheck("t2_tn1", false, CheckCategory.ToolName, issueIds: [4]), + ], + ToolDescription = [], + SchemaStructure = [], + Parameters = [], + }, + }; + var checklist = CreateChecklist([tool1, tool2]); + + var result = _analyzer.Analyze(checklist, "None"); + + // Issue 4 = "Missing purpose statement" + var issue4Name = "Missing purpose statement"; + result.IssueSummary.Should().ContainKey(issue4Name); + result.IssueSummary[issue4Name].Should().Be(2); + } + + // ----------------------------------------------------------------------- + // ActionItemsByPriority counts + // ----------------------------------------------------------------------- + + [Fact] + public void Analyze_ActionItemsByPriority_CountsAllPriorityLevels() + { + var tool = CreateToolWithUniformChecks("failing_tool", score: false); + var checklist = CreateChecklist([tool]); + + var result = _analyzer.Analyze(checklist, "None"); + + result.ActionItemsByPriority.Should().ContainKey("P0"); + result.ActionItemsByPriority.Should().ContainKey("P1"); + result.ActionItemsByPriority.Should().ContainKey("P2"); + result.ActionItemsByPriority.Should().ContainKey("P3"); + + int totalFromPriority = result.ActionItemsByPriority.Values.Sum(); + totalFromPriority.Should().Be(result.AllActionItems.Count); + } + + // ----------------------------------------------------------------------- + // Maturity level calculated correctly + // ----------------------------------------------------------------------- + + [Fact] + public void Analyze_AllPassingTool_MaturityLevelIs4() + { + var tool = CreateToolWithUniformChecks("exemplary_tool", score: true); + var checklist = CreateChecklist([tool]); + + var result = _analyzer.Analyze(checklist, "None"); + + // Score = 100, all category averages = 100 -> no caps -> Level 4 + result.Maturity.Level.Should().Be(4); + result.Maturity.Label.Should().Be("Exemplary"); + } + + [Fact] + public void Analyze_AllFailingTool_MaturityLevelIs0() + { + var tool = CreateToolWithUniformChecks("terrible_tool", score: false); + var checklist = CreateChecklist([tool]); + + var result = _analyzer.Analyze(checklist, "None"); + + // Overall score = 15 (only toolset contributes) -> Level 0 + result.Maturity.Level.Should().Be(0); + result.Maturity.Label.Should().Be("Functional"); + } + + [Fact] + public void Analyze_MixedChecks_MaturityLevelReflectsScore() + { + var tool = CreateToolWithMixedChecks("mixed_tool"); + var checklist = CreateChecklist([tool]); + + var result = _analyzer.Analyze(checklist, "None"); + + // Overall is somewhere between 55-80, maturity is based on that + result.Maturity.Level.Should().BeInRange(0, 3); + } + + // ----------------------------------------------------------------------- + // Result metadata + // ----------------------------------------------------------------------- + + [Fact] + public void Analyze_SetsServerNameAndUrl() + { + var tool = CreateToolWithUniformChecks("tool1", score: true); + var checklist = CreateChecklist([tool]); + + var result = _analyzer.Analyze(checklist, "GitHub Copilot"); + + result.ServerName.Should().Be("test-server"); + result.ServerUrl.Should().Be("http://localhost:3000"); + result.EvalEngine.Should().Be("GitHub Copilot"); + } + + [Fact] + public void Analyze_SetsToolCount() + { + var tools = new List + { + CreateToolWithUniformChecks("tool1", score: true), + CreateToolWithUniformChecks("tool2", score: true), + }; + var checklist = CreateChecklist(tools); + + var result = _analyzer.Analyze(checklist, "None"); + + result.ToolCount.Should().Be(2); + result.ToolResults.Should().HaveCount(2); + } + + [Fact] + public void Analyze_SetsEvaluatedAtToRecentTime() + { + var tool = CreateToolWithUniformChecks("tool1", score: true); + var checklist = CreateChecklist([tool]); + + var result = _analyzer.Analyze(checklist, "None"); + + result.EvaluatedAt.Should().BeCloseTo(DateTime.UtcNow, TimeSpan.FromSeconds(5)); + } + + // ----------------------------------------------------------------------- + // Category averages + // ----------------------------------------------------------------------- + + [Fact] + public void Analyze_CategoryAverages_ComputedAcrossMultipleTools() + { + var tools = new List + { + CreateToolWithUniformChecks("pass_tool", score: true), + CreateToolWithUniformChecks("fail_tool", score: false), + }; + var checklist = CreateChecklist(tools); + + var result = _analyzer.Analyze(checklist, "None"); + + // Each category should have an average of (100 + 0) / 2 = 50 + result.CategoryAverages.Should().NotBeEmpty(); + result.CategoryAverages.Should().ContainKey("tool_name"); + result.CategoryAverages["tool_name"].Should().Be(50f); + } + + // ----------------------------------------------------------------------- + // Null checks / edge cases + // ----------------------------------------------------------------------- + + [Fact] + public void Analyze_NullChecklist_ThrowsArgumentNullException() + { + var act = () => _analyzer.Analyze(null!, "None"); + + act.Should().Throw(); + } + + [Fact] + public void Analyze_NullEvalEngine_DefaultsToEmpty() + { + var tool = CreateToolWithUniformChecks("tool", score: true); + var checklist = CreateChecklist([tool]); + + var result = _analyzer.Analyze(checklist, null!); + + result.EvalEngine.Should().BeEmpty(); + } + + [Fact] + public void Analyze_ToolWithNoParameters_StillComputes() + { + var tool = new ToolChecklist + { + Name = "no_params", + Description = "A tool with no parameters", + Checks = new ToolCheckGroups + { + ToolName = + [ + CreateCheck("tn1", true, CheckCategory.ToolName), + ], + ToolDescription = + [ + CreateCheck("td1", true, CheckCategory.ToolDescription), + ], + SchemaStructure = + [ + CreateCheck("ss1", true, CheckCategory.SchemaStructure), + ], + Parameters = [], + }, + }; + var checklist = CreateChecklist([tool]); + + var result = _analyzer.Analyze(checklist, "None"); + + result.ToolResults.Should().HaveCount(1); + result.ToolResults[0].ParamCount.Should().Be(0); + result.ToolResults[0].Score.Should().Be(100f); + } +} diff --git a/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/EvaluationPipelineServiceExitCodeTests.cs b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/EvaluationPipelineServiceExitCodeTests.cs new file mode 100644 index 00000000..d2eb98c4 --- /dev/null +++ b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/EvaluationPipelineServiceExitCodeTests.cs @@ -0,0 +1,107 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using FluentAssertions; +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; +using Microsoft.Agents.A365.DevTools.Cli.Services; +using Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; +using Microsoft.Extensions.Logging.Abstractions; +using NSubstitute; +using Xunit; + +namespace Microsoft.Agents.A365.DevTools.Cli.Tests.Services.Evaluate; + +/// +/// Verifies RunAsync maps the evaluator's to the +/// right process exit code: a genuine "could not evaluate" (no agent installed, or +/// an agent that left checks unscored) is a failure — exit 1 — so CI detects it +/// instead of seeing a silent success with no report; an intentional +/// bring-your-own-LLM stop (--eval-engine none) exits 0. +/// +public class EvaluationPipelineServiceExitCodeTests +{ + private static EvaluationPipelineService BuildSut(IChecklistEvaluator evaluator) + { + var tooling = Substitute.For(); + var discovery = Substitute.For(); + var checklistGen = Substitute.For(); + var analyzer = Substitute.For(); + var report = Substitute.For(); + + // Fresh-run path: discovery + generation are stubbed so control reaches the + // (mocked) evaluator, whose Outcome is the thing under test. + discovery + .DiscoverToolsAsync(Arg.Any(), Arg.Any(), Arg.Any()) + .Returns(new List()); + checklistGen + .Generate(Arg.Any>(), Arg.Any(), Arg.Any()) + .Returns(new EvaluationChecklist()); + + return new EvaluationPipelineService( + NullLogger.Instance, + discovery, + checklistGen, + evaluator, + analyzer, + report, + tooling); + } + + private static string FreshOutputDir() + => Path.Combine(Path.GetTempPath(), Guid.NewGuid().ToString("N")); + + [Fact] + public async Task RunAsync_WhenEvaluationCouldNotComplete_ReturnsExitCode1() + { + var evaluator = Substitute.For(); + evaluator + .EvaluateAsync(Arg.Any(), Arg.Any(), Arg.Any(), Arg.Any()) + .Returns(new ChecklistEvaluationResult { Outcome = EvaluationOutcome.CouldNotEvaluate }); + + var sut = BuildSut(evaluator); + + var exitCode = await sut.RunAsync( + serverUrl: "http://localhost:9999/mcp", + outputDir: FreshOutputDir(), + evalEngine: "claude-code", + authToken: null, + cancellationToken: CancellationToken.None); + + exitCode.Should().Be(1, because: "a requested evaluation that produced no report is a failure CI must be able to detect"); + } + + [Fact] + public async Task RunAsync_WhenUserOptedOutWithEngineNone_ReturnsExitCode0() + { + var evaluator = Substitute.For(); + evaluator + .EvaluateAsync(Arg.Any(), Arg.Any(), Arg.Any(), Arg.Any()) + .Returns(new ChecklistEvaluationResult { Outcome = EvaluationOutcome.OptedOut }); + + var sut = BuildSut(evaluator); + + var exitCode = await sut.RunAsync( + serverUrl: "http://localhost:9999/mcp", + outputDir: FreshOutputDir(), + evalEngine: "none", + authToken: null, + cancellationToken: CancellationToken.None); + + exitCode.Should().Be(0, because: "--eval-engine none is a deliberate bring-your-own-LLM stop, not a failure"); + } + + [Fact] + public async Task RunAsync_WhenOutputDirIsWhitespace_ReturnsExitCode1() + { + var sut = BuildSut(Substitute.For()); + + var exitCode = await sut.RunAsync( + serverUrl: "http://localhost:9999/mcp", + outputDir: " ", + evalEngine: "auto", + authToken: null, + cancellationToken: CancellationToken.None); + + exitCode.Should().Be(1, because: "an empty or whitespace --output-dir must fail fast with a targeted error, not a deep exception late in the run"); + } +} diff --git a/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/EvaluationPipelineServiceTelemetryTests.cs b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/EvaluationPipelineServiceTelemetryTests.cs new file mode 100644 index 00000000..8430d90f --- /dev/null +++ b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/EvaluationPipelineServiceTelemetryTests.cs @@ -0,0 +1,64 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using FluentAssertions; +using Microsoft.Agents.A365.DevTools.Cli.Exceptions; +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; +using Microsoft.Agents.A365.DevTools.Cli.Services; +using Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; +using Microsoft.Extensions.Logging.Abstractions; +using NSubstitute; +using Xunit; + +namespace Microsoft.Agents.A365.DevTools.Cli.Tests.Services.Evaluate; + +/// +/// Verifies the evaluation pipeline fires the user-telemetry marker at the start of RunAsync, +/// so any future surface that drives evaluations (not just the CLI command) is also attributed. +/// +public class EvaluationPipelineServiceTelemetryTests +{ + [Fact] + public async Task RunAsync_FiresEvaluateTelemetry_BeforeAnyPipelineWork() + { + var tooling = Substitute.For(); + var discovery = Substitute.For(); + var checklistGen = Substitute.For(); + var evaluator = Substitute.For(); + var analyzer = Substitute.For(); + var report = Substitute.For(); + + // Short-circuit RunAsync immediately after the telemetry call so we don't need + // to stub the full pipeline. Discovery is the first dep called after telemetry; + // throwing here forces the catch-block to wrap and re-throw as EvaluationException. + discovery + .DiscoverToolsAsync(Arg.Any(), Arg.Any(), Arg.Any()) + .Returns>>(_ => throw new InvalidOperationException("short-circuit")); + + var sut = new EvaluationPipelineService( + NullLogger.Instance, + discovery, + checklistGen, + evaluator, + analyzer, + report, + tooling); + + var tempDir = Path.Combine(Path.GetTempPath(), Guid.NewGuid().ToString("N")); + + var act = async () => await sut.RunAsync( + serverUrl: "http://localhost:9999/mcp", + outputDir: tempDir, + evalEngine: "none", + authToken: null, + cancellationToken: CancellationToken.None); + + await act.Should().ThrowAsync( + because: "discovery throws after the telemetry call has fired; RunAsync wraps it"); + + // Privacy boundary: telemetry call carries no customer content — only that an + // evaluation was kicked off. The serverUrl above is the customer's MCP endpoint + // and intentionally does NOT appear in the LogEvaluateUsageAsync call signature. + await tooling.Received(1).LogEvaluateUsageAsync(Arg.Any()); + } +} diff --git a/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/EvaluationPipelineServiceTests.cs b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/EvaluationPipelineServiceTests.cs new file mode 100644 index 00000000..2f862e82 --- /dev/null +++ b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/EvaluationPipelineServiceTests.cs @@ -0,0 +1,107 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using FluentAssertions; +using Microsoft.Agents.A365.DevTools.Cli.Exceptions; +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; +using Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; +using Xunit; + +namespace Microsoft.Agents.A365.DevTools.Cli.Tests.Services.Evaluate; + +/// +/// Tests for EvaluationPipelineService helper methods. +/// +public class EvaluationPipelineServiceTests +{ + // ----------------------------------------------------------------------- + // ParseEvalEngine + // ----------------------------------------------------------------------- + + [Theory] + [InlineData("auto", EvalEngine.Auto)] + [InlineData("AUTO", EvalEngine.Auto)] + [InlineData("github-copilot", EvalEngine.GitHubCopilot)] + [InlineData("GITHUB-COPILOT", EvalEngine.GitHubCopilot)] + [InlineData("claude-code", EvalEngine.ClaudeCode)] + [InlineData("Claude-Code", EvalEngine.ClaudeCode)] + [InlineData("none", EvalEngine.None)] + [InlineData("NONE", EvalEngine.None)] + public void ParseEvalEngine_ValidValues_ReturnsCorrectEnum(string input, EvalEngine expected) + { + var result = EvaluationPipelineService.ParseEvalEngine(input); + + result.Should().Be(expected); + } + + [Theory] + [InlineData("invalid")] + [InlineData("openai")] + [InlineData("")] + public void ParseEvalEngine_InvalidValues_ThrowsEvaluationException(string input) + { + var act = () => EvaluationPipelineService.ParseEvalEngine(input); + + act.Should().Throw(); + } + + // ----------------------------------------------------------------------- + // DeriveServerName + // ----------------------------------------------------------------------- + + [Fact] + public void DeriveServerName_StandardUrl_ReturnsHostWithDotsReplaced() + { + var result = EvaluationPipelineService.DeriveServerName("http://my.server.com/mcp"); + + result.Should().Be("my-server-com", + because: "derived names feed into filenames, so dots in the host must be replaced with filesystem-safe hyphens"); + } + + [Fact] + public void DeriveServerName_UrlWithNonStandardPort_IncludesPort() + { + var result = EvaluationPipelineService.DeriveServerName("http://localhost:3000/mcp"); + + result.Should().Be("localhost-3000", + because: "non-default ports must be included so two servers on the same host don't collide to the same filename"); + } + + [Fact] + public void DeriveServerName_UrlWithDefaultPort_ExcludesPort() + { + var result = EvaluationPipelineService.DeriveServerName("http://example.com/mcp"); + + result.Should().Be("example-com", + because: "default ports are implicit in the scheme and would add noise to the filename"); + } + + [Fact] + public void DeriveServerName_InvalidUri_ReturnsSanitizedFallback() + { + var result = EvaluationPipelineService.DeriveServerName("not a valid uri"); + + result.Should().NotBeNullOrWhiteSpace( + because: "a malformed URL should still produce a usable name rather than breaking the pipeline"); + } + + [Fact] + public void DeriveServerName_InvalidUriWithSpecialChars_ReplacesSpecialChars() + { + var result = EvaluationPipelineService.DeriveServerName("fake://host.name:1234/path"); + + result.Should().NotContain("://", + because: "the derived name is used in file paths which cannot contain scheme separators"); + result.Should().NotContain("/", + because: "the derived name is used as a filename, not a path"); + } + + [Fact] + public void DeriveServerName_EmptyString_ReturnsUnknownServer() + { + var result = EvaluationPipelineService.DeriveServerName(""); + + result.Should().Be("unknown-server", + because: "empty input must fall back to a stable placeholder so report generation still has a filename"); + } +} diff --git a/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/MaturityCalculatorTests.cs b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/MaturityCalculatorTests.cs new file mode 100644 index 00000000..7aab7b14 --- /dev/null +++ b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/MaturityCalculatorTests.cs @@ -0,0 +1,336 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using FluentAssertions; +using Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; +using Xunit; + +namespace Microsoft.Agents.A365.DevTools.Cli.Tests.Services.Evaluate; + +public class MaturityCalculatorTests +{ + // ======================================================================= + // Score-based level thresholds + // ======================================================================= + + [Theory] + [InlineData(0f, 0)] + [InlineData(30f, 0)] + [InlineData(39.9f, 0)] + public void DetermineLevel_BelowThreshold40_ReturnsLevel0(float score, int expectedLevel) + { + var allHigh = HighCategoryAverages(); + + var result = MaturityCalculator.DetermineLevel(score, allHigh); + + result.Level.Should().Be(expectedLevel); + result.Label.Should().Be("Functional"); + } + + [Theory] + [InlineData(40f, 1)] + [InlineData(50f, 1)] + [InlineData(59.9f, 1)] + public void DetermineLevel_Score40To59_ReturnsLevel1(float score, int expectedLevel) + { + var allHigh = HighCategoryAverages(); + + var result = MaturityCalculator.DetermineLevel(score, allHigh); + + result.Level.Should().Be(expectedLevel); + result.Label.Should().Be("Described"); + } + + [Theory] + [InlineData(60f, 2)] + [InlineData(65f, 2)] + [InlineData(74.9f, 2)] + public void DetermineLevel_Score60To74_ReturnsLevel2(float score, int expectedLevel) + { + var allHigh = HighCategoryAverages(); + + var result = MaturityCalculator.DetermineLevel(score, allHigh); + + result.Level.Should().Be(expectedLevel); + result.Label.Should().Be("Consistent"); + } + + [Theory] + [InlineData(75f, 3)] + [InlineData(80f, 3)] + [InlineData(89.9f, 3)] + public void DetermineLevel_Score75To89_ReturnsLevel3(float score, int expectedLevel) + { + var allHigh = HighCategoryAverages(); + + var result = MaturityCalculator.DetermineLevel(score, allHigh); + + result.Level.Should().Be(expectedLevel); + result.Label.Should().Be("Optimized for AI"); + } + + [Theory] + [InlineData(90f, 4)] + [InlineData(95f, 4)] + [InlineData(100f, 4)] + public void DetermineLevel_Score90Plus_ReturnsLevel4(float score, int expectedLevel) + { + var allHigh = HighCategoryAverages(); + + var result = MaturityCalculator.DetermineLevel(score, allHigh); + + result.Level.Should().Be(expectedLevel); + result.Label.Should().Be("Exemplary"); + } + + // ======================================================================= + // Category-based caps + // ======================================================================= + + [Fact] + public void DetermineLevel_ToolDescriptionBelow50_CapsAtLevel1() + { + // Score 95 would be Level 4, but tool_description < 50 caps at Level 1 + var categoryAverages = new Dictionary + { + ["tool_description"] = 49f, + ["param_description"] = 100f, + ["tool_name"] = 100f, + }; + + var result = MaturityCalculator.DetermineLevel(95f, categoryAverages); + + result.Level.Should().Be(1); + result.Label.Should().Be("Described"); + } + + [Fact] + public void DetermineLevel_ToolDescriptionExactly50_NoCap() + { + var categoryAverages = new Dictionary + { + ["tool_description"] = 50f, + ["param_description"] = 100f, + ["tool_name"] = 100f, + }; + + var result = MaturityCalculator.DetermineLevel(95f, categoryAverages); + + // No cap from tool_description, so score 95 -> Level 4 + result.Level.Should().Be(4); + } + + [Fact] + public void DetermineLevel_ParamDescriptionBelow60_CapsAtLevel2() + { + var categoryAverages = new Dictionary + { + ["tool_description"] = 100f, + ["param_description"] = 59f, + ["tool_name"] = 100f, + }; + + var result = MaturityCalculator.DetermineLevel(95f, categoryAverages); + + result.Level.Should().Be(2); + result.Label.Should().Be("Consistent"); + } + + [Fact] + public void DetermineLevel_ParamDescriptionExactly60_NoCap() + { + var categoryAverages = new Dictionary + { + ["tool_description"] = 100f, + ["param_description"] = 60f, + ["tool_name"] = 100f, + }; + + var result = MaturityCalculator.DetermineLevel(95f, categoryAverages); + + result.Level.Should().Be(4); + } + + [Fact] + public void DetermineLevel_ToolNameBelow75_CapsAtLevel3() + { + var categoryAverages = new Dictionary + { + ["tool_description"] = 100f, + ["param_description"] = 100f, + ["tool_name"] = 74f, + }; + + var result = MaturityCalculator.DetermineLevel(95f, categoryAverages); + + result.Level.Should().Be(3); + result.Label.Should().Be("Optimized for AI"); + } + + [Fact] + public void DetermineLevel_ToolNameExactly75_NoCap() + { + var categoryAverages = new Dictionary + { + ["tool_description"] = 100f, + ["param_description"] = 100f, + ["tool_name"] = 75f, + }; + + var result = MaturityCalculator.DetermineLevel(95f, categoryAverages); + + result.Level.Should().Be(4); + } + + [Fact] + public void DetermineLevel_MultipleCaps_LowestWins() + { + // Both tool_description and param_description are low + // tool_description < 50 caps at 1, param_description < 60 caps at 2 + // The tool_description cap of 1 should win (applied first, most restrictive) + var categoryAverages = new Dictionary + { + ["tool_description"] = 30f, + ["param_description"] = 40f, + ["tool_name"] = 50f, + }; + + var result = MaturityCalculator.DetermineLevel(95f, categoryAverages); + + result.Level.Should().Be(1); + } + + [Fact] + public void DetermineLevel_NullCategoryAverages_HandledGracefully() + { + // Null averages default to empty dict, all averages default to 0 + var result = MaturityCalculator.DetermineLevel(95f, null!); + + // tool_description=0 < 50 caps at Level 1 + result.Level.Should().Be(1); + } + + [Fact] + public void DetermineLevel_EmptyCategoryAverages_DefaultsApply() + { + var result = MaturityCalculator.DetermineLevel(95f, []); + + // tool_description defaults to 0 < 50, caps at Level 1 + result.Level.Should().Be(1); + } + + // ======================================================================= + // Next-level requirements + // ======================================================================= + + [Fact] + public void DetermineLevel_Level4_RequirementsMaintain() + { + var result = MaturityCalculator.DetermineLevel(95f, HighCategoryAverages()); + + result.NextLevelRequirements.Should().ContainSingle() + .Which.Should().Contain("Maintain"); + } + + [Fact] + public void DetermineLevel_Level0_HasDescriptionRequirements() + { + var result = MaturityCalculator.DetermineLevel(30f, HighCategoryAverages()); + + result.NextLevelRequirements.Should().NotBeEmpty(); + result.NextLevelRequirements.Should().Contain(r => r.Contains("description")); + } + + [Fact] + public void DetermineLevel_HasDescription() + { + var result = MaturityCalculator.DetermineLevel(50f, HighCategoryAverages()); + + result.Description.Should().NotBeNullOrWhiteSpace(); + } + + // ======================================================================= + // GetMaturityLadder + // ======================================================================= + + [Fact] + public void GetMaturityLadder_Returns5Entries() + { + var ladder = MaturityCalculator.GetMaturityLadder(2); + + ladder.Should().HaveCount(5); + } + + [Fact] + public void GetMaturityLadder_LevelsAre0Through4() + { + var ladder = MaturityCalculator.GetMaturityLadder(0); + + ladder.Select(e => e.Level).Should().BeEquivalentTo([0, 1, 2, 3, 4]); + } + + [Fact] + public void GetMaturityLadder_CorrectIsCurrentForLevel2() + { + var ladder = MaturityCalculator.GetMaturityLadder(2); + + ladder.Where(e => e.IsCurrent).Should().ContainSingle() + .Which.Level.Should().Be(2); + } + + [Theory] + [InlineData(0)] + [InlineData(1)] + [InlineData(2)] + [InlineData(3)] + [InlineData(4)] + public void GetMaturityLadder_ExactlyOneIsCurrent(int currentLevel) + { + var ladder = MaturityCalculator.GetMaturityLadder(currentLevel); + + ladder.Where(e => e.IsCurrent).Should().ContainSingle(); + ladder.Single(e => e.IsCurrent).Level.Should().Be(currentLevel); + } + + [Fact] + public void GetMaturityLadder_AllEntriesHaveLabels() + { + var ladder = MaturityCalculator.GetMaturityLadder(0); + + ladder.Should().AllSatisfy(e => + { + e.Label.Should().NotBeNullOrWhiteSpace(); + e.Description.Should().NotBeNullOrWhiteSpace(); + }); + } + + [Fact] + public void GetMaturityLadder_ContainsExpectedLabels() + { + var ladder = MaturityCalculator.GetMaturityLadder(0); + var labels = ladder.Select(e => e.Label).ToList(); + + labels.Should().Contain("Functional"); + labels.Should().Contain("Described"); + labels.Should().Contain("Consistent"); + labels.Should().Contain("Optimized for AI"); + labels.Should().Contain("Exemplary"); + } + + // ======================================================================= + // Helpers + // ======================================================================= + + /// + /// Returns category averages that are high enough to avoid any caps. + /// + private static Dictionary HighCategoryAverages() + { + return new Dictionary + { + ["tool_description"] = 100f, + ["param_description"] = 100f, + ["tool_name"] = 100f, + }; + } +} diff --git a/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/ReportGeneratorTests.cs b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/ReportGeneratorTests.cs new file mode 100644 index 00000000..f655c64b --- /dev/null +++ b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/ReportGeneratorTests.cs @@ -0,0 +1,327 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using FluentAssertions; +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; +using Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; +using Microsoft.Extensions.Logging.Abstractions; +using Xunit; + +namespace Microsoft.Agents.A365.DevTools.Cli.Tests.Services.Evaluate; + +/// +/// Tests for the ReportGenerator service which produces JSON and HTML report files. +/// +public class ReportGeneratorTests : IDisposable +{ + private readonly ReportGenerator _generator; + private readonly string _tempDir; + + public ReportGeneratorTests() + { + _generator = new ReportGenerator(NullLogger.Instance); + _tempDir = Path.Combine(Path.GetTempPath(), $"eval_test_{Guid.NewGuid():N}"); + Directory.CreateDirectory(_tempDir); + } + + public void Dispose() + { + if (Directory.Exists(_tempDir)) + { + Directory.Delete(_tempDir, recursive: true); + } + } + + /// + /// Creates a minimal SchemaEvalResult for testing report generation. + /// + private static SchemaEvalResult CreateMinimalResult(string serverName = "test-server") + { + return new SchemaEvalResult + { + ServerName = serverName, + ServerUrl = "http://localhost:3000", + EvaluatedAt = DateTime.UtcNow, + OverallScore = 75.5f, + Maturity = new MaturityLevel + { + Level = 2, + Label = "Consistent", + Description = "Test maturity description", + NextLevelRequirements = ["Requirement 1"], + }, + ToolCount = 1, + ToolResults = + [ + new ToolEvalResult + { + ToolName = "test_tool", + ToolDescription = "A test tool", + ParamCount = 1, + Score = 80f, + CategoryScores = new Dictionary + { + ["tool_name"] = 100f, + ["tool_description"] = 66.7f, + ["schema_structure"] = 100f, + ["param_name"] = 100f, + ["param_description"] = 50f, + }, + Checks = [], + ActionItems = [], + IssuesDetected = [], + }, + ], + ToolsetResult = new ToolsetEvalResult + { + Score = 100f, + Checks = [], + ActionItems = [], + }, + AllActionItems = [], + CategoryAverages = new Dictionary + { + ["tool_name"] = 100f, + ["tool_description"] = 66.7f, + }, + ActionItemsByPriority = new Dictionary + { + ["P0"] = 0, + ["P1"] = 1, + ["P2"] = 0, + ["P3"] = 0, + }, + IssueSummary = [], + EvalEngine = "None", + }; + } + + // ----------------------------------------------------------------------- + // JSON report generation + // ----------------------------------------------------------------------- + + [Fact] + public async Task GenerateAsync_CreatesJsonReportFile() + { + var result = CreateMinimalResult(); + + await _generator.GenerateAsync(result, _tempDir, openInBrowser: false); + + var jsonPath = Path.Combine(_tempDir, "test-server_eval_report.json"); + File.Exists(jsonPath).Should().BeTrue("JSON report file should be created"); + } + + [Fact] + public async Task GenerateAsync_JsonReportContainsValidJson() + { + var result = CreateMinimalResult(); + + await _generator.GenerateAsync(result, _tempDir, openInBrowser: false); + + var jsonPath = Path.Combine(_tempDir, "test-server_eval_report.json"); + var content = await File.ReadAllTextAsync(jsonPath); + content.Should().Contain("\"server_name\""); + content.Should().Contain("\"overall_score\""); + content.Should().Contain("test-server"); + } + + // ----------------------------------------------------------------------- + // HTML report generation + // ----------------------------------------------------------------------- + + [Fact] + public async Task GenerateAsync_CreatesHtmlReportFile() + { + var result = CreateMinimalResult(); + + await _generator.GenerateAsync(result, _tempDir, openInBrowser: false); + + var htmlPath = Path.Combine(_tempDir, "test-server_eval_report.html"); + File.Exists(htmlPath).Should().BeTrue("HTML report file should be created"); + } + + [Fact] + public async Task GenerateAsync_HtmlReportContainsReportData() + { + var result = CreateMinimalResult(); + + await _generator.GenerateAsync(result, _tempDir, openInBrowser: false); + + var htmlPath = Path.Combine(_tempDir, "test-server_eval_report.html"); + var content = await File.ReadAllTextAsync(htmlPath); + + // The template placeholder {{REPORT_DATA}} should have been replaced + // with actual JSON data + content.Should().NotContain("{{REPORT_DATA}}", + "the placeholder should be replaced with actual report data"); + + // The injected data should contain the server name from the result + content.Should().Contain("test-server"); + } + + [Fact] + public async Task GenerateAsync_HtmlReportIsValidHtml() + { + var result = CreateMinimalResult(); + + await _generator.GenerateAsync(result, _tempDir, openInBrowser: false); + + var htmlPath = Path.Combine(_tempDir, "test-server_eval_report.html"); + var content = await File.ReadAllTextAsync(htmlPath); + + content.Should().Contain(" escape safety + // ----------------------------------------------------------------------- + + [Fact] + public void EscapeForInlineScript_EscapesClosingScriptTag() + { + var input = "{\"name\": \"\"}"; + + var result = ReportGenerator.EscapeForInlineScript(input); + + result.Should().NotContain("", + because: "literal in an inline script closes the script block and lets injected HTML execute"); + result.Should().Contain("<\\/script>", + because: "\\/ is a valid JSON escape that JSON.parse treats as a plain /, so the round-tripped string is unchanged"); + } + + [Fact] + public void EscapeForInlineScript_EscapesHtmlCommentStart() + { + var input = "{\"note\": \"\"}"; + + var result = ReportGenerator.EscapeForInlineScript(input); + + result.Should().NotContain("", + because: "--> pairs with \"}"; + + var escaped = ReportGenerator.EscapeForInlineScript(input); + using var parsed = System.Text.Json.JsonDocument.Parse(escaped); + + parsed.RootElement.GetProperty("name").GetString().Should().Be("", + because: "escaping must preserve the original data after JSON.parse; only the on-wire representation changes"); + parsed.RootElement.GetProperty("note").GetString().Should().Be("", + because: "unicode escapes round-trip through JSON.parse to the original characters"); + } + + [Fact] + public void EscapeForInlineScript_EmptyInput_ReturnsEmpty() + { + ReportGenerator.EscapeForInlineScript("").Should().Be(""); + } + + // ----------------------------------------------------------------------- + // Null argument validation + // ----------------------------------------------------------------------- + + [Fact] + public async Task GenerateAsync_NullResult_ThrowsArgumentNullException() + { + var act = () => _generator.GenerateAsync(null!, _tempDir); + + await act.Should().ThrowAsync(); + } + + [Fact] + public async Task GenerateAsync_NullOutputDir_ThrowsArgumentException() + { + var result = CreateMinimalResult(); + + var act = () => _generator.GenerateAsync(result, null!); + + await act.Should().ThrowAsync(); + } + + [Fact] + public async Task GenerateAsync_WhitespaceOutputDir_ThrowsArgumentException() + { + var result = CreateMinimalResult(); + + var act = () => _generator.GenerateAsync(result, " "); + + await act.Should().ThrowAsync(); + } +} diff --git a/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/ScorerTests.cs b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/ScorerTests.cs new file mode 100644 index 00000000..bd3d8a1d --- /dev/null +++ b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/ScorerTests.cs @@ -0,0 +1,337 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using FluentAssertions; +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; +using Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; +using Xunit; + +namespace Microsoft.Agents.A365.DevTools.Cli.Tests.Services.Evaluate; + +public class ScorerTests +{ + // ======================================================================= + // ComputeCategoryScore + // ======================================================================= + + [Fact] + public void ComputeCategoryScore_AllPass_Returns100() + { + var checks = new List + { + new() { Score = true }, + new() { Score = true }, + new() { Score = true }, + }; + + float result = Scorer.ComputeCategoryScore(checks); + + result.Should().Be(100f); + } + + [Fact] + public void ComputeCategoryScore_AllFail_Returns0() + { + var checks = new List + { + new() { Score = false }, + new() { Score = false }, + new() { Score = false }, + }; + + float result = Scorer.ComputeCategoryScore(checks); + + result.Should().Be(0f); + } + + [Fact] + public void ComputeCategoryScore_MixedResults_ReturnsCorrectPercentage() + { + var checks = new List + { + new() { Score = true }, + new() { Score = false }, + new() { Score = true }, + }; + + float result = Scorer.ComputeCategoryScore(checks); + + // 2/3 * 100 = 66.7 + result.Should().BeApproximately(66.7f, 0.1f); + } + + [Fact] + public void ComputeCategoryScore_NullScoresExcluded_CountsOnlyEvaluated() + { + var checks = new List + { + new() { Score = true }, + new() { Score = null }, + new() { Score = false }, + new() { Score = null }, + }; + + float result = Scorer.ComputeCategoryScore(checks); + + // Only 2 evaluated: 1 pass / 2 = 50% + result.Should().Be(50f); + } + + [Fact] + public void ComputeCategoryScore_AllNull_Returns100() + { + var checks = new List + { + new() { Score = null }, + new() { Score = null }, + }; + + float result = Scorer.ComputeCategoryScore(checks); + + result.Should().Be(100f); + } + + [Fact] + public void ComputeCategoryScore_EmptyList_Returns100() + { + float result = Scorer.ComputeCategoryScore([]); + + result.Should().Be(100f); + } + + // ======================================================================= + // ComputeToolScore + // ======================================================================= + + [Fact] + public void ComputeToolScore_AllCategoriesPerfect_Returns100() + { + var categoryScores = new Dictionary + { + ["tool_name"] = 100f, + ["tool_description"] = 100f, + ["param_name"] = 100f, + ["param_description"] = 100f, + ["schema_structure"] = 100f, + }; + + float result = Scorer.ComputeToolScore(categoryScores); + + result.Should().Be(100f); + } + + [Fact] + public void ComputeToolScore_AllCategoriesZero_Returns0() + { + var categoryScores = new Dictionary + { + ["tool_name"] = 0f, + ["tool_description"] = 0f, + ["param_name"] = 0f, + ["param_description"] = 0f, + ["schema_structure"] = 0f, + }; + + float result = Scorer.ComputeToolScore(categoryScores); + + result.Should().Be(0f); + } + + [Fact] + public void ComputeToolScore_VerifyWeights() + { + // Set one category to 100 and all others to 0 to verify individual weights + var categories = new[] { "tool_name", "tool_description", "param_name", "param_description", "schema_structure" }; + var expectedWeights = new Dictionary + { + ["tool_name"] = 0.15f, + ["tool_description"] = 0.35f, + ["param_name"] = 0.10f, + ["param_description"] = 0.25f, + ["schema_structure"] = 0.15f, + }; + + foreach (string category in categories) + { + var scores = categories.ToDictionary(c => c, c => c == category ? 100f : 0f); + float result = Scorer.ComputeToolScore(scores); + + float expectedWeight = expectedWeights[category] * 100f; + result.Should().BeApproximately(expectedWeight, 0.1f, + because: $"category '{category}' should have weight {expectedWeights[category]}"); + } + } + + [Fact] + public void ComputeToolScore_MissingCategories_DefaultTo100() + { + // Only one category present: tool_description=50, rest default to 100 + var categoryScores = new Dictionary + { + ["tool_description"] = 50f, + }; + + float result = Scorer.ComputeToolScore(categoryScores); + + // 100*0.15 + 50*0.35 + 100*0.10 + 100*0.25 + 100*0.15 = 15 + 17.5 + 10 + 25 + 15 = 82.5 + result.Should().BeApproximately(82.5f, 0.1f); + } + + [Fact] + public void CategoryWeights_SumTo1() + { + float sum = Scorer.CategoryWeights.Values.Sum(); + + sum.Should().BeApproximately(1.0f, 0.001f); + } + + // ======================================================================= + // ComputeOverallScore + // ======================================================================= + + [Fact] + public void ComputeOverallScore_VerifyBlend() + { + var toolResults = new List + { + new() { Score = 80f }, + new() { Score = 60f }, + }; + float toolsetScore = 90f; + + float result = Scorer.ComputeOverallScore(toolResults, toolsetScore); + + // meanTool = (80+60)/2 = 70 + // overall = 70 * 0.85 + 90 * 0.15 = 59.5 + 13.5 = 73.0 + result.Should().BeApproximately(73.0f, 0.1f); + } + + [Fact] + public void ComputeOverallScore_SingleTool_CorrectBlend() + { + var toolResults = new List + { + new() { Score = 100f }, + }; + float toolsetScore = 100f; + + float result = Scorer.ComputeOverallScore(toolResults, toolsetScore); + + // 100 * 0.85 + 100 * 0.15 = 100 + result.Should().Be(100f); + } + + [Fact] + public void ComputeOverallScore_EmptyTools_ReturnsToolsetOnly() + { + float toolsetScore = 80f; + + float result = Scorer.ComputeOverallScore([], toolsetScore); + + // 80 * 0.15 = 12.0 + result.Should().BeApproximately(12.0f, 0.1f); + } + + [Fact] + public void ToolWeight_Is085() + { + Scorer.ToolWeight.Should().Be(0.85f); + } + + [Fact] + public void ToolsetWeight_Is015() + { + Scorer.ToolsetWeight.Should().Be(0.15f); + } + + // ======================================================================= + // ComputeCategoryAverages + // ======================================================================= + + [Fact] + public void ComputeCategoryAverages_SingleTool_ReturnsSameScores() + { + var toolResults = new List + { + new() + { + CategoryScores = new Dictionary + { + ["tool_name"] = 80f, + ["tool_description"] = 60f, + }, + }, + }; + + var result = Scorer.ComputeCategoryAverages(toolResults); + + result["tool_name"].Should().Be(80f); + result["tool_description"].Should().Be(60f); + } + + [Fact] + public void ComputeCategoryAverages_MultipleTools_AveragesCorrectly() + { + var toolResults = new List + { + new() + { + CategoryScores = new Dictionary + { + ["tool_name"] = 80f, + ["tool_description"] = 40f, + }, + }, + new() + { + CategoryScores = new Dictionary + { + ["tool_name"] = 60f, + ["tool_description"] = 80f, + }, + }, + }; + + var result = Scorer.ComputeCategoryAverages(toolResults); + + result["tool_name"].Should().Be(70f); // (80+60)/2 + result["tool_description"].Should().Be(60f); // (40+80)/2 + } + + [Fact] + public void ComputeCategoryAverages_EmptyList_ReturnsEmptyDict() + { + var result = Scorer.ComputeCategoryAverages([]); + + result.Should().BeEmpty(); + } + + [Fact] + public void ComputeCategoryAverages_UnevenCategories_AveragesPerCategory() + { + // tool1 has tool_name, tool2 does not + var toolResults = new List + { + new() + { + CategoryScores = new Dictionary + { + ["tool_name"] = 100f, + ["tool_description"] = 80f, + }, + }, + new() + { + CategoryScores = new Dictionary + { + ["tool_description"] = 60f, + }, + }, + }; + + var result = Scorer.ComputeCategoryAverages(toolResults); + + result["tool_name"].Should().Be(100f); // only 1 entry + result["tool_description"].Should().Be(70f); // (80+60)/2 + } +} diff --git a/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/SemanticCheckDefinitionsTests.cs b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/SemanticCheckDefinitionsTests.cs new file mode 100644 index 00000000..f024c638 --- /dev/null +++ b/src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/SemanticCheckDefinitionsTests.cs @@ -0,0 +1,304 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +using FluentAssertions; +using Microsoft.Agents.A365.DevTools.Cli.Models.Evaluate; +using Microsoft.Agents.A365.DevTools.Cli.Services.Evaluate; +using Xunit; + +namespace Microsoft.Agents.A365.DevTools.Cli.Tests.Services.Evaluate; + +public class SemanticCheckDefinitionsTests +{ + // ----------------------------------------------------------------------- + // GetToolLevelChecks + // ----------------------------------------------------------------------- + + [Fact] + public void GetToolLevelChecks_ReturnsExactly10Items() + { + var checks = SemanticCheckDefinitions.GetToolLevelChecks(); + checks.Should().HaveCount(10); + } + + [Fact] + public void GetToolLevelChecks_AllHaveSemanticType() + { + var checks = SemanticCheckDefinitions.GetToolLevelChecks(); + checks.Should().AllSatisfy(c => c.Type.Should().Be(CheckType.Semantic)); + } + + [Fact] + public void GetToolLevelChecks_AllHaveNullScore() + { + var checks = SemanticCheckDefinitions.GetToolLevelChecks(); + checks.Should().AllSatisfy(c => c.Score.Should().BeNull()); + } + + [Fact] + public void GetToolLevelChecks_AllHaveNullReason() + { + var checks = SemanticCheckDefinitions.GetToolLevelChecks(); + checks.Should().AllSatisfy(c => c.Reason.Should().BeNull()); + } + + [Fact] + public void GetToolLevelChecks_AllHaveNonEmptyPrompt() + { + var checks = SemanticCheckDefinitions.GetToolLevelChecks(); + checks.Should().AllSatisfy(c => c.Prompt.Should().NotBeNullOrWhiteSpace()); + } + + [Fact] + public void GetToolLevelChecks_AllHaveNonEmptyId() + { + var checks = SemanticCheckDefinitions.GetToolLevelChecks(); + checks.Should().AllSatisfy(c => c.Id.Should().NotBeNullOrWhiteSpace()); + } + + [Fact] + public void GetToolLevelChecks_AllHaveNonEmptyRemediation() + { + var checks = SemanticCheckDefinitions.GetToolLevelChecks(); + checks.Should().AllSatisfy(c => c.Remediation.Should().NotBeNullOrWhiteSpace()); + } + + [Fact] + public void GetToolLevelChecks_AllHaveNonEmptyIssueIds() + { + var checks = SemanticCheckDefinitions.GetToolLevelChecks(); + checks.Should().AllSatisfy(c => c.IssueIds.Should().NotBeEmpty()); + } + + [Fact] + public void GetToolLevelChecks_AllHaveNonEmptyImpactAreas() + { + var checks = SemanticCheckDefinitions.GetToolLevelChecks(); + checks.Should().AllSatisfy(c => c.ImpactAreas.Should().NotBeEmpty()); + } + + [Fact] + public void GetToolLevelChecks_ContainsExpectedCheckIds() + { + var checks = SemanticCheckDefinitions.GetToolLevelChecks(); + var ids = checks.Select(c => c.Id).ToList(); + + ids.Should().Contain("tn_verb_prefix"); + ids.Should().Contain("tn_not_generic"); + ids.Should().Contain("tn_descriptive"); + ids.Should().Contain("td_has_purpose"); + ids.Should().Contain("td_not_name_echo"); + ids.Should().Contain("td_has_usage_guidelines"); + ids.Should().Contain("td_has_limitations"); + ids.Should().Contain("td_has_return_docs"); + ids.Should().Contain("td_has_examples"); + ids.Should().Contain("td_no_boilerplate"); + } + + [Fact] + public void GetToolLevelChecks_HasExpectedCategories() + { + var checks = SemanticCheckDefinitions.GetToolLevelChecks(); + + var toolNameChecks = checks.Where(c => c.Category == CheckCategory.ToolName).ToList(); + var toolDescChecks = checks.Where(c => c.Category == CheckCategory.ToolDescription).ToList(); + + toolNameChecks.Should().HaveCount(3); + toolDescChecks.Should().HaveCount(7); + } + + [Fact] + public void GetToolLevelChecks_HasExpectedSeverities() + { + var checks = SemanticCheckDefinitions.GetToolLevelChecks(); + var ids = checks.ToDictionary(c => c.Id, c => c.Severity); + + ids["tn_verb_prefix"].Should().Be(Priority.P1); + ids["tn_not_generic"].Should().Be(Priority.P1); + ids["tn_descriptive"].Should().Be(Priority.P2); + ids["td_has_purpose"].Should().Be(Priority.P0); + ids["td_not_name_echo"].Should().Be(Priority.P2); + ids["td_has_usage_guidelines"].Should().Be(Priority.P1); + ids["td_has_limitations"].Should().Be(Priority.P2); + ids["td_has_return_docs"].Should().Be(Priority.P1); + ids["td_has_examples"].Should().Be(Priority.P2); + ids["td_no_boilerplate"].Should().Be(Priority.P1); + } + + [Fact] + public void GetToolLevelChecks_ReturnsNewInstanceEachCall() + { + var checks1 = SemanticCheckDefinitions.GetToolLevelChecks(); + var checks2 = SemanticCheckDefinitions.GetToolLevelChecks(); + + checks1.Should().NotBeSameAs(checks2); + } + + [Fact] + public void GetToolLevelChecks_HasUniqueIds() + { + var checks = SemanticCheckDefinitions.GetToolLevelChecks(); + var ids = checks.Select(c => c.Id).ToList(); + ids.Should().OnlyHaveUniqueItems(); + } + + // ----------------------------------------------------------------------- + // GetParamLevelChecks + // ----------------------------------------------------------------------- + + [Fact] + public void GetParamLevelChecks_ReturnsExactly4Items() + { + var checks = SemanticCheckDefinitions.GetParamLevelChecks("userId"); + checks.Should().HaveCount(4); + } + + [Fact] + public void GetParamLevelChecks_AllHaveSemanticType() + { + var checks = SemanticCheckDefinitions.GetParamLevelChecks("query"); + checks.Should().AllSatisfy(c => c.Type.Should().Be(CheckType.Semantic)); + } + + [Fact] + public void GetParamLevelChecks_AllHaveNullScore() + { + var checks = SemanticCheckDefinitions.GetParamLevelChecks("query"); + checks.Should().AllSatisfy(c => c.Score.Should().BeNull()); + } + + [Fact] + public void GetParamLevelChecks_ContainsExpectedCheckIds() + { + var checks = SemanticCheckDefinitions.GetParamLevelChecks("status"); + var ids = checks.Select(c => c.Id).ToList(); + + ids.Should().Contain("pn_not_generic"); + ids.Should().Contain("pd_not_name_echo"); + ids.Should().Contain("pd_has_constraints"); + ids.Should().Contain("pd_enum_for_categorical"); + } + + [Fact] + public void GetParamLevelChecks_IncludesParamNameInPrompts() + { + const string paramName = "messageId"; + var checks = SemanticCheckDefinitions.GetParamLevelChecks(paramName); + + checks.Should().AllSatisfy(c => + c.Prompt.Should().Contain(paramName, because: "prompts should reference the specific parameter")); + } + + [Fact] + public void GetParamLevelChecks_IncludesParamNameInRemediation() + { + const string paramName = "searchQuery"; + var checks = SemanticCheckDefinitions.GetParamLevelChecks(paramName); + + checks.Should().AllSatisfy(c => + c.Remediation.Should().Contain(paramName, because: "remediation should reference the specific parameter")); + } + + [Fact] + public void GetParamLevelChecks_HasExpectedCategories() + { + var checks = SemanticCheckDefinitions.GetParamLevelChecks("query"); + + var paramNameChecks = checks.Where(c => c.Category == CheckCategory.ParamName).ToList(); + var paramDescChecks = checks.Where(c => c.Category == CheckCategory.ParamDescription).ToList(); + + paramNameChecks.Should().HaveCount(1); + paramDescChecks.Should().HaveCount(3); + } + + [Fact] + public void GetParamLevelChecks_HasUniqueIds() + { + var checks = SemanticCheckDefinitions.GetParamLevelChecks("test"); + var ids = checks.Select(c => c.Id).ToList(); + ids.Should().OnlyHaveUniqueItems(); + } + + [Fact] + public void GetParamLevelChecks_DifferentParamsProduceDifferentPrompts() + { + var checks1 = SemanticCheckDefinitions.GetParamLevelChecks("userId"); + var checks2 = SemanticCheckDefinitions.GetParamLevelChecks("status"); + + // The prompts should differ because they contain the param name + for (int i = 0; i < checks1.Count; i++) + { + checks1[i].Prompt.Should().NotBe(checks2[i].Prompt); + } + } + + // ----------------------------------------------------------------------- + // GetToolsetLevelChecks + // ----------------------------------------------------------------------- + + [Fact] + public void GetToolsetLevelChecks_ReturnsExactly2Items() + { + var checks = SemanticCheckDefinitions.GetToolsetLevelChecks(); + checks.Should().HaveCount(2); + } + + [Fact] + public void GetToolsetLevelChecks_AllHaveSemanticType() + { + var checks = SemanticCheckDefinitions.GetToolsetLevelChecks(); + checks.Should().AllSatisfy(c => c.Type.Should().Be(CheckType.Semantic)); + } + + [Fact] + public void GetToolsetLevelChecks_AllHaveNullScore() + { + var checks = SemanticCheckDefinitions.GetToolsetLevelChecks(); + checks.Should().AllSatisfy(c => c.Score.Should().BeNull()); + } + + [Fact] + public void GetToolsetLevelChecks_ContainsExpectedCheckIds() + { + var checks = SemanticCheckDefinitions.GetToolsetLevelChecks(); + var ids = checks.Select(c => c.Id).ToList(); + + ids.Should().Contain("ts_no_description_overlap"); + ids.Should().Contain("ts_crud_completeness"); + } + + [Fact] + public void GetToolsetLevelChecks_AllInToolsetDesignCategory() + { + var checks = SemanticCheckDefinitions.GetToolsetLevelChecks(); + checks.Should().AllSatisfy(c => + c.Category.Should().Be(CheckCategory.ToolsetDesign)); + } + + [Fact] + public void GetToolsetLevelChecks_HasExpectedSeverities() + { + var checks = SemanticCheckDefinitions.GetToolsetLevelChecks(); + var ids = checks.ToDictionary(c => c.Id, c => c.Severity); + + ids["ts_no_description_overlap"].Should().Be(Priority.P1); + ids["ts_crud_completeness"].Should().Be(Priority.P2); + } + + [Fact] + public void GetToolsetLevelChecks_HasUniqueIds() + { + var checks = SemanticCheckDefinitions.GetToolsetLevelChecks(); + var ids = checks.Select(c => c.Id).ToList(); + ids.Should().OnlyHaveUniqueItems(); + } + + [Fact] + public void GetToolsetLevelChecks_ReturnsNewInstanceEachCall() + { + var checks1 = SemanticCheckDefinitions.GetToolsetLevelChecks(); + var checks2 = SemanticCheckDefinitions.GetToolsetLevelChecks(); + + checks1.Should().NotBeSameAs(checks2); + } +}