diff --git a/labs/deployment-guard/README.md b/labs/deployment-guard/README.md new file mode 100644 index 000000000..a70b85a93 --- /dev/null +++ b/labs/deployment-guard/README.md @@ -0,0 +1,238 @@ +# Deployment Guard Lab + +Shift-left reliability with SRE Agent: catch breaking changes in PRs **before** they reach production. This lab sets up an SRE Agent with an HTTP trigger that receives GitHub PR events, deploys changes to staging, compares health metrics against production, and posts a risk assessment as a PR comment. + +## What You'll Learn + +1. Deploy an SRE Agent with the `law-dynatrace-httptrigger` recipe +2. Wire a GitHub repo to the agent via Logic App webhook bridge +3. Create a PR with a subtle breaking change and watch the agent catch it +4. Understand how deployment guard analysis works end-to-end + +## Architecture + +``` +┌─────────────────┐ PR event ┌──────────────────┐ webhook ┌──────────────┐ +│ GitHub Repo │ ──────────────→ │ GitHub Actions │ ────────────→ │ Logic App │ +│ (contoso-trading)│ │ (PR workflow) │ │ (bridge) │ +└─────────────────┘ └──────────────────┘ └──────┬───────┘ + │ + HTTP trigger + │ + ▼ + ┌──────────────────┐ + │ SRE Agent │ + │ deployment-guard │ + │ subagent │ + └────────┬─────────┘ + │ + ┌───────────────────────────────┼───────────────────────────────┐ + │ │ │ + ▼ ▼ ▼ + Read PR diff from Deploy PR changes to Query Dynatrace + + connected GitHub repo staging environment LAW baselines + │ + ▼ + Run canary traffic + for 2-3 minutes + │ + ▼ + Compare staging vs prod + health metrics + │ + ▼ + Post risk assessment + comment on PR +``` + +## Prerequisites + +- Azure subscription with Contributor access +- Dynatrace environment with MCP gateway access +- Tools: `az`, `gh`, `jq` + +## Step 0 — Deploy the Sample App (contoso-trading) + +Fork and deploy [contoso-trading](https://github.com/dm-chelupati/contoso-trading) to two environments — production and staging. The app is a microservices trading platform (gateway, order-service, payment-service) running on Azure Container Apps. + +```bash +# Fork the repo +gh repo fork dm-chelupati/contoso-trading --clone + +cd contoso-trading + +# Deploy production +azd env new contoso-prod +azd env set AZURE_LOCATION eastus2 +azd up + +# Deploy staging (same app, separate resource group) +azd env new contoso-staging +azd env set AZURE_LOCATION eastus2 +azd up +``` + +After both environments are running, note: +- **Production RG**: `rg-contoso-prod` (or whatever `azd` created) +- **Staging RG**: `rg-contoso-staging` +- **LAW resource ID**: Find it in the production RG — `az resource list --resource-group rg-contoso-prod --resource-type Microsoft.OperationalInsights/workspaces --query "[0].id" -o tsv` + +## Step 1 — Deploy the SRE Agent + +Use the `law-dynatrace-httptrigger` recipe from the templates: + +```bash +cd sreagent-templates + +./bin/new-agent.sh --recipe law-dynatrace-httptrigger --non-interactive \ + --set agentName=deployment-guard-lab \ + --set resourceGroup=rg-deployment-guard-lab \ + --set location=eastus2 \ + --set lawId=/subscriptions//resourceGroups//providers/Microsoft.OperationalInsights/workspaces/ \ + --set dtTenant= \ + --set dtToken= \ + --set githubRepo=/contoso-trading \ + --set targetRGs=rg-contoso-prod,rg-contoso-staging \ + -o deployment-guard-lab/ + +./bin/deploy.sh deployment-guard-lab/ +``` + +## Step 2 — Get the Webhook URL + +After deployment, the agent has a Logic App webhook bridge. Get the trigger URL: + +```bash +# Find the Logic App in the agent's resource group +LOGIC_APP=$(az resource list \ + --resource-group rg-deployment-guard-lab \ + --resource-type Microsoft.Logic/workflows \ + --query "[0].name" -o tsv) + +# Get the callback URL for the HTTP trigger +WEBHOOK_URL=$(az rest --method POST \ + --url "https://management.azure.com/subscriptions/$(az account show --query id -o tsv)/resourceGroups/rg-deployment-guard-lab/providers/Microsoft.Logic/workflows/${LOGIC_APP}/triggers/manual/listCallbackUrl?api-version=2016-06-01" \ + --query "value" -o tsv) + +echo "Webhook URL: $WEBHOOK_URL" +``` + +## Step 3 — Wire GitHub to the Agent + +### Option A: Use the setup script + +```bash +cd labs/deployment-guard +bash scripts/setup-github-workflow.sh \ + --repo /contoso-trading \ + --webhook-url "$WEBHOOK_URL" +``` + +### Option B: Manual setup + +1. Copy the workflow to your contoso-trading fork: + +```bash +cp sreagent-templates/recipes/law-dynatrace-httptrigger/data/sample-github-workflow.yml \ + /path/to/contoso-trading/.github/workflows/sre-agent-pr-guard.yml +cd /path/to/contoso-trading +git add .github/workflows/sre-agent-pr-guard.yml +git commit -m "Add SRE Agent PR deployment guard" +git push +``` + +2. Add the webhook URL as a GitHub secret: + +```bash +gh secret set SRE_AGENT_WEBHOOK_URL \ + --repo /contoso-trading \ + --body "$WEBHOOK_URL" +``` + +## Step 4 — Test with a Risky PR + +Now create a PR that introduces a subtle breaking change: + +```bash +cd /path/to/contoso-trading +git checkout main && git pull +git checkout -b config-cleanup + +# Rename a database env var — looks like a cleanup but breaks payment-service +sed -i '' 's|DATABASE_URL|DB_CONNECTION_URL|g' payment-service/Program.cs + +git add -A +git commit -m "Standardize database env var naming" +git push origin config-cleanup + +# Create the PR +gh pr create \ + --title "Standardize database env var naming" \ + --body "Renamed DATABASE_URL to DB_CONNECTION_URL for consistency with other services." \ + --base main \ + --head config-cleanup +``` + +### What happens next + +1. GitHub Actions fires the `sre-agent-pr-guard` workflow +2. The workflow sends the PR event to the Logic App webhook URL +3. The Logic App forwards it to the SRE Agent's HTTP trigger +4. The `deployment-guard` subagent activates and: + - Reads the PR diff (sees `DATABASE_URL` → `DB_CONNECTION_URL`) + - Captures production baselines from Dynatrace + LAW + - Deploys the PR changes to staging + - Sends canary traffic to staging endpoints + - Detects that payment-service can't connect to the database (env var mismatch) + - Posts a **CRITICAL** risk assessment as a PR comment + +### Expected PR Comment + +The agent should post something like: + +> **🔴 CRITICAL Risk — Do not merge** +> +> | Check | Result | +> |---|---| +> | Static Analysis | `DATABASE_URL` renamed to `DB_CONNECTION_URL` in payment-service — env var mismatch with deployment config | +> | Staging Deploy | ✅ Deployed | +> | Canary Tests | ❌ payment-service returning 500 — database connection failed | +> | Health Comparison | Production: 0 errors, Staging: 100% error rate on /api/payments | +> +> **Root Cause**: The `DATABASE_URL` environment variable is defined in the Container App configuration but the code now reads `DB_CONNECTION_URL`. The payment service cannot connect to the database. +> +> **Recommendation**: Either update the Container App env var to `DB_CONNECTION_URL` or revert the code change. + +## Step 5 — Clean Up + +```bash +# Close the test PR +gh pr close config-cleanup --repo /contoso-trading --delete-branch + +# Delete the agent (optional) +az group delete --name rg-deployment-guard-lab --yes --no-wait +``` + +## Lab Scenarios + +### Scenario 1: Safe change (LOW risk) +Update a log message or comment — agent should report LOW risk. + +### Scenario 2: Performance regression (MEDIUM risk) +Add a `Thread.Sleep(500)` or `await Task.Delay(500)` to a hot path — agent should detect latency increase. + +### Scenario 3: Breaking change (CRITICAL risk) +Rename an env var or remove a health check endpoint — agent should flag it. + +### Scenario 4: Silent data corruption (HIGH risk) +Change a calculation or data mapping — app returns 200 but wrong data. Agent compares response payloads against baselines and catches the difference. + +## Troubleshooting + +| Issue | Fix | +|---|---| +| Webhook not firing | Check GitHub Actions logs — is `SRE_AGENT_WEBHOOK_URL` secret set? | +| Agent not responding | Check Logic App run history in Azure portal | +| No PR comment | Verify GitHub repo is connected in SRE Agent portal (Settings → Repos) | +| Staging deploy fails | Check agent has `RunAzCliWriteCommands` tool and Contributor role on staging RG | +| Dynatrace queries empty | Verify Dynatrace MCP connector is connected (Settings → Connectors) | diff --git a/labs/deployment-guard/azure.yaml b/labs/deployment-guard/azure.yaml new file mode 100644 index 000000000..3843f179a --- /dev/null +++ b/labs/deployment-guard/azure.yaml @@ -0,0 +1,17 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/Azure/azure-dev/main/schemas/v1.0/azure.yaml.json + +name: deployment-guard-demo +metadata: + template: deployment-guard-demo@1.0.0 + +# This lab uses an existing application repo (contoso-trading) and deploys +# an SRE Agent configured with the law-dynatrace-httptrigger recipe. +# The agent's HTTP trigger receives GitHub PR webhooks via a Logic App bridge +# and runs deployment guard analysis on every PR. +# +# Prerequisites: +# - Fork or clone https://github.com/dm-chelupati/contoso-trading +# - Dynatrace environment with MCP gateway access +# +# The lab does NOT provision the app infrastructure (contoso-trading has its own). +# It provisions only the SRE Agent + webhook bridge. diff --git a/labs/deployment-guard/docs/blog-shift-left-deployment-guard.md b/labs/deployment-guard/docs/blog-shift-left-deployment-guard.md new file mode 100644 index 000000000..b929be995 --- /dev/null +++ b/labs/deployment-guard/docs/blog-shift-left-deployment-guard.md @@ -0,0 +1,168 @@ +# Shift Left with Azure SRE Agent: An Agent That Guards Every PR + +## Azure SRE Agent can do more than investigate production incidents. With HTTP triggers and a deployment guard skill, it analyzes pull requests by deploying changes to staging, comparing health metrics against production baselines, and posting risk assessments directly on the PR — before the code is merged. + +## The Gap Between Code Review and Production + +Most teams have two reliability checkpoints: code review (before merge) and monitoring (after deployment). The gap between them is where subtle breaking changes slip through. + +A renamed environment variable, a removed health check endpoint, a changed database schema — these changes pass code review because they look correct in isolation. They pass CI because nobody wrote a test for the specific interaction between the code change and the deployment configuration. They reach production, and the first signal is an alert at 2 AM. + +The challenge is cross-referencing: a human reviewer would need to compare the PR diff against the live infrastructure config, the deployment environment variables, and the production health baselines. In practice, this doesn't happen for routine changes. + +Azure SRE Agent's HTTP trigger capability fills this gap by inserting an automated reliability check into the PR workflow. + +## How It Works + +The deployment guard uses a webhook bridge pattern: + +``` +GitHub PR → GitHub Actions workflow → Logic App webhook bridge → SRE Agent HTTP trigger + ↓ + deployment-guard subagent + ↓ + ┌───────────────────┼───────────────────┐ + ↓ ↓ ↓ + Read PR diff Deploy to staging Query Dynatrace + ↓ + LAW baselines + Canary traffic + ↓ + Compare health + ↓ + Post risk assessment + comment on PR +``` + +Here's the agent configured with the deployment guard skill, HTTP trigger, and connectors: + + +![Agent builder canvas with deployment guard configuration](images/agent-builder-canvas.png) + +The HTTP trigger receives PR events from GitHub via a Logic App webhook bridge and routes them to the deployment-guard subagent in autonomous mode: + + +![HTTP trigger configuration for pr-deployment-guard](images/http-trigger-config.png) + +When a developer opens a PR, GitHub Actions sends the event to the SRE Agent via the Logic App bridge. The agent's deployment guard subagent runs a 9-step analysis: + +1. **Read the PR diff** from the connected GitHub repo — identify what changed (app code, IaC, config, DB schema, dependencies) +2. **Static analysis** — check for breaking patterns: renamed env vars, removed endpoints, changed schemas, missing error handling +3. **Capture production baselines** — query Dynatrace and Log Analytics for current error rates, latency percentiles, and throughput. Send test requests to production endpoints and record response structure +4. **Deploy to staging** — use `az containerapp update` to deploy the PR's changes to the staging environment +5. **Canary traffic** — send synthetic HTTP requests to staging endpoints for 2-3 minutes to exercise affected code paths +6. **Validate responses** — compare staging API responses against production baselines. Catch cases where the app returns 200 OK but serves degraded or incorrect data +7. **Monitor health** — query Dynatrace and LAW for staging metrics over 5 minutes. Compare against production +8. **Risk assessment** — classify as LOW, MEDIUM, HIGH, or CRITICAL +9. **Post PR comment** — structured report with risk level, static analysis findings, canary test results, health comparison table, and recommendation + +## Risk Levels + +| Risk | Criteria | Example | +|---|---|---| +| LOW | No functional or performance changes detected | Updated a log message or code comment | +| MEDIUM | Minor changes, no regressions in staging | Added a new optional query parameter | +| HIGH | Behavioral regression detected, staging still functional | Response payload changed, latency increased 2x | +| CRITICAL | Staging failing or data integrity compromised | Database connection failed, endpoints returning 500 | + +## Example: Environment Variable Rename + +A PR titled "Standardize database env var naming" renames `DATABASE_URL` to `DB_CONNECTION_URL` in the payment service. The commit is clean, the description is clear, and the change looks like responsible housekeeping. + +The deployment guard: +- Reads the diff and flags `DATABASE_URL` → `DB_CONNECTION_URL` as a potential env var mismatch +- Deploys to staging — the Container App's environment variables still define `DATABASE_URL` +- Sends canary traffic to the payment-service endpoint +- Gets 500 errors — the service can't find `DB_CONNECTION_URL` and fails to connect to the database +- Posts a CRITICAL risk assessment on the PR: + +| Check | Result | +|---|---| +| Static Analysis | `DATABASE_URL` renamed to `DB_CONNECTION_URL` — env var mismatch with deployment config | +| Staging Deploy | Deployed | +| Canary Tests | payment-service returning 500 — database connection failed | +| Health Comparison | Production: 0 errors / Staging: 100% error rate on /api/payments | + +**Recommendation**: Update the Container App env var to `DB_CONNECTION_URL` or revert the code change. + +The developer sees this before merging. No production incident. + +## How It Compares to CI Tests + +| Capability | CI Tests | Deployment Guard | +|---|---|---| +| Catches what you wrote tests for | Yes | N/A | +| Catches unanticipated regressions | No | Yes — compares against live production baselines | +| Compares response payloads against production | No | Yes — detects silent data degradation | +| Cross-references code against infrastructure config | No | Yes — reads the diff and checks env vars, endpoints, schemas | +| Requires pre-written test cases | Yes | No — uses real traffic against a real staging deployment | + +The deployment guard complements CI — it doesn't replace it. CI validates correctness against known expectations. The deployment guard validates behavior against the live production environment. + +## Setting It Up + +### Step 1 — Deploy an agent with the `law-dynatrace-httptrigger` recipe + +```bash +cd sreagent-templates + +./bin/new-agent.sh --recipe law-dynatrace-httptrigger --non-interactive \ + --set agentName=my-deployment-guard \ + --set resourceGroup=rg-sre-guard \ + --set location=eastus2 \ + --set lawId=/subscriptions//resourceGroups//providers/Microsoft.OperationalInsights/workspaces/ \ + --set dtTenant= \ + --set dtToken= \ + --set githubRepo=/ \ + --set targetRGs=rg-prod,rg-staging \ + -o my-deployment-guard/ + +./bin/deploy.sh my-deployment-guard/ +``` + +The recipe includes: + +| Component | What it does | +|---|---| +| **deployment-guard-analysis** skill | 9-step PR analysis workflow | +| **deployment-guard** subagent | Autonomous agent with access to az CLI, Dynatrace, LAW, GitHub | +| **pr-deployment-guard** HTTP trigger | Receives webhook events and routes to the subagent | +| **Log Analytics connector** | Azure-side logs and metrics | +| **Dynatrace MCP connector** | Application performance data | +| **Safety hooks** | deny-prod-deletes, require-approval-for-restarts | + +### Step 2 — Copy the GitHub workflow to your app repo + +The recipe generates a sample workflow at `data/sample-github-workflow.yml`. Copy it to your app repo: + +```bash +cp my-deployment-guard/data/sample-github-workflow.yml \ + /path/to/your-app/.github/workflows/sre-agent-pr-guard.yml +``` + +### Step 3 — Set the webhook secret + +Get the Logic App trigger URL from the agent's webhook bridge and add it as a GitHub secret: + +```bash +gh secret set SRE_AGENT_WEBHOOK_URL --repo / --body "" +``` + +### Step 4 — Open a PR and watch the agent analyze it + +Every PR on the app repo now triggers the deployment guard. The agent posts its risk assessment as a PR comment within 5-10 minutes (baseline capture + canary testing + analysis). + +## Lab and Recipe + +| Resource | Description | +|---|---| +| [law-dynatrace-httptrigger recipe](https://github.com/microsoft/sre-agent/tree/main/sreagent-templates/recipes/law-dynatrace-httptrigger) | Deploy an agent with LAW + Dynatrace + HTTP trigger + deployment guard pre-configured | +| [deployment-guard lab](https://github.com/microsoft/sre-agent/tree/main/labs/deployment-guard) | End-to-end walkthrough using [contoso-trading](https://github.com/dm-chelupati/contoso-trading) as the target app — includes a demo script that creates a risky PR and shows the agent's response | +| [Inside SRE Agent Live](https://www.youtube.com/@InsideSREAgent) | Live demo recordings | + +## Learn More + +- [HTTP Triggers](https://sre.azure.com/docs/capabilities/http-triggers) — Configuring webhook-based automation +- [Skills](https://sre.azure.com/docs/capabilities/skills) — Creating custom analysis workflows +- [Subagents](https://sre.azure.com/docs/capabilities/subagents) — Dedicated agents with scoped tools and instructions +- [Connectors](https://sre.azure.com/docs/capabilities/connectors) — Connecting Log Analytics, Dynatrace, and other data sources +- [SRE Agent Templates](https://github.com/microsoft/sre-agent) — Recipes, labs, and deployment tooling diff --git a/labs/deployment-guard/scripts/demo.sh b/labs/deployment-guard/scripts/demo.sh new file mode 100644 index 000000000..6511e32d1 --- /dev/null +++ b/labs/deployment-guard/scripts/demo.sh @@ -0,0 +1,137 @@ +#!/bin/bash +# ============================================================ +# demo.sh — Run the deployment guard demo end-to-end +# +# This script creates a risky PR on contoso-trading and watches +# the SRE Agent analyze it via the HTTP trigger. +# +# Usage: +# bash demo.sh --repo [--app-dir ] +# +# Prerequisites: +# - SRE Agent deployed with law-dynatrace-httptrigger recipe +# - GitHub workflow + webhook secret configured (setup-github-workflow.sh) +# - contoso-trading cloned locally +# ============================================================ +set -euo pipefail + +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +BLUE='\033[0;34m' +NC='\033[0m' + +REPO="" +APP_DIR="" + +while [[ $# -gt 0 ]]; do + case "$1" in + --repo) REPO="$2"; shift 2 ;; + --app-dir) APP_DIR="$2"; shift 2 ;; + *) echo "Unknown option: $1"; exit 1 ;; + esac +done + +if [[ -z "$REPO" ]]; then + echo "Usage: $0 --repo [--app-dir ]" + exit 1 +fi + +# Default app dir to ~/contoso-trading +APP_DIR="${APP_DIR:-$HOME/contoso-trading}" + +if [[ ! -d "$APP_DIR" ]]; then + echo -e "${RED}contoso-trading not found at $APP_DIR${NC}" + echo "Clone it first: gh repo clone $REPO $APP_DIR" + exit 1 +fi + +echo -e "${BLUE}═══════════════════════════════════════════════════════════${NC}" +echo -e "${BLUE} Deployment Guard Demo${NC}" +echo -e "${BLUE}═══════════════════════════════════════════════════════════${NC}" + +# ───────────────────────────────────────────────────────── +# PREP: Clean up any previous demo branches +# ───────────────────────────────────────────────────────── +echo -e "\n${YELLOW}[PREP] Cleaning up previous demo state...${NC}" +cd "$APP_DIR" +git checkout main 2>/dev/null && git pull +git branch -D config-cleanup 2>/dev/null || true +git push origin --delete config-cleanup 2>/dev/null || true + +# Close any existing demo PRs +EXISTING_PR=$(gh pr list --repo "$REPO" --head config-cleanup --json number -q '.[0].number' 2>/dev/null || echo "") +if [[ -n "$EXISTING_PR" ]]; then + gh pr close "$EXISTING_PR" --repo "$REPO" --delete-branch 2>/dev/null || true +fi +echo -e "${GREEN} ✓ Clean state${NC}" + +# ───────────────────────────────────────────────────────── +# ACT 1: Create a risky change +# ───────────────────────────────────────────────────────── +echo -e "\n${YELLOW}[ACT 1] Creating a subtle breaking change...${NC}" +git checkout -b config-cleanup + +# Rename DATABASE_URL to DB_CONNECTION_URL — looks like a cleanup +# but breaks payment-service because the env var is still DATABASE_URL +sed -i '' 's|DATABASE_URL|DB_CONNECTION_URL|g' payment-service/Program.cs 2>/dev/null \ + || sed -i 's|DATABASE_URL|DB_CONNECTION_URL|g' payment-service/Program.cs 2>/dev/null + +git add -A +git commit -m "Standardize database env var naming" +git push origin config-cleanup --force +echo -e "${GREEN} ✓ Pushed config-cleanup branch${NC}" + +# ───────────────────────────────────────────────────────── +# ACT 2: Open the PR — this triggers the webhook +# ───────────────────────────────────────────────────────── +echo -e "\n${YELLOW}[ACT 2] Creating PR...${NC}" +PR_URL=$(gh pr create \ + --repo "$REPO" \ + --title "Standardize database env var naming" \ + --body "Renamed DATABASE_URL to DB_CONNECTION_URL for consistency with other services." \ + --base main \ + --head config-cleanup \ + --json url -q '.url' 2>/dev/null || \ + gh pr view config-cleanup --repo "$REPO" --json url -q '.url') + +echo -e "${GREEN} ✓ PR created: $PR_URL${NC}" +echo "" +echo -e "${BLUE}The GitHub Actions workflow is now sending the PR event to the SRE Agent.${NC}" +echo -e "${BLUE}Watch the PR for the agent's risk assessment comment.${NC}" +echo "" +echo -e "${YELLOW}Check progress:${NC}" +echo " GitHub Actions: gh run list --repo $REPO --limit 3" +echo " PR comments: gh pr view config-cleanup --repo $REPO --comments" +echo "" + +# ───────────────────────────────────────────────────────── +# ACT 3: Wait and show the result +# ───────────────────────────────────────────────────────── +echo -e "${YELLOW}[ACT 3] Waiting for agent to analyze the PR...${NC}" +echo " This typically takes 5-10 minutes (baseline capture + canary testing)." +echo "" +echo " To check manually:" +echo " gh pr view config-cleanup --repo $REPO --comments" +echo "" + +# Poll for PR comment (up to 15 minutes) +for i in $(seq 1 30); do + COMMENTS=$(gh pr view config-cleanup --repo "$REPO" --json comments --jq '.comments | length' 2>/dev/null || echo "0") + if [[ "$COMMENTS" -gt 0 ]]; then + echo -e "\n${GREEN} ✓ Agent posted a comment on the PR!${NC}" + echo "" + gh pr view config-cleanup --repo "$REPO" --comments 2>/dev/null | tail -40 + break + fi + echo " Waiting... ($((i * 30))s elapsed, $COMMENTS comments so far)" + sleep 30 +done + +# ───────────────────────────────────────────────────────── +# CLEANUP +# ───────────────────────────────────────────────────────── +echo "" +echo -e "${YELLOW}[CLEANUP] To clean up after the demo:${NC}" +echo " gh pr close config-cleanup --repo $REPO --delete-branch" +echo " cd $APP_DIR && git checkout main" diff --git a/labs/deployment-guard/scripts/prereqs.sh b/labs/deployment-guard/scripts/prereqs.sh new file mode 100644 index 000000000..eb6827350 --- /dev/null +++ b/labs/deployment-guard/scripts/prereqs.sh @@ -0,0 +1,56 @@ +#!/bin/bash +# ============================================================ +# prereqs.sh — Check prerequisites for Deployment Guard Lab +# ============================================================ +set -euo pipefail + +echo "" +echo "=============================================" +echo " Deployment Guard Lab — Prerequisites" +echo "=============================================" +echo "" + +MISSING=0 + +check_tool() { + local name="$1" + local cmd="$2" + if command -v "$cmd" &>/dev/null; then + version=$($cmd --version 2>&1 | head -1) + echo " ✅ $name: $version" + else + echo " ❌ $name: NOT FOUND" + MISSING=$((MISSING + 1)) + fi +} + +check_tool "Azure CLI" "az" +check_tool "GitHub CLI" "gh" +check_tool "jq" "jq" + +echo "" + +# Check az login +if az account show &>/dev/null; then + ACCOUNT=$(az account show --query name -o tsv) + echo " ✅ Logged into Azure: $ACCOUNT" +else + echo " ❌ Not logged into Azure (run: az login)" + MISSING=$((MISSING + 1)) +fi + +# Check gh auth +if gh auth status &>/dev/null; then + echo " ✅ Logged into GitHub" +else + echo " ❌ Not logged into GitHub (run: gh auth login)" + MISSING=$((MISSING + 1)) +fi + +echo "" +if [[ $MISSING -eq 0 ]]; then + echo " All prerequisites met! ✅" +else + echo " $MISSING prerequisite(s) missing. Fix them before proceeding." + exit 1 +fi diff --git a/labs/deployment-guard/scripts/setup-github-workflow.sh b/labs/deployment-guard/scripts/setup-github-workflow.sh new file mode 100644 index 000000000..8f8f41703 --- /dev/null +++ b/labs/deployment-guard/scripts/setup-github-workflow.sh @@ -0,0 +1,68 @@ +#!/bin/bash +# ============================================================ +# setup-github-workflow.sh — Wire a GitHub repo to the SRE Agent +# Copies the PR guard workflow and sets the webhook secret. +# +# Usage: +# bash setup-github-workflow.sh \ +# --repo \ +# --webhook-url +# ============================================================ +set -euo pipefail + +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +NC='\033[0m' + +REPO="" +WEBHOOK_URL="" + +while [[ $# -gt 0 ]]; do + case "$1" in + --repo) REPO="$2"; shift 2 ;; + --webhook-url) WEBHOOK_URL="$2"; shift 2 ;; + *) echo "Unknown option: $1"; exit 1 ;; + esac +done + +if [[ -z "$REPO" || -z "$WEBHOOK_URL" ]]; then + echo "Usage: $0 --repo --webhook-url " + exit 1 +fi + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +RECIPE_DIR="$(cd "$SCRIPT_DIR/../../sreagent-templates/recipes/law-dynatrace-httptrigger" && pwd)" +WORKFLOW_SRC="$RECIPE_DIR/data/sample-github-workflow.yml" + +if [[ ! -f "$WORKFLOW_SRC" ]]; then + echo -e "${RED}Workflow template not found at $WORKFLOW_SRC${NC}" + exit 1 +fi + +# Clone the repo to a temp dir, add workflow, push +TMPDIR=$(mktemp -d) +echo -e "${YELLOW}Cloning $REPO...${NC}" +gh repo clone "$REPO" "$TMPDIR/repo" -- --depth 1 + +mkdir -p "$TMPDIR/repo/.github/workflows" +cp "$WORKFLOW_SRC" "$TMPDIR/repo/.github/workflows/sre-agent-pr-guard.yml" + +cd "$TMPDIR/repo" +git add .github/workflows/sre-agent-pr-guard.yml +if git diff --cached --quiet; then + echo -e "${GREEN}Workflow already exists. Skipping.${NC}" +else + git commit -m "Add SRE Agent PR deployment guard workflow" + git push + echo -e "${GREEN}✓ Workflow pushed to $REPO${NC}" +fi + +# Set the webhook secret +echo -e "${YELLOW}Setting SRE_AGENT_WEBHOOK_URL secret...${NC}" +gh secret set SRE_AGENT_WEBHOOK_URL --repo "$REPO" --body "$WEBHOOK_URL" +echo -e "${GREEN}✓ Secret set on $REPO${NC}" + +# Clean up +rm -rf "$TMPDIR" +echo -e "${GREEN}Done! PRs on $REPO will now trigger the SRE Agent deployment guard.${NC}" diff --git a/sreagent-templates/bin/new-agent.sh b/sreagent-templates/bin/new-agent.sh index fd0580253..a960cfb8d 100755 --- a/sreagent-templates/bin/new-agent.sh +++ b/sreagent-templates/bin/new-agent.sh @@ -223,14 +223,14 @@ done < "$VALUES_FILE" mv "$MAPPED_FILE" "$VALUES_FILE" # Replace {{placeholders}} with user values in all JSON and YAML files -for file in $(find "$OUTPUT" -name '*.json' -o -name '*.yaml' -type f); do +for file in $(find "$OUTPUT" -type f \( -name '*.json' -o -name '*.yaml' \)); do content=$(cat "$file") while IFS="=" read -r key val || [[ -n "$key" ]]; do # Handle {{key:bool}} — converts non-empty to true, empty to false content=$(echo "$content" | sed "s|\"{{${key}:bool}}\"|$(if [[ -n "$val" ]]; then echo "true"; else echo "false"; fi)|g") # Handle {{key}} that's a comma-separated list → JSON array if [[ "$key" == "targetRGs" && "$val" == *,* ]]; then - json_array=$(echo "$val" | tr ',' '\n' | sed 's/^ */"/;s/ *$/"/;' | paste -sd, | sed 's/^/[/;s/$/]/') + json_array=$(echo "$val" | tr ',' '\n' | sed 's/^ */"/;s/ *$/"/;' | paste -s -d, - | sed 's/^/[/;s/$/]/') content=$(echo "$content" | sed "s|\"{{${key}}}\"| ${json_array}|g") else content=$(echo "$content" | sed "s|{{${key}}}|${val}|g") diff --git a/sreagent-templates/recipes/law-dynatrace-httptrigger/.gitignore b/sreagent-templates/recipes/law-dynatrace-httptrigger/.gitignore new file mode 100644 index 000000000..ba950ef32 --- /dev/null +++ b/sreagent-templates/recipes/law-dynatrace-httptrigger/.gitignore @@ -0,0 +1,3 @@ +connectors.secrets.env +*.secrets.env +data/ diff --git a/sreagent-templates/recipes/law-dynatrace-httptrigger/README.md b/sreagent-templates/recipes/law-dynatrace-httptrigger/README.md new file mode 100644 index 000000000..8c6fbea0e --- /dev/null +++ b/sreagent-templates/recipes/law-dynatrace-httptrigger/README.md @@ -0,0 +1,126 @@ +# law-dynatrace-httptrigger + +SRE Agent with Log Analytics + Dynatrace MCP connectors, GitHub repo integration, and an HTTP trigger that enables **PR deployment guard** — automated PR reviews that deploy to staging, run canary tests, and post risk assessments as PR comments. + +## Use Case + +Shift-left reliability: instead of catching production issues after deployment, the agent reviews every PR by deploying changes to a staging environment, comparing health metrics against production baselines, and flagging regressions before merge. + +## Prerequisites + +- Azure subscription with **production** and **staging** resource groups +- Log Analytics workspace connected to your Container Apps / App Services +- Dynatrace environment with MCP gateway access and API token +- GitHub repo with app source code +- All [CLI tools](../../README.md#prerequisites) installed (`./bin/install-prerequisites.sh --check`) + +## Quick Start + +### Step 1 — Generate agent config + +```bash +./bin/new-agent.sh --recipe law-dynatrace-httptrigger --non-interactive \ + --set agentName=contoso-sre \ + --set resourceGroup=rg-sre-contoso \ + --set location=eastus2 \ + --set lawId=/subscriptions//resourceGroups//providers/Microsoft.OperationalInsights/workspaces/ \ + --set dtTenant=abc12345 \ + --set dtToken=dt0c01.xxx \ + --set githubRepo=contoso/trading-app \ + --set targetRGs=rg-contoso-prod,rg-contoso-staging \ + -o contoso-sre/ +``` + +### Step 2 — Deploy + +| Backend | Command | +|---|---| +| Bicep | `./bin/deploy.sh contoso-sre/` | +| Terraform | `./bin/deploy-tf.sh contoso-sre/` | +| PowerShell | `./bin/ps/Deploy-Agent.ps1 -InputPath contoso-sre/` | + +### Step 3 — Set up the Dynatrace secret + +```bash +echo "DYNATRACE_BEARER_TOKEN=dt0c01.your-token-here" > contoso-sre/connectors.secrets.env +``` + +Then redeploy or run `./bin/deploy.sh contoso-sre/` to apply. + +### Step 4 — Wire up GitHub PR workflow + +Copy the sample workflow to your app repo: + +```bash +cp contoso-sre/docs/sample-github-workflow.yml \ + /path/to/your-app/.github/workflows/sre-agent-pr-guard.yml +``` + +Add the webhook URL as a GitHub secret: + +```bash +# Get the Logic App trigger URL from the agent's webhook bridge +WEBHOOK_URL=$(az resource show \ + --resource-group rg-sre-contoso \ + --resource-type Microsoft.Logic/workflows \ + --name \ + --query "properties.accessEndpoint" -o tsv) + +gh secret set SRE_AGENT_WEBHOOK_URL --repo contoso/trading-app --body "$WEBHOOK_URL" +``` + +### Step 5 — Test it + +Open a PR on your app repo — the GitHub workflow sends the PR event to the agent, which triggers the deployment guard. The agent will: + +1. Read the PR diff +2. Capture production baseline metrics from Dynatrace + LAW +3. Deploy changes to staging +4. Send synthetic canary traffic +5. Compare staging health against production +6. Post a risk assessment comment on the PR + +## Parameters + +| Param | Required | Example | Description | +|---|---|---|---| +| agentName | ✅ | `contoso-sre` | Agent name (lowercase, hyphens) | +| resourceGroup | ✅ | `rg-sre-contoso` | Resource group for the agent | +| location | ✅ | `eastus2` | Azure region | +| targetRGs | ✅ | `rg-contoso-prod,rg-contoso-staging` | Resource groups the agent monitors | +| lawId | ✅ | `/subscriptions/.../workspaces/...` | Log Analytics workspace resource ID | +| dtTenant | ✅ | `abc12345` | Dynatrace tenant ID | +| dtToken | ✅ | `dt0c01.xxx` | Dynatrace API token (stored as secret) | +| githubRepo | ✅ | `contoso/trading-app` | GitHub org/repo | +| modelProvider | | `Anthropic` | AI model provider (Anthropic or Azure OpenAI) | + +## What You Get + +| Category | Items | +|---|---| +| **Connectors** | Log Analytics, Dynatrace MCP | +| **Skills** | deployment-guard-analysis, investigate-app-errors | +| **Subagents** | deployment-guard, error-investigator | +| **HTTP Trigger** | pr-deployment-guard (receives GitHub PR webhooks) | +| **Hooks** | deny-prod-deletes, require-approval-for-restarts | +| **Common Prompts** | investigation-guidelines, safety-rules | +| **GitHub Repo** | Connected for diff analysis and PR comments | + +## Architecture + +``` +GitHub PR → GitHub Actions workflow → Logic App webhook bridge → SRE Agent HTTP trigger + ↓ + deployment-guard subagent + ↓ + ┌───────────────┼───────────────┐ + ↓ ↓ ↓ + Read PR diff Deploy to staging Query Dynatrace + ↓ + LAW baselines + Canary traffic + ↓ + Compare health + ↓ + Post PR comment with + risk assessment +``` diff --git a/sreagent-templates/recipes/law-dynatrace-httptrigger/agent.json b/sreagent-templates/recipes/law-dynatrace-httptrigger/agent.json new file mode 100644 index 000000000..086592135 --- /dev/null +++ b/sreagent-templates/recipes/law-dynatrace-httptrigger/agent.json @@ -0,0 +1,89 @@ +{ + "_scenario": "law-dynatrace-httptrigger", + "_description": "SRE Agent with LAW + Dynatrace MCP connectors, GitHub repo, HTTP trigger for PR deployment guard, and a deployment-guard subagent that validates PRs by deploying to staging and comparing health metrics.", + "_prerequisites": [ + "Azure subscription with production and staging resource groups", + "Log Analytics workspace connected to your Container Apps or App Services", + "Dynatrace environment with MCP gateway access and API token", + "GitHub repo with app source code", + "GitHub Actions or equivalent CI/CD to send PR webhooks" + ], + "_prompts": { + "agentName": { + "ask": "Agent name (lowercase, hyphens ok)", + "default": "my-sre-agent" + }, + "resourceGroup": { + "ask": "Resource group for the agent", + "default": "sre-agent-rg" + }, + "location": { + "ask": "Region", + "options": [ + "eastus2", + "swedencentral", + "uksouth", + "australiaeast" + ], + "required": true + }, + "targetRGs": { + "ask": "Resource groups the agent can access (comma-separated). Include your app's prod and staging RGs — the agent needs these to deploy to staging and read container app config. The LAW/AppInsights RG is NOT needed here if you provided the full resource ID above.", + "required": true + }, + "lawId": { + "ask": "Log Analytics workspace resource ID", + "required": true + }, + "dtTenant": { + "ask": "Dynatrace tenant ID (e.g. abc12345)", + "required": true + }, + "dtToken": { + "ask": "Dynatrace API token", + "required": true, + "secret": true + }, + "githubRepo": { + "ask": "GitHub repo (org/repo format, e.g. contoso/trading-app)", + "required": true + }, + "existingUamiId": { + "ask": "Existing UAMI resource ID (leave blank to create new)", + "default": "" + }, + "modelProvider": { + "ask": "AI model provider", + "options": [ + "Anthropic", + "Azure OpenAI" + ], + "default": "Anthropic" + }, + "existingAgentAppInsightsId": { + "ask": "Existing App Insights resource ID for agent telemetry (leave blank to create new)", + "default": "" + } + }, + "identity": { + "agentName": "{{agentName}}", + "resourceGroup": "{{resourceGroup}}", + "subscription": "", + "location": "{{location}}", + "targetResourceGroups": "{{targetRGs}}" + }, + "access": { + "accessLevel": "High", + "actionMode": "Review" + }, + "upgradeChannel": "Preview", + "defaultModelProvider": "{{modelProvider}}", + "monthlyAgentUnitLimit": 10000, + "tags": {}, + "toggles": { + "enableWebhookBridge": true, + "webhookBridgeTriggerUrl": "" + }, + "existingUamiId": "{{existingUamiId}}", + "existingAgentAppInsightsId": "{{existingAgentAppInsightsId}}" +} diff --git a/sreagent-templates/recipes/law-dynatrace-httptrigger/automations/http-triggers/pr-deployment-guard.yaml b/sreagent-templates/recipes/law-dynatrace-httptrigger/automations/http-triggers/pr-deployment-guard.yaml new file mode 100644 index 000000000..df617cdfb --- /dev/null +++ b/sreagent-templates/recipes/law-dynatrace-httptrigger/automations/http-triggers/pr-deployment-guard.yaml @@ -0,0 +1,10 @@ +name: pr-deployment-guard +spec: + description: Receives PR webhooks from GitHub and triggers the deployment guard + to analyze changes for production safety. + prompt: A PR webhook has been received from the connected GitHub repo. Use the deployment-guard-analysis + skill to read the PR diff, deploy changes to the staging environment, monitor + health for 5 minutes comparing against production, then post a risk assessment + comment on the PR. + handlingAgent: deployment-guard + agentMode: autonomous diff --git a/sreagent-templates/recipes/law-dynatrace-httptrigger/config/common-prompts/investigation-guidelines.yaml b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/common-prompts/investigation-guidelines.yaml new file mode 100644 index 000000000..d7c1b4b8d --- /dev/null +++ b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/common-prompts/investigation-guidelines.yaml @@ -0,0 +1,15 @@ +metadata: + name: investigation-guidelines +spec: + prompt: '## Investigation guidelines + + + - Always check the last 3 deployments for correlation + + - Include timestamp, affected resource, and severity in all summaries + + - Never take destructive actions without explicit approval + + - Prefer read-only investigation before recommending changes + + - Always provide an impact assessment (users affected, blast radius)' diff --git a/sreagent-templates/recipes/law-dynatrace-httptrigger/config/common-prompts/safety-rules.yaml b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/common-prompts/safety-rules.yaml new file mode 100644 index 000000000..efa6dd631 --- /dev/null +++ b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/common-prompts/safety-rules.yaml @@ -0,0 +1,15 @@ +metadata: + name: safety-rules +spec: + prompt: '## Safety rules + + + - Never delete resources in production without explicit approval + + - Always prefer read-only investigation before taking action + + - Escalate to human if confidence is below 80% + + - Do not modify network security groups or firewall rules + + - Do not access or display secrets, keys, or connection strings' diff --git a/sreagent-templates/recipes/law-dynatrace-httptrigger/config/hooks/deny-prod-deletes.yaml b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/hooks/deny-prod-deletes.yaml new file mode 100644 index 000000000..4545f0aae --- /dev/null +++ b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/hooks/deny-prod-deletes.yaml @@ -0,0 +1,11 @@ +metadata: + name: deny-prod-deletes +spec: + eventType: PreToolUse + hook: + type: prompt + prompt: If the tool targets a production resource (name contains 'prod' or 'prd'), + deny the action. Otherwise allow. + matcher: ^(delete_|remove_).* + permissionDecision: deny + enabled: true diff --git a/sreagent-templates/recipes/law-dynatrace-httptrigger/config/hooks/require-approval-for-restarts.yaml b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/hooks/require-approval-for-restarts.yaml new file mode 100644 index 000000000..3eae406c9 --- /dev/null +++ b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/hooks/require-approval-for-restarts.yaml @@ -0,0 +1,11 @@ +metadata: + name: require-approval-for-restarts +spec: + eventType: PreToolUse + hook: + type: prompt + prompt: If this action will restart or scale a resource, require human approval + before proceeding. + matcher: ^(restart_|scale_).* + permissionDecision: allow + enabled: true diff --git a/sreagent-templates/recipes/law-dynatrace-httptrigger/config/repos/github-repo.yaml b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/repos/github-repo.yaml new file mode 100644 index 000000000..b29c262d3 --- /dev/null +++ b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/repos/github-repo.yaml @@ -0,0 +1,5 @@ +name: github-repo +spec: + url: "{{githubRepo}}" + branch: main + description: Connected GitHub repository diff --git a/sreagent-templates/recipes/law-dynatrace-httptrigger/config/skills/deployment-guard-analysis.md b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/skills/deployment-guard-analysis.md new file mode 100644 index 000000000..9fd80e96a --- /dev/null +++ b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/skills/deployment-guard-analysis.md @@ -0,0 +1,21 @@ +You are a deployment guard. When triggered by a PR webhook, you assess if the change is safe for production. + +Step 1: Read the PR diff from the connected GitHub repo. Identify what changed — app code, IaC, config, DB schema, dependencies. + +Step 2: Static analysis — check for breaking changes: API contract changes, removed endpoints, changed DB schemas, renamed env vars, missing error handling. + +Step 3: Capture production baseline. Use Dynatrace DQL to query current error rates, latency p50/p95/p99, throughput. Use az CLI to check ContainerAppConsoleLogs_CL in LAW. Also capture baseline API responses by sending test requests to production endpoints and recording the response structure, status codes, and key data fields. + +Step 4: Deploy the PR changes to the STAGING environment using az containerapp update. This is a separate environment from production — deploy the new image there. + +Step 5: Send synthetic test traffic to the staging services to exercise the code paths affected by the PR. Use ExecutePythonCode to send HTTP requests to the staging endpoints (e.g. GET /orders, POST /orders, GET /health) for 2-3 minutes. This is canary testing — you need real traffic to surface regressions like timeouts, 500s, or latency spikes. + +Step 6: Validate response correctness — compare staging API responses against the production baseline captured in Step 3. Look for any differences in response bodies, status codes, data fields, or behavior. The app may return 200 OK but serve degraded or incorrect data. + +Step 7: Monitor staging health for 5 minutes. Query Dynatrace and LAW for the staging services. Compare all metrics and response patterns against the production baseline. Use PlotAreaChartWithCorrelation to visualize. + +Step 8: Risk assessment — LOW (no functional or performance changes), MEDIUM (minor changes), HIGH (behavioral or performance regression detected), CRITICAL (staging failing or data integrity compromised). + +Step 9: Post a structured PR comment with: risk level, changes analyzed, static analysis findings, canary test results, any behavioral regressions found, health comparison table (prod baseline vs staging), and recommendation. + +Tools to use: RunAzCliReadCommands, RunAzCliWriteCommands, ExecutePythonCode, PlotAreaChartWithCorrelation, PlotBarChart, CreateGithubIssue, FindConnectedGitHubRepo, and all dynatrace MCP tools. diff --git a/sreagent-templates/recipes/law-dynatrace-httptrigger/config/skills/deployment-guard-analysis.yaml b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/skills/deployment-guard-analysis.yaml new file mode 100644 index 000000000..1eef7bac0 --- /dev/null +++ b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/skills/deployment-guard-analysis.yaml @@ -0,0 +1,9 @@ +metadata: + name: deployment-guard-analysis + description: Deployment guard that assesses PR safety for production by analyzing + diffs, capturing baselines, deploying to staging, running canary tests, validating + response correctness, and comparing health metrics. + spec: + tools: [] +skillContent: skills/deployment-guard-analysis.md +additionalFiles: [] diff --git a/sreagent-templates/recipes/law-dynatrace-httptrigger/config/skills/investigate-app-errors.md b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/skills/investigate-app-errors.md new file mode 100644 index 000000000..508a81608 --- /dev/null +++ b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/skills/investigate-app-errors.md @@ -0,0 +1,20 @@ +You are an application error investigator. When errors are reported, follow this workflow: + +1. **Identify the error**: Get the error details — HTTP status codes, exception types, affected endpoints, timestamps. + +2. **Check recent deployments**: Use az CLI to list recent Container App revisions or deployments. Correlate error start time with deployment timestamps. + +3. **Query Dynatrace**: Use DQL to query error rates, response times, and throughput for the affected services. Look for anomalies that started around the same time. + +4. **Query Log Analytics**: Check ContainerAppConsoleLogs_CL and ContainerAppSystemLogs_CL for exceptions, crash loops, or OOM kills. + +5. **Check dependencies**: Query Dynatrace for dependency health — databases, external APIs, message queues. An upstream failure may be the root cause. + +6. **Correlate findings**: Build a timeline of events — deployment, config change, traffic spike, dependency failure — and identify the most likely root cause. + +7. **Recommend fix**: Provide actionable recommendations — rollback, config change, scaling, or code fix with the specific file/line if the GitHub repo is connected. + +Always include: +- Impact assessment (users affected, error rate, duration) +- Root cause confidence level +- Recommended action with rollback option diff --git a/sreagent-templates/recipes/law-dynatrace-httptrigger/config/skills/investigate-app-errors.yaml b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/skills/investigate-app-errors.yaml new file mode 100644 index 000000000..669dddcb8 --- /dev/null +++ b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/skills/investigate-app-errors.yaml @@ -0,0 +1,16 @@ +metadata: + name: investigate-app-errors + description: Investigate application errors using Dynatrace DQL and Log Analytics + to correlate errors with deployments, infrastructure changes, and dependencies. + spec: + tools: + - RunAzCliReadCommands + - QueryLogAnalyticsByWorkspaceId + - dynatrace_execute-dql + - dynatrace_create-dql + - dynatrace_query-problems + - dynatrace_get-entity-id + - ExecutePythonCode + - PlotAreaChartWithCorrelation +skillContent: skills/investigate-app-errors.md +additionalFiles: [] diff --git a/sreagent-templates/recipes/law-dynatrace-httptrigger/config/subagents/deployment-guard.instructions.md b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/subagents/deployment-guard.instructions.md new file mode 100644 index 000000000..28019290a --- /dev/null +++ b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/subagents/deployment-guard.instructions.md @@ -0,0 +1 @@ +You are the best engineer who guards production deployments operating in autonomous mode. Use the deployment-guard-analysis skill to assess PRs for production safety. Follow the full 9-step workflow: analyze the PR diff, perform static analysis, capture production baselines from Dynatrace and LAW, deploy to staging, send synthetic canary traffic, validate response correctness, monitor staging health, assess risk, and post a structured PR comment with your findings. diff --git a/sreagent-templates/recipes/law-dynatrace-httptrigger/config/subagents/deployment-guard.yaml b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/subagents/deployment-guard.yaml new file mode 100644 index 000000000..99ead646b --- /dev/null +++ b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/subagents/deployment-guard.yaml @@ -0,0 +1,31 @@ +metadata: + name: deployment-guard +spec: + instructions: subagents/deployment-guard.instructions.md + handoffDescription: Analyzes PRs by deploying to staging, comparing health against + production via Dynatrace + LAW, and posting risk assessment as a PR comment + handoffs: [] + tools: + - RunAzCliReadCommands + - RunAzCliWriteCommands + - ExecutePythonCode + - PlotAreaChartWithCorrelation + - PlotBarChart + - CreateGithubIssue + - FindConnectedGitHubRepo + - QueryLogAnalyticsByWorkspaceId + - dynatrace_adaptive-anomaly-detector + - dynatrace_create-dql + - dynatrace_execute-dql + - dynatrace_explain-dql + - dynatrace_get-entity-id + - dynatrace_get-entity-name + - dynatrace_query-problems + - dynatrace_seasonal-baseline-anomaly-detector + - dynatrace_static-threshold-analyzer + - dynatrace_timeseries-forecast + - dynatrace_timeseries-novelty-detection + maxReflectionCount: 0 + customReflectionNote: '' + commonPrompts: [] + enableVanillaMode: false diff --git a/sreagent-templates/recipes/law-dynatrace-httptrigger/config/subagents/error-investigator.instructions.md b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/subagents/error-investigator.instructions.md new file mode 100644 index 000000000..41b3c7a46 --- /dev/null +++ b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/subagents/error-investigator.instructions.md @@ -0,0 +1 @@ +You are an application error investigator. When errors are reported, use the investigate-app-errors skill to systematically diagnose the issue. Correlate Dynatrace metrics with Log Analytics data and deployment history. Always provide impact assessment and actionable recommendations with rollback options. diff --git a/sreagent-templates/recipes/law-dynatrace-httptrigger/config/subagents/error-investigator.yaml b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/subagents/error-investigator.yaml new file mode 100644 index 000000000..61fb2f3a4 --- /dev/null +++ b/sreagent-templates/recipes/law-dynatrace-httptrigger/config/subagents/error-investigator.yaml @@ -0,0 +1,23 @@ +metadata: + name: error-investigator +spec: + instructions: subagents/error-investigator.instructions.md + handoffDescription: Investigates application errors by correlating Dynatrace metrics, + LAW logs, and deployment history to identify root cause + handoffs: [] + tools: + - RunAzCliReadCommands + - QueryLogAnalyticsByWorkspaceId + - ExecutePythonCode + - PlotAreaChartWithCorrelation + - PlotBarChart + - FindConnectedGitHubRepo + - dynatrace_execute-dql + - dynatrace_create-dql + - dynatrace_query-problems + - dynatrace_get-entity-id + - dynatrace_get-entity-name + maxReflectionCount: 0 + customReflectionNote: '' + commonPrompts: [] + enableVanillaMode: false diff --git a/sreagent-templates/recipes/law-dynatrace-httptrigger/connectors.json b/sreagent-templates/recipes/law-dynatrace-httptrigger/connectors.json new file mode 100644 index 000000000..0818d8bed --- /dev/null +++ b/sreagent-templates/recipes/law-dynatrace-httptrigger/connectors.json @@ -0,0 +1,30 @@ +{ + "toggles": { + "enableAppInsightsConnector": false, + "appInsightsResourceId": "", + "appInsightsAppId": "", + "enableLogAnalyticsConnector": "{{lawId:bool}}", + "lawResourceId": "{{lawId}}", + "enableAzureMonitorConnector": false, + "azureMonitorLookbackDays": 7, + "grafanaUrl": "", + "grafanaApiKey": "" + }, + "connectors": [ + { + "name": "dynatrace", + "properties": { + "dataConnectorType": "Mcp", + "dataSource": "placeholder", + "extendedProperties": { + "type": "http", + "endpoint": "https://{{dtTenant}}.apps.dynatrace.com/platform-reserved/mcp-gateway/v0.1/servers/dynatrace-mcp/mcp", + "authType": "BearerToken", + "bearerToken": "${DYNATRACE_BEARER_TOKEN}", + "partnerType": "DynatraceMcp" + }, + "identity": "system" + } + } + ] +} diff --git a/sreagent-templates/recipes/law-dynatrace-httptrigger/docs/sample-github-workflow.yml b/sreagent-templates/recipes/law-dynatrace-httptrigger/docs/sample-github-workflow.yml new file mode 100644 index 000000000..82883626a --- /dev/null +++ b/sreagent-templates/recipes/law-dynatrace-httptrigger/docs/sample-github-workflow.yml @@ -0,0 +1,40 @@ +# Sample GitHub Actions workflow for your application repo. +# This sends PR events to the SRE Agent via the Logic App webhook bridge, +# which triggers the deployment-guard-analysis skill. +# +# Setup: +# 1. Copy this file to your app repo: .github/workflows/sre-agent-pr-guard.yml +# 2. Add a repo secret SRE_AGENT_WEBHOOK_URL with the Logic App trigger URL +# (find it in the Azure portal under the Logic App's trigger settings, +# or run: az resource show ... to get the callback URL) + +name: SRE Agent — PR Deployment Guard + +on: + pull_request: + types: [opened, synchronize, reopened] + +jobs: + notify-sre-agent: + runs-on: ubuntu-latest + steps: + - name: Trigger SRE Agent via webhook bridge + env: + WEBHOOK_URL: ${{ secrets.SRE_AGENT_WEBHOOK_URL }} + run: | + curl -s -X POST "$WEBHOOK_URL" \ + -H "Content-Type: application/json" \ + -d '{ + "event": "pull_request", + "action": "${{ github.event.action }}", + "pr_number": ${{ github.event.pull_request.number }}, + "pr_title": "${{ github.event.pull_request.title }}", + "pr_url": "${{ github.event.pull_request.html_url }}", + "pr_diff_url": "${{ github.event.pull_request.diff_url }}", + "pr_author": "${{ github.event.pull_request.user.login }}", + "repo": "${{ github.repository }}", + "head_ref": "${{ github.event.pull_request.head.ref }}", + "base_ref": "${{ github.event.pull_request.base.ref }}", + "head_sha": "${{ github.event.pull_request.head.sha }}" + }' + echo "Webhook sent to SRE Agent" diff --git a/sreagent-templates/recipes/law-dynatrace-httptrigger/expected-config.json b/sreagent-templates/recipes/law-dynatrace-httptrigger/expected-config.json new file mode 100644 index 000000000..f1a36a4a0 --- /dev/null +++ b/sreagent-templates/recipes/law-dynatrace-httptrigger/expected-config.json @@ -0,0 +1,47 @@ +{ + "_description": "Expected configuration for law-dynatrace-httptrigger recipe. Used by verify-agent.sh to validate deployments.", + "_scenario": "law-dynatrace-httptrigger", + + "agent": { + "accessLevel": "High", + "actionMode": "Review", + "upgradeChannel": "Preview", + "defaultModelProvider": "Anthropic", + "incidentPlatform": "None" + }, + + "connectors": [ + { "name": "log-analytics", "type": "LogAnalytics" }, + { "name": "dynatrace", "type": "Mcp" } + ], + + "skills": [ + "deployment-guard-analysis", + "investigate-app-errors" + ], + + "subagents": [ + "deployment-guard", + "error-investigator" + ], + + "hooks": [ + "deny-prod-deletes", + "require-approval-for-restarts" + ], + + "commonPrompts": [ + "investigation-guidelines", + "safety-rules" + ], + + "scheduledTasks": [], + + "responsePlans": [], + + "httpTriggers": [ + { "name": "pr-deployment-guard", "handlingAgent": "deployment-guard" } + ], + + "repos": [] +} diff --git a/sreagent-templates/recipes/law-dynatrace-httptrigger/roles.yaml b/sreagent-templates/recipes/law-dynatrace-httptrigger/roles.yaml new file mode 100644 index 000000000..9ec1aa266 --- /dev/null +++ b/sreagent-templates/recipes/law-dynatrace-httptrigger/roles.yaml @@ -0,0 +1,19 @@ +# Required roles/credentials for the law-dynatrace-httptrigger recipe. +# deploy.sh processes this after the UAMI is created. + +roles: + # GitHub repos — prints OAuth URL or uses GITHUB_PAT env var + - name: GitHub OAuth + type: manual + instructions: | + To connect GitHub repos, either: + 1. Set GITHUB_PAT env var before deploy: export GITHUB_PAT=ghp_xxx + 2. Or after deploy, open the OAuth URL printed by apply-extras.sh + + # Dynatrace MCP — requires bearer token in connectors.secrets.env + - name: Dynatrace MCP + type: manual + instructions: | + Create a Dynatrace API token with scopes: entities.read, events.read, metrics.read, problems.read + Save it in connectors.secrets.env: + DYNATRACE_BEARER_TOKEN=dt0c01.xxx diff --git a/sreagent-templates/tests/test-dry-run-law-dt-httptrigger.sh b/sreagent-templates/tests/test-dry-run-law-dt-httptrigger.sh new file mode 100644 index 000000000..4a381a375 --- /dev/null +++ b/sreagent-templates/tests/test-dry-run-law-dt-httptrigger.sh @@ -0,0 +1,29 @@ +#!/usr/bin/env bash +# tests/test-dry-run-law-dt-httptrigger.sh — law-dynatrace-httptrigger: 4 backends × dry-run +set -uo pipefail +cd "$(dirname "$0")/.." +REPORT="/tmp/test-dry-run-law-dt-httptrigger.txt"; > "$REPORT" +source tests/lib/test-helpers.sh + +RECIPE="law-dynatrace-httptrigger" +EXTRA_SETS="lawId=/sub/fake;dtTenant=fake;dtToken=fake;githubRepo=https://github.com/fake/repo" +EXP_SKILLS=2 EXP_SA=2 EXP_HOOKS=2 EXP_PROMPTS=2 EXP_SCHED=0 EXP_FILTERS=0 EXP_PLAT=0 EXP_HT=1 +OUT="/tmp/dryrun-${RECIPE}" + +log "═══ $RECIPE ═══" +log "── bash new-agent ──" +rm -rf "$OUT" +SET_ARGS="--set agentName=dry-${RECIPE} --set resourceGroup=rg-dry --set location=swedencentral --set targetRGs=rg-fake-prod,rg-fake-staging" +IFS_OLD="$IFS"; IFS=';'; for s in $EXTRA_SETS; do [[ -n "$s" ]] && SET_ARGS="$SET_ARGS --set $s"; done; IFS="$IFS_OLD" +eval "./bin/new-agent.sh --recipe $RECIPE --non-interactive $SET_ARGS -o $OUT" > /tmp/dryrun-new.log 2>&1 +if [[ -f "$OUT/agent.json" ]]; then pass "new-agent"; else fail "new-agent"; print_summary "$RECIPE"; exit 1; fi + +validate_config_dir "$OUT" $EXP_SKILLS $EXP_SA $EXP_HOOKS $EXP_PROMPTS $EXP_SCHED $EXP_FILTERS $EXP_PLAT $EXP_HT +validate_assembled_content "$OUT" +validate_bicep_dryrun "$OUT" +validate_tf_dryrun "$OUT" $EXP_SKILLS $EXP_SA $EXP_PROMPTS +validate_ps_newagent "$RECIPE" "$EXTRA_SETS" +validate_azd_dryrun "$OUT" + +print_summary "$RECIPE" +exit $?