[prompt-clustering] Copilot Agent Prompt Clustering Analysis — 2026-05-30 #35903

2026-05-30T10:40:51Z

github-actions[bot]
Bot May 30, 2026

Summary

NLP clustering analysis of 1,000 Copilot agent PRs in github/gh-aw from the last 30 days (2026-05-06 → 2026-05-24). Prompts were extracted from PR bodies, cleaned (markdown/code/firewall-warning stripped), TF-IDF vectorized (uni+bigram), and grouped with K-means. Cluster count k=8 was selected by silhouette score.

Total PRs analyzed: 1,000 (997 with usable prompt text)
Clusters identified: 8 (silhouette 0.036 — weak-but-interpretable separation, expected for short task text)
Overall merge-success rate: 81.0%
Spread: highest-success theme 89.3% (prompt/agent tuning) → lowest 65.4% (AWF/version regeneration)

Key Findings

Mechanical regeneration is the riskiest work. The two lowest-success clusters — AWF/version/golden-file bumps (65.4%) and workflow recompile/lock-file imports (72.2%) — are dominated by large, generated diffs (avg 106 and 31 files changed; lock-file PRs reach +21,746 lines). These are closed-without-merge far more often, usually because a newer regeneration supersedes them.
Prompt/agent self-tuning is the safest work. The prompt/experiment/report cluster (gh-aw tuning its own workflows) merges at 89.3% with the smallest diffs and few iterations — well-scoped, low-blast-radius changes.
PR-review bot work iterates the most. The sous-chef / PR-review cluster averages 6.9 commits and 2.2 reviews per PR — roughly double the others — reflecting back-and-forth refinement rather than one-shot delivery.
[WIP] Fix failing GitHub Actions job PRs are a distinct low-value cluster. 27 near-duplicate auto-generated attempts (74% merge, 0.9 reviews) — many are throwaway retries against the same failing lint/test/build jobs.

Methodology & Limitations

Source: /tmp/gh-aw/agent/prompt-cache/pr-full-data/pr-*.json (1,000 indexed PRs with full body/commits/reviews).
Prompt cleaning: removed fenced/inline code, URLs, HTML tags, markdown punctuation, and the [!WARNING] Firewall rules blocked... boilerplate that appears in many bodies (would otherwise dominate TF-IDF). 3 PRs dropped for <30 chars of usable text.
Vectorizer: TfidfVectorizer(max_features=600, ngram_range=(1,2), min_df=3, max_df=0.6, stop_words='english').
k selection: silhouette scanned over k=3..8; monotonically rising (0.026→0.036), so k=8 chosen as the most differentiated still-interpretable split.
Limitation — no turn-count enrichment: per-run aw_info.json workflow metrics (turns/duration/cost) were not joined — these PRs originate from many different workflows and don't map 1:1 to retrievable run logs in this context. Commits, reviews, files-changed, and diff size are used as complexity proxies instead.
Low silhouette caveat: short, templated task text clusters weakly; treat cluster boundaries as thematic tendencies, not hard partitions. Cluster 5 (37%) is a broad "general feature/test/fix" catch-all.

Cluster Analysis

All 8 clusters (sorted by size)

Cluster 5 — General feature + test changes (catch-all)

Size: 368 (36.9%) · Success: 84.0% · avg 19.6 files, +267/−147, 4.3 commits
Keywords: added, test, updated, behavior, changes, error, safe, new
Broad bucket of well-scoped feature/fix/test PRs (e.g. "Add emoji frontmatter field", context-propagation fixes).

Cluster 4 — Workflow recompile / lock-file / shared imports

Size: 151 (15.1%) · Success: 72.2% · avg 31.1 files, +614/−590, 4.7 commits
Keywords: workflow, workflows, shared, import, source, compiled, files, lock
Largest text diffs from .lock.yml regeneration (e.g. "Recompile workflows..." +21,746). Frequently superseded → below-average merge.

Cluster 2 — Bug fixes

Size: 113 (11.3%) · Success: 85.0% · avg 26.7 files, +253/−152, 4.2 commits
Keywords: bug, fix, bug fix, testing, added
Targeted defect fixes with validation/fuzz coverage.

Cluster 1 — Prompt / agent / experiment tuning ⭐ highest success

Size: 103 (10.3%) · Success: 89.3% · avg 20.5 files, +225/−108, 3.1 commits
Keywords: prompt, agent, workflow, experiment, report, step, issue
gh-aw tuning its own agentic workflows (A/B experiments, report formatting). Smallest, safest changes.

Cluster 0 — PR-review / sous-chef bots 🔁 most iterative

Size: 91 (9.1%) · Success: 86.8% · avg 30.7 files, +447/−107, 6.9 commits, 2.2 reviews
Keywords: pr, branch, run, review, comment, sous chef, chef
PR-quality/review automation; heaviest review back-and-forth.

Cluster 7 — AWF / firewall / version & golden-file bumps ⚠️ lowest success

Size: 81 (8.1%) · Success: 65.4% · avg 106.2 files, +1217/−1009, 3.6 commits
Keywords: awf, version, golden, copilot, engine, default, generated
Mechanical dependency/firewall version bumps + golden-test regeneration. Huge diffs, often duplicated/superseded.

Cluster 6 — Observability / OTLP spans & model config

Size: 63 (6.3%) · Success: 81.0% · avg 41.1 files, +280/−55, 3.8 commits
Keywords: span, alias, model, spans, conclusion, multiplier, setup, aliases
OTEL span attributes, model aliases, cost multipliers.

Cluster 3 — `[WIP] Fix failing GitHub Actions job`

Size: 27 (2.7%) · Success: 74.1% · avg 49.7 files, +255/−58, 3.1 commits, 0.9 reviews
Keywords: actions, job, fix, implement, form plan
Near-duplicate auto-generated CI-repair attempts; many closed unmerged.

Success Rate by Cluster

Cluster	Theme	PRs	Success	Avg files	Avg commits	Avg reviews
1	Prompt/agent tuning	103	89.3%	20.5	3.1	1.5
0	PR-review bots	91	86.8%	30.7	6.9	2.2
2	Bug fixes	113	85.0%	26.7	4.2	1.3
5	General feature/test	368	84.0%	19.6	4.3	1.7
6	Observability/OTLP	63	81.0%	41.1	3.8	1.8
3	[WIP] CI-job fixes	27	74.1%	49.7	3.1	0.9
4	Workflow recompile/lock	151	72.2%	31.1	4.7	1.6
7	AWF/version/golden bumps	81	65.4%	106.2	3.6	1.5

Sample PRs by cluster (top 3 by diff size)

PR #	Title	Cluster	Outcome	Files	Additions
#33852	Add `create-check-run` safe output type	0	Merged	29	+3660
#31614	Auto-detect ARC/DinD, emit AWF docker-host-path-prefix	0	Merged	224	+2430
#31605	Centralized slash-command trigger strategy	0	Merged	60	+1938
#31225	Optimize `aw-failure-investigator`	1	Merged	219	+2239
#31981	Dynamic agent-of-the-day blog entry	1	Merged	3	+2179
#32904	UK AI operational resilience workflow	1	Merged	2	+1686
#31820	Use `aw_context` fallbacks for prompt context	2	Merged	232	+5670
#32375	Allow `${{ experiments.* }}` in runtime-import	2	Merged	231	+3278
#32186	Download activation artifact before detection	2	Closed	233	+3030
#32190	[WIP] Fix failing GitHub Actions job 'test'	3	Closed	232	+1703
#32188	[WIP] Fix failing GitHub Actions job test	3	Closed	231	+1689
#32189	[WIP] Fix failing job 'build-wasm'	3	Closed	230	+1686
#32254	Recompile workflows: remove deprecated props	4	Closed	457	+21746
#31223	Recompile lock files to restore source/lock	4	Closed	218	+10851
#30995	Import shared/observability-otlp.md	4	Merged	390	+7821
#32200	Add `emoji` frontmatter field	5	Merged	463	+3763
#33658	Add engine.permission-mode Claude config	5	Closed	11	+3424
#31049	Decompose oversized functions in pkg/workflow	5	Closed	195	+3369
#30683	Add cache token multipliers + model cost	6	Closed	215	+1782
#32196	Frontmatter source/hash metadata to OTEL spans	6	Merged	224	+1740
#32298	Propagate GH_AW_INFO_ENGINE_ID into setup	6	Closed	236	+1549
#33219	Bind Node toolcache into AWF chroot	7	Open	250	+7844
#34321	Bump AWF firewall to v0.25.53	7	Merged	249	+4267
#34324	Bump gh-aw-firewall to v0.25.53	7	Closed	245	+4158

Recommendations

Deduplicate regeneration/version-bump work (Clusters 7 & 4). These account for ~23% of PRs but the lowest merge rates (65–72%), largely because concurrent agents open competing regenerations that supersede each other. Serialize or debounce "recompile" / "bump version" workflows, or collapse them into a single scheduled batch PR to cut wasted closed PRs.
Suppress duplicate [WIP] Fix failing GitHub Actions job PRs (Cluster 3). Near-identical retries against the same job add noise (0.9 reviews, 26% closed). Gate these on "no existing open WIP PR for this job" before opening a new one.
Promote the prompt-tuning pattern (Cluster 1). Its 89.3% success + small diffs show well-scoped, single-purpose tasks merge best — favor narrow task framing in prompts over broad multi-file asks.
Add turn-count/cost enrichment next run. Joining aw_info.json metrics would let us correlate iteration count with prompt cluster directly, replacing the commit/review proxies used here.

References: §26681361951

Generated by 📊 Copilot Agent Prompt Clustering Analysis · opus48 1M · ◷

expires on May 31, 2026, 10:40 AM UTC

2026-05-31T10:45:31Z

github-actions[bot]
Bot May 31, 2026
Author

This discussion has been marked as outdated by Copilot Agent Prompt Clustering Analysis.

A newer discussion is available at Discussion #36103.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[prompt-clustering] Copilot Agent Prompt Clustering Analysis — 2026-05-30 #35903

Uh oh!

{{title}}

Uh oh!

Cluster 5 — General feature + test changes (catch-all)

Cluster 4 — Workflow recompile / lock-file / shared imports

Cluster 2 — Bug fixes

Cluster 1 — Prompt / agent / experiment tuning ⭐ highest success

Cluster 0 — PR-review / sous-chef bots 🔁 most iterative

Cluster 7 — AWF / firewall / version & golden-file bumps ⚠️ lowest success

Cluster 6 — Observability / OTLP spans & model config

Cluster 3 — `[WIP] Fix failing GitHub Actions job`

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[prompt-clustering] Copilot Agent Prompt Clustering Analysis — 2026-05-30 #35903

Uh oh!

github-actions[bot] Bot May 30, 2026

Summary

Key Findings

Cluster Analysis

Cluster 5 — General feature + test changes (catch-all)

Cluster 4 — Workflow recompile / lock-file / shared imports

Cluster 2 — Bug fixes

Cluster 1 — Prompt / agent / experiment tuning ⭐ highest success

Cluster 0 — PR-review / sous-chef bots 🔁 most iterative

Cluster 7 — AWF / firewall / version & golden-file bumps ⚠️ lowest success

Cluster 6 — Observability / OTLP spans & model config

Cluster 3 — [WIP] Fix failing GitHub Actions job

Success Rate by Cluster

Recommendations

Replies: 1 comment

Uh oh!

github-actions[bot] Bot May 31, 2026 Author

github-actions[bot]
Bot May 30, 2026

Cluster 3 — `[WIP] Fix failing GitHub Actions job`

github-actions[bot]
Bot May 31, 2026
Author