Improve skills based on an external benchmark for DevHub prompts by pkosiec · Pull Request #68 · databricks/databricks-agent-skills

pkosiec · 2026-05-07T14:16:48Z

Summary

Skill improvements based on the internal report — a benchmark of Claude Opus 4.6 vs Codex 5.2 across 25 DevHub templates. The report identified error recovery patterns, troubleshooting gaps, and factual errors in skills.

Report-driven additions

Token passthrough error — added to Common Errors with correct fix (workspace admin enables user authorization)
Lakehouse Sync — documented as UI-only, added REPLICA IDENTITY prerequisite, fixed Azure support status, added Postgres 17 requirement and limitations
Off-platform TypeScript — added @databricks/lakebase standalone pattern in connectivity.md
CLI fallback — added REST API exception for sandboxed environments
PostgreSQL extensions — brief note with link to official docs

Bug fixes (verified against AppKit source)

Lakebase API pattern — replaced outdated createLakebasePool() + pool.query() with AppKit.lakebase.query() plugin pattern
ORM integration — updated to use AppKit.lakebase.pool / getOrmConfig()
Genie resource — restored correct genie_space_name variable (confirmed via apps init output)
Model serving — replaced verbose streaming/AI Gateway sections with AppKit docs pointer; noted AI Gateway (beta) endpoints not directly supported

False positives from initial analysis (reverted)

Scaffolding "Go template bug" — intentional AppKit template design, not a bug
OBO vs SP-only guidance — misleading since apps init handles scopes correctly
Verify Deployment / Deployment Recovery sections — redundant (apps deploy already reports status)
CLI version bump to v0.296.0 — unjustified
Jobs/Pipelines troubleshooting tables — not report-driven
Unity Catalog skill — separate effort
databricks apps logs PAT error — actually about querying app endpoints, not the logs command

Test plan

python3 scripts/skills.py validate passes
No real credentials or workspace IDs in examples
Key claims verified against official Databricks docs and AppKit source

Co-authored-by: Isaac

JIRA

LKB-12465 — Cookbook codegen: agent generates wrong AppKit API signatures
LKB-12428 — Mode B: as unknown as double-assertions (Done)
LKB-12614 — AppKit version scatter
LKB-12159 — Umbrella regression hunt

Add error recovery patterns, troubleshooting, and coverage gaps identified by the April 2026 DevHub agent benchmark across 25 tasks. Key additions: - Token passthrough workaround and deployment recovery chain - Off-platform TypeScript/Node.js patterns with REST API fallback - pgvector, streaming AI chat, multi-space Genie patterns - Lakehouse Sync UI-only note and REPLICA IDENTITY prerequisite - Jobs and Pipelines troubleshooting tables - CLI STOP directive carve-out for off-platform tasks Co-authored-by: Isaac

- Revert scaffolding bug "Known Issue" (Go template syntax is intentional AppKit design, not a bug) - Drop Unity Catalog skill (separate PR) - Replace raw pg driver + REST API curl patterns with @databricks/lakebase package (standalone, auto token refresh) - Replace detailed pgvector section with brief PostgreSQL extensions note linking to official docs Co-authored-by: Isaac

- Replace outdated createLakebasePool() + pool.query() pattern with AppKit plugin pattern: AppKit.lakebase.query() - Fix Genie databricks.yml: remove nonexistent `name` field and `genie_space_name` variable from genie_space resource - Add missing user_api_scopes (files.files) for files plugin - Improve model serving streaming docs (SSE proxy, not AI SDK) - Bump CLI version to >= v0.296.0 across all 7 skills for consistency - Add multi-environment deploy note in Lakebase scaffolding Co-authored-by: Isaac

- Revert redundant user_api_scopes in files.md (auto-generated by apps init) - Restore genie_space_name variable (confirmed in actual scaffolding output) - Replace multi-space Genie section with pointer to AppKit docs - Revert Jobs and Pipelines troubleshooting tables (not report-driven) - Revert CLI version bumps to original values - Update connectivity.md cross-reference to match new plugin pattern Co-authored-by: Isaac

- Clarify Lakebase plugin pattern requires scaffolding first - Update ORM integration to use AppKit.lakebase.pool / getOrmConfig() - Fix stale pool.query() references in synced tables section - Replace streaming/AI Gateway sections with AppKit docs pointer - Add AI Gateway note linking to official docs - Remove app-focused Getting Started from databricks-core Co-authored-by: Isaac

- Remove misleading OBO vs SP-only note (apps init handles scopes) - Fix AI Gateway: note beta endpoints unsupported, point to databricks-model-serving skill instead of incompatible docs - Fix CLI exception: link to REST API docs, not Lakebase skill - Trim off-platform Pattern 5 to minimal example + npm view readme Co-authored-by: Isaac

- Remove Verify Deployment + Deployment Recovery subsections (apps deploy already reports status; it's Option A, not a fallback) - Remove duplicate file size error and incorrect apps logs PAT entry - Fix token passthrough error: point to workspace admin enablement, not stripping OBO scopes - Fix Lakehouse Sync: Azure now supported, add Postgres 17 requirement, destination naming, permissions, partitioned table limitation - Simplify CLI exception: remove "does NOT require deploying" condition Co-authored-by: Isaac

keugenek · 2026-05-08T13:13:36Z

hey Pawel, how have you tested this - can you please attach test report?

keugenek · 2026-05-08T16:52:32Z

@@ -1,12 +1,12 @@
 {
  "version": "2",
-  "updated_at": "2026-04-30T11:02:41Z",


can these be autogenerated on each push? or removed entirely since they are unnessessery

Yeah, let's remove that but on a separate PR 👍

keugenek

approved, with comments, check for eval run results before merging please

keugenek

Nitpicker review — 3 perspectives (correctness / completeness / conflicts)

Triggered a dev eval run with skills_ref=pkosiec/report-compare to validate empirically: run 882501631168304. Results in ~60-90 min.

HIGH (must fix before merge)

1. AppKit.lakebase.query() / AppKit.server.router() — unverified API shapes (lakebase.md)

The PR replaces createLakebasePool() + pool.query() with AppKit.lakebase.query() and AppKit.server.router() / AppKit.server.procedure. Our own SKILL.md line 43 warns: "Training data has stale shapes; a single invented signature fails tsc --noEmit during validate."

The existing trpc.md uses initTRPC + t.router / t.procedure, not AppKit.server.*. If these aren't real exports, every generated Lakebase app will fail compilation. Can you attach a passing tsc --noEmit log or npx @databricks/appkit docs output confirming the new shapes?

The dev eval run above will also validate this empirically — if Lakebase apps fail tsc with the new skill text, we'll know.

2. Stale references not updated — overview.md still says createLakebasePool, tRPC patterns; trpc.md still says provides createLakebasePool for PostgreSQL CRUD. These contradict the new Lakebase plugin API pattern in the same skill. Should be updated in this PR to avoid agent confusion.

MEDIUM (should fix)

3. Genie multi-space pointer (genie.md) — hardcodes npx @databricks/appkit docs ./docs/plugins/genie.md. The existing pattern in this file uses component-name lookups (npx @databricks/appkit docs "GenieChat") which are version-agnostic. Prefer the component-name form.

4. databricks-core/SKILL.md REST fallback — adds an exception to the existing "STOP — do not work around a missing CLI" guardrail. This reverses a deliberate safety rule without discussing the trade-offs (raw REST calls with PATs bypass workspace auth flows). Worth explicit discussion.

LOW

5. Lakehouse Sync prerequisite (synced-tables.md) — requirement that tables must reside in databricks_postgres database is significant but buried in prerequisites. Should be higher in the section.

Verdict

REQUEST CHANGES on HIGHs #1 and #2. The Lakebase API replacement is the load-bearing change and needs evidence that AppKit.server.router and AppKit.lakebase.query are real exports. Happy to approve once those are addressed. The dev eval will give us a data point either way.

…house Sync prereq - Update stale createLakebasePool references in overview.md and trpc.md - Broaden Genie docs pointer to not be limited to multi-space apps - REST fallback now asks user instead of auto-falling back - Surface databricks_postgres requirement earlier in Lakehouse Sync section - Regenerate manifest Co-authored-by: Isaac

… real Express pattern AppKit.server.router() and AppKit.server.procedure do not exist in the AppKit server plugin. Confirmed by: - tsc --noEmit failure against AppKit 0.24.0 scaffold - AppKit source (ServerPlugin.exports() only exposes extend/getServer/getConfig) Replace with the correct pattern: server.extend() with Express routes, matching the scaffold-generated code. Also: - Rename "tRPC CRUD Pattern" → "CRUD Routes Pattern" - Use .then() callback pattern (compatible with current AppKit 0.24.0) - Update all code examples to use lowercase appkit variable - Fix synced tables example route - Update Key Differences table Co-authored-by: Isaac

pkosiec · 2026-05-12T21:26:54Z

Skill Test Report: Main vs PR (`pkosiec/report-compare`)

Setup

CLI: Databricks CLI v0.298.0
Profile: eng-nephos-dust-oregon
Lakebase: project pkosiec3, branch production
Main branch: 6402677 (latest origin/main, worktree at ~/.../databricks-agent-skills/main/)
PR branch: pkosiec/report-compare (rebased on latest main, 9 commits ahead)
AppKit template: template-v0.24.0 (default)
Eval prompts: All 5 Lakebase OLTP prompts from generation_prompts.py
Model: Claude Opus 4.6 (subagents)

Methodology

Skills were properly symlinked to ~/.claude/skills/ before each batch, matching the real eval pipeline:

Symlink databricks-apps + databricks-lakebase → main branch worktree
Verify content (confirmed tRPC pattern in lakebase.md)
Launch 5 main-branch agents with eval-format prompts (app name + directory + task only)
Wait for completion
Swap symlinks → PR branch worktree
Verify content (confirmed server.extend() pattern in lakebase.md)
Launch 5 PR-branch agents with identical prompts
Wait for completion
Restore original symlinks
Run databricks apps validate on all 10 apps

Agent prompts matched the real eval pipeline format — no scaffold commands, no file paths, no "read these files." The skill system injected context automatically.

Key difference between branches

	Main	PR
Documented backend pattern	tRPC + `createLakebasePool()`	Express via `appkit.server.extend()` + `appkit.lakebase.query()`
Scaffold template ships	Express (`server.extend`)	Express (`server.extend`)

Build Results (`databricks apps validate --skip-tests`)

Runs: typegen → ast-grep lint → typecheck → build

App	Main	PR
`todo_app`	PASS	PASS
`user_feedback_collector`	PASS	PASS
`inventory_tracker`	PASS	PASS
`event_registration`	PASS	PASS
`bookmark_manager`	PASS	PASS
Total	5/5	5/5

Full Validation Results (`databricks apps validate` with smoke tests)

App	Main	PR	Notes
`todo_app`	PASS	PASS
`user_feedback_collector`	PASS	FAIL	Playwright strict mode violation (see below)
`inventory_tracker`	PASS	PASS
`event_registration`	PASS	PASS
`bookmark_manager`	PASS	PASS
Total	5/5	4/5

PR feedback_app smoke test failure — root cause

The PR agent wrote a non-strict Playwright selector:

// Line 78 — FAILS: matches both <label>Category</label> AND <div>Category Breakdown</div>
await expect(page.getByText('Category')).toBeVisible();

The main agent avoided this by using { exact: true } and .first():

// Main version — PASSES
await expect(page.getByText('Category Breakdown', { exact: true })).toBeVisible();

This is agent authoring variance, not caused by skill content differences. The testing.md skill doc (identical on both branches) warns about Playwright strict mode. One agent followed it better than the other.

Observations

No regressions from the PR. Build pass rate is 5/5 on both branches. The single smoke test failure is unrelated to skill content.
Scaffold template dominates backend pattern. The template ships createApp + server.extend() + appkit.lakebase.query(). Both branches' agents used this pattern because it's what the scaffold generates. The PR aligns the skill docs with the template's actual output.
CardTitle renders as <div>. The @databricks/appkit-ui CardTitle component renders as <div data-slot="card-title">, not a heading element. This caused the strict mode violation above and is a known gap — the component's JSDoc says "Title heading for the card" but the implementation is a <div>.
All agents discovered Lakebase resources via CLI. Given just the project name pkosiec3, agents used databricks postgres list-branches and databricks postgres list-databases to resolve the full resource names. The databricks-lakebase skill guided this correctly on both branches.
Schema patterns consistent. All 10 apps used CREATE SCHEMA IF NOT EXISTS app_data (or app) with qualified table names, matching the skill guidance on both branches.

Conclusion

The PR introduces zero regressions. Build quality is 5/5 on both branches. The PR's main value is aligning the Lakebase skill documentation (lakebase.md) with what the AppKit scaffold template actually generates — server.extend() instead of tRPC.
:

The validation logic already stripped these during comparison, so they served no functional purpose — just diff churn on every regeneration. Git history is the source of truth for timestamps. Addresses review feedback from PR #68. Co-authored-by: Isaac

pkosiec added 3 commits May 7, 2026 16:08

pkosiec changed the title ~~Improve skills based on CAO pilot report findings~~ Improve skills based on an external benchmark for DevHub prompts May 7, 2026

pkosiec added 4 commits May 7, 2026 17:20

pkosiec requested a review from keugenek May 7, 2026 16:26

pkosiec marked this pull request as ready for review May 7, 2026 16:26

pkosiec requested review from a team, lennartkats-db and simonfaltum as code owners May 7, 2026 16:26

simonfaltum approved these changes May 8, 2026

View reviewed changes

keugenek reviewed May 8, 2026

View reviewed changes

Comment thread skills/databricks-apps/references/appkit/genie.md Outdated

keugenek reviewed May 8, 2026

View reviewed changes

Comment thread skills/databricks-core/SKILL.md Outdated

keugenek reviewed May 8, 2026

View reviewed changes

keugenek approved these changes May 8, 2026

View reviewed changes

keugenek reviewed May 8, 2026

View reviewed changes

pkosiec added 2 commits May 11, 2026 12:23

jamesbroadhead mentioned this pull request May 12, 2026

feat: import 17 skills from app-templates (ML-63273) #74

Closed

4 tasks

pkosiec mentioned this pull request May 12, 2026

Remove updated_at fields from manifest #75

Open

3 tasks

pkosiec merged commit 2865b9f into main May 12, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve skills based on an external benchmark for DevHub prompts#68

Improve skills based on an external benchmark for DevHub prompts#68
pkosiec merged 9 commits into
mainfrom
pkosiec/report-compare

pkosiec commented May 7, 2026 •

edited by atlassian Bot

Loading

Uh oh!

keugenek commented May 8, 2026

Uh oh!

Uh oh!

Uh oh!

keugenek May 8, 2026

Uh oh!

pkosiec May 12, 2026

Uh oh!

pkosiec May 12, 2026

Uh oh!

keugenek left a comment

Uh oh!

keugenek left a comment

Uh oh!

pkosiec commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pkosiec commented May 7, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Report-driven additions

Bug fixes (verified against AppKit source)

False positives from initial analysis (reverted)

Test plan

JIRA

Uh oh!

keugenek commented May 8, 2026

Uh oh!

Uh oh!

Uh oh!

keugenek May 8, 2026

Choose a reason for hiding this comment

Uh oh!

pkosiec May 12, 2026

Choose a reason for hiding this comment

Uh oh!

pkosiec May 12, 2026

Choose a reason for hiding this comment

Uh oh!

keugenek left a comment

Choose a reason for hiding this comment

Uh oh!

keugenek left a comment

Choose a reason for hiding this comment

Nitpicker review — 3 perspectives (correctness / completeness / conflicts)

HIGH (must fix before merge)

MEDIUM (should fix)

LOW

Verdict

Uh oh!

pkosiec commented May 12, 2026

Skill Test Report: Main vs PR (pkosiec/report-compare)

Setup

Methodology

Key difference between branches

Build Results (databricks apps validate --skip-tests)

Full Validation Results (databricks apps validate with smoke tests)

PR feedback_app smoke test failure — root cause

Observations

Conclusion

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pkosiec commented May 7, 2026 •

edited by atlassian Bot

Loading

Skill Test Report: Main vs PR (`pkosiec/report-compare`)

Build Results (`databricks apps validate --skip-tests`)

Full Validation Results (`databricks apps validate` with smoke tests)