Improve skills based on an external benchmark for DevHub prompts#68
Conversation
Add error recovery patterns, troubleshooting, and coverage gaps identified by the April 2026 DevHub agent benchmark across 25 tasks. Key additions: - Token passthrough workaround and deployment recovery chain - Off-platform TypeScript/Node.js patterns with REST API fallback - pgvector, streaming AI chat, multi-space Genie patterns - Lakehouse Sync UI-only note and REPLICA IDENTITY prerequisite - Jobs and Pipelines troubleshooting tables - CLI STOP directive carve-out for off-platform tasks Co-authored-by: Isaac
- Revert scaffolding bug "Known Issue" (Go template syntax is intentional AppKit design, not a bug) - Drop Unity Catalog skill (separate PR) - Replace raw pg driver + REST API curl patterns with @databricks/lakebase package (standalone, auto token refresh) - Replace detailed pgvector section with brief PostgreSQL extensions note linking to official docs Co-authored-by: Isaac
- Replace outdated createLakebasePool() + pool.query() pattern with AppKit plugin pattern: AppKit.lakebase.query() - Fix Genie databricks.yml: remove nonexistent `name` field and `genie_space_name` variable from genie_space resource - Add missing user_api_scopes (files.files) for files plugin - Improve model serving streaming docs (SSE proxy, not AI SDK) - Bump CLI version to >= v0.296.0 across all 7 skills for consistency - Add multi-environment deploy note in Lakebase scaffolding Co-authored-by: Isaac
- Revert redundant user_api_scopes in files.md (auto-generated by apps init) - Restore genie_space_name variable (confirmed in actual scaffolding output) - Replace multi-space Genie section with pointer to AppKit docs - Revert Jobs and Pipelines troubleshooting tables (not report-driven) - Revert CLI version bumps to original values - Update connectivity.md cross-reference to match new plugin pattern Co-authored-by: Isaac
- Clarify Lakebase plugin pattern requires scaffolding first - Update ORM integration to use AppKit.lakebase.pool / getOrmConfig() - Fix stale pool.query() references in synced tables section - Replace streaming/AI Gateway sections with AppKit docs pointer - Add AI Gateway note linking to official docs - Remove app-focused Getting Started from databricks-core Co-authored-by: Isaac
- Remove misleading OBO vs SP-only note (apps init handles scopes) - Fix AI Gateway: note beta endpoints unsupported, point to databricks-model-serving skill instead of incompatible docs - Fix CLI exception: link to REST API docs, not Lakebase skill - Trim off-platform Pattern 5 to minimal example + npm view readme Co-authored-by: Isaac
- Remove Verify Deployment + Deployment Recovery subsections (apps deploy already reports status; it's Option A, not a fallback) - Remove duplicate file size error and incorrect apps logs PAT entry - Fix token passthrough error: point to workspace admin enablement, not stripping OBO scopes - Fix Lakehouse Sync: Azure now supported, add Postgres 17 requirement, destination naming, permissions, partitioned table limitation - Simplify CLI exception: remove "does NOT require deploying" condition Co-authored-by: Isaac
|
hey Pawel, how have you tested this - can you please attach test report? |
| @@ -1,12 +1,12 @@ | |||
| { | |||
| "version": "2", | |||
| "updated_at": "2026-04-30T11:02:41Z", | |||
There was a problem hiding this comment.
can these be autogenerated on each push? or removed entirely since they are unnessessery
There was a problem hiding this comment.
Yeah, let's remove that but on a separate PR 👍
keugenek
left a comment
There was a problem hiding this comment.
approved, with comments, check for eval run results before merging please
keugenek
left a comment
There was a problem hiding this comment.
Nitpicker review — 3 perspectives (correctness / completeness / conflicts)
Triggered a dev eval run with skills_ref=pkosiec/report-compare to validate empirically: run 882501631168304. Results in ~60-90 min.
HIGH (must fix before merge)
1. AppKit.lakebase.query() / AppKit.server.router() — unverified API shapes (lakebase.md)
The PR replaces createLakebasePool() + pool.query() with AppKit.lakebase.query() and AppKit.server.router() / AppKit.server.procedure. Our own SKILL.md line 43 warns: "Training data has stale shapes; a single invented signature fails tsc --noEmit during validate."
The existing trpc.md uses initTRPC + t.router / t.procedure, not AppKit.server.*. If these aren't real exports, every generated Lakebase app will fail compilation. Can you attach a passing tsc --noEmit log or npx @databricks/appkit docs output confirming the new shapes?
The dev eval run above will also validate this empirically — if Lakebase apps fail tsc with the new skill text, we'll know.
2. Stale references not updated — overview.md still says createLakebasePool, tRPC patterns; trpc.md still says provides createLakebasePool for PostgreSQL CRUD. These contradict the new Lakebase plugin API pattern in the same skill. Should be updated in this PR to avoid agent confusion.
MEDIUM (should fix)
3. Genie multi-space pointer (genie.md) — hardcodes npx @databricks/appkit docs ./docs/plugins/genie.md. The existing pattern in this file uses component-name lookups (npx @databricks/appkit docs "GenieChat") which are version-agnostic. Prefer the component-name form.
4. databricks-core/SKILL.md REST fallback — adds an exception to the existing "STOP — do not work around a missing CLI" guardrail. This reverses a deliberate safety rule without discussing the trade-offs (raw REST calls with PATs bypass workspace auth flows). Worth explicit discussion.
LOW
5. Lakehouse Sync prerequisite (synced-tables.md) — requirement that tables must reside in databricks_postgres database is significant but buried in prerequisites. Should be higher in the section.
Verdict
REQUEST CHANGES on HIGHs #1 and #2. The Lakebase API replacement is the load-bearing change and needs evidence that AppKit.server.router and AppKit.lakebase.query are real exports. Happy to approve once those are addressed. The dev eval will give us a data point either way.
…house Sync prereq - Update stale createLakebasePool references in overview.md and trpc.md - Broaden Genie docs pointer to not be limited to multi-space apps - REST fallback now asks user instead of auto-falling back - Surface databricks_postgres requirement earlier in Lakehouse Sync section - Regenerate manifest Co-authored-by: Isaac
… real Express pattern AppKit.server.router() and AppKit.server.procedure do not exist in the AppKit server plugin. Confirmed by: - tsc --noEmit failure against AppKit 0.24.0 scaffold - AppKit source (ServerPlugin.exports() only exposes extend/getServer/getConfig) Replace with the correct pattern: server.extend() with Express routes, matching the scaffold-generated code. Also: - Rename "tRPC CRUD Pattern" → "CRUD Routes Pattern" - Use .then() callback pattern (compatible with current AppKit 0.24.0) - Update all code examples to use lowercase appkit variable - Fix synced tables example route - Update Key Differences table Co-authored-by: Isaac
Skill Test Report: Main vs PR (
|
| Main | PR | |
|---|---|---|
| Documented backend pattern | tRPC + createLakebasePool() |
Express via appkit.server.extend() + appkit.lakebase.query() |
| Scaffold template ships | Express (server.extend) |
Express (server.extend) |
Build Results (databricks apps validate --skip-tests)
Runs: typegen → ast-grep lint → typecheck → build
| App | Main | PR |
|---|---|---|
todo_app |
PASS | PASS |
user_feedback_collector |
PASS | PASS |
inventory_tracker |
PASS | PASS |
event_registration |
PASS | PASS |
bookmark_manager |
PASS | PASS |
| Total | 5/5 | 5/5 |
Full Validation Results (databricks apps validate with smoke tests)
| App | Main | PR | Notes |
|---|---|---|---|
todo_app |
PASS | PASS | |
user_feedback_collector |
PASS | FAIL | Playwright strict mode violation (see below) |
inventory_tracker |
PASS | PASS | |
event_registration |
PASS | PASS | |
bookmark_manager |
PASS | PASS | |
| Total | 5/5 | 4/5 |
PR feedback_app smoke test failure — root cause
The PR agent wrote a non-strict Playwright selector:
// Line 78 — FAILS: matches both <label>Category</label> AND <div>Category Breakdown</div>
await expect(page.getByText('Category')).toBeVisible();The main agent avoided this by using { exact: true } and .first():
// Main version — PASSES
await expect(page.getByText('Category Breakdown', { exact: true })).toBeVisible();This is agent authoring variance, not caused by skill content differences. The testing.md skill doc (identical on both branches) warns about Playwright strict mode. One agent followed it better than the other.
Observations
-
No regressions from the PR. Build pass rate is 5/5 on both branches. The single smoke test failure is unrelated to skill content.
-
Scaffold template dominates backend pattern. The template ships
createApp+server.extend()+appkit.lakebase.query(). Both branches' agents used this pattern because it's what the scaffold generates. The PR aligns the skill docs with the template's actual output. -
CardTitlerenders as<div>. The@databricks/appkit-uiCardTitlecomponent renders as<div data-slot="card-title">, not a heading element. This caused the strict mode violation above and is a known gap — the component's JSDoc says "Title heading for the card" but the implementation is a<div>. -
All agents discovered Lakebase resources via CLI. Given just the project name
pkosiec3, agents useddatabricks postgres list-branchesanddatabricks postgres list-databasesto resolve the full resource names. Thedatabricks-lakebaseskill guided this correctly on both branches. -
Schema patterns consistent. All 10 apps used
CREATE SCHEMA IF NOT EXISTS app_data(orapp) with qualified table names, matching the skill guidance on both branches.
Conclusion
The PR introduces zero regressions. Build quality is 5/5 on both branches. The PR's main value is aligning the Lakebase skill documentation (lakebase.md) with what the AppKit scaffold template actually generates — server.extend() instead of tRPC.
:
The validation logic already stripped these during comparison, so they served no functional purpose — just diff churn on every regeneration. Git history is the source of truth for timestamps. Addresses review feedback from PR #68. Co-authored-by: Isaac
Summary
Skill improvements based on the internal report — a benchmark of Claude Opus 4.6 vs Codex 5.2 across 25 DevHub templates. The report identified error recovery patterns, troubleshooting gaps, and factual errors in skills.
Report-driven additions
@databricks/lakebasestandalone pattern in connectivity.mdBug fixes (verified against AppKit source)
createLakebasePool()+pool.query()withAppKit.lakebase.query()plugin patternAppKit.lakebase.pool/getOrmConfig()genie_space_namevariable (confirmed viaapps initoutput)False positives from initial analysis (reverted)
apps inithandles scopes correctlyapps deployalready reports status)databricks apps logsPAT error — actually about querying app endpoints, not the logs commandTest plan
python3 scripts/skills.py validatepassesCo-authored-by: Isaac
JIRA
Related:
as unknown asdouble-assertions (Done)