Skip to content

Improve skills based on an external benchmark for DevHub prompts#68

Merged
pkosiec merged 9 commits into
mainfrom
pkosiec/report-compare
May 12, 2026
Merged

Improve skills based on an external benchmark for DevHub prompts#68
pkosiec merged 9 commits into
mainfrom
pkosiec/report-compare

Conversation

@pkosiec
Copy link
Copy Markdown
Member

@pkosiec pkosiec commented May 7, 2026

Summary

Skill improvements based on the internal report — a benchmark of Claude Opus 4.6 vs Codex 5.2 across 25 DevHub templates. The report identified error recovery patterns, troubleshooting gaps, and factual errors in skills.

Report-driven additions

  • Token passthrough error — added to Common Errors with correct fix (workspace admin enables user authorization)
  • Lakehouse Sync — documented as UI-only, added REPLICA IDENTITY prerequisite, fixed Azure support status, added Postgres 17 requirement and limitations
  • Off-platform TypeScript — added @databricks/lakebase standalone pattern in connectivity.md
  • CLI fallback — added REST API exception for sandboxed environments
  • PostgreSQL extensions — brief note with link to official docs

Bug fixes (verified against AppKit source)

  • Lakebase API pattern — replaced outdated createLakebasePool() + pool.query() with AppKit.lakebase.query() plugin pattern
  • ORM integration — updated to use AppKit.lakebase.pool / getOrmConfig()
  • Genie resource — restored correct genie_space_name variable (confirmed via apps init output)
  • Model serving — replaced verbose streaming/AI Gateway sections with AppKit docs pointer; noted AI Gateway (beta) endpoints not directly supported

False positives from initial analysis (reverted)

  • Scaffolding "Go template bug" — intentional AppKit template design, not a bug
  • OBO vs SP-only guidance — misleading since apps init handles scopes correctly
  • Verify Deployment / Deployment Recovery sections — redundant (apps deploy already reports status)
  • CLI version bump to v0.296.0 — unjustified
  • Jobs/Pipelines troubleshooting tables — not report-driven
  • Unity Catalog skill — separate effort
  • databricks apps logs PAT error — actually about querying app endpoints, not the logs command

Test plan

  • python3 scripts/skills.py validate passes
  • No real credentials or workspace IDs in examples
  • Key claims verified against official Databricks docs and AppKit source

Co-authored-by: Isaac


JIRA

Related:

  • LKB-12465 — Cookbook codegen: agent generates wrong AppKit API signatures
  • LKB-12428 — Mode B: as unknown as double-assertions (Done)
  • LKB-12614 — AppKit version scatter
  • LKB-12159 — Umbrella regression hunt

pkosiec added 3 commits May 7, 2026 16:08
Add error recovery patterns, troubleshooting, and coverage gaps
identified by the April 2026 DevHub agent benchmark across 25 tasks.

Key additions:
- Token passthrough workaround and deployment recovery chain
- Off-platform TypeScript/Node.js patterns with REST API fallback
- pgvector, streaming AI chat, multi-space Genie patterns
- Lakehouse Sync UI-only note and REPLICA IDENTITY prerequisite
- Jobs and Pipelines troubleshooting tables
- CLI STOP directive carve-out for off-platform tasks

Co-authored-by: Isaac
- Revert scaffolding bug "Known Issue" (Go template syntax is
  intentional AppKit design, not a bug)
- Drop Unity Catalog skill (separate PR)
- Replace raw pg driver + REST API curl patterns with
  @databricks/lakebase package (standalone, auto token refresh)
- Replace detailed pgvector section with brief PostgreSQL
  extensions note linking to official docs

Co-authored-by: Isaac
- Replace outdated createLakebasePool() + pool.query() pattern with
  AppKit plugin pattern: AppKit.lakebase.query()
- Fix Genie databricks.yml: remove nonexistent `name` field and
  `genie_space_name` variable from genie_space resource
- Add missing user_api_scopes (files.files) for files plugin
- Improve model serving streaming docs (SSE proxy, not AI SDK)
- Bump CLI version to >= v0.296.0 across all 7 skills for consistency
- Add multi-environment deploy note in Lakebase scaffolding

Co-authored-by: Isaac
@pkosiec pkosiec changed the title Improve skills based on CAO pilot report findings Improve skills based on an external benchmark for DevHub prompts May 7, 2026
pkosiec added 4 commits May 7, 2026 17:20
- Revert redundant user_api_scopes in files.md (auto-generated by apps init)
- Restore genie_space_name variable (confirmed in actual scaffolding output)
- Replace multi-space Genie section with pointer to AppKit docs
- Revert Jobs and Pipelines troubleshooting tables (not report-driven)
- Revert CLI version bumps to original values
- Update connectivity.md cross-reference to match new plugin pattern

Co-authored-by: Isaac
- Clarify Lakebase plugin pattern requires scaffolding first
- Update ORM integration to use AppKit.lakebase.pool / getOrmConfig()
- Fix stale pool.query() references in synced tables section
- Replace streaming/AI Gateway sections with AppKit docs pointer
- Add AI Gateway note linking to official docs
- Remove app-focused Getting Started from databricks-core

Co-authored-by: Isaac
- Remove misleading OBO vs SP-only note (apps init handles scopes)
- Fix AI Gateway: note beta endpoints unsupported, point to
  databricks-model-serving skill instead of incompatible docs
- Fix CLI exception: link to REST API docs, not Lakebase skill
- Trim off-platform Pattern 5 to minimal example + npm view readme

Co-authored-by: Isaac
- Remove Verify Deployment + Deployment Recovery subsections (apps deploy
  already reports status; it's Option A, not a fallback)
- Remove duplicate file size error and incorrect apps logs PAT entry
- Fix token passthrough error: point to workspace admin enablement, not
  stripping OBO scopes
- Fix Lakehouse Sync: Azure now supported, add Postgres 17 requirement,
  destination naming, permissions, partitioned table limitation
- Simplify CLI exception: remove "does NOT require deploying" condition

Co-authored-by: Isaac
@pkosiec pkosiec requested a review from keugenek May 7, 2026 16:26
@pkosiec pkosiec marked this pull request as ready for review May 7, 2026 16:26
@pkosiec pkosiec requested review from a team, lennartkats-db and simonfaltum as code owners May 7, 2026 16:26
@keugenek
Copy link
Copy Markdown
Contributor

keugenek commented May 8, 2026

hey Pawel, how have you tested this - can you please attach test report?

Comment thread skills/databricks-apps/references/appkit/genie.md Outdated
Comment thread skills/databricks-core/SKILL.md Outdated
Comment thread manifest.json
@@ -1,12 +1,12 @@
{
"version": "2",
"updated_at": "2026-04-30T11:02:41Z",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can these be autogenerated on each push? or removed entirely since they are unnessessery

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's remove that but on a separate PR 👍

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#75

Copy link
Copy Markdown
Contributor

@keugenek keugenek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved, with comments, check for eval run results before merging please

Copy link
Copy Markdown
Contributor

@keugenek keugenek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicker review — 3 perspectives (correctness / completeness / conflicts)

Triggered a dev eval run with skills_ref=pkosiec/report-compare to validate empirically: run 882501631168304. Results in ~60-90 min.

HIGH (must fix before merge)

1. AppKit.lakebase.query() / AppKit.server.router() — unverified API shapes (lakebase.md)

The PR replaces createLakebasePool() + pool.query() with AppKit.lakebase.query() and AppKit.server.router() / AppKit.server.procedure. Our own SKILL.md line 43 warns: "Training data has stale shapes; a single invented signature fails tsc --noEmit during validate."

The existing trpc.md uses initTRPC + t.router / t.procedure, not AppKit.server.*. If these aren't real exports, every generated Lakebase app will fail compilation. Can you attach a passing tsc --noEmit log or npx @databricks/appkit docs output confirming the new shapes?

The dev eval run above will also validate this empirically — if Lakebase apps fail tsc with the new skill text, we'll know.

2. Stale references not updatedoverview.md still says createLakebasePool, tRPC patterns; trpc.md still says provides createLakebasePool for PostgreSQL CRUD. These contradict the new Lakebase plugin API pattern in the same skill. Should be updated in this PR to avoid agent confusion.

MEDIUM (should fix)

3. Genie multi-space pointer (genie.md) — hardcodes npx @databricks/appkit docs ./docs/plugins/genie.md. The existing pattern in this file uses component-name lookups (npx @databricks/appkit docs "GenieChat") which are version-agnostic. Prefer the component-name form.

4. databricks-core/SKILL.md REST fallback — adds an exception to the existing "STOP — do not work around a missing CLI" guardrail. This reverses a deliberate safety rule without discussing the trade-offs (raw REST calls with PATs bypass workspace auth flows). Worth explicit discussion.

LOW

5. Lakehouse Sync prerequisite (synced-tables.md) — requirement that tables must reside in databricks_postgres database is significant but buried in prerequisites. Should be higher in the section.

Verdict

REQUEST CHANGES on HIGHs #1 and #2. The Lakebase API replacement is the load-bearing change and needs evidence that AppKit.server.router and AppKit.lakebase.query are real exports. Happy to approve once those are addressed. The dev eval will give us a data point either way.

pkosiec added 2 commits May 11, 2026 12:23
…house Sync prereq

- Update stale createLakebasePool references in overview.md and trpc.md
- Broaden Genie docs pointer to not be limited to multi-space apps
- REST fallback now asks user instead of auto-falling back
- Surface databricks_postgres requirement earlier in Lakehouse Sync section
- Regenerate manifest

Co-authored-by: Isaac
… real Express pattern

AppKit.server.router() and AppKit.server.procedure do not exist in the
AppKit server plugin. Confirmed by:
- tsc --noEmit failure against AppKit 0.24.0 scaffold
- AppKit source (ServerPlugin.exports() only exposes extend/getServer/getConfig)

Replace with the correct pattern: server.extend() with Express routes,
matching the scaffold-generated code. Also:
- Rename "tRPC CRUD Pattern" → "CRUD Routes Pattern"
- Use .then() callback pattern (compatible with current AppKit 0.24.0)
- Update all code examples to use lowercase appkit variable
- Fix synced tables example route
- Update Key Differences table

Co-authored-by: Isaac
@pkosiec
Copy link
Copy Markdown
Member Author

pkosiec commented May 12, 2026

Skill Test Report: Main vs PR (pkosiec/report-compare)

Setup

  • CLI: Databricks CLI v0.298.0
  • Profile: eng-nephos-dust-oregon
  • Lakebase: project pkosiec3, branch production
  • Main branch: 6402677 (latest origin/main, worktree at ~/.../databricks-agent-skills/main/)
  • PR branch: pkosiec/report-compare (rebased on latest main, 9 commits ahead)
  • AppKit template: template-v0.24.0 (default)
  • Eval prompts: All 5 Lakebase OLTP prompts from generation_prompts.py
  • Model: Claude Opus 4.6 (subagents)

Methodology

Skills were properly symlinked to ~/.claude/skills/ before each batch, matching the real eval pipeline:

  1. Symlink databricks-apps + databricks-lakebase → main branch worktree
  2. Verify content (confirmed tRPC pattern in lakebase.md)
  3. Launch 5 main-branch agents with eval-format prompts (app name + directory + task only)
  4. Wait for completion
  5. Swap symlinks → PR branch worktree
  6. Verify content (confirmed server.extend() pattern in lakebase.md)
  7. Launch 5 PR-branch agents with identical prompts
  8. Wait for completion
  9. Restore original symlinks
  10. Run databricks apps validate on all 10 apps

Agent prompts matched the real eval pipeline format — no scaffold commands, no file paths, no "read these files." The skill system injected context automatically.

Key difference between branches

Main PR
Documented backend pattern tRPC + createLakebasePool() Express via appkit.server.extend() + appkit.lakebase.query()
Scaffold template ships Express (server.extend) Express (server.extend)

Build Results (databricks apps validate --skip-tests)

Runs: typegen → ast-grep lint → typecheck → build

App Main PR
todo_app PASS PASS
user_feedback_collector PASS PASS
inventory_tracker PASS PASS
event_registration PASS PASS
bookmark_manager PASS PASS
Total 5/5 5/5

Full Validation Results (databricks apps validate with smoke tests)

App Main PR Notes
todo_app PASS PASS
user_feedback_collector PASS FAIL Playwright strict mode violation (see below)
inventory_tracker PASS PASS
event_registration PASS PASS
bookmark_manager PASS PASS
Total 5/5 4/5

PR feedback_app smoke test failure — root cause

The PR agent wrote a non-strict Playwright selector:

// Line 78 — FAILS: matches both <label>Category</label> AND <div>Category Breakdown</div>
await expect(page.getByText('Category')).toBeVisible();

The main agent avoided this by using { exact: true } and .first():

// Main version — PASSES
await expect(page.getByText('Category Breakdown', { exact: true })).toBeVisible();

This is agent authoring variance, not caused by skill content differences. The testing.md skill doc (identical on both branches) warns about Playwright strict mode. One agent followed it better than the other.

Observations

  1. No regressions from the PR. Build pass rate is 5/5 on both branches. The single smoke test failure is unrelated to skill content.

  2. Scaffold template dominates backend pattern. The template ships createApp + server.extend() + appkit.lakebase.query(). Both branches' agents used this pattern because it's what the scaffold generates. The PR aligns the skill docs with the template's actual output.

  3. CardTitle renders as <div>. The @databricks/appkit-ui CardTitle component renders as <div data-slot="card-title">, not a heading element. This caused the strict mode violation above and is a known gap — the component's JSDoc says "Title heading for the card" but the implementation is a <div>.

  4. All agents discovered Lakebase resources via CLI. Given just the project name pkosiec3, agents used databricks postgres list-branches and databricks postgres list-databases to resolve the full resource names. The databricks-lakebase skill guided this correctly on both branches.

  5. Schema patterns consistent. All 10 apps used CREATE SCHEMA IF NOT EXISTS app_data (or app) with qualified table names, matching the skill guidance on both branches.

Conclusion

The PR introduces zero regressions. Build quality is 5/5 on both branches. The PR's main value is aligning the Lakebase skill documentation (lakebase.md) with what the AppKit scaffold template actually generates — server.extend() instead of tRPC.
:

@pkosiec pkosiec merged commit 2865b9f into main May 12, 2026
1 check passed
pkosiec added a commit that referenced this pull request May 15, 2026
The validation logic already stripped these during comparison,
so they served no functional purpose — just diff churn on every
regeneration. Git history is the source of truth for timestamps.

Addresses review feedback from PR #68.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants