diff --git a/README.md b/README.md index 7fae229..7428a88 100644 --- a/README.md +++ b/README.md @@ -50,7 +50,7 @@ See [`integrations/README.md`](integrations/README.md), [`automations/README.md` ## Extensions Catalog -This repository contains **2 marketplace(s)** with **50 extensions** (40 skills, 10 plugins). +This repository contains **2 marketplace(s)** with **51 extensions** (41 skills, 10 plugins). ### large-codebase @@ -69,7 +69,7 @@ OpenHands skills for interacting, improving, and refactoring large codebases Official skills and plugins for OpenHands — the open-source AI software engineer. -**46 extensions** (38 skills, 8 plugins) +**47 extensions** (39 skills, 8 plugins) | Name | Type | Description | Commands | |------|------|-------------|----------| @@ -98,6 +98,7 @@ Official skills and plugins for OpenHands — the open-source AI software engine | learn-from-code-review | skill | Distill code review feedback from GitHub PRs into reusable skills and guidelines. Use when users ask to learn from co... | `/learn-from-reviews` | | linear | skill | Interact with Linear project management - query issues, update status, create tickets using the Linear GraphQL API. | — | | magic-test | plugin | A simple test plugin for verifying plugin loading. Triggers on magic words (alakazam, abracadabra) and returns a spec... | — | +| model-router | skill | Recommend the most cost-efficient LLM for a given task category (research, bug fixing, planning, frontend, testing, b... | — | | notion | skill | Create, search, and update Notion pages/databases using the Notion API. Use for documenting work, generating runbooks... | — | | npm | skill | Handle npm package installation in non-interactive environments by piping confirmations. Use when installing Node.js ... | — | | onboarding | plugin | Assess repository agent-readiness across five pillars, propose high-impact fixes, and generate repo-specific AGENTS.m... | — | diff --git a/marketplaces/openhands-extensions.json b/marketplaces/openhands-extensions.json index 550a86b..e67d3b7 100644 --- a/marketplaces/openhands-extensions.json +++ b/marketplaces/openhands-extensions.json @@ -590,6 +590,20 @@ "pull-request", "iterate" ] + }, + { + "name": "model-router", + "source": "./skills/model-router", + "description": "Recommend the most cost-efficient LLM for a given task category (research, bug fixing, planning, frontend, testing, bulk repetitive work) based on the OpenHands Index benchmark data. Use when picking a model, configuring a sub-agent, or optimizing cost vs. quality.", + "category": "productivity", + "keywords": [ + "model", + "llm", + "cost", + "routing", + "benchmark", + "openhands-index" + ] } ] } diff --git a/skills/model-router/README.md b/skills/model-router/README.md new file mode 100644 index 0000000..798e55a --- /dev/null +++ b/skills/model-router/README.md @@ -0,0 +1,57 @@ +# model-router + +A small OpenHands skill that recommends the most cost-efficient LLM for a given +task category, using benchmark data from the public [OpenHands +Index](https://index.openhands.dev). + +## What it does + +When the agent (or user) needs to pick a model, this skill provides a quick +lookup table mapping task type to: + +- **Cost pick** - best score-per-dollar on the Pareto frontier +- **Balanced pick** - good score at moderate cost +- **Premium pick** - top raw score, ignore cost + +Categories covered (matching the OpenHands Index): + +- Research / Information gathering +- Bug fixing / Issue resolution +- Planning / Architecture / Greenfield builds +- Frontend / UI +- Testing / Test generation +- Bulk repetitive work (cheap reasoning at scale) + +## When it triggers + +This skill activates on keyword cues like "which model", "pick a model", +"cost-efficient model", "best model for ...", or "model router". The agent can +also load it on demand when it needs to recommend or switch models, for example +when configuring a sub-agent or delegating a cloud conversation. + +## Example output + +If the user asks "which model should I use for deep research across multiple +sites?", the agent (with this skill loaded) should answer with something like: + +> Use **Gemini-3.1-Pro** ($0.12 per task, score 76.4 on GAIA via OpenHands +> Index). If you need the absolute best quality and don't mind paying ~6x more, +> switch to GPT-5.5 ($0.74, 86.1). Source: https://index.openhands.dev/information-gathering + +## Why this exists + +LLM cost per task varies by 10x or more across the Pareto frontier. Defaulting +to a single premium model for every step of an agent run leaves a lot of money +on the table when cheaper models reach 90%+ of the quality on most subtasks. +This skill encodes a few sensible defaults so the agent can route work to the +right model without having to re-derive the tradeoff each time. + +## Caveats + +- Numbers are a May 2026 snapshot from index.openhands.dev. Re-check the index + before committing to a model for a long-running production job. +- The benchmark measures full agent runs, not raw completions. For one-shot + classifications or short summaries, almost any capable cheap model is fine. +- Whether your runtime can actually switch models mid-conversation depends on + how OpenHands is deployed (sub-agents, delegated cloud conversations, or just + the user choosing a model when starting a session). diff --git a/skills/model-router/SKILL.md b/skills/model-router/SKILL.md new file mode 100644 index 0000000..c3d2ac3 --- /dev/null +++ b/skills/model-router/SKILL.md @@ -0,0 +1,98 @@ +--- +name: model-router +description: Recommend the most cost-efficient LLM for a given task category (research, bug fixing, planning, frontend, testing, bulk repetitive work) based on the OpenHands Index benchmark data. Use when the user asks "which model should I use", wants to pick a model, configure a sub-agent, route work to another LLM, or optimize for cost vs. quality. +triggers: +- which model +- model selection +- pick a model +- model router +- cost efficient model +- cheapest model +- best model for +--- + +# Model Router + +Pick the right LLM for the task instead of paying premium prices for every step. +Recommendations below come from the public [OpenHands Index](https://index.openhands.dev) +benchmark (May 2026 snapshot). For each category we list a **cost pick** (best +score-per-dollar on the Pareto frontier), a **balanced pick**, and a **premium +pick** (top raw score). + +If your runtime supports model switching (sub-agents, delegated cloud +conversations, or just user choice), default to the cost pick and only escalate +to balanced/premium when the task warrants it. + +## Quick decision table + +| Task type | Cost pick (cheap & good) | Balanced | Premium (top score) | +| --- | --- | --- | --- | +| Research / Information gathering | **Gemini-3.1-Pro** ($0.12, 76.4) | claude-opus-4-6 ($0.44, 80.0) | GPT-5.5 ($0.74, 86.1) | +| Bug fixing / Issue resolution | **Minimax-2.7** ($0.18, 75.6) | claude-opus-4-6 ($0.77, 76.8) | GPT-5.5 ($1.52, 78.2) | +| Planning / Architecture / New apps (greenfield) | **GPT-5.4** ($4.04, 56.2) | claude-opus-4-7 ($5.69, 56.2) | claude-opus-4-7 ($5.69, 56.2) | +| Frontend / UI | **Gemini-3.1-Pro** ($1.24, 44.1) | claude-opus-4-6 ($2.37, 41.8) | claude-opus-4-7 ($2.83, 48.5) | +| Testing / Test generation | **Minimax-2.7** ($0.13, 69.1) | claude-opus-4-6 ($0.43, 78.8) | GPT-5.5 ($0.92, 83.4) | +| Bulk repetitive lifting (cheap reasoning at scale) | **DeepSeek-V3.2-Reasoner** ($0.57) | Minimax-2.7 ($0.13-0.18) | n/a - escalate if quality matters | + +Numbers are **average cost per problem (USD)** and **score** from the index. +Lower cost and higher score are both better. + +## How to use this skill + +1. **Identify the task category** from the user request. Map it to a row above: + - "find/research/look up/answer with sources" -> Research + - "fix bug / failing test / SWE-bench style issue / debug" -> Bug fixing + - "design / plan / build new app or feature from scratch" -> Planning + - "build UI / page / component / styling" -> Frontend + - "write tests / increase coverage / add unit tests" -> Testing + - "rename across N files / run the same edit many times / mechanical refactor" -> Bulk + +2. **Default to the cost pick.** Tell the user (or pick programmatically) the + cost pick model, citing roughly how much you expect to spend per task. + +3. **Escalate when justified.** Switch to the balanced or premium pick if any + of these are true: + - The task is high-stakes (production bug, security fix, breaking change). + - The cost pick has clearly failed once on the same task. + - The user explicitly asked for the best possible quality. + - The task spans multiple categories (e.g., "design + build + test a new + service") - prefer a strong all-rounder like **claude-opus-4-7** or + **GPT-5.5**. + +4. **Mixed pipelines** are encouraged. Example flow for a feature request: + - Research existing code & docs -> Gemini-3.1-Pro + - Plan the change -> claude-opus-4-7 + - Implement the fix -> claude-opus-4-6 or Minimax-2.7 + - Generate tests -> claude-opus-4-6 + - Repetitive lint/format/codemod cleanup -> DeepSeek-V3.2-Reasoner + +5. **Show your work.** When you recommend a model to the user, include: + - The chosen model and a one-line justification. + - Expected cost per task (from the table) and any quality tradeoff. + - A link to https://index.openhands.dev so they can verify or pick differently. + +## Heuristics & caveats + +- **Pareto frontier matters more than headline score.** A model that is 95% as + accurate at 20% of the cost is almost always the right starting point. +- **Numbers age fast.** Treat the table as a current-as-of-May-2026 snapshot. + When in doubt, re-check the relevant category page on + https://index.openhands.dev before committing to a model for a long-running + job. +- **One-shot vs. multi-turn.** The index measures agent runs, not raw + completions. If you only need a single classification or a short summary, the + cost gap shrinks and any capable cheap model is fine. +- **Open vs. closed models.** Minimax-2.7, DeepSeek-V3.2-Reasoner, GLM-5/5.1, + Qwen3.6, and Kimi are open-weights options if self-hosting or data residency + matters. Gemini, GPT-5.x, and claude-opus are closed. +- **Greenfield is expensive everywhere.** Even the cost pick is ~$4 per problem + because these are long, multi-step builds. Budget accordingly or scope down. + +## Category pages on the OpenHands Index + +- Issue Resolution: https://index.openhands.dev/issue-resolution +- Greenfield: https://index.openhands.dev/greenfield +- Frontend: https://index.openhands.dev/frontend +- Testing: https://index.openhands.dev/testing +- Information Gathering: https://index.openhands.dev/information-gathering +- Overall leaderboard: https://index.openhands.dev/