From 99eb1f1ceea6ce585420d33d3095863658a25d82 Mon Sep 17 00:00:00 2001 From: openhands Date: Wed, 25 Mar 2026 00:57:55 -0400 Subject: [PATCH 01/10] Create human-readable and agent-readable tutorials separately --- content/docs/datasets/adapters-human.mdx | 332 +++++++++++++++++++++++ content/docs/datasets/adapters.mdx | 51 ++-- content/docs/datasets/meta.json | 1 + 3 files changed, 359 insertions(+), 25 deletions(-) create mode 100644 content/docs/datasets/adapters-human.mdx diff --git a/content/docs/datasets/adapters-human.mdx b/content/docs/datasets/adapters-human.mdx new file mode 100644 index 0000000..ae55874 --- /dev/null +++ b/content/docs/datasets/adapters-human.mdx @@ -0,0 +1,332 @@ +--- +title: Adapters (Human Guide) +description: A concise guide for human readers to create a Harbor adapter for your benchmark. +--- + +import { Callout } from 'fumadocs-ui/components/callout'; +import { File, Folder, Files } from 'fumadocs-ui/components/files'; + + +AI agents should follow the spec at [Adapter AI Guideline](/docs/datasets/adapters) +instead of this page. That document contains the complete schema, +all edge cases, and machine-verifiable examples. +Do not use the tutorial below as your source of truth. + + +Harbor supports running various benchmarks and datasets via a simple, unified interface. SWE-Bench, LiveCodeBench, and more benchmarks are integrated into Harbor, and our team is actively working to adapt additional benchmarks to the framework. +To add a new benchmark or dataset, you need to create an [adapter](https://github.com/laude-institute/harbor/tree/main/adapters) that translates the original benchmark's tasks into the Harbor format. + +We welcome the open source community to contribute adapters for new benchmarks and datasets. If you have a benchmark or a dataset of tasks that you want to adapt (e.g., using Harbor's evaluation harness), please follow the steps below to develop your adapter and get it merged. + + +If you are thinking about adapting your benchmark or contributing one from our [Adapter List](https://docs.google.com/spreadsheets/d/1mJbiASPm32DDNzEnV6eDGwpEf3FlMUe5dhkZmFjjSoo/edit?gid=0#gid=0), please join our [Discord](https://discord.com/invite/6xWPKhGDbA) and reach out to [Lin Shi](mailto:ls2282@cornell.edu) from the `#adapters-announcements` channel. + + + +Join our [Discord](https://discord.com/invite/6xWPKhGDbA) and reach out in `#adapters-announcements`. Check the [Adapter List](https://docs.google.com/spreadsheets/d/1mJbiASPm32DDNzEnV6eDGwpEf3FlMUe5dhkZmFjjSoo/edit?gid=0#gid=0) for available benchmarks. We cover API costs for parity experiments. + + +## Quick Start + +```bash +# List available datasets +harbor dataset list + +# Scaffold a new adapter interactively +harbor adapters init + +# Or with arguments +harbor adapters init my-adapter --name "My Benchmark" +``` + +## Steps at a Glance + +| # | Step | Goal | +|---|------|------| +| 1 | [Understand the benchmark](#1-understand-the-original-benchmark) | Identify instructions, environments, tests, and solutions | +| 2 | [Write the adapter code](#2-write-the-adapter-code) | Generate Harbor-format task directories | +| 3 | [Verify oracle solutions](#3-verify-oracle-solutions) | All oracle solutions pass at 100% reward | +| 4 | [Plan parity & implement agents](#4-plan-parity--implement-agents) | Coordinate with the team; set up agents on both sides | +| 5 | [Run parity experiments](#5-run-parity-experiments) | Compare Harbor vs. original benchmark scores | +| 6 | [Record parity results](#6-record-parity-results) | Save results to `parity_experiment.json` | +| 7 | [Upload results](#7-upload-results) | Push to HuggingFace parity dataset | +| 8 | [Register the dataset](#8-register-the-dataset) | Prepare dataset with `harbor init` and `dataset.toml`, submit for publishing | +| 9 | [Document & submit](#9-document--submit) | Write README, submit PR for review | + +--- + +## 1. Understand the Original Benchmark + +Before coding, study the original benchmark and identify its four key components: + +1. **[Understand the Original Benchmark](#1-understand-the-original-benchmark):** First, you'll analyze the original benchmark to identify the task's four key factors required by Harbor: task instructions, environments, tests, and solutions. +2. **[Fork Harbor Repository and Develop Adapter Code](#2-fork-harbor-repository-and-develop-adapter-code):** Fork the Harbor repository and write Python adapter code that translates the original benchmark's tasks into the Harbor format. +3. **[Running Harbor Harness and Verify Oracle Solutions](#3-running-harbor-harness-and-verify-oracle-solutions):** Run Harbor harness on your adapter and ensure all oracle solutions pass with 100% reward. Create a WIP PR with a screenshot showing oracle success. +4. **[Discuss Parity Plans and Implement Agents](#4-discuss-parity-plans-and-implement-agents):** Reach out to the team to discuss parity experiment plans, then implement the corresponding agents on the original benchmark side or in Harbor, depending on the benchmark setting. This could happen right after you sign up for an adapter and before Step 1 as well, if the benchmark is relatively straightforward. +5. **[Run Parity Experiments](#5-run-parity-experiments):** Run parity experiments to verify your adapter's performance against the original benchmark baseline results. +6. **[Record Parity Results](#6-record-parity-results):** Formally document the performance comparison in `parity_experiment.json`. +7. **[Upload Parity Results](#7-upload-parity-results):** Upload parity and oracle results to the HuggingFace dataset repository. +8. **[Register the Dataset](#8-register-the-dataset):** Prepare your dataset for the [Harbor Registry](https://registry.harborframework.com) using `harbor init` and `dataset.toml`, then coordinate with the Harbor team to publish. +9. **[Document and Submit](#9-document-and-submit):** Document your adapter's usage, parity results, and comprehensive adaptation details in a `README.md`, then submit your work through a pull request. + +## 2. Write the Adapter Code + +### 2.0 Read the README template first + +The [adapter README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md) doubles as a requirements checklist. Read it before writing code — it tells you what you'll need to provide. + +### 2.1 Fork and branch + +```bash +git clone https://github.com/{you}/harbor.git +cd harbor +git checkout -b {adapter-name}-adapter +``` + +### 2.2 Target task directory structure + +Each generated task should look like this: + + + + + + + + + + + + + + + + + + + + +See the [hello-world example](https://github.com/laude-institute/harbor/tree/main/examples/tasks/hello-world) for a concrete reference. + +### 2.3 Adapter code structure + +Your adapter lives in `harbor/adapters/{adapter-name}/`: + +| File | Purpose | +|------|---------| +| `adapter.py` | Core logic: parse benchmark data, generate task dirs | +| `run_adapter.py` | CLI entry point (supports `--output-path`) | +| `template/` | Template files copied into each task | +| `parity_experiment.json` | Parity results (filled in later) | +| `run_{name}.yaml` | Reference config for reproducibility | +| `README.md` | Final documentation (written last) | +| `adapter_metadata.json` | Structured metadata about the adapter | + +**Requirements for `run_adapter.py`:** +- Support cloning the source benchmark temporarily (with cleanup) +- Support using an already-cloned repo +- Default output to `datasets/{adapter-name}`, with `--output-path` override + +**Tips:** +- Minor prompt tweaks (e.g., "write files in place without asking") are fine, as long as they apply to both the original benchmark and Harbor sides. +- Adapting only a subset of tasks is acceptable if documented in the README. + +## 3. Verify Oracle Solutions + +Run your adapter with the oracle agent and confirm **100% reward on all tasks**. + +```bash +# Single task +harbor trials start -p datasets// + +# Entire dataset +harbor run -p datasets/ + +# With a config file (recommended for reproducibility) +harbor run -c adapters//.yaml -a -m +``` + +Once oracle passes, create a **WIP PR** titled `[WIP] Adapter: {adapter_name}` with a screenshot of the 100% pass results. + +## 4. Plan Parity & Implement Agents + +Reach out to the team (e.g., **Lin Shi**) on [Discord](https://discord.com/invite/6xWPKhGDbA) **before** running parity experiments. They will help decide: +- Which agents and models to use +- How many runs are needed +- API key provisioning + +Depending on your benchmark, you'll fall into one of three scenarios: + +**Scenario A — Compatible agents exist:** The original benchmark already supports Harbor-compatible agents (OpenHands, Codex, Claude-Code, etc.). No extra work needed. + +**Scenario B — LLM-based, no compatible agents:** Fork the original benchmark, implement a Harbor-compatible agent there, and document it. See the [EvoEval example](https://github.com/laude-institute/harbor/blob/main/adapters/evoeval/parity_experiment.json). + +**Scenario C — Custom agents:** Implement the custom agent in Harbor under `adapters/{name}/{agent}.py` and run parity experiments with the custom agents. Also run experiments with standard agents (Codex, Claude-Code) to show the adapter generalizes. + + +For expensive benchmarks, you can run parity on a representative subset. Discuss sampling strategy with the team first. Use `--split parity` in your adapter and register with `"version": "parity"` so users can run `-d {name}@parity`. + + +## 5. Run Parity Experiments + +Run the **same agents, models, and settings** on both the original benchmark and your Harbor adapter, multiple times each. Compare average scores and standard deviations — they should be **comparable** to demonstrate equivalence. + +```bash +# Harbor side +harbor run -p datasets/ -a -m +``` + +## 6. Record Parity Results + +Create `parity_experiment.json` in your adapter directory: + +```json +[ + { + "adapter_name": "", + "agent": "@", + "model": "", + "date": "", + "adapted_benchmark_size": "", + "parity_benchmark_size": "", + "number_of_runs": "", + "notes": "", + "original_parity_repo": "", + "adapter_pr": [""], + "dataset_pr": [""], + "parity_pr": [""], + "metrics": [ + { + "benchmark_name": "", + "metric": "", + "original": "", + "harbor": "", + "original_runs": ["", "", "..."], + "harbor_runs": ["", "", "..."] + } + ] + } +] +``` + +Also include a summary table in your README: + +```markdown +| Agent | Model | Metric | Runs | Dataset Size | Original | Harbor | +|-------|-------|--------|------|--------------|----------|--------| +| codex | gpt-5 | pass@1 | 5 | 2000 (100%) | X ± Y | X ± Y | +``` + +## 7. Upload Results + +Upload parity and oracle results to the [HuggingFace parity-experiments dataset](https://huggingface.co/datasets/harborframework/parity-experiments): + +``` +adapters// +├── README.md +├── config.yaml +├── original_parity/ +├── harbor_parity/ +├── oracle/ +└── results_collection/ + ├── result_{original/harbor}_trial1.json + └── ... +``` + +## 8. Register the Dataset + +### 8.1 Generate dataset + +```bash +git clone https://github.com/{you}/harbor-datasets.git +cd harbor/adapters/ +uv run run_adapter.py --output-dir /path/to/harbor-datasets/datasets/ +``` + +Generate `dataset.toml`: + +```bash +cd harbor-datasets/datasets/ +harbor init +# Select "dataset" when prompted +``` + +Edit the generated `dataset.toml` to fill in metadata: parity results summary, adapter author credits, and any acknowledgments. + +**Version naming:** Use `"1.0"` by default. Follow the original benchmark's naming if it has versions (e.g., "verified", "lite"). Use `"parity"` for parity subsets so users can run `-d @parity`. + +Create a PR to `harbor-datasets`. Request `@Slimshilin` for review. + +### 8.2 Test locally + +Before submitting for publishing, verify with the `-p` path parameter: + +```bash +harbor run -p /path/to/your/dataset +``` + + +You cannot test against the registry (using `-d`) until the dataset has been published. Use `-p` (local path) for all pre-publish testing. + + +### 8.3 Submit for publishing + +Include your tasks directory and `dataset.toml` in your adapter PR. Once approved, the Harbor team will publish the dataset to the registry. + +### 8.4 Verify post-publish + +Once published, verify it loads and runs correctly: + +```bash +harbor run -d +``` + +## 9. Document & Submit + +Fill out the [README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md) covering: +- Benchmark bugs discovered and how they were handled +- Special treatments (prompt tweaks, environment adjustments) +- Deviations from the original and why +- Agent implementation details +- Known limitations + +Create `adapter_metadata.json` ([see format in full docs](/docs/datasets/adapters#9-document-and-submit)). + +When ready, update your PR title from `[WIP]` to `[Ready for Review]` and request review from `@Slimshilin`. + +--- + +## Appendix: Terminal-Bench Migration + +If you're converting a Terminal-Bench adapter, here are the key differences: + +| Aspect | Terminal-Bench | Harbor | +|--------|---------------|--------| +| Config | `task.yaml` | `task.toml` | +| Instruction | In `task.yaml` | Separate `instruction.md` | +| Dockerfile | Root level | `environment/Dockerfile` | +| Solution | `solution.sh` | `solution/solve.sh` | +| Tests | `run-tests.sh` + `tests/` | `tests/test.sh` | +| Verification | Exit code (pytest) | Reward file: `/logs/verifier/reward.txt` | +| Output dir | `tasks/` | `datasets/` | +| Registry | Dataset-level `dataset_path` | `dataset.toml` + `harbor init` publishing workflow | +| CLI | `tb run --dataset` | `harbor jobs start -d` / `harbor runs start -p` | +| Metrics | Binary pass/fail | Float rewards, multiple metrics | + +**Important:** If Terminal-Bench used a tweaked metric, re-implement to support the **original** benchmark metrics — Harbor supports multiple metrics as rewards. + +Migration checklist: +1. Convert `task.yaml` → `task.toml` + `instruction.md` +2. Reorganize files into `environment/`, `solution/`, `tests/` subdirs +3. Update test scripts to write rewards to `/logs/verifier/reward.txt` +4. Change output directory from `tasks/` to `datasets/` +5. Update registry format using `harbor init` and `dataset.toml` + +--- + +## Resources + +- [Harbor docs](/docs/getting-started) — Running tasks and jobs +- [Harbor repo](https://github.com/laude-institute/harbor) — Examples and configs +- [Agent tutorial](/docs/agents) — Creating custom agents +- [Discord](https://discord.com/invite/6xWPKhGDbA) — Ask questions in `#adapters-spam` diff --git a/content/docs/datasets/adapters.mdx b/content/docs/datasets/adapters.mdx index c428ab0..ff9251a 100644 --- a/content/docs/datasets/adapters.mdx +++ b/content/docs/datasets/adapters.mdx @@ -1,9 +1,10 @@ --- title: Adapters -description: How to create a new adapter for a new benchmark using Harbor. +description: Comprehensive adapter reference with full specs for creating a Harbor adapter for your benchmark. --- import { Accordion, Accordions } from 'fumadocs-ui/components/accordion'; +import { Callout } from 'fumadocs-ui/components/callout'; import { File, Folder, Files } from 'fumadocs-ui/components/files'; Harbor supports running various benchmarks and datasets via a simple, unified interface. SWE-Bench, LiveCodeBench, and more benchmarks are integrated into Harbor, and our team is actively working to adapt additional benchmarks to the framework. @@ -23,16 +24,16 @@ See [this section](#translating-terminal-bench-adapters-to-harbor) to learn abou ```bash # List available datasets -harbor dataset list +harbor datasets list # Start the interactive wizard to create a new adapter -harbor adapter init +harbor adapters init # Initialize with specific arguments (skipping some prompts) -harbor adapter init my-adapter --name "My Benchmark" +harbor adapters init my-adapter --name "My Benchmark" ``` -Use the above commands to view our supported datasets and start creating your new ones. The `harbor adapter init` command will create starter code and template files. +Use the above commands to view our supported datasets and start creating your new ones. The `harbor adapters init` command will create starter code and template files. For more details about what adapters are and how we ensure equivalance between the original benchmark and its harbor adapter, please continue reading. @@ -193,32 +194,32 @@ There are several ways to run Harbor harness on your adapter: **Option 1: Using individual trials (for testing single tasks)** ```bash # Run oracle agent on a single task -harbor trial start -p datasets// +uv run harbor trials start -p datasets// # Run with specific agent and model -harbor trial start -p datasets// -a -m +uv run harbor trials start -p datasets// -a -m ``` **Option 2: Using jobs with local dataset path** ```bash # Run on entire local dataset -harbor run -p datasets/ -a -m +uv run harbor jobs start -p datasets/ -a -m ``` **Option 3: Using jobs with configuration file**. Refer to [harbor/examples/configs](https://github.com/laude-institute/harbor/tree/main/examples/configs) for configuration examples. It's highly recommended to write a reference config file for your adapter to ensure reproducibility. ```bash # Create a job config YAML (see harbor/examples/configs/ for examples) -harbor run -c adapters//.yaml -a -m +uv run harbor jobs start -c adapters//.yaml -a -m ``` -**Option 4: Using registry dataset (after [publishing](#8-publish-to-the-registry))**. Registry testing is only available after the dataset has been published, which ensures the correct data structure. +**Option 4: Using registry dataset (after [publishing](#8-register-the-dataset))**. Registry testing is only available after the dataset has been published, which ensures the correct data structure. ```bash # Run from registry # Single task -harbor run -t terminal-bench/adaptive-rejection-sampler -a -m +harbor run -t / -a -m # Entire dataset -harbor run -d terminal-bench/terminal-bench-2 -a -m +harbor run -d -a -m ``` You should include instructions for running in multiple ways in the `README.md` for your adapter, following the [Harbor adapter README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md). **Note that the order of these options is organized differently in the final adapter README**. This is because from the user's perspective, Option 4 (registry) is the primary way to run the adapter without needing to prepare task directories; the adapter code and other running methods are mainly used for development and reproduction. @@ -228,7 +229,7 @@ You should include instructions for running in multiple ways in the `README.md` Before proceeding further, you must ensure that all oracle solutions pass with a 100% reward. Run the oracle agent on your entire dataset: ```bash -harbor run -p datasets/ +uv run harbor jobs start -p datasets/ ``` Once you've verified that all oracle solutions pass, you can create a Work-In-Progress (WIP) pull request to the Harbor repository: @@ -240,7 +241,7 @@ This WIP PR allows the team to review your adapter structure early and provide f ### 4. Discuss Parity Plans and Implement Agents -After your oracle solutions pass and you've created a WIP PR, reach out to the team (e.g., **Lin Shi**) through Discord to discuss your parity experiment plans before running them. We will help you determine which agents and models to use, how many trials are needed, and we can provide API keys for running parity experiments. Based on your benchmark's characteristics, you'll need to implement agents accordingly. There are three main scenarios: +After your oracle solutions pass and you've created a WIP PR, reach out to the team (e.g., **Lin Shi**) through Discord to discuss your parity experiment plans before running them. We will help you determine which agents and models to use, how many runs are needed, and we can provide API keys for running parity experiments. Based on your benchmark's characteristics, you'll need to implement agents accordingly. There are three main scenarios: If the original benchmark already supports agents that are also supported in Harbor (e.g., OpenHands, Codex, Claude-Code, Gemini-CLI), you can run parity experiments using identical agent and model settings on both sides. No additional agent implementation is needed. @@ -284,7 +285,7 @@ This approach has two important implications: uv run run_adapter.py --output-dir /path/to/output ``` -2. **Registry Version Naming:** When publishing the dataset to the [Harbor Registry](https://registry.harborframework.com), use the version name `"parity"` instead of `"1.0"` or `"2.0"` in your `dataset.toml` to avoid confusion. This allows users to run parity reproduction using `-d @parity` while keeping the full dataset available separately. +2. **Registry Version Naming:** When uploading the dataset to the registry, use the version name `"parity"` instead of `"1.0"` or `"2.0"` to avoid confusion. This allows users to run parity reproduction using `-d @parity` while keeping the full dataset available separately. @@ -294,7 +295,7 @@ This approach has two important implications: Once you've implemented the necessary agents (if needed), run parity experiments to verify your adapter. Use the Harbor harness (see [Section 3](#3-running-harbor-harness-and-verify-oracle-solutions)) with the same set of agents and models that you used (or will use) on the original benchmark side. Ensure the config and parameter settings are identical as well (e.g., codex version). Run them multiple times on each side to compare average scores and standard deviations. -The average scores across multiple trials should be **comparable to demonstrate equivalence of adaptation** (i.e., running the benchmark with Harbor is equivalent to running it with the original harness). +The average scores across multiple runs should be **comparable to demonstrate equivalence of adaptation** (i.e., running the benchmark with Harbor is equivalent to running it with the original harness). ### 6. Record Parity Results @@ -309,7 +310,7 @@ To formally store and track the performance parity between the original benchmar "date": , "adapted_benchmark_size": // Full set size "parity_benchmark_size": , // Same as adapted_benchmark_size if we ran parity on full set - "number_of_trials": // Unless special case, this should be identical for original and harbor runs. + "number_of_runs": // Unless special case, this should be identical for original and harbor runs. "notes": , // additional explanations on special treatments, etc. "original_parity_repo": , // For reproducing the parity experiments on the original benchmark side; usually this is a fork of the original benchmark repo whose README includes instructions + scripts for running the parity experiments "adapter_pr": [, ...], // Adapter PR link(s) in the `harbor` repo; show all PR links related to the adapter, including later fixes. @@ -321,16 +322,16 @@ To formally store and track the performance parity between the original benchmar "metric": , "original": , // Average scores obtained from the original benchmark "harbor": , // Average scores obtained from Harbor adapter - "original_trials": [, , , ...], // Individual trial scores - "harbor_trials": [, , , ...], // Individual trial scores - }, + "original_runs": [, , , ...], // Individual run scores + "harbor_runs": [, , , ...], // Individual run scores + }, { "benchmark_name": , "metric": , "original": , // Average scores obtained from the original benchmark "harbor": , // Average scores obtained from Harbor adapter - "original_trials": [, , , ...], // Individual trial scores - "harbor_trials": [, , , ...], // Individual trial scores + "original_runs": [, , , ...], // Individual run scores + "harbor_runs": [, , , ...], // Individual run scores }, // ... more metrics ] }, @@ -519,8 +520,8 @@ The following table summarizes the main differences between Terminal-Bench and H | **Test Verification** | Exit code based (pytest) | Reward-based: write to `/logs/verifier/reward.txt` | | **Docker Compose** | `docker-compose.yaml` in task root | Not typically used per-task | | **Default Output Directory** | `tasks/` | `datasets/` | -| **Registry Format** | Dataset-level with `dataset_path` | Task-level with `git_url` and `path` per task | -| **CLI Commands** | `tb run --dataset` / `tb run --dataset-path` | `harbor run -d` / `harbor run -t` / `harbor run -p` | +| **Registry Format** | Dataset-level with `dataset_path` | `dataset.toml` + `harbor init` publishing workflow | +| **CLI Commands** | `tb run --dataset` / `tb run --dataset-path` | `harbor jobs start -d` / `harbor runs start -p` | | **Metrics** | Resolved rate (binary pass/fail per task) | Rewards that support multiple metrics and float-type values | **IMPORTANT:** If the Terminal-Bench adapter used a tweaked metric (e.g., threshold-based scoring, ignoring certain metrics), then you'll need to re-implement the adapter for Harbor to support the **original** metrics used by the benchmark, as Harbor now supports multiple metrics as rewards. @@ -654,7 +655,7 @@ fi #### Step 5: Update Registry Format -Terminal-Bench uses a `registry.json` file with dataset-level entries. Harbor uses the [Harbor Registry](https://registry.harborframework.com) with `dataset.toml` configuration and the `harbor publish` workflow. +Terminal-Bench uses a `registry.json` file with dataset-level entries. Harbor uses the [Harbor Registry](https://registry.harborframework.com) with `dataset.toml` configuration and the `harbor init` workflow. **Terminal-Bench registry.json:** ```json diff --git a/content/docs/datasets/meta.json b/content/docs/datasets/meta.json index 153e4fe..565fc19 100644 --- a/content/docs/datasets/meta.json +++ b/content/docs/datasets/meta.json @@ -5,6 +5,7 @@ "registering-datasets", "publishing", "adapters", + "adapters-human", "metrics" ] } From 9f797bed03307facf229883e4b20b97129cc34f3 Mon Sep 17 00:00:00 2001 From: Crystal Zhou Date: Sat, 28 Mar 2026 01:48:22 -0400 Subject: [PATCH 02/10] Add back conflicts from registry updates --- content/docs/datasets/adapters-human.mdx | 6 +++- content/docs/datasets/adapters.mdx | 40 ++++++++++++++++-------- 2 files changed, 32 insertions(+), 14 deletions(-) diff --git a/content/docs/datasets/adapters-human.mdx b/content/docs/datasets/adapters-human.mdx index ae55874..0a164b6 100644 --- a/content/docs/datasets/adapters-human.mdx +++ b/content/docs/datasets/adapters-human.mdx @@ -148,6 +148,10 @@ harbor run -c adapters//.yaml -a -m Once oracle passes, create a **WIP PR** titled `[WIP] Adapter: {adapter_name}` with a screenshot of the 100% pass results. + +Don't fix them on the Harbor side. Document the affected tasks, file bugs to the upstream benchmark, and exclude those tasks if they can't be reliably verified. This keeps Harbor adapters faithful to the original. + + ## 4. Plan Parity & Implement Agents Reach out to the team (e.g., **Lin Shi**) on [Discord](https://discord.com/invite/6xWPKhGDbA) **before** running parity experiments. They will help decide: @@ -219,7 +223,7 @@ Also include a summary table in your README: ## 7. Upload Results -Upload parity and oracle results to the [HuggingFace parity-experiments dataset](https://huggingface.co/datasets/harborframework/parity-experiments): +Upload parity and oracle results to the [HuggingFace parity-experiments dataset](https://huggingface.co/datasets/harborframework/parity-experiments). The [parity upload skill](https://github.com/harbor-framework/harbor/pull/1286) can automate this workflow. ``` adapters// diff --git a/content/docs/datasets/adapters.mdx b/content/docs/datasets/adapters.mdx index ff9251a..0423c78 100644 --- a/content/docs/datasets/adapters.mdx +++ b/content/docs/datasets/adapters.mdx @@ -24,16 +24,16 @@ See [this section](#translating-terminal-bench-adapters-to-harbor) to learn abou ```bash # List available datasets -harbor datasets list +harbor dataset list # Start the interactive wizard to create a new adapter -harbor adapters init +harbor adapter init # Initialize with specific arguments (skipping some prompts) -harbor adapters init my-adapter --name "My Benchmark" +harbor adapter init my-adapter --name "My Benchmark" ``` -Use the above commands to view our supported datasets and start creating your new ones. The `harbor adapters init` command will create starter code and template files. +Use the above commands to view our supported datasets and start creating your new ones. The `harbor adapter init` command will create starter code and template files. For more details about what adapters are and how we ensure equivalance between the original benchmark and its harbor adapter, please continue reading. @@ -194,22 +194,22 @@ There are several ways to run Harbor harness on your adapter: **Option 1: Using individual trials (for testing single tasks)** ```bash # Run oracle agent on a single task -uv run harbor trials start -p datasets// +harbor trial start -p datasets// # Run with specific agent and model -uv run harbor trials start -p datasets// -a -m +harbor trial start -p datasets// -a -m ``` **Option 2: Using jobs with local dataset path** ```bash # Run on entire local dataset -uv run harbor jobs start -p datasets/ -a -m +harbor run -p datasets/ -a -m ``` **Option 3: Using jobs with configuration file**. Refer to [harbor/examples/configs](https://github.com/laude-institute/harbor/tree/main/examples/configs) for configuration examples. It's highly recommended to write a reference config file for your adapter to ensure reproducibility. ```bash # Create a job config YAML (see harbor/examples/configs/ for examples) -uv run harbor jobs start -c adapters//.yaml -a -m +harbor run -c adapters//.yaml -a -m ``` **Option 4: Using registry dataset (after [publishing](#8-register-the-dataset))**. Registry testing is only available after the dataset has been published, which ensures the correct data structure. @@ -229,7 +229,7 @@ You should include instructions for running in multiple ways in the `README.md` Before proceeding further, you must ensure that all oracle solutions pass with a 100% reward. Run the oracle agent on your entire dataset: ```bash -uv run harbor jobs start -p datasets/ +harbor run -p datasets/ ``` Once you've verified that all oracle solutions pass, you can create a Work-In-Progress (WIP) pull request to the Harbor repository: @@ -239,6 +239,16 @@ Once you've verified that all oracle solutions pass, you can create a Work-In-Pr This WIP PR allows the team to review your adapter structure early and provide feedback before you proceed with parity experiments. + +If the original benchmark has tasks with broken or flawed oracle solutions, **do not attempt to fix them on the Harbor side**. Instead: + +1. **Document** which tasks have oracle issues in your adapter's README. +2. **File bugs** to the upstream benchmark repository so the original maintainers can address them. +3. **Exclude** those tasks from your adapter if they cannot be reliably verified, and note the exclusion in your README. + +This ensures Harbor adapters faithfully reflect the original benchmark rather than silently diverging. + + ### 4. Discuss Parity Plans and Implement Agents After your oracle solutions pass and you've created a WIP PR, reach out to the team (e.g., **Lin Shi**) through Discord to discuss your parity experiment plans before running them. We will help you determine which agents and models to use, how many runs are needed, and we can provide API keys for running parity experiments. Based on your benchmark's characteristics, you'll need to implement agents accordingly. There are three main scenarios: @@ -285,7 +295,7 @@ This approach has two important implications: uv run run_adapter.py --output-dir /path/to/output ``` -2. **Registry Version Naming:** When uploading the dataset to the registry, use the version name `"parity"` instead of `"1.0"` or `"2.0"` to avoid confusion. This allows users to run parity reproduction using `-d @parity` while keeping the full dataset available separately. +2. **Registry Version Naming:** When publishing the dataset to the [Harbor Registry](https://registry.harborframework.com), use the version name `"parity"` instead of `"1.0"` or `"2.0"` in your `dataset.toml` to avoid confusion. This allows users to run parity reproduction using `-d @parity` while keeping the full dataset available separately. @@ -358,6 +368,10 @@ Then include the following links: After recording your parity results, you need to upload both the parity experiment results and oracle results to the [Harbor Parity Experiments HuggingFace dataset](https://huggingface.co/datasets/harborframework/parity-experiments). This allows the community to track adapter quality and helps estimate costs for each adapter on diverse agents and models. + +Uploading to the HuggingFace parity dataset can be tricky (large repos, LFS requirements, HF-specific refs). The [parity upload skill](https://github.com/harbor-framework/harbor/pull/1286) automates this workflow — it handles sparse checkouts, LFS tracking, and pushing to the correct HF PR ref. Use it to avoid common upload pitfalls. + + Follow the README instructions in the HuggingFace dataset repository to upload your results. The dataset expects results to be organized in the following format: ``` @@ -468,7 +482,7 @@ Next, you need to write a `harbor/adapters/{adapter_name}/adapter_metadata.json` "split": , // if there's no split or subset name, use "full"; if the adapter code works for all splits and we ran parity collectively, we can just write "full" without needing to split them one by one; however, if different splits are registered / validated in different ways, we need to split them out. "adapted_benchmark_size": , // this may be different than the size of the original benchmark's corresponding split, because we might exclude certain tasks for sufficient reasons documented in the README. "parity_benchmark_size": , // same as adapted_benchmark_size if we ran parity on full set - "parity_sampling_rate": adapted_benchmark_size / parity_benchmark_size + "parity_sampling_rate": parity_benchmark_size / adapted_benchmark_size "registry_benchmark_size": // we will match this number with adapted_benchmark_size or parity_benchmark_size to determine whether the full set or parity set is being registered. Please use the exact match integer-value count here. "added_agents": [custom_agent1, custom_agent2], // custom agents added by the adapter to align with the original benchmark. "parity_matching_agents": [agent_1@version+model, agent_1@version+model, ...] // agents (including custom ones) used for parity experiment AND achieved comparable scores to original benchmark. @@ -521,7 +535,7 @@ The following table summarizes the main differences between Terminal-Bench and H | **Docker Compose** | `docker-compose.yaml` in task root | Not typically used per-task | | **Default Output Directory** | `tasks/` | `datasets/` | | **Registry Format** | Dataset-level with `dataset_path` | `dataset.toml` + `harbor init` publishing workflow | -| **CLI Commands** | `tb run --dataset` / `tb run --dataset-path` | `harbor jobs start -d` / `harbor runs start -p` | +| **CLI Commands** | `tb run --dataset` / `tb run --dataset-path` | `harbor run -d` / `harbor run -t` / `harbor run -p` | | **Metrics** | Resolved rate (binary pass/fail per task) | Rewards that support multiple metrics and float-type values | **IMPORTANT:** If the Terminal-Bench adapter used a tweaked metric (e.g., threshold-based scoring, ignoring certain metrics), then you'll need to re-implement the adapter for Harbor to support the **original** metrics used by the benchmark, as Harbor now supports multiple metrics as rewards. @@ -655,7 +669,7 @@ fi #### Step 5: Update Registry Format -Terminal-Bench uses a `registry.json` file with dataset-level entries. Harbor uses the [Harbor Registry](https://registry.harborframework.com) with `dataset.toml` configuration and the `harbor init` workflow. +Terminal-Bench uses a `registry.json` file with dataset-level entries. Harbor uses the [Harbor Registry](https://registry.harborframework.com) with `dataset.toml` configuration and the `harbor publish` workflow. **Terminal-Bench registry.json:** ```json From 97db1c19f4c85727b1616f15ad6164798b43d3f3 Mon Sep 17 00:00:00 2001 From: Crystal Zhou Date: Sat, 28 Mar 2026 02:19:58 -0400 Subject: [PATCH 03/10] Update formats --- content/docs/datasets/adapters-human.mdx | 40 +- content/docs/datasets/adapters.mdx | 869 +++++++++-------------- 2 files changed, 364 insertions(+), 545 deletions(-) diff --git a/content/docs/datasets/adapters-human.mdx b/content/docs/datasets/adapters-human.mdx index 0a164b6..9073fad 100644 --- a/content/docs/datasets/adapters-human.mdx +++ b/content/docs/datasets/adapters-human.mdx @@ -6,6 +6,8 @@ description: A concise guide for human readers to create a Harbor adapter for yo import { Callout } from 'fumadocs-ui/components/callout'; import { File, Folder, Files } from 'fumadocs-ui/components/files'; +To add a new benchmark or dataset to Harbor, you create an [adapter](https://github.com/laude-institute/harbor/tree/main/adapters) that translates the original benchmark's tasks into Harbor format. + AI agents should follow the spec at [Adapter AI Guideline](/docs/datasets/adapters) instead of this page. That document contains the complete schema, @@ -13,17 +15,8 @@ all edge cases, and machine-verifiable examples. Do not use the tutorial below as your source of truth. -Harbor supports running various benchmarks and datasets via a simple, unified interface. SWE-Bench, LiveCodeBench, and more benchmarks are integrated into Harbor, and our team is actively working to adapt additional benchmarks to the framework. -To add a new benchmark or dataset, you need to create an [adapter](https://github.com/laude-institute/harbor/tree/main/adapters) that translates the original benchmark's tasks into the Harbor format. - -We welcome the open source community to contribute adapters for new benchmarks and datasets. If you have a benchmark or a dataset of tasks that you want to adapt (e.g., using Harbor's evaluation harness), please follow the steps below to develop your adapter and get it merged. - - -If you are thinking about adapting your benchmark or contributing one from our [Adapter List](https://docs.google.com/spreadsheets/d/1mJbiASPm32DDNzEnV6eDGwpEf3FlMUe5dhkZmFjjSoo/edit?gid=0#gid=0), please join our [Discord](https://discord.com/invite/6xWPKhGDbA) and reach out to [Lin Shi](mailto:ls2282@cornell.edu) from the `#adapters-announcements` channel. - - - -Join our [Discord](https://discord.com/invite/6xWPKhGDbA) and reach out in `#adapters-announcements`. Check the [Adapter List](https://docs.google.com/spreadsheets/d/1mJbiASPm32DDNzEnV6eDGwpEf3FlMUe5dhkZmFjjSoo/edit?gid=0#gid=0) for available benchmarks. We cover API costs for parity experiments. + +Join our [Discord](https://discord.com/invite/6xWPKhGDbA) (`#adapters-announcements`) and reach out to [Lin Shi](mailto:ls2282@cornell.edu). Check the [Adapter List](https://docs.google.com/spreadsheets/d/1mJbiASPm32DDNzEnV6eDGwpEf3FlMUe5dhkZmFjjSoo/edit?gid=0#gid=0) for available benchmarks. We cover API costs for parity experiments. ## Quick Start @@ -33,10 +26,10 @@ Join our [Discord](https://discord.com/invite/6xWPKhGDbA) and reach out in `#ada harbor dataset list # Scaffold a new adapter interactively -harbor adapters init +harbor adapter init # Or with arguments -harbor adapters init my-adapter --name "My Benchmark" +harbor adapter init my-adapter --name "My Benchmark" ``` ## Steps at a Glance @@ -57,17 +50,12 @@ harbor adapters init my-adapter --name "My Benchmark" ## 1. Understand the Original Benchmark -Before coding, study the original benchmark and identify its four key components: +Before coding, study the original benchmark and identify four key components: -1. **[Understand the Original Benchmark](#1-understand-the-original-benchmark):** First, you'll analyze the original benchmark to identify the task's four key factors required by Harbor: task instructions, environments, tests, and solutions. -2. **[Fork Harbor Repository and Develop Adapter Code](#2-fork-harbor-repository-and-develop-adapter-code):** Fork the Harbor repository and write Python adapter code that translates the original benchmark's tasks into the Harbor format. -3. **[Running Harbor Harness and Verify Oracle Solutions](#3-running-harbor-harness-and-verify-oracle-solutions):** Run Harbor harness on your adapter and ensure all oracle solutions pass with 100% reward. Create a WIP PR with a screenshot showing oracle success. -4. **[Discuss Parity Plans and Implement Agents](#4-discuss-parity-plans-and-implement-agents):** Reach out to the team to discuss parity experiment plans, then implement the corresponding agents on the original benchmark side or in Harbor, depending on the benchmark setting. This could happen right after you sign up for an adapter and before Step 1 as well, if the benchmark is relatively straightforward. -5. **[Run Parity Experiments](#5-run-parity-experiments):** Run parity experiments to verify your adapter's performance against the original benchmark baseline results. -6. **[Record Parity Results](#6-record-parity-results):** Formally document the performance comparison in `parity_experiment.json`. -7. **[Upload Parity Results](#7-upload-parity-results):** Upload parity and oracle results to the HuggingFace dataset repository. -8. **[Register the Dataset](#8-register-the-dataset):** Prepare your dataset for the [Harbor Registry](https://registry.harborframework.com) using `harbor init` and `dataset.toml`, then coordinate with the Harbor team to publish. -9. **[Document and Submit](#9-document-and-submit):** Document your adapter's usage, parity results, and comprehensive adaptation details in a `README.md`, then submit your work through a pull request. +1. **Task Instructions** — How are tasks described? What do agents need? +2. **Environments** — What setup is required? (Docker, dependencies, file structures) +3. **Tests** — How are solutions evaluated? (unit tests, LLM-as-a-Judge, etc.) +4. **Solutions** — What are the oracle/reference solutions? ## 2. Write the Adapter Code @@ -137,7 +125,7 @@ Run your adapter with the oracle agent and confirm **100% reward on all tasks**. ```bash # Single task -harbor trials start -p datasets// +harbor trial start -p datasets// # Entire dataset harbor run -p datasets/ @@ -223,7 +211,7 @@ Also include a summary table in your README: ## 7. Upload Results -Upload parity and oracle results to the [HuggingFace parity-experiments dataset](https://huggingface.co/datasets/harborframework/parity-experiments). The [parity upload skill](https://github.com/harbor-framework/harbor/pull/1286) can automate this workflow. +Upload parity and oracle results to the [HuggingFace parity-experiments dataset](https://huggingface.co/datasets/harborframework/parity-experiments). The [parity upload skill](https://github.com/harbor-framework/harbor/tree/main/skills/upload-parity-experiments) can automate this workflow. ``` adapters// @@ -314,7 +302,7 @@ If you're converting a Terminal-Bench adapter, here are the key differences: | Verification | Exit code (pytest) | Reward file: `/logs/verifier/reward.txt` | | Output dir | `tasks/` | `datasets/` | | Registry | Dataset-level `dataset_path` | `dataset.toml` + `harbor init` publishing workflow | -| CLI | `tb run --dataset` | `harbor jobs start -d` / `harbor runs start -p` | +| CLI | `tb run --dataset` | `harbor run -d` / `harbor run -t` /`harbor run -p` | | Metrics | Binary pass/fail | Float rewards, multiple metrics | **Important:** If Terminal-Bench used a tweaked metric, re-implement to support the **original** benchmark metrics — Harbor supports multiple metrics as rewards. diff --git a/content/docs/datasets/adapters.mdx b/content/docs/datasets/adapters.mdx index 0423c78..d784c2e 100644 --- a/content/docs/datasets/adapters.mdx +++ b/content/docs/datasets/adapters.mdx @@ -1,84 +1,97 @@ --- -title: Adapters -description: Comprehensive adapter reference with full specs for creating a Harbor adapter for your benchmark. +title: Adapters (Agent Guide) +description: Comprehensive adapter spec for AI agents building Harbor adapters. Contains full schemas, directory structures, commands, and validation criteria. --- -import { Accordion, Accordions } from 'fumadocs-ui/components/accordion'; import { Callout } from 'fumadocs-ui/components/callout'; -import { File, Folder, Files } from 'fumadocs-ui/components/files'; -Harbor supports running various benchmarks and datasets via a simple, unified interface. SWE-Bench, LiveCodeBench, and more benchmarks are integrated into Harbor, and our team is actively working to adapt additional benchmarks to the framework. -To add a new benchmark or dataset, you need to create an [adapter](https://github.com/laude-institute/harbor/tree/main/adapters) that translates the original benchmark's tasks into the Harbor format. + +This page is the comprehensive spec optimized for AI agents. For a concise walkthrough, see the [Adapters (Human Guide)](/docs/datasets/adapters-human). + -We welcome the open source community to contribute adapters for new benchmarks and datasets. If you have a benchmark or a dataset of tasks that you want to adapt (e.g., using Harbor's evaluation harness), please follow the steps below to develop your adapter and get it merged. +## Purpose - -If you are thinking about adapting your benchmark or contributing one from our [Adapter List](https://docs.google.com/spreadsheets/d/1mJbiASPm32DDNzEnV6eDGwpEf3FlMUe5dhkZmFjjSoo/edit?gid=0#gid=0), please join our [Discord](https://discord.com/invite/6xWPKhGDbA) and reach out to [Lin Shi](mailto:ls2282@cornell.edu) from the `#adapters-announcements` channel. - +An adapter translates an existing benchmark into Harbor's task format. This document is the authoritative reference for building one. Follow steps 1–9 in order. - -See [this section](#translating-terminal-bench-adapters-to-harbor) to learn about the requirements and differences between Terminal-Bench and Harbor. - +Check the [Adapter List](https://docs.google.com/spreadsheets/d/1mJbiASPm32DDNzEnV6eDGwpEf3FlMUe5dhkZmFjjSoo/edit?gid=0#gid=0) for available benchmarks. Contact [Lin Shi](mailto:ls2282@cornell.edu) or join [Discord](https://discord.com/invite/6xWPKhGDbA) `#adapters-announcements` for coordination. The team covers API costs for parity experiments. ## Quick Start ```bash -# List available datasets -harbor dataset list - -# Start the interactive wizard to create a new adapter -harbor adapter init - -# Initialize with specific arguments (skipping some prompts) -harbor adapter init my-adapter --name "My Benchmark" +harbor dataset list # list available datasets +harbor adapter init # interactive scaffold +harbor adapter init my-adapter --name "My Name" # non-interactive scaffold ``` -Use the above commands to view our supported datasets and start creating your new ones. The `harbor adapter init` command will create starter code and template files. +## Required Directory Structures -For more details about what adapters are and how we ensure equivalance between the original benchmark and its harbor adapter, please continue reading. +### Generated task directory (one per task) -## Overview +``` +/ +└── / + ├── task.toml # task configuration and metadata + ├── instruction.md # task instructions for the agent + ├── environment/ + │ └── Dockerfile # container environment definition + ├── solution/ + │ └── solve.sh # oracle solution script + └── tests/ + ├── test.sh # test execution script + └── test_*.py # (optional) pytest test files +``` -Adapting a benchmark to Harbor is a straightforward process designed to ensure consistency and quality. This guide will walk you through everything you need to know. However, since each benchmark is unique, the exact process and special requirements may vary slightly depending on the benchmark. Please contact our team to understand the specific requirements and considerations for your benchmark. We will support API costs for running parity experiments :-) +Reference implementation: [hello-world example](https://github.com/laude-institute/harbor/tree/main/examples/tasks/hello-world) -Here's a quick look at the typical steps: +### Adapter code directory -1. **[Understand the Original Benchmark](#1-understand-the-original-benchmark):** First, you'll analyze the original benchmark to identify the task's four key factors required by Harbor: task instructions, environments, tests, and solutions. -2. **[Fork Harbor Repository and Develop Adapter Code](#2-fork-harbor-repository-and-develop-adapter-code):** Fork the Harbor repository and write Python adapter code that translates the original benchmark's tasks into the Harbor format. -3. **[Running Harbor Harness and Verify Oracle Solutions](#3-running-harbor-harness-and-verify-oracle-solutions):** Run Harbor harness on your adapter and ensure all oracle solutions pass with 100% reward. Create a WIP PR with a screenshot showing oracle success. -4. **[Discuss Parity Plans and Implement Agents](#4-discuss-parity-plans-and-implement-agents):** Reach out to the team to discuss parity experiment plans, then implement the corresponding agents on the original benchmark side or in Harbor, depending on the benchmark setting. This could happen right after you sign up for an adapter and before Step 1 as well, if the benchmark is relatively straightforward. -5. **[Run Parity Experiments](#5-run-parity-experiments):** Run parity experiments to verify your adapter's performance against the original benchmark baseline results. -6. **[Record Parity Results](#6-record-parity-results):** Formally document the performance comparison in `parity_experiment.json`. -7. **[Upload Parity Results](#7-upload-parity-results):** Upload parity and oracle results to the HuggingFace dataset repository. -8. **[Register the Dataset](#8-register-the-dataset):** Prepare your dataset for the [Harbor Registry](https://registry.harborframework.com) using `harbor init` and `dataset.toml`, then coordinate with the Harbor team to publish. -9. **[Document and Submit](#9-document-and-submit):** Document your adapter's usage, parity results, and comprehensive adaptation details in a `README.md`, then submit your work through a pull request. +``` +harbor/adapters// +├── adapter.py # main logic: parse benchmark, generate task dirs +├── run_adapter.py # CLI entry point (must support --output-path) +├── parity_experiment.json # parity results (step 6) +├── run_.yaml # reference config for reproducibility +├── README.md # final documentation (step 9) +├── adapter_metadata.json # structured metadata (step 9) +└── template/ # template files copied into each task + ├── task.toml + ├── instruction.md + ├── environment/ + │ └── Dockerfile + ├── solution/ + │ └── solve.sh + └── tests/ + └── test.sh +``` -We'll break down each step in detail below. Let's get started! +### Key requirements for `run_adapter.py` -## The Adapter Development Workflow +- Must support temporarily cloning the source benchmark, preparing tasks, and cleaning up the clone. +- Must support generating tasks from an already-cloned repo without deleting it. +- Default output directory: `datasets/`, overridable via `--output-path`. -Creating a high-quality adapter involves several key steps. Following this workflow ensures that the adapted benchmark is a faithful and reliable implementation of the original. +--- -### 1. Understand the Original Benchmark +## Step 1. Understand the Original Benchmark -Before writing any adapter code, it's crucial to deeply understand the original benchmark. Your goal is to identify and understand the four key factors required by Harbor: +Identify these four components for every task in the benchmark: -1. **Task Instructions:** How are tasks described? What information do agents need to solve each task? -2. **Environments:** What environment setup is required? (e.g., Docker containers, system dependencies, file structures) -3. **Tests:** How are solutions evaluated? What test scripts or verification mechanisms are used? Deterministic unit tests or LLM-as-a-Judge? -4. **Solutions:** What are the oracle/reference solutions? If there's no oracle solution in the original benchmark, is it possible to create them using LLM? +| Component | What to find | +|-----------|-------------| +| **Instructions** | How tasks are described; what information agents receive | +| **Environments** | Docker setup, system dependencies, file structures | +| **Tests** | Evaluation method: deterministic unit tests, LLM-as-a-Judge, etc. | +| **Solutions** | Oracle/reference solutions; if none exist, whether LLM generation is feasible | -Study the original benchmark's repository, documentation, and code structure to understand these components. This understanding will guide your adapter development and ensure you capture all necessary information when converting tasks to Harbor format. +Study the benchmark's repository, documentation, and code structure. -### 2. Fork Harbor Repository and Develop Adapter Code +**Step complete when:** You can describe, for each task, the instruction text, environment setup, test/verification method, and oracle solution. -With a solid understanding of the original benchmark, you can now create the adapter itself within the [harbor](https://github.com/laude-institute/harbor) repository. +--- -#### 2.0 Read the README template -The [Harbor adapter README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md) serves as the template for the final README file that you will create for your submitted adapter. However, it is more than just a template: it includes essential instructions to help you understand the requirements that will facilitate the development and review processes. Reading it will give you a sense of what to provide and will guide your code, experiments, and documentation. +## Step 2. Fork and Develop Adapter Code -#### 2.1 Fork the Harbor repository -Fork the Harbor repository and create a new branch for your adapter (e.g., `{adapter-name}-adapter`). +Read the [adapter README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md) first — it doubles as a requirements checklist. ```bash git clone https://github.com/{your-github-username}/harbor.git @@ -86,499 +99,371 @@ cd harbor git checkout -b {your-adapter-name}-adapter ``` -#### 2.2 Develop the adapter code -Develop the adapter under `adapters/{adapter-name}`. You may refer to the existing adapters in the `adapters/` directory and follow the patterns. The adapter's primary job is to parse the original benchmark's data and generate task directories in the standard Harbor format. Here is an example architecture of the task directory: - - - - - - - - - - - - - - - - - - - - -[Here](https://github.com/laude-institute/harbor/tree/main/examples/tasks/hello-world) is an example task directory. Your code should prepare task directories locally following a similar format. - - -#### 2.3 Requirements and Tips for the Adapter Code -Your adapter code is used to generate task directories. A typical directory structure for your adapter code is as follows: - - - - - - - - - - - - - - - - - - - - - - - - - -More details (expand to view): - - - Harbor supports multiple metrics represented as rewards to seamlessly serve for RL. Reward can be float values. We will further support aggregation of metrics across dataset (e.g., average or custom ones). - - This allows you to use the same metrics of any type as the original benchmark and convert them to RL-compatible formats. - - - - - - It should support: - - Temporarily cloning the source benchmark, preparing the tasks, and cleaning up the temporary clone. - - Generating tasks from an existing, already-cloned benchmark repository without deleting it. - - Also, by default, your adapter should create tasks in `datasets/`, but you should also allow users to specify a custom output path via command-line arguments `--output-path`. - - - - - - The `template/` directory stores the template files required for the tasks. For your reference, all files [above](#22-develop-the-adapter-code) or in the [hello-world example](https://github.com/laude-institute/harbor/tree/main/examples/tasks/hello-world) are recommended to be included in the `template/` directory. Then your adapter code would use the templates to generate the actual task directories. - - - - - - A file to store the parity experiment results (i.e., comparison between the original benchmark and the Harbor adapter). More details are provided in the [Recording Parity Results](#6-record-parity-results) section. - - - - - - This is the last thing you should work on before PR submission. More details are provided in the [Document and Submit](#9-document-and-submit) section. You can follow the [Harbor adapter README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md). - - - - - - - - - It is acceptable to make prompt modifications to the task description to support CLI agents. For example, if adding prompts like "directly write the files in place without asking for my approval" would be helpful, it's fine to do so. **You just need to ensure that they apply to both the forked original benchmark repository and the Harbor adapter.** - - It is acceptable to adapt only part of the original benchmark (e.g., only SWE-Bench-Verified). Excluding certain tasks for valid reasons is also understandable (e.g., extensive GPU requirements). **You just need to ensure that the relevant information is included in the README.** - - - - - - -### 3. Running Harbor Harness and Verify Oracle Solutions - -There are several ways to run Harbor harness on your adapter: - -**Option 1: Using individual trials (for testing single tasks)** -```bash -# Run oracle agent on a single task -harbor trial start -p datasets// +Develop your adapter under `adapters/{adapter-name}/`. Refer to existing adapters in that directory. -# Run with specific agent and model -harbor trial start -p datasets// -a -m -``` +### Adapter component reference -**Option 2: Using jobs with local dataset path** -```bash -# Run on entire local dataset -harbor run -p datasets/ -a -m -``` +| Component | Description | +|-----------|-------------| +| `adapter.py` / `run_adapter.py` | Must support: (1) temporary clone + cleanup; (2) generating from existing clone. Default output: `datasets/`, with `--output-path` override. | +| `template/` | Template files for task generation. Include all files from the task directory structure above. | +| `parity_experiment.json` | Parity results — see [Step 6](#step-6-record-parity-results) for full schema. | +| `README.md` | Write last before PR submission. Follow the [README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md). | +| Metrics / Rewards | Harbor supports multiple float-valued metrics as rewards (RL-compatible). Use the same metrics as the original benchmark. | -**Option 3: Using jobs with configuration file**. Refer to [harbor/examples/configs](https://github.com/laude-institute/harbor/tree/main/examples/configs) for configuration examples. It's highly recommended to write a reference config file for your adapter to ensure reproducibility. -```bash -# Create a job config YAML (see harbor/examples/configs/ for examples) -harbor run -c adapters//.yaml -a -m -``` - -**Option 4: Using registry dataset (after [publishing](#8-register-the-dataset))**. Registry testing is only available after the dataset has been published, which ensures the correct data structure. -```bash -# Run from registry -# Single task -harbor run -t / -a -m - -# Entire dataset -harbor run -d -a -m -``` - -You should include instructions for running in multiple ways in the `README.md` for your adapter, following the [Harbor adapter README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md). **Note that the order of these options is organized differently in the final adapter README**. This is because from the user's perspective, Option 4 (registry) is the primary way to run the adapter without needing to prepare task directories; the adapter code and other running methods are mainly used for development and reproduction. +### Rules -#### 3.1 Verify Oracle Solutions Pass 100% +- Prompt modifications (e.g., "write files in place without asking") are acceptable **if applied to both the original benchmark and Harbor adapter**. +- Adapting a subset of tasks is acceptable (e.g., only SWE-Bench-Verified). **Document all exclusions in the README.** -Before proceeding further, you must ensure that all oracle solutions pass with a 100% reward. Run the oracle agent on your entire dataset: +**Step complete when:** `run_adapter.py` produces a valid task directory for each task containing `task.toml`, `instruction.md`, `environment/Dockerfile`, `solution/solve.sh`, and `tests/test.sh`. -```bash -harbor run -p datasets/ -``` +--- -Once you've verified that all oracle solutions pass, you can create a Work-In-Progress (WIP) pull request to the Harbor repository: +## Step 3. Verify Oracle Solutions -1. **Create a WIP PR:** Push your branch and create a pull request with the title `[WIP] Adapter: {adapter_name}`. -2. **Include a screenshot:** Paste a screenshot of your terminal showing the oracle solution 100% pass results. This demonstrates that your adapter correctly generates tasks and that the oracle solutions work as expected. +### Run commands -This WIP PR allows the team to review your adapter structure early and provide feedback before you proceed with parity experiments. +| Method | Command | When to use | +|--------|---------|-------------| +| Single task | `harbor trial start -p datasets// -a -m ` | Testing individual tasks | +| Entire dataset | `harbor run -p datasets/ -a -m ` | Full oracle verification | +| Config file | `harbor run -c adapters//.yaml -a -m ` | Reproducible runs (see [example configs](https://github.com/laude-institute/harbor/tree/main/examples/configs)) | +| Registry: single task | `harbor run -t / -a -m ` | Post-publish single task | +| Registry: full dataset | `harbor run -d -a -m ` | Post-publish full dataset (after [Step 8](#step-8-register-the-dataset)) | - -If the original benchmark has tasks with broken or flawed oracle solutions, **do not attempt to fix them on the Harbor side**. Instead: +Write a reference config YAML for your adapter to ensure reproducibility. -1. **Document** which tasks have oracle issues in your adapter's README. -2. **File bugs** to the upstream benchmark repository so the original maintainers can address them. -3. **Exclude** those tasks from your adapter if they cannot be reliably verified, and note the exclusion in your README. +**README ordering note:** In the final adapter README, list the registry method (Option 5) first — it is the primary user-facing run method. Adapter code and local-path methods are for development/reproduction. -This ensures Harbor adapters faithfully reflect the original benchmark rather than silently diverging. - +### After oracle passes -### 4. Discuss Parity Plans and Implement Agents +1. Create a WIP PR titled `[WIP] Adapter: {adapter_name}`. +2. Include a screenshot of the terminal showing 100% oracle pass results. -After your oracle solutions pass and you've created a WIP PR, reach out to the team (e.g., **Lin Shi**) through Discord to discuss your parity experiment plans before running them. We will help you determine which agents and models to use, how many runs are needed, and we can provide API keys for running parity experiments. Based on your benchmark's characteristics, you'll need to implement agents accordingly. There are three main scenarios: +### Broken oracles in the original benchmark - -If the original benchmark already supports agents that are also supported in Harbor (e.g., OpenHands, Codex, Claude-Code, Gemini-CLI), you can run parity experiments using identical agent and model settings on both sides. No additional agent implementation is needed. - - - -If the original benchmark is LLM-based but doesn't have Harbor-compatible agents implemented, you'll need to: +Do **not** fix broken oracles on the Harbor side. Instead: +1. Document which tasks have oracle issues in the README. +2. File bugs to the upstream benchmark repository. +3. Exclude those tasks and note the exclusion in the README. -1. **Fork the original benchmark repository** and create a branch for your adaptation work (e.g., `harbor-adapter`). -2. **Implement Harbor-compatible agents** (e.g., codex) in the forked repository to enable fair comparisons. -3. **Document the implementation** in a `README.md` file in your fork. +**Step complete when:** All oracle solutions pass with 100% reward, and a WIP PR titled `[WIP] Adapter: {adapter_name}` is created with a screenshot of the passing results. -For an example, see the [EvoEval adapter's parity experiment configuration](https://github.com/laude-institute/harbor/blob/main/adapters/evoeval/parity_experiment.json), which shows how agents were implemented in a fork of the original benchmark. - +--- - -If the original benchmark uses custom agents that aren't available in Harbor, you'll need to: +## Step 4. Discuss Parity Plans and Implement Agents -1. **Implement the custom agent in Harbor** under your adapter directory (e.g., `adapters//.py`). This is adapter-specific and doesn't need to be installed as a general Harbor agent. -2. **Run parity experiments** using this custom agent to ensure equivalence with the original benchmark. -3. **Additionally run experiments** with other Harbor-supported agents (e.g., Codex, Claude-Code) to demonstrate that the adaptation works well for multiple agent types. In other words, show that "using other supported agents to run the adapter makes sense". - +Contact the team (e.g., **Lin Shi** on [Discord](https://discord.com/invite/6xWPKhGDbA)) **before** running parity experiments. They determine agents, models, number of runs, and API key provisioning. -Keep a link to any forked repositories, and document your agent implementation approach in your adapter's README. +### Agent implementation scenarios - -If the original benchmark is very large and expensive to run, you may want to run parity experiments on a fixed, representative subset of samples instead of the full dataset. Please discuss with the team to confirm sampling and parity plans! +| Scenario | Condition | Action required | +|----------|-----------|-----------------| +| **A: Compatible agents exist** | Original benchmark supports Harbor-compatible agents (OpenHands, Codex, Claude-Code, Gemini-CLI) | None — run parity with identical settings on both sides | +| **B: LLM-based, no compatible agents** | Original benchmark is LLM-based but lacks Harbor agents | Fork the original repo, implement Harbor-compatible agents, document in fork's README. Example: [EvoEval parity config](https://github.com/laude-institute/harbor/blob/main/adapters/evoeval/parity_experiment.json) | +| **C: Custom agents** | Original benchmark uses custom agents unavailable in Harbor | Implement custom agent in `adapters//.py`. Also run with standard agents (Codex, Claude-Code) to show generalization | -This approach has two important implications: +Keep links to any forked repositories and document the approach in the README. -1. **README Documentation:** In your adapter's README, you must clearly: - - State how the parity subset was selected (e.g., random seed, "stratified sample across difficulty levels", etc.) - - Explicitly indicate that parity experiments were run on a subset - - Provide instructions for users on how to use the full dataset with the adapter code, typically using an argument like `--split parity` (or similar) to generate only the parity subset - ```bash - # Example of adapter code usage - # Generate only the parity subset - uv run run_adapter.py --split parity --output-dir /path/to/output +### Large or expensive benchmarks - # Generate the full dataset - uv run run_adapter.py --output-dir /path/to/output - ``` +If running the full benchmark is too expensive, run parity on a representative subset. Requirements: +- Document in README how the subset was selected and that parity ran on a subset. +- Support `--split parity` in `run_adapter.py` to generate only the parity subset. +- Use version `"parity"` in `dataset.toml` so users can run `-d @parity`. -2. **Registry Version Naming:** When publishing the dataset to the [Harbor Registry](https://registry.harborframework.com), use the version name `"parity"` instead of `"1.0"` or `"2.0"` in your `dataset.toml` to avoid confusion. This allows users to run parity reproduction using `-d @parity` while keeping the full dataset available separately. +```bash +uv run run_adapter.py --split parity --output-dir /path/to/output # parity subset +uv run run_adapter.py --output-dir /path/to/output # full dataset +``` +**Step complete when:** Parity plan is agreed with the team (agents, models, number of runs), and any required agent implementations are working on both the original benchmark and Harbor sides. - +--- -### 5. Run Parity Experiments +## Step 5. Run Parity Experiments +Run the **same agents, models, and config settings** on both the original benchmark and Harbor adapter, multiple times each. Compare average scores and standard deviations — they must be comparable to demonstrate equivalence. -Once you've implemented the necessary agents (if needed), run parity experiments to verify your adapter. Use the Harbor harness (see [Section 3](#3-running-harbor-harness-and-verify-oracle-solutions)) with the same set of agents and models that you used (or will use) on the original benchmark side. Ensure the config and parameter settings are identical as well (e.g., codex version). Run them multiple times on each side to compare average scores and standard deviations. +```bash +harbor run -p datasets/ -a -m +``` -The average scores across multiple runs should be **comparable to demonstrate equivalence of adaptation** (i.e., running the benchmark with Harbor is equivalent to running it with the original harness). +**Step complete when:** Multiple runs on both sides produce scores within each other's standard error, demonstrating equivalence. -### 6. Record Parity Results +--- -To formally store and track the performance parity between the original benchmark and your adapter, create a `parity_experiment.json` file in your adapter's directory. A typical file would look like this: +## Step 6. Record Parity Results + +Create `parity_experiment.json` in your adapter directory. The file is a JSON array; each entry is one agent+model parity experiment. + +### `parity_experiment.json` field reference + +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `adapter_name` | `string` | Yes | Adapter name (e.g., `"swe-bench"`) | +| `agent` | `string` | Yes | Agent with version (e.g., `"codex@1.0"`) | +| `model` | `string` | Yes | Full model identifier (e.g., `"gpt-5-2025-06-01"`) | +| `date` | `string` | Yes | Experiment date (e.g., `"2025-06-15"`) | +| `adapted_benchmark_size` | `integer` | Yes | Total tasks converted by adapter (full set) | +| `parity_benchmark_size` | `integer` | Yes | Tasks used for parity. Equals `adapted_benchmark_size` if full set | +| `number_of_runs` | `integer` | Yes | Runs per side. Should be identical for original and Harbor | +| `notes` | `string` | No | Additional explanations | +| `original_parity_repo` | `string` | Yes | Fork URL for reproducing parity on original benchmark | +| `adapter_pr` | `string[]` | Yes | All adapter PR links in `harbor` repo | +| `dataset_pr` | `string[]` | Yes | All PR links in `harbor-datasets` repo | +| `parity_pr` | `string[]` | Yes | All PR links to HuggingFace parity dataset | +| `metrics` | `object[]` | Yes | Metric comparison objects (see below) | + +### `metrics` entry fields + +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `benchmark_name` | `string` | Yes | Original benchmark name | +| `metric` | `string` | Yes | Metric name (e.g., `"pass@1"`, `"resolve_rate"`) | +| `original` | `string` | Yes | Mean ± stderr on original (e.g., `"45.2 ± 1.3"`) | +| `harbor` | `string` | Yes | Mean ± stderr on Harbor (e.g., `"44.8 ± 1.1"`) | +| `original_runs` | `number[]` | Yes | Individual scores per run on original | +| `harbor_runs` | `number[]` | Yes | Individual scores per run on Harbor | + +### Example ```json [ { - "adapter_name": , - "agent": @, - "model": , - "date": , - "adapted_benchmark_size": // Full set size - "parity_benchmark_size": , // Same as adapted_benchmark_size if we ran parity on full set - "number_of_runs": // Unless special case, this should be identical for original and harbor runs. - "notes": , // additional explanations on special treatments, etc. - "original_parity_repo": , // For reproducing the parity experiments on the original benchmark side; usually this is a fork of the original benchmark repo whose README includes instructions + scripts for running the parity experiments - "adapter_pr": [, ...], // Adapter PR link(s) in the `harbor` repo; show all PR links related to the adapter, including later fixes. - "dataset_pr": [, ...], // All PR link(s) in `harbor-datasets` repo that are registering the adapter. - "parity_pr": [, ...], // All PR link(s) to the HuggingFace parity experiment dataset (instructions below)) + "adapter_name": "my-benchmark", + "agent": "codex@1.0", + "model": "gpt-5-2025-06-01", + "date": "2025-06-15", + "adapted_benchmark_size": 500, + "parity_benchmark_size": 500, + "number_of_runs": 3, + "notes": "None", + "original_parity_repo": "https://github.com/user/my-benchmark-fork", + "adapter_pr": ["https://github.com/laude-institute/harbor/pull/123"], + "dataset_pr": ["https://github.com/laude-institute/harbor-datasets/pull/45"], + "parity_pr": ["https://huggingface.co/datasets/harborframework/parity-experiments/discussions/12"], "metrics": [ { - "benchmark_name": , - "metric": , - "original": , // Average scores obtained from the original benchmark - "harbor": , // Average scores obtained from Harbor adapter - "original_runs": [, , , ...], // Individual run scores - "harbor_runs": [, , , ...], // Individual run scores - }, - { - "benchmark_name": , - "metric": , - "original": , // Average scores obtained from the original benchmark - "harbor": , // Average scores obtained from Harbor adapter - "original_runs": [, , , ...], // Individual run scores - "harbor_runs": [, , , ...], // Individual run scores - }, // ... more metrics + "benchmark_name": "my-benchmark", + "metric": "pass@1", + "original": "45.2 ± 1.3", + "harbor": "44.8 ± 1.1", + "original_runs": [44.0, 45.5, 46.1], + "harbor_runs": [43.8, 45.0, 45.6] + } ] - }, - ... + } ] ``` -You should also include the parity experiment results in the `README.md` of your adapter. For example, you can add the following table: +### README parity table + +Include this table in the adapter README: + ```markdown -| Agent | Model | Metric | Number of Trials | Dataset Size | Original Benchmark Performance | Harbor Adapter Performance | -|-------|-------|--------|------------------|--------------|--------------------------------|----------------------------| -| claude-code | claude-4-opus | Metric | 3 | 100 tasks (5% of full set) | Score ± Std | Score ± Std | -| codex | gpt-5 | Metric | 5 | 2000 tasks (100% of full set) | Score ± Std | Score ± Std | -| ... | ... | ... | ... | ... | ... | ... | +| Agent | Model | Metric | Runs | Dataset Size | Original | Harbor | +|-------|-------|--------|------|--------------|----------|--------| +| codex | gpt-5 | pass@1 | 5 | 2000 (100%) | 45.2±1.3 | 44.8±1.1 | ``` -Then include the following links: -- The link to the original benchmark's GitHub repository -- The link to the forked repo of the original benchmark (if applicable) from [Step 4](#4-discuss-parity-plans-and-implement-agents) -- The link to the dataset PR from [Step 8](#8-register-the-dataset) -- The link to the parity experiment PR to the HuggingFace parity experiment dataset (instructions below in [Section 7](#7-upload-parity-results)) -- The link to the adapter PR -### 7. Upload Parity Results +Also include links to: original benchmark repo, forked repo (if applicable), dataset PR, HuggingFace parity PR, adapter PR. -After recording your parity results, you need to upload both the parity experiment results and oracle results to the [Harbor Parity Experiments HuggingFace dataset](https://huggingface.co/datasets/harborframework/parity-experiments). This allows the community to track adapter quality and helps estimate costs for each adapter on diverse agents and models. +**Step complete when:** `parity_experiment.json` is valid JSON, all required fields are populated, `original` and `harbor` scores are comparable (within standard error), and the README includes the parity summary table with links. If scores diverge significantly, investigate before proceeding. - -Uploading to the HuggingFace parity dataset can be tricky (large repos, LFS requirements, HF-specific refs). The [parity upload skill](https://github.com/harbor-framework/harbor/pull/1286) automates this workflow — it handles sparse checkouts, LFS tracking, and pushing to the correct HF PR ref. Use it to avoid common upload pitfalls. - +--- + +## Step 7. Upload Parity Results -Follow the README instructions in the HuggingFace dataset repository to upload your results. The dataset expects results to be organized in the following format: +Upload parity and oracle results to [harborframework/parity-experiments](https://huggingface.co/datasets/harborframework/parity-experiments) on HuggingFace. + +**Recommended:** Use the [parity upload skill](https://github.com/harbor-framework/harbor/tree/main/skills/upload-parity-experiments) to automate this — it handles sparse checkouts, LFS tracking, and HF-specific PR refs. + +### Required directory structure ``` adapters/ - └── {adapter_name}/ - ├── README.md # Results overview, interpretation, notes, etc. - ├── config.yaml # The yaml file that can be directly used to run parity experiments in Harbor. - ├── original_parity/ - ├── harbor_parity/ - ├── oracle/ - └── results_collection/ # copy the valid result.json files from parity to this directory - ├── result_{original/harbor}_trial1.json - ├── result_{original/harbor}_trial2.json - ├── ... - └── result_{original/harbor}_trial{N}.json +└── {adapter_name}/ + ├── README.md + ├── config.yaml + ├── original_parity/ + ├── harbor_parity/ + ├── oracle/ + └── results_collection/ + ├── result_{original/harbor}_trial1.json + ├── result_{original/harbor}_trial2.json + └── result_{original/harbor}_trial{N}.json ``` +**Step complete when:** PR to the HuggingFace parity-experiments dataset is submitted with all result files in the expected directory structure. + +--- -### 8. Register the Dataset - -#### 8.1 Generate dataset -Once your adapter correctly generates tasks and you verify the parity experiments, you should add them to the official [Harbor datasets repository](https://github.com/laude-institute/harbor-datasets). - -- **Fork and clone the dataset repository:** - ```bash - git clone https://github.com/{your-github-username}/harbor-datasets.git - ``` -- **Add your tasks:** Place the generated task directories under `datasets//`. For example, if you follow the adapter development instructions above correctly, you should be able to run the following example commands to add your tasks to the dataset repository: - ```bash - cd harbor/adapters/ - - # Specify custom path to the harbor-datasets repo - uv run run_adapter.py --output-dir /path/to/harbor-datasets/datasets/ - ``` -- Generate `dataset.toml`: - ```bash - # Initialize the dataset (creates dataset.toml, auto-detects tasks in the directory) - cd harbor-datasets/datasets/ - harbor init - # Select "dataset" when prompted - ``` -- Edit the generated `dataset.toml` to fill in the required metadata. Your dataset description should include: - - **Parity experiment results:** A summary of your parity findings (see [Step 6](#6-record-parity-results)) - - **Adapter author credits:** Names and contact information for the adapter contributors - - **Any other acknowledgment:** i.e. funding support -- **Pull Request:** Create a pull request to the `harbor-datasets` repository. It's recommended to link the original benchmark's GitHub repository in your PR. Request @Slimshilin for review, and he will merge it so you can try `--registry-path` for the `harbor` harness. You may always submit another PR to update the dataset registry. - -**Version naming:** Use `"1.0"` by default. If the original benchmark has named versions (e.g., "verified", "lite"), follow their naming. If you ran parity experiments on a subset (see [Section 4](#4-discuss-parity-plans-and-implement-agents)), use `"parity"` for the parity subset to allow users to run `-d @parity` for parity reproduction. - -#### 8.2 Test Locally -Before submitting for publishing, verify your dataset works correctly using the `-p` path parameter: +## Step 8. Register the Dataset + +### 8.1 Generate dataset ```bash -# Run oracle agent on your local dataset -harbor run -p /path/to/your/dataset +# Fork and clone +git clone https://github.com/{your-github-username}/harbor-datasets.git + +# Generate tasks +cd harbor/adapters/ +uv run run_adapter.py --output-dir /path/to/harbor-datasets/datasets/ + +# Create dataset.toml +cd /path/to/harbor-datasets/datasets/ +harbor init # select "dataset" when prompted ``` - -You cannot test against the registry (using `-d`) until the dataset has been published. This ensures the published data structure is correct. Use `-p` (local path) for all pre-publish testing. - +Edit `dataset.toml` to include: parity results summary, adapter author credits, acknowledgments. + +**Version naming:** Use `"1.0"` by default. Follow original benchmark naming if applicable (e.g., "verified", "lite"). Use `"parity"` for parity subsets. + +Create a PR to `harbor-datasets`. Request `@Slimshilin` for review. -#### 8.3 Submit for Publishing -Include your tasks directory and `dataset.toml` in your adapter PR. +### 8.2 Test locally -Once your adapter PR gets approved, the Harbor team will review and publish the dataset to the registry. +```bash +harbor run -p /path/to/your/dataset +``` -#### 8.4 Verify Post-Publish +**Note:** Registry testing (`-d`) is only available after publishing. Use `-p` for all pre-publish testing. -Once the dataset is published to the registry, verify that it loads and runs correctly: +### 8.3 Submit for publishing + +Include tasks directory and `dataset.toml` in your adapter PR. The Harbor team publishes after approval. + +### 8.4 Verify post-publish ```bash -# Run oracle agent from the registry harbor run -d ``` -### 9. Document and Submit +**Step complete when:** Dataset is published to the registry, `harbor run -d ` passes oracle tests, and the PR to `harbor-datasets` is merged. -Follow the [Harbor adapter README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md) to draft comprehensive documentation for your adapter. +--- -Your README must clearly and comprehensively document all adaptation details, including: -- **Benchmark bugs or issues** that were discovered and how they were handled -- **Special treatments for agent adaptation** (e.g., prompt modifications, environment adjustments) -- **Any deviations from the original benchmark** and the rationale behind them -- **Agent implementation details** (if custom agents were created) -- **Known limitations or constraints** +## Step 9. Document and Submit -The documentation should be detailed enough for other community users to understand your adaptation choices and reproduce your work. +### README requirements -Next, you need to write a `harbor/adapters/{adapter_name}/adapter_metadata.json` that follows the format below: -```json -[ - { - "adapter_name": , - "adapter_builders": [ (), ...] - "original_benchmark": [ - { - "split": , // if there's no split or subset name, use "full". - "size": , // "task" may mean different things in different benchmarks; for term consistency, we count tasks in Harbor context. - "harness": // choose between "agent", "llm", or `None`, depending on whether the benchmark has scripts for agent / llm inference. - "supported_agents": [agent_1, agent_2, ...], // supported agents (including custom agents) in the original harness; if no agents are originally supported, use `None`. Please use agent@version if version is available. - "adaptable": , // if this split can be converted to Harbor tasks with the provided adapter code. - "notes": , // e.g., term explanation, special task structures or requirements on machine or compute. Fill `None` if not applicable. - }, - ... // more splits or subsets if there exist. - ], - "harbor_adapter": [ - { - "split": , // if there's no split or subset name, use "full"; if the adapter code works for all splits and we ran parity collectively, we can just write "full" without needing to split them one by one; however, if different splits are registered / validated in different ways, we need to split them out. - "adapted_benchmark_size": , // this may be different than the size of the original benchmark's corresponding split, because we might exclude certain tasks for sufficient reasons documented in the README. - "parity_benchmark_size": , // same as adapted_benchmark_size if we ran parity on full set - "parity_sampling_rate": parity_benchmark_size / adapted_benchmark_size - "registry_benchmark_size": // we will match this number with adapted_benchmark_size or parity_benchmark_size to determine whether the full set or parity set is being registered. Please use the exact match integer-value count here. - "added_agents": [custom_agent1, custom_agent2], // custom agents added by the adapter to align with the original benchmark. - "parity_matching_agents": [agent_1@version+model, agent_1@version+model, ...] // agents (including custom ones) used for parity experiment AND achieved comparable scores to original benchmark. - "parity_unmatching_agents": [agent_1@version+model, agent_1@version+model, ...] // agents used for parity experiment BUT didn't achieve comparable scores to original benchmark. This may happen for some weak models. Fill `None` if there's no unmatching parity results. - "parity_costs": // total expense used for running parity experiments on the adapter - "notes": , // e.g., special treatment on the adapter. Fill `None` if not applicable. - }, - ... // more splits or subsets if necessary. - ], - }, - ... // if the adapter ran parity between Harbor Adapter <--> Terminal Bench Adapter <--> Original Benchmark, then substitute "harbor_adapter" with "tb_adapter" above and copy paste the dictionary below to include corresponding information for "tb_adapter" and "harbor_adapter" comparison. -] -``` +Follow the [adapter README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md). Must document: +- Benchmark bugs discovered and how they were handled +- Special treatments (prompt modifications, environment adjustments) +- Deviations from the original benchmark and rationale +- Agent implementation details (if custom agents were created) +- Known limitations -Once everything is ready for review (all steps completed, documentation finalized, screenshots added), update your Harbor adapter PR: +### `adapter_metadata.json` schema -1. **Change the PR title** from `[WIP] Adapter: {adapter_name}` to `[Ready for Review] Adapter: {adapter_name}` -2. **Request review** from `@Slimshilin` in the PR +Create `harbor/adapters/{adapter_name}/adapter_metadata.json`. -This signals to the team that your adapter is complete and ready for final review and merge. +**Top-level fields:** -### Other Useful Resources -- The [Harbor documentation](/docs/getting-started) provides detailed information about running tasks and jobs with Harbor. -- The [Harbor repository](https://github.com/laude-institute/harbor) contains example tasks and configurations. -- The [agent tutorial](/docs/agents) provides instructions on how to create and use your customized agent in Harbor. +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `adapter_name` | `string` | Yes | Adapter name | +| `adapter_builders` | `string[]` | Yes | Builder names with email, e.g., `["Jane Doe (jane@example.com)"]` | +| `original_benchmark` | `object[]` | Yes | Original benchmark split descriptors | +| `harbor_adapter` | `object[]` | Yes | Harbor adapter split descriptors | -### Getting Help -Thank you for your interest in Harbor and building an adapter! If you have any questions, please ask in the `#adapters-spam` channel in our [Discord](https://discord.com/invite/6xWPKhGDbA). +**`original_benchmark` entry fields:** ---- +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `split` | `string` | Yes | Split name (use `"full"` if none) | +| `size` | `integer` | Yes | Number of tasks in Harbor context | +| `harness` | `string` | Yes | `"agent"`, `"llm"`, or `"None"` | +| `supported_agents` | `string[]` | Yes | Use `agent@version` format. `["None"]` if none | +| `adaptable` | `boolean` | Yes | Whether this split can be converted | +| `notes` | `string` | No | Additional clarification. `"None"` if N/A | + +**`harbor_adapter` entry fields:** -## Translating Terminal-Bench Adapters to Harbor +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `split` | `string` | Yes | Corresponding split. `"full"` if collective | +| `adapted_benchmark_size` | `integer` | Yes | Tasks convertible by adapter | +| `parity_benchmark_size` | `integer` | Yes | Tasks used for parity | +| `parity_sampling_rate` | `number` | Yes | `parity_benchmark_size / adapted_benchmark_size` | +| `registry_benchmark_size` | `integer` | Yes | Exact task count in registry | +| `added_agents` | `string[]` | Yes | Custom agents added. `["None"]` if none | +| `parity_matching_agents` | `string[]` | Yes | Agents with comparable scores (`agent@version+model`) | +| `parity_unmatching_agents` | `string[]` | Yes | Agents without comparable scores. `["None"]` if all matched | +| `parity_costs` | `string` | Yes | Total USD (e.g., `"$150"`) | +| `notes` | `string` | No | `"None"` if N/A | -If you have an existing [Terminal-Bench adapter](https://github.com/laude-institute/terminal-bench/tree/main/adapters) and want to convert it to Harbor format, this section outlines the key differences and migration steps. Harbor maintains the same core principles as Terminal-Bench but uses a different file structure and configuration format. +If parity ran across three systems (Harbor ↔ Terminal-Bench ↔ Original), include a `"tb_adapter"` key with the same structure. -Note that the Harbor adapter should be isolated from the Terminal-Bench repo. You are expected to write adapter code following the same process as for Terminal-Bench instead of applying a direct translation script. Fortunately, with a good Terminal-Bench adapter, it is relatively easy to create a Harbor adapter by handling a slightly different task format. +### Submit -### Key Format Differences +1. Change PR title from `[WIP] Adapter: {adapter_name}` to `[Ready for Review] Adapter: {adapter_name}`. +2. Request review from `@Slimshilin`. + +**Step complete when:** PR title is `[Ready for Review] Adapter: {adapter_name}`, README covers all required sections, `adapter_metadata.json` passes schema validation, and review is requested from `@Slimshilin`. + +--- -The following table summarizes the main differences between Terminal-Bench and Harbor task formats: +## Reference: Terminal-Bench Migration + +**Important:** The Harbor adapter must be isolated from the Terminal-Bench repo. Do not write a mechanical translation script — write fresh adapter code following the Harbor process. | Aspect | Terminal-Bench | Harbor | |--------|----------------|---------| -| **Task Configuration** | `task.yaml` (YAML format) | `task.toml` (TOML format) | -| **Instruction** | Embedded in `task.yaml` as `instruction` field | Separate `instruction.md` file | -| **Dockerfile Location** | Root level: `Dockerfile` | Subdirectory: `environment/Dockerfile` | -| **Solution Script** | Root level: `solution.sh` | Subdirectory: `solution/solve.sh` | -| **Test Scripts** | Root level: `run-tests.sh` + `tests/test_outputs.py` | Subdirectory: `tests/test.sh` | -| **Test Verification** | Exit code based (pytest) | Reward-based: write to `/logs/verifier/reward.txt` | -| **Docker Compose** | `docker-compose.yaml` in task root | Not typically used per-task | -| **Default Output Directory** | `tasks/` | `datasets/` | -| **Registry Format** | Dataset-level with `dataset_path` | `dataset.toml` + `harbor init` publishing workflow | -| **CLI Commands** | `tb run --dataset` / `tb run --dataset-path` | `harbor run -d` / `harbor run -t` / `harbor run -p` | -| **Metrics** | Resolved rate (binary pass/fail per task) | Rewards that support multiple metrics and float-type values | - -**IMPORTANT:** If the Terminal-Bench adapter used a tweaked metric (e.g., threshold-based scoring, ignoring certain metrics), then you'll need to re-implement the adapter for Harbor to support the **original** metrics used by the benchmark, as Harbor now supports multiple metrics as rewards. - -### File Structure Migration - -**Terminal-Bench structure:** -``` -task-id/ -├── task.yaml -├── Dockerfile -├── docker-compose.yaml -├── run-tests.sh -├── solution.sh -└── tests/ - └── test_outputs.py +| Config | `task.yaml` | `task.toml` | +| Instruction | In `task.yaml` | Separate `instruction.md` | +| Dockerfile | Root level | `environment/Dockerfile` | +| Solution | `solution.sh` | `solution/solve.sh` | +| Tests | `run-tests.sh` + `tests/test_outputs.py` | `tests/test.sh` | +| Docker Compose | `docker-compose.yaml` in task root | Not used per-task | +| Verification | Exit code (pytest) | Reward file: `/logs/verifier/reward.txt` | +| Output dir | `tasks/` | `datasets/` | +| Registry | Dataset-level `dataset_path` | Task-level via `dataset.toml` + `harbor init` | +| CLI | `tb run --dataset` | `harbor run -d` / `-t` / `-p` | +| Metrics | Binary pass/fail | Float rewards, multiple metrics | + +**Important:** If Terminal-Bench used a tweaked metric, re-implement for the **original** benchmark metrics. + +### Migration steps + +1. Convert `task.yaml` to `task.toml` + `instruction.md` +2. Move files: `Dockerfile` → `environment/`, `solution.sh` → `solution/solve.sh`, `run-tests.sh` → `tests/test.sh` +3. Remove `docker-compose.yaml` (not needed per-task in Harbor) +4. Update test scripts to write rewards to `/logs/verifier/reward.txt` (Harbor mounts `/logs/verifier` at runtime) +5. Update adapter code: change output dir from `tasks/` to `datasets/`, create subdirectories (`environment/`, `solution/`, `tests/`), split instruction into `instruction.md`, convert YAML generation to TOML +6. Use `harbor init` + `dataset.toml` for registry (replaces the old `registry.json`) + +### Registry format conversion + +**Before (Terminal-Bench registry.json):** +```json +{ + "name": "my-adapter", + "version": "head", + "description": "...", + "github_url": "https://github.com/laude-institute/terminal-bench-datasets.git", + "dataset_path": "datasets/my-adapter", + "task_id_subset": null +} ``` -**Harbor structure:** -``` -task-id/ -├── task.toml -├── instruction.md -├── environment/ -│ └── Dockerfile -├── solution/ -│ └── solve.sh -└── tests/ - ├── test.sh - └── test_*.py (optional) +**After (Harbor):** +```bash +harbor init # select "dataset", creates dataset.toml +# Edit dataset.toml with descriptions, authors, credits +# Then submit to Harbor team for publishing ``` -### Migration Steps - -#### Step 1: Update Task Configuration Format +See [Step 8](#step-8-register-the-dataset) for the full publishing workflow. -Convert `task.yaml` to `task.toml` and extract the instruction: +### task.yaml → task.toml conversion example -**Before (task.yaml):** +**Before:** ```yaml instruction: | Your task instruction here... - Multiple lines... author_email: example@email.com author_name: Author Name difficulty: hard @@ -609,91 +494,37 @@ timeout_sec = 3000.0 timeout_sec = 3000.0 ``` -**And create instruction.md:** +**After (instruction.md):** ```markdown Your task instruction here... -Multiple lines... ``` -#### Step 2: Reorganize Files into Subdirectories - -- Move `Dockerfile` → `environment/Dockerfile` -- Move `solution.sh` → `solution/solve.sh` -- Move `run-tests.sh` → `tests/test.sh` -- Remove `docker-compose.yaml` (usually not needed per-task in Harbor) - -#### Step 3: Update Test Scripts for Reward-Based System +### test.sh conversion example -**Before (run-tests.sh in Terminal-Bench):** +**Before (Terminal-Bench):** ```bash #!/bin/bash -# Run tests and create marker file pytest tests/ > test_results.txt -if [ $? -eq 0 ]; then - echo "PASSED" > /tmp/test_marker.txt -else - echo "FAILED" > /tmp/test_marker.txt -fi +if [ $? -eq 0 ]; then echo "PASSED" > /tmp/test_marker.txt; else echo "FAILED" > /tmp/test_marker.txt; fi ``` -**After (tests/test.sh in Harbor):** +**After (Harbor):** ```bash #!/bin/bash -# Install dependencies if needed -apt-get update && apt-get install -y python3-pip -pip3 install pytest - -# Run tests pytest /tests/test_*.py - -# Write reward based on test results -if [ $? -eq 0 ]; then - echo 1 > /logs/verifier/reward.txt -else - echo 0 > /logs/verifier/reward.txt -fi -``` - -**Key changes:** -- Harbor mounts `/logs/verifier` for test outputs -- Write numeric reward (can be float type) to `/logs/verifier/reward.txt` -- Can still use pytest, but final output must be the reward file - -#### Step 4: Update Adapter Code - -- Change default output directory from `tasks/` to `datasets/` -- Update template directory to match Harbor structure -- Modify file generation logic to create subdirectories (`environment/`, `solution/`, `tests/`) -- Split instruction extraction into separate `instruction.md` file -- Convert YAML generation to TOML generation - -#### Step 5: Update Registry Format - -Terminal-Bench uses a `registry.json` file with dataset-level entries. Harbor uses the [Harbor Registry](https://registry.harborframework.com) with `dataset.toml` configuration and the `harbor publish` workflow. - -**Terminal-Bench registry.json:** -```json -{ - "name": "my-adapter", - "version": "head", - "description": "...", - "github_url": "https://github.com/laude-institute/terminal-bench-datasets.git", - "dataset_path": "datasets/my-adapter", - "task_id_subset": null -} +if [ $? -eq 0 ]; then echo 1 > /logs/verifier/reward.txt; else echo 0 > /logs/verifier/reward.txt; fi ``` -**Harbor registry (dataset.toml + publish):** -```bash -# Initialize dataset configuration (auto-detects tasks) -harbor init # select "dataset" +Key differences: +- Harbor mounts `/logs/verifier` for test outputs at runtime. +- Write numeric reward (can be float) to `/logs/verifier/reward.txt`. +- Can still use pytest, but final output must be the reward file. -# Edit dataset.toml with descriptions, authors, credits -# Then submit to Harbor team for publishing -``` - -See [Step 8: Register the Dataset](#8-register-the-dataset) for the full publishing workflow. +--- -### Getting Help +## Resources -If you have any questions about translating your Terminal-Bench adapter to Harbor, please ask in the `#adapters-spam` channel in our [Discord](https://discord.com/invite/6xWPKhGDbA) or reach out to [Lin Shi](mailto:ls2282@cornell.edu). +- [Harbor docs](/docs/getting-started) — running tasks and jobs +- [Harbor repo](https://github.com/laude-institute/harbor) — examples and configs +- [Agent tutorial](/docs/agents) — creating custom agents +- [Discord](https://discord.com/invite/6xWPKhGDbA) — `#adapters-spam` for questions From 36348035566c7575f7ee841101f549c2eb6ab791 Mon Sep 17 00:00:00 2001 From: Crystal Zhou Date: Sat, 28 Mar 2026 02:23:50 -0400 Subject: [PATCH 04/10] fix registry command --- content/docs/datasets/adapters-human.mdx | 2 +- content/docs/datasets/adapters.mdx | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/content/docs/datasets/adapters-human.mdx b/content/docs/datasets/adapters-human.mdx index 9073fad..f7cb1fd 100644 --- a/content/docs/datasets/adapters-human.mdx +++ b/content/docs/datasets/adapters-human.mdx @@ -270,7 +270,7 @@ Include your tasks directory and `dataset.toml` in your adapter PR. Once approve Once published, verify it loads and runs correctly: ```bash -harbor run -d +harbor run -d / ``` ## 9. Document & Submit diff --git a/content/docs/datasets/adapters.mdx b/content/docs/datasets/adapters.mdx index d784c2e..87f61d4 100644 --- a/content/docs/datasets/adapters.mdx +++ b/content/docs/datasets/adapters.mdx @@ -129,8 +129,8 @@ Develop your adapter under `adapters/{adapter-name}/`. Refer to existing adapter | Single task | `harbor trial start -p datasets// -a -m ` | Testing individual tasks | | Entire dataset | `harbor run -p datasets/ -a -m ` | Full oracle verification | | Config file | `harbor run -c adapters//.yaml -a -m ` | Reproducible runs (see [example configs](https://github.com/laude-institute/harbor/tree/main/examples/configs)) | -| Registry: single task | `harbor run -t / -a -m ` | Post-publish single task | -| Registry: full dataset | `harbor run -d -a -m ` | Post-publish full dataset (after [Step 8](#step-8-register-the-dataset)) | +| Registry: single task | `harbor run -t / -a -m ` | Post-publish single task | +| Registry: full dataset | `harbor run -d / -a -m ` | Post-publish full dataset (after [Step 8](#step-8-register-the-dataset)) | Write a reference config YAML for your adapter to ensure reproducibility. @@ -338,10 +338,10 @@ Include tasks directory and `dataset.toml` in your adapter PR. The Harbor team p ### 8.4 Verify post-publish ```bash -harbor run -d +harbor run -d / ``` -**Step complete when:** Dataset is published to the registry, `harbor run -d ` passes oracle tests, and the PR to `harbor-datasets` is merged. +**Step complete when:** Dataset is published to the registry, `harbor run -d /` passes oracle tests, and the PR to `harbor-datasets` is merged. --- From a8e708779cdd1b40f67806e6b23a32f0cf53a98a Mon Sep 17 00:00:00 2001 From: Crystal Zhou Date: Sat, 28 Mar 2026 22:42:49 -0400 Subject: [PATCH 05/10] Address comments, update adapter structure, add examples and --- content/docs/datasets/adapters-human.mdx | 60 +++++++++++++++----- content/docs/datasets/adapters.mdx | 72 ++++++++++++++---------- 2 files changed, 90 insertions(+), 42 deletions(-) diff --git a/content/docs/datasets/adapters-human.mdx b/content/docs/datasets/adapters-human.mdx index f7cb1fd..1ea39c2 100644 --- a/content/docs/datasets/adapters-human.mdx +++ b/content/docs/datasets/adapters-human.mdx @@ -98,26 +98,60 @@ See the [hello-world example](https://github.com/laude-institute/harbor/tree/mai ### 2.3 Adapter code structure -Your adapter lives in `harbor/adapters/{adapter-name}/`: +Your adapter lives in `harbor/adapters/{adapter-name}/` as a Python package (generated by `harbor adapter init`): + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Where `{pkg_name}` is your adapter name with dashes replaced by underscores (e.g., `my-adapter` becomes `my_adapter`). | File | Purpose | |------|---------| -| `adapter.py` | Core logic: parse benchmark data, generate task dirs | -| `run_adapter.py` | CLI entry point (supports `--output-path`) | -| `template/` | Template files copied into each task | +| `src/{pkg_name}/adapter.py` | Core logic: parse benchmark data, generate task dirs | +| `src/{pkg_name}/main.py` | CLI entry point (supports `--output-dir`, `--limit`, `--overwrite`, `--task-ids`) | +| `src/{pkg_name}/task-template/` | Template files copied into each generated task | | `parity_experiment.json` | Parity results (filled in later) | -| `run_{name}.yaml` | Reference config for reproducibility | +| `run_{adapter-name}.yaml` | Reference config to run the full adapted dataset | | `README.md` | Final documentation (written last) | | `adapter_metadata.json` | Structured metadata about the adapter | -**Requirements for `run_adapter.py`:** -- Support cloning the source benchmark temporarily (with cleanup) -- Support using an already-cloned repo -- Default output to `datasets/{adapter-name}`, with `--output-path` override +**Running the adapter:** +```bash +uv run python -m {pkg_name}.main --output-dir +``` **Tips:** - Minor prompt tweaks (e.g., "write files in place without asking") are fine, as long as they apply to both the original benchmark and Harbor sides. - Adapting only a subset of tasks is acceptable if documented in the README. +- If your benchmark requires GPU, add a `docker-compose.yaml` with nvidia device reservations in the task's `environment/` directory for Docker runs. For cloud/Modal runs, also set `gpus` in `task.toml`. See the [featurebench adapter](https://github.com/harbor-framework/harbor/tree/main/adapters/featurebench) for a comprehensive example with separate CPU/GPU/Modal configs. ## 3. Verify Oracle Solutions @@ -149,11 +183,11 @@ Reach out to the team (e.g., **Lin Shi**) on [Discord](https://discord.com/invit Depending on your benchmark, you'll fall into one of three scenarios: -**Scenario A — Compatible agents exist:** The original benchmark already supports Harbor-compatible agents (OpenHands, Codex, Claude-Code, etc.). No extra work needed. +**Scenario A — Compatible agents exist:** The original benchmark already supports Harbor-compatible agents (OpenHands, Codex, Claude-Code, etc.). No extra work needed. Example: [ADEBench](https://github.com/laude-institute/harbor/tree/main/adapters/adebench) — the original benchmark already supports Claude Code. -**Scenario B — LLM-based, no compatible agents:** Fork the original benchmark, implement a Harbor-compatible agent there, and document it. See the [EvoEval example](https://github.com/laude-institute/harbor/blob/main/adapters/evoeval/parity_experiment.json). +**Scenario B — LLM-based, no compatible agents:** Fork the original benchmark, implement a Harbor-compatible agent there, and document it. Example: [EvoEval](https://github.com/laude-institute/harbor/tree/main/adapters/evoeval) — forked the repo to add codex agent support for parity. -**Scenario C — Custom agents:** Implement the custom agent in Harbor under `adapters/{name}/{agent}.py` and run parity experiments with the custom agents. Also run experiments with standard agents (Codex, Claude-Code) to show the adapter generalizes. +**Scenario C — Custom agents:** Implement the custom agent in Harbor under `adapters/{name}/` and run parity experiments with the custom agents. Also run experiments with standard agents (Codex, Claude-Code) to show the adapter generalizes. Example: [MedAgentBench](https://github.com/laude-institute/harbor/tree/main/adapters/medagentbench) — implements a custom HTTPAgent matching the original GET/POST/FINISH interaction semantics. For expensive benchmarks, you can run parity on a representative subset. Discuss sampling strategy with the team first. Use `--split parity` in your adapter and register with `"version": "parity"` so users can run `-d {name}@parity`. @@ -232,7 +266,7 @@ adapters// ```bash git clone https://github.com/{you}/harbor-datasets.git cd harbor/adapters/ -uv run run_adapter.py --output-dir /path/to/harbor-datasets/datasets/ +uv run python -m .main --output-dir /path/to/harbor-datasets/datasets/ ``` Generate `dataset.toml`: diff --git a/content/docs/datasets/adapters.mdx b/content/docs/datasets/adapters.mdx index 87f61d4..167bd28 100644 --- a/content/docs/datasets/adapters.mdx +++ b/content/docs/datasets/adapters.mdx @@ -45,30 +45,39 @@ Reference implementation: [hello-world example](https://github.com/laude-institu ### Adapter code directory +Generated by `harbor adapter init`, this is a Python package using `src` layout: + ``` harbor/adapters// -├── adapter.py # main logic: parse benchmark, generate task dirs -├── run_adapter.py # CLI entry point (must support --output-path) -├── parity_experiment.json # parity results (step 6) -├── run_.yaml # reference config for reproducibility +├── .python-version # Python version (optional, created by uv init) +├── pyproject.toml # Python package config (created by uv init) ├── README.md # final documentation (step 9) ├── adapter_metadata.json # structured metadata (step 9) -└── template/ # template files copied into each task - ├── task.toml - ├── instruction.md - ├── environment/ - │ └── Dockerfile - ├── solution/ - │ └── solve.sh - └── tests/ - └── test.sh +├── parity_experiment.json # parity results (step 6) +├── run_.yaml # reference config to run the full adapted dataset +└── src/ + └── / # adapter-name with dashes → underscores + ├── __init__.py + ├── adapter.py # main logic: parse benchmark, generate task dirs + ├── main.py # CLI entry point (must support --output-dir) + └── task-template/ # template files copied into each task + ├── task.toml + ├── instruction.md + ├── environment/ + │ └── Dockerfile + ├── solution/ + │ └── solve.sh + └── tests/ + └── test.sh ``` -### Key requirements for `run_adapter.py` +Reference implementation: [helloworld adapter](https://github.com/laude-institute/harbor/tree/main/adapters/helloworld) + +### Key requirements for `main.py` -- Must support temporarily cloning the source benchmark, preparing tasks, and cleaning up the clone. -- Must support generating tasks from an already-cloned repo without deleting it. -- Default output directory: `datasets/`, overridable via `--output-path`. +- Must support `--output-dir` to specify where generated tasks are written. +- Must support `--limit`, `--overwrite`, and `--task-ids` flags. +- Run via `uv run python -m .main --output-dir `. --- @@ -105,18 +114,23 @@ Develop your adapter under `adapters/{adapter-name}/`. Refer to existing adapter | Component | Description | |-----------|-------------| -| `adapter.py` / `run_adapter.py` | Must support: (1) temporary clone + cleanup; (2) generating from existing clone. Default output: `datasets/`, with `--output-path` override. | -| `template/` | Template files for task generation. Include all files from the task directory structure above. | +| `src//adapter.py` | Core logic: parse benchmark data, generate task directories. | +| `src//main.py` | CLI entry point. Must support `--output-dir`, `--limit`, `--overwrite`, `--task-ids`. | +| `src//task-template/` | Template files copied into each generated task. | | `parity_experiment.json` | Parity results — see [Step 6](#step-6-record-parity-results) for full schema. | | `README.md` | Write last before PR submission. Follow the [README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md). | | Metrics / Rewards | Harbor supports multiple float-valued metrics as rewards (RL-compatible). Use the same metrics as the original benchmark. | +### GPU tasks + +If your benchmark includes tasks that require GPU (e.g., CUDA, Triton kernels), add a `docker-compose.yaml` in the task's `environment/` directory with nvidia device reservations for Docker runs. For cloud/Modal runs, also set `gpus` in `task.toml`. See the [featurebench adapter](https://github.com/laude-institute/harbor/tree/main/adapters/featurebench) for a comprehensive example — it handles 44 GPU tasks across multiple repos with separate CPU/GPU/Modal config files. + ### Rules - Prompt modifications (e.g., "write files in place without asking") are acceptable **if applied to both the original benchmark and Harbor adapter**. - Adapting a subset of tasks is acceptable (e.g., only SWE-Bench-Verified). **Document all exclusions in the README.** -**Step complete when:** `run_adapter.py` produces a valid task directory for each task containing `task.toml`, `instruction.md`, `environment/Dockerfile`, `solution/solve.sh`, and `tests/test.sh`. +**Step complete when:** `main.py` produces a valid task directory for each task containing `task.toml`, `instruction.md`, `environment/Dockerfile`, `solution/solve.sh`, and `tests/test.sh`. --- @@ -158,11 +172,11 @@ Contact the team (e.g., **Lin Shi** on [Discord](https://discord.com/invite/6xWP ### Agent implementation scenarios -| Scenario | Condition | Action required | -|----------|-----------|-----------------| -| **A: Compatible agents exist** | Original benchmark supports Harbor-compatible agents (OpenHands, Codex, Claude-Code, Gemini-CLI) | None — run parity with identical settings on both sides | -| **B: LLM-based, no compatible agents** | Original benchmark is LLM-based but lacks Harbor agents | Fork the original repo, implement Harbor-compatible agents, document in fork's README. Example: [EvoEval parity config](https://github.com/laude-institute/harbor/blob/main/adapters/evoeval/parity_experiment.json) | -| **C: Custom agents** | Original benchmark uses custom agents unavailable in Harbor | Implement custom agent in `adapters//.py`. Also run with standard agents (Codex, Claude-Code) to show generalization | +| Scenario | Condition | Action required | Example | +|----------|-----------|-----------------|---------| +| **A: Compatible agents exist** | Original benchmark supports Harbor-compatible agents (OpenHands, Codex, Claude-Code, Gemini-CLI) | None — run parity with identical settings on both sides | [ADEBench](https://github.com/laude-institute/harbor/tree/main/adapters/adebench) — original benchmark already supports Claude Code | +| **B: LLM-based, no compatible agents** | Original benchmark is LLM-based but lacks Harbor agents | Fork the original repo, implement Harbor-compatible agents, document in fork's README | [EvoEval](https://github.com/laude-institute/harbor/tree/main/adapters/evoeval) — forked repo to add codex agent support | +| **C: Custom agents** | Original benchmark uses custom agents unavailable in Harbor | Implement custom agent in `adapters//`. Also run with standard agents (Codex, Claude-Code) to show generalization | [MedAgentBench](https://github.com/laude-institute/harbor/tree/main/adapters/medagentbench) — implements custom HTTPAgent matching original GET/POST/FINISH semantics | Keep links to any forked repositories and document the approach in the README. @@ -170,12 +184,12 @@ Keep links to any forked repositories and document the approach in the README. If running the full benchmark is too expensive, run parity on a representative subset. Requirements: - Document in README how the subset was selected and that parity ran on a subset. -- Support `--split parity` in `run_adapter.py` to generate only the parity subset. +- Support `--split parity` in `main.py` to generate only the parity subset. - Use version `"parity"` in `dataset.toml` so users can run `-d @parity`. ```bash -uv run run_adapter.py --split parity --output-dir /path/to/output # parity subset -uv run run_adapter.py --output-dir /path/to/output # full dataset +uv run python -m .main --split parity --output-dir /path/to/output # parity subset +uv run python -m .main --output-dir /path/to/output # full dataset ``` **Step complete when:** Parity plan is agreed with the team (agents, models, number of runs), and any required agent implementations are working on both the original benchmark and Harbor sides. @@ -310,7 +324,7 @@ git clone https://github.com/{your-github-username}/harbor-datasets.git # Generate tasks cd harbor/adapters/ -uv run run_adapter.py --output-dir /path/to/harbor-datasets/datasets/ +uv run python -m .main --output-dir /path/to/harbor-datasets/datasets/ # Create dataset.toml cd /path/to/harbor-datasets/datasets/ From 33b3184c0b8bab06eb3762b764326f274875fecb Mon Sep 17 00:00:00 2001 From: Crystal Zhou Date: Sat, 28 Mar 2026 22:43:45 -0400 Subject: [PATCH 06/10] Remove broken urls --- content/docs/datasets/adapters-human.mdx | 2 -- content/docs/datasets/adapters.mdx | 2 -- 2 files changed, 4 deletions(-) diff --git a/content/docs/datasets/adapters-human.mdx b/content/docs/datasets/adapters-human.mdx index 1ea39c2..bb19e1e 100644 --- a/content/docs/datasets/adapters-human.mdx +++ b/content/docs/datasets/adapters-human.mdx @@ -94,8 +94,6 @@ Each generated task should look like this: -See the [hello-world example](https://github.com/laude-institute/harbor/tree/main/examples/tasks/hello-world) for a concrete reference. - ### 2.3 Adapter code structure Your adapter lives in `harbor/adapters/{adapter-name}/` as a Python package (generated by `harbor adapter init`): diff --git a/content/docs/datasets/adapters.mdx b/content/docs/datasets/adapters.mdx index 167bd28..86b65fa 100644 --- a/content/docs/datasets/adapters.mdx +++ b/content/docs/datasets/adapters.mdx @@ -71,8 +71,6 @@ harbor/adapters// └── test.sh ``` -Reference implementation: [helloworld adapter](https://github.com/laude-institute/harbor/tree/main/adapters/helloworld) - ### Key requirements for `main.py` - Must support `--output-dir` to specify where generated tasks are written. From 8e5416e5ba6fb40b086efbafa8bbab650fd488be Mon Sep 17 00:00:00 2001 From: Crystal Zhou Date: Sat, 28 Mar 2026 22:55:06 -0400 Subject: [PATCH 07/10] More changes --- content/docs/datasets/adapters-human.mdx | 1 + content/docs/datasets/adapters.mdx | 41 ++++++++++++++++++++++++ 2 files changed, 42 insertions(+) diff --git a/content/docs/datasets/adapters-human.mdx b/content/docs/datasets/adapters-human.mdx index bb19e1e..b3dbb91 100644 --- a/content/docs/datasets/adapters-human.mdx +++ b/content/docs/datasets/adapters-human.mdx @@ -147,6 +147,7 @@ uv run python -m {pkg_name}.main --output-dir ``` **Tips:** +- For `run_{adapter-name}.yaml`, keep oracle as the default agent and comment out alternatives (codex, claude-code, etc.) so anyone can quickly switch. Add separate config files for different scenarios if needed (parity subsets, CPU/GPU splits, cloud providers). See the [agent guide](/docs/datasets/adapters#writing-run_adapter-nameyaml) for a full example. - Minor prompt tweaks (e.g., "write files in place without asking") are fine, as long as they apply to both the original benchmark and Harbor sides. - Adapting only a subset of tasks is acceptable if documented in the README. - If your benchmark requires GPU, add a `docker-compose.yaml` with nvidia device reservations in the task's `environment/` directory for Docker runs. For cloud/Modal runs, also set `gpus` in `task.toml`. See the [featurebench adapter](https://github.com/harbor-framework/harbor/tree/main/adapters/featurebench) for a comprehensive example with separate CPU/GPU/Modal configs. diff --git a/content/docs/datasets/adapters.mdx b/content/docs/datasets/adapters.mdx index 86b65fa..c929a6a 100644 --- a/content/docs/datasets/adapters.mdx +++ b/content/docs/datasets/adapters.mdx @@ -116,6 +116,7 @@ Develop your adapter under `adapters/{adapter-name}/`. Refer to existing adapter | `src//main.py` | CLI entry point. Must support `--output-dir`, `--limit`, `--overwrite`, `--task-ids`. | | `src//task-template/` | Template files copied into each generated task. | | `parity_experiment.json` | Parity results — see [Step 6](#step-6-record-parity-results) for full schema. | +| `run_{adapter-name}.yaml` | Reference config to run the full adapted dataset | | `README.md` | Write last before PR submission. Follow the [README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md). | | Metrics / Rewards | Harbor supports multiple float-valued metrics as rewards (RL-compatible). Use the same metrics as the original benchmark. | @@ -123,6 +124,46 @@ Develop your adapter under `adapters/{adapter-name}/`. Refer to existing adapter If your benchmark includes tasks that require GPU (e.g., CUDA, Triton kernels), add a `docker-compose.yaml` in the task's `environment/` directory with nvidia device reservations for Docker runs. For cloud/Modal runs, also set `gpus` in `task.toml`. See the [featurebench adapter](https://github.com/laude-institute/harbor/tree/main/adapters/featurebench) for a comprehensive example — it handles 44 GPU tasks across multiple repos with separate CPU/GPU/Modal config files. +### Writing `run_{adapter-name}.yaml` + +This config file serves as the single entry point for all experiments — oracle verification, parity runs, and general benchmarking. Keep the oracle agent as the default (uncommented) and include other agents as commented-out alternatives so anyone can quickly switch. + +```yaml +datasets: + - path: datasets/ + +# Default: oracle agent for verification +agents: + - name: oracle + +# Uncomment to run with other agents: +# agents: +# - name: codex +# model_name: openai/gpt-5-mini +# +# agents: +# - name: claude-code +# model_name: claude-sonnet-4-5-20250929 + +environment: + type: docker + delete: true + +orchestrator: + type: local + n_concurrent_trials: 4 +``` + +You can also create additional config files for different scenarios (e.g., parity subsets, CPU-only vs GPU, Modal). For example, featurebench provides `featurebench_docker_cpu.yaml`, `featurebench_docker_gpu.yaml`, `featurebench_modal.yaml`, and `featurebench_parity.yaml`. + +Usage: +```bash +# Oracle verification (default) +harbor run -c adapters//run_.yaml + +# Switch agent by uncommenting the desired agent block +``` + ### Rules - Prompt modifications (e.g., "write files in place without asking") are acceptable **if applied to both the original benchmark and Harbor adapter**. From c13ec7171aa90de564e12d19aee675c9c5ee8d6f Mon Sep 17 00:00:00 2001 From: Crystal Zhou Date: Sat, 28 Mar 2026 23:08:40 -0400 Subject: [PATCH 08/10] Update urls --- content/docs/datasets/adapters-human.mdx | 14 +++++++------- content/docs/datasets/adapters.mdx | 22 ++++++++++------------ 2 files changed, 17 insertions(+), 19 deletions(-) diff --git a/content/docs/datasets/adapters-human.mdx b/content/docs/datasets/adapters-human.mdx index b3dbb91..a88f1a0 100644 --- a/content/docs/datasets/adapters-human.mdx +++ b/content/docs/datasets/adapters-human.mdx @@ -6,7 +6,7 @@ description: A concise guide for human readers to create a Harbor adapter for yo import { Callout } from 'fumadocs-ui/components/callout'; import { File, Folder, Files } from 'fumadocs-ui/components/files'; -To add a new benchmark or dataset to Harbor, you create an [adapter](https://github.com/laude-institute/harbor/tree/main/adapters) that translates the original benchmark's tasks into Harbor format. +To add a new benchmark or dataset to Harbor, you create an [adapter](https://github.com/harbor-framework/harbor/tree/main/adapters) that translates the original benchmark's tasks into Harbor format. AI agents should follow the spec at [Adapter AI Guideline](/docs/datasets/adapters) @@ -61,7 +61,7 @@ Before coding, study the original benchmark and identify four key components: ### 2.0 Read the README template first -The [adapter README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md) doubles as a requirements checklist. Read it before writing code — it tells you what you'll need to provide. +The [adapter README template](https://github.com/harbor-framework/harbor/blob/main/docs/adapters/templates/README.md) doubles as a requirements checklist. Read it before writing code — it tells you what you'll need to provide. ### 2.1 Fork and branch @@ -182,11 +182,11 @@ Reach out to the team (e.g., **Lin Shi**) on [Discord](https://discord.com/invit Depending on your benchmark, you'll fall into one of three scenarios: -**Scenario A — Compatible agents exist:** The original benchmark already supports Harbor-compatible agents (OpenHands, Codex, Claude-Code, etc.). No extra work needed. Example: [ADEBench](https://github.com/laude-institute/harbor/tree/main/adapters/adebench) — the original benchmark already supports Claude Code. +**Scenario A — Compatible agents exist:** The original benchmark already supports Harbor-compatible agents (OpenHands, Codex, Claude-Code, etc.). No extra work needed. Example: [ADEBench](https://github.com/harbor-framework/harbor/tree/main/adapters/adebench) — the original benchmark already supports Claude Code. -**Scenario B — LLM-based, no compatible agents:** Fork the original benchmark, implement a Harbor-compatible agent there, and document it. Example: [EvoEval](https://github.com/laude-institute/harbor/tree/main/adapters/evoeval) — forked the repo to add codex agent support for parity. +**Scenario B — LLM-based, no compatible agents:** Fork the original benchmark, implement a Harbor-compatible agent there, and document it. Example: [EvoEval](https://github.com/harbor-framework/harbor/tree/main/adapters/evoeval) — forked the repo to add codex agent support for parity. -**Scenario C — Custom agents:** Implement the custom agent in Harbor under `adapters/{name}/` and run parity experiments with the custom agents. Also run experiments with standard agents (Codex, Claude-Code) to show the adapter generalizes. Example: [MedAgentBench](https://github.com/laude-institute/harbor/tree/main/adapters/medagentbench) — implements a custom HTTPAgent matching the original GET/POST/FINISH interaction semantics. +**Scenario C — Custom agents:** Implement the custom agent in Harbor under `adapters/{name}/` and run parity experiments with the custom agents. Also run experiments with standard agents (Codex, Claude-Code) to show the adapter generalizes. Example: [MedAgentBench](https://github.com/harbor-framework/harbor/tree/main/adapters/medagentbench) — implements a custom HTTPAgent matching the original GET/POST/FINISH interaction semantics. For expensive benchmarks, you can run parity on a representative subset. Discuss sampling strategy with the team first. Use `--split parity` in your adapter and register with `"version": "parity"` so users can run `-d {name}@parity`. @@ -308,7 +308,7 @@ harbor run -d / ## 9. Document & Submit -Fill out the [README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md) covering: +Fill out the [README template](https://github.com/harbor-framework/harbor/blob/main/docs/adapters/templates/README.md) covering: - Benchmark bugs discovered and how they were handled - Special treatments (prompt tweaks, environment adjustments) - Deviations from the original and why @@ -352,6 +352,6 @@ Migration checklist: ## Resources - [Harbor docs](/docs/getting-started) — Running tasks and jobs -- [Harbor repo](https://github.com/laude-institute/harbor) — Examples and configs +- [Harbor repo](https://github.com/harbor-framework/harbor) — Examples and configs - [Agent tutorial](/docs/agents) — Creating custom agents - [Discord](https://discord.com/invite/6xWPKhGDbA) — Ask questions in `#adapters-spam` diff --git a/content/docs/datasets/adapters.mdx b/content/docs/datasets/adapters.mdx index c929a6a..f6210f6 100644 --- a/content/docs/datasets/adapters.mdx +++ b/content/docs/datasets/adapters.mdx @@ -41,8 +41,6 @@ harbor adapter init my-adapter --name "My Name" # non-interactive scaffold └── test_*.py # (optional) pytest test files ``` -Reference implementation: [hello-world example](https://github.com/laude-institute/harbor/tree/main/examples/tasks/hello-world) - ### Adapter code directory Generated by `harbor adapter init`, this is a Python package using `src` layout: @@ -98,7 +96,7 @@ Study the benchmark's repository, documentation, and code structure. ## Step 2. Fork and Develop Adapter Code -Read the [adapter README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md) first — it doubles as a requirements checklist. +Read the [adapter README template](https://github.com/harbor-framework/harbor/blob/main/docs/adapters/templates/README.md) first — it doubles as a requirements checklist. ```bash git clone https://github.com/{your-github-username}/harbor.git @@ -117,12 +115,12 @@ Develop your adapter under `adapters/{adapter-name}/`. Refer to existing adapter | `src//task-template/` | Template files copied into each generated task. | | `parity_experiment.json` | Parity results — see [Step 6](#step-6-record-parity-results) for full schema. | | `run_{adapter-name}.yaml` | Reference config to run the full adapted dataset | -| `README.md` | Write last before PR submission. Follow the [README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md). | +| `README.md` | Write last before PR submission. Follow the [README template](https://github.com/harbor-framework/harbor/blob/main/docs/adapters/templates/README.md). | | Metrics / Rewards | Harbor supports multiple float-valued metrics as rewards (RL-compatible). Use the same metrics as the original benchmark. | ### GPU tasks -If your benchmark includes tasks that require GPU (e.g., CUDA, Triton kernels), add a `docker-compose.yaml` in the task's `environment/` directory with nvidia device reservations for Docker runs. For cloud/Modal runs, also set `gpus` in `task.toml`. See the [featurebench adapter](https://github.com/laude-institute/harbor/tree/main/adapters/featurebench) for a comprehensive example — it handles 44 GPU tasks across multiple repos with separate CPU/GPU/Modal config files. +If your benchmark includes tasks that require GPU (e.g., CUDA, Triton kernels), add a `docker-compose.yaml` in the task's `environment/` directory with nvidia device reservations for Docker runs. For cloud/Modal runs, also set `gpus` in `task.toml`. See the [featurebench adapter](https://github.com/harbor-framework/harbor/tree/main/adapters/featurebench) for a comprehensive example — it handles 44 GPU tasks across multiple repos with separate CPU/GPU/Modal config files. ### Writing `run_{adapter-name}.yaml` @@ -181,7 +179,7 @@ harbor run -c adapters//run_.yaml |--------|---------|-------------| | Single task | `harbor trial start -p datasets// -a -m ` | Testing individual tasks | | Entire dataset | `harbor run -p datasets/ -a -m ` | Full oracle verification | -| Config file | `harbor run -c adapters//.yaml -a -m ` | Reproducible runs (see [example configs](https://github.com/laude-institute/harbor/tree/main/examples/configs)) | +| Config file | `harbor run -c adapters//.yaml -a -m ` | Reproducible runs (see [example configs](https://github.com/harbor-framework/harbor/tree/main/examples/configs)) | | Registry: single task | `harbor run -t / -a -m ` | Post-publish single task | | Registry: full dataset | `harbor run -d / -a -m ` | Post-publish full dataset (after [Step 8](#step-8-register-the-dataset)) | @@ -213,9 +211,9 @@ Contact the team (e.g., **Lin Shi** on [Discord](https://discord.com/invite/6xWP | Scenario | Condition | Action required | Example | |----------|-----------|-----------------|---------| -| **A: Compatible agents exist** | Original benchmark supports Harbor-compatible agents (OpenHands, Codex, Claude-Code, Gemini-CLI) | None — run parity with identical settings on both sides | [ADEBench](https://github.com/laude-institute/harbor/tree/main/adapters/adebench) — original benchmark already supports Claude Code | -| **B: LLM-based, no compatible agents** | Original benchmark is LLM-based but lacks Harbor agents | Fork the original repo, implement Harbor-compatible agents, document in fork's README | [EvoEval](https://github.com/laude-institute/harbor/tree/main/adapters/evoeval) — forked repo to add codex agent support | -| **C: Custom agents** | Original benchmark uses custom agents unavailable in Harbor | Implement custom agent in `adapters//`. Also run with standard agents (Codex, Claude-Code) to show generalization | [MedAgentBench](https://github.com/laude-institute/harbor/tree/main/adapters/medagentbench) — implements custom HTTPAgent matching original GET/POST/FINISH semantics | +| **A: Compatible agents exist** | Original benchmark supports Harbor-compatible agents (OpenHands, Codex, Claude-Code, Gemini-CLI) | None — run parity with identical settings on both sides | [ADEBench](https://github.com/harbor-framework/harbor/tree/main/adapters/adebench) — original benchmark already supports Claude Code | +| **B: LLM-based, no compatible agents** | Original benchmark is LLM-based but lacks Harbor agents | Fork the original repo, implement Harbor-compatible agents, document in fork's README | [EvoEval](https://github.com/harbor-framework/harbor/tree/main/adapters/evoeval) — forked repo to add codex agent support | +| **C: Custom agents** | Original benchmark uses custom agents unavailable in Harbor | Implement custom agent in `adapters//`. Also run with standard agents (Codex, Claude-Code) to show generalization | [MedAgentBench](https://github.com/harbor-framework/harbor/tree/main/adapters/medagentbench) — implements custom HTTPAgent matching original GET/POST/FINISH semantics | Keep links to any forked repositories and document the approach in the README. @@ -294,7 +292,7 @@ Create `parity_experiment.json` in your adapter directory. The file is a JSON ar "number_of_runs": 3, "notes": "None", "original_parity_repo": "https://github.com/user/my-benchmark-fork", - "adapter_pr": ["https://github.com/laude-institute/harbor/pull/123"], + "adapter_pr": ["https://github.com/harbor-framework/harbor/pull/123"], "dataset_pr": ["https://github.com/laude-institute/harbor-datasets/pull/45"], "parity_pr": ["https://huggingface.co/datasets/harborframework/parity-experiments/discussions/12"], "metrics": [ @@ -402,7 +400,7 @@ harbor run -d / ### README requirements -Follow the [adapter README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md). Must document: +Follow the [adapter README template](https://github.com/harbor-framework/harbor/blob/main/docs/adapters/templates/README.md). Must document: - Benchmark bugs discovered and how they were handled - Special treatments (prompt modifications, environment adjustments) - Deviations from the original benchmark and rationale @@ -578,6 +576,6 @@ Key differences: ## Resources - [Harbor docs](/docs/getting-started) — running tasks and jobs -- [Harbor repo](https://github.com/laude-institute/harbor) — examples and configs +- [Harbor repo](https://github.com/harbor-framework/harbor) — examples and configs - [Agent tutorial](/docs/agents) — creating custom agents - [Discord](https://discord.com/invite/6xWPKhGDbA) — `#adapters-spam` for questions From 965104ccc67a9892fa7ea717dfed595b7c4bfb4d Mon Sep 17 00:00:00 2001 From: Crystal Zhou Date: Sat, 28 Mar 2026 23:19:53 -0400 Subject: [PATCH 09/10] Fix a char --- content/docs/datasets/adapters.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/datasets/adapters.mdx b/content/docs/datasets/adapters.mdx index f6210f6..d01e07f 100644 --- a/content/docs/datasets/adapters.mdx +++ b/content/docs/datasets/adapters.mdx @@ -11,7 +11,7 @@ This page is the comprehensive spec optimized for AI agents. For a concise walkt ## Purpose -An adapter translates an existing benchmark into Harbor's task format. This document is the authoritative reference for building one. Follow steps 1–9 in order. +An adapter translates an existing benchmark into Harbor's task format. This document is the authoritative reference for building one. Follow steps 1-9 in order. Check the [Adapter List](https://docs.google.com/spreadsheets/d/1mJbiASPm32DDNzEnV6eDGwpEf3FlMUe5dhkZmFjjSoo/edit?gid=0#gid=0) for available benchmarks. Contact [Lin Shi](mailto:ls2282@cornell.edu) or join [Discord](https://discord.com/invite/6xWPKhGDbA) `#adapters-announcements` for coordination. The team covers API costs for parity experiments. From ca163ebb2874cdb6d6fca2013171d961047fefca Mon Sep 17 00:00:00 2001 From: Crystal Zhou Date: Mon, 6 Apr 2026 11:20:42 -0400 Subject: [PATCH 10/10] update registry dataset instructions --- content/docs/datasets/adapters-human.mdx | 43 ++++++++-------- content/docs/datasets/adapters.mdx | 64 +++++++++++++++++------- 2 files changed, 68 insertions(+), 39 deletions(-) diff --git a/content/docs/datasets/adapters-human.mdx b/content/docs/datasets/adapters-human.mdx index a88f1a0..106c61a 100644 --- a/content/docs/datasets/adapters-human.mdx +++ b/content/docs/datasets/adapters-human.mdx @@ -94,6 +94,10 @@ Each generated task should look like this: + +Every generated task must satisfy Harbor's [task format](/docs/tasks) — most importantly, **each `task.toml` needs a `name` field**. Harbor uses this name to identify the task when it's added to a dataset, so your adapter code must emit a valid, unique name for every task it generates. Sanitize upstream identifiers (lowercase, replace spaces/slashes/special characters with hyphens) so the resulting names are stable and registry-safe. See [§8 Tips](#8-register-the-dataset) for the full naming guidance. + + ### 2.3 Adapter code structure Your adapter lives in `harbor/adapters/{adapter-name}/` as a Python package (generated by `harbor adapter init`): @@ -189,7 +193,7 @@ Depending on your benchmark, you'll fall into one of three scenarios: **Scenario C — Custom agents:** Implement the custom agent in Harbor under `adapters/{name}/` and run parity experiments with the custom agents. Also run experiments with standard agents (Codex, Claude-Code) to show the adapter generalizes. Example: [MedAgentBench](https://github.com/harbor-framework/harbor/tree/main/adapters/medagentbench) — implements a custom HTTPAgent matching the original GET/POST/FINISH interaction semantics. -For expensive benchmarks, you can run parity on a representative subset. Discuss sampling strategy with the team first. Use `--split parity` in your adapter and register with `"version": "parity"` so users can run `-d {name}@parity`. +For expensive benchmarks, you can run parity on a representative subset. Discuss sampling strategy with the team first. Use `--split parity` in your adapter and ask the team to publish the parity subset under the `parity` tag so users can run `-d {name}@parity`. See the versioning tip in [§8](#8-register-the-dataset). ## 5. Run Parity Experiments @@ -260,7 +264,9 @@ adapters// ## 8. Register the Dataset -### 8.1 Generate dataset +A dataset is a collection of tasks, and the two have a many-to-many relationship: the same task can live in multiple datasets, and one dataset can aggregate tasks from multiple adapters. Both are namespaced as `{organization}/{name}` — a dataset as `{organization}/{dataset}`, and a task as `{organization}/{task-id}`. + +**Step 1:** Generate the dataset directory with your adapter code. Store it in the [Github repo](https://github.com/laude-institute/harbor-datasets), or in the [HuggingFace repo](https://huggingface.co/datasets/harborframework/harbor-datasets) if the dataset is too large for GitHub. ```bash git clone https://github.com/{you}/harbor-datasets.git @@ -268,44 +274,41 @@ cd harbor/adapters/ uv run python -m .main --output-dir /path/to/harbor-datasets/datasets/ ``` -Generate `dataset.toml`: +**Step 2:** Generate `dataset.toml` once your generated tasks are finalized. ```bash cd harbor-datasets/datasets/ harbor init -# Select "dataset" when prompted +# Select "dataset" when prompted, then enter the dataset name as /. ``` -Edit the generated `dataset.toml` to fill in metadata: parity results summary, adapter author credits, and any acknowledgments. - -**Version naming:** Use `"1.0"` by default. Follow the original benchmark's naming if it has versions (e.g., "verified", "lite"). Use `"parity"` for parity subsets so users can run `-d @parity`. - -Create a PR to `harbor-datasets`. Request `@Slimshilin` for review. - -### 8.2 Test locally +**Step 3:** Edit the generated `dataset.toml` to fill in the description. Include the parity results summary, adapter author credits, and any acknowledgments. -Before submitting for publishing, verify with the `-p` path parameter: +**Step 4:** Verify the dataset runs locally before submitting, using the `-p` (path) parameter: ```bash harbor run -p /path/to/your/dataset ``` -You cannot test against the registry (using `-d`) until the dataset has been published. Use `-p` (local path) for all pre-publish testing. +You cannot test against the registry (using `-d`) until the dataset has been published. Use `-p` for all pre-publish testing. -### 8.3 Submit for publishing +**Step 5:** Open a PR to `harbor-datasets` with the tasks directory and `dataset.toml`. Request `@Slimshilin` for review. Once approved, the Harbor team will publish the dataset to the registry. -Include your tasks directory and `dataset.toml` in your adapter PR. Once approved, the Harbor team will publish the dataset to the registry. - -### 8.4 Verify post-publish - -Once published, verify it loads and runs correctly: +**Step 6:** After publishing, verify the dataset loads and runs from the registry: ```bash harbor run -d / ``` +**Tips:** + +- **Authors:** if there are many benchmark authors, list the first authors only. +- **Organization:** the `organization` namespace disambiguates tasks that share a name across adapters. Prefer the benchmark's owning organization (e.g., `openai/mmmlu`). If there's no clear single owner or there are multiple, use the benchmark name itself as the org (e.g., `terminal-bench/terminal-bench`). +- **Task names:** every task must have a `name` field in `task.toml` to be included in a dataset. If the original benchmark lacks stable identifiers, create your own deterministic scheme (e.g., `{dataset}-1`, `{dataset}-2`, ...). +- **Versioning:** dataset versions are **publish-time tags**. Tell the Harbor team in your PR which tag you'd like (e.g., `v1.0`, `parity`) and they'll apply it. Users then resolve a specific version via `-d /@`. + ## 9. Document & Submit Fill out the [README template](https://github.com/harbor-framework/harbor/blob/main/docs/adapters/templates/README.md) covering: @@ -315,7 +318,7 @@ Fill out the [README template](https://github.com/harbor-framework/harbor/blob/m - Agent implementation details - Known limitations -Create `adapter_metadata.json` ([see format in full docs](/docs/datasets/adapters#9-document-and-submit)). +Create `adapter_metadata.json` ([see format in full docs](/docs/datasets/adapters#step-9-document-and-submit)). When ready, update your PR title from `[WIP]` to `[Ready for Review]` and request review from `@Slimshilin`. diff --git a/content/docs/datasets/adapters.mdx b/content/docs/datasets/adapters.mdx index d01e07f..c7c3837 100644 --- a/content/docs/datasets/adapters.mdx +++ b/content/docs/datasets/adapters.mdx @@ -41,6 +41,8 @@ harbor adapter init my-adapter --name "My Name" # non-interactive scaffold └── test_*.py # (optional) pytest test files ``` +**Task naming requirement:** Every generated `task.toml` **must** contain a `name` field. Harbor uses this field to identify the task when it's added to a dataset — tasks without a `name` cannot be registered. Adapter code is responsible for deriving a valid, unique, registry-safe name for every task: sanitize upstream identifiers (lowercase, replace spaces/slashes/special characters with hyphens). See [§Step 8 Naming rules](#naming-rules) for the full naming contract, and the [task format](/docs/tasks) for the rest of the task structure. Each generated directory must contain at minimum `task.toml`, `instruction.md`, `environment/Dockerfile`, `solution/solve.sh`, and `tests/test.sh`. + ### Adapter code directory Generated by `harbor adapter init`, this is a Python package using `src` layout: @@ -222,7 +224,7 @@ Keep links to any forked repositories and document the approach in the README. If running the full benchmark is too expensive, run parity on a representative subset. Requirements: - Document in README how the subset was selected and that parity ran on a subset. - Support `--split parity` in `main.py` to generate only the parity subset. -- Use version `"parity"` in `dataset.toml` so users can run `-d @parity`. +- Ask the team to publish the parity subset under the `parity` tag so users can run `-d @parity`. See [Versioning](#versioning) below. ```bash uv run python -m .main --split parity --output-dir /path/to/output # parity subset @@ -353,28 +355,26 @@ adapters/ ## Step 8. Register the Dataset -### 8.1 Generate dataset +A dataset is a collection of tasks with a **many-to-many** relationship: the same task can appear in multiple datasets, and a dataset can aggregate tasks from multiple adapters. Both datasets and tasks are namespaced as `{organization}/{name}` — a dataset as `{organization}/{dataset}`, and a task as `{organization}/{task-id}`. + +**Step 1.** Generate the dataset directory with your adapter. Store it in the [harbor-datasets GitHub repo](https://github.com/laude-institute/harbor-datasets), or the [HuggingFace mirror](https://huggingface.co/datasets/harborframework/harbor-datasets) if it's too large for GitHub. ```bash -# Fork and clone git clone https://github.com/{your-github-username}/harbor-datasets.git - -# Generate tasks -cd harbor/adapters/ -uv run python -m .main --output-dir /path/to/harbor-datasets/datasets/ - -# Create dataset.toml -cd /path/to/harbor-datasets/datasets/ -harbor init # select "dataset" when prompted +cd harbor/adapters/ +uv run python -m .main --output-dir /path/to/harbor-datasets/datasets/ ``` -Edit `dataset.toml` to include: parity results summary, adapter author credits, acknowledgments. +**Step 2.** Create `dataset.toml` at the root of the dataset directory (e.g., `harbor-datasets/datasets//dataset.toml`). -**Version naming:** Use `"1.0"` by default. Follow original benchmark naming if applicable (e.g., "verified", "lite"). Use `"parity"` for parity subsets. +```bash +cd /path/to/harbor-datasets/datasets/ +harbor init # select "dataset"; enter the dataset name as / +``` -Create a PR to `harbor-datasets`. Request `@Slimshilin` for review. +**Step 3.** Edit `dataset.toml` to fill in the description: parity results summary, adapter author credits (first authors only if the list is long), and acknowledgments. -### 8.2 Test locally +**Step 4.** Verify the dataset runs locally before submitting, using the `-p` (path) parameter: ```bash harbor run -p /path/to/your/dataset @@ -382,16 +382,42 @@ harbor run -p /path/to/your/dataset **Note:** Registry testing (`-d`) is only available after publishing. Use `-p` for all pre-publish testing. -### 8.3 Submit for publishing +**Step 5.** Open a PR to `harbor-datasets` with the tasks directory and `dataset.toml`. Request `@Slimshilin` for review. The Harbor team publishes the dataset to the registry after approval. -Include tasks directory and `dataset.toml` in your adapter PR. The Harbor team publishes after approval. - -### 8.4 Verify post-publish +**Step 6.** After publishing, verify the dataset loads and runs from the registry: ```bash harbor run -d / ``` +#### Naming rules + +| Rule | Requirement | +|------|-------------| +| Dataset ID | `/` — e.g., `openai/mmmlu`. Entered interactively during `harbor init`. | +| Task ID | `/` — every generated `task.toml` **must** contain a `name` field. Tasks without a name cannot be added to a dataset. | +| Choosing `` | Prefer the benchmark's owning organization (e.g., `openai/mmmlu`). If there is no clear single owner or there are multiple, use the benchmark name itself as the organization (e.g., `terminal-bench/terminal-bench`). | +| Name stability | Task names must be **unique** within the dataset and **stable** across adapter runs. Unstable names churn registry digests on republish. | +| Fallback scheme | If the upstream benchmark lacks stable task identifiers, mint a deterministic scheme in adapter code (e.g., `{dataset}-1`, `{dataset}-2`, ...) derived from a reproducible sort of upstream tasks. | +| Sanitization | Sanitize upstream identifiers before using them as names: lowercase, replace spaces/slashes/special characters with hyphens, avoid leading/trailing separators. | + +**Agent instruction:** before writing `dataset.toml`, verify every generated `task.toml` contains a `name` field. If any are missing, fix `main.py` and regenerate — do not hand-edit generated task directories. Treat `main.py` as the source of truth for task names. + +#### Versioning + +Dataset versions are **publish-time tags**, not a field in `dataset.toml`. The Harbor team applies tags when publishing to the registry. Users resolve a specific version with `-d /@`. Every publish also receives the `latest` tag automatically, so `-d /` (no `@`) always points at the newest release. + +| Tag | When to use | +|-----|-------------| +| `v1.0` | Default for the first release | +| `v1.1`, `v2.0`, ... | Subsequent releases; previous tags stay pinned to their snapshots | +| `verified`, `lite`, ... | Mirror upstream naming when the original benchmark has named splits | +| `parity` | Parity subset (generated via `--split parity`) | + +To request a version, state the desired tag(s) in your adapter PR description. To cut a new version later (e.g., a bug fix), open a follow-up PR and request the new tag. + +**Agent instruction:** do **not** add a `version` key to `dataset.toml` to control the published version — that does nothing. Do **not** change `version = "1.0"` in `task.toml`; that's the task-config schema version and must stay `"1.0"`. The only way to select a version is to request a tag in the PR description. + **Step complete when:** Dataset is published to the registry, `harbor run -d /` passes oracle tests, and the PR to `harbor-datasets` is merged. ---