Basic research harness

A small agent-engineering harness that turns a research topic into an inspectable JSON bundle of candidate sources and candidate claims, built two ways: a raw Python agent loop and a Claude Code SDK rebuild.

The portfolio claim is the engineering shape. The same workflow appears once with every decision point in code and once with the procedure, the review step, and the QA gate moved into Claude Code project assets. Putting the two side by side is what makes the responsibilities of an agent harness visible.

Why this exists

I wanted to see what an agent loop has to handle when nothing is hidden by a framework. A monolithic prompt that returns a finished bundle in one call would hide the moments where decisions actually happen: which queries to run, when to stop searching, which claims to keep, and how to attach provenance back to the sources that justified them.

Then I wanted to see what changes when those mechanics move out of Python and into Claude Code SDK project assets. A Skill encodes the bundle procedure before the agent writes anything. A sub-agent owns the review step. A PostToolUse hook enforces structural validation after every write. The comparison is the point of the project, not the topic any single run was asked about.

What it demonstrates

A three-stage agent loop with one tool per stage: search, extract, bundle
Source-to-claim provenance recorded structurally, with every claim citing integer source IDs in the bundle's sources array
Explicit supported, weak, and unsupported confidence labels per claim, so model uncertainty stays visible in the output
A schema-versioned JSON bundle that any tool can read without custom parsing
An iteration guard plus an explicit stop tool, so the loop has a visible termination condition rather than an implicit one
A JSONL trace of every tool call, with the final line recording whether the run completed normally or hit the iteration cap
A Claude Code Skill, sub-agent, and PostToolUse hook that re-express the same procedure as project assets
A standalone bundle validator that checks structure and provenance integrity for either path

What this is and is not

This is a learning harness for agent engineering. It exercises search, extraction, provenance, and artifact validation end to end.

It does not verify that scientific claims are true. Confidence labels are model-assigned observations from gathered snippets. The hook validates structure and provenance integrity, not medical or scientific correctness. The bundle is a nomination artifact, not a truth verdict.

Implementations

harness_raw.py owns the loop directly. It defines tool schemas, stores state, calls Brave Search, extracts claims with Claude, validates claim structure, writes bundles, and logs JSONL traces.

harness_sdk.py delegates the workflow to Claude Code SDK. The procedure lives in .claude/skills/research-bundler/SKILL.md. The reviewer lives in .claude/agents/source-auditor.md. The QA gate lives in .claude/hooks/validate_bundle.py.

Both paths keep the same bundle schema so their responsibilities can be compared directly.

Project structure

basic-research-harness/
├── harness_raw.py
├── harness_sdk.py
├── DECISIONS.md
├── SDK_DECISIONS.md
├── COMPARISON.md
├── examples/
│   └── fda-jak-inhibitors-boxed-warning.sample.json
├── .claude/
│   ├── skills/research-bundler/SKILL.md
│   ├── agents/source-auditor.md
│   ├── hooks/validate_bundle.py
│   └── settings.json
├── output/
└── traces/

output/ and traces/ are generated artifacts and are gitignored.

examples/ contains a tracked sample bundle that passes the validator.

Setup

Create and activate a virtual environment, then install Python dependencies:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Install Claude Code for the SDK path:

npm install -g @anthropic-ai/claude-code

Copy the environment template and add keys:

cp .env.example .env

ANTHROPIC_API_KEY is required. BRAVE_SEARCH_API_KEY is preferred for the raw harness. If Brave is missing or unavailable, the raw harness falls back to Anthropic hosted web search.

Run the raw harness

python harness_raw.py "FDA JAK inhibitors boxed warning"

The raw harness writes a bundle to output/ and a trace to traces/.

Run the SDK harness

python harness_sdk.py "FDA JAK inhibitors boxed warning"

The SDK harness asks Claude Code SDK to produce a bundle using the project Skill, review it with the source-auditor sub-agent, and run the bundle validator before stopping.

Validate a bundle

Run the validator directly:

python3 .claude/hooks/validate_bundle.py output/<bundle>.json

The validator checks JSON structure, counts, source fields, claim fields, confidence labels, and whether every claim cites real source IDs.

Bundle schema

{
  "schema_version": "basic-harness-v0.1",
  "topic": "JAK inhibitor mechanism of action",
  "run_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "timestamp": "2026-05-11T21:30:00.000000+00:00",
  "source_count": 1,
  "claim_count": 1,
  "sources": [
    {
      "id": 0,
      "title": "JAK Inhibitors: Mechanism and Clinical Use",
      "url": "https://example.com/article",
      "snippet": "JAK inhibitors block the Janus kinase pathway.",
      "search_provider": "brave"
    }
  ],
  "claims": [
    {
      "text": "JAK inhibitors block the JAK-STAT signaling pathway",
      "source_ids": [0],
      "confidence": "supported"
    }
  ]
}

Example output

A trimmed version of examples/fda-jak-inhibitors-boxed-warning.sample.json (5 sources and 7 claims in full, 2 of each shown here):

{
  "schema_version": "basic-harness-v0.1",
  "topic": "FDA JAK inhibitors boxed warning",
  "run_id": "fda-jak-inhibitors-boxed-warning-f8c62028",
  "timestamp": "2026-05-15T17:30:59Z",
  "source_count": 5,
  "claim_count": 7,
  "sources": [
    {
      "id": 0,
      "title": "Janus Kinase (JAK) inhibitors: Drug Safety Communication",
      "url": "https://www.fda.gov/safety/medical-product-safety-information/...",
      "snippet": "FDA required revisions to the Boxed Warning for Xeljanz...",
      "search_provider": "web_search"
    }
  ],
  "claims": [
    {
      "text": "In September 2021, the FDA required revised boxed warnings for Xeljanz, Olumiant, and Rinvoq.",
      "source_ids": [0, 1, 3, 4],
      "confidence": "supported"
    },
    {
      "text": "ORAL Surveillance enrolled RA patients aged 50+ with at least one cardiovascular risk factor.",
      "source_ids": [2],
      "confidence": "weak"
    }
  ]
}

The supported claim cites four sources. The weak claim cites one. Neither is verified as true. The bundle records what was gathered and how confident the model was, not what is correct.

What I learned

Two of the three failure modes I expected before building the loop showed up during development.

Context overflow hit first. The initial version kept full search snippets in the message history and ran out of room before the bundle was ready. The fix was the Write primitive from the context-engineering pattern: gathered sources are offloaded to the bundle file rather than carried in the message history. The loop reads from the bundle when it needs to, instead of from a growing transcript.

The missing stop condition appeared next. Without an explicit stop tool and an iteration guard, the agent looped past its useful work and re-searched the same ground. Adding a write_bundle tool that returns a written status, paired with a hard iteration cap, made the termination condition inspectable in the trace: a normal run ends with write_bundle returning written, and a stuck run ends with the trace closing on the cap. Both outcomes are explicitly labeled.

The third failure mode I expected, tool-description ambiguity, did not surface in the three-tool version. The names and descriptions are blunt enough that there is no overlap to confuse. I expect this becomes the real risk only when the tool set grows.

The SDK rebuild surfaced a different lesson. Moving the bundle procedure out of Python and into a Skill puts the standard in front of the model before it writes, instead of only catching mistakes at validation time. The PostToolUse hook then enforces the same standard after the artifact is produced. Pre-write specification plus post-write enforcement is more honest than either step alone, and it maps cleanly onto how validated systems are run in regulated work: define expected behavior, produce the artifact, check the artifact, and make any failure mode visible in the record.

Design records

DECISIONS.md records the design decisions for the raw harness.

SDK_DECISIONS.md records the decisions for the SDK rebuild.

COMPARISON.md explains what the raw harness implements manually and what the SDK path moves into project assets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Basic research harness

Why this exists

What it demonstrates

What this is and is not

Implementations

Project structure

Setup

Run the raw harness

Run the SDK harness

Validate a bundle

Bundle schema

Example output

What I learned

Design records

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.claude		.claude
examples		examples
.env.example		.env.example
.gitignore		.gitignore
COMPARISON.md		COMPARISON.md
DECISIONS.md		DECISIONS.md
LICENSE		LICENSE
README.md		README.md
SDK_DECISIONS.md		SDK_DECISIONS.md
harness_raw.py		harness_raw.py
harness_sdk.py		harness_sdk.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Basic research harness

Why this exists

What it demonstrates

What this is and is not

Implementations

Project structure

Setup

Run the raw harness

Run the SDK harness

Validate a bundle

Bundle schema

Example output

What I learned

Design records

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages