A small agent-engineering harness that turns a research topic into an inspectable JSON bundle of candidate sources and candidate claims, built two ways: a raw Python agent loop and a Claude Code SDK rebuild.
The portfolio claim is the engineering shape. The same workflow appears once with every decision point in code and once with the procedure, the review step, and the QA gate moved into Claude Code project assets. Putting the two side by side is what makes the responsibilities of an agent harness visible.
I wanted to see what an agent loop has to handle when nothing is hidden by a framework. A monolithic prompt that returns a finished bundle in one call would hide the moments where decisions actually happen: which queries to run, when to stop searching, which claims to keep, and how to attach provenance back to the sources that justified them.
Then I wanted to see what changes when those mechanics move out of Python and into Claude Code SDK project assets. A Skill encodes the bundle procedure before the agent writes anything. A sub-agent owns the review step. A PostToolUse hook enforces structural validation after every write. The comparison is the point of the project, not the topic any single run was asked about.
- A three-stage agent loop with one tool per stage: search, extract, bundle
- Source-to-claim provenance recorded structurally, with every claim citing integer source IDs in the bundle's
sourcesarray - Explicit
supported,weak, andunsupportedconfidence labels per claim, so model uncertainty stays visible in the output - A schema-versioned JSON bundle that any tool can read without custom parsing
- An iteration guard plus an explicit stop tool, so the loop has a visible termination condition rather than an implicit one
- A JSONL trace of every tool call, with the final line recording whether the run completed normally or hit the iteration cap
- A Claude Code Skill, sub-agent, and PostToolUse hook that re-express the same procedure as project assets
- A standalone bundle validator that checks structure and provenance integrity for either path
This is a learning harness for agent engineering. It exercises search, extraction, provenance, and artifact validation end to end.
It does not verify that scientific claims are true. Confidence labels are model-assigned observations from gathered snippets. The hook validates structure and provenance integrity, not medical or scientific correctness. The bundle is a nomination artifact, not a truth verdict.
harness_raw.py owns the loop directly. It defines tool schemas, stores state, calls Brave Search, extracts claims with Claude, validates claim structure, writes bundles, and logs JSONL traces.
harness_sdk.py delegates the workflow to Claude Code SDK. The procedure lives in .claude/skills/research-bundler/SKILL.md. The reviewer lives in .claude/agents/source-auditor.md. The QA gate lives in .claude/hooks/validate_bundle.py.
Both paths keep the same bundle schema so their responsibilities can be compared directly.
basic-research-harness/
├── harness_raw.py
├── harness_sdk.py
├── DECISIONS.md
├── SDK_DECISIONS.md
├── COMPARISON.md
├── examples/
│ └── fda-jak-inhibitors-boxed-warning.sample.json
├── .claude/
│ ├── skills/research-bundler/SKILL.md
│ ├── agents/source-auditor.md
│ ├── hooks/validate_bundle.py
│ └── settings.json
├── output/
└── traces/
output/ and traces/ are generated artifacts and are gitignored.
examples/ contains a tracked sample bundle that passes the validator.
Create and activate a virtual environment, then install Python dependencies:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtInstall Claude Code for the SDK path:
npm install -g @anthropic-ai/claude-codeCopy the environment template and add keys:
cp .env.example .envANTHROPIC_API_KEY is required. BRAVE_SEARCH_API_KEY is preferred for the raw harness. If Brave is missing or unavailable, the raw harness falls back to Anthropic hosted web search.
python harness_raw.py "FDA JAK inhibitors boxed warning"The raw harness writes a bundle to output/ and a trace to traces/.
python harness_sdk.py "FDA JAK inhibitors boxed warning"The SDK harness asks Claude Code SDK to produce a bundle using the project Skill, review it with the source-auditor sub-agent, and run the bundle validator before stopping.
Run the validator directly:
python3 .claude/hooks/validate_bundle.py output/<bundle>.jsonThe validator checks JSON structure, counts, source fields, claim fields, confidence labels, and whether every claim cites real source IDs.
{
"schema_version": "basic-harness-v0.1",
"topic": "JAK inhibitor mechanism of action",
"run_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"timestamp": "2026-05-11T21:30:00.000000+00:00",
"source_count": 1,
"claim_count": 1,
"sources": [
{
"id": 0,
"title": "JAK Inhibitors: Mechanism and Clinical Use",
"url": "https://example.com/article",
"snippet": "JAK inhibitors block the Janus kinase pathway.",
"search_provider": "brave"
}
],
"claims": [
{
"text": "JAK inhibitors block the JAK-STAT signaling pathway",
"source_ids": [0],
"confidence": "supported"
}
]
}A trimmed version of examples/fda-jak-inhibitors-boxed-warning.sample.json (5 sources and 7 claims in full, 2 of each shown here):
{
"schema_version": "basic-harness-v0.1",
"topic": "FDA JAK inhibitors boxed warning",
"run_id": "fda-jak-inhibitors-boxed-warning-f8c62028",
"timestamp": "2026-05-15T17:30:59Z",
"source_count": 5,
"claim_count": 7,
"sources": [
{
"id": 0,
"title": "Janus Kinase (JAK) inhibitors: Drug Safety Communication",
"url": "https://www.fda.gov/safety/medical-product-safety-information/...",
"snippet": "FDA required revisions to the Boxed Warning for Xeljanz...",
"search_provider": "web_search"
}
],
"claims": [
{
"text": "In September 2021, the FDA required revised boxed warnings for Xeljanz, Olumiant, and Rinvoq.",
"source_ids": [0, 1, 3, 4],
"confidence": "supported"
},
{
"text": "ORAL Surveillance enrolled RA patients aged 50+ with at least one cardiovascular risk factor.",
"source_ids": [2],
"confidence": "weak"
}
]
}The supported claim cites four sources. The weak claim cites one. Neither is verified as true. The bundle records what was gathered and how confident the model was, not what is correct.
Two of the three failure modes I expected before building the loop showed up during development.
Context overflow hit first. The initial version kept full search snippets in the message history and ran out of room before the bundle was ready. The fix was the Write primitive from the context-engineering pattern: gathered sources are offloaded to the bundle file rather than carried in the message history. The loop reads from the bundle when it needs to, instead of from a growing transcript.
The missing stop condition appeared next. Without an explicit stop tool and an iteration guard, the agent looped past its useful work and re-searched the same ground. Adding a write_bundle tool that returns a written status, paired with a hard iteration cap, made the termination condition inspectable in the trace: a normal run ends with write_bundle returning written, and a stuck run ends with the trace closing on the cap. Both outcomes are explicitly labeled.
The third failure mode I expected, tool-description ambiguity, did not surface in the three-tool version. The names and descriptions are blunt enough that there is no overlap to confuse. I expect this becomes the real risk only when the tool set grows.
The SDK rebuild surfaced a different lesson. Moving the bundle procedure out of Python and into a Skill puts the standard in front of the model before it writes, instead of only catching mistakes at validation time. The PostToolUse hook then enforces the same standard after the artifact is produced. Pre-write specification plus post-write enforcement is more honest than either step alone, and it maps cleanly onto how validated systems are run in regulated work: define expected behavior, produce the artifact, check the artifact, and make any failure mode visible in the record.
DECISIONS.md records the design decisions for the raw harness.
SDK_DECISIONS.md records the decisions for the SDK rebuild.
COMPARISON.md explains what the raw harness implements manually and what the SDK path moves into project assets.