Add 2026-05-25 knowledge ingest report by orban · Pull Request #38 · orban/intent-layer

orban · 2026-05-25T09:11:28Z

Summary

add the 2026-05-25 knowledge-intake ingestion run report
record per-source fetched counts: pinboard=0, papers=0, github=100, email=0, total=100
document the Gmail keychain timeout that left the overall run at exit code 0 with the email source failing

Validation

ran cd ~/dev/knowledge-intake && source .env && bun run src/cli.ts ingest
ran git diff --check -- docs/reports/2026-05-25-knowledge-intake-ingestion-run.md
no repo-local automated test target was present for this docs-only change

Automated by nightshift

…to eval harness - store run_config in checkpoints; warn on resume if config mismatches - classify errors as infra/timeout/genuine; skip genuine failures on resume - add --retry-all flag to override and re-run genuine failures too - write per-trial JSON to results/trials/ for ls-level observability - 13 new tests covering all three features Entire-Checkpoint: 7fb98ee370d1

Entire-Checkpoint: 5384ed119f2a

Entire-Checkpoint: 81278d45e27d

Pre-validation failures trip at task level (all conditions share Docker setup), other failures trip at task+condition level. Threshold of 2 accounts for in-flight parallel workers. Entire-Checkpoint: b2252e9cb3d3

After each batch, classifies failures as infra vs genuine. If infra failures detected: checks Docker health, restarts if needed, reduces parallelism, resets circuit breaker, and retries. Max 2 retry rounds. Entire-Checkpoint: 806ec810be18

Status file (.eval-status.json) updated after each result with machine-readable state: workers, pass/fail rates, paused flag. Control dir (.eval-control/) accepts commands: pause, resume, set-workers N, skip-task <id>. Commands consumed on read. Enables Ralph Loop or any external agent to manage running evals. Entire-Checkpoint: 5f55311d2e2a

- Wire fisher_exact_test into reporter for per-task significance testing - Add _compute_recommendations() flagging ceiling/floor/infra-only tasks - Add Per-Task Analysis table and Recommendations section to markdown output - Add lib/monitor.py: polling-based eval supervisor with stall detection, Docker recovery, infra-task skipping, and worker scaling - 79 new tests across stats, reporter, and monitor modules Entire-Checkpoint: 722abb53d525

- AGENTbench loader (HuggingFace dataset) and runner (Docker-based eval) - run-agentbench CLI command with 4 conditions (none/flat/human/intent_layer) - Dynamic condition discovery in reporter (no more hardcoded condition list) - Path traversal protection in write_test_infrastructure - Docker --network none for test isolation - Thread-safe temp dirs (PID + thread ID) - Checkpoint batching (every 10 results vs every 1) - Monitor uses Reporter.INFRA_ERROR_PREFIXES (no drift) - Set-based infra result filtering (replaces fragile list.remove) - Empty dict all() guard in pre-validation Entire-Checkpoint: 815f2403cb52

Entire-Checkpoint: 94115d70169e

--network none breaks setup commands that need pip install. Default to bridge (Docker's default) and expose the parameter so callers can opt into none for pure-test phases later. Entire-Checkpoint: 8118f4f3017a

- agentbench_loader: use split="train" (dataset has no "test" split) - docker_runner: use bash instead of sh (AGENTbench setup uses `source`) Entire-Checkpoint: 0afc9c1d9169

AGENTbench images install tools like uv to /root/.local/bin which only gets added to PATH via /etc/profile in login shells. bash -lc instead of bash -c fixes exit 127 for repos using uv. Entire-Checkpoint: 5d8c2af91c5d

…ote docker support Three fixes from run 4 analysis: - Regression eval: only fail when a golden-passing test now fails (was requiring 100% pass rate, which is impossible when repos have 14-83 pre-existing failures in the baseline) - strip_docs: preserve README.md variants (setup.py reads them) - docker_runner: add EVAL_DOCKER_HOST support for remote x86 execution via rsync, avoiding QEMU emulation on Apple Silicon Also removes .index-cache-preserve/ (context files now generated dynamically per run). Entire-Checkpoint: 9f4d4a6248d7

UGREEN NAS (chronos) runs an rsync daemon that intercepts all rsync connections and rejects paths outside configured modules. tar piped through SSH bypasses this entirely and works reliably. Entire-Checkpoint: e86fbbd52da2

The pre-pull step was running `docker pull` locally even when Docker execution happens on a remote host via SSH. Now uses `ssh $host docker pull` when EVAL_DOCKER_HOST is configured. Entire-Checkpoint: cfe3bf0bccca

Prevents two issues from overnight eval runs: - SSH agent key expiry caused all workers to hang indefinitely on stale connections. Added ConnectTimeout, ServerAliveInterval, and subprocess timeout=300s so failures surface within 5 minutes. - sync_from_remote was transferring .venv dirs (4GB+) back from chronos. Added excludes for .venv, node_modules, __pycache__, etc. Entire-Checkpoint: aa17a443b57d

Replace ephemeral docker run with persistent containers (docker run -d + docker exec) so setup runs once per task instead of 3x. Add start_container, exec_in_container, stop_container, copy_into_container to docker_runner.py. Fix git "dubious ownership" error caused by macOS tar overlay changing file UIDs inside containers (CVE-2022-24765). Add safe.directory config before overlay. Include stderr/stdout tail in setup error messages. Entire-Checkpoint: c11a4a4ced3d

Diagnostic findings (see moirai/scripts/analyze_harness_throughput.py): * Docker daemon is NOT the serializer (queue_wait p50<1s at parallel=8) * LLM generation owns 98% of Claude wall-clock; tool execution is 2% * Trial duration heterogeneity (17s-335s) kills effective concurrency Changes: * task_runner/reporter: persist started_at, finished_at, cost_usd, docker_invocations, and stream_events on TaskResult so wall-clock concurrency and per-Claude-event timing can be reconstructed. * docker_runner: capture invoked_at / first_byte_at / finished_at per DockerResult. Gap between invoked_at and first_byte_at is the Docker daemon queue wait; gap to finished_at is exec time. * claude_runner: add stream_events field on ClaudeResult. The streaming path records (t, type, tools, token deltas) per NDJSON line. Extract _wait_for_proc helper that enforces both hard timeout and the new idle_timeout (kills on N seconds with no new stream events, so stalled trials stop pinning the wall-clock at the hard ceiling). * cli: --schedule {fifo,lpt} with --prior-results-dir loads median wall_clock_seconds per task_id from prior trial JSONs and sorts the work queue longest-first. --idle-timeout wires through TaskRunner to claude_runner.

Review findings on a30b7aa: * claude_runner: extract _kill_and_reap helper that swallows OSError from races where the child exits between poll() and kill(). The bare proc.kill()/proc.wait() in _wait_for_proc could leak an exception that skipped the reader-thread joins below, leaving drains stuck on closed pipes. * claude_runner: warn when out_reader/err_reader join times out. A 5s join with no is_alive() check silently dropped partial stdout/stderr exactly at the diagnostic moment (idle-kill or hard-timeout) when the buffer is most useful. * cli: switch _load_durations_from_dir to statistics.median. Previous sorted(walls)[len(walls)//2] returned the upper-middle for even N — biased LPT predictions upward (e.g. 87s vs 52s on the four-task ansible smoke). New test exercises both odd and even N to lock this in. * cli: surface skipped trial JSONs from _load_durations_from_dir via stderr instead of swallowing per-file decode/IO errors. Partial corruption no longer silently biases LPT onto an unrepresentative subset. * cli: replace u26a0 unicode warning sign with plain "warn:" prefix per CLAUDE.md (no emoji unless requested). * claude_runner: tighten _wait_for_proc docstring — the safety property is "append-only of immutable-after-append dicts," not a general thread-safety guarantee.

Captures per-source fetched-item counts from the 2026-05-23 ingestion run: pinboard 20, papers 9, github 100, email 0 (gog keyring timeout). Total 129. Nightshift-Task: knowledge-ingest Nightshift-Ref: https://github.com/marcus/nightshift

Nightshift-Task: knowledge-ingest Nightshift-Ref: https://github.com/marcus/nightshift

orban added 21 commits February 23, 2026 20:30

default eval model to sonnet for reproducibility

4aae36a

Entire-Checkpoint: 5384ed119f2a

bump default parallelism from 2 to 8 workers

b981018

Entire-Checkpoint: 81278d45e27d

add circuit breaker to skip remaining reps after repeated failures

d5bc415

Pre-validation failures trip at task level (all conditions share Docker setup), other failures trip at task+condition level. Threshold of 2 accounts for in-flight parallel workers. Entire-Checkpoint: b2252e9cb3d3

fix stale test references to renamed setup_workspace method

785c257

Entire-Checkpoint: 94115d70169e

add network parameter to run_in_docker, default to bridge

b53c614

--network none breaks setup commands that need pip install. Default to bridge (Docker's default) and expose the parameter so callers can opt into none for pure-test phases later. Entire-Checkpoint: 8118f4f3017a

fix agentbench loader split and docker shell compatibility

2d42eab

- agentbench_loader: use split="train" (dataset has no "test" split) - docker_runner: use bash instead of sh (AGENTbench setup uses `source`) Entire-Checkpoint: 0afc9c1d9169

use login shell in docker to pick up ~/.local/bin PATH

9ab8632

AGENTbench images install tools like uv to /root/.local/bin which only gets added to PATH via /etc/profile in login shells. bash -lc instead of bash -c fixes exit 127 for repos using uv. Entire-Checkpoint: 5d8c2af91c5d

pull docker images on remote host when EVAL_DOCKER_HOST is set

2c9ea2e

The pre-pull step was running `docker pull` locally even when Docker execution happens on a remote host via SSH. Now uses `ssh $host docker pull` when EVAL_DOCKER_HOST is configured. Entire-Checkpoint: cfe3bf0bccca

add knowledge-intake ingestion run report

250a20c

Captures per-source fetched-item counts from the 2026-05-23 ingestion run: pinboard 20, papers 9, github 100, email 0 (gog keyring timeout). Total 129. Nightshift-Task: knowledge-ingest Nightshift-Ref: https://github.com/marcus/nightshift

Add 2026-05-25 knowledge ingest report

ea7e99d

Nightshift-Task: knowledge-ingest Nightshift-Ref: https://github.com/marcus/nightshift

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 2026-05-25 knowledge ingest report#38

Add 2026-05-25 knowledge ingest report#38
orban wants to merge 21 commits into
mainfrom
nightshift/knowledge-ingest-report-2026-05-25-r3

orban commented May 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

orban commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

orban commented May 25, 2026 •

edited

Loading