Skip to content

Add 2026-05-25 knowledge ingest report#38

Open
orban wants to merge 21 commits into
mainfrom
nightshift/knowledge-ingest-report-2026-05-25-r3
Open

Add 2026-05-25 knowledge ingest report#38
orban wants to merge 21 commits into
mainfrom
nightshift/knowledge-ingest-report-2026-05-25-r3

Conversation

@orban
Copy link
Copy Markdown
Owner

@orban orban commented May 25, 2026

Summary

  • add the 2026-05-25 knowledge-intake ingestion run report
  • record per-source fetched counts: pinboard=0, papers=0, github=100, email=0, total=100
  • document the Gmail keychain timeout that left the overall run at exit code 0 with the email source failing

Validation

  • ran cd ~/dev/knowledge-intake && source .env && bun run src/cli.ts ingest
  • ran git diff --check -- docs/reports/2026-05-25-knowledge-intake-ingestion-run.md
  • no repo-local automated test target was present for this docs-only change

Automated by nightshift

orban added 21 commits February 23, 2026 20:30
…to eval harness

- store run_config in checkpoints; warn on resume if config mismatches
- classify errors as infra/timeout/genuine; skip genuine failures on resume
- add --retry-all flag to override and re-run genuine failures too
- write per-trial JSON to results/trials/ for ls-level observability
- 13 new tests covering all three features

Entire-Checkpoint: 7fb98ee370d1
Entire-Checkpoint: 5384ed119f2a
Entire-Checkpoint: 81278d45e27d
Pre-validation failures trip at task level (all conditions share
Docker setup), other failures trip at task+condition level.
Threshold of 2 accounts for in-flight parallel workers.

Entire-Checkpoint: b2252e9cb3d3
After each batch, classifies failures as infra vs genuine. If infra
failures detected: checks Docker health, restarts if needed, reduces
parallelism, resets circuit breaker, and retries. Max 2 retry rounds.

Entire-Checkpoint: 806ec810be18
Status file (.eval-status.json) updated after each result with
machine-readable state: workers, pass/fail rates, paused flag.
Control dir (.eval-control/) accepts commands: pause, resume,
set-workers N, skip-task <id>. Commands consumed on read.
Enables Ralph Loop or any external agent to manage running evals.

Entire-Checkpoint: 5f55311d2e2a
- Wire fisher_exact_test into reporter for per-task significance testing
- Add _compute_recommendations() flagging ceiling/floor/infra-only tasks
- Add Per-Task Analysis table and Recommendations section to markdown output
- Add lib/monitor.py: polling-based eval supervisor with stall detection,
  Docker recovery, infra-task skipping, and worker scaling
- 79 new tests across stats, reporter, and monitor modules

Entire-Checkpoint: 722abb53d525
- AGENTbench loader (HuggingFace dataset) and runner (Docker-based eval)
- run-agentbench CLI command with 4 conditions (none/flat/human/intent_layer)
- Dynamic condition discovery in reporter (no more hardcoded condition list)
- Path traversal protection in write_test_infrastructure
- Docker --network none for test isolation
- Thread-safe temp dirs (PID + thread ID)
- Checkpoint batching (every 10 results vs every 1)
- Monitor uses Reporter.INFRA_ERROR_PREFIXES (no drift)
- Set-based infra result filtering (replaces fragile list.remove)
- Empty dict all() guard in pre-validation

Entire-Checkpoint: 815f2403cb52
--network none breaks setup commands that need pip install.
Default to bridge (Docker's default) and expose the parameter
so callers can opt into none for pure-test phases later.

Entire-Checkpoint: 8118f4f3017a
- agentbench_loader: use split="train" (dataset has no "test" split)
- docker_runner: use bash instead of sh (AGENTbench setup uses `source`)
Entire-Checkpoint: 0afc9c1d9169
AGENTbench images install tools like uv to /root/.local/bin which
only gets added to PATH via /etc/profile in login shells. bash -lc
instead of bash -c fixes exit 127 for repos using uv.

Entire-Checkpoint: 5d8c2af91c5d
…ote docker support

Three fixes from run 4 analysis:

- Regression eval: only fail when a golden-passing test now fails (was
  requiring 100% pass rate, which is impossible when repos have 14-83
  pre-existing failures in the baseline)
- strip_docs: preserve README.md variants (setup.py reads them)
- docker_runner: add EVAL_DOCKER_HOST support for remote x86 execution
  via rsync, avoiding QEMU emulation on Apple Silicon

Also removes .index-cache-preserve/ (context files now generated
dynamically per run).

Entire-Checkpoint: 9f4d4a6248d7
UGREEN NAS (chronos) runs an rsync daemon that intercepts all rsync
connections and rejects paths outside configured modules. tar piped
through SSH bypasses this entirely and works reliably.

Entire-Checkpoint: e86fbbd52da2
The pre-pull step was running `docker pull` locally even when Docker
execution happens on a remote host via SSH. Now uses `ssh $host docker
pull` when EVAL_DOCKER_HOST is configured.

Entire-Checkpoint: cfe3bf0bccca
Prevents two issues from overnight eval runs:
- SSH agent key expiry caused all workers to hang indefinitely on
  stale connections. Added ConnectTimeout, ServerAliveInterval, and
  subprocess timeout=300s so failures surface within 5 minutes.
- sync_from_remote was transferring .venv dirs (4GB+) back from
  chronos. Added excludes for .venv, node_modules, __pycache__, etc.

Entire-Checkpoint: aa17a443b57d
Replace ephemeral docker run with persistent containers (docker run -d +
docker exec) so setup runs once per task instead of 3x. Add
start_container, exec_in_container, stop_container, copy_into_container
to docker_runner.py.

Fix git "dubious ownership" error caused by macOS tar overlay changing
file UIDs inside containers (CVE-2022-24765). Add safe.directory config
before overlay. Include stderr/stdout tail in setup error messages.

Entire-Checkpoint: c11a4a4ced3d
Diagnostic findings (see moirai/scripts/analyze_harness_throughput.py):
  * Docker daemon is NOT the serializer (queue_wait p50<1s at parallel=8)
  * LLM generation owns 98% of Claude wall-clock; tool execution is 2%
  * Trial duration heterogeneity (17s-335s) kills effective concurrency

Changes:
  * task_runner/reporter: persist started_at, finished_at, cost_usd,
    docker_invocations, and stream_events on TaskResult so wall-clock
    concurrency and per-Claude-event timing can be reconstructed.
  * docker_runner: capture invoked_at / first_byte_at / finished_at
    per DockerResult. Gap between invoked_at and first_byte_at is
    the Docker daemon queue wait; gap to finished_at is exec time.
  * claude_runner: add stream_events field on ClaudeResult. The
    streaming path records (t, type, tools, token deltas) per
    NDJSON line. Extract _wait_for_proc helper that enforces both
    hard timeout and the new idle_timeout (kills on N seconds
    with no new stream events, so stalled trials stop pinning the
    wall-clock at the hard ceiling).
  * cli: --schedule {fifo,lpt} with --prior-results-dir loads
    median wall_clock_seconds per task_id from prior trial JSONs
    and sorts the work queue longest-first. --idle-timeout wires
    through TaskRunner to claude_runner.
Review findings on a30b7aa:

  * claude_runner: extract _kill_and_reap helper that swallows OSError
    from races where the child exits between poll() and kill(). The
    bare proc.kill()/proc.wait() in _wait_for_proc could leak an
    exception that skipped the reader-thread joins below, leaving
    drains stuck on closed pipes.
  * claude_runner: warn when out_reader/err_reader join times out.
    A 5s join with no is_alive() check silently dropped partial
    stdout/stderr exactly at the diagnostic moment (idle-kill or
    hard-timeout) when the buffer is most useful.
  * cli: switch _load_durations_from_dir to statistics.median.
    Previous sorted(walls)[len(walls)//2] returned the upper-middle
    for even N — biased LPT predictions upward (e.g. 87s vs 52s
    on the four-task ansible smoke). New test exercises both odd
    and even N to lock this in.
  * cli: surface skipped trial JSONs from _load_durations_from_dir
    via stderr instead of swallowing per-file decode/IO errors.
    Partial corruption no longer silently biases LPT onto an
    unrepresentative subset.
  * cli: replace u26a0 unicode warning sign with plain "warn:"
    prefix per CLAUDE.md (no emoji unless requested).
  * claude_runner: tighten _wait_for_proc docstring — the safety
    property is "append-only of immutable-after-append dicts," not
    a general thread-safety guarantee.
Captures per-source fetched-item counts from the 2026-05-23 ingestion run:
pinboard 20, papers 9, github 100, email 0 (gog keyring timeout). Total 129.

Nightshift-Task: knowledge-ingest
Nightshift-Ref: https://github.com/marcus/nightshift
Nightshift-Task: knowledge-ingest
Nightshift-Ref: https://github.com/marcus/nightshift
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant