fix(syncing): cut scheduler+discovery CPU regression by juligasa · Pull Request #575 · seed-hypermedia/seed

juligasa · 2026-05-07T15:06:05Z

Summary

A 12-min pprof comparison on main showed 10.5× user CPU, 7.4× heap allocs, 28× GC cycles vs the same daemon 12 minutes earlier. Two compounding sources, both confirmed against current source:

Scheduler rewrite (Apr 29, 81bdc70c). The two-tier priority heap + per-task preemption walked s.tasks linearly under s.mu every timer tick (preempt / cleanup / boundedWake), and allocated context.WithCancel + Progress per peeked task — including ones immediately discarded under saturation.
Discovery peer-select. DiscoverObjectWithProgress scanned the whole peers table and parsed every multiaddr just to filter on Connectedness, keeping at most 20 fallback peers and discarding the rest.

Changes

scheduler.go
- Indexed in-progress hot tasks (ascending heap by hotDeadline) and subscriptions (map by key) so preempt/cleanup/wake are O(log n)/O(1) instead of O(n) over s.tasks.
- markInProgress / markFinished keep the indexes consistent across dispatch, executeTask cleanup, preemption, and heartbeat-expiry cancel paths. markFinished is idempotent so popped-then-cancelled tasks don't double-remove.
- context.WithCancel and Progress allocation moved past the capacity check — saturated peeks no longer allocate.
- time.Timer.Reset moved out of s.mu in the run loop. executeTask no longer takes s.mu to read runCtx/progress — passed via dispatch closure.
discovery.go
- Connected peers come from host.Network().Peers() directly. The bounded DB query (ORDER BY updated_at DESC LIMIT ?) only runs when no peer is currently connected, so the per-row multiaddr parse is bounded too.
- Fallback dial addresses use peerstore.RecentlyConnectedAddrTTL instead of TempAddrTTL — dials may run many seconds later as syncs queue up.
/debug/network: Discovery scheduler section. Renders queue/in-progress sizes plus cumulative subscription/hot preempt counters (atomic.Uint64). The snapshot provider is plumbed through initHTTP as a function arg — no setter-installed Node state — since it's set exactly once at startup.

Test plan

go test -race -count=1 ./backend/hmnet/... clean
All existing scheduler_test.go tests pass (subscription preemption, hot-vs-hot preemption, heartbeat cleanup, hot-tier-beats-cold, retouch promotion, heap-index correctness)
New http_debug_network_test.go cases for the Scheduler section render
Smoke run a daemon, hit /debug/network, confirm the new section renders and the preempt counters update under load
pprof comparison vs the artifacts in .ai/ shows heap-alloc rate ≤ 1.2× baseline and GC count ≤ 1.5× baseline (was 28×)

Notes

Original plan included an AddrInfoFromStringsCached LRU helper and versioned scheduler/discovery benchmarks. Both dropped after review:
- The cache was redundant once the SQL query was bounded — the bounded-LIMIT path already kept parse cost bounded, and a global LRU singleton on a hot-path slice introduced an aliasing footgun for marginal gain.
- The benchmark files weren't wired into CI, so they'd have been documentation rather than enforcement.

Two compounding sources of post-Apr-29 CPU/GC growth: - The scheduler rewrite walked s.tasks linearly under s.mu on every timer tick (preempt/cleanup/wake) and allocated a context.WithCancel per dispatch attempt — including saturated ones that immediately discarded it. This commit: * indexes in-progress tasks separately: ascending heap by hotDeadline for ephemeral hot, map by key for subscriptions, so preempt/cleanup/wake decisions are O(log n)/O(1) instead of O(n); * defers context.WithCancel and Progress allocation until after the capacity check, so saturated peeks don't allocate; * moves time.Timer.Reset out of s.mu in the run loop and drops the redundant lock/unlock at executeTask entry by passing taskCtx and progress in via the dispatch closure. - DiscoverObjectWithProgress scanned the whole peers table and parsed every multiaddr just to filter on Connectedness. Replaced with host.Network().Peers() for the connected fast path; the DB query now runs only when no peers are connected, ORDER BY updated_at DESC LIMIT, so the multiaddr parse cost is bounded. The disconnected fallback uses peerstore.RecentlyConnectedAddrTTL since dials may run seconds later — TempAddrTTL was short enough that addresses could expire before we got to dial. Adds a Discovery scheduler section to /debug/network exposing queue/ in-progress sizes plus cumulative subscription/hot preempt counters (atomic), so we can verify whether preemption fires in production without a metrics scraper. Wired through initHTTP as a function arg rather than as setter-installed Node state, since the snapshot provider is set once at startup.

revive's context-as-argument rule wants context.Context as the first non-receiver parameter. Reorder executeTask(task, taskCtx, prog) → executeTask(taskCtx, task, prog).

juligasa added 2 commits May 7, 2026 17:54

fix(syncing): satisfy context-as-argument lint on executeTask

020aec8

revive's context-as-argument rule wants context.Context as the first non-receiver parameter. Reorder executeTask(task, taskCtx, prog) → executeTask(taskCtx, task, prog).

juligasa force-pushed the daemon-cpu-regression-scheduler-discovery branch from 5eee41d to 020aec8 Compare May 7, 2026 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(syncing): cut scheduler+discovery CPU regression#575

fix(syncing): cut scheduler+discovery CPU regression#575
juligasa wants to merge 2 commits into
mainfrom
daemon-cpu-regression-scheduler-discovery

juligasa commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

juligasa commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

juligasa commented May 7, 2026 •

edited

Loading