fix: exempt non-autonomous agents from heartbeat inactivity timeout#708
fix: exempt non-autonomous agents from heartbeat inactivity timeout#708Fail-Safe wants to merge 1 commit intoRightNow-AI:mainfrom
Conversation
Reactive (non-autonomous) agents wait indefinitely for incoming messages and have no expected self-trigger schedule. Applying an inactivity timeout to them was incorrect — they would be flagged as unresponsive after the default 180s simply for being idle, causing unnecessary crash/recovery cycles. The fix makes timeout behaviour conditional on agent type: - Autonomous agents retain the `heartbeat_interval_secs × 2` inactivity check, which is meaningful because they are expected to fire periodically. - Non-autonomous agents are only flagged when their state is `Crashed`; idle time is irrelevant and no longer checked. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
jaberjaber23
left a comment
There was a problem hiding this comment.
Real bug, clean fix, not slop. But needs tests before merge.
The fix correctly exempts reactive agents from inactivity timeouts — they're designed to sit idle waiting for messages. However:
-
No tests. The check_agents function is pure and trivially testable. Need at minimum: (a) reactive agent idle 5 min is NOT flagged unresponsive, (b) reactive agent in Crashed state IS flagged, (c) autonomous agent idle beyond timeout IS flagged.
-
default_timeout_secs becomes dead code after this PR — never read at runtime. Should document why it's retained or remove it.
-
The warn! macro now logs timeout_secs=Some(60) instead of timeout_secs=60. Should unwrap the Some in the log since it's guaranteed at that point.
|
Reviewed and approved. Has merge conflicts with recent heartbeat changes. Please rebase onto current main. |
Problem
Non-autonomous (reactive) agents were being flagged as unresponsive and crash-recovered after sitting idle for
default_timeout_secs(180s), even though idle is their normal state. They wait for incoming messages and have no expected self-trigger schedule.This caused unnecessary crash/recovery cycles for healthy agents that simply hadn't received a message in the last 3 minutes.
Root Cause
check_agentsinheartbeat.rsapplied the same inactivity timeout to all Running agents regardless of type. For agents without anautonomousconfig block, it fell back toconfig.default_timeout_secs(180s). A reactive agent idle for 3+ minutes would be indistinguishable from an autonomous agent that had stalled.Fix
Make the inactivity check conditional on agent type:
heartbeat_interval_secs × 2inactivity check — meaningful because they are expected to fire periodically.Crashed; idle time is no longer checked at all.Testing
Verified with a 5-agent setup (mix of autonomous and reactive). Before the fix, idle reactive agents were logged as unresponsive at
inactive_secs=210. After the fix, they showheartbeat OKat 210s, 240s, and beyond with no crash/recovery cycles.cargo build --workspace --libpassescargo clippy --workspace --all-targets -- -D warningspasses (zero warnings)cargo test --workspacepasses