fix(worker-ops): recover failed workers before scheduler gating#69
fix(worker-ops): recover failed workers before scheduler gating#69
Conversation
PR Review: fix(worker-ops): recover failed workers before scheduler gatingSummary: This PR fixes a liveness bug where a failed/stale in-flight worker request could remain stuck in PROCESSING indefinitely when the EVM queue is empty. The fix moves _checkForFailedWorkerRequests before the capacity/pending-work gating in executeTransaction(), ensuring recovery runs on every scheduler tick. Core Bug Fix - FlowYieldVaultsEVMWorkerOps.cdcThe fix is correct and well-motivated. Previously, _checkForFailedWorkerRequests was called inside _runScheduler(), which only executes when there are pending EVM requests. Now it runs unconditionally at the top of executeTransaction(), before the capacity and backlog checks. This ensures stale PROCESSING requests are recovered on every scheduler tick, even when the EVM queue is empty. The removal of the duplicate call from _runScheduler() is correct — recovery now runs exactly once per tick. Effort Budgeting - scheduleNextSchedulerExecutionThe new max-based budgeting is a sound addition. trackedRecoveryWorkload is read from scheduledRequests.length after _checkForFailedWorkerRequests() has already run and removed failed entries. This is correct — it budgets for entries still in flight that may need recovery on the next tick, not the ones just recovered. Minor concern: forNumberOfRequests is typed as UInt8, so requestedProcessingWorkload is naturally bounded by 255. scheduledRequests.length is an Int cast to UInt64, so trackedRecoveryWorkload could theoretically exceed a UInt8 (leading to under-budgeting when passed as data: forNumberOfRequests). In practice maxProcessingRequests caps in-flight workers, so this should never trigger, but worth documenting. Effort Constant Changes
The schedulerBaseEffort increase accounts for recovery work now running unconditionally per tick — reasonable. The 75% reduction in schedulerPerRequestEffort (1000 to 250) deserves a comment explaining why it is sufficient. Recovery per entry is lighter than full request processing (mainly a status lookup in the common non-failing case), but the rationale is not documented. A brief inline comment would help future maintainers. Were the workerWithdrawRequestEffort (+50%) and workerDepositRequestEffort (-25%) changes measured empirically on testnet? If so, noting that in the commit message would provide useful context. init_and_schedule.cdcGood: workerHandlerExecutionEffort is now dynamically computed as max(create, withdraw, deposit, close) instead of a hardcoded 5000. This keeps the bootstrap transaction in sync when effort constants change. Good: schedulerExecutionEffort now reads from SCHEDULER_BASE_EFFORT instead of a hardcoded 700. This eliminates a silent divergence between the contract constant and the init transaction. Pre-existing concern (not introduced by this PR): The init path uses only baseEffort for the first scheduler run. On a restart with existing scheduledRequests entries in flight, the first tick will not account for recovery workload in its effort budget. Worth tracking as a follow-up. Test CoverageNo new tests are added for the core scenario: a stale PROCESSING request being recovered when the EVM queue is empty. Given that cadence/tests/error_handling_test.cdc already covers error cases, a test that verifies the scheduler marks a panicked worker request as FAILED with no pending EVM requests would directly validate this fix and prevent regression. Summary
The fix is sound and addresses a real liveness issue. Main asks before merging: (1) a comment explaining the schedulerPerRequestEffort reduction rationale, and (2) ideally a test covering the no-pending-work recovery path. |
Summary
SchedulerHandler.executeTransaction(...)_runScheduler(...)Why
A failed worker request can remain stuck in
PROCESSINGuntil another pending request arrives because_checkForFailedWorkerRequests(...)only ran inside_runScheduler(...), and_runScheduler(...)only executes when the scheduler sees pending work to process.By moving failed-worker recovery ahead of capacity and pending-request gating, stale in-flight entries are cleaned up on every scheduler tick.
Testing
flow test cadence/tests/evm_bridge_lifecycle_test.cdc cadence/tests/access_control_test.cdc cadence/tests/error_handling_test.cdc cadence/tests/validation_test.cdc