Skip to content

fix(worker-ops): recover failed workers before scheduler gating#69

Closed
liobrasil wants to merge 4 commits intomainfrom
lionel/fix-worker-recovery-before-gating
Closed

fix(worker-ops): recover failed workers before scheduler gating#69
liobrasil wants to merge 4 commits intomainfrom
lionel/fix-worker-recovery-before-gating

Conversation

@liobrasil
Copy link
Collaborator

Summary

  • run failed-worker recovery at the top of SchedulerHandler.executeTransaction(...)
  • remove the duplicate recovery call from _runScheduler(...)
  • keep failed/stale in-flight worker entries from waiting on fresh pending work before being marked failed

Why

A failed worker request can remain stuck in PROCESSING until another pending request arrives because _checkForFailedWorkerRequests(...) only ran inside _runScheduler(...), and _runScheduler(...) only executes when the scheduler sees pending work to process.

By moving failed-worker recovery ahead of capacity and pending-request gating, stale in-flight entries are cleaned up on every scheduler tick.

Testing

  • flow test cadence/tests/evm_bridge_lifecycle_test.cdc cadence/tests/access_control_test.cdc cadence/tests/error_handling_test.cdc cadence/tests/validation_test.cdc

@liobrasil liobrasil requested a review from nvdtf March 11, 2026 23:41
@claude
Copy link

claude bot commented Mar 11, 2026

PR Review: fix(worker-ops): recover failed workers before scheduler gating

Summary: This PR fixes a liveness bug where a failed/stale in-flight worker request could remain stuck in PROCESSING indefinitely when the EVM queue is empty. The fix moves _checkForFailedWorkerRequests before the capacity/pending-work gating in executeTransaction(), ensuring recovery runs on every scheduler tick.


Core Bug Fix - FlowYieldVaultsEVMWorkerOps.cdc

The fix is correct and well-motivated. Previously, _checkForFailedWorkerRequests was called inside _runScheduler(), which only executes when there are pending EVM requests. Now it runs unconditionally at the top of executeTransaction(), before the capacity and backlog checks. This ensures stale PROCESSING requests are recovered on every scheduler tick, even when the EVM queue is empty.

The removal of the duplicate call from _runScheduler() is correct — recovery now runs exactly once per tick.


Effort Budgeting - scheduleNextSchedulerExecution

The new max-based budgeting is a sound addition. trackedRecoveryWorkload is read from scheduledRequests.length after _checkForFailedWorkerRequests() has already run and removed failed entries. This is correct — it budgets for entries still in flight that may need recovery on the next tick, not the ones just recovered.

Minor concern: forNumberOfRequests is typed as UInt8, so requestedProcessingWorkload is naturally bounded by 255. scheduledRequests.length is an Int cast to UInt64, so trackedRecoveryWorkload could theoretically exceed a UInt8 (leading to under-budgeting when passed as data: forNumberOfRequests). In practice maxProcessingRequests caps in-flight workers, so this should never trigger, but worth documenting.


Effort Constant Changes

Constant Before After Delta
schedulerBaseEffort 700 950 +36%
schedulerPerRequestEffort 1000 250 -75%
workerCreateYieldVaultRequestEffort 5000 6500 +30%
workerDepositRequestEffort 2000 1500 -25%
workerWithdrawRequestEffort 2000 3000 +50%
workerCloseYieldVaultRequestEffort 5000 5000 unchanged

The schedulerBaseEffort increase accounts for recovery work now running unconditionally per tick — reasonable. The 75% reduction in schedulerPerRequestEffort (1000 to 250) deserves a comment explaining why it is sufficient. Recovery per entry is lighter than full request processing (mainly a status lookup in the common non-failing case), but the rationale is not documented. A brief inline comment would help future maintainers.

Were the workerWithdrawRequestEffort (+50%) and workerDepositRequestEffort (-25%) changes measured empirically on testnet? If so, noting that in the commit message would provide useful context.


init_and_schedule.cdc

Good: workerHandlerExecutionEffort is now dynamically computed as max(create, withdraw, deposit, close) instead of a hardcoded 5000. This keeps the bootstrap transaction in sync when effort constants change.

Good: schedulerExecutionEffort now reads from SCHEDULER_BASE_EFFORT instead of a hardcoded 700. This eliminates a silent divergence between the contract constant and the init transaction.

Pre-existing concern (not introduced by this PR): The init path uses only baseEffort for the first scheduler run. On a restart with existing scheduledRequests entries in flight, the first tick will not account for recovery workload in its effort budget. Worth tracking as a follow-up.


Test Coverage

No new tests are added for the core scenario: a stale PROCESSING request being recovered when the EVM queue is empty. Given that cadence/tests/error_handling_test.cdc already covers error cases, a test that verifies the scheduler marks a panicked worker request as FAILED with no pending EVM requests would directly validate this fix and prevent regression.


Summary

Area Assessment
Core fix correctness Correct
Effort budgeting logic Correct
Effort constant values Needs comment justifying perRequestEffort reduction
init_and_schedule.cdc improvements Good
Test coverage No new test for the specific recovery scenario

The fix is sound and addresses a real liveness issue. Main asks before merging: (1) a comment explaining the schedulerPerRequestEffort reduction rationale, and (2) ideally a test covering the no-pending-work recovery path.

@nvdtf nvdtf closed this Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants