fix(worker-ops): recover failed workers before scheduler gating by liobrasil · Pull Request #69 · onflow/FlowYieldVaultsEVM

liobrasil · 2026-03-11T23:38:27Z

Summary

run failed-worker recovery at the top of SchedulerHandler.executeTransaction(...)
remove the duplicate recovery call from _runScheduler(...)
keep failed/stale in-flight worker entries from waiting on fresh pending work before being marked failed

Why

A failed worker request can remain stuck in PROCESSING until another pending request arrives because _checkForFailedWorkerRequests(...) only ran inside _runScheduler(...), and _runScheduler(...) only executes when the scheduler sees pending work to process.

By moving failed-worker recovery ahead of capacity and pending-request gating, stale in-flight entries are cleaned up on every scheduler tick.

Testing

flow test cadence/tests/evm_bridge_lifecycle_test.cdc cadence/tests/access_control_test.cdc cadence/tests/error_handling_test.cdc cadence/tests/validation_test.cdc

cadence/contracts/FlowYieldVaultsEVMWorkerOps.cdc

claude · 2026-03-11T23:46:04Z

PR Review: fix(worker-ops): recover failed workers before scheduler gating

Summary: This PR fixes a liveness bug where a failed/stale in-flight worker request could remain stuck in PROCESSING indefinitely when the EVM queue is empty. The fix moves _checkForFailedWorkerRequests before the capacity/pending-work gating in executeTransaction(), ensuring recovery runs on every scheduler tick.

Core Bug Fix - FlowYieldVaultsEVMWorkerOps.cdc

The fix is correct and well-motivated. Previously, _checkForFailedWorkerRequests was called inside _runScheduler(), which only executes when there are pending EVM requests. Now it runs unconditionally at the top of executeTransaction(), before the capacity and backlog checks. This ensures stale PROCESSING requests are recovered on every scheduler tick, even when the EVM queue is empty.

The removal of the duplicate call from _runScheduler() is correct — recovery now runs exactly once per tick.

Effort Budgeting - scheduleNextSchedulerExecution

The new max-based budgeting is a sound addition. trackedRecoveryWorkload is read from scheduledRequests.length after _checkForFailedWorkerRequests() has already run and removed failed entries. This is correct — it budgets for entries still in flight that may need recovery on the next tick, not the ones just recovered.

Minor concern: forNumberOfRequests is typed as UInt8, so requestedProcessingWorkload is naturally bounded by 255. scheduledRequests.length is an Int cast to UInt64, so trackedRecoveryWorkload could theoretically exceed a UInt8 (leading to under-budgeting when passed as data: forNumberOfRequests). In practice maxProcessingRequests caps in-flight workers, so this should never trigger, but worth documenting.

Effort Constant Changes

Constant	Before	After	Delta
schedulerBaseEffort	700	950	+36%
schedulerPerRequestEffort	1000	250	-75%
workerCreateYieldVaultRequestEffort	5000	6500	+30%
workerDepositRequestEffort	2000	1500	-25%
workerWithdrawRequestEffort	2000	3000	+50%
workerCloseYieldVaultRequestEffort	5000	5000	unchanged

The schedulerBaseEffort increase accounts for recovery work now running unconditionally per tick — reasonable. The 75% reduction in schedulerPerRequestEffort (1000 to 250) deserves a comment explaining why it is sufficient. Recovery per entry is lighter than full request processing (mainly a status lookup in the common non-failing case), but the rationale is not documented. A brief inline comment would help future maintainers.

Were the workerWithdrawRequestEffort (+50%) and workerDepositRequestEffort (-25%) changes measured empirically on testnet? If so, noting that in the commit message would provide useful context.

init_and_schedule.cdc

Good: workerHandlerExecutionEffort is now dynamically computed as max(create, withdraw, deposit, close) instead of a hardcoded 5000. This keeps the bootstrap transaction in sync when effort constants change.

Good: schedulerExecutionEffort now reads from SCHEDULER_BASE_EFFORT instead of a hardcoded 700. This eliminates a silent divergence between the contract constant and the init transaction.

Pre-existing concern (not introduced by this PR): The init path uses only baseEffort for the first scheduler run. On a restart with existing scheduledRequests entries in flight, the first tick will not account for recovery workload in its effort budget. Worth tracking as a follow-up.

Test Coverage

No new tests are added for the core scenario: a stale PROCESSING request being recovered when the EVM queue is empty. Given that cadence/tests/error_handling_test.cdc already covers error cases, a test that verifies the scheduler marks a panicked worker request as FAILED with no pending EVM requests would directly validate this fix and prevent regression.

Summary

Area	Assessment
Core fix correctness	Correct
Effort budgeting logic	Correct
Effort constant values	Needs comment justifying perRequestEffort reduction
init_and_schedule.cdc improvements	Good
Test coverage	No new test for the specific recovery scenario

The fix is sound and addresses a real liveness issue. Main asks before merging: (1) a comment explaining the schedulerPerRequestEffort reduction rationale, and (2) ideally a test covering the no-pending-work recovery path.

fix(worker-ops): recover failed workers before scheduler gating

14fcda4

liobrasil requested a review from nvdtf March 11, 2026 23:41

nvdtf reviewed Mar 11, 2026

View reviewed changes

cadence/contracts/FlowYieldVaultsEVMWorkerOps.cdc Outdated Show resolved Hide resolved

liobrasil added 3 commits March 12, 2026 16:59

docs(worker-ops): clarify failed worker recovery comment

4ca30be

Adjust worker recovery effort budgeting

cf92817

Increase close worker effort to 5000

2666212

nvdtf closed this Mar 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(worker-ops): recover failed workers before scheduler gating#69

fix(worker-ops): recover failed workers before scheduler gating#69
liobrasil wants to merge 4 commits intomainfrom
lionel/fix-worker-recovery-before-gating

liobrasil commented Mar 11, 2026

Uh oh!

Uh oh!

claude bot commented Mar 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

liobrasil commented Mar 11, 2026

Summary

Why

Testing

Uh oh!

Uh oh!

claude bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix(worker-ops): recover failed workers before scheduler gating

Core Bug Fix - FlowYieldVaultsEVMWorkerOps.cdc

Effort Budgeting - scheduleNextSchedulerExecution

Effort Constant Changes

init_and_schedule.cdc

Test Coverage

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude bot commented Mar 11, 2026 •

edited

Loading