fix(smart-router): do not panic when 1 static provider fails, trigger a background gorou…#2262
fix(smart-router): do not panic when 1 static provider fails, trigger a background gorou…#2262Tomelia1999 wants to merge 5 commits intomainfrom
Conversation
…tine that re verifies the fails providers every 3 minutes
Review Summary by QodoGraceful handling of failed static providers with background retry
WalkthroughsDescription• Prevent panic when static providers fail verification at startup • Exclude failed providers from session registration, allowing partial startup • Launch background retry goroutine to re-validate failed providers every 3 minutes • Add thread-safe copy-on-write merging when recovering providers during retries • Implement graceful degradation: endpoint serves with healthy providers while retrying failed ones Diagramflowchart LR
A["Provider Validation<br/>Phase 1"] --> B{"All Failed?"}
B -->|Yes| C["Return Error<br/>Block Startup"]
B -->|No| D["Filter Failed<br/>Providers"]
D --> E["Register Healthy<br/>Providers"]
E --> F["Launch Retry<br/>Goroutine"]
F --> G["Background Loop<br/>Every 3min"]
G --> H{"Provider<br/>Recovered?"}
H -->|Yes| I["Merge into Active<br/>Sessions"]
H -->|No| J["Keep Retrying"]
I --> K["Update Session<br/>Manager"]
File Changes1. protocol/rpcsmartrouter/rpcsmartrouter.go
|
Code Review by Qodo
|
| // Collect ALL WebSocket-capable endpoints from static providers for direct subscriptions | ||
| // WebSocket URLs are identified by ws:// or wss:// prefix | ||
| var wsEndpoints []*common.NodeUrl | ||
| for _, provider := range relevantStaticProviderList { | ||
| for i := range provider.NodeUrls { | ||
| url := strings.ToLower(provider.NodeUrls[i].Url) | ||
| if strings.HasPrefix(url, "ws://") || strings.HasPrefix(url, "wss://") { | ||
| wsEndpoints = append(wsEndpoints, &provider.NodeUrls[i]) | ||
| utils.LavaFormatInfo("Found WebSocket endpoint for direct subscriptions", | ||
| utils.LogAttr("url", provider.NodeUrls[i].Url), | ||
| utils.LogAttr("provider", provider.Name), | ||
| utils.LogAttr("chainID", provider.ChainID), | ||
| ) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Create DirectWSSubscriptionManager if WebSocket endpoints are available | ||
| // Otherwise fall back to provider-based subscription manager | ||
| if len(wsEndpoints) > 0 { | ||
| directWSManager := NewDirectWSSubscriptionManager( | ||
| smartRouterMetricsManager, | ||
| spectypes.APIInterfaceJsonRPC, // WebSocket subscriptions use JSON-RPC | ||
| rpcEndpoint.ChainID, | ||
| rpcEndpoint.ApiInterface, | ||
| wsEndpoints, | ||
| optimizer, // Pass optimizer for endpoint selection | ||
| nil, // Use default WebSocket config (configurable via CLI flags later) | ||
| ) | ||
| // Start background cleanup goroutine | ||
| directWSManager.Start(ctx) | ||
| wsSubscriptionManager = directWSManager | ||
| utils.LavaFormatInfo("Using DirectWSSubscriptionManager for direct WebSocket subscriptions", | ||
| utils.LogAttr("chainID", rpcEndpoint.ChainID), | ||
| utils.LogAttr("apiInterface", rpcEndpoint.ApiInterface), | ||
| utils.LogAttr("wsEndpointCount", len(wsEndpoints)), | ||
| utils.LogAttr("optimizerEnabled", optimizer != nil), | ||
| ) | ||
| } else { | ||
| // No WebSocket endpoints configured - use NoOp manager that returns clear errors | ||
| // Smart router does NOT fall back to provider-based subscriptions (per implementation plan) | ||
| // Provider-based subscriptions are only for rpcconsumer, not rpcsmartrouter | ||
| wsSubscriptionManager = NewNoOpWSSubscriptionManager(rpcEndpoint.ChainID, rpcEndpoint.ApiInterface) | ||
| utils.LavaFormatInfo("No WebSocket endpoints configured for direct subscriptions", | ||
| utils.LogAttr("chainID", rpcEndpoint.ChainID), | ||
| utils.LogAttr("apiInterface", rpcEndpoint.ApiInterface), | ||
| utils.LogAttr("hint", "Add ws:// or wss:// URLs to static-providers-list to enable subscriptions"), | ||
| ) | ||
| } | ||
|
|
||
| // Create gRPC streaming subscription manager for gRPC server-streaming methods | ||
| // This supports Cosmos Event Streaming, Solana Geyser, and other gRPC streaming protocols | ||
| var grpcEndpoints []*common.NodeUrl | ||
| if rpcEndpoint.ApiInterface == spectypes.APIInterfaceGrpc { | ||
| // Collect gRPC endpoints from static providers | ||
| for _, provider := range relevantStaticProviderList { | ||
| if provider.ApiInterface == spectypes.APIInterfaceGrpc { | ||
| for i := range provider.NodeUrls { | ||
| grpcEndpoints = append(grpcEndpoints, &provider.NodeUrls[i]) | ||
| utils.LavaFormatInfo("Found gRPC endpoint for streaming subscriptions", | ||
| utils.LogAttr("url", provider.NodeUrls[i].Url), | ||
| utils.LogAttr("provider", provider.Name), | ||
| utils.LogAttr("chainID", provider.ChainID), | ||
| ) | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
2. Subscriptions use failed providers 🐞 Bug ≡ Correctness
CreateSmartRouterEndpoint filters static providers into healthyStaticProviders for session registration, but WebSocket and gRPC subscription endpoints are still collected from relevantStaticProviderList (unfiltered). This allows providers that failed validation (and were excluded from routing) to still be used for subscriptions, causing avoidable subscription failures against dead nodes.
Agent Prompt
### Issue description
WS/gRPC subscription endpoint collection uses the unfiltered static provider list, so endpoints that failed validation can still be used for subscriptions.
### Issue Context
Providers are filtered into `healthyStaticProviders` for session registration, but subscription endpoint collection still iterates `relevantStaticProviderList`.
### Fix Focus Areas
- protocol/rpcsmartrouter/rpcsmartrouter.go[821-849]
- protocol/rpcsmartrouter/rpcsmartrouter.go[921-988]
### Suggested fix
- Change the WS endpoint collection loop to iterate `healthyStaticProviders`.
- Change the gRPC endpoint collection loop to iterate `healthyStaticProviders` (and keep the `ApiInterface == grpc` check).
- (Optional) If you want subscriptions to automatically include recovered providers, consider rebuilding/updating the subscription managers when `retryFailedStaticProviders` re-registers recovered providers, or keep subscription endpoints tied strictly to currently-registered providers.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
There was a problem hiding this comment.
known but won't get handled on this PR, dedicated ticket will be opened
There was a problem hiding this comment.
There was a problem hiding this comment.
Commit 15d987c addressed this comment by collecting WebSocket and gRPC subscription endpoints from healthyStaticProviders instead of the unfiltered list, ensuring failed providers are excluded from subscriptions.
There was a problem hiding this comment.
▎ Issue: The condition len(failedStaticNames) >= validatedCount is semantically ambiguous based on whether validatedCount tracks all attempted providers or only successful ones. Based on the companion log healthy = validatedCount -
▎ len(failedStaticNames), the intent appears to be "total attempted", but the variable name says "validated" (successful). This can cause the endpoint to be killed when there are still healthy providers.
▎
▎ Suggestion:
▎ // Rename the loop counter and fix the check
▎ totalAttemptedCount := 0 // was: validatedCount
▎ // (increment on each iteration, not just on success)
▎
▎ healthyCount := totalAttemptedCount - len(failedStaticNames)
▎ if healthyCount == 0 {
▎ err := utils.LavaFormatError("all static providers failed — cannot serve endpoint", nil, ...)
▎ errCh <- err
▎ return err
▎ }
There was a problem hiding this comment.
logic is good, no changes needed.
renamed the param validatedCount -> totalAttemptedCount
There was a problem hiding this comment.
Commit 15d987c addressed this comment by renaming validatedCount to totalAttemptedCount, incrementing it for every provider attempt, and deriving healthyCount for failure detection so the endpoint only errors when all attempted providers fail while still logging healthy counts appropriately.
There was a problem hiding this comment.
WS endpoint collection loop (~line 960 in new file)
▎ Issue: relevantStaticProviderList is used to collect WebSocket and gRPC endpoints, but this is the unfiltered list including failed providers. This is a regression from the old panic-before-reaching-this-line behavior.
▎
▎ Suggestion: Change both loops to use healthyStaticProviders. Even if MAG-1569 tracks the full fix, using the unfiltered list here is strictly worse than before:
▎ for _, provider := range healthyStaticProviders { // not relevantStaticProviderList
There was a problem hiding this comment.
Commit 15d987c addressed this comment by switching the WebSocket and gRPC endpoint collection loops to iterate over healthyStaticProviders, ensuring only validated providers contribute endpoints instead of the unfiltered list.
There was a problem hiding this comment.
▎ Issue: A hung provider will block the retry loop indefinitely for all other failed providers sharing the same goroutine.
▎
▎ Suggestion:
▎ attemptCtx, attemptCancel := context.WithTimeout(ctx, 30*time.Second)
▎ defer attemptCancel()
There was a problem hiding this comment.
Commit 15d987c addressed this comment by wrapping each verification and retry attempt in a 30-second context.WithTimeout so that a hung provider can no longer block the shared goroutine loop, ensuring stalled connections time out instead of hanging indefinitely.
…tine that re verifies the fails providers every 3 minutes
Description
Closes: #XXXX
Author Checklist
All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.
I have...
!in the type prefix if API or client breaking changemainbranchReviewers Checklist
All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.
I have...