fix: health indicator#4636
Conversation
Signed-off-by: Pablo Carle <pablo.carle@broadcom.com>
Signed-off-by: Pablo Carle <pablo.carle@broadcom.com>
Signed-off-by: Pablo Carle <pablo.carle@broadcom.com>
Signed-off-by: Pablo Carle <pablo.carle@broadcom.com>
Signed-off-by: Pablo Carle <pablo.carle@broadcom.com>
Signed-off-by: Pablo Carle <pablo.carle@broadcom.com>
Signed-off-by: Pablo Carle <pablo.carle@broadcom.com>
| var zaasUp = !this.discoveryClient.getInstances(CoreService.ZAAS.getServiceId()).isEmpty(); | ||
|
|
||
| var gatewayCount = this.discoveryClient.getInstances(CoreService.GATEWAY.getServiceId()).size(); | ||
| var discoveryCount = this.discoveryClient.getInstances(CoreService.DISCOVERY.getServiceId()).size(); |
There was a problem hiding this comment.
What about rather to get how many instances are in the configuration? The check itself has no exact amount in the validation.
Also, if there is no validation of instance, how do we know that only local instances are up or all in the HA setup?
There was a problem hiding this comment.
Yes it makes sense. This is a partial fix for a specific scenario in which the second instance would print the API Mediation Layer ready before Discovery and/or ZAAS are available.
I guess the question is more generic. It's true this does not cover scenarios with 3 or more instances. It's also true that the message in the second instance is correct even if there's only one Discovery and/or ZAAS.
There was a problem hiding this comment.
I will try to refactor it
There was a problem hiding this comment.
Just a note - there is a scenario when the HA setup will be functionally correct even though not all the instances will be fully setup.
As this is atypical if not supported at all, I believe we are ok with approving and merging the PR.
The scenario explanation
Instance crashes during startup and never registers (in HA scenario instance means for example one zaas service on one LPAR out of three), the counts will never match and onFullyUp() will never be called. The "started" message never publishes. There's no timeout, no retry limit, no fallback.
Potential Suggestion: Add a time-based fallback — if the counts haven't converged after a configurable timeout (e.g., 5 minutes), log a warning and publish anyway
|
There will be additional work on this PR that will need full review afterwards


Description
Health indicator in multi-service deployment can show API Mediation Layer started message ZWEAM001I before the instance's Discovery / ZAAS (especially in slow environments)
This PR fixes it by verifying count of instance's of Discovery / ZAAS
Type of change
Checklist: