Today
zcash_local_net::validator::zebrad::wait_for_rpc_ready polls getblocktemplate until it returns success, treating success as proof that "zebrad is ready." This works in the steady-state case but fails opaquely when consensus is in an unexpected state (#244 is a concrete instance: a half-closed connection during submitblock is indistinguishable from a hung process).
Root cause: zebrad has no dedicated liveness/readiness contract. The harness has no choice but to overload a functional RPC as the readiness signal.
What zebrad exposes today
- JSON-RPC suite — getinfo, getnetworkinfo, getblockchaininfo, getblocktemplate, etc.
- Prometheus
/metrics if enabled — observability, not health.
- Logs to stdout/stderr.
What zebrad does not expose:
- An HTTP
/healthz / /readyz (Kubernetes-style probe).
- A dedicated RPC like
getstatus / getreadiness.
- Any explicit contract distinguishing "RPC framework accepts connections" from "consensus/state/mining can serve a particular request."
That conflation is what made #244's failure opaque. getblocktemplate failing could mean (a) RPC down, (b) state unloaded, (c) mining service not yet started, or (d) consensus has rejected the next block — and the harness can't tell from a connection error.
Audit of the 4 RPCs available as readiness signals
Each endpoint exercises a successively larger slice of zebrad's internal state. Failure of one tells you a different thing:
| Endpoint |
Empirical "success means" |
First fails when |
getinfo |
RPC framework is up; the JSON-RPC dispatcher is registered |
Process is dead, or the listener hasn't bound yet |
getnetworkinfo |
Network module is loaded; peer subsystem is reachable |
Network module hasn't initialised, or zebrad died after RPC bind but before module loading completed |
getblockchaininfo |
State module is loaded; chain tip is readable from the database |
State sled/disk init still in progress, database corrupted, or zebrad died during state load |
getblocktemplate |
Mining service is active and consensus is in a state where the next block can be mined |
Mining service not started, mempool not ready, OR consensus is rejecting the next block (e.g. NU6.1 activation block with mismatched coinbase — see #244) |
Note the asymmetry: getinfo is the most permissive and getblocktemplate is the most restrictive. A process that's "up" by getinfo's standard may be "down" by getblocktemplate's.
Proposed informal "readiness" — AND of the four
Until upstream Zebra grows a proper probe contract, the harness can implement an informal readiness check defined as the conjunction:
informal_readiness(zebrad) = (getinfo OK) AND (getnetworkinfo OK) AND (getblockchaininfo OK) AND (getblocktemplate OK)
Sketch:
async fn informal_readiness(
client: &RpcRequestClient,
) -> Result<(), Vec<String>> {
let mut failures = Vec::new();
for name in ["getinfo", "getnetworkinfo", "getblockchaininfo", "getblocktemplate"] {
match client
.json_result_from_call::<serde_json::Value>(name, "[]".to_string())
.await
{
Ok(_) => {}
Err(e) => failures.push(format!("{name}: {e}")),
}
}
if failures.is_empty() { Ok(()) } else { Err(failures) }
}
wait_for_rpc_ready becomes a poll on this. The error type includes which subset failed, so a timeout reports e.g.
RpcReadinessTimeout: getblocktemplate: rejected, others OK
instead of today's
RpcReadinessTimeout: <opaque RPC error>
That single change converts every "zebrad is down" diagnosis into an evidence-supported "zebrad is up but mining service is unhappy" — directly the diagnostic gap that bit us in #244.
Why this is informal, not normative
Each endpoint's success criteria is determined by zebrad's implementation, not a documented contract. Behavior could change between zebrad versions. The conjunction is a defensible heuristic for our harness's purposes, not a Zcash-protocol-level definition of liveness.
The proper solution is upstream: zebrad exposing a documented /readyz endpoint (or equivalent RPC) whose contract is "this process is ready to serve any RPC." That's a separate issue tracked against ZcashFoundation/zebra — needs to be filed. Until then, the AND-of-4 is a forward-compatible approximation: when zebrad gets a real probe, the harness drops the conjunction and uses the probe directly.
Acceptance criteria
Cross-reference
Today
zcash_local_net::validator::zebrad::wait_for_rpc_readypollsgetblocktemplateuntil it returns success, treating success as proof that "zebrad is ready." This works in the steady-state case but fails opaquely when consensus is in an unexpected state (#244 is a concrete instance: a half-closed connection duringsubmitblockis indistinguishable from a hung process).Root cause: zebrad has no dedicated liveness/readiness contract. The harness has no choice but to overload a functional RPC as the readiness signal.
What zebrad exposes today
/metricsif enabled — observability, not health.What zebrad does not expose:
/healthz//readyz(Kubernetes-style probe).getstatus/getreadiness.That conflation is what made #244's failure opaque.
getblocktemplatefailing could mean (a) RPC down, (b) state unloaded, (c) mining service not yet started, or (d) consensus has rejected the next block — and the harness can't tell from a connection error.Audit of the 4 RPCs available as readiness signals
Each endpoint exercises a successively larger slice of zebrad's internal state. Failure of one tells you a different thing:
getinfogetnetworkinfogetblockchaininfogetblocktemplateNote the asymmetry:
getinfois the most permissive andgetblocktemplateis the most restrictive. A process that's "up" bygetinfo's standard may be "down" bygetblocktemplate's.Proposed informal "readiness" — AND of the four
Until upstream Zebra grows a proper probe contract, the harness can implement an informal readiness check defined as the conjunction:
Sketch:
wait_for_rpc_readybecomes a poll on this. The error type includes which subset failed, so a timeout reports e.g.instead of today's
That single change converts every "zebrad is down" diagnosis into an evidence-supported "zebrad is up but mining service is unhappy" — directly the diagnostic gap that bit us in #244.
Why this is informal, not normative
Each endpoint's success criteria is determined by zebrad's implementation, not a documented contract. Behavior could change between zebrad versions. The conjunction is a defensible heuristic for our harness's purposes, not a Zcash-protocol-level definition of liveness.
The proper solution is upstream: zebrad exposing a documented
/readyzendpoint (or equivalent RPC) whose contract is "this process is ready to serve any RPC." That's a separate issue tracked againstZcashFoundation/zebra— needs to be filed. Until then, the AND-of-4 is a forward-compatible approximation: when zebrad gets a real probe, the harness drops the conjunction and uses the probe directly.Acceptance criteria
informal_readiness(or equivalent) implemented inzcash_local_net::validator::zebrad.wait_for_rpc_readyuses the conjunction; timeout errors enumerate which endpoints failed.generate_blocks's retry loop: between attempts, check whether zebrad is still alive via the conjunction. Distinguish "connection died" from "submission rejected."ZcashFoundation/zebraproposing a documented readiness probe; linked here.Cross-reference
IncompleteMessage.ZcashFoundation/zebra#NNN— upstream proposal for a documented readiness probe.