Skip to content

harness wait_for_rpc_ready overloads getblocktemplate as a liveness probe — define an informal AND-of-4 readiness conjunction #245

@zancas

Description

@zancas

Today

zcash_local_net::validator::zebrad::wait_for_rpc_ready polls getblocktemplate until it returns success, treating success as proof that "zebrad is ready." This works in the steady-state case but fails opaquely when consensus is in an unexpected state (#244 is a concrete instance: a half-closed connection during submitblock is indistinguishable from a hung process).

Root cause: zebrad has no dedicated liveness/readiness contract. The harness has no choice but to overload a functional RPC as the readiness signal.

What zebrad exposes today

  • JSON-RPC suite — getinfo, getnetworkinfo, getblockchaininfo, getblocktemplate, etc.
  • Prometheus /metrics if enabled — observability, not health.
  • Logs to stdout/stderr.

What zebrad does not expose:

  • An HTTP /healthz / /readyz (Kubernetes-style probe).
  • A dedicated RPC like getstatus / getreadiness.
  • Any explicit contract distinguishing "RPC framework accepts connections" from "consensus/state/mining can serve a particular request."

That conflation is what made #244's failure opaque. getblocktemplate failing could mean (a) RPC down, (b) state unloaded, (c) mining service not yet started, or (d) consensus has rejected the next block — and the harness can't tell from a connection error.

Audit of the 4 RPCs available as readiness signals

Each endpoint exercises a successively larger slice of zebrad's internal state. Failure of one tells you a different thing:

Endpoint Empirical "success means" First fails when
getinfo RPC framework is up; the JSON-RPC dispatcher is registered Process is dead, or the listener hasn't bound yet
getnetworkinfo Network module is loaded; peer subsystem is reachable Network module hasn't initialised, or zebrad died after RPC bind but before module loading completed
getblockchaininfo State module is loaded; chain tip is readable from the database State sled/disk init still in progress, database corrupted, or zebrad died during state load
getblocktemplate Mining service is active and consensus is in a state where the next block can be mined Mining service not started, mempool not ready, OR consensus is rejecting the next block (e.g. NU6.1 activation block with mismatched coinbase — see #244)

Note the asymmetry: getinfo is the most permissive and getblocktemplate is the most restrictive. A process that's "up" by getinfo's standard may be "down" by getblocktemplate's.

Proposed informal "readiness" — AND of the four

Until upstream Zebra grows a proper probe contract, the harness can implement an informal readiness check defined as the conjunction:

informal_readiness(zebrad) = (getinfo OK) AND (getnetworkinfo OK) AND (getblockchaininfo OK) AND (getblocktemplate OK)

Sketch:

async fn informal_readiness(
    client: &RpcRequestClient,
) -> Result<(), Vec<String>> {
    let mut failures = Vec::new();
    for name in ["getinfo", "getnetworkinfo", "getblockchaininfo", "getblocktemplate"] {
        match client
            .json_result_from_call::<serde_json::Value>(name, "[]".to_string())
            .await
        {
            Ok(_) => {}
            Err(e) => failures.push(format!("{name}: {e}")),
        }
    }
    if failures.is_empty() { Ok(()) } else { Err(failures) }
}

wait_for_rpc_ready becomes a poll on this. The error type includes which subset failed, so a timeout reports e.g.

RpcReadinessTimeout: getblocktemplate: rejected, others OK

instead of today's

RpcReadinessTimeout: <opaque RPC error>

That single change converts every "zebrad is down" diagnosis into an evidence-supported "zebrad is up but mining service is unhappy" — directly the diagnostic gap that bit us in #244.

Why this is informal, not normative

Each endpoint's success criteria is determined by zebrad's implementation, not a documented contract. Behavior could change between zebrad versions. The conjunction is a defensible heuristic for our harness's purposes, not a Zcash-protocol-level definition of liveness.

The proper solution is upstream: zebrad exposing a documented /readyz endpoint (or equivalent RPC) whose contract is "this process is ready to serve any RPC." That's a separate issue tracked against ZcashFoundation/zebra — needs to be filed. Until then, the AND-of-4 is a forward-compatible approximation: when zebrad gets a real probe, the harness drops the conjunction and uses the probe directly.

Acceptance criteria

  • informal_readiness (or equivalent) implemented in zcash_local_net::validator::zebrad.
  • wait_for_rpc_ready uses the conjunction; timeout errors enumerate which endpoints failed.
  • Equivalent extension in generate_blocks's retry loop: between attempts, check whether zebrad is still alive via the conjunction. Distinguish "connection died" from "submission rejected."
  • Upstream issue filed against ZcashFoundation/zebra proposing a documented readiness probe; linked here.

Cross-reference

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions