harness wait_for_rpc_ready overloads getblocktemplate as a liveness probe — define an informal AND-of-4 readiness conjunction

## Today

`zcash_local_net::validator::zebrad::wait_for_rpc_ready` polls `getblocktemplate` until it returns success, treating success as proof that "zebrad is ready." This works in the steady-state case but fails opaquely when consensus is in an unexpected state (#244 is a concrete instance: a half-closed connection during `submitblock` is indistinguishable from a hung process).

Root cause: zebrad has no dedicated liveness/readiness contract. The harness has no choice but to overload a functional RPC as the readiness signal.

## What zebrad exposes today

- **JSON-RPC suite** — getinfo, getnetworkinfo, getblockchaininfo, getblocktemplate, etc.
- **Prometheus `/metrics`** if enabled — observability, not health.
- **Logs** to stdout/stderr.

What zebrad does *not* expose:

- An HTTP `/healthz` / `/readyz` (Kubernetes-style probe).
- A dedicated RPC like `getstatus` / `getreadiness`.
- Any explicit contract distinguishing "RPC framework accepts connections" from "consensus/state/mining can serve a particular request."

That conflation is what made #244's failure opaque. `getblocktemplate` failing could mean (a) RPC down, (b) state unloaded, (c) mining service not yet started, or (d) consensus has rejected the next block — and the harness can't tell from a connection error.

## Audit of the 4 RPCs available as readiness signals

Each endpoint exercises a successively larger slice of zebrad's internal state. Failure of one tells you a different thing:

| Endpoint | Empirical "success means" | First fails when |
|---|---|---|
| **`getinfo`** | RPC framework is up; the JSON-RPC dispatcher is registered | Process is dead, or the listener hasn't bound yet |
| **`getnetworkinfo`** | Network module is loaded; peer subsystem is reachable | Network module hasn't initialised, or zebrad died after RPC bind but before module loading completed |
| **`getblockchaininfo`** | State module is loaded; chain tip is readable from the database | State sled/disk init still in progress, database corrupted, or zebrad died during state load |
| **`getblocktemplate`** | Mining service is active and consensus is in a state where the *next* block can be mined | Mining service not started, mempool not ready, OR consensus is rejecting the next block (e.g. NU6.1 activation block with mismatched coinbase — see #244) |

Note the asymmetry: `getinfo` is the most permissive and `getblocktemplate` is the most restrictive. A process that's "up" by `getinfo`'s standard may be "down" by `getblocktemplate`'s.

## Proposed informal "readiness" — AND of the four

Until upstream Zebra grows a proper probe contract, the harness can implement an informal readiness check defined as the conjunction:

> **`informal_readiness(zebrad)` = (getinfo OK) AND (getnetworkinfo OK) AND (getblockchaininfo OK) AND (getblocktemplate OK)**

Sketch:

```rust
async fn informal_readiness(
    client: &RpcRequestClient,
) -> Result<(), Vec<String>> {
    let mut failures = Vec::new();
    for name in ["getinfo", "getnetworkinfo", "getblockchaininfo", "getblocktemplate"] {
        match client
            .json_result_from_call::<serde_json::Value>(name, "[]".to_string())
            .await
        {
            Ok(_) => {}
            Err(e) => failures.push(format!("{name}: {e}")),
        }
    }
    if failures.is_empty() { Ok(()) } else { Err(failures) }
}
```

`wait_for_rpc_ready` becomes a poll on this. The error type includes which subset failed, so a timeout reports e.g.

```
RpcReadinessTimeout: getblocktemplate: rejected, others OK
```

instead of today's

```
RpcReadinessTimeout: <opaque RPC error>
```

That single change converts every "zebrad is down" diagnosis into an *evidence-supported* "zebrad is up but mining service is unhappy" — directly the diagnostic gap that bit us in #244.

## Why this is informal, not normative

Each endpoint's success criteria is determined by zebrad's implementation, not a documented contract. Behavior could change between zebrad versions. The conjunction is a defensible heuristic for our harness's purposes, not a Zcash-protocol-level definition of liveness.

The proper solution is upstream: zebrad exposing a documented `/readyz` endpoint (or equivalent RPC) whose contract is "this process is ready to serve any RPC." That's a separate issue tracked against `ZcashFoundation/zebra` — needs to be filed. Until then, the AND-of-4 is a forward-compatible approximation: when zebrad gets a real probe, the harness drops the conjunction and uses the probe directly.

## Acceptance criteria

- [ ] `informal_readiness` (or equivalent) implemented in `zcash_local_net::validator::zebrad`.
- [ ] `wait_for_rpc_ready` uses the conjunction; timeout errors enumerate which endpoints failed.
- [ ] Equivalent extension in `generate_blocks`'s retry loop: between attempts, check whether zebrad is still alive via the conjunction. Distinguish "connection died" from "submission rejected."
- [ ] Upstream issue filed against `ZcashFoundation/zebra` proposing a documented readiness probe; linked here.

## Cross-reference

- #244 — concrete instance where this would have given an actionable error message instead of a hyper `IncompleteMessage`.
- #243 — the parent NU6.1 lockbox-disbursement work that surfaced the diagnostic gap.
- (TBD) `ZcashFoundation/zebra#NNN` — upstream proposal for a documented readiness probe.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

harness wait_for_rpc_ready overloads getblocktemplate as a liveness probe — define an informal AND-of-4 readiness conjunction #245

Today

What zebrad exposes today

Audit of the 4 RPCs available as readiness signals

Proposed informal "readiness" — AND of the four

Why this is informal, not normative

Acceptance criteria

Cross-reference

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Endpoint	Empirical "success means"	First fails when
`getinfo`	RPC framework is up; the JSON-RPC dispatcher is registered	Process is dead, or the listener hasn't bound yet
`getnetworkinfo`	Network module is loaded; peer subsystem is reachable	Network module hasn't initialised, or zebrad died after RPC bind but before module loading completed
`getblockchaininfo`	State module is loaded; chain tip is readable from the database	State sled/disk init still in progress, database corrupted, or zebrad died during state load
`getblocktemplate`	Mining service is active and consensus is in a state where the next block can be mined	Mining service not started, mempool not ready, OR consensus is rejecting the next block (e.g. NU6.1 activation block with mismatched coinbase — see #244)

harness wait_for_rpc_ready overloads getblocktemplate as a liveness probe — define an informal AND-of-4 readiness conjunction #245

Description

Today

What zebrad exposes today

Audit of the 4 RPCs available as readiness signals

Proposed informal "readiness" — AND of the four

Why this is informal, not normative

Acceptance criteria

Cross-reference

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions