Skip to content

full repo can fall behind meta repo, undetected by the monitor (meta-vs-full load gap) #117

@tkuhn

Description

@tkuhn

Summary

The full repo can fall behind its own meta repo, ending up with fewer nanopubs than meta reports. Because the monitor (and the Nanopub-Query-Loaded-Nanopub-Count/-Checksum headers) source their count/checksum from meta, this gap is invisible: the monitor shows full consensus while generic SPARQL queries against /repo/full silently return incomplete results.

Observed

All three query instances reported meta consensus (82210, checksum BLJdIPNk3QNCYys8…) — the value the monitor at https://monitor.knowledgepixels.com/ displays. But the full repo's own admin-graph count/checksum diverged:

repo query.knowledgepixels.com query.petapico.org query.nanodash.net
meta 82210 BLJdIPNk… 82210 BLJdIPNk… 82210 BLJdIPNk…
full 82210 BLJdIPNk… 82134 ysRLX3H2… 82107 KIiaObIP…

(knowledgepixels' full matches only because it had just been FORCE_RELOADed.)

A per-week histogram of distinct ORCID signers confirms the gap is concentrated in recent (current-week) nanopubs — all prior weeks are byte-identical across instances. The same histogram run against /repo/meta is identical on all three (18 users in the current week); run against /repo/full it returns 18 / 14 / 2.

Why this is anomalous

executeLoading (NanopubLoader.java:402-499) submits the full/text/last30d/pubkey/type tasks, waits for all of them to succeed, and only then submits the meta task and bumps the count/checksum. By that ordering a nanopub reaches meta only after it is already in full, so full should always be meta. A full repo sitting below its own meta means full is missing nanopubs that meta already recorded — the "meta vs full load gap" already named in the comment at NanopubLoader.java:156.

So either:

  • the gap was accumulated under older loader code (before the meta-deferred-until-others-succeed ordering), and persists because nothing reconciles it, or
  • full-repo write failures are being swallowed somewhere without blocking the meta bump on those hosts.

Impact

  • /repo/full SPARQL results are silently incomplete on affected instances.
  • The monitor cannot detect it, since it only checks meta. "Consensus" on the dashboard does not imply full is synced.

Suggested actions

  1. Detection: have the monitor (or a health check) compare per-repo npa:hasNanopubCount/npa:hasNanopubChecksum across meta and full (and ideally text/last30d), and flag any repo whose count/checksum drifts from meta.
  2. Root cause: check whether full-repo load failures can complete without blocking the meta task (i.e. is the lock-step guarantee actually holding on all hosts?), and inspect loader logs on petapico/nanodash for swallowed full write errors around recent loads.
  3. Remediation: FORCE_RELOAD rebuilds full to match meta (already done on knowledgepixels); apply to petapico/nanodash. Consider an automatic per-repo reconciliation rather than a manual full reload.

Reproduction

# count + checksum, run against /repo/meta and /repo/full on each instance
PREFIX npa: <http://purl.org/nanopub/admin/>
SELECT ?count ?checksum WHERE {
  GRAPH npa:graph {
    ?repo npa:hasNanopubCount ?count .
    ?repo npa:hasNanopubChecksum ?checksum .
  }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions