Merge pull request #453 from AdaWorldAPI/docs/cluster-asymmetry

AdaWorldAPI · web-flow · commit 16f879ba587f · 2026-06-03T13:48:50.000+02:00
docs: cluster asymmetry — capacity-forced vs availability-chosen clustering
diff --git a/docs/CLUSTER_ASYMMETRY.md b/docs/CLUSTER_ASYMMETRY.md
@@ -0,0 +1,284 @@
+# Cluster ≠ cluster: how lance-graph clusters differ from Cassandra-era clusters
+
+## TL;DR
+
+There are two distinct reasons to cluster a distributed system:
+
+1. **Capacity-forced clustering**: the data does not fit on one node.
+   You shard. Cross-node queries are required. (Cassandra, JanusGraph,
+   classic Elasticsearch deployments.)
+
+2. **Availability-chosen clustering**: the data fits on one node.
+   You replicate it across N nodes for HA + geo + load distribution.
+   Each node holds the full dataset. Cross-node queries are NOT
+   required for the hot path. (CockroachDB, etcd, peer-Raft Lance.)
+
+Lance-graph consumers are virtually never capacity-forced (the
+encoding cascade + columnar layout + radix-trie deduplication
+collapse storage by 1-3 orders of magnitude vs LSM-tree wide-column
+stores). Clustering is always availability-chosen.
+
+Don't import the Cassandra cluster operations playbook into a
+lance-graph deployment. The failure modes, the operational rhythms,
+the budgeting assumptions are all qualitatively different.
+
+> ### Scope: external architecture pattern, not a built-in lance-graph feature
+>
+> Per codex P1 review on PR #453: the deployment shapes described
+> below — particularly "peer-Raft + Lance-local-per-node" — describe
+> an EXTERNAL ARCHITECTURE PATTERN that adopters can build on top of
+> lance-graph, NOT a built-in lance-graph capability. Lance-graph
+> itself provides the columnar storage, the DataFusion query path,
+> the encoding crates, and the Rust API surface. The Raft layer, the
+> substrate binary, and the consensus-replication path are
+> downstream consumer code. Adopters implementing this pattern reach
+> for `openraft` (or `surreal-cluster` if their stack is
+> surrealdb-shaped) to provide the Raft layer; they would NOT
+> inherit one from lance-graph.
+>
+> The doc documents WHY this pattern works well WHEN built on
+> lance-graph (the append-only storage model, cheap anti-entropy,
+> per-node fit), not a feature lance-graph itself ships. Adopters
+> who only consume lance-graph's columnar + DataFusion path should
+> NOT assume their data is automatically replicated.
+
+
+## The two cluster shapes, side by side
+
+| Concern | Capacity-forced (Cassandra+JG) | Availability-chosen (peer-Raft Lance) |
+|---|---|---|
+| Reason to cluster | Data does not fit on one node | Data fits on one node; replicate for HA + geo |
+| Each node holds | 1/N of the data (consistent-hash shards) | 100% of the data |
+| Hot-path reads | Coordinator pattern; fan-out to shard owners | Local; no cross-node hop |
+| Replication factor + storage amplification | R=3-5 per replica; total cluster storage = N_shards × R (sharding COMPOUNDS replication amplification) | R=3 per replica; total cluster storage = R × dataset_size (same per-replica amplification; no sharding compound) |
+| Effective node count | 3N to 5N where N is shards needed for capacity | 3 (or more if geo distribution is required) |
+| Compaction shape | LSM SSTable: tombstone-reclaim + run-merge; coordinates with replication; lag spikes during | Lance file (`DatasetOptimizer.compact_files`): merge small fragments for layout; independent of consensus; no replication-lag interaction |
+| Cross-shard transactions | Hard; needs 2PC or careful avoidance | Not applicable (no shards) |
+| Anti-entropy | Merkle-tree comparison; expensive | Manifest hash compare; O(1) decision |
+| Read latency | Depends on coordinator + slowest shard | Local-replica latency; predictable |
+| Operator burden | High (capacity planning, compaction scheduling, rebalancing) | Low (HA discipline only; no shard management) |
+
+
+**Storage amplification honesty (per codex P2 review on PR #453):**
+Both shapes have replication storage amplification — three replicas
+of a 5GB dataset is 15GB of total cluster disk regardless of
+distribution shape. The architectural advantage of availability-chosen
+clustering is NOT that amplification disappears (it doesn't); it's
+that amplification does NOT compound with shard count. Cassandra at
+RF=3 with 10 capacity shards consumes ~30× the single-replica
+storage; peer-Raft at R=3 consumes exactly 3×. Same per-replica
+amplification, no shard multiplier on top.
+
+## Why lance-graph consumers fit on one node
+
+The encoding stack typical of lance-graph consumers compresses data
+by 1-3 orders of magnitude vs LSM-tree wide-column representations:
+
+- **highheelbgz SpiralAddress**: 3 integers (12 bytes or 6 bytes
+  for u16 variants) representing a φ-spiral walking address — NOT
+  a copy of the weight vector, an address into a deterministic
+  spiral. Per-row footprint approaches the information-theoretic
+  minimum for the addressing dimension.
+
+- **bgz-hhtl-d Slot D / Slot V**: 4 bytes total per row — 2-byte
+  Slot D (HEEL basin 2 bits / HIP family 4 bits / TWIG centroid 8
+  bits / polarity 1 bit / reserved 1 bit) + 2-byte Slot V (BF16
+  residual magnitude). The hierarchical addressing is in the bit
+  layout; the residual captures the per-row delta from the centroid.
+
+- **bgz17 palette256**: 256-archetype compose table for multi-hop
+  semantic relations. 8 bits per archetype reference; multi-hop
+  composition in O(1) via the compose table.
+
+- **CAM-PQ leaf vectors**: 6 bytes per row for the leaf-exact-match
+  vector projection. HHTL-banked variant (16 family subcodebooks)
+  reduces ANN scan space by ~16× under empirical intra-family
+  locality (98.6% per the `lance-graph` PR #444 probe).
+
+- **vort/vart adaptive radix trie**: structural deduplication of
+  shared HHTL prefixes. The heel + hip nibbles common across many
+  entities are stored ONCE per path segment, not N times. Adaptive
+  Radix Tree shape: O(k) lookup, prefix-sharing storage. The same
+  structure also serves as the time-axis index over Lance
+  `versions()` for cold-path queries.
+
+Concrete example: Wikidata (~115M entities). In Cassandra+JG, the
+indexed graph form is multi-TB with replication factor 3 → multi-TB
+× 3 across the cluster. In lance-graph with the above encoding stack,
+the same corpus compresses to low single-digit GB total (including
+indexes and the Lance version log). Fits on a modest single node.
+Each peer replica holds the FULL dataset.
+
+This is not an academic claim; it's the deployment shape proven in
+the `AdaWorldAPI/bardioc` B1 substrate-b reference implementation.
+
+## Knock-on consequences of availability-chosen clustering
+
+### 1. Three-node deployment is the starting recommendation, not the toy
+
+For Cassandra-era thinkers, three-node clusters are toys. Production
+Cassandra deployments often run 12-24 nodes. The reasoning: each node
+holds 1/N of the data; you need many N to fit the corpus; replication
+factor 3 multiplies again.
+
+For peer-Raft Lance: three-node clusters are production. Each node
+holds the full dataset. Three replicas give you majority quorum (one
+failure tolerated). More replicas are added for geographic distribution
+or read-load fanout, not for capacity.
+
+Default starting deployment: 3 substrate-b instances, one per AZ in
+the same region, peer-Raft replicated.
+
+### 2. No coordinator pattern; no fan-out lag
+
+Cassandra reads route through a coordinator node which fans out to
+the shard owners and aggregates the response. The query latency is
+bounded below by the slowest shard's response time. Hot shards drag
+the cluster.
+
+Peer-Raft Lance reads are LOCAL. The client connects to any node;
+the node has the full dataset; it returns the answer without contacting
+peers (for eventually-consistent reads) or with a single Raft read-index
+round (for linearizable reads). No fan-out, no aggregation, no
+slowest-shard lag.
+
+### 3. Compaction is qualitatively different (lighter coordination)
+
+Cassandra-style LSM compaction rewrites SSTables to reclaim tombstones
+and merge sorted runs. It is CPU + IO heavy and creates replication
+lag spikes; if too many nodes compact simultaneously, the cluster's
+effective replication factor temporarily drops. Operators schedule
+compactions across nodes to avoid that.
+
+Lance has compaction too, but of a different shape:
+`DatasetOptimizer.compact_files` merges small fragments into larger
+ones for query layout optimization (many small appends produce many
+small fragments which slow scans; periodic file compaction restores
+good layout). It is NOT a tombstone-reclaim cycle — Lance is
+append-only at the version level, so there are no tombstones to
+reclaim in the LSM sense.
+
+The qualitative difference:
+
+- **LSM compaction** is a CORRECTNESS + SPACE concern (tombstones must
+  be reclaimed to bound storage; runs must merge for read performance);
+  coordination with replication is unavoidable.
+- **Lance file compaction** is a LAYOUT OPTIMIZATION concern (queries
+  get faster when fragments are larger; correctness is unaffected);
+  it can be scheduled independently per node, produces new fragments
+  that are themselves append-only, and does NOT block or interact
+  with consensus replication.
+
+So Lance compaction exists and operators should plan for it (Lance's
+table-maintenance docs describe `DatasetOptimizer.compact_files`).
+The operational burden is lower than Cassandra LSM compaction because
+the coordination requirements are weaker — but it is not zero. (Per
+codex P2 review on PR #453.)
+
+### 4. Anti-entropy is cheap
+
+Cassandra anti-entropy (catching up a lagging replica) compares SSTable
+Merkle trees node-by-node and streams the diffs. This is expensive and
+creates load on both sides.
+
+Lance peer-Raft anti-entropy: compare the manifest hash between nodes.
+If equal, sync. If not, ship missing fragments + the new manifest. The
+IDENTIFICATION step is O(1). The streaming step is bounded by the actual
+divergence, not by the dataset size.
+
+### 5. Rebalancing is not a thing
+
+Cassandra rebalancing (when adding or removing nodes) requires data
+movement. Token ranges shift; data streams from old owners to new
+owners; the cluster operates in a degraded state during the rebalance.
+
+Peer-Raft Lance: adding a node = bring up a new substrate-b instance,
+let it catch up via Raft, mark it a voter. No data movement needed
+beyond the catch-up (and the catch-up is the same wire pattern as
+the per-write replication — no special "rebalance" mode). Removing
+a node: stop the substrate-b, mark it non-voter, retire it. No data
+movement.
+
+## Knock-on consequences for the Raft consensus tax
+
+Raft consensus IS on the per-request budget for writes and linearizable
+reads. This is irreducible for the distributed-OLTP property. (See
+the companion doc `append-only-raft-dovetail.md` for why this lands
+lighter in Lance than in LSM-based systems.)
+
+In an availability-chosen cluster, the consensus tax lands EVEN lighter
+than in a capacity-forced cluster:
+
+- **Smaller per-node datasets**: Raft logs are smaller; commits propagate
+  faster
+- **Fewer replicas needed**: 3 substrate-b instances vs 12-24+
+  Cassandra+JG nodes → less coordination overhead per write
+- **No cross-node fan-out**: linearizable read-index can be served by
+  the local leader of the relevant Raft group; doesn't require remote
+  round-trip when the local instance IS the leader
+
+## When you actually DO need capacity-forced sharding
+
+Rare for lance-graph consumers, but possible:
+
+- Corpora dramatically larger than Wikidata (multi-billion entities or
+  multi-PB raw data)
+- Workloads with hot-key distributions that exceed a single node's
+  IO bandwidth
+- Specific compliance requirements that mandate physical data isolation
+  per tenant
+
+In those cases, the appropriate response is application-level sharding
+or tenant-level partitioning — NOT a Cassandra-style consistent-hash
+ring. Each shard or tenant gets its own peer-Raft + Lance cluster.
+The capacity dimension is solved by horizontal application-layer
+partitioning; each underlying cluster remains availability-chosen.
+
+## What this doc does NOT claim
+
+- **Single-node deployments are sufficient for production.** They're
+  not. HA requires multiple replicas (or accepted downtime during
+  failures). Single-node is a development / staging shape.
+
+- **Three-node is universally optimal.** Geographic distribution
+  may require more (one replica per region). Specific availability
+  targets may justify more replicas.
+
+- **All Lance + Raft stacks ship the same compression.** The specific
+  encoding cascade described above is from the `bardioc` B1 reference
+  consumer; other consumers will have different per-row footprints.
+  The qualitative property (orders-of-magnitude vs LSM wide-column)
+  remains, but specific numbers vary.
+
+- **Cassandra+JG choose their shape wrongly.** The Cassandra design
+  is correct FOR THE STORAGE MODEL IT HAS. The architectural choice
+  that produces a different deployment shape is the choice to use
+  Lance + Raft, with append-only storage and columnar compression.
+  The doc names the consequence, not the wrongness of the alternative.
+
+## Recommended deployment pattern (reference)
+
+> **Reminder (per codex P1):** This pattern is the bardioc B1
+> substrate-b reference architecture, NOT a built-in lance-graph
+> feature. Adopters provide the Raft layer themselves (openraft /
+> surreal-cluster / external TiKV). Lance-graph contributes the
+> columnar storage + DataFusion + encoding crates that MAKE this
+> pattern cheap, not the pattern itself.
+
+See [reference consumer implementation in bardioc B1 substrate-b]
+(separate proposed doc) for a worked example. Briefly:
+
+- Three substrate-b instances, one per availability zone within a region
+- Each substrate-b is one Rust binary AND one Raft node (Raft impl
+  from openraft or surreal-cluster — NOT inherited from lance-graph)
+- Lance dataset local to each instance (full dataset, not a shard)
+- Reads serve from local Lance with no cross-node coordination
+  (eventually-consistent) or from a Raft read-index round (linearizable)
+- Writes serve via Raft quorum to the local leader; replicated as Lance
+  fragment appends
+- Add a fourth+ instance only for geographic distribution or read-load
+  fanout, not for capacity
+
+This is the shape proven against Wikidata-scale workloads in the
+bardioc reference consumer. Operational complexity is significantly
+lower than the Cassandra-era equivalent at the same availability target.