|
| 1 | +# Cluster ≠ cluster: how lance-graph clusters differ from Cassandra-era clusters |
| 2 | + |
| 3 | +## TL;DR |
| 4 | + |
| 5 | +There are two distinct reasons to cluster a distributed system: |
| 6 | + |
| 7 | +1. **Capacity-forced clustering**: the data does not fit on one node. |
| 8 | + You shard. Cross-node queries are required. (Cassandra, JanusGraph, |
| 9 | + classic Elasticsearch deployments.) |
| 10 | + |
| 11 | +2. **Availability-chosen clustering**: the data fits on one node. |
| 12 | + You replicate it across N nodes for HA + geo + load distribution. |
| 13 | + Each node holds the full dataset. Cross-node queries are NOT |
| 14 | + required for the hot path. (CockroachDB, etcd, peer-Raft Lance.) |
| 15 | + |
| 16 | +Lance-graph consumers are virtually never capacity-forced (the |
| 17 | +encoding cascade + columnar layout + radix-trie deduplication |
| 18 | +collapse storage by 1-3 orders of magnitude vs LSM-tree wide-column |
| 19 | +stores). Clustering is always availability-chosen. |
| 20 | + |
| 21 | +Don't import the Cassandra cluster operations playbook into a |
| 22 | +lance-graph deployment. The failure modes, the operational rhythms, |
| 23 | +the budgeting assumptions are all qualitatively different. |
| 24 | + |
| 25 | +> ### Scope: external architecture pattern, not a built-in lance-graph feature |
| 26 | +> |
| 27 | +> Per codex P1 review on PR #453: the deployment shapes described |
| 28 | +> below — particularly "peer-Raft + Lance-local-per-node" — describe |
| 29 | +> an EXTERNAL ARCHITECTURE PATTERN that adopters can build on top of |
| 30 | +> lance-graph, NOT a built-in lance-graph capability. Lance-graph |
| 31 | +> itself provides the columnar storage, the DataFusion query path, |
| 32 | +> the encoding crates, and the Rust API surface. The Raft layer, the |
| 33 | +> substrate binary, and the consensus-replication path are |
| 34 | +> downstream consumer code. Adopters implementing this pattern reach |
| 35 | +> for `openraft` (or `surreal-cluster` if their stack is |
| 36 | +> surrealdb-shaped) to provide the Raft layer; they would NOT |
| 37 | +> inherit one from lance-graph. |
| 38 | +> |
| 39 | +> The doc documents WHY this pattern works well WHEN built on |
| 40 | +> lance-graph (the append-only storage model, cheap anti-entropy, |
| 41 | +> per-node fit), not a feature lance-graph itself ships. Adopters |
| 42 | +> who only consume lance-graph's columnar + DataFusion path should |
| 43 | +> NOT assume their data is automatically replicated. |
| 44 | +
|
| 45 | + |
| 46 | +## The two cluster shapes, side by side |
| 47 | + |
| 48 | +| Concern | Capacity-forced (Cassandra+JG) | Availability-chosen (peer-Raft Lance) | |
| 49 | +|---|---|---| |
| 50 | +| Reason to cluster | Data does not fit on one node | Data fits on one node; replicate for HA + geo | |
| 51 | +| Each node holds | 1/N of the data (consistent-hash shards) | 100% of the data | |
| 52 | +| Hot-path reads | Coordinator pattern; fan-out to shard owners | Local; no cross-node hop | |
| 53 | +| Replication factor + storage amplification | R=3-5 per replica; total cluster storage = N_shards × R (sharding COMPOUNDS replication amplification) | R=3 per replica; total cluster storage = R × dataset_size (same per-replica amplification; no sharding compound) | |
| 54 | +| Effective node count | 3N to 5N where N is shards needed for capacity | 3 (or more if geo distribution is required) | |
| 55 | +| Compaction shape | LSM SSTable: tombstone-reclaim + run-merge; coordinates with replication; lag spikes during | Lance file (`DatasetOptimizer.compact_files`): merge small fragments for layout; independent of consensus; no replication-lag interaction | |
| 56 | +| Cross-shard transactions | Hard; needs 2PC or careful avoidance | Not applicable (no shards) | |
| 57 | +| Anti-entropy | Merkle-tree comparison; expensive | Manifest hash compare; O(1) decision | |
| 58 | +| Read latency | Depends on coordinator + slowest shard | Local-replica latency; predictable | |
| 59 | +| Operator burden | High (capacity planning, compaction scheduling, rebalancing) | Low (HA discipline only; no shard management) | |
| 60 | + |
| 61 | + |
| 62 | +**Storage amplification honesty (per codex P2 review on PR #453):** |
| 63 | +Both shapes have replication storage amplification — three replicas |
| 64 | +of a 5GB dataset is 15GB of total cluster disk regardless of |
| 65 | +distribution shape. The architectural advantage of availability-chosen |
| 66 | +clustering is NOT that amplification disappears (it doesn't); it's |
| 67 | +that amplification does NOT compound with shard count. Cassandra at |
| 68 | +RF=3 with 10 capacity shards consumes ~30× the single-replica |
| 69 | +storage; peer-Raft at R=3 consumes exactly 3×. Same per-replica |
| 70 | +amplification, no shard multiplier on top. |
| 71 | + |
| 72 | +## Why lance-graph consumers fit on one node |
| 73 | + |
| 74 | +The encoding stack typical of lance-graph consumers compresses data |
| 75 | +by 1-3 orders of magnitude vs LSM-tree wide-column representations: |
| 76 | + |
| 77 | +- **highheelbgz SpiralAddress**: 3 integers (12 bytes or 6 bytes |
| 78 | + for u16 variants) representing a φ-spiral walking address — NOT |
| 79 | + a copy of the weight vector, an address into a deterministic |
| 80 | + spiral. Per-row footprint approaches the information-theoretic |
| 81 | + minimum for the addressing dimension. |
| 82 | + |
| 83 | +- **bgz-hhtl-d Slot D / Slot V**: 4 bytes total per row — 2-byte |
| 84 | + Slot D (HEEL basin 2 bits / HIP family 4 bits / TWIG centroid 8 |
| 85 | + bits / polarity 1 bit / reserved 1 bit) + 2-byte Slot V (BF16 |
| 86 | + residual magnitude). The hierarchical addressing is in the bit |
| 87 | + layout; the residual captures the per-row delta from the centroid. |
| 88 | + |
| 89 | +- **bgz17 palette256**: 256-archetype compose table for multi-hop |
| 90 | + semantic relations. 8 bits per archetype reference; multi-hop |
| 91 | + composition in O(1) via the compose table. |
| 92 | + |
| 93 | +- **CAM-PQ leaf vectors**: 6 bytes per row for the leaf-exact-match |
| 94 | + vector projection. HHTL-banked variant (16 family subcodebooks) |
| 95 | + reduces ANN scan space by ~16× under empirical intra-family |
| 96 | + locality (98.6% per the `lance-graph` PR #444 probe). |
| 97 | + |
| 98 | +- **vort/vart adaptive radix trie**: structural deduplication of |
| 99 | + shared HHTL prefixes. The heel + hip nibbles common across many |
| 100 | + entities are stored ONCE per path segment, not N times. Adaptive |
| 101 | + Radix Tree shape: O(k) lookup, prefix-sharing storage. The same |
| 102 | + structure also serves as the time-axis index over Lance |
| 103 | + `versions()` for cold-path queries. |
| 104 | + |
| 105 | +Concrete example: Wikidata (~115M entities). In Cassandra+JG, the |
| 106 | +indexed graph form is multi-TB with replication factor 3 → multi-TB |
| 107 | +× 3 across the cluster. In lance-graph with the above encoding stack, |
| 108 | +the same corpus compresses to low single-digit GB total (including |
| 109 | +indexes and the Lance version log). Fits on a modest single node. |
| 110 | +Each peer replica holds the FULL dataset. |
| 111 | + |
| 112 | +This is not an academic claim; it's the deployment shape proven in |
| 113 | +the `AdaWorldAPI/bardioc` B1 substrate-b reference implementation. |
| 114 | + |
| 115 | +## Knock-on consequences of availability-chosen clustering |
| 116 | + |
| 117 | +### 1. Three-node deployment is the starting recommendation, not the toy |
| 118 | + |
| 119 | +For Cassandra-era thinkers, three-node clusters are toys. Production |
| 120 | +Cassandra deployments often run 12-24 nodes. The reasoning: each node |
| 121 | +holds 1/N of the data; you need many N to fit the corpus; replication |
| 122 | +factor 3 multiplies again. |
| 123 | + |
| 124 | +For peer-Raft Lance: three-node clusters are production. Each node |
| 125 | +holds the full dataset. Three replicas give you majority quorum (one |
| 126 | +failure tolerated). More replicas are added for geographic distribution |
| 127 | +or read-load fanout, not for capacity. |
| 128 | + |
| 129 | +Default starting deployment: 3 substrate-b instances, one per AZ in |
| 130 | +the same region, peer-Raft replicated. |
| 131 | + |
| 132 | +### 2. No coordinator pattern; no fan-out lag |
| 133 | + |
| 134 | +Cassandra reads route through a coordinator node which fans out to |
| 135 | +the shard owners and aggregates the response. The query latency is |
| 136 | +bounded below by the slowest shard's response time. Hot shards drag |
| 137 | +the cluster. |
| 138 | + |
| 139 | +Peer-Raft Lance reads are LOCAL. The client connects to any node; |
| 140 | +the node has the full dataset; it returns the answer without contacting |
| 141 | +peers (for eventually-consistent reads) or with a single Raft read-index |
| 142 | +round (for linearizable reads). No fan-out, no aggregation, no |
| 143 | +slowest-shard lag. |
| 144 | + |
| 145 | +### 3. Compaction is qualitatively different (lighter coordination) |
| 146 | + |
| 147 | +Cassandra-style LSM compaction rewrites SSTables to reclaim tombstones |
| 148 | +and merge sorted runs. It is CPU + IO heavy and creates replication |
| 149 | +lag spikes; if too many nodes compact simultaneously, the cluster's |
| 150 | +effective replication factor temporarily drops. Operators schedule |
| 151 | +compactions across nodes to avoid that. |
| 152 | + |
| 153 | +Lance has compaction too, but of a different shape: |
| 154 | +`DatasetOptimizer.compact_files` merges small fragments into larger |
| 155 | +ones for query layout optimization (many small appends produce many |
| 156 | +small fragments which slow scans; periodic file compaction restores |
| 157 | +good layout). It is NOT a tombstone-reclaim cycle — Lance is |
| 158 | +append-only at the version level, so there are no tombstones to |
| 159 | +reclaim in the LSM sense. |
| 160 | + |
| 161 | +The qualitative difference: |
| 162 | + |
| 163 | +- **LSM compaction** is a CORRECTNESS + SPACE concern (tombstones must |
| 164 | + be reclaimed to bound storage; runs must merge for read performance); |
| 165 | + coordination with replication is unavoidable. |
| 166 | +- **Lance file compaction** is a LAYOUT OPTIMIZATION concern (queries |
| 167 | + get faster when fragments are larger; correctness is unaffected); |
| 168 | + it can be scheduled independently per node, produces new fragments |
| 169 | + that are themselves append-only, and does NOT block or interact |
| 170 | + with consensus replication. |
| 171 | + |
| 172 | +So Lance compaction exists and operators should plan for it (Lance's |
| 173 | +table-maintenance docs describe `DatasetOptimizer.compact_files`). |
| 174 | +The operational burden is lower than Cassandra LSM compaction because |
| 175 | +the coordination requirements are weaker — but it is not zero. (Per |
| 176 | +codex P2 review on PR #453.) |
| 177 | + |
| 178 | +### 4. Anti-entropy is cheap |
| 179 | + |
| 180 | +Cassandra anti-entropy (catching up a lagging replica) compares SSTable |
| 181 | +Merkle trees node-by-node and streams the diffs. This is expensive and |
| 182 | +creates load on both sides. |
| 183 | + |
| 184 | +Lance peer-Raft anti-entropy: compare the manifest hash between nodes. |
| 185 | +If equal, sync. If not, ship missing fragments + the new manifest. The |
| 186 | +IDENTIFICATION step is O(1). The streaming step is bounded by the actual |
| 187 | +divergence, not by the dataset size. |
| 188 | + |
| 189 | +### 5. Rebalancing is not a thing |
| 190 | + |
| 191 | +Cassandra rebalancing (when adding or removing nodes) requires data |
| 192 | +movement. Token ranges shift; data streams from old owners to new |
| 193 | +owners; the cluster operates in a degraded state during the rebalance. |
| 194 | + |
| 195 | +Peer-Raft Lance: adding a node = bring up a new substrate-b instance, |
| 196 | +let it catch up via Raft, mark it a voter. No data movement needed |
| 197 | +beyond the catch-up (and the catch-up is the same wire pattern as |
| 198 | +the per-write replication — no special "rebalance" mode). Removing |
| 199 | +a node: stop the substrate-b, mark it non-voter, retire it. No data |
| 200 | +movement. |
| 201 | + |
| 202 | +## Knock-on consequences for the Raft consensus tax |
| 203 | + |
| 204 | +Raft consensus IS on the per-request budget for writes and linearizable |
| 205 | +reads. This is irreducible for the distributed-OLTP property. (See |
| 206 | +the companion doc `append-only-raft-dovetail.md` for why this lands |
| 207 | +lighter in Lance than in LSM-based systems.) |
| 208 | + |
| 209 | +In an availability-chosen cluster, the consensus tax lands EVEN lighter |
| 210 | +than in a capacity-forced cluster: |
| 211 | + |
| 212 | +- **Smaller per-node datasets**: Raft logs are smaller; commits propagate |
| 213 | + faster |
| 214 | +- **Fewer replicas needed**: 3 substrate-b instances vs 12-24+ |
| 215 | + Cassandra+JG nodes → less coordination overhead per write |
| 216 | +- **No cross-node fan-out**: linearizable read-index can be served by |
| 217 | + the local leader of the relevant Raft group; doesn't require remote |
| 218 | + round-trip when the local instance IS the leader |
| 219 | + |
| 220 | +## When you actually DO need capacity-forced sharding |
| 221 | + |
| 222 | +Rare for lance-graph consumers, but possible: |
| 223 | + |
| 224 | +- Corpora dramatically larger than Wikidata (multi-billion entities or |
| 225 | + multi-PB raw data) |
| 226 | +- Workloads with hot-key distributions that exceed a single node's |
| 227 | + IO bandwidth |
| 228 | +- Specific compliance requirements that mandate physical data isolation |
| 229 | + per tenant |
| 230 | + |
| 231 | +In those cases, the appropriate response is application-level sharding |
| 232 | +or tenant-level partitioning — NOT a Cassandra-style consistent-hash |
| 233 | +ring. Each shard or tenant gets its own peer-Raft + Lance cluster. |
| 234 | +The capacity dimension is solved by horizontal application-layer |
| 235 | +partitioning; each underlying cluster remains availability-chosen. |
| 236 | + |
| 237 | +## What this doc does NOT claim |
| 238 | + |
| 239 | +- **Single-node deployments are sufficient for production.** They're |
| 240 | + not. HA requires multiple replicas (or accepted downtime during |
| 241 | + failures). Single-node is a development / staging shape. |
| 242 | + |
| 243 | +- **Three-node is universally optimal.** Geographic distribution |
| 244 | + may require more (one replica per region). Specific availability |
| 245 | + targets may justify more replicas. |
| 246 | + |
| 247 | +- **All Lance + Raft stacks ship the same compression.** The specific |
| 248 | + encoding cascade described above is from the `bardioc` B1 reference |
| 249 | + consumer; other consumers will have different per-row footprints. |
| 250 | + The qualitative property (orders-of-magnitude vs LSM wide-column) |
| 251 | + remains, but specific numbers vary. |
| 252 | + |
| 253 | +- **Cassandra+JG choose their shape wrongly.** The Cassandra design |
| 254 | + is correct FOR THE STORAGE MODEL IT HAS. The architectural choice |
| 255 | + that produces a different deployment shape is the choice to use |
| 256 | + Lance + Raft, with append-only storage and columnar compression. |
| 257 | + The doc names the consequence, not the wrongness of the alternative. |
| 258 | + |
| 259 | +## Recommended deployment pattern (reference) |
| 260 | + |
| 261 | +> **Reminder (per codex P1):** This pattern is the bardioc B1 |
| 262 | +> substrate-b reference architecture, NOT a built-in lance-graph |
| 263 | +> feature. Adopters provide the Raft layer themselves (openraft / |
| 264 | +> surreal-cluster / external TiKV). Lance-graph contributes the |
| 265 | +> columnar storage + DataFusion + encoding crates that MAKE this |
| 266 | +> pattern cheap, not the pattern itself. |
| 267 | +
|
| 268 | +See [reference consumer implementation in bardioc B1 substrate-b] |
| 269 | +(separate proposed doc) for a worked example. Briefly: |
| 270 | + |
| 271 | +- Three substrate-b instances, one per availability zone within a region |
| 272 | +- Each substrate-b is one Rust binary AND one Raft node (Raft impl |
| 273 | + from openraft or surreal-cluster — NOT inherited from lance-graph) |
| 274 | +- Lance dataset local to each instance (full dataset, not a shard) |
| 275 | +- Reads serve from local Lance with no cross-node coordination |
| 276 | + (eventually-consistent) or from a Raft read-index round (linearizable) |
| 277 | +- Writes serve via Raft quorum to the local leader; replicated as Lance |
| 278 | + fragment appends |
| 279 | +- Add a fourth+ instance only for geographic distribution or read-load |
| 280 | + fanout, not for capacity |
| 281 | + |
| 282 | +This is the shape proven against Wikidata-scale workloads in the |
| 283 | +bardioc reference consumer. Operational complexity is significantly |
| 284 | +lower than the Cassandra-era equivalent at the same availability target. |
0 commit comments