Skip to content

Commit 16f879b

Browse files
authored
Merge pull request #453 from AdaWorldAPI/docs/cluster-asymmetry
docs: cluster asymmetry — capacity-forced vs availability-chosen clustering
2 parents 919ab92 + 8d8ad00 commit 16f879b

1 file changed

Lines changed: 284 additions & 0 deletions

File tree

docs/CLUSTER_ASYMMETRY.md

Lines changed: 284 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,284 @@
1+
# Cluster ≠ cluster: how lance-graph clusters differ from Cassandra-era clusters
2+
3+
## TL;DR
4+
5+
There are two distinct reasons to cluster a distributed system:
6+
7+
1. **Capacity-forced clustering**: the data does not fit on one node.
8+
You shard. Cross-node queries are required. (Cassandra, JanusGraph,
9+
classic Elasticsearch deployments.)
10+
11+
2. **Availability-chosen clustering**: the data fits on one node.
12+
You replicate it across N nodes for HA + geo + load distribution.
13+
Each node holds the full dataset. Cross-node queries are NOT
14+
required for the hot path. (CockroachDB, etcd, peer-Raft Lance.)
15+
16+
Lance-graph consumers are virtually never capacity-forced (the
17+
encoding cascade + columnar layout + radix-trie deduplication
18+
collapse storage by 1-3 orders of magnitude vs LSM-tree wide-column
19+
stores). Clustering is always availability-chosen.
20+
21+
Don't import the Cassandra cluster operations playbook into a
22+
lance-graph deployment. The failure modes, the operational rhythms,
23+
the budgeting assumptions are all qualitatively different.
24+
25+
> ### Scope: external architecture pattern, not a built-in lance-graph feature
26+
>
27+
> Per codex P1 review on PR #453: the deployment shapes described
28+
> below — particularly "peer-Raft + Lance-local-per-node" — describe
29+
> an EXTERNAL ARCHITECTURE PATTERN that adopters can build on top of
30+
> lance-graph, NOT a built-in lance-graph capability. Lance-graph
31+
> itself provides the columnar storage, the DataFusion query path,
32+
> the encoding crates, and the Rust API surface. The Raft layer, the
33+
> substrate binary, and the consensus-replication path are
34+
> downstream consumer code. Adopters implementing this pattern reach
35+
> for `openraft` (or `surreal-cluster` if their stack is
36+
> surrealdb-shaped) to provide the Raft layer; they would NOT
37+
> inherit one from lance-graph.
38+
>
39+
> The doc documents WHY this pattern works well WHEN built on
40+
> lance-graph (the append-only storage model, cheap anti-entropy,
41+
> per-node fit), not a feature lance-graph itself ships. Adopters
42+
> who only consume lance-graph's columnar + DataFusion path should
43+
> NOT assume their data is automatically replicated.
44+
45+
46+
## The two cluster shapes, side by side
47+
48+
| Concern | Capacity-forced (Cassandra+JG) | Availability-chosen (peer-Raft Lance) |
49+
|---|---|---|
50+
| Reason to cluster | Data does not fit on one node | Data fits on one node; replicate for HA + geo |
51+
| Each node holds | 1/N of the data (consistent-hash shards) | 100% of the data |
52+
| Hot-path reads | Coordinator pattern; fan-out to shard owners | Local; no cross-node hop |
53+
| Replication factor + storage amplification | R=3-5 per replica; total cluster storage = N_shards × R (sharding COMPOUNDS replication amplification) | R=3 per replica; total cluster storage = R × dataset_size (same per-replica amplification; no sharding compound) |
54+
| Effective node count | 3N to 5N where N is shards needed for capacity | 3 (or more if geo distribution is required) |
55+
| Compaction shape | LSM SSTable: tombstone-reclaim + run-merge; coordinates with replication; lag spikes during | Lance file (`DatasetOptimizer.compact_files`): merge small fragments for layout; independent of consensus; no replication-lag interaction |
56+
| Cross-shard transactions | Hard; needs 2PC or careful avoidance | Not applicable (no shards) |
57+
| Anti-entropy | Merkle-tree comparison; expensive | Manifest hash compare; O(1) decision |
58+
| Read latency | Depends on coordinator + slowest shard | Local-replica latency; predictable |
59+
| Operator burden | High (capacity planning, compaction scheduling, rebalancing) | Low (HA discipline only; no shard management) |
60+
61+
62+
**Storage amplification honesty (per codex P2 review on PR #453):**
63+
Both shapes have replication storage amplification — three replicas
64+
of a 5GB dataset is 15GB of total cluster disk regardless of
65+
distribution shape. The architectural advantage of availability-chosen
66+
clustering is NOT that amplification disappears (it doesn't); it's
67+
that amplification does NOT compound with shard count. Cassandra at
68+
RF=3 with 10 capacity shards consumes ~30× the single-replica
69+
storage; peer-Raft at R=3 consumes exactly 3×. Same per-replica
70+
amplification, no shard multiplier on top.
71+
72+
## Why lance-graph consumers fit on one node
73+
74+
The encoding stack typical of lance-graph consumers compresses data
75+
by 1-3 orders of magnitude vs LSM-tree wide-column representations:
76+
77+
- **highheelbgz SpiralAddress**: 3 integers (12 bytes or 6 bytes
78+
for u16 variants) representing a φ-spiral walking address — NOT
79+
a copy of the weight vector, an address into a deterministic
80+
spiral. Per-row footprint approaches the information-theoretic
81+
minimum for the addressing dimension.
82+
83+
- **bgz-hhtl-d Slot D / Slot V**: 4 bytes total per row — 2-byte
84+
Slot D (HEEL basin 2 bits / HIP family 4 bits / TWIG centroid 8
85+
bits / polarity 1 bit / reserved 1 bit) + 2-byte Slot V (BF16
86+
residual magnitude). The hierarchical addressing is in the bit
87+
layout; the residual captures the per-row delta from the centroid.
88+
89+
- **bgz17 palette256**: 256-archetype compose table for multi-hop
90+
semantic relations. 8 bits per archetype reference; multi-hop
91+
composition in O(1) via the compose table.
92+
93+
- **CAM-PQ leaf vectors**: 6 bytes per row for the leaf-exact-match
94+
vector projection. HHTL-banked variant (16 family subcodebooks)
95+
reduces ANN scan space by ~16× under empirical intra-family
96+
locality (98.6% per the `lance-graph` PR #444 probe).
97+
98+
- **vort/vart adaptive radix trie**: structural deduplication of
99+
shared HHTL prefixes. The heel + hip nibbles common across many
100+
entities are stored ONCE per path segment, not N times. Adaptive
101+
Radix Tree shape: O(k) lookup, prefix-sharing storage. The same
102+
structure also serves as the time-axis index over Lance
103+
`versions()` for cold-path queries.
104+
105+
Concrete example: Wikidata (~115M entities). In Cassandra+JG, the
106+
indexed graph form is multi-TB with replication factor 3 → multi-TB
107+
× 3 across the cluster. In lance-graph with the above encoding stack,
108+
the same corpus compresses to low single-digit GB total (including
109+
indexes and the Lance version log). Fits on a modest single node.
110+
Each peer replica holds the FULL dataset.
111+
112+
This is not an academic claim; it's the deployment shape proven in
113+
the `AdaWorldAPI/bardioc` B1 substrate-b reference implementation.
114+
115+
## Knock-on consequences of availability-chosen clustering
116+
117+
### 1. Three-node deployment is the starting recommendation, not the toy
118+
119+
For Cassandra-era thinkers, three-node clusters are toys. Production
120+
Cassandra deployments often run 12-24 nodes. The reasoning: each node
121+
holds 1/N of the data; you need many N to fit the corpus; replication
122+
factor 3 multiplies again.
123+
124+
For peer-Raft Lance: three-node clusters are production. Each node
125+
holds the full dataset. Three replicas give you majority quorum (one
126+
failure tolerated). More replicas are added for geographic distribution
127+
or read-load fanout, not for capacity.
128+
129+
Default starting deployment: 3 substrate-b instances, one per AZ in
130+
the same region, peer-Raft replicated.
131+
132+
### 2. No coordinator pattern; no fan-out lag
133+
134+
Cassandra reads route through a coordinator node which fans out to
135+
the shard owners and aggregates the response. The query latency is
136+
bounded below by the slowest shard's response time. Hot shards drag
137+
the cluster.
138+
139+
Peer-Raft Lance reads are LOCAL. The client connects to any node;
140+
the node has the full dataset; it returns the answer without contacting
141+
peers (for eventually-consistent reads) or with a single Raft read-index
142+
round (for linearizable reads). No fan-out, no aggregation, no
143+
slowest-shard lag.
144+
145+
### 3. Compaction is qualitatively different (lighter coordination)
146+
147+
Cassandra-style LSM compaction rewrites SSTables to reclaim tombstones
148+
and merge sorted runs. It is CPU + IO heavy and creates replication
149+
lag spikes; if too many nodes compact simultaneously, the cluster's
150+
effective replication factor temporarily drops. Operators schedule
151+
compactions across nodes to avoid that.
152+
153+
Lance has compaction too, but of a different shape:
154+
`DatasetOptimizer.compact_files` merges small fragments into larger
155+
ones for query layout optimization (many small appends produce many
156+
small fragments which slow scans; periodic file compaction restores
157+
good layout). It is NOT a tombstone-reclaim cycle — Lance is
158+
append-only at the version level, so there are no tombstones to
159+
reclaim in the LSM sense.
160+
161+
The qualitative difference:
162+
163+
- **LSM compaction** is a CORRECTNESS + SPACE concern (tombstones must
164+
be reclaimed to bound storage; runs must merge for read performance);
165+
coordination with replication is unavoidable.
166+
- **Lance file compaction** is a LAYOUT OPTIMIZATION concern (queries
167+
get faster when fragments are larger; correctness is unaffected);
168+
it can be scheduled independently per node, produces new fragments
169+
that are themselves append-only, and does NOT block or interact
170+
with consensus replication.
171+
172+
So Lance compaction exists and operators should plan for it (Lance's
173+
table-maintenance docs describe `DatasetOptimizer.compact_files`).
174+
The operational burden is lower than Cassandra LSM compaction because
175+
the coordination requirements are weaker — but it is not zero. (Per
176+
codex P2 review on PR #453.)
177+
178+
### 4. Anti-entropy is cheap
179+
180+
Cassandra anti-entropy (catching up a lagging replica) compares SSTable
181+
Merkle trees node-by-node and streams the diffs. This is expensive and
182+
creates load on both sides.
183+
184+
Lance peer-Raft anti-entropy: compare the manifest hash between nodes.
185+
If equal, sync. If not, ship missing fragments + the new manifest. The
186+
IDENTIFICATION step is O(1). The streaming step is bounded by the actual
187+
divergence, not by the dataset size.
188+
189+
### 5. Rebalancing is not a thing
190+
191+
Cassandra rebalancing (when adding or removing nodes) requires data
192+
movement. Token ranges shift; data streams from old owners to new
193+
owners; the cluster operates in a degraded state during the rebalance.
194+
195+
Peer-Raft Lance: adding a node = bring up a new substrate-b instance,
196+
let it catch up via Raft, mark it a voter. No data movement needed
197+
beyond the catch-up (and the catch-up is the same wire pattern as
198+
the per-write replication — no special "rebalance" mode). Removing
199+
a node: stop the substrate-b, mark it non-voter, retire it. No data
200+
movement.
201+
202+
## Knock-on consequences for the Raft consensus tax
203+
204+
Raft consensus IS on the per-request budget for writes and linearizable
205+
reads. This is irreducible for the distributed-OLTP property. (See
206+
the companion doc `append-only-raft-dovetail.md` for why this lands
207+
lighter in Lance than in LSM-based systems.)
208+
209+
In an availability-chosen cluster, the consensus tax lands EVEN lighter
210+
than in a capacity-forced cluster:
211+
212+
- **Smaller per-node datasets**: Raft logs are smaller; commits propagate
213+
faster
214+
- **Fewer replicas needed**: 3 substrate-b instances vs 12-24+
215+
Cassandra+JG nodes → less coordination overhead per write
216+
- **No cross-node fan-out**: linearizable read-index can be served by
217+
the local leader of the relevant Raft group; doesn't require remote
218+
round-trip when the local instance IS the leader
219+
220+
## When you actually DO need capacity-forced sharding
221+
222+
Rare for lance-graph consumers, but possible:
223+
224+
- Corpora dramatically larger than Wikidata (multi-billion entities or
225+
multi-PB raw data)
226+
- Workloads with hot-key distributions that exceed a single node's
227+
IO bandwidth
228+
- Specific compliance requirements that mandate physical data isolation
229+
per tenant
230+
231+
In those cases, the appropriate response is application-level sharding
232+
or tenant-level partitioning — NOT a Cassandra-style consistent-hash
233+
ring. Each shard or tenant gets its own peer-Raft + Lance cluster.
234+
The capacity dimension is solved by horizontal application-layer
235+
partitioning; each underlying cluster remains availability-chosen.
236+
237+
## What this doc does NOT claim
238+
239+
- **Single-node deployments are sufficient for production.** They're
240+
not. HA requires multiple replicas (or accepted downtime during
241+
failures). Single-node is a development / staging shape.
242+
243+
- **Three-node is universally optimal.** Geographic distribution
244+
may require more (one replica per region). Specific availability
245+
targets may justify more replicas.
246+
247+
- **All Lance + Raft stacks ship the same compression.** The specific
248+
encoding cascade described above is from the `bardioc` B1 reference
249+
consumer; other consumers will have different per-row footprints.
250+
The qualitative property (orders-of-magnitude vs LSM wide-column)
251+
remains, but specific numbers vary.
252+
253+
- **Cassandra+JG choose their shape wrongly.** The Cassandra design
254+
is correct FOR THE STORAGE MODEL IT HAS. The architectural choice
255+
that produces a different deployment shape is the choice to use
256+
Lance + Raft, with append-only storage and columnar compression.
257+
The doc names the consequence, not the wrongness of the alternative.
258+
259+
## Recommended deployment pattern (reference)
260+
261+
> **Reminder (per codex P1):** This pattern is the bardioc B1
262+
> substrate-b reference architecture, NOT a built-in lance-graph
263+
> feature. Adopters provide the Raft layer themselves (openraft /
264+
> surreal-cluster / external TiKV). Lance-graph contributes the
265+
> columnar storage + DataFusion + encoding crates that MAKE this
266+
> pattern cheap, not the pattern itself.
267+
268+
See [reference consumer implementation in bardioc B1 substrate-b]
269+
(separate proposed doc) for a worked example. Briefly:
270+
271+
- Three substrate-b instances, one per availability zone within a region
272+
- Each substrate-b is one Rust binary AND one Raft node (Raft impl
273+
from openraft or surreal-cluster — NOT inherited from lance-graph)
274+
- Lance dataset local to each instance (full dataset, not a shard)
275+
- Reads serve from local Lance with no cross-node coordination
276+
(eventually-consistent) or from a Raft read-index round (linearizable)
277+
- Writes serve via Raft quorum to the local leader; replicated as Lance
278+
fragment appends
279+
- Add a fourth+ instance only for geographic distribution or read-load
280+
fanout, not for capacity
281+
282+
This is the shape proven against Wikidata-scale workloads in the
283+
bardioc reference consumer. Operational complexity is significantly
284+
lower than the Cassandra-era equivalent at the same availability target.

0 commit comments

Comments
 (0)