Problem
During a migration using ZDM proxy, instability on the target cluster (node outages, compaction pressure, streaming) can have serious consequences:
-
Write failures cascade to the client. The proxy waits for both clusters to respond before returning a result. If the target is slow or unresponsive, the proxy either returns the target's error or times out — even though the origin write succeeded. This can cause a full application outage driven by a cluster that isn't even the source of truth yet.
-
No visibility into data divergence. When target writes fail, there is currently no way to determine which keyspaces or tables were affected. Operators have no per-table breakdown to understand the scope of the problem or plan targeted repairs.
-
Unnecessary consistency requirements on the target. During dual-write migration, the target is being populated with data that can be repaired later. Using the same strong consistency level (e.g. LOCAL_QUORUM) on the target as on the origin means that target-side node failures can prevent writes that would otherwise succeed — even though temporary under-replication on the target is acceptable during migration.
-
Superuser authentication increases failure risk. Superuser authentication in Cassandra requires QUORUM consistency internally, which increases the risk of authentication failures during node instability. There is currently no warning if the proxy is configured with a superuser account.
Solution
Three features to address this:
1. Target consistency level override (#163)
A new optional config property ZDM_TARGET_CONSISTENCY_LEVEL that overrides the consistency level for all requests forwarded to the target cluster. The origin cluster always receives the original client-requested consistency level, preserving the consistency contract on the source of truth.
For example, setting ZDM_TARGET_CONSISTENCY_LEVEL=LOCAL_ONE means the target accepts writes as long as a single local replica is available — significantly reducing the blast radius of target-side instability. Once migration is complete, a repair ensures all replicas are consistent.
To ensure correctness, a fully integrated CCM-based test sends traced writes (inline, prepared, and batch) through the proxy and asserts via system_traces.sessions on each cluster that the origin receives the original client-requested consistency level while the target receives the overridden value.
2. Superuser startup warning (#163)
At startup, the proxy queries system_auth.roles on both origin and target control connections to check if the configured user is a superuser. If so, a WARN is logged advising against this practice and explaining the increased authentication failure risk.
The check is best-effort: silently skipped if auth is not enabled, the query fails, or the platform does not support it (e.g. DataStax Astra).
3. Per-table successful write metrics (#164)
A new Prometheus counter proxy_write_success_total{cluster, keyspace, table} that tracks successful writes per cluster, per keyspace, and per table. The counter is incremented independently when each cluster responds, so during a target outage, origin counters keep incrementing while target counters flatline.
This makes it straightforward to identify exactly which tables have diverged and by how much, enabling targeted repairs rather than full cluster-wide repair.
To ensure correctness, a fully integrated CCM-based test covers the complete write type permutation matrix — inline, prepared, and batch statements for INSERT, UPDATE, DELETE, and counter operations — and verifies the Prometheus metrics are correctly populated for both origin and target clusters.
Related PRs
Implementation note: prepared statement metadata
The per-table write metrics leverage the existing prepared statement cache to track keyspace and table names efficiently. The table name is extracted once during PREPARE (via the existing ANTLR parser) and stored alongside the prepared statement data that the proxy already caches. At EXECUTE time, the table name is a direct field lookup — no re-parsing of the query string. If Cassandra evicts a prepared statement and the client driver re-prepares it, the metadata is refreshed through the normal re-prepare flow. No changes were made to the cache lifecycle or eviction behaviour.
Problem
During a migration using ZDM proxy, instability on the target cluster (node outages, compaction pressure, streaming) can have serious consequences:
Write failures cascade to the client. The proxy waits for both clusters to respond before returning a result. If the target is slow or unresponsive, the proxy either returns the target's error or times out — even though the origin write succeeded. This can cause a full application outage driven by a cluster that isn't even the source of truth yet.
No visibility into data divergence. When target writes fail, there is currently no way to determine which keyspaces or tables were affected. Operators have no per-table breakdown to understand the scope of the problem or plan targeted repairs.
Unnecessary consistency requirements on the target. During dual-write migration, the target is being populated with data that can be repaired later. Using the same strong consistency level (e.g.
LOCAL_QUORUM) on the target as on the origin means that target-side node failures can prevent writes that would otherwise succeed — even though temporary under-replication on the target is acceptable during migration.Superuser authentication increases failure risk. Superuser authentication in Cassandra requires
QUORUMconsistency internally, which increases the risk of authentication failures during node instability. There is currently no warning if the proxy is configured with a superuser account.Solution
Three features to address this:
1. Target consistency level override (#163)
A new optional config property
ZDM_TARGET_CONSISTENCY_LEVELthat overrides the consistency level for all requests forwarded to the target cluster. The origin cluster always receives the original client-requested consistency level, preserving the consistency contract on the source of truth.For example, setting
ZDM_TARGET_CONSISTENCY_LEVEL=LOCAL_ONEmeans the target accepts writes as long as a single local replica is available — significantly reducing the blast radius of target-side instability. Once migration is complete, a repair ensures all replicas are consistent.To ensure correctness, a fully integrated CCM-based test sends traced writes (inline, prepared, and batch) through the proxy and asserts via
system_traces.sessionson each cluster that the origin receives the original client-requested consistency level while the target receives the overridden value.2. Superuser startup warning (#163)
At startup, the proxy queries
system_auth.roleson both origin and target control connections to check if the configured user is a superuser. If so, a WARN is logged advising against this practice and explaining the increased authentication failure risk.The check is best-effort: silently skipped if auth is not enabled, the query fails, or the platform does not support it (e.g. DataStax Astra).
3. Per-table successful write metrics (#164)
A new Prometheus counter
proxy_write_success_total{cluster, keyspace, table}that tracks successful writes per cluster, per keyspace, and per table. The counter is incremented independently when each cluster responds, so during a target outage, origin counters keep incrementing while target counters flatline.This makes it straightforward to identify exactly which tables have diverged and by how much, enabling targeted repairs rather than full cluster-wide repair.
To ensure correctness, a fully integrated CCM-based test covers the complete write type permutation matrix — inline, prepared, and batch statements for INSERT, UPDATE, DELETE, and counter operations — and verifies the Prometheus metrics are correctly populated for both origin and target clusters.
Related PRs
Implementation note: prepared statement metadata
The per-table write metrics leverage the existing prepared statement cache to track keyspace and table names efficiently. The table name is extracted once during PREPARE (via the existing ANTLR parser) and stored alongside the prepared statement data that the proxy already caches. At EXECUTE time, the table name is a direct field lookup — no re-parsing of the query string. If Cassandra evicts a prepared statement and the client driver re-prepares it, the metadata is refreshed through the normal re-prepare flow. No changes were made to the cache lifecycle or eviction behaviour.