Improve resilience and observability during migration with target cluster instability

## Problem

During a migration using ZDM proxy, instability on the target cluster (node outages, compaction pressure, streaming) can have serious consequences:

1. **Write failures cascade to the client.** The proxy waits for both clusters to respond before returning a result. If the target is slow or unresponsive, the proxy either returns the target's error or times out — even though the origin write succeeded. This can cause a full application outage driven by a cluster that isn't even the source of truth yet.

2. **No visibility into data divergence.** When target writes fail, there is currently no way to determine which keyspaces or tables were affected. Operators have no per-table breakdown to understand the scope of the problem or plan targeted repairs.

3. **Unnecessary consistency requirements on the target.** During dual-write migration, the target is being populated with data that can be repaired later. Using the same strong consistency level (e.g. `LOCAL_QUORUM`) on the target as on the origin means that target-side node failures can prevent writes that would otherwise succeed — even though temporary under-replication on the target is acceptable during migration.

4. **Superuser authentication increases failure risk.** Superuser authentication in Cassandra requires `QUORUM` consistency internally, which increases the risk of authentication failures during node instability. There is currently no warning if the proxy is configured with a superuser account.

## Solution

Three features to address this:

### 1. Target consistency level override (#163)

A new optional config property `ZDM_TARGET_CONSISTENCY_LEVEL` that overrides the consistency level for all requests forwarded to the target cluster. The origin cluster always receives the original client-requested consistency level, preserving the consistency contract on the source of truth.

For example, setting `ZDM_TARGET_CONSISTENCY_LEVEL=LOCAL_ONE` means the target accepts writes as long as a single local replica is available — significantly reducing the blast radius of target-side instability. Once migration is complete, a repair ensures all replicas are consistent.

To ensure correctness, a fully integrated CCM-based test sends traced writes (inline, prepared, and batch) through the proxy and asserts via `system_traces.sessions` on each cluster that the origin receives the original client-requested consistency level while the target receives the overridden value.

### 2. Superuser startup warning (#163)

At startup, the proxy queries `system_auth.roles` on both origin and target control connections to check if the configured user is a superuser. If so, a WARN is logged advising against this practice and explaining the increased authentication failure risk.

The check is best-effort: silently skipped if auth is not enabled, the query fails, or the platform does not support it (e.g. DataStax Astra).

### 3. Per-table successful write metrics (#164)

A new Prometheus counter `proxy_write_success_total{cluster, keyspace, table}` that tracks successful writes per cluster, per keyspace, and per table. The counter is incremented independently when each cluster responds, so during a target outage, origin counters keep incrementing while target counters flatline.

This makes it straightforward to identify exactly which tables have diverged and by how much, enabling targeted repairs rather than full cluster-wide repair.

To ensure correctness, a fully integrated CCM-based test covers the complete write type permutation matrix — inline, prepared, and batch statements for INSERT, UPDATE, DELETE, and counter operations — and verifies the Prometheus metrics are correctly populated for both origin and target clusters.

## Related PRs

- #163 — Target consistency level override + superuser startup warning
- #164 — Per-table successful write Prometheus metrics

## Implementation note: prepared statement metadata

The per-table write metrics leverage the existing prepared statement cache to track keyspace and table names efficiently. The table name is extracted once during PREPARE (via the existing ANTLR parser) and stored alongside the prepared statement data that the proxy already caches. At EXECUTE time, the table name is a direct field lookup — no re-parsing of the query string. If Cassandra evicts a prepared statement and the client driver re-prepares it, the metadata is refreshed through the normal re-prepare flow. No changes were made to the cache lifecycle or eviction behaviour.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve resilience and observability during migration with target cluster instability #165

Problem

Solution

1. Target consistency level override (#163)

2. Superuser startup warning (#163)

3. Per-table successful write metrics (#164)

Related PRs

Implementation note: prepared statement metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Improve resilience and observability during migration with target cluster instability #165

Description

Problem

Solution

1. Target consistency level override (#163)

2. Superuser startup warning (#163)

3. Per-table successful write metrics (#164)

Related PRs

Implementation note: prepared statement metadata

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions