From 3605a2c43a0bb9bafcf46b97075de54342056d5c Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Mon, 13 Apr 2026 11:03:00 +0200 Subject: [PATCH 1/8] Add proposal 017: KafkaProxy threading and lifecycle contract Define the single-use lifecycle and threading contract for KafkaProxy, including idempotent startup/shutdown, the future-returning startup() API, and deprecation of block(). Assisted-by: Claude Opus 4.6 Signed-off-by: Sam Barker Signed-off-by: Sam Barker --- .../017-kafkaproxy-threading-and-lifecycle.md | 119 ++++++++++++++++++ 1 file changed, 119 insertions(+) create mode 100644 proposals/017-kafkaproxy-threading-and-lifecycle.md diff --git a/proposals/017-kafkaproxy-threading-and-lifecycle.md b/proposals/017-kafkaproxy-threading-and-lifecycle.md new file mode 100644 index 0000000..f580a94 --- /dev/null +++ b/proposals/017-kafkaproxy-threading-and-lifecycle.md @@ -0,0 +1,119 @@ +# Proposal 017: KafkaProxy Threading and Lifecycle Contract + +## Summary + +Define the threading and lifecycle contract for `KafkaProxy`: it is a single-use object with a linear lifecycle (`NEW` -> `STARTING` -> `STARTED` -> `STOPPING` -> `STOPPED`). Both `startup()` and `shutdown()` are idempotent — they may be called multiple times, but only one call performs the work. + +## Current Situation + +`KafkaProxy` has an `AtomicBoolean running` field that guards `startup()` and `shutdown()` against concurrent or repeated invocation. The field enforces a basic invariant — you cannot start twice or stop twice — but the contract is implicit: + +- There is no documentation of which threads may call which methods. +- The two-state boolean cannot distinguish "not yet started" from "stopped," so `startup()` after `shutdown()` silently succeeds in setting `running` back to `true` rather than being rejected. +- The `block()` method checks `running.get()` but does not define what happens if called before `startup()` or after `shutdown()`. More fundamentally, `block()` exists as a separate method only because `startup()` does not return a future — the caller has no other way to wait for shutdown. +- The `close()` method (from `AutoCloseable`) conditionally calls `shutdown()` based on `running.get()`, but the interaction between `close()` and explicit `shutdown()` calls is undocumented. + +The lack of a formal contract means that components within the proxy — such as the virtual cluster lifecycle state machine ([016](./016-virtual-cluster-lifecycle.md)) — cannot make informed decisions about their own concurrency models. + +## Motivation + +The proxy's lifecycle is simple by design: start once, run, stop once. The threading context is similarly simple but not obvious — `shutdown()` is called from a JVM shutdown hook, which runs on a dedicated thread. The virtual cluster lifecycle ([016](./016-virtual-cluster-lifecycle.md)) introduces per-cluster state machines that execute within this proxy-level lifecycle. Without a documented proxy-level contract, each component must independently guess at the concurrency model. + +Formalising the contract provides: + +- **Correctness**: the only genuinely invalid sequence (start-after-stop) is detected and rejected. All other repeated or concurrent calls are idempotent. +- **Guidance for implementors**: components within the proxy can choose appropriate synchronisation strategies based on documented guarantees rather than assumptions. +- **A foundation for [016](./016-virtual-cluster-lifecycle.md)**: the virtual cluster lifecycle sits inside the proxy lifecycle. Defining the outer contract first avoids circular reasoning. + +## Proposal + +### KafkaProxy is Single-Use + +A `KafkaProxy` instance follows a linear, non-repeatable lifecycle: + +``` +NEW ──> STARTING ──> STARTED ──> STOPPING ──> STOPPED + │ ▲ + └────────── (startup failure) ─────────┘ +``` + +- **NEW**: constructed, not yet started. +- **STARTING**: `startup()` is in progress. +- **STARTED**: the proxy is serving traffic. +- **STOPPING**: `shutdown()` is in progress. The caller may join the future returned by `startup()` to wait for completion. +- **STOPPED**: terminal. All resources released. + +If `startup()` fails partway through, the proxy transitions directly to `STOPPED` — partially-acquired resources are released before the exception propagates. There is no `FAILED` state at the proxy level; a failed startup is simply a proxy that went straight to `STOPPED`. This is intentional for two reasons: the proxy is not reconfigurable, so there is no recovery transition from a hypothetical `FAILED` state (unlike virtual clusters, which support `failed` -> `initializing` for retry with corrected configuration); and the failure is communicated directly to the caller via the exception from `startup()`, so there is no need to hold the error in state for later inspection. + +The lifecycle is **not restartable**. Once a proxy reaches `STOPPED`, it cannot return to any earlier state. To restart, create a new `KafkaProxy` instance. This holds across all deployment models: standalone, Kubernetes, sidecar, and embedded library. + +Dynamic configuration reload handles changes within a running proxy; there is no use case for tearing down and re-creating Netty event loops in-process. + +### `startup()` Returns a Future + +`startup()` returns a `CompletableFuture` that completes when the proxy reaches `STOPPED`. This lets the caller decide how to wait: + +```java +// Standalone — block the main thread until shutdown +proxy.startup().join(); + +// Embedded / tests — start and hold the future +var stopped = proxy.startup(); +// ... do other work ... +stopped.join(); +``` + +The proxy also holds the future internally. The existing `block()` method is deprecated and delegates to `join()` on the same future. + +### Threading Contract + +| Method | May be called from | Behaviour | +|---|---|---| +| `startup()` | Any thread | First call initialises the proxy and returns a future. Subsequent calls return the same future. Throws `IllegalStateException` if the proxy is `STOPPING` or `STOPPED`. | +| `shutdown()` | Any thread (including JVM shutdown hook) | First call initiates shutdown. Subsequent calls are no-ops. No-op if the proxy was never started. | +| `close()` | Any thread | Delegates to `shutdown()`. | + +Both `startup()` and `shutdown()` are idempotent — callers do not need to coordinate. Only one call actually does the work; additional calls are harmless. This avoids requiring defensive checks at every call site (e.g. a test calling `shutdown()` explicitly and `@AfterEach` calling it again). + +**Concurrency guarantees:** + +1. **At most one startup and one shutdown sequence will execute.** Multiple concurrent calls to `startup()` all receive the same future; only one thread performs initialisation. Multiple concurrent calls to `shutdown()` are similarly safe; only one thread performs the shutdown sequence. +2. **`shutdown()` may be called from a different thread than `startup()`.** This is the common case: `startup()` on the application thread, `shutdown()` from a JVM shutdown hook. +3. **The future returned by `startup()` is safe to join concurrently with `shutdown()`.** The future is completed at the end of the shutdown sequence, regardless of whether shutdown succeeds or fails. + +### Error Behaviour + +The only invalid sequence is attempting to start a proxy that is shutting down or has stopped: + +| Attempted operation | Current state | Result | +|---|---|---| +| `startup()` | `STOPPING` or `STOPPED` | `IllegalStateException` — the proxy is not restartable | + +All other repeated or concurrent calls are idempotent no-ops (or return the existing future). + +## Affected/Not Affected Projects + +**Affected:** +- **kroxylicious-proxy (runtime)**: `KafkaProxy` gains a formal lifecycle contract. `startup()` returns a `CompletableFuture`. `block()` is deprecated. +- **kroxylicious-proxy tests**: tests that exercise lifecycle edge cases (double-start, start-after-stop) should be added. + +**Not affected:** +- **kroxylicious-api**: the filter SPI does not interact with proxy lifecycle. +- **kroxylicious-operator**: interacts with the proxy process, not the `KafkaProxy` object directly. +- **kroxylicious-kms and plugin modules**: no changes needed. + +## Compatibility + +This proposal formalises existing behaviour rather than changing it. The proxy is already single-use in practice — no code path calls `startup()` twice or restarts after `shutdown()`. The change makes the contract explicit and enforces it. + +The one behavioural change is stricter error detection: calling `startup()` on a stopped proxy currently appears to succeed but would leave the proxy in a corrupt state. After this change, it throws `IllegalStateException`. This is a bug fix, not a compatibility break. + +## Rejected Alternatives + +### Public lifecycle state enum + +We considered making the lifecycle state enum part of the public API so that external code (operators, management endpoints) could query proxy state. However, the proxy is not an entity that external systems monitor at the object level — they monitor the process (health endpoints, readiness probes). The lifecycle state is an internal correctness mechanism, not an observability surface. If proxy-level observability is needed in the future, it should be designed as part of a proxy-level lifecycle proposal (identified as future work in [016](./016-virtual-cluster-lifecycle.md)) rather than prematurely exposing an implementation detail. + +### Restartable proxy + +We considered allowing `startup()` -> `shutdown()` -> `startup()` cycles on the same instance. This would require careful resource management (ensuring all state is fully cleaned up before re-initialisation), add complexity to the lifecycle model, and serves no practical use case. Dynamic configuration reload handles changes within a running proxy, and a full restart in the same process is adequately served by creating a new `KafkaProxy` instance. The added complexity is not justified. From 901b87534ae05aa283ec286d4f448769afc396f4 Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Mon, 13 Apr 2026 11:19:16 +0200 Subject: [PATCH 2/8] Renumber proposal from 017 to 019 017 and 018 were claimed by PR #97. Assisted-by: Claude Opus 4.6 Signed-off-by: Sam Barker Signed-off-by: Sam Barker --- ...d-lifecycle.md => 019-kafkaproxy-threading-and-lifecycle.md} | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) rename proposals/{017-kafkaproxy-threading-and-lifecycle.md => 019-kafkaproxy-threading-and-lifecycle.md} (99%) diff --git a/proposals/017-kafkaproxy-threading-and-lifecycle.md b/proposals/019-kafkaproxy-threading-and-lifecycle.md similarity index 99% rename from proposals/017-kafkaproxy-threading-and-lifecycle.md rename to proposals/019-kafkaproxy-threading-and-lifecycle.md index f580a94..6ce247e 100644 --- a/proposals/017-kafkaproxy-threading-and-lifecycle.md +++ b/proposals/019-kafkaproxy-threading-and-lifecycle.md @@ -1,4 +1,4 @@ -# Proposal 017: KafkaProxy Threading and Lifecycle Contract +# Proposal 019: KafkaProxy Threading and Lifecycle Contract ## Summary From 996402890e9d85919391ac888eff48103ba36dd7 Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Tue, 14 Apr 2026 16:22:10 +0200 Subject: [PATCH 3/8] Address review feedback: clarify API status, exit codes, and metrics - Note that KafkaProxy is not public API - Clarify sidecar/embedded are not supported deployment models today - Document exceptional completion of the startup future - Add process exit code contract for the two edges into STOPPED - Add metrics section (state gauge, startup duration) - Update rejected alternatives to reference metrics Assisted-by: Claude claude-opus-4-6 Signed-off-by: Sam Barker --- .../019-kafkaproxy-threading-and-lifecycle.md | 28 ++++++++++++++++--- 1 file changed, 24 insertions(+), 4 deletions(-) diff --git a/proposals/019-kafkaproxy-threading-and-lifecycle.md b/proposals/019-kafkaproxy-threading-and-lifecycle.md index 6ce247e..544c798 100644 --- a/proposals/019-kafkaproxy-threading-and-lifecycle.md +++ b/proposals/019-kafkaproxy-threading-and-lifecycle.md @@ -6,7 +6,7 @@ Define the threading and lifecycle contract for `KafkaProxy`: it is a single-use ## Current Situation -`KafkaProxy` has an `AtomicBoolean running` field that guards `startup()` and `shutdown()` against concurrent or repeated invocation. The field enforces a basic invariant — you cannot start twice or stop twice — but the contract is implicit: +`KafkaProxy` is an internal class — it is not part of the public API. It has an `AtomicBoolean running` field that guards `startup()` and `shutdown()` against concurrent or repeated invocation. The field enforces a basic invariant — you cannot start twice or stop twice — but the contract is implicit: - There is no documentation of which threads may call which methods. - The two-state boolean cannot distinguish "not yet started" from "stopped," so `startup()` after `shutdown()` silently succeeds in setting `running` back to `true` rather than being rejected. @@ -45,7 +45,7 @@ NEW ──> STARTING ──> STARTED ──> STOPPING ──> STOPPED If `startup()` fails partway through, the proxy transitions directly to `STOPPED` — partially-acquired resources are released before the exception propagates. There is no `FAILED` state at the proxy level; a failed startup is simply a proxy that went straight to `STOPPED`. This is intentional for two reasons: the proxy is not reconfigurable, so there is no recovery transition from a hypothetical `FAILED` state (unlike virtual clusters, which support `failed` -> `initializing` for retry with corrected configuration); and the failure is communicated directly to the caller via the exception from `startup()`, so there is no need to hold the error in state for later inspection. -The lifecycle is **not restartable**. Once a proxy reaches `STOPPED`, it cannot return to any earlier state. To restart, create a new `KafkaProxy` instance. This holds across all deployment models: standalone, Kubernetes, sidecar, and embedded library. +The lifecycle is **not restartable**. Once a proxy reaches `STOPPED`, it cannot return to any earlier state. To restart, create a new `KafkaProxy` instance. This holds across all deployment models: standalone and Kubernetes today, and sidecar or embedded library if those models are supported in the future. Dynamic configuration reload handles changes within a running proxy; there is no use case for tearing down and re-creating Netty event loops in-process. @@ -63,7 +63,7 @@ var stopped = proxy.startup(); stopped.join(); ``` -The proxy also holds the future internally. The existing `block()` method is deprecated and delegates to `join()` on the same future. +If `startup()` fails, the returned future completes exceptionally — the caller receives the failure whether they call `join()` immediately or later. The proxy also holds the future internally. The existing `block()` method is deprecated and delegates to `join()` on the same future. ### Threading Contract @@ -91,6 +91,26 @@ The only invalid sequence is attempting to start a proxy that is shutting down o All other repeated or concurrent calls are idempotent no-ops (or return the existing future). +### Process Exit Codes + +There are two edges into `STOPPED`, and they carry different meaning at the process level: + +1. **`STARTING → STOPPED`** (startup failure): `startup()` completes the future exceptionally. The caller (typically `main()`) receives the exception and should exit with a non-zero status code. +2. **`STOPPING → STOPPED`** (clean shutdown): the future completes normally. The caller exits with zero. + +`KafkaProxy` itself does not call `System.exit()` — translating the outcome into a process exit code is the responsibility of the application entry point. This keeps `KafkaProxy` usable in embedded and test contexts where process termination is not appropriate. + +### Metrics + +While the lifecycle state enum remains internal (see [Rejected Alternatives](#public-lifecycle-state-enum)), the proxy should expose metrics that reflect its lifecycle state. Standard Kubernetes metrics (`kube_pod_status_phase`, restart counts, container termination reasons) treat the process as a black box — they can tell an operator *that* a pod restarted but not *why*, and they cannot distinguish a startup failure from a crash during normal operation without heuristics. + +The proxy should expose: + +- **State gauge**: a metric indicating the current lifecycle state (`starting`, `started`, `stopping`, `stopped`). This gives dashboards and alerts a definitive signal rather than requiring inference from Kubernetes-level indicators. +- **Startup duration**: time elapsed from `STARTING` to `STARTED` (or failure). This helps operators identify configuration or environment issues that slow startup across restarts. + +These metrics are public API surface — once exposed, their names and semantics become a compatibility commitment. The specific metric names and labelling conventions should be consistent with any metrics framework adopted for virtual cluster lifecycle ([016](./016-virtual-cluster-lifecycle.md)). + ## Affected/Not Affected Projects **Affected:** @@ -112,7 +132,7 @@ The one behavioural change is stricter error detection: calling `startup()` on a ### Public lifecycle state enum -We considered making the lifecycle state enum part of the public API so that external code (operators, management endpoints) could query proxy state. However, the proxy is not an entity that external systems monitor at the object level — they monitor the process (health endpoints, readiness probes). The lifecycle state is an internal correctness mechanism, not an observability surface. If proxy-level observability is needed in the future, it should be designed as part of a proxy-level lifecycle proposal (identified as future work in [016](./016-virtual-cluster-lifecycle.md)) rather than prematurely exposing an implementation detail. +We considered making the lifecycle state enum part of the public API so that external code (operators, management endpoints) could query proxy state programmatically. The enum remains internal — external observability is better served by metrics (see [Metrics](#metrics)), which provide the same information through a standard interface without coupling consumers to an internal type. ### Restartable proxy From b000faefde59758a9eb2198f7b5693b5a1d538cc Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Wed, 15 Apr 2026 11:55:17 +0200 Subject: [PATCH 4/8] Clarify terminology per review feedback - Replace "implementors" with "Kroxylicious developers" - Reword "exception propagates" to "returning a failed future" Assisted-by: Claude claude-opus-4-6 Signed-off-by: Sam Barker --- proposals/019-kafkaproxy-threading-and-lifecycle.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/proposals/019-kafkaproxy-threading-and-lifecycle.md b/proposals/019-kafkaproxy-threading-and-lifecycle.md index 544c798..4a0ba9b 100644 --- a/proposals/019-kafkaproxy-threading-and-lifecycle.md +++ b/proposals/019-kafkaproxy-threading-and-lifecycle.md @@ -22,7 +22,7 @@ The proxy's lifecycle is simple by design: start once, run, stop once. The threa Formalising the contract provides: - **Correctness**: the only genuinely invalid sequence (start-after-stop) is detected and rejected. All other repeated or concurrent calls are idempotent. -- **Guidance for implementors**: components within the proxy can choose appropriate synchronisation strategies based on documented guarantees rather than assumptions. +- **Guidance for Kroxylicious developers**: components within the proxy can choose appropriate synchronisation strategies based on documented guarantees rather than assumptions. - **A foundation for [016](./016-virtual-cluster-lifecycle.md)**: the virtual cluster lifecycle sits inside the proxy lifecycle. Defining the outer contract first avoids circular reasoning. ## Proposal @@ -43,7 +43,7 @@ NEW ──> STARTING ──> STARTED ──> STOPPING ──> STOPPED - **STOPPING**: `shutdown()` is in progress. The caller may join the future returned by `startup()` to wait for completion. - **STOPPED**: terminal. All resources released. -If `startup()` fails partway through, the proxy transitions directly to `STOPPED` — partially-acquired resources are released before the exception propagates. There is no `FAILED` state at the proxy level; a failed startup is simply a proxy that went straight to `STOPPED`. This is intentional for two reasons: the proxy is not reconfigurable, so there is no recovery transition from a hypothetical `FAILED` state (unlike virtual clusters, which support `failed` -> `initializing` for retry with corrected configuration); and the failure is communicated directly to the caller via the exception from `startup()`, so there is no need to hold the error in state for later inspection. +If `startup()` fails partway through, the proxy transitions directly to `STOPPED` — partially-acquired resources are released before returning a failed future. There is no `FAILED` state at the proxy level; a failed startup is simply a proxy that went straight to `STOPPED`. This is intentional for two reasons: the proxy is not reconfigurable, so there is no recovery transition from a hypothetical `FAILED` state (unlike virtual clusters, which support `failed` -> `initializing` for retry with corrected configuration); and the failure is communicated directly to the caller via the exception from `startup()`, so there is no need to hold the error in state for later inspection. The lifecycle is **not restartable**. Once a proxy reaches `STOPPED`, it cannot return to any earlier state. To restart, create a new `KafkaProxy` instance. This holds across all deployment models: standalone and Kubernetes today, and sidecar or embedded library if those models are supported in the future. From f143d134dfd1c5e28375b5c3915e0e9131afee31 Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Wed, 29 Apr 2026 12:46:11 +1200 Subject: [PATCH 5/8] 019: drive the single lifecycle future model consistently - Rename 'startup() Returns a Future' section to 'The Lifecycle Future' - Lead with the key concept: one future represents the full proxy lifetime - Add three-outcome table (startup failure, clean shutdown, shutdown failure) - Document cancel() semantics: equivalent to shutdown(), flag ignored - Update concurrency guarantee #3 to reinforce the single-future model - Add shutdown failure as a third path in Process Exit Codes - Note that startup() return type changes from KafkaProxy to CompletableFuture - Remove Kubernetes-specific kube_pod_status_phase reference from Metrics - Add 'Separate futures per phase' rejected alternative - Add 'Hard shutdown via cancel(true)' rejected alternative - Fix stray typo in Motivation section Assisted-by: Claude claude-sonnet-4-6 Signed-off-by: Sam Barker --- .../019-kafkaproxy-threading-and-lifecycle.md | 47 ++++++++++++++----- 1 file changed, 34 insertions(+), 13 deletions(-) diff --git a/proposals/019-kafkaproxy-threading-and-lifecycle.md b/proposals/019-kafkaproxy-threading-and-lifecycle.md index 4a0ba9b..398fd4f 100644 --- a/proposals/019-kafkaproxy-threading-and-lifecycle.md +++ b/proposals/019-kafkaproxy-threading-and-lifecycle.md @@ -2,7 +2,7 @@ ## Summary -Define the threading and lifecycle contract for `KafkaProxy`: it is a single-use object with a linear lifecycle (`NEW` -> `STARTING` -> `STARTED` -> `STOPPING` -> `STOPPED`). Both `startup()` and `shutdown()` are idempotent — they may be called multiple times, but only one call performs the work. +Define the threading and lifecycle contract for `KafkaProxy`: it is a single-use object with a linear lifecycle (`NEW` -> `STARTING` -> `STARTED` -> `STOPPING` -> `STOPPED`). `startup()` returns a single `CompletableFuture` representing the full proxy lifetime — it completes only when the proxy reaches `STOPPED`. Both `startup()` and `shutdown()` are idempotent — they may be called multiple times, but only one call performs the work. ## Current Situation @@ -10,7 +10,7 @@ Define the threading and lifecycle contract for `KafkaProxy`: it is a single-use - There is no documentation of which threads may call which methods. - The two-state boolean cannot distinguish "not yet started" from "stopped," so `startup()` after `shutdown()` silently succeeds in setting `running` back to `true` rather than being rejected. -- The `block()` method checks `running.get()` but does not define what happens if called before `startup()` or after `shutdown()`. More fundamentally, `block()` exists as a separate method only because `startup()` does not return a future — the caller has no other way to wait for shutdown. +- The `block()` method checks `running.get()` but does not define what happens if called before `startup()` or after `shutdown()`. More fundamentally, `block()` exists as a separate method only because `startup()` returns `this` rather than a future — the caller has no other way to wait for shutdown. A `shutdown` `CompletableFuture` already exists internally and is what `block()` joins, but it is not surfaced to callers. - The `close()` method (from `AutoCloseable`) conditionally calls `shutdown()` based on `running.get()`, but the interaction between `close()` and explicit `shutdown()` calls is undocumented. The lack of a formal contract means that components within the proxy — such as the virtual cluster lifecycle state machine ([016](./016-virtual-cluster-lifecycle.md)) — cannot make informed decisions about their own concurrency models. @@ -43,27 +43,37 @@ NEW ──> STARTING ──> STARTED ──> STOPPING ──> STOPPED - **STOPPING**: `shutdown()` is in progress. The caller may join the future returned by `startup()` to wait for completion. - **STOPPED**: terminal. All resources released. -If `startup()` fails partway through, the proxy transitions directly to `STOPPED` — partially-acquired resources are released before returning a failed future. There is no `FAILED` state at the proxy level; a failed startup is simply a proxy that went straight to `STOPPED`. This is intentional for two reasons: the proxy is not reconfigurable, so there is no recovery transition from a hypothetical `FAILED` state (unlike virtual clusters, which support `failed` -> `initializing` for retry with corrected configuration); and the failure is communicated directly to the caller via the exception from `startup()`, so there is no need to hold the error in state for later inspection. +If `startup()` fails partway through, the proxy transitions directly to `STOPPED` — partially-acquired resources are released before returning a failed future. There is no `FAILED` state at the proxy level; a failed startup is simply a proxy that went straight to `STOPPED`. This is intentional for two reasons: the proxy is not reconfigurable, so there is no recovery transition from a hypothetical `FAILED` state (unlike virtual clusters, which support `failed` -> `initializing` for retry with corrected configuration); and the failure is communicated directly to the caller via the future's exceptional completion, so there is no need to hold the error in state for later inspection. The lifecycle is **not restartable**. Once a proxy reaches `STOPPED`, it cannot return to any earlier state. To restart, create a new `KafkaProxy` instance. This holds across all deployment models: standalone and Kubernetes today, and sidecar or embedded library if those models are supported in the future. -Dynamic configuration reload handles changes within a running proxy; there is no use case for tearing down and re-creating Netty event loops in-process. +Dynamic configuration reload will handle changes within a running proxy; there is no use case for tearing down and re-creating a proxy instance. -### `startup()` Returns a Future +### The Lifecycle Future -`startup()` returns a `CompletableFuture` that completes when the proxy reaches `STOPPED`. This lets the caller decide how to wait: +`startup()` returns a `CompletableFuture` that represents the proxy's **full lifetime** — it is the same future on every call, and completes only when the proxy reaches `STOPPED`. The caller always discovers the final outcome by joining the future received from `startup()`: ```java // Standalone — block the main thread until shutdown proxy.startup().join(); // Embedded / tests — start and hold the future -var stopped = proxy.startup(); +CompletableFuture stopped = proxy.startup(); // ... do other work ... stopped.join(); ``` -If `startup()` fails, the returned future completes exceptionally — the caller receives the failure whether they call `join()` immediately or later. The proxy also holds the future internally. The existing `block()` method is deprecated and delegates to `join()` on the same future. +The future has three terminal outcomes: + +| Path | Completion | +|---|---| +| Startup failure (`STARTING → STOPPED`) | Exceptionally — the startup error | +| Clean shutdown (`STOPPING → STOPPED`) | Normally | +| Shutdown failure (`STOPPING → STOPPED`) | Exceptionally — the shutdown error | + +Calling `cancel()` on the future is equivalent to calling `shutdown()` — it initiates a graceful shutdown. The `mayInterruptIfRunning` parameter is ignored; shutdown is always graceful regardless of its value. + +The existing `block()` method is deprecated and delegates to `join()` on the lifecycle future. ### Threading Contract @@ -79,7 +89,7 @@ Both `startup()` and `shutdown()` are idempotent — callers do not need to coor 1. **At most one startup and one shutdown sequence will execute.** Multiple concurrent calls to `startup()` all receive the same future; only one thread performs initialisation. Multiple concurrent calls to `shutdown()` are similarly safe; only one thread performs the shutdown sequence. 2. **`shutdown()` may be called from a different thread than `startup()`.** This is the common case: `startup()` on the application thread, `shutdown()` from a JVM shutdown hook. -3. **The future returned by `startup()` is safe to join concurrently with `shutdown()`.** The future is completed at the end of the shutdown sequence, regardless of whether shutdown succeeds or fails. +3. **The caller joins the same future for both startup and shutdown outcomes.** It is safe to join the future concurrently with `shutdown()` — `shutdown()` is what completes it, at the end of the shutdown sequence. ### Error Behaviour @@ -93,16 +103,17 @@ All other repeated or concurrent calls are idempotent no-ops (or return the exis ### Process Exit Codes -There are two edges into `STOPPED`, and they carry different meaning at the process level: +There are three paths into `STOPPED`, each reflected in the lifecycle future: -1. **`STARTING → STOPPED`** (startup failure): `startup()` completes the future exceptionally. The caller (typically `main()`) receives the exception and should exit with a non-zero status code. -2. **`STOPPING → STOPPED`** (clean shutdown): the future completes normally. The caller exits with zero. +1. **Startup failure** (`STARTING → STOPPED`): the future completes exceptionally. The caller (typically `main()`) receives the exception and should exit with a non-zero status code. +2. **Clean shutdown** (`STOPPING → STOPPED`): the future completes normally. The caller exits with zero. +3. **Shutdown failure** (`STOPPING → STOPPED`): the future completes exceptionally with the shutdown error. The caller should exit with a non-zero status code. `KafkaProxy` itself does not call `System.exit()` — translating the outcome into a process exit code is the responsibility of the application entry point. This keeps `KafkaProxy` usable in embedded and test contexts where process termination is not appropriate. ### Metrics -While the lifecycle state enum remains internal (see [Rejected Alternatives](#public-lifecycle-state-enum)), the proxy should expose metrics that reflect its lifecycle state. Standard Kubernetes metrics (`kube_pod_status_phase`, restart counts, container termination reasons) treat the process as a black box — they can tell an operator *that* a pod restarted but not *why*, and they cannot distinguish a startup failure from a crash during normal operation without heuristics. +While the lifecycle state enum remains internal (see [Rejected Alternatives](#public-lifecycle-state-enum)), the proxy should expose metrics that reflect its lifecycle state. Platform-level observability (restart counts, container termination reasons) treats the process as a black box — it can tell an operator *that* a process restarted but not *why*, and cannot distinguish a startup failure from a crash during normal operation without heuristics. The proxy should expose: @@ -126,6 +137,8 @@ These metrics are public API surface — once exposed, their names and semantics This proposal formalises existing behaviour rather than changing it. The proxy is already single-use in practice — no code path calls `startup()` twice or restarts after `shutdown()`. The change makes the contract explicit and enforces it. +The API change is that `startup()` changes its return type from `KafkaProxy` to `CompletableFuture`. Since `KafkaProxy` is an internal class, there are no external callers to migrate. + The one behavioural change is stricter error detection: calling `startup()` on a stopped proxy currently appears to succeed but would leave the proxy in a corrupt state. After this change, it throws `IllegalStateException`. This is a bug fix, not a compatibility break. ## Rejected Alternatives @@ -134,6 +147,14 @@ The one behavioural change is stricter error detection: calling `startup()` on a We considered making the lifecycle state enum part of the public API so that external code (operators, management endpoints) could query proxy state programmatically. The enum remains internal — external observability is better served by metrics (see [Metrics](#metrics)), which provide the same information through a standard interface without coupling consumers to an internal type. +### Separate futures per phase + +We considered a three-phase model: `startup()` returns a future completing at `STARTED`, `block()` keeps the main thread alive through the running phase, and `shutdown()` returns a future completing at `STOPPED`. This gives a clear signal when the proxy is ready to serve traffic, but it requires the caller to manage three separate handles. It also does not simplify the existing API — `block()` must remain because without it, the main thread returns as soon as startup completes and `shutdown()` is never called. The single lifetime future is simpler: joining it keeps the main thread alive through the running phase, and its completion mode (normal or exceptional) captures the full outcome. + +### Hard shutdown via `cancel(true)` + +`cancel(mayInterruptIfRunning)` could distinguish graceful shutdown (`false`) from hard shutdown (`true` — bypass drain, force-close connections immediately). However, hard shutdown is better expressed as a dedicated method on `KafkaProxy` itself, where the intent is explicit and the API can evolve independently of `Future` semantics. There is no current use case driving a hard shutdown capability, so this is left for a future proposal. + ### Restartable proxy We considered allowing `startup()` -> `shutdown()` -> `startup()` cycles on the same instance. This would require careful resource management (ensuring all state is fully cleaned up before re-initialisation), add complexity to the lifecycle model, and serves no practical use case. Dynamic configuration reload handles changes within a running proxy, and a full restart in the same process is adequately served by creating a new `KafkaProxy` instance. The added complexity is not justified. From a66af897dcee0f040acf7306c9509cc986f2502b Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Wed, 29 Apr 2026 12:47:47 +1200 Subject: [PATCH 6/8] Rename proposal to use PR number Signed-off-by: Sam Barker --- ...d-lifecycle.md => 098-kafkaproxy-threading-and-lifecycle.md} | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) rename proposals/{019-kafkaproxy-threading-and-lifecycle.md => 098-kafkaproxy-threading-and-lifecycle.md} (99%) diff --git a/proposals/019-kafkaproxy-threading-and-lifecycle.md b/proposals/098-kafkaproxy-threading-and-lifecycle.md similarity index 99% rename from proposals/019-kafkaproxy-threading-and-lifecycle.md rename to proposals/098-kafkaproxy-threading-and-lifecycle.md index 398fd4f..31921d1 100644 --- a/proposals/019-kafkaproxy-threading-and-lifecycle.md +++ b/proposals/098-kafkaproxy-threading-and-lifecycle.md @@ -1,4 +1,4 @@ -# Proposal 019: KafkaProxy Threading and Lifecycle Contract +# 98 - Proposal 019: KafkaProxy Threading and Lifecycle Contract ## Summary From 0c5111e1963b6f932ba4d2cd93f30442629a06c2 Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Wed, 13 May 2026 11:38:51 +1200 Subject: [PATCH 7/8] Address review feedback: clarify API status, exit codes, and metrics MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Fix proposal number in title (019 -> 98) - Fix wording: "before the startup future is failed exceptionally" - Remove block() — internal method, no external callers, join() is the replacement Signed-off-by: Sam Barker Signed-off-by: Sam Barker --- proposals/098-kafkaproxy-threading-and-lifecycle.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/proposals/098-kafkaproxy-threading-and-lifecycle.md b/proposals/098-kafkaproxy-threading-and-lifecycle.md index 31921d1..588c6a9 100644 --- a/proposals/098-kafkaproxy-threading-and-lifecycle.md +++ b/proposals/098-kafkaproxy-threading-and-lifecycle.md @@ -1,4 +1,4 @@ -# 98 - Proposal 019: KafkaProxy Threading and Lifecycle Contract +# 98 - KafkaProxy Threading and Lifecycle Contract ## Summary @@ -10,7 +10,7 @@ Define the threading and lifecycle contract for `KafkaProxy`: it is a single-use - There is no documentation of which threads may call which methods. - The two-state boolean cannot distinguish "not yet started" from "stopped," so `startup()` after `shutdown()` silently succeeds in setting `running` back to `true` rather than being rejected. -- The `block()` method checks `running.get()` but does not define what happens if called before `startup()` or after `shutdown()`. More fundamentally, `block()` exists as a separate method only because `startup()` returns `this` rather than a future — the caller has no other way to wait for shutdown. A `shutdown` `CompletableFuture` already exists internally and is what `block()` joins, but it is not surfaced to callers. +- The `block()` method checks `running.get()` but does not define what happens if called before `startup()` or after `shutdown()`. More fundamentally, `block()` exists as a separate method only because `startup()` returns `this` rather than a future — the caller has no other way to wait for shutdown. A `shutdown` `CompletableFuture` already exists internally and is what `block()` joins, but it is not surfaced to callers. Since `KafkaProxy` is an internal class, `block()` can be removed entirely once `startup()` returns the future directly. - The `close()` method (from `AutoCloseable`) conditionally calls `shutdown()` based on `running.get()`, but the interaction between `close()` and explicit `shutdown()` calls is undocumented. The lack of a formal contract means that components within the proxy — such as the virtual cluster lifecycle state machine ([016](./016-virtual-cluster-lifecycle.md)) — cannot make informed decisions about their own concurrency models. @@ -43,7 +43,7 @@ NEW ──> STARTING ──> STARTED ──> STOPPING ──> STOPPED - **STOPPING**: `shutdown()` is in progress. The caller may join the future returned by `startup()` to wait for completion. - **STOPPED**: terminal. All resources released. -If `startup()` fails partway through, the proxy transitions directly to `STOPPED` — partially-acquired resources are released before returning a failed future. There is no `FAILED` state at the proxy level; a failed startup is simply a proxy that went straight to `STOPPED`. This is intentional for two reasons: the proxy is not reconfigurable, so there is no recovery transition from a hypothetical `FAILED` state (unlike virtual clusters, which support `failed` -> `initializing` for retry with corrected configuration); and the failure is communicated directly to the caller via the future's exceptional completion, so there is no need to hold the error in state for later inspection. +If `startup()` fails partway through, the proxy transitions directly to `STOPPED` — partially-acquired resources are released before the startup future is failed exceptionally. There is no `FAILED` state at the proxy level; a failed startup is simply a proxy that went straight to `STOPPED`. This is intentional for two reasons: the proxy is not reconfigurable, so there is no recovery transition from a hypothetical `FAILED` state (unlike virtual clusters, which support `failed` -> `initializing` for retry with corrected configuration); and the failure is communicated directly to the caller via the future's exceptional completion, so there is no need to hold the error in state for later inspection. The lifecycle is **not restartable**. Once a proxy reaches `STOPPED`, it cannot return to any earlier state. To restart, create a new `KafkaProxy` instance. This holds across all deployment models: standalone and Kubernetes today, and sidecar or embedded library if those models are supported in the future. @@ -73,7 +73,7 @@ The future has three terminal outcomes: Calling `cancel()` on the future is equivalent to calling `shutdown()` — it initiates a graceful shutdown. The `mayInterruptIfRunning` parameter is ignored; shutdown is always graceful regardless of its value. -The existing `block()` method is deprecated and delegates to `join()` on the lifecycle future. +The existing `block()` method is removed — it is an internal method with no external callers, and `join()` on the lifecycle future is a direct replacement. ### Threading Contract @@ -125,7 +125,7 @@ These metrics are public API surface — once exposed, their names and semantics ## Affected/Not Affected Projects **Affected:** -- **kroxylicious-proxy (runtime)**: `KafkaProxy` gains a formal lifecycle contract. `startup()` returns a `CompletableFuture`. `block()` is deprecated. +- **kroxylicious-proxy (runtime)**: `KafkaProxy` gains a formal lifecycle contract. `startup()` returns a `CompletableFuture`. `block()` is removed. - **kroxylicious-proxy tests**: tests that exercise lifecycle edge cases (double-start, start-after-stop) should be added. **Not affected:** From 1c0e89bde326d5b9853b81cdd50cb398c2b1b1e7 Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Fri, 15 May 2026 10:41:12 +1200 Subject: [PATCH 8/8] Update proposals/098-kafkaproxy-threading-and-lifecycle.md Co-authored-by: Tom Bentley Signed-off-by: Sam Barker --- proposals/098-kafkaproxy-threading-and-lifecycle.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/proposals/098-kafkaproxy-threading-and-lifecycle.md b/proposals/098-kafkaproxy-threading-and-lifecycle.md index 588c6a9..65194b8 100644 --- a/proposals/098-kafkaproxy-threading-and-lifecycle.md +++ b/proposals/098-kafkaproxy-threading-and-lifecycle.md @@ -2,7 +2,7 @@ ## Summary -Define the threading and lifecycle contract for `KafkaProxy`: it is a single-use object with a linear lifecycle (`NEW` -> `STARTING` -> `STARTED` -> `STOPPING` -> `STOPPED`). `startup()` returns a single `CompletableFuture` representing the full proxy lifetime — it completes only when the proxy reaches `STOPPED`. Both `startup()` and `shutdown()` are idempotent — they may be called multiple times, but only one call performs the work. +Define the threading and lifecycle contract for `KafkaProxy`: it is a single-use object with a linear lifecycle (`NEW` -> `STARTING` -> `STARTED` -> `STOPPING` -> `STOPPED`). `startup()` will always return the same `CompletableFuture` representing the full proxy lifetime — it completes only when the proxy reaches `STOPPED`. Both `startup()` and `shutdown()` are idempotent — they may be called multiple times, but only one call performs the work. ## Current Situation