OCPEDGE-2280: mutable topology by jeff-roche · Pull Request #2008 · openshift/enhancements

jeff-roche · 2026-05-11T19:46:28Z

Summary

Introduces the Mutable Topology enhancement, replacing the previous Adaptable Topology proposal
Proposes a new optional payload operator (OTTO) to orchestrate topology transitions between existing fixed topology modes, rather than adding a new topology enum
Initial scope: SNO to HA compact (3-node) on platform: none

🤖 Generated with Claude Code

openshift-ci-robot · 2026-05-11T19:48:33Z

@jeff-roche: This pull request references OCPEDGE-2280 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target either version "5.0." or "openshift-5.0.", but it targets "openshift-4.22" instead.

Details

In response to this:

Summary

Introduces the Mutable Topology enhancement, replacing the previous Adaptable Topology proposal

Proposes a new optional payload operator (OTTO) to orchestrate topology transitions between existing fixed topology modes, rather than adding a new topology enum

Initial scope: SNO to HA compact (3-node) on platform: none

Test plan

markdownlint passes (markdownlint-cli2)

Reviewer feedback from control plane, API, and architecture teams

Template structure validated against guidelines/enhancement_template.md

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Introduce the Mutable Topology enhancement, which replaces the previous Adaptable Topology proposal. Instead of a new topology enum that all operators must interpret, this approach uses a dedicated operator (OTTO) to orchestrate transitions between existing fixed topology modes. Initial scope: SNO to HA compact on platform: none. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jeff-roche · 2026-05-11T22:12:24Z

/assign @jaypoulz @eggfoobar @jerpeter1 @sdodson @dgoodwin @tjungblu @JoelSpeed @dusk125 @patrickdillon

brandisher

I'm missing a "why" statement covering why a day 2, out-of-payload operator is the right choice for this. The CVO section towards the bottom hints at the why a bit but more explicit detail is needed.

With that in mind, I haven't reviewed the EP fully because I don't understand why this is the approach we're taking. The assessment of CVO seems very light and not enough to exclude that as a potential option to meet the goals.

brandisher · 2026-05-12T14:55:58Z

+* Provide a topology transition operator (OTTO) that owns the transition DAG and orchestrates transitions safely
+* Provide an `oc adm transition topology` CLI command for interactive transition management


Suggested change

* Provide a topology transition operator (OTTO) that owns the transition DAG and orchestrates transitions safely

* Provide an `oc adm transition topology` CLI command for interactive transition management

Suggesting that these be dropped because they're more implementation details than goals. We know that the previous EP isn't the path forward but it seems too soon to determine that a new operator is the goal.

I'm going to push back on dropping these, they are the driver and target direction of this new proposal and until/if a different is determined, they are the current path and are goals of this proposal because they are the core drivers and identify where the functionality should live. Breaking into a lot of detail around the operator internals is out of scope but a high level "This is how we think we can do this" is appropriate to give reviewers an idea of the approach we will take.

This isn't a blocker so they can of course stay. For context, the way I view this is the previous proposal was "Let's introduce a new topology" and now the proposal is "Let's make topology mutable". These implementation details are ideas for how to achieve the goal but not goals in and of themselves.

brandisher · 2026-05-12T15:08:32Z

+- The CLI would need direct access to operator internals, violating separation of concerns
+- Error recovery and retry logic is better suited to an operator's reconciliation loop
+
+### Controller in CVO


Is CVO the only option in the core operators where this might make sense?

I expanded to include some other operators, none of which fit the bill in my opinion. This is an entirely new process and shoehorning it into another operator that wasn't designed for tackling this type of procedure seems irresponsible to me

Which operator handles adding nodes to clusters?

brandisher · 2026-05-12T15:08:50Z

+An alternative is to add transition controllers to the cluster-version-operator (CVO).
+
+**Why it was rejected**:
+- Overloads CVO with responsibilities unrelated to version management


CVO is the Cluster Version Operator. Why is unrelated to version management?

CVO is related to version management, mutable topologies is not. You don't mutate across versions. The version remains the same after transitioning the topology.

Reading this with fresh eyes I realized I misread the statement. For whatever reason, Github isn't allowing me to resolve this thread so ignore this one going forward 😄

brandisher · 2026-05-12T15:09:56Z

+
+**Why it was rejected**:
+- Overloads CVO with responsibilities unrelated to version management
+- As the transition DAG grows, it would significantly increase CVO complexity


This seems to assume unbounded growth of the DAG. Given that we're talking about topologies, isn't this set pretty fixed so it wouldn't actually increase complexity much?

Each leg of the DAG could have very different requirements and preconditions. While our current set (SNO -> HA) is relatively straightforward graph, things can complex quick (ex: configuration for two node with fencing complicates the logic considerably)

- Soften wording: "breaking" → "removing" the immutability assumption - Broaden Cluster Administrator definition to include non-human entities - Quote "mutable topology" as a defined term in summary - Replace "solutions architect" persona with "cluster administrator" - Simplify first goal bullet - Add upfront justification for why a dedicated operator is the right vehicle for topology transitions - Rewrite CLI-only alternative to acknowledge bounded topology set while emphasizing persistent state and recovery needs - Reframe CVO alternative around stability risk and lifecycle mismatch instead of "unrelated to version management" - Add CEO and MCO as evaluated alternative host operators Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

openshift-ci · 2026-05-12T18:05:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from dgoodwin. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jeff-roche · 2026-05-12T18:26:01Z

I'm missing a "why" statement covering why a day 2, out-of-payload operator is the right choice for this. The CVO section towards the bottom hints at the why a bit but more explicit detail is needed.

With that in mind, I haven't reviewed the EP fully because I don't understand why this is the approach we're taking. The assessment of CVO seems very light and not enough to exclude that as a potential option to meet the goals.

@brandisher I've added a new paragraph under the ## Proposal header that explains the why. If you're looking for something specifically beyond what I added, I'd be happy to add some more detail

…s not guaranteed in the future Signed-off-by: Jeff Roche <jeroche@redhat.com>

openshift-ci · 2026-05-12T23:03:12Z

@jeff-roche: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

brandisher · 2026-05-13T08:31:07Z

+
+Future support for HyperShift is not planned for this enhancement but is not ruled out.
+
+#### Standalone Clusters


What is a standalone cluster?

Traditionally this means general HA clusters, provisioned by IPI or UPI

JoelSpeed

🤖 Generated with Claude Code

There are significant portions of this proposal that assume behaviour of OpenShift that either doesn't exist, or doesn't work in the way proposed. I'm assuming here that this is hallucination of Claude?

The EP as it stands today doesn't actually make sense for implementation. It also doesn't align with what I thought we had agreed on the architecture call.

Has anyone tried to manually take a cluster and scale up and manually transition from a single replica to multiple replicas? IMO this is the most important next step for this project

What I thought we had agreed:

To scale from SNO to HA, the user must create two new control plane nodes and join them to the cluster
- On HighlyAvailable topology - KAS, KCM, etcd, etc all get scheduled automatically as static pods on these nodes - I don't see anything that prevents this based on if it's a SNO cluster today, this needs to be checked (it probably should)
- MCO still serves ignition for control plane nodes on SNO, so user needs to create the control plane nodes somehow to ignite from here
New fields are added to the infrastructure spec to allow the user to say "I intend for this cluster to be HA going forward"
A controller is added to cluster config operator
- This checks that the precondition of having additional control plane nodes in the cluster is met
- Once the precondition is met, it updates the status to reflect spec
Operators now react to the change in status and transition from single to HA
- etcd operator promotes learners to full members, quorum goes from 1->3 (I don't know if this guard is in place today, we should add if not)
- KAS/KCM - no change, it already scheduled new KAS/kCM pods
- Others - Those that previously deploy a single replica of their operand now move to 2 replicas, other changes might be needed on a per operator basis, I was expecting those details in the EP but don't see them yet

JoelSpeed · 2026-05-13T10:00:06Z

+
+The approach has two components:
+
+1. **OpenShift Topology Transition Operator (OTTO)** — An operator that ships with the payload but is not installed by default. OTTO owns the transition graph, validates preconditions, orchestrates the transition sequence, and updates the Infrastructure config as the final step.


BY which mechanism are you planning to install this only when needed? Once installed, what manages and updates this going forward?

I think the context that is missing here is - Jeff and I were discussing ways to make that we aren't adding idling infrastructure on the control-plane nodes to handle transitions until they're ready.

The mechanism to install when needed is the same thing we do with NTO - the CLI tool would create a manifest to deploy a replica for the controller used to run the pre-flights and transitions.

JoelSpeed · 2026-05-13T10:05:12Z

+2. The cluster administrator runs `oc adm transition topology` to begin the interactive transition flow
+3. The CLI checks whether OTTO is installed; if not, it installs/activates it
+4. The CLI guides the administrator through providing required configuration:
+   - Node identities and access details for the new control-plane nodes


They aren't already Nodes at this point then?

I think this is just poor wording. At this stage, the operator is looking to the node API to see if they exist. The user is assumed to have taken steps prior to step 4 to ensure that the nodes are present.

If the nodes are not present yet, OTTO fails the pre-flight checks.

JoelSpeed · 2026-05-13T10:06:41Z

+4. The CLI guides the administrator through providing required configuration:
+   - Node identities and access details for the new control-plane nodes
+   - Any certificates or secrets required for the transition
+   - Load balancing configuration (on `platform: none`, the user manages their own VIPs/DNS)


So for the initial implementation, this doesn't actually do anything? On AWS/Azure/GCP, is there manual action required here? Typically when the node is created it is added to the load balancer by whatever creates it, or the CCM depending on platform, I wouldn't expect the transition operator to have needed to know anything about this external infrastructure

Which platforms will need to update their config?

The reason we mention platform is because we only plan to support 1->3 on platform none and possible platform baremetal for the first pass.

To add support for other platforms, the corresponding platform controllers will need to handle stitching up the networks on those nodes. This is why we're looking at baremetal networking and how we can update it to handle scale up, since it's the only semi-automated platform that is available for all edge-topologies (except SNO atm).

It seems like this step is just part of the node joining, which should have happened prior; and this line should be (re)moved.

That node joining could be UPI, which is the proposal here. Could it potentially be MAO in the future? I know we are trying to keep the scope to UPI joining, but if possible I would like to confirm the design does not prevent future MAO use.

JoelSpeed · 2026-05-13T10:07:30Z

+3. The CLI checks whether OTTO is installed; if not, it installs/activates it
+4. The CLI guides the administrator through providing required configuration:
+   - Node identities and access details for the new control-plane nodes
+   - Any certificates or secrets required for the transition


What might these secrets/certificates be used for?

I am not aware of any certificates, unless Jeff is talked about a pre-flight check for the certificate approvals for the new nodes - e.g. node bootstrapper, client, etc. But those will get sorted by the node controller. If the nodes are ready, we're good to fly.

I misunderstood part of the intent of the CLI, I thought we were going to need to step the user through configuring the nodes, it seems that is not the case and we are assuming the new nodes are pre-configured and we just need to validate that

JoelSpeed · 2026-05-13T10:08:08Z

+
+7. OTTO signals that a transition is in progress (status on the transition CR)
+8. OTTO triggers setup changes on dependent operators:
+   - cluster-etcd-operator (CEO) scales etcd members sequentially following the bootstrap pattern (1→2→3)


Does etcd operator need to be told to do this today? How do you tell it to scale up?

This is trying to capture the context from a design discussion we had on the previous EP. Essentially, there was a discussion about whether we needed the 1->3 transition to be atomic or whether we should proceed with sequential, single-learner promotion for etcd to scaleup (which is done automatically during bootstrap). The outcome of that discussion was that we'd use the latter, because the atomicity would be a multi-year development effort in the upstream with a low probability of success, and gains you minimal protection over what is already available using quorum-restore.sh today.

To answer your question, CEO handles this automatically, no changes here.

JoelSpeed · 2026-05-13T10:11:47Z

+
+Future support for HyperShift is not planned for this enhancement but is not ruled out.
+
+#### Standalone Clusters


Traditionally this means general HA clusters, provisioned by IPI or UPI

JoelSpeed · 2026-05-13T10:13:21Z

+- Installed either manually or via the `oc adm transition topology` command
+- Owns the transition graph — the directed graph defining which topology transitions are supported
+- Owns the validation criteria for each transition (required nodes, certificates, secrets, operator states)
+- Orchestrates transitions by interacting with cluster operators via their existing APIs


I don't think this actually exists

This statement is just objectively incorrect. :( I can see why you'd be confused reading this.
This never happens. What we can do is look to update things like ingress and console to be more adaptable like etcd/api-server such that they update their replicas when more infrastructure nodes become available and do firmer pre-flight checks so that the "transition" piece becomes a no-op, but I think it's OK for some operators to continue to treat the topology field as the source of truth for desired behavior.

An alternative would be to key off the infrastructure topologies "desiredTopology" and update the hooks for ingress to try to update it's replicas when it detects an update to that field. Then the pre-flight checks actually verify that has the right number of replicas and we update the topology after it's already succeeded. i guess it depends on whether we're treating the topology field or the desired topology field as the answer to "what should the operator being doing right now".

JoelSpeed · 2026-05-13T10:16:12Z

+
+#### Risk: Transition Fails Partway Through
+
+**Risk**: A transition may fail after some operators have begun reconfiguring but before the transition completes, leaving the cluster in an intermediate state.


Can you provide examples of failures?

@jeff-roche one specific example:

etcd scales up to 2 nodes

network partition between them

both have lost quorum

no api

JoelSpeed · 2026-05-13T10:16:46Z

+
+#### Risk: Platform Bare Metal May Not Support Single-Node Clusters
+
+**Risk**: If keepalived networking cannot be enabled, `platform: baremetal` will be limited to 2+ nodes, reducing the value of mutable topology for this platform.


limited to 2+? Isn't that the success criteria?

baremetal platform doesn't support SNO because having a load balancer for 1 node doesn't make sense.
In order for users to get the benefits of having not having to manually deploy a load balancer (i.e. what they primarily save in terms of effort when deploying on platform: baremetal), we need to investigate if we can allow baremetal as a platform for SNO first (which loadbalancing disabled), and change that operator so that loadbalancing can be introduced post-transition.

Otherwise we need to introduce a new, scarier feature: platform transitions.
That a pandora's box I don't want to look at.

JoelSpeed · 2026-05-13T10:18:13Z

+- Error recovery and retry logic is better suited to an operator's reconciliation loop than imperative CLI code
+- The CLI would need direct access to operator internals, violating separation of concerns
+
+### Extending an Existing Core Operator


Or cluster config operator which would make a very natural home for this as long as we have commitment of ownership from folks writing the new controller

At risk of going on a tangent, currently the installer has problems calculating topology when laying down manifests. We have a bug for this in the backlog, and I left #1905 (comment) on the previous enhancement.

I would like to see that calculation moved to the cluster config operator in bootkube during bootstrapping. That solution could co-exist with this one (and my team will push it forward as priorities allow); but it could potentially also tie into this solution.

I'm find with us using CCO for this. I will take the blame for miscommunicating this to Jeff - it didn't strike me as obvious that a controller for this transition would obviously belong there. My instincts were that new code in the core operators is expensive, especially for a controller that doesn't need to be running 99% of the time. That said, I think it's fine for this to be a controller that is installed with zero replicas and the replicas are scaled-up during transition events. That fits them main sentiment of what Jeff and I were trying to solve - minimizing the tax on clusters that will never use this feature (i.e. the vast majority of them).

IMHO the solution for the installer is to get the user to specify their intent in the install-config (this should be passed through to the cluster). This enhancement is a good opportunity to define what that input should look like.

patrickdillon

I know the scope is limited to baremetal/platform:none, but I know there is interest for mutable topologies in cloud platforms as well so as much as appropriate I would to ensure the design leaves a path forward for those cloud platforms.

Also, like the other enhancement I don't see any mention of mastersSchedulable which affects the calculation for infrastructureTopology. How is the mastersSchedulable field handled/taken into account for this solution?

patrickdillon · 2026-05-13T13:13:39Z

+4. The CLI guides the administrator through providing required configuration:
+   - Node identities and access details for the new control-plane nodes
+   - Any certificates or secrets required for the transition
+   - Load balancing configuration (on `platform: none`, the user manages their own VIPs/DNS)


It seems like this step is just part of the node joining, which should have happened prior; and this line should be (re)moved.

That node joining could be UPI, which is the proposal here. Could it potentially be MAO in the future? I know we are trying to keep the scope to UPI joining, but if possible I would like to confirm the design does not prevent future MAO use.

patrickdillon · 2026-05-13T13:19:07Z

+- Error recovery and retry logic is better suited to an operator's reconciliation loop than imperative CLI code
+- The CLI would need direct access to operator internals, violating separation of concerns
+
+### Extending an Existing Core Operator


At risk of going on a tangent, currently the installer has problems calculating topology when laying down manifests. We have a bug for this in the backlog, and I left #1905 (comment) on the previous enhancement.

I would like to see that calculation moved to the cluster config operator in bootkube during bootstrapping. That solution could co-exist with this one (and my team will push it forward as priorities allow); but it could potentially also tie into this solution.

zaneb

This one looks directionally correct 👍

zaneb · 2026-05-14T00:26:19Z

+
+##### Pre-Transition
+
+1. The cluster administrator prepares the additional control-plane nodes (hardware, network, OS)


Does 'OS' here imply that the user joins the hosts to the cluster as as control plane nodes at this stage? If not, at what stage is that expected to happen?

zaneb · 2026-05-14T03:56:02Z

+OTTO maintains a directed graph of supported transitions. For the initial implementation:
+
+```text
+SingleReplica (SNO, platform: none) → HighlyAvailable (3-node compact)


I think it's a mistake to define the supported topologies in terms of the controlPlaneTopology field. There are at least 6 use cases I can think of that users have articulated:

single-node (1 schedulable control plane, 0+ workers, no load balancer)

compact (3 schedulable control plane, 0+ workers)

standby (3 non-schedulable control plane, 0 workers)

HA (3 non-schedulable control plane, 2+ workers)

TNA (2 non-schedulable control plane, 1 arbiter, 2+ workers)

TNF (2 schedulable control plane w/ STONITH, 0 workers)

zaneb · 2026-05-14T04:04:09Z

+
+The initial implementation targets `platform: none` clusters. On `platform: none`, the administrator is responsible for managing their own load balancing configuration (VIPs, DNS) when scaling beyond a single node.
+
+`platform: baremetal` support is planned for a subsequent phase. Bare metal networking uses keepalived for ingress load balancing, which is not useful and creates a point of failure for SNO deployments. The Bare Metal Networking team will be consulted to determine if this networking setup can be enabled for single-node clusters transitioning to HA.


I find it weird that we are going to add single-node support to platform:baremetal just so that we can say we are not preventing it from later transitioning to HA.
Who is asking for this?

I would prefer that any effort from the on-prem networking team were instead directed toward adding optional on-prem networking to platform:external.

zaneb · 2026-05-14T04:05:23Z

+
+#### Risk: Platform Bare Metal May Not Support Single-Node Clusters
+
+**Risk**: If keepalived networking cannot be enabled, `platform: baremetal` will be limited to 2+ nodes, reducing the value of mutable topology for this platform.


This is out of date given the current scope, right?

zaneb · 2026-05-14T04:06:38Z

+
+`platform: baremetal` support is planned for a subsequent phase. Bare metal networking uses keepalived for ingress load balancing, which is not useful and creates a point of failure for SNO deployments. The Bare Metal Networking team will be consulted to determine if this networking setup can be enabled for single-node clusters transitioning to HA.
+
+### Risks and Mitigations


I would say one is the inability to validate external requirements, such as correct load balancer setup.

zaneb · 2026-05-14T04:26:06Z

+10. OTTO updates the Infrastructure status fields:
+    - `controlPlaneTopology` transitions from `SingleReplica` to `HighlyAvailable`
+    - `infrastructureTopology` transitions from `SingleReplica` to `HighlyAvailable`
+11. Operators reconcile against the new topology values and adjust their deployment strategies, replica counts, and placement policies


Are we going to try to e.g. restart OLM operators (which previously have treated the topology as fixed)?

Do you have a view into how many/which olm operators are reading this value? Are they reading it at startup, or watching the resource? The expected pattern would be that the operator sees the change, and then reacts by updating the operand (e.g. scaling from 1 to 2 replicas now that it's been told the cluster is HA)

zaneb · 2026-05-14T04:53:02Z

+| cluster-etcd-operator | Coordinate with OTTO for sequential etcd scaling during transitions |
+| Ingress, networking, monitoring operators | Respond to OTTO coordination signals during transitions; reconcile on Infrastructure config changes |
+
+#### Platform Support Constraints


We need to mention that IBI clusters cannot be converted from SNO (and have some mechanism for preventing that).

What's the technical blocker there?

zaneb · 2026-05-14T04:56:01Z

+
+## Operational Aspects of API Extensions
+
+OTTO introduces a `TopologyTransition` CRD. This CRD:


How about just a Topology CRD? We should store the intent of the user, not just when the cluster is in transition.

Infrastructure spec should hold this rather than a separate CRD - since we already have topology as infrastructure status today

zaneb · 2026-05-14T04:59:42Z

+- Error recovery and retry logic is better suited to an operator's reconciliation loop than imperative CLI code
+- The CLI would need direct access to operator internals, violating separation of concerns
+
+### Extending an Existing Core Operator


IMHO the solution for the installer is to get the user to specify their intent in the install-config (this should be passed through to the cluster). This enhancement is a good opportunity to define what that input should look like.

openshift-ci Bot requested review from bn222 and cooktheryan May 11, 2026 19:46

jeff-roche mentioned this pull request May 11, 2026

OCPEDGE-2280: Add Adaptable Topology, reorganize topology enhancements #1905

Closed

jeff-roche changed the title ~~enhancements/topologies: mutable topology enhancement proposal~~ OCPEDGE-2280: mutable topology enhancement proposal May 11, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 11, 2026

jeff-roche changed the title ~~OCPEDGE-2280: mutable topology enhancement proposal~~ OCPEDGE-2280: mutable topology May 11, 2026

jeff-roche force-pushed the mutable-topology branch from 438e03c to 98a1ba4 Compare May 11, 2026 21:24

openshift-ci Bot assigned dgoodwin, dusk125, eggfoobar, jaypoulz, jerpeter1, JoelSpeed, patrickdillon, sdodson and tjungblu May 11, 2026

brandisher reviewed May 12, 2026

View reviewed changes

patrickdillon reviewed May 12, 2026

View reviewed changes

Comment thread enhancements/topologies/mutable-topology.md Outdated

chore: dropped DAG and shifted it to a directed graph since acyclic i…

c2648c6

…s not guaranteed in the future Signed-off-by: Jeff Roche <jeroche@redhat.com>

brandisher reviewed May 13, 2026

View reviewed changes

JoelSpeed reviewed May 13, 2026

View reviewed changes

patrickdillon reviewed May 13, 2026

View reviewed changes

zaneb reviewed May 14, 2026

View reviewed changes

		* Provide a topology transition operator (OTTO) that owns the transition DAG and orchestrates transitions safely
		* Provide an `oc adm transition topology` CLI command for interactive transition management


		Future support for HyperShift is not planned for this enhancement but is not ruled out.

		#### Standalone Clusters


		The approach has two components:

		1. OpenShift Topology Transition Operator (OTTO) — An operator that ships with the payload but is not installed by default. OTTO owns the transition graph, validates preconditions, orchestrates the transition sequence, and updates the Infrastructure config as the final step.


		#### Risk: Transition Fails Partway Through

		Risk: A transition may fail after some operators have begun reconfiguring but before the transition completes, leaving the cluster in an intermediate state.


		#### Risk: Platform Bare Metal May Not Support Single-Node Clusters

		Risk: If keepalived networking cannot be enabled, `platform: baremetal` will be limited to 2+ nodes, reducing the value of mutable topology for this platform.


		##### Pre-Transition

		1. The cluster administrator prepares the additional control-plane nodes (hardware, network, OS)


		The initial implementation targets `platform: none` clusters. On `platform: none`, the administrator is responsible for managing their own load balancing configuration (VIPs, DNS) when scaling beyond a single node.

		`platform: baremetal` support is planned for a subsequent phase. Bare metal networking uses keepalived for ingress load balancing, which is not useful and creates a point of failure for SNO deployments. The Bare Metal Networking team will be consulted to determine if this networking setup can be enabled for single-node clusters transitioning to HA.


		`platform: baremetal` support is planned for a subsequent phase. Bare metal networking uses keepalived for ingress load balancing, which is not useful and creates a point of failure for SNO deployments. The Bare Metal Networking team will be consulted to determine if this networking setup can be enabled for single-node clusters transitioning to HA.

		### Risks and Mitigations


		## Operational Aspects of API Extensions

		OTTO introduces a `TopologyTransition` CRD. This CRD:

Conversation

jeff-roche commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

openshift-ci-robot commented May 11, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

jeff-roche commented May 11, 2026

Uh oh!

brandisher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeff-roche May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

openshift-ci Bot commented May 12, 2026

Uh oh!

jeff-roche commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

openshift-ci Bot commented May 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoelSpeed left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeff-roche commented May 11, 2026 •

edited

Loading

openshift-ci-robot commented May 11, 2026 •

edited by openshift-ci Bot

Loading

jeff-roche May 12, 2026 •

edited

Loading

jeff-roche commented May 12, 2026 •

edited

Loading

jaypoulz May 13, 2026 •

edited

Loading