OCPEDGE-2280: mutable topology#2008
Conversation
|
@jeff-roche: This pull request references OCPEDGE-2280 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target either version "5.0." or "openshift-5.0.", but it targets "openshift-4.22" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Introduce the Mutable Topology enhancement, which replaces the previous Adaptable Topology proposal. Instead of a new topology enum that all operators must interpret, this approach uses a dedicated operator (OTTO) to orchestrate transitions between existing fixed topology modes. Initial scope: SNO to HA compact on platform: none. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
438e03c to
98a1ba4
Compare
brandisher
left a comment
There was a problem hiding this comment.
I'm missing a "why" statement covering why a day 2, out-of-payload operator is the right choice for this. The CVO section towards the bottom hints at the why a bit but more explicit detail is needed.
With that in mind, I haven't reviewed the EP fully because I don't understand why this is the approach we're taking. The assessment of CVO seems very light and not enough to exclude that as a potential option to meet the goals.
| * Provide a topology transition operator (OTTO) that owns the transition DAG and orchestrates transitions safely | ||
| * Provide an `oc adm transition topology` CLI command for interactive transition management |
There was a problem hiding this comment.
| * Provide a topology transition operator (OTTO) that owns the transition DAG and orchestrates transitions safely | |
| * Provide an `oc adm transition topology` CLI command for interactive transition management |
Suggesting that these be dropped because they're more implementation details than goals. We know that the previous EP isn't the path forward but it seems too soon to determine that a new operator is the goal.
There was a problem hiding this comment.
I'm going to push back on dropping these, they are the driver and target direction of this new proposal and until/if a different is determined, they are the current path and are goals of this proposal because they are the core drivers and identify where the functionality should live. Breaking into a lot of detail around the operator internals is out of scope but a high level "This is how we think we can do this" is appropriate to give reviewers an idea of the approach we will take.
There was a problem hiding this comment.
This isn't a blocker so they can of course stay. For context, the way I view this is the previous proposal was "Let's introduce a new topology" and now the proposal is "Let's make topology mutable". These implementation details are ideas for how to achieve the goal but not goals in and of themselves.
| - The CLI would need direct access to operator internals, violating separation of concerns | ||
| - Error recovery and retry logic is better suited to an operator's reconciliation loop | ||
|
|
||
| ### Controller in CVO |
There was a problem hiding this comment.
Is CVO the only option in the core operators where this might make sense?
There was a problem hiding this comment.
I expanded to include some other operators, none of which fit the bill in my opinion. This is an entirely new process and shoehorning it into another operator that wasn't designed for tackling this type of procedure seems irresponsible to me
There was a problem hiding this comment.
Which operator handles adding nodes to clusters?
| An alternative is to add transition controllers to the cluster-version-operator (CVO). | ||
|
|
||
| **Why it was rejected**: | ||
| - Overloads CVO with responsibilities unrelated to version management |
There was a problem hiding this comment.
CVO is the Cluster Version Operator. Why is unrelated to version management?
There was a problem hiding this comment.
CVO is related to version management, mutable topologies is not. You don't mutate across versions. The version remains the same after transitioning the topology.
There was a problem hiding this comment.
Reading this with fresh eyes I realized I misread the statement. For whatever reason, Github isn't allowing me to resolve this thread so ignore this one going forward 😄
|
|
||
| **Why it was rejected**: | ||
| - Overloads CVO with responsibilities unrelated to version management | ||
| - As the transition DAG grows, it would significantly increase CVO complexity |
There was a problem hiding this comment.
This seems to assume unbounded growth of the DAG. Given that we're talking about topologies, isn't this set pretty fixed so it wouldn't actually increase complexity much?
There was a problem hiding this comment.
Each leg of the DAG could have very different requirements and preconditions. While our current set (SNO -> HA) is relatively straightforward graph, things can complex quick (ex: configuration for two node with fencing complicates the logic considerably)
- Soften wording: "breaking" → "removing" the immutability assumption - Broaden Cluster Administrator definition to include non-human entities - Quote "mutable topology" as a defined term in summary - Replace "solutions architect" persona with "cluster administrator" - Simplify first goal bullet - Add upfront justification for why a dedicated operator is the right vehicle for topology transitions - Rewrite CLI-only alternative to acknowledge bounded topology set while emphasizing persistent state and recovery needs - Reframe CVO alternative around stability risk and lifecycle mismatch instead of "unrelated to version management" - Add CEO and MCO as evaluated alternative host operators Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
@brandisher I've added a new paragraph under the |
…s not guaranteed in the future Signed-off-by: Jeff Roche <jeroche@redhat.com>
|
@jeff-roche: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
|
||
| Future support for HyperShift is not planned for this enhancement but is not ruled out. | ||
|
|
||
| #### Standalone Clusters |
There was a problem hiding this comment.
What is a standalone cluster?
There was a problem hiding this comment.
Traditionally this means general HA clusters, provisioned by IPI or UPI
JoelSpeed
left a comment
There was a problem hiding this comment.
🤖 Generated with Claude Code
There are significant portions of this proposal that assume behaviour of OpenShift that either doesn't exist, or doesn't work in the way proposed. I'm assuming here that this is hallucination of Claude?
The EP as it stands today doesn't actually make sense for implementation. It also doesn't align with what I thought we had agreed on the architecture call.
Has anyone tried to manually take a cluster and scale up and manually transition from a single replica to multiple replicas? IMO this is the most important next step for this project
What I thought we had agreed:
- To scale from SNO to HA, the user must create two new control plane nodes and join them to the cluster
- On HighlyAvailable topology - KAS, KCM, etcd, etc all get scheduled automatically as static pods on these nodes - I don't see anything that prevents this based on if it's a SNO cluster today, this needs to be checked (it probably should)
- MCO still serves ignition for control plane nodes on SNO, so user needs to create the control plane nodes somehow to ignite from here
- New fields are added to the infrastructure spec to allow the user to say "I intend for this cluster to be HA going forward"
- A controller is added to cluster config operator
- This checks that the precondition of having additional control plane nodes in the cluster is met
- Once the precondition is met, it updates the status to reflect spec
- Operators now react to the change in status and transition from single to HA
- etcd operator promotes learners to full members, quorum goes from 1->3 (I don't know if this guard is in place today, we should add if not)
- KAS/KCM - no change, it already scheduled new KAS/kCM pods
- Others - Those that previously deploy a single replica of their operand now move to 2 replicas, other changes might be needed on a per operator basis, I was expecting those details in the EP but don't see them yet
|
|
||
| The approach has two components: | ||
|
|
||
| 1. **OpenShift Topology Transition Operator (OTTO)** — An operator that ships with the payload but is not installed by default. OTTO owns the transition graph, validates preconditions, orchestrates the transition sequence, and updates the Infrastructure config as the final step. |
There was a problem hiding this comment.
BY which mechanism are you planning to install this only when needed? Once installed, what manages and updates this going forward?
There was a problem hiding this comment.
I think the context that is missing here is - Jeff and I were discussing ways to make that we aren't adding idling infrastructure on the control-plane nodes to handle transitions until they're ready.
The mechanism to install when needed is the same thing we do with NTO - the CLI tool would create a manifest to deploy a replica for the controller used to run the pre-flights and transitions.
| 2. The cluster administrator runs `oc adm transition topology` to begin the interactive transition flow | ||
| 3. The CLI checks whether OTTO is installed; if not, it installs/activates it | ||
| 4. The CLI guides the administrator through providing required configuration: | ||
| - Node identities and access details for the new control-plane nodes |
There was a problem hiding this comment.
They aren't already Nodes at this point then?
There was a problem hiding this comment.
I think this is just poor wording. At this stage, the operator is looking to the node API to see if they exist. The user is assumed to have taken steps prior to step 4 to ensure that the nodes are present.
If the nodes are not present yet, OTTO fails the pre-flight checks.
| 4. The CLI guides the administrator through providing required configuration: | ||
| - Node identities and access details for the new control-plane nodes | ||
| - Any certificates or secrets required for the transition | ||
| - Load balancing configuration (on `platform: none`, the user manages their own VIPs/DNS) |
There was a problem hiding this comment.
So for the initial implementation, this doesn't actually do anything? On AWS/Azure/GCP, is there manual action required here? Typically when the node is created it is added to the load balancer by whatever creates it, or the CCM depending on platform, I wouldn't expect the transition operator to have needed to know anything about this external infrastructure
Which platforms will need to update their config?
There was a problem hiding this comment.
The reason we mention platform is because we only plan to support 1->3 on platform none and possible platform baremetal for the first pass.
To add support for other platforms, the corresponding platform controllers will need to handle stitching up the networks on those nodes. This is why we're looking at baremetal networking and how we can update it to handle scale up, since it's the only semi-automated platform that is available for all edge-topologies (except SNO atm).
There was a problem hiding this comment.
It seems like this step is just part of the node joining, which should have happened prior; and this line should be (re)moved.
That node joining could be UPI, which is the proposal here. Could it potentially be MAO in the future? I know we are trying to keep the scope to UPI joining, but if possible I would like to confirm the design does not prevent future MAO use.
| 3. The CLI checks whether OTTO is installed; if not, it installs/activates it | ||
| 4. The CLI guides the administrator through providing required configuration: | ||
| - Node identities and access details for the new control-plane nodes | ||
| - Any certificates or secrets required for the transition |
There was a problem hiding this comment.
What might these secrets/certificates be used for?
There was a problem hiding this comment.
I am not aware of any certificates, unless Jeff is talked about a pre-flight check for the certificate approvals for the new nodes - e.g. node bootstrapper, client, etc. But those will get sorted by the node controller. If the nodes are ready, we're good to fly.
There was a problem hiding this comment.
I misunderstood part of the intent of the CLI, I thought we were going to need to step the user through configuring the nodes, it seems that is not the case and we are assuming the new nodes are pre-configured and we just need to validate that
|
|
||
| 7. OTTO signals that a transition is in progress (status on the transition CR) | ||
| 8. OTTO triggers setup changes on dependent operators: | ||
| - cluster-etcd-operator (CEO) scales etcd members sequentially following the bootstrap pattern (1→2→3) |
There was a problem hiding this comment.
Does etcd operator need to be told to do this today? How do you tell it to scale up?
There was a problem hiding this comment.
This is trying to capture the context from a design discussion we had on the previous EP. Essentially, there was a discussion about whether we needed the 1->3 transition to be atomic or whether we should proceed with sequential, single-learner promotion for etcd to scaleup (which is done automatically during bootstrap). The outcome of that discussion was that we'd use the latter, because the atomicity would be a multi-year development effort in the upstream with a low probability of success, and gains you minimal protection over what is already available using quorum-restore.sh today.
To answer your question, CEO handles this automatically, no changes here.
|
|
||
| Future support for HyperShift is not planned for this enhancement but is not ruled out. | ||
|
|
||
| #### Standalone Clusters |
There was a problem hiding this comment.
Traditionally this means general HA clusters, provisioned by IPI or UPI
| - Installed either manually or via the `oc adm transition topology` command | ||
| - Owns the transition graph — the directed graph defining which topology transitions are supported | ||
| - Owns the validation criteria for each transition (required nodes, certificates, secrets, operator states) | ||
| - Orchestrates transitions by interacting with cluster operators via their existing APIs |
There was a problem hiding this comment.
I don't think this actually exists
There was a problem hiding this comment.
This statement is just objectively incorrect. :( I can see why you'd be confused reading this.
This never happens. What we can do is look to update things like ingress and console to be more adaptable like etcd/api-server such that they update their replicas when more infrastructure nodes become available and do firmer pre-flight checks so that the "transition" piece becomes a no-op, but I think it's OK for some operators to continue to treat the topology field as the source of truth for desired behavior.
An alternative would be to key off the infrastructure topologies "desiredTopology" and update the hooks for ingress to try to update it's replicas when it detects an update to that field. Then the pre-flight checks actually verify that has the right number of replicas and we update the topology after it's already succeeded. i guess it depends on whether we're treating the topology field or the desired topology field as the answer to "what should the operator being doing right now".
|
|
||
| #### Risk: Transition Fails Partway Through | ||
|
|
||
| **Risk**: A transition may fail after some operators have begun reconfiguring but before the transition completes, leaving the cluster in an intermediate state. |
There was a problem hiding this comment.
Can you provide examples of failures?
There was a problem hiding this comment.
@jeff-roche one specific example:
- etcd scales up to 2 nodes
- network partition between them
- both have lost quorum
- no api
|
|
||
| #### Risk: Platform Bare Metal May Not Support Single-Node Clusters | ||
|
|
||
| **Risk**: If keepalived networking cannot be enabled, `platform: baremetal` will be limited to 2+ nodes, reducing the value of mutable topology for this platform. |
There was a problem hiding this comment.
limited to 2+? Isn't that the success criteria?
There was a problem hiding this comment.
baremetal platform doesn't support SNO because having a load balancer for 1 node doesn't make sense.
In order for users to get the benefits of having not having to manually deploy a load balancer (i.e. what they primarily save in terms of effort when deploying on platform: baremetal), we need to investigate if we can allow baremetal as a platform for SNO first (which loadbalancing disabled), and change that operator so that loadbalancing can be introduced post-transition.
Otherwise we need to introduce a new, scarier feature: platform transitions.
That a pandora's box I don't want to look at.
| - Error recovery and retry logic is better suited to an operator's reconciliation loop than imperative CLI code | ||
| - The CLI would need direct access to operator internals, violating separation of concerns | ||
|
|
||
| ### Extending an Existing Core Operator |
There was a problem hiding this comment.
Or cluster config operator which would make a very natural home for this as long as we have commitment of ownership from folks writing the new controller
There was a problem hiding this comment.
At risk of going on a tangent, currently the installer has problems calculating topology when laying down manifests. We have a bug for this in the backlog, and I left #1905 (comment) on the previous enhancement.
I would like to see that calculation moved to the cluster config operator in bootkube during bootstrapping. That solution could co-exist with this one (and my team will push it forward as priorities allow); but it could potentially also tie into this solution.
There was a problem hiding this comment.
I'm find with us using CCO for this. I will take the blame for miscommunicating this to Jeff - it didn't strike me as obvious that a controller for this transition would obviously belong there. My instincts were that new code in the core operators is expensive, especially for a controller that doesn't need to be running 99% of the time. That said, I think it's fine for this to be a controller that is installed with zero replicas and the replicas are scaled-up during transition events. That fits them main sentiment of what Jeff and I were trying to solve - minimizing the tax on clusters that will never use this feature (i.e. the vast majority of them).
There was a problem hiding this comment.
IMHO the solution for the installer is to get the user to specify their intent in the install-config (this should be passed through to the cluster). This enhancement is a good opportunity to define what that input should look like.
patrickdillon
left a comment
There was a problem hiding this comment.
I know the scope is limited to baremetal/platform:none, but I know there is interest for mutable topologies in cloud platforms as well so as much as appropriate I would to ensure the design leaves a path forward for those cloud platforms.
Also, like the other enhancement I don't see any mention of mastersSchedulable which affects the calculation for infrastructureTopology. How is the mastersSchedulable field handled/taken into account for this solution?
| 4. The CLI guides the administrator through providing required configuration: | ||
| - Node identities and access details for the new control-plane nodes | ||
| - Any certificates or secrets required for the transition | ||
| - Load balancing configuration (on `platform: none`, the user manages their own VIPs/DNS) |
There was a problem hiding this comment.
It seems like this step is just part of the node joining, which should have happened prior; and this line should be (re)moved.
That node joining could be UPI, which is the proposal here. Could it potentially be MAO in the future? I know we are trying to keep the scope to UPI joining, but if possible I would like to confirm the design does not prevent future MAO use.
| - Error recovery and retry logic is better suited to an operator's reconciliation loop than imperative CLI code | ||
| - The CLI would need direct access to operator internals, violating separation of concerns | ||
|
|
||
| ### Extending an Existing Core Operator |
There was a problem hiding this comment.
At risk of going on a tangent, currently the installer has problems calculating topology when laying down manifests. We have a bug for this in the backlog, and I left #1905 (comment) on the previous enhancement.
I would like to see that calculation moved to the cluster config operator in bootkube during bootstrapping. That solution could co-exist with this one (and my team will push it forward as priorities allow); but it could potentially also tie into this solution.
zaneb
left a comment
There was a problem hiding this comment.
This one looks directionally correct 👍
|
|
||
| ##### Pre-Transition | ||
|
|
||
| 1. The cluster administrator prepares the additional control-plane nodes (hardware, network, OS) |
There was a problem hiding this comment.
Does 'OS' here imply that the user joins the hosts to the cluster as as control plane nodes at this stage? If not, at what stage is that expected to happen?
| OTTO maintains a directed graph of supported transitions. For the initial implementation: | ||
|
|
||
| ```text | ||
| SingleReplica (SNO, platform: none) → HighlyAvailable (3-node compact) |
There was a problem hiding this comment.
I think it's a mistake to define the supported topologies in terms of the controlPlaneTopology field. There are at least 6 use cases I can think of that users have articulated:
- single-node (1 schedulable control plane, 0+ workers, no load balancer)
- compact (3 schedulable control plane, 0+ workers)
- standby (3 non-schedulable control plane, 0 workers)
- HA (3 non-schedulable control plane, 2+ workers)
- TNA (2 non-schedulable control plane, 1 arbiter, 2+ workers)
- TNF (2 schedulable control plane w/ STONITH, 0 workers)
|
|
||
| The initial implementation targets `platform: none` clusters. On `platform: none`, the administrator is responsible for managing their own load balancing configuration (VIPs, DNS) when scaling beyond a single node. | ||
|
|
||
| `platform: baremetal` support is planned for a subsequent phase. Bare metal networking uses keepalived for ingress load balancing, which is not useful and creates a point of failure for SNO deployments. The Bare Metal Networking team will be consulted to determine if this networking setup can be enabled for single-node clusters transitioning to HA. |
There was a problem hiding this comment.
I find it weird that we are going to add single-node support to platform:baremetal just so that we can say we are not preventing it from later transitioning to HA.
Who is asking for this?
I would prefer that any effort from the on-prem networking team were instead directed toward adding optional on-prem networking to platform:external.
|
|
||
| #### Risk: Platform Bare Metal May Not Support Single-Node Clusters | ||
|
|
||
| **Risk**: If keepalived networking cannot be enabled, `platform: baremetal` will be limited to 2+ nodes, reducing the value of mutable topology for this platform. |
There was a problem hiding this comment.
This is out of date given the current scope, right?
|
|
||
| `platform: baremetal` support is planned for a subsequent phase. Bare metal networking uses keepalived for ingress load balancing, which is not useful and creates a point of failure for SNO deployments. The Bare Metal Networking team will be consulted to determine if this networking setup can be enabled for single-node clusters transitioning to HA. | ||
|
|
||
| ### Risks and Mitigations |
There was a problem hiding this comment.
I would say one is the inability to validate external requirements, such as correct load balancer setup.
| 10. OTTO updates the Infrastructure status fields: | ||
| - `controlPlaneTopology` transitions from `SingleReplica` to `HighlyAvailable` | ||
| - `infrastructureTopology` transitions from `SingleReplica` to `HighlyAvailable` | ||
| 11. Operators reconcile against the new topology values and adjust their deployment strategies, replica counts, and placement policies |
There was a problem hiding this comment.
Are we going to try to e.g. restart OLM operators (which previously have treated the topology as fixed)?
There was a problem hiding this comment.
Do you have a view into how many/which olm operators are reading this value? Are they reading it at startup, or watching the resource? The expected pattern would be that the operator sees the change, and then reacts by updating the operand (e.g. scaling from 1 to 2 replicas now that it's been told the cluster is HA)
| | cluster-etcd-operator | Coordinate with OTTO for sequential etcd scaling during transitions | | ||
| | Ingress, networking, monitoring operators | Respond to OTTO coordination signals during transitions; reconcile on Infrastructure config changes | | ||
|
|
||
| #### Platform Support Constraints |
There was a problem hiding this comment.
We need to mention that IBI clusters cannot be converted from SNO (and have some mechanism for preventing that).
There was a problem hiding this comment.
What's the technical blocker there?
|
|
||
| ## Operational Aspects of API Extensions | ||
|
|
||
| OTTO introduces a `TopologyTransition` CRD. This CRD: |
There was a problem hiding this comment.
How about just a Topology CRD? We should store the intent of the user, not just when the cluster is in transition.
There was a problem hiding this comment.
Infrastructure spec should hold this rather than a separate CRD - since we already have topology as infrastructure status today
| - Error recovery and retry logic is better suited to an operator's reconciliation loop than imperative CLI code | ||
| - The CLI would need direct access to operator internals, violating separation of concerns | ||
|
|
||
| ### Extending an Existing Core Operator |
There was a problem hiding this comment.
IMHO the solution for the installer is to get the user to specify their intent in the install-config (this should be passed through to the cluster). This enhancement is a good opportunity to define what that input should look like.
Summary
platform: none🤖 Generated with Claude Code