OCPEDGE-2280: Add Adaptable Topology, reorganize topology enhancements#1905
OCPEDGE-2280: Add Adaptable Topology, reorganize topology enhancements#1905jaypoulz wants to merge 7 commits into
Conversation
|
@jaypoulz: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
3aa08a3 to
40d4ff0
Compare
|
/retitle OCPEDGE-2280: Add Adaptable Topology, reorganize topology enhancements |
|
@jaypoulz: This pull request references OCPEDGE-2280 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target either version "4.21." or "openshift-4.21.", but it targets "openshift-4.22" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
40d4ff0 to
6e81004
Compare
|
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
|
Stale enhancement proposals rot after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle rotten |
|
/remove lifecycle-rotten |
|
/remove-lifecycle rotten |
|
Are there considerations needed for scaling the storage along with the nodes ? This seems like a concern that would need to be tested. Adding nodes seems somewhat easy (I could be talking out of my hat), but scaling storage down as the nodes decrease seems like a challenge. |
|
Is the Edge Enablement team responsible for testing the additions to the other team areas, or does that specific team assist in the testing responsibility ? |
| | Serial tests | Monthly | Standard test suite (openshift/conformance/serial) on Adaptable topology clusters | | ||
| | Upgrade between z-streams | Weekly | Test upgrades on clusters running Adaptable topology | | ||
| | Upgrade between y-streams | Weekly | Test upgrades across minor versions on clusters running Adaptable topology | | ||
|
|
There was a problem hiding this comment.
Whatever becomes of the openshift-test-private tests from QE. This is currently being looked at and moved into other areas.
There was a problem hiding this comment.
Yeah these are meant to replicate the lanes we have for SNO in particular. I think the standard conformance suite and upgrade workflows will be mostly applicable here and acceptable for a first pass at this functionality
There was a problem hiding this comment.
@jaypoulz can you resolve this one?
|
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
|
Stale enhancement proposals rot after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle rotten |
|
/remove-lifecycle rotten |
| #### Risk: etcd Data Loss If Transitions Are Not Atomic | ||
|
|
||
| **Risk**: If CEO cannot make 1→3 or 3→1 etcd member transitions truly atomic, | ||
| data loss or corruption could occur. |
There was a problem hiding this comment.
I think in this case the thought was in the event of the 2-node transient state and if you lose one of the etcd members during that state. That being said we can address this on the other threads about automicity and scaling down
There was a problem hiding this comment.
We should rephrase this a bit here. The concern is that "quorum loss could occur".
Data loss or corruption is really only a concern if split-brain happens, and that only happens if you lose quorum in the two node state and someone forces a new cluster on both sides.
Add non-functional constraint clarifying no availability guarantee during topology transitions, and document compact clusters (dual-role nodes) as a supported transition path from SNO. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
…back Drop atomic etcd member transitions in favor of the sequential bootstrap pattern (1→2→3) for the first iteration. Add dedicated etcd scaling mechanism section, platform:none scoping, scale-down CLI tooling, two-node enforcement, and controlPlaneNodeCount Infrastructure status field. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix all pre-existing markdownlint-cli2 errors: - MD060: align table separator pipe positions with headers - MD060: add spaces in separator rows for compact-style tables - MD013: wrap over-length line in topology audit section - MD051: fix broken link fragment for bare metal risk section Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/retest |
| before the next is added. | ||
| The 2-member state (quorum=2) is transient; losing either member | ||
| during this window is fatal. | ||
| When scaling down below 3 control-plane nodes, |
There was a problem hiding this comment.
May want to axe this given the meeting today.
| * Workload-aware topology decisions | ||
| (e.g., transitioning based on application requirements rather than node count) | ||
| * Supporting topology transitions for HyperShift clusters | ||
| * Implementing topology transitions for MicroShift deployments |
There was a problem hiding this comment.
Do we want to add scale down here for the time being.
There was a problem hiding this comment.
Yes, I think this is the natural spot for it currently
| * Automatically adjust infrastructure workload distribution based on | ||
| the number of worker nodes | ||
| * Provide a mechanism for operators to detect the topology behavior | ||
| through the Infrastructure API |
There was a problem hiding this comment.
On the call today there was a mention of machine config being the authoritative source, is the above line out of date or am I misunderstanding the context.
There was a problem hiding this comment.
We will expose controlPlaneNodeCount that the machine config operator will be responsible for update. I will clarify this point
There was a problem hiding this comment.
Actually I already did, it's below in the API Extensions > Infrastructure Config Changes section
| **cluster administrator** is managing a cluster already running Adaptable topology. | ||
|
|
||
| **Non-functional constraint**: There is no availability guarantee during topology | ||
| transitions. Scaling control-plane or worker nodes is an explicit operational |
There was a problem hiding this comment.
Why would scaling worker counts be treated as an operational action? I can see control plane scaling, but curious why workers are included here.
There was a problem hiding this comment.
Yeah I think a case of thinking to quick here, I think it only applies to CP nodes as you mention so I will drop the worker mention
| - Each new node joins as a learner and is promoted to a voting member before the next is added | ||
| - The 2-member state is transient — quorum requires both members, so losing either is fatal during this window | ||
| - Other operators adjust their behavior to match HighlyAvailable control-plane topology | ||
| 5. When scaling down and crossing the 3→2 control-plane node threshold: |
|
/lgtm Feel free to release hold when you feel it's ready to merge and proceed to PoC |
…idation
Scale-down of control-plane and worker nodes is not a goal of this
enhancement. Changes:
- Add scale-down as a non-goal
- Remove scale-down details from current-scope sections:
- etcd scale-down paragraph from "How Adaptable Topology Works"
- 3→2 control-plane threshold from "Scaling Control-Plane Nodes"
- 3→1 etcd scale-down sequence from "Behavior at Three Nodes"
- Worker node scale-down steps from "Scaling Worker Nodes"
- `oc adm topology scale-down` CLI command from "oc CLI Changes"
- Update non-functional constraint to scope to control-plane scaling only
- Narrow quorum loss risk to scale-up only (1→2→3)
- Remove scale-down test cases from the test plan
- Update workflow language from "adding or removing" to "adding"
- Add new "Future Considerations" section containing:
- Scale-down operations (control-plane 3→1, worker thresholds, CLI)
- Control-plane performance validation during scaling, mirroring
assisted installer host validation checks (disk I/O, network
latency, resource capacity)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@jaypoulz: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/lgtm |
| **Ambiguous Target State**: Should transitioning to HighlyAvailable create a | ||
| 3-node compact cluster or provision compute nodes for a 5-node cluster? | ||
| The end state is unclear. |
There was a problem hiding this comment.
What happens today if I take a SNO cluster and create two additional Machines and label them up as control plane? Do they turn into control plane nodes?
Wondering if a transition from SNO to HA is actually feasible as a one direction change? Do we ever see people asking for the reverse?
I'm not entirely sure that creating a new name (Adaptable) removes any of the issues you describe here
There was a problem hiding this comment.
What happens today if I take a SNO cluster and create two additional Machines and label them up as control plane? Do they turn into control plane nodes?
Yes, they become control-plane nodes. However, operators such as ingress and console do not respond to the changed topology. They maintain the same number of replicas and filtering as they set when the cluster was deployed. Some operators, such as etcd and api-server, do actually respond as expected.
Wondering if a transition from SNO to HA is actually feasible as a one direction change? Do we ever see people asking for the reverse?
It is technically possible, but you need to change the topology field, delete the configuration of your existing operators, and effectively reinitialize the controllers as though it was a new cluster. I am aware of only 1 ask for scale-down, but we've decided to mark that flow as out of scope for the initial delivery of this feature.
I'm not entirely sure that creating a new name (Adaptable) removes any of the issues you describe here
It doesn't address those particular issues.
The issues it does address are as follows:
- Mindset - Topologies have always been designed to be static. The goal of Adaptable topology is to be able to think of topology as fluid. No longer are we thinking of OpenShift in terms of SNO vs Two Node vs HA Compact vs HA with Workers - instead there is just a number of control-plane nodes and infrastructure nodes that have been confirmed by MCO, and each component in the cluster is responsible for behaving appropriately according to the nodes available.
- Layered Products/3rd Party Operators - How does a layered product or 3rd party operator identify that it should be expected to adapt to changing control-plane and infrastructure node conditions? We need a way to be able to distinguish between the operational expectations of OpenShift-with-Topologies to just OpenShift that behaves appropriately regardless of its node composition.
- It's a 1-way switch - once you go adaptable, there's no going back. You are opting in to a new paradigm where node-resource counts define cluster behavior instead of install-time defined configurations.
The main compelling alternative would be to add a new field to infrastructure that is not an enum per se, but essential tracks if you are operating with mutable topology or not. The tradeoffs with this approach is that we now have to compute and update the effective topology and update it in the infrastructure config. For 3rd party operators and layered products, this always breaks their expectations around something that was always an invariant. Additionally, it preserves a second limitation of topologies as we have them today - having to look at a second API if you have differences in behavior that are grouped together by topology definitions. (E.g. A third party operator may want to define behavior differently on a cluster with 2 compute nodes (HA infrastructure topology) vs 3 nodes (HA infrastructure topology). This last point is more of an academic concern related to storage offerings that have a majority quorum component like etcd, but I did want to include it since there is potential these could increase now that TNF is going GA.
Overall, I think if you see OpenShift for what it is today, I think it's easier to look at this and say - let's just architect this around the mechanic of updating the topology fields in-place because it's the least amount of disruption. What we are proposing is a greater vision for what OpenShift would become - eventually all clusters would be adaptable by default and looking at the replacement fields to controlPlaneNodes and infrastructureNodes to determine their appropriate/desired topological behavior.
There was a problem hiding this comment.
FWIW I don't have an issue with clusters becoming adaptable - the idea of allowing transition from SNO to HA makes sense. My issue I think is around whether we need a new API for this.
delete the configuration of your existing operators
Can you expand on this point? Which operators need config deleting and reconfiguring manually?
How does adding a new adaptable topology solve this? Or is it just you now have to go to these components and build logic that says "if you see adaptable, you must ..."
How does a layered product or 3rd party operator identify that it should be expected to adapt to changing control-plane and infrastructure node conditions? We need a way to be able to distinguish between the operational expectations of OpenShift-with-Topologies to just OpenShift that behaves appropriately regardless of its node composition.
Have you conducted research into exactly who consumes the toplogy field as it is today? And with that, how do the consumers behave when they see a particular value? Do you know what they do if they see a value that they don't currently recognise?
Have you found examples of any where upon restart, they wouldn't just adjust if they saw a new topology? Any examples where a single to multi node transition needs to actually be implemented?
Why does a controller looking at the size of the cluster and updating the topology field between the current available enum values not suffice here?
It's a 1-way switch - once you go adaptable, there's no going back. You are opting in to a new paradigm where node-resource counts define cluster behavior instead of install-time defined configurations.
My counter to this would be once you go from SNO to HA, there's no going back. I think it's equivalent here to be honest. Because that's effectively what you're saying adaptable is - you might start as SNO, and then move to HA, or you might not. And the adaptable flag just says to the operators, be prepared. What if they were just always prepared?
but essential tracks if you are operating with mutable topology or not.
Or we make every cluster mutable
For 3rd party operators and layered products, this always breaks their expectations around something that was always an invariant.
This project will break that invariant no matter how it is implemented
There was a problem hiding this comment.
Topologies have always been designed to be static. The goal of Adaptable topology is to be able to think of topology as fluid. No longer are we thinking of OpenShift in terms of SNO vs Two Node vs HA Compact vs HA with Workers - instead there is just a number of control-plane nodes and infrastructure nodes that have been confirmed by MCO, and each component in the cluster is responsible for behaving appropriately according to the nodes available.
The problem is that the correct behaviour always depends on the user's intent, which isn't known to us. A cluster with 3 control plane nodes and 1 worker could represent a normally operating cluster, a degraded or misconfigured cluster to which the user should be alerted, an intentional cost/risk balancing measure, or a dangerous security failure that should be avoided at all costs. The difference is what the user intended to happen.
Historically we have 'solved' this by taking the cluster topology at install time as an indication of the user's intent. This approach is deficient in two ways. First, it makes it hard to change after the fact. But more importantly, it represents only a lossy signal of the user's intent.
This proposal addresses only the first, imho lesser, problem. The second it makes worse, by discarding even such meagre information as we had. This seems unlikely to improve the situation.
| 5. The cluster installs with behavior matching the effective topology for the initial control-plane, arbiter, and worker node counts | ||
| 6. After installation completes, the cluster is ready to scale by adding nodes | ||
|
|
||
| *Note: The `adaptableTopology` flag is optional and defaults to `false`. Adaptable is intended to become the default topology mode in a future release once the feature reaches GA and has proven stable in production.* |
There was a problem hiding this comment.
To @JoelSpeed 's point, if adaptable becomes the default topology mode it feels like topology just became somewhat meaningless. Is this not possible to implement with the existing topologies without the need for a new one and a layer of indirection to determine the actual topology.
There was a problem hiding this comment.
This is by design. I've explained my reasoning and the alternatives in my response to Joel.
That said, it wouldn't introduce a layer of indirection because it would be replaced by the infrastructure fields that just declare how many control-plane and infrastructure nodes the operators should be expecting. In time, topology as a field could be retired entirely, but it can be preserved for historical context.
|
It seems like we're missing consideration of master schedulability: topology isn't just a calculation of node count, it also takes into account As a side note, we have an unresolved bug for correctly calculating the toplogy; ideally we would like to move the infrastructure topology calculation out of the installer, because the calculation depends too much on manifest editing, and the dependency handling is not user friendly. |
|
Note this EP is going to be closed in favor of #2008 |
| 1. The cluster creator prepares an `install-config.yaml` with the desired initial node count | ||
| 2. The cluster creator sets `adaptableTopology: true` in the `install-config.yaml` (optional, defaults to `false`) | ||
| 3. The cluster creator runs `openshift-install create cluster` to complete the installation | ||
| 4. The installer validates the configuration and sets both `controlPlaneTopology` and `infrastructureTopology` to `Adaptable` in the Infrastructure config |
There was a problem hiding this comment.
This requires code changes to every OLM operator, whereas if we had a separate field that a controller reads and then writes back the current topology to controlPlaneTopology and infrastructureTopology, it would be backwards-compatible.
Needless to say, many (most?) OLM operators are not controlled in any way by RH. And I think you could argue that this is in a sense breaking the API contract. Previously we've added new topology values for entirely new topologies, and there would have been OLM operators that could not yet operate on a new cluster of that type that had never existed before (e.g. if you have an OLM operator that runs on the control plane then we force you to explicitly decide whether it can run successfully on TNF now that that's a thing). But this would be the first example of changing the topology on cluster types that have long existed.
Furthermore, with all the logic residing in a library rather than a controller, any future bug fixes or improvements to the logic will also require a rollout to every OLM operator.
|
|
||
| `platform: none` will be supported for all node configurations. | ||
|
|
||
| `platform: baremetal` presents a challenge for single-node clusters. |
There was a problem hiding this comment.
platform:baremetal is not supported on single-node clusters. You cannot install one. This is because it is not remotely useful for anything.
| which is not useful and creates a point of failure for SNO deployments. | ||
| The Bare Metal Networking team will be consulted to determine if this | ||
| networking setup can be disabled for single-node clusters. | ||
| The goal is to support `platform: baremetal` for all node configurations |
| ```yaml | ||
| apiVersion: v1 | ||
| baseDomain: example.com | ||
| adaptableTopology: true # optional, defaults to false |
There was a problem hiding this comment.
I'm told that boolean fields always turn out to be a mistake.
|
|
||
| The console will be updated to: | ||
| - Display operator compatibility status for Adaptable topology | ||
| - Provide a marketplace filter to show only operators that support Adaptable topology |
There was a problem hiding this comment.
This seems like an own-goal that will prevent adoption of the feature.
| **Ambiguous Target State**: Should transitioning to HighlyAvailable create a | ||
| 3-node compact cluster or provision compute nodes for a 5-node cluster? | ||
| The end state is unclear. |
There was a problem hiding this comment.
Topologies have always been designed to be static. The goal of Adaptable topology is to be able to think of topology as fluid. No longer are we thinking of OpenShift in terms of SNO vs Two Node vs HA Compact vs HA with Workers - instead there is just a number of control-plane nodes and infrastructure nodes that have been confirmed by MCO, and each component in the cluster is responsible for behaving appropriately according to the nodes available.
The problem is that the correct behaviour always depends on the user's intent, which isn't known to us. A cluster with 3 control plane nodes and 1 worker could represent a normally operating cluster, a degraded or misconfigured cluster to which the user should be alerted, an intentional cost/risk balancing measure, or a dangerous security failure that should be avoided at all costs. The difference is what the user intended to happen.
Historically we have 'solved' this by taking the cluster topology at install time as an indication of the user's intent. This approach is deficient in two ways. First, it makes it hard to change after the fact. But more importantly, it represents only a lossy signal of the user's intent.
This proposal addresses only the first, imho lesser, problem. The second it makes worse, by discarding even such meagre information as we had. This seems unlikely to improve the situation.
| Single Node OpenShift (SNO) clusters are candidates for | ||
| transitioning to Adaptable topology. | ||
| The primary use case is enabling SNO deployments to scale to | ||
| multi-node highly available configurations as requirements change. |
There was a problem hiding this comment.
SNO clusters don't have load balancers, so there is a lot more to scaling up their control planes than simply adding nodes.
SNO clusters created via IBI depend on networking configuration within the node that cannot work in a multi-node cluster. You will need to have some way of dealing with this.
| MicroShift and OpenShift architectures closer together. | ||
| MicroShift would benefit from operators having an adaptable topology mode | ||
| that handles topology changes via node updates. | ||
| A follow-up enhancement will address MicroShift to SNO transitions, |
|
@zaneb @patrickdillon thanks for the reviews! After discussion at a recent arch call, we've shifted from this approach to a new one that is more robust and I think provides a way to address some of the concerns you both brought up: |
|
@jaypoulz: This pull request references OCPEDGE-2280 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target either version "5.0." or "openshift-5.0.", but it targets "openshift-4.22" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Closing as retired in favor of #2008. |
Retired in favor of #2008.
This enhancement introduces Adaptable topology, a new cluster-topology mode that enables clusters to dynamically adjust their behavior based on node count. This allows SingleReplica clusters to scale to multi-node configurations without redeployment.
Key features:
The proposal includes complete workflow descriptions, API extensions, test plans, and version skew strategy. Future stages will add AutomaticQuorumRecovery (AQR) to enable DualReplica-based resiliency for two-node configurations.