diff --git a/build-and-config/configuring-for-high-availability.mdx b/build-and-config/configuring-for-high-availability.mdx index 0f0abc7..7626910 100644 --- a/build-and-config/configuring-for-high-availability.mdx +++ b/build-and-config/configuring-for-high-availability.mdx @@ -11,6 +11,23 @@ In order to achieve high availability for your application, you need multiple re Applications running Kubernetes v1.12 or lower do not support the multi-master feature on Cloud 66. If you have deployed an application via Cloud 66 before March 2019, you will need to **redeploy your application "with upgrades" and choose to perform a Kubernetes upgrade** (note that this will incur significant downtime as your cluster will be recreated). All applications deployed after **March 2019** on version v1.13 and above automatically support multi-master clusters. +## Choosing between a shared master and dedicated workers + +A new Maestro Kubernetes cluster starts with a single master node that also runs your application workloads — a "shared master" topology. This is fine for development and small production loads, but it has a hard ceiling: the same node is responsible for both Kubernetes control-plane traffic (the API server, scheduler, controller manager, etcd) and your app's containers, and those two responsibilities compete for the same CPU, memory, and disk I/O. + +The right time to add a **dedicated worker** — a node that runs *only* application pods, with the master returned to control-plane duties — is when you start seeing any of: + +- **Slow `kubectl` responses or Dashboard timeline lag** when nothing else is changing. The API server is being starved by application containers. +- **Scheduling decisions that take noticeably longer than they used to** (new pods sitting in `Pending` for tens of seconds before being placed). +- **Pod evictions on the master node** — the kubelet evicting application pods because the node itself is under memory or disk pressure. These show up on the cluster's events feed and in the timeline. +- **etcd warnings in the master's logs** about slow writes (`took too long`, `apply request took too long`). etcd is the most latency-sensitive part of the control plane. + +If you're hitting any of those on a single-master cluster, add one dedicated worker before chasing other tuning options — it's usually the fastest path back to a healthy cluster. The procedure is the same as [adding any node](#adding-nodes-to-an-application); pick *Worker* in step 5. + + +Adding the first dedicated worker is a different decision from going to a full HA topology (three masters + workers). A single-master + single-worker cluster is not HA — losing the master still takes the cluster down — but it does take the workload pressure off the master and is often enough for small production apps. Move to three masters when you also need the cluster to survive a master node failure. + + ## Adding nodes to an application To add nodes to an existing application: