From c1a292d9daaaceb282a4553c35c974f9146091b8 Mon Sep 17 00:00:00 2001 From: Stu Doherty Date: Mon, 27 Apr 2026 09:24:21 -0400 Subject: [PATCH 01/14] docs: add postgres-ha-cluster.rst skeleton --- .../scale/postgres-ha-cluster.rst | 43 +++++++++++++++++++ 1 file changed, 43 insertions(+) create mode 100644 source/administration-guide/scale/postgres-ha-cluster.rst diff --git a/source/administration-guide/scale/postgres-ha-cluster.rst b/source/administration-guide/scale/postgres-ha-cluster.rst new file mode 100644 index 00000000000..168e6502236 --- /dev/null +++ b/source/administration-guide/scale/postgres-ha-cluster.rst @@ -0,0 +1,43 @@ +PostgreSQL high availability cluster +===================================== + +:nosearch: + +This guide describes how to deploy a high availability PostgreSQL cluster for +Mattermost using `repmgr `__ for replication management +and automatic failover, `HAProxy `__ for connection +routing, and `Keepalived `__ for Virtual IP (VIP) +management. + +This is infrastructure-level HA that operates independently of your Mattermost +edition. It is compatible with any self-hosted Mattermost deployment. + +.. note:: + + This guide has been validated on: **Ubuntu 24.04 LTS**, **PostgreSQL 17**, + **repmgr 5.5**, **HAProxy 2.8**, **Keepalived**. + +Architecture overview +--------------------- + +[stub] + +Before you begin +---------------- + +[stub] + +Setup guide +----------- + +[stub] + +Day-2 operations +---------------- + +[stub] + +Troubleshooting +--------------- + +[stub] From c11a8fce4279c8f1a0a1baa659fa25de5882cc99 Mon Sep 17 00:00:00 2001 From: Stu Doherty Date: Mon, 27 Apr 2026 09:25:29 -0400 Subject: [PATCH 02/14] =?UTF-8?q?docs:=20postgres-ha-cluster=20=E2=80=94?= =?UTF-8?q?=20architecture=20overview=20section?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../scale/postgres-ha-cluster.rst | 69 ++++++++++++++++++- 1 file changed, 68 insertions(+), 1 deletion(-) diff --git a/source/administration-guide/scale/postgres-ha-cluster.rst b/source/administration-guide/scale/postgres-ha-cluster.rst index 168e6502236..19c19a93189 100644 --- a/source/administration-guide/scale/postgres-ha-cluster.rst +++ b/source/administration-guide/scale/postgres-ha-cluster.rst @@ -20,7 +20,74 @@ edition. It is compatible with any self-hosted Mattermost deployment. Architecture overview --------------------- -[stub] +A PostgreSQL HA cluster for Mattermost consists of three nodes running in +parallel. Each node runs the full stack: PostgreSQL, repmgr daemon (repmgrd), +HAProxy, Keepalived, and a health-check service. A Virtual IP (VIP) floats +across nodes and always points to the current primary. + +.. code-block:: text + + VIP: + │ + ┌───────────────┼───────────────┐ + │ │ │ + ┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐ + │ pg1 │ │ pg2 │ │ pg3 │ + │ │ │ │ │ │ + │ HAProxy │ │ HAProxy │ │ HAProxy │ + │ Keepalived │ │ Keepalived │ │ Keepalived │ + │ pgchk.py │ │ pgchk.py │ │ pgchk.py │ + │ repmgrd │ │ repmgrd │ │ repmgrd │ + ├─────────────┤ ├─────────────┤ ├─────────────┤ + │ PostgreSQL │ │ PostgreSQL │ │ PostgreSQL │ + │ PRIMARY │ │ STANDBY │ │ STANDBY │ + └─────────────┘ └─────────────┘ └─────────────┘ + +**Components:** + +.. list-table:: + :widths: 20 10 70 + :header-rows: 1 + + * - Component + - Version + - Role + * - PostgreSQL + - 17 + - Primary database engine. Streaming replication with replication slots. + * - repmgr / repmgrd + - 5.5 + - Replication manager. Monitors cluster health and automatically promotes + a standby when the primary fails. + * - HAProxy + - 2.8 + - TCP load balancer. Routes write traffic to the primary and read traffic + to standbys via two ports. + * - Keepalived + - — + - Manages the VIP using VRRP. Moves the VIP to the new primary after + failover. + * - pgchk.py + - — + - HTTP health-check endpoint (port 8008). HAProxy queries this to + determine which node is the current primary. + +**HAProxy ports:** + +.. list-table:: + :widths: 15 85 + :header-rows: 1 + + * - Port + - Purpose + * - 5000 + - Write traffic — routes to the current primary only + * - 5001 + - Read traffic — load-balanced across all standbys + +**Sizing:** This architecture is appropriate for Mattermost deployments up to +approximately 2,000 concurrent users. For larger deployments, see +:doc:`Scaling for Enterprise `. Before you begin ---------------- From cbb742ad1122a281df54343ba9200c9b36deb135 Mon Sep 17 00:00:00 2001 From: Stu Doherty Date: Mon, 27 Apr 2026 09:26:42 -0400 Subject: [PATCH 03/14] =?UTF-8?q?docs:=20postgres-ha-cluster=20=E2=80=94?= =?UTF-8?q?=20before=20you=20begin=20/=20decision=20guidance=20section?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../scale/postgres-ha-cluster.rst | 109 +++++++++++++++++- 1 file changed, 108 insertions(+), 1 deletion(-) diff --git a/source/administration-guide/scale/postgres-ha-cluster.rst b/source/administration-guide/scale/postgres-ha-cluster.rst index 19c19a93189..a004691f1e2 100644 --- a/source/administration-guide/scale/postgres-ha-cluster.rst +++ b/source/administration-guide/scale/postgres-ha-cluster.rst @@ -92,7 +92,114 @@ approximately 2,000 concurrent users. For larger deployments, see Before you begin ---------------- -[stub] +Is this the right architecture for you? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. list-table:: + :widths: 30 35 35 + :header-rows: 1 + + * - Scenario + - Recommendation + - Why + * - Cloud-hosted on AWS/GCP/Azure + - Use managed RDS/Cloud SQL with Multi-AZ + - Managed failover, no infrastructure to operate + * - On-premises or private cloud, single site + - **This guide** — single-DC HA cluster + - Automatic failover within the datacenter, no cloud dependency + * - On-premises, two or more sites, DR required + - Single-DC HA (this guide) + Multi-DC DR guide (coming soon) + - Active/warm-standby across datacenters + +Requirements +~~~~~~~~~~~~ + +**Hardware (per node — minimum):** + +- Operating system: Ubuntu 24.04 LTS +- CPU: 2 cores +- RAM: 4 GB +- Disk: 50 GB + +**You need 3 nodes** and one spare IP address on the same subnet for the VIP. + +**Network — ports that must be open between all three nodes:** + +.. list-table:: + :widths: 15 85 + :header-rows: 1 + + * - Port + - Purpose + * - 22 + - SSH (administration) + * - 5432 + - PostgreSQL (replication, repmgr) + * - 8008 + - pgchk.py health check (HAProxy → database nodes) + * - VRRP (112) + - Keepalived VIP election between nodes + +**Ports that Mattermost application servers must reach:** + +.. list-table:: + :widths: 15 85 + :header-rows: 1 + + * - Port + - Purpose + * - 5000 + - Write connections (primary) + * - 5001 + - Read connections (standbys) + +**Software:** The following packages will be installed during setup. No +pre-installation is required. + +- ``postgresql-17`` +- ``postgresql-17-repmgr`` +- ``haproxy`` +- ``keepalived`` +- ``python3`` (for pgchk.py) + +Node planning worksheet +~~~~~~~~~~~~~~~~~~~~~~~ + +Complete this before starting. You will substitute these values throughout +the guide. + +.. list-table:: + :widths: 15 25 25 35 + :header-rows: 1 + + * - Node + - Hostname + - IP address + - Initial role + * - 1 + - pg1 + - _______________ + - Primary + * - 2 + - pg2 + - _______________ + - Standby + * - 3 + - pg3 + - _______________ + - Standby + * - VIP + - — + - _______________ + - Floating (always points to primary) + +**Subnet:** ``_______________`` (e.g. ``10.0.1.0``) + +Time estimate +~~~~~~~~~~~~~ + +Allow **2–3 hours** for a first-time setup on pre-provisioned servers. Setup guide ----------- From 3caf951d7c1fefd4c0ce98b3373fc93bc6365698 Mon Sep 17 00:00:00 2001 From: Stu Doherty Date: Mon, 27 Apr 2026 09:27:37 -0400 Subject: [PATCH 04/14] =?UTF-8?q?docs:=20postgres-ha-cluster=20=E2=80=94?= =?UTF-8?q?=20setup=20Phase=201=20(base=20install)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Sonnet 4.6 --- .../scale/postgres-ha-cluster.rst | 59 ++++++++++++++++++- 1 file changed, 58 insertions(+), 1 deletion(-) diff --git a/source/administration-guide/scale/postgres-ha-cluster.rst b/source/administration-guide/scale/postgres-ha-cluster.rst index a004691f1e2..ffdc8d9252d 100644 --- a/source/administration-guide/scale/postgres-ha-cluster.rst +++ b/source/administration-guide/scale/postgres-ha-cluster.rst @@ -204,7 +204,64 @@ Allow **2–3 hours** for a first-time setup on pre-provisioned servers. Setup guide ----------- -[stub] +.. note:: + + Throughout this guide, substitute the IP addresses and subnet you recorded + in the node planning worksheet above. + +.. warning:: + + Complete each phase in order. The checkpoint at the end of each phase must + pass before you proceed. + +Phase 1: Base installation (all nodes) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Run all steps in Phase 1 on **pg1, pg2, and pg3**. + +**Step 1.1 — Configure /etc/hosts** + +On each node, append to ``/etc/hosts``: + +.. code-block:: text + + pg1 + pg2 + pg3 + +Verify hostname resolution on each node: + +.. code-block:: bash + + ping -c 1 pg1 && ping -c 1 pg2 && ping -c 1 pg3 + +Expected: 3 successful pings. + +**Step 1.2 — Install PostgreSQL 17 and repmgr 5.5** + +.. code-block:: bash + + sudo apt update + sudo apt install -y curl ca-certificates + sudo install -d /usr/share/postgresql-common/pgdg + sudo curl -o /usr/share/postgresql-common/pgdg/apt.postgresql.org.asc \ + --fail https://www.postgresql.org/media/keys/ACCC4CF8.asc + sudo sh -c 'echo "deb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.asc] \ + https://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" \ + > /etc/apt/sources.list.d/pgdg.list' + sudo apt update + sudo apt install -y postgresql-17 postgresql-17-repmgr + +✅ **Phase 1 checkpoint** — run on every node: + +.. code-block:: bash + + sudo systemctl status postgresql | grep "active (running)" + /usr/lib/postgresql/17/bin/repmgr --version + +**Pass:** PostgreSQL shows ``active (running)``; repmgr prints ``repmgr 5.5.x``. + +**Fail:** If PostgreSQL did not start, check ``journalctl -u postgresql`` for errors. Day-2 operations ---------------- From ec40b00d403c2364ff71dd2401ed55f6448fc7e9 Mon Sep 17 00:00:00 2001 From: Stu Doherty Date: Mon, 27 Apr 2026 09:28:26 -0400 Subject: [PATCH 05/14] =?UTF-8?q?docs:=20postgres-ha-cluster=20=E2=80=94?= =?UTF-8?q?=20setup=20Phase=202=20(PostgreSQL=20config)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../scale/postgres-ha-cluster.rst | 58 +++++++++++++++++++ 1 file changed, 58 insertions(+) diff --git a/source/administration-guide/scale/postgres-ha-cluster.rst b/source/administration-guide/scale/postgres-ha-cluster.rst index ffdc8d9252d..a921c1c2c7c 100644 --- a/source/administration-guide/scale/postgres-ha-cluster.rst +++ b/source/administration-guide/scale/postgres-ha-cluster.rst @@ -263,6 +263,64 @@ Expected: 3 successful pings. **Fail:** If PostgreSQL did not start, check ``journalctl -u postgresql`` for errors. +Phase 2: PostgreSQL configuration (all nodes) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Run all steps in Phase 2 on **pg1, pg2, and pg3**. + +**Step 2.1 — Configure postgresql.conf** + +Append to ``/etc/postgresql/17/main/postgresql.conf``: + +.. code-block:: ini + + # Replication settings + listen_addresses = '*' + max_wal_senders = 10 + max_replication_slots = 10 + wal_level = replica + hot_standby = on + archive_mode = on + archive_command = '/bin/true' + shared_preload_libraries = 'repmgr' + wal_log_hints = on + wal_keep_size = 1024 + +**Step 2.2 — Configure pg_hba.conf** + +Append to ``/etc/postgresql/17/main/pg_hba.conf``: + +.. code-block:: text + + # repmgr access + host repmgr repmgr /24 trust + host repmgr repmgr 127.0.0.1/32 trust + # Replication connections + host replication repmgr /24 trust + host replication repmgr 127.0.0.1/32 trust + +.. note:: + + For production, replace ``trust`` with ``scram-sha-256`` and configure + ``.pgpass`` files on each node. + +**Step 2.3 — Restart PostgreSQL** + +.. code-block:: bash + + sudo systemctl restart postgresql + +✅ **Phase 2 checkpoint** — run on every node: + +.. code-block:: bash + + sudo -u postgres psql -c "SHOW wal_level;" + sudo -u postgres psql -c "SHOW shared_preload_libraries;" + +**Pass:** ``wal_level`` is ``replica``; ``shared_preload_libraries`` contains ``repmgr``. + +**Fail:** If PostgreSQL did not restart, check ``journalctl -u postgresql``. + Day-2 operations ---------------- From 755c7710510fd7b48a3111ddb28a1284fce10ef8 Mon Sep 17 00:00:00 2001 From: Stu Doherty Date: Mon, 27 Apr 2026 09:29:30 -0400 Subject: [PATCH 06/14] =?UTF-8?q?docs:=20postgres-ha-cluster=20=E2=80=94?= =?UTF-8?q?=20setup=20Phase=203=20(repmgr=20+=20standby=20clone)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../scale/postgres-ha-cluster.rst | 104 ++++++++++++++++++ 1 file changed, 104 insertions(+) diff --git a/source/administration-guide/scale/postgres-ha-cluster.rst b/source/administration-guide/scale/postgres-ha-cluster.rst index a921c1c2c7c..1128544ce9c 100644 --- a/source/administration-guide/scale/postgres-ha-cluster.rst +++ b/source/administration-guide/scale/postgres-ha-cluster.rst @@ -321,6 +321,110 @@ Append to ``/etc/postgresql/17/main/pg_hba.conf``: **Fail:** If PostgreSQL did not restart, check ``journalctl -u postgresql``. +Phase 3: repmgr configuration and cluster initialisation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +**Step 3.1 — Create repmgr user and database (pg1 only)** + +.. code-block:: bash + + sudo -u postgres createuser --superuser repmgr + sudo -u postgres createdb --owner=repmgr repmgr + sudo -u postgres psql -c "ALTER USER repmgr SET search_path TO repmgr, public;" + +**Step 3.2 — Create /etc/repmgr.conf (all nodes)** + +Create ``/etc/repmgr.conf`` on each node. Adjust ``node_id``, ``node_name``, +and ``host`` for each node: + +**pg1:** + +.. code-block:: ini + + node_id=1 + node_name='pg1' + conninfo='host= user=repmgr dbname=repmgr connect_timeout=2' + data_directory='/var/lib/postgresql/17/main' + use_replication_slots=yes + monitoring_history=yes + log_level=INFO + pg_bindir='/usr/lib/postgresql/17/bin' + service_start_command='sudo /usr/bin/pg_ctlcluster 17 main start' + service_stop_command='sudo /usr/bin/pg_ctlcluster 17 main stop' + service_restart_command='sudo /usr/bin/pg_ctlcluster 17 main restart' + service_reload_command='sudo /usr/bin/pg_ctlcluster 17 main reload' + service_promote_command='sudo /usr/bin/pg_ctlcluster 17 main promote' + failover=automatic + promote_command='repmgr standby promote -f /etc/repmgr.conf --log-to-file' + follow_command='repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n' + reconnect_attempts=3 + reconnect_interval=5 + monitor_interval_secs=2 + +**pg2:** Same as above with ``node_id=2``, ``node_name='pg2'``, ``host=``. + +**pg3:** Same as above with ``node_id=3``, ``node_name='pg3'``, ``host=``. + +**Step 3.3 — Register primary (pg1 only)** + +.. code-block:: bash + + sudo -u postgres repmgr -f /etc/repmgr.conf primary register + +**Step 3.4 — Clone standbys (pg2 and pg3)** + +Run on **pg2**, then **pg3**: + +.. code-block:: bash + + sudo systemctl stop postgresql + sudo -u postgres repmgr -h -U repmgr -d repmgr \ + -f /etc/repmgr.conf standby clone --delete-existing-pgdata + sudo systemctl start postgresql + sudo -u postgres repmgr -f /etc/repmgr.conf standby register + +**Step 3.5 — Start repmgrd (all nodes)** + +Create ``/etc/systemd/system/repmgrd.service``: + +.. code-block:: ini + + [Unit] + Description=repmgr daemon + After=postgresql.service + Requires=postgresql.service + + [Service] + User=postgres + ExecStart=/usr/lib/postgresql/17/bin/repmgrd -f /etc/repmgr.conf --no-daemonize + Restart=on-failure + + [Install] + WantedBy=multi-user.target + +.. code-block:: bash + + sudo systemctl daemon-reload + sudo systemctl enable repmgrd + sudo systemctl start repmgrd + +✅ **Phase 3 checkpoint** — run on any node: + +.. code-block:: bash + + sudo -u postgres repmgr -f /etc/repmgr.conf cluster show + +**Pass:** Output shows all three nodes — pg1 as ``* running`` (primary), pg2 and +pg3 as ``running`` (standby). On pg1, the following query returns 2 rows: + +.. code-block:: bash + + sudo -u postgres psql -c "SELECT client_addr, state FROM pg_stat_replication;" + +**Fail:** A standby showing ``! running`` means replication did not establish. +Check ``journalctl -u postgresql`` on the failed standby. Common cause: firewall +blocking port 5432 between nodes. + Day-2 operations ---------------- From 30a6955f99979c537b0e9f6540f0745356f3bc49 Mon Sep 17 00:00:00 2001 From: Stu Doherty Date: Mon, 27 Apr 2026 09:30:45 -0400 Subject: [PATCH 07/14] =?UTF-8?q?docs:=20postgres-ha-cluster=20=E2=80=94?= =?UTF-8?q?=20setup=20Phase=204=20(HAProxy,=20pgchk,=20Keepalived)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../scale/postgres-ha-cluster.rst | 136 ++++++++++++++++++ 1 file changed, 136 insertions(+) diff --git a/source/administration-guide/scale/postgres-ha-cluster.rst b/source/administration-guide/scale/postgres-ha-cluster.rst index 1128544ce9c..8b54647fea8 100644 --- a/source/administration-guide/scale/postgres-ha-cluster.rst +++ b/source/administration-guide/scale/postgres-ha-cluster.rst @@ -425,6 +425,142 @@ pg3 as ``running`` (standby). On pg1, the following query returns 2 rows: Check ``journalctl -u postgresql`` on the failed standby. Common cause: firewall blocking port 5432 between nodes. +Phase 4: HAProxy, health check, and VIP (all nodes) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Run all steps in Phase 4 on **pg1, pg2, and pg3**. + +**Step 4.1 — Install HAProxy** + +.. code-block:: bash + + sudo apt install -y haproxy + +**Step 4.2 — Configure HAProxy** + +Replace ``/etc/haproxy/haproxy.cfg``: + +.. code-block:: text + + global + log /dev/log local0 + maxconn 4000 + + defaults + log global + mode tcp + timeout connect 5s + timeout client 30s + timeout server 30s + + frontend pg_write + bind *:5000 + default_backend pg_primary + + frontend pg_read + bind *:5001 + default_backend pg_replicas + + backend pg_primary + option tcp-check + server pg1 :5432 check port 8008 + server pg2 :5432 check port 8008 backup + server pg3 :5432 check port 8008 backup + + backend pg_replicas + balance roundrobin + option tcp-check + server pg2 :5432 check port 8008 + server pg3 :5432 check port 8008 + server pg1 :5432 check port 8008 backup + +**Step 4.3 — Deploy pgchk.py** + +``pgchk.py`` is a lightweight HTTP server that returns ``200 OK`` when the local +node is the primary and ``503`` otherwise. HAProxy queries port 8008 on each +node to determine where to route connections. + +Copy ``pgchk.py`` from the +`ha-postgres-reprmgr-haproxy repository `__ +to ``/usr/local/bin/pgchk.py`` on each node and make it executable: + +.. code-block:: bash + + sudo chmod +x /usr/local/bin/pgchk.py + +Create ``/etc/systemd/system/pgchk.service``: + +.. code-block:: ini + + [Unit] + Description=PostgreSQL Health Check for HAProxy + After=postgresql.service + + [Service] + ExecStart=/usr/bin/python3 /usr/local/bin/pgchk.py --port 8008 + Restart=always + + [Install] + WantedBy=multi-user.target + +.. code-block:: bash + + sudo systemctl daemon-reload + sudo systemctl enable pgchk + sudo systemctl start pgchk + sudo systemctl enable haproxy + sudo systemctl start haproxy + +**Step 4.4 — Install and configure Keepalived** + +.. code-block:: bash + + sudo apt install -y keepalived + +Create ``/etc/keepalived/keepalived.conf``. Set the ``priority`` field: pg1 gets +``101``, pg2 gets ``100``, pg3 gets ``99``. Set ``virtual_ipaddress`` to your VIP: + +.. code-block:: text + + vrrp_instance VI_1 { + state BACKUP + interface eth0 + virtual_router_id 51 + priority 101 + advert_int 1 + nopreempt + virtual_ipaddress { + /24 + } + } + +.. code-block:: bash + + sudo systemctl enable keepalived + sudo systemctl start keepalived + +✅ **Phase 4 checkpoint** — run on any node: + +.. code-block:: bash + + # VIP should be active on the primary node (pg1) + ip addr show | grep + + # Port 5000 should connect to primary + psql -h -p 5000 -U repmgr -d repmgr \ + -c "SELECT inet_server_addr(), pg_is_in_recovery();" + + # Port 5001 should connect to a standby + psql -h -p 5001 -U repmgr -d repmgr \ + -c "SELECT inet_server_addr(), pg_is_in_recovery();" + +**Pass:** VIP visible on pg1. Port 5000 returns ``pg_is_in_recovery = f`` (primary). +Port 5001 returns ``pg_is_in_recovery = t`` (standby). + +**Fail:** If the VIP is not on pg1, check ``journalctl -u keepalived``. If HAProxy +is not routing correctly, check ``journalctl -u haproxy`` and verify pgchk.py +is responding: ``curl http://:8008`` should return HTTP 200. + Day-2 operations ---------------- From 63688bbaf4c03b7246a760e9f09375256becd9f6 Mon Sep 17 00:00:00 2001 From: Stu Doherty Date: Mon, 27 Apr 2026 09:31:40 -0400 Subject: [PATCH 08/14] =?UTF-8?q?docs:=20postgres-ha-cluster=20=E2=80=94?= =?UTF-8?q?=20setup=20Phase=205=20(end-to-end=20validation)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../scale/postgres-ha-cluster.rst | 64 +++++++++++++++++++ 1 file changed, 64 insertions(+) diff --git a/source/administration-guide/scale/postgres-ha-cluster.rst b/source/administration-guide/scale/postgres-ha-cluster.rst index 8b54647fea8..dcbeac209d1 100644 --- a/source/administration-guide/scale/postgres-ha-cluster.rst +++ b/source/administration-guide/scale/postgres-ha-cluster.rst @@ -561,6 +561,70 @@ Port 5001 returns ``pg_is_in_recovery = t`` (standby). is not routing correctly, check ``journalctl -u haproxy`` and verify pgchk.py is responding: ``curl http://:8008`` should return HTTP 200. +Phase 5: End-to-end validation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Run this phase after all four previous phases pass on all nodes. This confirms +the cluster behaves correctly under failure before you connect Mattermost. + +**Step 5.1 — Confirm healthy starting state** + +.. code-block:: bash + + sudo -u postgres repmgr -f /etc/repmgr.conf cluster show + +**Pass:** pg1 is ``* running`` (primary); pg2 and pg3 are ``running`` (standby). + +**Step 5.2 — Simulate primary failure** + +On **pg1**: + +.. code-block:: bash + + sudo systemctl stop postgresql + +Wait 30 seconds, then on **pg2** or **pg3**: + +.. code-block:: bash + + sudo -u postgres repmgr -f /etc/repmgr.conf cluster show + +**Pass:** One of pg2 or pg3 is now ``* running`` (primary). pg1 shows as ``! running`` +(unreachable — expected). + +**Step 5.3 — Verify HAProxy and VIP followed the new primary** + +.. code-block:: bash + + psql -h -p 5000 -U repmgr -d repmgr \ + -c "SELECT inet_server_addr(), pg_is_in_recovery();" + +**Pass:** Returns the IP of the newly promoted node with ``pg_is_in_recovery = f``. + +**Step 5.4 — Recover the old primary as a standby** + +On **pg1**: + +.. code-block:: bash + + sudo systemctl start postgresql + sudo -u postgres repmgr -f /etc/repmgr.conf node rejoin \ + --force-rewind --config-files=postgresql.conf,pg_hba.conf + +Then on any node: + +.. code-block:: bash + + sudo -u postgres repmgr -f /etc/repmgr.conf cluster show + +**Pass:** All three nodes show ``running``; pg1 is now a standby. + +.. note:: + + Your cluster is ready for production. Connect Mattermost using the VIP + address and port 5000 as the primary datasource. Optionally configure + port 5001 as a read replica in ``config.json``. + Day-2 operations ---------------- From 738ceefbb01ca55e606d14cb1c525bc6d4637a08 Mon Sep 17 00:00:00 2001 From: Stu Doherty Date: Mon, 27 Apr 2026 09:32:34 -0400 Subject: [PATCH 09/14] =?UTF-8?q?docs:=20postgres-ha-cluster=20=E2=80=94?= =?UTF-8?q?=20day-2=20operations=20section?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../scale/postgres-ha-cluster.rst | 72 ++++++++++++++++++- 1 file changed, 71 insertions(+), 1 deletion(-) diff --git a/source/administration-guide/scale/postgres-ha-cluster.rst b/source/administration-guide/scale/postgres-ha-cluster.rst index dcbeac209d1..0072fd6fe11 100644 --- a/source/administration-guide/scale/postgres-ha-cluster.rst +++ b/source/administration-guide/scale/postgres-ha-cluster.rst @@ -628,7 +628,77 @@ Then on any node: Day-2 operations ---------------- -[stub] +Check cluster status +~~~~~~~~~~~~~~~~~~~~ + +Run on any node: + +.. code-block:: bash + + sudo -u postgres repmgr -f /etc/repmgr.conf cluster show + +Expected healthy output shows one ``* running`` primary and two ``running`` standbys. + +Check replication lag +~~~~~~~~~~~~~~~~~~~~~ + +Run on the primary: + +.. code-block:: bash + + sudo -u postgres psql -c " + SELECT client_addr, application_name, state, + pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS lag + FROM pg_stat_replication;" + +Normal lag is under 1 MB during steady state. Lag growing continuously +indicates a replication problem — check network connectivity and standby +PostgreSQL logs. + +Controlled switchover (planned maintenance) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To move the primary role to a standby with zero data loss: + +.. code-block:: bash + + # Run on the TARGET standby (e.g. pg2) + sudo -u postgres repmgr -f /etc/repmgr.conf standby switchover + +repmgr will demote the old primary and promote this node. The VIP and HAProxy +will follow automatically. + +Add a standby node +~~~~~~~~~~~~~~~~~~ + +1. Provision a new server and complete Phases 1–2 of the setup guide. +2. Create ``/etc/repmgr.conf`` with the next available ``node_id``. +3. On the new node: + + .. code-block:: bash + + sudo systemctl stop postgresql + sudo -u postgres repmgr -h -U repmgr -d repmgr \ + -f /etc/repmgr.conf standby clone --delete-existing-pgdata + sudo systemctl start postgresql + sudo -u postgres repmgr -f /etc/repmgr.conf standby register + +4. Add the new node to ``/etc/haproxy/haproxy.cfg`` on all existing nodes and + reload HAProxy: ``sudo systemctl reload haproxy``. + +Rejoin a failed node +~~~~~~~~~~~~~~~~~~~~ + +After recovering a failed standby: + +.. code-block:: bash + + sudo -u postgres repmgr -f /etc/repmgr.conf node rejoin \ + --force-rewind --config-files=postgresql.conf,pg_hba.conf + +After rejoining a failed primary (after automatic failover has already promoted +a new primary), run the same command on the old primary to re-register it as a +standby. Troubleshooting --------------- From 9a95ced6ae461aec94ae2478785f2721a3776b9c Mon Sep 17 00:00:00 2001 From: Stu Doherty Date: Mon, 27 Apr 2026 09:33:36 -0400 Subject: [PATCH 10/14] =?UTF-8?q?docs:=20postgres-ha-cluster=20=E2=80=94?= =?UTF-8?q?=20troubleshooting=20section?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../scale/postgres-ha-cluster.rst | 90 ++++++++++++++++++- 1 file changed, 89 insertions(+), 1 deletion(-) diff --git a/source/administration-guide/scale/postgres-ha-cluster.rst b/source/administration-guide/scale/postgres-ha-cluster.rst index 0072fd6fe11..12abd582389 100644 --- a/source/administration-guide/scale/postgres-ha-cluster.rst +++ b/source/administration-guide/scale/postgres-ha-cluster.rst @@ -703,4 +703,92 @@ standby. Troubleshooting --------------- -[stub] +repmgrd is not starting +~~~~~~~~~~~~~~~~~~~~~~~~ + +**Symptom:** ``systemctl status repmgrd`` shows ``failed`` or ``activating``. + +**Likely cause:** PostgreSQL has not fully started yet, or the repmgr database +is not accessible. + +**Resolution:** + +.. code-block:: bash + + # Verify PostgreSQL is running first + sudo systemctl status postgresql + + # Check repmgrd logs + journalctl -u repmgrd -n 50 + + # Test repmgr connection manually + sudo -u postgres repmgr -f /etc/repmgr.conf cluster show + +Standby not replicating +~~~~~~~~~~~~~~~~~~~~~~~~ + +**Symptom:** ``repmgr cluster show`` shows a standby as ``! running``, or +``pg_stat_replication`` on the primary shows fewer than expected rows. + +**Likely cause:** Network connectivity issue on port 5432, or ``pg_hba.conf`` +not permitting the replication connection. + +**Resolution:** + +.. code-block:: bash + + # From the standby, test connectivity to the primary + pg_isready -h -p 5432 -U repmgr + + # Check PostgreSQL logs on the standby + sudo -u postgres tail -50 /var/log/postgresql/postgresql-17-main.log + +VIP not moving after failover +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +**Symptom:** After a primary failure and successful repmgr promotion, the VIP +remains on the failed node or does not appear on the new primary. + +**Likely cause:** Keepalived is not running, or VRRP traffic is blocked by a +firewall. + +**Resolution:** + +.. code-block:: bash + + sudo systemctl status keepalived + journalctl -u keepalived -n 50 + + # Verify VRRP traffic is not blocked — check cloud security groups or + # iptables rules for protocol 112 (VRRP) + +HAProxy routing to wrong node +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +**Symptom:** Connections on port 5000 land on a standby (writes fail), or +port 5001 routes to the primary. + +**Likely cause:** pgchk.py is not running or returning incorrect status. + +**Resolution:** + +.. code-block:: bash + + # Check health check response on each node + curl -v http://:8008 + + # Primary should return HTTP 200; standbys should return HTTP 503 + sudo systemctl status pgchk + journalctl -u pgchk -n 30 + +Split-brain prevention +~~~~~~~~~~~~~~~~~~~~~~ + +repmgr's ``failover=automatic`` setting and ``reconnect_attempts=3`` with +``reconnect_interval=5`` provide a brief delay before promoting a standby. +This prevents promotion during transient network blips. + +If you suspect a split-brain scenario (two nodes both believing they are +primary), **do not write to either node**. Check cluster status from a +third node and use ``repmgr node service --action=stop`` to fence the +unintended primary before recovering. From aa66119d84b46aba6027c53566b1c4017e3e7886 Mon Sep 17 00:00:00 2001 From: Stu Doherty Date: Mon, 27 Apr 2026 09:34:18 -0400 Subject: [PATCH 11/14] docs: add postgres-ha-cluster to scaling-for-enterprise toctree --- .../scale/scaling-for-enterprise.rst | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/source/administration-guide/scale/scaling-for-enterprise.rst b/source/administration-guide/scale/scaling-for-enterprise.rst index 18779aee9e2..5e4c0521dde 100644 --- a/source/administration-guide/scale/scaling-for-enterprise.rst +++ b/source/administration-guide/scale/scaling-for-enterprise.rst @@ -29,6 +29,14 @@ High availability A :doc:`high availability cluster-based deployment ` enables a Mattermost system to maintain service during outages and hardware failures through the use of redundant infrastructure. +PostgreSQL high availability cluster +-------------------------------------- + +For self-hosted deployments on bare-metal or VMs, a +:doc:`PostgreSQL HA cluster ` +provides automatic database failover using repmgr, HAProxy, and Keepalived — +without requiring a managed database service. + Redis ----- @@ -45,7 +53,8 @@ Available reference architectures Deployment architecture at scale Backing storage benchmarks Enterprise search - High availability + High availability + PostgreSQL HA cluster Redis Scale up to 200 users Scale up to 2000 users From 1e7d0606246ba725bc4e1c68643ce34ec8ba714b Mon Sep 17 00:00:00 2001 From: Stu Doherty Date: Mon, 27 Apr 2026 09:34:24 -0400 Subject: [PATCH 12/14] docs: cross-link to postgres-ha-cluster from HA cluster deployment page --- .../scale/high-availability-cluster-based-deployment.rst | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/source/administration-guide/scale/high-availability-cluster-based-deployment.rst b/source/administration-guide/scale/high-availability-cluster-based-deployment.rst index cfeb5b263c1..e0a32dd95f0 100644 --- a/source/administration-guide/scale/high-availability-cluster-based-deployment.rst +++ b/source/administration-guide/scale/high-availability-cluster-based-deployment.rst @@ -30,7 +30,9 @@ Moreover, search replicas are also supported to handle search queries. Deployment guide ---------------- -Set up and maintain a high availability cluster-based deployment on your Mattermost servers. This document doesn't cover the configuration of databases in terms of disaster recovery, however, you can refer to the `frequently asked questions (FAQ)`_ section for our recommendations. +Set up and maintain a high availability cluster-based deployment on your Mattermost servers. This document doesn't cover the configuration of databases in terms of +disaster recovery. For self-hosted deployments requiring database-level HA, +see :doc:`PostgreSQL high availability cluster `. To ensure your instance and configuration are compatible with a high availability cluster-based deployment, please review the `configuration and compatibility`_ section. From 7a6ae8cc294ede8ea0558c6aad5903c1a6020242 Mon Sep 17 00:00:00 2001 From: Stu Doherty Date: Mon, 27 Apr 2026 09:51:28 -0400 Subject: [PATCH 13/14] docs: address CodeRabbit review feedback - pg_hba.conf: default to scram-sha-256; trust moved to lab-only warning - pgchk.py: embed full script inline instead of linking to external repo - Keepalived: add ip link step to identify interface before hardcoding - repmgr createuser: add note explaining why superuser is required - high-availability-cluster-based-deployment.rst: clarify HA vs DR scope Co-Authored-By: Claude Sonnet 4.6 --- ...-availability-cluster-based-deployment.rst | 4 +- .../scale/postgres-ha-cluster.rst | 134 ++++++++++++++++-- 2 files changed, 123 insertions(+), 15 deletions(-) diff --git a/source/administration-guide/scale/high-availability-cluster-based-deployment.rst b/source/administration-guide/scale/high-availability-cluster-based-deployment.rst index e0a32dd95f0..6ca31b96911 100644 --- a/source/administration-guide/scale/high-availability-cluster-based-deployment.rst +++ b/source/administration-guide/scale/high-availability-cluster-based-deployment.rst @@ -30,8 +30,8 @@ Moreover, search replicas are also supported to handle search queries. Deployment guide ---------------- -Set up and maintain a high availability cluster-based deployment on your Mattermost servers. This document doesn't cover the configuration of databases in terms of -disaster recovery. For self-hosted deployments requiring database-level HA, +Set up and maintain a high availability cluster-based deployment on your Mattermost servers. Database-level HA and disaster recovery design are out of scope for this document. +For self-hosted deployments requiring database-level HA, see :doc:`PostgreSQL high availability cluster `. To ensure your instance and configuration are compatible with a high availability cluster-based deployment, please review the `configuration and compatibility`_ section. diff --git a/source/administration-guide/scale/postgres-ha-cluster.rst b/source/administration-guide/scale/postgres-ha-cluster.rst index 12abd582389..f215cf668ce 100644 --- a/source/administration-guide/scale/postgres-ha-cluster.rst +++ b/source/administration-guide/scale/postgres-ha-cluster.rst @@ -293,16 +293,26 @@ Append to ``/etc/postgresql/17/main/pg_hba.conf``: .. code-block:: text # repmgr access - host repmgr repmgr /24 trust - host repmgr repmgr 127.0.0.1/32 trust + host repmgr repmgr /24 scram-sha-256 + host repmgr repmgr 127.0.0.1/32 scram-sha-256 # Replication connections - host replication repmgr /24 trust - host replication repmgr 127.0.0.1/32 trust + host replication repmgr /24 scram-sha-256 + host replication repmgr 127.0.0.1/32 scram-sha-256 -.. note:: +Create a ``.pgpass`` file on each node so repmgr can authenticate without +an interactive password prompt: + +.. code-block:: bash - For production, replace ``trust`` with ``scram-sha-256`` and configure - ``.pgpass`` files on each node. + echo "*:*:repmgr:repmgr:" >> ~/.pgpass + chmod 600 ~/.pgpass + +.. warning:: + + **Lab and testing only:** If you want to skip password authentication for + initial setup, you can temporarily use ``trust`` instead of + ``scram-sha-256``. Do not use ``trust`` in production — it allows + passwordless connections from any host on the subnet. **Step 2.3 — Restart PostgreSQL** @@ -326,11 +336,21 @@ Phase 3: repmgr configuration and cluster initialisation **Step 3.1 — Create repmgr user and database (pg1 only)** +.. note:: + + repmgr requires superuser privileges to perform certain cluster operations + including ``pg_rewind`` (used to resync a failed primary as a standby) and + event notifications. If your security policy prohibits superuser accounts, + refer to the `repmgr documentation on permissions + `__ for the + minimum required grants. + .. code-block:: bash sudo -u postgres createuser --superuser repmgr sudo -u postgres createdb --owner=repmgr repmgr sudo -u postgres psql -c "ALTER USER repmgr SET search_path TO repmgr, public;" + sudo -u postgres psql -c "ALTER USER repmgr PASSWORD '';" **Step 3.2 — Create /etc/repmgr.conf (all nodes)** @@ -480,9 +500,87 @@ Replace ``/etc/haproxy/haproxy.cfg``: node is the primary and ``503`` otherwise. HAProxy queries port 8008 on each node to determine where to route connections. -Copy ``pgchk.py`` from the -`ha-postgres-reprmgr-haproxy repository `__ -to ``/usr/local/bin/pgchk.py`` on each node and make it executable: +On each node, create ``/usr/local/bin/pgchk.py`` with the following content: + +.. code-block:: python + + #!/usr/bin/env python3 + import subprocess + from http.server import BaseHTTPRequestHandler, HTTPServer + import argparse + + DEFAULT_PORT = 8008 + PG_USER = "postgres" + PG_DB = "postgres" + PG_PORT = "5432" + + class PostgresHealthCheckHandler(BaseHTTPRequestHandler): + def safe_write(self, data): + try: + self.wfile.write(data) + except (BrokenPipeError, ConnectionResetError): + pass + + def check_postgres_status(self): + try: + cmd = ["psql", "-U", PG_USER, "-d", PG_DB, "-p", PG_PORT, + "-t", "-c", "SELECT pg_is_in_recovery();"] + result = subprocess.run(cmd, capture_output=True, text=True, timeout=5) + if result.returncode != 0: + return None + output = result.stdout.strip() + if output == 't': + return True # Standby + elif output == 'f': + return False # Primary + return None + except Exception: + return None + + def do_GET(self): + status = self.check_postgres_status() + if status is None: + self.send_response(503) + self.end_headers() + self.safe_write(b"PostgreSQL Unreachable\n") + return + if self.path in ('/', '/master'): + if not status: + self.send_response(200); self.end_headers() + self.safe_write(b"OK - Primary\n") + else: + self.send_response(503); self.end_headers() + self.safe_write(b"Service Unavailable - Not Primary\n") + elif self.path == '/replica': + if status: + self.send_response(200); self.end_headers() + self.safe_write(b"OK - Replica\n") + else: + self.send_response(503); self.end_headers() + self.safe_write(b"Service Unavailable - Not Replica\n") + else: + self.send_response(404); self.end_headers() + self.safe_write(b"Not Found\n") + + def log_message(self, format, *args): + pass + + def run(port=DEFAULT_PORT): + httpd = HTTPServer(('', port), PostgresHealthCheckHandler) + print(f"Starting PostgreSQL Health Check on port {port}...") + try: + httpd.serve_forever() + except KeyboardInterrupt: + pass + httpd.server_close() + + if __name__ == '__main__': + parser = argparse.ArgumentParser(description='PostgreSQL Health Check for HAProxy') + parser.add_argument('--port', type=int, default=DEFAULT_PORT) + args = parser.parse_args() + run(port=args.port) + +Make the script executable: .. code-block:: bash @@ -517,14 +615,24 @@ Create ``/etc/systemd/system/pgchk.service``: sudo apt install -y keepalived -Create ``/etc/keepalived/keepalived.conf``. Set the ``priority`` field: pg1 gets -``101``, pg2 gets ``100``, pg3 gets ``99``. Set ``virtual_ipaddress`` to your VIP: +First, identify the name of the network interface that carries your node's IP address: + +.. code-block:: bash + + ip -o link show | awk '{print $2, $9}' | grep UP + +Note the interface name (e.g. ``ens3``, ``enp3s0``, ``eth0``). You will use it in +the next step. + +Create ``/etc/keepalived/keepalived.conf``. Replace ```` with the +interface name from the previous step. Set ``priority``: pg1 gets ``101``, +pg2 gets ``100``, pg3 gets ``99``. Set ``virtual_ipaddress`` to your VIP: .. code-block:: text vrrp_instance VI_1 { state BACKUP - interface eth0 + interface virtual_router_id 51 priority 101 advert_int 1 From 1fb0f5e0dddf749a537ae9ec6bb79d36f74696bf Mon Sep 17 00:00:00 2001 From: Stu Doherty Date: Mon, 27 Apr 2026 10:55:39 -0400 Subject: [PATCH 14/14] chore: trigger CodeRabbit re-review