diff --git a/source/administration-guide/scale/high-availability-cluster-based-deployment.rst b/source/administration-guide/scale/high-availability-cluster-based-deployment.rst index 169b6be00fd..214506eaf06 100644 --- a/source/administration-guide/scale/high-availability-cluster-based-deployment.rst +++ b/source/administration-guide/scale/high-availability-cluster-based-deployment.rst @@ -30,7 +30,9 @@ Moreover, search replicas are also supported to handle search queries. Deployment guide ---------------- -Set up and maintain a high availability cluster-based deployment on your Mattermost servers. This document doesn't cover the configuration of databases in terms of disaster recovery, however, you can refer to the `frequently asked questions (FAQ)`_ section for our recommendations. +Set up and maintain a high availability cluster-based deployment on your Mattermost servers. Database-level HA and disaster recovery design are out of scope for this document. +For self-hosted deployments requiring database-level HA, +see :doc:`PostgreSQL high availability cluster `. To ensure your instance and configuration are compatible with a high availability cluster-based deployment, please review the `configuration and compatibility`_ section. diff --git a/source/administration-guide/scale/postgres-ha-cluster.rst b/source/administration-guide/scale/postgres-ha-cluster.rst new file mode 100644 index 00000000000..f215cf668ce --- /dev/null +++ b/source/administration-guide/scale/postgres-ha-cluster.rst @@ -0,0 +1,902 @@ +PostgreSQL high availability cluster +===================================== + +:nosearch: + +This guide describes how to deploy a high availability PostgreSQL cluster for +Mattermost using `repmgr `__ for replication management +and automatic failover, `HAProxy `__ for connection +routing, and `Keepalived `__ for Virtual IP (VIP) +management. + +This is infrastructure-level HA that operates independently of your Mattermost +edition. It is compatible with any self-hosted Mattermost deployment. + +.. note:: + + This guide has been validated on: **Ubuntu 24.04 LTS**, **PostgreSQL 17**, + **repmgr 5.5**, **HAProxy 2.8**, **Keepalived**. + +Architecture overview +--------------------- + +A PostgreSQL HA cluster for Mattermost consists of three nodes running in +parallel. Each node runs the full stack: PostgreSQL, repmgr daemon (repmgrd), +HAProxy, Keepalived, and a health-check service. A Virtual IP (VIP) floats +across nodes and always points to the current primary. + +.. code-block:: text + + VIP: + │ + ┌───────────────┼───────────────┐ + │ │ │ + ┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐ + │ pg1 │ │ pg2 │ │ pg3 │ + │ │ │ │ │ │ + │ HAProxy │ │ HAProxy │ │ HAProxy │ + │ Keepalived │ │ Keepalived │ │ Keepalived │ + │ pgchk.py │ │ pgchk.py │ │ pgchk.py │ + │ repmgrd │ │ repmgrd │ │ repmgrd │ + ├─────────────┤ ├─────────────┤ ├─────────────┤ + │ PostgreSQL │ │ PostgreSQL │ │ PostgreSQL │ + │ PRIMARY │ │ STANDBY │ │ STANDBY │ + └─────────────┘ └─────────────┘ └─────────────┘ + +**Components:** + +.. list-table:: + :widths: 20 10 70 + :header-rows: 1 + + * - Component + - Version + - Role + * - PostgreSQL + - 17 + - Primary database engine. Streaming replication with replication slots. + * - repmgr / repmgrd + - 5.5 + - Replication manager. Monitors cluster health and automatically promotes + a standby when the primary fails. + * - HAProxy + - 2.8 + - TCP load balancer. Routes write traffic to the primary and read traffic + to standbys via two ports. + * - Keepalived + - — + - Manages the VIP using VRRP. Moves the VIP to the new primary after + failover. + * - pgchk.py + - — + - HTTP health-check endpoint (port 8008). HAProxy queries this to + determine which node is the current primary. + +**HAProxy ports:** + +.. list-table:: + :widths: 15 85 + :header-rows: 1 + + * - Port + - Purpose + * - 5000 + - Write traffic — routes to the current primary only + * - 5001 + - Read traffic — load-balanced across all standbys + +**Sizing:** This architecture is appropriate for Mattermost deployments up to +approximately 2,000 concurrent users. For larger deployments, see +:doc:`Scaling for Enterprise `. + +Before you begin +---------------- + +Is this the right architecture for you? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. list-table:: + :widths: 30 35 35 + :header-rows: 1 + + * - Scenario + - Recommendation + - Why + * - Cloud-hosted on AWS/GCP/Azure + - Use managed RDS/Cloud SQL with Multi-AZ + - Managed failover, no infrastructure to operate + * - On-premises or private cloud, single site + - **This guide** — single-DC HA cluster + - Automatic failover within the datacenter, no cloud dependency + * - On-premises, two or more sites, DR required + - Single-DC HA (this guide) + Multi-DC DR guide (coming soon) + - Active/warm-standby across datacenters + +Requirements +~~~~~~~~~~~~ + +**Hardware (per node — minimum):** + +- Operating system: Ubuntu 24.04 LTS +- CPU: 2 cores +- RAM: 4 GB +- Disk: 50 GB + +**You need 3 nodes** and one spare IP address on the same subnet for the VIP. + +**Network — ports that must be open between all three nodes:** + +.. list-table:: + :widths: 15 85 + :header-rows: 1 + + * - Port + - Purpose + * - 22 + - SSH (administration) + * - 5432 + - PostgreSQL (replication, repmgr) + * - 8008 + - pgchk.py health check (HAProxy → database nodes) + * - VRRP (112) + - Keepalived VIP election between nodes + +**Ports that Mattermost application servers must reach:** + +.. list-table:: + :widths: 15 85 + :header-rows: 1 + + * - Port + - Purpose + * - 5000 + - Write connections (primary) + * - 5001 + - Read connections (standbys) + +**Software:** The following packages will be installed during setup. No +pre-installation is required. + +- ``postgresql-17`` +- ``postgresql-17-repmgr`` +- ``haproxy`` +- ``keepalived`` +- ``python3`` (for pgchk.py) + +Node planning worksheet +~~~~~~~~~~~~~~~~~~~~~~~ + +Complete this before starting. You will substitute these values throughout +the guide. + +.. list-table:: + :widths: 15 25 25 35 + :header-rows: 1 + + * - Node + - Hostname + - IP address + - Initial role + * - 1 + - pg1 + - _______________ + - Primary + * - 2 + - pg2 + - _______________ + - Standby + * - 3 + - pg3 + - _______________ + - Standby + * - VIP + - — + - _______________ + - Floating (always points to primary) + +**Subnet:** ``_______________`` (e.g. ``10.0.1.0``) + +Time estimate +~~~~~~~~~~~~~ + +Allow **2–3 hours** for a first-time setup on pre-provisioned servers. + +Setup guide +----------- + +.. note:: + + Throughout this guide, substitute the IP addresses and subnet you recorded + in the node planning worksheet above. + +.. warning:: + + Complete each phase in order. The checkpoint at the end of each phase must + pass before you proceed. + +Phase 1: Base installation (all nodes) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Run all steps in Phase 1 on **pg1, pg2, and pg3**. + +**Step 1.1 — Configure /etc/hosts** + +On each node, append to ``/etc/hosts``: + +.. code-block:: text + + pg1 + pg2 + pg3 + +Verify hostname resolution on each node: + +.. code-block:: bash + + ping -c 1 pg1 && ping -c 1 pg2 && ping -c 1 pg3 + +Expected: 3 successful pings. + +**Step 1.2 — Install PostgreSQL 17 and repmgr 5.5** + +.. code-block:: bash + + sudo apt update + sudo apt install -y curl ca-certificates + sudo install -d /usr/share/postgresql-common/pgdg + sudo curl -o /usr/share/postgresql-common/pgdg/apt.postgresql.org.asc \ + --fail https://www.postgresql.org/media/keys/ACCC4CF8.asc + sudo sh -c 'echo "deb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.asc] \ + https://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" \ + > /etc/apt/sources.list.d/pgdg.list' + sudo apt update + sudo apt install -y postgresql-17 postgresql-17-repmgr + +✅ **Phase 1 checkpoint** — run on every node: + +.. code-block:: bash + + sudo systemctl status postgresql | grep "active (running)" + /usr/lib/postgresql/17/bin/repmgr --version + +**Pass:** PostgreSQL shows ``active (running)``; repmgr prints ``repmgr 5.5.x``. + +**Fail:** If PostgreSQL did not start, check ``journalctl -u postgresql`` for errors. + +Phase 2: PostgreSQL configuration (all nodes) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Run all steps in Phase 2 on **pg1, pg2, and pg3**. + +**Step 2.1 — Configure postgresql.conf** + +Append to ``/etc/postgresql/17/main/postgresql.conf``: + +.. code-block:: ini + + # Replication settings + listen_addresses = '*' + max_wal_senders = 10 + max_replication_slots = 10 + wal_level = replica + hot_standby = on + archive_mode = on + archive_command = '/bin/true' + shared_preload_libraries = 'repmgr' + wal_log_hints = on + wal_keep_size = 1024 + +**Step 2.2 — Configure pg_hba.conf** + +Append to ``/etc/postgresql/17/main/pg_hba.conf``: + +.. code-block:: text + + # repmgr access + host repmgr repmgr /24 scram-sha-256 + host repmgr repmgr 127.0.0.1/32 scram-sha-256 + # Replication connections + host replication repmgr /24 scram-sha-256 + host replication repmgr 127.0.0.1/32 scram-sha-256 + +Create a ``.pgpass`` file on each node so repmgr can authenticate without +an interactive password prompt: + +.. code-block:: bash + + echo "*:*:repmgr:repmgr:" >> ~/.pgpass + chmod 600 ~/.pgpass + +.. warning:: + + **Lab and testing only:** If you want to skip password authentication for + initial setup, you can temporarily use ``trust`` instead of + ``scram-sha-256``. Do not use ``trust`` in production — it allows + passwordless connections from any host on the subnet. + +**Step 2.3 — Restart PostgreSQL** + +.. code-block:: bash + + sudo systemctl restart postgresql + +✅ **Phase 2 checkpoint** — run on every node: + +.. code-block:: bash + + sudo -u postgres psql -c "SHOW wal_level;" + sudo -u postgres psql -c "SHOW shared_preload_libraries;" + +**Pass:** ``wal_level`` is ``replica``; ``shared_preload_libraries`` contains ``repmgr``. + +**Fail:** If PostgreSQL did not restart, check ``journalctl -u postgresql``. + +Phase 3: repmgr configuration and cluster initialisation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +**Step 3.1 — Create repmgr user and database (pg1 only)** + +.. note:: + + repmgr requires superuser privileges to perform certain cluster operations + including ``pg_rewind`` (used to resync a failed primary as a standby) and + event notifications. If your security policy prohibits superuser accounts, + refer to the `repmgr documentation on permissions + `__ for the + minimum required grants. + +.. code-block:: bash + + sudo -u postgres createuser --superuser repmgr + sudo -u postgres createdb --owner=repmgr repmgr + sudo -u postgres psql -c "ALTER USER repmgr SET search_path TO repmgr, public;" + sudo -u postgres psql -c "ALTER USER repmgr PASSWORD '';" + +**Step 3.2 — Create /etc/repmgr.conf (all nodes)** + +Create ``/etc/repmgr.conf`` on each node. Adjust ``node_id``, ``node_name``, +and ``host`` for each node: + +**pg1:** + +.. code-block:: ini + + node_id=1 + node_name='pg1' + conninfo='host= user=repmgr dbname=repmgr connect_timeout=2' + data_directory='/var/lib/postgresql/17/main' + use_replication_slots=yes + monitoring_history=yes + log_level=INFO + pg_bindir='/usr/lib/postgresql/17/bin' + service_start_command='sudo /usr/bin/pg_ctlcluster 17 main start' + service_stop_command='sudo /usr/bin/pg_ctlcluster 17 main stop' + service_restart_command='sudo /usr/bin/pg_ctlcluster 17 main restart' + service_reload_command='sudo /usr/bin/pg_ctlcluster 17 main reload' + service_promote_command='sudo /usr/bin/pg_ctlcluster 17 main promote' + failover=automatic + promote_command='repmgr standby promote -f /etc/repmgr.conf --log-to-file' + follow_command='repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n' + reconnect_attempts=3 + reconnect_interval=5 + monitor_interval_secs=2 + +**pg2:** Same as above with ``node_id=2``, ``node_name='pg2'``, ``host=``. + +**pg3:** Same as above with ``node_id=3``, ``node_name='pg3'``, ``host=``. + +**Step 3.3 — Register primary (pg1 only)** + +.. code-block:: bash + + sudo -u postgres repmgr -f /etc/repmgr.conf primary register + +**Step 3.4 — Clone standbys (pg2 and pg3)** + +Run on **pg2**, then **pg3**: + +.. code-block:: bash + + sudo systemctl stop postgresql + sudo -u postgres repmgr -h -U repmgr -d repmgr \ + -f /etc/repmgr.conf standby clone --delete-existing-pgdata + sudo systemctl start postgresql + sudo -u postgres repmgr -f /etc/repmgr.conf standby register + +**Step 3.5 — Start repmgrd (all nodes)** + +Create ``/etc/systemd/system/repmgrd.service``: + +.. code-block:: ini + + [Unit] + Description=repmgr daemon + After=postgresql.service + Requires=postgresql.service + + [Service] + User=postgres + ExecStart=/usr/lib/postgresql/17/bin/repmgrd -f /etc/repmgr.conf --no-daemonize + Restart=on-failure + + [Install] + WantedBy=multi-user.target + +.. code-block:: bash + + sudo systemctl daemon-reload + sudo systemctl enable repmgrd + sudo systemctl start repmgrd + +✅ **Phase 3 checkpoint** — run on any node: + +.. code-block:: bash + + sudo -u postgres repmgr -f /etc/repmgr.conf cluster show + +**Pass:** Output shows all three nodes — pg1 as ``* running`` (primary), pg2 and +pg3 as ``running`` (standby). On pg1, the following query returns 2 rows: + +.. code-block:: bash + + sudo -u postgres psql -c "SELECT client_addr, state FROM pg_stat_replication;" + +**Fail:** A standby showing ``! running`` means replication did not establish. +Check ``journalctl -u postgresql`` on the failed standby. Common cause: firewall +blocking port 5432 between nodes. + +Phase 4: HAProxy, health check, and VIP (all nodes) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Run all steps in Phase 4 on **pg1, pg2, and pg3**. + +**Step 4.1 — Install HAProxy** + +.. code-block:: bash + + sudo apt install -y haproxy + +**Step 4.2 — Configure HAProxy** + +Replace ``/etc/haproxy/haproxy.cfg``: + +.. code-block:: text + + global + log /dev/log local0 + maxconn 4000 + + defaults + log global + mode tcp + timeout connect 5s + timeout client 30s + timeout server 30s + + frontend pg_write + bind *:5000 + default_backend pg_primary + + frontend pg_read + bind *:5001 + default_backend pg_replicas + + backend pg_primary + option tcp-check + server pg1 :5432 check port 8008 + server pg2 :5432 check port 8008 backup + server pg3 :5432 check port 8008 backup + + backend pg_replicas + balance roundrobin + option tcp-check + server pg2 :5432 check port 8008 + server pg3 :5432 check port 8008 + server pg1 :5432 check port 8008 backup + +**Step 4.3 — Deploy pgchk.py** + +``pgchk.py`` is a lightweight HTTP server that returns ``200 OK`` when the local +node is the primary and ``503`` otherwise. HAProxy queries port 8008 on each +node to determine where to route connections. + +On each node, create ``/usr/local/bin/pgchk.py`` with the following content: + +.. code-block:: python + + #!/usr/bin/env python3 + import subprocess + from http.server import BaseHTTPRequestHandler, HTTPServer + import argparse + + DEFAULT_PORT = 8008 + PG_USER = "postgres" + PG_DB = "postgres" + PG_PORT = "5432" + + class PostgresHealthCheckHandler(BaseHTTPRequestHandler): + def safe_write(self, data): + try: + self.wfile.write(data) + except (BrokenPipeError, ConnectionResetError): + pass + + def check_postgres_status(self): + try: + cmd = ["psql", "-U", PG_USER, "-d", PG_DB, "-p", PG_PORT, + "-t", "-c", "SELECT pg_is_in_recovery();"] + result = subprocess.run(cmd, capture_output=True, text=True, timeout=5) + if result.returncode != 0: + return None + output = result.stdout.strip() + if output == 't': + return True # Standby + elif output == 'f': + return False # Primary + return None + except Exception: + return None + + def do_GET(self): + status = self.check_postgres_status() + if status is None: + self.send_response(503) + self.end_headers() + self.safe_write(b"PostgreSQL Unreachable\n") + return + if self.path in ('/', '/master'): + if not status: + self.send_response(200); self.end_headers() + self.safe_write(b"OK - Primary\n") + else: + self.send_response(503); self.end_headers() + self.safe_write(b"Service Unavailable - Not Primary\n") + elif self.path == '/replica': + if status: + self.send_response(200); self.end_headers() + self.safe_write(b"OK - Replica\n") + else: + self.send_response(503); self.end_headers() + self.safe_write(b"Service Unavailable - Not Replica\n") + else: + self.send_response(404); self.end_headers() + self.safe_write(b"Not Found\n") + + def log_message(self, format, *args): + pass + + def run(port=DEFAULT_PORT): + httpd = HTTPServer(('', port), PostgresHealthCheckHandler) + print(f"Starting PostgreSQL Health Check on port {port}...") + try: + httpd.serve_forever() + except KeyboardInterrupt: + pass + httpd.server_close() + + if __name__ == '__main__': + parser = argparse.ArgumentParser(description='PostgreSQL Health Check for HAProxy') + parser.add_argument('--port', type=int, default=DEFAULT_PORT) + args = parser.parse_args() + run(port=args.port) + +Make the script executable: + +.. code-block:: bash + + sudo chmod +x /usr/local/bin/pgchk.py + +Create ``/etc/systemd/system/pgchk.service``: + +.. code-block:: ini + + [Unit] + Description=PostgreSQL Health Check for HAProxy + After=postgresql.service + + [Service] + ExecStart=/usr/bin/python3 /usr/local/bin/pgchk.py --port 8008 + Restart=always + + [Install] + WantedBy=multi-user.target + +.. code-block:: bash + + sudo systemctl daemon-reload + sudo systemctl enable pgchk + sudo systemctl start pgchk + sudo systemctl enable haproxy + sudo systemctl start haproxy + +**Step 4.4 — Install and configure Keepalived** + +.. code-block:: bash + + sudo apt install -y keepalived + +First, identify the name of the network interface that carries your node's IP address: + +.. code-block:: bash + + ip -o link show | awk '{print $2, $9}' | grep UP + +Note the interface name (e.g. ``ens3``, ``enp3s0``, ``eth0``). You will use it in +the next step. + +Create ``/etc/keepalived/keepalived.conf``. Replace ```` with the +interface name from the previous step. Set ``priority``: pg1 gets ``101``, +pg2 gets ``100``, pg3 gets ``99``. Set ``virtual_ipaddress`` to your VIP: + +.. code-block:: text + + vrrp_instance VI_1 { + state BACKUP + interface + virtual_router_id 51 + priority 101 + advert_int 1 + nopreempt + virtual_ipaddress { + /24 + } + } + +.. code-block:: bash + + sudo systemctl enable keepalived + sudo systemctl start keepalived + +✅ **Phase 4 checkpoint** — run on any node: + +.. code-block:: bash + + # VIP should be active on the primary node (pg1) + ip addr show | grep + + # Port 5000 should connect to primary + psql -h -p 5000 -U repmgr -d repmgr \ + -c "SELECT inet_server_addr(), pg_is_in_recovery();" + + # Port 5001 should connect to a standby + psql -h -p 5001 -U repmgr -d repmgr \ + -c "SELECT inet_server_addr(), pg_is_in_recovery();" + +**Pass:** VIP visible on pg1. Port 5000 returns ``pg_is_in_recovery = f`` (primary). +Port 5001 returns ``pg_is_in_recovery = t`` (standby). + +**Fail:** If the VIP is not on pg1, check ``journalctl -u keepalived``. If HAProxy +is not routing correctly, check ``journalctl -u haproxy`` and verify pgchk.py +is responding: ``curl http://:8008`` should return HTTP 200. + +Phase 5: End-to-end validation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Run this phase after all four previous phases pass on all nodes. This confirms +the cluster behaves correctly under failure before you connect Mattermost. + +**Step 5.1 — Confirm healthy starting state** + +.. code-block:: bash + + sudo -u postgres repmgr -f /etc/repmgr.conf cluster show + +**Pass:** pg1 is ``* running`` (primary); pg2 and pg3 are ``running`` (standby). + +**Step 5.2 — Simulate primary failure** + +On **pg1**: + +.. code-block:: bash + + sudo systemctl stop postgresql + +Wait 30 seconds, then on **pg2** or **pg3**: + +.. code-block:: bash + + sudo -u postgres repmgr -f /etc/repmgr.conf cluster show + +**Pass:** One of pg2 or pg3 is now ``* running`` (primary). pg1 shows as ``! running`` +(unreachable — expected). + +**Step 5.3 — Verify HAProxy and VIP followed the new primary** + +.. code-block:: bash + + psql -h -p 5000 -U repmgr -d repmgr \ + -c "SELECT inet_server_addr(), pg_is_in_recovery();" + +**Pass:** Returns the IP of the newly promoted node with ``pg_is_in_recovery = f``. + +**Step 5.4 — Recover the old primary as a standby** + +On **pg1**: + +.. code-block:: bash + + sudo systemctl start postgresql + sudo -u postgres repmgr -f /etc/repmgr.conf node rejoin \ + --force-rewind --config-files=postgresql.conf,pg_hba.conf + +Then on any node: + +.. code-block:: bash + + sudo -u postgres repmgr -f /etc/repmgr.conf cluster show + +**Pass:** All three nodes show ``running``; pg1 is now a standby. + +.. note:: + + Your cluster is ready for production. Connect Mattermost using the VIP + address and port 5000 as the primary datasource. Optionally configure + port 5001 as a read replica in ``config.json``. + +Day-2 operations +---------------- + +Check cluster status +~~~~~~~~~~~~~~~~~~~~ + +Run on any node: + +.. code-block:: bash + + sudo -u postgres repmgr -f /etc/repmgr.conf cluster show + +Expected healthy output shows one ``* running`` primary and two ``running`` standbys. + +Check replication lag +~~~~~~~~~~~~~~~~~~~~~ + +Run on the primary: + +.. code-block:: bash + + sudo -u postgres psql -c " + SELECT client_addr, application_name, state, + pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS lag + FROM pg_stat_replication;" + +Normal lag is under 1 MB during steady state. Lag growing continuously +indicates a replication problem — check network connectivity and standby +PostgreSQL logs. + +Controlled switchover (planned maintenance) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To move the primary role to a standby with zero data loss: + +.. code-block:: bash + + # Run on the TARGET standby (e.g. pg2) + sudo -u postgres repmgr -f /etc/repmgr.conf standby switchover + +repmgr will demote the old primary and promote this node. The VIP and HAProxy +will follow automatically. + +Add a standby node +~~~~~~~~~~~~~~~~~~ + +1. Provision a new server and complete Phases 1–2 of the setup guide. +2. Create ``/etc/repmgr.conf`` with the next available ``node_id``. +3. On the new node: + + .. code-block:: bash + + sudo systemctl stop postgresql + sudo -u postgres repmgr -h -U repmgr -d repmgr \ + -f /etc/repmgr.conf standby clone --delete-existing-pgdata + sudo systemctl start postgresql + sudo -u postgres repmgr -f /etc/repmgr.conf standby register + +4. Add the new node to ``/etc/haproxy/haproxy.cfg`` on all existing nodes and + reload HAProxy: ``sudo systemctl reload haproxy``. + +Rejoin a failed node +~~~~~~~~~~~~~~~~~~~~ + +After recovering a failed standby: + +.. code-block:: bash + + sudo -u postgres repmgr -f /etc/repmgr.conf node rejoin \ + --force-rewind --config-files=postgresql.conf,pg_hba.conf + +After rejoining a failed primary (after automatic failover has already promoted +a new primary), run the same command on the old primary to re-register it as a +standby. + +Troubleshooting +--------------- + +repmgrd is not starting +~~~~~~~~~~~~~~~~~~~~~~~~ + +**Symptom:** ``systemctl status repmgrd`` shows ``failed`` or ``activating``. + +**Likely cause:** PostgreSQL has not fully started yet, or the repmgr database +is not accessible. + +**Resolution:** + +.. code-block:: bash + + # Verify PostgreSQL is running first + sudo systemctl status postgresql + + # Check repmgrd logs + journalctl -u repmgrd -n 50 + + # Test repmgr connection manually + sudo -u postgres repmgr -f /etc/repmgr.conf cluster show + +Standby not replicating +~~~~~~~~~~~~~~~~~~~~~~~~ + +**Symptom:** ``repmgr cluster show`` shows a standby as ``! running``, or +``pg_stat_replication`` on the primary shows fewer than expected rows. + +**Likely cause:** Network connectivity issue on port 5432, or ``pg_hba.conf`` +not permitting the replication connection. + +**Resolution:** + +.. code-block:: bash + + # From the standby, test connectivity to the primary + pg_isready -h -p 5432 -U repmgr + + # Check PostgreSQL logs on the standby + sudo -u postgres tail -50 /var/log/postgresql/postgresql-17-main.log + +VIP not moving after failover +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +**Symptom:** After a primary failure and successful repmgr promotion, the VIP +remains on the failed node or does not appear on the new primary. + +**Likely cause:** Keepalived is not running, or VRRP traffic is blocked by a +firewall. + +**Resolution:** + +.. code-block:: bash + + sudo systemctl status keepalived + journalctl -u keepalived -n 50 + + # Verify VRRP traffic is not blocked — check cloud security groups or + # iptables rules for protocol 112 (VRRP) + +HAProxy routing to wrong node +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +**Symptom:** Connections on port 5000 land on a standby (writes fail), or +port 5001 routes to the primary. + +**Likely cause:** pgchk.py is not running or returning incorrect status. + +**Resolution:** + +.. code-block:: bash + + # Check health check response on each node + curl -v http://:8008 + + # Primary should return HTTP 200; standbys should return HTTP 503 + sudo systemctl status pgchk + journalctl -u pgchk -n 30 + +Split-brain prevention +~~~~~~~~~~~~~~~~~~~~~~ + +repmgr's ``failover=automatic`` setting and ``reconnect_attempts=3`` with +``reconnect_interval=5`` provide a brief delay before promoting a standby. +This prevents promotion during transient network blips. + +If you suspect a split-brain scenario (two nodes both believing they are +primary), **do not write to either node**. Check cluster status from a +third node and use ``repmgr node service --action=stop`` to fence the +unintended primary before recovering. diff --git a/source/administration-guide/scale/scaling-for-enterprise.rst b/source/administration-guide/scale/scaling-for-enterprise.rst index 18779aee9e2..5e4c0521dde 100644 --- a/source/administration-guide/scale/scaling-for-enterprise.rst +++ b/source/administration-guide/scale/scaling-for-enterprise.rst @@ -29,6 +29,14 @@ High availability A :doc:`high availability cluster-based deployment ` enables a Mattermost system to maintain service during outages and hardware failures through the use of redundant infrastructure. +PostgreSQL high availability cluster +-------------------------------------- + +For self-hosted deployments on bare-metal or VMs, a +:doc:`PostgreSQL HA cluster ` +provides automatic database failover using repmgr, HAProxy, and Keepalived — +without requiring a managed database service. + Redis ----- @@ -45,7 +53,8 @@ Available reference architectures Deployment architecture at scale Backing storage benchmarks Enterprise search - High availability + High availability + PostgreSQL HA cluster Redis Scale up to 200 users Scale up to 2000 users