Skip to main content

High-availability Cluster with Pacemaker and Corosync, PCS and CRMSH version

1740 words·
Pacemaker Corosync PCS CRMSH High-availability Cluster Nginx

In this tutorial I setup a high availability cluster with Pacemaker and Corosync.

Prerequisites
#

My Setup
#

I use 3 Ubuntu 22.04 servers with the following hostnames and IP addresses:

192.168.30.31 jkw-han-1
192.168.30.32 jkw-han-2
192.168.30.33 jkw-han-3

And the following floating IP address: 192.168.30.30


Hosts Files
#

Make sure to exclude the 127.0.1.1 jkw-han-1 entry from the hosts files!

# Open hosts file
sudo vi /etc/hosts
# Change hosts file
127.0.0.1 localhost

192.168.30.31 jkw-han-1
192.168.30.32 jkw-han-2
192.168.30.33 jkw-han-3

Nginx
#

Install nginx on all nodes and setup a unique html file for each node so that you are able to distinguish them from each other.

# Install nginx
sudo apt install nginx -y


# Define nginx webiste on node 1
echo '<h1>jkw-han-1</h1>' > /var/www/html/index.html
 
# Define nginx webiste on node 2
echo '<h1>jkw-han-2</h1>' > /var/www/html/index.html
 
# Define nginx webiste on node 3
echo '<h1>jkw-han-3</h1>' > /var/www/html/index.html


# Stop nginx on all nodes
sudo systemctl stop nginx

PCS Version
#

Install Pacemaker, Corosync and PCS
#

Run the following commands on all nodes:

# Install packages
sudo apt install pacemaker corosync pcs resource-agents -y

# Enable startup
sudo systemctl enable pacemaker corosync pcsd

# Stop Corosync and Pacemaker
sudo systemctl stop corosync && sudo systemctl stop pacemaker

# Start PCS
sudo systemctl start pcsd

Packages Explanation
#

  • pacemaker Pacemaker manages the cluster resources like starting and stopping the services, based on policies you define and based on messages and events from Corosync.

  • corosync Corosync provides the messaging capabilities to allow the nodes from the cluster to communicate with each other. It ensures that each node in the cluster knows about the other nodes and if they are online or offline.

  • crmsh or pcs Command-line tools to manage and configure Pacemaker and Corosync clusters.

  • resource-agents Scripts or agents that Pacemaker uses to manage various services and resources. Each resource type (e.g., NGINX, Apache, Filesystem, Virtual IP, etc.) has a corresponding resource agent that knows how to start, stop, and monitor that specific resource.

  • fence-agents Used for “fencing” or “stonith” / to ensure data integrity and prevent “split-brain”.


Pacemaker Configuration System (PCS)
#

During the installation of “PCS” an user with the name “hacluster” is created, set the same password for the user on all nodes:

# Set password for hacluster user: On all nodes
sudo passwd hacluster

# Shell output:
New password:
Retype new password:
passwd: password updated successfully
# Authenticate / authorize the nodes: Run command on first node
sudo pcs host auth jkw-han-1 jkw-han-2 jkw-han-3 -u hacluster

# Shell Output:
Password:
jkw-han-2: Authorized
jkw-han-3: Authorized
jkw-han-1: Authorized
# Configure Cluster: Run command on first node
sudo pcs cluster setup jkw-cluster jkw-han-1 jkw-han-2 jkw-han-3 --force

Note: Use the --force option to overwrite the existing (standard) configuration. Replace “jkw-cluster” with the name of your cluster.

# Start the Cluster: Run command on first node
sudo pcs cluster start --all

# Shell output:
jkw-han-1: Starting Cluster...
jkw-han-3: Starting Cluster...
jkw-han-2: Starting Cluster...
# Enable auto start
sudo pcs cluster enable --all 

# Shell output:
jkw-han-1: Cluster Enabled
jkw-han-2: Cluster Enabled
jkw-han-3: Cluster Enabled
# Check Cluster Status:
sudo pcs cluster status

# Shell output:
Cluster Status:
 Cluster Summary:
   * Stack: corosync
   * Current DC: jkw-han-3 (version 2.1.2-ada5c3b36e2) - partition with quorum
   * Last updated: Sat Aug 19 22:35:06 2023
   * Last change:  Sat Aug 19 22:34:25 2023 by hacluster via crmd on jkw-han-3
   * 3 nodes configured
   * 0 resource instances configured
 Node List:
   * Online: [ jkw-han-1 jkw-han-2 jkw-han-3 ]

PCSD Status:
  jkw-han-1: Online
  jkw-han-2: Online
  jkw-han-3: Online
# Configure Cluster Options: Run command on first node
sudo pcs property set stonith-enabled=false # Disable fancing / stonith
sudo pcs property set no-quorum-policy=ignore  # Disable quorum
# Check Cluster Options
sudo pcs property config

# Shell output:
Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: jkw-cluster
 dc-version: 2.1.2-ada5c3b36e2
 have-watchdog: false
 no-quorum-policy: ignore
 stonith-enabled: false
# Add nginx as cluster resource
sudo pcs resource create webserver ocf:heartbeat:nginx configfile="/etc/nginx/nginx.conf" op monitor interval="30s"

# Create floating IP resource
sudo pcs resource create floatingip ocf:heartbeat:IPaddr2 ip=192.168.30.30 cidr_netmask=32 op monitor interval=30s

# Create resource group
sudo pcs resource group add nginx_group webserver floatingip
# Check the resources
sudo pcs resource status

# Shell output:
* Resource Group: nginx_group:
  * webserver  (ocf:heartbeat:nginx):   Started jkw-han-1
  * floatingip (ocf:heartbeat:IPaddr2): Started jkw-han-1

Cluster Status
#

# List nodes status
sudo pcs status nodes

# Shell output:
Pacemaker Nodes:
 Online: jkw-han-1 jkw-han-2 jkw-han-3
 Standby:
 Standby with resource(s) running:
 Maintenance:
 Offline:
Pacemaker Remote Nodes:
 Online:
 Standby:
 Standby with resource(s) running:
 Maintenance:
 Offline:
#  List nodes IP addresses
sudo corosync-cmapctl | grep members

# Shell output:
runtime.members.1.config_version (u64) = 0
runtime.members.1.ip (str) = r(0) ip(192.168.30.31)
runtime.members.1.join_count (u32) = 1
runtime.members.1.status (str) = joined
runtime.members.2.config_version (u64) = 0
runtime.members.2.ip (str) = r(0) ip(192.168.30.32)
runtime.members.2.join_count (u32) = 1
runtime.members.2.status (str) = joined
runtime.members.3.config_version (u64) = 0
runtime.members.3.ip (str) = r(0) ip(192.168.30.33)
runtime.members.3.join_count (u32) = 1
runtime.members.3.status (str) = joined
# List Corosync members
sudo pcs status corosync

# Shell output:
Membership information
----------------------
    Nodeid      Votes Name
         1          1 jkw-han-1 (local)
         2          1 jkw-han-2
         3          1 jkw-han-3

Usful Commands
#

# Disable / enable resource
sudo pcs resource disable webserver
sudo pcs resource enable webserver

# Delte resource
sudo pcs resource delete webserver

# List resource details
sudo pcs resource config
# Stop / start the cluster
sudo pcs cluster stop --all
sudo pcs cluster start --all
# Pacemaker logs
sudo journalctl -u pacemaker

# Pacemaker status
sudo systemctl status pacemaker

# Corosync status
sudo systemctl status corosync

Testing
#

# Open the floating IP in a browser
http://192.168.30.30/

# Browser output:
jkw-han-1
# Stop node 1
sudo pcs cluster stop jkw-han-1

# shell output:
jkw-han-1: Stopping Cluster (pacemaker)...
jkw-han-1: Stopping Cluster (corosync)...
# Open the floating IP in a browser
http://192.168.30.30/

# Browser output:
jkw-han-2

CRMSH Version
#

Install Pacemaker, Corosync and CRMSH
#

Run the following commands on all nodes:

# Install packages
sudo apt install pacemaker corosync crmsh resource-agents -y

# Enable startup
sudo systemctl enable pacemaker corosync

# Stop Corosync and Pacemaker
sudo systemctl stop corosync && sudo systemctl stop pacemaker

Create Corosync Key
#

# Install haveaged for the Corosync key generation
sudo apt install haveged -y

# Create the Corosync key
sudo corosync-keygen

# Shell output:
Writing corosync key to /etc/corosync/authkey
# Check if key exists
ls -la /etc/corosync/

# Shell output:
-r--------   1 root root  256 Aug 20 13:22 authkey
-rw-r--r--   1 root root 2001 Jan 12  2022 corosync.conf
drwxr-xr-x   2 root root 4096 Jan 12  2022 uidgid.d

Corosync Configuration
#

# Edit the corosync configuration
sudo vi /etc/corosync/corosync.conf
# Define Corosync Configuration
totem {
    version: 2
    cluster_name: jkw-cluster
    transport: knet
    crypto_cipher: aes256
    crypto_hash: sha256
}

nodelist {
    node {
        ring0_addr: jkw-han-1
        name: jkw-han-1
        nodeid: 1
    }

    node {
        ring0_addr: jkw-han-2
        name: jkw-han-2
        nodeid: 2
    }

    node {
        ring0_addr: jkw-han-3
        name: jkw-han-3
        nodeid: 3
    }
}

quorum {
    provider: corosync_votequorum
}

logging {
    to_logfile: yes
    logfile: /var/log/corosync/corosync.log
    to_syslog: yes
    timestamp: on
}

Copy Corosync Configuration
#

Copy the configuration to the other nodes:

# Copy the corosync configuration to ha-node-2
sudo scp -r /etc/corosync/* ubuntu@jkw-han-2:/home/ubuntu

# Copy the corosync configuration to ha-node-3
sudo scp -r /etc/corosync/* ubuntu@jkw-han-3:/home/ubuntu


# SSH into node and copy the configuration
cd
sudo cp -r * /etc/corosync

Start Cluster
#

# Start Corosync on node 1
sudo systemctl start corosync && sudo systemctl enable corosync

# Shell output:
Synchronizing state of corosync.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable corosync
# Start and enable Pacemaker on all nodes
sudo systemctl start pacemaker && sudo systemctl enable pacemaker

# Shell output:
Synchronizing state of pacemaker.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable pacemaker

Cluster Status
#

# Check Server status on all nodes
sudo crm status

# Shell output:
Cluster Summary:
  * Stack: corosync
  * Current DC: jkw-han-2 (version 2.1.2-ada5c3b36e2) - partition with quorum
  * Last updated: Sun Aug 20 13:35:19 2023
  * Last change:  Sun Aug 20 13:34:51 2023 by hacluster via crmd on jkw-han-2
  * 3 nodes configured
  * 0 resource instances configured

Node List:
  * Online: [ jkw-han-1 jkw-han-2 jkw-han-3 ]

Full List of Resources:
  * No resources
# List node IPs
sudo corosync-cmapctl | grep members

# Shell output
ubuntu@jkw-han-1:~$ sudo corosync-cmapctl | grep members
runtime.members.1.config_version (u64) = 0
runtime.members.1.ip (str) = r(0) ip(192.168.30.31)
runtime.members.1.join_count (u32) = 1
runtime.members.1.status (str) = joined
runtime.members.2.config_version (u64) = 0
runtime.members.2.ip (str) = r(0) ip(192.168.30.32)
runtime.members.2.join_count (u32) = 1
runtime.members.2.status (str) = joined
runtime.members.3.config_version (u64) = 0
runtime.members.3.ip (str) = r(0) ip(192.168.30.33)
runtime.members.3.join_count (u32) = 1
runtime.members.3.status (str) = joined

Cluster Resources
#

# Configure Cluster Options: Run command on first node
sudo crm configure property stonith-enabled=false # Disable fancing / stonith
sudo crm configure property no-quorum-policy=ignore # Disable quorum
# Check the policy status
sudo crm configure show

# Shell output:
node 1: jkw-han-1
node 2: jkw-han-2
node 3: jkw-han-3
property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=2.1.2-ada5c3b36e2 \
        cluster-infrastructure=corosync \
        cluster-name=debian \
        stonith-enabled=false \
        no-quorum-policy=ignore
# Create virtual IP resoure for the floting IP 192.168.70.30
sudo crm configure primitive floatingip \
ocf:heartbeat:IPaddr2 params ip="192.168.30.30" \
cidr_netmask="32" op monitor interval="10s" \
meta migration-threshold="10"

# Shell output
WARNING: (unpack_config)        warning: Blind faith: not fencing unseen nodes
# Create webserver resource for nginx
sudo crm configure primitive webserver \
ocf:heartbeat:nginx configfile=/etc/nginx/nginx.conf \
op start timeout="40s" interval="0" \
op stop timeout="60s" interval="0" \
op monitor interval="10s" timeout="60s" \
meta migration-threshold="10"

# Shell output
WARNING: (unpack_config)        warning: Blind faith: not fencing unseen nodes
# Create resource group
sudo crm configure group jkw-cluster floatingip webserver

# Shell output:
WARNING: (unpack_config)        warning: Blind faith: not fencing unseen nodes
# Check resources
sudo crm resource status

# Shell output:
Full List of Resources:
  * Resource Group: jkw-cluster:
    * floatingip (ocf:heartbeat:IPaddr2):  Started
    * webserver  (ocf:heartbeat:nginx):    Started
# Check status
sudo crm status

# Shell output:
Cluster Summary:
  * Stack: corosync
  * Current DC: jkw-han-2 (version 2.1.2-ada5c3b36e2) - partition with quorum
  * Last updated: Sun Aug 20 13:44:10 2023
  * Last change:  Sun Aug 20 13:42:52 2023 by root via cibadmin on jkw-han-1
  * 3 nodes configured
  * 2 resource instances configured

Node List:
  * Online: [ jkw-han-1 jkw-han-2 jkw-han-3 ]

Full List of Resources:
  * Resource Group: jkw-cluster:
    * floatingip (ocf:heartbeat:IPaddr2): Started jkw-han-1
    * webserver  (ocf:heartbeat:nginx):   Started jkw-han-1

Testing
#

# Open the floating IP in a browser
http://192.168.30.30/

# Browser output:
jkw-han-1
# Stop node 1: Run command on node 1
sudo crm cluster stop

# shell output:
INFO: Cluster services stopped
# Open the floating IP in a browser
http://192.168.30.30/

# Browser output:
jkw-han-2