Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Building a Two-Node High Availability Cluster with DRBD and Heartbeat on CentOS 6

Tech Jun 13 1

Cluster Architecture and Prerequisites

A high availability (HA) setup using DRBD and Heartbeat relies on block-level data replication and automated failover management. DRBD mirrors storage across networked nodes, functioning as a network-based RAID 1. Heartbeat monitors node health via network pulses and triggers resource migration when the primary node becomes unresponsive.

Target Environment:

  • OS: CentOS 6.x (x86_64)
  • DRBD Version: 8.4.x
  • Node A (Primary): ha-node-01.local | Public: 10.0.50.10 | Sync/Heartbeat: 10.10.10.10
  • Node B (Secondary): ha-node-02.local | Public: 10.0.50.11 | Sync/Heartbeat: 10.10.10.11
  • Virtual IP (VIP): 10.0.50.100
  • Replication Volume: /dev/vdb1 (10GB raw partition on both nodes)
  • Mount Point: /var/lib/ha-data

System Preparation

Configure network interfaces, host resolution, and system security parameters on both nodes before installing cluster software.

1. Hostname and DNS Resolution Update /etc/sysconfig/network and /etc/hosts to ensure consistent name resolution across the cluster.

# /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=ha-node-01.local

# /etc/hosts
127.0.0.1   localhost
10.0.50.10  ha-node-01.local
10.0.50.11  ha-node-02.local
10.10.10.10 ha-node-01-sync.local
10.10.10.11 ha-node-02-sync.local

2. Security and Time Synchronization Disable firewall and SELinux to prevent cluster communication blocks. Synchronize system clocks to avoid split-brain scenarios caused by timestamp mismatches.

# Disable iptables and SELinux
/sbin/service iptables stop
/sbin/chkconfig iptables off
setenforce 0
sed -i 's/^SELINUX=.*/SELINUX=disabled/' /etc/selinux/config

# Time synchronization
yum install -y ntpdate
ntpdate pool.ntp.org
hwclock --systohc

3. Storage Partitioning Create an identical raw partition on both servers. Do not format or mount it manually.

fdisk /dev/vdb <<EOF
n
p
1

+10G
w
EOF
partprobe /dev/vdb
mkdir -p /var/lib/ha-data

DRBD Installation and Block Replication Setup

1. Compile and Install DRBD Install build dependencies and compile the DRBD kernel module and user-space tools.

yum install -y gcc make flex kernel-devel kernel-headers
wget http://www.drbd.org/download/drbd/8.4/archive/drbd-8.4.3.tar.gz
tar -xf drbd-8.4.3.tar.gz
cd drbd-8.4.3
./configure --prefix=/opt/drbd --with-km
make KDIR=/usr/src/kernels/$(uname -r)/
make install

# Register service
mkdir -p /opt/drbd/var/run/drbd
cp /opt/drbd/etc/rc.d/init.d/drbd /etc/init.d/
chkconfig --add drbd
chkconfig drbd on

2. Load Kernel Module

modprobe drbd
lsmod | grep drbd

If the module fails to load, ensure the running kernel matches the installed kernel-devel package. A system reboot may be required after kernel updates.

3. Resource Configuration Define the replication resource in /etc/drbd.conf. Clear existing content and apply the following:

global { usage-count no; }
common { syncer { rate 100M; } }

resource cluster_data {
  protocol C;
  startup {
    wfc-timeout 0;
    degr-wfc-timeout 60;
  }
  net {
    timeout 60;
    connect-int 10;
    ping-int 10;
    max-buffers 2048;
  }
  disk { on-io-error detach; }

  on ha-node-01.local {
    device    /dev/drbd0;
    disk      /dev/vdb1;
    address   10.10.10.10:7789;
    meta-disk internal;
  }
  on ha-node-02.local {
    device    /dev/drbd0;
    disk      /dev/vdb1;
    address   10.10.10.11:7789;
    meta-disk internal;
  }
}

4. Initialize and Activate Replication Execute on both nodes:

mknod /dev/drbd0 b 147 0
drbdadm create-md cluster_data
service drbd start

Force Node A to become the primary source and format the device:

# Run ONLY on ha-node-01.local
drbdsetup /dev/drbd0 primary --force
mkfs.ext4 /dev/drbd0
mount /dev/drbd0 /var/lib/ha-data

Verify synchronization status using cat /proc/drbd. The state should transition to Connected Primary/Secondary UpToDate/UpToDate.

Application Data Migration

Relocate application directories to the replicated volume and create symbolic links. Perform this on the primary node while the DRBD device is mounted.

APP_BASE="/opt/custom_app"
HA_STORE="/var/lib/ha_data"

# Move directories
mv ${APP_BASE}/storage ${HA_STORE}/
mv ${APP_BASE}/database ${HA_STORE}/

# Create symlinks
ln -s ${HA_STORE}/storage ${APP_BASE}/storage
ln -s ${HA_STORE}/database ${APP_BASE}/database

# Fix ownership
chown -R appuser:appgroup ${HA_STORE}/storage
chown -R dbuser:dbgroup ${HA_STORE}/database

On the secondary node, rename original directories and create identical symlinks pointing to the mount path. Do not mount the DRBD device manually on the secondary node.

Heartbeat Configuration and Resource Management

1. Install Heartbeat

yum install -y epel-release
yum install -y heartbeat heartbeat-libs

2. Cluster Communication (/etc/ha.d/ha.cf) Configure node messaging and failure detection thresholds.

logfile /var/log/ha-cluster.log
keepalive 2
deadtime 15
warntime 5
initdead 30
udpport 694
ucast eth1 10.10.10.11  # On Node A, point to Node B's sync IP
ucast eth0 10.0.50.11   # Fallback public interface
auto_failback off
node ha-node-01.local
node ha-node-02.local
ping 10.0.50.1
respawn hacluster /usr/lib64/heartbeat/ipfail

3. Authentication (/etc/ha.d/authkeys)

echo -e "auth 2\n2 sha1 CLUSTER_SECRET_KEY_HERE" > /etc/ha.d/authkeys
chmod 600 /etc/ha.d/authkeys

4. Resource Definition (/etc/ha.d/haresources) Define the faiolver chain. The order dictates startup and shutdown sequences.

ha-node-01.local IPaddr::10.0.50.100/24/eth0 drbddisk::cluster_data Filesystem::/dev/drbd0::/var/lib/ha_data::ext4 custom_app_service

5. Custom DRBD Resource Agent Heartbeat requires a legacy R1 script to manage DRBD roles. Create /etc/ha.d/resource.d/drbddisk:

#!/bin/bash
# Heartbeat R1 Agent for DRBD Role Management

DRBD_CMD="/sbin/drbdadm"
RES_NAME="${1:-all}"
ACTION="${2:-$1}"

if [[ "$RES_NAME" == "start" || "$RES_NAME" == "stop" || "$RES_NAME" == "status" ]]; then
    ACTION="$RES_NAME"
    RES_NAME="all"
fi

handle_action() {
    case "$ACTION" in
        start)
            for i in {1..5}; do
                $DRBD_CMD primary "$RES_NAME" && return 0
                sleep 2
            done
            return 1
            ;;
        stop)
            for i in {1..3}; do
                $DRBD_CMD secondary "$RES_NAME" && return 0
                sleep 1
            done
            return 1
            ;;
        status)
            if [[ "$RES_NAME" == "all" ]]; then
                echo "Resource identifier required for status check."
                return 10
            fi
            CURRENT_ROLE=$($DRBD_CMD role "$RES_NAME" | cut -d'/' -f1)
            case "$CURRENT_ROLE" in
                Primary)   echo "running (Primary)"; return 0 ;;
                Secondary) echo "stopped (Secondary)"; return 3 ;;
                *)         echo "state unknown ($CURRENT_ROLE)"; return 4 ;;
            esac
            ;;
        *)
            echo "Usage: $0 <resource> {start|stop|status}"
            return 1
            ;;
    esac
}

handle_action
exit $?

Make it executable: chmod 755 /etc/ha.d/resource.d/drbddisk

6. Service Control and Startup Disable standalone application services to let Heartbeat manage them:

SVC_LIST="custom_app_service app_cache app_queue"
for svc in $SVC_LIST; do
    /sbin/service "$svc" stop
    /sbin/chkconfig "$svc" off
done

Start Heartbeat on the primary node first, then the seocndary:

service heartbeat start
chkconfig heartbeat on

Verify VIP assignmant using ip addr show eth0 and test connectivity to 10.0.50.100.

Failover Validation and Split-Brain Recovery

Testing Automatic Failover Simulate a hard failure by cutting power to Node A. Node B should detect the missed heartbeats, promote itself to Primary, mount /dev/drbd0, assign the VIP, and start application services. Check /var/log/ha-cluster.log for resource acquisition events. When Node A recovers, it will join as Secondary and sync missing blocks.

Resolving Split-Brain Conditions Split-brain occurs when both nodes believe they are Primary, usually due to network partitioning or simultaneous reboots. DRBD will halt replication and show StandAlone or WFConnection states.

To recover, identify the node with outdated or corrupted data (typically the former secondary) and discard its changes:

# On the node to become Secondary (discard local changes)
drbdadm secondary cluster_data
drbdadm --discard-my-data connect cluster_data

# On the node retaining authoritative data (Primary)
drbdadm connect cluster_data

Monitor resynchronization progress via watch cat /proc/drbd. The state will shift to SyncSource/SyncTarget and eventually return to UpToDate/UpToDate. Prevent future occurrences by ensuring dedicated heartbeat links, configuring proper deadtime values, and avoiding simultaneous cluster reboots.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.