GlusterFS Distributed Storage Architecture and Volume Management
Distributed Storage Concepts
Distributed storage aggregates multiple independent storage servers into a unified resource pool, presenting it over the network. This shared resource can manifest as file-level access (similar to NAS), block-level access (similar to SAN), or object-level access via API gateways. Prominent open-source platforms in this domain include Ceph, GlusterFS, HDFS, and MinIO.
Key benefits of distributed architectures include:
- Horizontal scalability reaching petabyte capacities.
- Enhanced read/write throughput and data redundancy.
- Resilience against single-node failures.
- Cost efficiency by utilizing commodity hardware instead of expensive proprietary SANs.
GlusterFS Overview
GlusterFS is an open-source, scale-out file-level storage system. It consolidates physically dispersed storage resources into a single global namespace, capable of handling petabytes of data and thousands of concurrent clients.
A defining characteristic of GlusterFS is its absence of a centralized metadata server. This eliminates metadata bottlenecks, significantly boosting performance, reliability, and stability. The architecture comprises storage servers (Bricks), clients, and optional NFS/Samba gateways. Clients handle volume management, I/O scheduling, file location, and caching using the FUSE (Filesystem in Userspace) kernel module, mounting the remote cluster locally.
RAID Fundamentals
Standard RAID configurations underpin many distributed storage concepts:
- RAID 0: Striping. Excellent performance but zero fault tolerance. 100% capacity utilization.
- RAID 1: Mirroring. High reliability, decent read speed, slower writes. 50% capacity utilization.
- RAID 10: Mirrored stripes. Combines RAID 0 performance with RAID 1 redundancy.
- RAID 5: Block-level striping with distributed parity. Tolerates one disk failure. Capacity is (N-1)/N.
- RAID 6: Dual distributed parity. Tolerates two simultaneous disk failures.
In production environments, RAID 5 and RAID 10 are the most prevalent choices.
GlusterFS Terminology
- Brick: The fundamental storage unit, represented as a directory exported from a trusted storage pool, formatted as
hostname:directory_path. - Volume: A logical aggregation of multiple Bricks, presented as a single mountable entity.
- FUSE: A loadable kernel module allowing user-space filesystem implementations without kernel modifications.
Volume Architectures
GlusterFS constructs volumes using distinct data placement strategies:
- Distribute Volume: The default mode. Files are placed on Bricks using a hash algorithm. It maximizes capacity but offers no redundancy. A single Brick failure results in data loss for files residing on that Brick.
- Replica Volume: Mirrors files across multiple Bricks (analogous to RAID 1). It improves read performance and fault tolerance but reduces write speed and usable capacity.
- Disperse Volume: Strips data across multiple Bricks with dedicated redundancy chunks (similar to RAID 5/6). For example, a 4-Brick volume with 1 redundancy unit can survive one Brick failure while utilizing 75% of total capacity.
- Distributed Replica Volume: Combines hashing across multiple replica sets. Files are hashed to a replica set, then mirrored within that set.
Note: Stripe volumes were deprecated in GlusterFS 6.0 and are no longer recommended.
Cluster Deployment
We will deploy a cluster using four storage nodes and one client.
Lab Topology
| System | IP Address | Role |
|---|---|---|
| client-01 | 10.10.10.10 | Consumer |
| storage-01 | 10.10.10.11 | Storage Node |
| storage-02 | 10.10.10.12 | Storage Node |
| storage-03 | 10.10.10.13 | Storage Node |
| storage-04 | 10.10.10.14 | Storage Node |
Ensure firewalls, SELinux, and network configurations are properly set, and time synchronization is active across all nodes.
Time Synchronization
On the time server, configure chrony:
[root@client-01 ~]# vim /etc/chrony.conf
server 10.10.10.10 iburst
allow 10.10.10.0/24
local stratum 10
[root@client-01 ~]# systemctl restart chronydOn all storage nodes, point to the time server:
[root@storage-01 ~]# vim /etc/chrony.conf
server 10.10.10.10 iburst
[root@storage-01 ~]# systemctl restart chronydSoftware Installation
Install the GlusterFS server packages on all storage nodes:
[root@storage-01 ~]# dnf install centos-release-gluster -y
[root@storage-01 ~]# dnf install glusterfs-server -y
[root@storage-01 ~]# systemctl enable --now glusterdTrusted Storage Pool Formation
From any single node, probe the others to form the cluster. You do not need to probe from every node.
[root@storage-01 ~]# gluster peer probe storage-02
peer probe: success.
[root@storage-01 ~]# gluster peer probe storage-03
peer probe: success.
[root@storage-01 ~]# gluster peer probe storage-04
peer probe: success.Verify the pool status:
[root@storage-01 ~]# gluster pool list
UUID Hostname State
e3b1c5a2-6b89-40fb-af53-d2d8a56cd41e storage-02 Connected
7396a19d-a2a7-4b27-86d3-12c89ac4df39 storage-03 Connected
b2ea8b19-658c-40ec-84b4-6568c627eefd storage-04 ConnectedBrick Preparation
Create the backend directories on all storage nodes. Ideally, these should be mounted separate filesystems, but for testing, root partitions can be used with the force flag.
[root@storage-01 ~]# mkdir -p /mnt/brick_a
[root@storage-01 ~]# mkdir -p /mnt/brick_b
[root@storage-01 ~]# mkdir -p /mnt/brick_cVolume Creation
Replica Volume (vol-mirror)
Creates a 2-way mirror across two nodes.
[root@storage-01 ~]# gluster volume create vol-mirror replica 2 storage-01:/mnt/brick_a storage-02:/mnt/brick_a force
volume create: vol-mirror: success: please start the volume to access dataDistribute Volume (vol-spread)
Distributes files across three nodes based on hashing.
[root@storage-01 ~]# gluster volume create vol-spread storage-01:/mnt/brick_b storage-02:/mnt/brick_b storage-03:/mnt/brick_b force
volume create: vol-spread: success: please start the volume to access dataDisperse Volume (vol-erasure)
Erasure coding across 4 nodes with 1 redundancy unit (similar to RAID 5).
[root@storage-01 ~]# gluster volume create vol-erasure disperse 4 redundancy 1 storage-01:/mnt/brick_c storage-02:/mnt/brick_c storage-03:/mnt/brick_c storage-04:/mnt/brick_c force
volume create: vol-erasure: success: please start the volume to access dataDistributed Replica Volume (vol-spread-mirror)
Two replica sets distributed across four nodes.
[root@storage-01 ~]# gluster volume create vol-spread-mirror replica 2 storage-01:/mnt/brick_d storage-02:/mnt/brick_d storage-03:/mnt/brick_d storage-04:/mnt/brick_d force
volume create: vol-spread-mirror: success: please start the volume to access dataStarting Volumes
[root@storage-01 ~]# gluster volume start vol-mirror
volume start: vol-mirror: success
[root@storage-01 ~]# gluster volume start vol-spread
volume start: vol-spread: success
[root@storage-01 ~]# gluster volume start vol-erasure
volume start: vol-erasure: success
[root@storage-01 ~]# gluster volume start vol-spread-mirror
volume start: vol-spread-mirror: successClient Configuration and Verification
Install the client packages on the consumer machine:
[root@client-01 ~]# dnf install glusterfs-fuse -yCreate mount points and attach the volumes:
[root@client-01 ~]# mkdir -p /data/mnt1 /data/mnt2 /data/mnt3 /data/mnt4
[root@client-01 ~]# mount -t glusterfs storage-01:/vol-mirror /data/mnt1
[root@client-01 ~]# mount -t glusterfs storage-01:/vol-spread /data/mnt2
[root@client-01 ~]# mount -t glusterfs storage-01:/vol-erasure /data/mnt3
[root@client-01 ~]# mount -t glusterfs storage-01:/vol-spread-mirror /data/mnt4Generate test files on each mount point:
[root@client-01 ~]# dd if=/dev/zero of=/data/mnt1/file.bin bs=1M count=100
[root@client-01 ~]# dd if=/dev/zero of=/data/mnt2/file.bin bs=1M count=100
[root@client-01 ~]# dd if=/dev/zero of=/data/mnt3/file.bin bs=1M count=100
[root@client-01 ~]# dd if=/dev/zero of=/data/mnt4/file.bin bs=1M count=100Verification:
- vol-mirror: The 100MB file will appear identically on both storage-01 and storage-02 inside
/mnt/brick_a. - vol-spread: The 100MB file will exist entirely on one of the three nodes inside
/mnt/brick_b, while the others remain empty. - vol-erasure: The file will be split, with roughly 33MB present on each of the four nodes inside
/mnt/brick_c. - vol-spread-mirror: The file will be hashed to one replica pair. For instance, it might appear on storage-03 and storage-04 inside
/mnt/brick_d, while storage-01 and storage-02 remain empty.
Volume Lifecycle Management
Deleting a Volume
Unmount the volume from all clients first, then stop and delete it from any storage node:
[root@client-01 ~]# umount /data/mnt1
[root@storage-01 ~]# gluster volume stop vol-mirror
[root@storage-01 ~]# gluster volume delete vol-mirrorExpanding a Distribute Volume
Add a new Brick to an existing distributed volume to increase capacity:
[root@storage-01 ~]# gluster volume add-brick vol-spread storage-04:/mnt/brick_b force
volume add-brick: successVerify the expansion:
[root@storage-01 ~]# gluster volume info vol-spreadShrinking a Distribute Volume
Remove an empty Brick from a distributed volume. Removing Bricks containing data requires a migration step, but the force flag bypasses this, leading to data loss for files on that Brick.
[root@storage-01 ~]# gluster volume remove-brick vol-spread storage-04:/mnt/brick_b force
volume remove-brick commit force: successReplacing a Brick
Swap a faulty Brick with a new one in a replica volume:
[root@storage-01 ~]# gluster volume replace-brick vol-mirror storage-02:/mnt/brick_a storage-04:/mnt/brick_a commit force
volume replace-brick: successAdding Nodes to the Cluster
To integrate a fifth storage server, install the software, start the daemon, and probe it from an existing node:
[root@storage-05 ~]# systemctl enable --now glusterd
[root@storage-01 ~]# gluster peer probe storage-05
peer probe: success.