Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

GlusterFS Distributed Storage Architecture and Volume Management

Tech 1

Distributed Storage Concepts

Distributed storage aggregates multiple independent storage servers into a unified resource pool, presenting it over the network. This shared resource can manifest as file-level access (similar to NAS), block-level access (similar to SAN), or object-level access via API gateways. Prominent open-source platforms in this domain include Ceph, GlusterFS, HDFS, and MinIO.

Key benefits of distributed architectures include:

  • Horizontal scalability reaching petabyte capacities.
  • Enhanced read/write throughput and data redundancy.
  • Resilience against single-node failures.
  • Cost efficiency by utilizing commodity hardware instead of expensive proprietary SANs.

GlusterFS Overview

GlusterFS is an open-source, scale-out file-level storage system. It consolidates physically dispersed storage resources into a single global namespace, capable of handling petabytes of data and thousands of concurrent clients.

A defining characteristic of GlusterFS is its absence of a centralized metadata server. This eliminates metadata bottlenecks, significantly boosting performance, reliability, and stability. The architecture comprises storage servers (Bricks), clients, and optional NFS/Samba gateways. Clients handle volume management, I/O scheduling, file location, and caching using the FUSE (Filesystem in Userspace) kernel module, mounting the remote cluster locally.

RAID Fundamentals

Standard RAID configurations underpin many distributed storage concepts:

  • RAID 0: Striping. Excellent performance but zero fault tolerance. 100% capacity utilization.
  • RAID 1: Mirroring. High reliability, decent read speed, slower writes. 50% capacity utilization.
  • RAID 10: Mirrored stripes. Combines RAID 0 performance with RAID 1 redundancy.
  • RAID 5: Block-level striping with distributed parity. Tolerates one disk failure. Capacity is (N-1)/N.
  • RAID 6: Dual distributed parity. Tolerates two simultaneous disk failures.

In production environments, RAID 5 and RAID 10 are the most prevalent choices.

GlusterFS Terminology

  • Brick: The fundamental storage unit, represented as a directory exported from a trusted storage pool, formatted as hostname:directory_path.
  • Volume: A logical aggregation of multiple Bricks, presented as a single mountable entity.
  • FUSE: A loadable kernel module allowing user-space filesystem implementations without kernel modifications.

Volume Architectures

GlusterFS constructs volumes using distinct data placement strategies:

  • Distribute Volume: The default mode. Files are placed on Bricks using a hash algorithm. It maximizes capacity but offers no redundancy. A single Brick failure results in data loss for files residing on that Brick.
  • Replica Volume: Mirrors files across multiple Bricks (analogous to RAID 1). It improves read performance and fault tolerance but reduces write speed and usable capacity.
  • Disperse Volume: Strips data across multiple Bricks with dedicated redundancy chunks (similar to RAID 5/6). For example, a 4-Brick volume with 1 redundancy unit can survive one Brick failure while utilizing 75% of total capacity.
  • Distributed Replica Volume: Combines hashing across multiple replica sets. Files are hashed to a replica set, then mirrored within that set.

Note: Stripe volumes were deprecated in GlusterFS 6.0 and are no longer recommended.

Cluster Deployment

We will deploy a cluster using four storage nodes and one client.

Lab Topology

SystemIP AddressRole
client-0110.10.10.10Consumer
storage-0110.10.10.11Storage Node
storage-0210.10.10.12Storage Node
storage-0310.10.10.13Storage Node
storage-0410.10.10.14Storage Node

Ensure firewalls, SELinux, and network configurations are properly set, and time synchronization is active across all nodes.

Time Synchronization

On the time server, configure chrony:

[root@client-01 ~]# vim /etc/chrony.conf
server 10.10.10.10 iburst
allow 10.10.10.0/24
local stratum 10
[root@client-01 ~]# systemctl restart chronyd

On all storage nodes, point to the time server:

[root@storage-01 ~]# vim /etc/chrony.conf
server 10.10.10.10 iburst
[root@storage-01 ~]# systemctl restart chronyd

Software Installation

Install the GlusterFS server packages on all storage nodes:

[root@storage-01 ~]# dnf install centos-release-gluster -y
[root@storage-01 ~]# dnf install glusterfs-server -y
[root@storage-01 ~]# systemctl enable --now glusterd

Trusted Storage Pool Formation

From any single node, probe the others to form the cluster. You do not need to probe from every node.

[root@storage-01 ~]# gluster peer probe storage-02
peer probe: success.
[root@storage-01 ~]# gluster peer probe storage-03
peer probe: success.
[root@storage-01 ~]# gluster peer probe storage-04
peer probe: success.

Verify the pool status:

[root@storage-01 ~]# gluster pool list
UUID                                  Hostname      State
e3b1c5a2-6b89-40fb-af53-d2d8a56cd41e  storage-02    Connected
7396a19d-a2a7-4b27-86d3-12c89ac4df39  storage-03    Connected
b2ea8b19-658c-40ec-84b4-6568c627eefd  storage-04    Connected

Brick Preparation

Create the backend directories on all storage nodes. Ideally, these should be mounted separate filesystems, but for testing, root partitions can be used with the force flag.

[root@storage-01 ~]# mkdir -p /mnt/brick_a
[root@storage-01 ~]# mkdir -p /mnt/brick_b
[root@storage-01 ~]# mkdir -p /mnt/brick_c

Volume Creation

Replica Volume (vol-mirror)

Creates a 2-way mirror across two nodes.

[root@storage-01 ~]# gluster volume create vol-mirror replica 2 storage-01:/mnt/brick_a storage-02:/mnt/brick_a force
volume create: vol-mirror: success: please start the volume to access data

Distribute Volume (vol-spread)

Distributes files across three nodes based on hashing.

[root@storage-01 ~]# gluster volume create vol-spread storage-01:/mnt/brick_b storage-02:/mnt/brick_b storage-03:/mnt/brick_b force
volume create: vol-spread: success: please start the volume to access data

Disperse Volume (vol-erasure)

Erasure coding across 4 nodes with 1 redundancy unit (similar to RAID 5).

[root@storage-01 ~]# gluster volume create vol-erasure disperse 4 redundancy 1 storage-01:/mnt/brick_c storage-02:/mnt/brick_c storage-03:/mnt/brick_c storage-04:/mnt/brick_c force
volume create: vol-erasure: success: please start the volume to access data

Distributed Replica Volume (vol-spread-mirror)

Two replica sets distributed across four nodes.

[root@storage-01 ~]# gluster volume create vol-spread-mirror replica 2 storage-01:/mnt/brick_d storage-02:/mnt/brick_d storage-03:/mnt/brick_d storage-04:/mnt/brick_d force
volume create: vol-spread-mirror: success: please start the volume to access data

Starting Volumes

[root@storage-01 ~]# gluster volume start vol-mirror
volume start: vol-mirror: success
[root@storage-01 ~]# gluster volume start vol-spread
volume start: vol-spread: success
[root@storage-01 ~]# gluster volume start vol-erasure
volume start: vol-erasure: success
[root@storage-01 ~]# gluster volume start vol-spread-mirror
volume start: vol-spread-mirror: success

Client Configuration and Verification

Install the client packages on the consumer machine:

[root@client-01 ~]# dnf install glusterfs-fuse -y

Create mount points and attach the volumes:

[root@client-01 ~]# mkdir -p /data/mnt1 /data/mnt2 /data/mnt3 /data/mnt4
[root@client-01 ~]# mount -t glusterfs storage-01:/vol-mirror /data/mnt1
[root@client-01 ~]# mount -t glusterfs storage-01:/vol-spread /data/mnt2
[root@client-01 ~]# mount -t glusterfs storage-01:/vol-erasure /data/mnt3
[root@client-01 ~]# mount -t glusterfs storage-01:/vol-spread-mirror /data/mnt4

Generate test files on each mount point:

[root@client-01 ~]# dd if=/dev/zero of=/data/mnt1/file.bin bs=1M count=100
[root@client-01 ~]# dd if=/dev/zero of=/data/mnt2/file.bin bs=1M count=100
[root@client-01 ~]# dd if=/dev/zero of=/data/mnt3/file.bin bs=1M count=100
[root@client-01 ~]# dd if=/dev/zero of=/data/mnt4/file.bin bs=1M count=100

Verification:

  • vol-mirror: The 100MB file will appear identically on both storage-01 and storage-02 inside /mnt/brick_a.
  • vol-spread: The 100MB file will exist entirely on one of the three nodes inside /mnt/brick_b, while the others remain empty.
  • vol-erasure: The file will be split, with roughly 33MB present on each of the four nodes inside /mnt/brick_c.
  • vol-spread-mirror: The file will be hashed to one replica pair. For instance, it might appear on storage-03 and storage-04 inside /mnt/brick_d, while storage-01 and storage-02 remain empty.

Volume Lifecycle Management

Deleting a Volume

Unmount the volume from all clients first, then stop and delete it from any storage node:

[root@client-01 ~]# umount /data/mnt1
[root@storage-01 ~]# gluster volume stop vol-mirror
[root@storage-01 ~]# gluster volume delete vol-mirror

Expanding a Distribute Volume

Add a new Brick to an existing distributed volume to increase capacity:

[root@storage-01 ~]# gluster volume add-brick vol-spread storage-04:/mnt/brick_b force
volume add-brick: success

Verify the expansion:

[root@storage-01 ~]# gluster volume info vol-spread

Shrinking a Distribute Volume

Remove an empty Brick from a distributed volume. Removing Bricks containing data requires a migration step, but the force flag bypasses this, leading to data loss for files on that Brick.

[root@storage-01 ~]# gluster volume remove-brick vol-spread storage-04:/mnt/brick_b force
volume remove-brick commit force: success

Replacing a Brick

Swap a faulty Brick with a new one in a replica volume:

[root@storage-01 ~]# gluster volume replace-brick vol-mirror storage-02:/mnt/brick_a storage-04:/mnt/brick_a commit force
volume replace-brick: success

Adding Nodes to the Cluster

To integrate a fifth storage server, install the software, start the daemon, and probe it from an existing node:

[root@storage-05 ~]# systemctl enable --now glusterd
[root@storage-01 ~]# gluster peer probe storage-05
peer probe: success.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.