Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Understanding Redis Cluster: Architecture, Data Partitioning, and High Availability

Tech May 11 3

Distributed Database Fundamentals

Distributed databases address three critical requirements:

Scalability: When data volume or read/write load exceeds a single machine's capacity, load must be distributed across multiple nodes. This horizontal scaling approach handles increased demand by adding more machines rather than upgrading existing hardware.

Fault Tolerance and High Availability: When failures occur—in individual machines, multiple nodes, network components, or even entire data centers—systems must continue operating. Redundancy through multiple machines ensures that when components fail, standby systems can take over seamlessly.

Latency Considerations: Global user distribution demands geographically distributed services. Deploying data centers closer to users reduces latency by avoiding cross-continental data requests.

Data Distribution Mechanisms

Two fundamental mechanisms govern data distribution in distributed databases: partitioning (sharding) and replication.

Partitioning aims to distribute data and query load evenly across all nodes, enabling horizontal scaling.

Replication serves multiple purposes:

  • Geographic proximity to users reduces access latency
  • Continues operation despite partial component failures
  • Increases read throughput by distributing access across multiple machines

Redis Cluster implements a shared-nothing architecture with horizontal scaling, using slots for automatic data sharding across nodes.

Why Redis Cluster Mode?

Redis standalone instances achieve high availability through master-slave replication and Sentinel-based failover. However, single instances face fundamental limitations: storage capacity constraints and traffic pressure ceilings.

Sentinel mode provides failover and read/write splitting, but each Redis server maintains identical data copies, wasting memory resources. Redis 3.0 introduced Cluster mode to address these issues by implementing distributed storage with data sharding—each node stores different content.

Cluster mode satisfies two critical requirements:

  • Guaranteed data partitioning across nodes
  • Prevention of data chaos between master nodes, with support for online hot data migration

Data Partitioning Strategies

Uneven Partitioning Consequences

Non-uniform data distribution causes data skew, severely degrading partition efficiency. In extreme cases, all load concentrates on a single partition node, creating a system bottleneck—the infamous "hotspot" problem.

The simplest mitigation involves randomly distributing records across nodes. The trade-off: reading specific data requires querying all nodes to locate the target, increasing overhead significantly.

Range-Based Partitioning

This approach assigns continuous key ranges to each partition. Node assignment depends on whether a key falls within a node's defined range boundaries.

Range boundaries need not be evenly distributed—distribution should reflect actual data patterns, ensuring roughly equal data volumes per range.

Advantages: Ranges enable efficient interval queries when keys are sortable. For time-series sensor data, using timestamps as keys allows rapid retrieval of all measurements within a specific month.

Limitations: Certain access patterns create hotspots. For instance, daily partitions for sensor data cause all writes to concentrate on the current day's partition during active collection periods, leaving other partitions idle.

Mitigation: Combine multiple dimensions in the key—prefix sansor names before timestamps. This distributes write load across nodes based on sensor activity, and each sensor's time-range queries remain efficient.

Databases using range partitioning include Bigtable, HBase, RethinkDB, and MongoDB versions prior to 2.4.

Hash-Based Partitioning

This strategy determines data placement by hashing key values, avoiding the skew inherent in range-based approaches.

Key considerations:

  • Hash function design critically impacts distribution quality. Well-designed functions distribute data uniformly regardless of input patterns
  • Encryption strength is unnecessary—MD5 serves well as a partitioning hash, producing 128-bit values
  • Built-in language hash functions (like Java's Object.hashCode) may behave inconsistently across processes, making them unsuitable for partitioning

Advantages: Keys distribute uniformly across partitions. Boundaries can use fixed intervals or pseudo-random selection (consistent hashing).

Limitations:

  • Range query support degrades: Adjacent keys hash to different partitions, losing spatial locality
  • Hotspots persist for single-key access patterns: Celebrity user activity (millions of followers) generates concentrated access to identical keys, defeating hash-based distribution

Hotspot mitigation: Append random suffixes to hot keys. A two-digit random suffix distributes writes across 100 partition variants. Read operations must query all variants and merge results—this technique suits only truly hot keys.

Distinction: Encryption vs Message Digests

MD5 is a message digest algorithm (producing 128-bit hashes) rather than encryption. While cryptographically broken for security purposes, it remains suitable for non-cryptographic applications like partition determination.

Consistent Hashing

Traditional hash tables require remapping nearly all keys when resizing—adding a single server invalidates most key-to-server mappings. Consistent hashing minimizes this disruption.

Mechanism: Keys and servers map to a circular hash space (0 to 2³²-1). Key lookup traverses clockwise from the key's position until finding a server. When adding or removing servers, only adjacent keys on the ring require remapping.

Virtual nodes improve distribution: Each physical server creates multiple virtual node replicas distributed around the ring. This ensures that when servers join or leave, key redistribution remains statistically even across remaining servers.

With N virtual nodes per physical server, only K/N keys require remapping when topology changes (K = total keys, N = servers).

Redis Cluster Architecture

Core Functionality

Redis Cluster provides distributed database capabilities with two primary features:

  • Automatic data sharding: Distributes dataset across multiple nodes, balancing load and enabling scaling
  • Fault tolerance: Continues operation when nodes fail or lose cluster connectivity

Source Code Structure (Redis 4.0)

Three fundamental structures manage cluster state:

typedef struct clusterNode {
    mstime_t ctime;
    char name[CLUSTER_NAMELEN];
    int flags;
    uint64_t configEpoch;
    unsigned char slots[CLUSTER_SLOTS/8];
    int numslots;
    int numslaves;
    struct clusterNode **slaves;
    struct clusterNode *slaveof;
    mstime_t ping_sent;
    mstime_t pong_received;
    mstime_t fail_time;
    mstime_t voted_time;
    mstime_t repl_offset_time;
    mstime_t orphaned_time;
    long long repl_offset;
    char ip[NET_IP_STR_LEN];
    int port;
    int cport;
    clusterLink *link;
    list *fail_reports;
} clusterNode;

typedef struct clusterState {
    clusterNode *myself;
    uint64_t currentEpoch;
    int state;
    int size;
    dict *nodes;
    dict *nodes_black_list;
    clusterNode *migrating_slots_to[CLUSTER_SLOTS];
    clusterNode *importing_slots_from[CLUSTER_SLOTS];
    clusterNode *slots[CLUSTER_SLOTS];
    uint64_t slots_keys_count[CLUSTER_SLOTS];
    rax *slots_to_keys;
    mstime_t failover_auth_time;
    int failover_auth_count;
    int failover_auth_sent;
    int failover_auth_rank;
    uint64_t failover_auth_epoch;
    int cant_failover_reason;
    mstime_t mf_end;
    clusterNode *mf_slave;
    long long mf_master_offset;
    int mf_can_start;
    uint64_t lastVoteEpoch;
    int todo_before_sleep;
    long long stats_bus_messages_sent[CLUSTERMSG_TYPE_COUNT];
    long long stats_bus_messages_received[CLUSTERMSG_TYPE_COUNT];
    long long stats_pfail_nodes;
} clusterState;

typedef struct clusterLink {
    mstime_t ctime;
    int fd;
    sds sndbuf;
    sds rcvbuf;
    struct clusterNode *node;
} clusterLink;
  • clusterNode: Maintains individual node information
  • clusterState: Tracks overall cluster health and configuration
  • clusterLink: Encapsulates remote node communication resources

Gossip Protocol for Inter-Node Communication

Redis Cluster uses two distinct ports per node:

  • Standard port (e.g., 6379): Client access
  • Cluster bus port (+10000, e.g., 16379): Node-to-node communication using binary gossip protocol for efficient data exchange, failure detection, and configuration updates

Cluster metadata maintenance approaches:

  • Centralized: Real-time updates, storage pressure (used by Storm, backed by ZooKeeper)
  • Gossip-based: All nodes maintain copies, changes propagate gradually—Redis uses this approach

Gossip protocol commands:

Command Function
MEET Invites new nodes to join the cluster
PING Exchanges metadata between nodes
PONG Responds to MEET/PING, includes status updates and broadcasts information
FAIL Notifies cluster when a node judges another as unreachable

Fixed Hash Slot Mechanism

Design Principles

Redis Cluster divides the keyspace into 16,384 fixed slots rather than using consistent hashing. Each key belongs to exactly one slot, and slots distribute across cluster nodes.

Slot computation: Keys are hashed using CRC16, yielding a 16-bit checksum. Slot assignment uses modulo 16384.

Why 16,384 slots?

  • Message size: 16,384 slots require 2KB per node's slot bitmap; 65,535 slots would need 8KB
  • Scalability: Supports up to 1,000 nodes with reasonable slot-per-node ratios
  • State propagation: Full slot state transfers are idempotent and simpler to reason about in distributed scenarios

Initial Slot Allocation

Step 1: Establish inter-node connections

CLUSTER MEET <ip> <port>

When node A invites node B:

  1. A sends MEET command to B
  2. B creates clusterNode entry for A and returns PONG
  3. A sends PING to confirm; B acknowledges
  4. A broadcasts B's information via gossip; other nodes repeat the process

Step 2: Assign hash slots

Each node's clusterNode structure contains a bitmap (slots[]) indicating its assigned slots. Nodes propagate this information, ensuring every node knows slot-to-node mappings across the cluster.

Step 3: Cluster enters online state and accepts client connections.

Key-to-Slot Resolution

Client request → Compute CRC16(key) % 16384 → Determine slot owner
If local: process request
If remote: return MOVED error with target IP:port

Clients cache slot mappings and redirect directly to correct nodes upon MOVED errors.

Resharding Process

Cluster scaling requires slot migration:

  1. Target node begins importing new slots
  2. Source node migrates slot data and key associations

During migration, requests follow these rules:

  • Read operations: Check source node first; if data moved, return ASK error directing client to new owner
  • Write operations: Redirect to new slot owner immediately
  • ASK handling: Clients must send ASKING command before accessing imported slots

ASK vs MOVED distinction:

  • MOVED: Permanent redirection (slot ownership changed)
  • ASK: Temporary redirection (migration in progress)

High Availability Implementation

Redis Cluster distributes 16,384 slots among master nodes. Each master typically has replica nodes.

Replica responsibilities:

  • Monitor master health
  • Replicate master data
  • Assume master duties upon failure

Failover Mechanism

When a master fails:

  1. Replicas detect the failure (based on cluster-node-timeout)
  2. Replicas request votes from other masters
  3. Upon receiving majority suppport (N/2 + 1 votes), the replica promotes to master
  4. Cluster reconfigures to route slots to the new master

This election process resembles Raft leader election, adapted for Redis Cluster's master-replica topology.

Practical Operations

Common Commands

# Cluster information
CLUSTER INFO
CLUSTER NODES

# Node management
CLUSTER MEET <ip> <port>
CLUSTER FORGET <node_id>
CLUSTER REPLICATE <master_id>
CLUSTER SAVECONFIG

# Slot management
CLUSTER ADDSLOTS <slot> [slot...]
CLUSTER DELSLOTS <slot> [slot...]
CLUSTER FLUSHSLOTS
CLUSTER SETSLOT <slot> NODE <node_id>
CLUSTER SETSLOT <slot> MIGRATING <node_id>
CLUSTER SETSLOT <slot> IMPORTING <node_id>
CLUSTER SETSLOT <slot> STABLE

# Key operations
CLUSTER KEYSLOT <key>
CLUSTER COUNTKEYSINSLOT <slot>
CLUSTER GETKEYSINSLOT <slot> <count>

Cluster Setup Example

Node configuration (redis-7000.conf):

port 7000
daemonize yes
dir /var/redis/data
appendonly yes
appendfsync everysec
cluster-enabled yes
cluster-config-file nodes-7000.conf
cluster-node-timeout 10000

Create cluster with three master-slave pairs:

redis-cli --cluster create \
  127.0.0.1:7000 127.0.0.1:7001 127.0.0.1:7002 \
  127.0.0.1:7003 127.0.0.1:7004 127.0.0.1:7005 \
  --cluster-replicas 1

The command allocates 16,384 slots across three masters: approximately 5,461, 5,461, and 5,462 slots respectively.

Client Access

redis-cli -c -p 7000

The -c flag enables cluster mode, automatically following MOVED and ASK redirections.

Failure Scenarios

Replica failure: Master detects replica disconnection via gossip protocol, marks the replica as failing (quorum reached), broadcasts FAIL messages to the cluster. When the replica rejoins, master resumes synchronization and notifies the cluster.

Master failure: Replicas attempt reconnection during the timeout window (default 10 seconds). After timeout expiration without recovery, the replica initiates failover, promotes to master, and the failed master rejoins as a replica upon return.

Configuration Options

cluster-enabled yes|no
cluster-config-file 
cluster-node-timeout <milliseconds>
cluster-migration-barrier <count>

cluster-node-timeout determines failure detection and replica promotion timing—typically 30 seconds in production environments.

Cluster Management Tools

redis-cli --cluster (replaces the deprecated redis-trib.rb):

redis-cli --cluster add-node <new_node> <existing_node>
redis-cli --cluster del-node <node_id>
redis-cli --cluster reshard <host:port>

Summary

Redis Cluster provides production-grade distributed storage through fixed hash slot partitioning, gossip-based metadata propagation, and master-replica failover. Understanding slot migration, redirection handling, and failure detection mechanisms enables reliable cluster operations at scale.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.