Fading Coder

An Old Coder’s Final Dance

You are here: Home > Tech > Content

Clickhouse Shards and Replication Setup

Tech 3

A replica in Clickhouse is essentially two or more tables or table segments containing identical data, typically intended for redundancy and data integrity. Sharding, on the other hand, divides table data horizontally across different servers, alleviating the workload on individual nodes. Clickhouse employs Zookeeper for implementing replication and sharding, requiring Zookeeper version 3.4.5 or newer.

Configuring Zookeeper for Clickhouse

Adding Zookeeper Configuration

The Zookeeper configuration can either be embedded directly in the main configuraton file (config.xml) or referenced from an external file.

  1. Direct Configuration: Edit the Zookeeper node definitions in the main configuraton file:

    <zookeeper>
        <node>
            <host>server1</host>
            <port>2181</port>
        </node>
        <node>
            <host>server2</host>
            <port>2181</port>
        </node>
    </zookeeper>
    
  2. External Configuration: Create a separate configuration file, e.g., /etc/clickhouse-server/config.d/zookeeper.xml, with the following format:

    <?xml version="1.0"?>
    <clickhouse>
        <zookeeper>
            <node>
                <host>server1</host>
                <port>2181</port>
            </node>
        </zookeeper>
    </clickhouse>
    

    Reference it from the main configuration file:

    <include_from>/etc/clickhouse-server/config.d/zookeeper.xml</include_from>
    

Setting Up Replicas

Using ReplicatedMergeTree Engine

Configure replica nodes with the ReplicatedMergeTree engine: Create the replica table using distributed DDL:

CREATE TABLE cluster_node.replicated_table ON CLUSTER cluster_name (
    id Int32,
    name String
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/replicated_table', '{replica}') ORDER BY id;

Macro values like {shard} and {replica} denote each node’s shard and replica specifics.

Sharding Configuration

To distribute data across servers, define shards:

<remote_servers>
    <cluster>
        <shard>
            <replica>
                <host>server1</host>
                <port>9000</port>
            </replica>
        </shard>
        <shard>
            <replica>
                <host>server2</host>
                <port>9000</port>
            </replica>
        </shard>
    </cluster>
</remote_servers>

Adjust weights for custom data distribution proportions among shards if needed:

<shard>
    <weight>2</weight>
    <replica>
        <host>server1</host>
        <port>9000</port>
    </replica>
</shard>

Creating and Managing Distributed Tables

Distributed tables act as intermediaries to distribute data inserts to underlying replicated or sharded tables:

CREATE TABLE distributed_table AS base_table
ENGINE = Distributed(cluster_name, database_name, table_name, id);

Insert data and observe distribution across nodes:

INSERT INTO distributed_table VALUES (1, 'sample data');

View replicated or sharded table content on different nodes based on configured sharding keys and weights.

Combining Replication with Sharding

Integrate sharding with replication for optimal data management:

<remote_servers>
    <cluster_advanced>
        <shard>
            <internal_replication>true</internal_replication>
            <replica>
                <host>server1</host>
                <port>9000</port>
            </replica>
        </shard>
        <shard>
            <internal_replication>false</internal_replication>
            <replica>
                <host>server2</host>
                <port>9000</port>
            </replica>
            <replica>
                <host>server3</host>
                <port>9000</port>
            </replica>
        </shard>
    </cluster_advanced>
</remote_servers>

Configure table creation and macros for shard-specific replication:

CREATE TABLE replicated_sharded_table ON CLUSTER cluster_advanced (
    id Int32,
    name String
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/replicated_sharded_table', '{replica}') ORDER BY id;

Distributed tables then direct writes across shards and replicas efficiently.

Combine distributed tables with replication for the most scalable and fault-tolerant architecture, maintaining optimal read/write loads and availability.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.