Clickhouse Shards and Replication Setup
A replica in Clickhouse is essentially two or more tables or table segments containing identical data, typically intended for redundancy and data integrity. Sharding, on the other hand, divides table data horizontally across different servers, alleviating the workload on individual nodes. Clickhouse employs Zookeeper for implementing replication and sharding, requiring Zookeeper version 3.4.5 or newer.
Configuring Zookeeper for Clickhouse
Adding Zookeeper Configuration
The Zookeeper configuration can either be embedded directly in the main configuraton file (config.xml) or referenced from an external file.
-
Direct Configuration: Edit the Zookeeper node definitions in the main configuraton file:
<zookeeper> <node> <host>server1</host> <port>2181</port> </node> <node> <host>server2</host> <port>2181</port> </node> </zookeeper> -
External Configuration: Create a separate configuration file, e.g.,
/etc/clickhouse-server/config.d/zookeeper.xml, with the following format:<?xml version="1.0"?> <clickhouse> <zookeeper> <node> <host>server1</host> <port>2181</port> </node> </zookeeper> </clickhouse>Reference it from the main configuration file:
<include_from>/etc/clickhouse-server/config.d/zookeeper.xml</include_from>
Setting Up Replicas
Using ReplicatedMergeTree Engine
Configure replica nodes with the ReplicatedMergeTree engine:
Create the replica table using distributed DDL:
CREATE TABLE cluster_node.replicated_table ON CLUSTER cluster_name (
id Int32,
name String
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/replicated_table', '{replica}') ORDER BY id;
Macro values like {shard} and {replica} denote each node’s shard and replica specifics.
Sharding Configuration
To distribute data across servers, define shards:
<remote_servers>
<cluster>
<shard>
<replica>
<host>server1</host>
<port>9000</port>
</replica>
</shard>
<shard>
<replica>
<host>server2</host>
<port>9000</port>
</replica>
</shard>
</cluster>
</remote_servers>
Adjust weights for custom data distribution proportions among shards if needed:
<shard>
<weight>2</weight>
<replica>
<host>server1</host>
<port>9000</port>
</replica>
</shard>
Creating and Managing Distributed Tables
Distributed tables act as intermediaries to distribute data inserts to underlying replicated or sharded tables:
CREATE TABLE distributed_table AS base_table
ENGINE = Distributed(cluster_name, database_name, table_name, id);
Insert data and observe distribution across nodes:
INSERT INTO distributed_table VALUES (1, 'sample data');
View replicated or sharded table content on different nodes based on configured sharding keys and weights.
Combining Replication with Sharding
Integrate sharding with replication for optimal data management:
<remote_servers>
<cluster_advanced>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>server1</host>
<port>9000</port>
</replica>
</shard>
<shard>
<internal_replication>false</internal_replication>
<replica>
<host>server2</host>
<port>9000</port>
</replica>
<replica>
<host>server3</host>
<port>9000</port>
</replica>
</shard>
</cluster_advanced>
</remote_servers>
Configure table creation and macros for shard-specific replication:
CREATE TABLE replicated_sharded_table ON CLUSTER cluster_advanced (
id Int32,
name String
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/replicated_sharded_table', '{replica}') ORDER BY id;
Distributed tables then direct writes across shards and replicas efficiently.
Combine distributed tables with replication for the most scalable and fault-tolerant architecture, maintaining optimal read/write loads and availability.