Deploying a Distributed ZooKeeper and HBase Cluster
ZooKeeper and HBase Overview
ZooKeeper
ZooKeeper operates as an open-source coordination framework, originally developed at Yahoo! to provide straightforward and robust access for distributed applications. It abstracts complex and error-prone consensus protocols into an efficient and reliable set of primitives exposed via simple interfaces. Distributed systems leverage ZooKeeper for data publication/subscription, load distribution, naming services, coordination notifications, cluster administration, leader election, distributed locking, and queue management. Service providers register their endpoints within the ZooKeeper registry; consumers then query the registry to discover provider details before initiating direct communication.
HBase
HBase is a highly reliable, performant, scalable, and column-oriented distributed storage system designed to run on commodity hardware. It targets the storage and processing of massive datasets, easily handling tables comprising billions of rows and millions of columns using standard server configurations.
HBase Characteristics
- Massive Storage: Capable of managing petabyte-scale data, returning queries in tens to hundreds of milliseconds due to its exceptional scalability.
- Column-Family Storage: Data is organized into column families, which must be defined during table creation. A family can contain an unlimited number of columns.
- Extreme Scalability: Expansion is supported both computationally (by adding RegionServers to handle more regions) and in storage capacity (by adding DataNodes to the underlying HDFS).
- High Concurrency: Despite running on commodity hardware where individual I/O latency might be in the millisecond range, HBase maintains consistently low latency even under heavy concurrent access loads.
- Sparsity: Empty columns within a column family consume zero physical storage space, offering immense flexibility without wasted disk usage.
ZooKeeper Deployment
Extraction
Ensure firewall services are disabled across all nodes to prevent connection failures.
[root@primary ~]# tar -xzvf /opt/archives/apache-zookeeper-3.5.10-bin.tar.gz -C /opt/apps/
[root@primary ~]# mv /opt/apps/apache-zookeeper-3.5.10-bin /opt/apps/zk
Primary Node Configuration
Create required data and logging directories within the installation path.
[root@primary ~]# cd /opt/apps/zk
[root@primary zk]# mkdir zk_data && mkdir zk_logs
Assign a unique identifier for this node.
[root@primary zk]# echo "1" > /opt/apps/zk/zk_data/myid
Generate the configuration file from the provided sample and modify it.
[root@primary zk]# cp /opt/apps/zk/conf/zoo_sample.cfg /opt/apps/zk/conf/zoo.cfg
[root@primary zk]# vi /opt/apps/zk/conf/zoo.cfg
Update the dataDir parameter:
dataDir=/opt/apps/zk/zk_data
Append the cluster node definitions at the end of the file:
server.1=primary:2888:3888
server.2=worker1:2888:3888
server.3=worker2:2888:3888
Transfer ownership of the installation directory to the designated service account.
[root@primary zk]# chown -R hadoopuser:hadoopgroup /opt/apps/zk
Worker Node Configuration
Transfer the configured directory to the remaining nodes.
[root@primary ~]# scp -r /opt/apps/zk worker1:/opt/apps/
[root@primary ~]# scp -r /opt/apps/zk worker2:/opt/apps/
On worker1, set the appropriate permissions and update its identifier.
[root@worker1 ~]# chown -R hadoopuser:hadoopgroup /opt/apps/zk
[root@worker1 ~]# echo "2" > /opt/apps/zk/zk_data/myid
On worker2, apply the same permissions and set a distinct identifier.
[root@worker2 ~]# chown -R hadoopuser:hadoopgroup /opt/apps/zk
[root@worker2 ~]# echo "3" > /opt/apps/zk/zk_data/myid
Environment Variables
Append the following environment configurations to /etc/profile on all machines.
export ZK_HOME=/opt/apps/zk
export PATH=$PATH:$ZK_HOME/bin
Service Activation
Switch to the service account on all nodes, reload the profile, and start the daemon.
[hadoopuser@primary ~]$ source /etc/profile
[hadoopuser@primary ~]$ zkServer.sh start
[hadoopuser@worker1 ~]$ source /etc/profile
[hadoopuser@worker1 ~]$ zkServer.sh start
[hadoopuser@worker2 ~]$ source /etc/profile
[hadoopuser@worker2 ~]$ zkServer.sh start
Once all instances are active, verify the cluster state. One node will assume the leader role while the others become followers.
[hadoopuser@primary ~]$ zkServer.sh status
[hadoopuser@worker1 ~]$ zkServer.sh status
[hadoopuser@worker2 ~]$ zkServer.sh status
HBase Deployment
Extraction and Relocation
[root@primary ~]# tar -xzvf /opt/archives/hbase-2.4.11-bin.tar.gz -C /opt/apps/
[root@primary ~]# mv /opt/apps/hbase-2.4.11 /opt/apps/hbase
Environment Variables
Add the HBase environment paths to /etc/profile across all nodes.
export HBASE_HOME=/opt/apps/hbase
export PATH=$HBASE_HOME/bin:$PATH
Apply the changes on every machine.
[root@primary ~]# source /etc/profile
[root@worker1 ~]# source /etc/profile
[root@worker2 ~]# source /etc/profile
Primary Node Configuration
Navigate to the configuration directory.
[root@primary ~]# cd /opt/apps/hbase/conf/
Edit hbase-env.sh to define Java locations and external ZooKeeper usage.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
export HBASE_MANAGES_ZK=false
export HBASE_CLASSPATH=/opt/apps/hadoop/etc/hadoop/
Modify hbase-site.xml to define the distributed properties.
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://primary:8020/hbase_data</value>
</property>
<property>
<name>hbase.master.info.port</name>
<value>16010</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
<property>
<name>zookeeper.session.timeout</name>
<value>90000</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>primary,worker1,worker2</value>
</property>
<property>
<name>hbase.tmp.dir</name>
<value>/opt/apps/hbase/temp_store</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
</configuration>
Update the regionservers file to list the worker nodes.
worker1
worker2
Create the temporary directory specified in the configuration.
[root@primary conf]# mkdir /opt/apps/hbase/temp_store
Cluster Distribution and Permissions
Synchronize the installation folder to the worker nodes.
[root@primary conf]# scp -r /opt/apps/hbase/ worker1:/opt/apps/
[root@primary conf]# scp -r /opt/apps/hbase/ worker2:/opt/apps/
Assign proper ownership on all nodes.
[root@primary ~]# chown -R hadoopuser:hadoopgroup /opt/apps/hbase/
[root@worker1 ~]# chown -R hadoopuser:hadoopgroup /opt/apps/hbase/
[root@worker2 ~]# chown -R hadoopuser:hadoopgroup /opt/apps/hbase/
Service Activation
Log in as the service user on all machines, ensure the environment is loaded, and launch the cluster.
[hadoopuser@primary ~]$ source /etc/profile
[hadoopuser@primary ~]$ start-hbase.sh