Prerequisites
Before proceeding with the integration, ensure that Hive, HBase, and Spark environments are properly set up. If you haven't successfully configured these components, refer to the Hadoop+Spark+Zookeeper+HBase+Hive cluster setup guide.
Hive Integration with HBase
The integration between Hive and HBase is achieved through API communication, facilitated by the
hive-hbase-handler-*.jar utility located in Hive's lib directory. To complete the integration, copy this JAR file to the HBase/lib directory:
cp hive-hbase-handler-*.jar /opt/hbase/hbase1.2/lib
Note: If version conflicts arise during Hive-HBase integration, prioritize the HBase version by replacing the corresponding JAR files in Hive with those from HBase.
For detailed testing procedures between Hive and HBase, refer to the "Hive Integration with HBase: A Step-by-Step Guide" article.
Hive Integration with Spark
Hive's integration with Spark involves using pre-compiled JAR packages. A critical consideration is version compatibility - specific Spark and Hive versions must be used together. After researching compatible versions, we can proceed with the integration.
Hive Configuration Changes
Navigate to the
hive/conf directory and edit the
hive-env.sh file to add Spark environment variables:
export SPARK_HOME=/opt/spark/spark1.6-hadoop2.4-hive
Next, edit the
hive-site.xml file and add the following configurations:
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
<property>
<name>spark.master</name>
<value>spark://master:7077</value>
</property>
<property>
<name>spark.home</name>
<value>/opt/spark/spark1.6-hadoop2.4-hive</value>
</property>
<property>
<name>spark.submit.deployMode</name>
<value>client</value>
</property>
<property>
<name>spark.serializer</name>
<value>org.apache.spark.serializer.KryoSerializer</value>
</property>
<property>
<name>spark.eventLog.enabled</name>
<value>true</value>
</property>
<property>
<name>spark.eventLog.dir</name>
<value>hdfs://master:9000/directory</value>
</property>
<property>
<name>spark.executor.memory</name>
<value>10G</value>
</property>
<property>
<name>spark.driver.memory</name>
<value>10G</value>
</property>
Configuration details:
-
hive.execution.engine: Specifies the default execution engine for Hive. Set to "spark". Alternatively, manually specify Spark in Hive shell with:
set hive.execution.engine=spark;
-
spark.master: Spark master node address
-
spark.home: Spark installation path
-
spark.submit.deployMode: Spark submission mode (client or cluster)
-
spark.serializer: Spark serialization method
-
spark.eventLog.enabled: Enable Spark event logging
-
spark.eventLog.dir: Directory for Spark logs (must be created in Hadoop)
-
spark.executor.memory: Memory allocated to each Spark executor
-
spark.driver.memory: Total memory for Spark driver
After successful configuration, enter the Hive shell and perform table join queries to verify Spark engine usage.
Testing Hive on HBase with Spark Engine
After successful integration and creation of two external Hive tables linked to HBase, conduct performance testing with data queries.
Table creation scripts:
create table t_student(id int,name string) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties("hbase.columns.mapping"=":key,st1:name") tblproperties("hbase.table.name"="t_student","hbase.mapred.output.outputtable" = "t_student");
create table t_student_info(id int,age int,sex string) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties("hbase.columns.mapping"=":key,st1:age,st1:sex") tblproperties("hbase.table.name"="t_student_info","hbase.mapred.output.outputtable" = "t_student_info");
Insert 1 million records into both tables for testing. In this example, data was inserted directly into HBase using the HBase API.
Once data is inserted, test query performance in the Hive shell:
Record Count Test:
Primary Key Query Test:
Non-Primary Key Query Test:
Note: You can also use Hive's JDBC API with the following driver:
Class.forName("org.apache.hive.jdbc.HiveDriver");
Conclusion: Testing with Hive on Spark shows that when querying by primary key (HBase rowkey), retrieving 1 million records takes approximately 2.3 seconds (with Spark initialization consuming about 2 seconds). However, non-primary key queries demonstrate significantly slower performance. Therefore, when using Hive on HBase, prioritize rowkey-based queries for optimal performance.