Home > Tech > Content

Integrating Hive with Spark and HBase for Enhanced Big Data Query Performance

Tech May 13 2

Prerequisites

Before proceeding with the integration, ensure that Hive, HBase, and Spark environments are properly set up. If you haven't successfully configured these components, refer to the Hadoop+Spark+Zookeeper+HBase+Hive cluster setup guide.

Hive Integration with HBase

The integration between Hive and HBase is achieved through API communication, facilitated by the hive-hbase-handler-*.jar utility located in Hive's lib directory. To complete the integration, copy this JAR file to the HBase/lib directory:

cp hive-hbase-handler-*.jar /opt/hbase/hbase1.2/lib

Note: If version conflicts arise during Hive-HBase integration, prioritize the HBase version by replacing the corresponding JAR files in Hive with those from HBase. For detailed testing procedures between Hive and HBase, refer to the "Hive Integration with HBase: A Step-by-Step Guide" article.

Hive Integration with Spark

Hive's integration with Spark involves using pre-compiled JAR packages. A critical consideration is version compatibility - specific Spark and Hive versions must be used together. After researching compatible versions, we can proceed with the integration.

Hive Configuration Changes

Navigate to the hive/conf directory and edit the hive-env.sh file to add Spark environment variables:

export SPARK_HOME=/opt/spark/spark1.6-hadoop2.4-hive

Next, edit the hive-site.xml file and add the following configurations:


<property>
   <name>hive.execution.engine</name>
   <value>spark</value>
</property>

<property>
   <name>spark.master</name>
   <value>spark://master:7077</value>
</property>

<property>
   <name>spark.home</name>
   <value>/opt/spark/spark1.6-hadoop2.4-hive</value>
</property>

<property>
   <name>spark.submit.deployMode</name>
   <value>client</value>
</property>

<property>
   <name>spark.serializer</name>
   <value>org.apache.spark.serializer.KryoSerializer</value>
</property>

<property>
   <name>spark.eventLog.enabled</name>
   <value>true</value>
</property>

<property>
   <name>spark.eventLog.dir</name>
   <value>hdfs://master:9000/directory</value>
</property>

<property>
   <name>spark.executor.memory</name>
   <value>10G</value>
</property>

<property>
   <name>spark.driver.memory</name>
   <value>10G</value>
</property>

Configuration details: - hive.execution.engine: Specifies the default execution engine for Hive. Set to "spark". Alternatively, manually specify Spark in Hive shell with: set hive.execution.engine=spark; - spark.master: Spark master node address - spark.home: Spark installation path - spark.submit.deployMode: Spark submission mode (client or cluster) - spark.serializer: Spark serialization method - spark.eventLog.enabled: Enable Spark event logging - spark.eventLog.dir: Directory for Spark logs (must be created in Hadoop) - spark.executor.memory: Memory allocated to each Spark executor - spark.driver.memory: Total memory for Spark driver After successful configuration, enter the Hive shell and perform table join queries to verify Spark engine usage.

Testing Hive on HBase with Spark Engine

After successful integration and creation of two external Hive tables linked to HBase, conduct performance testing with data queries. Table creation scripts:

create table t_student(id int,name string) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties("hbase.columns.mapping"=":key,st1:name") tblproperties("hbase.table.name"="t_student","hbase.mapred.output.outputtable" = "t_student");

create table t_student_info(id int,age int,sex string) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties("hbase.columns.mapping"=":key,st1:age,st1:sex") tblproperties("hbase.table.name"="t_student_info","hbase.mapred.output.outputtable" = "t_student_info");

Insert 1 million records into both tables for testing. In this example, data was inserted directly into HBase using the HBase API. Once data is inserted, test query performance in the Hive shell: Record Count Test: Primary Key Query Test: Non-Primary Key Query Test: Note: You can also use Hive's JDBC API with the following driver:

Class.forName("org.apache.hive.jdbc.HiveDriver");

Conclusion: Testing with Hive on Spark shows that when querying by primary key (HBase rowkey), retrieving 1 million records takes approximately 2.3 seconds (with Spark initialization consuming about 2 seconds). However, non-primary key queries demonstrate significantly slower performance. Therefore, when using Hive on HBase, prioritize rowkey-based queries for optimal performance.

Tags: Big Data

Back to List

Prev: 100 Python Beginner Exercises with Solutions

Next: Dynamic Module Loading with require.context in Webpack

Fading Coder

Integrating Hive with Spark and HBase for Enhanced Big Data Query Performance

Prerequisites

Hive Integration with HBase

Hive Integration with Spark

Hive Configuration Changes

Testing Hive on HBase with Spark Engine

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Integrating Hive with Spark and HBase for Enhanced Big Data Query Performance

Prerequisites

Hive Integration with HBase

Hive Integration with Spark

Hive Configuration Changes

Testing Hive on HBase with Spark Engine

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment