Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Integrating Hive with Spark and HBase for Enhanced Big Data Query Performance

Tech May 13 2

Prerequisites

Before proceeding with the integration, ensure that Hive, HBase, and Spark environments are properly set up. If you haven't successfully configured these components, refer to the Hadoop+Spark+Zookeeper+HBase+Hive cluster setup guide.

Hive Integration with HBase

The integration between Hive and HBase is achieved through API communication, facilitated by the hive-hbase-handler-*.jar utility located in Hive's lib directory. To complete the integration, copy this JAR file to the HBase/lib directory:
cp hive-hbase-handler-*.jar /opt/hbase/hbase1.2/lib
Note: If version conflicts arise during Hive-HBase integration, prioritize the HBase version by replacing the corresponding JAR files in Hive with those from HBase. For detailed testing procedures between Hive and HBase, refer to the "Hive Integration with HBase: A Step-by-Step Guide" article.

Hive Integration with Spark

Hive's integration with Spark involves using pre-compiled JAR packages. A critical consideration is version compatibility - specific Spark and Hive versions must be used together. After researching compatible versions, we can proceed with the integration.

Hive Configuration Changes

Navigate to the hive/conf directory and edit the hive-env.sh file to add Spark environment variables:
export SPARK_HOME=/opt/spark/spark1.6-hadoop2.4-hive
Next, edit the hive-site.xml file and add the following configurations:

<property>
   <name>hive.execution.engine</name>
   <value>spark</value>
</property>

<property>
   <name>spark.master</name>
   <value>spark://master:7077</value>
</property>

<property>
   <name>spark.home</name>
   <value>/opt/spark/spark1.6-hadoop2.4-hive</value>
</property>

<property>
   <name>spark.submit.deployMode</name>
   <value>client</value>
</property>

<property>
   <name>spark.serializer</name>
   <value>org.apache.spark.serializer.KryoSerializer</value>
</property>

<property>
   <name>spark.eventLog.enabled</name>
   <value>true</value>
</property>

<property>
   <name>spark.eventLog.dir</name>
   <value>hdfs://master:9000/directory</value>
</property>

<property>
   <name>spark.executor.memory</name>
   <value>10G</value>
</property>

<property>
   <name>spark.driver.memory</name>
   <value>10G</value>
</property>
Configuration details: - hive.execution.engine: Specifies the default execution engine for Hive. Set to "spark". Alternatively, manually specify Spark in Hive shell with: set hive.execution.engine=spark; - spark.master: Spark master node address - spark.home: Spark installation path - spark.submit.deployMode: Spark submission mode (client or cluster) - spark.serializer: Spark serialization method - spark.eventLog.enabled: Enable Spark event logging - spark.eventLog.dir: Directory for Spark logs (must be created in Hadoop) - spark.executor.memory: Memory allocated to each Spark executor - spark.driver.memory: Total memory for Spark driver After successful configuration, enter the Hive shell and perform table join queries to verify Spark engine usage.

Testing Hive on HBase with Spark Engine

After successful integration and creation of two external Hive tables linked to HBase, conduct performance testing with data queries. Table creation scripts:
create table t_student(id int,name string) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties("hbase.columns.mapping"=":key,st1:name") tblproperties("hbase.table.name"="t_student","hbase.mapred.output.outputtable" = "t_student");

create table t_student_info(id int,age int,sex string) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties("hbase.columns.mapping"=":key,st1:age,st1:sex") tblproperties("hbase.table.name"="t_student_info","hbase.mapred.output.outputtable" = "t_student_info");
Insert 1 million records into both tables for testing. In this example, data was inserted directly into HBase using the HBase API. Once data is inserted, test query performance in the Hive shell: Record Count Test: Primary Key Query Test: Non-Primary Key Query Test: Note: You can also use Hive's JDBC API with the following driver:
Class.forName("org.apache.hive.jdbc.HiveDriver");
Conclusion: Testing with Hive on Spark shows that when querying by primary key (HBase rowkey), retrieving 1 million records takes approximately 2.3 seconds (with Spark initialization consuming about 2 seconds). However, non-primary key queries demonstrate significantly slower performance. Therefore, when using Hive on HBase, prioritize rowkey-based queries for optimal performance.
Tags: Big Data

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.