Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Hive Practical Techniques and Common Pitfalls

Tech 1

Using explode for Column-to-Row Transformation

To convert a delimited string column into multiple rows:

SELECT other_cols, exploded_col
FROM source_table
LATERAL VIEW explode(split(target_column, ',')) tmp AS exploded_col;

The split() function breaks the string into an array, explode() turns each array element into a row, and LATERAL VIEW creates a temporary virtual table. The alias exploded_col becomes a usable column in the SELECT clause.

Important: Filtering conditions must appear after the LATERAL VIEW clause; placing them before causes syntax errors.

Aggregating Multiple Rows into a Single Delimited String

Use collect_set() (deduplicated) or collect_list() (with duplicates), combined with concat_ws():

-- Deduplicated result
SELECT name,
       concat_ws(',', collect_set(CAST(type AS STRING))) AS type_list
FROM sample_table
GROUP BY name;

-- With duplicates preserved
SELECT name,
       concat_ws(',', collect_list(CAST(type AS STRING))) AS type_list
FROM sample_table
GROUP BY name;

Both functions require string inputs—cast non-string columns explicitly. Neither guarantees order; to enforce sorting:

SELECT name,
       concat_ws(',', collect_list(CAST(type AS STRING))) AS type_list
FROM (
  SELECT name, type
  FROM sample_table
  DISTRIBUTE BY name
  SORT BY type
) t
GROUP BY name;

Parsing JSON Strings

For single-field extraction:

SELECT get_json_object(json_col, '$.name') AS name,
       get_json_object(json_col, '$.age')  AS age
FROM json_table;

For multi-field extraction more efficiently:

SELECT t.json_col,
       j.name,
       j.age
FROM json_table t
LATERAL VIEW json_tuple(t.json_col, 'name', 'age') j AS name, age;

Mitigating Data Skew with DISTRIBUTE BY

To avoid skewed partitions (e.g., when partitioning by date), add a random salt:

INSERT OVERWRITE TABLE target_table PARTITION(day)
SELECT *
FROM staging_table
DISTRIBUTE BY day, pmod(CAST(rand() * 1000 AS INT), 50);

This distributes data across 50 sub-buckets per day, preventing large files in a single reducer.

Efficient Set Combination: UNION vs UNION ALL

  • UNION: Removes duplicates and sorts results → slower.
  • UNION ALL: Appends results as-is → faster and preferred when duplicates aren’t expected.
-- Fast, no deduplication
SELECT col FROM table1
UNION ALL
SELECT col FROM table2;

Optimizing Joins with Early Filtering

Filter the right table within the ON clause to reduce join volume:

SELECT a.id, a.val, b.val
FROM left_table a
JOIN right_table b ON a.id = b.id AND b.val > 100
WHERE a.val > 50;

This optimization works only for inner joins. It does not apply to LEFT or RIGHT joins.

Window Functions and Aggregation Behavior

Key behaviors:

  • COUNT(*) includes NULLs; COUNT(col) excludes NULLs.
  • MIN, MAX, AVG ignore NULLs unless all values are NULL.

Window frame examples:

SUM(cost) OVER () -- entire dataset
SUM(cost) OVER (PARTITION BY user) -- per-user total
SUM(cost) OVER (PARTITION BY user ORDER BY dt ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) -- rolling 2-day sum

Default frame with ORDER BY is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

Offset and First/Last Value Functions

  • LAG(col, n, default): value n rows before current.
  • LEAD(col, n, default): value n rows after.
  • FIRST_VALUE(col): first in window.
  • LAST_VALUE(col): last in window—use full window spec to avoid partial results:
LAST_VALUE(id) OVER (PARTITION BY grp ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

Ranking Functions: ROW_NUMBER, RANK, DENSE_RANK

  • ROW_NUMBER(): 1,2,3,4… (no ties)
  • RANK(): 1,2,2,4… (gaps after ties)
  • DENSE_RANK(): 1,2,2,3… (no gaps)

All used with OVER(PARTITION BY ... ORDER BY ...).

Prefer GROUP BY Over DISTINCT for Large-Scale Deduplication

COUNT(DISTINCT col) forces all data into one reducer → severe skew. Instead:

SELECT COUNT(*)
FROM (SELECT col FROM big_table GROUP BY col) t;

This leverages multiple reducers via dynamic partitioning.

Controlling Map Task Count

  • Increase mappers: SET mapred.map.tasks = 2000;
  • Decrease mappers: increase split size:
    SET mapred.min.split.size = 268435456; -- 256 MB
    

LEFT JOIN Conditions: ON vs WHERE

Conditions on the left table in ON are ignored. Conditions on the right table filter it before joining:

-- Filters right table early
SELECT * FROM A LEFT JOIN B ON A.id = B.id AND B.status = 'active';

-- Equivalent post-join filter (less efficient)
SELECT * FROM A LEFT JOIN B ON A.id = B.id WHERE B.status = 'active';

But note: WHERE B.col IS NOT NULL converts a LEFT JOIN into an inner join.

Avoid Scientific Notation in Output

Cast numeric results to DECIMAL:

SELECT CAST(large_number AS DECIMAL(18,2)) FROM table;

Repair Partitions Added via HDFS

When partition directories are added directly to HDFS (e.g., hdfs dfs -put), sync metadata with:

MSCK REPAIR TABLE table_name;

Requires partition paths in /col=value/ format.

Understanding COUNT Variants

  • COUNT(*) and COUNT(1): count all rows, including those with NULLs.
  • COUNT(col): counts only non-NULL values in col.

Enable Column Headers in CLI Output

SET hive.cli.print.header=true;

String Search Functions

  • Hive: LOCATE('substring', column) → position (1-based)
  • Presto: STRPOS(column, 'substring')

Conditional Counting with NULL

SELECT COUNT(IF(condition, value, NULL)); -- counts only when condition is true

Since COUNT(NULL) = 0, this effectively filters.

Numeric Rounding and Truncation

  • CEIL(x): smallest integer ≥ x
  • FLOOR(x): largest integer ≤ x
  • ROUND(x, d): rounds to d decimal places
  • CAST(x AS INT): truncates decimals (no rounding)

Utility Functions

  • ARRAY_CONTAINS(arr, val): returns true if array contains value.
  • COALESCE(a, b, c): returns first non-NULL argument.

Performance Note on OR in WHERE

Using OR in filters often prevents predicate pushdown and column pruning, leading to full table scans. Consider rewriting with UNION ALL if feasible.

Tags: hivesql

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.