Home > Tech > Content

Hive Practical Techniques and Common Pitfalls

Tech Apr 29 23

Using `explode` for Column-to-Row Transformation

To convert a delimited string column into multiple rows:

SELECT other_cols, exploded_col
FROM source_table
LATERAL VIEW explode(split(target_column, ',')) tmp AS exploded_col;

The split() function breaks the string into an array, explode() turns each array element into a row, and LATERAL VIEW creates a temporary virtual table. The alias exploded_col becomes a usable column in the SELECT clause.

Important: Filtering conditions must appear after the LATERAL VIEW clause; placing them before causes syntax errors.

Aggregating Multiple Rows into a Single Delimited String

Use collect_set() (deduplicated) or collect_list() (with duplicates), combined with concat_ws():

-- Deduplicated result
SELECT name,
       concat_ws(',', collect_set(CAST(type AS STRING))) AS type_list
FROM sample_table
GROUP BY name;

-- With duplicates preserved
SELECT name,
       concat_ws(',', collect_list(CAST(type AS STRING))) AS type_list
FROM sample_table
GROUP BY name;

Both functions require string inputs—cast non-string columns explicitly. Neither guarantees order; to enforce sorting:

SELECT name,
       concat_ws(',', collect_list(CAST(type AS STRING))) AS type_list
FROM (
  SELECT name, type
  FROM sample_table
  DISTRIBUTE BY name
  SORT BY type
) t
GROUP BY name;

Parsing JSON Strings

For single-field extraction:

SELECT get_json_object(json_col, '$.name') AS name,
       get_json_object(json_col, '$.age')  AS age
FROM json_table;

For multi-field extraction more efficiently:

SELECT t.json_col,
       j.name,
       j.age
FROM json_table t
LATERAL VIEW json_tuple(t.json_col, 'name', 'age') j AS name, age;

Mitigating Data Skew with `DISTRIBUTE BY`

To avoid skewed partitions (e.g., when partitioning by date), add a random salt:

INSERT OVERWRITE TABLE target_table PARTITION(day)
SELECT *
FROM staging_table
DISTRIBUTE BY day, pmod(CAST(rand() * 1000 AS INT), 50);

This distributes data across 50 sub-buckets per day, preventing large files in a single reducer.

Efficient Set Combination: `UNION` vs `UNION ALL`

UNION: Removes duplicates and sorts results → slower.
UNION ALL: Appends results as-is → faster and preferred when duplicates aren’t expected.

-- Fast, no deduplication
SELECT col FROM table1
UNION ALL
SELECT col FROM table2;

Optimizing Joins with Early Filtering

Filter the right table within the ON clause to reduce join volume:

SELECT a.id, a.val, b.val
FROM left_table a
JOIN right_table b ON a.id = b.id AND b.val > 100
WHERE a.val > 50;

This optimization works only for inner joins. It does not apply to LEFT or RIGHT joins.

Window Functions and Aggregation Behavior

Key behaviors:

COUNT(*) includes NULLs; COUNT(col) excludes NULLs.
MIN, MAX, AVG ignore NULLs unless all values are NULL.

Window frame examples:

SUM(cost) OVER () -- entire dataset
SUM(cost) OVER (PARTITION BY user) -- per-user total
SUM(cost) OVER (PARTITION BY user ORDER BY dt ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) -- rolling 2-day sum

Default frame with ORDER BY is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

Offset and First/Last Value Functions

LAG(col, n, default): value n rows before current.
LEAD(col, n, default): value n rows after.
FIRST_VALUE(col): first in window.
LAST_VALUE(col): last in window—use full window spec to avoid partial results:

LAST_VALUE(id) OVER (PARTITION BY grp ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

Ranking Functions: `ROW_NUMBER`, `RANK`, `DENSE_RANK`

ROW_NUMBER(): 1,2,3,4… (no ties)
RANK(): 1,2,2,4… (gaps after ties)
DENSE_RANK(): 1,2,2,3… (no gaps)

All used with OVER(PARTITION BY ... ORDER BY ...).

Prefer `GROUP BY` Over `DISTINCT` for Large-Scale Deduplication

COUNT(DISTINCT col) forces all data into one reducer → severe skew. Instead:

SELECT COUNT(*)
FROM (SELECT col FROM big_table GROUP BY col) t;

This leverages multiple reducers via dynamic partitioning.

Controlling Map Task Count

Increase mappers: SET mapred.map.tasks = 2000;

Decrease mappers: increase split size:

SET mapred.min.split.size = 268435456; -- 256 MB

`LEFT JOIN` Conditions: `ON` vs `WHERE`

Conditions on the left table in ON are ignored. Conditions on the right table filter it before joining:

-- Filters right table early
SELECT * FROM A LEFT JOIN B ON A.id = B.id AND B.status = 'active';

-- Equivalent post-join filter (less efficient)
SELECT * FROM A LEFT JOIN B ON A.id = B.id WHERE B.status = 'active';

But note: WHERE B.col IS NOT NULL converts a LEFT JOIN into an inner join.

Avoid Scientific Notation in Output

Cast numeric results to DECIMAL:

SELECT CAST(large_number AS DECIMAL(18,2)) FROM table;

Repair Partitions Added via HDFS

When partition directories are added directly to HDFS (e.g., hdfs dfs -put), sync metadata with:

MSCK REPAIR TABLE table_name;

Requires partition paths in /col=value/ format.

Understanding `COUNT` Variants

COUNT(*) and COUNT(1): count all rows, including those with NULLs.
COUNT(col): counts only non-NULL values in col.

Enable Column Headers in CLI Output

SET hive.cli.print.header=true;

String Search Functions

Hive: LOCATE('substring', column) → position (1-based)
Presto: STRPOS(column, 'substring')

Conditional Counting with `NULL`

SELECT COUNT(IF(condition, value, NULL)); -- counts only when condition is true

Since COUNT(NULL) = 0, this effectively filters.

Numeric Rounding and Truncation

CEIL(x): smallest integer ≥ x
FLOOR(x): largest integer ≤ x
ROUND(x, d): rounds to d decimal places
CAST(x AS INT): truncates decimals (no rounding)

Utility Functions

ARRAY_CONTAINS(arr, val): returns true if array contains value.
COALESCE(a, b, c): returns first non-NULL argument.

Performance Note on `OR` in `WHERE`

Using OR in filters often prevents predicate pushdown and column pruning, leading to full table scans. Consider rewriting with UNION ALL if feasible.

Tags: hive sql

Back to List

Prev: Foundational VTK Rendering and Geometry Examples

Next: Core Foundational Principles Governing Computer Operation

Fading Coder

Hive Practical Techniques and Common Pitfalls

Using `explode` for Column-to-Row Transformation

Aggregating Multiple Rows into a Single Delimited String

Parsing JSON Strings

Mitigating Data Skew with `DISTRIBUTE BY`

Efficient Set Combination: `UNION` vs `UNION ALL`

Optimizing Joins with Early Filtering

Window Functions and Aggregation Behavior

Offset and First/Last Value Functions

Ranking Functions: `ROW_NUMBER`, `RANK`, `DENSE_RANK`

Prefer `GROUP BY` Over `DISTINCT` for Large-Scale Deduplication

Controlling Map Task Count

`LEFT JOIN` Conditions: `ON` vs `WHERE`

Avoid Scientific Notation in Output

Repair Partitions Added via HDFS

Understanding `COUNT` Variants

Enable Column Headers in CLI Output

String Search Functions

Conditional Counting with `NULL`

Numeric Rounding and Truncation

Utility Functions

Performance Note on `OR` in `WHERE`

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Hive Practical Techniques and Common Pitfalls

Using explode for Column-to-Row Transformation

Aggregating Multiple Rows into a Single Delimited String

Parsing JSON Strings

Mitigating Data Skew with DISTRIBUTE BY

Efficient Set Combination: UNION vs UNION ALL

Optimizing Joins with Early Filtering

Window Functions and Aggregation Behavior

Offset and First/Last Value Functions

Ranking Functions: ROW_NUMBER, RANK, DENSE_RANK

Prefer GROUP BY Over DISTINCT for Large-Scale Deduplication

Controlling Map Task Count

LEFT JOIN Conditions: ON vs WHERE

Avoid Scientific Notation in Output

Repair Partitions Added via HDFS

Understanding COUNT Variants

Enable Column Headers in CLI Output

String Search Functions

Conditional Counting with NULL

Numeric Rounding and Truncation

Utility Functions

Performance Note on OR in WHERE

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Using `explode` for Column-to-Row Transformation

Mitigating Data Skew with `DISTRIBUTE BY`

Efficient Set Combination: `UNION` vs `UNION ALL`

Ranking Functions: `ROW_NUMBER`, `RANK`, `DENSE_RANK`

Prefer `GROUP BY` Over `DISTINCT` for Large-Scale Deduplication

`LEFT JOIN` Conditions: `ON` vs `WHERE`

Understanding `COUNT` Variants

Conditional Counting with `NULL`

Performance Note on `OR` in `WHERE`

Leave a Comment