Hive Practical Techniques and Common Pitfalls
Using explode for Column-to-Row Transformation
To convert a delimited string column into multiple rows:
SELECT other_cols, exploded_col
FROM source_table
LATERAL VIEW explode(split(target_column, ',')) tmp AS exploded_col;
The split() function breaks the string into an array, explode() turns each array element into a row, and LATERAL VIEW creates a temporary virtual table. The alias exploded_col becomes a usable column in the SELECT clause.
Important: Filtering conditions must appear after the
LATERAL VIEWclause; placing them before causes syntax errors.
Aggregating Multiple Rows into a Single Delimited String
Use collect_set() (deduplicated) or collect_list() (with duplicates), combined with concat_ws():
-- Deduplicated result
SELECT name,
concat_ws(',', collect_set(CAST(type AS STRING))) AS type_list
FROM sample_table
GROUP BY name;
-- With duplicates preserved
SELECT name,
concat_ws(',', collect_list(CAST(type AS STRING))) AS type_list
FROM sample_table
GROUP BY name;
Both functions require string inputs—cast non-string columns explicitly. Neither guarantees order; to enforce sorting:
SELECT name,
concat_ws(',', collect_list(CAST(type AS STRING))) AS type_list
FROM (
SELECT name, type
FROM sample_table
DISTRIBUTE BY name
SORT BY type
) t
GROUP BY name;
Parsing JSON Strings
For single-field extraction:
SELECT get_json_object(json_col, '$.name') AS name,
get_json_object(json_col, '$.age') AS age
FROM json_table;
For multi-field extraction more efficiently:
SELECT t.json_col,
j.name,
j.age
FROM json_table t
LATERAL VIEW json_tuple(t.json_col, 'name', 'age') j AS name, age;
Mitigating Data Skew with DISTRIBUTE BY
To avoid skewed partitions (e.g., when partitioning by date), add a random salt:
INSERT OVERWRITE TABLE target_table PARTITION(day)
SELECT *
FROM staging_table
DISTRIBUTE BY day, pmod(CAST(rand() * 1000 AS INT), 50);
This distributes data across 50 sub-buckets per day, preventing large files in a single reducer.
Efficient Set Combination: UNION vs UNION ALL
UNION: Removes duplicates and sorts results → slower.UNION ALL: Appends results as-is → faster and preferred when duplicates aren’t expected.
-- Fast, no deduplication
SELECT col FROM table1
UNION ALL
SELECT col FROM table2;
Optimizing Joins with Early Filtering
Filter the right table within the ON clause to reduce join volume:
SELECT a.id, a.val, b.val
FROM left_table a
JOIN right_table b ON a.id = b.id AND b.val > 100
WHERE a.val > 50;
This optimization works only for inner joins. It does not apply to
LEFTorRIGHTjoins.
Window Functions and Aggregation Behavior
Key behaviors:
COUNT(*)includes NULLs;COUNT(col)excludes NULLs.MIN,MAX,AVGignore NULLs unless all values are NULL.
Window frame examples:
SUM(cost) OVER () -- entire dataset
SUM(cost) OVER (PARTITION BY user) -- per-user total
SUM(cost) OVER (PARTITION BY user ORDER BY dt ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) -- rolling 2-day sum
Default frame with ORDER BY is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
Offset and First/Last Value Functions
LAG(col, n, default): valuenrows before current.LEAD(col, n, default): valuenrows after.FIRST_VALUE(col): first in window.LAST_VALUE(col): last in window—use full window spec to avoid partial results:
LAST_VALUE(id) OVER (PARTITION BY grp ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
Ranking Functions: ROW_NUMBER, RANK, DENSE_RANK
ROW_NUMBER(): 1,2,3,4… (no ties)RANK(): 1,2,2,4… (gaps after ties)DENSE_RANK(): 1,2,2,3… (no gaps)
All used with OVER(PARTITION BY ... ORDER BY ...).
Prefer GROUP BY Over DISTINCT for Large-Scale Deduplication
COUNT(DISTINCT col) forces all data into one reducer → severe skew. Instead:
SELECT COUNT(*)
FROM (SELECT col FROM big_table GROUP BY col) t;
This leverages multiple reducers via dynamic partitioning.
Controlling Map Task Count
- Increase mappers:
SET mapred.map.tasks = 2000; - Decrease mappers: increase split size:
SET mapred.min.split.size = 268435456; -- 256 MB
LEFT JOIN Conditions: ON vs WHERE
Conditions on the left table in ON are ignored. Conditions on the right table filter it before joining:
-- Filters right table early
SELECT * FROM A LEFT JOIN B ON A.id = B.id AND B.status = 'active';
-- Equivalent post-join filter (less efficient)
SELECT * FROM A LEFT JOIN B ON A.id = B.id WHERE B.status = 'active';
But note: WHERE B.col IS NOT NULL converts a LEFT JOIN into an inner join.
Avoid Scientific Notation in Output
Cast numeric results to DECIMAL:
SELECT CAST(large_number AS DECIMAL(18,2)) FROM table;
Repair Partitions Added via HDFS
When partition directories are added directly to HDFS (e.g., hdfs dfs -put), sync metadata with:
MSCK REPAIR TABLE table_name;
Requires partition paths in /col=value/ format.
Understanding COUNT Variants
COUNT(*)andCOUNT(1): count all rows, including those with NULLs.COUNT(col): counts only non-NULL values incol.
Enable Column Headers in CLI Output
SET hive.cli.print.header=true;
String Search Functions
- Hive:
LOCATE('substring', column)→ position (1-based) - Presto:
STRPOS(column, 'substring')
Conditional Counting with NULL
SELECT COUNT(IF(condition, value, NULL)); -- counts only when condition is true
Since COUNT(NULL) = 0, this effectively filters.
Numeric Rounding and Truncation
CEIL(x): smallest integer ≥ xFLOOR(x): largest integer ≤ xROUND(x, d): rounds toddecimal placesCAST(x AS INT): truncates decimals (no rounding)
Utility Functions
ARRAY_CONTAINS(arr, val): returns true if array contains value.COALESCE(a, b, c): returns first non-NULL argument.
Performance Note on OR in WHERE
Using OR in filters often prevents predicate pushdown and column pruning, leading to full table scans. Consider rewriting with UNION ALL if feasible.