Unnesting Correlated Subqueries: Systematic Transformation to Dependent Joins
Correlated subqueries fundamentally constrain execution efficiency by requiring nested iteration semantics. When the outer relation contains substantial cardinality, the naive row-by-row evaluation strategy—analogous to a nested-loop join with arbitrary complexity on the inner side—becomes computationally prohibitive. Query optimizers must transform these dependent operations into set-oriented alternatives through a process variously termed decorrelation, unnesting, or flattening.
This transformation hinges on the Apply operator (also referred to as Dependent Join or Correlated Join), which explicitly models the execution of a parameterized inner query for each tuple from the outer input. While commercial optimizers differ in implementation specifics, the theoretical foundation rests on a unified set of algebraic equivalences that systematically eliminate correlation.
Classification of Subquery Expressions
Correlated subqueries reference attributes from enclosing query blocks, rendering them incomplete with out external context. Prior to optimization, subqueries typically manifest within expression trees embedded in filters, projections, or join predicates. The Apply operator extracts these into explicit relational algebra structures.
Subqueries categorize by their return cardinality and comparison semantics:
Scalar Subqueries return exactly one value (single row, single column), substituting NULL for empty results and raising runtime errors for multiple rows. They appear anywhere scalar expressions are valid.
SELECT s.store_id
FROM stores s
WHERE 100000 < (
SELECT SUM(t.amount)
FROM transactions t
WHERE t.store_id = s.store_id
);
Existential Tests (EXISTS) return boolean values indicating non-emptiness of the subquery result. In filter contexts, these correspond to semi-joins.
SELECT s.store_id
FROM stores s
WHERE s.region_code = 'NA'
AND EXISTS (
SELECT 1
FROM inventory i
WHERE i.store_id = s.store_id
);
Quantified Comparisons (IN, SOME, ANY, ALL) test membership or universal quantification across sets.
SELECT p.product_name
FROM products p
WHERE p.category_id <> ALL (
SELECT category_id
FROM discontinued_categories
);
The Apply Operator Semantics
Apply bridges the gap between expression evaluation and relational algebra. Formally, Apply accepts an outer relation (O) and a parameterized inner relation expression (I(o)), producing the union (bag semantics) of (o \otimes I(o)) for all (o \in O), where (\otimes) denotes a join variant:
- Cross Apply ((D^{\times})): Cartesian product of outer tuple with its corresponding inner results
- Outer Apply ((D^{\text{LOJ}})): Preserves outer tuples when inner results are empty, padding with NULLs
- Semi Apply ((D^{\exists})): Returns outer tuple if inner result is non-empty
- Anti-Semi Apply ((D^{\nexists})): Returns outer tuple if inner result is empty
Conversion to Apply form proceeds by hoisting subqueries from expression trees in to explicit operator nodes. When scalar subqueries cannot guarantee single-row output, a Max1Row assertion operator guards against cardinality violations.
Fundamentall Elimination
When the inner side of an Apply contains no references to the outer relation (accidental correlation or subsequent optimization renders it independent), the operator collapses to standard join variants:
[ D^{\times}(O, I) \equiv O \times I \quad \text{if } I \text{ is independent of } O ]
[ D^{\text{LOJ}}(O, I) \equiv O \leftrightarrow I \quad \text{if } I \text{ is independent of } O ]
This represents the trivial case where decorrelation yields conventional joins.
Pushing Apply Through Projections and Filters
The core transformation strategy pushes Apply operators toward the leaves of the query tree while lifting projections ((\pi)) and selections ((\sigma)) above the Apply node. This inversion appears counterintuitive compared to standard predicate pushdown, yet serves the critical purpose of exposing correlation parameters as regular attributes.
Given a projection above an Apply:
[ \pi_{f(a, b)}(D^{\times}(O, I)) \equiv D^{\times}(O, \pi_{f(o, b)}(I)) ]
Where (o) represents the correlation parameter drawn from (O), now accessible as a constant within the inner projection. Similarly, for selections:
[ \sigma_{\theta}(D^{\times}(O, I)) \equiv D^{\times}(O, \sigma_{\theta}(I)) ]
When the predicate (\theta) references only inner attributes. If (\theta) references both outer and inner attributes, it remains attached to the Apply or decomposes into a subsequent join predicate.
Handling Aggregate Operations
Aggregates present the most intricate transformation scenarios. When Apply encounters aggregation, the optimizer must distinguish between Scalar Aggregation (no explicit GROUP BY) and Group Aggregation.
For Scalar Aggregation within Apply:
[ D^{\text{LOJ}}(O, \mathcal{G}{\emptyset, F}(I)) \equiv \pi{O, F'}(D^{\text{LOJ}}(O, \mathcal{G}_{K, F}(I))) ]
Here, the transformation converts Scalar Aggregation to Group Aggregation ((\mathcal{G}_{K, F})) using the outer relation's key (K) as the grouping column. The Left Outer Apply ensures that outer tuples without matching inner data still produce aggregate results (typically NULL or zero depending on the function), preserving the semantics of scalar subqueries.
Critical Distinction for Count Aggregates
Consider a query counting employees per department:
SELECT d.dept_code, (
SELECT COUNT(*)
FROM employees e
WHERE e.dept_code = d.dept_code
) AS headcount
FROM departments d;
Naive transformation yields:
SELECT d.dept_code, COUNT(*) AS headcount
FROM departments d
LEFT JOIN employees e ON e.dept_code = d.dept_code
GROUP BY d.dept_code, d.dept_id;
This produces incorrect results: departments with zero employees report a count of 1 because the LEFT JOIN generates a single NULL-padded row, and COUNT(*) counts rows rather than non-NULL values. The correct transformation requires:
SELECT d.dept_code, COUNT(e.emp_id) AS headcount
FROM departments d
LEFT JOIN employees e ON e.dept_code = d.dept_code
GROUP BY d.dept_code, d.dept_id;
Or for non-nullable expressions, explicit NULL handling:
SELECT d.dept_code, COUNT(CASE WHEN e.dept_code IS NOT NULL THEN 1 END) AS headcount
...
Functions where (F(\emptyset) \neq F({\text{NULL}})) require careful rewriting. For COUNT(*), substitute COUNT(non_nullable_column). For expressions like MAX(COALESCE(column, 0)), decomposition into explicit projection followed by aggregation ensures semantic equivalence.
Set Operations and Complex Subqueries
Subqueries containing set operations (UNION ALL, EXCEPT ALL) require distributing Apply over the set operation:
[ D^{\times}(O, I_1 \cup I_2) \equiv D^{\times}(O, I_1) \cup D^{\times}(O, I_2) ]
[ D^{\times}(O, I_1 \setminus I_2) \equiv D^{\times}(O, I_1) \setminus D^{\times}(O, I_2) ]
These transformations duplicate the outer relation reference, potentially converting the query tree into a directed acyclic graph (DAG) structure to avoid physical redundancy. While theoretically necessary for completeness, such patterns rarely manifest in analytical workloads.
Advanced Optimization: Magic Sets
Beyond basic unnesting, the Magic Sets (or MiniMax) optimization reduces the effective input size to the Apply operator. When the outer relation contains duplicate values in the correlated columns, executing the inner query for each outer row wastes computation.
The optimization inserts a distinct projection on the correlation attributes prior to Apply:
-- Original correlated form
SELECT d.*
FROM departments d
WHERE d.budget < (
SELECT SUM(e.salary)
FROM employees e
WHERE e.division_id = d.division_id
);
-- With Magic Sets
WITH unique_divisions AS (
SELECT DISTINCT division_id FROM departments
),
division_costs AS (
SELECT d.division_id, SUM(e.salary) AS total_salary
FROM unique_divisions d
JOIN employees e ON e.division_id = d.division_id
GROUP BY d.division_id
)
SELECT d.*
FROM departments d
JOIN division_costs dc ON dc.division_id = d.division_id
WHERE d.budget < dc.total_salary;
This transformation proves particularly effective when the outer table is large but the domain of correlation values is small.
Cost-Based Considerations
Decorrelation does not universally improve performance. When the outer relation exhibits low cardinality—particularly single-row cases—preserving the Apply structure (effectively a nested-loop join with materialization) may outperform hash-based set operations. Similar, when highly selective indexes exist on the inner relation's correlated columns, row-by-row index probing can surpass scanning and hashing the entire inner relation.
Consequently, modern optimizers implement these transformations as cost-based alternatives rather than hard rules. The optimizer estimates cardinalities and access paths, potentially electing to retain corelation (nested iteration) when the outer input is sufficiently small or the inner access path is sufficiently efficient. This mirrors the traditional Hash Join versus Nested Loop Join decision, where correlation represents the extreme case of nested iteration with arbitrary inner complexity.