Optimizing MySQL Queries: Joins, Sorting, Pagination, Subqueries, and Group By
Optimizing Join Queries
1. Setup for Testing
CREATE TABLE IF NOT EXISTS category(
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
ref_code INT UNSIGNED NOT NULL,
PRIMARY KEY(id)
);
CREATE TABLE IF NOT EXISTS item(
item_id INT UNSIGNED NOT NULL AUTO_INCREMENT,
ref_code INT UNSIGNED NOT NULL,
PRIMARY KEY(item_id)
);
INSERT INTO category(ref_code) VALUES (FLOOR(1 + (RAND() * 20)));
INSERT INTO item(ref_code) VALUES (FLOOR(1 + (RAND() * 20)));
Case 1: LEFT OUTER JOIN
EXPLAIN SELECT SQL_NO_CACHE * FROM category LEFT JOIN item ON category.ref_code = item.ref_code;
After adding an index:
CREATE INDEX idx_ref ON item(ref_code);
INNER JOIN
EXPLAIN SELECT SQL_NO_CACHE * FROM category INNER JOIN item ON category.ref_code = item.ref_code;
For inner joins, the optimizer selects the driving and driven tables. If only one table has an index on the join condition, that table becomes the driven table. With indexes on both tables, the smaller table is chosen as the driver.
Join Mechanisms
In a query like SELECT * FROM A JOIN B ON ..., A is not always the driver; the optimizer decides based on query analysis. Use EXPLAIN to identify roles.
For outer joins, such as SELECT * FROM A LEFT JOIN B ON ..., A typically drives, but this can vary.
Simple Nested-Loop Join
This method performs a full scan of the driven table for each row in the driver table, leading to high cost with large datasets.
Index Nested-Loop Join
Requires an index on the driven table's join column. It reduces matches by using the index for lookups, improving performance.
Block Nested-Loop Join
Uses a join buffer to store rows from the driver table, reducing disk I/O. Enable with block_nested_loop=on.
Check settings:
SHOW VARIABLES LIKE '%optimizer_switch%';
SHOW VARIABLES LIKE '%join_buffer_size%';
Join Summary
From MySQL 8.0.20, hash joins replace BNLJ by default. Ensure join columns are indexed, data types match, and use small tables as drivers for left joins. Prefer joins over subqueries to avoid temporary tables.
Subquery Optimization
Subqueries allow nesting SELECT statements but can be ineffficient due to temporary table creasion and lack of indexing. Replace with JOIN queries for better performence, especially avoiding NOT IN or NOT EXISTS in favor of LEFT JOIN with IS NULL checks.
Sorting Optimization
MySQL uses Index or FileSort for ordering. Index sorting is faster; FileSort occurs in memory or disk, consuming CPU. Optimize by:
- Using indexes on WHERE and ORDER BY clauses.
- Employing composite indexes for different columns.
- Avoiding SELECT * to enable index-only scans.
- Addding LIMIT clauses to reduce sorted data.
Index usage fails with incorrect column order or mixed sort directions. Ensure filtering exists to leverage indexes.
FileSort Algorithms
Two methods: two-pass (old) and single-pass (improved). Increase sort_buffer_size to handle large sorts and minimize I/O.
Group By Optimization
Similar to ORDER BY, GROUP BY benefits from indexes and follows leftmost prefix rules. Increase buffer sizes if indexes aren't used. Prefer WHERE over HAVING for filtering. Limit result sets to under 1000 rows to avoid slow queries.
Pagination Query Optimization
For queries like LIMIT 2000000,10, performance degrades due to sorting overhead. Optimize by:
- Sorting on an indexed column and joining back:
EXPLAIN SELECT * FROM item t, (SELECT item_id FROM item ORDER BY item_id LIMIT 2000000,10) s WHERE t.item_id = s.item_id;
- For auto-increment primary keys, use range queries:
EXPLAIN SELECT * FROM item WHERE item_id > 2000000 LIMIT 10;
EXISTS vs IN
Choose based on table size: use EXISTS for small driving tables and IN for large ones, considering index availability.
COUNT(*) vs COUNT(column)
COUNT(*) and COUNT(1) are generally efficient for row counts; COUNT(column) counts non-null values and may be slower if endexing differs.
Avoiding SELECT *
Specify columns instead of using * to reduce parsing overhead and enable covering indexes.
Using LIMIT 1
Add LIMIT 1 to full-table scans when only one row is expected to stop early. Not needed with unique indexed queries.
Frequent COMMIT Usage
Use COMMIT in transactions to release locks and improve concurrency.