Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

MySQL Index Design Best Practices and Performance Optimization

Tech 1

The Critical Role of Indexes

Proper indexing directly impacts MySQL database performance. Without appropriate indexes, even moderate-sized tables become vulnerable to full table scans that can cause response times to spike dramatically under concurrent load. In severe cases, this leads to database performance collapse.

The InnoDB storage engine, which dominates modern deployments, relies on indexes for its locking mechanism. Insufficient indexes expand lock scope, potentially causing table-level locks that severely impact SQL parallel execution and system scalability. Only with proper indexing do row-level locks function as intended.

Understanding EXPLAIN Output

Full Table Scans and Row Estimates

When analyzing query execution plans, the type: ALL indicates a full table scan. The rows estimate represents the optimizer's calculation of records to be examined, derived from dynamic sampling.

mysql> SELECT COUNT(1) FROM order_records;
+----------+
| COUNT(1) |
+----------+
| 8923451  |
+----------+

mysql> EXPLAIN SELECT * FROM order_records WHERE customer_id = 13301689388199959972\G
*************************** 1. row ***************************
           type: ALL
           rows: 8567123

The query requires scanning millions of rows because customer_id lacks an index. Full table scans on large datasets under concurrent conditions are performance killers.

Index Usage and Coverage

The key column shows which index the optimizer chose. The Using index in the Extra column indicates a covering index scenario—where the query retrieves only indexed columns without table access.

mysql> EXPLAIN SELECT id, book_name FROM novel_agg_info WHERE book_name = 'Test Novel'\G
*************************** 1. row ***************************
             type: ref
             key: book_name
           rows: 1
        Extra: Using index

mysql> EXPLAIN SELECT id, book_name, tag FROM novel_agg_info WHERE book_name = 'Test Novel'\G
*************************** 1. row ***************************
             type: ref
             key: book_name
           rows: 1
        Extra: NULL

The first query needs only id and book_name—both in the index leaf nodes alongside the primary key—so it avoids tablee access. The second query requires tag, which exists only in the table, forcing additional random I/O.

Determining Composite Index Field Usage via key_len

Composite index INDEX(a, b, c) field usage can be deduced from key_len values:

key_len Calculation Rules:

  • Nullable columns add 1 byte for null indicator
  • Variable-length types (VARCHAR) add 2 bytes for length storage
  • Fixed-length types (INT, CHAR, DATETIME) carry no length overhead
  • Character columns: GBK uses 2 bytes per character, UTF-8 uses 3 bytes per character
  • INT occupies 4 bytes, BIGINT occupies 8 bytes
mysql> CREATE TABLE operation_menu (
    `id` INT AUTO_INCREMENT PRIMARY KEY,
    `app_id` INT NOT NULL,
    `created_at` INT NULL,
    `from_platform` VARCHAR(200),
    KEY `idx_appid_created_platform` (`app_id`, `created_at`, `from_platform`)
);

-- Query using first two fields
mysql> EXPLAIN SELECT * FROM operation_menu WHERE app_id = 927 AND created_at = 1494492062\G
           key_len: 9

-- Query using only first field
mysql> EXPLAIN SELECT * FROM operation_menu WHERE app_id = 927 AND from_platform = 'web'\G
           key_len: 4

The first query uses app_id (4 bytes) + created_at (4 bytes + 1 byte for nullable) = 9 bytes, confirming two fields utilized. The second query uses only app_id at 4 bytes.

Order By and Filesort

Queries with ORDER BY typically require physical sorting—the infamous Using filesort—which retrieves all matching rows and sorts them in memory or disk. This is expensive and should be avoided.

However, indexes maintain logical order via bidirectional linked leaf nodes. For WHERE a = ? ORDER BY b LIMIT 1, index (a, b) can return results without physical sorting by navigating directly to a = ? and reading the first b value from the ordered leaf chain.

mysql> EXPLAIN SELECT * FROM activity_log WHERE operator = 'admin' ORDER BY create_time DESC LIMIT 5\G
           type: ref
           key: idx_operator_createtime
           rows: 150
        Extra: Using where

No Using filesort appears because the index (operator, create_time) provides the required ordering. Compare this to:

mysql> EXPLAIN SELECT * FROM activity_log WHERE operator IN ('admin', 'guest') ORDER BY create_time DESC LIMIT 5\G
           Extra: Using where; Using filesort

The IN clause creates a range condition on operator, breaking the index's logical order and triggering physical sorting.

Index Design Guidelines

Fundamental InnoDB Index Properties:

  • Index leaf nodes store defined columns plus the primary key, enabling access to non-indexed columns
  • Leaf nodes use bidirectional linked lists to maintain logical ordering

PRIMARY KEY Recommendations for InnoDB

Use auto-increment BIGINT as the primary key with business keys as unique constraints:

  1. Insert Performance: Business keys often have random insertion patterns, causing page splits and fragmentation in the clustered index. Auto-increment keys ensure sequential inserts with compact, space-efficient pages.

  2. Index Space Efficiency: Non-clustered indexes store the primary key to locate rows. BIGINT (8 bytes) is far more efficient than typical business keys (string, multi-column), especially as the number of secondary indexes grows.

Column Selectivity Considerations

Field selectivity is calculated as COUNT(DISTINCT column) / COUNT(*). Values approaching 1 indicate high selectivity; values approaching the total row count indicate low selectivity.

High-selectivity columns (like IDs) benefit from indexing. Low-selectivity columns (like status with few enum values) generally do not.

mysql> SELECT status, COUNT(1) FROM user_records GROUP BY status;
+---------------+----------+
| status        | count(1) |
+---------------+----------+
|             0 |  3511945 |
|             1 |   367234 |
|             2 |   419062 |
|            12 |       16 |
+---------------+----------+

mysql> EXPLAIN SELECT * FROM user_records WHERE status = 12\G
           type: ref
           key: idx_status
           rows: 15

mysql> EXPLAIN SELECT * FROM user_records WHERE status = 0\G
           type: ref
           key: idx_status
           rows: 1955112

While an index exists on status, queries for status = 0 scan nearly 2 million rows. Often, a full table scan outperforms index access when returning a large percentage of rows.

Exception: When low-selectivity columns have extremely skewed distributions, indexing becomes valuable for queries targeting rare values. Task status fields in job queues exemplify this—"pending" status might be queried frequently despite low selectivity.

Index Overhead Reality

Indexes carry significant costs:

  • Every INSERT/UPDATE/DELETE must maintain all affected indexes, increasing write latency
  • Excessive indexes degrade write performance substantialy
  • Unused or redundant indexes waste storage and increase maintenance overhead

Remove unused indexes during schema changes and SQL refactoring. Balance read optimization against write penalties.

Avoid Operations on Indexed Columns

Applying any operation to an indexed column disables the index, forcing full table scans:

-- Bad: triggers full table scan
SELECT * FROM transactions WHERE id + 1 = 10000;

-- Good
SELECT * FROM transactions WHERE id = 9999;

-- Bad: function on indexed column
SELECT * FROM audit_log WHERE FROM_UNIXTIME(event_time) = '2024-01-15 00:00:00';

-- Good: move calculation to the comparison value
SELECT * FROM audit_log WHERE event_time = UNIX_TIMESTAMP('2024-01-15 00:00:00');

-- Bad: implicit type conversion
SELECT * FROM user_sessions WHERE session_id = 'abc123';  -- session_id is BIGINT

-- Good: type-matched comparison
SELECT * FROM user_sessions WHERE session_id = 12345;

Type Conversion Rules: MySQL converts strings to numbers when comparing with integers. When the column is VARCHAR and the value is numeric, the conversion occurs on the column side, disabling the index. When the column is INT and the value is a string, the conversion happens on the value side—the index remains usable.

Recommendation: Use INT/BIGINT over numeric VARCHAR whenever possible—they're smaller (4-8 bytes) and immune to index-breaking type mismatches.

Index Field Type Changes: When converting column types, apply changes in the correct order:

  • From INT to VARCHAR: Update application code first, then modify the column
  • From VARCHAR to INT: Change the column first, wait for replica propagation, then update application code

Case-Sensitive Comparisons

MySQL string comparisons are case-insensitive by default due to utf8_general_ci and gbk_chinese_ci collations. Using BINARY forces case sensitivity but disables indexes:

mysql> EXPLAIN SELECT * FROM config WHERE BINARY app_name = 'LbsPCommon' AND status = 1\G
           type: ALL
           key: NULL
           rows: 9156
        Extra: Using where

The BINARY keyword causes a full table scan. Instead, define the column with a case-sensitive collation:

`app_name` VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL

This preserves case sensitivity while maintaining index usability.

Avoid Trailing Wildcards in LIKE Patterns

%value% patterns (both-side wildcards) disable index usage or force index full scans:

-- Bad: full table scan
WHERE name LIKE '%zhao%'

-- Acceptable: can use index for range scan
WHERE name LIKE 'zhao%'

For frequent fuzzy search requirements, implement full-text search systems rather than relying on MySQL LIKE patterns. Often, seemingly necessary fuzzy searches can be replaced with precise matching.

Prefix Indexes and Redundant Indexes

Index INDEX(a, b, c) optimizes:

  • WHERE a = ? AND b = ? AND c = ?
  • WHERE a = ? AND b = ?
  • WHERE a = ?
  • WHERE a = ? ORDER BY b
  • WHERE a = ? AND b = ? ORDER BY c

It does NOT optimize:

  • WHERE b = ? AND c = ? (prefix column a missing)
  • WHERE c = ?

For queries WHERE b = ? AND c = ? and WHERE c = ?, define INDEX(c, b) instead of INDEX(b, c). The latter cannot optimize WHERE c = ?.

Redundant Indexes: Given INDEX(a, b, c), indexes INDEX(a) and INDEX(a, b) are typically redundant and should be dropped.

Exception: When queries like WHERE a = ? ORDER BY id LIMIT 1 exist, INDEX(a) (effectively INDEX(a, id)) can optimize sorting without physical operations. INDEX(a, b, c, id) and INDEX(a, b, id) cannot achieve the same because id is not the final index column.

Index Field Ordering

Place equality conditions first, range conditions last. Range conditions on preceding columns prevent subsequent columns from being used efficiently as access conditions.

mysql> CREATE TABLE audit_log (
    `id` INT AUTO_INCREMENT PRIMARY KEY,
    `operator` VARCHAR(255) NOT NULL,
    `created_time` INT NOT NULL,
    KEY `idx_operator_time` (`operator`, `created_time`),
    KEY `idx_time_operator` (`created_time`, `operator`)
);

-- Uses both fields efficiently
mysql> EXPLAIN SELECT COUNT(1) FROM audit_log 
    WHERE operator = 'zhangyu21' AND created_time BETWEEN 1492876800 AND 1495468800\G
           key: idx_operator_time
           rows: 0

-- Range on first field prevents efficient use of second field
mysql> EXPLAIN SELECT COUNT(1) FROM audit_log 
    FORCE INDEX(idx_time_operator)
    WHERE operator = 'zhangyu21' AND created_time BETWEEN 1492876800 AND 1495468800\G
           key: idx_time_operator
           rows: 22211

When forcing idx_time_operator (first field is range), every index entry in the time range must be evaluated for the operator condition. The optimizer correctly avoids this index.

Optimizing ORDER BY Queries

For WHERE a = ? ORDER BY b LIMIT N queries, INDEX(a, b) provides excellent optimization by scanning ordered index segments.

mysql> SELECT * FROM transaction_log WHERE client_id = '' ORDER BY timestamp DESC LIMIT 5;

Query_time: 0.001566  Rows_sent: 5  Rows_examined: 5

With 2.5 million matching rows, the query examines only 5 records by leveraging the index's logical ordering.

Critical Rule: Only when all index columns before the ORDER BY column use equality conditions can the index maintain ordering, avoiding physical sorting.

-- Cannot use index for sorting due to range condition on list_type
mysql> EXPLAIN SELECT * FROM transaction_log 
    WHERE client_id = '' AND list_type IN ('cronJob', 'cronJobNew') 
    ORDER BY timestamp DESC LIMIT 1\G
        Extra: Using index condition; Using where; Using filesort

-- Rewrite with union for equivalent results
SELECT * FROM (
    SELECT * FROM transaction_log WHERE client_id = '' AND list_type = 'cronJob' ORDER BY timestamp DESC LIMIT 1
    UNION ALL
    SELECT * FROM transaction_log WHERE client_id = '' AND list_type = 'cronJobNew' ORDER BY timestamp DESC LIMIT 1
) tmp ORDER BY timestamp DESC LIMIT 1;

Optimizer Quirks: MySQL sometimes makes poor index choices for ORDER BY queries, preferring indexes that avoid sorting but scan excessive rows:

mysql> EXPLAIN SELECT * FROM layer WHERE app_id = 2183 ORDER BY id ASC LIMIT 1\G
           type: index
           key: PRIMARY
           rows: 1986479

Using the PRIMARY KEY avoids filesort but scans nearly 2 million rows. Forcing idx_app_id:

mysql> EXPLAIN SELECT * FROM layer FORCE INDEX(idx_app_id) WHERE app_id = 2183 ORDER BY id ASC LIMIT 1\G
           type: ref
           key: idx_app_id
           rows: 1

Execution time drops from 2.5 seconds to 0.2 milliseconds. Use hints like FORCE INDEX and IGNORE INDEX cautiously—they bind to index names, risking失效 when indexes are renamed.

Single-Column vs Composite Indexes

Avoid creating separate single-column indexes for multi-condition queries:

-- Poor design
index(a), index(b), index(c)

-- Good design for high-selectivity 'a'
index(a, b, c)

When selectivity is distributed across columns, a composite index outperforms index merge:

For WHERE a = ? AND b = ?, if INDEX(a) returns 10,000 rows and INDEX(b) returns 10,000 rows, index merge must intersect 20,000 rows—a costly operation. INDEX(a, b) returns the 100 matching rows directly.

Note: For WHERE a = ? OR b = ? with high-selectivity columns, separate indexes may be necessary since composite indexes don't help with OR conditions.

Multi-Table Joins

MySQL uses only NESTED LOOP joins (no HASH JOIN or SORT MERGE JOIN). This is essentially:

FOR each row in outer_table:
    FOR each row in inner_table WHERE join_condition:
        emit combined row

NESTED LOOP excels when the outer table returns few rows (after filtering) and the inner table has an index on join columns. For large result sets from both sides, performance degrades significantly.

Recommendations:

  • Avoid complex multi-table joins in MySQL—suitable only for simple lookups
  • For complex analytics, use dedicated OLAP systems like Hadoop or HBase
  • Consider application-level joins for simple relationships, trading multiple queries for reduced database load

Slow Query Analysis

Using pt-query-digest

Percona's pt-query-digest analyzes slow query logs effectively:

pt-query-digest slow.log > slow_analysis.txt

It groups similar queries (after literal extraction), sorts by total execution time, and reports representative queries for each group.

Interpreting Results

Focus on queries consuming the most total time. The Rows examined / Rows sent ratio indicates efficiency—for single-row retrieval, target a ratio of 1.

# Profile
# Rank Query ID           Response time   Calls R/Call
# ==== ================== =============== ===== =======
#    1 0x426D0452190D3B9C 9629.1622 55.8%  5204  1.8503

This query class consumed 9,629 seconds (55.8% of total) across 5,204 executions, averaging 1.85 seconds per execution.

# Rows examine  18.54G   3.65M average per execution
# Rows sent      5.65k   1.11 average per execution

Ratio: 3,650,000 / 1.11 = 3,288,000 rows scanned per row returned—extremely inefficient. Adding INDEX(app_id, created_at) dramatically improves this pattern.

Optimization Case Studies

Case 1: Range Scan with Aggregation

mysql> EXPLAIN SELECT version_name, SUM(request_pv) FROM version_stats 
    WHERE dtime >= 20170504 AND dtime <= 20170510 
    GROUP BY version_name ORDER BY request_pv DESC\G
           type: range
           key: INDEX_d
           rows: 2779470
        Extra: Using index condition; Using temporary; Using filesort

The range scan on dtime returns 1.46 million rows. With INDEX(dtime), each row requires a table fetch for version_name and request_pv—1.46 million random I/O operations plus index scan cost.

Solution: Create a covering index:

ALTER TABLE version_stats ADD KEY idx_dtime_version_request (dtime, version_name, request_pv);
mysql> EXPLAIN SELECT version_name, SUM(request_pv) FROM version_stats 
    WHERE dtime >= 20170504 AND dtime <= 20170510 
    GROUP BY version_name ORDER BY request_pv DESC\G
           type: range
           key: idx_dtime_version_request
        Extra: Using where; Using index; Using temporary; Using filesort

Using index indicates no table access. Execution time halves.

Further optimization: Transform range query to multiple equality conditions:

SELECT version_name, SUM(request_pv)
FROM (
    SELECT version_name, SUM(request_pv) FROM version_stats WHERE dtime = 20170504 GROUP BY version_name
    UNION ALL
    SELECT version_name, SUM(request_pv) FROM version_stats WHERE dtime = 20170505 GROUP BY version_name
    UNION ALL
    -- ... additional days
) tmp GROUP BY version_name ORDER BY request_pv DESC;

Each day has ~210,000 rows, group to ~2,500 rows. The final sort processes 2,500 × 7 = 17,500 rows instead of 1.46 million. Execution time drops from 2.35s to 1.01s.

Business-layer optimization: Since daily statistics rarely change after insertion, compute daily aggregations nightly and store in summary tables. Reports query thousands of rows instead of millions.

Case 2: Replacing Complex Self-Join with Loop

Original slow query:

SELECT s.state_id, s.transaction_id, s.ip, s.state_name, s.start_time, s.update_time, s.end_time
FROM machine_state s
INNER JOIN (
    SELECT MAX(state_id) FROM machine_state GROUP BY ip
) latest ON s.state_id = latest.state_id
INNER JOIN machine m ON s.ip = m.ip
WHERE m.pool IN ('repair-pool')
LIMIT 0, 999999999999;

Execution time: 6-8 seconds, scanning 8+ million rows.

Replaced with loop-based approach:

$stmt1 = $pdo->prepare("SELECT ip FROM machine WHERE pool = ? AND ip > ? ORDER BY ip LIMIT 1000");
$stmt2 = $pdo->prepare("SELECT state_id, transaction_id, ip, state_name, start_time, update_time, end_time FROM machine_state WHERE ip = ? ORDER BY state_id DESC LIMIT 1");

$lastIp = 0;
while (true) {
    $stmt1->execute([$pool, $lastIp]);
    $ips = $stmt1->fetchAll(PDO::FETCH_ASSOC);
    
    foreach ($ips as $row) {
        $stmt2->execute([$row['ip']]);
        // Process results
    }
    
    if (count($ips) < 1000) break;
    $lastIp = $row['ip'];
}

Both queries use efficient index access:

  • First query: PRIMARY KEY range scan
  • Second query: idx_ip for single IP lookup

Execution time drops from 7+ seconds to milliseconds.

Case 3: Query Rewriting for Index Compatibility

Ranking query requiring mixed-direction sorting:

SELECT score FROM xiadou_user 
ORDER BY score DESC, last_add_score_time ASC LIMIT 1500, 1;
mysql> EXPLAIN SELECT score FROM xiadou_user 
    ORDER BY score DESC, last_add_score_time ASC LIMIT 1500, 1\G
           type: index
           key: SCORE_TIME_INDEX
           rows: 9255660
        Extra: Using index; Using filesort

Despite SCORE_TIME_INDEX(score, last_add_score_time) existing, the mixed sort directions disable index-based sorting. The query reads ~9 million rows, sorts, then returns position 1500.

Observation: The actual requirement is the 1500th score value, not user identity at that position. Query rewriting:

SELECT score FROM xiadou_user ORDER BY score DESC LIMIT 1500, 1;

mysql> EXPLAIN SELECT score FROM xiadou_user ORDER BY score DESC LIMIT 1500, 1\G
           type: index
           key: SCORE_TIME_INDEX
           rows: 1501
        Extra: Using index

Same functional result, but now the index covers the query completely—no filesort, reads only 1,501 rows.

Further improvement: Execute the full ranking query once after the event ends, cache results, and serve from cache during the competition.

Summary of Key Principles

  • Index selectively: High-selectivity columns benefit most; avoid indexing low-selectivity columns
  • Cover queries: Include all retrieved columns in the index to eliminate table access
  • Order matters: Place equality conditions before range conditions in index definitions
  • Avoid column operations: Keep indexed columns unmodified in WHERE clauses
  • Choose composite indexes strategically: They outperform multiple single-column indexes for multi-condition queries
  • Leverage index ordering: Use indexes for ORDER BY to eliminate filesort operations
  • Measure before optimizing: Use EXPLAIN and slow query analysis to guide index decisions

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.