Core MySQL Concepts for Backend Engineering Interviews
Fundamental Database Concepts
Index Column Limits
A single index in MySQL can encompass a maximum of 16 columns.
Storage Engines
MySQL utilizes storage engines to manage data storage, indexing, and retrieval mechanisms. The primary engines include:
- InnoDB: The default engine since version 5.5.5. It supports ACID transactions, row-level locking, and foreign key constraints, designed for high-concurrency systems.
- MyISAM: The legacy default engine. It lacks transaction support and row-level locking, utilizing table-level locking instead.
- MEMORY: Stores data entirely in RAM for high-speed access, though data is lost upon restart.
InnoDB vs. MyISAM
Key distinctions include:
- Transactions: InnoDB supports them; MyISAM does not.
- Locking: InnoDB uses row-level locks by default; MyISAM uses table-level locks.
- Structure: InnoDB uses a clustered index structure where data files are index files. MyISAM separates data and index files.
- Foreign Keys: Supported only by InnoDB.
- Recovery: InnoDB supports crash-safe recovery; MyISAM does not.
Default Isolation Level
InnoDB operates under the REPEATABLE-READ isolation level by default.
Database Normalization
- 1NF: Ensures atomicity; every column contains indivisible values.
- 2NF: Requires all non-key attributes to be fully dependent on the primary key.
- 3NF: Eliminates transitive dependencies; non-key attributes must depend only on the primary key.
Drop, Delete, and Truncate
- DROP: A DDL operation that removes the table structure and data entirely.
- TRUNCATE: A DDL operation that clears data but retains the structure. It resets auto-increment counters and cannot be rolled back easily.
- DELETE: A DML operation that removes rows based on a condition. It can be rolled back and does not reset auto-increment counters unless all rows are removed.
IN vs. EXISTS
Both are used for subqueries. Performance depends on dataset size:
- EXISTS: Efficient when the outer table is small and the inner table is large.
- IN: Efficient when the outer table is large and the inner table is small.
Indexing Mechanisms
Data Structures
Indexes optimize retrieval speeds. Common structures include:
- Hash: Offers O(1) lookups but does not support range queries or sorting.
- B-Tree: A multi-way balanced tree where nodes store both keys and data.
- B+Tree: A variant where only leaf nodes store data, linked via a doubly linked list. Internal nodes store only keys.
B-Tree vs. B+Tree
- Data Storage: B-Tree stores data in all nodes; B+Tree stores data only in leaves.
- Traversal: B+Tree leaves are linked, facilitating range scans. B-Tree requires recursive traversal.
- IO Efficiency: B+Tree internal nodes contain more keys per page, reducing tree height and disk I/O.
B+Tree vs. Red-Black Tree
Databases prefer B+Trees because:
- Height: B+Trees have lower height for large datasets, minimizing disk I/O operations compared to binary trees like Red-Black trees.
- Locality: B+Tree nodes align with disk pages, leveraging spatial locality for pre-fetching.
- Range Queries: B+Trees support efficient range scans due to linked leaf nodes.
Index Algorithms
- B-Tree: Supports equality, range, and prefix matching (e.g.,
LIKE 'value%'). - Hash: Supports only equality checks (
=).
InnoDB Search Flow
- Load the root page into memory.
- Perform binary search within the page to find the pointer to the next level.
- Repeat until the leaf node (data page) is reached.
- Retrieve the record from the leaf node.
B+Tree Capacity Calculation
Assuming a 16KB page size and a row size of 1KB:
- Leaf Node: Holds 16 rows.
- Non-Leaf Node: Stores primary key (8 bytes BigInt) + pointer (6 bytes) = 14 bytes per entry. A page holds ~1170 pointers.
- Height 2: 1170 pointers * 16 rows = 18,720 rows.
- Height 3: 1170 * 1170 * 16 = ~21.9 million rows.
Index Creation Principles
- Index columns frequently used in
WHERE,JOIN, orORDER BYclauses. - Avoid indexing low-cardinality columns (e.g., gender).
- Adhere to the Leftmost Prefix rule for composite indexes.
- Avoid indexing large text fields unless using prefix indexes.
- Extend existing indexes rather than creating duplicates.
Index Usage Scenarios
Indexes are utilized during filtering (WHERE), sorting (ORDER BY), and joining (JOIN) operations.
Index Types
- Primary Key: Unique, non-null identifier.
- Unique: Enforces uniqueness but allows NULLs.
- Normal: Standard index for speed.
- Prefix: Indexes only the first N characters of a string.
- Full-Text: Used for text search capabilities.
Clustered vs. Non-Clustered
- Clustered: Data and index are stored together (InnoDB Primary Key).
- Non-Clustered: Index stores pointers to data (Secondary Indexes, MyISAM).
Leftmost Prefix Principle
For a composite index (A, B, C), queries must start with A. If a range query occurs on B, column C will not utilize the index. Equality checks can be reordered by the optimizer, but range queries break the chain.
Index Failure Conditions
Indexes may be ignored when:
- Using
LIKEwith a leading wildcard (%value). - Performing implicit type conversions.
- Applying functions or calculations on indexed columns.
- Using
ORwhere not all conditions are indexed. - The optimizer estimates a full table scan is faster.
Index Condition Pushdown (ICP)
Introduced in MySQL 5.6, ICP allows the storage engine to filter rows using index data before returning them to the server layer, reducing table lookups.
Adaptive Hash Index
InnoDB automatically creates hash indexes for frequently accessed B-Tree pages to speed up point lookups. This is managed internally.
Auto-Increment Primary Keys
Recommended because:
- Ensures sequential writes, reducing page splits.
- Facilitates efficient range queries.
- Compacts the index sturcture.
EXPLAIN Analysis
Key fields include:
- id: Execution order.
- type: Access method (system > const > eq_ref > ref > range > index > ALL).
- key: The index actually used.
- rows: Estimated rows examined.
- Extra: Details like
Using index(covering) orUsing filesort(performance warning).
Transaction Management
ACID Properties
- Atomicity: All or nothing.
- Consistency: Data remains valid according to rules.
- Isolation: Concurrent transactions do not interfere.
- Durability: Committed changes persist.
ACID Implementation in InnoDB
- Atomicity/Consistency: Managed via Undo Log.
- Durability: Managed via Redo Log.
- Isolation: Managed via Locks and MVCC.
Concurrency Anomalies
- Dirty Read: Reading uncommitted data.
- Non-Repeatable Read: Data changes between reads within a transaction.
- Phantom Read: New rows appear between reads within a transaction.
Isolation Levels
- Read Uncommitted: Allows dirty reads.
- Read Committed: Prevents dirty reads.
- Repeatable Read: Prevents non-repeatable reads (MySQL Default).
- Serializable: Prevents all anomalies via locking.
Transaction Workflow
- Load data page into Buffer Pool.
- Modify memory data.
- Write Undo Log for rollback.
- Write Redo Log for durability.
- Commit transaction (flush Redo Log).
Transaction Logs
- Redo Log: Physical log ensuring durability. Written continuously.
- Undo Log: Logical log ensuring atomicity. Stores previous versions for rollback and MVCC.
Binlog
Server-level binary log recording all data modifications (DDL/DML). Used for replication and point-in-time recovery. Formats include Statement, Row, and Mixed.
Locking Strategies
Lock Mechanisms
- Shared Lock (S): Allows reading.
- Exclusive Lock (X): Allows writing.
- Intent Locks: Indicate intention to lock rows (IS/IX).
- Gap Lock: Locks a range between index records.
- Next-Key Lock: Combination of Record Lock + Gap Lock.
Isolation and Locks
Higher isolation levels utilize stricter locking. Serializable uses table locks, while Repeatable Read uses gap locks to prevent phantoms.
Solving Phantom Reads
- Snapshot Read: Handled by MVCC.
- Current Read: Handled by Next-Key Locks.
MVCC (Multi-Version Concurrency Control)
Maintains multiple versions of a row using Undo Log chains. Each transaction sees a snapshot based on its start time.
- Read View: Determines visibility of data versions based on transaction IDs.
- RC Level: Creates a new Read View for each statement.
- RR Level: Creates a Read View at the start of the transaction.
Architecture and Optimization
Sharding Strategies
- Horizontal: Splitting rows across tables.
- Vertical: Splitting columns across tables.
- Routing: Hash modulo, Range-based, or Hybrid approaches.
Sharding Challenges
- Distributed Transactions: Require middleware or eventual consistency.
- Joins: Avoid cross-shard joins; use data redundancy or application-side assembly.
- Pagination/Sorting: Complex across shards; requires gathering and merging results.
- Global IDs: Use Snowflake algorithm instead of auto-increment.
Deep Pagination Optimization
- Seek Method:
WHERE id > last_seen_id LIMIT N. - Covering Index: Select only IDs in the subquery, then join back.
- Business Limits: Restrict maximum page depth.
ORDER BY Optimization
- Avoid
SELECT *. - Use indexes to satisfy sorting.
- Tune
sort_buffer_size.
Filesort Methods
- Single-Pass: Retrieves all columns, sorts in memory.
- Double-Pass: Retrieves sort keys and pointers, sorts, then retrieves rows.
Multi-Range Read (MRR)
Optimizes range queries by sorting primary keys retrieved from secondary indexes before accessing the clustered index, converting random I/O to sequential I/O.
Update Statement Execution
- Connector: Handles authentication.
- Parser: Validates syntax.
- Optimizer: Chooses execution plan.
- Executor: Calls engine API.
- Engine: Writes Undo Log, updates Buffer Pool, writes Redo Log (Prepare), writes Binlog, commits Redo Log.
Large Table Operations
- Deletion: Drop indexes, batch delete, recreate indexes.
- Query Optimization: Use appropriate data types, indexing, read/write splitting, or archive cold data.
Replication
- Master: Dumps binlog.
- Slave I/O: Receives binlog to Relay Log.
- Slave SQL: Executes Relay Log.
- Modes: Async, Semi-Sync, Full-Sync.
Replication Lag
Caused by network latency or heavy slave load. Solutions include parallel replication, scaling read nodes, or forcing critical reads to the master.
Slow SQL Troubleshooting
- Hardware: Check CPU, RAM, Disk I/O.
- Architecture: Implement caching, sharding, or read/write splitting.
- Configuration: Tune
buffer_pool, connections, and logs. - SQL: Analyze
EXPLAIN, remove functions on indexes, optimize joins.
CPU Spikes
Identify top processes. If MySQL is the cause, check SHOW PROCESSLIST for heavy queries. Optimize SQL or adjust concurrency limits.
WAL, LSN, Checkpoint
- WAL: Write-Ahead Logging ensures data is logged before writing to disk.
- LSN: Log Sequence Number tracks log progress.
- Checkpoint: Marks the point where dirty pages are flushed to disk, allowing old logs to be overwritten.
Migration to Sharded Architecture
- Offline: Downtime required to export/import data.
- Dual-Write: Write to both old and new systems, sync data, verify consistency, then switch read traffic.
Dynamic Scaling
Pre-allocate sufficient shards (e.g., 32 DBs * 32 Tables) to accommodate future growth without immediate re-sharding.
Log Types
- Redo Log: Crash recovery (Engine level).
- Undo Log: Rollback and MVCC (Engine level).
- Binlog: Replication and PITR (Server level).
- Relay Log: Temporary storage for replication.
- Slow Query Log: Tracks long-running queries.
- General Log: Tracks all connections and queries.
- Error Log: Records startup/shutdown and critical errors.
InnoDB Engine Internals
Logical Storage
- Tablespace: Contains data and indexes.
- Segment: Data, Index, or Rollback segments.
- Extent: 1MB unit consisting of 64 pages.
- Page: 16KB minimum storage unit.
- Row: Basic data unit.
Memory Structure
- Buffer Pool: Caches data and indexes.
- Change Buffer: Buffers changes to non-unique secondary indexes.
- Log Buffer: Stages redo logs before flushing.
- Adaptive Hash Index: Accelerates lookups for hot data.