Home > Tech > Content

Core MySQL Concepts for Backend Engineering Interviews

Tech 2

Fundamental Database Concepts

Index Column Limits

A single index in MySQL can encompass a maximum of 16 columns.

Storage Engines

MySQL utilizes storage engines to manage data storage, indexing, and retrieval mechanisms. The primary engines include:

InnoDB: The default engine since version 5.5.5. It supports ACID transactions, row-level locking, and foreign key constraints, designed for high-concurrency systems.
MyISAM: The legacy default engine. It lacks transaction support and row-level locking, utilizing table-level locking instead.
MEMORY: Stores data entirely in RAM for high-speed access, though data is lost upon restart.

InnoDB vs. MyISAM

Key distinctions include:

Transactions: InnoDB supports them; MyISAM does not.
Locking: InnoDB uses row-level locks by default; MyISAM uses table-level locks.
Structure: InnoDB uses a clustered index structure where data files are index files. MyISAM separates data and index files.
Foreign Keys: Supported only by InnoDB.
Recovery: InnoDB supports crash-safe recovery; MyISAM does not.

Default Isolation Level

InnoDB operates under the REPEATABLE-READ isolation level by default.

Database Normalization

1NF: Ensures atomicity; every column contains indivisible values.
2NF: Requires all non-key attributes to be fully dependent on the primary key.
3NF: Eliminates transitive dependencies; non-key attributes must depend only on the primary key.

Drop, Delete, and Truncate

DROP: A DDL operation that removes the table structure and data entirely.
TRUNCATE: A DDL operation that clears data but retains the structure. It resets auto-increment counters and cannot be rolled back easily.
DELETE: A DML operation that removes rows based on a condition. It can be rolled back and does not reset auto-increment counters unless all rows are removed.

IN vs. EXISTS

Both are used for subqueries. Performance depends on dataset size:

EXISTS: Efficient when the outer table is small and the inner table is large.
IN: Efficient when the outer table is large and the inner table is small.

Indexing Mechanisms

Data Structures

Indexes optimize retrieval speeds. Common structures include:

Hash: Offers O(1) lookups but does not support range queries or sorting.
B-Tree: A multi-way balanced tree where nodes store both keys and data.
B+Tree: A variant where only leaf nodes store data, linked via a doubly linked list. Internal nodes store only keys.

B-Tree vs. B+Tree

Data Storage: B-Tree stores data in all nodes; B+Tree stores data only in leaves.
Traversal: B+Tree leaves are linked, facilitating range scans. B-Tree requires recursive traversal.
IO Efficiency: B+Tree internal nodes contain more keys per page, reducing tree height and disk I/O.

B+Tree vs. Red-Black Tree

Databases prefer B+Trees because:

Height: B+Trees have lower height for large datasets, minimizing disk I/O operations compared to binary trees like Red-Black trees.
Locality: B+Tree nodes align with disk pages, leveraging spatial locality for pre-fetching.
Range Queries: B+Trees support efficient range scans due to linked leaf nodes.

Index Algorithms

B-Tree: Supports equality, range, and prefix matching (e.g., LIKE 'value%').
Hash: Supports only equality checks (=).

InnoDB Search Flow

Load the root page into memory.
Perform binary search within the page to find the pointer to the next level.
Repeat until the leaf node (data page) is reached.
Retrieve the record from the leaf node.

B+Tree Capacity Calculation

Assuming a 16KB page size and a row size of 1KB:

Leaf Node: Holds 16 rows.
Non-Leaf Node: Stores primary key (8 bytes BigInt) + pointer (6 bytes) = 14 bytes per entry. A page holds ~1170 pointers.
Height 2: 1170 pointers * 16 rows = 18,720 rows.
Height 3: 1170 * 1170 * 16 = ~21.9 million rows.

Index Creation Principles

Index columns frequently used in WHERE, JOIN, or ORDER BY clauses.
Avoid indexing low-cardinality columns (e.g., gender).
Adhere to the Leftmost Prefix rule for composite indexes.
Avoid indexing large text fields unless using prefix indexes.
Extend existing indexes rather than creating duplicates.

Index Usage Scenarios

Indexes are utilized during filtering (WHERE), sorting (ORDER BY), and joining (JOIN) operations.

Index Types

Primary Key: Unique, non-null identifier.
Unique: Enforces uniqueness but allows NULLs.
Normal: Standard index for speed.
Prefix: Indexes only the first N characters of a string.
Full-Text: Used for text search capabilities.

Clustered vs. Non-Clustered

Clustered: Data and index are stored together (InnoDB Primary Key).
Non-Clustered: Index stores pointers to data (Secondary Indexes, MyISAM).

Leftmost Prefix Principle

For a composite index (A, B, C), queries must start with A. If a range query occurs on B, column C will not utilize the index. Equality checks can be reordered by the optimizer, but range queries break the chain.

Index Failure Conditions

Indexes may be ignored when:

Using LIKE with a leading wildcard (%value).
Performing implicit type conversions.
Applying functions or calculations on indexed columns.
Using OR where not all conditions are indexed.
The optimizer estimates a full table scan is faster.

Index Condition Pushdown (ICP)

Introduced in MySQL 5.6, ICP allows the storage engine to filter rows using index data before returning them to the server layer, reducing table lookups.

Adaptive Hash Index

InnoDB automatically creates hash indexes for frequently accessed B-Tree pages to speed up point lookups. This is managed internally.

Auto-Increment Primary Keys

Recommended because:

Ensures sequential writes, reducing page splits.
Facilitates efficient range queries.
Compacts the index sturcture.

EXPLAIN Analysis

Key fields include:

id: Execution order.
type: Access method (system > const > eq_ref > ref > range > index > ALL).
key: The index actually used.
rows: Estimated rows examined.
Extra: Details like Using index (covering) or Using filesort (performance warning).

Transaction Management

ACID Properties

Atomicity: All or nothing.
Consistency: Data remains valid according to rules.
Isolation: Concurrent transactions do not interfere.
Durability: Committed changes persist.

ACID Implementation in InnoDB

Atomicity/Consistency: Managed via Undo Log.
Durability: Managed via Redo Log.
Isolation: Managed via Locks and MVCC.

Concurrency Anomalies

Dirty Read: Reading uncommitted data.
Non-Repeatable Read: Data changes between reads within a transaction.
Phantom Read: New rows appear between reads within a transaction.

Isolation Levels

Read Uncommitted: Allows dirty reads.
Read Committed: Prevents dirty reads.
Repeatable Read: Prevents non-repeatable reads (MySQL Default).
Serializable: Prevents all anomalies via locking.

Transaction Workflow

Load data page into Buffer Pool.
Modify memory data.
Write Undo Log for rollback.
Write Redo Log for durability.
Commit transaction (flush Redo Log).

Transaction Logs

Redo Log: Physical log ensuring durability. Written continuously.
Undo Log: Logical log ensuring atomicity. Stores previous versions for rollback and MVCC.

Binlog

Server-level binary log recording all data modifications (DDL/DML). Used for replication and point-in-time recovery. Formats include Statement, Row, and Mixed.

Locking Strategies

Lock Mechanisms

Shared Lock (S): Allows reading.
Exclusive Lock (X): Allows writing.
Intent Locks: Indicate intention to lock rows (IS/IX).
Gap Lock: Locks a range between index records.
Next-Key Lock: Combination of Record Lock + Gap Lock.

Isolation and Locks

Higher isolation levels utilize stricter locking. Serializable uses table locks, while Repeatable Read uses gap locks to prevent phantoms.

Solving Phantom Reads

Snapshot Read: Handled by MVCC.
Current Read: Handled by Next-Key Locks.

MVCC (Multi-Version Concurrency Control)

Maintains multiple versions of a row using Undo Log chains. Each transaction sees a snapshot based on its start time.

Read View: Determines visibility of data versions based on transaction IDs.
RC Level: Creates a new Read View for each statement.
RR Level: Creates a Read View at the start of the transaction.

Architecture and Optimization

Sharding Strategies

Horizontal: Splitting rows across tables.
Vertical: Splitting columns across tables.
Routing: Hash modulo, Range-based, or Hybrid approaches.

Sharding Challenges

Distributed Transactions: Require middleware or eventual consistency.
Joins: Avoid cross-shard joins; use data redundancy or application-side assembly.
Pagination/Sorting: Complex across shards; requires gathering and merging results.
Global IDs: Use Snowflake algorithm instead of auto-increment.

Deep Pagination Optimization

Seek Method: WHERE id > last_seen_id LIMIT N.
Covering Index: Select only IDs in the subquery, then join back.
Business Limits: Restrict maximum page depth.

ORDER BY Optimization

Avoid SELECT *.
Use indexes to satisfy sorting.
Tune sort_buffer_size.

Filesort Methods

Single-Pass: Retrieves all columns, sorts in memory.
Double-Pass: Retrieves sort keys and pointers, sorts, then retrieves rows.

Multi-Range Read (MRR)

Optimizes range queries by sorting primary keys retrieved from secondary indexes before accessing the clustered index, converting random I/O to sequential I/O.

Update Statement Execution

Connector: Handles authentication.
Parser: Validates syntax.
Optimizer: Chooses execution plan.
Executor: Calls engine API.
Engine: Writes Undo Log, updates Buffer Pool, writes Redo Log (Prepare), writes Binlog, commits Redo Log.

Large Table Operations

Deletion: Drop indexes, batch delete, recreate indexes.
Query Optimization: Use appropriate data types, indexing, read/write splitting, or archive cold data.

Replication

Master: Dumps binlog.
Slave I/O: Receives binlog to Relay Log.
Slave SQL: Executes Relay Log.
Modes: Async, Semi-Sync, Full-Sync.

Replication Lag

Caused by network latency or heavy slave load. Solutions include parallel replication, scaling read nodes, or forcing critical reads to the master.

Slow SQL Troubleshooting

Hardware: Check CPU, RAM, Disk I/O.
Architecture: Implement caching, sharding, or read/write splitting.
Configuration: Tune buffer_pool, connections, and logs.
SQL: Analyze EXPLAIN, remove functions on indexes, optimize joins.

CPU Spikes

Identify top processes. If MySQL is the cause, check SHOW PROCESSLIST for heavy queries. Optimize SQL or adjust concurrency limits.

WAL, LSN, Checkpoint

WAL: Write-Ahead Logging ensures data is logged before writing to disk.
LSN: Log Sequence Number tracks log progress.
Checkpoint: Marks the point where dirty pages are flushed to disk, allowing old logs to be overwritten.

Migration to Sharded Architecture

Offline: Downtime required to export/import data.
Dual-Write: Write to both old and new systems, sync data, verify consistency, then switch read traffic.

Dynamic Scaling

Pre-allocate sufficient shards (e.g., 32 DBs * 32 Tables) to accommodate future growth without immediate re-sharding.

Log Types

Redo Log: Crash recovery (Engine level).
Undo Log: Rollback and MVCC (Engine level).
Binlog: Replication and PITR (Server level).
Relay Log: Temporary storage for replication.
Slow Query Log: Tracks long-running queries.
General Log: Tracks all connections and queries.
Error Log: Records startup/shutdown and critical errors.

InnoDB Engine Internals

Logical Storage

Tablespace: Contains data and indexes.
Segment: Data, Index, or Rollback segments.
Extent: 1MB unit consisting of 64 pages.
Page: 16KB minimum storage unit.
Row: Basic data unit.

Memory Structure

Buffer Pool: Caches data and indexes.
Change Buffer: Buffers changes to non-unique secondary indexes.
Log Buffer: Stages redo logs before flushing.
Adaptive Hash Index: Accelerates lookups for hot data.

Back to List

Prev: Building a Student Management System with CRUD Operations in Python

Next: Core Object-Oriented Design Principles in C#

Fading Coder