Internal Storage Architecture of TiDB
Key-Value Model
TiDB stores data using a key-value structure where both key and value are raw byte arrays. Conceptually, TiKV behaves like a massive ordered map: entries are sorted by the binary representation of keys, enabling efficient range scans and seeks. Two esssential points define this model:
- Data exists as key-value pairs within a single logical map.
- Keys are ordered lexicographically, allowing iteration from any starting point via sequential access.
The storage abstraction is independent of SQL table definitions.
Local Persistence with RocksDB
To handle on-disk durability, TiKV delegates actual persistence to RocksDB rather than implementing its own disk-writing layer. RocksDB operates as a high-performance embedded key-value store, satisfying TiKV's requirements for efficient local storage. It effectively acts as a standalone ordered map on a single machine, abstracting complex file management and optimization tasks.
Distributed Consistency via Raft
Local reliability is insufficient for fault tolerance; data must survive node failures. Raft provides a consensus mechanism equivalent to Paxos, ensuring safe replication across nodes. Core features include:
- Leader election
- Configuration changes
- Log replication
In TiKV, every change becomes a Raft log entry, which the protocol replicates to a majority of replicas in the group. Writes go through Raft first, then to RocksDB. This design yields a distributed key-value store resilient to single-node crashes.
Data Sharding Through Regions
A single ordered map cannot scale horizontally without partitioning. TiKV splits the entire key space into contiguous segments called Regions, each covering a half-open interval [startKey, endKey). Regions are capped at a configurable size (default 96MB) to balance load and manageability.
Two major actions follow partitioning:
- Distribution: Regions are spread evenly across all nodes, achieving both capacity scaling and balanced workload. A control component tracks region placement, enabling clients to locate any key’s hosting region and node.
- Replication: Each Region is replicated (default three replicas) across distinct nodes forming a Raft group. One replica acts as Leader, handling all reads and writes; others (Followers) replicate log entries from the Leader. If a Follower loses contact, it transitions to Candidate and starts an election.
Organizing data and replication by Region ensures horizontal scalability and resilience against disk failures. Excessive Regions increase coordination overhead due to heartbeat traffic to the placement manager.
Multi-Version Concurrency Control (MVCC)
To avoid locking overhead under concurrent updates, TiKV implements MVCC by appending version stamps to keys. Without MVCC:
K1 -> V
K2 -> V
...
Kn -> V
With MVCC, keys embed version numbers, sorted descending by version:
K1_v3 -> V
K1_v2 -> V
K1_v1 -> V
...
K2_v4 -> V
K2_v3 -> V
K2_v2 -> V
K2_v1 -> V
...
Kn_v2 -> V
Kn_v1 -> V
Fetching a specific version involves constructing the composite key K-version and performing a seek to the first entry not less than it.
Transaction Handling via Optimistic Locking
TiKV adopts the Percolator transaction model with optimizations. It uses optimistic concurrency control: writes proceed without conflict checks until commit time, where collisions trigger retries. This excels when write contention is low (e.g., random updates over large datasets), but suffers under high contention scenarios such as frequent counter increments.
Optimistic locking avoids upfront locks, improving throughput when conflicts are rare. In contrast, pessimistic locking prevents conflicts by acquiring locks before modification, guaranteeing correctness at the cost of reduced performance under heavy lock contention.
Structural Overview
TiKV serves as TiDB’s distributed storage engine. Data resides in RocksDB, partitioned into Regions managed independently. Each Region maintains multiple replicas across nodes, coordinated through Raft. The Leader replica processes all read/write operations, while Followers maintain consistency via replicated logs.