Understanding the HBase Data Model
Core Concepts Overview
HBase's data model is a sparse, distributed, multidimensional sorted map.
Structurally:
Table → Row → Column Family → Column Qualifier → Timestamp → Value
It can be understood as a five-dimensional key-value mapping:
{Table, RowKey, ColumnFamily, ColumnQualifier, Timestamp} → Value
Key Components
1. Table
- Similar to tables in relational databases, but without a fixed schema (no predefined columns).
- Different tables can hold data with different structures.
- A table consists of multiple rows, each uniquely identified by a RowKey.
2. Row
- Each row has a RowKey, sorted in lexicographical order.
- RowKeys are unique within a table.
- RowKey is central to HBase queries; all queries are based on a RowKey or a range of RowKeys.
3. Column Family
- Column families must be predefined when designing a table.
- A table can have multiple column families.
- Column families are the physical storage units in HBase; data from the same column family is stored together in HFiles.
- Columns within a column family can be added dynamically.
Example:
Column Family: info
├── name
├── age
└── address
4. Column Qualifier
- The "specific column name" within a column family.
- A combination of column family name and column qualifier uniquely identifies a column.
- Column qualifiers are dynamically variable; HBase does not require them to be predefined.
Example:
info:name
info:email
info:phone
5. Timestamp
- Each cell can have multiple versions of its value, differentiated by timestamps.
- Default timestamp is the system time (in milliseconds), but can be customized.
- The number of versions to retain can be configured per column family (e.g., keeping only the latest 3 versions).
6. Cell
- The smallest data storage unit.
- Uniquely identified by
{RowKey, ColumnFamily, ColumnQualifier, Timestamp}. - Stores the actual
Value.
Data Model Example
| RowKey | info:name | info:email | score:math | score:english |
|---|---|---|---|---|
| 001 | Alice | a@xx.com | 90 | 85 |
| 002 | Bob | b@xx.com | 78 | 92 |
Here:
- Table name:
students - Column families:
info,score - Column qualifiers:
name,email,math,english
Model Characteristics
| Characteristic | Description |
|---|---|
| Sparsity | Non-existent columns do not occupy storage (ideal for semi-structured data) |
| Multi-versioning | Each cell can retain multiple versions of its value |
| Ordered | Rows are stored sorted by RowKey |
| Distributed | Automatically partitioned by RowKey range (Region) and distributed across cluster |
| Column-oriented | Data from the same column family is physically stored together |
RowKey Design Recommendations
RowKey is critical for HBase performance; avoid hotspotting:
- Avoid using auto-increment IDs (they cause region hotspotting).
- Use strategies like hash prefix, reverse string, or reversed timestamp.
- Example:
userID_hash_timestamp - Example:
reverse(phoneNumber)
Type Codes in Cells
| Type | Meaning |
|---|---|
| Put | Normal write (insert/update) |
| Delete | Delete a single version |
| DeleteColumn | Delete all versions of a column |
| DeleteFamily | Delete all versions of a column family |
Note: In HBase, a delete is actually writing a cell with a Delete marker; physical cleanup occurs during Major Compaction.
Store: The Physical Storage Unit
In HBase, a Store is a core component within a Region that manages all data for a column family. It handles caching, persistence, and merging.
Store Composition
Each Store consists of two parts:
- MemStore (in-memory buffer): Temporary storage for writes. Data is written to MemStore first; when it reaches a threshold (default 128MB), its flushed to disk as an HFile.
- StoreFile (disk file collection): HFiles generated from MemStore flushes. Data is persisted on HDFS for reliability.
Relationship with Other Components
- Region: A Region contains multiple Stores, one per column family. For example, if a table
userhas two column familiesinfoandaddress, each Region has two Stores. - Column Family: The Store is the physical carrier of a column family. The column family is a logical grouping; the Store is the physical storage unit.
Store Workflow
- Write Path: Writes go to MemStore (and HLog for durability). When MemStore reaches threshold, it flushes to disk as an HFile.
- Read Path: Check MemStore first (fast), then scan StoreFiles (HFiles). Merge results from both.
- Compaction:
- Minor Compaction: Merge several small HFiles into a larger one; does not remove expired/deleted data.
- Major Compaction: Merge all HFiles into one; remove expired and marked-deleted data permanently.
Summary Table
| Layer | Role | Characteristics |
|---|---|---|
| Table | Logical collection of data | No fixed schema |
| Row | Unique identifier by RowKey | Ordered |
| Column Family | Logical grouping of columns | Predefined |
| Column Qualifier | Specific column name | Dynamically added |
| Timestamp | Version differentiation | Supports multi-versioning |
| Value | Actual data | Stored as bytes |
Example: Cell Storage Hierarchy
Table
└── RowKey: user1
├── ColumnFamily: info
│ ├── Column: name
│ │ ├── [ts=1730600000000] → "Alice" ← Cell 1
│ │ └── [ts=1730500000000] → "Alicia" ← Cell 2
│ └── Column: age
│ └── [ts=1730600000000] → "20" ← Cell 3