Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Understanding the HBase Data Model

Tech 1

Core Concepts Overview

HBase's data model is a sparse, distributed, multidimensional sorted map.

Structurally:

Table → Row → Column Family → Column Qualifier → Timestamp → Value

It can be understood as a five-dimensional key-value mapping:

{Table, RowKey, ColumnFamily, ColumnQualifier, Timestamp} → Value

Key Components

1. Table

  • Similar to tables in relational databases, but without a fixed schema (no predefined columns).
  • Different tables can hold data with different structures.
  • A table consists of multiple rows, each uniquely identified by a RowKey.

2. Row

  • Each row has a RowKey, sorted in lexicographical order.
  • RowKeys are unique within a table.
  • RowKey is central to HBase queries; all queries are based on a RowKey or a range of RowKeys.

3. Column Family

  • Column families must be predefined when designing a table.
  • A table can have multiple column families.
  • Column families are the physical storage units in HBase; data from the same column family is stored together in HFiles.
  • Columns within a column family can be added dynamically.

Example:

Column Family: info
   ├── name
   ├── age
   └── address

4. Column Qualifier

  • The "specific column name" within a column family.
  • A combination of column family name and column qualifier uniquely identifies a column.
  • Column qualifiers are dynamically variable; HBase does not require them to be predefined.

Example:

info:name
info:email
info:phone

5. Timestamp

  • Each cell can have multiple versions of its value, differentiated by timestamps.
  • Default timestamp is the system time (in milliseconds), but can be customized.
  • The number of versions to retain can be configured per column family (e.g., keeping only the latest 3 versions).

6. Cell

  • The smallest data storage unit.
  • Uniquely identified by {RowKey, ColumnFamily, ColumnQualifier, Timestamp}.
  • Stores the actual Value.

Data Model Example

RowKey info:name info:email score:math score:english
001 Alice a@xx.com 90 85
002 Bob b@xx.com 78 92

Here:

  • Table name: students
  • Column families: info, score
  • Column qualifiers: name, email, math, english

Model Characteristics

Characteristic Description
Sparsity Non-existent columns do not occupy storage (ideal for semi-structured data)
Multi-versioning Each cell can retain multiple versions of its value
Ordered Rows are stored sorted by RowKey
Distributed Automatically partitioned by RowKey range (Region) and distributed across cluster
Column-oriented Data from the same column family is physically stored together

RowKey Design Recommendations

RowKey is critical for HBase performance; avoid hotspotting:

  • Avoid using auto-increment IDs (they cause region hotspotting).
  • Use strategies like hash prefix, reverse string, or reversed timestamp.
  • Example: userID_hash_timestamp
  • Example: reverse(phoneNumber)

Type Codes in Cells

Type Meaning
Put Normal write (insert/update)
Delete Delete a single version
DeleteColumn Delete all versions of a column
DeleteFamily Delete all versions of a column family

Note: In HBase, a delete is actually writing a cell with a Delete marker; physical cleanup occurs during Major Compaction.

Store: The Physical Storage Unit

In HBase, a Store is a core component within a Region that manages all data for a column family. It handles caching, persistence, and merging.

Store Composition

Each Store consists of two parts:

  1. MemStore (in-memory buffer): Temporary storage for writes. Data is written to MemStore first; when it reaches a threshold (default 128MB), its flushed to disk as an HFile.
  2. StoreFile (disk file collection): HFiles generated from MemStore flushes. Data is persisted on HDFS for reliability.

Relationship with Other Components

  • Region: A Region contains multiple Stores, one per column family. For example, if a table user has two column families info and address, each Region has two Stores.
  • Column Family: The Store is the physical carrier of a column family. The column family is a logical grouping; the Store is the physical storage unit.

Store Workflow

  1. Write Path: Writes go to MemStore (and HLog for durability). When MemStore reaches threshold, it flushes to disk as an HFile.
  2. Read Path: Check MemStore first (fast), then scan StoreFiles (HFiles). Merge results from both.
  3. Compaction:
    • Minor Compaction: Merge several small HFiles into a larger one; does not remove expired/deleted data.
    • Major Compaction: Merge all HFiles into one; remove expired and marked-deleted data permanently.

Summary Table

Layer Role Characteristics
Table Logical collection of data No fixed schema
Row Unique identifier by RowKey Ordered
Column Family Logical grouping of columns Predefined
Column Qualifier Specific column name Dynamically added
Timestamp Version differentiation Supports multi-versioning
Value Actual data Stored as bytes

Example: Cell Storage Hierarchy

Table
 └── RowKey: user1
       ├── ColumnFamily: info
       │      ├── Column: name
       │      │      ├── [ts=1730600000000] → "Alice"    ← Cell 1
       │      │      └── [ts=1730500000000] → "Alicia"   ← Cell 2
       │      └── Column: age
       │             └── [ts=1730600000000] → "20"       ← Cell 3

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.