Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Choosing Between Append-Optimized Row and Column Storage in Greenplum

Tech May 13 3

Storage Architecture Fundamentals

In the Greenplum (MPP PostgreSQL) ecosystem, Append-Optimized Row (AORO) and Append-Optimized Column (AOCO) tables represent specialized storage formats designed for analytical workloads. The core distinction lies in their physical organization:

  • AORO: Data is stored sequentially by row, optimized for writing full records.
  • AOCO: Data is stored sequentially by column, optimized for reading specific fields.

Unlike standard Heap tables (standard PostgreSQL tables with MVCC), these Append-Optimized formats do not support in-place updates efficiently and are tailored for high-volume data ingestion.

Physical Layout Comparison

AORO Structure

Complete rows are written contiguously to disk:

[ Record 1 (Col1, Col2, Col3) ]
[ Record 2 (Col1, Col2, Col3) ]
[ Record 3 (Col1, Col2, Col3) ]

AOCO Structure

Each column is stored in its own separate physical file or segment:

Column_1_File: Val1, Val2, Val3, Val4
Column_2_File: Val1, Val2, Val3, Val4
Column_3_File: Val1, Val2, Val3, Val4

Detailed Use Case Analysis

Standard Heap Tables

  • Role: Ideal for OLTP operations, dimension tables, and small lookup tables.
  • Scale: Best for datasets under 10GB or fewer than 1 million rows.
  • Behavior: High-frequency single-row updates/deletes, indexed point lookups, and concurrent small transactions.
  • Examples: User dimensions, configuration settings, product catalogs.

AORO (Row-Oriented)

  • Role: Storing large volumes of detailed records where full rows are often retrieved.
  • Scale: Hundreds of millions to billions of rows.
  • Behavior: Batch ETL inserts, minimal updates. Queries often involve SELECT * or retrieving most columns.
  • Examples: Application logs, raw event streams, audit trails.
  • Note: Adding a column with a default value requires a full table rewrite, which is expensive on large datasets.

AOCO (Column-Oriented)

  • Role: The powerhouse for Data Warehousing fact tables and OLAP analytics.
  • Scale: Terabyte or Petabyte scale, often with very wide schemas (100+ columns).
  • Behavior: Queries touch only a handful of columns (e.g., aggregations on specific metrics). Writes are bulk appends.
  • Examples: Financial transaction facts, user behavior analysis, wide feature tables for ML.
  • Note: Supports adding nullable columns instantly without rewriting data.

Performance Benchmark Comparison

Decition Matrix for Workloads

When to choose AORO (Row)

  • Full Row Access: You frequently need every column in the row (e.g., exporting raw logs).
  • Moderate Width: Tables with fewer than 100 columns.
  • Write Heavy: The workload is primarily inserting data with few updates.
  • Example Query:
SELECT * FROM server_logs WHERE log_id = 12345;

When to choose AOCO (Column)

  • Massive Scale: Tables ranging from Terabytes to Petabytes.
  • Wide Tables: Schemas exceeding 100 columns (e.g., user profiles, tags).
  • Projected Reads: Queries access only a subset of columns.
  • Heavy Aggregation: Frequent use of SUM, COUNT, AVG, GROUP BY.
  • Compression Needs: Significant storage savings using ZSTD or RLE encoding.
  • Example Query:
SELECT SUM(revenue), AVG(latency) 
FROM network_stats 
WHERE capture_date BETWEEN '2023-01-01' AND '2023-01-31';
-- Only reads 3 columns out of 150

Performance Case Study

Scenario: 1 Billion rows, 120 columns, querying 3 columns.

Creation Syntax Examples

Creating an AORO Table

CREATE TABLE raw_metrics (
    event_time timestamp,
    source_ip inet,
    payload text
) 
WITH (appendonly=true, orientation=row);

Creating an AOCO Table

CREATE TABLE sales_fact (
    transaction_id bigint,
    amount numeric,
    region text
) 
WITH (appendonly=true, orientation=column);

Creating AOCO with Compression

CREATE TABLE user_analytics (
    user_id bigint,
    session_count int,
    last_active date
) 
WITH (
    appendonly=true, 
    orientation=column, 
    compresstype=zstd, 
    compresslevel=7
);

Analyzing Database Workload Patterns

To determine if a table is read-heavy or write-heavy, query the statistics from pg_stat_user_tables.

SELECT
    schemaname,
    relname AS table_name,
    seq_scan + idx_scan AS total_reads,
    n_tup_ins + n_tup_upd + n_tup_del AS total_writes,
    pg_size_pretty(pg_total_relation_size(relid)) AS size
FROM pg_stat_user_tables
ORDER BY total_reads DESC;

Automated Workload Classification

SELECT
    schemaname,
    relname,
    CASE
        WHEN (seq_scan + idx_scan) > 10 * (n_tup_ins + n_tup_upd + n_tup_del) 
            THEN 'Read Intensive'
        WHEN (n_tup_ins + n_tup_upd + n_tup_del) > (seq_scan + idx_scan) 
            THEN 'Write Intensive'
        ELSE 'Mixed Workload'
    END AS workload_profile
FROM pg_stat_user_tables;

Storage Selection Strategy

Tags: Greenplum

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.