Fading Coder

An Old Coder’s Final Dance

Home > Tech > Content

Understanding Ceph Placement Group (PG) States and Operations

Tech 1

Placement Groups (PGs) are the core unit of placement, replication, and coordination in Ceph’s RADOS layer. A PG aggregates objects, maps them to OSDs via CRUSH, and coordinates replication, ordering, and recovery. Many pool-level behaviors (replication, failure domains, recovery) are realized through PG logic.

A healthy cluster reports all PGs as active+clean, meaning each PG is available and every replica in its acting set is in sync.

Below are the major PG states you may encounter, what they mean, how to reproduce them safely in a lab, and how they affect client I/O.

Degraded and Undersized

Description

  • degraded: At least one replica in the PG’s acting set is missing or behind; some objects may lack full replica count.
  • undersized: The number of replicas is below the pool’s size setting.
  • When one OSD in a triplicated pool (size=3, min_size=2) fails, the PGs that included that OSD become active+undersized+degraded.

Example (lab)

  1. Stop one OSD
    sudo systemctl stop ceph-osd@1
    
  2. Inspect PG and cluster health
    ceph pg stat
    ceph health detail
    
    Example excerpts:
    • "active+undersized+degraded" on multiple PGs
    • Health warns about degraded data redundancy
  3. Verify client I/O is still available (min_size=2)
    # Write and read a small object in a test pool
    rados -p pool_rw put objA /etc/hosts
    rados -p pool_rw get objA /tmp/objA.copy
    ls -l /etc/hosts /tmp/objA.copy
    

Notes

  • degraded/undersized states are expected during transient OSD failures; client I/O proceeds as long as min_size is satisfied.
  • The "acting [a,b]" set shows which OSDs currently host available replicas.

Peered (and I/O gating with min_size)

Description

  • peered: Peering completed (the replicas have negotiated a consistent view of history), but the number of available replicas is below min_size; the PG typically stays inactive and won’t serve client I/O.

Example (lab)

  1. Stop two OSDs in a 3x replicated pool
    sudo systemctl stop ceph-osd@1
    sudo systemctl stop ceph-osd@0
    
  2. Check health
    ceph health detail
    
    Example excerpts:
    • "undersized+degraded+peered"
    • "pg N is stuck inactive … last acting [2]"
  3. Observe I/O blocking with min_size=2
    # This read may hang because only 1 replica is present
    rados -p pool_rw get objA /tmp/objA.blocked
    
  4. Reduce the gating threshold (for lab only)
    ceph osd pool set pool_rw min_size 1
    
  5. Verify I/O proceeds
    rados -p pool_rw get objA /tmp/objA.ok
    ls -l /tmp/objA.ok
    

Notes

  • peered means the replicas have reached agreement, but if available replicas < min_size, the PG won’t serve I/O.
  • Lowering min_size allows service with fewer replicas but increases risk; avoid in production unless part of a controlled recovery plan.

Remapped

Description

  • After peering, if the computed up set (CRUSH placement to the PG) differs from the current acting set, the PG appears remapped while replicas transition to the correct OSDs.
  • This is common after OSDs go in/out, devices are reweighted, or the cluster is expanded.

Example (lab)

  1. Bounce an OSD to force movemetn
    sudo systemctl stop ceph-osd@x
    sleep 30
    sudo systemctl start ceph-osd@x
    
  2. Inspect PGs
    ceph pg stat
    ceph pg dump | grep remapped
    
    Example excerpts:
    • "active+clean+remapped"
    • or combined with recovery/backfill phases depending on what changed
  3. Client I/O continues normal
    rados -p pool_rw put objB /tmp/test.log
    

Notes

  • remapped indicates the acting set is transitioning toward the up set.

Recovery

Description

  • Recovery replays PG logs to resync objects incrementally after a transient failure. If the divergence is within the retained PG log window, recovery uses those logs to bring replicas current.

Example (lab)

  1. Stop an OSD briefly, then restart it
    sudo systemctl stop ceph-osd@x
    sleep 60
    sudo systemctl start ceph-osd@x
    
  2. Watch health details
    ceph health detail
    
    Example excerpts:
    • "active+recovery_wait+degraded" or "active+recovering"

Notes

  • Recovery relies on PG logs. If the number of required entries is within osd_max_pg_log_entries (commonly ~10000 by default, varies by release), Ceph can perform incremental recovery.

Backfill

Description

  • If the divergence exceeds what the PG log retains, Ceph performs backfill: a full copy of objects from the authoritative replica to the out-of-date OSD.

Example (lab)

  1. Keep an OSD down long enough for logs to trim, then restart
    sudo systemctl stop ceph-osd@x
    # wait longer than your log retention horizon
    sleep 600
    sudo systemctl start ceph-osd@x
    
  2. Check health
    ceph health detail
    
    Example excerpts:
    • "active+undersized+degraded+remapped+backfilling"

Notes

  • backfill_wait/backfilling indicate full synchronization is scheduled or in progress.

Stale

Description

  • A PG is stale when monitors have not heard from its primary for a while, so the PG’s state is unknown. This can happen if:
    • The primary OSD is down or partitioned.
    • All replicas are down.
    • The primary cannot report PG state due to prolonged network issues.

Example (lab)

  1. Stop all three OSDs for a given PG
    sudo systemctl stop ceph-osd@23
    sudo systemctl stop ceph-osd@24
    sudo systemctl stop ceph-osd@10
    
  2. Inspect health
    ceph health detail
    
    Example excerpts:
    • "stale+undersized+degraded+peered" on the affected PG
  3. Client I/O against affected data blocks
    # I/O will block for objects that map to the stale PG
    ls -l /mnt/
    

Notes

  • When all replicas for a PG are down, the PG becomes stale/inactive and cannot serve I/O.

Inconsistent (Scrub/Deep-Scrub detected)

Description

  • Scrub/Deep-Scrub detects mismatches between replicas (e.g., missing object, size/omap/digest mismatch). The PG is marked inconsistent.

Example (lab)

  1. Simulate corruption on a non-primary replica (lab only)
    # WARNING: destructive; do not run on real data
    sudo rm -rf /var/lib/ceph/osd/ceph-34/current/3.0_head/DIR_0/1000000697c.0000122c__head_19785300__3
    
  2. Trigger a scrub on the PG
    ceph pg scrub 3.0
    
  3. Observe health
    ceph health detail
    
    Example excerpt:
    • "pg 3.0 is active+clean+inconsistent"
  4. Repair the PG
    ceph pg repair 3.0
    
    After repair and subsequent scrub/deep-scrub, health should return to OK if repair succeeds.

Notes

  • Repair fetches correct data from healthy replicas when possible.

Down

Description

  • PG down occurs when peering cannot progress because the only available OSD(s) have insufficient or too-old state to repair or serve as authority for the interval.

Example (lab)

  1. Baseline: a PG with acting set [5,21,29]
    ceph pg dump | grep '^3.7f'
    
  2. Stop one OSD and write new data so the remaining replicas advance
    sudo systemctl stop ceph-osd@21
    # generate some I/O (any read/write workload suffices)
    fio --name=randwrite --filename=/mnt/testfile --rw=randwrite --bs=4M --size=2G --numjobs=8 \
        --iodepth=4 --direct=1 --ioengine=libaio --runtime=60 --time_based
    
  3. Stop the remaining up-to-date replicas
    sudo systemctl stop ceph-osd@29
    sudo systemctl stop ceph-osd@5
    
  4. Start the old OSD (21) alone and check PG
    sudo systemctl start ceph-osd@21
    ceph pg dump | grep '^3.7f'
    # Example: state shows "down"
    
  5. I/O to data in that PG will block
    ls -l /mnt/
    

Notes and remediation

  • Bring back at least one of the newer replicas (5 or 29) so the PG can recover.
  • If an OSD is permanently lost and peering cannot complete, you may need to remove and replace the OSD and handle unfound objects:
    # Last resort: mark specific unfound-lost objects after careful assessment
    ceph pg <pg-id> mark_unfound_lost revert|delete
    
    Use revert to create placeholders that return EIO on access; delete discards the objects. Assess data impact before use.

Incomplete

Description

  • During peering, if the authoritative log cannot be determined and the available acting set cannot reconstruct a consistent history, the PG becomes incomplete. This often stems from repeated crashes or power loss during peering.

Minimal, riskier approach (data may be lost)

  1. Stop the primary OSD for the incomplete PG
  2. Mark PG complete on disk on that OSD
    ceph-objectstore-tool \
      --data-path /var/lib/ceph/osd/ceph-<id>/ \
      --pgid 1.1 \
      --op mark-complete
    
  3. Start the OSD and allow peering to proceed

Data-conserving approach (recommended)

  • Goal: reconcile replicas by exporting from the most complete copy and importing to others, then marking complete.
  1. Inspect PG details and OSDs’ local views
    ceph pg 7.123 query > /export/pg-7.123-query.txt
    
    # Run on each replica OSD (stop the OSD first before using the tool)
    ceph-objectstore-tool \
      --data-path /var/lib/ceph/osd/ceph-641/ \
      --type bluestore \
      --pgid 7.123 \
      --op info > /export/pg-7.123-info-osd641.txt
    
  2. Compare object inventories to choose the most complete copy
    # List objects per replica
    ceph-objectstore-tool \
      --data-path /var/lib/ceph/osd/ceph-641/ \
      --type bluestore \
      --pgid 7.123 \
      --op list > /export/pg-7.123-objlist-osd-641.txt
    
    wc -l /export/pg-7.123-objlist-osd-*.txt
    diff -u /export/pg-7.123-objlist-osd-641.txt /export/pg-7.123-objlist-osd-<other>.txt
    
  3. Export the best replica as backup
    ceph-objectstore-tool \
      --data-path /var/lib/ceph/osd/ceph-641/ \
      --type bluestore \
      --pgid 7.123 \
      --op export \
      --file /export/pg-7.123-osd-641.obj
    
  4. Import to less-complete replicas (remove PG first if required by your Ceph version)
    # Stop target OSD; remove PG if necessary
    ceph-objectstore-tool \
      --data-path /var/lib/ceph/osd/ceph-57/ \
      --type bluestore \
      --pgid 7.123 \
      --op remove --force
    
    # Import
    ceph-objectstore-tool \
      --data-path /var/lib/ceph/osd/ceph-57/ \
      --type bluestore \
      --pgid 7.123 \
      --op import \
      --file /export/pg-7.123-osd-641.obj
    
  5. Mark complete on all replicas, then start OSDs
    ceph-objectstore-tool \
      --data-path /var/lib/ceph/osd/ceph-57/ \
      --type bluestore \
      --pgid 7.123 \
      --op mark-complete
    
    # Restart the OSD service
    sudo systemctl start ceph-osd@57
    

Cautions

  • Always stop the OSD before using ceph-objectstore-tool.
  • Export every replica first as a safety backup.
  • Never run destructive operations on production without validated recovery plans.

Additional Notes

  • active+clean is the steady state where all replicas are in sync and the PG can serve I/O.
  • During failure, peering and recovery/backfill orchestrate safee and consistent convergence. Client I/O availability depends on min_size, the number of healthy replicas, and peering success.
  • Health messages such as recovery_wait, backfill_wait, undersized, degraded, peered, remapped, inconsistent, stale, down, and incomplete indicate distinct phases or failure modes; understanding them helps pinpoint actions: wait, add capacity, fix networks, bring OSDs back, or perform operator recovery steps.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.