Home > Tech > Content

Understanding Ceph Placement Group (PG) States and Operations

Tech Mar 30 17

Placement Groups (PGs) are the core unit of placement, replication, and coordination in Ceph’s RADOS layer. A PG aggregates objects, maps them to OSDs via CRUSH, and coordinates replication, ordering, and recovery. Many pool-level behaviors (replication, failure domains, recovery) are realized through PG logic.

A healthy cluster reports all PGs as active+clean, meaning each PG is available and every replica in its acting set is in sync.

Below are the major PG states you may encounter, what they mean, how to reproduce them safely in a lab, and how they affect client I/O.

Degraded and Undersized

Description

degraded: At least one replica in the PG’s acting set is missing or behind; some objects may lack full replica count.
undersized: The number of replicas is below the pool’s size setting.
When one OSD in a triplicated pool (size=3, min_size=2) fails, the PGs that included that OSD become active+undersized+degraded.

Example (lab)

Stop one OSD
```
sudo systemctl stop ceph-osd@1
```
Inspect PG and cluster health
```
ceph pg stat
ceph health detail
```
Example excerpts:
- "active+undersized+degraded" on multiple PGs
- Health warns about degraded data redundancy

Verify client I/O is still available (min_size=2)

# Write and read a small object in a test pool
rados -p pool_rw put objA /etc/hosts
rados -p pool_rw get objA /tmp/objA.copy
ls -l /etc/hosts /tmp/objA.copy

Notes

degraded/undersized states are expected during transient OSD failures; client I/O proceeds as long as min_size is satisfied.
The "acting [a,b]" set shows which OSDs currently host available replicas.

Peered (and I/O gating with min_size)

Description

peered: Peering completed (the replicas have negotiated a consistent view of history), but the number of available replicas is below min_size; the PG typically stays inactive and won’t serve client I/O.

Example (lab)

Stop two OSDs in a 3x replicated pool

sudo systemctl stop ceph-osd@1
sudo systemctl stop ceph-osd@0

Check health
```
ceph health detail
```
Example excerpts:
- "undersized+degraded+peered"
- "pg N is stuck inactive … last acting [2]"

Observe I/O blocking with min_size=2

# This read may hang because only 1 replica is present
rados -p pool_rw get objA /tmp/objA.blocked

Reduce the gating threshold (for lab only)
```
ceph osd pool set pool_rw min_size 1
```

Verify I/O proceeds

rados -p pool_rw get objA /tmp/objA.ok
ls -l /tmp/objA.ok

Notes

peered means the replicas have reached agreement, but if available replicas < min_size, the PG won’t serve I/O.
Lowering min_size allows service with fewer replicas but increases risk; avoid in production unless part of a controlled recovery plan.

Remapped

Description

After peering, if the computed up set (CRUSH placement to the PG) differs from the current acting set, the PG appears remapped while replicas transition to the correct OSDs.
This is common after OSDs go in/out, devices are reweighted, or the cluster is expanded.

Example (lab)

Bounce an OSD to force movemetn

sudo systemctl stop ceph-osd@x
sleep 30
sudo systemctl start ceph-osd@x

Inspect PGs
```
ceph pg stat
ceph pg dump | grep remapped
```
Example excerpts:
- "active+clean+remapped"
- or combined with recovery/backfill phases depending on what changed

Client I/O continues normal

rados -p pool_rw put objB /tmp/test.log

Notes

remapped indicates the acting set is transitioning toward the up set.

Recovery

Description

Recovery replays PG logs to resync objects incrementally after a transient failure. If the divergence is within the retained PG log window, recovery uses those logs to bring replicas current.

Example (lab)

Stop an OSD briefly, then restart it

sudo systemctl stop ceph-osd@x
sleep 60
sudo systemctl start ceph-osd@x

Watch health details
```
ceph health detail
```
Example excerpts:
- "active+recovery_wait+degraded" or "active+recovering"

Notes

Recovery relies on PG logs. If the number of required entries is within osd_max_pg_log_entries (commonly ~10000 by default, varies by release), Ceph can perform incremental recovery.

Backfill

Description

If the divergence exceeds what the PG log retains, Ceph performs backfill: a full copy of objects from the authoritative replica to the out-of-date OSD.

Example (lab)

Keep an OSD down long enough for logs to trim, then restart

sudo systemctl stop ceph-osd@x
# wait longer than your log retention horizon
sleep 600
sudo systemctl start ceph-osd@x

Check health
```
ceph health detail
```
Example excerpts:
- "active+undersized+degraded+remapped+backfilling"

Notes

backfill_wait/backfilling indicate full synchronization is scheduled or in progress.

Stale

Description

A PG is stale when monitors have not heard from its primary for a while, so the PG’s state is unknown. This can happen if:
- The primary OSD is down or partitioned.
- All replicas are down.
- The primary cannot report PG state due to prolonged network issues.

Example (lab)

Stop all three OSDs for a given PG

sudo systemctl stop ceph-osd@23
sudo systemctl stop ceph-osd@24
sudo systemctl stop ceph-osd@10

Inspect health
```
ceph health detail
```
Example excerpts:
- "stale+undersized+degraded+peered" on the affected PG

Client I/O against affected data blocks

# I/O will block for objects that map to the stale PG
ls -l /mnt/

Notes

When all replicas for a PG are down, the PG becomes stale/inactive and cannot serve I/O.

Inconsistent (Scrub/Deep-Scrub detected)

Description

Scrub/Deep-Scrub detects mismatches between replicas (e.g., missing object, size/omap/digest mismatch). The PG is marked inconsistent.

Example (lab)

Simulate corruption on a non-primary replica (lab only)

# WARNING: destructive; do not run on real data
sudo rm -rf /var/lib/ceph/osd/ceph-34/current/3.0_head/DIR_0/1000000697c.0000122c__head_19785300__3

Trigger a scrub on the PG
```
ceph pg scrub 3.0
```
Observe health
```
ceph health detail
```
Example excerpt:
- "pg 3.0 is active+clean+inconsistent"
Repair the PG
```
ceph pg repair 3.0
```
After repair and subsequent scrub/deep-scrub, health should return to OK if repair succeeds.

Notes

Repair fetches correct data from healthy replicas when possible.

Down

Description

PG down occurs when peering cannot progress because the only available OSD(s) have insufficient or too-old state to repair or serve as authority for the interval.

Example (lab)

Baseline: a PG with acting set [5,21,29]
```
ceph pg dump | grep '^3.7f'
```

Stop one OSD and write new data so the remaining replicas advance

sudo systemctl stop ceph-osd@21
# generate some I/O (any read/write workload suffices)
fio --name=randwrite --filename=/mnt/testfile --rw=randwrite --bs=4M --size=2G --numjobs=8 \
    --iodepth=4 --direct=1 --ioengine=libaio --runtime=60 --time_based

Stop the remaining up-to-date replicas

sudo systemctl stop ceph-osd@29
sudo systemctl stop ceph-osd@5

Start the old OSD (21) alone and check PG

sudo systemctl start ceph-osd@21
ceph pg dump | grep '^3.7f'
# Example: state shows "down"

I/O to data in that PG will block
```
ls -l /mnt/
```

Notes and remediation

Bring back at least one of the newer replicas (5 or 29) so the PG can recover.
If an OSD is permanently lost and peering cannot complete, you may need to remove and replace the OSD and handle unfound objects:
```
# Last resort: mark specific unfound-lost objects after careful assessment
ceph pg <pg-id> mark_unfound_lost revert|delete
```
Use revert to create placeholders that return EIO on access; delete discards the objects. Assess data impact before use.

Incomplete

Description

During peering, if the authoritative log cannot be determined and the available acting set cannot reconstruct a consistent history, the PG becomes incomplete. This often stems from repeated crashes or power loss during peering.

Minimal, riskier approach (data may be lost)

Stop the primary OSD for the incomplete PG

Mark PG complete on disk on that OSD

ceph-objectstore-tool \
  --data-path /var/lib/ceph/osd/ceph-<id>/ \
  --pgid 1.1 \
  --op mark-complete

Start the OSD and allow peering to proceed

Data-conserving approach (recommended)

Goal: reconcile replicas by exporting from the most complete copy and importing to others, then marking complete.

Inspect PG details and OSDs’ local views

ceph pg 7.123 query > /export/pg-7.123-query.txt

# Run on each replica OSD (stop the OSD first before using the tool)
ceph-objectstore-tool \
  --data-path /var/lib/ceph/osd/ceph-641/ \
  --type bluestore \
  --pgid 7.123 \
  --op info > /export/pg-7.123-info-osd641.txt

Compare object inventories to choose the most complete copy

# List objects per replica
ceph-objectstore-tool \
  --data-path /var/lib/ceph/osd/ceph-641/ \
  --type bluestore \
  --pgid 7.123 \
  --op list > /export/pg-7.123-objlist-osd-641.txt

wc -l /export/pg-7.123-objlist-osd-*.txt
diff -u /export/pg-7.123-objlist-osd-641.txt /export/pg-7.123-objlist-osd-<other>.txt

Export the best replica as backup

ceph-objectstore-tool \
  --data-path /var/lib/ceph/osd/ceph-641/ \
  --type bluestore \
  --pgid 7.123 \
  --op export \
  --file /export/pg-7.123-osd-641.obj

Import to less-complete replicas (remove PG first if required by your Ceph version)

# Stop target OSD; remove PG if necessary
ceph-objectstore-tool \
  --data-path /var/lib/ceph/osd/ceph-57/ \
  --type bluestore \
  --pgid 7.123 \
  --op remove --force

# Import
ceph-objectstore-tool \
  --data-path /var/lib/ceph/osd/ceph-57/ \
  --type bluestore \
  --pgid 7.123 \
  --op import \
  --file /export/pg-7.123-osd-641.obj

Mark complete on all replicas, then start OSDs

ceph-objectstore-tool \
  --data-path /var/lib/ceph/osd/ceph-57/ \
  --type bluestore \
  --pgid 7.123 \
  --op mark-complete

# Restart the OSD service
sudo systemctl start ceph-osd@57

Cautions

Always stop the OSD before using ceph-objectstore-tool.
Export every replica first as a safety backup.
Never run destructive operations on production without validated recovery plans.

Additional Notes

active+clean is the steady state where all replicas are in sync and the PG can serve I/O.
During failure, peering and recovery/backfill orchestrate safee and consistent convergence. Client I/O availability depends on min_size, the number of healthy replicas, and peering success.
Health messages such as recovery_wait, backfill_wait, undersized, degraded, peered, remapped, inconsistent, stale, down, and incomplete indicate distinct phases or failure modes; understanding them helps pinpoint actions: wait, add capacity, fix networks, bring OSDs back, or perform operator recovery steps.

Tags: Ceph RADOS Placement Groups

Back to List

Prev: Understanding the Eclipse .classpath File

Next: When to Use @RequestParam, @RequestBody, and @ModelAttribute in Spring MVC

Fading Coder

Understanding Ceph Placement Group (PG) States and Operations

Degraded and Undersized

Peered (and I/O gating with min_size)

Remapped

Recovery

Backfill

Stale

Inconsistent (Scrub/Deep-Scrub detected)

Down

Incomplete

Additional Notes

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Understanding Ceph Placement Group (PG) States and Operations

Degraded and Undersized

Peered (and I/O gating with min_size)

Remapped

Recovery

Backfill

Stale

Inconsistent (Scrub/Deep-Scrub detected)

Down

Incomplete

Additional Notes

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment