Understanding Ceph Placement Group (PG) States and Operations
Placement Groups (PGs) are the core unit of placement, replication, and coordination in Ceph’s RADOS layer. A PG aggregates objects, maps them to OSDs via CRUSH, and coordinates replication, ordering, and recovery. Many pool-level behaviors (replication, failure domains, recovery) are realized through PG logic.
A healthy cluster reports all PGs as active+clean, meaning each PG is available and every replica in its acting set is in sync.
Below are the major PG states you may encounter, what they mean, how to reproduce them safely in a lab, and how they affect client I/O.
Degraded and Undersized
Description
- degraded: At least one replica in the PG’s acting set is missing or behind; some objects may lack full replica count.
- undersized: The number of replicas is below the pool’s size setting.
- When one OSD in a triplicated pool (size=3, min_size=2) fails, the PGs that included that OSD become active+undersized+degraded.
Example (lab)
- Stop one OSD
sudo systemctl stop ceph-osd@1 - Inspect PG and cluster health
Example excerpts:ceph pg stat ceph health detail- "active+undersized+degraded" on multiple PGs
- Health warns about degraded data redundancy
- Verify client I/O is still available (min_size=2)
# Write and read a small object in a test pool rados -p pool_rw put objA /etc/hosts rados -p pool_rw get objA /tmp/objA.copy ls -l /etc/hosts /tmp/objA.copy
Notes
- degraded/undersized states are expected during transient OSD failures; client I/O proceeds as long as min_size is satisfied.
- The "acting [a,b]" set shows which OSDs currently host available replicas.
Peered (and I/O gating with min_size)
Description
- peered: Peering completed (the replicas have negotiated a consistent view of history), but the number of available replicas is below min_size; the PG typically stays inactive and won’t serve client I/O.
Example (lab)
- Stop two OSDs in a 3x replicated pool
sudo systemctl stop ceph-osd@1 sudo systemctl stop ceph-osd@0 - Check health
Example excerpts:ceph health detail- "undersized+degraded+peered"
- "pg N is stuck inactive … last acting [2]"
- Observe I/O blocking with min_size=2
# This read may hang because only 1 replica is present rados -p pool_rw get objA /tmp/objA.blocked - Reduce the gating threshold (for lab only)
ceph osd pool set pool_rw min_size 1 - Verify I/O proceeds
rados -p pool_rw get objA /tmp/objA.ok ls -l /tmp/objA.ok
Notes
- peered means the replicas have reached agreement, but if available replicas < min_size, the PG won’t serve I/O.
- Lowering min_size allows service with fewer replicas but increases risk; avoid in production unless part of a controlled recovery plan.
Remapped
Description
- After peering, if the computed up set (CRUSH placement to the PG) differs from the current acting set, the PG appears remapped while replicas transition to the correct OSDs.
- This is common after OSDs go in/out, devices are reweighted, or the cluster is expanded.
Example (lab)
- Bounce an OSD to force movemetn
sudo systemctl stop ceph-osd@x sleep 30 sudo systemctl start ceph-osd@x - Inspect PGs
Example excerpts:ceph pg stat ceph pg dump | grep remapped- "active+clean+remapped"
- or combined with recovery/backfill phases depending on what changed
- Client I/O continues normal
rados -p pool_rw put objB /tmp/test.log
Notes
- remapped indicates the acting set is transitioning toward the up set.
Recovery
Description
- Recovery replays PG logs to resync objects incrementally after a transient failure. If the divergence is within the retained PG log window, recovery uses those logs to bring replicas current.
Example (lab)
- Stop an OSD briefly, then restart it
sudo systemctl stop ceph-osd@x sleep 60 sudo systemctl start ceph-osd@x - Watch health details
Example excerpts:ceph health detail- "active+recovery_wait+degraded" or "active+recovering"
Notes
- Recovery relies on PG logs. If the number of required entries is within osd_max_pg_log_entries (commonly ~10000 by default, varies by release), Ceph can perform incremental recovery.
Backfill
Description
- If the divergence exceeds what the PG log retains, Ceph performs backfill: a full copy of objects from the authoritative replica to the out-of-date OSD.
Example (lab)
- Keep an OSD down long enough for logs to trim, then restart
sudo systemctl stop ceph-osd@x # wait longer than your log retention horizon sleep 600 sudo systemctl start ceph-osd@x - Check health
Example excerpts:ceph health detail- "active+undersized+degraded+remapped+backfilling"
Notes
- backfill_wait/backfilling indicate full synchronization is scheduled or in progress.
Stale
Description
- A PG is stale when monitors have not heard from its primary for a while, so the PG’s state is unknown. This can happen if:
- The primary OSD is down or partitioned.
- All replicas are down.
- The primary cannot report PG state due to prolonged network issues.
Example (lab)
- Stop all three OSDs for a given PG
sudo systemctl stop ceph-osd@23 sudo systemctl stop ceph-osd@24 sudo systemctl stop ceph-osd@10 - Inspect health
Example excerpts:ceph health detail- "stale+undersized+degraded+peered" on the affected PG
- Client I/O against affected data blocks
# I/O will block for objects that map to the stale PG ls -l /mnt/
Notes
- When all replicas for a PG are down, the PG becomes stale/inactive and cannot serve I/O.
Inconsistent (Scrub/Deep-Scrub detected)
Description
- Scrub/Deep-Scrub detects mismatches between replicas (e.g., missing object, size/omap/digest mismatch). The PG is marked inconsistent.
Example (lab)
- Simulate corruption on a non-primary replica (lab only)
# WARNING: destructive; do not run on real data sudo rm -rf /var/lib/ceph/osd/ceph-34/current/3.0_head/DIR_0/1000000697c.0000122c__head_19785300__3 - Trigger a scrub on the PG
ceph pg scrub 3.0 - Observe health
Example excerpt:ceph health detail- "pg 3.0 is active+clean+inconsistent"
- Repair the PG
After repair and subsequent scrub/deep-scrub, health should return to OK if repair succeeds.ceph pg repair 3.0
Notes
- Repair fetches correct data from healthy replicas when possible.
Down
Description
- PG down occurs when peering cannot progress because the only available OSD(s) have insufficient or too-old state to repair or serve as authority for the interval.
Example (lab)
- Baseline: a PG with acting set [5,21,29]
ceph pg dump | grep '^3.7f' - Stop one OSD and write new data so the remaining replicas advance
sudo systemctl stop ceph-osd@21 # generate some I/O (any read/write workload suffices) fio --name=randwrite --filename=/mnt/testfile --rw=randwrite --bs=4M --size=2G --numjobs=8 \ --iodepth=4 --direct=1 --ioengine=libaio --runtime=60 --time_based - Stop the remaining up-to-date replicas
sudo systemctl stop ceph-osd@29 sudo systemctl stop ceph-osd@5 - Start the old OSD (21) alone and check PG
sudo systemctl start ceph-osd@21 ceph pg dump | grep '^3.7f' # Example: state shows "down" - I/O to data in that PG will block
ls -l /mnt/
Notes and remediation
- Bring back at least one of the newer replicas (5 or 29) so the PG can recover.
- If an OSD is permanently lost and peering cannot complete, you may need to remove and replace the OSD and handle unfound objects:
Use revert to create placeholders that return EIO on access; delete discards the objects. Assess data impact before use.# Last resort: mark specific unfound-lost objects after careful assessment ceph pg <pg-id> mark_unfound_lost revert|delete
Incomplete
Description
- During peering, if the authoritative log cannot be determined and the available acting set cannot reconstruct a consistent history, the PG becomes incomplete. This often stems from repeated crashes or power loss during peering.
Minimal, riskier approach (data may be lost)
- Stop the primary OSD for the incomplete PG
- Mark PG complete on disk on that OSD
ceph-objectstore-tool \ --data-path /var/lib/ceph/osd/ceph-<id>/ \ --pgid 1.1 \ --op mark-complete - Start the OSD and allow peering to proceed
Data-conserving approach (recommended)
- Goal: reconcile replicas by exporting from the most complete copy and importing to others, then marking complete.
- Inspect PG details and OSDs’ local views
ceph pg 7.123 query > /export/pg-7.123-query.txt # Run on each replica OSD (stop the OSD first before using the tool) ceph-objectstore-tool \ --data-path /var/lib/ceph/osd/ceph-641/ \ --type bluestore \ --pgid 7.123 \ --op info > /export/pg-7.123-info-osd641.txt - Compare object inventories to choose the most complete copy
# List objects per replica ceph-objectstore-tool \ --data-path /var/lib/ceph/osd/ceph-641/ \ --type bluestore \ --pgid 7.123 \ --op list > /export/pg-7.123-objlist-osd-641.txt wc -l /export/pg-7.123-objlist-osd-*.txt diff -u /export/pg-7.123-objlist-osd-641.txt /export/pg-7.123-objlist-osd-<other>.txt - Export the best replica as backup
ceph-objectstore-tool \ --data-path /var/lib/ceph/osd/ceph-641/ \ --type bluestore \ --pgid 7.123 \ --op export \ --file /export/pg-7.123-osd-641.obj - Import to less-complete replicas (remove PG first if required by your Ceph version)
# Stop target OSD; remove PG if necessary ceph-objectstore-tool \ --data-path /var/lib/ceph/osd/ceph-57/ \ --type bluestore \ --pgid 7.123 \ --op remove --force # Import ceph-objectstore-tool \ --data-path /var/lib/ceph/osd/ceph-57/ \ --type bluestore \ --pgid 7.123 \ --op import \ --file /export/pg-7.123-osd-641.obj - Mark complete on all replicas, then start OSDs
ceph-objectstore-tool \ --data-path /var/lib/ceph/osd/ceph-57/ \ --type bluestore \ --pgid 7.123 \ --op mark-complete # Restart the OSD service sudo systemctl start ceph-osd@57
Cautions
- Always stop the OSD before using ceph-objectstore-tool.
- Export every replica first as a safety backup.
- Never run destructive operations on production without validated recovery plans.
Additional Notes
- active+clean is the steady state where all replicas are in sync and the PG can serve I/O.
- During failure, peering and recovery/backfill orchestrate safee and consistent convergence. Client I/O availability depends on min_size, the number of healthy replicas, and peering success.
- Health messages such as recovery_wait, backfill_wait, undersized, degraded, peered, remapped, inconsistent, stale, down, and incomplete indicate distinct phases or failure modes; understanding them helps pinpoint actions: wait, add capacity, fix networks, bring OSDs back, or perform operator recovery steps.