2018-02-12 13:18:03 UTC
Warning, this is a long story! There's a TL;DR; close to the end.
We are replacing some of our spinning drives with SSDs. We have 14 OSD
nodes with 12 drives each. We are replacing 4 drives from each node
with SSDs. The cluster is running Ceph Jewel (10.2.7). The affected pool
had min_size=2 and size=3.
After removing some of the drives (from a single host) we noticed the
rebalancing/recovering process got stuck and we had 1 PG with 2 unfound
Most of our Openstack VMs were having issues - were unresponsive or
had other i/o issues.
We tried quering the PG but had no response after hours of waiting.
Trying to recover or delete the unfound objects did the same thing:
One of the two remaining OSD nodes that had the PG was experiencing
huge load spikes correlated with disk IO spikes: https://imgur.com/a/7g0eI
We had this OSD removed and after a while the other OSD started doing
the same thing - huge load spikes.
Tried doing a query on the affected PG and deleting the unfound objects.
Nothing had changed.
The OSDs this PG was supposed to be replicated to only had and empty
We removed the last OSD that had the PG with unfound objects. Now we had
an incomplete PG. Recovered the data from the OSD we removed before all
this has started and tried exporting and importing the PG using the
Ceph Object Store Tool. Unfortunately nothing happened.
Also tried using the Ceph Object Store Tool to find and delete the
unfound objects from the last two OSDs we had removed and re-import the
PG but this also didn't work.
*TL;DR;* we had 2 unfound objects on a PG after removing an OSD, cluster
status was healthy before this, pool has min_size=2 and size=3.
Had to delete the entire pool and recreate all the virtual machines.
If you have any idea why the PG was not being replicated on the other
two OSDs please let me know. Any sugestions on how to avoid this?
Just want to make sure this never happens again.
Our story is similar to this one: