Magnus Grönlund
2018-07-11 18:30:41 UTC
Hi,
Started to upgrade a ceph-cluster from Jewel (10.2.10) to Luminous (12.2.6)
After upgrading and restarting the mons everything looked OK, the mons had
quorum, all OSDs where up and in and all the PGs where active+clean.
But before I had time to start upgrading the OSDs it became obvious that
something had gone terribly wrong.
All of a sudden 1600 out of 4100 PGs where inactive and 40% of the data was
misplaced!
The mons appears OK and all OSDs are still up and in, but a few hours later
there was still 1483 pgs stuck inactive, essentially all of them in peering!
Investigating one of the stuck PGs it appears to be looping between
âinactiveâ, âremapped+peeringâ and âpeeringâ and the epoch number is rising
fast, see the attached pg query outputs.
We really canât afford to loose the cluster or the data so any help or
suggestions on how to debug or fix this issue would be very, very
appreciated!
health: HEALTH_ERR
1483 pgs are stuck inactive for more than 60 seconds
542 pgs backfill_wait
14 pgs backfilling
11 pgs degraded
1402 pgs peering
3 pgs recovery_wait
11 pgs stuck degraded
1483 pgs stuck inactive
2042 pgs stuck unclean
7 pgs stuck undersized
7 pgs undersized
111 requests are blocked > 32 sec
10586 requests are blocked > 4096 sec
recovery 9472/11120724 objects degraded (0.085%)
recovery 1181567/11120724 objects misplaced (10.625%)
noout flag(s) set
mon.eselde02u32 low disk space
services:
mon: 3 daemons, quorum eselde02u32,eselde02u33,eselde02u34
mgr: eselde02u32(active), standbys: eselde02u33, eselde02u34
osd: 111 osds: 111 up, 111 in; 800 remapped pgs
flags noout
data:
pools: 18 pools, 4104 pgs
objects: 3620k objects, 13875 GB
usage: 42254 GB used, 160 TB / 201 TB avail
pgs: 1.876% pgs unknown
34.259% pgs not active
9472/11120724 objects degraded (0.085%)
1181567/11120724 objects misplaced (10.625%)
2062 active+clean
1221 peering
535 active+remapped+backfill_wait
181 remapped+peering
77 unknown
13 active+remapped+backfilling
7 active+undersized+degraded+remapped+backfill_wait
4 remapped
3 active+recovery_wait+degraded+remapped
1 active+degraded+remapped+backfilling
io:
recovery: 298 MB/s, 77 objects/s
Started to upgrade a ceph-cluster from Jewel (10.2.10) to Luminous (12.2.6)
After upgrading and restarting the mons everything looked OK, the mons had
quorum, all OSDs where up and in and all the PGs where active+clean.
But before I had time to start upgrading the OSDs it became obvious that
something had gone terribly wrong.
All of a sudden 1600 out of 4100 PGs where inactive and 40% of the data was
misplaced!
The mons appears OK and all OSDs are still up and in, but a few hours later
there was still 1483 pgs stuck inactive, essentially all of them in peering!
Investigating one of the stuck PGs it appears to be looping between
âinactiveâ, âremapped+peeringâ and âpeeringâ and the epoch number is rising
fast, see the attached pg query outputs.
We really canât afford to loose the cluster or the data so any help or
suggestions on how to debug or fix this issue would be very, very
appreciated!
health: HEALTH_ERR
1483 pgs are stuck inactive for more than 60 seconds
542 pgs backfill_wait
14 pgs backfilling
11 pgs degraded
1402 pgs peering
3 pgs recovery_wait
11 pgs stuck degraded
1483 pgs stuck inactive
2042 pgs stuck unclean
7 pgs stuck undersized
7 pgs undersized
111 requests are blocked > 32 sec
10586 requests are blocked > 4096 sec
recovery 9472/11120724 objects degraded (0.085%)
recovery 1181567/11120724 objects misplaced (10.625%)
noout flag(s) set
mon.eselde02u32 low disk space
services:
mon: 3 daemons, quorum eselde02u32,eselde02u33,eselde02u34
mgr: eselde02u32(active), standbys: eselde02u33, eselde02u34
osd: 111 osds: 111 up, 111 in; 800 remapped pgs
flags noout
data:
pools: 18 pools, 4104 pgs
objects: 3620k objects, 13875 GB
usage: 42254 GB used, 160 TB / 201 TB avail
pgs: 1.876% pgs unknown
34.259% pgs not active
9472/11120724 objects degraded (0.085%)
1181567/11120724 objects misplaced (10.625%)
2062 active+clean
1221 peering
535 active+remapped+backfill_wait
181 remapped+peering
77 unknown
13 active+remapped+backfilling
7 active+undersized+degraded+remapped+backfill_wait
4 remapped
3 active+recovery_wait+degraded+remapped
1 active+degraded+remapped+backfilling
io:
recovery: 298 MB/s, 77 objects/s