[ceph-users] PGs stuck peering (looping?) after upgrade to Luminous.

Discussion:

Magnus Grönlund

2018-07-11 18:30:41 UTC

Hi,

Started to upgrade a ceph-cluster from Jewel (10.2.10) to Luminous (12.2.6)

After upgrading and restarting the mons everything looked OK, the mons had
quorum, all OSDs where up and in and all the PGs where active+clean.
But before I had time to start upgrading the OSDs it became obvious that
something had gone terribly wrong.
All of a sudden 1600 out of 4100 PGs where inactive and 40% of the data was
misplaced!

The mons appears OK and all OSDs are still up and in, but a few hours later
there was still 1483 pgs stuck inactive, essentially all of them in peering!
Investigating one of the stuck PGs it appears to be looping between
âinactiveâ, âremapped+peeringâ and âpeeringâ and the epoch number is rising
fast, see the attached pg query outputs.

We really canât afford to loose the cluster or the data so any help or
suggestions on how to debug or fix this issue would be very, very
appreciated!

health: HEALTH_ERR
1483 pgs are stuck inactive for more than 60 seconds
542 pgs backfill_wait
14 pgs backfilling
11 pgs degraded
1402 pgs peering
3 pgs recovery_wait
11 pgs stuck degraded
1483 pgs stuck inactive
2042 pgs stuck unclean
7 pgs stuck undersized
7 pgs undersized
111 requests are blocked > 32 sec
10586 requests are blocked > 4096 sec
recovery 9472/11120724 objects degraded (0.085%)
recovery 1181567/11120724 objects misplaced (10.625%)
noout flag(s) set
mon.eselde02u32 low disk space

services:
mon: 3 daemons, quorum eselde02u32,eselde02u33,eselde02u34
mgr: eselde02u32(active), standbys: eselde02u33, eselde02u34
osd: 111 osds: 111 up, 111 in; 800 remapped pgs
flags noout

data:
pools: 18 pools, 4104 pgs
objects: 3620k objects, 13875 GB
usage: 42254 GB used, 160 TB / 201 TB avail
pgs: 1.876% pgs unknown
34.259% pgs not active
9472/11120724 objects degraded (0.085%)
1181567/11120724 objects misplaced (10.625%)
2062 active+clean
1221 peering
535 active+remapped+backfill_wait
181 remapped+peering
77 unknown
13 active+remapped+backfilling
7 active+undersized+degraded+remapped+backfill_wait
4 remapped
3 active+recovery_wait+degraded+remapped
1 active+degraded+remapped+backfilling

io:
recovery: 298 MB/s, 77 objects/s

Paul Emmerich

2018-07-11 18:39:03 UTC

Permalink

Did you finish the upgrade of the OSDs? Are OSDs flapping? (ceph -w) Is
there anything weird in the OSDs' log files?

Paul

Post by Magnus GrÃ¶nlund
Hi,
Started to upgrade a ceph-cluster from Jewel (10.2.10) to Luminous (12.2.6)
After upgrading and restarting the mons everything looked OK, the mons had
quorum, all OSDs where up and in and all the PGs where active+clean.
But before I had time to start upgrading the OSDs it became obvious that
something had gone terribly wrong.
All of a sudden 1600 out of 4100 PGs where inactive and 40% of the data
was misplaced!
The mons appears OK and all OSDs are still up and in, but a few hours
later there was still 1483 pgs stuck inactive, essentially all of them in
peering!
Investigating one of the stuck PGs it appears to be looping between
âinactiveâ, âremapped+peeringâ and âpeeringâ and the epoch number is rising
fast, see the attached pg query outputs.
We really canât afford to loose the cluster or the data so any help or
suggestions on how to debug or fix this issue would be very, very
appreciated!
health: HEALTH_ERR
1483 pgs are stuck inactive for more than 60 seconds
542 pgs backfill_wait
14 pgs backfilling
11 pgs degraded
1402 pgs peering
3 pgs recovery_wait
11 pgs stuck degraded
1483 pgs stuck inactive
2042 pgs stuck unclean
7 pgs stuck undersized
7 pgs undersized
111 requests are blocked > 32 sec
10586 requests are blocked > 4096 sec
recovery 9472/11120724 objects degraded (0.085%)
recovery 1181567/11120724 objects misplaced (10.625%)
noout flag(s) set
mon.eselde02u32 low disk space
mon: 3 daemons, quorum eselde02u32,eselde02u33,eselde02u34
mgr: eselde02u32(active), standbys: eselde02u33, eselde02u34
osd: 111 osds: 111 up, 111 in; 800 remapped pgs
flags noout
pools: 18 pools, 4104 pgs
objects: 3620k objects, 13875 GB
usage: 42254 GB used, 160 TB / 201 TB avail
pgs: 1.876% pgs unknown
34.259% pgs not active
9472/11120724 objects degraded (0.085%)
1181567/11120724 objects misplaced (10.625%)
2062 active+clean
1221 peering
535 active+remapped+backfill_wait
181 remapped+peering
77 unknown
13 active+remapped+backfilling
7 active+undersized+degraded+remapped+backfill_wait
4 remapped
3 active+recovery_wait+degraded+remapped
1 active+degraded+remapped+backfilling
recovery: 298 MB/s, 77 objects/s
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 MÃŒnchen
www.croit.io
Tel: +49 89 1896585 90

Magnus Grönlund

2018-07-11 19:10:34 UTC

Permalink

Hi Paul,

No all OSDs are still jewel , the issue started before I had even started
to upgrade the first OSD and they don't appear to be flapping.
ceph -w shows a lot of slow request etc, but nothing unexpected as far as I
can tell considering the state the cluster is in.

2018-07-11 20:40:09.396642 osd.37 [WRN] 100 slow requests, 2 included
below; oldest blocked for > 25402.278824 secs
2018-07-11 20:40:09.396652 osd.37 [WRN] slow request 1920.957326 seconds
old, received at 2018-07-11 20:08:08.439214:
osd_op(client.73540057.0:8289463 2.e57b3e32 (undecoded)
ack+ondisk+retry+write+known_if_redirected e160294) currently waiting for
peered
2018-07-11 20:40:09.396660 osd.37 [WRN] slow request 1920.048094 seconds
old, received at 2018-07-11 20:08:09.348446:
osd_op(client.671628641.0:998704 2.42f88232 (undecoded)
ack+ondisk+retry+write+known_if_redirected e160475) currently waiting for
peered
2018-07-11 20:40:10.397008 osd.37 [WRN] 100 slow requests, 2 included
below; oldest blocked for > 25403.279204 secs
2018-07-11 20:40:10.397017 osd.37 [WRN] slow request 1920.043860 seconds
old, received at 2018-07-11 20:08:10.353060:
osd_op(client.231731103.0:1007729 3.e0ff5786 (undecoded)
ondisk+write+known_if_redirected e137428) currently waiting for peered
2018-07-11 20:40:10.397023 osd.37 [WRN] slow request 1920.034101 seconds
old, received at 2018-07-11 20:08:10.362819:
osd_op(client.207458703.0:2000292 3.a8143b86 (undecoded)
ondisk+write+known_if_redirected e137428) currently waiting for peered
2018-07-11 20:40:10.790573 mon.0 [INF] pgmap 4104 pgs: 5 down+peering, 1142
peering, 210 remapped+peering, 5 active+recovery_wait+degraded, 1551
active+clean, 2 activating+undersized+degraded+remapped, 15
active+remapped+backfilling, 178 unknown, 1 active+remapped, 3
activating+remapped, 78 active+undersized+degraded+remapped+backfill_wait,
6 active+recovery_wait+degraded+remapped, 3
undersized+degraded+remapped+backfill_wait+peered, 5
active+undersized+degraded+remapped+backfilling, 295
active+remapped+backfill_wait, 3 active+recovery_wait+undersized+degraded,
21 activating+undersized+degraded, 559 active+undersized+degraded, 4
remapped, 17 undersized+degraded+peered, 1
active+recovery_wait+undersized+degraded+remapped; 13439 GB data, 42395 GB
used, 160 TB / 201 TB avail; 4069 B/s rd, 746 kB/s wr, 5 op/s;
534753/10756032 objects degraded (4.972%); 779027/10756032 objects
misplaced (7.243%); 256 MB/s, 65 objects/s recovering

There are a lot of things in the OSD-log files that I'm unfamiliar with but
so far I haven't found anything that has given me a clue on how to fix the
issue.
BTW restarting a OSD doesn't seem to help, on the contrary, that sometimes
results in PGs beeing stuck undersized!
I have attaced a osd-log from when a OSD i restarted started up.

Best regards
/Magnus

Post by Paul Emmerich
Did you finish the upgrade of the OSDs? Are OSDs flapping? (ceph -w) Is
there anything weird in the OSDs' log files?
Paul

Post by Magnus GrÃ¶nlund
Hi,
Started to upgrade a ceph-cluster from Jewel (10.2.10) to Luminous (12.2.6)
After upgrading and restarting the mons everything looked OK, the mons
had quorum, all OSDs where up and in and all the PGs where active+clean.
But before I had time to start upgrading the OSDs it became obvious that
something had gone terribly wrong.
All of a sudden 1600 out of 4100 PGs where inactive and 40% of the data
was misplaced!
The mons appears OK and all OSDs are still up and in, but a few hours
later there was still 1483 pgs stuck inactive, essentially all of them in
peering!
Investigating one of the stuck PGs it appears to be looping between
âinactiveâ, âremapped+peeringâ and âpeeringâ and the epoch number is rising
fast, see the attached pg query outputs.
We really canât afford to loose the cluster or the data so any help or
suggestions on how to debug or fix this issue would be very, very
appreciated!
health: HEALTH_ERR
1483 pgs are stuck inactive for more than 60 seconds
542 pgs backfill_wait
14 pgs backfilling
11 pgs degraded
1402 pgs peering
3 pgs recovery_wait
11 pgs stuck degraded
1483 pgs stuck inactive
2042 pgs stuck unclean
7 pgs stuck undersized
7 pgs undersized
111 requests are blocked > 32 sec
10586 requests are blocked > 4096 sec
recovery 9472/11120724 objects degraded (0.085%)
recovery 1181567/11120724 objects misplaced (10.625%)
noout flag(s) set
mon.eselde02u32 low disk space
mon: 3 daemons, quorum eselde02u32,eselde02u33,eselde02u34
mgr: eselde02u32(active), standbys: eselde02u33, eselde02u34
osd: 111 osds: 111 up, 111 in; 800 remapped pgs
flags noout
pools: 18 pools, 4104 pgs
objects: 3620k objects, 13875 GB
usage: 42254 GB used, 160 TB / 201 TB avail
pgs: 1.876% pgs unknown
34.259% pgs not active
9472/11120724 objects degraded (0.085%)
1181567/11120724 objects misplaced (10.625%)
2062 active+clean
1221 peering
535 active+remapped+backfill_wait
181 remapped+peering
77 unknown
13 active+remapped+backfilling
7 active+undersized+degraded+remapped+backfill_wait
4 remapped
3 active+recovery_wait+degraded+remapped
1 active+degraded+remapped+backfilling
recovery: 298 MB/s, 77 objects/s
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH
Freseniusstr. 31h
<https://maps.google.com/?q=Freseniusstr.+31h+81247+M%C3%BCnchen&entry=gmail&source=g>
81247 MÃŒnchen
<https://maps.google.com/?q=Freseniusstr.+31h+81247+M%C3%BCnchen&entry=gmail&source=g>
www.croit.io
Tel: +49 89 1896585 90

Magnus Grönlund

2018-07-12 09:45:29 UTC

Permalink

Hi list,

Things went from bad to worse, tried to upgrade some OSDs to Luminous to
see if that could help but that didnât appear to make any difference.
But for each restarted OSD there was a few PGs that the OSD seemed to
âforgetâ and the number of undersized PGs grew until some PGs had been
âforgottenâ by all 3 acting OSDs and became stale, even though all OSDs
(and their disks) where available.
Then the OSDs grew so big that the servers ran out of memory (48GB per
server with 10 2TB-disks per server) and started killing the OSDsâŠ
All OSDs where then shutdown to try and preserve some data on the disks at
least, but maybe it is too late?

/Magnus

Post by Magnus GrÃ¶nlund
Hi Paul,
No all OSDs are still jewel , the issue started before I had even started
to upgrade the first OSD and they don't appear to be flapping.
ceph -w shows a lot of slow request etc, but nothing unexpected as far as
I can tell considering the state the cluster is in.
2018-07-11 20:40:09.396642 osd.37 [WRN] 100 slow requests, 2 included
below; oldest blocked for > 25402.278824 secs
2018-07-11 20:40:09.396652 osd.37 [WRN] slow request 1920.957326 seconds
old, received at 2018-07-11 20:08:08.439214: osd_op(client.73540057.0:8289463
2.e57b3e32 (undecoded) ack+ondisk+retry+write+known_if_redirected
e160294) currently waiting for peered
2018-07-11 20:40:09.396660 osd.37 [WRN] slow request 1920.048094 seconds
old, received at 2018-07-11 20:08:09.348446: osd_op(client.671628641.0:998704
2.42f88232 (undecoded) ack+ondisk+retry+write+known_if_redirected
e160475) currently waiting for peered
2018-07-11 20:40:10.397008 osd.37 [WRN] 100 slow requests, 2 included
below; oldest blocked for > 25403.279204 secs
2018-07-11 20:40:10.397017 osd.37 [WRN] slow request 1920.043860 seconds
old, received at 2018-07-11 20:08:10.353060: osd_op(client.231731103.0:1007729
3.e0ff5786 (undecoded) ondisk+write+known_if_redirected e137428)
currently waiting for peered
2018-07-11 20:40:10.397023 osd.37 [WRN] slow request 1920.034101 seconds
old, received at 2018-07-11 20:08:10.362819: osd_op(client.207458703.0:2000292
3.a8143b86 (undecoded) ondisk+write+known_if_redirected e137428)
currently waiting for peered
2018-07-11 20:40:10.790573 mon.0 [INF] pgmap 4104 pgs: 5 down+peering,
1142 peering, 210 remapped+peering, 5 active+recovery_wait+degraded, 1551
active+clean, 2 activating+undersized+degraded+remapped, 15
active+remapped+backfilling, 178 unknown, 1 active+remapped, 3
activating+remapped, 78 active+undersized+degraded+remapped+backfill_wait,
6 active+recovery_wait+degraded+remapped, 3 undersized+degraded+remapped+backfill_wait+peered,
5 active+undersized+degraded+remapped+backfilling, 295
active+remapped+backfill_wait, 3 active+recovery_wait+undersized+degraded,
21 activating+undersized+degraded, 559 active+undersized+degraded, 4
remapped, 17 undersized+degraded+peered, 1 active+recovery_wait+undersized+degraded+remapped;
13439 GB data, 42395 GB used, 160 TB / 201 TB avail; 4069 B/s rd, 746 kB/s
wr, 5 op/s; 534753/10756032 objects degraded (4.972%); 779027/10756032
objects misplaced (7.243%); 256 MB/s, 65 objects/s recovering
There are a lot of things in the OSD-log files that I'm unfamiliar with
but so far I haven't found anything that has given me a clue on how to fix
the issue.
BTW restarting a OSD doesn't seem to help, on the contrary, that sometimes
results in PGs beeing stuck undersized!
I have attaced a osd-log from when a OSD i restarted started up.
Best regards
/Magnus

Post by Paul Emmerich
Did you finish the upgrade of the OSDs? Are OSDs flapping? (ceph -w) Is
there anything weird in the OSDs' log files?
Paul

Post by Magnus GrÃ¶nlund
Hi,
Started to upgrade a ceph-cluster from Jewel (10.2.10) to Luminous (12.2.6)
After upgrading and restarting the mons everything looked OK, the mons
had quorum, all OSDs where up and in and all the PGs where active+clean.
But before I had time to start upgrading the OSDs it became obvious that
something had gone terribly wrong.
All of a sudden 1600 out of 4100 PGs where inactive and 40% of the data
was misplaced!
The mons appears OK and all OSDs are still up and in, but a few hours
later there was still 1483 pgs stuck inactive, essentially all of them in
peering!
Investigating one of the stuck PGs it appears to be looping between
âinactiveâ, âremapped+peeringâ and âpeeringâ and the epoch number is rising
fast, see the attached pg query outputs.
We really canât afford to loose the cluster or the data so any help or
suggestions on how to debug or fix this issue would be very, very
appreciated!
health: HEALTH_ERR
1483 pgs are stuck inactive for more than 60 seconds
542 pgs backfill_wait
14 pgs backfilling
11 pgs degraded
1402 pgs peering
3 pgs recovery_wait
11 pgs stuck degraded
1483 pgs stuck inactive
2042 pgs stuck unclean
7 pgs stuck undersized
7 pgs undersized
111 requests are blocked > 32 sec
10586 requests are blocked > 4096 sec
recovery 9472/11120724 objects degraded (0.085%)
recovery 1181567/11120724 objects misplaced (10.625%)
noout flag(s) set
mon.eselde02u32 low disk space
mon: 3 daemons, quorum eselde02u32,eselde02u33,eselde02u34
mgr: eselde02u32(active), standbys: eselde02u33, eselde02u34
osd: 111 osds: 111 up, 111 in; 800 remapped pgs
flags noout
pools: 18 pools, 4104 pgs
objects: 3620k objects, 13875 GB
usage: 42254 GB used, 160 TB / 201 TB avail
pgs: 1.876% pgs unknown
34.259% pgs not active
9472/11120724 objects degraded (0.085%)
1181567/11120724 objects misplaced (10.625%)
2062 active+clean
1221 peering
535 active+remapped+backfill_wait
181 remapped+peering
77 unknown
13 active+remapped+backfilling
7 active+undersized+degraded+remapped+backfill_wait
4 remapped
3 active+recovery_wait+degraded+remapped
1 active+degraded+remapped+backfilling
recovery: 298 MB/s, 77 objects/s
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

David Majchrzak

2018-07-12 10:05:44 UTC

Permalink

Hi/Hej Magnus,

We had a similar issue going from latest hammer to jewel (so might not be applicable for you), with PGs stuck peering / data misplaced, right after updating all mons to latest jewel at that time 10.2.10.
Finally setting the require_jewel_osds put everything back in place ( we were going to do this after restarting all OSDs, following the docs/changelogs ).
What does your ceph health detail look like?
Did you perform any other commands after starting your mon upgrade? Any commands that might change the crush-map might cause issues AFAIK (correct me if im wrong, but i think we ran into this once) if your mons and osds are different versions.
// david

Post by Magnus GrÃ¶nlund
Hi list,
Things went from bad to worse, tried to upgrade some OSDs to Luminous to see if that could help but that didnât appear to make any difference.
But for each restarted OSD there was a few PGs that the OSD seemed to âforgetâ and the number of undersized PGs grew until some PGs had been âforgottenâ by all 3 acting OSDs and became stale, even though all OSDs (and their disks) where available.
Then the OSDs grew so big that the servers ran out of memory (48GB per server with 10 2TB-disks per server) and started killing the OSDsâŠ
All OSDs where then shutdown to try and preserve some data on the disks at least, but maybe it is too late?
/Magnus

Post by Magnus GrÃ¶nlund
Hi Paul,
No all OSDs are still jewel , the issue started before I had even started to upgrade the first OSD and they don't appear to be flapping.
ceph -w shows a lot of slow request etc, but nothing unexpected as far as I can tell considering the state the cluster is in.
2018-07-11 20:40:09.396642 osd.37 [WRN] 100 slow requests, 2 included below; oldest blocked for > 25402.278824 secs
2018-07-11 20:40:09.396652 osd.37 [WRN] slow request 1920.957326 seconds old, received at 2018-07-11 20:08:08.439214: osd_op(client.73540057.0:8289463 2.e57b3e32 (undecoded) ack+ondisk+retry+write+known_if_redirected e160294) currently waiting for peered
2018-07-11 20:40:09.396660 osd.37 [WRN] slow request 1920.048094 seconds old, received at 2018-07-11 20:08:09.348446: osd_op(client.671628641.0:998704 2.42f88232 (undecoded) ack+ondisk+retry+write+known_if_redirected e160475) currently waiting for peered
2018-07-11 20:40:10.397008 osd.37 [WRN] 100 slow requests, 2 included below; oldest blocked for > 25403.279204 secs
2018-07-11 20:40:10.397017 osd.37 [WRN] slow request 1920.043860 seconds old, received at 2018-07-11 20:08:10.353060: osd_op(client.231731103.0:1007729 3.e0ff5786 (undecoded) ondisk+write+known_if_redirected e137428) currently waiting for peered
2018-07-11 20:40:10.397023 osd.37 [WRN] slow request 1920.034101 seconds old, received at 2018-07-11 20:08:10.362819: osd_op(client.207458703.0:2000292 3.a8143b86 (undecoded) ondisk+write+known_if_redirected e137428) currently waiting for peered
2018-07-11 20:40:10.790573 mon.0 [INF] pgmap 4104 pgs: 5 down+peering, 1142 peering, 210 remapped+peering, 5 active+recovery_wait+degraded, 1551 active+clean, 2 activating+undersized+degraded+remapped, 15 active+remapped+backfilling, 178 unknown, 1 active+remapped, 3 activating+remapped, 78 active+undersized+degraded+remapped+backfill_wait, 6 active+recovery_wait+degraded+remapped, 3 undersized+degraded+remapped+backfill_wait+peered, 5 active+undersized+degraded+remapped+backfilling, 295 active+remapped+backfill_wait, 3 active+recovery_wait+undersized+degraded, 21 activating+undersized+degraded, 559 active+undersized+degraded, 4 remapped, 17 undersized+degraded+peered, 1 active+recovery_wait+undersized+degraded+remapped; 13439 GB data, 42395 GB used, 160 TB / 201 TB avail; 4069 B/s rd, 746 kB/s wr, 5 op/s; 534753/10756032 objects degraded (4.972%); 779027/10756032 objects misplaced (7.243%); 256 MB/s, 65 objects/s recovering
There are a lot of things in the OSD-log files that I'm unfamiliar with but so far I haven't found anything that has given me a clue on how to fix the issue.
BTW restarting a OSD doesn't seem to help, on the contrary, that sometimes results in PGs beeing stuck undersized!
I have attaced a osd-log from when a OSD i restarted started up.
Best regards
/Magnus

Did you finish the upgrade of the OSDs? Are OSDs flapping? (ceph -w) Is there anything weird in the OSDs' log files?
Paul

Post by Magnus GrÃ¶nlund
Hi,
Started to upgrade a ceph-cluster from Jewel (10.2.10) to Luminous (12.2.6)
After upgrading and restarting the mons everything looked OK, the mons had quorum, all OSDs where up and in and all the PGs where active+clean.
But before I had time to start upgrading the OSDs it became obvious that something had gone terribly wrong.
All of a sudden 1600 out of 4100 PGs where inactive and 40% of the data was misplaced!
The mons appears OK and all OSDs are still up and in, but a few hours later there was still 1483 pgs stuck inactive, essentially all of them in peering!
Investigating one of the stuck PGs it appears to be looping between âinactiveâ, âremapped+peeringâ and âpeeringâ and the epoch number is rising fast, see the attached pg query outputs.
We really canât afford to loose the cluster or the data so any help or suggestions on how to debug or fix this issue would be very, very appreciated!
health: HEALTH_ERR
1483 pgs are stuck inactive for more than 60 seconds
542 pgs backfill_wait
14 pgs backfilling
11 pgs degraded
1402 pgs peering
3 pgs recovery_wait
11 pgs stuck degraded
1483 pgs stuck inactive
2042 pgs stuck unclean
7 pgs stuck undersized
7 pgs undersized
111 requests are blocked > 32 sec
10586 requests are blocked > 4096 sec
recovery 9472/11120724 objects degraded (0.085%)
recovery 1181567/11120724 objects misplaced (10.625%)
noout flag(s) set
mon.eselde02u32 low disk space
mon: 3 daemons, quorum eselde02u32,eselde02u33,eselde02u34
mgr: eselde02u32(active), standbys: eselde02u33, eselde02u34
osd: 111 osds: 111 up, 111 in; 800 remapped pgs
flags noout
pools: 18 pools, 4104 pgs
objects: 3620k objects, 13875 GB
usage: 42254 GB used, 160 TB / 201 TB avail
pgs: 1.876% pgs unknown
34.259% pgs not active
9472/11120724 objects degraded (0.085%)
1181567/11120724 objects misplaced (10.625%)
2062 active+clean
1221 peering
535 active+remapped+backfill_wait
181 remapped+peering
77 unknown
13 active+remapped+backfilling
7 active+undersized+degraded+remapped+backfill_wait
4 remapped
3 active+recovery_wait+degraded+remapped
1 active+degraded+remapped+backfilling
recovery: 298 MB/s, 77 objects/s
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH
Freseniusstr. 31h (https://maps.google.com/?q=Freseniusstr.+31h+81247+M%C3%BCnchen&entry=gmail&source=g)
81247 MÃŒnchen (https://maps.google.com/?q=Freseniusstr.+31h+81247+M%C3%BCnchen&entry=gmail&source=g)
www.croit.io (http://www.croit.io)
Tel: +49 89 1896585 90

_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Magnus Grönlund

2018-07-12 12:26:30 UTC

Permalink

Hej David and thanks!

That was indeed the magic trick, no more peering, stale or down PGs.

Upgraded the ceph-packages on the hosts, restarted the OSDs and then "ceph
osd require-osd-release luminous"

/Magnus

Post by David Majchrzak
Hi/Hej Magnus,
We had a similar issue going from latest hammer to jewel (so might not be
applicable for you), with PGs stuck peering / data misplaced, right after
updating all mons to latest jewel at that time 10.2.10.
Finally setting the require_jewel_osds put everything back in place ( we
were going to do this after restarting all OSDs, following the
docs/changelogs ).
What does your ceph health detail look like?
Did you perform any other commands after starting your mon upgrade? Any
commands that might change the crush-map might cause issues AFAIK (correct
me if im wrong, but i think we ran into this once) if your mons and osds
are different versions.
// david
Hi list,
Things went from bad to worse, tried to upgrade some OSDs to Luminous to
see if that could help but that didnât appear to make any difference.
But for each restarted OSD there was a few PGs that the OSD seemed to
âforgetâ and the number of undersized PGs grew until some PGs had been
âforgottenâ by all 3 acting OSDs and became stale, even though all OSDs
(and their disks) where available.
Then the OSDs grew so big that the servers ran out of memory (48GB per
server with 10 2TB-disks per server) and started killing the OSDsâŠ
All OSDs where then shutdown to try and preserve some data on the disks at
least, but maybe it is too late?
/Magnus
Hi Paul,
No all OSDs are still jewel , the issue started before I had even started
to upgrade the first OSD and they don't appear to be flapping.
ceph -w shows a lot of slow request etc, but nothing unexpected as far as
I can tell considering the state the cluster is in.
2018-07-11 20:40:09.396642 osd.37 [WRN] 100 slow requests, 2 included
below; oldest blocked for > 25402.278824 secs
2018-07-11 20:40:09.396652 osd.37 [WRN] slow request 1920.957326 seconds
old, received at 2018-07-11 20:08:08.439214: osd_op(client.73540057.0:8289463
2.e57b3e32 (undecoded) ack+ondisk+retry+write+known_if_redirected
e160294) currently waiting for peered
2018-07-11 20:40:09.396660 osd.37 [WRN] slow request 1920.048094 seconds
old, received at 2018-07-11 20:08:09.348446: osd_op(client.671628641.0:998704
2.42f88232 (undecoded) ack+ondisk+retry+write+known_if_redirected
e160475) currently waiting for peered
2018-07-11 20:40:10.397008 osd.37 [WRN] 100 slow requests, 2 included
below; oldest blocked for > 25403.279204 secs
2018-07-11 20:40:10.397017 osd.37 [WRN] slow request 1920.043860 seconds
old, received at 2018-07-11 20:08:10.353060: osd_op(client.231731103.0:1007729
3.e0ff5786 (undecoded) ondisk+write+known_if_redirected e137428)
currently waiting for peered
2018-07-11 20:40:10.397023 osd.37 [WRN] slow request 1920.034101 seconds
old, received at 2018-07-11 20:08:10.362819: osd_op(client.207458703.0:2000292
3.a8143b86 (undecoded) ondisk+write+known_if_redirected e137428)
currently waiting for peered
2018-07-11 20:40:10.790573 mon.0 [INF] pgmap 4104 pgs: 5 down+peering,
1142 peering, 210 remapped+peering, 5 active+recovery_wait+degraded, 1551
active+clean, 2 activating+undersized+degraded+remapped, 15
active+remapped+backfilling, 178 unknown, 1 active+remapped, 3
activating+remapped, 78 active+undersized+degraded+remapped+backfill_wait,
6 active+recovery_wait+degraded+remapped, 3 undersized+degraded+remapped+backfill_wait+peered,
5 active+undersized+degraded+remapped+backfilling, 295
active+remapped+backfill_wait, 3 active+recovery_wait+undersized+degraded,
21 activating+undersized+degraded, 559 active+undersized+degraded, 4
remapped, 17 undersized+degraded+peered, 1 active+recovery_wait+undersized+degraded+remapped;
13439 GB data, 42395 GB used, 160 TB / 201 TB avail; 4069 B/s rd, 746 kB/s
wr, 5 op/s; 534753/10756032 objects degraded (4.972%); 779027/10756032
objects misplaced (7.243%); 256 MB/s, 65 objects/s recovering
There are a lot of things in the OSD-log files that I'm unfamiliar with
but so far I haven't found anything that has given me a clue on how to fix
the issue.
BTW restarting a OSD doesn't seem to help, on the contrary, that sometimes
results in PGs beeing stuck undersized!
I have attaced a osd-log from when a OSD i restarted started up.
Best regards
/Magnus
Did you finish the upgrade of the OSDs? Are OSDs flapping? (ceph -w) Is
there anything weird in the OSDs' log files?
Paul
Hi,
Started to upgrade a ceph-cluster from Jewel (10.2.10) to Luminous (12.2.6)
After upgrading and restarting the mons everything looked OK, the mons had
quorum, all OSDs where up and in and all the PGs where active+clean.
But before I had time to start upgrading the OSDs it became obvious that
something had gone terribly wrong.
All of a sudden 1600 out of 4100 PGs where inactive and 40% of the data was misplaced!
The mons appears OK and all OSDs are still up and in, but a few hours
later there was still 1483 pgs stuck inactive, essentially all of them in
peering!
Investigating one of the stuck PGs it appears to be looping between
âinactiveâ, âremapped+peeringâ and âpeeringâ and the epoch number is rising
fast, see the attached pg query outputs.
We really canât afford to loose the cluster or the data so any help or
suggestions on how to debug or fix this issue would be very, very
appreciated!
health: HEALTH_ERR
1483 pgs are stuck inactive for more than 60 seconds
542 pgs backfill_wait
14 pgs backfilling
11 pgs degraded
1402 pgs peering
3 pgs recovery_wait
11 pgs stuck degraded
1483 pgs stuck inactive
2042 pgs stuck unclean
7 pgs stuck undersized
7 pgs undersized
111 requests are blocked > 32 sec
10586 requests are blocked > 4096 sec
recovery 9472/11120724 objects degraded (0.085%)
recovery 1181567/11120724 objects misplaced (10.625%)
noout flag(s) set
mon.eselde02u32 low disk space
mon: 3 daemons, quorum eselde02u32,eselde02u33,eselde02u34
mgr: eselde02u32(active), standbys: eselde02u33, eselde02u34
osd: 111 osds: 111 up, 111 in; 800 remapped pgs
flags noout
pools: 18 pools, 4104 pgs
objects: 3620k objects, 13875 GB
usage: 42254 GB used, 160 TB / 201 TB avail
pgs: 1.876% pgs unknown
34.259% pgs not active
9472/11120724 objects degraded (0.085%)
1181567/11120724 objects misplaced (10.625%)
2062 active+clean
1221 peering
535 active+remapped+backfill_wait
181 remapped+peering
77 unknown
13 active+remapped+backfilling
7 active+undersized+degraded+remapped+backfill_wait
4 remapped
3 active+recovery_wait+degraded+remapped
1 active+degraded+remapped+backfilling
recovery: 298 MB/s, 77 objects/s
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH
Freseniusstr. 31h
<https://maps.google.com/?q=Freseniusstr.+31h+81247+M%C3%BCnchen&entry=gmail&source=g>
81247 MÃŒnchen
<https://maps.google.com/?q=Freseniusstr.+31h+81247+M%C3%BCnchen&entry=gmail&source=g>
www.croit.io
Tel: +49 89 1896585 90
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Mark Schouten

2018-11-14 03:13:16 UTC

Permalink

Hi,

This was indeed the trick. This has caused me a few years because of raised heart pressure. :)

Should this be documented somewhere? That even though the cluster does not seem to be recovering as is should, you should just continue to restart OSD's and run 'ceph osd require-osd-release luminous'?

Mark
----- Original Message -----
Â From: Magnus GrÃ¶nlund (***@gronlund.se)
Date: 12-07-2018 15:26
To: David Majchrzak (***@oderland.se)
Cc: Ceph Users (ceph-***@lists.ceph.com)
Subject: Re: [ceph-users] PGs stuck peering (looping?) after upgrade to Luminous.
Â

Hej David and thanks!

That was indeed the magic trick, no more peering, stale or down PGs.

Upgraded the ceph-packages on the hosts, restarted the OSDs and then "ceph osd require-osd-release luminous"

/Magnus

2018-07-12 12:05 GMT+02:00 David Majchrzak <***@oderland.se>:

Hi/Hej Magnus,
Â

We had a similar issue going from latest hammer to jewel (so might not be applicable for you), with PGs stuck peering / data misplaced, right after updating all mons to latest jewel at that time 10.2.10.
Â

Finally setting the require_jewel_osds put everything back in place ( we were going to do this after restarting all OSDs, following the docs/changelogs ).
Â

What does your ceph health detail look like?
Â

Did you perform any other commands after starting your mon upgrade? Any commands that might change the crush-map might cause issues AFAIK (correct me if im wrong, but i think we ran into this once) if your mons and osds are different versions.
Â

// david

On jul 12 2018, at 11:45 am, Magnus GrÃ¶nlund <***@gronlund.se> wrote:

Hi list,

Things went from bad to worse, tried to upgrade some OSDs to Luminous to see if that could help but that didn't appear to make any difference.
But for each restarted OSD there was a few PGs that the OSD seemed to "forget" and the number of undersized PGs grew until some PGs had been "forgotten" by all 3 acting OSDs and became stale, even though all OSDs (and their disks) where available.
Then the OSDs grew so big that the servers ran out of memory (48GB per server with 10 2TB-disks per server) and started killing the OSDsâŠ
All OSDs where then shutdown to try and preserve some data on the disks at least, but maybe it is too late?

/Magnus

2018-07-11 21:10 GMT+02:00 Magnus GrÃ¶nlund <***@gronlund.se>:

Hi Paul,

No all OSDs are still jewel , the issue started before I had even started to upgrade the first OSD and they don't appear to be flapping.
ceph -w shows a lot of slow request etc, but nothing unexpected as far as I can tell considering the state the cluster is in.

2018-07-11 20:40:09.396642 osd.37 [WRN] 100 slow requests, 2 included below; oldest blocked for > 25402.278824 secs
2018-07-11 20:40:09.396652 osd.37 [WRN] slow request 1920.957326 seconds old, received at 2018-07-11 20:08:08.439214: osd_op(client.73540057.0:8289463 2.e57b3e32 (undecoded) ack+ondisk+retry+write+known_if_redirected e160294) currently waiting for peered
2018-07-11 20:40:09.396660 osd.37 [WRN] slow request 1920.048094 seconds old, received at 2018-07-11 20:08:09.348446: osd_op(client.671628641.0:998704 2.42f88232 (undecoded) ack+ondisk+retry+write+known_if_redirected e160475) currently waiting for peered
2018-07-11 20:40:10.397008 osd.37 [WRN] 100 slow requests, 2 included below; oldest blocked for > 25403.279204 secs
2018-07-11 20:40:10.397017 osd.37 [WRN] slow request 1920.043860 seconds old, received at 2018-07-11 20:08:10.353060: osd_op(client.231731103.0:1007729 3.e0ff5786 (undecoded) ondisk+write+known_if_redirected e137428) currently waiting for peered
2018-07-11 20:40:10.397023 osd.37 [WRN] slow request 1920.034101 seconds old, received at 2018-07-11 20:08:10.362819: osd_op(client.207458703.0:2000292 3.a8143b86 (undecoded) ondisk+write+known_if_redirected e137428) currently waiting for peered
2018-07-11 20:40:10.790573 mon.0 [INF] pgmap 4104 pgs: 5 down+peering, 1142 peering, 210 remapped+peering, 5 active+recovery_wait+degraded, 1551 active+clean, 2 activating+undersized+degraded+remapped, 15 active+remapped+backfilling, 178 unknown, 1 active+remapped, 3 activating+remapped, 78 active+undersized+degraded+remapped+backfill_wait, 6 active+recovery_wait+degraded+remapped, 3 undersized+degraded+remapped+backfill_wait+peered, 5 active+undersized+degraded+remapped+backfilling, 295 active+remapped+backfill_wait, 3 active+recovery_wait+undersized+degraded, 21 activating+undersized+degraded, 559 active+undersized+degraded, 4 remapped, 17 undersized+degraded+peered, 1 active+recovery_wait+undersized+degraded+remapped; 13439 GB data, 42395 GB used, 160 TB / 201 TB avail; 4069 B/s rd, 746 kB/s wr, 5 op/s; 534753/10756032 objects degraded (4.972%); 779027/10756032 objects misplaced
(7.243%); 256 MB/s, 65 objects/s recovering

There are a lot of things in the OSD-log files that I'm unfamiliar with but so far I haven't found anything that has given me a clue on how to fix the issue.
BTW restarting a OSD doesn't seem to help, on the contrary, that sometimes results in PGs beeing stuck undersized!
I have attaced a osd-log from when a OSD i restarted started up.

Best regards
/Magnus

2018-07-11 20:39 GMT+02:00 Paul Emmerich <***@croit.io>:

Did you finish the upgrade of the OSDs? Are OSDs flapping? (ceph -w) Is there anything weird in the OSDs' log files?

Paul

2018-07-11 20:30 GMT+02:00 Magnus GrÃ¶nlund <***@gronlund.se>:

Hi,

Started to upgrade a ceph-cluster from Jewel (10.2.10) to Luminous (12.2.6)

After upgrading and restarting the mons everything looked OK, the mons had quorum, all OSDs where up and in and all the PGs where active+clean.
But before I had time to start upgrading the OSDs it became obvious that something had gone terribly wrong.
All of a sudden 1600 out of 4100 PGs where inactive and 40% of the data was misplaced!

The mons appears OK and all OSDs are still up and in, but a few hours later there was still 1483 pgs stuck inactive, essentially all of them in peering!
Investigating one of the stuck PGs it appears to be looping between "inactive", "remapped+peering" and "peering" and the epoch number is rising fast, see the attached pg query outputs.

We really can't afford to loose the cluster or the data so any help or suggestions on how to debug or fix this issue would be very, very appreciated!

health: HEALTH_ERR
1483 pgs are stuck inactive for more than 60 seconds
542 pgs backfill_wait
14 pgs backfilling
11 pgs degraded
1402 pgs peering
3 pgs recovery_wait
11 pgs stuck degraded
1483 pgs stuck inactive
2042 pgs stuck unclean
7 pgs stuck undersized
7 pgs undersized
111 requests are blocked > 32 sec
10586 requests are blocked > 4096 sec
recovery 9472/11120724 objects degraded (0.085%)
recovery 1181567/11120724 objects misplaced (10.625%)
noout flag(s) set
mon.eselde02u32 low disk space
services:
mon: 3 daemons, quorum eselde02u32,eselde02u33,eselde02u34
mgr: eselde02u32(active), standbys: eselde02u33, eselde02u34
osd: 111 osds: 111 up, 111 in; 800 remapped pgs
flags noout
data:
pools: 18 pools, 4104 pgs
objects: 3620k objects, 13875 GB
usage: 42254 GB used, 160 TB / 201 TB avail
pgs: 1.876% pgs unknown
34.259% pgs not active
9472/11120724 objects degraded (0.085%)
1181567/11120724 objects misplaced (10.625%)
2062 active+clean
1221 peering
535 active+remapped+backfill_wait
181 remapped+peering
77 unknown
13 active+remapped+backfilling
7 active+undersized+degraded+remapped+backfill_wait
4 remapped
3 active+recovery_wait+degraded+remapped
1 active+degraded+remapped+backfilling
io:
recovery: 298 MB/s, 77 objects/s
Â
_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Â
Â

--
Paul Emmerich
Â

Looking for help with your Ceph cluster? Contact us at https://croit.io
Â

croit GmbH
Freseniusstr. 31h
81247 MÃŒnchen
www.croit.io
Tel: +49 89 1896585 90

Â

Â
_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Â
_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Mark Schouten Â | Tuxis Internet Engineering
KvK: 61527076 Â |Â http://www.tuxis.nl/
T: 0318 200208 |Â ***@tuxis.nl
Â

Kevin Olbrich

2018-07-11 18:46:33 UTC

Permalink

Sounds a little bit like the problem I had on OSDs:

[ceph-users] Blocked requests activating+remapped after extending pg(p)_num
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026680.html>
*Kevin
Olbrich*

- [ceph-users] Blocked requests activating+remapped
afterextendingpg(p)_num
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026681.html>
*Burkhard Linke*
- [ceph-users] Blocked requests activating+remapped
afterextendingpg(p)_num
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026682.html>
*Kevin Olbrich*
- [ceph-users] Blocked requests activating+remapped
afterextendingpg(p)_num
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026683.html>
*Kevin Olbrich*
- [ceph-users] Blocked requests activating+remapped
afterextendingpg(p)_num
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026685.html>
*Kevin Olbrich*
- [ceph-users] Blocked requests activating+remapped
afterextendingpg(p)_num
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026689.html>
*Kevin Olbrich*
- [ceph-users] Blocked requests activating+remapped
afterextendingpg(p)_num
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026692.html>
*Paul Emmerich*
- [ceph-users] Blocked requests activating+remapped
afterextendingpg(p)_num
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026695.html>
*Kevin Olbrich*

I ended up restarting the OSDs which were stuck in that state and they
immediately fixed themselfs.
It should also work to just "out" the problem-OSDs and immeditly up them
again to fix it.

- Kevin

Magnus Grönlund

2018-07-11 21:07:46 UTC

Permalink

Hi Kevin,

Unfortunately restarting OSD don't appear to help, instead it seems to make
it worse with PGs getting stuck degraded.

Best regards
/Magnus

Post by Kevin Olbrich
[ceph-users] Blocked requests activating+remapped after extending pg(p)_num
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026680.html>
*Kevin Olbrich*
- [ceph-users] Blocked requests activating+remapped
afterextendingpg(p)_num
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026681.html>
*Burkhard Linke*
- [ceph-users] Blocked requests activating+remapped
afterextendingpg(p)_num
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026682.html>
*Kevin Olbrich*
- [ceph-users] Blocked requests activating+remapped
afterextendingpg(p)_num
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026683.html>
*Kevin Olbrich*
- [ceph-users] Blocked requests activating+remapped
afterextendingpg(p)_num
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026685.html>
*Kevin Olbrich*
- [ceph-users] Blocked requests activating+remapped
afterextendingpg(p)_num
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026689.html>
*Kevin Olbrich*
- [ceph-users] Blocked requests activating+remapped
afterextendingpg(p)_num
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026692.html>
*Paul Emmerich*
- [ceph-users] Blocked requests activating+remapped
afterextendingpg(p)_num
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026695.html>
*Kevin Olbrich*
I ended up restarting the OSDs which were stuck in that state and they
immediately fixed themselfs.
It should also work to just "out" the problem-OSDs and immeditly up them
again to fix it.
- Kevin

Post by Magnus GrÃ¶nlund
Hi,
Started to upgrade a ceph-cluster from Jewel (10.2.10) to Luminous (12.2.6)
After upgrading and restarting the mons everything looked OK, the mons
had quorum, all OSDs where up and in and all the PGs where active+clean.
But before I had time to start upgrading the OSDs it became obvious that
something had gone terribly wrong.
All of a sudden 1600 out of 4100 PGs where inactive and 40% of the data
was misplaced!
The mons appears OK and all OSDs are still up and in, but a few hours
later there was still 1483 pgs stuck inactive, essentially all of them in
peering!
Investigating one of the stuck PGs it appears to be looping between
âinactiveâ, âremapped+peeringâ and âpeeringâ and the epoch number is rising
fast, see the attached pg query outputs.
We really canât afford to loose the cluster or the data so any help or
suggestions on how to debug or fix this issue would be very, very
appreciated!
health: HEALTH_ERR
1483 pgs are stuck inactive for more than 60 seconds
542 pgs backfill_wait
14 pgs backfilling
11 pgs degraded
1402 pgs peering
3 pgs recovery_wait
11 pgs stuck degraded
1483 pgs stuck inactive
2042 pgs stuck unclean
7 pgs stuck undersized
7 pgs undersized
111 requests are blocked > 32 sec
10586 requests are blocked > 4096 sec
recovery 9472/11120724 objects degraded (0.085%)
recovery 1181567/11120724 objects misplaced (10.625%)
noout flag(s) set
mon.eselde02u32 low disk space
mon: 3 daemons, quorum eselde02u32,eselde02u33,eselde02u34
mgr: eselde02u32(active), standbys: eselde02u33, eselde02u34
osd: 111 osds: 111 up, 111 in; 800 remapped pgs
flags noout
pools: 18 pools, 4104 pgs
objects: 3620k objects, 13875 GB
usage: 42254 GB used, 160 TB / 201 TB avail
pgs: 1.876% pgs unknown
34.259% pgs not active
9472/11120724 objects degraded (0.085%)
1181567/11120724 objects misplaced (10.625%)
2062 active+clean
1221 peering
535 active+remapped+backfill_wait
181 remapped+peering
77 unknown
13 active+remapped+backfilling
7 active+undersized+degraded+remapped+backfill_wait
4 remapped
3 active+recovery_wait+degraded+remapped
1 active+degraded+remapped+backfilling
recovery: 298 MB/s, 77 objects/s
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com