[ceph-users] Placement Groups undersized after adding OSDs

Discussion:

Wido den Hollander

2018-11-14 14:38:36 UTC

Hi,

I'm in the middle of expanding a Ceph cluster and while having 'ceph -s'
open I suddenly saw a bunch of Placement Groups go undersized.

My first hint was that one or more OSDs have failed, but none did.

So I checked and I saw these Placement Groups undersized:

11.3b54 active+undersized+degraded+remapped+backfill_wait
[1795,639,1422] 1795 [1795,639] 1795
11.362f active+undersized+degraded+remapped+backfill_wait
[1431,1134,2217] 1431 [1134,1468] 1134
11.3e31 active+undersized+degraded+remapped+backfill_wait
[1451,1391,1906] 1451 [1906,2053] 1906
11.50c active+undersized+degraded+remapped+backfill_wait
[1867,1455,1348] 1867 [1867,2036] 1867
11.421e active+undersized+degraded+remapped+backfilling
[280,117,1421] 280 [280,117] 280
11.700 active+undersized+degraded+remapped+backfill_wait
[2212,1422,2087] 2212 [2055,2087] 2055
11.735 active+undersized+degraded+remapped+backfilling
[772,1832,1433] 772 [772,1832] 772
11.d5a active+undersized+degraded+remapped+backfill_wait
[423,1709,1441] 423 [423,1709] 423
11.a95 active+undersized+degraded+remapped+backfill_wait
[1433,1180,978] 1433 [978,1180] 978
11.a67 active+undersized+degraded+remapped+backfill_wait
[1154,1463,2151] 1154 [1154,2151] 1154
11.10ca active+undersized+degraded+remapped+backfill_wait
[2012,486,1457] 2012 [2012,486] 2012
11.2439 active+undersized+degraded+remapped+backfill_wait
[910,1457,1193] 910 [910,1193] 910
11.2f7e active+undersized+degraded+remapped+backfill_wait
[1423,1356,2098] 1423 [1356,2098] 1356

After searching I found that OSDs
1422,1431,1451,1455,1421,1422,1433,1441,1433,1463,1457,1457 and 1423 are
all running on the same (newly) added host.

I checked:
- The host did not reboot
- The OSDs did not restart

The OSDs are up_thru since map 646724 which is from 11:05 this morning
(4,5 hours ago), which is about the same time when these were added.

So these PGs are currently running on *2* replicas while they should be
running on *3*.

We just added 8 nodes with 24 disks each to the cluster, but none of the
existing OSDs were touched.

When looking at PG 11.3b54 I see that 1422 is a backfill target:

$ ceph pg 11.3b54 query|jq '.recovery_state'

The 'enter time' for this is about 30 minutes ago and that's about the
same time this has happened.

'might_have_unfound' tells me OSD 1982 which is in the same rack as 1422
(CRUSH replicates over racks), but that OSD is also online.

It's up_thru = 647122 and that's from about 30 minutes ago. That
ceph-osd process is however running since September and seems to be
functioning fine.

This confuses me as during such an expansion I know that normally a PG
would map to size+1 until the backfill finishes.

The cluster is running Luminous 12.2.8 on CentOS 7.5.

Any ideas on what this could be?

Wido

Gregory Farnum

2018-11-15 03:37:16 UTC

Permalink

This is weird. Can you capture the pg query for one of them and narrow down
in which epoch it âlostâ the previous replica and see if thereâs any
evidence of why?

Post by Wido den Hollander
Hi,
I'm in the middle of expanding a Ceph cluster and while having 'ceph -s'
open I suddenly saw a bunch of Placement Groups go undersized.
My first hint was that one or more OSDs have failed, but none did.
11.3b54 active+undersized+degraded+remapped+backfill_wait
[1795,639,1422] 1795 [1795,639] 1795
11.362f active+undersized+degraded+remapped+backfill_wait
[1431,1134,2217] 1431 [1134,1468] 1134
11.3e31 active+undersized+degraded+remapped+backfill_wait
[1451,1391,1906] 1451 [1906,2053] 1906
11.50c active+undersized+degraded+remapped+backfill_wait
[1867,1455,1348] 1867 [1867,2036] 1867
11.421e active+undersized+degraded+remapped+backfilling
[280,117,1421] 280 [280,117] 280
11.700 active+undersized+degraded+remapped+backfill_wait
[2212,1422,2087] 2212 [2055,2087] 2055
11.735 active+undersized+degraded+remapped+backfilling
[772,1832,1433] 772 [772,1832] 772
11.d5a active+undersized+degraded+remapped+backfill_wait
[423,1709,1441] 423 [423,1709] 423
11.a95 active+undersized+degraded+remapped+backfill_wait
[1433,1180,978] 1433 [978,1180] 978
11.a67 active+undersized+degraded+remapped+backfill_wait
[1154,1463,2151] 1154 [1154,2151] 1154
11.10ca active+undersized+degraded+remapped+backfill_wait
[2012,486,1457] 2012 [2012,486] 2012
11.2439 active+undersized+degraded+remapped+backfill_wait
[910,1457,1193] 910 [910,1193] 910
11.2f7e active+undersized+degraded+remapped+backfill_wait
[1423,1356,2098] 1423 [1356,2098] 1356
After searching I found that OSDs
1422,1431,1451,1455,1421,1422,1433,1441,1433,1463,1457,1457 and 1423 are
all running on the same (newly) added host.
- The host did not reboot
- The OSDs did not restart
The OSDs are up_thru since map 646724 which is from 11:05 this morning
(4,5 hours ago), which is about the same time when these were added.
So these PGs are currently running on *2* replicas while they should be
running on *3*.
We just added 8 nodes with 24 disks each to the cluster, but none of the
existing OSDs were touched.
$ ceph pg 11.3b54 query|jq '.recovery_state'
The 'enter time' for this is about 30 minutes ago and that's about the
same time this has happened.
'might_have_unfound' tells me OSD 1982 which is in the same rack as 1422
(CRUSH replicates over racks), but that OSD is also online.
It's up_thru = 647122 and that's from about 30 minutes ago. That
ceph-osd process is however running since September and seems to be
functioning fine.
This confuses me as during such an expansion I know that normally a PG
would map to size+1 until the backfill finishes.
The cluster is running Luminous 12.2.8 on CentOS 7.5.
Any ideas on what this could be?
Wido
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Wido den Hollander

2018-11-15 09:55:12 UTC

Permalink

Post by Gregory Farnum
This is weird. Can you capture the pg query for one of them and narrow
down in which epoch it “lost” the previous replica and see if there’s
any evidence of why?

So I checked it further and dug deeper into the logs and found this on
osd.1982:

2018-11-14 15:03:04.261689 7fde7b525700 0 log_channel(cluster) log
[WRN] : Monitor daemon marked osd.1982 down, but it is still running
2018-11-14 15:03:04.261713 7fde7b525700 0 log_channel(cluster) log
[DBG] : map e647120 wrongly marked me down at e647120

After searching further (Zabbix graphs) it seems that this machine had a
spike in CPU load around that time which probably caused it to be marked
as down.

As OSD 1982 was involved which these PGs it's now in undersized+degraded
state.

Recovery didn't start, but Ceph choose to wait for the backfill to
happen as the PG needed to be vacated from this OSD.

The side-effect is that it took 14 hours before these PGs started to
backfill.

I would say that a PG which is in undersized+degraded should get the
highest possible priority to be repaired asap.

Wido

Post by Gregory Farnum
Hi,
I'm in the middle of expanding a Ceph cluster and while having 'ceph -s'
open I suddenly saw a bunch of Placement Groups go undersized.
My first hint was that one or more OSDs have failed, but none did.
11.3b54 active+undersized+degraded+remapped+backfill_wait
[1795,639,1422] 1795 [1795,639] 1795
11.362f active+undersized+degraded+remapped+backfill_wait
[1431,1134,2217] 1431 [1134,1468] 1134
11.3e31 active+undersized+degraded+remapped+backfill_wait
[1451,1391,1906] 1451 [1906,2053] 1906
11.50c active+undersized+degraded+remapped+backfill_wait
[1867,1455,1348] 1867 [1867,2036] 1867
11.421e active+undersized+degraded+remapped+backfilling
[280,117,1421] 280 [280,117] 280
11.700 active+undersized+degraded+remapped+backfill_wait
[2212,1422,2087] 2212 [2055,2087] 2055
11.735 active+undersized+degraded+remapped+backfilling
[772,1832,1433] 772 [772,1832] 772
11.d5a active+undersized+degraded+remapped+backfill_wait
[423,1709,1441] 423 [423,1709] 423
11.a95 active+undersized+degraded+remapped+backfill_wait
[1433,1180,978] 1433 [978,1180] 978
11.a67 active+undersized+degraded+remapped+backfill_wait
[1154,1463,2151] 1154 [1154,2151] 1154
11.10ca active+undersized+degraded+remapped+backfill_wait
[2012,486,1457] 2012 [2012,486] 2012
11.2439 active+undersized+degraded+remapped+backfill_wait
[910,1457,1193] 910 [910,1193] 910
11.2f7e active+undersized+degraded+remapped+backfill_wait
[1423,1356,2098] 1423 [1356,2098] 1356
After searching I found that OSDs
1422,1431,1451,1455,1421,1422,1433,1441,1433,1463,1457,1457 and 1423 are
all running on the same (newly) added host.
- The host did not reboot
- The OSDs did not restart
The OSDs are up_thru since map 646724 which is from 11:05 this morning
(4,5 hours ago), which is about the same time when these were added.
So these PGs are currently running on *2* replicas while they should be
running on *3*.
We just added 8 nodes with 24 disks each to the cluster, but none of the
existing OSDs were touched.
$ ceph pg 11.3b54 query|jq '.recovery_state'
The 'enter time' for this is about 30 minutes ago and that's about the
same time this has happened.
'might_have_unfound' tells me OSD 1982 which is in the same rack as 1422
(CRUSH replicates over racks), but that OSD is also online.
It's up_thru = 647122 and that's from about 30 minutes ago. That
ceph-osd process is however running since September and seems to be
functioning fine.
This confuses me as during such an expansion I know that normally a PG
would map to size+1 until the backfill finishes.
The cluster is running Luminous 12.2.8 on CentOS 7.5.
Any ideas on what this could be?
Wido
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com