Discussion:
pgs down after adding 260 OSDs & increasing PGs
(too old to reply)
Jake Grimmett
2018-01-29 12:46:07 UTC
Permalink
Raw Message
Dear All,

Our ceph luminous (12.2.2) cluster has just broken, due to either adding
260 OSDs drives in one go, or to increasing the PG number from 1024 to
4096 in one go, or a combination of both...

Prior to the upgrade, the cluster consisted of 10 dual v4 Xeon nodes
running SL7.4, each node had 19 bluestore OSDs (8TB Seagate Ironwolf) &
64GB ram.

The cluster has just two pools;
1) 1024 pg/pgs 8+2 EC pool on 190 x hdd.
2) 4 nodes have 1 NVMe SSD's used for a 3x replicated MDS pool.

Cluster provides 500TB CephFS used for scratch space, four snapshots
taken daily and kept for one week only.

Everything was working perfectly, until 26 OSD's were added to each
node, bringing the total hdd OSD count to 450. (all 8TB Ironwolf)

After adding all 260 OSD's with ceph-deploy, ceph health shows

HEALTH_WARN noout flag(s) set;
732950716/1219068139 objects misplaced (60.124%);
Degraded data redundancy: 1024 pgs unclean;
too few PGs per OSD (23 < min 30)

So far so good, I'd expected to see the cluster rebalancing, the
complaint about too few pgs per OSD seemed reasonable.

Without waiting for the cluster to rebalance, I increased the pg/pgs to
4096. At this point, ceph health showed this:

HEALTH_ERR 135858073/1219068139 objects misplaced (11.144%);
Reduced data availability: 3119 pgs inactive;
Degraded data redundancy: 210609/1219068139 objects degraded (0.017%),
4088 pgs unclean, 1002 pgs degraded,
1002 pgs undersized; 5 stuck requests are blocked > 4096 sec

We then left the cluster to rebalance.

Next morning, two ceph nodes were down, and I could see lots of
oom-killer messages in the logs.
Each node only has 64GB for 45 OSD's which is probably the cause of this.

as a short term fix, we limited RAM usage by adding this to ceph.conf
bluestore_cache_size = 104857600
bluestore_cache_kv_max = 67108864

This appears to stop the oom problems, so we waited while the cluster
rebalanced, until it said it stopped saying "objects misplaced"
This took a couple of days...

The problem now, is that although all of the OSD's are up, lots of pgs
are down, degraded, unclean, and it is not clear how to fix this.

I have tried issuing osd scrub, and pg repair commands, but these do not
appear to do anything.

cephfs will mount, but when locks up when it hits a pg that is down.

I have tried sequentially restarting all OSD's on each node, slowly
walking through the cluster several times, but this does not fix things.

Current Status:
# ceph health
HEALTH_ERR Reduced data availability:
3021 pgs inactive, 23 pgs down, 23 pgs stale;
Degraded data redundancy: 3021 pgs unclean, 1879 pgs degraded,
1879 pgs undersized; 1 stuck requests are blocked > 4096 sec

ceph health detail (see http://p.ip.fi/Pwdb ) contains many lines such as:

pg 4.ffe is stuck unclean for 470551.768849, current state
activating+remapped, last acting [156,175,33,169,135,85,165,55,148,178]

pg 4.fff is stuck undersized for 49509.580577, current state
activating+undersized+degraded+remapped, last acting
[44,12,185,125,69,29,119,102,81,2147483647]

(Presumably the OSD number "2147483647" is due to Erasure Encoding,
as per
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001660.html>
?)

Tailing the stuck osd log with debug osd = 20 shows this:

2018-01-29 11:56:35.204391 7f0dab4fd700 20 osd.46 15482 share_map_peer
0x5647cb336800 already has epoch 15482
2018-01-29 11:56:35.213226 7f0da7537700 10 osd.46 15482
tick_without_osd_lock
2018-01-29 11:56:35.213252 7f0da7537700 20 osd.46 15482
scrub_random_backoff lost coin flip, randomly backing off
2018-01-29 11:56:35.213257 7f0da7537700 10 osd.46 15482
promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0
bytes; target 25 obj/sec or 5120 k bytes/sec
2018-01-29 11:56:35.213263 7f0da7537700 20 osd.46 15482
promote_throttle_recalibrate new_prob 1000
2018-01-29 11:56:35.213266 7f0da7537700 10 osd.46 15482
promote_throttle_recalibrate actual 0, actual/prob ratio 1, adjusted
new_prob 1000, prob 1000 -> 1000
2018-01-29 11:56:35.232884 7f0dab4fd700 20 osd.46 15482 share_map_peer
0x5647cabf3800 already has epoch 15482

Currently this cluster is just storing scratch data, so could be wiped,
however we would be more confident about using ceph widely if we can fix
errors like this...

thanks for reading, any advice appreciated,

Jake
Nick Fisk
2018-01-29 13:07:01 UTC
Permalink
Raw Message
Hi Jake,

I suspect you have hit an issue that me and a few others have hit in
Luminous. By increasing the number of PG's before all the data has
re-balanced, you have probably exceeded hard PG per OSD limit.

See this thread
https://www.spinics.net/lists/ceph-users/msg41231.html

Nick
-----Original Message-----
Jake Grimmett
Sent: 29 January 2018 12:46
Subject: [ceph-users] pgs down after adding 260 OSDs & increasing PGs
Dear All,
Our ceph luminous (12.2.2) cluster has just broken, due to either adding
260 OSDs drives in one go, or to increasing the PG number from 1024 to
4096 in one go, or a combination of both...
Prior to the upgrade, the cluster consisted of 10 dual v4 Xeon nodes
running
SL7.4, each node had 19 bluestore OSDs (8TB Seagate Ironwolf) & 64GB ram.
The cluster has just two pools;
1) 1024 pg/pgs 8+2 EC pool on 190 x hdd.
2) 4 nodes have 1 NVMe SSD's used for a 3x replicated MDS pool.
Cluster provides 500TB CephFS used for scratch space, four snapshots taken
daily and kept for one week only.
Everything was working perfectly, until 26 OSD's were added to each node,
bringing the total hdd OSD count to 450. (all 8TB Ironwolf)
After adding all 260 OSD's with ceph-deploy, ceph health shows
HEALTH_WARN noout flag(s) set;
732950716/1219068139 objects misplaced (60.124%); Degraded data
redundancy: 1024 pgs unclean; too few PGs per OSD (23 < min 30)
So far so good, I'd expected to see the cluster rebalancing, the complaint
about
too few pgs per OSD seemed reasonable.
Without waiting for the cluster to rebalance, I increased the pg/pgs to
4096. At
HEALTH_ERR 135858073/1219068139 objects misplaced (11.144%); Reduced
210609/1219068139 objects degraded (0.017%),
4088 pgs unclean, 1002 pgs degraded,
1002 pgs undersized; 5 stuck requests are blocked > 4096 sec
We then left the cluster to rebalance.
Next morning, two ceph nodes were down, and I could see lots of oom-killer
messages in the logs.
Each node only has 64GB for 45 OSD's which is probably the cause of this.
as a short term fix, we limited RAM usage by adding this to ceph.conf
bluestore_cache_size = 104857600 bluestore_cache_kv_max = 67108864
This appears to stop the oom problems, so we waited while the cluster
rebalanced, until it said it stopped saying "objects misplaced"
This took a couple of days...
The problem now, is that although all of the OSD's are up, lots of pgs are
down,
degraded, unclean, and it is not clear how to fix this.
I have tried issuing osd scrub, and pg repair commands, but these do not
appear
to do anything.
cephfs will mount, but when locks up when it hits a pg that is down.
I have tried sequentially restarting all OSD's on each node, slowly
walking
through the cluster several times, but this does not fix things.
# ceph health
3021
pgs unclean, 1879 pgs degraded,
1879 pgs undersized; 1 stuck requests are blocked > 4096 sec
pg 4.ffe is stuck unclean for 470551.768849, current state
activating+remapped, last acting [156,175,33,169,135,85,165,55,148,178]
pg 4.fff is stuck undersized for 49509.580577, current state
activating+undersized+degraded+remapped, last acting
[44,12,185,125,69,29,119,102,81,2147483647]
(Presumably the OSD number "2147483647" is due to Erasure Encoding,
as per
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-
May/001660.html>
?)
2018-01-29 11:56:35.204391 7f0dab4fd700 20 osd.46 15482 share_map_peer
0x5647cb336800 already has epoch 15482
2018-01-29 11:56:35.213226 7f0da7537700 10 osd.46 15482
tick_without_osd_lock
2018-01-29 11:56:35.213252 7f0da7537700 20 osd.46 15482
scrub_random_backoff lost coin flip, randomly backing off
2018-01-29 11:56:35.213257 7f0da7537700 10 osd.46 15482
promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0
bytes; target 25 obj/sec or 5120 k bytes/sec
2018-01-29 11:56:35.213263 7f0da7537700 20 osd.46 15482
promote_throttle_recalibrate new_prob 1000
2018-01-29 11:56:35.213266 7f0da7537700 10 osd.46 15482
promote_throttle_recalibrate actual 0, actual/prob ratio 1, adjusted
new_prob 1000, prob 1000 -> 1000
2018-01-29 11:56:35.232884 7f0dab4fd700 20 osd.46 15482 share_map_peer
0x5647cabf3800 already has epoch 15482
Currently this cluster is just storing scratch data, so could be wiped,
however we would be more confident about using ceph widely if we can fix
errors like this...
thanks for reading, any advice appreciated,
Jake
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Wido den Hollander
2018-01-29 13:09:28 UTC
Permalink
Raw Message
Post by Nick Fisk
Hi Jake,
I suspect you have hit an issue that me and a few others have hit in
Luminous. By increasing the number of PG's before all the data has
re-balanced, you have probably exceeded hard PG per OSD limit.
See this thread
https://www.spinics.net/lists/ceph-users/msg41231.html
Indeed, I ran into this issue this morning as well during a FileStore ->
BlueStore migration.

Wrote a quick blog about it so search engines may find it:
https://blog.widodh.nl/2018/01/placement-groups-with-ceph-luminous-stay-in-activating-state/
Post by Nick Fisk
Nick
-----Original Message-----
Jake Grimmett
Sent: 29 January 2018 12:46
Subject: [ceph-users] pgs down after adding 260 OSDs & increasing PGs
Dear All,
Our ceph luminous (12.2.2) cluster has just broken, due to either adding
260 OSDs drives in one go, or to increasing the PG number from 1024 to
4096 in one go, or a combination of both...
Prior to the upgrade, the cluster consisted of 10 dual v4 Xeon nodes
running
SL7.4, each node had 19 bluestore OSDs (8TB Seagate Ironwolf) & 64GB ram.
The cluster has just two pools;
1) 1024 pg/pgs 8+2 EC pool on 190 x hdd.
2) 4 nodes have 1 NVMe SSD's used for a 3x replicated MDS pool.
Cluster provides 500TB CephFS used for scratch space, four snapshots taken
daily and kept for one week only.
Everything was working perfectly, until 26 OSD's were added to each node,
bringing the total hdd OSD count to 450. (all 8TB Ironwolf)
After adding all 260 OSD's with ceph-deploy, ceph health shows
HEALTH_WARN noout flag(s) set;
732950716/1219068139 objects misplaced (60.124%); Degraded data
redundancy: 1024 pgs unclean; too few PGs per OSD (23 < min 30)
So far so good, I'd expected to see the cluster rebalancing, the complaint
about
too few pgs per OSD seemed reasonable.
Without waiting for the cluster to rebalance, I increased the pg/pgs to
4096. At
HEALTH_ERR 135858073/1219068139 objects misplaced (11.144%); Reduced
210609/1219068139 objects degraded (0.017%),
4088 pgs unclean, 1002 pgs degraded,
1002 pgs undersized; 5 stuck requests are blocked > 4096 sec
We then left the cluster to rebalance.
Next morning, two ceph nodes were down, and I could see lots of oom-killer
messages in the logs.
Each node only has 64GB for 45 OSD's which is probably the cause of this.
as a short term fix, we limited RAM usage by adding this to ceph.conf
bluestore_cache_size = 104857600 bluestore_cache_kv_max = 67108864
This appears to stop the oom problems, so we waited while the cluster
rebalanced, until it said it stopped saying "objects misplaced"
This took a couple of days...
The problem now, is that although all of the OSD's are up, lots of pgs are
down,
degraded, unclean, and it is not clear how to fix this.
I have tried issuing osd scrub, and pg repair commands, but these do not
appear
to do anything.
cephfs will mount, but when locks up when it hits a pg that is down.
I have tried sequentially restarting all OSD's on each node, slowly
walking
through the cluster several times, but this does not fix things.
# ceph health
3021
pgs unclean, 1879 pgs degraded,
1879 pgs undersized; 1 stuck requests are blocked > 4096 sec
pg 4.ffe is stuck unclean for 470551.768849, current state
activating+remapped, last acting [156,175,33,169,135,85,165,55,148,178]
pg 4.fff is stuck undersized for 49509.580577, current state
activating+undersized+degraded+remapped, last acting
[44,12,185,125,69,29,119,102,81,2147483647]
(Presumably the OSD number "2147483647" is due to Erasure Encoding,
as per
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-
May/001660.html>
?)
2018-01-29 11:56:35.204391 7f0dab4fd700 20 osd.46 15482 share_map_peer
0x5647cb336800 already has epoch 15482
2018-01-29 11:56:35.213226 7f0da7537700 10 osd.46 15482
tick_without_osd_lock
2018-01-29 11:56:35.213252 7f0da7537700 20 osd.46 15482
scrub_random_backoff lost coin flip, randomly backing off
2018-01-29 11:56:35.213257 7f0da7537700 10 osd.46 15482
promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0
bytes; target 25 obj/sec or 5120 k bytes/sec
2018-01-29 11:56:35.213263 7f0da7537700 20 osd.46 15482
promote_throttle_recalibrate new_prob 1000
2018-01-29 11:56:35.213266 7f0da7537700 10 osd.46 15482
promote_throttle_recalibrate actual 0, actual/prob ratio 1, adjusted
new_prob 1000, prob 1000 -> 1000
2018-01-29 11:56:35.232884 7f0dab4fd700 20 osd.46 15482 share_map_peer
0x5647cabf3800 already has epoch 15482
Currently this cluster is just storing scratch data, so could be wiped,
however we would be more confident about using ceph widely if we can fix
errors like this...
thanks for reading, any advice appreciated,
Jake
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Jake Grimmett
2018-01-29 15:21:55 UTC
Permalink
Raw Message
Hi Nick,

many thanks for the tip, I've set "osd_max_pg_per_osd_hard_ratio = 3"
and restarted the OSD's.

So far it's looking promising, I now have 56% objects misplaced rather
than 3021 pgs inactive. cluster now working hard to rebalance.

I will report back after things stabilise...

many, many thanks for the advice,

Jake

NB: For reference,

**********************************************
Soon after "osd_max_pg_per_osd_hard_ratio = 3":
**********************************************
health: HEALTH_ERR
691832326/1218771235 objects misplaced (56.765%)
Reduced data availability: 459 pgs inactive
Degraded data redundancy: 56110116/1218771235 objects
degraded (4.604%), 3021 pgs unclean, 2245 pgs degraded,
2245 pgs undersized
1 stuck requests are blocked > 4096 sec

services:
mon: 3 daemons, quorum ceph2,ceph1,ceph3
mgr: ceph1(active), standbys: ceph3, ceph2
mds: cephfs-1/1/1 up {0=ceph3=up:active}, 2 up:standby
osd: 454 osds: 454 up, 454 in; 3020 remapped pgs

data:
pools: 2 pools, 4128 pgs
objects: 116M objects, 411 TB
usage: 526 TB used, 2750 TB / 3277 TB avail
pgs: 0.024% pgs unknown
11.095% pgs not active
56110116/1218771235 objects degraded (4.604%)
691832326/1218771235 objects misplaced (56.765%)
1714 active+undersized+degraded+remapped+backfill_wait
1107 active+clean
492 active+remapped+backfill_wait
355 active+undersized+degraded+remapped+backfilling
282 activating+remapped
176 activating+undersized+degraded+remapped
1 unknown
1 active+remapped+backfilling

io:
recovery: 2404 MB/s, 683 objects/s

**********************************************
Before "osd_max_pg_per_osd_hard_ratio = 3":
**********************************************
health: HEALTH_ERR
Reduced data availability: 3021 pgs inactive,
10 pgs down, 10 pgs stale
Degraded data redundancy: 3021 pgs unclean,
2029 pgs degraded, 2029 pgs undersized
1 stuck requests are blocked > 4096 sec

services:
mon: 3 daemons, quorum ceph2,ceph1,ceph3
mgr: ceph1(active), standbys: ceph3, ceph2
mds: cephfs-1/1/1 up {0=ceph3=up:active}, 2 up:standby
osd: 454 osds: 454 up, 454 in; 2717 remapped pgs

data:
pools: 2 pools, 4128 pgs
objects: 108M objects, 380 TB
usage: 524 TB used, 2753 TB / 3277 TB avail
pgs: 7.122% pgs unknown
66.061% pgs not active
2029 activating+undersized+degraded+remapped
1107 active+clean
688 activating+remapped
294 unknown
10 stale+down
Post by Nick Fisk
Hi Jake,
I suspect you have hit an issue that me and a few others have hit in
Luminous. By increasing the number of PG's before all the data has
re-balanced, you have probably exceeded hard PG per OSD limit.
See this thread
https://www.spinics.net/lists/ceph-users/msg41231.html
Nick
-----Original Message-----
Jake Grimmett
Sent: 29 January 2018 12:46
Subject: [ceph-users] pgs down after adding 260 OSDs & increasing PGs
Dear All,
Our ceph luminous (12.2.2) cluster has just broken, due to either adding
260 OSDs drives in one go, or to increasing the PG number from 1024 to
4096 in one go, or a combination of both...
Prior to the upgrade, the cluster consisted of 10 dual v4 Xeon nodes
running
SL7.4, each node had 19 bluestore OSDs (8TB Seagate Ironwolf) & 64GB ram.
The cluster has just two pools;
1) 1024 pg/pgs 8+2 EC pool on 190 x hdd.
2) 4 nodes have 1 NVMe SSD's used for a 3x replicated MDS pool.
Cluster provides 500TB CephFS used for scratch space, four snapshots taken
daily and kept for one week only.
Everything was working perfectly, until 26 OSD's were added to each node,
bringing the total hdd OSD count to 450. (all 8TB Ironwolf)
After adding all 260 OSD's with ceph-deploy, ceph health shows
HEALTH_WARN noout flag(s) set;
732950716/1219068139 objects misplaced (60.124%); Degraded data
redundancy: 1024 pgs unclean; too few PGs per OSD (23 < min 30)
So far so good, I'd expected to see the cluster rebalancing, the complaint
about
too few pgs per OSD seemed reasonable.
Without waiting for the cluster to rebalance, I increased the pg/pgs to
4096. At
HEALTH_ERR 135858073/1219068139 objects misplaced (11.144%); Reduced
210609/1219068139 objects degraded (0.017%),
4088 pgs unclean, 1002 pgs degraded,
1002 pgs undersized; 5 stuck requests are blocked > 4096 sec
We then left the cluster to rebalance.
Next morning, two ceph nodes were down, and I could see lots of oom-killer
messages in the logs.
Each node only has 64GB for 45 OSD's which is probably the cause of this.
as a short term fix, we limited RAM usage by adding this to ceph.conf
bluestore_cache_size = 104857600 bluestore_cache_kv_max = 67108864
This appears to stop the oom problems, so we waited while the cluster
rebalanced, until it said it stopped saying "objects misplaced"
This took a couple of days...
The problem now, is that although all of the OSD's are up, lots of pgs are
down,
degraded, unclean, and it is not clear how to fix this.
I have tried issuing osd scrub, and pg repair commands, but these do not
appear
to do anything.
cephfs will mount, but when locks up when it hits a pg that is down.
I have tried sequentially restarting all OSD's on each node, slowly
walking
through the cluster several times, but this does not fix things.
# ceph health
3021
pgs unclean, 1879 pgs degraded,
1879 pgs undersized; 1 stuck requests are blocked > 4096 sec
pg 4.ffe is stuck unclean for 470551.768849, current state
activating+remapped, last acting [156,175,33,169,135,85,165,55,148,178]
pg 4.fff is stuck undersized for 49509.580577, current state
activating+undersized+degraded+remapped, last acting
[44,12,185,125,69,29,119,102,81,2147483647]
(Presumably the OSD number "2147483647" is due to Erasure Encoding,
as per
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-
May/001660.html>
?)
2018-01-29 11:56:35.204391 7f0dab4fd700 20 osd.46 15482 share_map_peer
0x5647cb336800 already has epoch 15482
2018-01-29 11:56:35.213226 7f0da7537700 10 osd.46 15482
tick_without_osd_lock
2018-01-29 11:56:35.213252 7f0da7537700 20 osd.46 15482
scrub_random_backoff lost coin flip, randomly backing off
2018-01-29 11:56:35.213257 7f0da7537700 10 osd.46 15482
promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0
bytes; target 25 obj/sec or 5120 k bytes/sec
2018-01-29 11:56:35.213263 7f0da7537700 20 osd.46 15482
promote_throttle_recalibrate new_prob 1000
2018-01-29 11:56:35.213266 7f0da7537700 10 osd.46 15482
promote_throttle_recalibrate actual 0, actual/prob ratio 1, adjusted
new_prob 1000, prob 1000 -> 1000
2018-01-29 11:56:35.232884 7f0dab4fd700 20 osd.46 15482 share_map_peer
0x5647cabf3800 already has epoch 15482
Currently this cluster is just storing scratch data, so could be wiped,
however we would be more confident about using ceph widely if we can fix
errors like this...
thanks for reading, any advice appreciated,
Jake
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Wido den Hollander
2018-01-29 15:24:21 UTC
Permalink
Raw Message
Post by Jake Grimmett
Hi Nick,
many thanks for the tip, I've set "osd_max_pg_per_osd_hard_ratio = 3"
and restarted the OSD's.
So far it's looking promising, I now have 56% objects misplaced rather
than 3021 pgs inactive. cluster now working hard to rebalance.
PGs shouldn't stay in the 'activating' state that long. You might want
to restart a few OSDs.

Although a restart isn't required, just marking them as down works:

$ ceph osd down X

You should only mark the OSDs as down which are primary for a PG which
is stuck in that state.

Wido
Post by Jake Grimmett
I will report back after things stabilise...
many, many thanks for the advice,
Jake
NB: For reference,
**********************************************
**********************************************
    health: HEALTH_ERR
            691832326/1218771235 objects misplaced (56.765%)
            Reduced data availability: 459 pgs inactive
            Degraded data redundancy: 56110116/1218771235 objects
    degraded (4.604%), 3021 pgs unclean, 2245 pgs degraded,
     2245 pgs undersized
            1 stuck requests are blocked > 4096 sec
    mon: 3 daemons, quorum ceph2,ceph1,ceph3
    mgr: ceph1(active), standbys: ceph3, ceph2
    mds: cephfs-1/1/1 up  {0=ceph3=up:active}, 2 up:standby
    osd: 454 osds: 454 up, 454 in; 3020 remapped pgs
    pools:   2 pools, 4128 pgs
    objects: 116M objects, 411 TB
    usage:   526 TB used, 2750 TB / 3277 TB avail
    pgs:     0.024% pgs unknown
             11.095% pgs not active
             56110116/1218771235 objects degraded (4.604%)
             691832326/1218771235 objects misplaced (56.765%)
             1714 active+undersized+degraded+remapped+backfill_wait
             1107 active+clean
             492  active+remapped+backfill_wait
             355  active+undersized+degraded+remapped+backfilling
             282  activating+remapped
             176  activating+undersized+degraded+remapped
             1    unknown
             1    active+remapped+backfilling
    recovery: 2404 MB/s, 683 objects/s
**********************************************
**********************************************
    health: HEALTH_ERR
            Reduced data availability: 3021 pgs inactive,
        10 pgs down, 10 pgs stale
            Degraded data redundancy: 3021 pgs unclean,
        2029 pgs degraded, 2029 pgs undersized
            1 stuck requests are blocked > 4096 sec
    mon: 3 daemons, quorum ceph2,ceph1,ceph3
    mgr: ceph1(active), standbys: ceph3, ceph2
    mds: cephfs-1/1/1 up  {0=ceph3=up:active}, 2 up:standby
    osd: 454 osds: 454 up, 454 in; 2717 remapped pgs
    pools:   2 pools, 4128 pgs
    objects: 108M objects, 380 TB
    usage:   524 TB used, 2753 TB / 3277 TB avail
    pgs:     7.122% pgs unknown
             66.061% pgs not active
             2029 activating+undersized+degraded+remapped
             1107 active+clean
             688  activating+remapped
             294  unknown
             10   stale+down
Post by Nick Fisk
Hi Jake,
I suspect you have hit an issue that me and a few others have hit in
Luminous. By increasing the number of PG's before all the data has
re-balanced, you have probably exceeded hard PG per OSD limit.
See this thread
https://www.spinics.net/lists/ceph-users/msg41231.html
Nick
-----Original Message-----
Jake Grimmett
Sent: 29 January 2018 12:46
Subject: [ceph-users] pgs down after adding 260 OSDs & increasing PGs
Dear All,
Our ceph luminous (12.2.2) cluster has just broken, due to either adding
260 OSDs drives in one go, or to increasing the PG number from 1024 to
4096 in one go, or a combination of both...
Prior to the upgrade, the cluster consisted of 10 dual v4 Xeon nodes
running
SL7.4, each node had 19 bluestore OSDs (8TB Seagate Ironwolf) & 64GB ram.
The cluster has just two pools;
1) 1024 pg/pgs 8+2 EC pool on 190 x hdd.
2) 4 nodes have 1 NVMe SSD's used for a 3x replicated MDS pool.
Cluster provides 500TB CephFS used for scratch space, four snapshots taken
daily and kept for one week only.
Everything was working perfectly, until 26 OSD's were added to each node,
bringing the total hdd OSD count to 450. (all 8TB Ironwolf)
After adding all 260 OSD's with ceph-deploy, ceph health shows
HEALTH_WARN noout flag(s) set;
732950716/1219068139 objects misplaced (60.124%); Degraded data
redundancy: 1024 pgs unclean; too few PGs per OSD (23 < min 30)
So far so good, I'd expected to see the cluster rebalancing, the complaint
about
too few pgs per OSD seemed reasonable.
Without waiting for the cluster to rebalance, I increased the pg/pgs to
4096. At
HEALTH_ERR 135858073/1219068139 objects misplaced (11.144%); Reduced
210609/1219068139 objects degraded (0.017%),
4088 pgs unclean, 1002 pgs degraded,
1002 pgs undersized; 5 stuck requests are blocked > 4096 sec
We then left the cluster to rebalance.
Next morning, two ceph nodes were down, and I could see lots of oom-killer
messages in the logs.
Each node only has 64GB for 45 OSD's which is probably the cause of this.
as a short term fix, we limited RAM usage by adding this to ceph.conf
bluestore_cache_size = 104857600 bluestore_cache_kv_max = 67108864
This appears to stop the oom problems, so we waited while the cluster
rebalanced, until it said it stopped saying "objects misplaced"
This took a couple of days...
The problem now, is that although all of the OSD's are up, lots of pgs are
down,
degraded, unclean, and it is not clear how to fix this.
I have tried issuing osd scrub, and pg repair commands, but these do not
appear
to do anything.
cephfs will mount, but when locks up when it hits a pg that is down.
I have tried sequentially restarting all OSD's on each node, slowly
walking
through the cluster several times, but this does not fix things.
# ceph health
3021
pgs unclean, 1879 pgs degraded,
1879 pgs undersized; 1 stuck requests are blocked > 4096 sec
pg 4.ffe is stuck unclean for 470551.768849, current state
activating+remapped, last acting [156,175,33,169,135,85,165,55,148,178]
pg 4.fff is stuck undersized for 49509.580577, current state
activating+undersized+degraded+remapped, last acting
[44,12,185,125,69,29,119,102,81,2147483647]
(Presumably the OSD number "2147483647" is due to Erasure Encoding,
as per
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-
May/001660.html>
?)
2018-01-29 11:56:35.204391 7f0dab4fd700 20 osd.46 15482 share_map_peer
0x5647cb336800 already has epoch 15482
2018-01-29 11:56:35.213226 7f0da7537700 10 osd.46 15482
tick_without_osd_lock
2018-01-29 11:56:35.213252 7f0da7537700 20 osd.46 15482
scrub_random_backoff lost coin flip, randomly backing off
2018-01-29 11:56:35.213257 7f0da7537700 10 osd.46 15482
promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0
bytes; target 25 obj/sec or 5120 k bytes/sec
2018-01-29 11:56:35.213263 7f0da7537700 20 osd.46 15482
promote_throttle_recalibrate  new_prob 1000
2018-01-29 11:56:35.213266 7f0da7537700 10 osd.46 15482
promote_throttle_recalibrate  actual 0, actual/prob ratio 1, adjusted
new_prob 1000, prob 1000 -> 1000
2018-01-29 11:56:35.232884 7f0dab4fd700 20 osd.46 15482 share_map_peer
0x5647cabf3800 already has epoch 15482
Currently this cluster is just storing scratch data, so could be wiped,
however we would be more confident about using ceph widely if we can fix
errors like this...
thanks for reading, any advice appreciated,
Jake
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Jake Grimmett
2018-02-05 17:25:15 UTC
Permalink
Raw Message
Dear Nick & Wido,

Many thanks for your helpful advice; our cluster has returned to HEALTH_OK

One caveat is that a small number of pgs remained at "activating".

By increasing mon_max_pg_per_osd from 500 to 1000 these few osds
activated, allowing the cluster to rebalance fully.

i.e. this was needed
mon_max_pg_per_osd = 1000

once the cluster returned to HEALTH_OK the mon_max_pg_per_osd setting
was removed.

again, many thanks...

Jake
Post by Nick Fisk
Hi Jake,
I suspect you have hit an issue that me and a few others have hit in
Luminous. By increasing the number of PG's before all the data has
re-balanced, you have probably exceeded hard PG per OSD limit.
See this thread
https://www.spinics.net/lists/ceph-users/msg41231.html
Nick
Loading...