[ceph-users] No recovery when "norebalance" flag set

Discussion:

Stefan Kooman

2018-11-25 19:41:47 UTC

Hi list,

During cluster expansion (adding extra disks to existing hosts) some
OSDs failed (FAILED assert(0 == "unexpected error", _txc_add_transaction
error (39) Directory not empty not handled on operation 21 (op 1,
counting from 0), full details: https://8n1.org/14078/c534). We had
"norebalance", "nobackfill", and "norecover" flags set. After we unset
nobackfill and norecover (to let Ceph fix the degraded PGs) it would
recover all but 12 objects (2 PGs). We queried the PGs and the OSDs that
were supposed to have a copy of them, and they were already "probed". A
day later (~24 hours) it would still not have recovered the degraded
objects. After we unset the "norebalance" flag it would start
rebalancing, backfilling and recovering. The 12 degraded objects were
recovered.

Is this expected behaviour? I would expect Ceph to always try to fix
degraded things first and foremost. Even "pg force-recover" and "pg
force-backfill" could not force recovery.

Gr. Stefan

--
| BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351
| GPG: 0xD14839C6 +31 318 648 688 / ***@bit.nl

Dan van der Ster

2018-11-26 11:20:38 UTC

Permalink

Haven't seen that exact issue.

One thing to note though is that if osd_max_backfills is set to 1,
then it can happen that PGs get into backfill state, taking that
single reservation on a given OSD, and therefore the recovery_wait PGs
can't get a slot.
I suppose that backfill prioritization is supposed to prevent this,
but in my experience luminous v12.2.8 doesn't always get it right.

So next time I'd try injecting osd_max_backfills = 2 or 3 to kickstart
the recovering PGs.

-- dan

Post by Stefan Kooman
Hi list,
During cluster expansion (adding extra disks to existing hosts) some
OSDs failed (FAILED assert(0 == "unexpected error", _txc_add_transaction
error (39) Directory not empty not handled on operation 21 (op 1,
counting from 0), full details: https://8n1.org/14078/c534). We had
"norebalance", "nobackfill", and "norecover" flags set. After we unset
nobackfill and norecover (to let Ceph fix the degraded PGs) it would
recover all but 12 objects (2 PGs). We queried the PGs and the OSDs that
were supposed to have a copy of them, and they were already "probed". A
day later (~24 hours) it would still not have recovered the degraded
objects. After we unset the "norebalance" flag it would start
rebalancing, backfilling and recovering. The 12 degraded objects were
recovered.
Is this expected behaviour? I would expect Ceph to always try to fix
degraded things first and foremost. Even "pg force-recover" and "pg
force-backfill" could not force recovery.
Gr. Stefan
--
| BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Stefan Kooman

2018-11-26 11:32:07 UTC

Permalink

Post by Dan van der Ster
Haven't seen that exact issue.
One thing to note though is that if osd_max_backfills is set to 1,
then it can happen that PGs get into backfill state, taking that
single reservation on a given OSD, and therefore the recovery_wait PGs
can't get a slot.
I suppose that backfill prioritization is supposed to prevent this,
but in my experience luminous v12.2.8 doesn't always get it right.

That's also our experience. Even if if the degraded PGs with backfill /
recovery state are given a higher priority (forced) ... than still
normally backfilling takes place.

Post by Dan van der Ster
So next time I'd try injecting osd_max_backfills = 2 or 3 to kickstart
the recovering PGs.

Wat still on "1" indeed. We tend to cranck that (and max recovery) with
keeping an eye on max read and write apply latency. In our setup we can
do 16 backfills concurrently / and or 2 recovery / 4 backfills. Recovery
speeds ~ 4 - 5 GB/s ... pushing it beyond that tends to crashing OSDs.

We'll try your suggestion next time.

Thanks,

Stefan

--
| BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351
| GPG: 0xD14839C6 +31 318 648 688 / ***@bit.nl

Gregory Farnum

2018-11-26 13:14:32 UTC

Permalink

I haven't dug into how the norebalance flag works, but I think this is
expected â it presumably prevents OSDs from creating new copies of PGs,
which is what needed to happen here.
-Greg

Post by Stefan Kooman
Gr. Stefan
--
| BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351
| GPG: 0xD14839C6 +31 318 648 688
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com