[ceph-users] How you handle failing/slow disks?

Discussion:

Arvydas Opulskis

2018-11-21 15:21:37 UTC

Hi all,

it's not first time we have this kind of problem, usually with HP raid
controllers:

1. One disk is failing, bringing all controller to slow state, where it's
performance degrades dramatically
2. Some OSDs are reported as down by other OSDs and marked as down
3. At same time other OSDs on same node are not detected as failed and are
still participating in cluster. I think, it's because OSD is not aware
about backend disk problems and answers to health checks
4. Because of this, requests to PGs, which are on problematic node, are
becoming "slow", later becoming "stuck"
5. Cluster is struggling and client operations are not performed, so
cluster is in some kind "locked" state
6. We need to mark them down manually (or stop problematic daemons), so
cluster starts to recover and process requests

Is there any mechanism in Ceph, which monitors slow request containing OSDs
and mark them down after some kind of threshold?

Thanks,
Arvydas

Paul Emmerich

2018-11-21 16:26:20 UTC

Permalink

Yeah, we also observed problems with HP raid controllers misbehaving
when a single disk starts to fail. We would never recommend building a
Ceph cluster on HP raid controllers until they can fix that issue.

There are several features in Ceph which detect dead disks: there are
timeouts for OSDs checking each other and there's a timeout for OSDs
checking in with the mons. But that's usually not enough in this
scenario. The good news is that recent Ceph versions will show which
OSDs are implicated in slow requests (check ceph health detail) which
at least gives you some way to figure out which OSDs are becoming
slow.

We have found it to be useful to monitor the op_*_latency values of
all OSDs (especially subop latencies) from the admin daemon to detect
such failures earlier.

Paul
--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

Am Mi., 21. Nov. 2018 um 16:22 Uhr schrieb Arvydas Opulskis

Post by Arvydas Opulskis
Hi all,
1. One disk is failing, bringing all controller to slow state, where it's performance degrades dramatically
2. Some OSDs are reported as down by other OSDs and marked as down
3. At same time other OSDs on same node are not detected as failed and are still participating in cluster. I think, it's because OSD is not aware about backend disk problems and answers to health checks
4. Because of this, requests to PGs, which are on problematic node, are becoming "slow", later becoming "stuck"
5. Cluster is struggling and client operations are not performed, so cluster is in some kind "locked" state
6. We need to mark them down manually (or stop problematic daemons), so cluster starts to recover and process requests
Is there any mechanism in Ceph, which monitors slow request containing OSDs and mark them down after some kind of threshold?
Thanks,
Arvydas
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Alex Litvak

2018-11-22 21:13:19 UTC

Permalink

Sorry for hijacking a thread but do you have an idea of what to watch for:

I monitor admin sockets of osds and occasionally I see a burst of both op_w_process_latency and op_w_latency to near 150 - 200 ms on 7200 SAS enterprise drives.
For example load average on the node jumps up with idle 97 % CPU and I see that out of 12 OSDs probably have latency of op_w_latency 170 - 180 ms and 3 more have latency of ~ 120 - 130 ms and the rest
100 ms or below. Does it say anything regarding possible drive failure (I am running drives inside of Dell PowerVault MD3400 and the storage unit shows them all green OK)? Unfortunately, smartmon
outside of box tells me nothing other then health is OK.

High load usually corresponds with when the op_w_latency affects multiple OSDs (4 or more) at the same time.

Post by Paul Emmerich
Yeah, we also observed problems with HP raid controllers misbehaving
when a single disk starts to fail. We would never recommend building a
Ceph cluster on HP raid controllers until they can fix that issue.
There are several features in Ceph which detect dead disks: there are
timeouts for OSDs checking each other and there's a timeout for OSDs
checking in with the mons. But that's usually not enough in this
scenario. The good news is that recent Ceph versions will show which
OSDs are implicated in slow requests (check ceph health detail) which
at least gives you some way to figure out which OSDs are becoming
slow.
We have found it to be useful to monitor the op_*_latency values of
all OSDs (especially subop latencies) from the admin daemon to detect
such failures earlier.
Paul