Arvydas Opulskis
2018-11-21 15:21:37 UTC
Hi all,
it's not first time we have this kind of problem, usually with HP raid
controllers:
1. One disk is failing, bringing all controller to slow state, where it's
performance degrades dramatically
2. Some OSDs are reported as down by other OSDs and marked as down
3. At same time other OSDs on same node are not detected as failed and are
still participating in cluster. I think, it's because OSD is not aware
about backend disk problems and answers to health checks
4. Because of this, requests to PGs, which are on problematic node, are
becoming "slow", later becoming "stuck"
5. Cluster is struggling and client operations are not performed, so
cluster is in some kind "locked" state
6. We need to mark them down manually (or stop problematic daemons), so
cluster starts to recover and process requests
Is there any mechanism in Ceph, which monitors slow request containing OSDs
and mark them down after some kind of threshold?
Thanks,
Arvydas
it's not first time we have this kind of problem, usually with HP raid
controllers:
1. One disk is failing, bringing all controller to slow state, where it's
performance degrades dramatically
2. Some OSDs are reported as down by other OSDs and marked as down
3. At same time other OSDs on same node are not detected as failed and are
still participating in cluster. I think, it's because OSD is not aware
about backend disk problems and answers to health checks
4. Because of this, requests to PGs, which are on problematic node, are
becoming "slow", later becoming "stuck"
5. Cluster is struggling and client operations are not performed, so
cluster is in some kind "locked" state
6. We need to mark them down manually (or stop problematic daemons), so
cluster starts to recover and process requests
Is there any mechanism in Ceph, which monitors slow request containing OSDs
and mark them down after some kind of threshold?
Thanks,
Arvydas