HEALTH_ERR resulted from a bad sector
(too old to reply)
K.C. Wong
2018-02-08 00:35:29 UTC
Our Ceph cluster entered into that HEALTH_ERR last week.
We’re running Infernalis and that was the first time I’ve seen
it in that state. Even when OSD instances dropped off we’ve
only seen HEALTH_WARN. The output of `ceph status` looks
like this:

[***@r01u02-b ~]# ceph status
cluster ed62b3b9-be4a-4ce2-8cd3-34854aa8d6c2
1 pgs inconsistent
1 scrub errors
monmap e1: 3 mons at {r01u01-a=,r01u02-b=,r01u03-c=}
election epoch 900, quorum 0,1,2 r01u01-a,r01u02-b,r01u03-c
mdsmap e744: 1/1/1 up {0=r01u01-a=up:active}, 2 up:standby
osdmap e533858: 48 osds: 48 up, 48 in
flags sortbitwise
pgmap v47571404: 3456 pgs, 14 pools, 16470 GB data, 18207 kobjects
33056 GB used, 56324 GB / 89381 GB avail
3444 active+clean
8 active+clean+scrubbing+deep
3 active+clean+scrubbing
1 active+clean+inconsistent
client io 1535 kB/s wr, 23 op/s

I tracked down the inconsistent PG and found that one of pair
of OSDs had kernel log messages like these:

[1773723.509386] sd 5:0:0:0: [sdb] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1773723.509390] sd 5:0:0:0: [sdb] Sense Key : Medium Error [current] [descriptor]
[1773723.509394] sd 5:0:0:0: [sdb] Add. Sense: Unrecovered read error - auto reallocate failed
[1773723.509398] sd 5:0:0:0: [sdb] CDB: Read(10) 28 00 01 4c 1b a0 00 00 08 00
[1773723.509401] blk_update_request: I/O error, dev sdb, sector 21765025

Replacing the disk on that OSD server eventually fix the
problem, but it took a long time to get out of the error

[***@r01u01-a ~]# ceph status
cluster ed62b3b9-be4a-4ce2-8cd3-34854aa8d6c2
61 pgs backfill
2 pgs backfilling
1 pgs inconsistent
1 pgs repair
63 pgs stuck unclean
recovery 5/37908099 objects degraded (0.000%)
recovery 1244055/37908099 objects misplaced (3.282%)
1 scrub errors
monmap e1: 3 mons at {r01u01-a=,r01u02-b=,r01u03-c=}
election epoch 920, quorum 0,1,2 r01u01-a,r01u02-b,r01u03-c
mdsmap e759: 1/1/1 up {0=r01u02-b=up:active}, 2 up:standby
osdmap e534536: 48 osds: 48 up, 48 in; 63 remapped pgs
flags sortbitwise
pgmap v47590337: 3456 pgs, 14 pools, 16466 GB data, 18205 kobjects
33085 GB used, 56295 GB / 89381 GB avail
5/37908099 objects degraded (0.000%)
1244055/37908099 objects misplaced (3.282%)
3385 active+clean
61 active+remapped+wait_backfill
6 active+clean+scrubbing+deep
2 active+remapped+backfilling
1 active+clean+scrubbing+deep+inconsistent+repair
1 active+clean+scrubbing
client io 2720 kB/s wr, 16 op/s

Here’s what I’m curious about:

* How did a bad sector resulted in more damage to the Ceph
cluster than a few downed OSD servers?
* Is this issue addressed in later releases? I’m in the
middle of setting up a Jewel instance.
* What can be done to avoid the `HEALTH_ERR` state in similar
failure scenarios? Increasing the default pool size from 2
to 3?

Many thanks for any input/insight you may have.


K.C. Wong
M: +1 (408) 769-8235

Confidentiality Notice:
This message contains confidential information. If you are not the
intended recipient and received this message in error, any use or
distribution is strictly prohibited. Please also notify us
immediately by return e-mail, and delete this message from your
computer system. Thank you.
4096R/B8995EDE E527 CBE8 023E 79EA 8BBB 5C77 23A6 92E9 B899 5EDE