Discussion:
[ceph-users] pg inconsistent, scrub stat mismatch on bytes
Adrian
2018-06-07 00:57:36 UTC
Permalink
Update to this.

The affected pg didn't seem inconsistent:

[***@admin-ceph1-qh2 ~]# ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 6.20 is active+clean+inconsistent, acting [114,26,44]
[***@admin-ceph1-qh2 ~]# rados list-inconsistent-obj 6.20
--format=json-pretty
{
"epoch": 210034,
"inconsistents": []
}

Although pg query showed the primary info.stats.stat_sum.num_bytes differed
from the peers

A pg repair on 6.20 seems to have resolved the issue for now but the
info.stats.stat_sum.num_bytes still differs so presumably will become
inconsistent again next time it scrubs.

Adrian.
Hi Cephers,
We recently upgraded one of our clusters from hammer to jewel and then to
luminous (12.2.5, 5 mons/mgr, 21 storage nodes * 9 osd's). After some
deep-scubs we have an inconsistent pg with a log message we've not seen
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 6.20 is active+clean+inconsistent, acting [114,26,44]
Ceph log shows
2018-06-03 06:53:35.467791 osd.114 osd.114 172.26.28.25:6825/40819 395 : cluster [ERR] 6.20 scrub stat mismatch, got 6526/6526 objects, 87/87 clones, 6526/6526 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 25952454144/25952462336 bytes, 0/0 hit_set_archive bytes.
2018-06-03 06:53:35.467799 osd.114 osd.114 172.26.28.25:6825/40819 396 : cluster [ERR] 6.20 scrub 1 errors
2018-06-03 06:53:40.701632 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0 41298 : cluster [ERR] Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS)
2018-06-03 06:53:40.701668 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0 41299 : cluster [ERR] Health check failed: Possible data damage: 1 pg inconsistent (PG_DAMAGED)
2018-06-03 07:00:00.000137 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0 41345 : cluster [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
There are no EC pools - looks like it may be the same as
https://tracker.ceph.com/issues/22656 although as in #7 this is not a
cache pool.
Wondering if this is ok to issue a pg repair on 6.20 or if there's
something else we should be looking at first ?
Thanks in advance,
Adrian.
---
If violence doesn't solve your problem, you're not using enough of it.
--
---
Adrian : ***@gmail.com
If violence doesn't solve your problem, you're not using enough of it.
David Turner
2018-06-20 17:56:36 UTC
Permalink
As a part of the repair operation it runs a deep-scrub on the PG. If it
showed active+clean after the repair and deep-scrub finished, then the next
run of a scrub on the PG shouldn't change the PG status at all.
Post by Adrian
Update to this.
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 6.20 is active+clean+inconsistent, acting [114,26,44]
--format=json-pretty
{
"epoch": 210034,
"inconsistents": []
}
Although pg query showed the primary info.stats.stat_sum.num_bytes
differed from the peers
A pg repair on 6.20 seems to have resolved the issue for now but the
info.stats.stat_sum.num_bytes still differs so presumably will become
inconsistent again next time it scrubs.
Adrian.
Hi Cephers,
We recently upgraded one of our clusters from hammer to jewel and then to
luminous (12.2.5, 5 mons/mgr, 21 storage nodes * 9 osd's). After some
deep-scubs we have an inconsistent pg with a log message we've not seen
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 6.20 is active+clean+inconsistent, acting [114,26,44]
Ceph log shows
2018-06-03 06:53:35.467791 osd.114 osd.114 172.26.28.25:6825/40819 395 : cluster [ERR] 6.20 scrub stat mismatch, got 6526/6526 objects, 87/87 clones, 6526/6526 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 25952454144/25952462336 bytes, 0/0 hit_set_archive bytes.
2018-06-03 06:53:35.467799 osd.114 osd.114 172.26.28.25:6825/40819 396 : cluster [ERR] 6.20 scrub 1 errors
2018-06-03 06:53:40.701632 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0 41298 : cluster [ERR] Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS)
2018-06-03 06:53:40.701668 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0 41299 : cluster [ERR] Health check failed: Possible data damage: 1 pg inconsistent (PG_DAMAGED)
2018-06-03 07:00:00.000137 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0 41345 : cluster [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
There are no EC pools - looks like it may be the same as
https://tracker.ceph.com/issues/22656 although as in #7 this is not a
cache pool.
Wondering if this is ok to issue a pg repair on 6.20 or if there's
something else we should be looking at first ?
Thanks in advance,
Adrian.
---
If violence doesn't solve your problem, you're not using enough of it.
--
---
If violence doesn't solve your problem, you're not using enough of it.
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Loading...