[ceph-users] Scrub Error / How does ceph pg repair work?

Discussion:

Christian Eichelmann

2015-05-11 08:09:32 UTC

Hi all!

We are experiencing approximately 1 scrub error / inconsistent pg every
two days. As far as I know, to fix this you can issue a "ceph pg
repair", which works fine for us. I have a few qestions regarding the
behavior of the ceph cluster in such a case:

1. After ceph detects the scrub error, the pg is marked as inconsistent.
Does that mean that any IO to this pg is blocked until it is repaired?

2. Is this amount of scrub errors normal? We currently have only 150TB
in our cluster, distributed over 720 2TB disks.

3. As far as I know, a "ceph pg repair" just copies the content of the
primary pg to all replicas. Is this still the case? What if the primary
copy is the one having errors? We have a 4x replication level and it
would be cool if ceph would use one of the pg for recovery which has the
same checksum as the majority of pgs.

4. Some of this errors are happening at night. Since ceph reports this
as a critical error, our shift is called and wake up, just to issue a
single command. Do you see any problems in triggering this command
automatically via monitoring event? Is there a reason why ceph isn't
resolving these errors itself when it has enought replicas to do so?

Regards,
Christian

Chris Hoy Poy

2015-05-11 09:03:38 UTC

Permalink

Hi Christian

In my experience, inconsistent PGs are almost always related back to a bad drive somewhere. They are going to keep happening, and with that many drives you still need to be diligent/aggressive in dropping bad drives and replacing them.

If a drive returns an incorrect read, it can't be trusted from that point. Deep scrubs just serve to churn your bits and make sure you catch these errors early on.

/Chris

-----Original Message-----
From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf Of Christian Eichelmann
Sent: Monday, 11 May 2015 4:10 PM
To: ceph-***@lists.ceph.com
Subject: [ceph-users] Scrub Error / How does ceph pg repair work?

Hi all!

We are experiencing approximately 1 scrub error / inconsistent pg every two days. As far as I know, to fix this you can issue a "ceph pg repair", which works fine for us. I have a few qestions regarding the behavior of the ceph cluster in such a case:

1. After ceph detects the scrub error, the pg is marked as inconsistent.
Does that mean that any IO to this pg is blocked until it is repaired?

2. Is this amount of scrub errors normal? We currently have only 150TB in our cluster, distributed over 720 2TB disks.

3. As far as I know, a "ceph pg repair" just copies the content of the primary pg to all replicas. Is this still the case? What if the primary copy is the one having errors? We have a 4x replication level and it would be cool if ceph would use one of the pg for recovery which has the same checksum as the majority of pgs.

4. Some of this errors are happening at night. Since ceph reports this as a critical error, our shift is called and wake up, just to issue a single command. Do you see any problems in triggering this command automatically via monitoring event? Is there a reason why ceph isn't resolving these errors itself when it has enought replicas to do so?

Regards,
Christian

Robert LeBlanc

2015-05-11 16:19:08 UTC

Permalink

This post might be inappropriate. Click to display it.

Christian Balzer

2015-05-12 06:20:18 UTC

Permalink

Hello,

I can only nod emphatically to what Robert said, don't issue repairs
unless you
a) don't care about the data or
b) have verified that your primary OSD is good.

See this for some details on how establish which replica(s) are actually
good or not:
http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/

Of course if you somehow wind up with more subtle data corruption and are
faced with 3 slightly differing data sets, you may have have to resort to
rolling a dice after all.

A word from the devs about the state of checksums and automatic repairs we
can trust would be appreciated.

Christian

Post by Robert LeBlanc
Personally I would not just run this command automatically because as you
stated, it only copies the primary PGs to the replicas and if the primary
is corrupt, you will corrupt your secondaries.I think the monitor log
shows which OSD has the problem so if it is not your primary, then just
issue the repair command.
There was talk, and I believe work towards, Ceph storing a hash of the
object so that it can be smarter about which replica has the correct data
and automatically replicate the good data no matter where it is. I think
the first part, creating the hash and storing it, has been included in
Hammer. I'm not an authority on this so take it with a grain of salt.
Right now our procedure is to find the PG files on the OSDs, perform a
MD5 on all of them and the one that doesn't match, overwrite, either by
issuing the PG repair command, or removing the bad PG files, rsyncing
them with the -X argument and then instructing a deep-scrub on the PG to
clear it up in Ceph.
I've only tested this on an idle cluster, so I don't know how well it
will work on an active cluster. Since we issue a deep-scrub, if the PGs
of the replicas change during the rsync, it should come up with an
error. The idea is to keep rsyncing until the deep-scrub is clean. Be
warned that you may be aiming your gun at your foot with this!
----------------
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
On Mon, May 11, 2015 at 2:09 AM, Christian Eichelmann <

Post by Christian Eichelmann
Hi all!
We are experiencing approximately 1 scrub error / inconsistent pg every
two days. As far as I know, to fix this you can issue a "ceph pg
repair", which works fine for us. I have a few qestions regarding the
1. After ceph detects the scrub error, the pg is marked as
inconsistent. Does that mean that any IO to this pg is blocked until
it is repaired?
2. Is this amount of scrub errors normal? We currently have only 150TB
in our cluster, distributed over 720 2TB disks.
3. As far as I know, a "ceph pg repair" just copies the content of the
primary pg to all replicas. Is this still the case? What if the primary
copy is the one having errors? We have a 4x replication level and it
would be cool if ceph would use one of the pg for recovery which has
the same checksum as the majority of pgs.
4. Some of this errors are happening at night. Since ceph reports this
as a critical error, our shift is called and wake up, just to issue a
single command. Do you see any problems in triggering this command
automatically via monitoring event? Is there a reason why ceph isn't
resolving these errors itself when it has enought replicas to do so?
Regards,
Christian
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/

Christian Eichelmann

2015-05-12 07:00:43 UTC

Permalink

Hi Christian, Hi Robert,

thank you for your replies!
I was already expecting something like this. But I am seriously worried
about that!

Just assume that this is happening at night. Our shift has not
necessarily enough knowledge to perform all the steps in Sebasien's
article. And if we always have to do that when a scrub error appears, we
are putting several hours per week into fixing such problems.

It is also very misleading that a command called "ceph pg repair" might
do quite the opposit and overwrite the "good" data in your cluster with
corrupt one. I don't know much about the interna of ceph, but if the
cluster can already recognize that checksums are not the same, why can't
he just build a quorum from the existing replicas if possible?

And again the question:
Are these placementgroups (scrub error, inconsistent) blocking on
read/write requests? Because if yes, we have a serious problem here...

Regards,
Christian

Post by Christian Balzer
Hello,
I can only nod emphatically to what Robert said, don't issue repairs
unless you
a) don't care about the data or
b) have verified that your primary OSD is good.
See this for some details on how establish which replica(s) are actually
http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/
Of course if you somehow wind up with more subtle data corruption and are
faced with 3 slightly differing data sets, you may have have to resort to
rolling a dice after all.
A word from the devs about the state of checksums and automatic repairs we
can trust would be appreciated.
Christian

--
Christian Eichelmann
Systemadministrator

1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting
Brauerstraße 48 · DE-76135 Karlsruhe
Telefon: +49 721 91374-8026
***@1und1.de

Amtsgericht Montabaur / HRB 6484
Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
Aufsichtsratsvorsitzender: Michael Scheeren

Anthony D'Atri

2015-05-11 23:07:30 UTC

Permalink

Agree that 99+% of the inconsistent PG's I see correlate directly to disk flern.

Check /var/log/kern.log*, /var/log/messages*, etc. and I'll bet you find errors correlating.

-- Anthony

Dan van der Ster

2015-05-12 07:07:40 UTC

Permalink

Post by Anthony D'Atri
Agree that 99+% of the inconsistent PG's I see correlate directly to disk flern.
Check /var/log/kern.log*, /var/log/messages*, etc. and I'll bet you find errors correlating.

More to this... In the case that an inconsistent PG is caused by a
failed disk read, you don't need to run ceph pg repair at all.
Instead, since your drive is bad, stop the osd process, mark that osd
out. After backfilling has completed and the PG is re-scrubbed, you
will find it is consistent again.

Cheers, Dan

Anthony D'Atri

2015-05-12 15:20:10 UTC

Permalink

For me that's true about 1/3 the time, but often I do still have to repair the PG after removing the affected OSD. YMMV.

Post by Dan van der Ster

Post by Anthony D'Atri
Agree that 99+% of the inconsistent PG's I see correlate directly to disk flern.
Check /var/log/kern.log*, /var/log/messages*, etc. and I'll bet you find errors correlating.