Discussion:
[ceph-users] How safe is ceph pg repair these days?
Tracy Reed
2017-02-18 01:02:56 UTC
Permalink
I have a 3 replica cluster. A couple times I have run into inconsistent
PGs. I googled it and ceph docs and various blogs say run a repair
first. But a couple people on IRC and a mailing list thread from 2015
say that ceph blindly copies the primary over the secondaries and calls
it good.

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001370.html

I sure hope that isn't the case. If so it would seem highly
irresponsible to implement such a naive command called "repair". I have
recently learned how to properly analyze the OSD logs and manually fix
these things but not before having run repair on a dozen inconsistent
PGs. Now I'm worried about what sort of corruption I may have
introduced. Repairing things by hand is a simple heuristic based on
comparing the size or checksum (as indicated by the logs) for each of
the 3 copies and figuring out which is correct. Presumably matching two
out of three should win and the odd object out should be deleted since
having the exact same kind of error on two different OSDs is highly
improbable. I don't understand why ceph repair wouldn't have done this
all along.

What is the current best practice in the use of ceph repair?

Thanks!
--
Tracy Reed
Shinobu Kinjo
2017-02-18 02:08:39 UTC
Permalink
if ``ceph pg deep-scrub <pg id>`` does not work
then
do
``ceph pg repair <pg id>
Post by Tracy Reed
I have a 3 replica cluster. A couple times I have run into inconsistent
PGs. I googled it and ceph docs and various blogs say run a repair
first. But a couple people on IRC and a mailing list thread from 2015
say that ceph blindly copies the primary over the secondaries and calls
it good.
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001370.html
I sure hope that isn't the case. If so it would seem highly
irresponsible to implement such a naive command called "repair". I have
recently learned how to properly analyze the OSD logs and manually fix
these things but not before having run repair on a dozen inconsistent
PGs. Now I'm worried about what sort of corruption I may have
introduced. Repairing things by hand is a simple heuristic based on
comparing the size or checksum (as indicated by the logs) for each of
the 3 copies and figuring out which is correct. Presumably matching two
out of three should win and the odd object out should be deleted since
having the exact same kind of error on two different OSDs is highly
improbable. I don't understand why ceph repair wouldn't have done this
all along.
What is the current best practice in the use of ceph repair?
Thanks!
--
Tracy Reed
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Tracy Reed
2017-02-18 03:05:38 UTC
Permalink
Well, that's the question...is that safe? Because the link to the
mailing list post (possibly outdated) says that what you just suggested
is definitely NOT safe. Is the mailing list post wrong? Has the
situation changed? Exactly what does ceph repair do now? I suppose I
could go dig into the code but I'm not an expert and would hate to get
it wrong and post possibly bogus info the the list for other newbies to
find and worry about and possibly lose their data.
Post by Shinobu Kinjo
if ``ceph pg deep-scrub <pg id>`` does not work
then
do
``ceph pg repair <pg id>
Post by Tracy Reed
I have a 3 replica cluster. A couple times I have run into inconsistent
PGs. I googled it and ceph docs and various blogs say run a repair
first. But a couple people on IRC and a mailing list thread from 2015
say that ceph blindly copies the primary over the secondaries and calls
it good.
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001370.html
I sure hope that isn't the case. If so it would seem highly
irresponsible to implement such a naive command called "repair". I have
recently learned how to properly analyze the OSD logs and manually fix
these things but not before having run repair on a dozen inconsistent
PGs. Now I'm worried about what sort of corruption I may have
introduced. Repairing things by hand is a simple heuristic based on
comparing the size or checksum (as indicated by the logs) for each of
the 3 copies and figuring out which is correct. Presumably matching two
out of three should win and the odd object out should be deleted since
having the exact same kind of error on two different OSDs is highly
improbable. I don't understand why ceph repair wouldn't have done this
all along.
What is the current best practice in the use of ceph repair?
Thanks!
--
Tracy Reed
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Tracy Reed
Nick Fisk
2017-02-18 08:39:22 UTC
Permalink
From what I understand in Jewel+ Ceph has the concept of an authorative
shard, so in the case of a 3x replica pools, it will notice that 2 replicas
match and one doesn't and use one of the good replicas. However, in a 2x
pool your out of luck.

However, if someone could confirm my suspicions that would be good as well.
-----Original Message-----
Tracy Reed
Sent: 18 February 2017 03:06
Subject: Re: [ceph-users] How safe is ceph pg repair these days?
Well, that's the question...is that safe? Because the link to the mailing
list
post (possibly outdated) says that what you just suggested is definitely
NOT
safe. Is the mailing list post wrong? Has the situation changed? Exactly
what
does ceph repair do now? I suppose I could go dig into the code but I'm
not
an expert and would hate to get it wrong and post possibly bogus info the
the list for other newbies to find and worry about and possibly lose their
data.
if ``ceph pg deep-scrub <pg id>`` does not work then
do
``ceph pg repair <pg id>
Post by Tracy Reed
I have a 3 replica cluster. A couple times I have run into
inconsistent PGs. I googled it and ceph docs and various blogs say
run a repair first. But a couple people on IRC and a mailing list
thread from 2015 say that ceph blindly copies the primary over the
secondaries and calls it good.
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-
May/001370.
Post by Tracy Reed
html
I sure hope that isn't the case. If so it would seem highly
irresponsible to implement such a naive command called "repair". I
have recently learned how to properly analyze the OSD logs and
manually fix these things but not before having run repair on a
dozen inconsistent PGs. Now I'm worried about what sort of
corruption I may have introduced. Repairing things by hand is a
simple heuristic based on comparing the size or checksum (as
indicated by the logs) for each of the 3 copies and figuring out
which is correct. Presumably matching two out of three should win
and the odd object out should be deleted since having the exact same
kind of error on two different OSDs is highly improbable. I don't
understand why ceph repair wouldn't have done this all along.
What is the current best practice in the use of ceph repair?
Thanks!
--
Tracy Reed
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Tracy Reed
Gregory Farnum
2017-02-20 22:12:52 UTC
Permalink
Post by Nick Fisk
From what I understand in Jewel+ Ceph has the concept of an authorative
shard, so in the case of a 3x replica pools, it will notice that 2 replicas
match and one doesn't and use one of the good replicas. However, in a 2x
pool your out of luck.
However, if someone could confirm my suspicions that would be good as well.
Hmm, I went digging in and sadly this isn't quite right. The code has
a lot of internal plumbing to allow more smarts than were previously
feasible and the erasure-coded pools make use of them for noticing
stuff like local corruption. Replicated pools make an attempt but it's
not as reliable as one would like and it still doesn't involve any
kind of voting mechanism.
A self-inconsistent replicated primary won't get chosen. A primary is
self-inconsistent when its digest doesn't match the data, which
happens when:
1) the object hasn't been written since it was last scrubbed, or
2) the object was written in full, or
3) the object has only been appended to since the last time its digest
was recorded, or
4) something has gone terribly wrong in/under LevelDB and the omap
entries don't match what the digest says should be there.

David knows more and correct if I'm missing something. He's also
working on interfaces for scrub that are more friendly in general and
allow administrators to make more fine-grained decisions about
recovery in ways that cooperate with RADOS.
-Greg
Post by Nick Fisk
-----Original Message-----
Tracy Reed
Sent: 18 February 2017 03:06
Subject: Re: [ceph-users] How safe is ceph pg repair these days?
Well, that's the question...is that safe? Because the link to the mailing
list
post (possibly outdated) says that what you just suggested is definitely
NOT
safe. Is the mailing list post wrong? Has the situation changed? Exactly
what
does ceph repair do now? I suppose I could go dig into the code but I'm
not
an expert and would hate to get it wrong and post possibly bogus info the
the list for other newbies to find and worry about and possibly lose their
data.
if ``ceph pg deep-scrub <pg id>`` does not work then
do
``ceph pg repair <pg id>
Post by Tracy Reed
I have a 3 replica cluster. A couple times I have run into
inconsistent PGs. I googled it and ceph docs and various blogs say
run a repair first. But a couple people on IRC and a mailing list
thread from 2015 say that ceph blindly copies the primary over the
secondaries and calls it good.
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-
May/001370.
Post by Tracy Reed
html
I sure hope that isn't the case. If so it would seem highly
irresponsible to implement such a naive command called "repair". I
have recently learned how to properly analyze the OSD logs and
manually fix these things but not before having run repair on a
dozen inconsistent PGs. Now I'm worried about what sort of
corruption I may have introduced. Repairing things by hand is a
simple heuristic based on comparing the size or checksum (as
indicated by the logs) for each of the 3 copies and figuring out
which is correct. Presumably matching two out of three should win
and the odd object out should be deleted since having the exact same
kind of error on two different OSDs is highly improbable. I don't
understand why ceph repair wouldn't have done this all along.
What is the current best practice in the use of ceph repair?
Thanks!
--
Tracy Reed
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Tracy Reed
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Christian Balzer
2017-02-21 00:24:02 UTC
Permalink
Hello,
Post by Gregory Farnum
Post by Nick Fisk
From what I understand in Jewel+ Ceph has the concept of an authorative
shard, so in the case of a 3x replica pools, it will notice that 2 replicas
match and one doesn't and use one of the good replicas. However, in a 2x
pool your out of luck.
However, if someone could confirm my suspicions that would be good as well.
Hmm, I went digging in and sadly this isn't quite right. The code has
a lot of internal plumbing to allow more smarts than were previously
feasible and the erasure-coded pools make use of them for noticing
stuff like local corruption. Replicated pools make an attempt but it's
not as reliable as one would like and it still doesn't involve any
kind of voting mechanism.
I seem to recall a lot of that plumbing going/being talked about, but
never going into full action, good to know that I didn't miss anything and
that my memory is still reliable. ^o^
Post by Gregory Farnum
A self-inconsistent replicated primary won't get chosen. A primary is
self-inconsistent when its digest doesn't match the data, which
1) the object hasn't been written since it was last scrubbed, or
2) the object was written in full, or
3) the object has only been appended to since the last time its digest
was recorded, or
4) something has gone terribly wrong in/under LevelDB and the omap
entries don't match what the digest says should be there.
David knows more and correct if I'm missing something. He's also
working on interfaces for scrub that are more friendly in general and
allow administrators to make more fine-grained decisions about
recovery in ways that cooperate with RADOS.
That is certainly appreciated, especially if it gets backported to
versions where people are stuck with FS based OSDs.

However I presume that the main goal and focus is still BlueStore with
live internal checksums that make scrubbing obsolete, right?


Christian
Post by Gregory Farnum
-Greg
Post by Nick Fisk
-----Original Message-----
Tracy Reed
Sent: 18 February 2017 03:06
Subject: Re: [ceph-users] How safe is ceph pg repair these days?
Well, that's the question...is that safe? Because the link to the mailing
list
post (possibly outdated) says that what you just suggested is definitely
NOT
safe. Is the mailing list post wrong? Has the situation changed? Exactly
what
does ceph repair do now? I suppose I could go dig into the code but I'm
not
an expert and would hate to get it wrong and post possibly bogus info the
the list for other newbies to find and worry about and possibly lose their
data.
if ``ceph pg deep-scrub <pg id>`` does not work then
do
``ceph pg repair <pg id>
Post by Tracy Reed
I have a 3 replica cluster. A couple times I have run into
inconsistent PGs. I googled it and ceph docs and various blogs say
run a repair first. But a couple people on IRC and a mailing list
thread from 2015 say that ceph blindly copies the primary over the
secondaries and calls it good.
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-
May/001370.
Post by Tracy Reed
html
I sure hope that isn't the case. If so it would seem highly
irresponsible to implement such a naive command called "repair". I
have recently learned how to properly analyze the OSD logs and
manually fix these things but not before having run repair on a
dozen inconsistent PGs. Now I'm worried about what sort of
corruption I may have introduced. Repairing things by hand is a
simple heuristic based on comparing the size or checksum (as
indicated by the logs) for each of the 3 copies and figuring out
which is correct. Presumably matching two out of three should win
and the odd object out should be deleted since having the exact same
kind of error on two different OSDs is highly improbable. I don't
understand why ceph repair wouldn't have done this all along.
What is the current best practice in the use of ceph repair?
Thanks!
--
Tracy Reed
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Tracy Reed
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/
Gregory Farnum
2017-02-21 01:15:59 UTC
Permalink
Post by Christian Balzer
Hello,
Post by Gregory Farnum
Post by Nick Fisk
From what I understand in Jewel+ Ceph has the concept of an authorative
shard, so in the case of a 3x replica pools, it will notice that 2 replicas
match and one doesn't and use one of the good replicas. However, in a 2x
pool your out of luck.
However, if someone could confirm my suspicions that would be good as well.
Hmm, I went digging in and sadly this isn't quite right. The code has
a lot of internal plumbing to allow more smarts than were previously
feasible and the erasure-coded pools make use of them for noticing
stuff like local corruption. Replicated pools make an attempt but it's
not as reliable as one would like and it still doesn't involve any
kind of voting mechanism.
I seem to recall a lot of that plumbing going/being talked about, but
never going into full action, good to know that I didn't miss anything and
that my memory is still reliable. ^o^
Yeah. Mixed in with the subtlety are some good use cases, though. For
instance, anything written with RGW is always going to fit into the
cases where it will detect a bad primary. RBD is a lot less likely to
(unless you've done something crazy like set 4K objects, and the VM
always sends down 4k writes), but since scrubbing fills in the data
you can count on your snapshots and golden images being
well-protected. Etc etc.
Post by Christian Balzer
Post by Gregory Farnum
A self-inconsistent replicated primary won't get chosen. A primary is
self-inconsistent when its digest doesn't match the data, which
1) the object hasn't been written since it was last scrubbed, or
2) the object was written in full, or
3) the object has only been appended to since the last time its digest
was recorded, or
4) something has gone terribly wrong in/under LevelDB and the omap
entries don't match what the digest says should be there.
David knows more and correct if I'm missing something. He's also
working on interfaces for scrub that are more friendly in general and
allow administrators to make more fine-grained decisions about
recovery in ways that cooperate with RADOS.
That is certainly appreciated, especially if it gets backported to
versions where people are stuck with FS based OSDs.
However I presume that the main goal and focus is still BlueStore with
live internal checksums that make scrubbing obsolete, right?
I'm not sure what you mean. BlueStore certainly has a ton of work
going on, and we have plans to update scrub/repair to play nicely and
handle more of the use cases that BlueStore is likely to expose and
which FileStore did not. But just about all the scrub/repair
enhancements we're aiming at will work on both systems, and making
them handle the BlueStore cases may do a lot more proportionally for
FileStore.
-Greg
Post by Christian Balzer
Christian
Post by Gregory Farnum
-Greg
Post by Nick Fisk
-----Original Message-----
Tracy Reed
Sent: 18 February 2017 03:06
Subject: Re: [ceph-users] How safe is ceph pg repair these days?
Well, that's the question...is that safe? Because the link to the mailing
list
post (possibly outdated) says that what you just suggested is definitely
NOT
safe. Is the mailing list post wrong? Has the situation changed? Exactly
what
does ceph repair do now? I suppose I could go dig into the code but I'm
not
an expert and would hate to get it wrong and post possibly bogus info the
the list for other newbies to find and worry about and possibly lose their
data.
if ``ceph pg deep-scrub <pg id>`` does not work then
do
``ceph pg repair <pg id>
Post by Tracy Reed
I have a 3 replica cluster. A couple times I have run into
inconsistent PGs. I googled it and ceph docs and various blogs say
run a repair first. But a couple people on IRC and a mailing list
thread from 2015 say that ceph blindly copies the primary over the
secondaries and calls it good.
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-
May/001370.
Post by Tracy Reed
html
I sure hope that isn't the case. If so it would seem highly
irresponsible to implement such a naive command called "repair". I
have recently learned how to properly analyze the OSD logs and
manually fix these things but not before having run repair on a
dozen inconsistent PGs. Now I'm worried about what sort of
corruption I may have introduced. Repairing things by hand is a
simple heuristic based on comparing the size or checksum (as
indicated by the logs) for each of the 3 copies and figuring out
which is correct. Presumably matching two out of three should win
and the odd object out should be deleted since having the exact same
kind of error on two different OSDs is highly improbable. I don't
understand why ceph repair wouldn't have done this all along.
What is the current best practice in the use of ceph repair?
Thanks!
--
Tracy Reed
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Tracy Reed
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
http://www.gol.com/
Christian Balzer
2017-02-21 02:38:21 UTC
Permalink
Hello,
Post by Gregory Farnum
Post by Christian Balzer
Hello,
Post by Gregory Farnum
Post by Nick Fisk
From what I understand in Jewel+ Ceph has the concept of an authorative
shard, so in the case of a 3x replica pools, it will notice that 2 replicas
match and one doesn't and use one of the good replicas. However, in a 2x
pool your out of luck.
However, if someone could confirm my suspicions that would be good as well.
Hmm, I went digging in and sadly this isn't quite right. The code has
a lot of internal plumbing to allow more smarts than were previously
feasible and the erasure-coded pools make use of them for noticing
stuff like local corruption. Replicated pools make an attempt but it's
not as reliable as one would like and it still doesn't involve any
kind of voting mechanism.
I seem to recall a lot of that plumbing going/being talked about, but
never going into full action, good to know that I didn't miss anything and
that my memory is still reliable. ^o^
Yeah. Mixed in with the subtlety are some good use cases, though. For
instance, anything written with RGW is always going to fit into the
cases where it will detect a bad primary. RBD is a lot less likely to
(unless you've done something crazy like set 4K objects, and the VM
always sends down 4k writes), but since scrubbing fills in the data
you can count on your snapshots and golden images being
well-protected. Etc etc.
Post by Christian Balzer
Post by Gregory Farnum
A self-inconsistent replicated primary won't get chosen. A primary is
self-inconsistent when its digest doesn't match the data, which
1) the object hasn't been written since it was last scrubbed, or
2) the object was written in full, or
3) the object has only been appended to since the last time its digest
was recorded, or
4) something has gone terribly wrong in/under LevelDB and the omap
entries don't match what the digest says should be there.
David knows more and correct if I'm missing something. He's also
working on interfaces for scrub that are more friendly in general and
allow administrators to make more fine-grained decisions about
recovery in ways that cooperate with RADOS.
That is certainly appreciated, especially if it gets backported to
versions where people are stuck with FS based OSDs.
However I presume that the main goal and focus is still BlueStore with
live internal checksums that make scrubbing obsolete, right?
I'm not sure what you mean. BlueStore certainly has a ton of work
going on, and we have plans to update scrub/repair to play nicely and
handle more of the use cases that BlueStore is likely to expose and
which FileStore did not. But just about all the scrub/repair
enhancements we're aiming at will work on both systems, and making
them handle the BlueStore cases may do a lot more proportionally for
FileStore.
I'm talking about the various discussions here (google for "bluestore
checksum", which also shows talks on devel, unsurprisingly) as well as
Sage's various slides about Bluestore and checksums.

From those I take away that:

1. All BlueStore reads have 100% read-checksums all the time, completely
preventing silent data corruption from happening as it is now possible.
Similar to ZFS/BTRFS, with in-flight delivery of a good replica and repair
of the broken one.

2. Scrubbing (deep) becomes something of "feel good" thing that can be done
in much longer intervals (depending on the quality of your storage and
replication size) and with much lower priority.
As it's main (only?) benefit will be to detect and correct corruption
before all replicas of very infrequently read data may have become
affected.

Christian
Post by Gregory Farnum
-Greg
Post by Christian Balzer
Christian
Post by Gregory Farnum
-Greg
Post by Nick Fisk
-----Original Message-----
Tracy Reed
Sent: 18 February 2017 03:06
Subject: Re: [ceph-users] How safe is ceph pg repair these days?
Well, that's the question...is that safe? Because the link to the mailing
list
post (possibly outdated) says that what you just suggested is definitely
NOT
safe. Is the mailing list post wrong? Has the situation changed? Exactly
what
does ceph repair do now? I suppose I could go dig into the code but I'm
not
an expert and would hate to get it wrong and post possibly bogus info the
the list for other newbies to find and worry about and possibly lose their
data.
if ``ceph pg deep-scrub <pg id>`` does not work then
do
``ceph pg repair <pg id>
Post by Tracy Reed
I have a 3 replica cluster. A couple times I have run into
inconsistent PGs. I googled it and ceph docs and various blogs say
run a repair first. But a couple people on IRC and a mailing list
thread from 2015 say that ceph blindly copies the primary over the
secondaries and calls it good.
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-
May/001370.
Post by Tracy Reed
html
I sure hope that isn't the case. If so it would seem highly
irresponsible to implement such a naive command called "repair". I
have recently learned how to properly analyze the OSD logs and
manually fix these things but not before having run repair on a
dozen inconsistent PGs. Now I'm worried about what sort of
corruption I may have introduced. Repairing things by hand is a
simple heuristic based on comparing the size or checksum (as
indicated by the logs) for each of the 3 copies and figuring out
which is correct. Presumably matching two out of three should win
and the odd object out should be deleted since having the exact same
kind of error on two different OSDs is highly improbable. I don't
understand why ceph repair wouldn't have done this all along.
What is the current best practice in the use of ceph repair?
Thanks!
--
Tracy Reed
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Tracy Reed
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
http://www.gol.com/
--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/
Nick Fisk
2017-02-21 08:38:07 UTC
Permalink
-----Original Message-----
Gregory Farnum
Sent: 20 February 2017 22:13
Subject: Re: [ceph-users] How safe is ceph pg repair these days?
Post by Nick Fisk
From what I understand in Jewel+ Ceph has the concept of an
authorative shard, so in the case of a 3x replica pools, it will
notice that 2 replicas match and one doesn't and use one of the good
replicas. However, in a 2x pool your out of luck.
However, if someone could confirm my suspicions that would be good as
well.
Hmm, I went digging in and sadly this isn't quite right. The code has a
lot of
internal plumbing to allow more smarts than were previously feasible and
the erasure-coded pools make use of them for noticing stuff like local
corruption. Replicated pools make an attempt but it's not as reliable as
one
would like and it still doesn't involve any kind of voting mechanism.
A self-inconsistent replicated primary won't get chosen. A primary is
self-
1) the object hasn't been written since it was last scrubbed, or
2) the object was written in full, or
3) the object has only been appended to since the last time its digest was
recorded, or
4) something has gone terribly wrong in/under LevelDB and the omap entries
don't match what the digest says should be there.
Thanks for the correction Greg. So I'm guessing that the probability of
overwriting with an incorrect primary is reduced in later releases, but it
can still happen.

Quick question and its maybe that this is a #5 on your list. What about
objects that are marked inconsistent on the primary due to a read error. I
would say 90% of my inconsistent PG's are always caused by a read error and
associated smartctl error.

"rados list-inconsistent-obj" shows that it knows that the primary had a
read error, so I assume a "pg repair" wouldn't try and read from the primary
again?
David knows more and correct if I'm missing something. He's also working
on
interfaces for scrub that are more friendly in general and allow
administrators to make more fine-grained decisions about recovery in ways
that cooperate with RADOS.
-Greg
Post by Nick Fisk
-----Original Message-----
Of Tracy Reed
Sent: 18 February 2017 03:06
Subject: Re: [ceph-users] How safe is ceph pg repair these days?
Well, that's the question...is that safe? Because the link to the mailing
list
post (possibly outdated) says that what you just suggested is definitely
NOT
safe. Is the mailing list post wrong? Has the situation changed? Exactly
what
does ceph repair do now? I suppose I could go dig into the code but I'm
not
an expert and would hate to get it wrong and post possibly bogus info
the the list for other newbies to find and worry about and possibly
lose their data.
if ``ceph pg deep-scrub <pg id>`` does not work then
do
``ceph pg repair <pg id>
On Sat, Feb 18, 2017 at 10:02 AM, Tracy Reed
Post by Tracy Reed
I have a 3 replica cluster. A couple times I have run into
inconsistent PGs. I googled it and ceph docs and various blogs
say run a repair first. But a couple people on IRC and a mailing
list thread from 2015 say that ceph blindly copies the primary
over the secondaries and calls it good.
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-
May/001370.
Post by Tracy Reed
html
I sure hope that isn't the case. If so it would seem highly
irresponsible to implement such a naive command called "repair".
I have recently learned how to properly analyze the OSD logs and
manually fix these things but not before having run repair on a
dozen inconsistent PGs. Now I'm worried about what sort of
corruption I may have introduced. Repairing things by hand is a
simple heuristic based on comparing the size or checksum (as
indicated by the logs) for each of the 3 copies and figuring out
which is correct. Presumably matching two out of three should win
and the odd object out should be deleted since having the exact
same kind of error on two different OSDs is highly improbable. I
don't understand why ceph repair wouldn't have done this all along.
What is the current best practice in the use of ceph repair?
Thanks!
--
Tracy Reed
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Tracy Reed
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
David Zafman
2017-02-22 01:20:08 UTC
Permalink
Nick,

Yes, as you would expect a read error would not be used as a source
for repair no matter which OSD(s) are getting read errors.


David
Post by Gregory Farnum
-----Original Message-----
Gregory Farnum
Sent: 20 February 2017 22:13
Subject: Re: [ceph-users] How safe is ceph pg repair these days?
Post by Nick Fisk
From what I understand in Jewel+ Ceph has the concept of an
authorative shard, so in the case of a 3x replica pools, it will
notice that 2 replicas match and one doesn't and use one of the good
replicas. However, in a 2x pool your out of luck.
However, if someone could confirm my suspicions that would be good as
well.
Hmm, I went digging in and sadly this isn't quite right. The code has a
lot of
internal plumbing to allow more smarts than were previously feasible and
the erasure-coded pools make use of them for noticing stuff like local
corruption. Replicated pools make an attempt but it's not as reliable as
one
would like and it still doesn't involve any kind of voting mechanism.
A self-inconsistent replicated primary won't get chosen. A primary is
self-
1) the object hasn't been written since it was last scrubbed, or
2) the object was written in full, or
3) the object has only been appended to since the last time its digest was
recorded, or
4) something has gone terribly wrong in/under LevelDB and the omap entries
don't match what the digest says should be there.
Thanks for the correction Greg. So I'm guessing that the probability of
overwriting with an incorrect primary is reduced in later releases, but it
can still happen.
Quick question and its maybe that this is a #5 on your list. What about
objects that are marked inconsistent on the primary due to a read error. I
would say 90% of my inconsistent PG's are always caused by a read error and
associated smartctl error.
"rados list-inconsistent-obj" shows that it knows that the primary had a
read error, so I assume a "pg repair" wouldn't try and read from the primary
again?
David knows more and correct if I'm missing something. He's also working
on
interfaces for scrub that are more friendly in general and allow
administrators to make more fine-grained decisions about recovery in ways
that cooperate with RADOS.
-Greg
Post by Nick Fisk
-----Original Message-----
Of Tracy Reed
Sent: 18 February 2017 03:06
Subject: Re: [ceph-users] How safe is ceph pg repair these days?
Well, that's the question...is that safe? Because the link to the mailing
list
post (possibly outdated) says that what you just suggested is definitely
NOT
safe. Is the mailing list post wrong? Has the situation changed? Exactly
what
does ceph repair do now? I suppose I could go dig into the code but I'm
not
an expert and would hate to get it wrong and post possibly bogus info
the the list for other newbies to find and worry about and possibly
lose their data.
if ``ceph pg deep-scrub <pg id>`` does not work then
do
``ceph pg repair <pg id>
On Sat, Feb 18, 2017 at 10:02 AM, Tracy Reed
Post by Tracy Reed
I have a 3 replica cluster. A couple times I have run into
inconsistent PGs. I googled it and ceph docs and various blogs
say run a repair first. But a couple people on IRC and a mailing
list thread from 2015 say that ceph blindly copies the primary
over the secondaries and calls it good.
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-
May/001370.
Post by Tracy Reed
html
I sure hope that isn't the case. If so it would seem highly
irresponsible to implement such a naive command called "repair".
I have recently learned how to properly analyze the OSD logs and
manually fix these things but not before having run repair on a
dozen inconsistent PGs. Now I'm worried about what sort of
corruption I may have introduced. Repairing things by hand is a
simple heuristic based on comparing the size or checksum (as
indicated by the logs) for each of the 3 copies and figuring out
which is correct. Presumably matching two out of three should win
and the odd object out should be deleted since having the exact
same kind of error on two different OSDs is highly improbable. I
don't understand why ceph repair wouldn't have done this all along.
What is the current best practice in the use of ceph repair?
Thanks!
--
Tracy Reed
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Tracy Reed
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Tracy Reed
2017-02-23 04:35:30 UTC
Permalink
Post by Gregory Farnum
Hmm, I went digging in and sadly this isn't quite right.
Thanks for looking into this! This is the answer I was afraid of. Aren't
all of those blog entries which talk about using repair and the ceph
docs themselves putting people's data at risk? It seems like the only
responsible way to deal with inconsistent PGs is to dig into the osd
log, look at the reason for the inconistency, examine the data on disk,
determine which one is good and which is bad, and delete the bad one?
Post by Gregory Farnum
The code has a lot of internal plumbing to allow more smarts than were
previously feasible and the erasure-coded pools make use of them for
noticing stuff like local corruption. Replicated pools make an attempt
but it's not as reliable as one would like and it still doesn't
involve any kind of voting mechanism.
This is pretty surprising. I would have thought a best two out of three
voting mechanism in a triple replicated setup would be the obvious way
to go. It must be more difficult to implement than I suppose.
Post by Gregory Farnum
A self-inconsistent replicated primary won't get chosen. A primary is
self-inconsistent when its digest doesn't match the data, which
1) the object hasn't been written since it was last scrubbed, or
2) the object was written in full, or
3) the object has only been appended to since the last time its digest
was recorded, or
4) something has gone terribly wrong in/under LevelDB and the omap
entries don't match what the digest says should be there.
At least there's some sort of basic heuristic which attempts to do the
right thing even if the whole process isn't as thorough as it could be.
Post by Gregory Farnum
David knows more and correct if I'm missing something. He's also
working on interfaces for scrub that are more friendly in general and
allow administrators to make more fine-grained decisions about
recovery in ways that cooperate with RADOS.
These will be very welcome improvements!
--
Tracy Reed
Loading...