Discussion:
Best practice K/M-parameters EC pool
(too old to reply)
Blair Bethwaite
2014-08-26 00:23:43 UTC
Permalink
Message: 25
Date: Fri, 15 Aug 2014 15:06:49 +0200
From: Loic Dachary <loic at dachary.org>
To: Erik Logtenberg <erik at logtenberg.eu>, ceph-users at lists.ceph.com
Subject: Re: [ceph-users] Best practice K/M-parameters EC pool
Message-ID: <53EE05E9.1040105 at dachary.org>
Content-Type: text/plain; charset="iso-8859-1"
...
If the probability of loosing a disk is 0.1%, the probability of loosing two disks simultaneously (i.e. before the failure can be recovered) would be 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001%
I watched this conversation and an older similar one (Failure
probability with largish deployments) with interest as we are in the
process of planning a pretty large Ceph cluster (~3.5 PB), so I have
been trying to wrap my head around these issues.

Loic's reasoning (above) seems sound as a naive approximation assuming
independent probabilities for disk failures, which may not be quite
true given potential for batch production issues, but should be okay
for other sorts of correlations (assuming a sane crushmap that
eliminates things like controllers and nodes as sources of
correlation).

One of the things that came up in the "Failure probability with
largish deployments" thread and has raised its head again here is the
idea that striped data (e.g., RADOS-GW objects and RBD volumes) might
be somehow more prone to data-loss than non-striped. I don't think
anyone has so far provided an answer on this, so here's my thinking...

The level of atomicity that matters when looking at durability &
availability in Ceph is the Placement Group. For any non-trivial RBD
it is likely that many RBDs will span all/most PGs, e.g., even a
relatively small 50GiB volume would (with default 4MiB object size)
span 12800 PGs - more than there are in many production clusters
obeying the 100-200 PGs per drive rule of thumb. <IMPORTANT>Losing any
one PG will cause data-loss. The failure-probability effects of
striping across multiple PGs are immaterial considering that loss of
any single PG is likely to damage all your RBDs</IMPORTANT>. This
might be why the reliability calculator doesn't consider total number
of disks.

Related to all this is the durability of 2 versus 3 replicas (or e.g.
M>=1 for Erasure Coding). It's easy to get caught up in the worrying
fallacy that losing any M OSDs will cause data-loss, but this isn't
true - they have to be members of the same PG for data-loss to occur.
So then it's tempting to think the chances of that happening are so
slim as to not matter and why would we ever even need 3 replicas. I
mean, what are the odds of exactly those 2 drives, out of the
100,200... in my cluster, failing in <recovery window>?! But therein
lays the rub - you should be thinking about PGs. If a drive fails then
the chance of a data-loss event resulting are dependent on the chances
of losing further drives from the affected/degraded PGs.

I've got a real cluster at hand, so let's use that as an example. We
have 96 drives/OSDs - 8 nodes, 12 OSDs per node, 2 replicas, top-down
failure domains: rack pairs (x2), nodes, OSDs... Let's say OSD 15
dies. How many PGs are now at risk:
$ grep "^10\." pg.dump | awk '{print $15}' | grep 15 | wc
109 109 861
(NB: 10 is the pool id, pg.dump is a text file dump of "ceph pg dump",
$15 is the acting set column)

109 PGs now "living on the edge". No surprises in that number as we
used 100 * 96 / 2 = 4800 to arrive at the PG count for this pool, so
on average any one OSD will be primary for 50 PGs and replica for
another 50. But this doesn't tell me how exposed I am, for that I need
to know how many "neighbouring" OSDs there are in these 109 PGs:
$ grep "^10\." pg.dump | awk '{print $15}' | grep 15 | sed
's/\[15,\(.*\)\]/\1/' | sed 's/\[\(.*\),15\]/\1/' | sort | uniq | wc
67 67 193
(NB: grep-ing for OSD "15" and using sed to remove it and surrounding
formatting to get just the neighbour id)

Yikes! So if any one of those 67 drives fails during recovery of OSD
15, then we've lost data. On average we should expect this to be
determined by our crushmap, which in this case splits the cluster up
into 2 top-level failure domains, so I'd guess it's the probability of
1 in 48 drives failing on average for this cluster. But actually
looking at the numbers for each OSD it is higher than that here - the
lowest distinct "neighbour" count we have is 50. Note that we haven't
tuned any of the options in our crushmap, so I guess maybe Ceph
favours fewer repeat sets by default when coming up with PGs(?).

Anyway, here's the average and top 10 neighbour counts (hope this
scripting is right! ;-):

$ for OSD in {0..95}; do echo -ne "$OSD\t"; grep "^10\." pg.dump | awk
'{print $15}' | grep "\[${OSD},\|,${OSD}\]" | sed
"s/\[$OSD,\(.*\)\]/\1/" | sed "s/\[\(.*\),$OSD\]/\1/" | sort | uniq |
wc -l; done | awk '{ total += $2 } END { print total/NR }'
58.5208

$ for OSD in {0..95}; do echo -ne "$OSD\t"; grep "^10\." pg.dump | awk
'{print $15}' | grep "\[${OSD},\|,${OSD}\]" | sed
"s/\[$OSD,\(.*\)\]/\1/" | sed "s/\[\(.*\),$OSD\]/\1/" | sort | uniq |
wc -l; done | sort -k2 -r | head
78 69
37 68
92 67
15 67
91 66
66 65
61 65
89 64
88 64
87 64
(OSD# Neighbour#)

So, if I am getting this right then at the end of the day __I think__
all this essentially boils down (sans CRUSH) to the number of possible
combinations (not permutations - order is irrelevant) of OSDs that can
be chosen. Making these numbers smaller is only possible by increasing
r in nCr:
96 choose 2 = 4560
96 choose 3 = 142880

So basically with two replicas, if _any_ two disks fail within your
recovery window the chance of data-loss is high thanks to the chances
of those OSDs intersecting in the concrete space of PGs represented in
the pool. With three replicas that tapers off hugely as we're only
utilising 4800 / 142880 * 100 ~= 3.5% of the potential PG space.

I guess to some extent that this holds true for M values in EC pools.

I hope some of this makes sense...? I'd love to see some of these
questions answered canonically by Inktank or Sage, if not then perhaps
I'll see how far I get sticking this diatribe into the ICE support
portal...
--
Cheers,
~Blairo
Christian Balzer
2014-08-26 09:17:53 UTC
Permalink
Hello,
Post by Blair Bethwaite
Message: 25
Date: Fri, 15 Aug 2014 15:06:49 +0200
From: Loic Dachary <loic at dachary.org>
To: Erik Logtenberg <erik at logtenberg.eu>, ceph-users at lists.ceph.com
Subject: Re: [ceph-users] Best practice K/M-parameters EC pool
Message-ID: <53EE05E9.1040105 at dachary.org>
Content-Type: text/plain; charset="iso-8859-1"
...
If the probability of loosing a disk is 0.1%, the probability of
loosing two disks simultaneously (i.e. before the failure can be
recovered) would be 0.1*0.1 = 0.01% and three disks becomes
0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001%
I watched this conversation and an older similar one (Failure
probability with largish deployments) with interest as we are in the
process of planning a pretty large Ceph cluster (~3.5 PB), so I have
been trying to wrap my head around these issues.
As the OP of the "Failure probability with largish deployments" thread I
have to thank Blair for raising this issue again and doing the hard math
below. Which looks fine to me.

At the end of that slightly inconclusive thread I walked away with the
same impression as Blair, namely that the survival of PGs is the key
factor and that they will likely be spread out over most, if not all the
OSDs.

Which in turn did reinforce my decision to deploy our first production
Ceph cluster based on nodes with 2 OSDs backed by 11 disk RAID6 sets behind
a HW RAID controller with 4GB cache AND SDD journals.
I can live with the reduced performance (which is caused by the OSD code
running out of steam long before the SSDs or the RAIDs do), because not
only do I save 1/3rd of the space and 1/4th of the cost compared to a
replication 3 cluster, the total of disks that need to fail within the
recovery window to cause data loss is now 4.

The next cluster I'm currently building is a classic Ceph design,
replication of 3, 8 OSD HDDs and 4 journal SSDs per node, because with
this cluster I won't have predictable I/O patterns and loads.
OTOH, I don't see it growing much beyond 48 OSDs, so I'm happy enough with
the odds here.

I think doing the exact maths for a cluster of the size you're planning
would be very interesting and also very much needed.
3.5PB usable space would be close to 3000 disks with a replication of 3,
but even if you meant that as gross value it would probably mean that
you're looking at frequent, if not daily disk failures.


Regards,

Christian
Post by Blair Bethwaite
Loic's reasoning (above) seems sound as a naive approximation assuming
independent probabilities for disk failures, which may not be quite
true given potential for batch production issues, but should be okay
for other sorts of correlations (assuming a sane crushmap that
eliminates things like controllers and nodes as sources of
correlation).
One of the things that came up in the "Failure probability with
largish deployments" thread and has raised its head again here is the
idea that striped data (e.g., RADOS-GW objects and RBD volumes) might
be somehow more prone to data-loss than non-striped. I don't think
anyone has so far provided an answer on this, so here's my thinking...
The level of atomicity that matters when looking at durability &
availability in Ceph is the Placement Group. For any non-trivial RBD
it is likely that many RBDs will span all/most PGs, e.g., even a
relatively small 50GiB volume would (with default 4MiB object size)
span 12800 PGs - more than there are in many production clusters
obeying the 100-200 PGs per drive rule of thumb. <IMPORTANT>Losing any
one PG will cause data-loss. The failure-probability effects of
striping across multiple PGs are immaterial considering that loss of
any single PG is likely to damage all your RBDs</IMPORTANT>. This
might be why the reliability calculator doesn't consider total number
of disks.
Related to all this is the durability of 2 versus 3 replicas (or e.g.
M>=1 for Erasure Coding). It's easy to get caught up in the worrying
fallacy that losing any M OSDs will cause data-loss, but this isn't
true - they have to be members of the same PG for data-loss to occur.
So then it's tempting to think the chances of that happening are so
slim as to not matter and why would we ever even need 3 replicas. I
mean, what are the odds of exactly those 2 drives, out of the
100,200... in my cluster, failing in <recovery window>?! But therein
lays the rub - you should be thinking about PGs. If a drive fails then
the chance of a data-loss event resulting are dependent on the chances
of losing further drives from the affected/degraded PGs.
I've got a real cluster at hand, so let's use that as an example. We
have 96 drives/OSDs - 8 nodes, 12 OSDs per node, 2 replicas, top-down
failure domains: rack pairs (x2), nodes, OSDs... Let's say OSD 15
$ grep "^10\." pg.dump | awk '{print $15}' | grep 15 | wc
109 109 861
(NB: 10 is the pool id, pg.dump is a text file dump of "ceph pg dump",
$15 is the acting set column)
109 PGs now "living on the edge". No surprises in that number as we
used 100 * 96 / 2 = 4800 to arrive at the PG count for this pool, so
on average any one OSD will be primary for 50 PGs and replica for
another 50. But this doesn't tell me how exposed I am, for that I need
$ grep "^10\." pg.dump | awk '{print $15}' | grep 15 | sed
's/\[15,\(.*\)\]/\1/' | sed 's/\[\(.*\),15\]/\1/' | sort | uniq | wc
67 67 193
(NB: grep-ing for OSD "15" and using sed to remove it and surrounding
formatting to get just the neighbour id)
Yikes! So if any one of those 67 drives fails during recovery of OSD
15, then we've lost data. On average we should expect this to be
determined by our crushmap, which in this case splits the cluster up
into 2 top-level failure domains, so I'd guess it's the probability of
1 in 48 drives failing on average for this cluster. But actually
looking at the numbers for each OSD it is higher than that here - the
lowest distinct "neighbour" count we have is 50. Note that we haven't
tuned any of the options in our crushmap, so I guess maybe Ceph
favours fewer repeat sets by default when coming up with PGs(?).
Anyway, here's the average and top 10 neighbour counts (hope this
$ for OSD in {0..95}; do echo -ne "$OSD\t"; grep "^10\." pg.dump | awk
'{print $15}' | grep "\[${OSD},\|,${OSD}\]" | sed
"s/\[$OSD,\(.*\)\]/\1/" | sed "s/\[\(.*\),$OSD\]/\1/" | sort | uniq |
wc -l; done | awk '{ total += $2 } END { print total/NR }'
58.5208
$ for OSD in {0..95}; do echo -ne "$OSD\t"; grep "^10\." pg.dump | awk
'{print $15}' | grep "\[${OSD},\|,${OSD}\]" | sed
"s/\[$OSD,\(.*\)\]/\1/" | sed "s/\[\(.*\),$OSD\]/\1/" | sort | uniq |
wc -l; done | sort -k2 -r | head
78 69
37 68
92 67
15 67
91 66
66 65
61 65
89 64
88 64
87 64
(OSD# Neighbour#)
So, if I am getting this right then at the end of the day __I think__
all this essentially boils down (sans CRUSH) to the number of possible
combinations (not permutations - order is irrelevant) of OSDs that can
be chosen. Making these numbers smaller is only possible by increasing
96 choose 2 = 4560
96 choose 3 = 142880
So basically with two replicas, if _any_ two disks fail within your
recovery window the chance of data-loss is high thanks to the chances
of those OSDs intersecting in the concrete space of PGs represented in
the pool. With three replicas that tapers off hugely as we're only
utilising 4800 / 142880 * 100 ~= 3.5% of the potential PG space.
I guess to some extent that this holds true for M values in EC pools.
I hope some of this makes sense...? I'd love to see some of these
questions answered canonically by Inktank or Sage, if not then perhaps
I'll see how far I get sticking this diatribe into the ICE support
portal...
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Loic Dachary
2014-08-26 13:25:30 UTC
Permalink
Hi Blair,

Assuming that:

* The pool is configured for three replicas (size = 3 which is the default)
* It takes one hour for Ceph to recover from the loss of a single OSD
* Any other disk has a 0.001% chance to fail within the hour following the failure of the first disk (assuming AFR https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 10%, divided by the number of hours during a year).
* A given disk does not participate in more than 100 PG

Each time an OSD is lost, there is a 0.001*0.001 = 0.000001% chance that two other disks are lost before recovery. Since the disk that failed initialy participates in 100 PG, that is 0.000001% x 100 = 0.0001% chance that a PG is lost. Or the entire pool if it is used in a way that loosing a PG means loosing all data in the pool (as in your example, where it contains RBD volumes and each of the RBD volume uses all the available PG).

If the pool is using at least two datacenters operated by two different organizations, this calculation makes sense to me. However, if the cluster is in a single datacenter, isn't it possible that some event independent of Ceph has a greater probability of permanently destroying the data ? A month ago I lost three machines in a Ceph cluster and realized on that occasion that the crushmap was not configured properly and that PG were lost as a result. Fortunately I was able to recover the disks and plug them in another machine to recover the lost PGs. I'm not a system administrator and the probability of me failing to do the right thing is higher than normal: this is just an example of a high probability event leading to data loss. In other words, I wonder if this 0.0001% chance of losing a PG within the hour following a disk failure matters or if it is dominated by other factors. What do you think ?

Cheers
Post by Blair Bethwaite
Message: 25
Date: Fri, 15 Aug 2014 15:06:49 +0200
From: Loic Dachary <loic at dachary.org>
To: Erik Logtenberg <erik at logtenberg.eu>, ceph-users at lists.ceph.com
Subject: Re: [ceph-users] Best practice K/M-parameters EC pool
Message-ID: <53EE05E9.1040105 at dachary.org>
Content-Type: text/plain; charset="iso-8859-1"
...
If the probability of loosing a disk is 0.1%, the probability of loosing two disks simultaneously (i.e. before the failure can be recovered) would be 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001%
I watched this conversation and an older similar one (Failure
probability with largish deployments) with interest as we are in the
process of planning a pretty large Ceph cluster (~3.5 PB), so I have
been trying to wrap my head around these issues.
Loic's reasoning (above) seems sound as a naive approximation assuming
independent probabilities for disk failures, which may not be quite
true given potential for batch production issues, but should be okay
for other sorts of correlations (assuming a sane crushmap that
eliminates things like controllers and nodes as sources of
correlation).
One of the things that came up in the "Failure probability with
largish deployments" thread and has raised its head again here is the
idea that striped data (e.g., RADOS-GW objects and RBD volumes) might
be somehow more prone to data-loss than non-striped. I don't think
anyone has so far provided an answer on this, so here's my thinking...
The level of atomicity that matters when looking at durability &
availability in Ceph is the Placement Group. For any non-trivial RBD
it is likely that many RBDs will span all/most PGs, e.g., even a
relatively small 50GiB volume would (with default 4MiB object size)
span 12800 PGs - more than there are in many production clusters
obeying the 100-200 PGs per drive rule of thumb. <IMPORTANT>Losing any
one PG will cause data-loss. The failure-probability effects of
striping across multiple PGs are immaterial considering that loss of
any single PG is likely to damage all your RBDs</IMPORTANT>. This
might be why the reliability calculator doesn't consider total number
of disks.
Related to all this is the durability of 2 versus 3 replicas (or e.g.
M>=1 for Erasure Coding). It's easy to get caught up in the worrying
fallacy that losing any M OSDs will cause data-loss, but this isn't
true - they have to be members of the same PG for data-loss to occur.
So then it's tempting to think the chances of that happening are so
slim as to not matter and why would we ever even need 3 replicas. I
mean, what are the odds of exactly those 2 drives, out of the
100,200... in my cluster, failing in <recovery window>?! But therein
lays the rub - you should be thinking about PGs. If a drive fails then
the chance of a data-loss event resulting are dependent on the chances
of losing further drives from the affected/degraded PGs.
I've got a real cluster at hand, so let's use that as an example. We
have 96 drives/OSDs - 8 nodes, 12 OSDs per node, 2 replicas, top-down
failure domains: rack pairs (x2), nodes, OSDs... Let's say OSD 15
$ grep "^10\." pg.dump | awk '{print $15}' | grep 15 | wc
109 109 861
(NB: 10 is the pool id, pg.dump is a text file dump of "ceph pg dump",
$15 is the acting set column)
109 PGs now "living on the edge". No surprises in that number as we
used 100 * 96 / 2 = 4800 to arrive at the PG count for this pool, so
on average any one OSD will be primary for 50 PGs and replica for
another 50. But this doesn't tell me how exposed I am, for that I need
$ grep "^10\." pg.dump | awk '{print $15}' | grep 15 | sed
's/\[15,\(.*\)\]/\1/' | sed 's/\[\(.*\),15\]/\1/' | sort | uniq | wc
67 67 193
(NB: grep-ing for OSD "15" and using sed to remove it and surrounding
formatting to get just the neighbour id)
Yikes! So if any one of those 67 drives fails during recovery of OSD
15, then we've lost data. On average we should expect this to be
determined by our crushmap, which in this case splits the cluster up
into 2 top-level failure domains, so I'd guess it's the probability of
1 in 48 drives failing on average for this cluster. But actually
looking at the numbers for each OSD it is higher than that here - the
lowest distinct "neighbour" count we have is 50. Note that we haven't
tuned any of the options in our crushmap, so I guess maybe Ceph
favours fewer repeat sets by default when coming up with PGs(?).
Anyway, here's the average and top 10 neighbour counts (hope this
$ for OSD in {0..95}; do echo -ne "$OSD\t"; grep "^10\." pg.dump | awk
'{print $15}' | grep "\[${OSD},\|,${OSD}\]" | sed
"s/\[$OSD,\(.*\)\]/\1/" | sed "s/\[\(.*\),$OSD\]/\1/" | sort | uniq |
wc -l; done | awk '{ total += $2 } END { print total/NR }'
58.5208
$ for OSD in {0..95}; do echo -ne "$OSD\t"; grep "^10\." pg.dump | awk
'{print $15}' | grep "\[${OSD},\|,${OSD}\]" | sed
"s/\[$OSD,\(.*\)\]/\1/" | sed "s/\[\(.*\),$OSD\]/\1/" | sort | uniq |
wc -l; done | sort -k2 -r | head
78 69
37 68
92 67
15 67
91 66
66 65
61 65
89 64
88 64
87 64
(OSD# Neighbour#)
So, if I am getting this right then at the end of the day __I think__
all this essentially boils down (sans CRUSH) to the number of possible
combinations (not permutations - order is irrelevant) of OSDs that can
be chosen. Making these numbers smaller is only possible by increasing
96 choose 2 = 4560
96 choose 3 = 142880
So basically with two replicas, if _any_ two disks fail within your
recovery window the chance of data-loss is high thanks to the chances
of those OSDs intersecting in the concrete space of PGs represented in
the pool. With three replicas that tapers off hugely as we're only
utilising 4800 / 142880 * 100 ~= 3.5% of the potential PG space.
I guess to some extent that this holds true for M values in EC pools.
I hope some of this makes sense...? I'd love to see some of these
questions answered canonically by Inktank or Sage, if not then perhaps
I'll see how far I get sticking this diatribe into the ICE support
portal...
--
Lo?c Dachary, Artisan Logiciel Libre

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140826/e4f9f0d3/attachment.pgp>
Loic Dachary
2014-08-26 14:12:11 UTC
Permalink
Using percentages instead of numbers lead me to calculations errors. Here it is again using 1/100 instead of % for clarity ;-)

Assuming that:

* The pool is configured for three replicas (size = 3 which is the default)
* It takes one hour for Ceph to recover from the loss of a single OSD
* Any other disk has a 1/100,000 chance to fail within the hour following the failure of the first disk (assuming AFR https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 8%, divided by the number of hours during a year == (0.08 / 8760) ~= 1/100,000
* A given disk does not participate in more than 100 PG

Each time an OSD is lost, there is a 1/100,000*1/100,000 = 1/10,000,000,000 chance that two other disks are lost before recovery. Since the disk that failed initialy participates in 100 PG, that is 1/10,000,000,000 x 100 = 1/100,000,000 chance that a PG is lost. Or the entire pool if it is used in a way that loosing a PG means loosing all data in the pool (as in your example, where it contains RBD volumes and each of the RBD volume uses all the available PG).

If the pool is using at least two datacenters operated by two different organizations, this calculation makes sense to me. However, if the cluster is in a single datacenter, isn't it possible that some event independent of Ceph has a greater probability of permanently destroying the data ? A month ago I lost three machines in a Ceph cluster and realized on that occasion that the crushmap was not configured properly and that PG were lost as a result. Fortunately I was able to recover the disks and plug them in another machine to recover the lost PGs. I'm not a system administrator and the probability of me failing to do the right thing is higher than normal: this is just an example of a high probability event leading to data loss. Another example would be if all disks in the same PG are part of the same batch and therefore likely to fail at the same time. In other words, I wonder if this 0.0001% chance of losing a PG within the hour following a disk failure matters or if it is dominate
d by other factors. What do you think ?

Cheers
Post by Loic Dachary
* The pool is configured for three replicas (size = 3 which is the default)
* It takes one hour for Ceph to recover from the loss of a single OSD
* Any other disk has a 0.001% chance to fail within the hour following the failure of the first disk (assuming AFR https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 10%, divided by the number of hours during a year).
* A given disk does not participate in more than 100 PG
Each time an OSD is lost, there is a 0.001*0.001 = 0.000001% chance that two other disks are lost before recovery. Since the disk that failed initialy participates in 100 PG, that is 0.000001% x 100 = 0.0001% chance that a PG is lost. Or the entire pool if it is used in a way that loosing a PG means loosing all data in the pool (as in your example, where it contains RBD volumes and each of the RBD volume uses all the available PG).
If the pool is using at least two datacenters operated by two different organizations, this calculation makes sense to me. However, if the cluster is in a single datacenter, isn't it possible that some event independent of Ceph has a greater probability of permanently destroying the data ? A month ago I lost three machines in a Ceph cluster and realized on that occasion that the crushmap was not configured properly and that PG were lost as a result. Fortunately I was able to recover the disks and plug them in another machine to recover the lost PGs. I'm not a system administrator and the probability of me failing to do the right thing is higher than normal: this is just an example of a high probability event leading to data loss. In other words, I wonder if this 0.0001% chance of losing a PG within the hour following a disk failure matters or if it is dominated by other factors. What do you think ?
Cheers
On 26/08/2014 15:25, Loic Dachary wrote:> Hi Blair,
Post by Loic Dachary
* The pool is configured for three replicas (size = 3 which is the default)
* It takes one hour for Ceph to recover from the loss of a single OSD
* Any other disk has a 0.001% chance to fail within the hour following the failure of the first disk (assuming AFR https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 10%, divided by the number of hours during a year).
* A given disk does not participate in more than 100 PG
Each time an OSD is lost, there is a 0.001*0.001 = 0.000001% chance that two other disks are lost before recovery. Since the disk that failed initialy participates in 100 PG, that is 0.000001% x 100 = 0.0001% chance that a PG is lost. Or the entire pool if it is used in a way that loosing a PG means loosing all data in the pool (as in your example, where it contains RBD volumes and each of the RBD volume uses all the available PG).
If the pool is using at least two datacenters operated by two different organizations, this calculation makes sense to me. However, if the cluster is in a single datacenter, isn't it possible that some event independent of Ceph has a greater probability of permanently destroying the data ? A month ago I lost three machines in a Ceph cluster and realized on that occasion that the crushmap was not configured properly and that PG were lost as a result. Fortunately I was able to recover the disks and plug them in another machine to recover the lost PGs. I'm not a system administrator and the probability of me failing to do the right thing is higher than normal: this is just an example of a high probability event leading to data loss. In other words, I wonder if this 0.0001% chance of losing a PG within the hour following a disk failure matters or if it is dominated by other factors. What do you think ?
Cheers
Post by Blair Bethwaite
Message: 25
Date: Fri, 15 Aug 2014 15:06:49 +0200
From: Loic Dachary <loic at dachary.org>
To: Erik Logtenberg <erik at logtenberg.eu>, ceph-users at lists.ceph.com
Subject: Re: [ceph-users] Best practice K/M-parameters EC pool
Message-ID: <53EE05E9.1040105 at dachary.org>
Content-Type: text/plain; charset="iso-8859-1"
...
If the probability of loosing a disk is 0.1%, the probability of loosing two disks simultaneously (i.e. before the failure can be recovered) would be 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001%
I watched this conversation and an older similar one (Failure
probability with largish deployments) with interest as we are in the
process of planning a pretty large Ceph cluster (~3.5 PB), so I have
been trying to wrap my head around these issues.
Loic's reasoning (above) seems sound as a naive approximation assuming
independent probabilities for disk failures, which may not be quite
true given potential for batch production issues, but should be okay
for other sorts of correlations (assuming a sane crushmap that
eliminates things like controllers and nodes as sources of
correlation).
One of the things that came up in the "Failure probability with
largish deployments" thread and has raised its head again here is the
idea that striped data (e.g., RADOS-GW objects and RBD volumes) might
be somehow more prone to data-loss than non-striped. I don't think
anyone has so far provided an answer on this, so here's my thinking...
The level of atomicity that matters when looking at durability &
availability in Ceph is the Placement Group. For any non-trivial RBD
it is likely that many RBDs will span all/most PGs, e.g., even a
relatively small 50GiB volume would (with default 4MiB object size)
span 12800 PGs - more than there are in many production clusters
obeying the 100-200 PGs per drive rule of thumb. <IMPORTANT>Losing any
one PG will cause data-loss. The failure-probability effects of
striping across multiple PGs are immaterial considering that loss of
any single PG is likely to damage all your RBDs</IMPORTANT>. This
might be why the reliability calculator doesn't consider total number
of disks.
Related to all this is the durability of 2 versus 3 replicas (or e.g.
M>=1 for Erasure Coding). It's easy to get caught up in the worrying
fallacy that losing any M OSDs will cause data-loss, but this isn't
true - they have to be members of the same PG for data-loss to occur.
So then it's tempting to think the chances of that happening are so
slim as to not matter and why would we ever even need 3 replicas. I
mean, what are the odds of exactly those 2 drives, out of the
100,200... in my cluster, failing in <recovery window>?! But therein
lays the rub - you should be thinking about PGs. If a drive fails then
the chance of a data-loss event resulting are dependent on the chances
of losing further drives from the affected/degraded PGs.
I've got a real cluster at hand, so let's use that as an example. We
have 96 drives/OSDs - 8 nodes, 12 OSDs per node, 2 replicas, top-down
failure domains: rack pairs (x2), nodes, OSDs... Let's say OSD 15
$ grep "^10\." pg.dump | awk '{print $15}' | grep 15 | wc
109 109 861
(NB: 10 is the pool id, pg.dump is a text file dump of "ceph pg dump",
$15 is the acting set column)
109 PGs now "living on the edge". No surprises in that number as we
used 100 * 96 / 2 = 4800 to arrive at the PG count for this pool, so
on average any one OSD will be primary for 50 PGs and replica for
another 50. But this doesn't tell me how exposed I am, for that I need
$ grep "^10\." pg.dump | awk '{print $15}' | grep 15 | sed
's/\[15,\(.*\)\]/\1/' | sed 's/\[\(.*\),15\]/\1/' | sort | uniq | wc
67 67 193
(NB: grep-ing for OSD "15" and using sed to remove it and surrounding
formatting to get just the neighbour id)
Yikes! So if any one of those 67 drives fails during recovery of OSD
15, then we've lost data. On average we should expect this to be
determined by our crushmap, which in this case splits the cluster up
into 2 top-level failure domains, so I'd guess it's the probability of
1 in 48 drives failing on average for this cluster. But actually
looking at the numbers for each OSD it is higher than that here - the
lowest distinct "neighbour" count we have is 50. Note that we haven't
tuned any of the options in our crushmap, so I guess maybe Ceph
favours fewer repeat sets by default when coming up with PGs(?).
Anyway, here's the average and top 10 neighbour counts (hope this
$ for OSD in {0..95}; do echo -ne "$OSD\t"; grep "^10\." pg.dump | awk
'{print $15}' | grep "\[${OSD},\|,${OSD}\]" | sed
"s/\[$OSD,\(.*\)\]/\1/" | sed "s/\[\(.*\),$OSD\]/\1/" | sort | uniq |
wc -l; done | awk '{ total += $2 } END { print total/NR }'
58.5208
$ for OSD in {0..95}; do echo -ne "$OSD\t"; grep "^10\." pg.dump | awk
'{print $15}' | grep "\[${OSD},\|,${OSD}\]" | sed
"s/\[$OSD,\(.*\)\]/\1/" | sed "s/\[\(.*\),$OSD\]/\1/" | sort | uniq |
wc -l; done | sort -k2 -r | head
78 69
37 68
92 67
15 67
91 66
66 65
61 65
89 64
88 64
87 64
(OSD# Neighbour#)
So, if I am getting this right then at the end of the day __I think__
all this essentially boils down (sans CRUSH) to the number of possible
combinations (not permutations - order is irrelevant) of OSDs that can
be chosen. Making these numbers smaller is only possible by increasing
96 choose 2 = 4560
96 choose 3 = 142880
So basically with two replicas, if _any_ two disks fail within your
recovery window the chance of data-loss is high thanks to the chances
of those OSDs intersecting in the concrete space of PGs represented in
the pool. With three replicas that tapers off hugely as we're only
utilising 4800 / 142880 * 100 ~= 3.5% of the potential PG space.
I guess to some extent that this holds true for M values in EC pools.
I hope some of this makes sense...? I'd love to see some of these
questions answered canonically by Inktank or Sage, if not then perhaps
I'll see how far I get sticking this diatribe into the ICE support
portal...
--
Lo?c Dachary, Artisan Logiciel Libre

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140826/2f674661/attachment.pgp>
Craig Lewis
2014-08-26 17:37:33 UTC
Permalink
My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd max
backfills = 1). I believe that increases my risk of failure by 48^2 .
Since your numbers are failure rate per hour per disk, I need to consider
the risk for the whole time for each disk. So more formally, rebuild time
to the power of (replicas -1).

So I'm at 2304/100,000,000, or approximately 1/43,000. That's a much
higher risk than 1 / 10^8.


A risk of 1/43,000 means that I'm more likely to lose data due to human
error than disk failure. Still, I can put a small bit of effort in to
optimize recovery speed, and lower this number. Managing human error is
much harder.
Post by Loic Dachary
Using percentages instead of numbers lead me to calculations errors. Here
it is again using 1/100 instead of % for clarity ;-)
* The pool is configured for three replicas (size = 3 which is the default)
* It takes one hour for Ceph to recover from the loss of a single OSD
* Any other disk has a 1/100,000 chance to fail within the hour following
the failure of the first disk (assuming AFR
https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
8%, divided by the number of hours during a year == (0.08 / 8760) ~=
1/100,000
* A given disk does not participate in more than 100 PG
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140826/9839f367/attachment.htm>
Loic Dachary
2014-08-26 18:21:39 UTC
Permalink
Hi Craig,

I assume the reason for the 48 hours recovery time is to keep the cost of the cluster low ? I wrote "1h recovery time" because it is roughly the time it would take to move 4TB over a 10Gb/s link. Could you upgrade your hardware to reduce the recovery time to less than two hours ? Or are there factors other than cost that prevent this ?

Cheers
My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd max backfills = 1). I believe that increases my risk of failure by 48^2 . Since your numbers are failure rate per hour per disk, I need to consider the risk for the whole time for each disk. So more formally, rebuild time to the power of (replicas -1).
So I'm at 2304/100,000,000, or approximately 1/43,000. That's a much higher risk than 1 / 10^8.
A risk of 1/43,000 means that I'm more likely to lose data due to human error than disk failure. Still, I can put a small bit of effort in to optimize recovery speed, and lower this number. Managing human error is much harder.
Using percentages instead of numbers lead me to calculations errors. Here it is again using 1/100 instead of % for clarity ;-)
* The pool is configured for three replicas (size = 3 which is the default)
* It takes one hour for Ceph to recover from the loss of a single OSD
* Any other disk has a 1/100,000 chance to fail within the hour following the failure of the first disk (assuming AFR https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 8%, divided by the number of hours during a year == (0.08 / 8760) ~= 1/100,000
* A given disk does not participate in more than 100 PG
--
Lo?c Dachary, Artisan Logiciel Libre

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140826/7f63e686/attachment.pgp>
Christian Balzer
2014-08-27 02:34:43 UTC
Permalink
Hello,
Post by Loic Dachary
Hi Craig,
I assume the reason for the 48 hours recovery time is to keep the cost
of the cluster low ? I wrote "1h recovery time" because it is roughly
the time it would take to move 4TB over a 10Gb/s link. Could you upgrade
your hardware to reduce the recovery time to less than two hours ? Or
are there factors other than cost that prevent this ?
I doubt Craig is operating on a shoestring budget.
And even if his network were to be just GbE, that would still make it only
10 hours according to your wishful thinking formula.

He probably has set the max_backfills to 1 because that is the level of
I/O his OSDs can handle w/o degrading cluster performance too much.
The network is unlikely to be the limiting factor.

The way I see it most Ceph clusters are in sort of steady state when
operating normally, i.e. a few hundred VM RBD images ticking over, most
actual OSD disk ops are writes, as nearly all hot objects that are being
read are in the page cache of the storage nodes.
Easy peasy.

Until something happens that breaks this routine, like a deep scrub, all
those VMs rebooting at the same time or a backfill caused by a failed OSD.
Now all of a sudden client ops compete with the backfill ops, page caches
are no longer hot, the spinners are seeking left and right.
Pandemonium.

I doubt very much that even with a SSD backed cluster you would get away
with less than 2 hours for 4TB.

To give you some real life numbers, I currently am building a new cluster
but for the time being have only one storage node to play with.
It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs and 8 actual
OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.

So I took out one OSD (reweight 0 first, then the usual removal steps)
because the actual disk was wonky. Replaced the disk and re-added the OSD.
Both operations took about the same time, 4 minutes for evacuating the OSD
(having 7 write targets clearly helped) for measly 12GB or about 50MB/s
and 5 minutes or about 35MB/ for refilling the OSD.
And that is on one node (thus no network latency) that has the default
parameters (so a max_backfill of 10) which was otherwise totally idle.

In other words, in this pretty ideal case it would have taken 22 hours
to re-distribute 4TB.

More in another reply.
Post by Loic Dachary
Cheers
Post by Craig Lewis
My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd
max backfills = 1). I believe that increases my risk of failure by
48^2 . Since your numbers are failure rate per hour per disk, I need
to consider the risk for the whole time for each disk. So more
formally, rebuild time to the power of (replicas -1).
So I'm at 2304/100,000,000, or approximately 1/43,000. That's a much
higher risk than 1 / 10^8.
A risk of 1/43,000 means that I'm more likely to lose data due to
human error than disk failure. Still, I can put a small bit of effort
in to optimize recovery speed, and lower this number. Managing human
error is much harder.
On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary <loic at dachary.org
Using percentages instead of numbers lead me to calculations
errors. Here it is again using 1/100 instead of % for clarity ;-)
* The pool is configured for three replicas (size = 3 which is the default)
* It takes one hour for Ceph to recover from the loss of a single OSD
* Any other disk has a 1/100,000 chance to fail within the hour
following the failure of the first disk (assuming AFR
https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
8%, divided by the number of hours during a year == (0.08 / 8760) ~=
1/100,000
* A given disk does not participate in more than 100 PG
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Loic Dachary
2014-08-27 11:04:48 UTC
Permalink
Post by Christian Balzer
Hello,
Post by Loic Dachary
Hi Craig,
I assume the reason for the 48 hours recovery time is to keep the cost
of the cluster low ? I wrote "1h recovery time" because it is roughly
the time it would take to move 4TB over a 10Gb/s link. Could you upgrade
your hardware to reduce the recovery time to less than two hours ? Or
are there factors other than cost that prevent this ?
I doubt Craig is operating on a shoestring budget.
And even if his network were to be just GbE, that would still make it only
10 hours according to your wishful thinking formula.
He probably has set the max_backfills to 1 because that is the level of
I/O his OSDs can handle w/o degrading cluster performance too much.
The network is unlikely to be the limiting factor.
The way I see it most Ceph clusters are in sort of steady state when
operating normally, i.e. a few hundred VM RBD images ticking over, most
actual OSD disk ops are writes, as nearly all hot objects that are being
read are in the page cache of the storage nodes.
Easy peasy.
Until something happens that breaks this routine, like a deep scrub, all
those VMs rebooting at the same time or a backfill caused by a failed OSD.
Now all of a sudden client ops compete with the backfill ops, page caches
are no longer hot, the spinners are seeking left and right.
Pandemonium.
I doubt very much that even with a SSD backed cluster you would get away
with less than 2 hours for 4TB.
To give you some real life numbers, I currently am building a new cluster
but for the time being have only one storage node to play with.
It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs and 8 actual
OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.
So I took out one OSD (reweight 0 first, then the usual removal steps)
because the actual disk was wonky. Replaced the disk and re-added the OSD.
Both operations took about the same time, 4 minutes for evacuating the OSD
(having 7 write targets clearly helped) for measly 12GB or about 50MB/s
and 5 minutes or about 35MB/ for refilling the OSD.
And that is on one node (thus no network latency) that has the default
parameters (so a max_backfill of 10) which was otherwise totally idle.
In other words, in this pretty ideal case it would have taken 22 hours
to re-distribute 4TB.
That makes sense to me :-)

When I wrote 1h, I thought about what happens when an OSD becomes unavailable with no planning in advance. In the scenario you describe the risk of a data loss does not increase since the objects are evicted gradually from the disk being decommissioned and the number of replica stays the same at all times. There is not a sudden drop in the number of replica which is what I had in mind.

If the lost OSD was part of 100 PG, the other disks (let say 50 of them) will start transferring a new replica of the objects they have to the new OSD in their PG. The replacement will not be a single OSD although nothing prevents the same OSD to be used in more than one PG as a replacement for the lost one. If the cluster network is connected at 10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new duplicates do not originate from a single OSD but from at least dozens of them and since they target more than one OSD, I assume we can expect an actual throughput of 5Gb/s. I should have written 2h instead of 1h to account for the fact that the cluster network is never idle.

Am I being too optimistic ? Do you see another blocking factor that would significantly slow down recovery ?

Cheers
Post by Christian Balzer
More in another reply.
Post by Loic Dachary
Cheers
Post by Craig Lewis
My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd
max backfills = 1). I believe that increases my risk of failure by
48^2 . Since your numbers are failure rate per hour per disk, I need
to consider the risk for the whole time for each disk. So more
formally, rebuild time to the power of (replicas -1).
So I'm at 2304/100,000,000, or approximately 1/43,000. That's a much
higher risk than 1 / 10^8.
A risk of 1/43,000 means that I'm more likely to lose data due to
human error than disk failure. Still, I can put a small bit of effort
in to optimize recovery speed, and lower this number. Managing human
error is much harder.
On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary <loic at dachary.org
Using percentages instead of numbers lead me to calculations
errors. Here it is again using 1/100 instead of % for clarity ;-)
* The pool is configured for three replicas (size = 3 which is the default)
* It takes one hour for Ceph to recover from the loss of a single OSD
* Any other disk has a 1/100,000 chance to fail within the hour
following the failure of the first disk (assuming AFR
https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
8%, divided by the number of hours during a year == (0.08 / 8760) ~=
1/100,000
* A given disk does not participate in more than 100 PG
--
Lo?c Dachary, Artisan Logiciel Libre

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140827/9f35796c/attachment.pgp>
Christian Balzer
2014-08-28 04:23:36 UTC
Permalink
Post by Loic Dachary
Post by Christian Balzer
Hello,
Post by Loic Dachary
Hi Craig,
I assume the reason for the 48 hours recovery time is to keep the cost
of the cluster low ? I wrote "1h recovery time" because it is roughly
the time it would take to move 4TB over a 10Gb/s link. Could you
upgrade your hardware to reduce the recovery time to less than two
hours ? Or are there factors other than cost that prevent this ?
I doubt Craig is operating on a shoestring budget.
And even if his network were to be just GbE, that would still make it
only 10 hours according to your wishful thinking formula.
He probably has set the max_backfills to 1 because that is the level of
I/O his OSDs can handle w/o degrading cluster performance too much.
The network is unlikely to be the limiting factor.
The way I see it most Ceph clusters are in sort of steady state when
operating normally, i.e. a few hundred VM RBD images ticking over, most
actual OSD disk ops are writes, as nearly all hot objects that are
being read are in the page cache of the storage nodes.
Easy peasy.
Until something happens that breaks this routine, like a deep scrub,
all those VMs rebooting at the same time or a backfill caused by a
failed OSD. Now all of a sudden client ops compete with the backfill
ops, page caches are no longer hot, the spinners are seeking left and
right. Pandemonium.
I doubt very much that even with a SSD backed cluster you would get
away with less than 2 hours for 4TB.
To give you some real life numbers, I currently am building a new
cluster but for the time being have only one storage node to play with.
It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs and 8
actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.
So I took out one OSD (reweight 0 first, then the usual removal steps)
because the actual disk was wonky. Replaced the disk and re-added the
OSD. Both operations took about the same time, 4 minutes for
evacuating the OSD (having 7 write targets clearly helped) for measly
12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the
OSD. And that is on one node (thus no network latency) that has the
default parameters (so a max_backfill of 10) which was otherwise
totally idle.
In other words, in this pretty ideal case it would have taken 22 hours
to re-distribute 4TB.
That makes sense to me :-)
When I wrote 1h, I thought about what happens when an OSD becomes
unavailable with no planning in advance. In the scenario you describe
the risk of a data loss does not increase since the objects are evicted
gradually from the disk being decommissioned and the number of replica
stays the same at all times. There is not a sudden drop in the number of
replica which is what I had in mind.
That may be, but I'm rather certain that there is no difference in speed
and priority of a rebalancing caused by an OSD set to weight 0 or one
being set out.
Post by Loic Dachary
If the lost OSD was part of 100 PG, the other disks (let say 50 of them)
will start transferring a new replica of the objects they have to the
new OSD in their PG. The replacement will not be a single OSD although
nothing prevents the same OSD to be used in more than one PG as a
replacement for the lost one. If the cluster network is connected at
10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new
duplicates do not originate from a single OSD but from at least dozens
of them and since they target more than one OSD, I assume we can expect
an actual throughput of 5Gb/s. I should have written 2h instead of 1h to
account for the fact that the cluster network is never idle.
Am I being too optimistic ?
Vastly.
Post by Loic Dachary
Do you see another blocking factor that
would significantly slow down recovery ?
As Craig and I keep telling you, the network is not the limiting factor.
Concurrent disk IO is, as I pointed out in the other thread.

Another example if you please:
My shitty test cluster, 4 nodes, one OSD each, journal on disk, no SSDs.
1 GbE links for client and cluster respectively.
---
#ceph -s
cluster 25bb48ec-689d-4cec-8494-d1a62ca509be
health HEALTH_OK
monmap e1: 1 mons at {irt03=192,168.0.33:6789/0}, election epoch 1, quorum 0 irt03
osdmap e1206: 4 osds: 4 up, 4 in
pgmap v543045: 256 pgs, 3 pools, 62140 MB data, 15648 objects
141 GB used, 2323 GB / 2464 GB avail
256 active+clean
---
replication size is 2, in can do about 60MB/s writes with rados bench from
a client.

Setting one OSD out (the data distribution is nearly uniform) it took 12
minutes to recover on a completely idle (no clients connected) cluster.
The disk utilization was 70-90%, the cluster network hovered around 20%,
never exceeding 35% on the 3 "surviving" nodes. CPU was never an issue.
Given the ceph log numbers and the data size, I make this a recovery speed
of about 40MB/s or 13MB/s per OSD.
Better than I expected, but a far cry from what the OSDs could do
individually if they were not flooded with concurrent read and write
requests by the backfilling operation.

Now, more disks will help, but I very much doubt that this will scale
linear, so 50 OSDs won't give you 500MB/s (somebody prove me wrong please).

And this was an IDLE cluster.

Doing this on a cluster with just about 10 client IOPS per OSD would be
far worse. Never mind that people don't like their client IO to stall for
more than a few seconds.

Something that might improve this booth in terms of speed and impact to
the clients would be something akin to the MD (linux software raid)
recovery logic.
As in, only one backfill operation per OSD (read or write, not both!) at
the same time.

Regards,

Christian
Post by Loic Dachary
Cheers
Post by Christian Balzer
More in another reply.
Post by Loic Dachary
Cheers
Post by Craig Lewis
My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd
max backfills = 1). I believe that increases my risk of failure by
48^2 . Since your numbers are failure rate per hour per disk, I need
to consider the risk for the whole time for each disk. So more
formally, rebuild time to the power of (replicas -1).
So I'm at 2304/100,000,000, or approximately 1/43,000. That's a
much higher risk than 1 / 10^8.
A risk of 1/43,000 means that I'm more likely to lose data due to
human error than disk failure. Still, I can put a small bit of
effort in to optimize recovery speed, and lower this number.
Managing human error is much harder.
On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary <loic at dachary.org
Using percentages instead of numbers lead me to calculations
errors. Here it is again using 1/100 instead of % for clarity ;-)
* The pool is configured for three replicas (size = 3 which is the default)
* It takes one hour for Ceph to recover from the loss of a single OSD
* Any other disk has a 1/100,000 chance to fail within the hour
following the failure of the first disk (assuming AFR
https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk
is 8%, divided by the number of hours during a year == (0.08 / 8760)
~= 1/100,000
* A given disk does not participate in more than 100 PG
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Loic Dachary
2014-08-28 08:48:18 UTC
Permalink
Post by Christian Balzer
Post by Loic Dachary
Post by Christian Balzer
Hello,
Post by Loic Dachary
Hi Craig,
I assume the reason for the 48 hours recovery time is to keep the cost
of the cluster low ? I wrote "1h recovery time" because it is roughly
the time it would take to move 4TB over a 10Gb/s link. Could you
upgrade your hardware to reduce the recovery time to less than two
hours ? Or are there factors other than cost that prevent this ?
I doubt Craig is operating on a shoestring budget.
And even if his network were to be just GbE, that would still make it
only 10 hours according to your wishful thinking formula.
He probably has set the max_backfills to 1 because that is the level of
I/O his OSDs can handle w/o degrading cluster performance too much.
The network is unlikely to be the limiting factor.
The way I see it most Ceph clusters are in sort of steady state when
operating normally, i.e. a few hundred VM RBD images ticking over, most
actual OSD disk ops are writes, as nearly all hot objects that are
being read are in the page cache of the storage nodes.
Easy peasy.
Until something happens that breaks this routine, like a deep scrub,
all those VMs rebooting at the same time or a backfill caused by a
failed OSD. Now all of a sudden client ops compete with the backfill
ops, page caches are no longer hot, the spinners are seeking left and
right. Pandemonium.
I doubt very much that even with a SSD backed cluster you would get
away with less than 2 hours for 4TB.
To give you some real life numbers, I currently am building a new
cluster but for the time being have only one storage node to play with.
It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs and 8
actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.
So I took out one OSD (reweight 0 first, then the usual removal steps)
because the actual disk was wonky. Replaced the disk and re-added the
OSD. Both operations took about the same time, 4 minutes for
evacuating the OSD (having 7 write targets clearly helped) for measly
12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the
OSD. And that is on one node (thus no network latency) that has the
default parameters (so a max_backfill of 10) which was otherwise
totally idle.
In other words, in this pretty ideal case it would have taken 22 hours
to re-distribute 4TB.
That makes sense to me :-)
When I wrote 1h, I thought about what happens when an OSD becomes
unavailable with no planning in advance. In the scenario you describe
the risk of a data loss does not increase since the objects are evicted
gradually from the disk being decommissioned and the number of replica
stays the same at all times. There is not a sudden drop in the number of
replica which is what I had in mind.
That may be, but I'm rather certain that there is no difference in speed
and priority of a rebalancing caused by an OSD set to weight 0 or one
being set out.
Post by Loic Dachary
If the lost OSD was part of 100 PG, the other disks (let say 50 of them)
will start transferring a new replica of the objects they have to the
new OSD in their PG. The replacement will not be a single OSD although
nothing prevents the same OSD to be used in more than one PG as a
replacement for the lost one. If the cluster network is connected at
10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new
duplicates do not originate from a single OSD but from at least dozens
of them and since they target more than one OSD, I assume we can expect
an actual throughput of 5Gb/s. I should have written 2h instead of 1h to
account for the fact that the cluster network is never idle.
Am I being too optimistic ?
Vastly.
Post by Loic Dachary
Do you see another blocking factor that
would significantly slow down recovery ?
As Craig and I keep telling you, the network is not the limiting factor.
Concurrent disk IO is, as I pointed out in the other thread.
My shitty test cluster, 4 nodes, one OSD each, journal on disk, no SSDs.
1 GbE links for client and cluster respectively.
---
#ceph -s
cluster 25bb48ec-689d-4cec-8494-d1a62ca509be
health HEALTH_OK
monmap e1: 1 mons at {irt03=192,168.0.33:6789/0}, election epoch 1, quorum 0 irt03
osdmap e1206: 4 osds: 4 up, 4 in
pgmap v543045: 256 pgs, 3 pools, 62140 MB data, 15648 objects
141 GB used, 2323 GB / 2464 GB avail
256 active+clean
---
replication size is 2, in can do about 60MB/s writes with rados bench from
a client.
Setting one OSD out (the data distribution is nearly uniform) it took 12
minutes to recover on a completely idle (no clients connected) cluster.
The disk utilization was 70-90%, the cluster network hovered around 20%,
never exceeding 35% on the 3 "surviving" nodes. CPU was never an issue.
Given the ceph log numbers and the data size, I make this a recovery speed
of about 40MB/s or 13MB/s per OSD.
Better than I expected, but a far cry from what the OSDs could do
individually if they were not flooded with concurrent read and write
requests by the backfilling operation.
Hi Christian,

My apologies for not noticing you were running the test cluster with a journal collocated with the data on a spinner. In this case I would indeed expect that I/O is the blocking factor because randomized operations can reduce the disk throughput by an order of magnitude. If you have the journal on a SSD, which is what is generally recommended, you should be able to observe a significant improvement. Such a setup also better reflect the architecture of a large cluster and extrapolations will be more accurate.

Cheers
Post by Christian Balzer
Now, more disks will help, but I very much doubt that this will scale
linear, so 50 OSDs won't give you 500MB/s (somebody prove me wrong please).
And this was an IDLE cluster.
Doing this on a cluster with just about 10 client IOPS per OSD would be
far worse. Never mind that people don't like their client IO to stall for
more than a few seconds.
Something that might improve this booth in terms of speed and impact to
the clients would be something akin to the MD (linux software raid)
recovery logic.
As in, only one backfill operation per OSD (read or write, not both!) at
the same time.
Regards,
Christian
Post by Loic Dachary
Cheers
Post by Christian Balzer
More in another reply.
Post by Loic Dachary
Cheers
Post by Craig Lewis
My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd
max backfills = 1). I believe that increases my risk of failure by
48^2 . Since your numbers are failure rate per hour per disk, I need
to consider the risk for the whole time for each disk. So more
formally, rebuild time to the power of (replicas -1).
So I'm at 2304/100,000,000, or approximately 1/43,000. That's a
much higher risk than 1 / 10^8.
A risk of 1/43,000 means that I'm more likely to lose data due to
human error than disk failure. Still, I can put a small bit of
effort in to optimize recovery speed, and lower this number.
Managing human error is much harder.
On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary <loic at dachary.org
Using percentages instead of numbers lead me to calculations
errors. Here it is again using 1/100 instead of % for clarity ;-)
* The pool is configured for three replicas (size = 3 which is the default)
* It takes one hour for Ceph to recover from the loss of a single OSD
* Any other disk has a 1/100,000 chance to fail within the hour
following the failure of the first disk (assuming AFR
https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk
is 8%, divided by the number of hours during a year == (0.08 / 8760)
~= 1/100,000
* A given disk does not participate in more than 100 PG
--
Lo?c Dachary, Artisan Logiciel Libre

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140828/5a7f3b0d/attachment.pgp>
Mike Dawson
2014-08-28 14:29:20 UTC
Permalink
Post by Christian Balzer
Post by Loic Dachary
Post by Christian Balzer
Hello,
Post by Loic Dachary
Hi Craig,
I assume the reason for the 48 hours recovery time is to keep the cost
of the cluster low ? I wrote "1h recovery time" because it is roughly
the time it would take to move 4TB over a 10Gb/s link. Could you
upgrade your hardware to reduce the recovery time to less than two
hours ? Or are there factors other than cost that prevent this ?
I doubt Craig is operating on a shoestring budget.
And even if his network were to be just GbE, that would still make it
only 10 hours according to your wishful thinking formula.
He probably has set the max_backfills to 1 because that is the level of
I/O his OSDs can handle w/o degrading cluster performance too much.
The network is unlikely to be the limiting factor.
The way I see it most Ceph clusters are in sort of steady state when
operating normally, i.e. a few hundred VM RBD images ticking over, most
actual OSD disk ops are writes, as nearly all hot objects that are
being read are in the page cache of the storage nodes.
Easy peasy.
Until something happens that breaks this routine, like a deep scrub,
all those VMs rebooting at the same time or a backfill caused by a
failed OSD. Now all of a sudden client ops compete with the backfill
ops, page caches are no longer hot, the spinners are seeking left and
right. Pandemonium.
I doubt very much that even with a SSD backed cluster you would get
away with less than 2 hours for 4TB.
To give you some real life numbers, I currently am building a new
cluster but for the time being have only one storage node to play with.
It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs and 8
actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.
So I took out one OSD (reweight 0 first, then the usual removal steps)
because the actual disk was wonky. Replaced the disk and re-added the
OSD. Both operations took about the same time, 4 minutes for
evacuating the OSD (having 7 write targets clearly helped) for measly
12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the
OSD. And that is on one node (thus no network latency) that has the
default parameters (so a max_backfill of 10) which was otherwise
totally idle.
In other words, in this pretty ideal case it would have taken 22 hours
to re-distribute 4TB.
That makes sense to me :-)
When I wrote 1h, I thought about what happens when an OSD becomes
unavailable with no planning in advance. In the scenario you describe
the risk of a data loss does not increase since the objects are evicted
gradually from the disk being decommissioned and the number of replica
stays the same at all times. There is not a sudden drop in the number of
replica which is what I had in mind.
That may be, but I'm rather certain that there is no difference in speed
and priority of a rebalancing caused by an OSD set to weight 0 or one
being set out.
Post by Loic Dachary
If the lost OSD was part of 100 PG, the other disks (let say 50 of them)
will start transferring a new replica of the objects they have to the
new OSD in their PG. The replacement will not be a single OSD although
nothing prevents the same OSD to be used in more than one PG as a
replacement for the lost one. If the cluster network is connected at
10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new
duplicates do not originate from a single OSD but from at least dozens
of them and since they target more than one OSD, I assume we can expect
an actual throughput of 5Gb/s. I should have written 2h instead of 1h to
account for the fact that the cluster network is never idle.
Am I being too optimistic ?
Vastly.
Post by Loic Dachary
Do you see another blocking factor that
would significantly slow down recovery ?
As Craig and I keep telling you, the network is not the limiting factor.
Concurrent disk IO is, as I pointed out in the other thread.
Completely agree.

On a production cluster with OSDs backed by spindles, even with OSD
journals on SSDs, it is insufficient to calculate single-disk
replacement backfill time based solely on network throughput. IOPS will
likely be the limiting factor when backfilling a single failed spinner
in a production cluster.

Last week I replaced a 3TB 7200rpm drive that was ~75% full in a 72-osd
cluster, 24 hosts, rbd pool with 3 replicas, osd journals on SSDs (ratio
of 3:1), with dual 1GbE bonded NICs.

Using the only throughput math, backfill could have theoretically
completed in a bit over 2.5 hours, but it actually took 15 hours. I've
done this a few times with similar results.

Why? Spindle contention on the replacement drive. Graph the '%util'
metric from something like 'iostat -xt 2' during a single disk backfill
to get a very clear view that spindle contention is the true limiting
factor. It'll be pegged at or near 100% if spindle contention is the issue.

- Mike
Post by Christian Balzer
My shitty test cluster, 4 nodes, one OSD each, journal on disk, no SSDs.
1 GbE links for client and cluster respectively.
---
#ceph -s
cluster 25bb48ec-689d-4cec-8494-d1a62ca509be
health HEALTH_OK
monmap e1: 1 mons at {irt03=192,168.0.33:6789/0}, election epoch 1, quorum 0 irt03
osdmap e1206: 4 osds: 4 up, 4 in
pgmap v543045: 256 pgs, 3 pools, 62140 MB data, 15648 objects
141 GB used, 2323 GB / 2464 GB avail
256 active+clean
---
replication size is 2, in can do about 60MB/s writes with rados bench from
a client.
Setting one OSD out (the data distribution is nearly uniform) it took 12
minutes to recover on a completely idle (no clients connected) cluster.
The disk utilization was 70-90%, the cluster network hovered around 20%,
never exceeding 35% on the 3 "surviving" nodes. CPU was never an issue.
Given the ceph log numbers and the data size, I make this a recovery speed
of about 40MB/s or 13MB/s per OSD.
Better than I expected, but a far cry from what the OSDs could do
individually if they were not flooded with concurrent read and write
requests by the backfilling operation.
Now, more disks will help, but I very much doubt that this will scale
linear, so 50 OSDs won't give you 500MB/s (somebody prove me wrong please).
And this was an IDLE cluster.
Doing this on a cluster with just about 10 client IOPS per OSD would be
far worse. Never mind that people don't like their client IO to stall for
more than a few seconds.
Something that might improve this booth in terms of speed and impact to
the clients would be something akin to the MD (linux software raid)
recovery logic.
As in, only one backfill operation per OSD (read or write, not both!) at
the same time.
Regards,
Christian
Post by Loic Dachary
Cheers
Post by Christian Balzer
More in another reply.
Post by Loic Dachary
Cheers
Post by Craig Lewis
My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd
max backfills = 1). I believe that increases my risk of failure by
48^2 . Since your numbers are failure rate per hour per disk, I need
to consider the risk for the whole time for each disk. So more
formally, rebuild time to the power of (replicas -1).
So I'm at 2304/100,000,000, or approximately 1/43,000. That's a
much higher risk than 1 / 10^8.
A risk of 1/43,000 means that I'm more likely to lose data due to
human error than disk failure. Still, I can put a small bit of
effort in to optimize recovery speed, and lower this number.
Managing human error is much harder.
On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary <loic at dachary.org
Using percentages instead of numbers lead me to calculations
errors. Here it is again using 1/100 instead of % for clarity ;-)
* The pool is configured for three replicas (size = 3 which is the default)
* It takes one hour for Ceph to recover from the loss of a single OSD
* Any other disk has a 1/100,000 chance to fail within the hour
following the failure of the first disk (assuming AFR
https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk
is 8%, divided by the number of hours during a year == (0.08 / 8760)
~= 1/100,000
* A given disk does not participate in more than 100 PG
Loic Dachary
2014-08-28 15:17:44 UTC
Permalink
Post by Mike Dawson
Post by Christian Balzer
Post by Loic Dachary
Post by Christian Balzer
Hello,
Post by Loic Dachary
Hi Craig,
I assume the reason for the 48 hours recovery time is to keep the cost
of the cluster low ? I wrote "1h recovery time" because it is roughly
the time it would take to move 4TB over a 10Gb/s link. Could you
upgrade your hardware to reduce the recovery time to less than two
hours ? Or are there factors other than cost that prevent this ?
I doubt Craig is operating on a shoestring budget.
And even if his network were to be just GbE, that would still make it
only 10 hours according to your wishful thinking formula.
He probably has set the max_backfills to 1 because that is the level of
I/O his OSDs can handle w/o degrading cluster performance too much.
The network is unlikely to be the limiting factor.
The way I see it most Ceph clusters are in sort of steady state when
operating normally, i.e. a few hundred VM RBD images ticking over, most
actual OSD disk ops are writes, as nearly all hot objects that are
being read are in the page cache of the storage nodes.
Easy peasy.
Until something happens that breaks this routine, like a deep scrub,
all those VMs rebooting at the same time or a backfill caused by a
failed OSD. Now all of a sudden client ops compete with the backfill
ops, page caches are no longer hot, the spinners are seeking left and
right. Pandemonium.
I doubt very much that even with a SSD backed cluster you would get
away with less than 2 hours for 4TB.
To give you some real life numbers, I currently am building a new
cluster but for the time being have only one storage node to play with.
It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs and 8
actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.
So I took out one OSD (reweight 0 first, then the usual removal steps)
because the actual disk was wonky. Replaced the disk and re-added the
OSD. Both operations took about the same time, 4 minutes for
evacuating the OSD (having 7 write targets clearly helped) for measly
12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the
OSD. And that is on one node (thus no network latency) that has the
default parameters (so a max_backfill of 10) which was otherwise
totally idle.
In other words, in this pretty ideal case it would have taken 22 hours
to re-distribute 4TB.
That makes sense to me :-)
When I wrote 1h, I thought about what happens when an OSD becomes
unavailable with no planning in advance. In the scenario you describe
the risk of a data loss does not increase since the objects are evicted
gradually from the disk being decommissioned and the number of replica
stays the same at all times. There is not a sudden drop in the number of
replica which is what I had in mind.
That may be, but I'm rather certain that there is no difference in speed
and priority of a rebalancing caused by an OSD set to weight 0 or one
being set out.
Post by Loic Dachary
If the lost OSD was part of 100 PG, the other disks (let say 50 of them)
will start transferring a new replica of the objects they have to the
new OSD in their PG. The replacement will not be a single OSD although
nothing prevents the same OSD to be used in more than one PG as a
replacement for the lost one. If the cluster network is connected at
10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new
duplicates do not originate from a single OSD but from at least dozens
of them and since they target more than one OSD, I assume we can expect
an actual throughput of 5Gb/s. I should have written 2h instead of 1h to
account for the fact that the cluster network is never idle.
Am I being too optimistic ?
Vastly.
Post by Loic Dachary
Do you see another blocking factor that
would significantly slow down recovery ?
As Craig and I keep telling you, the network is not the limiting factor.
Concurrent disk IO is, as I pointed out in the other thread.
Completely agree.
On a production cluster with OSDs backed by spindles, even with OSD journals on SSDs, it is insufficient to calculate single-disk replacement backfill time based solely on network throughput. IOPS will likely be the limiting factor when backfilling a single failed spinner in a production cluster.
Last week I replaced a 3TB 7200rpm drive that was ~75% full in a 72-osd cluster, 24 hosts, rbd pool with 3 replicas, osd journals on SSDs (ratio of 3:1), with dual 1GbE bonded NICs.
Using the only throughput math, backfill could have theoretically completed in a bit over 2.5 hours, but it actually took 15 hours. I've done this a few times with similar results.
Why? Spindle contention on the replacement drive. Graph the '%util' metric from something like 'iostat -xt 2' during a single disk backfill to get a very clear view that spindle contention is the true limiting factor. It'll be pegged at or near 100% if spindle contention is the issue.
Hi Mike,

Did you by any chance also measure how long it took for the 3 replicas to be restored on all PG in which the failed disk was participating ? I assume the following sequence happened:

A) The 3TB drive failed and contained ~2TB
B) The cluster recovered by creating new replicas
C) The new 3TB drive was installed
D) Backfilling completed

I'm interested in the time between A and B, i.e. when one copy is potentially lost forever, because this is when the probability of a permanent data loss increases. Although it is important to reduce the time between C and D to a minimum, it has no impact on the durability of the data.

Cheers
Post by Mike Dawson
- Mike
Post by Christian Balzer
My shitty test cluster, 4 nodes, one OSD each, journal on disk, no SSDs.
1 GbE links for client and cluster respectively.
---
#ceph -s
cluster 25bb48ec-689d-4cec-8494-d1a62ca509be
health HEALTH_OK
monmap e1: 1 mons at {irt03=192,168.0.33:6789/0}, election epoch 1, quorum 0 irt03
osdmap e1206: 4 osds: 4 up, 4 in
pgmap v543045: 256 pgs, 3 pools, 62140 MB data, 15648 objects
141 GB used, 2323 GB / 2464 GB avail
256 active+clean
---
replication size is 2, in can do about 60MB/s writes with rados bench from
a client.
Setting one OSD out (the data distribution is nearly uniform) it took 12
minutes to recover on a completely idle (no clients connected) cluster.
The disk utilization was 70-90%, the cluster network hovered around 20%,
never exceeding 35% on the 3 "surviving" nodes. CPU was never an issue.
Given the ceph log numbers and the data size, I make this a recovery speed
of about 40MB/s or 13MB/s per OSD.
Better than I expected, but a far cry from what the OSDs could do
individually if they were not flooded with concurrent read and write
requests by the backfilling operation.
Now, more disks will help, but I very much doubt that this will scale
linear, so 50 OSDs won't give you 500MB/s (somebody prove me wrong please).
And this was an IDLE cluster.
Doing this on a cluster with just about 10 client IOPS per OSD would be
far worse. Never mind that people don't like their client IO to stall for
more than a few seconds.
Something that might improve this booth in terms of speed and impact to
the clients would be something akin to the MD (linux software raid)
recovery logic.
As in, only one backfill operation per OSD (read or write, not both!) at
the same time.
Regards,
Christian
Post by Loic Dachary
Cheers
Post by Christian Balzer
More in another reply.
Post by Loic Dachary
Cheers
Post by Craig Lewis
My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd
max backfills = 1). I believe that increases my risk of failure by
48^2 . Since your numbers are failure rate per hour per disk, I need
to consider the risk for the whole time for each disk. So more
formally, rebuild time to the power of (replicas -1).
So I'm at 2304/100,000,000, or approximately 1/43,000. That's a
much higher risk than 1 / 10^8.
A risk of 1/43,000 means that I'm more likely to lose data due to
human error than disk failure. Still, I can put a small bit of
effort in to optimize recovery speed, and lower this number.
Managing human error is much harder.
On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary <loic at dachary.org
Using percentages instead of numbers lead me to calculations
errors. Here it is again using 1/100 instead of % for clarity ;-)
* The pool is configured for three replicas (size = 3 which is the default)
* It takes one hour for Ceph to recover from the loss of a single OSD
* Any other disk has a 1/100,000 chance to fail within the hour
following the failure of the first disk (assuming AFR
https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk
is 8%, divided by the number of hours during a year == (0.08 / 8760)
~= 1/100,000
* A given disk does not participate in more than 100 PG
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Lo?c Dachary, Artisan Logiciel Libre

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140828/1715acbc/attachment.pgp>
Mike Dawson
2014-08-28 17:38:03 UTC
Permalink
Post by Loic Dachary
Post by Mike Dawson
Post by Christian Balzer
Post by Loic Dachary
Post by Christian Balzer
Hello,
Post by Loic Dachary
Hi Craig,
I assume the reason for the 48 hours recovery time is to keep the cost
of the cluster low ? I wrote "1h recovery time" because it is roughly
the time it would take to move 4TB over a 10Gb/s link. Could you
upgrade your hardware to reduce the recovery time to less than two
hours ? Or are there factors other than cost that prevent this ?
I doubt Craig is operating on a shoestring budget.
And even if his network were to be just GbE, that would still make it
only 10 hours according to your wishful thinking formula.
He probably has set the max_backfills to 1 because that is the level of
I/O his OSDs can handle w/o degrading cluster performance too much.
The network is unlikely to be the limiting factor.
The way I see it most Ceph clusters are in sort of steady state when
operating normally, i.e. a few hundred VM RBD images ticking over, most
actual OSD disk ops are writes, as nearly all hot objects that are
being read are in the page cache of the storage nodes.
Easy peasy.
Until something happens that breaks this routine, like a deep scrub,
all those VMs rebooting at the same time or a backfill caused by a
failed OSD. Now all of a sudden client ops compete with the backfill
ops, page caches are no longer hot, the spinners are seeking left and
right. Pandemonium.
I doubt very much that even with a SSD backed cluster you would get
away with less than 2 hours for 4TB.
To give you some real life numbers, I currently am building a new
cluster but for the time being have only one storage node to play with.
It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs and 8
actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.
So I took out one OSD (reweight 0 first, then the usual removal steps)
because the actual disk was wonky. Replaced the disk and re-added the
OSD. Both operations took about the same time, 4 minutes for
evacuating the OSD (having 7 write targets clearly helped) for measly
12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the
OSD. And that is on one node (thus no network latency) that has the
default parameters (so a max_backfill of 10) which was otherwise
totally idle.
In other words, in this pretty ideal case it would have taken 22 hours
to re-distribute 4TB.
That makes sense to me :-)
When I wrote 1h, I thought about what happens when an OSD becomes
unavailable with no planning in advance. In the scenario you describe
the risk of a data loss does not increase since the objects are evicted
gradually from the disk being decommissioned and the number of replica
stays the same at all times. There is not a sudden drop in the number of
replica which is what I had in mind.
That may be, but I'm rather certain that there is no difference in speed
and priority of a rebalancing caused by an OSD set to weight 0 or one
being set out.
Post by Loic Dachary
If the lost OSD was part of 100 PG, the other disks (let say 50 of them)
will start transferring a new replica of the objects they have to the
new OSD in their PG. The replacement will not be a single OSD although
nothing prevents the same OSD to be used in more than one PG as a
replacement for the lost one. If the cluster network is connected at
10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new
duplicates do not originate from a single OSD but from at least dozens
of them and since they target more than one OSD, I assume we can expect
an actual throughput of 5Gb/s. I should have written 2h instead of 1h to
account for the fact that the cluster network is never idle.
Am I being too optimistic ?
Vastly.
Post by Loic Dachary
Do you see another blocking factor that
would significantly slow down recovery ?
As Craig and I keep telling you, the network is not the limiting factor.
Concurrent disk IO is, as I pointed out in the other thread.
Completely agree.
On a production cluster with OSDs backed by spindles, even with OSD journals on SSDs, it is insufficient to calculate single-disk replacement backfill time based solely on network throughput. IOPS will likely be the limiting factor when backfilling a single failed spinner in a production cluster.
Last week I replaced a 3TB 7200rpm drive that was ~75% full in a 72-osd cluster, 24 hosts, rbd pool with 3 replicas, osd journals on SSDs (ratio of 3:1), with dual 1GbE bonded NICs.
Using the only throughput math, backfill could have theoretically completed in a bit over 2.5 hours, but it actually took 15 hours. I've done this a few times with similar results.
Why? Spindle contention on the replacement drive. Graph the '%util' metric from something like 'iostat -xt 2' during a single disk backfill to get a very clear view that spindle contention is the true limiting factor. It'll be pegged at or near 100% if spindle contention is the issue.
Hi Mike,
A) The 3TB drive failed and contained ~2TB
B) The cluster recovered by creating new replicas
C) The new 3TB drive was installed
D) Backfilling completed
I'm interested in the time between A and B, i.e. when one copy is potentially lost forever, because this is when the probability of a permanent data loss increases. Although it is important to reduce the time between C and D to a minimum, it has no impact on the durability of the data.
Loic,

We use 3x replication and have drives that have relatively high
steady-state IOPS. Therefore, we tend to prioritize client-side IO more
than a reduction from 3 copies to 2 during the loss of one disk. The
disruption to client io is so great on our cluster, we don't want our
cluster to be in a recovery state without operator-supervision.

Letting OSDs get marked out without operator intervention was a disaster
in the early going of our cluster. For example, an OSD daemon crash
would trigger automatic recovery where it was unneeded. Ironically,
often times the unneeded recovery would often trigger additional daemons
to crash, making a bad situation worse. During the recovery, rbd client
io would often times go to 0.

To deal with this issue, we set "mon osd down out interval = 14400", so
as operators we have 4 hours to intervene before Ceph attempts to
self-heal. When hardware is at fault, we remove the osd, replace the
drive, re-add the osd, then allow backfill to begin, thereby completely
skipping step B in your timeline above.

- Mike
Post by Loic Dachary
Cheers
Post by Mike Dawson
- Mike
Post by Christian Balzer
My shitty test cluster, 4 nodes, one OSD each, journal on disk, no SSDs.
1 GbE links for client and cluster respectively.
---
#ceph -s
cluster 25bb48ec-689d-4cec-8494-d1a62ca509be
health HEALTH_OK
monmap e1: 1 mons at {irt03=192,168.0.33:6789/0}, election epoch 1, quorum 0 irt03
osdmap e1206: 4 osds: 4 up, 4 in
pgmap v543045: 256 pgs, 3 pools, 62140 MB data, 15648 objects
141 GB used, 2323 GB / 2464 GB avail
256 active+clean
---
replication size is 2, in can do about 60MB/s writes with rados bench from
a client.
Setting one OSD out (the data distribution is nearly uniform) it took 12
minutes to recover on a completely idle (no clients connected) cluster.
The disk utilization was 70-90%, the cluster network hovered around 20%,
never exceeding 35% on the 3 "surviving" nodes. CPU was never an issue.
Given the ceph log numbers and the data size, I make this a recovery speed
of about 40MB/s or 13MB/s per OSD.
Better than I expected, but a far cry from what the OSDs could do
individually if they were not flooded with concurrent read and write
requests by the backfilling operation.
Now, more disks will help, but I very much doubt that this will scale
linear, so 50 OSDs won't give you 500MB/s (somebody prove me wrong please).
And this was an IDLE cluster.
Doing this on a cluster with just about 10 client IOPS per OSD would be
far worse. Never mind that people don't like their client IO to stall for
more than a few seconds.
Something that might improve this booth in terms of speed and impact to
the clients would be something akin to the MD (linux software raid)
recovery logic.
As in, only one backfill operation per OSD (read or write, not both!) at
the same time.
Regards,
Christian
Post by Loic Dachary
Cheers
Post by Christian Balzer
More in another reply.
Post by Loic Dachary
Cheers
Post by Craig Lewis
My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd
max backfills = 1). I believe that increases my risk of failure by
48^2 . Since your numbers are failure rate per hour per disk, I need
to consider the risk for the whole time for each disk. So more
formally, rebuild time to the power of (replicas -1).
So I'm at 2304/100,000,000, or approximately 1/43,000. That's a
much higher risk than 1 / 10^8.
A risk of 1/43,000 means that I'm more likely to lose data due to
human error than disk failure. Still, I can put a small bit of
effort in to optimize recovery speed, and lower this number.
Managing human error is much harder.
On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary <loic at dachary.org
Using percentages instead of numbers lead me to calculations
errors. Here it is again using 1/100 instead of % for clarity ;-)
* The pool is configured for three replicas (size = 3 which is
the default)
* It takes one hour for Ceph to recover from the loss of a single OSD
* Any other disk has a 1/100,000 chance to fail within the hour
following the failure of the first disk (assuming AFR
https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk
is 8%, divided by the number of hours during a year == (0.08 / 8760)
~= 1/100,000
* A given disk does not participate in more than 100 PG
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Craig Lewis
2014-08-28 20:17:34 UTC
Permalink
This post might be inappropriate. Click to display it.
Mike Dawson
2014-08-28 20:49:43 UTC
Permalink
Post by Craig Lewis
My initial experience was similar to Mike's, causing a similar level of
paranoia. :-) I'm dealing with RadosGW though, so I can tolerate
higher latencies.
I was running my cluster with noout and nodown set for weeks at a time.
I'm sure Craig will agree, but wanted to add this for other readers:

I find value in the noout flag for temporary intervention, but prefer to
set "mon osd down out interval" for dealing with events that may occur
in the future to give an operator time to intervene.

The nodown flag is another beast altogether. The nodown flag tends to be
*a bad thing* when attempting to provide reliable client io. For our use
case, we want OSDs to be marked down quickly if they are in fact
unavailable for any reason, so client io doesn't hang waiting for them.

If OSDs are flapping during recovery (i.e. the "wrongly marked me down"
log messages), I've found far superior results by tuning the recovery
knobs than by permanently setting the nodown flag.

- Mike
Post by Craig Lewis
Recovery of a single OSD might cause other OSDs to crash. In the
primary cluster, I was always able to get it under control before it
cascaded too wide. In my secondary cluster, it did spiral out to 40% of
the OSDs, with 2-5 OSDs down at any time.
I traced my problems to a combination of osd max backfills was too high
for my cluster, and my mkfs.xfs arguments were causing memory starvation
issues. I lowered osd max backfills, added SSD journals,
and reformatted every OSD with better mkfs.xfs arguments. Now both
clusters are stable, and I don't want to break it.
I only have 45 OSDs, so the risk with a 24-48 hours recovery time is
acceptable to me. It will be a problem as I scale up, but scaling up
will also help with the latency problems.
On Thu, Aug 28, 2014 at 10:38 AM, Mike Dawson <mike.dawson at cloudapt.com
We use 3x replication and have drives that have relatively high
steady-state IOPS. Therefore, we tend to prioritize client-side IO
more than a reduction from 3 copies to 2 during the loss of one
disk. The disruption to client io is so great on our cluster, we
don't want our cluster to be in a recovery state without
operator-supervision.
Letting OSDs get marked out without operator intervention was a
disaster in the early going of our cluster. For example, an OSD
daemon crash would trigger automatic recovery where it was unneeded.
Ironically, often times the unneeded recovery would often trigger
additional daemons to crash, making a bad situation worse. During
the recovery, rbd client io would often times go to 0.
To deal with this issue, we set "mon osd down out interval = 14400",
so as operators we have 4 hours to intervene before Ceph attempts to
self-heal. When hardware is at fault, we remove the osd, replace the
drive, re-add the osd, then allow backfill to begin, thereby
completely skipping step B in your timeline above.
- Mike
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Christian Balzer
2014-08-28 15:23:40 UTC
Permalink
Post by Mike Dawson
Post by Christian Balzer
Post by Loic Dachary
Post by Christian Balzer
Hello,
Post by Loic Dachary
Hi Craig,
I assume the reason for the 48 hours recovery time is to keep the
cost of the cluster low ? I wrote "1h recovery time" because it is
roughly the time it would take to move 4TB over a 10Gb/s link.
Could you upgrade your hardware to reduce the recovery time to less
than two hours ? Or are there factors other than cost that prevent
this ?
I doubt Craig is operating on a shoestring budget.
And even if his network were to be just GbE, that would still make it
only 10 hours according to your wishful thinking formula.
He probably has set the max_backfills to 1 because that is the level
of I/O his OSDs can handle w/o degrading cluster performance too
much. The network is unlikely to be the limiting factor.
The way I see it most Ceph clusters are in sort of steady state when
operating normally, i.e. a few hundred VM RBD images ticking over,
most actual OSD disk ops are writes, as nearly all hot objects that
are being read are in the page cache of the storage nodes.
Easy peasy.
Until something happens that breaks this routine, like a deep scrub,
all those VMs rebooting at the same time or a backfill caused by a
failed OSD. Now all of a sudden client ops compete with the backfill
ops, page caches are no longer hot, the spinners are seeking left and
right. Pandemonium.
I doubt very much that even with a SSD backed cluster you would get
away with less than 2 hours for 4TB.
To give you some real life numbers, I currently am building a new
cluster but for the time being have only one storage node to play
with. It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs
and 8 actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.
So I took out one OSD (reweight 0 first, then the usual removal
steps) because the actual disk was wonky. Replaced the disk and
re-added the OSD. Both operations took about the same time, 4
minutes for evacuating the OSD (having 7 write targets clearly
helped) for measly 12GB or about 50MB/s and 5 minutes or about 35MB/
for refilling the OSD. And that is on one node (thus no network
latency) that has the default parameters (so a max_backfill of 10)
which was otherwise totally idle.
In other words, in this pretty ideal case it would have taken 22
hours to re-distribute 4TB.
That makes sense to me :-)
When I wrote 1h, I thought about what happens when an OSD becomes
unavailable with no planning in advance. In the scenario you describe
the risk of a data loss does not increase since the objects are
evicted gradually from the disk being decommissioned and the number
of replica stays the same at all times. There is not a sudden drop in
the number of replica which is what I had in mind.
That may be, but I'm rather certain that there is no difference in
speed and priority of a rebalancing caused by an OSD set to weight 0
or one being set out.
Post by Loic Dachary
If the lost OSD was part of 100 PG, the other disks (let say 50 of
them) will start transferring a new replica of the objects they have
to the new OSD in their PG. The replacement will not be a single OSD
although nothing prevents the same OSD to be used in more than one PG
as a replacement for the lost one. If the cluster network is
connected at 10Gb/s and is 50% busy at all times, that leaves 5Gb/s.
Since the new duplicates do not originate from a single OSD but from
at least dozens of them and since they target more than one OSD, I
assume we can expect an actual throughput of 5Gb/s. I should have
written 2h instead of 1h to account for the fact that the cluster
network is never idle.
Am I being too optimistic ?
Vastly.
Post by Loic Dachary
Do you see another blocking factor that
would significantly slow down recovery ?
As Craig and I keep telling you, the network is not the limiting
factor. Concurrent disk IO is, as I pointed out in the other thread.
Completely agree.
Thank you for that voice of reason, backing things up by a real life
sizable cluster. ^o^
Post by Mike Dawson
On a production cluster with OSDs backed by spindles, even with OSD
journals on SSDs, it is insufficient to calculate single-disk
replacement backfill time based solely on network throughput. IOPS will
likely be the limiting factor when backfilling a single failed spinner
in a production cluster.
Last week I replaced a 3TB 7200rpm drive that was ~75% full in a 72-osd
cluster, 24 hosts, rbd pool with 3 replicas, osd journals on SSDs (ratio
of 3:1), with dual 1GbE bonded NICs.
You're generous with your SSDs. ^o^
Post by Mike Dawson
Using the only throughput math, backfill could have theoretically
completed in a bit over 2.5 hours, but it actually took 15 hours. I've
done this a few times with similar results.
And that makes it about 40MB/s. Similar to what Craig is seeing with
increased backfills and what I speculated from my tests. ^o^
Post by Mike Dawson
Why? Spindle contention on the replacement drive. Graph the '%util'
metric from something like 'iostat -xt 2' during a single disk backfill
to get a very clear view that spindle contention is the true limiting
factor. It'll be pegged at or near 100% if spindle contention is the issue.
Precisely.

Along those lines I give you:
http://www.engadget.com/2014/08/26/seagate-8tb-hard-drive/

Which also makes me smirk, because the people telling me that a RAID
backed OSD is bad often cite the size of it and that it will take ages to
backfill (if it were to fail in the first place and if one hadn't set the
configuration that such a failure would result in an automatic
re-balancing).
Because nothing I have deployed in that fashion or would consider to do so
is more that 3 times the size of that single disk.

Christian
Post by Mike Dawson
- Mike
Post by Christian Balzer
My shitty test cluster, 4 nodes, one OSD each, journal on disk, no
SSDs. 1 GbE links for client and cluster respectively.
---
#ceph -s
cluster 25bb48ec-689d-4cec-8494-d1a62ca509be
health HEALTH_OK
monmap e1: 1 mons at {irt03=192,168.0.33:6789/0}, election epoch
1, quorum 0 irt03 osdmap e1206: 4 osds: 4 up, 4 in
pgmap v543045: 256 pgs, 3 pools, 62140 MB data, 15648 objects
141 GB used, 2323 GB / 2464 GB avail
256 active+clean
---
replication size is 2, in can do about 60MB/s writes with rados bench
from a client.
Setting one OSD out (the data distribution is nearly uniform) it took
12 minutes to recover on a completely idle (no clients connected)
cluster. The disk utilization was 70-90%, the cluster network hovered
around 20%, never exceeding 35% on the 3 "surviving" nodes. CPU was
never an issue. Given the ceph log numbers and the data size, I make
this a recovery speed of about 40MB/s or 13MB/s per OSD.
Better than I expected, but a far cry from what the OSDs could do
individually if they were not flooded with concurrent read and write
requests by the backfilling operation.
Now, more disks will help, but I very much doubt that this will scale
linear, so 50 OSDs won't give you 500MB/s (somebody prove me wrong please).
And this was an IDLE cluster.
Doing this on a cluster with just about 10 client IOPS per OSD would be
far worse. Never mind that people don't like their client IO to stall
for more than a few seconds.
Something that might improve this booth in terms of speed and impact to
the clients would be something akin to the MD (linux software raid)
recovery logic.
As in, only one backfill operation per OSD (read or write, not both!)
at the same time.
Regards,
Christian
Post by Loic Dachary
Cheers
Post by Christian Balzer
More in another reply.
Post by Loic Dachary
Cheers
Post by Craig Lewis
My OSD rebuild time is more like 48 hours (4TB disks, >60% full,
osd max backfills = 1). I believe that increases my risk of
failure by 48^2 . Since your numbers are failure rate per hour
per disk, I need to consider the risk for the whole time for each
disk. So more formally, rebuild time to the power of (replicas
-1).
So I'm at 2304/100,000,000, or approximately 1/43,000. That's a
much higher risk than 1 / 10^8.
A risk of 1/43,000 means that I'm more likely to lose data due to
human error than disk failure. Still, I can put a small bit of
effort in to optimize recovery speed, and lower this number.
Managing human error is much harder.
On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary <loic at dachary.org
Using percentages instead of numbers lead me to calculations
errors. Here it is again using 1/100 instead of % for clarity ;-)
* The pool is configured for three replicas (size = 3 which is
the default)
* It takes one hour for Ceph to recover from the loss of a single OSD
* Any other disk has a 1/100,000 chance to fail within the
hour following the failure of the first disk (assuming AFR
https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk
is 8%, divided by the number of hours during a year == (0.08 /
8760) ~= 1/100,000
* A given disk does not participate in more than 100 PG
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Craig Lewis
2014-08-27 20:45:46 UTC
Permalink
I am using GigE. I'm building a cluster using existing hardware, and the
network hasn't been my bottleneck (yet).

I've benchmarked the single disk recovery speed as about 50 MB/s, using max
backfills = 4, with SSD journals. If I go higher, the disk bandwidth
increases slightly, and the latency starts increasing.
At max backfills = 10, I regularly see OSD latency hit the 1 second mark.
With max backfills = 4, OSD latency is pretty much the same as max
backfills = 1. I haven't tested 5-9 yet.

I'm tracking latency by polling the OSD perf numbers every minute,
recording the delta from the previous poll, and calculating the average
latency over the last minute. Given that it's an average over the last
minute, a 1 second average latency is way too high. That usually means one
operation took > 30 seconds, and the other operations were mostly ok. It's
common see blocked operations in ceph -w when latency is this high.


Using 50 MB/s for a single disk, that takes at least 14 hours to rebuild my
disks (4TB disk, 60% full). If I'm not sitting in front of the computer, I
usually only run 2 backfills. I'm very paranoid, due to some problems I
had early in the production release. Most of these problems were caused by
64k XFS inodes, not Ceph. But I have things working now, so I'm hesitant
to change anything. :-)
Post by Loic Dachary
Hi Craig,
I assume the reason for the 48 hours recovery time is to keep the cost of
the cluster low ? I wrote "1h recovery time" because it is roughly the time
it would take to move 4TB over a 10Gb/s link. Could you upgrade your
hardware to reduce the recovery time to less than two hours ? Or are there
factors other than cost that prevent this ?
Cheers
Post by Craig Lewis
My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd max
backfills = 1). I believe that increases my risk of failure by 48^2 .
Since your numbers are failure rate per hour per disk, I need to consider
the risk for the whole time for each disk. So more formally, rebuild time
to the power of (replicas -1).
Post by Craig Lewis
So I'm at 2304/100,000,000, or approximately 1/43,000. That's a much
higher risk than 1 / 10^8.
Post by Craig Lewis
A risk of 1/43,000 means that I'm more likely to lose data due to human
error than disk failure. Still, I can put a small bit of effort in to
optimize recovery speed, and lower this number. Managing human error is
much harder.
Post by Craig Lewis
Using percentages instead of numbers lead me to calculations errors.
Here it is again using 1/100 instead of % for clarity ;-)
Post by Craig Lewis
* The pool is configured for three replicas (size = 3 which is the
default)
Post by Craig Lewis
* It takes one hour for Ceph to recover from the loss of a single OSD
* Any other disk has a 1/100,000 chance to fail within the hour
following the failure of the first disk (assuming AFR
https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
8%, divided by the number of hours during a year == (0.08 / 8760) ~=
1/100,000
Post by Craig Lewis
* A given disk does not participate in more than 100 PG
--
Lo?c Dachary, Artisan Logiciel Libre
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140827/08b59175/attachment.htm>
Christian Balzer
2014-08-27 03:33:19 UTC
Permalink
Hello,
Post by Loic Dachary
Using percentages instead of numbers lead me to calculations errors.
Here it is again using 1/100 instead of % for clarity ;-)
* The pool is configured for three replicas (size = 3 which is the default)
* It takes one hour for Ceph to recover from the loss of a single OSD
I think Craig and I have debunked that number.
It will be something like "that depends on many things starting with the
amount of data, the disk speeds, the contention (client and other ops),
the network speed/utilization, the actual OSD process and queue handling
speed, etc.".
If you want to make an assumption that's not an order of magnitude wrong,
start with 24 hours.

It would be nice to hear from people with really huge clusters like Dan at
CERN how their recovery speeds are.
Post by Loic Dachary
* Any other disk has a 1/100,000 chance to fail within the hour
following the failure of the first disk (assuming AFR
https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
8%, divided by the number of hours during a year == (0.08 / 8760) ~=
1/100,000
* A given disk does not participate in more than 100 PG
You will find that the smaller the cluster, the more likely it is to be
higher than 100, due to rounding up or just upping things because the
distribution is too uneven otherwise.
Post by Loic Dachary
Each time an OSD is lost, there is a 1/100,000*1/100,000 =
1/10,000,000,000 chance that two other disks are lost before recovery.
Since the disk that failed initialy participates in 100 PG, that is
1/10,000,000,000 x 100 = 1/100,000,000 chance that a PG is lost. Or the
entire pool if it is used in a way that loosing a PG means loosing all
data in the pool (as in your example, where it contains RBD volumes and
each of the RBD volume uses all the available PG).
If the pool is using at least two datacenters operated by two different
organizations, this calculation makes sense to me. However, if the
cluster is in a single datacenter, isn't it possible that some event
independent of Ceph has a greater probability of permanently destroying
the data ? A month ago I lost three machines in a Ceph cluster and
realized on that occasion that the crushmap was not configured properly
and that PG were lost as a result. Fortunately I was able to recover the
disks and plug them in another machine to recover the lost PGs. I'm not
a system administrator and the probability of me failing to do the right
thing is higher than normal: this is just an example of a high
probability event leading to data loss. Another example would be if all
disks in the same PG are part of the same batch and therefore likely to
fail at the same time. In other words, I wonder if this 0.0001% chance
of losing a PG within the hour following a disk failure matters or if it
is dominate d by other factors. What do you think ?
Batch failures are real, I'm seeing that all the time.
However they tend to be still spaced out widely enough most of the time.
Still something to consider in a complete calculation.

As for failures other than disks, these tend to be recoverable, as you
experienced yourself. A node, rack, whatever failure might make your
cluster temporarily inaccessible (and thus should be avoided by proper
CRUSH maps and other precautions), but it will not lead to actual data
loss.

Regards,

Christian
Post by Loic Dachary
Cheers
Post by Loic Dachary
* The pool is configured for three replicas (size = 3 which is the default)
* It takes one hour for Ceph to recover from the loss of a single OSD
* Any other disk has a 0.001% chance to fail within the hour following
the failure of the first disk (assuming AFR
https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
10%, divided by the number of hours during a year).
* A given disk does not participate in more than 100 PG
Each time an OSD is lost, there is a 0.001*0.001 = 0.000001% chance
that two other disks are lost before recovery. Since the disk that
failed initialy participates in 100 PG, that is 0.000001% x 100 =
0.0001% chance that a PG is lost. Or the entire pool if it is used in
a way that loosing a PG means loosing all data in the pool (as in your
example, where it contains RBD volumes and each of the RBD volume uses
all the available PG).
If the pool is using at least two datacenters operated by two
different organizations, this calculation makes sense to me. However,
if the cluster is in a single datacenter, isn't it possible that some
event independent of Ceph has a greater probability of permanently
destroying the data ? A month ago I lost three machines in a Ceph
cluster and realized on that occasion that the crushmap was not
configured properly and that PG were lost as a result. Fortunately I
was able to recover the disks and plug them in another machine to
recover the lost PGs. I'm not a system administrator and the
this is just an example of a high probability event leading to data
loss. In other words, I wonder if this 0.0001% chance of losing a PG
within the hour following a disk failure matters or if it is dominated
by other factors. What do you think ?
Cheers
On 26/08/2014 15:25, Loic Dachary wrote:> Hi Blair,
Post by Loic Dachary
* The pool is configured for three replicas (size = 3 which is the default)
* It takes one hour for Ceph to recover from the loss of a single OSD
* Any other disk has a 0.001% chance to fail within the hour following
the failure of the first disk (assuming AFR
https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
10%, divided by the number of hours during a year).
* A given disk does not participate in more than 100 PG
Each time an OSD is lost, there is a 0.001*0.001 = 0.000001% chance
that two other disks are lost before recovery. Since the disk that
failed initialy participates in 100 PG, that is 0.000001% x 100 =
0.0001% chance that a PG is lost. Or the entire pool if it is used in
a way that loosing a PG means loosing all data in the pool (as in your
example, where it contains RBD volumes and each of the RBD volume uses
all the available PG).
If the pool is using at least two datacenters operated by two
different organizations, this calculation makes sense to me. However,
if the cluster is in a single datacenter, isn't it possible that some
event independent of Ceph has a greater probability of permanently
destroying the data ? A month ago I lost three machines in a Ceph
cluster and realized on that occasion that the crushmap was not
configured properly and that PG were lost as a result. Fortunately I
was able to recover the disks and plug them in another machine to
recover the lost PGs. I'm not a system administrator and the
this is just an example of a high probability event leading to data
loss. In other words, I wonder if this 0.0001% chance of losing a PG
within the hour following a disk failure matters or if it is dominated
by other factors. What do you think ?
Cheers
Post by Blair Bethwaite
Message: 25
Date: Fri, 15 Aug 2014 15:06:49 +0200
From: Loic Dachary <loic at dachary.org>
To: Erik Logtenberg <erik at logtenberg.eu>, ceph-users at lists.ceph.com
Subject: Re: [ceph-users] Best practice K/M-parameters EC pool
Message-ID: <53EE05E9.1040105 at dachary.org>
Content-Type: text/plain; charset="iso-8859-1"
...
If the probability of loosing a disk is 0.1%, the probability of
loosing two disks simultaneously (i.e. before the failure can be
recovered) would be 0.1*0.1 = 0.01% and three disks becomes
0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001%
I watched this conversation and an older similar one (Failure
probability with largish deployments) with interest as we are in the
process of planning a pretty large Ceph cluster (~3.5 PB), so I have
been trying to wrap my head around these issues.
Loic's reasoning (above) seems sound as a naive approximation assuming
independent probabilities for disk failures, which may not be quite
true given potential for batch production issues, but should be okay
for other sorts of correlations (assuming a sane crushmap that
eliminates things like controllers and nodes as sources of
correlation).
One of the things that came up in the "Failure probability with
largish deployments" thread and has raised its head again here is the
idea that striped data (e.g., RADOS-GW objects and RBD volumes) might
be somehow more prone to data-loss than non-striped. I don't think
anyone has so far provided an answer on this, so here's my thinking...
The level of atomicity that matters when looking at durability &
availability in Ceph is the Placement Group. For any non-trivial RBD
it is likely that many RBDs will span all/most PGs, e.g., even a
relatively small 50GiB volume would (with default 4MiB object size)
span 12800 PGs - more than there are in many production clusters
obeying the 100-200 PGs per drive rule of thumb. <IMPORTANT>Losing any
one PG will cause data-loss. The failure-probability effects of
striping across multiple PGs are immaterial considering that loss of
any single PG is likely to damage all your RBDs</IMPORTANT>. This
might be why the reliability calculator doesn't consider total number
of disks.
Related to all this is the durability of 2 versus 3 replicas (or e.g.
M>=1 for Erasure Coding). It's easy to get caught up in the worrying
fallacy that losing any M OSDs will cause data-loss, but this isn't
true - they have to be members of the same PG for data-loss to occur.
So then it's tempting to think the chances of that happening are so
slim as to not matter and why would we ever even need 3 replicas. I
mean, what are the odds of exactly those 2 drives, out of the
100,200... in my cluster, failing in <recovery window>?! But therein
lays the rub - you should be thinking about PGs. If a drive fails then
the chance of a data-loss event resulting are dependent on the chances
of losing further drives from the affected/degraded PGs.
I've got a real cluster at hand, so let's use that as an example. We
have 96 drives/OSDs - 8 nodes, 12 OSDs per node, 2 replicas, top-down
failure domains: rack pairs (x2), nodes, OSDs... Let's say OSD 15
$
109 109 861
(NB: 10 is the pool id, pg.dump is a text file dump of "ceph pg dump",
$15 is the acting set column)
109 PGs now "living on the edge". No surprises in that number as we
used 100 * 96 / 2 = 4800 to arrive at the PG count for this pool, so
on average any one OSD will be primary for 50 PGs and replica for
another 50. But this doesn't tell me how exposed I am, for that I need
$ grep "^10\." pg.dump | awk '{print $15}' | grep 15 | sed
's/\[15,\(.*\)\]/\1/' | sed 's/\[\(.*\),15\]/\1/' | sort | uniq | wc
67 67 193
(NB: grep-ing for OSD "15" and using sed to remove it and surrounding
formatting to get just the neighbour id)
Yikes! So if any one of those 67 drives fails during recovery of OSD
15, then we've lost data. On average we should expect this to be
determined by our crushmap, which in this case splits the cluster up
into 2 top-level failure domains, so I'd guess it's the probability of
1 in 48 drives failing on average for this cluster. But actually
looking at the numbers for each OSD it is higher than that here - the
lowest distinct "neighbour" count we have is 50. Note that we haven't
tuned any of the options in our crushmap, so I guess maybe Ceph
favours fewer repeat sets by default when coming up with PGs(?).
Anyway, here's the average and top 10 neighbour counts (hope this
$ for OSD in {0..95}; do echo -ne "$OSD\t"; grep "^10\." pg.dump | awk
'{print $15}' | grep "\[${OSD},\|,${OSD}\]" | sed
"s/\[$OSD,\(.*\)\]/\1/" | sed "s/\[\(.*\),$OSD\]/\1/" | sort | uniq |
wc -l; done | awk '{ total += $2 } END { print total/NR }'
58.5208
$ for OSD in {0..95}; do echo -ne "$OSD\t"; grep "^10\." pg.dump | awk
'{print $15}' | grep "\[${OSD},\|,${OSD}\]" | sed
"s/\[$OSD,\(.*\)\]/\1/" | sed "s/\[\(.*\),$OSD\]/\1/" | sort | uniq |
wc -l; done | sort -k2 -r | head
78 69
37 68
92 67
15 67
91 66
66 65
61 65
89 64
88 64
87 64
(OSD# Neighbour#)
So, if I am getting this right then at the end of the day __I think__
all this essentially boils down (sans CRUSH) to the number of possible
combinations (not permutations - order is irrelevant) of OSDs that can
be chosen. Making these numbers smaller is only possible by increasing
96 choose 2 = 4560
96 choose 3 = 142880
So basically with two replicas, if _any_ two disks fail within your
recovery window the chance of data-loss is high thanks to the chances
of those OSDs intersecting in the concrete space of PGs represented in
the pool. With three replicas that tapers off hugely as we're only
utilising 4800 / 142880 * 100 ~= 3.5% of the potential PG space.
I guess to some extent that this holds true for M values in EC pools.
I hope some of this makes sense...? I'd love to see some of these
questions answered canonically by Inktank or Sage, if not then perhaps
I'll see how far I get sticking this diatribe into the ICE support
portal...
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Blair Bethwaite
2014-08-28 14:38:58 UTC
Permalink
This post might be inappropriate. Click to display it.
Loic Dachary
2014-08-28 16:04:09 UTC
Permalink
Hi Blair,
Post by Blair Bethwaite
Hi Loic,
Thanks for the reply and interesting discussion.
I'm learning a lot :-)
Post by Blair Bethwaite
Post by Loic Dachary
Each time an OSD is lost, there is a 0.001*0.001 = 0.000001% chance that two other disks are lost before recovery. Since the disk that failed initialy participates in 100 PG, that is 0.000001% x 100 = 0.0001% chance that a PG is lost.
Seems okay, so you're just taking the max PG spread as the worst case
(noting as demonstrated with my numbers that the spread could be
lower).
...actually, I could be way off here, but if the chance of any one
disk failing in that time is 0.0001%, then assuming the first failure
(0.0001% / 2) * 99 * (0.0001% / 2) * 98
?
As you're essentially calculating the probability of one more disk out
of the remaining 99 failing, and then another out of the remaining 98
(and so on), within the repair window (dividing by the number of
remaining replicas for which the probability is being calculated, as
otherwise you'd be counting their chance of failure in the recovery
window multiple times). And of course this all assumes the recovery
continues gracefully from the remaining replica/s when another failure
occurs...?
That makes sense. I chose to arbitrarily ignore the probability of the first failure to happen because the event is not bounded in time. The second failure matters as long as it happens in the interval it takes for the cluster to create the missing copies and that seemed more important.
Post by Blair Bethwaite
Taking your followup correcting the base chances of failure into
99(1/100000 / 2) * 98(1/100000 / 2)
= 9.702e-7
1 in 1030715
If a disk participates in 100 PG with replica 3, it means there is a maximum of 200 other disks involved (if the cluster is large enough and the odds of two disks being used together in more than one PG are very low). You are assuming that this total is 100 which seems a reasonable approximation. I guess it could be verified by tests on a crushmap. However, it also means that the second failing disk probably shares 2 PG with the first failing disk, in which case the 98 should rather be 2 (i.e. the number of PG that are down to one replica as a result of the double failure).
Post by Blair Bethwaite
I'm also skeptical on the 1h recovery time - at the very least the
issues regarding stalling client ops come into play here and may push
the max_backfills down for operational reasons (after all, you can't
have a general purpose volume storage service that periodically spikes
latency due to normal operational tasks like recoveries).
If the cluster is overloaded (disks I/O, cluster network), re-creating the lost copies within less than 2h seems indeed unlikely.
Post by Blair Bethwaite
Post by Loic Dachary
Or the entire pool if it is used in a way that loosing a PG means loosing all data in the pool (as in your example, where it contains RBD volumes and each of the RBD volume uses all the available PG).
Well, there's actually another whole interesting conversation in here
- assuming a decent filesystem is sitting on top of those RBDs it
should be possible to get those filesystems back into working order
and identify any lost inodes, and then, if you've got one you can
recover from tape backup. BUT, if you have just one pool for these
RBDs spread over the entire cluster then the amount of work to do that
fsck-ing is quickly going to be problematic - you'd have to fsck every
RBD! So I wonder if there is cause for partitioning large clusters
into multiple pools, so that such a failure would (hopefully) have a
more limited scope. Backups for DR purposes are only worth having (and
paying for) if the DR plan is actually practical.
Post by Loic Dachary
If the pool is using at least two datacenters operated by two different organizations, this calculation makes sense to me. However, if the cluster is in a single datacenter, isn't it possible that some event independent of Ceph has a greater probability of permanently destroying the data ? A month ago I lost three machines in a Ceph cluster and realized on that occasion that the crushmap was not configured properly and that PG were lost as a result. Fortunately I was able to recover the disks and plug them in another machine to recover the lost PGs. I'm not a system administrator and the probability of me failing to do the right thing is higher than normal: this is just an example of a high probability event leading to data loss. In other words, I wonder if this 0.0001% chance of losing a PG within the hour following a disk failure matters or if it is dominated by other factors. What do you think ?
I wouldn't expect that number to be dominated by the chances of
total-loss/godzilla events, but I'm no datacentre reliability guru (at
least we don't have Godzilla here in Melbourne yet anyway). I couldn't
very quickly find any stats on "one-in-one-hundred year" events that
might actually destroy a datacentre. Availability is another question
altogether, which you probably know the Uptime Institute has specific
figures for tiers 1-4. But in my mind you should expect datacentre
power outages as an operational (rather than disaster) event, and
you'd want your Ceph cluster to survive them unscathed. If that
Copysets paper mentioned a while ago has any merit (see
http://hackingdistributed.com/2014/02/14/chainsets/ for more on that),
then it seems like the chances of drive loss following an availability
event are much higher than normal.
:-)

Cheers
--
Lo?c Dachary, Artisan Logiciel Libre

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140828/0123699b/attachment.pgp>
Continue reading on narkive:
Loading...