Discussion:
What a maximum theoretical and practical capacity in ceph cluster?
(too old to reply)
Mike
2014-10-27 15:30:23 UTC
Permalink
Hello,
My company is plaining to build a big Ceph cluster for achieving and
storing data.
By requirements from customer - 70% of capacity is SATA, 30% SSD.
First day data is storing in SSD storage, on next day moving SATA storage.

By now we decide use a SuperMicro's SKU with 72 bays for HDD = 22 SSD +
50 SATA drives.
Our racks can hold 10 this servers and 50 this racks in ceph cluster =
36000 OSD's,
With 4tb SATA drives and replica = 2 and nerfull ratio = 0.8 we have 40
Petabyte of useful capacity.

It's too big or normal use case for ceph?
Wido den Hollander
2014-10-27 16:07:21 UTC
Permalink
Post by Mike
Hello,
My company is plaining to build a big Ceph cluster for achieving and
storing data.
By requirements from customer - 70% of capacity is SATA, 30% SSD.
First day data is storing in SSD storage, on next day moving SATA storage.
How are you planning on moving this data? Do you expect Ceph to do this?

What kind of access to Ceph are you planning on using? RBD? Raw RADOS?
The RADOS Gateway (S3/Swift)?
Post by Mike
By now we decide use a SuperMicro's SKU with 72 bays for HDD = 22 SSD +
50 SATA drives.
That are some serious machines. It will require a LOT of CPU power in
those machines to run 72 OSDs. Probably 4 CPUs per machine.
Post by Mike
Our racks can hold 10 this servers and 50 this racks in ceph cluster =
36000 OSD's,
36.000 OSDs shouldn't really be the problem, but you are thinking really
big scale here.
Post by Mike
With 4tb SATA drives and replica = 2 and nerfull ratio = 0.8 we have 40
Petabyte of useful capacity.
It's too big or normal use case for ceph?
No, it's not to big for Ceph. This is what it was designed for. But a
setup like this shouldn't be taken lightly.

Think about the network connectivity required to connect all these
machines and other decisions to be made.

_______________________________________________
Post by Mike
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
Dan van der Ster
2014-10-27 16:32:37 UTC
Permalink
Hi,
Post by Wido den Hollander
Post by Mike
Hello,
My company is plaining to build a big Ceph cluster for achieving and
storing data.
By requirements from customer - 70% of capacity is SATA, 30% SSD.
First day data is storing in SSD storage, on next day moving SATA storage.
How are you planning on moving this data? Do you expect Ceph to do this?
What kind of access to Ceph are you planning on using? RBD? Raw RADOS?
The RADOS Gateway (S3/Swift)?
Post by Mike
By now we decide use a SuperMicro's SKU with 72 bays for HDD = 22 SSD +
50 SATA drives.
That are some serious machines. It will require a LOT of CPU power in
those machines to run 72 OSDs. Probably 4 CPUs per machine.
Post by Mike
Our racks can hold 10 this servers and 50 this racks in ceph cluster =
36000 OSD's,
36.000 OSDs shouldn't really be the problem, but you are thinking really
big scale here.
AFAIK, the OSDs should scale, since they only peer with ~100 others regardless of the cluster size. I wonder about the mon's though -- 36,000 OSDs will send a lot of pg_stats updates so the mon's will have some work to keep up. But the main issue I foresee is on the clients: don't be surprised when you see that each client needs close to 100k threads when connected to this cluster. A hypervisor with 10 VMs running would approach 1 million threads -- I have no idea if that will present any problems. There were discussions about limiting the number of client threads, but I don't know if there was any progress on that yet.

Anyway, it would be good to know if there are any current installations even close to this size (even in test). We are in the early days of planning a 10k OSD test, but haven't exceed ~1200 yet.

Cheers, Dan
Post by Wido den Hollander
Post by Mike
With 4tb SATA drives and replica = 2 and nerfull ratio = 0.8 we have 40
Petabyte of useful capacity.
It's too big or normal use case for ceph?
No, it's not to big for Ceph. This is what it was designed for. But a
setup like this shouldn't be taken lightly.
Think about the network connectivity required to connect all these
machines and other decisions to be made.
_______________________________________________
Post by Mike
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Wido den Hollander
42on B.V.
Ceph trainer and consultant
Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Wido den Hollander
2014-10-27 17:23:43 UTC
Permalink
Post by Dan van der Ster
Hi,
Post by Wido den Hollander
Post by Mike
Hello,
My company is plaining to build a big Ceph cluster for achieving and
storing data.
By requirements from customer - 70% of capacity is SATA, 30% SSD.
First day data is storing in SSD storage, on next day moving SATA storage.
How are you planning on moving this data? Do you expect Ceph to do this?
What kind of access to Ceph are you planning on using? RBD? Raw RADOS?
The RADOS Gateway (S3/Swift)?
Post by Mike
By now we decide use a SuperMicro's SKU with 72 bays for HDD = 22 SSD +
50 SATA drives.
That are some serious machines. It will require a LOT of CPU power in
those machines to run 72 OSDs. Probably 4 CPUs per machine.
Post by Mike
Our racks can hold 10 this servers and 50 this racks in ceph cluster =
36000 OSD's,
36.000 OSDs shouldn't really be the problem, but you are thinking really
big scale here.
AFAIK, the OSDs should scale, since they only peer with ~100 others regardless of the cluster size. I wonder about the mon's though -- 36,000 OSDs will send a lot of pg_stats updates so the mon's will have some work to keep up. But the main issue I foresee is on the clients: don't be surprised when you see that each client needs close to 100k threads when connected to this cluster. A hypervisor with 10 VMs running would approach 1 million threads -- I have no idea if that will present any problems. There were discussions about limiting the number of client threads, but I don't know if there was any progress on that yet.
True about the mons. 3 monitors will not cut it here. You need 9 MONs at
least I think, on dedicated resources.
Post by Dan van der Ster
Anyway, it would be good to know if there are any current installations even close to this size (even in test). We are in the early days of planning a 10k OSD test, but haven't exceed ~1200 yet.
Cheers, Dan
Post by Wido den Hollander
Post by Mike
With 4tb SATA drives and replica = 2 and nerfull ratio = 0.8 we have 40
Petabyte of useful capacity.
It's too big or normal use case for ceph?
No, it's not to big for Ceph. This is what it was designed for. But a
setup like this shouldn't be taken lightly.
Think about the network connectivity required to connect all these
machines and other decisions to be made.
_______________________________________________
Post by Mike
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Wido den Hollander
42on B.V.
Ceph trainer and consultant
Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
Christian Balzer
2014-10-28 02:32:34 UTC
Permalink
Post by Mike
Hello,
My company is plaining to build a big Ceph cluster for achieving and
storing data.
By requirements from customer - 70% of capacity is SATA, 30% SSD.
First day data is storing in SSD storage, on next day moving SATA storage.
Lots of data movement. Is the design to store data on SSDs for the first
day done to assure fast writes from the clients?
Knowing the reason for this requirement would really help to find a
potentially more appropriate solution.
Post by Mike
By now we decide use a SuperMicro's SKU with 72 bays for HDD = 22 SSD +
50 SATA drives.
I suppose you're talking about these:
http://www.supermicro.com.tw/products/system/4U/6047/SSG-6047R-E1R72L2K.cfm

Which is the worst thing that ever came out from Supermicro, IMNSHO.

Have you actually read the documentation and/or talked to a Supermicro
representative?

Firstly and most importantly, if you have to replace a failed disk the
other one on the same dual disk tray will also get disconnected from the
system. That's why they require you to run RAID all the time so pulling a
tray doesn't destroy your data.
But even then, the other affected RAID will of course have to rebuild
itself once you re-insert the tray.
And you can substitute RAID with OSD, doubling the impact of a failed disk
on your cluster.
The fact that they make you buy the complete system with IT mode
controllers also means that if you would want to do something like RAID6,
you'd be forced to do it in software.

Secondly, CPU requirements.
A purely HDD based OSD (journal on the same HDD) requires about 1GHz of
CPU power. So to make sure the CPU isn't your bottleneck, you'd need about
3000USD worth of CPUs (2x 10core 2.6GHz) but that's ignoring your SSDs.

To get even remotely close to utilize the potential speed of SSDs you don't
want more than 10-12 SSD based OSDs per node and to give that server the
highest CPU GHz total count you can afford.

Look at the "anti-cephalod question" thread in the ML archives for a
discussion of dense servers and all the recent threads about SSD
performance.

Lastly, even just the 50 HDD based OSDs will saturate a 10GbE link, never
mind the 22 SSDs. Building a Ceph cluster is a careful balancing act
between storage, network speeds and CPU requirements while also taking
density and budget into consideration.
Post by Mike
Our racks can hold 10 this servers and 50 this racks in ceph cluster =
36000 OSD's,
Others have already pointed out that this number can have undesirable
effects, but see more below.
Post by Mike
With 4tb SATA drives and replica = 2 and nerfull ratio = 0.8 we have 40
Petabyte of useful capacity.
A replica of 2 with a purely SSD based pool can work, if you constantly
monitor those SSDs for wear level and replace them early before they fail.
Deploying those SSDs staggered would be a good idea to prevent having them
all needed to be replaced at the same time. A sufficiently fast network to
replicate the data in a very short period is also a must.
But with your deployment goal of 11000(!) SSDs all in the same pool the
statistics are stacked against you. I'm sure somebody more versed than me
in these matters can run the exact numbers (which SSDs are you planning to
use?), but I'd be terrified.

And with 25000 HDDs a replication factor of 2 is GUARANTEED to make you
loose data, probably a lot earlier in the life of your cluster than you
think. You'll be replacing several disk per day on average.

If there is no real need for SSDs, build your cluster with a simple, 4U 24
drive server, put a fast RAID card (I like ARECA) in it and create 2 11
disk RAID6 with 2 global spares, thus 2 OSDs.
Add a NVMe like the DC P3700 400GB for journals and OS, which will limit
one node to 1GB/s writes and that in turn would be a nice match for a 10GbE
network.
The combination of RAID card and NVMe (or 2 fast SSDs) will make this a
pretty snappy/speedy beast and as a bonus you'll likely never have to deal
with a failed OSD, just easily replaced failed disks in a RAID.
This will also drive your OSD count for HDDs from 25000 to about 2800+.

If you need more dense storage, look at something like this
http://www.45drives.com/ (there are other, similar products).
With this particular case I'd again put RAID controllers and a (fast,
2GB/s) NVMe (or 2 slower ones) in it, for 4 10 disk RAID6 with 5 spares.
Given the speed of the storage you will want a 2x10GbE bonded or Infiniband
link.

And if you need really need SSD backed pools, but don't want to risk data
loss, get a 2U case with 24 2.5" hotswap bays and run 2 RAID5s (2x 12port
RAID cards). Add some fast CPUs (but you can get away with much less than
what you would need with 24 distinct OSDs) and you're gold.
This will nicely reduce your SSD OSD count from 11000 to something in the
1000+ range AND allow for a low risk deployment with a replica size of 2.
And while not giving you as much performance as individual OSDs, it will
definitely be faster than your original design.
With something dense and fast like this you will probably want 40GbE or
Infiniband on the network side, though.
Post by Mike
It's too big or normal use case for ceph?
Not too big, but definitely needs a completely different design and lots of
forethought, planning and testing.

Christian
--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Mariusz Gronczewski
2014-10-28 09:33:55 UTC
Permalink
Post by Christian Balzer
The fact that they make you buy the complete system with IT mode
controllers also means that if you would want to do something like RAID6,
you'd be forced to do it in software.
If you are using cheap consumer drives, you definitely DO want to use
IT mode controller (and software RAID if needed), non-RAID-designed
drives perform very poorly behind hardware level abstraction. We had
LSI SAS 2208 (no IT mode flash avaliable) and it just turned disks off,
had problems with disk timeout (disks were shitty segates *DM001, no
TLER) so it dropped whole drive from raid. And using MegaCli for
everything is not exactly ergonomic.

But yeah, 72 drives in 4U only makes sense if you use it for bulk
storage
--
Mariusz Gronczewski, Administrator

Efigence S. A.
ul. Wo³oska 9a, 02-583 Warszawa
T: [+48] 22 380 13 13
F: [+48] 22 380 13 14
E: ***@efigence.com
<mailto:***@efigence.com>
Nick Fisk
2014-10-28 09:58:45 UTC
Permalink
I've been looking at various categories of disks and how the
performance/reliability/cost varies.

There seems to be 5 main categories: (WD disks given as example)-


Budget (WD Green - 5400 no TLER)
Desktop Drives (WD Blue - /7200RPM no TLER)
NAS Drives (WD Red - 5400RPM TLER)
Enterprise Capacity (WD SE - 7200RPM TLER)
Enterprise Performance (WD RE - 7200RPM TLER)
SAS Enterprise Performance

I would definitely not use the Green drives as they seem to park the heads
very frequently and seem to suffer high failure rates in Enterprise work
loads.

The Blue drives, I'm not sure about, they definitely can't be used in RAID
as they have very high error timeouts, but I don't know how CEPH handles
this and if it's worth the risk for the cheaper cost.

The RED drives are interesting, they are very cheap and if performance is
not of top importance (cold storage/archive) they would seem to be a good
choice as they are designed for 24x7 use and support 7s error timeout.

The two enterprise drives vary by performance with the later also costing
more. To be honest I don't see the point of the capacity version, if you
don't need the extra performance, you would be better going with the Red
drive.

And finally the SAS drive. For CEPH I don't see this drive making much
sense. Most manufacturers enterprise SATA drives are identical to the SAS
version with just the different interface. Performance seems identical in
all comparisons I have seen, apart from the fact that SATA can only queue up
to 32 IO's, not sure how important this is? But they also command a price
premium.

In terms of the 72 Disk chassis, if I was to use one (which I probably
wouldn't) I would design the cluster to tolerate high numbers of failures
before requiring replacement and then do large batch replacements every few
months. This would probably involve setting noout and shutting down each
server in turn to replace the disks, to work around the 2 disks in one tray
design.

Nick

-----Original Message-----
From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf Of
Mariusz Gronczewski
Sent: 28 October 2014 09:34
To: Christian Balzer
Cc: ceph-***@lists.ceph.com
Subject: Re: [ceph-users] What a maximum theoretical and practical capacity
in ceph cluster?
Post by Christian Balzer
The fact that they make you buy the complete system with IT mode
controllers also means that if you would want to do something like
RAID6, you'd be forced to do it in software.
If you are using cheap consumer drives, you definitely DO want to use IT
mode controller (and software RAID if needed), non-RAID-designed drives
perform very poorly behind hardware level abstraction. We had LSI SAS 2208
(no IT mode flash avaliable) and it just turned disks off, had problems with
disk timeout (disks were shitty segates *DM001, no
TLER) so it dropped whole drive from raid. And using MegaCli for everything
is not exactly ergonomic.

But yeah, 72 drives in 4U only makes sense if you use it for bulk storage


--
Mariusz Gronczewski, Administrator

Efigence S. A.
ul. Wołoska 9a, 02-583 Warszawa
T: [+48] 22 380 13 13
F: [+48] 22 380 13 14
E: ***@efigence.com
<mailto:***@efigence.com>
Craig Lewis
2014-10-28 22:41:48 UTC
Permalink
Post by Nick Fisk
And finally the SAS drive. For CEPH I don't see this drive making much
sense. Most manufacturers enterprise SATA drives are identical to the SAS
version with just the different interface. Performance seems identical in
all comparisons I have seen, apart from the fact that SATA can only queue up
to 32 IO's, not sure how important this is? But they also command a price
premium.
Anecdotal: I got a good deal on some new systems, including WD
Nearline SAS disks. It wasn't amazing, but the whole system was
cheaper than me manually assembling a SuperMicro with some HGST SATA
disks. The SATA nodes have a battery backed RAID0 setup. The SAS
nodes are using a SAS HBA (no write cache). All nodes' journals are
the same model Intel SATA SSD, with no write caching.

My load test was snapshot trimming, and I noticed it from watching
atop. Completely quantifiable and repeatable ;-).

The SAS disks would consistently finish sooner than the SATA disks.
For a rmsnap that took ~2 hours to trim, the SAS disks would finish up
about 15 minutes sooner. Regardless of uneven data distribution, all
SAS disks were completely done trimming before the first SATA disk
started to ramp down it's IOPS.

This is something I just noticed, so I haven't (yet) spent any time
trying to actually quantify.

I only noticed when the load was high enough to make cluster
completely unresponsive. I have no idea if the difference will show
up under normal loads. I'm not even sure how I'm going to quantify
this, since the lack of write cache on the SAS disks makes the graphs
much harder to compare.

So far, the best I can say is that the SAS disks are "faster", even
without a write cache.
Mariusz Gronczewski
2014-10-28 10:05:41 UTC
Permalink
Post by Nick Fisk
The RED drives are interesting, they are very cheap and if performance is
not of top importance (cold storage/archive) they would seem to be a good
choice as they are designed for 24x7 use and support 7s error timeout.
WD also have Red Pro. which basically are reds but 7200 RPM and
slightly less expensive than Re. We've been replacing our segate
barracuda DM001 with these (dont get those segates, they are
horrible....)
--
Mariusz Gronczewski, Administrator

Efigence S. A.
ul. Wo³oska 9a, 02-583 Warszawa
T: [+48] 22 380 13 13
F: [+48] 22 380 13 14
E: ***@efigence.com
<mailto:***@efigence.com>
Christian Balzer
2014-10-29 00:12:58 UTC
Permalink
Post by Mariusz Gronczewski
Post by Christian Balzer
The fact that they make you buy the complete system with IT mode
controllers also means that if you would want to do something like
RAID6, you'd be forced to do it in software.
If you are using cheap consumer drives, you definitely DO want to use
IT mode controller (and software RAID if needed), non-RAID-designed
drives perform very poorly behind hardware level abstraction. We had
LSI SAS 2208 (no IT mode flash avaliable) and it just turned disks off,
had problems with disk timeout (disks were shitty segates *DM001, no
TLER) so it dropped whole drive from raid. And using MegaCli for
everything is not exactly ergonomic.
Well, he wasn't telling us exactly which HDDs he was going to use. There
are some cheap drives (HGST and certain Toshiba modesl comes to mind) that
behave rather well.
But I totally agree on the Seagate DM001 drives, they are the reason we no
longer consider buying Seagate for at least 1-2 drive generations.
Post by Mariusz Gronczewski
But yeah, 72 drives in 4U only makes sense if you use it for bulk
storage
That was really my point, 72 OSDs will wear out any and all dual CPU
combination there is, without cycles to spare for a software RAID6.

Christian
--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Robert van Leeuwen
2014-10-28 07:25:23 UTC
Permalink
Post by Mike
By now we decide use a SuperMicro's SKU with 72 bays for HDD = 22 SSD +
50 SATA drives.
Our racks can hold 10 this servers and 50 this racks in ceph cluster =
36000 OSD's,
With 4tb SATA drives and replica = 2 and nerfull ratio = 0.8 we have 40
Petabyte of useful capacity.
It's too big or normal use case for ceph?
I'm a bit worried about the replica count:
The chances of 2 disks failing of 25000 at the same time becomes very significant. (or a disk + server failure)
Without doing any math my gut feeling says that 3 replica's is still not very comfortable. (especially if the disks come from the same batch)

Cheers,
Robert van Leeuwen
Dan Van Der Ster
2014-10-28 07:46:30 UTC
Permalink
Post by Robert van Leeuwen
Post by Mike
By now we decide use a SuperMicro's SKU with 72 bays for HDD = 22 SSD +
50 SATA drives.
Our racks can hold 10 this servers and 50 this racks in ceph cluster =
36000 OSD's,
With 4tb SATA drives and replica = 2 and nerfull ratio = 0.8 we have 40
Petabyte of useful capacity.
It's too big or normal use case for ceph?
The chances of 2 disks failing of 25000 at the same time becomes very significant. (or a disk + server failure)
Without doing any math my gut feeling says that 3 replica's is still not very comfortable. (especially if the disks come from the same batch)
It doesn’t quite work like that. You’re not going to lose data if _any_ two disks out of 25000 fail. You’ll only lose data if two disks that are coupled in a PG are lost. So, while there are 25000^2 ways to lose two disks, there are only nPGs disk pairs that matter for data loss. Said another way, suppose you have one disk failed, what is the probability of losing data? Well, the data loss scenario is going to happen if one of the ~100 disks coupled with the failed disk also fails. So you see, the chance of data loss with 2 replicas is roughly equivalent whether you have 1000 OSDs or 25000 OSDs.

Cheers, Dan
Post by Robert van Leeuwen
Cheers,
Robert van Leeuwen
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Christian Balzer
2014-10-28 08:30:34 UTC
Permalink
Post by Dan Van Der Ster
On 28 Oct 2014, at 08:25, Robert van Leeuwen
Post by Mike
By now we decide use a SuperMicro's SKU with 72 bays for HDD = 22 SSD
+ 50 SATA drives.
Our racks can hold 10 this servers and 50 this racks in ceph cluster =
36000 OSD's,
With 4tb SATA drives and replica = 2 and nerfull ratio = 0.8 we have
40 Petabyte of useful capacity.
It's too big or normal use case for ceph?
The chances of 2 disks failing of 25000 at the same time becomes very
significant. (or a disk + server failure) Without doing any math my
gut feeling says that 3 replica's is still not very comfortable.
(especially if the disks come from the same batch)
It doesn’t quite work like that. You’re not going to lose data if _any_
two disks out of 25000 fail. You’ll only lose data if two disks that are
coupled in a PG are lost. So, while there are 25000^2 ways to lose two
disks, there are only nPGs disk pairs that matter for data loss. Said
another way, suppose you have one disk failed, what is the probability
of losing data? Well, the data loss scenario is going to happen if one
of the ~100 disks coupled with the failed disk also fails. So you see,
the chance of data loss with 2 replicas is roughly equivalent whether
you have 1000 OSDs or 25000 OSDs.
We keep having that discussion here and are still lacking a fully realistic
model for this scenario. ^^
Though I seem to recall work is being done the Ceph reliability
calculator.

Lets just say that with a replica of 2 and a set of 100 disks all the
models and calculators I checked predict a data loss within a year.
That DL probability goes down from 99.99% to just 0.04% in a year (which I
would still consider too high) with a replica of 3.
That's why I never use more than 22 HDDs in a RAID6 and keep this at 10-12
for anything mission critical.

And having likely multiple (even if unrelated) OSD failures at the same
time can't be good for recovery times (increased risk) and cluster
performance either.

Christian
Post by Dan Van Der Ster
Cheers, Dan
Cheers,
Robert van Leeuwen
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Dan Van Der Ster
2014-10-28 08:52:44 UTC
Permalink
Post by Christian Balzer
Post by Dan Van Der Ster
On 28 Oct 2014, at 08:25, Robert van Leeuwen
Post by Mike
By now we decide use a SuperMicro's SKU with 72 bays for HDD = 22 SSD
+ 50 SATA drives.
Our racks can hold 10 this servers and 50 this racks in ceph cluster =
36000 OSD's,
With 4tb SATA drives and replica = 2 and nerfull ratio = 0.8 we have
40 Petabyte of useful capacity.
It's too big or normal use case for ceph?
The chances of 2 disks failing of 25000 at the same time becomes very
significant. (or a disk + server failure) Without doing any math my
gut feeling says that 3 replica's is still not very comfortable.
(especially if the disks come from the same batch)
It doesn’t quite work like that. You’re not going to lose data if _any_
two disks out of 25000 fail. You’ll only lose data if two disks that are
coupled in a PG are lost. So, while there are 25000^2 ways to lose two
disks, there are only nPGs disk pairs that matter for data loss. Said
another way, suppose you have one disk failed, what is the probability
of losing data? Well, the data loss scenario is going to happen if one
of the ~100 disks coupled with the failed disk also fails. So you see,
the chance of data loss with 2 replicas is roughly equivalent whether
you have 1000 OSDs or 25000 OSDs.
We keep having that discussion here and are still lacking a fully realistic
model for this scenario. ^^
Though I seem to recall work is being done the Ceph reliability
calculator.
Perhaps I’ve missed those discussions, but this principle has been in the reliability calculator since forever.. see “declustering”. There is no place to enter the number of OSDs, because it is just not relevant.

Just run ceph pg dump and look at those combinations of OSDs — those are the only simultaneous failures that matter. Every other combination of failures will not cause data loss. You can even write a Monte Carlo simulation of this — generate random pairs/triplets/etc. of OSDs and see if they would cause data loss. The probability of an M-way failure causing data loss will be nPGs/(nOSDs^M).
Post by Christian Balzer
Lets just say that with a replica of 2 and a set of 100 disks all the
models and calculators I checked predict a data loss within a year.
That DL probability goes down from 99.99% to just 0.04% in a year (which I
would still consider too high) with a replica of 3.
That's why I never use more than 22 HDDs in a RAID6 and keep this at 10-12
for anything mission critical.
Yeah, 2 replicas doesn’t cut it — I agree on that. 3 is the minimum, actually, tolerance to 2 failures is the minimum (if you use EC, for example).

We first used 4 replicas in our RBD pool, but after realizing that not all OSDs are coupled together, we decreased to 3 replicas.
Post by Christian Balzer
And having likely multiple (even if unrelated) OSD failures at the same
time can't be good for recovery times (increased risk) and cluster
performance either.
Yes, I agree. I don’t really like that we can only limit max_backills/recoveries per OSD. What we need is a cluster-wide limit. I.e. I don’t want more than 30-40 OSDs backfilling at once in my 1000-OSD RBD cluster. Otherwise the latency penalty gets annoying.

Cheers, Dan
Continue reading on narkive:
Loading...