Post by Mike
My company is plaining to build a big Ceph cluster for achieving and
By requirements from customer - 70% of capacity is SATA, 30% SSD.
First day data is storing in SSD storage, on next day moving SATA storage.
Lots of data movement. Is the design to store data on SSDs for the first
day done to assure fast writes from the clients?
Knowing the reason for this requirement would really help to find a
potentially more appropriate solution.
Post by Mike
By now we decide use a SuperMicro's SKU with 72 bays for HDD = 22 SSD +
50 SATA drives.
I suppose you're talking about these:
Which is the worst thing that ever came out from Supermicro, IMNSHO.
Have you actually read the documentation and/or talked to a Supermicro
Firstly and most importantly, if you have to replace a failed disk the
other one on the same dual disk tray will also get disconnected from the
system. That's why they require you to run RAID all the time so pulling a
tray doesn't destroy your data.
But even then, the other affected RAID will of course have to rebuild
itself once you re-insert the tray.
And you can substitute RAID with OSD, doubling the impact of a failed disk
on your cluster.
The fact that they make you buy the complete system with IT mode
controllers also means that if you would want to do something like RAID6,
you'd be forced to do it in software.
Secondly, CPU requirements.
A purely HDD based OSD (journal on the same HDD) requires about 1GHz of
CPU power. So to make sure the CPU isn't your bottleneck, you'd need about
3000USD worth of CPUs (2x 10core 2.6GHz) but that's ignoring your SSDs.
To get even remotely close to utilize the potential speed of SSDs you don't
want more than 10-12 SSD based OSDs per node and to give that server the
highest CPU GHz total count you can afford.
Look at the "anti-cephalod question" thread in the ML archives for a
discussion of dense servers and all the recent threads about SSD
Lastly, even just the 50 HDD based OSDs will saturate a 10GbE link, never
mind the 22 SSDs. Building a Ceph cluster is a careful balancing act
between storage, network speeds and CPU requirements while also taking
density and budget into consideration.
Post by Mike
Our racks can hold 10 this servers and 50 this racks in ceph cluster =
Others have already pointed out that this number can have undesirable
effects, but see more below.
Post by Mike
With 4tb SATA drives and replica = 2 and nerfull ratio = 0.8 we have 40
Petabyte of useful capacity.
A replica of 2 with a purely SSD based pool can work, if you constantly
monitor those SSDs for wear level and replace them early before they fail.
Deploying those SSDs staggered would be a good idea to prevent having them
all needed to be replaced at the same time. A sufficiently fast network to
replicate the data in a very short period is also a must.
But with your deployment goal of 11000(!) SSDs all in the same pool the
statistics are stacked against you. I'm sure somebody more versed than me
in these matters can run the exact numbers (which SSDs are you planning to
use?), but I'd be terrified.
And with 25000 HDDs a replication factor of 2 is GUARANTEED to make you
loose data, probably a lot earlier in the life of your cluster than you
think. You'll be replacing several disk per day on average.
If there is no real need for SSDs, build your cluster with a simple, 4U 24
drive server, put a fast RAID card (I like ARECA) in it and create 2 11
disk RAID6 with 2 global spares, thus 2 OSDs.
Add a NVMe like the DC P3700 400GB for journals and OS, which will limit
one node to 1GB/s writes and that in turn would be a nice match for a 10GbE
The combination of RAID card and NVMe (or 2 fast SSDs) will make this a
pretty snappy/speedy beast and as a bonus you'll likely never have to deal
with a failed OSD, just easily replaced failed disks in a RAID.
This will also drive your OSD count for HDDs from 25000 to about 2800+.
If you need more dense storage, look at something like this
http://www.45drives.com/ (there are other, similar products).
With this particular case I'd again put RAID controllers and a (fast,
2GB/s) NVMe (or 2 slower ones) in it, for 4 10 disk RAID6 with 5 spares.
Given the speed of the storage you will want a 2x10GbE bonded or Infiniband
And if you need really need SSD backed pools, but don't want to risk data
loss, get a 2U case with 24 2.5" hotswap bays and run 2 RAID5s (2x 12port
RAID cards). Add some fast CPUs (but you can get away with much less than
what you would need with 24 distinct OSDs) and you're gold.
This will nicely reduce your SSD OSD count from 11000 to something in the
1000+ range AND allow for a low risk deployment with a replica size of 2.
And while not giving you as much performance as individual OSDs, it will
definitely be faster than your original design.
With something dense and fast like this you will probably want 40GbE or
Infiniband on the network side, though.
Post by Mike
It's too big or normal use case for ceph?
Not too big, but definitely needs a completely different design and lots of
forethought, planning and testing.
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Fusion Communications