Discussion:
[ceph-users] Low traffic Ceph cluster with consumer SSD.
Anton Aleksandrov
2018-11-24 17:09:51 UTC
Permalink
Hello community,

We are building CEPH cluster on pretty old (but free) hardware. We will
have 12 nodes with 1 OSD per node and migrate data from single RAID5
setup, so our traffic is not very intense, we basically need more space
and possibility to expand it.

We plan to have data on dedicate disk in each node and my question is
about WAL/DB for Bluestore. How bad would it be to place it on
system-consumer-SSD? How big risk is it, that everything will get
"slower than using spinning HDD for the same purpose"? And how big risk
is it, that our nodes will die, because of SSD lifespan?

I am sorry, for such untechnical question.

Regards,
Anton.
Ashley Merrick
2018-11-25 01:45:38 UTC
Permalink
As it’s consumer hardware / old I am guessing your only be using 1Gbps for
the network.

If so that will definitely be your bottle neck across the whole environment
having both client and replication data sharing a single 1Gbps.

Your SSD’s will sit mostly idle, if you have 10Gbps then different story.

,Ash
Post by Anton Aleksandrov
Hello community,
We are building CEPH cluster on pretty old (but free) hardware. We will
have 12 nodes with 1 OSD per node and migrate data from single RAID5
setup, so our traffic is not very intense, we basically need more space
and possibility to expand it.
We plan to have data on dedicate disk in each node and my question is
about WAL/DB for Bluestore. How bad would it be to place it on
system-consumer-SSD? How big risk is it, that everything will get
"slower than using spinning HDD for the same purpose"? And how big risk
is it, that our nodes will die, because of SSD lifespan?
I am sorry, for such untechnical question.
Regards,
Anton.
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Jesper Krogh
2018-11-25 07:07:58 UTC
Permalink
We plan to have data on dedicate disk in each node and my question is about WAL/DB for Bluestore. How bad would it be to place it on system-consumer-SSD? How big risk is it, that everything will get "slower than using spinning HDD for the same purpose"? And how big risk is it, that our nodes will die, because of SSD lifespan?
the real risk is the lack of power loss protection. Data can be corrupted on unflean shutdowns

Disabling cache may help
Vitaliy Filippov
2018-11-25 12:00:42 UTC
Permalink
Post by Jesper Krogh
Post by Anton Aleksandrov
We plan to have data on dedicate disk in each node and my question is
about WAL/DB for Bluestore. How bad would it be to place it on
system-consumer-SSD? How big risk is it, that everything will get
"slower than using spinning HDD for the same purpose"? And how big risk
is it, that our nodes will die, because of SSD lifespan?
just try and tell us :) I can't imagine it may be slower than colocated
db+wal+data.

also it depends on exact SSD models, but a lot of SSDs (even consumer
ones) in fact survive 10-20 times more writes than claimed by the
manufacturer. only some really cheap chinese ones don't...

there's an article on 3dnews about it: https://3dnews.ru/938764/
Post by Jesper Krogh
the real risk is the lack of power loss protection. Data can be
corrupted on unflean shutdowns
it's not! lack of "advanced power loss protection" only means lower iops
with fsync, but not the possibility of data corruption

"advanced power loss protection" is basically the synonym for
"non-volatile cache"
Post by Jesper Krogh
Disabling cache may help
it won't help on consumer ssds, because (write+fsync) performance is
roughly the same as (write with cache disabled) for them

Ceph is always issuing at least as many fsync's as writes, so it's
basically always operating in "disk cache disabled" mode

at the same time, disabling disk write cache on enterprise SSDs (hdparm -W
0) often increases random write iops by an order of magnitude. not sure
why. maybe because kernel flushes disk queue on every sync if it thinks
disk cache is enabled...
--
With best regards,
Vitaliy Filippov
j***@krogh.cc
2018-11-25 14:07:00 UTC
Permalink
Post by Vitaliy Filippov
Post by Jesper Krogh
the real risk is the lack of power loss protection. Data can be
corrupted on unflean shutdowns
it's not! lack of "advanced power loss protection" only means lower iops
with fsync, but not the possibility of data corruption
"advanced power loss protection" is basically the synonym for
"non-volatile cache"
A few years ago - it was pretty common knowledge that if it didnt have
capacitors - and thus Power-Loss-Protection, then an unexpected power-off
could lead to data-loss situations. Perhapos I'm not updated with recent
development. Is it a solved problem today in consumergrade SSD?
.. any links to insight/testing/etc would be welcome.

https://arstechnica.com/civis/viewtopic.php?f=11&t=1383499
- does at least not support the viewpoint.

Jesper
Vitaliy Filippov
2018-11-25 14:17:47 UTC
Permalink
Post by j***@krogh.cc
Post by Vitaliy Filippov
Post by Jesper Krogh
the real risk is the lack of power loss protection. Data can be
corrupted on unflean shutdowns
it's not! lack of "advanced power loss protection" only means lower iops
with fsync, but not the possibility of data corruption
"advanced power loss protection" is basically the synonym for
"non-volatile cache"
A few years ago - it was pretty common knowledge that if it didnt have
capacitors - and thus Power-Loss-Protection, then an unexpected power-off
could lead to data-loss situations. Perhapos I'm not updated with recent
development. Is it a solved problem today in consumergrade SSD?
.. any links to insight/testing/etc would be welcome.
https://arstechnica.com/civis/viewtopic.php?f=11&t=1383499
- does at least not support the viewpoint.
All disks (HDDs and SSDs) have cache and may lose non-transactional writes
that are in-flight. However, any adequate disk handles fsync's (i.e SATA
FLUSH CACHE commands). So transactional writes should never be lost, and
in Ceph ALL writes are transactional - Ceph issues fsync's all the time.
Another example is DBMS-es - they also issue an fsync when you COMMIT.
--
With best regards,
Vitaliy Filippov
Vitaliy Filippov
2018-11-25 14:22:48 UTC
Permalink
Post by Vitaliy Filippov
Ceph issues fsync's all the time
...and, of course, it has journaling :) (only fsync is of course not
sufficient)

with enterprise SSDs which have capacitors fsync just becomes a no-op and
thus transactional write performance becomes the same as non-transactional
(i.e. 10+ times faster for 4k random writes)
--
With best regards,
Vitaliy Filippov
Jesper Krogh
2018-11-25 15:39:20 UTC
Permalink
All disks (HDDs and SSDs) have cache and may lose non-transactional writes that are in-flight. However, any adequate disk handles fsync's (i.e SATA FLUSH CACHE commands). So transactional writes should never be lost, and in Ceph ALL writes are transactional - Ceph issues fsync's all the time. Another example is DBMS-es - they also issue an fsync when you COMMIT.
https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf

This may have changed since 2013, normal understanding is that cache need to be disabled to ensure that flushed are persistent, and disabling cache in ssd is either not adhered to by firmware or plummeting the write performance.

Which is why enterprise discs had power loss protection in terms of capacitors.

again any links/info telling otherwise is very welcome

Jesper
Виталий Филиппов
2018-11-25 17:23:20 UTC
Permalink
Ok... That's better than previous thread with file download where the topic starter suffered from normal only-metadata-journaled fs... Thanks for the link, it would be interesting to repeat similar tests. Although I suspect it shouldn't be that bad... at least not all desktop SSDs are that broken - for example https://engineering.nordeus.com/power-failure-testing-with-ssds/ says samsumg 840 pro is ok.
--
With best regards,
Vitaliy Filippov
Eneko Lacunza
2018-11-26 11:02:44 UTC
Permalink
Hi,
Post by Виталий Филиппов
Ok... That's better than previous thread with file download where the
topic starter suffered from normal only-metadata-journaled fs...
Thanks for the link, it would be interesting to repeat similar tests.
Although I suspect it shouldn't be that bad... at least not all
desktop SSDs are that broken - for example
https://engineering.nordeus.com/power-failure-testing-with-ssds/ says
samsumg 840 pro is ok.
Only that ceph performance for that SSD model is very very bad. We had
one of those repurposed for ceph and had to run to buy an Intel
enterprise SSD drive to replace it.

Don't even try :)

Cheers
Eneko
--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarraga bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es
Vitaliy Filippov
2018-11-25 18:36:23 UTC
Permalink
At least when I run a simple O_SYNC random 4k write test with a random
Intel 545s SSD plugged in through USB3-SATA adapter (UASP), pull USB cable
out and then recheck written data everything is good and nothing is lost
(however iops are of course low, 1100-1200)
--
With best regards,
Vitaliy Filippov
Martin Verges
2018-11-26 07:13:58 UTC
Permalink
Hello Anton,

we have some bad experience with consumer disks. They tend to fail quite
early and sometimes have extrem poor performance in Ceph workloads.
If possible, spend some money on reliable Samsung PM/SM863a SSDs. However a
customer of us uses the WD Blue 1TB SSDs and seems to be quite happy with.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: ***@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Sa., 24. Nov. 2018 um 18:10 Uhr schrieb Anton Aleksandrov <
Post by Anton Aleksandrov
Hello community,
We are building CEPH cluster on pretty old (but free) hardware. We will
have 12 nodes with 1 OSD per node and migrate data from single RAID5
setup, so our traffic is not very intense, we basically need more space
and possibility to expand it.
We plan to have data on dedicate disk in each node and my question is
about WAL/DB for Bluestore. How bad would it be to place it on
system-consumer-SSD? How big risk is it, that everything will get
"slower than using spinning HDD for the same purpose"? And how big risk
is it, that our nodes will die, because of SSD lifespan?
I am sorry, for such untechnical question.
Regards,
Anton.
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Loading...