Discussion:
[ceph-users] SSDs for journals vs SSDs for a cache tier, which is better?
Piotr Wachowicz
2016-02-16 17:56:43 UTC
Permalink
Hey,

Which one's "better": to use SSDs for storing journals, vs to use them as a
writeback cache tier? All other things being equal.

The usecase is a 15 osd-node cluster, with 6 HDDs and 1 SSDs per node.
Used for block storage for a typical 20-hypervisor OpenStack cloud (with
bunch of VMs running Linux). 10GigE public net + 10 GigE replication
network.

Let's consider both cases:
Journals on SSDs - for writes, the write operation returns right after data
lands on the Journal's SSDs, but before it's written to the backing HDD.
So, for writes, SSD journal approach should be comparable to having a SSD
cache tier. In both cases we're writing to an SSD (and to replica's SSDs),
and returning to the client immediately after that. Data is only flushed to
HDD later on.

However for reads (of hot data) I would expect a SSD Cache Tier to be
faster/better. That's because, in the case of having journals on SSDs, even
if data is in the journal, it's always read from the (slow) backing disk
anyway, right? But with a SSD cache tier, if the data is hot, it would be
read from the (fast) SSD.

I'm sure both approaches have their own merits, and might be better for
some specific tasks, but with all other things being equal, I would expect
that using SSDs as the "Writeback" cache tier should, on average, provide
better performance than suing the same SSDs for Journals. Specifically in
the area of read throughput/latency.

The main difference, I suspect, between the two approaches is that in the
case of multiple HDDs (multiple ceph-osd processes), all of those processes
share access to the same shared SSD storing their journals. Whereas it's
likely not the case with Cache tiering, right? Though I must say I failed
to find any detailed info on this. Any clarification will be appreciated.

So, is the above correct, or am I missing some pieces here? Any other major
differences between the two approaches?

Thanks.
P.
Christian Balzer
2016-02-17 04:22:01 UTC
Permalink
Hello,
Post by Piotr Wachowicz
Hey,
Which one's "better": to use SSDs for storing journals, vs to use them
as a writeback cache tier? All other things being equal.
Pears are better than either oranges or apples. ^_-
Post by Piotr Wachowicz
The usecase is a 15 osd-node cluster, with 6 HDDs and 1 SSDs per node.
Used for block storage for a typical 20-hypervisor OpenStack cloud (with
bunch of VMs running Linux). 10GigE public net + 10 GigE replication
network.
Journals on SSDs - for writes, the write operation returns right after
data lands on the Journal's SSDs, but before it's written to the backing
HDD. So, for writes, SSD journal approach should be comparable to having
a SSD cache tier.
Not quite, see below.
Post by Piotr Wachowicz
In both cases we're writing to an SSD (and to
replica's SSDs), and returning to the client immediately after that.
Data is only flushed to HDD later on.
Correct, note that the flushing is happening by the OSD process submitting
this write to the underlying device/FS.
It doesn't go from the journal to the OSD storage device, which has the
implication that with default settings and plain HDDs you quickly wind up
being being limited to what your actual HDDs can handle in a sustained
manner.
Post by Piotr Wachowicz
However for reads (of hot data) I would expect a SSD Cache Tier to be
faster/better. That's because, in the case of having journals on SSDs,
even if data is in the journal, it's always read from the (slow) backing
disk anyway, right? But with a SSD cache tier, if the data is hot, it
would be read from the (fast) SSD.
It will be read from the even faster pagecache if it is a sufficiently hot
object and you have sufficient RAM.
Post by Piotr Wachowicz
I'm sure both approaches have their own merits, and might be better for
some specific tasks, but with all other things being equal, I would
expect that using SSDs as the "Writeback" cache tier should, on average,
provide better performance than suing the same SSDs for Journals.
Specifically in the area of read throughput/latency.
Cache tiers (currently) work only well if all your hot data fits into them.
In which case you'd even better off with with a dedicated SSD pool for
that data.

Because (currently) Ceph has to promote a full object (4MB by default) to
the cache for each operation, be it read or or write.
That means the first time you want to read a 2KB file in your RBD backed
VM, Ceph has to copy 4MB from the HDD pool to the SSD cache tier.
This has of course a significant impact on read performance, in my crappy
test cluster reading cold data is half as fast as using the actual
non-cached HDD pool.

And once your cache pool has to evict objects because it is getting full,
it has to write out 4MB for each such object to the HDD pool.
Then read it back in later, etc.
Post by Piotr Wachowicz
The main difference, I suspect, between the two approaches is that in the
case of multiple HDDs (multiple ceph-osd processes), all of those
processes share access to the same shared SSD storing their journals.
Whereas it's likely not the case with Cache tiering, right? Though I
must say I failed to find any detailed info on this. Any clarification
will be appreciated.
In your specific case writes to the OSDs (HDDs) will be (at least) 50%
slower if your journals are on disk instead of the SSD.
(Which SSDs do you plan to use anyway?)
I don't think you'll be happy with the resulting performance.

Christian.
Post by Piotr Wachowicz
So, is the above correct, or am I missing some pieces here? Any other
major differences between the two approaches?
Thanks.
P.
--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/
Piotr Wachowicz
2016-02-17 09:04:11 UTC
Permalink
Thanks for your reply.
Post by Christian Balzer
Post by Piotr Wachowicz
Journals on SSDs - for writes, the write operation returns right after
data lands on the Journal's SSDs, but before it's written to the backing
HDD. So, for writes, SSD journal approach should be comparable to having
a SSD cache tier.
Not quite, see below.
Could you elaborate a bit more?

Are you saying that with a Journal on a SSD writes from clients, before
they can return from the operation to the client, must end up on both the
SSD (Journal) *and* HDD (actual data store behind that journal)? I was
under the impression that one of the benefits of having a journal on a SSD
is deferring the write to the slow HDD to a later time, until after the
write call returns to the client. Is that not the case? If so, that would
mean SSD cache tier should be much faster in terms of write latency than
SSD journal.
Post by Christian Balzer
In your specific case writes to the OSDs (HDDs) will be (at least) 50%
slower if your journals are on disk instead of the SSD.
Is that because of the above -- with Journal on the same disk (HDD) as the
data, writes have to be written twice (assuming no btrfs/zfs cow) to the
HDD (journal, and data). Whereas with a Journal on the SSD write to the
Journal and disk can be done in parallel with write to the HDD? (But still
both of those have to be completed before the write operation returns to
the client).
Post by Christian Balzer
(Which SSDs do you plan to use anyway?)
Intel DC S3700


Thanks,
Piotr
Christian Balzer
2016-02-17 11:07:55 UTC
Permalink
Hello,
Post by Piotr Wachowicz
Thanks for your reply.
Post by Christian Balzer
Post by Piotr Wachowicz
Journals on SSDs - for writes, the write operation returns right
after data lands on the Journal's SSDs, but before it's written to
the backing HDD. So, for writes, SSD journal approach should be
comparable to having a SSD cache tier.
Not quite, see below.
Could you elaborate a bit more?
Are you saying that with a Journal on a SSD writes from clients, before
they can return from the operation to the client, must end up on both the
SSD (Journal) *and* HDD (actual data store behind that journal)?
No, your initial statement is correct.

However that burst of speed doesn't last indefinitely.

Aside from the size of the journal (which is incidentally NOT the most
limiting factor) there are various "filestore" parameters in Ceph, in
particular the sync interval ones.
There was a more in-depth explanation by a developer about this in this ML,
try your google-foo.

For short bursts of activity, the journal helps a LOT.
If you send a huge number of for example 4KB writes to your cluster, the
speed will eventually (after a few seconds) go down to what your backing
storage (HDDs) are capable of sustaining.
Post by Piotr Wachowicz
I was
under the impression that one of the benefits of having a journal on a
SSD is deferring the write to the slow HDD to a later time, until after
the write call returns to the client. Is that not the case? If so, that
would mean SSD cache tier should be much faster in terms of write
latency than SSD journal.
Post by Christian Balzer
In your specific case writes to the OSDs (HDDs) will be (at least) 50%
slower if your journals are on disk instead of the SSD.
Is that because of the above -- with Journal on the same disk (HDD) as
the data, writes have to be written twice (assuming no btrfs/zfs cow) to
the HDD (journal, and data). Whereas with a Journal on the SSD write to
the Journal and disk can be done in parallel with write to the HDD?
Yes, as far as the doubling of the I/O and thus the halving of speed is
concerned. Even with disk based journals the ACK of course happens when
ALL journal OSDs have done their writing.
Post by Piotr Wachowicz
(But
still both of those have to be completed before the write operation
returns to the client).
See above, eventually, kind-a-sorta.
Post by Piotr Wachowicz
Post by Christian Balzer
(Which SSDs do you plan to use anyway?)
Intel DC S3700
Good choice, with the 200GB model prefer the 3700 over the 3710 (higher
sequential write speed).

Christian
Post by Piotr Wachowicz
Thanks,
Piotr
--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/
Stephen Harker
2016-03-16 16:22:06 UTC
Permalink
Post by Christian Balzer
Post by Piotr Wachowicz
Post by Christian Balzer
Post by Piotr Wachowicz
Journals on SSDs - for writes, the write operation returns right
after data lands on the Journal's SSDs, but before it's written to
the backing HDD. So, for writes, SSD journal approach should be
comparable to having a SSD cache tier.
Not quite, see below.
Could you elaborate a bit more?
Are you saying that with a Journal on a SSD writes from clients, before
they can return from the operation to the client, must end up on both the
SSD (Journal) *and* HDD (actual data store behind that journal)?
No, your initial statement is correct.
However that burst of speed doesn't last indefinitely.
Aside from the size of the journal (which is incidentally NOT the most
limiting factor) there are various "filestore" parameters in Ceph, in
particular the sync interval ones.
There was a more in-depth explanation by a developer about this in this ML,
try your google-foo.
For short bursts of activity, the journal helps a LOT.
If you send a huge number of for example 4KB writes to your cluster, the
speed will eventually (after a few seconds) go down to what your backing
storage (HDDs) are capable of sustaining.
Post by Piotr Wachowicz
Post by Christian Balzer
(Which SSDs do you plan to use anyway?)
Intel DC S3700
Good choice, with the 200GB model prefer the 3700 over the 3710 (higher
sequential write speed).
Hi All,

I am looking at using PCI-E SSDs as journals in our (4) Ceph OSD nodes,
each of which has 6 4TB SATA drives within. I had my eye on these:

400GB Intel P3500 DC AIC SSD, HHHL PCIe 3.0

but reading through this thread, it might be better to go with the P3700
given the improved iops. So a couple of questions.

* Are the PCI-E versions of these drives different in any other way than
the interface?

* Would one of these as a journal for 6 4TB OSDs be overkill
(connectivity is 10GE, or will be shortly anyway), would the SATA S3700
be sufficient?

Given they're not hot-swappable, it'd be good if they didn't wear out in
6 months too.

I realise I've not given you much to go on and I'm Googling around as
well, I'm really just asking in case someone has tried this already and
has some feedback or advice..

Thanks! :)

Stephen
--
Stephen Harker
Chief Technology Officer
The Positive Internet Company.
--
All postal correspondence to:
The Positive Internet Company, 24 Ganton Street, London. W1F 7QY

*Follow us on Twitter* @posipeople

The Positive Internet Company Limited is registered in England and Wales.
Registered company number: 3673639. VAT no: 726 7072 28.
Registered office: Northside House, Mount Pleasant, Barnet, Herts, EN4 9EE.
Nick Fisk
2016-03-16 16:37:54 UTC
Permalink
-----Original Message-----
Stephen Harker
Sent: 16 March 2016 16:22
Subject: Re: [ceph-users] SSDs for journals vs SSDs for a cache tier,
which is
better?
Post by Christian Balzer
Post by Piotr Wachowicz
Post by Christian Balzer
Post by Piotr Wachowicz
Journals on SSDs - for writes, the write operation returns right
after data lands on the Journal's SSDs, but before it's written
to the backing HDD. So, for writes, SSD journal approach should
be comparable to having a SSD cache tier.
Not quite, see below.
Could you elaborate a bit more?
Are you saying that with a Journal on a SSD writes from clients,
before they can return from the operation to the client, must end up
on both the SSD (Journal) *and* HDD (actual data store behind that
journal)?
No, your initial statement is correct.
However that burst of speed doesn't last indefinitely.
Aside from the size of the journal (which is incidentally NOT the most
limiting factor) there are various "filestore" parameters in Ceph, in
particular the sync interval ones.
There was a more in-depth explanation by a developer about this in
this ML, try your google-foo.
For short bursts of activity, the journal helps a LOT.
If you send a huge number of for example 4KB writes to your cluster,
the speed will eventually (after a few seconds) go down to what your
backing storage (HDDs) are capable of sustaining.
Post by Piotr Wachowicz
Post by Christian Balzer
(Which SSDs do you plan to use anyway?)
Intel DC S3700
Good choice, with the 200GB model prefer the 3700 over the 3710
(higher sequential write speed).
Hi All,
I am looking at using PCI-E SSDs as journals in our (4) Ceph OSD nodes,
each
400GB Intel P3500 DC AIC SSD, HHHL PCIe 3.0
but reading through this thread, it might be better to go with the P3700
given
the improved iops. So a couple of questions.
* Are the PCI-E versions of these drives different in any other way than
the
interface?
Yes and no. Internally they are probably not much difference, but the
NVME/PCIE interface is a lot faster than SATA/SAS, both in terms of minimum
latency and bandwidth.
* Would one of these as a journal for 6 4TB OSDs be overkill (connectivity
is
10GE, or will be shortly anyway), would the SATA S3700 be sufficient?
Again depends on your use case. The S3700 may suffer if you are doing large
sequential writes, it might not have a high enough sequential write speed
and will become the bottleneck. 6 Disks could potentially take around
500-700MB/s of writes. A P3700 will have enough and will give slightly lower
write latency as well if this is important. You may even be able to run more
than 6 disk OSD's on it if needed.
Given they're not hot-swappable, it'd be good if they didn't wear out in
6 months too.
Probably won't unless you are doing some really extreme write workloads and
even then I would imagine they would last 1-2 years.
I realise I've not given you much to go on and I'm Googling around as
well, I'm
really just asking in case someone has tried this already and has some
feedback or advice..
That's ok, I'm currently running S3700 100GB's on current cluster and new
cluster that's in planning stages will be using the 400Gb P3700's.
Thanks! :)
Stephen
--
Stephen Harker
Chief Technology Officer
The Positive Internet Company.
--
The Positive Internet Company, 24 Ganton Street, London. W1F 7QY
The Positive Internet Company Limited is registered in England and Wales.
Registered company number: 3673639. VAT no: 726 7072 28.
Registered office: Northside House, Mount Pleasant, Barnet, Herts, EN4 9EE.
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Heath Albritton
2016-03-16 16:58:15 UTC
Permalink
The rule of thumb is to match the journal throughput to the OSD throughout. I'm seeing ~180MB/s sequential write on my OSDs and I'm using one of the P3700 400GB units per six OSDs. The 400GB P3700 yields around 1200MB/s* and has around 1/10th the latency of any SATA SSD I've tested.

I put a pair of them in a 12-drive chassis and get excellent performance. One could probably do the same in an 18-drive chassis without any issues. Failure domain for a journal starts to get pretty large at they point. I have dozens of the "Fultondale" SSDs deployed and have had zero failures. Endurance is excellent, etc.

*the larger units yield much better write throughout but don't make sense financially for journals.

-H
Post by Nick Fisk
-----Original Message-----
Stephen Harker
Sent: 16 March 2016 16:22
Subject: Re: [ceph-users] SSDs for journals vs SSDs for a cache tier,
which is
better?
Post by Christian Balzer
Post by Piotr Wachowicz
Post by Christian Balzer
Post by Piotr Wachowicz
Journals on SSDs - for writes, the write operation returns right
after data lands on the Journal's SSDs, but before it's written
to the backing HDD. So, for writes, SSD journal approach should
be comparable to having a SSD cache tier.
Not quite, see below.
Could you elaborate a bit more?
Are you saying that with a Journal on a SSD writes from clients,
before they can return from the operation to the client, must end up
on both the SSD (Journal) *and* HDD (actual data store behind that
journal)?
No, your initial statement is correct.
However that burst of speed doesn't last indefinitely.
Aside from the size of the journal (which is incidentally NOT the most
limiting factor) there are various "filestore" parameters in Ceph, in
particular the sync interval ones.
There was a more in-depth explanation by a developer about this in
this ML, try your google-foo.
For short bursts of activity, the journal helps a LOT.
If you send a huge number of for example 4KB writes to your cluster,
the speed will eventually (after a few seconds) go down to what your
backing storage (HDDs) are capable of sustaining.
Post by Piotr Wachowicz
Post by Christian Balzer
(Which SSDs do you plan to use anyway?)
Intel DC S3700
Good choice, with the 200GB model prefer the 3700 over the 3710
(higher sequential write speed).
Hi All,
I am looking at using PCI-E SSDs as journals in our (4) Ceph OSD nodes,
each
400GB Intel P3500 DC AIC SSD, HHHL PCIe 3.0
but reading through this thread, it might be better to go with the P3700
given
the improved iops. So a couple of questions.
* Are the PCI-E versions of these drives different in any other way than
the
interface?
Yes and no. Internally they are probably not much difference, but the
NVME/PCIE interface is a lot faster than SATA/SAS, both in terms of minimum
latency and bandwidth.
* Would one of these as a journal for 6 4TB OSDs be overkill (connectivity
is
10GE, or will be shortly anyway), would the SATA S3700 be sufficient?
Again depends on your use case. The S3700 may suffer if you are doing large
sequential writes, it might not have a high enough sequential write speed
and will become the bottleneck. 6 Disks could potentially take around
500-700MB/s of writes. A P3700 will have enough and will give slightly lower
write latency as well if this is important. You may even be able to run more
than 6 disk OSD's on it if needed.
Given they're not hot-swappable, it'd be good if they didn't wear out in
6 months too.
Probably won't unless you are doing some really extreme write workloads and
even then I would imagine they would last 1-2 years.
I realise I've not given you much to go on and I'm Googling around as
well, I'm
really just asking in case someone has tried this already and has some
feedback or advice..
That's ok, I'm currently running S3700 100GB's on current cluster and new
cluster that's in planning stages will be using the 400Gb P3700's.
Thanks! :)
Stephen
--
Stephen Harker
Chief Technology Officer
The Positive Internet Company.
--
The Positive Internet Company, 24 Ganton Street, London. W1F 7QY
The Positive Internet Company Limited is registered in England and Wales.
Registered company number: 3673639. VAT no: 726 7072 28.
Registered office: Northside House, Mount Pleasant, Barnet, Herts, EN4
9EE.
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Stephen Harker
2016-03-16 23:54:48 UTC
Permalink
Thanks all for your suggestions and advice. I'll let you know how it
goes :)

Stephen
Post by Heath Albritton
The rule of thumb is to match the journal throughput to the OSD
throughout. I'm seeing ~180MB/s sequential write on my OSDs and I'm
using one of the P3700 400GB units per six OSDs. The 400GB P3700
yields around 1200MB/s* and has around 1/10th the latency of any SATA
SSD I've tested.
I put a pair of them in a 12-drive chassis and get excellent
performance. One could probably do the same in an 18-drive chassis
without any issues. Failure domain for a journal starts to get pretty
large at they point. I have dozens of the "Fultondale" SSDs deployed
and have had zero failures. Endurance is excellent, etc.
*the larger units yield much better write throughout but don't make
sense financially for journals.
-H
Post by Nick Fisk
-----Original Message-----
Stephen Harker
Sent: 16 March 2016 16:22
Subject: Re: [ceph-users] SSDs for journals vs SSDs for a cache tier,
which is
better?
Post by Christian Balzer
Post by Piotr Wachowicz
Post by Christian Balzer
Post by Piotr Wachowicz
Journals on SSDs - for writes, the write operation returns right
after data lands on the Journal's SSDs, but before it's written
to the backing HDD. So, for writes, SSD journal approach should
be comparable to having a SSD cache tier.
Not quite, see below.
Could you elaborate a bit more?
Are you saying that with a Journal on a SSD writes from clients,
before they can return from the operation to the client, must end up
on both the SSD (Journal) *and* HDD (actual data store behind that
journal)?
No, your initial statement is correct.
However that burst of speed doesn't last indefinitely.
Aside from the size of the journal (which is incidentally NOT the most
limiting factor) there are various "filestore" parameters in Ceph, in
particular the sync interval ones.
There was a more in-depth explanation by a developer about this in
this ML, try your google-foo.
For short bursts of activity, the journal helps a LOT.
If you send a huge number of for example 4KB writes to your cluster,
the speed will eventually (after a few seconds) go down to what your
backing storage (HDDs) are capable of sustaining.
Post by Piotr Wachowicz
Post by Christian Balzer
(Which SSDs do you plan to use anyway?)
Intel DC S3700
Good choice, with the 200GB model prefer the 3700 over the 3710
(higher sequential write speed).
Hi All,
I am looking at using PCI-E SSDs as journals in our (4) Ceph OSD nodes,
each
400GB Intel P3500 DC AIC SSD, HHHL PCIe 3.0
but reading through this thread, it might be better to go with the P3700
given
the improved iops. So a couple of questions.
* Are the PCI-E versions of these drives different in any other way than
the
interface?
Yes and no. Internally they are probably not much difference, but the
NVME/PCIE interface is a lot faster than SATA/SAS, both in terms of minimum
latency and bandwidth.
* Would one of these as a journal for 6 4TB OSDs be overkill
(connectivity
is
10GE, or will be shortly anyway), would the SATA S3700 be sufficient?
Again depends on your use case. The S3700 may suffer if you are doing large
sequential writes, it might not have a high enough sequential write speed
and will become the bottleneck. 6 Disks could potentially take around
500-700MB/s of writes. A P3700 will have enough and will give slightly lower
write latency as well if this is important. You may even be able to run more
than 6 disk OSD's on it if needed.
Given they're not hot-swappable, it'd be good if they didn't wear out in
6 months too.
Probably won't unless you are doing some really extreme write
workloads and
even then I would imagine they would last 1-2 years.
I realise I've not given you much to go on and I'm Googling around as
well, I'm
really just asking in case someone has tried this already and has some
feedback or advice..
That's ok, I'm currently running S3700 100GB's on current cluster and new
cluster that's in planning stages will be using the 400Gb P3700's.
Thanks! :)
Stephen
--
Stephen Harker
Chief Technology Officer
The Positive Internet Company.
--
The Positive Internet Company, 24 Ganton Street, London. W1F 7QY
The Positive Internet Company Limited is registered in England and Wales.
Registered company number: 3673639. VAT no: 726 7072 28.
Registered office: Northside House, Mount Pleasant, Barnet, Herts, EN4
9EE.
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
All postal correspondence to:
The Positive Internet Company, 24 Ganton Street, London. W1F 7QY

*Follow us on Twitter* @posipeople

The Positive Internet Company Limited is registered in England and Wales.
Registered company number: 3673639. VAT no: 726 7072 28.
Registered office: Northside House, Mount Pleasant, Barnet, Herts, EN4 9EE.
Christian Balzer
2016-03-17 01:50:56 UTC
Permalink
Hello,
Post by Stephen Harker
Post by Christian Balzer
Post by Piotr Wachowicz
Post by Christian Balzer
Post by Piotr Wachowicz
Journals on SSDs - for writes, the write operation returns right
after data lands on the Journal's SSDs, but before it's written to
the backing HDD. So, for writes, SSD journal approach should be
comparable to having a SSD cache tier.
Not quite, see below.
Could you elaborate a bit more?
Are you saying that with a Journal on a SSD writes from clients, before
they can return from the operation to the client, must end up on both the
SSD (Journal) *and* HDD (actual data store behind that journal)?
No, your initial statement is correct.
However that burst of speed doesn't last indefinitely.
Aside from the size of the journal (which is incidentally NOT the most
limiting factor) there are various "filestore" parameters in Ceph, in
particular the sync interval ones.
There was a more in-depth explanation by a developer about this in this ML,
try your google-foo.
For short bursts of activity, the journal helps a LOT.
If you send a huge number of for example 4KB writes to your cluster, the
speed will eventually (after a few seconds) go down to what your backing
storage (HDDs) are capable of sustaining.
Post by Piotr Wachowicz
Post by Christian Balzer
(Which SSDs do you plan to use anyway?)
Intel DC S3700
Good choice, with the 200GB model prefer the 3700 over the 3710 (higher
sequential write speed).
Hi All,
I am looking at using PCI-E SSDs as journals in our (4) Ceph OSD nodes,
400GB Intel P3500 DC AIC SSD, HHHL PCIe 3.0
but reading through this thread, it might be better to go with the P3700
given the improved iops. So a couple of questions.
The 3700's will also last significantly longer than the 3500's.
IOPS (of the device) are mostly irrelevant, sequential write speed is
where it's at.
In the same vein, remember that journals are never ever read from unless
there was a crash.
Post by Stephen Harker
* Are the PCI-E versions of these drives different in any other way than
the interface?
* Would one of these as a journal for 6 4TB OSDs be overkill
(connectivity is 10GE, or will be shortly anyway), would the SATA S3700
be sufficient?
Overkill, but not insanely so.

From my (not insignificant) experience you want to match your journal(s)
firstly towards your network speed and then the devices behind them.

A SATA HDD can write indeed about 180MB/s sequentially, but that's firmly
in the land of theory when it comes to Ceph.

Ceph/RBD writes are 4MB objects at the largest, they are spread out all
over the cluster and of course most likely interspersed with competing
(seeking) reads and other writes to the same OSD.
That is before all the IO and thus seeks needed for for file system
operations, LevelDB updates, etc.
I thus spec my journals to 100MB/s write speed per SATA based HDD and
that's already generous.

Concrete case in point, 4 node cluster, 4 DC S3700 100GB SSDs with 2
journals each, 8 7.2k 3TB SATA HDDs, Infiniband network.
That cluster is very lightly loaded.

Doing this fio from a client VM:
---
fio --size=6G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4M --iodepth=32
---
and watching all 4 nodes simultaneously with atop shows us that the HDDs
are pushed up to around 80% utilization while writing only about 50MB/s.
The journal SSDs (which can handle 200MB/s writes) are consequently
semi-bored at about 45% utilization writing around 95MB/s.

As others mentioned, the P series will give you significantly lower
latencies if that's important in your use case (small writes that in their
sum do not exceed the abilities of your backing storage and CPUs).

Also a lot of this depends on your actual HW (cases), how many hot-swap
bays do you have, how many free PCIe slots, etc.
With entirely new HW you could go for something that has 1-2 NVMe hot-swap
bays and get the best of both worlds.

Summing things up, the 400GB P3700 matches your network speed and thus can
deal with short bursts at full speed.
However it is overkill for your 6 HDDs, especially once they get busy
(like backfilling or tests as above).
I'd be surprised to see them handle more than 400MB/s writes combined.

If you're trying to economize, a single 200GB DC S3700 or 2 100GB ones
(smaller failure domains) should do the trick, too.
Post by Stephen Harker
Given they're not hot-swappable, it'd be good if they didn't wear out in
6 months too.
See above.
I haven't been able to make more than 1% impact in the media wearout of
200GB DC S3700s that receive a constant write stream of 3MB/s over 500
days of operation.

Christian
--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/
Nick Fisk
2016-02-17 09:23:11 UTC
Permalink
-----Original Message-----
Christian Balzer
Sent: 17 February 2016 04:22
Subject: Re: [ceph-users] SSDs for journals vs SSDs for a cache tier,
which is
better?
Hello,
Post by Piotr Wachowicz
Hey,
Which one's "better": to use SSDs for storing journals, vs to use them
as a writeback cache tier? All other things being equal.
Pears are better than either oranges or apples. ^_-
Post by Piotr Wachowicz
The usecase is a 15 osd-node cluster, with 6 HDDs and 1 SSDs per node.
Used for block storage for a typical 20-hypervisor OpenStack cloud
(with bunch of VMs running Linux). 10GigE public net + 10 GigE
replication network.
Journals on SSDs - for writes, the write operation returns right after
data lands on the Journal's SSDs, but before it's written to the
backing HDD. So, for writes, SSD journal approach should be comparable
to having a SSD cache tier.
Not quite, see below.
Post by Piotr Wachowicz
In both cases we're writing to an SSD (and to replica's SSDs), and
returning to the client immediately after that.
Data is only flushed to HDD later on.
Correct, note that the flushing is happening by the OSD process submitting
this write to the underlying device/FS.
It doesn't go from the journal to the OSD storage device, which has the
implication that with default settings and plain HDDs you quickly wind up
being being limited to what your actual HDDs can handle in a sustained
manner.
Post by Piotr Wachowicz
However for reads (of hot data) I would expect a SSD Cache Tier to be
faster/better. That's because, in the case of having journals on SSDs,
even if data is in the journal, it's always read from the (slow)
backing disk anyway, right? But with a SSD cache tier, if the data is
hot, it would be read from the (fast) SSD.
It will be read from the even faster pagecache if it is a sufficiently hot
object
and you have sufficient RAM.
Post by Piotr Wachowicz
I'm sure both approaches have their own merits, and might be better
for some specific tasks, but with all other things being equal, I
would expect that using SSDs as the "Writeback" cache tier should, on
average, provide better performance than suing the same SSDs for
Journals.
Post by Piotr Wachowicz
Specifically in the area of read throughput/latency.
Cache tiers (currently) work only well if all your hot data fits into them.
In which case you'd even better off with with a dedicated SSD pool for
that
data.
Because (currently) Ceph has to promote a full object (4MB by default) to
the cache for each operation, be it read or or write.
That means the first time you want to read a 2KB file in your RBD backed
VM,
Ceph has to copy 4MB from the HDD pool to the SSD cache tier.
This has of course a significant impact on read performance, in my crappy
test
cluster reading cold data is half as fast as using the actual non-cached
HDD
pool.
Just a FYI, there will most likely be several fixes/improvements going into
Jewel which will address most of these problems with caching. Objects will
now only be promoted if they are hit several times(configurable) and, if it
makes it in time, a promotion throttle to stop too many promotions hindering
cluster performance.

However in the context of this thread, Christian is correct, SSD journals
first and then caching if needed.
And once your cache pool has to evict objects because it is getting full,
it has
to write out 4MB for each such object to the HDD pool.
Then read it back in later, etc.
Post by Piotr Wachowicz
The main difference, I suspect, between the two approaches is that in
the case of multiple HDDs (multiple ceph-osd processes), all of those
processes share access to the same shared SSD storing their journals.
Whereas it's likely not the case with Cache tiering, right? Though I
must say I failed to find any detailed info on this. Any clarification
will be appreciated.
In your specific case writes to the OSDs (HDDs) will be (at least) 50%
slower if
your journals are on disk instead of the SSD.
(Which SSDs do you plan to use anyway?)
I don't think you'll be happy with the resulting performance.
Christian.
Post by Piotr Wachowicz
So, is the above correct, or am I missing some pieces here? Any other
major differences between the two approaches?
Thanks.
P.
--
Christian Balzer Network/Systems Engineer
http://www.gol.com/
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Christian Balzer
2016-02-17 12:36:17 UTC
Permalink
Hello,
Post by Nick Fisk
-----Original Message-----
Of Christian Balzer
Sent: 17 February 2016 04:22
Subject: Re: [ceph-users] SSDs for journals vs SSDs for a cache tier,
which is
better?
[snip]
Post by Nick Fisk
Post by Piotr Wachowicz
I'm sure both approaches have their own merits, and might be better
for some specific tasks, but with all other things being equal, I
would expect that using SSDs as the "Writeback" cache tier should, on
average, provide better performance than suing the same SSDs for
Journals.
Post by Piotr Wachowicz
Specifically in the area of read throughput/latency.
Cache tiers (currently) work only well if all your hot data fits into
them.
In which case you'd even better off with with a dedicated SSD pool for
that
data.
Because (currently) Ceph has to promote a full object (4MB by default)
to the cache for each operation, be it read or or write.
That means the first time you want to read a 2KB file in your RBD backed
VM,
Ceph has to copy 4MB from the HDD pool to the SSD cache tier.
This has of course a significant impact on read performance, in my crappy
test
cluster reading cold data is half as fast as using the actual
non-cached
HDD
pool.
Just a FYI, there will most likely be several fixes/improvements going
into Jewel which will address most of these problems with caching.
Objects will now only be promoted if they are hit several
times(configurable) and, if it makes it in time, a promotion throttle to
stop too many promotions hindering cluster performance.
Ah, both of these would be very nice indeed, especially since the first
one is something that's supposedly already present (but broken).

The 2nd one, if done right, will be probably a game changer.
Robert LeBlanc and me will be most pleased.
Post by Nick Fisk
However in the context of this thread, Christian is correct, SSD journals
first and then caching if needed.
Yeah, thus my overuse of "currently". ^o^

Christian
Post by Nick Fisk
And once your cache pool has to evict objects because it is getting full,
it has
to write out 4MB for each such object to the HDD pool.
Then read it back in later, etc.
Post by Piotr Wachowicz
The main difference, I suspect, between the two approaches is that in
the case of multiple HDDs (multiple ceph-osd processes), all of those
processes share access to the same shared SSD storing their journals.
Whereas it's likely not the case with Cache tiering, right? Though I
must say I failed to find any detailed info on this. Any
clarification will be appreciated.
In your specific case writes to the OSDs (HDDs) will be (at least) 50%
slower if
your journals are on disk instead of the SSD.
(Which SSDs do you plan to use anyway?)
I don't think you'll be happy with the resulting performance.
Christian.
Post by Piotr Wachowicz
So, is the above correct, or am I missing some pieces here? Any other
major differences between the two approaches?
Thanks.
P.
--
Christian Balzer Network/Systems Engineer
http://www.gol.com/
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/
Mark Nelson
2016-02-17 13:00:38 UTC
Permalink
Post by Christian Balzer
Hello,
Post by Nick Fisk
-----Original Message-----
Of Christian Balzer
Sent: 17 February 2016 04:22
Subject: Re: [ceph-users] SSDs for journals vs SSDs for a cache tier,
which is
better?
[snip]
Post by Nick Fisk
Post by Piotr Wachowicz
I'm sure both approaches have their own merits, and might be better
for some specific tasks, but with all other things being equal, I
would expect that using SSDs as the "Writeback" cache tier should, on
average, provide better performance than suing the same SSDs for
Journals.
Post by Piotr Wachowicz
Specifically in the area of read throughput/latency.
Cache tiers (currently) work only well if all your hot data fits into
them.
In which case you'd even better off with with a dedicated SSD pool for
that
data.
Because (currently) Ceph has to promote a full object (4MB by default)
to the cache for each operation, be it read or or write.
That means the first time you want to read a 2KB file in your RBD backed
VM,
Ceph has to copy 4MB from the HDD pool to the SSD cache tier.
This has of course a significant impact on read performance, in my crappy
test
cluster reading cold data is half as fast as using the actual
non-cached
HDD
pool.
Just a FYI, there will most likely be several fixes/improvements going
into Jewel which will address most of these problems with caching.
Objects will now only be promoted if they are hit several
times(configurable) and, if it makes it in time, a promotion throttle to
stop too many promotions hindering cluster performance.
Ah, both of these would be very nice indeed, especially since the first
one is something that's supposedly already present (but broken).
The 2nd one, if done right, will be probably a game changer.
Robert LeBlanc and me will be most pleased.
The branch is wip-promote-throttle and we need testing from more people
besides me to make sure it's the right path forward <hint hint>. :)

I'm including the a link to the results we've gotten so far here.
There's still a degenerate case in small random mixed workloads, but
initial testing seems to indicate that the promotion throttling is
helping in many other cases, especially at *very* low promotion rates.
Small random read and write performance for example improves
dramatically. Highly skewed zipf distribution writes are also much
improved except for large writes).

https://drive.google.com/open?id=0B2gTBZrkrnpZUFV4OC1UaGVlTm8

Note: You will likely need to download the document and open it in open
office to see the graphs.

In the graphs I have different series labeled as VH, H, M, L, VL, 0,
etc. The throttle rates that correspond to those are:

#VH (ie, let everything through)
# osd tier promote max objects sec = 20000
# osd tier promote max bytes sec = 1610612736

#H (Almost allow the cache tier to be saturated with writes)
# osd tier promote max objects sec = 2000
# osd tier promote max bytes sec = 268435456

# M (Allow about 20% writes into the cache tier)
# osd tier promote max objects sec = 500
# osd tier promote max bytes sec = 67108864

# L (Allow about 5% writes into the cache tier)
# osd tier promote max objects sec = 125
# osd tier promote max bytes sec = 16777216

# VL (Only allow 4MB/sec to be promoted into the cache tier)
# osd tier promote max objects sec = 25
# osd tier promote max bytes sec = 4194304

# 0 (Technically not zero, something like 1/1000 still allowed through)
# osd tier promote max objects sec = 0
# osd tier promote max bytes sec = 0

Mark
Post by Christian Balzer
Post by Nick Fisk
However in the context of this thread, Christian is correct, SSD journals
first and then caching if needed.
Yeah, thus my overuse of "currently". ^o^
Christian
Post by Nick Fisk
And once your cache pool has to evict objects because it is getting full,
it has
to write out 4MB for each such object to the HDD pool.
Then read it back in later, etc.
Post by Piotr Wachowicz
The main difference, I suspect, between the two approaches is that in
the case of multiple HDDs (multiple ceph-osd processes), all of those
processes share access to the same shared SSD storing their journals.
Whereas it's likely not the case with Cache tiering, right? Though I
must say I failed to find any detailed info on this. Any
clarification will be appreciated.
In your specific case writes to the OSDs (HDDs) will be (at least) 50%
slower if
your journals are on disk instead of the SSD.
(Which SSDs do you plan to use anyway?)
I don't think you'll be happy with the resulting performance.
Christian.
Post by Piotr Wachowicz
So, is the above correct, or am I missing some pieces here? Any other
major differences between the two approaches?
Thanks.
P.
--
Christian Balzer Network/Systems Engineer
http://www.gol.com/
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Christian Balzer
2016-02-18 02:11:18 UTC
Permalink
Hello,
Post by Mark Nelson
Post by Christian Balzer
Hello,
Post by Nick Fisk
-----Original Message-----
Of Christian Balzer
Sent: 17 February 2016 04:22
Subject: Re: [ceph-users] SSDs for journals vs SSDs for a cache tier,
which is
better?
[snip]
Post by Nick Fisk
Post by Piotr Wachowicz
I'm sure both approaches have their own merits, and might be better
for some specific tasks, but with all other things being equal, I
would expect that using SSDs as the "Writeback" cache tier should,
on average, provide better performance than suing the same SSDs for
Journals.
Post by Piotr Wachowicz
Specifically in the area of read throughput/latency.
Cache tiers (currently) work only well if all your hot data fits into
them.
In which case you'd even better off with with a dedicated SSD pool for
that
data.
Because (currently) Ceph has to promote a full object (4MB by
default) to the cache for each operation, be it read or or write.
That means the first time you want to read a 2KB file in your RBD backed
VM,
Ceph has to copy 4MB from the HDD pool to the SSD cache tier.
This has of course a significant impact on read performance, in my crappy
test
cluster reading cold data is half as fast as using the actual
non-cached
HDD
pool.
Just a FYI, there will most likely be several fixes/improvements going
into Jewel which will address most of these problems with caching.
Objects will now only be promoted if they are hit several
times(configurable) and, if it makes it in time, a promotion throttle
to stop too many promotions hindering cluster performance.
Ah, both of these would be very nice indeed, especially since the first
one is something that's supposedly already present (but broken).
The 2nd one, if done right, will be probably a game changer.
Robert LeBlanc and me will be most pleased.
The branch is wip-promote-throttle and we need testing from more people
besides me to make sure it's the right path forward <hint hint>. :)
Well, supposedly I'll be getting some real testing/staging HW in the
future, which means I could use the current test cluster for really
experimental stuff.
Until then I need to keep it as the place to test scary procedures and
being the first to test upgrades for the production clusters
unfortunately.
Post by Mark Nelson
I'm including the a link to the results we've gotten so far here.
There's still a degenerate case in small random mixed workloads, but
initial testing seems to indicate that the promotion throttling is
helping in many other cases, especially at *very* low promotion rates.
Small random read and write performance for example improves
dramatically. Highly skewed zipf distribution writes are also much
improved except for large writes).
https://drive.google.com/open?id=0B2gTBZrkrnpZUFV4OC1UaGVlTm8
That looks very interesting and promising indeed.
Thanks for that link and the ongoing effort.

Christian
Post by Mark Nelson
Note: You will likely need to download the document and open it in open
office to see the graphs.
In the graphs I have different series labeled as VH, H, M, L, VL, 0,
#VH (ie, let everything through)
# osd tier promote max objects sec = 20000
# osd tier promote max bytes sec = 1610612736
#H (Almost allow the cache tier to be saturated with writes)
# osd tier promote max objects sec = 2000
# osd tier promote max bytes sec = 268435456
# M (Allow about 20% writes into the cache tier)
# osd tier promote max objects sec = 500
# osd tier promote max bytes sec = 67108864
# L (Allow about 5% writes into the cache tier)
# osd tier promote max objects sec = 125
# osd tier promote max bytes sec = 16777216
# VL (Only allow 4MB/sec to be promoted into the cache tier)
# osd tier promote max objects sec = 25
# osd tier promote max bytes sec = 4194304
# 0 (Technically not zero, something like 1/1000 still allowed through)
# osd tier promote max objects sec = 0
# osd tier promote max bytes sec = 0
Mark
Post by Christian Balzer
Post by Nick Fisk
However in the context of this thread, Christian is correct, SSD
journals first and then caching if needed.
Yeah, thus my overuse of "currently". ^o^
Christian
Post by Nick Fisk
And once your cache pool has to evict objects because it is getting full,
it has
to write out 4MB for each such object to the HDD pool.
Then read it back in later, etc.
Post by Piotr Wachowicz
The main difference, I suspect, between the two approaches is that
in the case of multiple HDDs (multiple ceph-osd processes), all of
those processes share access to the same shared SSD storing their
journals. Whereas it's likely not the case with Cache tiering,
right? Though I must say I failed to find any detailed info on
this. Any clarification will be appreciated.
In your specific case writes to the OSDs (HDDs) will be (at least) 50%
slower if
your journals are on disk instead of the SSD.
(Which SSDs do you plan to use anyway?)
I don't think you'll be happy with the resulting performance.
Christian.
Post by Piotr Wachowicz
So, is the above correct, or am I missing some pieces here? Any
other major differences between the two approaches?
Thanks.
P.
--
Christian Balzer Network/Systems Engineer
http://www.gol.com/
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/
Loading...