[ceph-users] Poor performance on all SSD cluster

Discussion:

[ceph-users] Poor performance on all SSD cluster

Greg Poirier

2014-06-20 18:08:21 UTC

I recently created a 9-node Firefly cluster backed by all SSDs. We have had
some pretty severe performance degradation when using O_DIRECT in our tests
(as this is how MySQL will be interacting with RBD volumes, this makes the
most sense for a preliminary test). Running the following test:

dd if=/dev/zero of=testfilasde bs=16k count=65535 oflag=direct

779829248 bytes (780 MB) copied, 604.333 s, 1.3 MB/s

Shows us only about 1.5 MB/s throughput and 100 IOPS from the single dd
thread. Running a second dd process does show increased throughput which is
encouraging, but I am still concerned by the low throughput of a single
thread w/ O_DIRECT.

Two threads:
779829248 bytes (780 MB) copied, 604.333 s, 1.3 MB/s
126271488 bytes (126 MB) copied, 99.2069 s, 1.3 MB/s

I am testing with an RBD volume mounted with the kernel module (I have also
tested from within KVM, similar performance).

If allow caching, we start to see reasonable numbers from a single dd
process:

dd if=/dev/zero of=testfilasde bs=16k count=65535
65535+0 records in
65535+0 records out
1073725440 bytes (1.1 GB) copied, 2.05356 s, 523 MB/s

I can get >1GB/s from a single host with three threads.

Rados bench produces similar results.

Is there something I can do to increase the performance of O_DIRECT? I
expect performance degradation, but so much?

If I increase the blocksize to 4M, I'm able to get significantly higher
throughput:

3833593856 bytes (3.8 GB) copied, 44.2964 s, 86.5 MB/s

This still seems very low.

I'm using the deadline scheduler in all places. With noop scheduler, I do
not see a performance improvement.

Suggestions?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140620/788698b9/attachment.htm>

Tyler Wilson

2014-06-20 21:13:14 UTC

Greg,

Not a real fix for you but I too run a full-ssd cluster and am able to get
112MB/s with your command;

[root at plesk-test ~]# dd if=/dev/zero of=testfilasde bs=16k count=65535
oflag=direct
65535+0 records in
65535+0 records out
1073725440 bytes (1.1 GB) copied, 9.59092 s, 112 MB/s

This of course is in a VM, here is my ceph config

[global]
fsid = <hidden>
mon_initial_members = node-1 node-2 node-3
mon_host = 192.168.0.3 192.168.0.4 192.168.0.5
auth_supported = cephx
osd_journal_size = 2048
filestore_xattr_use_omap = true
osd_pool_default_size = 2
osd_pool_default_min_size = 1
osd_pool_default_pg_num = 1024
public_network = 192.168.0.0/24
osd_mkfs_type = xfs
cluster_network = 192.168.1.0/24

On Fri, Jun 20, 2014 at 11:08 AM, Greg Poirier <greg.poirier at opower.com>
wrote:

> I recently created a 9-node Firefly cluster backed by all SSDs. We have
> had some pretty severe performance degradation when using O_DIRECT in our
> tests (as this is how MySQL will be interacting with RBD volumes, this
> makes the most sense for a preliminary test). Running the following test:
>
> dd if=/dev/zero of=testfilasde bs=16k count=65535 oflag=direct
>
> 779829248 bytes (780 MB) copied, 604.333 s, 1.3 MB/s
>
> Shows us only about 1.5 MB/s throughput and 100 IOPS from the single dd
> thread. Running a second dd process does show increased throughput which is
> encouraging, but I am still concerned by the low throughput of a single
> thread w/ O_DIRECT.
>
> Two threads:
> 779829248 bytes (780 MB) copied, 604.333 s, 1.3 MB/s
> 126271488 bytes (126 MB) copied, 99.2069 s, 1.3 MB/s
>
> I am testing with an RBD volume mounted with the kernel module (I have
> also tested from within KVM, similar performance).
>
> If allow caching, we start to see reasonable numbers from a single dd
> process:
>
> dd if=/dev/zero of=testfilasde bs=16k count=65535
> 65535+0 records in
> 65535+0 records out
> 1073725440 bytes (1.1 GB) copied, 2.05356 s, 523 MB/s
>
> I can get >1GB/s from a single host with three threads.
>
> Rados bench produces similar results.
>
> Is there something I can do to increase the performance of O_DIRECT? I
> expect performance degradation, but so much?
>
> If I increase the blocksize to 4M, I'm able to get significantly higher
> throughput:
>
> 3833593856 bytes (3.8 GB) copied, 44.2964 s, 86.5 MB/s
>
> This still seems very low.
>
> I'm using the deadline scheduler in all places. With noop scheduler, I do
> not see a performance improvement.
>
> Suggestions?
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140620/b600262e/attachment.htm>

Greg Poirier

2014-06-20 21:17:18 UTC

Thanks Tyler. So, I'm not totally crazy. There is something weird going on.

I've looked into things about as much as I can:

- We have tested with collocated journals and dedicated journal disks.
- We have bonded 10Gb nics and have verified network configuration and
connectivity is sound
- We have run dd independently on the SSDs in the cluster and they are
performing fine
- We have tested both in a VM and with the RBD kernel module and get
identical performance
- We have pool size = 3, pool min size = 2 and have tested with min size of
2 and 3 -- the performance impact is not bad
- osd_op times are approximately 6-12ms
- osd_sub_op times are 6-12 ms
- iostat reports service time of 6-12ms
- Latency between the storage and rbd client is approximately .1-.2ms
- Disabling replication entirely did not help significantly

On Fri, Jun 20, 2014 at 2:13 PM, Tyler Wilson <kupo at linuxdigital.net> wrote:

> Greg,
>
> Not a real fix for you but I too run a full-ssd cluster and am able to get
> 112MB/s with your command;
>
> [root at plesk-test ~]# dd if=/dev/zero of=testfilasde bs=16k count=65535
> oflag=direct
> 65535+0 records in
> 65535+0 records out
> 1073725440 bytes (1.1 GB) copied, 9.59092 s, 112 MB/s
>
> This of course is in a VM, here is my ceph config
>
> [global]
> fsid = <hidden>
> mon_initial_members = node-1 node-2 node-3
> mon_host = 192.168.0.3 192.168.0.4 192.168.0.5
> auth_supported = cephx
> osd_journal_size = 2048
> filestore_xattr_use_omap = true
> osd_pool_default_size = 2
> osd_pool_default_min_size = 1
> osd_pool_default_pg_num = 1024
> public_network = 192.168.0.0/24
> osd_mkfs_type = xfs
> cluster_network = 192.168.1.0/24
>
>
>
> On Fri, Jun 20, 2014 at 11:08 AM, Greg Poirier <greg.poirier at opower.com>
> wrote:
>
>> I recently created a 9-node Firefly cluster backed by all SSDs. We have
>> had some pretty severe performance degradation when using O_DIRECT in our
>> tests (as this is how MySQL will be interacting with RBD volumes, this
>> makes the most sense for a preliminary test). Running the following test:
>>
>> dd if=/dev/zero of=testfilasde bs=16k count=65535 oflag=direct
>>
>> 779829248 bytes (780 MB) copied, 604.333 s, 1.3 MB/s
>>
>> Shows us only about 1.5 MB/s throughput and 100 IOPS from the single dd
>> thread. Running a second dd process does show increased throughput which is
>> encouraging, but I am still concerned by the low throughput of a single
>> thread w/ O_DIRECT.
>>
>> Two threads:
>> 779829248 bytes (780 MB) copied, 604.333 s, 1.3 MB/s
>> 126271488 bytes (126 MB) copied, 99.2069 s, 1.3 MB/s
>>
>> I am testing with an RBD volume mounted with the kernel module (I have
>> also tested from within KVM, similar performance).
>>
>> If allow caching, we start to see reasonable numbers from a single dd
>> process:
>>
>> dd if=/dev/zero of=testfilasde bs=16k count=65535
>> 65535+0 records in
>> 65535+0 records out
>> 1073725440 bytes (1.1 GB) copied, 2.05356 s, 523 MB/s
>>
>> I can get >1GB/s from a single host with three threads.
>>
>> Rados bench produces similar results.
>>
>> Is there something I can do to increase the performance of O_DIRECT? I
>> expect performance degradation, but so much?
>>
>> If I increase the blocksize to 4M, I'm able to get significantly higher
>> throughput:
>>
>> 3833593856 bytes (3.8 GB) copied, 44.2964 s, 86.5 MB/s
>>
>> This still seems very low.
>>
>> I'm using the deadline scheduler in all places. With noop scheduler, I do
>> not see a performance improvement.
>>
>> Suggestions?
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140620/8bc0e32f/attachment.htm>

Mark Kirkwood

2014-06-22 02:09:50 UTC

I can reproduce this in:

ceph version 0.81-423-g1fb4574

on Ubuntu 14.04. I have a two osd cluster with data on two sata spinners
(WD blacks) and journals on two ssd (Crucual m4's). I getting about 3.5
MB/s (kernel and librbd) using your dd command with direct on. Leaving
off direct I'm seeing about 140 MB/s (librbd) and 90 MB/s (kernel 3.11
[2]). The ssd's can do writes at about 180 MB/s each... which is
something to look at another day[1].

It would be interesting to know what version of Ceph Tyer is using, as
his setup seems not nearly impacted by adding direct. Also it might be
useful to know what make and model of ssd you both are using (some of
'em do not like a series of essentially sync writes). Having said that
testing my Crucial m4's shows they can do the dd command (with direct
*on*) at about 180 MB/s...hmmm...so it *is* the Ceph layer it seems.

Regards

Mark

[1] I set filestore_max_sync_interval = 100 (30G journal...ssd able to
do 180 MB/s etc), however I am still seeing writes to the spinners
during the 8s or so that the above dd tests take).
[2] Ubuntu 13.10 VM - I'll upgrade it to 14.04 and see if that helps at all.

On 21/06/14 09:17, Greg Poirier wrote:
> Thanks Tyler. So, I'm not totally crazy. There is something weird going on.
>
> I've looked into things about as much as I can:
>
> - We have tested with collocated journals and dedicated journal disks.
> - We have bonded 10Gb nics and have verified network configuration and
> connectivity is sound
> - We have run dd independently on the SSDs in the cluster and they are
> performing fine
> - We have tested both in a VM and with the RBD kernel module and get
> identical performance
> - We have pool size = 3, pool min size = 2 and have tested with min size
> of 2 and 3 -- the performance impact is not bad
> - osd_op times are approximately 6-12ms
> - osd_sub_op times are 6-12 ms
> - iostat reports service time of 6-12ms
> - Latency between the storage and rbd client is approximately .1-.2ms
> - Disabling replication entirely did not help significantly
>
>
>
>
> On Fri, Jun 20, 2014 at 2:13 PM, Tyler Wilson <kupo at linuxdigital.net
> <mailto:kupo at linuxdigital.net>> wrote:
>
> Greg,
>
> Not a real fix for you but I too run a full-ssd cluster and am able
> to get 112MB/s with your command;
>
> [root at plesk-test ~]# dd if=/dev/zero of=testfilasde bs=16k
> count=65535 oflag=direct
> 65535+0 records in
> 65535+0 records out
> 1073725440 bytes (1.1 GB) copied, 9.59092 s, 112 MB/s
>
> This of course is in a VM, here is my ceph config
>
> [global]
> fsid = <hidden>
> mon_initial_members = node-1 node-2 node-3
> mon_host = 192.168.0.3 192.168.0.4 192.168.0.5
> auth_supported = cephx
> osd_journal_size = 2048
> filestore_xattr_use_omap = true
> osd_pool_default_size = 2
> osd_pool_default_min_size = 1
> osd_pool_default_pg_num = 1024
> public_network = 192.168.0.0/24 <http://192.168.0.0/24>
> osd_mkfs_type = xfs
> cluster_network = 192.168.1.0/24 <http://192.168.1.0/24>
>
>
>
> On Fri, Jun 20, 2014 at 11:08 AM, Greg Poirier
> <greg.poirier at opower.com <mailto:greg.poirier at opower.com>> wrote:
>
> I recently created a 9-node Firefly cluster backed by all SSDs.
> We have had some pretty severe performance degradation when
> using O_DIRECT in our tests (as this is how MySQL will be
> interacting with RBD volumes, this makes the most sense for a
> preliminary test). Running the following test:
>
> dd if=/dev/zero of=testfilasde bs=16k count=65535 oflag=direct
>
> 779829248 bytes (780 MB) copied, 604.333 s, 1.3 MB/s
>
> Shows us only about 1.5 MB/s throughput and 100 IOPS from the
> single dd thread. Running a second dd process does show
> increased throughput which is encouraging, but I am still
> concerned by the low throughput of a single thread w/ O_DIRECT.
>
> Two threads:
> 779829248 bytes (780 MB) copied, 604.333 s, 1.3 MB/s
> 126271488 bytes (126 MB) copied, 99.2069 s, 1.3 MB/s
>
> I am testing with an RBD volume mounted with the kernel module
> (I have also tested from within KVM, similar performance).
>
> If allow caching, we start to see reasonable numbers from a
> single dd process:
>
> dd if=/dev/zero of=testfilasde bs=16k count=65535
> 65535+0 records in
> 65535+0 records out
> 1073725440 bytes (1.1 GB) copied, 2.05356 s, 523 MB/s
>
> I can get >1GB/s from a single host with three threads.
>
> Rados bench produces similar results.
>
> Is there something I can do to increase the performance of
> O_DIRECT? I expect performance degradation, but so much?
>
> If I increase the blocksize to 4M, I'm able to get significantly
> higher throughput:
>
> 3833593856 bytes (3.8 GB) copied, 44.2964 s, 86.5 MB/s
>
> This still seems very low.
>
> I'm using the deadline scheduler in all places. With noop
> scheduler, I do not see a performance improvement.
>
> Suggestions?
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com <mailto:ceph-users at lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

Mark Kirkwood

2014-06-22 03:50:20 UTC

On 22/06/14 14:09, Mark Kirkwood wrote:

Upgrading the VM to 14.04 and restesting the case *without* direct I get:

- 164 MB/s (librbd)
- 115 MB/s (kernel 3.13)

So managing to almost get native performance out of the librbd case. I
tweaked both filestore max and min sync intervals (100 and 10 resp) just
to see if I could actually avoid writing to the spinners while the test
was in progress (still seeing some, but clearly fewer).

However no improvement at all *with* direct enabled. The output of
iostat on the host while the direct test is in progress is interesting:

avg-cpu: %user %nice %system %iowait %steal %idle
11.73 0.00 5.04 0.76 0.00 82.47

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 11.00 0.00 4.02
749.09 0.14 12.36 0.00 12.36 6.55 7.20
sdb 0.00 0.00 0.00 11.00 0.00 4.02
749.09 0.14 12.36 0.00 12.36 5.82 6.40
sdc 0.00 0.00 0.00 435.00 0.00 4.29
20.21 0.53 1.21 0.00 1.21 1.21 52.80
sdd 0.00 0.00 0.00 435.00 0.00 4.29
20.21 0.52 1.20 0.00 1.20 1.20 52.40

(sda,b are the spinners sdc,d the ssds). Something is making the journal
work very hard for its 4.29 MB/s!

regards

Mark

> Leaving
> off direct I'm seeing about 140 MB/s (librbd) and 90 MB/s (kernel 3.11
> [2]). The ssd's can do writes at about 180 MB/s each... which is
> something to look at another day[1].

Haomai Wang

2014-06-22 07:02:40 UTC

Hi Mark,

Do you enable rbdcache? I test on my ssd cluster(only one ssd), it seemed ok.

> dd if=/dev/zero of=test bs=16k count=65536 oflag=direct

82.3MB/s

On Sun, Jun 22, 2014 at 11:50 AM, Mark Kirkwood
<mark.kirkwood at catalyst.net.nz> wrote:
> On 22/06/14 14:09, Mark Kirkwood wrote:
>
> Upgrading the VM to 14.04 and restesting the case *without* direct I get:
>
> - 164 MB/s (librbd)
> - 115 MB/s (kernel 3.13)
>
> So managing to almost get native performance out of the librbd case. I
> tweaked both filestore max and min sync intervals (100 and 10 resp) just to
> see if I could actually avoid writing to the spinners while the test was in
> progress (still seeing some, but clearly fewer).
>
> However no improvement at all *with* direct enabled. The output of iostat on
> the host while the direct test is in progress is interesting:
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 11.73 0.00 5.04 0.76 0.00 82.47
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> sda 0.00 0.00 0.00 11.00 0.00 4.02 749.09
> 0.14 12.36 0.00 12.36 6.55 7.20
> sdb 0.00 0.00 0.00 11.00 0.00 4.02 749.09
> 0.14 12.36 0.00 12.36 5.82 6.40
> sdc 0.00 0.00 0.00 435.00 0.00 4.29 20.21
> 0.53 1.21 0.00 1.21 1.21 52.80
> sdd 0.00 0.00 0.00 435.00 0.00 4.29 20.21
> 0.52 1.20 0.00 1.20 1.20 52.40
>
> (sda,b are the spinners sdc,d the ssds). Something is making the journal
> work very hard for its 4.29 MB/s!
>
> regards
>
> Mark
>
>
>> Leaving
>> off direct I'm seeing about 140 MB/s (librbd) and 90 MB/s (kernel 3.11
>> [2]). The ssd's can do writes at about 180 MB/s each... which is
>> something to look at another day[1].
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Best Regards,

Wheat

Mark Nelson

2014-06-22 13:44:16 UTC

On 06/22/2014 02:02 AM, Haomai Wang wrote:
> Hi Mark,
>
> Do you enable rbdcache? I test on my ssd cluster(only one ssd), it seemed ok.
>
>> dd if=/dev/zero of=test bs=16k count=65536 oflag=direct
>
> 82.3MB/s

RBD Cache is definitely going to help in this use case. This test is
basically just sequentially writing a single 16k chunk of data out, one
at a time. IE, entirely latency bound. At least on OSDs backed by XFS,
you have to wait for that data to hit the journals of every OSD
associated with the object before the acknowledgement gets sent back to
the client. If you are using the default 4MB block size, you'll hit the
same OSDs over and over again and your other OSDs will sit there
twiddling their thumbs waiting for IO until you hit the next block, but
then it will just be a different set OSDs getting hit. You should be
able to verify this by using iostat or collectl or something to look at
the behaviour of the SSDs during the test. Since this is all sequential
though, switching to buffered IO (ie coalesce IOs at the buffercache
layer) or using RBD cache for direct IO (coalesce IOs below the block
device) will dramatically improve things.

The real question here though, is whether or not a synchronous stream of
sequential 16k writes is even remotely close to the IO patterns that
would be seen in actual use for MySQL. Most likely in actual use you'll
be seeing a big mix of random reads and writes, and hopefully at least
some parallelism (though this depends on the number of databases, number
of users, and the workload!).

Ceph is pretty good at small random IO with lots of parallelism on
spinning disk backed OSDs (So long as you use SSD journals or
controllers with WB cache). It's much harder to get native-level IOPS
rates with SSD backed OSDs though. The latency involved in distributing
and processing all of that data becomes a much bigger deal. Having said
that, we are actively working on improving latency as much as we can. :)

Mark

>
>
> On Sun, Jun 22, 2014 at 11:50 AM, Mark Kirkwood
> <mark.kirkwood at catalyst.net.nz> wrote:
>> On 22/06/14 14:09, Mark Kirkwood wrote:
>>
>> Upgrading the VM to 14.04 and restesting the case *without* direct I get:
>>
>> - 164 MB/s (librbd)
>> - 115 MB/s (kernel 3.13)
>>
>> So managing to almost get native performance out of the librbd case. I
>> tweaked both filestore max and min sync intervals (100 and 10 resp) just to
>> see if I could actually avoid writing to the spinners while the test was in
>> progress (still seeing some, but clearly fewer).
>>
>> However no improvement at all *with* direct enabled. The output of iostat on
>> the host while the direct test is in progress is interesting:
>>
>> avg-cpu: %user %nice %system %iowait %steal %idle
>> 11.73 0.00 5.04 0.76 0.00 82.47
>>
>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
>> avgqu-sz await r_await w_await svctm %util
>> sda 0.00 0.00 0.00 11.00 0.00 4.02 749.09
>> 0.14 12.36 0.00 12.36 6.55 7.20
>> sdb 0.00 0.00 0.00 11.00 0.00 4.02 749.09
>> 0.14 12.36 0.00 12.36 5.82 6.40
>> sdc 0.00 0.00 0.00 435.00 0.00 4.29 20.21
>> 0.53 1.21 0.00 1.21 1.21 52.80
>> sdd 0.00 0.00 0.00 435.00 0.00 4.29 20.21
>> 0.52 1.20 0.00 1.20 1.20 52.40
>>
>> (sda,b are the spinners sdc,d the ssds). Something is making the journal
>> work very hard for its 4.29 MB/s!
>>
>> regards
>>
>> Mark
>>
>>
>>> Leaving
>>> off direct I'm seeing about 140 MB/s (librbd) and 90 MB/s (kernel 3.11
>>> [2]). The ssd's can do writes at about 180 MB/s each... which is
>>> something to look at another day[1].
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>

Greg Poirier

2014-06-22 19:14:38 UTC

We actually do have a use pattern of large batch sequential writes, and
this dd is pretty similar to that use case.

A round-trip write with replication takes approximately 10-15ms to
complete. I've been looking at dump_historic_ops on a number of OSDs and
getting mean, min, and max for sub_op and ops. If these were on the order
of 1-2 seconds, I could understand this throughput... But we're talking
about fairly fast SSDs and a 20Gbps network with <1ms latency for TCP
round-trip between the client machine and all of the OSD hosts.

I've gone so far as disabling replication entirely (which had almost no
impact) and putting journals on separate SSDs as the data disks (which are
ALSO SSDs).

This just doesn't make sense to me.

On Sun, Jun 22, 2014 at 6:44 AM, Mark Nelson <mark.nelson at inktank.com>
wrote:

> On 06/22/2014 02:02 AM, Haomai Wang wrote:
>
>> Hi Mark,
>>
>> Do you enable rbdcache? I test on my ssd cluster(only one ssd), it seemed
>> ok.
>>
>> dd if=/dev/zero of=test bs=16k count=65536 oflag=direct
>>>
>>
>> 82.3MB/s
>>
>
> RBD Cache is definitely going to help in this use case. This test is
> basically just sequentially writing a single 16k chunk of data out, one at
> a time. IE, entirely latency bound. At least on OSDs backed by XFS, you
> have to wait for that data to hit the journals of every OSD associated with
> the object before the acknowledgement gets sent back to the client. If you
> are using the default 4MB block size, you'll hit the same OSDs over and
> over again and your other OSDs will sit there twiddling their thumbs
> waiting for IO until you hit the next block, but then it will just be a
> different set OSDs getting hit. You should be able to verify this by using
> iostat or collectl or something to look at the behaviour of the SSDs during
> the test. Since this is all sequential though, switching to buffered IO
> (ie coalesce IOs at the buffercache layer) or using RBD cache for direct IO
> (coalesce IOs below the block device) will dramatically improve things.
>
> The real question here though, is whether or not a synchronous stream of
> sequential 16k writes is even remotely close to the IO patterns that would
> be seen in actual use for MySQL. Most likely in actual use you'll be
> seeing a big mix of random reads and writes, and hopefully at least some
> parallelism (though this depends on the number of databases, number of
> users, and the workload!).
>
> Ceph is pretty good at small random IO with lots of parallelism on
> spinning disk backed OSDs (So long as you use SSD journals or controllers
> with WB cache). It's much harder to get native-level IOPS rates with SSD
> backed OSDs though. The latency involved in distributing and processing
> all of that data becomes a much bigger deal. Having said that, we are
> actively working on improving latency as much as we can. :)
>
> Mark
>
>
>
>>
>> On Sun, Jun 22, 2014 at 11:50 AM, Mark Kirkwood
>> <mark.kirkwood at catalyst.net.nz> wrote:
>>
>>> On 22/06/14 14:09, Mark Kirkwood wrote:
>>>
>>> Upgrading the VM to 14.04 and restesting the case *without* direct I get:
>>>
>>> - 164 MB/s (librbd)
>>> - 115 MB/s (kernel 3.13)
>>>
>>> So managing to almost get native performance out of the librbd case. I
>>> tweaked both filestore max and min sync intervals (100 and 10 resp) just
>>> to
>>> see if I could actually avoid writing to the spinners while the test was
>>> in
>>> progress (still seeing some, but clearly fewer).
>>>
>>> However no improvement at all *with* direct enabled. The output of
>>> iostat on
>>> the host while the direct test is in progress is interesting:
>>>
>>> avg-cpu: %user %nice %system %iowait %steal %idle
>>> 11.73 0.00 5.04 0.76 0.00 82.47
>>>
>>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
>>> avgrq-sz
>>> avgqu-sz await r_await w_await svctm %util
>>> sda 0.00 0.00 0.00 11.00 0.00 4.02 749.09
>>> 0.14 12.36 0.00 12.36 6.55 7.20
>>> sdb 0.00 0.00 0.00 11.00 0.00 4.02 749.09
>>> 0.14 12.36 0.00 12.36 5.82 6.40
>>> sdc 0.00 0.00 0.00 435.00 0.00 4.29 20.21
>>> 0.53 1.21 0.00 1.21 1.21 52.80
>>> sdd 0.00 0.00 0.00 435.00 0.00 4.29 20.21
>>> 0.52 1.20 0.00 1.20 1.20 52.40
>>>
>>> (sda,b are the spinners sdc,d the ssds). Something is making the journal
>>> work very hard for its 4.29 MB/s!
>>>
>>> regards
>>>
>>> Mark
>>>
>>>
>>> Leaving
>>>> off direct I'm seeing about 140 MB/s (librbd) and 90 MB/s (kernel 3.11
>>>> [2]). The ssd's can do writes at about 180 MB/s each... which is
>>>> something to look at another day[1].
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users at lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>>
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140622/144b1dc8/attachment.htm>

Christian Balzer

2014-06-23 01:42:55 UTC

On Sun, 22 Jun 2014 12:14:38 -0700 Greg Poirier wrote:

> We actually do have a use pattern of large batch sequential writes, and
> this dd is pretty similar to that use case.
>
> A round-trip write with replication takes approximately 10-15ms to
> complete. I've been looking at dump_historic_ops on a number of OSDs and
> getting mean, min, and max for sub_op and ops. If these were on the order
> of 1-2 seconds, I could understand this throughput... But we're talking
> about fairly fast SSDs and a 20Gbps network with <1ms latency for TCP
> round-trip between the client machine and all of the OSD hosts.
>
> I've gone so far as disabling replication entirely (which had almost no
> impact) and putting journals on separate SSDs as the data disks (which
> are ALSO SSDs).
>
> This just doesn't make sense to me.
>
A lot of this sounds like my "Slow IOPS on RBD compared to journal and
backing devices" thread a few weeks ago.
Though those results are even worse in a way than what I saw.

How many OSDs do you have per node and how many CPU cores?

When running this test, are the OSDs very CPU intense?
Do you see a good spread amongst the OSDs or are there hotspots?

If you have the time/chance, could you run the fio from that thread and
post the results, I'm very curious to find out if the no more than 400
IOPS per OSD holds true for your cluster as well.

fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128

Regards,

Christian

>
> On Sun, Jun 22, 2014 at 6:44 AM, Mark Nelson <mark.nelson at inktank.com>
> wrote:
>
> > On 06/22/2014 02:02 AM, Haomai Wang wrote:
> >
> >> Hi Mark,
> >>
> >> Do you enable rbdcache? I test on my ssd cluster(only one ssd), it
> >> seemed ok.
> >>
> >> dd if=/dev/zero of=test bs=16k count=65536 oflag=direct
> >>>
> >>
> >> 82.3MB/s
> >>
> >
> > RBD Cache is definitely going to help in this use case. This test is
> > basically just sequentially writing a single 16k chunk of data out,
> > one at a time. IE, entirely latency bound. At least on OSDs backed
> > by XFS, you have to wait for that data to hit the journals of every
> > OSD associated with the object before the acknowledgement gets sent
> > back to the client. If you are using the default 4MB block size,
> > you'll hit the same OSDs over and over again and your other OSDs will
> > sit there twiddling their thumbs waiting for IO until you hit the next
> > block, but then it will just be a different set OSDs getting hit. You
> > should be able to verify this by using iostat or collectl or something
> > to look at the behaviour of the SSDs during the test. Since this is
> > all sequential though, switching to buffered IO (ie coalesce IOs at
> > the buffercache layer) or using RBD cache for direct IO (coalesce IOs
> > below the block device) will dramatically improve things.
> >
> > The real question here though, is whether or not a synchronous stream
> > of sequential 16k writes is even remotely close to the IO patterns
> > that would be seen in actual use for MySQL. Most likely in actual use
> > you'll be seeing a big mix of random reads and writes, and hopefully
> > at least some parallelism (though this depends on the number of
> > databases, number of users, and the workload!).
> >
> > Ceph is pretty good at small random IO with lots of parallelism on
> > spinning disk backed OSDs (So long as you use SSD journals or
> > controllers with WB cache). It's much harder to get native-level IOPS
> > rates with SSD backed OSDs though. The latency involved in
> > distributing and processing all of that data becomes a much bigger
> > deal. Having said that, we are actively working on improving latency
> > as much as we can. :)
> >
> > Mark
> >
> >
> >
> >>
> >> On Sun, Jun 22, 2014 at 11:50 AM, Mark Kirkwood
> >> <mark.kirkwood at catalyst.net.nz> wrote:
> >>
> >>> On 22/06/14 14:09, Mark Kirkwood wrote:
> >>>
> >>> Upgrading the VM to 14.04 and restesting the case *without* direct I
> >>> get:
> >>>
> >>> - 164 MB/s (librbd)
> >>> - 115 MB/s (kernel 3.13)
> >>>
> >>> So managing to almost get native performance out of the librbd case.
> >>> I tweaked both filestore max and min sync intervals (100 and 10
> >>> resp) just to
> >>> see if I could actually avoid writing to the spinners while the test
> >>> was in
> >>> progress (still seeing some, but clearly fewer).
> >>>
> >>> However no improvement at all *with* direct enabled. The output of
> >>> iostat on
> >>> the host while the direct test is in progress is interesting:
> >>>
> >>> avg-cpu: %user %nice %system %iowait %steal %idle
> >>> 11.73 0.00 5.04 0.76 0.00 82.47
> >>>
> >>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
> >>> avgrq-sz
> >>> avgqu-sz await r_await w_await svctm %util
> >>> sda 0.00 0.00 0.00 11.00 0.00 4.02
> >>> 749.09 0.14 12.36 0.00 12.36 6.55 7.20
> >>> sdb 0.00 0.00 0.00 11.00 0.00 4.02
> >>> 749.09 0.14 12.36 0.00 12.36 5.82 6.40
> >>> sdc 0.00 0.00 0.00 435.00 0.00 4.29
> >>> 20.21 0.53 1.21 0.00 1.21 1.21 52.80
> >>> sdd 0.00 0.00 0.00 435.00 0.00 4.29
> >>> 20.21 0.52 1.20 0.00 1.20 1.20 52.40
> >>>
> >>> (sda,b are the spinners sdc,d the ssds). Something is making the
> >>> journal work very hard for its 4.29 MB/s!
> >>>
> >>> regards
> >>>
> >>> Mark
> >>>
> >>>
> >>> Leaving
> >>>> off direct I'm seeing about 140 MB/s (librbd) and 90 MB/s (kernel
> >>>> 3.11 [2]). The ssd's can do writes at about 180 MB/s each... which
> >>>> is something to look at another day[1].
> >>>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-users at lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>
> >>
> >>
> >>
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users at lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >

--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/

Greg Poirier

2014-06-23 17:26:32 UTC

10 OSDs per node
12 physical cores hyperthreaded (24 logical cores exposed to OS)
64GB RAM

Negligible load

iostat shows the disks are largely idle except for bursty writes
occasionally.

Results of fio from one of the SSDs in the cluster:

fiojob: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
iodepth=128
fio-2.1.3
Starting 1 process
fiojob: Laying out IO file(s) (1 file(s) / 400MB)
Jobs: 1 (f=1): [w] [-.-% done] [0KB/155.5MB/0KB /s] [0/39.8K/0 iops] [eta
00m:00s]
fiojob: (groupid=0, jobs=1): err= 0: pid=21845: Mon Jun 23 13:23:47 2014
write: io=409600KB, bw=157599KB/s, iops=39399, runt= 2599msec
slat (usec): min=6, max=2149, avg=22.13, stdev=23.08
clat (usec): min=70, max=10700, avg=3220.76, stdev=521.44
lat (usec): min=90, max=10722, avg=3243.13, stdev=523.70
clat percentiles (usec):
| 1.00th=[ 2736], 5.00th=[ 2864], 10.00th=[ 2896], 20.00th=[ 2928],
| 30.00th=[ 2960], 40.00th=[ 3024], 50.00th=[ 3056], 60.00th=[ 3184],
| 70.00th=[ 3344], 80.00th=[ 3440], 90.00th=[ 3504], 95.00th=[ 3632],
| 99.00th=[ 5856], 99.50th=[ 6240], 99.90th=[ 7136], 99.95th=[ 7584],
| 99.99th=[ 8160]
bw (KB /s): min=139480, max=173320, per=99.99%, avg=157577.60,
stdev=16122.77
lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.08%, 4=95.89%, 10=3.98%, 20=0.01%
cpu : usr=14.05%, sys=46.73%, ctx=72243, majf=0, minf=186
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.1%
issued : total=r=0/w=102400/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
WRITE: io=409600KB, aggrb=157599KB/s, minb=157599KB/s, maxb=157599KB/s,
mint=2599msec, maxt=2599msec

Disk stats (read/write):
sda: ios=0/95026, merge=0/0, ticks=0/3016, in_queue=2972, util=82.27%

All of the disks are identical.

The same fio from the host with the RBD volume mounted:

fiojob: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
iodepth=128
fio-2.1.3
Starting 1 process
fiojob: Laying out IO file(s) (1 file(s) / 400MB)
Jobs: 1 (f=1): [w] [100.0% done] [0KB/5384KB/0KB /s] [0/1346/0 iops] [eta
00m:00s]
fiojob: (groupid=0, jobs=1): err= 0: pid=30070: Mon Jun 23 13:25:50 2014
write: io=409600KB, bw=9264.3KB/s, iops=2316, runt= 44213msec
slat (usec): min=17, max=154210, avg=84.83, stdev=535.40
clat (msec): min=10, max=1294, avg=55.17, stdev=103.43
lat (msec): min=10, max=1295, avg=55.25, stdev=103.43
clat percentiles (msec):
| 1.00th=[ 17], 5.00th=[ 21], 10.00th=[ 24], 20.00th=[ 28],
| 30.00th=[ 31], 40.00th=[ 34], 50.00th=[ 37], 60.00th=[ 40],
| 70.00th=[ 44], 80.00th=[ 50], 90.00th=[ 63], 95.00th=[ 103],
| 99.00th=[ 725], 99.50th=[ 906], 99.90th=[ 1106], 99.95th=[ 1172],
| 99.99th=[ 1237]
bw (KB /s): min= 3857, max=12416, per=100.00%, avg=9280.09,
stdev=1233.63
lat (msec) : 20=3.76%, 50=76.60%, 100=14.45%, 250=2.98%, 500=0.72%
lat (msec) : 750=0.56%, 1000=0.66%, 2000=0.27%
cpu : usr=3.50%, sys=19.31%, ctx=131358, majf=0, minf=986
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.1%
issued : total=r=0/w=102400/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
WRITE: io=409600KB, aggrb=9264KB/s, minb=9264KB/s, maxb=9264KB/s,
mint=44213msec, maxt=44213msec

Disk stats (read/write):
rbd2: ios=0/102499, merge=0/1818, ticks=0/5593828, in_queue=5599520,
util=99.85%

On Sun, Jun 22, 2014 at 6:42 PM, Christian Balzer <chibi at gol.com> wrote:

> On Sun, 22 Jun 2014 12:14:38 -0700 Greg Poirier wrote:
>
> > We actually do have a use pattern of large batch sequential writes, and
> > this dd is pretty similar to that use case.
> >
> > A round-trip write with replication takes approximately 10-15ms to
> > complete. I've been looking at dump_historic_ops on a number of OSDs and
> > getting mean, min, and max for sub_op and ops. If these were on the order
> > of 1-2 seconds, I could understand this throughput... But we're talking
> > about fairly fast SSDs and a 20Gbps network with <1ms latency for TCP
> > round-trip between the client machine and all of the OSD hosts.
> >
> > I've gone so far as disabling replication entirely (which had almost no
> > impact) and putting journals on separate SSDs as the data disks (which
> > are ALSO SSDs).
> >
> > This just doesn't make sense to me.
> >
> A lot of this sounds like my "Slow IOPS on RBD compared to journal and
> backing devices" thread a few weeks ago.
> Though those results are even worse in a way than what I saw.
>
> How many OSDs do you have per node and how many CPU cores?
>
> When running this test, are the OSDs very CPU intense?
> Do you see a good spread amongst the OSDs or are there hotspots?
>
> If you have the time/chance, could you run the fio from that thread and
> post the results, I'm very curious to find out if the no more than 400
> IOPS per OSD holds true for your cluster as well.
>
> fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
> --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
>
>
> Regards,
>
> Christian
>
> >
> > On Sun, Jun 22, 2014 at 6:44 AM, Mark Nelson <mark.nelson at inktank.com>
> > wrote:
> >
> > > On 06/22/2014 02:02 AM, Haomai Wang wrote:
> > >
> > >> Hi Mark,
> > >>
> > >> Do you enable rbdcache? I test on my ssd cluster(only one ssd), it
> > >> seemed ok.
> > >>
> > >> dd if=/dev/zero of=test bs=16k count=65536 oflag=direct
> > >>>
> > >>
> > >> 82.3MB/s
> > >>
> > >
> > > RBD Cache is definitely going to help in this use case. This test is
> > > basically just sequentially writing a single 16k chunk of data out,
> > > one at a time. IE, entirely latency bound. At least on OSDs backed
> > > by XFS, you have to wait for that data to hit the journals of every
> > > OSD associated with the object before the acknowledgement gets sent
> > > back to the client. If you are using the default 4MB block size,
> > > you'll hit the same OSDs over and over again and your other OSDs will
> > > sit there twiddling their thumbs waiting for IO until you hit the next
> > > block, but then it will just be a different set OSDs getting hit. You
> > > should be able to verify this by using iostat or collectl or something
> > > to look at the behaviour of the SSDs during the test. Since this is
> > > all sequential though, switching to buffered IO (ie coalesce IOs at
> > > the buffercache layer) or using RBD cache for direct IO (coalesce IOs
> > > below the block device) will dramatically improve things.
> > >
> > > The real question here though, is whether or not a synchronous stream
> > > of sequential 16k writes is even remotely close to the IO patterns
> > > that would be seen in actual use for MySQL. Most likely in actual use
> > > you'll be seeing a big mix of random reads and writes, and hopefully
> > > at least some parallelism (though this depends on the number of
> > > databases, number of users, and the workload!).
> > >
> > > Ceph is pretty good at small random IO with lots of parallelism on
> > > spinning disk backed OSDs (So long as you use SSD journals or
> > > controllers with WB cache). It's much harder to get native-level IOPS
> > > rates with SSD backed OSDs though. The latency involved in
> > > distributing and processing all of that data becomes a much bigger
> > > deal. Having said that, we are actively working on improving latency
> > > as much as we can. :)
> > >
> > > Mark
> > >
> > >
> > >
> > >>
> > >> On Sun, Jun 22, 2014 at 11:50 AM, Mark Kirkwood
> > >> <mark.kirkwood at catalyst.net.nz> wrote:
> > >>
> > >>> On 22/06/14 14:09, Mark Kirkwood wrote:
> > >>>
> > >>> Upgrading the VM to 14.04 and restesting the case *without* direct I
> > >>> get:
> > >>>
> > >>> - 164 MB/s (librbd)
> > >>> - 115 MB/s (kernel 3.13)
> > >>>
> > >>> So managing to almost get native performance out of the librbd case.
> > >>> I tweaked both filestore max and min sync intervals (100 and 10
> > >>> resp) just to
> > >>> see if I could actually avoid writing to the spinners while the test
> > >>> was in
> > >>> progress (still seeing some, but clearly fewer).
> > >>>
> > >>> However no improvement at all *with* direct enabled. The output of
> > >>> iostat on
> > >>> the host while the direct test is in progress is interesting:
> > >>>
> > >>> avg-cpu: %user %nice %system %iowait %steal %idle
> > >>> 11.73 0.00 5.04 0.76 0.00 82.47
> > >>>
> > >>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
> > >>> avgrq-sz
> > >>> avgqu-sz await r_await w_await svctm %util
> > >>> sda 0.00 0.00 0.00 11.00 0.00 4.02
> > >>> 749.09 0.14 12.36 0.00 12.36 6.55 7.20
> > >>> sdb 0.00 0.00 0.00 11.00 0.00 4.02
> > >>> 749.09 0.14 12.36 0.00 12.36 5.82 6.40
> > >>> sdc 0.00 0.00 0.00 435.00 0.00 4.29
> > >>> 20.21 0.53 1.21 0.00 1.21 1.21 52.80
> > >>> sdd 0.00 0.00 0.00 435.00 0.00 4.29
> > >>> 20.21 0.52 1.20 0.00 1.20 1.20 52.40
> > >>>
> > >>> (sda,b are the spinners sdc,d the ssds). Something is making the
> > >>> journal work very hard for its 4.29 MB/s!
> > >>>
> > >>> regards
> > >>>
> > >>> Mark
> > >>>
> > >>>
> > >>> Leaving
> > >>>> off direct I'm seeing about 140 MB/s (librbd) and 90 MB/s (kernel
> > >>>> 3.11 [2]). The ssd's can do writes at about 180 MB/s each... which
> > >>>> is something to look at another day[1].
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>> _______________________________________________
> > >>> ceph-users mailing list
> > >>> ceph-users at lists.ceph.com
> > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>>
> > >>
> > >>
> > >>
> > >>
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users at lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
>
>
> --
> Christian Balzer Network/Systems Engineer
> chibi at gol.com Global OnLine Japan/Fusion Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140623/e9a97b35/attachment.htm>

Christian Balzer

2014-06-24 02:36:03 UTC

Hello,

On Mon, 23 Jun 2014 10:26:32 -0700 Greg Poirier wrote:

> 10 OSDs per node
So 90 OSDs in total.

> 12 physical cores hyperthreaded (24 logical cores exposed to OS)
Sounds good.

> 64GB RAM
With SSDs the effect of a large pagecache on the storage nodes isn't that
pronounced, but still nice. ^^

>
> Negligible load
>
> iostat shows the disks are largely idle except for bursty writes
> occasionally.
>
I suppose it is a bit of drag to monitor this on 9 nodes at the same
time but at least with atop it would be feasible.
You might want to check if specific OSDs (both disks and processes) are
getting busy while others remain idle for the duration of the test.

As for the fio results, could you try it from a VM using userspace RBD as
well?

But either way, that result for the host is horrible, but unfortunately
totally on par with what you saw from your dd test.
I would have expected a cluster like yours to produce up to 40k IOPS (yes,
the amount a single SSD can do) from my experience.

Something more than the inherent latency of Ceph (OSDs) seems to be going
on here.

Christian

> Results of fio from one of the SSDs in the cluster:
>
> fiojob: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
> iodepth=128
> fio-2.1.3
> Starting 1 process
> fiojob: Laying out IO file(s) (1 file(s) / 400MB)
> Jobs: 1 (f=1): [w] [-.-% done] [0KB/155.5MB/0KB /s] [0/39.8K/0 iops] [eta
> 00m:00s]
> fiojob: (groupid=0, jobs=1): err= 0: pid=21845: Mon Jun 23 13:23:47 2014
> write: io=409600KB, bw=157599KB/s, iops=39399, runt= 2599msec
> slat (usec): min=6, max=2149, avg=22.13, stdev=23.08
> clat (usec): min=70, max=10700, avg=3220.76, stdev=521.44
> lat (usec): min=90, max=10722, avg=3243.13, stdev=523.70
> clat percentiles (usec):
> | 1.00th=[ 2736], 5.00th=[ 2864], 10.00th=[ 2896],
> 20.00th=[ 2928], | 30.00th=[ 2960], 40.00th=[ 3024], 50.00th=[ 3056],
> 60.00th=[ 3184], | 70.00th=[ 3344], 80.00th=[ 3440], 90.00th=[ 3504],
> 95.00th=[ 3632], | 99.00th=[ 5856], 99.50th=[ 6240], 99.90th=[ 7136],
> 99.95th=[ 7584], | 99.99th=[ 8160]
> bw (KB /s): min=139480, max=173320, per=99.99%, avg=157577.60,
> stdev=16122.77
> lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
> lat (msec) : 2=0.08%, 4=95.89%, 10=3.98%, 20=0.01%
> cpu : usr=14.05%, sys=46.73%, ctx=72243, majf=0, minf=186
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
> >=64=99.9%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.1%
> issued : total=r=0/w=102400/d=0, short=r=0/w=0/d=0
>
> Run status group 0 (all jobs):
> WRITE: io=409600KB, aggrb=157599KB/s, minb=157599KB/s, maxb=157599KB/s,
> mint=2599msec, maxt=2599msec
>
> Disk stats (read/write):
> sda: ios=0/95026, merge=0/0, ticks=0/3016, in_queue=2972, util=82.27%
>
> All of the disks are identical.
>
> The same fio from the host with the RBD volume mounted:
>
> fiojob: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
> iodepth=128
> fio-2.1.3
> Starting 1 process
> fiojob: Laying out IO file(s) (1 file(s) / 400MB)
> Jobs: 1 (f=1): [w] [100.0% done] [0KB/5384KB/0KB /s] [0/1346/0 iops] [eta
> 00m:00s]
> fiojob: (groupid=0, jobs=1): err= 0: pid=30070: Mon Jun 23 13:25:50 2014
> write: io=409600KB, bw=9264.3KB/s, iops=2316, runt= 44213msec
> slat (usec): min=17, max=154210, avg=84.83, stdev=535.40
> clat (msec): min=10, max=1294, avg=55.17, stdev=103.43
> lat (msec): min=10, max=1295, avg=55.25, stdev=103.43
> clat percentiles (msec):
> | 1.00th=[ 17], 5.00th=[ 21], 10.00th=[ 24],
> 20.00th=[ 28], | 30.00th=[ 31], 40.00th=[ 34], 50.00th=[ 37],
> 60.00th=[ 40], | 70.00th=[ 44], 80.00th=[ 50], 90.00th=[ 63],
> 95.00th=[ 103], | 99.00th=[ 725], 99.50th=[ 906], 99.90th=[ 1106],
> 99.95th=[ 1172], | 99.99th=[ 1237]
> bw (KB /s): min= 3857, max=12416, per=100.00%, avg=9280.09,
> stdev=1233.63
> lat (msec) : 20=3.76%, 50=76.60%, 100=14.45%, 250=2.98%, 500=0.72%
> lat (msec) : 750=0.56%, 1000=0.66%, 2000=0.27%
> cpu : usr=3.50%, sys=19.31%, ctx=131358, majf=0, minf=986
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
> >=64=99.9%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.1%
> issued : total=r=0/w=102400/d=0, short=r=0/w=0/d=0
>
> Run status group 0 (all jobs):
> WRITE: io=409600KB, aggrb=9264KB/s, minb=9264KB/s, maxb=9264KB/s,
> mint=44213msec, maxt=44213msec
>
> Disk stats (read/write):
> rbd2: ios=0/102499, merge=0/1818, ticks=0/5593828, in_queue=5599520,
> util=99.85%
>
>
> On Sun, Jun 22, 2014 at 6:42 PM, Christian Balzer <chibi at gol.com> wrote:
>
> > On Sun, 22 Jun 2014 12:14:38 -0700 Greg Poirier wrote:
> >
> > > We actually do have a use pattern of large batch sequential writes,
> > > and this dd is pretty similar to that use case.
> > >
> > > A round-trip write with replication takes approximately 10-15ms to
> > > complete. I've been looking at dump_historic_ops on a number of OSDs
> > > and getting mean, min, and max for sub_op and ops. If these were on
> > > the order of 1-2 seconds, I could understand this throughput... But
> > > we're talking about fairly fast SSDs and a 20Gbps network with <1ms
> > > latency for TCP round-trip between the client machine and all of the
> > > OSD hosts.
> > >
> > > I've gone so far as disabling replication entirely (which had almost
> > > no impact) and putting journals on separate SSDs as the data disks
> > > (which are ALSO SSDs).
> > >
> > > This just doesn't make sense to me.
> > >
> > A lot of this sounds like my "Slow IOPS on RBD compared to journal and
> > backing devices" thread a few weeks ago.
> > Though those results are even worse in a way than what I saw.
> >
> > How many OSDs do you have per node and how many CPU cores?
> >
> > When running this test, are the OSDs very CPU intense?
> > Do you see a good spread amongst the OSDs or are there hotspots?
> >
> > If you have the time/chance, could you run the fio from that thread and
> > post the results, I'm very curious to find out if the no more than 400
> > IOPS per OSD holds true for your cluster as well.
> >
> > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
> > --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
> >
> >
> > Regards,
> >
> > Christian
> >
> > >
> > > On Sun, Jun 22, 2014 at 6:44 AM, Mark Nelson
> > > <mark.nelson at inktank.com> wrote:
> > >
> > > > On 06/22/2014 02:02 AM, Haomai Wang wrote:
> > > >
> > > >> Hi Mark,
> > > >>
> > > >> Do you enable rbdcache? I test on my ssd cluster(only one ssd), it
> > > >> seemed ok.
> > > >>
> > > >> dd if=/dev/zero of=test bs=16k count=65536 oflag=direct
> > > >>>
> > > >>
> > > >> 82.3MB/s
> > > >>
> > > >
> > > > RBD Cache is definitely going to help in this use case. This test
> > > > is basically just sequentially writing a single 16k chunk of data
> > > > out, one at a time. IE, entirely latency bound. At least on OSDs
> > > > backed by XFS, you have to wait for that data to hit the journals
> > > > of every OSD associated with the object before the acknowledgement
> > > > gets sent back to the client. If you are using the default 4MB
> > > > block size, you'll hit the same OSDs over and over again and your
> > > > other OSDs will sit there twiddling their thumbs waiting for IO
> > > > until you hit the next block, but then it will just be a different
> > > > set OSDs getting hit. You should be able to verify this by using
> > > > iostat or collectl or something to look at the behaviour of the
> > > > SSDs during the test. Since this is all sequential though,
> > > > switching to buffered IO (ie coalesce IOs at the buffercache
> > > > layer) or using RBD cache for direct IO (coalesce IOs below the
> > > > block device) will dramatically improve things.
> > > >
> > > > The real question here though, is whether or not a synchronous
> > > > stream of sequential 16k writes is even remotely close to the IO
> > > > patterns that would be seen in actual use for MySQL. Most likely
> > > > in actual use you'll be seeing a big mix of random reads and
> > > > writes, and hopefully at least some parallelism (though this
> > > > depends on the number of databases, number of users, and the
> > > > workload!).
> > > >
> > > > Ceph is pretty good at small random IO with lots of parallelism on
> > > > spinning disk backed OSDs (So long as you use SSD journals or
> > > > controllers with WB cache). It's much harder to get native-level
> > > > IOPS rates with SSD backed OSDs though. The latency involved in
> > > > distributing and processing all of that data becomes a much bigger
> > > > deal. Having said that, we are actively working on improving
> > > > latency as much as we can. :)
> > > >
> > > > Mark
> > > >
> > > >
> > > >
> > > >>
> > > >> On Sun, Jun 22, 2014 at 11:50 AM, Mark Kirkwood
> > > >> <mark.kirkwood at catalyst.net.nz> wrote:
> > > >>
> > > >>> On 22/06/14 14:09, Mark Kirkwood wrote:
> > > >>>
> > > >>> Upgrading the VM to 14.04 and restesting the case *without*
> > > >>> direct I get:
> > > >>>
> > > >>> - 164 MB/s (librbd)
> > > >>> - 115 MB/s (kernel 3.13)
> > > >>>
> > > >>> So managing to almost get native performance out of the librbd
> > > >>> case. I tweaked both filestore max and min sync intervals (100
> > > >>> and 10 resp) just to
> > > >>> see if I could actually avoid writing to the spinners while the
> > > >>> test was in
> > > >>> progress (still seeing some, but clearly fewer).
> > > >>>
> > > >>> However no improvement at all *with* direct enabled. The output
> > > >>> of iostat on
> > > >>> the host while the direct test is in progress is interesting:
> > > >>>
> > > >>> avg-cpu: %user %nice %system %iowait %steal %idle
> > > >>> 11.73 0.00 5.04 0.76 0.00 82.47
> > > >>>
> > > >>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
> > > >>> avgrq-sz
> > > >>> avgqu-sz await r_await w_await svctm %util
> > > >>> sda 0.00 0.00 0.00 11.00 0.00 4.02
> > > >>> 749.09 0.14 12.36 0.00 12.36 6.55 7.20
> > > >>> sdb 0.00 0.00 0.00 11.00 0.00 4.02
> > > >>> 749.09 0.14 12.36 0.00 12.36 5.82 6.40
> > > >>> sdc 0.00 0.00 0.00 435.00 0.00 4.29
> > > >>> 20.21 0.53 1.21 0.00 1.21 1.21 52.80
> > > >>> sdd 0.00 0.00 0.00 435.00 0.00 4.29
> > > >>> 20.21 0.52 1.20 0.00 1.20 1.20 52.40
> > > >>>
> > > >>> (sda,b are the spinners sdc,d the ssds). Something is making the
> > > >>> journal work very hard for its 4.29 MB/s!
> > > >>>
> > > >>> regards
> > > >>>
> > > >>> Mark
> > > >>>
> > > >>>
> > > >>> Leaving
> > > >>>> off direct I'm seeing about 140 MB/s (librbd) and 90 MB/s
> > > >>>> (kernel 3.11 [2]). The ssd's can do writes at about 180 MB/s
> > > >>>> each... which is something to look at another day[1].
> > > >>>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> _______________________________________________
> > > >>> ceph-users mailing list
> > > >>> ceph-users at lists.ceph.com
> > > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >>>
> > > >>
> > > >>
> > > >>
> > > >>
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-users at lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> >
> >
> > --
> > Christian Balzer Network/Systems Engineer
> > chibi at gol.com Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users at lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >

--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/

Greg Poirier

2014-06-23 17:54:59 UTC

On Sun, Jun 22, 2014 at 6:44 AM, Mark Nelson <mark.nelson at inktank.com>
wrote:

> RBD Cache is definitely going to help in this use case. This test is
> basically just sequentially writing a single 16k chunk of data out, one at
> a time. IE, entirely latency bound. At least on OSDs backed by XFS, you
> have to wait for that data to hit the journals of every OSD associated with
> the object before the acknowledgement gets sent back to the client.
>

Again, I can reproduce this with replication disabled.

> If you are using the default 4MB block size, you'll hit the same OSDs
> over and over again and your other OSDs will sit there twiddling their
> thumbs waiting for IO until you hit the next block, but then it will just
> be a different set OSDs getting hit. You should be able to verify this by
> using iostat or collectl or something to look at the behaviour of the SSDs
> during the test. Since this is all sequential though, switching to
> buffered IO (ie coalesce IOs at the buffercache layer) or using RBD cache
> for direct IO (coalesce IOs below the block device) will dramatically
> improve things.
>

This makes sense.

Given the following scenario:

- No replication
- osd_op time average is .015 seconds (stddev ~.003 seconds)
- Network latency is approximately .000237 seconds on avg

I should be getting 60 IOPS from the OSD reporting this time, right?

So 60 * 16kB = 960kB. That's slightly slower than we're getting because
I'm only able to sample the slowest ops. We're getting closer to 100 IOPS.
But that does make sense, I suppose.

So the only way to improve performance would be to not use O_DIRECT (as
this should bypass rbd cache as well, right?).

> Ceph is pretty good at small random IO with lots of parallelism on
> spinning disk backed OSDs (So long as you use SSD journals or controllers
> with WB cache). It's much harder to get native-level IOPS rates with SSD
> backed OSDs though. The latency involved in distributing and processing
> all of that data becomes a much bigger deal. Having said that, we are
> actively working on improving latency as much as we can. :)

And this is true because flushing from the journal to spinning disks is
going to coalesce the writes into the appropriate blocks in a meaningful
way, right? Or I guess... Why is this?

Why doesn't that happen with SSD journals and SSD OSDs?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140623/c3cd84a2/attachment.htm>

Mark Nelson

2014-06-23 19:03:17 UTC

On 06/23/2014 12:54 PM, Greg Poirier wrote:
> On Sun, Jun 22, 2014 at 6:44 AM, Mark Nelson <mark.nelson at inktank.com
> <mailto:mark.nelson at inktank.com>> wrote:
>
> RBD Cache is definitely going to help in this use case. This test
> is basically just sequentially writing a single 16k chunk of data
> out, one at a time. IE, entirely latency bound. At least on OSDs
> backed by XFS, you have to wait for that data to hit the journals of
> every OSD associated with the object before the acknowledgement gets
> sent back to the client.
>
>
> Again, I can reproduce this with replication disabled.

Replication is the less important part of that statement, it's more
about the specific test you are running.

>
> If you are using the default 4MB block size, you'll hit the same
> OSDs over and over again and your other OSDs will sit there
> twiddling their thumbs waiting for IO until you hit the next block,
> but then it will just be a different set OSDs getting hit. You
> should be able to verify this by using iostat or collectl or
> something to look at the behaviour of the SSDs during the test.
> Since this is all sequential though, switching to buffered IO (ie
> coalesce IOs at the buffercache layer) or using RBD cache for direct
> IO (coalesce IOs below the block device) will dramatically improve
> things.
>
>
> This makes sense.
>
> Given the following scenario:
>
> - No replication
> - osd_op time average is .015 seconds (stddev ~.003 seconds)
> - Network latency is approximately .000237 seconds on avg
>
> I should be getting 60 IOPS from the OSD reporting this time, right?
>
> So 60 * 16kB = 960kB. That's slightly slower than we're getting because
> I'm only able to sample the slowest ops. We're getting closer to 100
> IOPS. But that does make sense, I suppose.
>
> So the only way to improve performance would be to not use O_DIRECT (as
> this should bypass rbd cache as well, right?).

RBD actually will still work. You can think of it like the cache on a
typical hard drive with similar upsides and downsides. Remember that
O_DIRECT only tries to minimize caching effects by skipping the linux
buffer cache. It doesn't make any guarantees about what happens below
the block level. Having said that, RBD cache should fully respect
flushes and barriers, but there's (typically) no battery so you can't
make any other assumptions beyond that.

http://ceph.com/docs/master/rbd/rbd-config-ref/

>
> Ceph is pretty good at small random IO with lots of parallelism on
> spinning disk backed OSDs (So long as you use SSD journals or
> controllers with WB cache). It's much harder to get native-level
> IOPS rates with SSD backed OSDs though. The latency involved in
> distributing and processing all of that data becomes a much bigger
> deal. Having said that, we are actively working on improving
> latency as much as we can. :)
>
>
> And this is true because flushing from the journal to spinning disks is
> going to coalesce the writes into the appropriate blocks in a meaningful
> way, right? Or I guess... Why is this?

Well, for random IO you often can't do much coalescing. You have to
bite the bullet and either parallelize things or reduce per-op latency.
Ceph already handles parallelism very well. You just throw more disks
at the problem and so long as there are enough client requests it more
or less just scales (limited by things like network bisection bandwidth
or other complications). On the latency side, spinning disks aren't
fast enough for Ceph's extra latency overhead to matter much, but with
SSDs the story is different. That's why we are very interested in
reducing latency.

Regarding journals: Journal writes are always sequential (even for
random IO!), but are O_DIRECT so they'll skip linux buffer cache. If
you have hardware that is fast at writing sequential small IO (say a
controller with WB cache or an SSD), you can do journal writes very
quickly. For bursts of small random IO, performance can be quite good.
The downsides is that you can hit journal limits very quickly, meaning
you have to flush and wait for the underlying filestore to catch up.
This results in performance that starts out super fast, then stalls once
the journal limits are hit, back to super fast again for a bit, then
another stall, etc. This is less than ideal given the way crush
distributes data across OSDs. The alternative is setting a soft limit
on how much data is in the journal and flushing smaller amounts of data
more quickly to limit the spikey behaviour. On the whole, that can be
good but limits the burst potential and also limits the amount of data
that could potentially be coalesced in the journal.

Luckily with RBD you can (when applicable) coalesce on the client with
RBD cache instead, which is arguably better anyway since you can send
bigger IOs to the OSDs earlier in the write path. So long as you are ok
with what RBD cache does and does not guarantee, it's definitely worth
enabling imho.

>
> Why doesn't that happen with SSD journals and SSD OSDs?

SSD journals and SSD OSDs should be fine. I suspect in this case it's
just software latency.

Mark

Jake Young

2014-06-24 13:10:50 UTC

On Mon, Jun 23, 2014 at 3:03 PM, Mark Nelson <mark.nelson at inktank.com>
wrote:

> Well, for random IO you often can't do much coalescing. You have to bite
> the bullet and either parallelize things or reduce per-op latency. Ceph
> already handles parallelism very well. You just throw more disks at the
> problem and so long as there are enough client requests it more or less
> just scales (limited by things like network bisection bandwidth or other
> complications). On the latency side, spinning disks aren't fast enough for
> Ceph's extra latency overhead to matter much, but with SSDs the story is
> different. That's why we are very interested in reducing latency.
>
> Regarding journals: Journal writes are always sequential (even for random
> IO!), but are O_DIRECT so they'll skip linux buffer cache. If you have
> hardware that is fast at writing sequential small IO (say a controller with
> WB cache or an SSD), you can do journal writes very quickly. For bursts of
> small random IO, performance can be quite good. The downsides is that you
> can hit journal limits very quickly, meaning you have to flush and wait for
> the underlying filestore to catch up. This results in performance that
> starts out super fast, then stalls once the journal limits are hit, back to
> super fast again for a bit, then another stall, etc. This is less than
> ideal given the way crush distributes data across OSDs. The alternative is
> setting a soft limit on how much data is in the journal and flushing
> smaller amounts of data more quickly to limit the spikey behaviour. On the
> whole, that can be good but limits the burst potential and also limits the
> amount of data that could potentially be coalesced in the journal.
>

Mark,

What settings are you suggesting for setting a soft limit on journal size
and flushing smaller amounts of data?

Something like this?
filestore_queue_max_bytes: 10485760
filestore_queue_committing_max_bytes: 10485760
journal_max_write_bytes: 10485760
journal_queue_max_bytes: 10485760
ms_dispatch_throttle_bytes: 10485760
objecter_infilght_op_bytes: 10485760

(see "Small bytes" in
http://ceph.com/community/ceph-bobtail-jbod-performance-tuning)

>
> Luckily with RBD you can (when applicable) coalesce on the client with RBD
> cache instead, which is arguably better anyway since you can send bigger
> IOs to the OSDs earlier in the write path. So long as you are ok with what
> RBD cache does and does not guarantee, it's definitely worth enabling imho.
>
>
Thanks,

Jake
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140624/94b9bfc1/attachment.htm>

Mark Kirkwood

2014-06-23 04:36:08 UTC

Good point, I had neglected to do that.

So, amending my conf.conf [1]:

[client]
rbd cache = true
rbd cache size = 2147483648
rbd cache max dirty = 1073741824
rbd cache max dirty age = 100

and also the VM's xml def to include cache to writeback:

<disk type='network' device='disk'>
<driver name='qemu' type='raw' cache='writeback' io='native'/>
<auth username='admin'>
<secret type='ceph' uuid='cd2d3ab1-2d31-41e0-ab08-3d0c6e2fafa0'/>
</auth>
<source protocol='rbd' name='rbd/vol1'>
<host name='192.168.1.64' port='6789'/>
</source>
<target dev='vdb' bus='virtio'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x07'
function='0x0'/>
</disk>

Retesting from inside the VM:

$ dd if=/dev/zero of=/mnt/vol1/scratch/file bs=16k count=65535 oflag=direct
65535+0 records in
65535+0 records out
1073725440 bytes (1.1 GB) copied, 8.1686 s, 131 MB/s

Which is much better, so certainly for the librbd case enabling the rbd
cache seems to nail this particular issue.

Regards

Mark

[1] possibly somewhat agressively set, but at least a noticeable
difference :-)

On 22/06/14 19:02, Haomai Wang wrote:
> Hi Mark,
>
> Do you enable rbdcache? I test on my ssd cluster(only one ssd), it seemed ok.
>
>> dd if=/dev/zero of=test bs=16k count=65536 oflag=direct
>
> 82.3MB/s
>
>

Greg Poirier

2014-06-23 06:27:01 UTC

How does RBD cache work? I wasn't able to find an adequate explanation in
the docs.

On Sunday, June 22, 2014, Mark Kirkwood <mark.kirkwood at catalyst.net.nz>
wrote:

> Good point, I had neglected to do that.
>
> So, amending my conf.conf [1]:
>
> [client]
> rbd cache = true
> rbd cache size = 2147483648
> rbd cache max dirty = 1073741824
> rbd cache max dirty age = 100
>
> and also the VM's xml def to include cache to writeback:
>
> <disk type='network' device='disk'>
> <driver name='qemu' type='raw' cache='writeback' io='native'/>
> <auth username='admin'>
> <secret type='ceph' uuid='cd2d3ab1-2d31-41e0-ab08-3d0c6e2fafa0'/>
> </auth>
> <source protocol='rbd' name='rbd/vol1'>
> <host name='192.168.1.64' port='6789'/>
> </source>
> <target dev='vdb' bus='virtio'/>
> <address type='pci' domain='0x0000' bus='0x00' slot='0x07'
> function='0x0'/>
> </disk>
>
> Retesting from inside the VM:
>
> $ dd if=/dev/zero of=/mnt/vol1/scratch/file bs=16k count=65535 oflag=direct
> 65535+0 records in
> 65535+0 records out
> 1073725440 bytes (1.1 GB) copied, 8.1686 s, 131 MB/s
>
> Which is much better, so certainly for the librbd case enabling the rbd
> cache seems to nail this particular issue.
>
> Regards
>
> Mark
>
> [1] possibly somewhat agressively set, but at least a noticeable
> difference :-)
>
> On 22/06/14 19:02, Haomai Wang wrote:
>
>> Hi Mark,
>>
>> Do you enable rbdcache? I test on my ssd cluster(only one ssd), it seemed
>> ok.
>>
>> dd if=/dev/zero of=test bs=16k count=65536 oflag=direct
>>>
>>
>> 82.3MB/s
>>
>>
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140622/26a134e8/attachment.htm>

Christian Balzer

2014-06-23 06:51:22 UTC

Hello,

On Sun, 22 Jun 2014 23:27:01 -0700 Greg Poirier wrote:

> How does RBD cache work? I wasn't able to find an adequate explanation in
> the docs.
>
The mailing list (archive) is your friend, I asked pretty much the same
question in January.

In short it mimics the cache on a typical hard disk, in a similar size to
those in it's default settings and with the same gotchas (need to flush it
at the right times, which any non-acient OS will do).

However keep reading below.

> On Sunday, June 22, 2014, Mark Kirkwood <mark.kirkwood at catalyst.net.nz>
> wrote:
>
> > Good point, I had neglected to do that.
> >
> > So, amending my conf.conf [1]:
> >
> > [client]
> > rbd cache = true
> > rbd cache size = 2147483648

Any inktank engineer reading this, I'd really wish we could use K/M/G
instead of having to whip out a calculator each time when setting values
like these in ceph.

> > rbd cache max dirty = 1073741824
> > rbd cache max dirty age = 100
> >

Mark, you're giving it a 2GB cache.
For a write test that's 1GB in size.
"Aggressively set" is a bit of an understatement here. ^o^
Most people will not want to spend this much memory on write-only caching.

Of course with these settings that test will yield impressive results.

However if you'd observe your storage nodes, OSDs, you will see that this
is still going to take the same time until it is actually, finally written
to disk. Same with using kernelspace RBD and caching enabled in the VM.
Doing similar tests with fio I managed to fill the cache and got fantastic
IOPS but then it took minutes to finally clean out.

Resulting in hung task warnings for the jbd process(es) like this:
---
May 28 16:58:56 tvm-03 kernel: [ 960.320182] INFO: task jbd2/vda1-8:153 blocked
for more than 120 seconds.
May 28 16:58:56 tvm-03 kernel: [ 960.320866] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
---

Now this doesn't actively break things AFAICT, but it left me feeling
quite uncomfortable nevertheless.

Also what happens if something "bad" happens to the VM or it's host before
the cache is drained?

From where I'm standing the RBD cache is fine for merging really small
writes and that's it.

Regards,

Christian
> > and also the VM's xml def to include cache to writeback:
> >
> > <disk type='network' device='disk'>
> > <driver name='qemu' type='raw' cache='writeback' io='native'/>
> > <auth username='admin'>
> > <secret type='ceph'
> > uuid='cd2d3ab1-2d31-41e0-ab08-3d0c6e2fafa0'/> </auth>
> > <source protocol='rbd' name='rbd/vol1'>
> > <host name='192.168.1.64' port='6789'/>
> > </source>
> > <target dev='vdb' bus='virtio'/>
> > <address type='pci' domain='0x0000' bus='0x00' slot='0x07'
> > function='0x0'/>
> > </disk>
> >
> > Retesting from inside the VM:
> >
> > $ dd if=/dev/zero of=/mnt/vol1/scratch/file bs=16k count=65535
> > oflag=direct 65535+0 records in
> > 65535+0 records out
> > 1073725440 bytes (1.1 GB) copied, 8.1686 s, 131 MB/s
> >
> > Which is much better, so certainly for the librbd case enabling the rbd
> > cache seems to nail this particular issue.
> >
> > Regards
> >
> > Mark
> >
> > [1] possibly somewhat agressively set, but at least a noticeable
> > difference :-)
> >
> > On 22/06/14 19:02, Haomai Wang wrote:
> >
> >> Hi Mark,
> >>
> >> Do you enable rbdcache? I test on my ssd cluster(only one ssd), it
> >> seemed ok.
> >>
> >> dd if=/dev/zero of=test bs=16k count=65536 oflag=direct
> >>>
> >>
> >> 82.3MB/s
> >>
> >>
> >>
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users at lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >

--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/

Mark Kirkwood

2014-06-23 07:16:52 UTC

On 23/06/14 18:51, Christian Balzer wrote:
>> On Sunday, June 22, 2014, Mark Kirkwood <mark.kirkwood at catalyst.net.nz>
>> rbd cache max dirty = 1073741824
>> rbd cache max dirty age = 100
>>
>
> Mark, you're giving it a 2GB cache.
> For a write test that's 1GB in size.
> "Aggressively set" is a bit of an understatement here. ^o^
> Most people will not want to spend this much memory on write-only caching.
>
> Of course with these settings that test will yield impressive results.
>
> However if you'd observe your storage nodes, OSDs, you will see that this
> is still going to take the same time until it is actually, finally written
> to disk. Same with using kernelspace RBD and caching enabled in the VM.
> Doing similar tests with fio I managed to fill the cache and got fantastic
> IOPS but then it took minutes to finally clean out.
>
> Resulting in hung task warnings for the jbd process(es) like this:
> ---
> May 28 16:58:56 tvm-03 kernel: [ 960.320182] INFO: task jbd2/vda1-8:153 blocked
> for more than 120 seconds.
> May 28 16:58:56 tvm-03 kernel: [ 960.320866] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> ---
>
> Now this doesn't actively break things AFAICT, but it left me feeling
> quite uncomfortable nevertheless.
>
> Also what happens if something "bad" happens to the VM or it's host before
> the cache is drained?
>
> From where I'm standing the RBD cache is fine for merging really small
> writes and that's it.

Yes! And thank you Christian for writing (something very similar to)
what I was about to write in response to Greg's question!

For database types (and yes I'm one of those)...you want to know that
your writes (particularly your commit writes) are actually making it to
persistent storage (that ACID thing you know). Now I see RBD cache very
like battery backed RAID cards - your commits (i.e fsync or O_DIRECT
writes) are not actually written, but are cached - so you are depending
on the reliability of a) your RAID controller battery etc in that case
or more interestingly b) your Ceph topology - to withstand node
failures. Given we usually design a Ceph cluster with these things in
mind it is probably ok [1]!

Regards

Mark

[1] Obviously my setup in use here - 2 ods, 2 SATA and 2 SSD all on the
same host is merely a play/benchmark config and it *not* a topology
designed with reliability in mind!

Mark Kirkwood

2014-06-24 09:46:52 UTC

On 23/06/14 19:16, Mark Kirkwood wrote:
> For database types (and yes I'm one of those)...you want to know that
> your writes (particularly your commit writes) are actually making it to
> persistent storage (that ACID thing you know). Now I see RBD cache very
> like battery backed RAID cards - your commits (i.e fsync or O_DIRECT
> writes) are not actually written, but are cached - so you are depending
> on the reliability of a) your RAID controller battery etc in that case
> or more interestingly b) your Ceph topology - to withstand node
> failures. Given we usually design a Ceph cluster with these things in
> mind it is probably ok!
>

Thinking about this a bit more (and noting Mark N's comment too), this
is a bit more subtle that what I indicated above:

The rbd cache lives at the *client* level so (thinking in Openstack
terms): if your VM fails - no problem, the compute node has the write
cache in memory...ok, but how about if the compute node itself fails?
This is analogous to: how about if your battery backed raid card self
destructs? The answer would appear to be data loss, so rbd cache
reliability looks to be dependent on the resilience of the
client/compute design.

Regards

Mark

Mark Nelson

2014-06-24 12:28:14 UTC

On 06/24/2014 04:46 AM, Mark Kirkwood wrote:
> On 23/06/14 19:16, Mark Kirkwood wrote:
>> For database types (and yes I'm one of those)...you want to know that
>> your writes (particularly your commit writes) are actually making it to
>> persistent storage (that ACID thing you know). Now I see RBD cache very
>> like battery backed RAID cards - your commits (i.e fsync or O_DIRECT
>> writes) are not actually written, but are cached - so you are depending
>> on the reliability of a) your RAID controller battery etc in that case
>> or more interestingly b) your Ceph topology - to withstand node
>> failures. Given we usually design a Ceph cluster with these things in
>> mind it is probably ok!
>>
>
> Thinking about this a bit more (and noting Mark N's comment too), this
> is a bit more subtle that what I indicated above:
>
> The rbd cache lives at the *client* level so (thinking in Openstack
> terms): if your VM fails - no problem, the compute node has the write
> cache in memory...ok, but how about if the compute node itself fails?
> This is analogous to: how about if your battery backed raid card self
> destructs? The answer would appear to be data loss, so rbd cache
> reliability looks to be dependent on the resilience of the
> client/compute design.

Well, it's the same problem you have with cache on most spinning disks.
You just have to assume that anything that wasn't flushed might not
have made it. Depending on the use case that might or might not be an
ok assumption.

In terms of data loss, the way I like to look at this is that there is
always a spectrum. Even with battery backed RAID cards you don't have
any guarantee that any given write is going to make it out of RAM and to
the controller before a system crash. What's more important imho is
making sure you know exactly what the granularity is and what kind of
guaranties you do get.

>
> Regards
>
> Mark
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Greg Poirier

2014-06-22 19:10:05 UTC

I'm using Crucial M500s.

On Sat, Jun 21, 2014 at 7:09 PM, Mark Kirkwood <
mark.kirkwood at catalyst.net.nz> wrote:

> I can reproduce this in:
>
> ceph version 0.81-423-g1fb4574
>
> on Ubuntu 14.04. I have a two osd cluster with data on two sata spinners
> (WD blacks) and journals on two ssd (Crucual m4's). I getting about 3.5
> MB/s (kernel and librbd) using your dd command with direct on. Leaving off
> direct I'm seeing about 140 MB/s (librbd) and 90 MB/s (kernel 3.11 [2]).
> The ssd's can do writes at about 180 MB/s each... which is something to
> look at another day[1].
>
> It would be interesting to know what version of Ceph Tyer is using, as his
> setup seems not nearly impacted by adding direct. Also it might be useful
> to know what make and model of ssd you both are using (some of 'em do not
> like a series of essentially sync writes). Having said that testing my
> Crucial m4's shows they can do the dd command (with direct *on*) at about
> 180 MB/s...hmmm...so it *is* the Ceph layer it seems.
>
> Regards
>
> Mark
>
> [1] I set filestore_max_sync_interval = 100 (30G journal...ssd able to do
> 180 MB/s etc), however I am still seeing writes to the spinners during the
> 8s or so that the above dd tests take).
> [2] Ubuntu 13.10 VM - I'll upgrade it to 14.04 and see if that helps at
> all.
>
>
> On 21/06/14 09:17, Greg Poirier wrote:
>
>> Thanks Tyler. So, I'm not totally crazy. There is something weird going
>> on.
>>
>> I've looked into things about as much as I can:
>>
>> - We have tested with collocated journals and dedicated journal disks.
>> - We have bonded 10Gb nics and have verified network configuration and
>> connectivity is sound
>> - We have run dd independently on the SSDs in the cluster and they are
>> performing fine
>> - We have tested both in a VM and with the RBD kernel module and get
>> identical performance
>> - We have pool size = 3, pool min size = 2 and have tested with min size
>> of 2 and 3 -- the performance impact is not bad
>> - osd_op times are approximately 6-12ms
>> - osd_sub_op times are 6-12 ms
>> - iostat reports service time of 6-12ms
>> - Latency between the storage and rbd client is approximately .1-.2ms
>> - Disabling replication entirely did not help significantly
>>
>>
>>
>>
>> On Fri, Jun 20, 2014 at 2:13 PM, Tyler Wilson <kupo at linuxdigital.net
>> <mailto:kupo at linuxdigital.net>> wrote:
>>
>> Greg,
>>
>> Not a real fix for you but I too run a full-ssd cluster and am able
>> to get 112MB/s with your command;
>>
>> [root at plesk-test ~]# dd if=/dev/zero of=testfilasde bs=16k
>> count=65535 oflag=direct
>> 65535+0 records in
>> 65535+0 records out
>> 1073725440 bytes (1.1 GB) copied, 9.59092 s, 112 MB/s
>>
>> This of course is in a VM, here is my ceph config
>>
>> [global]
>> fsid = <hidden>
>> mon_initial_members = node-1 node-2 node-3
>> mon_host = 192.168.0.3 192.168.0.4 192.168.0.5
>> auth_supported = cephx
>> osd_journal_size = 2048
>> filestore_xattr_use_omap = true
>> osd_pool_default_size = 2
>> osd_pool_default_min_size = 1
>> osd_pool_default_pg_num = 1024
>> public_network = 192.168.0.0/24 <http://192.168.0.0/24>
>> osd_mkfs_type = xfs
>> cluster_network = 192.168.1.0/24 <http://192.168.1.0/24>
>>
>>
>>
>>
>> On Fri, Jun 20, 2014 at 11:08 AM, Greg Poirier
>> <greg.poirier at opower.com <mailto:greg.poirier at opower.com>> wrote:
>>
>> I recently created a 9-node Firefly cluster backed by all SSDs.
>> We have had some pretty severe performance degradation when
>> using O_DIRECT in our tests (as this is how MySQL will be
>> interacting with RBD volumes, this makes the most sense for a
>> preliminary test). Running the following test:
>>
>> dd if=/dev/zero of=testfilasde bs=16k count=65535 oflag=direct
>>
>> 779829248 bytes (780 MB) copied, 604.333 s, 1.3 MB/s
>>
>> Shows us only about 1.5 MB/s throughput and 100 IOPS from the
>> single dd thread. Running a second dd process does show
>> increased throughput which is encouraging, but I am still
>> concerned by the low throughput of a single thread w/ O_DIRECT.
>>
>> Two threads:
>> 779829248 bytes (780 MB) copied, 604.333 s, 1.3 MB/s
>> 126271488 bytes (126 MB) copied, 99.2069 s, 1.3 MB/s
>>
>> I am testing with an RBD volume mounted with the kernel module
>> (I have also tested from within KVM, similar performance).
>>
>> If allow caching, we start to see reasonable numbers from a
>> single dd process:
>>
>> dd if=/dev/zero of=testfilasde bs=16k count=65535
>> 65535+0 records in
>> 65535+0 records out
>> 1073725440 bytes (1.1 GB) copied, 2.05356 s, 523 MB/s
>>
>> I can get >1GB/s from a single host with three threads.
>>
>> Rados bench produces similar results.
>>
>> Is there something I can do to increase the performance of
>> O_DIRECT? I expect performance degradation, but so much?
>>
>> If I increase the blocksize to 4M, I'm able to get significantly
>> higher throughput:
>>
>> 3833593856 bytes (3.8 GB) copied, 44.2964 s, 86.5 MB/s
>>
>> This still seems very low.
>>
>> I'm using the deadline scheduler in all places. With noop
>> scheduler, I do not see a performance improvement.
>>
>> Suggestions?
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com <mailto:ceph-users at lists.ceph.com>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140622/c3bca2cf/attachment.htm>

Alexandre DERUMIER

2014-06-24 05:37:44 UTC

Hi Greg,

>>So the only way to improve performance would be to not use O_DIRECT (as this should bypass rbd cache as well, right?).

yes, indeed O_DIRECT bypass cache.

BTW, Do you need to use mysql with O_DIRECT ? default innodb_flush_method is fdatasync, so it should work with cache.
(but you can lose some write is case of a crash failure)

----- Mail original -----

De: "Mark Nelson" <mark.nelson at inktank.com>
?: "Greg Poirier" <greg.poirier at opower.com>
Cc: "Zach Dunn" <zach.dunn at opower.com>, ceph-users at lists.ceph.com, "Unai Basterretxea" <unai.basterretxea at opower.com>
Envoy?: Lundi 23 Juin 2014 21:03:17
Objet: Re: [ceph-users] Poor performance on all SSD cluster

On 06/23/2014 12:54 PM, Greg Poirier wrote:
> On Sun, Jun 22, 2014 at 6:44 AM, Mark Nelson <mark.nelson at inktank.com
> <mailto:mark.nelson at inktank.com>> wrote:
>
> RBD Cache is definitely going to help in this use case. This test
> is basically just sequentially writing a single 16k chunk of data
> out, one at a time. IE, entirely latency bound. At least on OSDs
> backed by XFS, you have to wait for that data to hit the journals of
> every OSD associated with the object before the acknowledgement gets
> sent back to the client.
>
>
> Again, I can reproduce this with replication disabled.

Replication is the less important part of that statement, it's more
about the specific test you are running.

>
> If you are using the default 4MB block size, you'll hit the same
> OSDs over and over again and your other OSDs will sit there
> twiddling their thumbs waiting for IO until you hit the next block,
> but then it will just be a different set OSDs getting hit. You
> should be able to verify this by using iostat or collectl or
> something to look at the behaviour of the SSDs during the test.
> Since this is all sequential though, switching to buffered IO (ie
> coalesce IOs at the buffercache layer) or using RBD cache for direct
> IO (coalesce IOs below the block device) will dramatically improve
> things.
>
>
> This makes sense.
>
> Given the following scenario:
>
> - No replication
> - osd_op time average is .015 seconds (stddev ~.003 seconds)
> - Network latency is approximately .000237 seconds on avg
>
> I should be getting 60 IOPS from the OSD reporting this time, right?
>
> So 60 * 16kB = 960kB. That's slightly slower than we're getting because
> I'm only able to sample the slowest ops. We're getting closer to 100
> IOPS. But that does make sense, I suppose.
>
> So the only way to improve performance would be to not use O_DIRECT (as
> this should bypass rbd cache as well, right?).

RBD actually will still work. You can think of it like the cache on a
typical hard drive with similar upsides and downsides. Remember that
O_DIRECT only tries to minimize caching effects by skipping the linux
buffer cache. It doesn't make any guarantees about what happens below
the block level. Having said that, RBD cache should fully respect
flushes and barriers, but there's (typically) no battery so you can't
make any other assumptions beyond that.

http://ceph.com/docs/master/rbd/rbd-config-ref/

>
> Ceph is pretty good at small random IO with lots of parallelism on
> spinning disk backed OSDs (So long as you use SSD journals or
> controllers with WB cache). It's much harder to get native-level
> IOPS rates with SSD backed OSDs though. The latency involved in
> distributing and processing all of that data becomes a much bigger
> deal. Having said that, we are actively working on improving
> latency as much as we can. :)
>
>
> And this is true because flushing from the journal to spinning disks is
> going to coalesce the writes into the appropriate blocks in a meaningful
> way, right? Or I guess... Why is this?

Well, for random IO you often can't do much coalescing. You have to
bite the bullet and either parallelize things or reduce per-op latency.
Ceph already handles parallelism very well. You just throw more disks
at the problem and so long as there are enough client requests it more
or less just scales (limited by things like network bisection bandwidth
or other complications). On the latency side, spinning disks aren't
fast enough for Ceph's extra latency overhead to matter much, but with
SSDs the story is different. That's why we are very interested in
reducing latency.

Regarding journals: Journal writes are always sequential (even for
random IO!), but are O_DIRECT so they'll skip linux buffer cache. If
you have hardware that is fast at writing sequential small IO (say a
controller with WB cache or an SSD), you can do journal writes very
quickly. For bursts of small random IO, performance can be quite good.
The downsides is that you can hit journal limits very quickly, meaning
you have to flush and wait for the underlying filestore to catch up.
This results in performance that starts out super fast, then stalls once
the journal limits are hit, back to super fast again for a bit, then
another stall, etc. This is less than ideal given the way crush
distributes data across OSDs. The alternative is setting a soft limit
on how much data is in the journal and flushing smaller amounts of data
more quickly to limit the spikey behaviour. On the whole, that can be
good but limits the burst potential and also limits the amount of data
that could potentially be coalesced in the journal.

Luckily with RBD you can (when applicable) coalesce on the client with
RBD cache instead, which is arguably better anyway since you can send
bigger IOs to the OSDs earlier in the write path. So long as you are ok
with what RBD cache does and does not guarantee, it's definitely worth
enabling imho.

>
> Why doesn't that happen with SSD journals and SSD OSDs?

SSD journals and SSD OSDs should be fine. I suspect in this case it's
just software latency.

Mark
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Mark Kirkwood

2014-06-24 06:04:28 UTC

On 24/06/14 17:37, Alexandre DERUMIER wrote:
> Hi Greg,
>
>>> So the only way to improve performance would be to not use O_DIRECT (as this should bypass rbd cache as well, right?).
>
> yes, indeed O_DIRECT bypass cache.
>
>
>
> BTW, Do you need to use mysql with O_DIRECT ? default innodb_flush_method is fdatasync, so it should work with cache.
> (but you can lose some write is case of a crash failure)
>

While this suggestion is good, I don't believe that the "you could lose
data" statement is correct with respect to fdatasync (or fsync) [1].
With all modern kernels I think you will find that fdatasync will
actually flush modified buffers to the device (i.e write through file
buffer cache).

All of which means that Mysql performance (looking at you binlog) may
still suffer due to lots of small block size sync writes.

regards

Mark

[1] See kernel archives concerning REQ_FLUSH and friends.

Robert van Leeuwen

2014-06-24 06:15:02 UTC

> All of which means that Mysql performance (looking at you binlog) may
> still suffer due to lots of small block size sync writes.

Which begs the question:
Anyone running a reasonable busy Mysql server on Ceph backed storage?

We tried and it did not perform good enough.
We have a small ceph cluster: 3 machines with 2 SSD journals and 10 spinning disks each.
Using ceph trough kvm rbd we were seeing performance equal to about 1-2 spinning disks.

Reading this thread it now looks a bit if there are inherent architecture + latency issues that would prevent it from performing great as a Mysql database store.
I'd be interested in example setups where people are running busy databases on Ceph backed volumes.

Cheers,
Robert

Mark Kirkwood

2014-06-24 08:45:07 UTC

On 24/06/14 18:15, Robert van Leeuwen wrote:
>> All of which means that Mysql performance (looking at you binlog) may
>> still suffer due to lots of small block size sync writes.
>
> Which begs the question:
> Anyone running a reasonable busy Mysql server on Ceph backed storage?
>
> We tried and it did not perform good enough.
> We have a small ceph cluster: 3 machines with 2 SSD journals and 10 spinning disks each.
> Using ceph trough kvm rbd we were seeing performance equal to about 1-2 spinning disks.
>
> Reading this thread it now looks a bit if there are inherent architecture + latency issues that would prevent it from performing great as a Mysql database store.
> I'd be interested in example setups where people are running busy databases on Ceph backed volumes.

Yes indeed,

We have looked extensively at Postgres performance on rbd - and while it
is not Mysql, the underlying mechanism for durable writes (i.e commit)
is essentially very similar (fsync, fdatasync and friends). We achieved
quite reasonable performance (by that I mean sufficiently encouraging to
be happy to host real datastores for our moderately busy systems - and
we are continuing to investigate using it for our really busy ones).

I have not experimented exptensively with the various choices of flush
method (called sync method in Postgres but the same idea), as we found
quite good performance with the default (fdatasync). However this is
clearly an area that is worth investigation.

Regards

Mark

Mark Nelson

2014-06-24 11:39:15 UTC

On 06/24/2014 03:45 AM, Mark Kirkwood wrote:
> On 24/06/14 18:15, Robert van Leeuwen wrote:
>>> All of which means that Mysql performance (looking at you binlog) may
>>> still suffer due to lots of small block size sync writes.
>>
>> Which begs the question:
>> Anyone running a reasonable busy Mysql server on Ceph backed storage?
>>
>> We tried and it did not perform good enough.
>> We have a small ceph cluster: 3 machines with 2 SSD journals and 10
>> spinning disks each.
>> Using ceph trough kvm rbd we were seeing performance equal to about
>> 1-2 spinning disks.
>>
>> Reading this thread it now looks a bit if there are inherent
>> architecture + latency issues that would prevent it from performing
>> great as a Mysql database store.
>> I'd be interested in example setups where people are running busy
>> databases on Ceph backed volumes.
>
> Yes indeed,
>
> We have looked extensively at Postgres performance on rbd - and while it
> is not Mysql, the underlying mechanism for durable writes (i.e commit)
> is essentially very similar (fsync, fdatasync and friends). We achieved
> quite reasonable performance (by that I mean sufficiently encouraging to
> be happy to host real datastores for our moderately busy systems - and
> we are continuing to investigate using it for our really busy ones).
>
> I have not experimented exptensively with the various choices of flush
> method (called sync method in Postgres but the same idea), as we found
> quite good performance with the default (fdatasync). However this is
> clearly an area that is worth investigation.

FWIW, I ran through the DBT-3 benchmark suite on MariaDB ontop of
qemu/kvm RBD with a 3X replication pool on 30 OSDs with 3x replication.
I kept buffer sizes small to try to force disk IO and benchmarked
against a local disk passed through to the VM. We typically did about
3-4x faster on queries than the local disk, but there were a couple of
queries were we were slower. I didn't look at how multiple databases
scaled though. That may have it's own benefits and challenges.

I'm encouraged overall though. It looks like from your comments and
from my own testing it's possible to have at least passable performance
with a single database and potentially as we reduce latency in Ceph make
it even better. With multiple databases, it's entirely possible that we
can do pretty good even now.

>
>
> Regards
>
> Mark
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Mark Kirkwood

2014-06-24 22:27:33 UTC

On 24/06/14 23:39, Mark Nelson wrote:
> On 06/24/2014 03:45 AM, Mark Kirkwood wrote:
>> On 24/06/14 18:15, Robert van Leeuwen wrote:
>>>> All of which means that Mysql performance (looking at you binlog) may
>>>> still suffer due to lots of small block size sync writes.
>>>
>>> Which begs the question:
>>> Anyone running a reasonable busy Mysql server on Ceph backed storage?
>>>
>>> We tried and it did not perform good enough.
>>> We have a small ceph cluster: 3 machines with 2 SSD journals and 10
>>> spinning disks each.
>>> Using ceph trough kvm rbd we were seeing performance equal to about
>>> 1-2 spinning disks.
>>>
>>> Reading this thread it now looks a bit if there are inherent
>>> architecture + latency issues that would prevent it from performing
>>> great as a Mysql database store.
>>> I'd be interested in example setups where people are running busy
>>> databases on Ceph backed volumes.
>>
>> Yes indeed,
>>
>> We have looked extensively at Postgres performance on rbd - and while it
>> is not Mysql, the underlying mechanism for durable writes (i.e commit)
>> is essentially very similar (fsync, fdatasync and friends). We achieved
>> quite reasonable performance (by that I mean sufficiently encouraging to
>> be happy to host real datastores for our moderately busy systems - and
>> we are continuing to investigate using it for our really busy ones).
>>
>> I have not experimented exptensively with the various choices of flush
>> method (called sync method in Postgres but the same idea), as we found
>> quite good performance with the default (fdatasync). However this is
>> clearly an area that is worth investigation.
>
> FWIW, I ran through the DBT-3 benchmark suite on MariaDB ontop of
> qemu/kvm RBD with a 3X replication pool on 30 OSDs with 3x replication.
> I kept buffer sizes small to try to force disk IO and benchmarked
> against a local disk passed through to the VM. We typically did about
> 3-4x faster on queries than the local disk, but there were a couple of
> queries were we were slower. I didn't look at how multiple databases
> scaled though. That may have it's own benefits and challenges.
>
> I'm encouraged overall though. It looks like from your comments and
> from my own testing it's possible to have at least passable performance
> with a single database and potentially as we reduce latency in Ceph make
> it even better. With multiple databases, it's entirely possible that we
> can do pretty good even now.
>

Yes - same kind of findings, specifically:

- random read and write (e.g index access) faster than local disk
- sequential write (e.g batch inserts) similar or faster than local disk
- sequential read (e.g table scan) slower than local disk

Regards

Mark

Josef Johansson

2014-06-25 15:15:36 UTC

Hi,

On 25/06/14 00:27, Mark Kirkwood wrote:
> On 24/06/14 23:39, Mark Nelson wrote:
>> On 06/24/2014 03:45 AM, Mark Kirkwood wrote:
>>> On 24/06/14 18:15, Robert van Leeuwen wrote:
>>>>> All of which means that Mysql performance (looking at you binlog) may
>>>>> still suffer due to lots of small block size sync writes.
>>>>
>>>> Which begs the question:
>>>> Anyone running a reasonable busy Mysql server on Ceph backed storage?
>>>>
>>>> We tried and it did not perform good enough.
>>>> We have a small ceph cluster: 3 machines with 2 SSD journals and 10
>>>> spinning disks each.
>>>> Using ceph trough kvm rbd we were seeing performance equal to about
>>>> 1-2 spinning disks.
>>>>
>>>> Reading this thread it now looks a bit if there are inherent
>>>> architecture + latency issues that would prevent it from performing
>>>> great as a Mysql database store.
>>>> I'd be interested in example setups where people are running busy
>>>> databases on Ceph backed volumes.
>>>
>>> Yes indeed,
>>>
>>> We have looked extensively at Postgres performance on rbd - and
>>> while it
>>> is not Mysql, the underlying mechanism for durable writes (i.e commit)
>>> is essentially very similar (fsync, fdatasync and friends). We achieved
>>> quite reasonable performance (by that I mean sufficiently
>>> encouraging to
>>> be happy to host real datastores for our moderately busy systems - and
>>> we are continuing to investigate using it for our really busy ones).
>>>
>>> I have not experimented exptensively with the various choices of flush
>>> method (called sync method in Postgres but the same idea), as we found
>>> quite good performance with the default (fdatasync). However this is
>>> clearly an area that is worth investigation.
>>
>> FWIW, I ran through the DBT-3 benchmark suite on MariaDB ontop of
>> qemu/kvm RBD with a 3X replication pool on 30 OSDs with 3x replication.
>> I kept buffer sizes small to try to force disk IO and benchmarked
>> against a local disk passed through to the VM. We typically did about
>> 3-4x faster on queries than the local disk, but there were a couple of
>> queries were we were slower. I didn't look at how multiple databases
>> scaled though. That may have it's own benefits and challenges.
>>
>> I'm encouraged overall though. It looks like from your comments and
>> from my own testing it's possible to have at least passable performance
>> with a single database and potentially as we reduce latency in Ceph make
>> it even better. With multiple databases, it's entirely possible that we
>> can do pretty good even now.
>>
>
> Yes - same kind of findings, specifically:
>
> - random read and write (e.g index access) faster than local disk
> - sequential write (e.g batch inserts) similar or faster than local disk
> - sequential read (e.g table scan) slower than local disk
>
Regarding sequential read, I think it was
https://software.intel.com/en-us/blogs/2013/11/20/measure-ceph-rbd-performance-in-a-quantitative-way-part-ii
that did some tuning with that.
Anyone tried to optimize it the way they did in the article?

Cheers,
Josef
> Regards
>
> Mark
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Mark Kirkwood

2014-06-26 08:08:43 UTC

On 26/06/14 03:15, Josef Johansson wrote:
> Hi,
>
> On 25/06/14 00:27, Mark Kirkwood wrote:
>
>> Yes - same kind of findings, specifically:
>>
>> - random read and write (e.g index access) faster than local disk
>> - sequential write (e.g batch inserts) similar or faster than local disk
>> - sequential read (e.g table scan) slower than local disk
>>
> Regarding sequential read, I think it was
> https://software.intel.com/en-us/blogs/2013/11/20/measure-ceph-rbd-performance-in-a-quantitative-way-part-ii
> that did some tuning with that.
> Anyone tried to optimize it the way they did in the article?
>
>

In a similar vein, enabling striping in the rbd volume might be worth
experimenting with (just thought of it after reading 'How to improve
performance of ceph objcect storage cluster' thread).

Regards

Mark

Alexandre DERUMIER

2014-06-24 06:45:04 UTC

I don't known if it's related, but

"[Performance] Improvement on DB Performance"
http://www.spinics.net/lists/ceph-devel/msg19062.html

they are a patch here:
https://github.com/ceph/ceph/pull/1848

already pushed in master

----- Mail original -----

De: "Robert van Leeuwen" <Robert.vanLeeuwen at spilgames.com>
?: ceph-users at lists.ceph.com
Envoy?: Mardi 24 Juin 2014 08:15:02
Objet: Re: [ceph-users] Poor performance on all SSD cluster

> All of which means that Mysql performance (looking at you binlog) may
> still suffer due to lots of small block size sync writes.

Which begs the question:
Anyone running a reasonable busy Mysql server on Ceph backed storage?

We tried and it did not perform good enough.
We have a small ceph cluster: 3 machines with 2 SSD journals and 10 spinning disks each.
Using ceph trough kvm rbd we were seeing performance equal to about 1-2 spinning disks.

Reading this thread it now looks a bit if there are inherent architecture + latency issues that would prevent it from performing great as a Mysql database store.
I'd be interested in example setups where people are running busy databases on Ceph backed volumes.

Cheers,
Robert
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

29 Replies
315 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Greg Poirier 2014-06-20 18:08:21 UTC

Tyler Wilson 2014-06-20 21:13:14 UTC

Greg Poirier 2014-06-20 21:17:18 UTC

Mark Kirkwood 2014-06-22 02:09:50 UTC

Mark Kirkwood 2014-06-22 03:50:20 UTC

Haomai Wang 2014-06-22 07:02:40 UTC

Mark Nelson 2014-06-22 13:44:16 UTC

Greg Poirier 2014-06-22 19:14:38 UTC

Christian Balzer 2014-06-23 01:42:55 UTC

Greg Poirier 2014-06-23 17:26:32 UTC

Christian Balzer 2014-06-24 02:36:03 UTC

Greg Poirier 2014-06-23 17:54:59 UTC

Mark Nelson 2014-06-23 19:03:17 UTC

Jake Young 2014-06-24 13:10:50 UTC

Mark Kirkwood 2014-06-23 04:36:08 UTC

Greg Poirier 2014-06-23 06:27:01 UTC

Christian Balzer 2014-06-23 06:51:22 UTC

Mark Kirkwood 2014-06-23 07:16:52 UTC

Mark Kirkwood 2014-06-24 09:46:52 UTC

Mark Nelson 2014-06-24 12:28:14 UTC

Greg Poirier 2014-06-22 19:10:05 UTC

Alexandre DERUMIER 2014-06-24 05:37:44 UTC

Mark Kirkwood 2014-06-24 06:04:28 UTC

Robert van Leeuwen 2014-06-24 06:15:02 UTC

Mark Kirkwood 2014-06-24 08:45:07 UTC

Mark Nelson 2014-06-24 11:39:15 UTC

Mark Kirkwood 2014-06-24 22:27:33 UTC

Josef Johansson 2014-06-25 15:15:36 UTC

Mark Kirkwood 2014-06-26 08:08:43 UTC

Alexandre DERUMIER 2014-06-24 06:45:04 UTC

about - legalese

Loading...