Discussion:
New Ceph-cluster and performance "questions"
Add Reply
Patrik Martinsson
2018-02-05 18:15:37 UTC
Reply
Permalink
Raw Message
Hello,

I'm not a "storage-guy" so please excuse me if I'm missing /
overlooking something obvious.

My question is in the area "what kind of performance am I to expect
with this setup". We have bought servers, disks and networking for our
future ceph-cluster and are now in our "testing-phase" and I simply
want to understand if our numbers line up, or if we are missing
something obvious.

Background,
- cephmon1, DELL R730, 1 x E5-2643, 64 GB
- cephosd1-6, DELL R730, 1 x E5-2697, 64 GB
- each server is connected to a dedicated 50 Gbe network, with
Mellanox-4 Lx cards (teamed into one interface, team0).

In our test we only have one monitor. This will of course not be the
case later on.

Each OSD, has the following SSD's configured as pass-through (not raid
0 through the raid-controller),

- 2 x Dell 1.6TB 2.5" SATA MLC MU 6Gbs SSD (THNSF81D60CSE), only spec I
can find on Dell's homepage says "Data Transfer Rate 600 Mbps"
- 4 x Intel SSD DC S3700 (https://ark.intel.com/products/71916/Intel-SS
D-DC-S3700-Series-800GB-2_5in-SATA-6Gbs-25nm-MLC)
- 3 HDD's, which is uninteresting here. At the moment I'm only
interested in the performance of the SSD-pool.

Ceph-cluster is created with ceph-ansible with "default params" (ie.
have not added / changed anything except the necessary).

When ceph-cluster is up, we have 54 OSD's (36 SSD, 18HDD).
The min_size is 3 on the pool.

Rules are created as follows,

$ > ceph osd crush rule create-replicated ssd-rule default host ssd
$ > ceph osd crush rule create-replicated hdd-rule default host hdd

Testing is done on a separate node (same nic and network though),

$ > ceph osd pool create ssd-bench 512 512 replicated ssd-rule

$ > ceph osd pool application enable ssd-bench rbd

$ > rbd create ssd-image --size 1T --pool ssd-pool

$ > rbd map ssd-image --pool ssd-bench

$ > mkfs.xfs /dev/rbd/ssd-bench/ssd-image

$ > mount /dev/rbd/ssd-bench/ssd-image /ssd-bench

Fio is then run like this,
$ >
actions="read randread write randwrite"
blocksizes="4k 128k 8m"
tmp_dir="/tmp/"

for blocksize in ${blocksizes}; do
for action in ${actions}; do
rm -f ${tmp_dir}${action}_${blocksize}_${suffix}
fio --directory=/ssd-bench \
--time_based \
--direct=1 \
--rw=${action} \
--bs=$blocksize \
--size=1G \
--numjobs=100 \
--runtime=120 \
--group_reporting \
--name=testfile \
--output=${tmp_dir}${action}_${blocksize}_${suffix}
done
done

After running this, we end up with these numbers

read_4k iops : 159266 throughput : 622 MB / sec
randread_4k iops : 151887 throughput : 593 MB / sec

read_128k iops : 31705 throughput : 3963.3 MB / sec
randread_128k iops : 31664 throughput : 3958.5 MB / sec

read_8m iops : 470 throughput : 3765.5 MB / sec
randread_8m iops : 463 throughput : 3705.4 MB / sec

write_4k iops : 50486 throughput : 197 MB / sec
randwrite_4k iops : 42491 throughput : 165 MB / sec

write_128k iops : 15907 throughput : 1988.5 MB / sec
randwrite_128k iops : 15558 throughput : 1944.9 MB / sec

write_8m iops : 347 throughput : 2781.2 MB / sec
randwrite
_8m iops : 347 throughput : 2777.2 MB / sec


Ok, if you read all way here, the million dollar question is of course
if the numbers above are in the ballpark of what to expect, or if they
are low.

The main reason I'm a bit uncertain on the numbers above are, and this
may sound fuzzy but, because we did POC a couple of months ago with (if
I remember the configuration correctly, unfortunately we only saved the
numbers, not the *exact* configuration *sigh* (networking still the
same though)) with fewer OSD's and those numbers were

read 4k iops : 282303 throughput : 1102.8 MB / sec
(b)
randread 4k iops : 253453 throughput : 990.52 MB / sec
(b)

read 128k iops : 31298 throughput : 3912 MB / sec (w)
randread 128k iops : 9013 throughput : 1126.8 MB /
sec (w)

read 8m iops : 405 throughput : 3241.4 MB /
sec (w)
randread 8m iops : 369 throughput : 2957.8 MB / sec
(w)

write 4k iops : 80644 throughput : 315 MB / sec (b)
randwrite 4k iops : 53178 throughput : 207 MB / sec
(b)

write 128k iops : 17126 throughput : 2140.8 MB / sec
(b)
randwrite 128k iops : 11654 throughput : 2015.9 MB /
sec (b)

write 8m iops : 258 throughput : 2067.1 MB / sec
(w)
randwrite 8m iops : 251 throughput : 1456.9 MB / sec
(w)

Where (b) is higher number and (w) is lower. What I would expect since
adding more OSD's was an increase on *all* numbers. The read_4k_
throughput and iops number in current setup is not even close to the
POC which makes me wonder if these "new" numbers are what they "are
suppose to" or if I'm missing something obvious.

Ehm, in this new setup we are running with MTU 1500, I think we had the
POC to 9000, but the difference on the read_4k is roughly 400 MB/sec
and I wonder if the MTU will make up for that.

Is the above a good way of measuring our cluster, or is it better more
reliable ways of measuring it ?

Is there a way to calculate this "theoretically" (ie with with 6 nodes
and 36 SSD's we should get these numbers) and then compare it to the
reality. Again, not a storage guy and haven't really done this before
so please excuse me for my laymen terms.

Thanks for Ceph and keep up the awesome work!

Best regards,
Patrik Martinsson
Sweden
Christian Balzer
2018-02-06 01:47:37 UTC
Reply
Permalink
Raw Message
Hello,
Post by Patrik Martinsson
I'm not a "storage-guy" so please excuse me if I'm missing /
overlooking something obvious.
My question is in the area "what kind of performance am I to expect
with this setup". We have bought servers, disks and networking for our
future ceph-cluster and are now in our "testing-phase" and I simply
want to understand if our numbers line up, or if we are missing
something obvious.
A myriad of variables will make for a myriad of results, expected and
otherwise.

For example, you say nothing about the Ceph version, how the OSDs are
created (filestore, bluestore, details), OS and kernel (PTI!!) version.
Post by Patrik Martinsson
Background,
- cephmon1, DELL R730, 1 x E5-2643, 64 GB
- cephosd1-6, DELL R730, 1 x E5-2697, 64 GB
Unless you're planning on having 16 SSDs per node, a CPU with less and
faster cores would be better (see archives).

In general, you will want to run atop or something similar on your ceph
and client nodes during these tests to see where and if any resources
(CPU, DISK, NET) are getting stressed.
Post by Patrik Martinsson
- each server is connected to a dedicated 50 Gbe network, with
Mellanox-4 Lx cards (teamed into one interface, team0).
In our test we only have one monitor. This will of course not be the
case later on.
Each OSD, has the following SSD's configured as pass-through (not raid
0 through the raid-controller),
- 2 x Dell 1.6TB 2.5" SATA MLC MU 6Gbs SSD (THNSF81D60CSE), only spec I
can find on Dell's homepage says "Data Transfer Rate 600 Mbps"
- 4 x Intel SSD DC S3700 (https://ark.intel.com/products/71916/Intel-SS
D-DC-S3700-Series-800GB-2_5in-SATA-6Gbs-25nm-MLC)
When and where did you get those?
I wonder if they're available again, had 0 luck getting any last year.
Post by Patrik Martinsson
- 3 HDD's, which is uninteresting here. At the moment I'm only
interested in the performance of the SSD-pool.
Ceph-cluster is created with ceph-ansible with "default params" (ie.
have not added / changed anything except the necessary).
When ceph-cluster is up, we have 54 OSD's (36 SSD, 18HDD).
The min_size is 3 on the pool.
Any reason for that?
It will make any OSD failure result in a cluster lockup with a size of 3.
Unless you did set your size to 4, in which case you wrecked performance.
Post by Patrik Martinsson
Rules are created as follows,
$ > ceph osd crush rule create-replicated ssd-rule default host ssd
$ > ceph osd crush rule create-replicated hdd-rule default host hdd
Testing is done on a separate node (same nic and network though),
$ > ceph osd pool create ssd-bench 512 512 replicated ssd-rule
$ > ceph osd pool application enable ssd-bench rbd
$ > rbd create ssd-image --size 1T --pool ssd-pool
$ > rbd map ssd-image --pool ssd-bench
$ > mkfs.xfs /dev/rbd/ssd-bench/ssd-image
$ > mount /dev/rbd/ssd-bench/ssd-image /ssd-bench
Unless you're planning on using the Ceph cluster in this fashion (kernel
mounted images), you'd be better off testing in an environment that
matches the use case, i.e. from a VM.
Post by Patrik Martinsson
Fio is then run like this,
$ >
actions="read randread write randwrite"
blocksizes="4k 128k 8m"
tmp_dir="/tmp/"
for blocksize in ${blocksizes}; do
for action in ${actions}; do
rm -f ${tmp_dir}${action}_${blocksize}_${suffix}
fio --directory=/ssd-bench \
--time_based \
--direct=1 \
--rw=${action} \
--bs=$blocksize \
--size=1G \
--numjobs=100 \
--runtime=120 \
--group_reporting \
--name=testfile \
--output=${tmp_dir}${action}_${blocksize}_${suffix}
done
done
After running this, we end up with these numbers
read_4k iops : 159266 throughput : 622 MB / sec
randread_4k iops : 151887 throughput : 593 MB / sec
These are very nice numbers.
Too nice, in my book.
I have a test cluster with a cache-tier based on 2 nodes with 3 DC S3610s
400GB each, obviously with size 2 and min_size=1. So just based on that,
it will be faster than a size 3 pool, Jewel with Filestore.
Network is IPoIB (40Gb), so in that aspect similar to yours,
64k MTU though.
Ceph nodes have E5-2620 v3 @ 2.40GHz CPUs and 32GB RAM.
I've run the following fio (with different rw actions of course) from a
KVM/qemu VM and am also showing how busy the SSDs, OSD processes, qemu
process on the comp node and the fio inside the VM are:
"fio --size=4G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
--rw=read --name=fiojob --blocksize=4K --iodepth=64"

READ
read : io=4096.0MB, bw=81361KB/s, iops=20340, runt= 51552msec
SSDs: 0% (pagecache on the nodes), OSDs: 45%, qemu: 330%, fio_in_VM: 19%

RANDREAD
read : io=4096.0MB, bw=62760KB/s, iops=15689, runt= 66831msec
SSDs: 0% (pagecache on the nodes), OSDs: 50%, qemu: 550%!!, fio_in_VM: 23%

WRITE
write: io=4096.0MB, bw=256972KB/s, iops=64243, runt= 16322msec
SSDs: 40%, OSDs: 20%, qemu: 150%, fio_in_VM: 45%

RANDWRITE
write: io=4096.0MB, bw=43981KB/s, iops=10995, runt= 95366msec
SSDs: 38%, OSDs: 250%!!, qemu: 480%, fio_in_VM: 23%

Note especially the OSD CPU usage in the randwrite fio, this is where
faster (and non-powersaving mode) CPUs will be significant.
I'm not seeing the same level of performance reductions with rand actions
in your results.

We can roughly compare the reads as the SSDs and pool size play little to
no part in it.
20k *6 (to compensate for your OSD numbers) is 120k, definitely the same
ball park as your 158k.
It doesn't explain the 282k with your old setup, unless the MTU is really
so significant (see below) or other things changed, like more

For nonrand writes your basically looking at latency (numjobs is
meaningless), so thats why my 62k (remember size 2) are comparable to your
50k or 80k respectively.
For randwrite the larger amount of OSDs in your case nicely explains the
difference seen.
Post by Patrik Martinsson
read_128k iops : 31705 throughput : 3963.3 MB / sec
randread_128k iops : 31664 throughput : 3958.5 MB / sec
read_8m iops : 470 throughput : 3765.5 MB / sec
randread_8m iops : 463 throughput : 3705.4 MB / sec
write_4k iops : 50486 throughput : 197 MB / sec
randwrite_4k iops : 42491 throughput : 165 MB / sec
write_128k iops : 15907 throughput : 1988.5 MB / sec
randwrite_128k iops : 15558 throughput : 1944.9 MB / sec
write_8m iops : 347 throughput : 2781.2 MB / sec
randwrite
_8m iops : 347 throughput : 2777.2 MB / sec
Ok, if you read all way here, the million dollar question is of course
if the numbers above are in the ballpark of what to expect, or if they
are low.
The main reason I'm a bit uncertain on the numbers above are, and this
may sound fuzzy but, because we did POC a couple of months ago with (if
I remember the configuration correctly, unfortunately we only saved the
numbers, not the *exact* configuration *sigh* (networking still the
same though)) with fewer OSD's and those numbers were
Which unfortunately basically means that these results are... questionable
when comparing them with your current setup.
Post by Patrik Martinsson
read 4k iops : 282303 throughput : 1102.8 MB / sec
(b)
randread 4k iops : 253453 throughput : 990.52 MB / sec
(b)
read 128k iops : 31298 throughput : 3912 MB / sec (w)
randread 128k iops : 9013 throughput : 1126.8 MB /
sec (w)
read 8m iops : 405 throughput : 3241.4 MB /
sec (w)
randread 8m iops : 369 throughput : 2957.8 MB / sec
(w)
write 4k iops : 80644 throughput : 315 MB / sec (b)
randwrite 4k iops : 53178 throughput : 207 MB / sec
(b)
write 128k iops : 17126 throughput : 2140.8 MB / sec
(b)
randwrite 128k iops : 11654 throughput : 2015.9 MB /
sec (b)
write 8m iops : 258 throughput : 2067.1 MB / sec
(w)
randwrite 8m iops : 251 throughput : 1456.9 MB / sec
(w)
Where (b) is higher number and (w) is lower. What I would expect since
adding more OSD's was an increase on *all* numbers. The read_4k_
throughput and iops number in current setup is not even close to the
POC which makes me wonder if these "new" numbers are what they "are
suppose to" or if I'm missing something obvious.
Ehm, in this new setup we are running with MTU 1500, I think we had the
POC to 9000, but the difference on the read_4k is roughly 400 MB/sec
and I wonder if the MTU will make up for that.
You're in the best position of everybody here to verify this by changing
your test cluster to use the other MTU and compare...
Post by Patrik Martinsson
Is the above a good way of measuring our cluster, or is it better more
reliable ways of measuring it ?
See above.
A fio test is definitely a closer thing to reality compared to OSD or
RADOS benches.
Post by Patrik Martinsson
Is there a way to calculate this "theoretically" (ie with with 6 nodes
and 36 SSD's we should get these numbers) and then compare it to the
reality. Again, not a storage guy and haven't really done this before
so please excuse me for my laymen terms.
People have tried in the past and AFAIR nothing really conclusive came
about, it really is a game of too many variables.

Regards,

Christian
--
Christian Balzer Network/Systems Engineer
***@gol.com Rakuten Communications
Konstantin Shalygin
2018-02-06 05:04:02 UTC
Reply
Permalink
Raw Message
/offtopic
Post by Christian Balzer
When and where did you get those?
I wonder if they're available again, had 0 luck getting any last year.
I was behold P3700 in Russia since December 2017 with real quantity on
stock, not just a "price with out of stock".

https://market.yandex.ru/catalog/55316/list?text=intel%20p3700&cvredirect=3&track=srch_ddl&glfilter=7893318%3A453797&local-offers-first=0&deliveryincluded=0&onstock=0&how=aprice



k
Patrik Martinsson
2018-02-08 10:58:43 UTC
Reply
Permalink
Raw Message
Hi Christian,

First of all, thanks for all the great answers and sorry for the late
reply.
Post by Patrik Martinsson
Hello,
Post by Patrik Martinsson
I'm not a "storage-guy" so please excuse me if I'm missing /
overlooking something obvious.
My question is in the area "what kind of performance am I to expect
with this setup". We have bought servers, disks and networking for our
future ceph-cluster and are now in our "testing-phase" and I simply
want to understand if our numbers line up, or if we are missing
something obvious.
A myriad of variables will make for a myriad of results, expected and
otherwise.
For example, you say nothing about the Ceph version, how the OSDs are
created (filestore, bluestore, details), OS and kernel (PTI!!)
version.
Good catch, I totally forgot this.

$ > ceph version 12.2.1-40.el7cp
(c6d85fd953226c9e8168c9abe81f499d66cc2716) luminous (stable), deployed
via Red Hat Ceph Storage 3 (ceph-ansible). Bluestore is enabled, and
osd_scenario is set to collocated.

$ > cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.4 (Maipo)

$ > uname -r
3.10.0-693.11.6.el7.x86_64 (PTI *not* disabled at boot)
Post by Patrik Martinsson
Post by Patrik Martinsson
Background,
- cephmon1, DELL R730, 1 x E5-2643, 64 GB
- cephosd1-6, DELL R730, 1 x E5-2697, 64 GB
Unless you're planning on having 16 SSDs per node, a CPU with less and
faster cores would be better (see archives).
In general, you will want to run atop or something similar on your ceph
and client nodes during these tests to see where and if any resources
(CPU, DISK, NET) are getting stressed.
Understood, thanks!
Post by Patrik Martinsson
Post by Patrik Martinsson
- each server is connected to a dedicated 50 Gbe network, with
Mellanox-4 Lx cards (teamed into one interface, team0).
In our test we only have one monitor. This will of course not be the
case later on.
Each OSD, has the following SSD's configured as pass-through (not raid
0 through the raid-controller),
- 2 x Dell 1.6TB 2.5" SATA MLC MU 6Gbs SSD (THNSF81D60CSE), only spec I
can find on Dell's homepage says "Data Transfer Rate 600 Mbps"
- 4 x Intel SSD DC S3700 (https://ark.intel.com/products/71916/Inte
l-SS
D-DC-S3700-Series-800GB-2_5in-SATA-6Gbs-25nm-MLC)
When and where did you get those?
I wonder if they're available again, had 0 luck getting any last year.
It's actually disks that we have had "lying around", no clue where you
could get them today.
Post by Patrik Martinsson
Post by Patrik Martinsson
- 3 HDD's, which is uninteresting here. At the moment I'm only
interested in the performance of the SSD-pool.
Ceph-cluster is created with ceph-ansible with "default params" (ie.
have not added / changed anything except the necessary).
When ceph-cluster is up, we have 54 OSD's (36 SSD, 18HDD).
The min_size is 3 on the pool.
Any reason for that?
It will make any OSD failure result in a cluster lockup with a size of 3.
Unless you did set your size to 4, in which case you wrecked
performance.
Hm, sorry, what I meant was size=3. Reading the documentation, I'm not
sure I understand the difference between size and min_size.
Post by Patrik Martinsson
Post by Patrik Martinsson
Rules are created as follows,
$ > ceph osd crush rule create-replicated ssd-rule default host ssd
$ > ceph osd crush rule create-replicated hdd-rule default host hdd
Testing is done on a separate node (same nic and network though),
$ > ceph osd pool create ssd-bench 512 512 replicated ssd-rule
$ > ceph osd pool application enable ssd-bench rbd
$ > rbd create ssd-image --size 1T --pool ssd-pool
$ > rbd map ssd-image --pool ssd-bench
$ > mkfs.xfs /dev/rbd/ssd-bench/ssd-image
$ > mount /dev/rbd/ssd-bench/ssd-image /ssd-bench
Unless you're planning on using the Ceph cluster in this fashion (kernel
mounted images), you'd be better off testing in an environment that
matches the use case, i.e. from a VM.
Gotcha, thanks!
Post by Patrik Martinsson
Post by Patrik Martinsson
Fio is then run like this,
$ >
actions="read randread write randwrite"
blocksizes="4k 128k 8m"
tmp_dir="/tmp/"
for blocksize in ${blocksizes}; do
for action in ${actions}; do
rm -f ${tmp_dir}${action}_${blocksize}_${suffix}
fio --directory=/ssd-bench \
--time_based \
--direct=1 \
--rw=${action} \
--bs=$blocksize \
--size=1G \
--numjobs=100 \
--runtime=120 \
--group_reporting \
--name=testfile \
--output=${tmp_dir}${action}_${blocksize}_${suffix}
done
done
After running this, we end up with these numbers
read_4k iops : 159266 throughput : 622 MB / sec
randread_4k iops : 151887 throughput : 593 MB / sec
These are very nice numbers.
Too nice, in my book.
I have a test cluster with a cache-tier based on 2 nodes with 3 DC S3610s
400GB each, obviously with size 2 and min_size=1. So just based on that,
it will be faster than a size 3 pool, Jewel with Filestore.
Network is IPoIB (40Gb), so in that aspect similar to yours,
64k MTU though.
I've run the following fio (with different rw actions of course) from a
KVM/qemu VM and am also showing how busy the SSDs, OSD processes, qemu
"fio --size=4G --ioengine=libaio --invalidate=1 --direct=1 --
numjobs=1
--rw=read --name=fiojob --blocksize=4K --iodepth=64"
READ
read : io=4096.0MB, bw=81361KB/s, iops=20340, runt= 51552msec
SSDs: 0% (pagecache on the nodes), OSDs: 45%, qemu: 330%, fio_in_VM: 19%
RANDREAD
read : io=4096.0MB, bw=62760KB/s, iops=15689, runt= 66831msec
SSDs: 0% (pagecache on the nodes), OSDs: 50%, qemu: 550%!!,
fio_in_VM: 23%
WRITE
write: io=4096.0MB, bw=256972KB/s, iops=64243, runt= 16322msec
SSDs: 40%, OSDs: 20%, qemu: 150%, fio_in_VM: 45%
RANDWRITE
write: io=4096.0MB, bw=43981KB/s, iops=10995, runt= 95366msec
SSDs: 38%, OSDs: 250%!!, qemu: 480%, fio_in_VM: 23%
Note especially the OSD CPU usage in the randwrite fio, this is where
faster (and non-powersaving mode) CPUs will be significant.
I'm not seeing the same level of performance reductions with rand actions
in your results.
We can roughly compare the reads as the SSDs and pool size play little to
no part in it.
20k *6 (to compensate for your OSD numbers) is 120k, definitely the same
ball park as your 158k.
It doesn't explain the 282k with your old setup, unless the MTU is really
so significant (see below) or other things changed, like more
Thanks for all that- makes sense. I'm not sure I will dig so much
deeper into why I got those numbers to begin with - it is a bit
annoying though, but since we have little knowledge about the disks and
that previous setup, its impossible to compare (since as you say, in
the beginning "myriad of variables will make for a myriad of results").
Post by Patrik Martinsson
For nonrand writes your basically looking at latency (numjobs is
meaningless), so thats why my 62k (remember size 2) are comparable to your
50k or 80k respectively.
For randwrite the larger amount of OSDs in your case nicely explains the
difference seen.
Post by Patrik Martinsson
read_128k iops : 31705 throughput : 3963.3 MB / sec
randread_128k iops : 31664 throughput : 3958.5 MB / sec
read_8m iops : 470 throughput : 3765.5 MB / sec
randread_8m iops : 463 throughput : 3705.4 MB / sec
write_4k iops : 50486 throughput : 197 MB / sec
randwrite_4k iops : 42491 throughput : 165 MB / sec
write_128k iops : 15907 throughput : 1988.5 MB / sec
randwrite_128k iops : 15558 throughput : 1944.9 MB / sec
write_8m iops : 347 throughput : 2781.2 MB / sec
randwrite
_8m iops : 347 throughput : 2777.2 MB / sec
Ok, if you read all way here, the million dollar question is of course
if the numbers above are in the ballpark of what to expect, or if they
are low.
The main reason I'm a bit uncertain on the numbers above are, and this
may sound fuzzy but, because we did POC a couple of months ago with (if
I remember the configuration correctly, unfortunately we only saved the
numbers, not the *exact* configuration *sigh* (networking still the
same though)) with fewer OSD's and those numbers were
Thanks again.
Post by Patrik Martinsson
Which unfortunately basically means that these results are...
questionable
when comparing them with your current setup.
Post by Patrik Martinsson
read 4k iops : 282303 throughput : 1102.8 MB /
sec
(b)
randread 4k iops : 253453 throughput : 990.52 MB /
sec
(b)
read 128k iops : 31298 throughput : 3912 MB / sec
(w)
randread 128k iops : 9013 throughput : 1126.8 MB
/
sec (w)
read 8m iops : 405 throughput : 3241.4
MB /
sec (w)
randread 8m iops : 369 throughput : 2957.8 MB /
sec
(w)
write 4k iops : 80644 throughput : 315 MB / sec
(b)
randwrite 4k iops : 53178 throughput : 207 MB /
sec
(b)
write 128k iops : 17126 throughput : 2140.8 MB /
sec
(b)
randwrite 128k iops : 11654 throughput : 2015.9 M
B /
sec (b)
write 8m iops : 258 throughput : 2067.1 MB /
sec
(w)
randwrite 8m iops : 251 throughput : 1456.9 MB /
sec
(w)
Where (b) is higher number and (w) is lower. What I would expect since
adding more OSD's was an increase on *all* numbers. The read_4k_
throughput and iops number in current setup is not even close to the
POC which makes me wonder if these "new" numbers are what they "are
suppose to" or if I'm missing something obvious.
Ehm, in this new setup we are running with MTU 1500, I think we had the
POC to 9000, but the difference on the read_4k is roughly 400 MB/sec
and I wonder if the MTU will make up for that.
You're in the best position of everybody here to verify this by changing
your test cluster to use the other MTU and compare...
Yes, we will do some more benchmarks and monitor the results.
Post by Patrik Martinsson
Post by Patrik Martinsson
Is the above a good way of measuring our cluster, or is it better more
reliable ways of measuring it ?
See above.
A fio test is definitely a closer thing to reality compared to OSD or
RADOS benches.
Post by Patrik Martinsson
Is there a way to calculate this "theoretically" (ie with with 6 nodes
and 36 SSD's we should get these numbers) and then compare it to the
reality. Again, not a storage guy and haven't really done this before
so please excuse me for my laymen terms.
People have tried in the past and AFAIR nothing really conclusive came
about, it really is a game of too many variables.
Regards,
Christian
Again, thanks for everything, nicely explained.

Not sure if it could be of interest for anyone, but I took a screenshot
of our fio-diagrams generated in confluence and put it here, https://im
gur.com/a/PaMLg

Basically the only interesting bars are the yellow (test #3) and the
purple (test #4), as those are the ones where I actually know the exact
configuration.

Interesting to see that enabling the raid controller and putting all
disks in raid 0 (disk cache disabled) vs. pass through yielded a quite
big enhance in the write128k/write8m and randwrite128k/randwrite8m
areas. Whereas in the other areas, there aren't that much of a
difference. So in my opinion, raid 0 would be the way to go - however I
see some different opinions about this. Red Hat talks about this in the
following pdf, https://www.redhat.com/cms/managed-files/st-rhcs-config-
guide-technology-detail-inc0387897-201604-en.pdf

Any thoughts about this ?

Putting disks through the raid-controller also messes up the automatic
type-classification that ceph does, which is annoying - but
"workaroundable". As I understand it, ceph determines the disk class by
by looking at the value in /sys/block/<disk>/queue/rotational (1 is
hdd, 0 ssd). This value gets set correctly when using "pass through
(non-raid)", but when using raid 0, this gets set to 1 even though its
ssd's.

We workaround this by using the following udev-rule, where "sd[a-c]"
would be the ssd-disks.

$ > echo 'ACTION=="add|change", KERNEL=="sd[a-c]",
ATTR{queue/rotational}="0"' >> /etc/udev/rules.d/10-ssd-
persistent.rules

Maybe this has been mentioned, but I'm curious on why this happens,
anyone knows ?

Again, thanks for all the great work.


Best regards,
Patrik
Sweden
Christian Balzer
2018-02-09 02:47:50 UTC
Reply
Permalink
Raw Message
Hello,
Post by Patrik Martinsson
Hi Christian,
First of all, thanks for all the great answers and sorry for the late
reply.
You're welcome.
Post by Patrik Martinsson
Post by Patrik Martinsson
Hello,
Post by Patrik Martinsson
I'm not a "storage-guy" so please excuse me if I'm missing /
overlooking something obvious.
My question is in the area "what kind of performance am I to expect
with this setup". We have bought servers, disks and networking for our
future ceph-cluster and are now in our "testing-phase" and I simply
want to understand if our numbers line up, or if we are missing
something obvious.
A myriad of variables will make for a myriad of results, expected and
otherwise.
For example, you say nothing about the Ceph version, how the OSDs are
created (filestore, bluestore, details), OS and kernel (PTI!!) version.
Good catch, I totally forgot this.
$ > ceph version 12.2.1-40.el7cp
(c6d85fd953226c9e8168c9abe81f499d66cc2716) luminous (stable), deployed
via Red Hat Ceph Storage 3 (ceph-ansible). Bluestore is enabled, and
osd_scenario is set to collocated.
Given the (rather disconcerting) number of bugs in Luminous, you probably
want to go to 12.2.2 now and .3 when released.
Post by Patrik Martinsson
$ > cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.4 (Maipo)
$ > uname -r
3.10.0-693.11.6.el7.x86_64 (PTI *not* disabled at boot)
That's what I'd call an old kernel, if it weren't for the (insane level of)
RH backporting.

As for PTI, I'd disable it on pure Ceph nodes, the logic being that if
somebody can access those in the first place you have bigger problems
already.

Make sure to run a test/benchmark before and after and let the community
here know.
Post by Patrik Martinsson
Post by Patrik Martinsson
Post by Patrik Martinsson
Background,
- cephmon1, DELL R730, 1 x E5-2643, 64 GB
- cephosd1-6, DELL R730, 1 x E5-2697, 64 GB
Unless you're planning on having 16 SSDs per node, a CPU with less and
faster cores would be better (see archives).
In general, you will want to run atop or something similar on your ceph
and client nodes during these tests to see where and if any resources
(CPU, DISK, NET) are getting stressed.
Understood, thanks!
Post by Patrik Martinsson
Post by Patrik Martinsson
- each server is connected to a dedicated 50 Gbe network, with
Mellanox-4 Lx cards (teamed into one interface, team0).
In our test we only have one monitor. This will of course not be the
case later on.
Each OSD, has the following SSD's configured as pass-through (not raid
0 through the raid-controller),
- 2 x Dell 1.6TB 2.5" SATA MLC MU 6Gbs SSD (THNSF81D60CSE), only spec I
can find on Dell's homepage says "Data Transfer Rate 600 Mbps"
- 4 x Intel SSD DC S3700 (https://ark.intel.com/products/71916/Inte
l-SS
D-DC-S3700-Series-800GB-2_5in-SATA-6Gbs-25nm-MLC)
When and where did you get those?
I wonder if they're available again, had 0 luck getting any last year.
It's actually disks that we have had "lying around", no clue where you
could get them today.
Consider yourself lucky.
Post by Patrik Martinsson
Post by Patrik Martinsson
Post by Patrik Martinsson
- 3 HDD's, which is uninteresting here. At the moment I'm only
interested in the performance of the SSD-pool.
Ceph-cluster is created with ceph-ansible with "default params" (ie.
have not added / changed anything except the necessary).
When ceph-cluster is up, we have 54 OSD's (36 SSD, 18HDD).
The min_size is 3 on the pool.
Any reason for that?
It will make any OSD failure result in a cluster lockup with a size of 3.
Unless you did set your size to 4, in which case you wrecked
performance.
Hm, sorry, what I meant was size=3. Reading the documentation, I'm not
sure I understand the difference between size and min_size.
Check the archives for this, lots of pertinent and moderately recent
discussions about this. 3 and 2 (defaults) are fine for most people.
Post by Patrik Martinsson
Post by Patrik Martinsson
Post by Patrik Martinsson
Rules are created as follows,
$ > ceph osd crush rule create-replicated ssd-rule default host ssd
$ > ceph osd crush rule create-replicated hdd-rule default host hdd
Testing is done on a separate node (same nic and network though),
$ > ceph osd pool create ssd-bench 512 512 replicated ssd-rule
$ > ceph osd pool application enable ssd-bench rbd
$ > rbd create ssd-image --size 1T --pool ssd-pool
$ > rbd map ssd-image --pool ssd-bench
$ > mkfs.xfs /dev/rbd/ssd-bench/ssd-image
$ > mount /dev/rbd/ssd-bench/ssd-image /ssd-bench
Unless you're planning on using the Ceph cluster in this fashion (kernel
mounted images), you'd be better off testing in an environment that
matches the use case, i.e. from a VM.
Gotcha, thanks!
Post by Patrik Martinsson
Post by Patrik Martinsson
Fio is then run like this,
$ >
actions="read randread write randwrite"
blocksizes="4k 128k 8m"
tmp_dir="/tmp/"
for blocksize in ${blocksizes}; do
for action in ${actions}; do
rm -f ${tmp_dir}${action}_${blocksize}_${suffix}
fio --directory=/ssd-bench \
--time_based \
--direct=1 \
--rw=${action} \
--bs=$blocksize \
--size=1G \
--numjobs=100 \
--runtime=120 \
--group_reporting \
--name=testfile \
--output=${tmp_dir}${action}_${blocksize}_${suffix}
done
done
After running this, we end up with these numbers
read_4k iops : 159266 throughput : 622 MB / sec
randread_4k iops : 151887 throughput : 593 MB / sec
These are very nice numbers.
Too nice, in my book.
I have a test cluster with a cache-tier based on 2 nodes with 3 DC S3610s
400GB each, obviously with size 2 and min_size=1. So just based on that,
it will be faster than a size 3 pool, Jewel with Filestore.
Network is IPoIB (40Gb), so in that aspect similar to yours,
64k MTU though.
I've run the following fio (with different rw actions of course) from a
KVM/qemu VM and am also showing how busy the SSDs, OSD processes, qemu
"fio --size=4G --ioengine=libaio --invalidate=1 --direct=1 --
numjobs=1
--rw=read --name=fiojob --blocksize=4K --iodepth=64"
READ
read : io=4096.0MB, bw=81361KB/s, iops=20340, runt= 51552msec
SSDs: 0% (pagecache on the nodes), OSDs: 45%, qemu: 330%, fio_in_VM: 19%
RANDREAD
read : io=4096.0MB, bw=62760KB/s, iops=15689, runt= 66831msec
SSDs: 0% (pagecache on the nodes), OSDs: 50%, qemu: 550%!!,
fio_in_VM: 23%
WRITE
write: io=4096.0MB, bw=256972KB/s, iops=64243, runt= 16322msec
SSDs: 40%, OSDs: 20%, qemu: 150%, fio_in_VM: 45%
RANDWRITE
write: io=4096.0MB, bw=43981KB/s, iops=10995, runt= 95366msec
SSDs: 38%, OSDs: 250%!!, qemu: 480%, fio_in_VM: 23%
Note especially the OSD CPU usage in the randwrite fio, this is where
faster (and non-powersaving mode) CPUs will be significant.
I'm not seeing the same level of performance reductions with rand actions
in your results.
We can roughly compare the reads as the SSDs and pool size play little to
no part in it.
20k *6 (to compensate for your OSD numbers) is 120k, definitely the same
ball park as your 158k.
It doesn't explain the 282k with your old setup, unless the MTU is really
so significant (see below) or other things changed, like more
Thanks for all that- makes sense. I'm not sure I will dig so much
deeper into why I got those numbers to begin with - it is a bit
annoying though, but since we have little knowledge about the disks and
that previous setup, its impossible to compare (since as you say, in
the beginning "myriad of variables will make for a myriad of results").
[snip]
Post by Patrik Martinsson
Post by Patrik Martinsson
Post by Patrik Martinsson
Ehm, in this new setup we are running with MTU 1500, I think we had the
POC to 9000, but the difference on the read_4k is roughly 400 MB/sec
and I wonder if the MTU will make up for that.
You're in the best position of everybody here to verify this by changing
your test cluster to use the other MTU and compare...
Yes, we will do some more benchmarks and monitor the results.
Post by Patrik Martinsson
Post by Patrik Martinsson
Is the above a good way of measuring our cluster, or is it better more
reliable ways of measuring it ?
See above.
A fio test is definitely a closer thing to reality compared to OSD or
RADOS benches.
Post by Patrik Martinsson
Is there a way to calculate this "theoretically" (ie with with 6 nodes
and 36 SSD's we should get these numbers) and then compare it to the
reality. Again, not a storage guy and haven't really done this before
so please excuse me for my laymen terms.
People have tried in the past and AFAIR nothing really conclusive came
about, it really is a game of too many variables.
Again, thanks for everything, nicely explained.
Not sure if it could be of interest for anyone, but I took a screenshot
of our fio-diagrams generated in confluence and put it here, https://im
gur.com/a/PaMLg
Basically the only interesting bars are the yellow (test #3) and the
purple (test #4), as those are the ones where I actually know the exact
configuration.
Interesting to see that enabling the raid controller and putting all
disks in raid 0 (disk cache disabled) vs. pass through yielded a quite
big enhance in the write128k/write8m and randwrite128k/randwrite8m
areas. Whereas in the other areas, there aren't that much of a
difference. So in my opinion, raid 0 would be the way to go - however I
see some different opinions about this. Red Hat talks about this in the
following pdf, https://www.redhat.com/cms/managed-files/st-rhcs-config-
guide-technology-detail-inc0387897-201604-en.pdf
Any thoughts about this ?
If the controller cache is sizable it will of course help and if you're
willing to work with drawbacks you already discovered (SMART often is also
a PITA going though the controller) then it is indeed preferable.

Always keep in mind that this is masking the real device speeds though,
meaning that once the cache is overwhelmed it is back to the "slow" speeds.
Also failing battery backup units will disable the cache and leaving you
wondering why your machine suddenly got so slow.

Regards,

Christian
Post by Patrik Martinsson
Putting disks through the raid-controller also messes up the automatic
type-classification that ceph does, which is annoying - but
"workaroundable". As I understand it, ceph determines the disk class by
by looking at the value in /sys/block/<disk>/queue/rotational (1 is
hdd, 0 ssd). This value gets set correctly when using "pass through
(non-raid)", but when using raid 0, this gets set to 1 even though its
ssd's.
We workaround this by using the following udev-rule, where "sd[a-c]"
would be the ssd-disks.
$ > echo 'ACTION=="add|change", KERNEL=="sd[a-c]",
ATTR{queue/rotational}="0"' >> /etc/udev/rules.d/10-ssd-
persistent.rules
Maybe this has been mentioned, but I'm curious on why this happens,
anyone knows ?
Again, thanks for all the great work.
Best regards,
Patrik
Sweden
--
Christian Balzer Network/Systems Engineer
***@gol.com Rakuten Communications
Loading...