[ceph-users] Poor ceph cluster performance

Discussion:

Cody

2018-11-27 00:06:34 UTC

Hello,

I have a Ceph cluster deployed together with OpenStack using TripleO.
While the Ceph cluster shows a healthy status, its performance is
painfully slow. After eliminating a possibility of network issues, I
have zeroed in on the Ceph cluster itself, but have no experience in
further debugging and tunning.

The Ceph OSD part of the cluster uses 3 identical servers with the
following specifications:

CPU: 2 x E5-2603 @1.8GHz
RAM: 16GB
Network: 1G port shared for Ceph public and cluster traffics
Journaling device: 1 x 120GB SSD (SATA3, consumer grade)
OSD device: 2 x 2TB 7200rpm spindle (SATA3, consumer grade)

This is not beefy enough in any way, but I am running for PoC only,
with minimum utilization.

Ceph-mon and ceph-mgr daemons are hosted on the OpenStack Controller
nodes. Ceph-ansible version is 3.1 and is using Filestore with
non-colocated scenario (1 SSD for every 2 OSDs). Connection speed
among Controllers, Computes, and OSD nodes can reach ~900Mbps tested
using iperf.

I followed the Red Hat Ceph 3 benchmarking procedure [1] and received
following results:

Write Test:

Total time run: 80.313004
Total writes made: 17
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 0.846687
Stddev Bandwidth: 0.320051
Max bandwidth (MB/sec): 2
Min bandwidth (MB/sec): 0
Average IOPS: 0
Stddev IOPS: 0
Max IOPS: 0
Min IOPS: 0
Average Latency(s): 66.6582
Stddev Latency(s): 15.5529
Max latency(s): 80.3122
Min latency(s): 29.7059

Sequencial Read Test:

Total time run: 25.951049
Total reads made: 17
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 2.62032
Average IOPS: 0
Stddev IOPS: 0
Max IOPS: 1
Min IOPS: 0
Average Latency(s): 24.4129
Max latency(s): 25.9492
Min latency(s): 0.117732

Random Read Test:

Total time run: 66.355433
Total reads made: 46
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 2.77295
Average IOPS: 0
Stddev IOPS: 3
Max IOPS: 27
Min IOPS: 0
Average Latency(s): 21.4531
Max latency(s): 66.1885
Min latency(s): 0.0395266

Apparently, the results are pathetic...

As I moved on to test block devices, I got a following error message:

# rbd map image01 --pool testbench --name client.admin
rbd: failed to add secret 'client.admin' to kernel

Any suggestions on the above error and/or debugging would be greatly
appreciated!

Thank you very much to all.

Cody

[1] https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/administration_guide/#benchmarking_performance

Stefan Kooman

2018-11-27 07:19:44 UTC

Permalink

Post by Cody
The Ceph OSD part of the cluster uses 3 identical servers with the
RAM: 16GB
Network: 1G port shared for Ceph public and cluster traffics

This will hamper throughput a lot.

Post by Cody
Journaling device: 1 x 120GB SSD (SATA3, consumer grade)
OSD device: 2 x 2TB 7200rpm spindle (SATA3, consumer grade)

OK, let's stop here first: Consumer grade SSD. Percona did a nice
writeup about "fsync" speed on consumer grade SSDs [1]. As I don't know
what drives you use this might or might not be the issue.

Post by Cody
This is not beefy enough in any way, but I am running for PoC only,
with minimum utilization.
Ceph-mon and ceph-mgr daemons are hosted on the OpenStack Controller
nodes. Ceph-ansible version is 3.1 and is using Filestore with
non-colocated scenario (1 SSD for every 2 OSDs). Connection speed
among Controllers, Computes, and OSD nodes can reach ~900Mbps tested
using iperf.

Why filestore if I may ask? I guess bluestore with bluestore journal on
SSD and data on SATA should give you better performance. If the SSDs are
suitable for the job at all.

What version of Ceph are use using? Metrics can give you a lot of
insight. Did you take a look at those? In fFor example Ceph mgr dashboard?

Post by Cody
I followed the Red Hat Ceph 3 benchmarking procedure [1] and received
Total time run: 80.313004
Total writes made: 17
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 0.846687
Stddev Bandwidth: 0.320051
Max bandwidth (MB/sec): 2
Min bandwidth (MB/sec): 0
Average IOPS: 0
Stddev IOPS: 0
Max IOPS: 0
Min IOPS: 0
Average Latency(s): 66.6582
Stddev Latency(s): 15.5529
Max latency(s): 80.3122
Min latency(s): 29.7059
Total time run: 25.951049
Total reads made: 17
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 2.62032
Average IOPS: 0
Stddev IOPS: 0
Max IOPS: 1
Min IOPS: 0
Average Latency(s): 24.4129
Max latency(s): 25.9492
Min latency(s): 0.117732
Total time run: 66.355433
Total reads made: 46
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 2.77295
Average IOPS: 0
Stddev IOPS: 3
Max IOPS: 27
Min IOPS: 0
Average Latency(s): 21.4531
Max latency(s): 66.1885
Min latency(s): 0.0395266
Apparently, the results are pathetic...
# rbd map image01 --pool testbench --name client.admin
rbd: failed to add secret 'client.admin' to kernel

What replication factor are you using?

Make sure you have the client.admin keyring on the node you are issuing
this command. If you have the keyring present like Ceph expects it to
be, then you can omit the --name client.admin. On a monitor node you can
extract the admin keyring: ceph auth export client.admin. Put the output
of that in /etc/ceph/ceph.client.admin.keyring and this should work.

Post by Cody
Any suggestions on the above error and/or debugging would be greatly
appreciated!

Gr. Stefan

[1]:
https://www.percona.com/blog/2018/07/18/why-consumer-ssd-reviews-are-useless-for-database-performance-use-case/

Post by Cody
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/administration_guide/#benchmarking_performance
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
| BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351
| GPG: 0xD14839C6 +31 318 648 688 / ***@bit.nl

Darius Kasparavičius

2018-11-27 08:14:38 UTC

Permalink

Hi,

Most likely the issue is with your consumer grade journal ssd. Run
this to your ssd to check if it performs: fio --filename=<SSD DEVICE>
--direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1
--runtime=60 --time_based --group_reporting --name=journal-test

Post by Cody
Hello,
I have a Ceph cluster deployed together with OpenStack using TripleO.
While the Ceph cluster shows a healthy status, its performance is
painfully slow. After eliminating a possibility of network issues, I
have zeroed in on the Ceph cluster itself, but have no experience in
further debugging and tunning.
The Ceph OSD part of the cluster uses 3 identical servers with the
RAM: 16GB
Network: 1G port shared for Ceph public and cluster traffics
Journaling device: 1 x 120GB SSD (SATA3, consumer grade)
OSD device: 2 x 2TB 7200rpm spindle (SATA3, consumer grade)
This is not beefy enough in any way, but I am running for PoC only,
with minimum utilization.
Ceph-mon and ceph-mgr daemons are hosted on the OpenStack Controller
nodes. Ceph-ansible version is 3.1 and is using Filestore with
non-colocated scenario (1 SSD for every 2 OSDs). Connection speed
among Controllers, Computes, and OSD nodes can reach ~900Mbps tested
using iperf.
I followed the Red Hat Ceph 3 benchmarking procedure [1] and received
Total time run: 80.313004
Total writes made: 17
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 0.846687
Stddev Bandwidth: 0.320051
Max bandwidth (MB/sec): 2
Min bandwidth (MB/sec): 0
Average IOPS: 0
Stddev IOPS: 0
Max IOPS: 0
Min IOPS: 0
Average Latency(s): 66.6582
Stddev Latency(s): 15.5529
Max latency(s): 80.3122
Min latency(s): 29.7059
Total time run: 25.951049
Total reads made: 17
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 2.62032
Average IOPS: 0
Stddev IOPS: 0
Max IOPS: 1
Min IOPS: 0
Average Latency(s): 24.4129
Max latency(s): 25.9492
Min latency(s): 0.117732
Total time run: 66.355433
Total reads made: 46
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 2.77295
Average IOPS: 0
Stddev IOPS: 3
Max IOPS: 27
Min IOPS: 0
Average Latency(s): 21.4531
Max latency(s): 66.1885
Min latency(s): 0.0395266
Apparently, the results are pathetic...
# rbd map image01 --pool testbench --name client.admin
rbd: failed to add secret 'client.admin' to kernel
Any suggestions on the above error and/or debugging would be greatly
appreciated!
Thank you very much to all.
Cody
[1] https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/administration_guide/#benchmarking_performance
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Vitaliy Filippov

2018-11-27 11:31:25 UTC

Permalink

Post by Cody
RAM: 16GB
Network: 1G port shared for Ceph public and cluster traffics
Journaling device: 1 x 120GB SSD (SATA3, consumer grade)
OSD device: 2 x 2TB 7200rpm spindle (SATA3, consumer grade)

0.84 MB/s sequential write is impossibly bad, it's not normal with any
kind of devices and even with 1G network, you probably have some kind of
problem in your setup - maybe the network RTT is very high or maybe osd or
mon nodes are shared with other running tasks and overloaded or maybe your
disks are already dead... :))

Post by Cody
# rbd map image01 --pool testbench --name client.admin

You don't need to map it to run benchmarks, use `fio --ioengine=rbd`
(however you'll still need /etc/ceph/ceph.client.admin.keyring)

--
With best regards,
Vitaliy Filippov

Cody

2018-11-27 18:21:38 UTC

Permalink

Hi everyone,

Many, many thanks to all of you!

The root cause was due to a failed OS drive on one storage node. The
server was responsive to ping, but unable to login. After a reboot via
IPMI, docker daemon failed to start due to I/O errors and dmesg
complained about the failing OS disk. I failed to catch the problem
initially since 'ceph -s' kept showing HEALTH and the cluster was
"functional" despite of slow performance.

I really appreciate all the tips and advices received from you all and
learned a lot. I will carry your advices (e.g. using bluestore,
enterprise ssd/hdd, separating public and cluster traffics, etc) into
my next round PoC.

Thank you very much!

Best regards,
Cody

Post by Vitaliy Filippov

Post by Cody
# rbd map image01 --pool testbench --name client.admin

You don't need to map it to run benchmarks, use `fio --ioengine=rbd`
(however you'll still need /etc/ceph/ceph.client.admin.keyring)
--
With best regards,
Vitaliy Filippov

Paul Emmerich

2018-11-27 18:53:12 UTC

Permalink

And this exact problem was one of the reasons why we migrated
everything to PXE boot where the OS runs from RAM.
That kind of failure is just the worst to debug...
Also, 1 GB of RAM is cheaper than a separate OS disk.
--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

Post by Cody
Hi everyone,
Many, many thanks to all of you!
The root cause was due to a failed OS drive on one storage node. The
server was responsive to ping, but unable to login. After a reboot via
IPMI, docker daemon failed to start due to I/O errors and dmesg
complained about the failing OS disk. I failed to catch the problem
initially since 'ceph -s' kept showing HEALTH and the cluster was
"functional" despite of slow performance.
I really appreciate all the tips and advices received from you all and
learned a lot. I will carry your advices (e.g. using bluestore,
enterprise ssd/hdd, separating public and cluster traffics, etc) into
my next round PoC.
Thank you very much!
Best regards,
Cody

Post by Vitaliy Filippov

Post by Cody
# rbd map image01 --pool testbench --name client.admin

You don't need to map it to run benchmarks, use `fio --ioengine=rbd`
(however you'll still need /etc/ceph/ceph.client.admin.keyring)
--
With best regards,
Vitaliy Filippov

_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Paul Emmerich

2018-11-28 23:08:48 UTC

Permalink

Post by Paul Emmerich
And this exact problem was one of the reasons why we migrated
everything to PXE boot where the OS runs from RAM.

Hi Paul,
I totally agree with and admire your diskless approach. If I may ask,
what kind of OS image do you use? 1GB footprint sounds really small.

It's based on Debian, because Debian makes live boot really easy with
squashfs + overlayfs.
We also have a half-finished CentOS/RHEL-based version somewhere, but
that requires way more RAM because it doesn't use overlayfs (or didn't
when we last checked, I guess we need to check RHEL 8 again)

Current image size is 400 MB + 30 MB for kernel + initrd and it comes
with everything you need for Ceph. We don't even run aggressive
compression on the squashfs, it's just lzo.

You can test it for yourself in a VM: https://croit.io/croit-virtual-demo

Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

Post by Paul Emmerich
And this exact problem was one of the reasons why we migrated
everything to PXE boot where the OS runs from RAM.
That kind of failure is just the worst to debug...
Also, 1 GB of RAM is cheaper than a separate OS disk.
--
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

Post by Vitaliy Filippov

Post by Cody
# rbd map image01 --pool testbench --name client.admin

You don't need to map it to run benchmarks, use `fio --ioengine=rbd`
(however you'll still need /etc/ceph/ceph.client.admin.keyring)
--
With best regards,
Vitaliy Filippov

_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com