Jean-Philippe Méthot
2018-11-27 16:47:42 UTC
Hi,
Weâre currently progressively pushing into production a CEPH Mimic cluster and weâve noticed a fairly strange behaviour. We use Ceph as a storage backend for Openstack block device. Now, weâve deployed a few VMs on this backend to test the waters. These VMs are practically empty, with only the regular cpanel services running on them and no actual website set. We notice that about twice in a span of about 5 minutes, the iowait will jump to ~10% without any VM-side explanation, no specific service taking any more io bandwidth than usual.
I must also add that the speed of the cluster is excellent. Itâs really more of a stability issue that bothers me here. I see the jump in iowait as the VM being unable to read or write on the ceph cluster for a second or so. I've considered that it could be the deep scrub operations, but those seem to complete in 0.1 second, as thereâs practically no data to scrub.
The cluster pool configuration is as such:
-RBD on erasure-coded pool (a replicated metadata pool and an erasure coded data pool) with overwrites enabled
-The data pool size is k=6 m=2, so 8, with 1024 PGs
-The metadata pool size is 3, with 64 PGs
Of course, this is running on bluestore.
As for the hardware, the config is as follow:
-10 hosts
-9 OSD per host
-Each OSD is a Intel DC S3510
-CPUs are dual E5-2680v2 (40 threads total @2.8GHz)
-Each host has 128 GB of ram
-Network is 2x bonded 10gbps, 1 for storage, 1 for replication
I understand that I will eventually hit a speed block because of either the CPUs or the network, but maximum speed is not my current concern here and can be upgraded when needed. Iâve been wondering, could these hiccups be caused by data caching at the client level? If so, what could I do to fix this?
Jean-Philippe Méthot
Openstack system administrator
Administrateur systÚme Openstack
PlanetHoster inc.
Weâre currently progressively pushing into production a CEPH Mimic cluster and weâve noticed a fairly strange behaviour. We use Ceph as a storage backend for Openstack block device. Now, weâve deployed a few VMs on this backend to test the waters. These VMs are practically empty, with only the regular cpanel services running on them and no actual website set. We notice that about twice in a span of about 5 minutes, the iowait will jump to ~10% without any VM-side explanation, no specific service taking any more io bandwidth than usual.
I must also add that the speed of the cluster is excellent. Itâs really more of a stability issue that bothers me here. I see the jump in iowait as the VM being unable to read or write on the ceph cluster for a second or so. I've considered that it could be the deep scrub operations, but those seem to complete in 0.1 second, as thereâs practically no data to scrub.
The cluster pool configuration is as such:
-RBD on erasure-coded pool (a replicated metadata pool and an erasure coded data pool) with overwrites enabled
-The data pool size is k=6 m=2, so 8, with 1024 PGs
-The metadata pool size is 3, with 64 PGs
Of course, this is running on bluestore.
As for the hardware, the config is as follow:
-10 hosts
-9 OSD per host
-Each OSD is a Intel DC S3510
-CPUs are dual E5-2680v2 (40 threads total @2.8GHz)
-Each host has 128 GB of ram
-Network is 2x bonded 10gbps, 1 for storage, 1 for replication
I understand that I will eventually hit a speed block because of either the CPUs or the network, but maximum speed is not my current concern here and can be upgraded when needed. Iâve been wondering, could these hiccups be caused by data caching at the client level? If so, what could I do to fix this?
Jean-Philippe Méthot
Openstack system administrator
Administrateur systÚme Openstack
PlanetHoster inc.