Discussion:
[ceph-users] suse_enterprise_storage3_rbd_LIO_vmware_performance_bad
mq
2016-07-01 05:04:45 UTC
Permalink
Hi list
I have tested suse enterprise storage3 using 2 iscsi gateway attached to vmware. The performance is bad. I have turn off VAAI following the (https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033665) <https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033665)>.
My cluster
3 ceph nodes :2*E5-2620 64G , mem 2*1Gbps
(3*10K SAS, 1*480G SSD) per node, SSD as journal
1 vmware node 2*E5-2620 64G , mem 2*1Gbps

# ceph -s
cluster 0199f68d-a745-4da3-9670-15f2981e7a15
health HEALTH_OK
monmap e1: 3 mons at {node1=192.168.50.91:6789/0,node2=192.168.50.92:6789/0,node3=192.168.50.93:6789/0}
election epoch 22, quorum 0,1,2 node1,node2,node3
osdmap e200: 9 osds: 9 up, 9 in
flags sortbitwise
pgmap v1162: 448 pgs, 1 pools, 14337 MB data, 4935 objects
18339 MB used, 5005 GB / 5023 GB avail
448 active+clean
client io 87438 kB/s wr, 0 op/s rd, 213 op/s wr

sudo ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 4.90581 root default
-2 1.63527 host node1
0 0.54509 osd.0 up 1.00000 1.00000
1 0.54509 osd.1 up 1.00000 1.00000
2 0.54509 osd.2 up 1.00000 1.00000
-3 1.63527 host node2
3 0.54509 osd.3 up 1.00000 1.00000
4 0.54509 osd.4 up 1.00000 1.00000
5 0.54509 osd.5 up 1.00000 1.00000
-4 1.63527 host node3
6 0.54509 osd.6 up 1.00000 1.00000
7 0.54509 osd.7 up 1.00000 1.00000
8 0.54509 osd.8 up 1.00000 1.00000



An linux vm in vmmare£¬ running fio. 4k randwrite result just 64 IOPS lantency is high£¬dd test just 11MB£¯s.

fio -ioengine=libaio -bs=4k -direct=1 -thread -rw=randwrite -size=100G -filename=/dev/sdb -name="EBS 4KB randwrite test" -iodepth=32 -runtime=60
EBS 4KB randwrite test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.0.13
Starting 1 thread
Jobs: 1 (f=1): [w] [100.0% done] [0K/131K/0K /s] [0 /32 /0 iops] [eta 00m:00s]
EBS 4KB randwrite test: (groupid=0, jobs=1): err= 0: pid=6766: Wed Jun 29 21:28:06 2016
write: io=15696KB, bw=264627 B/s, iops=64 , runt= 60737msec
slat (usec): min=10 , max=213 , avg=35.54, stdev=16.41
clat (msec): min=1 , max=31368 , avg=495.01, stdev=1862.52
lat (msec): min=2 , max=31368 , avg=495.04, stdev=1862.52
clat percentiles (msec):
| 1.00th=[ 7], 5.00th=[ 8], 10.00th=[ 8], 20.00th=[ 9],
| 30.00th=[ 9], 40.00th=[ 10], 50.00th=[ 198], 60.00th=[ 204],
| 70.00th=[ 208], 80.00th=[ 217], 90.00th=[ 799], 95.00th=[ 1795],
| 99.00th=[ 7177], 99.50th=[12649], 99.90th=[16712], 99.95th=[16712],
| 99.99th=[16712]
bw (KB/s) : min= 36, max=11960, per=100.00%, avg=264.77, stdev=1110.81
lat (msec) : 2=0.03%, 4=0.23%, 10=40.93%, 20=0.48%, 50=0.03%
lat (msec) : 100=0.08%, 250=39.55%, 500=5.63%, 750=2.91%, 1000=1.35%
lat (msec) : 2000=4.03%, >=2000=4.77%
cpu : usr=0.02%, sys=0.22%, ctx=2973, majf=0, minf=18446744073709538907
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.4%, 32=99.2%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=3924/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
WRITE: io=15696KB, aggrb=258KB/s, minb=258KB/s, maxb=258KB/s, mint=60737msec, maxt=60737msec

Disk stats (read/write):
sdb: ios=83/3921, merge=0/0, ticks=60/1903085, in_queue=1931694, util=100.00%

anyone can give me some suggestion to improve the performance ?

Regards

MQ
Christian Balzer
2016-07-01 08:18:19 UTC
Permalink
Hello,
Post by mq
Hi list
I have tested suse enterprise storage3 using 2 iscsi gateway attached
to vmware. The performance is bad.
First off, it's somewhat funny that you're testing the repackaged SUSE
Ceph, but asking for help here (with Ceph being owned by Red Hat).

Aside from that, you're not telling us what these 2 iSCSI gateways are
(SW, HW specs/configuration).

Having iSCSI on top of Ceph is by the very nature of things going to be
slower than native Ceph.

Use "rbd bench" or a VM client with RBD to get a base number of what your
Ceph cluster is capable of, this will help identifying where the slowdown
is.
Post by mq
I have turn off VAAI following the
(https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033665)
<https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033665)>.
My cluster 3 ceph nodes :2*E5-2620 64G , mem 2*1Gbps (3*10K SAS, 1*480G
SSD) per node, SSD as journal 1 vmware node 2*E5-2620 64G , mem 2*1Gbps
That's a slow (latency wise) network, but not your problem.
What SSD model?
A 480GB size suggests a consumer model and that would explain a lot.

Check you storage nodes with atop during the fio runs and see if you can
spot a bottleneck.

Christian
Post by mq
# ceph -s
cluster 0199f68d-a745-4da3-9670-15f2981e7a15
health HEALTH_OK
monmap e1: 3 mons at
{node1=192.168.50.91:6789/0,node2=192.168.50.92:6789/0,node3=192.168.50.93:6789/0}
election epoch 22, quorum 0,1,2 node1,node2,node3 osdmap e200: 9 osds: 9
up, 9 in flags sortbitwise
pgmap v1162: 448 pgs, 1 pools, 14337 MB data, 4935 objects
18339 MB used, 5005 GB / 5023 GB avail
448 active+clean
client io 87438 kB/s wr, 0 op/s rd, 213 op/s wr
sudo ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 4.90581 root default
-2 1.63527 host node1
0 0.54509 osd.0 up 1.00000 1.00000
1 0.54509 osd.1 up 1.00000 1.00000
2 0.54509 osd.2 up 1.00000 1.00000
-3 1.63527 host node2
3 0.54509 osd.3 up 1.00000 1.00000
4 0.54509 osd.4 up 1.00000 1.00000
5 0.54509 osd.5 up 1.00000 1.00000
-4 1.63527 host node3
6 0.54509 osd.6 up 1.00000 1.00000
7 0.54509 osd.7 up 1.00000 1.00000
8 0.54509 osd.8 up 1.00000 1.00000
An linux vm in vmmare, running fio. 4k randwrite result just 64 IOPS
lantency is high,dd test just 11MB/s.
fio -ioengine=libaio -bs=4k -direct=1 -thread -rw=randwrite -size=100G
-filename=/dev/sdb -name="EBS 4KB randwrite test" -iodepth=32
-runtime=60 EBS 4KB randwrite test: (g=0): rw=randwrite,
bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 fio-2.0.13 Starting 1
thread Jobs: 1 (f=1): [w] [100.0% done] [0K/131K/0K /s] [0 /32 /0 iops]
pid=6766: Wed Jun 29 21:28:06 2016 write: io=15696KB, bw=264627 B/s,
iops=64 , runt= 60737msec slat (usec): min=10 , max=213 , avg=35.54,
stdev=16.41 clat (msec): min=1 , max=31368 , avg=495.01, stdev=1862.52
lat (msec): min=2 , max=31368 , avg=495.04, stdev=1862.52
| 1.00th=[ 7], 5.00th=[ 8], 10.00th=[ 8],
20.00th=[ 9], | 30.00th=[ 9], 40.00th=[ 10], 50.00th=[ 198],
60.00th=[ 204], | 70.00th=[ 208], 80.00th=[ 217], 90.00th=[ 799],
95.00th=[ 1795], | 99.00th=[ 7177], 99.50th=[12649], 99.90th=[16712],
99.95th=[16712], | 99.99th=[16712]
bw (KB/s) : min= 36, max=11960, per=100.00%, avg=264.77,
stdev=1110.81 lat (msec) : 2=0.03%, 4=0.23%, 10=40.93%, 20=0.48%,
50=0.03% lat (msec) : 100=0.08%, 250=39.55%, 500=5.63%, 750=2.91%,
1000=1.35% lat (msec) : 2000=4.03%, >=2000=4.77%
cpu : usr=0.02%, sys=0.22%, ctx=2973, majf=0,
minf=18446744073709538907 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%,
16=0.4%, 32=99.2%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%,
16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%,
total=r=0/w=3924/d=0, short=r=0/w=0/d=0
WRITE: io=15696KB, aggrb=258KB/s, minb=258KB/s, maxb=258KB/s,
mint=60737msec, maxt=60737msec
sdb: ios=83/3921, merge=0/0, ticks=60/1903085, in_queue=1931694, util=100.00%
anyone can give me some suggestion to improve the performance ?
Regards
MQ
--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/
mq
2016-07-01 09:28:06 UTC
Permalink
HI
1.
2 sw iscsi gateways(deploy on osd/monitor ) using lrbd to create,the iscsi target is LIO
configuration:
{
"auth": [
{
"target": "iqn.2016-07.org.linux-iscsi.iscsi.x86:testvol",
"authentication": "none"
}
],
"targets": [
{
"target": "iqn.2016-07.org.linux-iscsi.iscsi.x86:testvol",
"hosts": [
{
"host": "node2",
"portal": "east"
},
{
"host": "node3",
"portal": "west"
}
]
}
],
"portals": [
{
"name": "east",
"addresses": [
"10.0.52.92"
]
},
{
"name": "west",
"addresses": [
"10.0.52.93"
]
}
],
"pools": [
{
"pool": "rbd",
"gateways": [
{
"target": "iqn.2016-07.org.linux-iscsi.iscsi.x86:testvol",
"tpg": [
{
"image": "testvol"
}
]
}
]
}
]
}

2 the ceph cluster itself¡¯s performance is ok. i create a rbd on one of ceph node. fio results is nice: 4K randwrite IOPS=3013 bw=100MB/s.
so i think the ceph cluster have no bottleneck.

3 Intel S3510 SSD 480G enterprise not consumer

new test :clone a VM in wmware can reach 100MB/s. but fio and dd test in vm still poor.
Post by Christian Balzer
Hello,
Post by mq
Hi list
I have tested suse enterprise storage3 using 2 iscsi gateway attached
to vmware. The performance is bad.
First off, it's somewhat funny that you're testing the repackaged SUSE
Ceph, but asking for help here (with Ceph being owned by Red Hat).
Aside from that, you're not telling us what these 2 iSCSI gateways are
(SW, HW specs/configuration).
Having iSCSI on top of Ceph is by the very nature of things going to be
slower than native Ceph.
Use "rbd bench" or a VM client with RBD to get a base number of what your
Ceph cluster is capable of, this will help identifying where the slowdown
is.
Post by mq
I have turn off VAAI following the
(https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033665 <https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033665>)
<https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033665 <https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033665>)>.
My cluster 3 ceph nodes :2*E5-2620 64G , mem 2*1Gbps (3*10K SAS, 1*480G
SSD) per node, SSD as journal 1 vmware node 2*E5-2620 64G , mem 2*1Gbps
That's a slow (latency wise) network, but not your problem.
What SSD model?
A 480GB size suggests a consumer model and that would explain a lot.
Check you storage nodes with atop during the fio runs and see if you can
spot a bottleneck.
Christian
Post by mq
# ceph -s
cluster 0199f68d-a745-4da3-9670-15f2981e7a15
health HEALTH_OK
monmap e1: 3 mons at
{node1=192.168.50.91:6789/0,node2=192.168.50.92:6789/0,node3=192.168.50.93:6789/0}
election epoch 22, quorum 0,1,2 node1,node2,node3 osdmap e200: 9 osds: 9
up, 9 in flags sortbitwise
pgmap v1162: 448 pgs, 1 pools, 14337 MB data, 4935 objects
18339 MB used, 5005 GB / 5023 GB avail
448 active+clean
client io 87438 kB/s wr, 0 op/s rd, 213 op/s wr
sudo ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 4.90581 root default
-2 1.63527 host node1
0 0.54509 osd.0 up 1.00000 1.00000
1 0.54509 osd.1 up 1.00000 1.00000
2 0.54509 osd.2 up 1.00000 1.00000
-3 1.63527 host node2
3 0.54509 osd.3 up 1.00000 1.00000
4 0.54509 osd.4 up 1.00000 1.00000
5 0.54509 osd.5 up 1.00000 1.00000
-4 1.63527 host node3
6 0.54509 osd.6 up 1.00000 1.00000
7 0.54509 osd.7 up 1.00000 1.00000
8 0.54509 osd.8 up 1.00000 1.00000
An linux vm in vmmare£¬ running fio. 4k randwrite result just 64 IOPS
lantency is high£¬dd test just 11MB£¯s.
fio -ioengine=libaio -bs=4k -direct=1 -thread -rw=randwrite -size=100G
-filename=/dev/sdb -name="EBS 4KB randwrite test" -iodepth=32
-runtime=60 EBS 4KB randwrite test: (g=0): rw=randwrite,
bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 fio-2.0.13 Starting 1
thread Jobs: 1 (f=1): [w] [100.0% done] [0K/131K/0K /s] [0 /32 /0 iops]
pid=6766: Wed Jun 29 21:28:06 2016 write: io=15696KB, bw=264627 B/s,
iops=64 , runt= 60737msec slat (usec): min=10 , max=213 , avg=35.54,
stdev=16.41 clat (msec): min=1 , max=31368 , avg=495.01, stdev=1862.52
lat (msec): min=2 , max=31368 , avg=495.04, stdev=1862.52
| 1.00th=[ 7], 5.00th=[ 8], 10.00th=[ 8],
20.00th=[ 9], | 30.00th=[ 9], 40.00th=[ 10], 50.00th=[ 198],
60.00th=[ 204], | 70.00th=[ 208], 80.00th=[ 217], 90.00th=[ 799],
95.00th=[ 1795], | 99.00th=[ 7177], 99.50th=[12649], 99.90th=[16712],
99.95th=[16712], | 99.99th=[16712]
bw (KB/s) : min= 36, max=11960, per=100.00%, avg=264.77,
stdev=1110.81 lat (msec) : 2=0.03%, 4=0.23%, 10=40.93%, 20=0.48%,
50=0.03% lat (msec) : 100=0.08%, 250=39.55%, 500=5.63%, 750=2.91%,
1000=1.35% lat (msec) : 2000=4.03%, >=2000=4.77%
cpu : usr=0.02%, sys=0.22%, ctx=2973, majf=0,
minf=18446744073709538907 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%,
16=0.4%, 32=99.2%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%,
16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%,
total=r=0/w=3924/d=0, short=r=0/w=0/d=0
WRITE: io=15696KB, aggrb=258KB/s, minb=258KB/s, maxb=258KB/s,
mint=60737msec, maxt=60737msec
sdb: ios=83/3921, merge=0/0, ticks=60/1903085, in_queue=1931694, util=100.00%
anyone can give me some suggestion to improve the performance ?
Regards
MQ
--
Christian Balzer Network/Systems Engineer
http://www.gol.com/ <http://www.gol.com/>
Lars Marowsky-Bree
2016-07-04 10:32:16 UTC
Permalink
Post by Christian Balzer
First off, it's somewhat funny that you're testing the repackaged SUSE
Ceph, but asking for help here (with Ceph being owned by Red Hat).
*cough* Ceph is not owned by RH. RH acquired the InkTank team and the
various trademarks, that's true (and, admittedly, I'm a bit envious
about that ;-), but Ceph itself is an Open Source project that is not
owned by a single company.

You may want to check out the growing contributions from other
companies and the active involvement by them in the Ceph community ;-)


Regards,
Lars
--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde
Oliver Dzombic
2016-07-01 08:27:27 UTC
Permalink
Hi,

my experience:

ceph + iscsi ( multipath ) + vmware == worst

Better you search for another solution.

vmware + nfs + vmware might have a much better performance.

--------

If you are able to get vmware run with iscsi and ceph, i would be
Post by mq
very<< intrested in what/how you did that.
--
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:***@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107
Post by mq
Hi list
I have tested suse enterprise storage3 using 2 iscsi gateway attached
to vmware. The performance is bad. I have turn off VAAI following the
(https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033665)
<https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033665%29>.
My cluster
3 ceph nodes :2*E5-2620 64G , mem 2*1Gbps
(3*10K SAS, 1*480G SSD) per node, SSD as journal
1 vmware node 2*E5-2620 64G , mem 2*1Gbps
# ceph -s
cluster 0199f68d-a745-4da3-9670-15f2981e7a15
health HEALTH_OK
monmap e1: 3 mons at
{node1=192.168.50.91:6789/0,node2=192.168.50.92:6789/0,node3=192.168.50.93:6789/0}
election epoch 22, quorum 0,1,2 node1,node2,node3
osdmap e200: 9 osds: 9 up, 9 in
flags sortbitwise
pgmap v1162: 448 pgs, 1 pools, 14337 MB data, 4935 objects
18339 MB used, 5005 GB / 5023 GB avail
448 active+clean
client io 87438 kB/s wr, 0 op/s rd, 213 op/s wr
sudo ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 4.90581 root default
-2 1.63527 host node1
0 0.54509 osd.0 up 1.00000 1.00000
1 0.54509 osd.1 up 1.00000 1.00000
2 0.54509 osd.2 up 1.00000 1.00000
-3 1.63527 host node2
3 0.54509 osd.3 up 1.00000 1.00000
4 0.54509 osd.4 up 1.00000 1.00000
5 0.54509 osd.5 up 1.00000 1.00000
-4 1.63527 host node3
6 0.54509 osd.6 up 1.00000 1.00000
7 0.54509 osd.7 up 1.00000 1.00000
8 0.54509 osd.8 up 1.00000 1.00000
An linux vm in vmmare, running fio. 4k randwrite result just 64 IOPS
lantency is high,dd test just 11MB/s.
fio -ioengine=libaio -bs=4k -direct=1 -thread -rw=randwrite -size=100G
-filename=/dev/sdb -name="EBS 4KB randwrite test" -iodepth=32 -runtime=60
EBS 4KB randwrite test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K,
ioengine=libaio, iodepth=32
fio-2.0.13
Starting 1 thread
Jobs: 1 (f=1): [w] [100.0% done] [0K/131K/0K /s] [0 /32 /0 iops] [eta 00m:00s]
EBS 4KB randwrite test: (groupid=0, jobs=1): err= 0: pid=6766: Wed Jun 29 21:28:06 2016
write: io=15696KB, bw=264627 B/s, iops=64 , runt= 60737msec
slat (usec): min=10 , max=213 , avg=35.54, stdev=16.41
clat (msec): min=1 , max=31368 , avg=495.01, stdev=1862.52
lat (msec): min=2 , max=31368 , avg=495.04, stdev=1862.52
| 1.00th=[ 7], 5.00th=[ 8], 10.00th=[ 8], 20.00th=[ 9],
| 30.00th=[ 9], 40.00th=[ 10], 50.00th=[ 198], 60.00th=[ 204],
| 70.00th=[ 208], 80.00th=[ 217], 90.00th=[ 799], 95.00th=[ 1795],
| 99.00th=[ 7177], 99.50th=[12649], 99.90th=[16712], 99.95th=[16712],
| 99.99th=[16712]
bw (KB/s) : min= 36, max=11960, per=100.00%, avg=264.77, stdev=1110.81
lat (msec) : 2=0.03%, 4=0.23%, 10=40.93%, 20=0.48%, 50=0.03%
lat (msec) : 100=0.08%, 250=39.55%, 500=5.63%, 750=2.91%, 1000=1.35%
lat (msec) : 2000=4.03%, >=2000=4.77%
cpu : usr=0.02%, sys=0.22%, ctx=2973, majf=0,
minf=18446744073709538907
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.4%, 32=99.2%,
=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
=64=0.0%
issued : total=r=0/w=3924/d=0, short=r=0/w=0/d=0
WRITE: io=15696KB, aggrb=258KB/s, minb=258KB/s, maxb=258KB/s,
mint=60737msec, maxt=60737msec
sdb: ios=83/3921, merge=0/0, ticks=60/1903085, in_queue=1931694, util=100.00%
anyone can give me some suggestion to improve the performance ?
Regards
MQ
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Nick Fisk
2016-07-01 18:11:34 UTC
Permalink
To summarise,

LIO is just not working very well at the moment because of the ABORT Tasks problem, this will hopefully be fixed at some point. I'm not sure if SUSE works around this, but see below for other pain points with RBD + ESXi + iSCSI

TGT is easy to get going, but performance isn't the best and failover is an absolute pain as TGT won't stop if it has ongoing IO. You normally end up in a complete mess if you try and do HA, unless you can cover a number of different failure scenarios.

SCST probably works the best at the moment. Yes, you have to compile it into a new kernel, but it performs well, doesn't fall over, supports the VAAI extensions and can be configured HA in an ALUA or VIP failover modes. There might be a couple of corner cases with the ALUA mode with Active/Standby paths, with possible data corruption that need to be tested/explored.

However, there are a number of pain points with iSCSI + ESXi + RBD and they all mainly centre on write latency. It seems VMFS was designed around the fact that Enterprise storage arrays service writes in 10-100us, whereas Ceph will service them in 2-10ms.

1. Thin Provisioning makes things slow. I believe the main cause is that when growing and zeroing the new blocks, metadata needs to be updated and the block zero'd. Both issue small IO which would normally not be a problem, but with Ceph it becomes a bottleneck to overall IO on the datastore.

2. Snapshots effectively turn all IO into 64kb IO's. Again a traditional SAN will coalesce these back into a stream of larger IO's before committing to disk. However with Ceph each IO takes 2-10ms and so everything seems slow. The future feature of persistent RBD cache may go a long way to helping with this.

3. >2TB VMDK's with snapshots use a different allocation mode, which happens in 4kb chunks instead of 64kb ones. This makes the problem 16 times worse than above.

4. Any of the above will also apply when migrating machines around, so VM's can takes hours/days to move.

5. If you use FILEIO, you can't use thin provisioning. If you use BLOCKIO, you get thin provisioning, but no pagecache or readahead, so performance can nose dive if this is needed.

6. iSCSI is very complicated (especially ALUA) and sensitive. Get used to seeing APD/PDL even when you think you have finally got everything working great.


Normal IO from eager zeroed VM's with no snapshots, however should perform ok. So depends what your workload is.


And then comes NFS. It's very easy to setup, very easy to configure for HA, and works pretty well overall. You don't seem to get any of the IO size penalties when using snapshots. If you mount with discard, thin provisioning is done by Ceph. You can defragment the FS on the proxy node and several other things that you can't do with VMFS. Just make sure you run the server in sync mode to avoid data loss.

The only downside is that every IO causes an IO to the FS and one to the FS journal, so you effectively double your IO. But if your Ceph backend can support it, then it shouldn't be too much of a problem.

Now to the original poster, assuming the iSCSI node is just kernel mounting the RBD, I would run iostat on it, to try and see what sort of latency you are seeing at that point. Also do the same with esxtop +u, and look at the write latency there, both whilst running the fio in the VM. This should hopefully let you see if there is just a gradual increase as you go from hop to hop or if there is an obvious culprit.

Can you also confirm your kernel version?

With 1GB networking I think you will struggle to get your write latency much below 10-15ms, but from your example ~30ms is still a bit high. I wonder if the default queue depths on your iSCSI target are too low as well?

Nick
-----Original Message-----
Oliver Dzombic
Sent: 01 July 2016 09:27
Subject: Re: [ceph-users]
suse_enterprise_storage3_rbd_LIO_vmware_performance_bad
Hi,
ceph + iscsi ( multipath ) + vmware == worst
Better you search for another solution.
vmware + nfs + vmware might have a much better performance.
--------
If you are able to get vmware run with iscsi and ceph, i would be
Post by mq
very<< intrested in what/how you did that.
--
Mit freundlichen Gruessen / Best regards
Oliver Dzombic
IP-Interactive
IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
63571 Gelnhausen
HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic
Steuer Nr.: 35 236 3622 1
UST ID: DE274086107
Post by mq
Hi list
I have tested suse enterprise storage3 using 2 iscsi gateway attached
to vmware. The performance is bad. I have turn off VAAI following the
(https://kb.vmware.com/selfservice/microsites/search.do?language=en_US
Post by mq
&cmd=displayKC&externalId=1033665)
<https://kb.vmware.com/selfservice/microsites/search.do?language=en_U
S&cmd=displayKC&externalId=1033665%29>.
Post by mq
My cluster
3 ceph nodes :2*E5-2620 64G , mem 2*1Gbps (3*10K SAS, 1*480G SSD) per
node, SSD as journal
1 vmware node 2*E5-2620 64G , mem 2*1Gbps
# ceph -s
cluster 0199f68d-a745-4da3-9670-15f2981e7a15
health HEALTH_OK
monmap e1: 3 mons at
{node1=192.168.50.91:6789/0,node2=192.168.50.92:6789/0,node3=192.168.5
0.93:6789/0}
Post by mq
election epoch 22, quorum 0,1,2 node1,node2,node3
osdmap e200: 9 osds: 9 up, 9 in
flags sortbitwise
pgmap v1162: 448 pgs, 1 pools, 14337 MB data, 4935 objects
18339 MB used, 5005 GB / 5023 GB avail
448 active+clean
client io 87438 kB/s wr, 0 op/s rd, 213 op/s wr
sudo ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 4.90581 root default
-2 1.63527 host node1
0 0.54509 osd.0 up 1.00000 1.00000
1 0.54509 osd.1 up 1.00000 1.00000
2 0.54509 osd.2 up 1.00000 1.00000
-3 1.63527 host node2
3 0.54509 osd.3 up 1.00000 1.00000
4 0.54509 osd.4 up 1.00000 1.00000
5 0.54509 osd.5 up 1.00000 1.00000
-4 1.63527 host node3
6 0.54509 osd.6 up 1.00000 1.00000
7 0.54509 osd.7 up 1.00000 1.00000
8 0.54509 osd.8 up 1.00000 1.00000
An linux vm in vmmare, running fio. 4k randwrite result just 64 IOPS
lantency is high,dd test just 11MB/s.
fio -ioengine=libaio -bs=4k -direct=1 -thread -rw=randwrite -size=100G
-filename=/dev/sdb -name="EBS 4KB randwrite test" -iodepth=32
-runtime=60 EBS 4KB randwrite test: (g=0): rw=randwrite,
bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.0.13
Starting 1 thread
Jobs: 1 (f=1): [w] [100.0% done] [0K/131K/0K /s] [0 /32 /0 iops] [eta
pid=6766: Wed Jun
29 21:28:06 2016
write: io=15696KB, bw=264627 B/s, iops=64 , runt= 60737msec
slat (usec): min=10 , max=213 , avg=35.54, stdev=16.41
clat (msec): min=1 , max=31368 , avg=495.01, stdev=1862.52
lat (msec): min=2 , max=31368 , avg=495.04, stdev=1862.52
| 1.00th=[ 7], 5.00th=[ 8], 10.00th=[ 8], 20.00th=[ 9],
| 30.00th=[ 9], 40.00th=[ 10], 50.00th=[ 198], 60.00th=[ 204],
| 70.00th=[ 208], 80.00th=[ 217], 90.00th=[ 799], 95.00th=[ 1795],
| 99.00th=[ 7177], 99.50th=[12649], 99.90th=[16712], 99.95th=[16712],
| 99.99th=[16712]
bw (KB/s) : min= 36, max=11960, per=100.00%, avg=264.77, stdev=1110.81
lat (msec) : 2=0.03%, 4=0.23%, 10=40.93%, 20=0.48%, 50=0.03%
lat (msec) : 100=0.08%, 250=39.55%, 500=5.63%, 750=2.91%, 1000=1.35%
lat (msec) : 2000=4.03%, >=2000=4.77%
cpu : usr=0.02%, sys=0.22%, ctx=2973, majf=0,
minf=18446744073709538907
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.4%, 32=99.2%,
=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
=64=0.0%
issued : total=r=0/w=3924/d=0, short=r=0/w=0/d=0
WRITE: io=15696KB, aggrb=258KB/s, minb=258KB/s, maxb=258KB/s,
mint=60737msec, maxt=60737msec
sdb: ios=83/3921, merge=0/0, ticks=60/1903085, in_queue=1931694, util=100.00%
anyone can give me some suggestion to improve the performance ?
Regards
MQ
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Lars Marowsky-Bree
2016-07-04 10:36:05 UTC
Permalink
Post by Nick Fisk
To summarise,
LIO is just not working very well at the moment because of the ABORT Tasks problem, this will hopefully be fixed at some point. I'm not sure if SUSE works around this, but see below for other pain points with RBD + ESXi + iSCSI
Yes, the SUSE kernel has recent backports that fix these bugs. And
there's obviously on-going work to improve the performance and code.

That's not to say that I'd advocate iSCSI as a primary access mechanism
for Ceph. But the need to interface from non-Linux systems to a Ceph
cluster is unfortunately very real.
Post by Nick Fisk
With 1GB networking I think you will struggle to get your write latency much below 10-15ms, but from your example ~30ms is still a bit high. I wonder if the default queue depths on your iSCSI target are too low as well?
Thanks for all the insights on the performance issues. You're really
quite spot on.

The main concern here obviously is that the same 2x1GbE network is
carrying both the client/ESX traffic, the iSCSI target to OSD traffic,
and the OSD backend traffic. That is not advisable.


Regards,
Lars
--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde
Nick Fisk
2016-07-04 17:59:10 UTC
Permalink
-----Original Message-----
Lars Marowsky-Bree
Sent: 04 July 2016 11:36
Subject: Re: [ceph-users]
suse_enterprise_storage3_rbd_LIO_vmware_performance_bad
Post by Nick Fisk
To summarise,
LIO is just not working very well at the moment because of the ABORT
Tasks problem, this will hopefully be fixed at some point. I'm not
sure if SUSE works around this, but see below for other pain points
with RBD + ESXi + iSCSI
Yes, the SUSE kernel has recent backports that fix these bugs. And there's
obviously on-going work to improve the performance and code.
That's not to say that I'd advocate iSCSI as a primary access mechanism
for
Ceph. But the need to interface from non-Linux systems to a Ceph cluster
is
unfortunately very real.
Post by Nick Fisk
With 1GB networking I think you will struggle to get your write latency
much below 10-15ms, but from your example ~30ms is still a bit high. I
wonder if the default queue depths on your iSCSI target are too low as
well?
Thanks for all the insights on the performance issues. You're really quite
spot
on.
Thanks, it's been a painful experience working through them all, but have
learnt a lot along the way.
The main concern here obviously is that the same 2x1GbE network is
carrying
both the client/ESX traffic, the iSCSI target to OSD traffic, and the OSD
backend traffic. That is not advisable.
Regards,
Lars
--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg) "Experience is the name everyone gives to their
mistakes." -- Oscar Wilde
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Alex Gorbachev
2016-07-04 20:59:50 UTC
Permalink
HI Nick,


On Fri, Jul 1, 2016 at 2:11 PM, Nick Fisk <***@fisk.me.uk> wrote:

<snip>
Post by Nick Fisk
However, there are a number of pain points with iSCSI + ESXi + RBD and they all mainly centre on write latency. It seems VMFS was designed around the fact that Enterprise storage arrays service writes in 10-100us, whereas Ceph will service them in 2-10ms.
1. Thin Provisioning makes things slow. I believe the main cause is that when growing and zeroing the new blocks, metadata needs to be updated and the block zero'd. Both issue small IO which would normally not be a problem, but with Ceph it becomes a bottleneck to overall IO on the datastore.
2. Snapshots effectively turn all IO into 64kb IO's. Again a traditional SAN will coalesce these back into a stream of larger IO's before committing to disk. However with Ceph each IO takes 2-10ms and so everything seems slow. The future feature of persistent RBD cache may go a long way to helping with this.
Are you referring to ESXi snapshots? Specifically, if a VM is running
off a snapshot (https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1015180),
its IO will drop to 64KB "grains"?
Post by Nick Fisk
3. >2TB VMDK's with snapshots use a different allocation mode, which happens in 4kb chunks instead of 64kb ones. This makes the problem 16 times worse than above.
4. Any of the above will also apply when migrating machines around, so VM's can takes hours/days to move.
5. If you use FILEIO, you can't use thin provisioning. If you use BLOCKIO, you get thin provisioning, but no pagecache or readahead, so performance can nose dive if this is needed.
Would not FILEIO also leverage the Linux scheduler to do IO coalescing
and help with (2) ? Since FILEIO also uses the dirty flush mechanism
in page cache (and makes IO somewhat crash-unsafe at the same time).
Post by Nick Fisk
6. iSCSI is very complicated (especially ALUA) and sensitive. Get used to seeing APD/PDL even when you think you have finally got everything working great.
We were used to seeing APD/PDL all the time with LIO, but pretty much
have not seen any with SCST > 3.1. Most of the ESXi problems are with
just with high latency periods, which are not a problem for the
hypervisor itself, but rather for the databases or applications inside
VMs.

Thanks,
Alex
Post by Nick Fisk
Normal IO from eager zeroed VM's with no snapshots, however should perform ok. So depends what your workload is.
And then comes NFS. It's very easy to setup, very easy to configure for HA, and works pretty well overall. You don't seem to get any of the IO size penalties when using snapshots. If you mount with discard, thin provisioning is done by Ceph. You can defragment the FS on the proxy node and several other things that you can't do with VMFS. Just make sure you run the server in sync mode to avoid data loss.
The only downside is that every IO causes an IO to the FS and one to the FS journal, so you effectively double your IO. But if your Ceph backend can support it, then it shouldn't be too much of a problem.
Now to the original poster, assuming the iSCSI node is just kernel mounting the RBD, I would run iostat on it, to try and see what sort of latency you are seeing at that point. Also do the same with esxtop +u, and look at the write latency there, both whilst running the fio in the VM. This should hopefully let you see if there is just a gradual increase as you go from hop to hop or if there is an obvious culprit.
Can you also confirm your kernel version?
With 1GB networking I think you will struggle to get your write latency much below 10-15ms, but from your example ~30ms is still a bit high. I wonder if the default queue depths on your iSCSI target are too low as well?
Nick
-----Original Message-----
Oliver Dzombic
Sent: 01 July 2016 09:27
Subject: Re: [ceph-users]
suse_enterprise_storage3_rbd_LIO_vmware_performance_bad
Hi,
ceph + iscsi ( multipath ) + vmware == worst
Better you search for another solution.
vmware + nfs + vmware might have a much better performance.
--------
If you are able to get vmware run with iscsi and ceph, i would be
Post by mq
very<< intrested in what/how you did that.
--
Mit freundlichen Gruessen / Best regards
Oliver Dzombic
IP-Interactive
IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
63571 Gelnhausen
HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic
Steuer Nr.: 35 236 3622 1
UST ID: DE274086107
Post by mq
Hi list
I have tested suse enterprise storage3 using 2 iscsi gateway attached
to vmware. The performance is bad. I have turn off VAAI following the
(https://kb.vmware.com/selfservice/microsites/search.do?language=en_US
Post by mq
&cmd=displayKC&externalId=1033665)
<https://kb.vmware.com/selfservice/microsites/search.do?language=en_U
S&cmd=displayKC&externalId=1033665%29>.
Post by mq
My cluster
3 ceph nodes :2*E5-2620 64G , mem 2*1Gbps (3*10K SAS, 1*480G SSD) per
node, SSD as journal
1 vmware node 2*E5-2620 64G , mem 2*1Gbps
# ceph -s
cluster 0199f68d-a745-4da3-9670-15f2981e7a15
health HEALTH_OK
monmap e1: 3 mons at
{node1=192.168.50.91:6789/0,node2=192.168.50.92:6789/0,node3=192.168.5
0.93:6789/0}
Post by mq
election epoch 22, quorum 0,1,2 node1,node2,node3
osdmap e200: 9 osds: 9 up, 9 in
flags sortbitwise
pgmap v1162: 448 pgs, 1 pools, 14337 MB data, 4935 objects
18339 MB used, 5005 GB / 5023 GB avail
448 active+clean
client io 87438 kB/s wr, 0 op/s rd, 213 op/s wr
sudo ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 4.90581 root default
-2 1.63527 host node1
0 0.54509 osd.0 up 1.00000 1.00000
1 0.54509 osd.1 up 1.00000 1.00000
2 0.54509 osd.2 up 1.00000 1.00000
-3 1.63527 host node2
3 0.54509 osd.3 up 1.00000 1.00000
4 0.54509 osd.4 up 1.00000 1.00000
5 0.54509 osd.5 up 1.00000 1.00000
-4 1.63527 host node3
6 0.54509 osd.6 up 1.00000 1.00000
7 0.54509 osd.7 up 1.00000 1.00000
8 0.54509 osd.8 up 1.00000 1.00000
An linux vm in vmmare, running fio. 4k randwrite result just 64 IOPS
lantency is high,dd test just 11MB/s.
fio -ioengine=libaio -bs=4k -direct=1 -thread -rw=randwrite -size=100G
-filename=/dev/sdb -name="EBS 4KB randwrite test" -iodepth=32
-runtime=60 EBS 4KB randwrite test: (g=0): rw=randwrite,
bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.0.13
Starting 1 thread
Jobs: 1 (f=1): [w] [100.0% done] [0K/131K/0K /s] [0 /32 /0 iops] [eta
pid=6766: Wed Jun
29 21:28:06 2016
write: io=15696KB, bw=264627 B/s, iops=64 , runt= 60737msec
slat (usec): min=10 , max=213 , avg=35.54, stdev=16.41
clat (msec): min=1 , max=31368 , avg=495.01, stdev=1862.52
lat (msec): min=2 , max=31368 , avg=495.04, stdev=1862.52
| 1.00th=[ 7], 5.00th=[ 8], 10.00th=[ 8], 20.00th=[ 9],
| 30.00th=[ 9], 40.00th=[ 10], 50.00th=[ 198], 60.00th=[ 204],
| 70.00th=[ 208], 80.00th=[ 217], 90.00th=[ 799], 95.00th=[ 1795],
| 99.00th=[ 7177], 99.50th=[12649], 99.90th=[16712], 99.95th=[16712],
| 99.99th=[16712]
bw (KB/s) : min= 36, max=11960, per=100.00%, avg=264.77, stdev=1110.81
lat (msec) : 2=0.03%, 4=0.23%, 10=40.93%, 20=0.48%, 50=0.03%
lat (msec) : 100=0.08%, 250=39.55%, 500=5.63%, 750=2.91%, 1000=1.35%
lat (msec) : 2000=4.03%, >=2000=4.77%
cpu : usr=0.02%, sys=0.22%, ctx=2973, majf=0,
minf=18446744073709538907
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.4%, 32=99.2%,
=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
=64=0.0%
issued : total=r=0/w=3924/d=0, short=r=0/w=0/d=0
WRITE: io=15696KB, aggrb=258KB/s, minb=258KB/s, maxb=258KB/s,
mint=60737msec, maxt=60737msec
sdb: ios=83/3921, merge=0/0, ticks=60/1903085, in_queue=1931694, util=100.00%
anyone can give me some suggestion to improve the performance ?
Regards
MQ
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Nick Fisk
2016-07-05 08:54:12 UTC
Permalink
-----Original Message-----
Sent: 04 July 2016 22:00
Subject: Re: [ceph-users]
suse_enterprise_storage3_rbd_LIO_vmware_performance_bad
HI Nick,
<snip>
Post by Nick Fisk
However, there are a number of pain points with iSCSI + ESXi + RBD and
they all mainly centre on write latency. It seems VMFS was designed around
the fact that Enterprise storage arrays service writes in 10-100us, whereas
Ceph will service them in 2-10ms.
Post by Nick Fisk
1. Thin Provisioning makes things slow. I believe the main cause is that
when growing and zeroing the new blocks, metadata needs to be updated
and the block zero'd. Both issue small IO which would normally not be a
problem, but with Ceph it becomes a bottleneck to overall IO on the
datastore.
Post by Nick Fisk
2. Snapshots effectively turn all IO into 64kb IO's. Again a traditional SAN
will coalesce these back into a stream of larger IO's before committing to
disk. However with Ceph each IO takes 2-10ms and so everything seems
slow. The future feature of persistent RBD cache may go a long way to
helping with this.
Are you referring to ESXi snapshots? Specifically, if a VM is running off a
snapshot
(https://kb.vmware.com/selfservice/microsites/search.do?language=en_US
&cmd=displayKC&externalId=1015180),
its IO will drop to 64KB "grains"?
Yep, that’s the one
Post by Nick Fisk
3. >2TB VMDK's with snapshots use a different allocation mode, which
happens in 4kb chunks instead of 64kb ones. This makes the problem 16
times worse than above.
Post by Nick Fisk
4. Any of the above will also apply when migrating machines around, so
VM's can takes hours/days to move.
Post by Nick Fisk
5. If you use FILEIO, you can't use thin provisioning. If you use BLOCKIO,
you get thin provisioning, but no pagecache or readahead, so performance
can nose dive if this is needed.
Would not FILEIO also leverage the Linux scheduler to do IO coalescing and
help with (2) ? Since FILEIO also uses the dirty flush mechanism in page cache
(and makes IO somewhat crash-unsafe at the same time).
Turning off nv_cache and enabling write_through, should make this safe, but then you won't benefit from any writeback flushing.
Post by Nick Fisk
6. iSCSI is very complicated (especially ALUA) and sensitive. Get used to
seeing APD/PDL even when you think you have finally got everything
working great.
We were used to seeing APD/PDL all the time with LIO, but pretty much have
not seen any with SCST > 3.1. Most of the ESXi problems are with just with
high latency periods, which are not a problem for the hypervisor itself, but
rather for the databases or applications inside VMs.
Yeah I think once you get SCST working, it's pretty stable. Certainly the best of the bunch. But I was more referring to "actually getting it working" :-)

Particularly once you start introducing pacemaker, there are so many corner cases you need to take into account, that I'm still not 100% satisfied by the stability. Eg. Spent a long time working on the resource agents to make sure all the LUNS and Targets could shut down cleanly on a node. Depending on load and number of iscsi connections, it would randomly hang and then go into a APD state. Not saying it can't work, but compared to NFS it seems a lot more complicated to get it stable.
Thanks,
Alex
Post by Nick Fisk
Normal IO from eager zeroed VM's with no snapshots, however should
perform ok. So depends what your workload is.
Post by Nick Fisk
And then comes NFS. It's very easy to setup, very easy to configure for HA,
and works pretty well overall. You don't seem to get any of the IO size
penalties when using snapshots. If you mount with discard, thin provisioning
is done by Ceph. You can defragment the FS on the proxy node and several
other things that you can't do with VMFS. Just make sure you run the server
in sync mode to avoid data loss.
Post by Nick Fisk
The only downside is that every IO causes an IO to the FS and one to the FS
journal, so you effectively double your IO. But if your Ceph backend can
support it, then it shouldn't be too much of a problem.
Post by Nick Fisk
Now to the original poster, assuming the iSCSI node is just kernel mounting
the RBD, I would run iostat on it, to try and see what sort of latency you are
seeing at that point. Also do the same with esxtop +u, and look at the write
latency there, both whilst running the fio in the VM. This should hopefully let
you see if there is just a gradual increase as you go from hop to hop or if
there is an obvious culprit.
Post by Nick Fisk
Can you also confirm your kernel version?
With 1GB networking I think you will struggle to get your write latency
much below 10-15ms, but from your example ~30ms is still a bit high. I
wonder if the default queue depths on your iSCSI target are too low as well?
Post by Nick Fisk
Nick
-----Original Message-----
Of Oliver Dzombic
Sent: 01 July 2016 09:27
Subject: Re: [ceph-users]
suse_enterprise_storage3_rbd_LIO_vmware_performance_bad
Hi,
ceph + iscsi ( multipath ) + vmware == worst
Better you search for another solution.
vmware + nfs + vmware might have a much better performance.
--------
If you are able to get vmware run with iscsi and ceph, i would be
Post by mq
very<< intrested in what/how you did that.
--
Mit freundlichen Gruessen / Best regards
Oliver Dzombic
IP-Interactive
IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
63571 Gelnhausen
HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic
Steuer Nr.: 35 236 3622 1
UST ID: DE274086107
Post by mq
Hi list
I have tested suse enterprise storage3 using 2 iscsi gateway
attached to vmware. The performance is bad. I have turn off VAAI
following the
(https://kb.vmware.com/selfservice/microsites/search.do?language=en_U
Post by Nick Fisk
S
Post by mq
&cmd=displayKC&externalId=1033665)
<https://kb.vmware.com/selfservice/microsites/search.do?language=en_U
Post by Nick Fisk
S&cmd=displayKC&externalId=1033665%29>.
Post by mq
My cluster
3 ceph nodes :2*E5-2620 64G , mem 2*1Gbps (3*10K SAS, 1*480G SSD)
per node, SSD as journal
1 vmware node 2*E5-2620 64G , mem 2*1Gbps
# ceph -s
cluster 0199f68d-a745-4da3-9670-15f2981e7a15
health HEALTH_OK
monmap e1: 3 mons at
{node1=192.168.50.91:6789/0,node2=192.168.50.92:6789/0,node3=192.168.
Post by Nick Fisk
5
0.93:6789/0}
Post by mq
election epoch 22, quorum 0,1,2 node1,node2,node3
osdmap e200: 9 osds: 9 up, 9 in
flags sortbitwise
pgmap v1162: 448 pgs, 1 pools, 14337 MB data, 4935 objects
18339 MB used, 5005 GB / 5023 GB avail
448 active+clean
client io 87438 kB/s wr, 0 op/s rd, 213 op/s wr
sudo ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 4.90581 root default
-2 1.63527 host node1
0 0.54509 osd.0 up 1.00000 1.00000
1 0.54509 osd.1 up 1.00000 1.00000
2 0.54509 osd.2 up 1.00000 1.00000
-3 1.63527 host node2
3 0.54509 osd.3 up 1.00000 1.00000
4 0.54509 osd.4 up 1.00000 1.00000
5 0.54509 osd.5 up 1.00000 1.00000
-4 1.63527 host node3
6 0.54509 osd.6 up 1.00000 1.00000
7 0.54509 osd.7 up 1.00000 1.00000
8 0.54509 osd.8 up 1.00000 1.00000
An linux vm in vmmare, running fio. 4k randwrite result just 64
IOPS lantency is high,dd test just 11MB/s.
fio -ioengine=libaio -bs=4k -direct=1 -thread -rw=randwrite
-size=100G -filename=/dev/sdb -name="EBS 4KB randwrite test"
-iodepth=32
-runtime=60 EBS 4KB randwrite test: (g=0): rw=randwrite,
bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.0.13
Starting 1 thread
Jobs: 1 (f=1): [w] [100.0% done] [0K/131K/0K /s] [0 /32 /0 iops]
pid=6766: Wed Jun
29 21:28:06 2016
write: io=15696KB, bw=264627 B/s, iops=64 , runt= 60737msec
slat (usec): min=10 , max=213 , avg=35.54, stdev=16.41
clat (msec): min=1 , max=31368 , avg=495.01, stdev=1862.52
lat (msec): min=2 , max=31368 , avg=495.04, stdev=1862.52
| 1.00th=[ 7], 5.00th=[ 8], 10.00th=[ 8], 20.00th=[ 9],
| 30.00th=[ 9], 40.00th=[ 10], 50.00th=[ 198], 60.00th=[ 204],
| 70.00th=[ 208], 80.00th=[ 217], 90.00th=[ 799], 95.00th=[ 1795],
| 99.00th=[ 7177], 99.50th=[12649], 99.90th=[16712], 99.95th=[16712],
| 99.99th=[16712]
bw (KB/s) : min= 36, max=11960, per=100.00%, avg=264.77, stdev=1110.81
lat (msec) : 2=0.03%, 4=0.23%, 10=40.93%, 20=0.48%, 50=0.03%
lat (msec) : 100=0.08%, 250=39.55%, 500=5.63%, 750=2.91%,
1000=1.35%
Post by Nick Fisk
Post by mq
lat (msec) : 2000=4.03%, >=2000=4.77%
cpu : usr=0.02%, sys=0.22%, ctx=2973, majf=0,
minf=18446744073709538907
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.4%, 32=99.2%,
=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
=64=0.0%
issued : total=r=0/w=3924/d=0, short=r=0/w=0/d=0
WRITE: io=15696KB, aggrb=258KB/s, minb=258KB/s, maxb=258KB/s,
mint=60737msec, maxt=60737msec
sdb: ios=83/3921, merge=0/0, ticks=60/1903085, in_queue=1931694, util=100.00%
anyone can give me some suggestion to improve the performance ?
Regards
MQ
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Nick Fisk
2016-07-04 07:21:25 UTC
Permalink
-----Original Message-----
Sent: 04 July 2016 08:13
Subject: Re: [ceph-users]
suse_enterprise_storage3_rbd_LIO_vmware_performance_bad
Hi Nick
i have test NFS: since NFS cannot choose Eager Zeroed Thick Provision mode
so i use the default thin provision in sphere.
first test: fio result: 4k randwrite iops 538 , latency 59ms.
second test: formatted the sdb, fio result : 4k randwrite iops 746 , latency
48ms.
the NFS performance is half of LIO
NFS will always have a penalty compared to VMFS on iSCSI because of the extra journal write, but as you saw in your LIO test, you have to conform to certain criteria, this may or may not be a problem.

Just one thing comes to mind though. How many NFS server threads are you running? By default I think most OS's only spin up 8, which is far too low. If you run fio at 32 depth against the defaults, you will see really low performance as IO's queue up. Try setting the NFS server threads to something like 128.

Other thing to keep in mind (as I have just been finding out) it's important to set an extent size hint on the XFS FS on the NFS server, otherwise you will get lots of fragmentation.

Eg.
Xfs_io -c extsize 16M /mountpoint
Regards
MQ
Hi Nick
kernel v: 3.12.49-11-default
after change vsphere virtual disk configuration to Eager Zeroed Thick
Provision mode the performance in vm is ok. fio result: 4k randwrite iops
1600, latency 8ms. 1M seq write bw 100MB/s. but when clone 200G vm need
30min.
by the way i want test bcache/flashcache+OSD or cache tier, do you have
any suggestion can give to me.
i will try NFS next day.
Regards
To summarise,
LIO is just not working very well at the moment because of the ABORT Tasks
problem, this will hopefully be fixed at some point. I'm not sure if SUSE works
around this, but see below for other pain points with RBD + ESXi + iSCSI
TGT is easy to get going, but performance isn't the best and failover is an
absolute pain as TGT won't stop if it has ongoing IO. You normally end up in a
complete mess if you try and do HA, unless you can cover a number of
different failure scenarios.
SCST probably works the best at the moment. Yes, you have to compile it
into a new kernel, but it performs well, doesn't fall over, supports the VAAI
extensions and can be configured HA in an ALUA or VIP failover modes.
There might be a couple of corner cases with the ALUA mode with
Active/Standby paths, with possible data corruption that need to be
tested/explored.
However, there are a number of pain points with iSCSI + ESXi + RBD and they
all mainly centre on write latency. It seems VMFS was designed around the
fact that Enterprise storage arrays service writes in 10-100us, whereas Ceph
will service them in 2-10ms.
1. Thin Provisioning makes things slow. I believe the main cause is that when
growing and zeroing the new blocks, metadata needs to be updated and the
block zero'd. Both issue small IO which would normally not be a problem, but
with Ceph it becomes a bottleneck to overall IO on the datastore.
2. Snapshots effectively turn all IO into 64kb IO's. Again a traditional SAN will
coalesce these back into a stream of larger IO's before committing to disk.
However with Ceph each IO takes 2-10ms and so everything seems slow. The
future feature of persistent RBD cache may go a long way to helping with
this.
3. >2TB VMDK's with snapshots use a different allocation mode, which
happens in 4kb chunks instead of 64kb ones. This makes the problem 16
times worse than above.
4. Any of the above will also apply when migrating machines around, so VM's
can takes hours/days to move.
5. If you use FILEIO, you can't use thin provisioning. If you use BLOCKIO, you
get thin provisioning, but no pagecache or readahead, so performance can
nose dive if this is needed.
6. iSCSI is very complicated (especially ALUA) and sensitive. Get used to
seeing APD/PDL even when you think you have finally got everything
working great.
Normal IO from eager zeroed VM's with no snapshots, however should
perform ok. So depends what your workload is.
And then comes NFS. It's very easy to setup, very easy to configure for HA,
and works pretty well overall. You don't seem to get any of the IO size
penalties when using snapshots. If you mount with discard, thin provisioning
is done by Ceph. You can defragment the FS on the proxy node and several
other things that you can't do with VMFS. Just make sure you run the server
in sync mode to avoid data loss.
The only downside is that every IO causes an IO to the FS and one to the FS
journal, so you effectively double your IO. But if your Ceph backend can
support it, then it shouldn't be too much of a problem.
Now to the original poster, assuming the iSCSI node is just kernel mounting
the RBD, I would run iostat on it, to try and see what sort of latency you are
seeing at that point. Also do the same with esxtop +u, and look at the write
latency there, both whilst running the fio in the VM. This should hopefully let
you see if there is just a gradual increase as you go from hop to hop or if
there is an obvious culprit.
Can you also confirm your kernel version?
With 1GB networking I think you will struggle to get your write latency much
below 10-15ms, but from your example ~30ms is still a bit high. I wonder if
the default queue depths on your iSCSI target are too low as well?
Nick
-----Original Message-----
Oliver Dzombic
Sent: 01 July 2016 09:27
Subject: Re: [ceph-users]
suse_enterprise_storage3_rbd_LIO_vmware_performance_bad
Hi,
ceph + iscsi ( multipath ) + vmware == worst
Better you search for another solution.
vmware + nfs + vmware might have a much better performance.
--------
If you are able to get vmware run with iscsi and ceph, i would be
very<< intrested in what/how you did that.
--
Mit freundlichen Gruessen / Best regards
Oliver Dzombic
IP-Interactive
IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
63571 Gelnhausen
HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic
Steuer Nr.: 35 236 3622 1
UST ID: DE274086107
Hi list
I have tested suse enterprise storage3 using 2 iscsi gateway attached
to vmware. The performance is bad. I have turn off VAAI following the
(https://kb.vmware.com/selfservice/microsites/search.do?language=en_US
&cmd=displayKC&externalId=1033665)
<https://kb.vmware.com/selfservice/microsites/search.do?language=en_U
S&cmd=displayKC&externalId=1033665%29>.
My cluster
3 ceph nodes :2*E5-2620 64G , mem 2*1Gbps (3*10K SAS, 1*480G SSD) per
node, SSD as journal
1 vmware node 2*E5-2620 64G , mem 2*1Gbps
# ceph -s
cluster 0199f68d-a745-4da3-9670-15f2981e7a15
health HEALTH_OK
monmap e1: 3 mons at
{node1=192.168.50.91:6789/0,node2=192.168.50.92:6789/0,node3=192.168.5
0.93:6789/0}
election epoch 22, quorum 0,1,2 node1,node2,node3
osdmap e200: 9 osds: 9 up, 9 in
flags sortbitwise
pgmap v1162: 448 pgs, 1 pools, 14337 MB data, 4935 objects
18339 MB used, 5005 GB / 5023 GB avail
448 active+clean
client io 87438 kB/s wr, 0 op/s rd, 213 op/s wr
sudo ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 4.90581 root default
-2 1.63527 host node1
0 0.54509 osd.0 up 1.00000 1.00000
1 0.54509 osd.1 up 1.00000 1.00000
2 0.54509 osd.2 up 1.00000 1.00000
-3 1.63527 host node2
3 0.54509 osd.3 up 1.00000 1.00000
4 0.54509 osd.4 up 1.00000 1.00000
5 0.54509 osd.5 up 1.00000 1.00000
-4 1.63527 host node3
6 0.54509 osd.6 up 1.00000 1.00000
7 0.54509 osd.7 up 1.00000 1.00000
8 0.54509 osd.8 up 1.00000 1.00000
An linux vm in vmmare, running fio. 4k randwrite result just 64 IOPS
lantency is high,dd test just 11MB/s.
fio -ioengine=libaio -bs=4k -direct=1 -thread -rw=randwrite -size=100G
-filename=/dev/sdb -name="EBS 4KB randwrite test" -iodepth=32
-runtime=60 EBS 4KB randwrite test: (g=0): rw=randwrite,
bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.0.13
Starting 1 thread
Jobs: 1 (f=1): [w] [100.0% done] [0K/131K/0K /s] [0 /32 /0 iops] [eta
pid=6766: Wed Jun
29 21:28:06 2016
write: io=15696KB, bw=264627 B/s, iops=64 , runt= 60737msec
slat (usec): min=10 , max=213 , avg=35.54, stdev=16.41
clat (msec): min=1 , max=31368 , avg=495.01, stdev=1862.52
lat (msec): min=2 , max=31368 , avg=495.04, stdev=1862.52
| 1.00th=[ 7], 5.00th=[ 8], 10.00th=[ 8], 20.00th=[ 9],
| 30.00th=[ 9], 40.00th=[ 10], 50.00th=[ 198], 60.00th=[ 204],
| 70.00th=[ 208], 80.00th=[ 217], 90.00th=[ 799], 95.00th=[ 1795],
| 99.00th=[ 7177], 99.50th=[12649], 99.90th=[16712], 99.95th=[16712],
| 99.99th=[16712]
bw (KB/s) : min= 36, max=11960, per=100.00%, avg=264.77,
stdev=1110.81
lat (msec) : 2=0.03%, 4=0.23%, 10=40.93%, 20=0.48%, 50=0.03%
lat (msec) : 100=0.08%, 250=39.55%, 500=5.63%, 750=2.91%, 1000=1.35%
lat (msec) : 2000=4.03%, >=2000=4.77%
cpu : usr=0.02%, sys=0.22%, ctx=2973, majf=0,
minf=18446744073709538907
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.4%, 32=99.2%,
=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
=64=0.0%
issued : total=r=0/w=3924/d=0, short=r=0/w=0/d=0
WRITE: io=15696KB, aggrb=258KB/s, minb=258KB/s, maxb=258KB/s,
mint=60737msec, maxt=60737msec
sdb: ios=83/3921, merge=0/0, ticks=60/1903085, in_queue=1931694, util=100.00%
anyone can give me some suggestion to improve the performance ?
Regards
MQ
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Lars Marowsky-Bree
2016-07-04 10:29:52 UTC
Permalink
On 2016-07-01T13:04:45, mq <***@126.com> wrote:

Hi MQ,

perhaps the upstream list is not the best one to discuss this. SUSE
includes adjusted backports for the iSCSI functionality that upstream
does not; very few people here are going to be intimately familiar with
the code you're running. If you're evaluating SES3, you might as well
give our support team a call ;-)

That said:

First, let me start with the same others have pointed out: the iSCSI
gateway (via the LIO targets) will introduce an additional network hop
between your clients and the Ceph cluster. That's perfectly fine for
bandwidth-oriented workloads, but for latency/IOPS, it is quite
expensive. It also negates some of the benefits of Ceph (namely, that a
client can directly talk to the OSD holding the data without an
intermediary).

So, you need to check whether the iSCSI access method fits your use
case, and then the iSCSI gateways really need good network interfaces,
both facing to the clients and to the Ceph cluster (on its public
network).
Post by mq
My cluster
3 ceph nodes :2*E5-2620 64G , mem 2*1Gbps
(3*10K SAS, 1*480G SSD) per node, SSD as journal
1 vmware node 2*E5-2620 64G , mem 2*1Gbps
And here we are. 1 GbE NICs just aren't adequate for any reasonable
performance numbers. I'm assuming you're running the iSCSI GW on the
Ceph nodes, just like the MONs (since you didn't specify any additional
nodes and the node[123] names are kind of suspicious).

This environment lacks network performance. You barely have enough
network bandwidth to sustain a single of those drives - and then add in
that you're replicating over the same NIC, and that the OSD traffic is
multiplexed on the same network as the iSCSI/client traffic.

You also lack scale out capacity - Ceph scales horizontally, but each of
your only three nodes only has 3 drives. That doesn't give Ceph a lot to
work with.
Post by mq
anyone can give me some suggestion to improve the performance ?
Yes. I'd start with ordering a lot more and faster hardware ;-) But even
then, you'll have to understand that iSCSI will not - and really,
really, cannot - deliver quite the same performance as native RBD.

So that'd make me look into replacing VMWare with an OpenStack cloud,
where you get native Ceph drivers, proper integration, and performance.

After all - if you're avoiding proprietary lock-in for the storage in
favor of Open Source / Ceph (which is a great choice!), why would you
accept this on the hypervisor/private cloud ;-)



Regards,
Lars
--
Architect SDS, Distinguished Engineer
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde
Loading...