Vitaliy Filippov
2018-11-10 21:33:30 UTC
Hi
A weird thing happens in my test cluster made from desktop hardware.
The command `for i in /dev/sd?; do hdparm -W 0 $i; done` increases
single-thread write iops (reduces latency) 7 times!
It is a 3-node cluster with Ryzen 2700 CPUs, 3x SATA 7200rpm HDDs + 1x
SATA desktop SSD for system and ceph-mon + 1x SATA server SSD for
block.db/wal in each host. Hosts are linked by 10gbit ethernet (not the
fastest one though, average RTT according to flood-ping is 0.098ms). Ceph
and OpenNebula are installed on the same hosts, OSDs are prepared with
ceph-volume and bluestore with default options. SSDs have capacitors
('power-loss protection'), write cache is turned off for them since the
very beginning (hdparm -W 0 /dev/sdb). They're quite old, but each of them
is capable of delivering ~22000 iops in journal mode (fio -sync=1
-direct=1 -iodepth=1 -bs=4k -rw=write).
However, RBD single-threaded random-write benchmark originally gave awful
results - when testing with `fio -ioengine=libaio -size=10G -sync=1
-direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -runtime=60
-filename=./testfile` from inside a VM, the result was only 58 iops
average (17ms latency). This was not what I expected from the HDD+SSD
setup.
But today I tried to play with cache settings for data disks. And I was
really surprised to discover that just disabling HDD write cache (hdparm
-W 0 /dev/sdX for all HDD devices) increases single-threaded performance
~7 times! The result from the same VM (without even rebooting it) is
iops=405, avg lat=2.47ms. That's a magnitude faster and in fact 2.5ms
seems sort of an expected number.
As I understand 4k writes are always deferred at the default setting of
prefer_deferred_size_hdd=32768, this means they should only get written to
the journal device before OSD acks the write operation.
So my question is WHY? Why does HDD write cache affect commit latency with
WAL on an SSD?
I would also appreciate if anybody with similar setup (HDD+SSD with
desktop SATA controllers or HBA) could test the same thing...
A weird thing happens in my test cluster made from desktop hardware.
The command `for i in /dev/sd?; do hdparm -W 0 $i; done` increases
single-thread write iops (reduces latency) 7 times!
It is a 3-node cluster with Ryzen 2700 CPUs, 3x SATA 7200rpm HDDs + 1x
SATA desktop SSD for system and ceph-mon + 1x SATA server SSD for
block.db/wal in each host. Hosts are linked by 10gbit ethernet (not the
fastest one though, average RTT according to flood-ping is 0.098ms). Ceph
and OpenNebula are installed on the same hosts, OSDs are prepared with
ceph-volume and bluestore with default options. SSDs have capacitors
('power-loss protection'), write cache is turned off for them since the
very beginning (hdparm -W 0 /dev/sdb). They're quite old, but each of them
is capable of delivering ~22000 iops in journal mode (fio -sync=1
-direct=1 -iodepth=1 -bs=4k -rw=write).
However, RBD single-threaded random-write benchmark originally gave awful
results - when testing with `fio -ioengine=libaio -size=10G -sync=1
-direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -runtime=60
-filename=./testfile` from inside a VM, the result was only 58 iops
average (17ms latency). This was not what I expected from the HDD+SSD
setup.
But today I tried to play with cache settings for data disks. And I was
really surprised to discover that just disabling HDD write cache (hdparm
-W 0 /dev/sdX for all HDD devices) increases single-threaded performance
~7 times! The result from the same VM (without even rebooting it) is
iops=405, avg lat=2.47ms. That's a magnitude faster and in fact 2.5ms
seems sort of an expected number.
As I understand 4k writes are always deferred at the default setting of
prefer_deferred_size_hdd=32768, this means they should only get written to
the journal device before OSD acks the write operation.
So my question is WHY? Why does HDD write cache affect commit latency with
WAL on an SSD?
I would also appreciate if anybody with similar setup (HDD+SSD with
desktop SATA controllers or HBA) could test the same thing...
--
With best regards,
Vitaliy Filippov
With best regards,
Vitaliy Filippov