[ceph-users] Disabling write cache on SATA HDDs reduces write latency 7 times

Discussion:

Vitaliy Filippov

2018-11-10 21:33:30 UTC

Hi

A weird thing happens in my test cluster made from desktop hardware.

The command `for i in /dev/sd?; do hdparm -W 0 $i; done` increases
single-thread write iops (reduces latency) 7 times!

It is a 3-node cluster with Ryzen 2700 CPUs, 3x SATA 7200rpm HDDs + 1x
SATA desktop SSD for system and ceph-mon + 1x SATA server SSD for
block.db/wal in each host. Hosts are linked by 10gbit ethernet (not the
fastest one though, average RTT according to flood-ping is 0.098ms). Ceph
and OpenNebula are installed on the same hosts, OSDs are prepared with
ceph-volume and bluestore with default options. SSDs have capacitors
('power-loss protection'), write cache is turned off for them since the
very beginning (hdparm -W 0 /dev/sdb). They're quite old, but each of them
is capable of delivering ~22000 iops in journal mode (fio -sync=1
-direct=1 -iodepth=1 -bs=4k -rw=write).

However, RBD single-threaded random-write benchmark originally gave awful
results - when testing with `fio -ioengine=libaio -size=10G -sync=1
-direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -runtime=60
-filename=./testfile` from inside a VM, the result was only 58 iops
average (17ms latency). This was not what I expected from the HDD+SSD
setup.

But today I tried to play with cache settings for data disks. And I was
really surprised to discover that just disabling HDD write cache (hdparm
-W 0 /dev/sdX for all HDD devices) increases single-threaded performance
~7 times! The result from the same VM (without even rebooting it) is
iops=405, avg lat=2.47ms. That's a magnitude faster and in fact 2.5ms
seems sort of an expected number.

As I understand 4k writes are always deferred at the default setting of
prefer_deferred_size_hdd=32768, this means they should only get written to
the journal device before OSD acks the write operation.

So my question is WHY? Why does HDD write cache affect commit latency with
WAL on an SSD?

I would also appreciate if anybody with similar setup (HDD+SSD with
desktop SATA controllers or HBA) could test the same thing...

--
With best regards,
Vitaliy Filippov

Ashley Merrick

2018-11-11 05:24:29 UTC

Permalink

I've just worked out I had the same issue, been trying to work out the
cause for the past few days!

However I am using brand new enterprise Toshiba drivers with 256MB write
cache, was seeing I/O wait peaks of 40% even during a small writing
operation to CEPH and commit / apply latency's in the 40ms+.

Just went through and disabled the write cache on each drive, and done a
few tests with the exact same write performance, but I/O wait in the <1%
and commit / apply latency's in the 1-3ms max.

Something somewhere definitely doesn't seem to like the write cache being
enabled on the disks, this is a EC Pool in the latest Mimic version.

Post by Vitaliy Filippov
Hi
A weird thing happens in my test cluster made from desktop hardware.
The command `for i in /dev/sd?; do hdparm -W 0 $i; done` increases
single-thread write iops (reduces latency) 7 times!
It is a 3-node cluster with Ryzen 2700 CPUs, 3x SATA 7200rpm HDDs + 1x
SATA desktop SSD for system and ceph-mon + 1x SATA server SSD for
block.db/wal in each host. Hosts are linked by 10gbit ethernet (not the
fastest one though, average RTT according to flood-ping is 0.098ms). Ceph
and OpenNebula are installed on the same hosts, OSDs are prepared with
ceph-volume and bluestore with default options. SSDs have capacitors
('power-loss protection'), write cache is turned off for them since the
very beginning (hdparm -W 0 /dev/sdb). They're quite old, but each of them
is capable of delivering ~22000 iops in journal mode (fio -sync=1
-direct=1 -iodepth=1 -bs=4k -rw=write).
However, RBD single-threaded random-write benchmark originally gave awful
results - when testing with `fio -ioengine=libaio -size=10G -sync=1
-direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -runtime=60
-filename=./testfile` from inside a VM, the result was only 58 iops
average (17ms latency). This was not what I expected from the HDD+SSD
setup.
But today I tried to play with cache settings for data disks. And I was
really surprised to discover that just disabling HDD write cache (hdparm
-W 0 /dev/sdX for all HDD devices) increases single-threaded performance
~7 times! The result from the same VM (without even rebooting it) is
iops=405, avg lat=2.47ms. That's a magnitude faster and in fact 2.5ms
seems sort of an expected number.
As I understand 4k writes are always deferred at the default setting of
prefer_deferred_size_hdd=32768, this means they should only get written to
the journal device before OSD acks the write operation.
So my question is WHY? Why does HDD write cache affect commit latency with
WAL on an SSD?
I would also appreciate if anybody with similar setup (HDD+SSD with
desktop SATA controllers or HBA) could test the same thing...
--
With best regards,
Vitaliy Filippov
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Ashley Merrick

2018-11-11 10:43:09 UTC

Permalink

Donât have any SSD in the cluster to test.

Also without knowing the exact reason why it being enabled has such a
negative effect I wouldnât be sure if also would be the same on SSDâs.

Post by Marc Roos
Does it make sense to test disabling this on hdd cluster only?
-----Original Message-----
Sent: zondag 11 november 2018 6:24
Subject: Re: [ceph-users] Disabling write cache on SATA HDDs reduces
write latency 7 times
I've just worked out I had the same issue, been trying to work out the
cause for the past few days!
However I am using brand new enterprise Toshiba drivers with 256MB write
cache, was seeing I/O wait peaks of 40% even during a small writing
operation to CEPH and commit / apply latency's in the 40ms+.
Just went through and disabled the write cache on each drive, and done a
few tests with the exact same write performance, but I/O wait in the <1%
and commit / apply latency's in the 1-3ms max.
Something somewhere definitely doesn't seem to like the write cache
being enabled on the disks, this is a EC Pool in the latest Mimic
version.
Hi
A weird thing happens in my test cluster made from desktop hardware.
The command `for i in /dev/sd?; do hdparm -W 0 $i; done` increases
single-thread write iops (reduces latency) 7 times!
It is a 3-node cluster with Ryzen 2700 CPUs, 3x SATA 7200rpm HDDs
+
1x
SATA desktop SSD for system and ceph-mon + 1x SATA server SSD for
block.db/wal in each host. Hosts are linked by 10gbit ethernet
(not
the
fastest one though, average RTT according to flood-ping is 0.098ms). Ceph
and OpenNebula are installed on the same hosts, OSDs are prepared with
ceph-volume and bluestore with default options. SSDs have capacitors
('power-loss protection'), write cache is turned off for them
since
the
very beginning (hdparm -W 0 /dev/sdb). They're quite old, but each of them
is capable of delivering ~22000 iops in journal mode (fio -sync=1
-direct=1 -iodepth=1 -bs=4k -rw=write).
However, RBD single-threaded random-write benchmark originally
gave
awful
results - when testing with `fio -ioengine=libaio -size=10G -sync=1
-direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -runtime=60
-filename=./testfile` from inside a VM, the result was only 58 iops
average (17ms latency). This was not what I expected from the HDD+SSD
setup.
But today I tried to play with cache settings for data disks. And
I
was
really surprised to discover that just disabling HDD write cache (hdparm
-W 0 /dev/sdX for all HDD devices) increases single-threaded performance
~7 times! The result from the same VM (without even rebooting it) is
iops=405, avg lat=2.47ms. That's a magnitude faster and in fact 2.5ms
seems sort of an expected number.
As I understand 4k writes are always deferred at the default setting of
prefer_deferred_size_hdd=32768, this means they should only get written to
the journal device before OSD acks the write operation.
So my question is WHY? Why does HDD write cache affect commit latency with
WAL on an SSD?
I would also appreciate if anybody with similar setup (HDD+SSD with
desktop SATA controllers or HBA) could test the same thing...
--
With best regards,
Vitaliy Filippov
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Marc Roos

2018-11-11 10:41:12 UTC

Permalink

Does it make sense to test disabling this on hdd cluster only?

-----Original Message-----
From: Ashley Merrick [mailto:***@amerrick.co.uk]
Sent: zondag 11 november 2018 6:24
To: ***@yourcmc.ru
Cc: ceph-***@lists.ceph.com
Subject: Re: [ceph-users] Disabling write cache on SATA HDDs reduces
write latency 7 times

I've just worked out I had the same issue, been trying to work out the
cause for the past few days!

However I am using brand new enterprise Toshiba drivers with 256MB write
cache, was seeing I/O wait peaks of 40% even during a small writing
operation to CEPH and commit / apply latency's in the 40ms+.

Just went through and disabled the write cache on each drive, and done a
few tests with the exact same write performance, but I/O wait in the <1%
and commit / apply latency's in the 1-3ms max.

Something somewhere definitely doesn't seem to like the write cache
being enabled on the disks, this is a EC Pool in the latest Mimic
version.

On Sun, Nov 11, 2018 at 5:34 AM Vitaliy Filippov <***@yourcmc.ru>
wrote:

Hi

A weird thing happens in my test cluster made from desktop
hardware.

The command `for i in /dev/sd?; do hdparm -W 0 $i; done` increases

single-thread write iops (reduces latency) 7 times!

It is a 3-node cluster with Ryzen 2700 CPUs, 3x SATA 7200rpm HDDs +
1x
SATA desktop SSD for system and ceph-mon + 1x SATA server SSD for
block.db/wal in each host. Hosts are linked by 10gbit ethernet (not
the
fastest one though, average RTT according to flood-ping is
0.098ms). Ceph
and OpenNebula are installed on the same hosts, OSDs are prepared
with
ceph-volume and bluestore with default options. SSDs have
capacitors
('power-loss protection'), write cache is turned off for them since
the
very beginning (hdparm -W 0 /dev/sdb). They're quite old, but each
of them
is capable of delivering ~22000 iops in journal mode (fio -sync=1
-direct=1 -iodepth=1 -bs=4k -rw=write).

However, RBD single-threaded random-write benchmark originally gave
awful
results - when testing with `fio -ioengine=libaio -size=10G -sync=1

-direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -runtime=60
-filename=./testfile` from inside a VM, the result was only 58 iops

average (17ms latency). This was not what I expected from the
HDD+SSD
setup.

But today I tried to play with cache settings for data disks. And I
was
really surprised to discover that just disabling HDD write cache
(hdparm
-W 0 /dev/sdX for all HDD devices) increases single-threaded
performance
~7 times! The result from the same VM (without even rebooting it)
is
iops=405, avg lat=2.47ms. That's a magnitude faster and in fact
2.5ms
seems sort of an expected number.

As I understand 4k writes are always deferred at the default
setting of
prefer_deferred_size_hdd=32768, this means they should only get
written to
the journal device before OSD acks the write operation.

So my question is WHY? Why does HDD write cache affect commit
latency with
WAL on an SSD?

I would also appreciate if anybody with similar setup (HDD+SSD with

desktop SATA controllers or HBA) could test the same thing...

--
With best regards,
Vitaliy Filippov
_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Marc Roos

2018-11-11 10:55:10 UTC

Permalink

I just did very very short test and don’t see any difference with this
cache on or off, so I am leaving it on for now.

-----Original Message-----
From: Ashley Merrick [mailto:***@amerrick.co.uk]
Sent: zondag 11 november 2018 11:43
To: Marc Roos
Cc: ceph-users; vitalif
Subject: Re: [ceph-users] Disabling write cache on SATA HDDs reduces
write latency 7 times

Don’t have any SSD in the cluster to test.

Also without knowing the exact reason why it being enabled has such a
negative effect I wouldn’t be sure if also would be the same on SSD’s.

On Sun, 11 Nov 2018 at 6:41 PM, Marc Roos <***@f1-outsourcing.eu>
wrote:

Does it make sense to test disabling this on hdd cluster only?

-----Original Message-----
From: Ashley Merrick [mailto:***@amerrick.co.uk]
Sent: zondag 11 november 2018 6:24
To: ***@yourcmc.ru
Cc: ceph-***@lists.ceph.com
Subject: Re: [ceph-users] Disabling write cache on SATA HDDs
reduces
write latency 7 times

I've just worked out I had the same issue, been trying to work out
the
cause for the past few days!

However I am using brand new enterprise Toshiba drivers with 256MB
write
cache, was seeing I/O wait peaks of 40% even during a small writing

operation to CEPH and commit / apply latency's in the 40ms+.

Just went through and disabled the write cache on each drive, and
done a
few tests with the exact same write performance, but I/O wait in
the <1%
and commit / apply latency's in the 1-3ms max.

Something somewhere definitely doesn't seem to like the write cache

being enabled on the disks, this is a EC Pool in the latest Mimic
version.

On Sun, Nov 11, 2018 at 5:34 AM Vitaliy Filippov
<***@yourcmc.ru>
wrote:

Hi

A weird thing happens in my test cluster made from desktop
hardware.

The command `for i in /dev/sd?; do hdparm -W 0 $i; done`
increases

single-thread write iops (reduces latency) 7 times!

It is a 3-node cluster with Ryzen 2700 CPUs, 3x SATA
7200rpm HDDs +
1x
SATA desktop SSD for system and ceph-mon + 1x SATA server
SSD for
block.db/wal in each host. Hosts are linked by 10gbit
ethernet (not
the
fastest one though, average RTT according to flood-ping is
0.098ms). Ceph
and OpenNebula are installed on the same hosts, OSDs are
prepared
with
ceph-volume and bluestore with default options. SSDs have
capacitors
('power-loss protection'), write cache is turned off for
them since
the
very beginning (hdparm -W 0 /dev/sdb). They're quite old,
but each
of them
is capable of delivering ~22000 iops in journal mode (fio
-sync=1
-direct=1 -iodepth=1 -bs=4k -rw=write).

However, RBD single-threaded random-write benchmark
originally gave
awful
results - when testing with `fio -ioengine=libaio -size=10G
-sync=1

-direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite
-runtime=60
-filename=./testfile` from inside a VM, the result was only
58 iops

average (17ms latency). This was not what I expected from
the
HDD+SSD
setup.

But today I tried to play with cache settings for data
disks. And I
was
really surprised to discover that just disabling HDD write
cache
(hdparm
-W 0 /dev/sdX for all HDD devices) increases
single-threaded
performance
~7 times! The result from the same VM (without even
rebooting it)
is
iops=405, avg lat=2.47ms. That's a magnitude faster and in
fact
2.5ms
seems sort of an expected number.

As I understand 4k writes are always deferred at the
default
setting of
prefer_deferred_size_hdd=32768, this means they should only
get
written to
the journal device before OSD acks the write operation.

So my question is WHY? Why does HDD write cache affect
commit
latency with
WAL on an SSD?

I would also appreciate if anybody with similar setup
(HDD+SSD with

desktop SATA controllers or HBA) could test the same
thing...

--
With best regards,
Vitaliy Filippov
_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Vitaliy Filippov

2018-11-11 11:19:54 UTC

Permalink

It seems no, I've just tested it on another small cluster with HDDs only -
no change

Post by Marc Roos
Does it make sense to test disabling this on hdd cluster only?

--
With best regards,
Vitaliy Filippov

Ashley Merrick

2018-11-11 12:46:37 UTC

Permalink

Either more weird then, what drives is in the other cluster?

Post by Vitaliy Filippov
It seems no, I've just tested it on another small cluster with HDDs only -
no change

Post by Marc Roos
Does it make sense to test disabling this on hdd cluster only?

--
With best regards,
Vitaliy Filippov

Ashley Merrick

2018-11-11 14:48:03 UTC

Permalink

Mixture of Toshiba drivers here all enterprise rated, cache 128 - 256MB

I have tried turning the write cache on and off a few times across the
cluster using hdparm, ever time can see a huge change from on (40ms
average) to off (1-3ms average)

Vitaliy what drives are you using? Maybe a particular brand / firmware?

Post by Marc Roos
WD Red here
-----Original Message-----
Sent: zondag 11 november 2018 13:47
To: Vitaliy Filippov
Cc: Marc Roos; ceph-users
Subject: Re: [ceph-users] Disabling write cache on SATA HDDs reduces
write latency 7 times
Either more weird then, what drives is in the other cluster?
It seems no, I've just tested it on another small cluster with
HDDs
only -
no change

Post by Marc Roos
Does it make sense to test disabling this on hdd cluster only?

--
With best regards,
Vitaliy Filippov

Виталий Филиппов

2018-11-13 08:26:45 UTC

Permalink

This may be the explanation:

https://serverfault.com/questions/857271/better-performance-when-hdd-write-cache-is-disabled-hgst-ultrastar-7k6000-and

Other manufacturers may have started to do the same, I suppose.

--
With best regards,
Vitaliy Filippov

Ashley Merrick

2018-11-13 08:41:43 UTC

Permalink

Looks like it as the Toshiba drives I use have their own version of that it
seems.

So would explain the same kind of results.

Kevin Olbrich

2018-11-13 08:47:38 UTC

Permalink

I read the whole thread and it looks like the write cache should always be
disabled as in the worst case, the performance is the same(?).
This is based on this discussion.

I will test some WD4002FYYZ which don't mention "media cache".

Kevin

Am Di., 13. Nov. 2018 um 09:27 Uhr schrieb ÐÐžÑÐ°Ð»ÐžÐ¹ Ð€ÐžÐ»ÐžÐ¿Ð¿ÐŸÐ² <

Post by ÐÐ¸ÑÐ°Ð»Ð¸Ð¹ Ð¤Ð¸Ð»Ð¸Ð¿Ð¿Ð¾Ð²
https://serverfault.com/questions/857271/better-performance-when-hdd-write-cache-is-disabled-hgst-ultrastar-7k6000-and
Other manufacturers may have started to do the same, I suppose.
--
With best regards,
Vitaliy Filippov_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Vitaliy Filippov

2018-11-11 20:20:57 UTC

Permalink

Post by Ashley Merrick
Either more weird then, what drives is in the other cluster?

Desktop Toshiba and Seagate Constellation 7200rpm

As I understand by now the main impact is for SSD+HDD clusters. Enabled
HDD write cache causes kernel to send flush requests for it (when write
cache is disabled it doesn't bother about that) and probably it affects
something else and causes some extra waits for SSD journal (although it's
strange and looks like a bug to me). I tried to check latencies in `ceph
daemon osd.xx perf dump` and both kv_commit_lat and commit_lat decreased
~10 times when I disabled HDD write cache (although both are SSD-related
as I understand).

Maybe your HDD are connected via some RAID controller and when you disable
cache it doesn't really get disabled, but the kernels just stops to issue
flush requests and makes some writes unsafe?

--
With best regards,
Vitaliy Filippov

Marc Roos

2018-11-11 12:54:47 UTC

Permalink

WD Red here

-----Original Message-----
From: Ashley Merrick [mailto:***@amerrick.co.uk]
Sent: zondag 11 november 2018 13:47
To: Vitaliy Filippov
Cc: Marc Roos; ceph-users
Subject: Re: [ceph-users] Disabling write cache on SATA HDDs reduces
write latency 7 times

Either more weird then, what drives is in the other cluster?

On Sun, 11 Nov 2018 at 7:19 PM, Vitaliy Filippov <***@yourcmc.ru>
wrote:

It seems no, I've just tested it on another small cluster with HDDs
only -
no change

Post by Marc Roos
Does it make sense to test disabling this on hdd cluster only?

--
With best regards,
Vitaliy Filippov