[ceph-users] Slow rbd reads (fast writes) with luminous + bluestore

Discussion:

Emmanuel Lacour

2018-08-13 12:56:39 UTC

Dear ceph users,

I set up a new cluster:

- Debian stretch
- ceph 12.2.7
- 3 nodes with mixed mon/osd
- 4 hdd 4TB osd per nodes
- 2 SSDs per nodes shared among osds for db/wal
- each OSD alone in a raid0+WriteBack

Inside a VM I get really good writes(200MB/s, 5k iops for direct 4K rand
writes), but with rand reads, device is 100% io wait with only ~150 IOPS
of avg size 128K.

I tried same workload using fio on rbd volume, same results :(

I played with VM read_ahead without any changes. I also disable most of
ceph debug, no change.

Any hints to solve this?

Here is the ceph.conf used:
https://owncloud.home-dn.net/index.php/s/swZsgeFGF2ZfPB2

Jason Dillaman

2018-08-13 13:21:07 UTC

Permalink

Is this a clean (new) cluster and RBD image you are using for your test or
has it been burned in? When possible (i.e. it has enough free space),
bluestore will essentially turn your random RBD image writes into
sequential writes. This optimization doesn't work for random reads unless
your read patterns matches your original random write pattern.

Note that with the default "stupid" allocator, this optimization will at
some point hit a massive performance cliff because the allocator will
aggressively try to re-use free slots that best match the IO size, even if
that means it will require massive seeking around the disk. Hopefully the
"bitmap" allocator will address this issue once it becomes the stable
default in a future release of Ceph.

Post by Emmanuel Lacour
Dear ceph users,
- Debian stretch
- ceph 12.2.7
- 3 nodes with mixed mon/osd
- 4 hdd 4TB osd per nodes
- 2 SSDs per nodes shared among osds for db/wal
- each OSD alone in a raid0+WriteBack
Inside a VM I get really good writes(200MB/s, 5k iops for direct 4K rand
writes), but with rand reads, device is 100% io wait with only ~150 IOPS
of avg size 128K.
I tried same workload using fio on rbd volume, same results :(
I played with VM read_ahead without any changes. I also disable most of
ceph debug, no change.
Any hints to solve this?
https://owncloud.home-dn.net/index.php/s/swZsgeFGF2ZfPB2
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Jason

Emmanuel Lacour

2018-08-13 13:32:30 UTC

Permalink

Post by Jason Dillaman
Is this a clean (new) cluster and RBD image you are using for your
test or has it been burned in? When possible (i.e. it has enough free
space), bluestore will essentially turn your random RBD image writes
into sequential writes. This optimization doesn't work for random
reads unless your read patterns matches your original random write
pattern.

Cluster is a new one but already hosts some VM images, not yet used on
production, but already has data and had writes/reads.

Post by Jason Dillaman
Note that with the default "stupid" allocator, this optimization will
at some point hit a massive performance cliff because the allocator
will aggressively try to re-use free slots that best match the IO
size, even if that means it will require massive seeking around the
disk. Hopefully the "bitmap" allocator will address this issue once it
becomes the stable default in a future release of Ceph.

Well, but not so worst that I see here:

New cluster
=======

file1: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=16
fio-2.16
Starting 1 process
file1: Laying out IO file(s) (1 file(s) / 2048MB)
Jobs: 1 (f=1): [r(1)] [100.0% done] [876KB/0KB/0KB /s] [219/0/0 iops]
[eta 00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=3289045: Mon Aug 13 14:58:22 2018
read : io=16072KB, bw=822516B/s, iops=200, runt= 20009msec

An old cluster with less disks and older hardware, running ceph hammer
============================================

file1: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=16
fio-2.16
Starting 1 process
file1: Laying out IO file(s) (1 file(s) / 2048MB)
Jobs: 1 (f=0): [f(1)] [100.0% done] [6350KB/0KB/0KB /s] [1587/0/0 iops]
[eta 00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=15596: Mon Aug 13 14:59:22 2018
read : io=112540KB, bw=5626.8KB/s, iops=1406, runt= 20001msec

So around 7 times less iops ::(

When using rados bench, new cluster has better results:

New:

Total time run:       10.080886
Total reads made:     3724
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1477.65
Average IOPS:         369
Stddev IOPS:          59
Max IOPS:             451
Min IOPS:             279
Average Latency(s):   0.0427141
Max latency(s):       0.320013
Min latency(s):       0.00142682

Old:

Total time run:       10.276202
Total reads made:     724
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   281.816
Average IOPS:         70
Stddev IOPS:          5
Max IOPS:             76
Min IOPS:             59
Average Latency(s):   0.226087
Max latency(s):       0.981571
Min latency(s):       0.00343391

so problem seems located on "rbd" side ...

Jason Dillaman

2018-08-13 13:55:38 UTC

Permalink

Post by Emmanuel Lacour

Cluster is a new one but already hosts some VM images, not yet used on
production, but already has data and had writes/reads.

New cluster
=======
file1: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=16
fio-2.16
Starting 1 process
file1: Laying out IO file(s) (1 file(s) / 2048MB)
Jobs: 1 (f=1): [r(1)] [100.0% done] [876KB/0KB/0KB /s] [219/0/0 iops]
[eta 00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=3289045: Mon Aug 13 14:58:22 2018
read : io=16072KB, bw=822516B/s, iops=200, runt= 20009msec
An old cluster with less disks and older hardware, running ceph hammer
============================================
file1: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=16
fio-2.16
Starting 1 process
file1: Laying out IO file(s) (1 file(s) / 2048MB)
Jobs: 1 (f=0): [f(1)] [100.0% done] [6350KB/0KB/0KB /s] [1587/0/0 iops]
[eta 00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=15596: Mon Aug 13 14:59:22 2018
read : io=112540KB, bw=5626.8KB/s, iops=1406, runt= 20001msec
So around 7 times less iops ::(
Total time run: 10.080886
Total reads made: 3724
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1477.65
Average IOPS: 369
Stddev IOPS: 59
Max IOPS: 451
Min IOPS: 279
Average Latency(s): 0.0427141
Max latency(s): 0.320013
Min latency(s): 0.00142682
Total time run: 10.276202
Total reads made: 724
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 281.816
Average IOPS: 70
Stddev IOPS: 5
Max IOPS: 76
Min IOPS: 59
Average Latency(s): 0.226087
Max latency(s): 0.981571
Min latency(s): 0.00343391
so problem seems located on "rbd" side ...

Post by Emmanuel Lacour
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Jason

Emmanuel Lacour

2018-08-13 14:01:48 UTC

Permalink

so problem seems located on "rbd" sideÂ ...
That's a pretty big apples-to-oranges comparison (4KiB random IO to
4MiB full-object IO). With your RBD workload, the OSDs will be seeking
after each 4KiB read but w/ your RADOS bench workload, it's reading a
full 4MiB object before seeking.

yes you're right, but if we compare cluster to cluser, on new cluster,
rados bench is faster (2 times) rbd fio is 7 times slower.

that's why I suppose rbd is th problem here, but I really do not
understand how to fix it. I looked at 3 old hammer cluster and two new
luminous/buestore clusters and those results are constants. I do not
think ceph decided to put bluestore
as default luminous filestore if random reads are 7 time slower ;)

(BTW: thanks for helping me Jason :) ).

Jason Dillaman

2018-08-13 14:29:52 UTC

Permalink

Post by Jason Dillaman

Post by Emmanuel Lacour
so problem seems located on "rbd" side ...

That's a pretty big apples-to-oranges comparison (4KiB random IO to 4MiB
full-object IO). With your RBD workload, the OSDs will be seeking after
each 4KiB read but w/ your RADOS bench workload, it's reading a full 4MiB
object before seeking.
yes you're right, but if we compare cluster to cluser, on new cluster,
rados bench is faster (2 times) rbd fio is 7 times slower.
that's why I suppose rbd is th problem here, but I really do not
understand how to fix it. I looked at 3 old hammer cluster and two new
luminous/buestore clusters and those results are constants. I do not think
ceph decided to put bluestore
as default luminous filestore if random reads are 7 time slower ;)

For such a small benchmark (2 GiB), I wouldn't be surprised if you are not
just seeing the Filestore-backed OSDs hitting the page cache for the reads
whereas the Bluestore-backed OSDs need to actually hit the disk. Are the
two clusters similar in terms of the numbers of HDD-backed OSDs?

Post by Jason Dillaman
(BTW: thanks for helping me Jason :) ).

--
Jason

Emmanuel Lacour

2018-08-13 14:40:58 UTC

Permalink

New cluster has a bit more OSDs, and better hardware (raid card with
cache, more memory, more cpu, and less workload).

Old:

# ceph osd tree
ID WEIGHTÂ Â TYPE NAMEÂ Â Â Â Â Â Â Â Â Â UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 17.28993 root defaultÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
-2Â 3.63998Â Â Â Â host hyp-prs-01Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
Â 0Â 1.81999Â Â Â Â Â Â Â Â osd.0Â Â Â Â Â Â Â Â Â Â Â upÂ 1.00000Â Â Â Â Â Â Â Â Â 1.00000
Â 1Â 1.81999Â Â Â Â Â Â Â Â osd.1Â Â Â Â Â Â Â Â Â Â Â upÂ 1.00000Â Â Â Â Â Â Â Â Â 1.00000
-3Â 3.63997Â Â Â Â host hyp-prs-02Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
Â 3Â 1.81999Â Â Â Â Â Â Â Â osd.3Â Â Â Â Â Â Â Â Â Â Â upÂ 1.00000Â Â Â Â Â Â Â Â Â 1.00000
Â 2Â 1.81998Â Â Â Â Â Â Â Â osd.2Â Â Â Â Â Â Â Â Â Â Â upÂ 1.00000Â Â Â Â Â Â Â Â Â 1.00000
-4Â 4.54999Â Â Â Â host hyp-prs-03Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
Â 4Â 1.81999Â Â Â Â Â Â Â Â osd.4Â Â Â Â Â Â Â Â Â Â Â upÂ 1.00000Â Â Â Â Â Â Â Â Â 1.00000
Â 5Â 2.73000Â Â Â Â Â Â Â Â osd.5Â Â Â Â Â Â Â Â Â Â Â upÂ 1.00000Â Â Â Â Â Â Â Â Â 1.00000
-5Â 5.45999Â Â Â Â host hyp-prs-04Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
Â 6Â 2.73000Â Â Â Â Â Â Â Â osd.6Â Â Â Â Â Â Â Â Â Â Â upÂ 1.00000Â Â Â Â Â Â Â Â Â 1.00000
Â 7Â 2.73000Â Â Â Â Â Â Â Â osd.7Â Â Â Â Â Â Â Â Â Â Â upÂ 1.00000Â Â Â Â Â Â Â Â Â 1.00000

New:

# ceph osd tree
ID CLASS WEIGHTÂ Â TYPE NAMEÂ Â Â Â Â Â STATUS REWEIGHT PRI-AFF
-1Â Â Â Â Â Â 43.66919 root defaultÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
-3Â Â Â Â Â Â 14.55640Â Â Â Â host osd-01Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
Â 0Â Â hddÂ 3.63910Â Â Â Â Â Â Â Â osd.0Â Â Â Â Â Â upÂ 1.00000 1.00000
Â 1Â Â hddÂ 3.63910Â Â Â Â Â Â Â Â osd.1Â Â Â Â Â Â upÂ 1.00000 1.00000
Â 2Â Â hddÂ 3.63910Â Â Â Â Â Â Â Â osd.2Â Â Â Â Â Â upÂ 1.00000 1.00000
Â 3Â Â hddÂ 3.63910Â Â Â Â Â Â Â Â osd.3Â Â Â Â Â Â upÂ 1.00000 1.00000
-5Â Â Â Â Â Â 14.55640Â Â Â Â host osd-02Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
Â 4Â Â hddÂ 3.63910Â Â Â Â Â Â Â Â osd.4Â Â Â Â Â Â upÂ 1.00000 1.00000
Â 5Â Â hddÂ 3.63910Â Â Â Â Â Â Â Â osd.5Â Â Â Â Â Â upÂ 1.00000 1.00000
Â 6Â Â hddÂ 3.63910Â Â Â Â Â Â Â Â osd.6Â Â Â Â Â Â upÂ 1.00000 1.00000
Â 7Â Â hddÂ 3.63910Â Â Â Â Â Â Â Â osd.7Â Â Â Â Â Â upÂ 1.00000 1.00000
-7Â Â Â Â Â Â 14.55640Â Â Â Â host osd-03Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
Â 8Â Â hddÂ 3.63910Â Â Â Â Â Â Â Â osd.8Â Â Â Â Â Â upÂ 1.00000 1.00000
Â 9Â Â hddÂ 3.63910Â Â Â Â Â Â Â Â osd.9Â Â Â Â Â Â upÂ 1.00000 1.00000
10Â Â hddÂ 3.63910Â Â Â Â Â Â Â Â osd.10Â Â Â Â Â upÂ 1.00000 1.00000
11Â Â hddÂ 3.63910Â Â Â Â Â Â Â Â osd.11Â Â Â Â Â upÂ 1.00000 1.00000

do you mean that with bluestore there is noÂ page cache involved?

Emmanuel Lacour

2018-08-13 14:44:30 UTC

Permalink

I looked at iostat on both cluster when running fio and yes, on new
cluster I see disks reads, but not with old cluster, everything comes
from page cache.

So is there a way to simulate page cache for bluestore, or on rbd side?

Emmanuel Lacour

2018-08-13 14:58:16 UTC

Permalink

Post by Emmanuel Lacour

I looked at iostat on both cluster when running fio and yes, on new
cluster I see disks reads, but not with old cluster, everything comes
from page cache.
So is there a way to simulate page cache for bluestore, or on rbd side?

seems it has been discussed already here:
https://www.mail-archive.com/ceph-***@lists.ceph.com/msg45321.html

but one question remain ... which kind of memory tuning should we do on
bluestore to get better use of osd memory?

Jason Dillaman

2018-08-13 14:58:51 UTC

Permalink

Post by Jason Dillaman
For such a small benchmark (2 GiB), I wouldn't be surprised if you are not
just seeing the Filestore-backed OSDs hitting the page cache for the reads
whereas the Bluestore-backed OSDs need to actually hit the disk. Are the
two clusters similar in terms of the numbers of HDD-backed OSDs?
I looked at iostat on both cluster when running fio and yes, on new
cluster I see disks reads, but not with old cluster, everything comes from
page cache.
So is there a way to simulate page cache for bluestore, or on rbd side?

See [1] for ways to tweak the bluestore cache sizes. I believe that by
default, bluestore will not cache any data but instead will only attempt to
cache its key/value store and metadata. In general, however, I would think
that attempting to have bluestore cache data is just an attempt to optimize
to the test instead of actual workloads. Personally, I think it would be
more worthwhile to just run 'fio --ioengine=rbd' directly against a
pre-initialized image after you have dropped the cache on the OSD nodes.

[1]
http://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/

--
Jason

Emmanuel Lacour

2018-08-14 13:57:35 UTC

Permalink

Post by Jason Dillaman
See [1] for ways to tweak the bluestore cache sizes. I believe that by
default, bluestore will not cache any data but instead will only
attempt to cache its key/value store and metadata.

I suppose too because default ratio is to cache as much as possible k/v
up to 512M and hdd cache is 1G by default.

I tried to increase hdd cache up to 4G and it seems to be used, 4 osd
processes uses 20GB now.

Post by Jason Dillaman
In general, however, I would think that attempting to have bluestore
cache data is just an attempt to optimize to the test instead of
actual workloads. Personally, I think it would be more worthwhile to
just run 'fio --ioengine=rbd' directly against a pre-initialized image
after you have dropped the cache on the OSD nodes.

So with bluestore, I assume that we need to think more of client page
cache (at least when using a VM) when with old filestore both osd and
client cache where used.

For benchmark, I did real benchmark here for the expected app workload
of this new cluster and it's ok for us :)

Thanks for your help Jason.

Florian Haas

2018-11-28 14:36:59 UTC

Permalink

Post by Emmanuel Lacour

I suppose too because default ratio is to cache as much as possible k/v
up to 512M and hdd cache is 1G by default.
I tried to increase hdd cache up to 4G and it seems to be used, 4 osd
processes uses 20GB now.

Shifting over a discussion from IRC and taking the liberty to resurrect
an old thread, as I just ran into the same (?) issue. I see
*significantly* reduced performance on RBD reads, compared to writes
with the same parameters. "rbd bench --io-type read" gives me 8K IOPS
(with the default 4K I/O size), whereas "rbd bench --io-type write"
produces more than twice that.

I should probably add that while my end result of doing an "rbd bench
--io-type read" is about half of what I get from a write benchmark, the
intermediate ops/sec output fluctuates from > 30K IOPS (about twice the
write IOPS) to about 3K IOPS (about 1/6 of what I get for writes). So
really, my read IOPS are all over the map (and terrible on average),
whereas my write IOPS are not stellar, but consistent.

This is an all-bluestore cluster on spinning disks with Luminous, and
I've tried the following things:

- run rbd bench with --rbd_readahead_disable_after_bytes=0 and
--rbd_readahead_max_bytes=4194304 (per
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008271.html)

- configure OSDs with a larger bluestore_cache_size_hdd (4G; default is 1G)

- configure OSDs with bluestore_cache_kv_ratio = .49, so that rather
than using 1%/99%/0% for metadata/KV data/objects, the OSDs use 1%/49%/50%

None of the above produced any tangible improvement. Benchmark results
are at http://paste.openstack.org/show/736314/ if anyone wants to take a
look.

I'd be curious to see if anyone has a suggestion on what else to try.
Thanks in advance!

Cheers,
Florian

Mark Nelson

2018-11-28 14:52:03 UTC

Permalink

Post by Florian Haas

Post by Emmanuel Lacour

I suppose too because default ratio is to cache as much as possible k/v
up to 512M and hdd cache is 1G by default.
I tried to increase hdd cache up to 4G and it seems to be used, 4 osd
processes uses 20GB now.

So with bluestore, I assume that we need to think more of client page
cache (at least when using a VM) when with old filestore both osd and
client cache where used.
For benchmark, I did real benchmark here for the expected app workload
of this new cluster and it's ok for us :)
Thanks for your help Jason.

Shifting over a discussion from IRC and taking the liberty to resurrect
an old thread, as I just ran into the same (?) issue. I see
*significantly* reduced performance on RBD reads, compared to writes
with the same parameters. "rbd bench --io-type read" gives me 8K IOPS
(with the default 4K I/O size), whereas "rbd bench --io-type write"
produces more than twice that.
I should probably add that while my end result of doing an "rbd bench
--io-type read" is about half of what I get from a write benchmark, the
intermediate ops/sec output fluctuates from > 30K IOPS (about twice the
write IOPS) to about 3K IOPS (about 1/6 of what I get for writes). So
really, my read IOPS are all over the map (and terrible on average),
whereas my write IOPS are not stellar, but consistent.
This is an all-bluestore cluster on spinning disks with Luminous, and
- run rbd bench with --rbd_readahead_disable_after_bytes=0 and
--rbd_readahead_max_bytes=4194304 (per
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008271.html)
- configure OSDs with a larger bluestore_cache_size_hdd (4G; default is 1G)
- configure OSDs with bluestore_cache_kv_ratio = .49, so that rather
than using 1%/99%/0% for metadata/KV data/objects, the OSDs use 1%/49%/50%
None of the above produced any tangible improvement. Benchmark results
are at http://paste.openstack.org/show/736314/ if anyone wants to take a
look.
I'd be curious to see if anyone has a suggestion on what else to try.
Thanks in advance!

Hi Florian,

By default bluestore will cache buffers on reads but not on writes
(unless there are hints):

Option("bluestore_default_buffered_read", Option::TYPE_BOOL,
Option::LEVEL_ADVANCED)
    .set_default(true)
    .set_flag(Option::FLAG_RUNTIME)
    .set_description("Cache read results by default (unless hinted
NOCACHE or WONTNEED)"),

    Option("bluestore_default_buffered_write", Option::TYPE_BOOL,
Option::LEVEL_ADVANCED)
    .set_default(false)
    .set_flag(Option::FLAG_RUNTIME)
    .set_description("Cache writes by default (unless hinted NOCACHE or
WONTNEED)"),

This is one area where bluestore is a lot more confusing for users that
filestore was. There was a lot of concern about enabling buffer cache
on writes by default because there's some associated overhead
(potentially both during writes and in the mempool thread when trimming
the cache). It might be worth enabling bluestore_default_buffered_write
and see if it helps reads. You'll probably also want to pay attention
to writes though. I think we might want to consider enabling it by
default but we should go through and do a lot of careful testing first.
FWIW I did have it enabled when testing the new memory target code (and
the not-yet-merged age-binned autotuning). It was doing OK in my tests,
but I didn't do an apples-to-apples comparison with it off.

Mark

Post by Florian Haas
Cheers,
Florian
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Florian Haas

2018-11-28 15:53:20 UTC

Permalink

Post by Mark Nelson

Post by Florian Haas
Shifting over a discussion from IRC and taking the liberty to resurrect
an old thread, as I just ran into the same (?) issue. I see
*significantly* reduced performance on RBD reads, compared to writes
with the same parameters. "rbd bench --io-type read" gives me 8K IOPS
(with the default 4K I/O size), whereas "rbd bench --io-type write"
produces more than twice that.
I should probably add that while my end result of doing an "rbd bench
--io-type read" is about half of what I get from a write benchmark, the
intermediate ops/sec output fluctuates from > 30K IOPS (about twice the
write IOPS) to about 3K IOPS (about 1/6 of what I get for writes). So
really, my read IOPS are all over the map (and terrible on average),
whereas my write IOPS are not stellar, but consistent.
This is an all-bluestore cluster on spinning disks with Luminous, and
- run rbd bench with --rbd_readahead_disable_after_bytes=0 and
--rbd_readahead_max_bytes=4194304 (per
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008271.html)
- configure OSDs with a larger bluestore_cache_size_hdd (4G; default is 1G)
- configure OSDs with bluestore_cache_kv_ratio = .49, so that rather
than using 1%/99%/0% for metadata/KV data/objects, the OSDs use 1%/49%/50%
None of the above produced any tangible improvement. Benchmark results
are at http://paste.openstack.org/show/736314/ if anyone wants to take a
look.
I'd be curious to see if anyone has a suggestion on what else to try.
Thanks in advance!

Hi Florian,

Hi Mark, thanks for the speedy reply!

Post by Mark Nelson
By default bluestore will cache buffers on reads but not on writes
Option("bluestore_default_buffered_read", Option::TYPE_BOOL,
Option::LEVEL_ADVANCED)
    .set_default(true)
    .set_flag(Option::FLAG_RUNTIME)
    .set_description("Cache read results by default (unless hinted
NOCACHE or WONTNEED)"),
    Option("bluestore_default_buffered_write", Option::TYPE_BOOL,
Option::LEVEL_ADVANCED)
    .set_default(false)
    .set_flag(Option::FLAG_RUNTIME)
    .set_description("Cache writes by default (unless hinted NOCACHE or
WONTNEED)"),
This is one area where bluestore is a lot more confusing for users that
filestore was. There was a lot of concern about enabling buffer cache
on writes by default because there's some associated overhead
(potentially both during writes and in the mempool thread when trimming
the cache). It might be worth enabling bluestore_default_buffered_write
and see if it helps reads.

So yes this is rather counterintuitive, but I happily gave it a shot and
the results are... more head-scratching than before. :)

The output is here: http://paste.openstack.org/show/736324/

In summary:

1. Write benchmark is in the same ballpark as before (good).

2. Read benchmark *without* readahead is *way* better than before
(splendid!) but has a weird dip down to 9K IOPS that I find
inexplicable. Any ideas on that?

3. Read benchmark *with* readahead is still abysmal, which I also find
rather odd. What do you think about that one?

4. Rerunning the benchmark without readahead is slow at first and then
speeds up to where it was before, but is not nearly being as consistent
even towards the end of the benchmark run.

I do much appreciate your continued insight, thanks a lot!

Cheers,
Florian

Florian Haas

2018-12-02 18:48:33 UTC

Permalink

Hi Mark,

just taking the liberty to follow up on this one, as I'd really like to
get to the bottom of this.

Post by Florian Haas

Post by Mark Nelson
Option("bluestore_default_buffered_read", Option::TYPE_BOOL,
Option::LEVEL_ADVANCED)
Â Â Â .set_default(true)
Â Â Â .set_flag(Option::FLAG_RUNTIME)
Â Â Â .set_description("Cache read results by default (unless hinted
NOCACHE or WONTNEED)"),
Â Â Â Option("bluestore_default_buffered_write", Option::TYPE_BOOL,
Option::LEVEL_ADVANCED)
Â Â Â .set_default(false)
Â Â Â .set_flag(Option::FLAG_RUNTIME)
Â Â Â .set_description("Cache writes by default (unless hinted NOCACHE or
WONTNEED)"),
This is one area where bluestore is a lot more confusing for users that
filestore was.Â There was a lot of concern about enabling buffer cache
on writes by default because there's some associated overhead
(potentially both during writes and in the mempool thread when trimming
the cache).Â It might be worth enabling bluestore_default_buffered_write
and see if it helps reads.

These two still confuse me.

And in addition, I'm curious as to what you think of the approach to
configure OSDs with bluestore_cache_kv_ratio = .49, so that rather
than using 1%/99%/0% of cache memory for metadata/KV data/objects, the
OSDs use 1%/49%/50%. Is this sensible? I assume the default of not using
any memory to actually cache object data is there for a reason, but I am
struggling to grasp what that reason would be. Particularly since in
filestore, we always got in-memory object caching for free, via the page
cache.

Thanks again!

Cheers,
Florian