[ceph-users] SSD sizing for Bluestore

Discussion:

Brendan Moloney

2018-11-13 02:19:57 UTC

Hi,

I have been reading up on this a bit, and found one particularly useful mailing list thread [1].

The fact that there is such a large jump when your DB fits into 3 levels (30GB) vs 4 levels (300GB) makes it hard to choose SSDs of an appropriate size. My workload is all RBD, so objects should be large, but I am also looking at purchasing rather large HDDs (12TB). It seems wasteful to spec out 300GB per OSD, but I am worried that I will barely cross the 30GB threshold when the disks get close to full.

It would be nice if we could either enable "dynamic level sizing" (done here [2] for monitors, but not bluestore?), or allow changing the "max_bytes_for_level_base" to something that better suits our use case. For example, if it were set it to 25% of the default (75MB L0 and L1, 750MB L2, 7.5GB L3, 75GB L4) then I could allocate ~85GB per OSD and feel confident there wouldn't be any spill over onto the slow HDDs. I am far from on expert on RocksDB, so I might be overlooking something important here.

[1] https://ceph-users.ceph.narkive.com/tGcDsnAB/slow-used-bytes-slowdb-being-used-despite-lots-of-space-free-in-blockdb-on-ssd
[2] https://tracker.ceph.com/issues/24361

Thanks,
Brendan

Igor Fedotov

2018-11-13 11:44:40 UTC

Permalink

Hi Brendan

in fact you can alter RocksDB settings by using
bluestore_rocksdb_options config parameter. And hence change
"max_bytes_for_level_base" and others.

Not sure about dynamic level sizing though.

Current defaults are:

"compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152"

Thanks,
Igor
On 11/13/2018 5:19 AM, Brendan Moloney wrote:
> Hi,
>
> I have been reading up on this a bit, and found one particularly
> useful mailing list thread [1].
>
> The fact that there is such a large jump when your DB fits into 3
> levels (30GB) vs 4 levels (300GB) makes it hard to choose SSDs of an
> appropriate size. My workload is all RBD, so objects should be large,
> but I am also looking at purchasing rather large HDDs (12TB). It
> seems wasteful to spec out 300GB per OSD, but I am worried that I will
> barely cross the 30GB threshold when the disks get close to full.
>
> It would be nice if we could either enable "dynamic level sizing"
> (done here [2] for monitors, but not bluestore?), or allow changing
> the "max_bytes_for_level_base" to something that better suits our use
> case. For example, if it were set it to 25% of the default (75MB L0
> and L1, 750MB L2, 7.5GB L3, 75GB L4) then I could allocate ~85GB per
> OSD and feel confident there wouldn't be any spill over onto the slow
> HDDs. I am far from on expert on RocksDB, so I might be overlooking
> something important here.
>
> [1]
> https://ceph-users.ceph.narkive.com/tGcDsnAB/slow-used-bytes-slowdb-being-used-despite-lots-of-space-free-in-blockdb-on-ssd
> [2] https://tracker.ceph.com/issues/24361
>
> Thanks,
> Brendan
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Brendan Moloney

2018-11-13 23:32:30 UTC

Permalink

Hi Igor,

Thank you for that information. This means I would have to reduce the "write_buffer_size" in order to reduce the L0 size in addition to reducing "max_bytes_for_level_base" to make the L1 size match.

Does anyone on the list have experience making these kinds of modifications? Or better yet some benchmarks?

I found a mailing list reference [1] saying RBD workloads need about 24KB per onode, and average object size is ~2.8MB. Even taking advertised best case throughput for an HDD we only get ~70 objects per second, which would generate 1.6MB/s in writes to RocksDB. If the write_buffer_size were set to 75MB (25% of default) that would take 45 seconds to fill. With a more realistic number for sustained write throughput on an HDD, it would take well over a minute. That sounds like a rather large buffer to me...

[1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/024297.html

Brendan
________________________________
From: Igor Fedotov [***@suse.de]
Sent: Tuesday, November 13, 2018 3:44 AM
To: Brendan Moloney; ceph-***@lists.ceph.com
Subject: Re: [ceph-users] SSD sizing for Bluestore

Hi Brendan

in fact you can alter RocksDB settings by using bluestore_rocksdb_options config parameter. And hence change "max_bytes_for_level_base" and others.

Not sure about dynamic level sizing though.

Current defaults are:

"compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152"

Thanks,
Igor
On 11/13/2018 5:19 AM, Brendan Moloney wrote:
Hi,

I have been reading up on this a bit, and found one particularly useful mailing list thread [1].

The fact that there is such a large jump when your DB fits into 3 levels (30GB) vs 4 levels (300GB) makes it hard to choose SSDs of an appropriate size. My workload is all RBD, so objects should be large, but I am also looking at purchasing rather large HDDs (12TB). It seems wasteful to spec out 300GB per OSD, but I am worried that I will barely cross the 30GB threshold when the disks get close to full.

It would be nice if we could either enable "dynamic level sizing" (done here [2] for monitors, but not bluestore?), or allow changing the "max_bytes_for_level_base" to something that better suits our use case. For example, if it were set it to 25% of the default (75MB L0 and L1, 750MB L2, 7.5GB L3, 75GB L4) then I could allocate ~85GB per OSD and feel confident there wouldn't be any spill over onto the slow HDDs. I am far from on expert on RocksDB, so I might be overlooking something important here.

[1] https://ceph-users.ceph.narkive.com/tGcDsnAB/slow-used-bytes-slowdb-being-used-despite-lots-of-space-free-in-blockdb-on-ssd
[2] https://tracker.ceph.com/issues/24361

Thanks,
Brendan