-----Original Message-----
Sent: 19 October 2018 08:15
Subject: RE: [ceph-users] slow_used_bytes - SlowDB being used despite lots of space free in BlockDB on SSD?
-----Original Message-----
Sent: 19 October 2018 01:03
Subject: Re: [ceph-users] slow_used_bytes - SlowDB being used despite lots of space free in BlockDB on SSD?
Post by Nick FiskHi,
Ceph Version = 12.2.8
8TB spinner with 20G SSD partition
"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 21472731136,
"db_used_bytes": 3467640832,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
"slow_total_bytes": 320063143936,
"slow_used_bytes": 4546625536,
"num_files": 124,
"log_bytes": 11833344,
"log_compactions": 4,
"logged_bytes": 316227584,
"files_written_wal": 2,
"files_written_sst": 4375,
"bytes_written_wal": 204427489105,
"bytes_written_sst": 248223463173
Am I reading that correctly, about 3.4GB used out of 20GB on the SSD, yet 4.5GB of DB is stored on the spinning disk?
Correct. Most probably the rationale for this is the layered scheme
RocksDB uses to keep its sst. For each level It has a maximum
threshold (determined by level no, some base value and corresponding
multiplier - see max_bytes_for_level_base &
max_bytes_for_level_multiplier at
https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide)
If the next level (at its max size) doesn't fit into the space available at DB volume - it's totally spilled over to slow device.
IIRC level_base is about 250MB and multiplier is 10 so the third level needs 25Gb and hence doesn't fit into your DB volume.
In fact DB volume of 20GB is VERY small for 8TB OSD - just 0.25% of the slow one. AFAIR current recommendation is about 4%.
Thanks Igor, these nodes were designed back in the filestore days where Small 10DWPD SSD's were all the rage, I might be able to
shrink the OS/swap partition and get each DB partition up to 25/26GB, they are not going to get any bigger than that as that’s the
NVME completely filled. But I'm then going have to effectively wipe all the disks I've done so far and re-backfill. ☹ Are there any
tunables to change this behaviour post OSD deployment to move data back onto SSD?
On a related note, does frequently accessed data move into the SSD, or is the overspill a one way ticket? I would assume writes would
cause data in rocksdb to be written back into L0 and work its way down, but I'm not sure about reads?
This is from a similar slightly newer node with 10TB spinners and 40G partition
"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 53684985856,
"db_used_bytes": 10380902400,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
"slow_total_bytes": 400033841152,
"slow_used_bytes": 0,
"num_files": 165,
"log_bytes": 15683584,
"log_compactions": 8,
"logged_bytes": 384712704,
"files_written_wal": 2,
"files_written_sst": 11317,
"bytes_written_wal": 564218701044,
"bytes_written_sst": 618268958848
So I see your point about the 25G file size making it over spill the partition, as it obvious in this case that the 10G of DB used is
completely stored on the SSD. Theses OSD's are about 70% full, so I'm not expecting a massive increase in usage. Albeit if I move to EC
pools, I should expect maybe a doubling in objects, so maybe that db_used might double, but it should still be within the 40G
hopefully.
The 4% rule would not be workable in my case, there are 12X10TB disks in these nodes, I would nearly 5TB worth of SSD, which would
likely cost a similar amount to the whole node+disks. I get the fact that any recommendations need to take the worse case into
account, but I would imagine for a lot of simple RBD only use cases, this number is quite inflated.
So I think the lesson from this is that despite whatever DB usage you may think you may end up with, always make sure your SSD
partition is bigger than 26GB (L0+L1)?
Ok, so after some reading [1] a slight correction. block.db needs to be at a minimum of around 28G(L1+L2+L3) to make sure L3 fits on the SSD, where for most RBD workloads (or any other largish object type workloads) the metadata will likely fit well within this limit.