[ceph-users] slow_used_bytes - SlowDB being used despite lots of space free in BlockDB on SSD?
Nick Fisk
2018-10-18 16:49:51 UTC

Ceph Version = 12.2.8
8TB spinner with 20G SSD partition

Perf dump shows the following:

"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 21472731136,
"db_used_bytes": 3467640832,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
"slow_total_bytes": 320063143936,
"slow_used_bytes": 4546625536,
"num_files": 124,
"log_bytes": 11833344,
"log_compactions": 4,
"logged_bytes": 316227584,
"files_written_wal": 2,
"files_written_sst": 4375,
"bytes_written_wal": 204427489105,
"bytes_written_sst": 248223463173

Am I reading that correctly, about 3.4GB used out of 20GB on the SSD, yet 4.5GB of DB is stored on the spinning disk?

Am I also understanding correctly that BlueFS has reserved 300G of space on the spinning disk?

Found a previous bug tracker for something which looks exactly the same case, but should be fixed now:

Igor Fedotov
2018-10-19 00:02:51 UTC
Post by Nick Fisk
Ceph Version = 12.2.8
8TB spinner with 20G SSD partition
"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 21472731136,
"db_used_bytes": 3467640832,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
"slow_total_bytes": 320063143936,
"slow_used_bytes": 4546625536,
"num_files": 124,
"log_bytes": 11833344,
"log_compactions": 4,
"logged_bytes": 316227584,
"files_written_wal": 2,
"files_written_sst": 4375,
"bytes_written_wal": 204427489105,
"bytes_written_sst": 248223463173
Am I reading that correctly, about 3.4GB used out of 20GB on the SSD, yet 4.5GB of DB is stored on the spinning disk?
Correct. Most probably the rationale for this is the layered scheme
RocksDB uses to keep its sst. For each level It has a maximum threshold
(determined by level no, some base value and corresponding multiplier -
see max_bytes_for_level_base & max_bytes_for_level_multiplier at
If the next level  (at its max size) doesn't fit into the space
available at DB volume - it's totally spilled over to slow device.
IIRC level_base is about 250MB and multiplier is 10 so the third level
needs 25Gb and hence doesn't fit into your DB volume.

In fact  DB volume of 20GB is VERY small for 8TB OSD - just 0.25% of the
slow one. AFAIR current recommendation is about 4%.
Post by Nick Fisk
Am I also understanding correctly that BlueFS has reserved 300G of space on the spinning disk?
Post by Nick Fisk
ceph-users mailing list
Nick Fisk
2018-10-19 07:14:36 UTC
-----Original Message-----
Sent: 19 October 2018 01:03
Subject: Re: [ceph-users] slow_used_bytes - SlowDB being used despite lots of space free in BlockDB on SSD?
Post by Nick Fisk
Ceph Version = 12.2.8
8TB spinner with 20G SSD partition
"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 21472731136,
"db_used_bytes": 3467640832,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
"slow_total_bytes": 320063143936,
"slow_used_bytes": 4546625536,
"num_files": 124,
"log_bytes": 11833344,
"log_compactions": 4,
"logged_bytes": 316227584,
"files_written_wal": 2,
"files_written_sst": 4375,
"bytes_written_wal": 204427489105,
"bytes_written_sst": 248223463173
Am I reading that correctly, about 3.4GB used out of 20GB on the SSD, yet 4.5GB of DB is stored on the spinning disk?
Correct. Most probably the rationale for this is the layered scheme RocksDB uses to keep its sst. For each level It has a maximum
threshold (determined by level no, some base value and corresponding multiplier - see max_bytes_for_level_base &
max_bytes_for_level_multiplier at
If the next level (at its max size) doesn't fit into the space available at DB volume - it's totally spilled over to slow device.
IIRC level_base is about 250MB and multiplier is 10 so the third level needs 25Gb and hence doesn't fit into your DB volume.
In fact DB volume of 20GB is VERY small for 8TB OSD - just 0.25% of the slow one. AFAIR current recommendation is about 4%.
Thanks Igor, these nodes were designed back in the filestore days where Small 10DWPD SSD's were all the rage, I might be able to shrink the OS/swap partition and get each DB partition up to 25/26GB, they are not going to get any bigger than that as that’s the NVME completely filled. But I'm then going have to effectively wipe all the disks I've done so far and re-backfill. ☹ Are there any tunables to change this behaviour post OSD deployment to move data back onto SSD?

On a related note, does frequently accessed data move into the SSD, or is the overspill a one way ticket? I would assume writes would cause data in rocksdb to be written back into L0 and work its way down, but I'm not sure about reads?

This is from a similar slightly newer node with 10TB spinners and 40G partition
"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 53684985856,
"db_used_bytes": 10380902400,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
"slow_total_bytes": 400033841152,
"slow_used_bytes": 0,
"num_files": 165,
"log_bytes": 15683584,
"log_compactions": 8,
"logged_bytes": 384712704,
"files_written_wal": 2,
"files_written_sst": 11317,
"bytes_written_wal": 564218701044,
"bytes_written_sst": 618268958848

So I see your point about the 25G file size making it over spill the partition, as it obvious in this case that the 10G of DB used is completely stored on the SSD. Theses OSD's are about 70% full, so I'm not expecting a massive increase in usage. Albeit if I move to EC pools, I should expect maybe a doubling in objects, so maybe that db_used might double, but it should still be within the 40G hopefully.

The 4% rule would not be workable in my case, there are 12X10TB disks in these nodes, I would nearly 5TB worth of SSD, which would likely cost a similar amount to the whole node+disks. I get the fact that any recommendations need to take the worse case into account, but I would imagine for a lot of simple RBD only use cases, this number is quite inflated.

So I think the lesson from this is that despite whatever DB usage you may think you may end up with, always make sure your SSD partition is bigger than 26GB (L0+L1)?
Post by Nick Fisk
Am I also understanding correctly that BlueFS has reserved 300G of space on the spinning disk?
Post by Nick Fisk
ceph-users mailing list
Nick Fisk
2018-10-19 09:14:53 UTC
-----Original Message-----
Sent: 19 October 2018 08:15
Subject: RE: [ceph-users] slow_used_bytes - SlowDB being used despite lots of space free in BlockDB on SSD?
-----Original Message-----
Sent: 19 October 2018 01:03
Subject: Re: [ceph-users] slow_used_bytes - SlowDB being used despite lots of space free in BlockDB on SSD?
Post by Nick Fisk
Ceph Version = 12.2.8
8TB spinner with 20G SSD partition
"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 21472731136,
"db_used_bytes": 3467640832,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
"slow_total_bytes": 320063143936,
"slow_used_bytes": 4546625536,
"num_files": 124,
"log_bytes": 11833344,
"log_compactions": 4,
"logged_bytes": 316227584,
"files_written_wal": 2,
"files_written_sst": 4375,
"bytes_written_wal": 204427489105,
"bytes_written_sst": 248223463173
Am I reading that correctly, about 3.4GB used out of 20GB on the SSD, yet 4.5GB of DB is stored on the spinning disk?
Correct. Most probably the rationale for this is the layered scheme
RocksDB uses to keep its sst. For each level It has a maximum
threshold (determined by level no, some base value and corresponding
multiplier - see max_bytes_for_level_base &
max_bytes_for_level_multiplier at
If the next level (at its max size) doesn't fit into the space available at DB volume - it's totally spilled over to slow device.
IIRC level_base is about 250MB and multiplier is 10 so the third level needs 25Gb and hence doesn't fit into your DB volume.
In fact DB volume of 20GB is VERY small for 8TB OSD - just 0.25% of the slow one. AFAIR current recommendation is about 4%.
Thanks Igor, these nodes were designed back in the filestore days where Small 10DWPD SSD's were all the rage, I might be able to
shrink the OS/swap partition and get each DB partition up to 25/26GB, they are not going to get any bigger than that as that’s the
NVME completely filled. But I'm then going have to effectively wipe all the disks I've done so far and re-backfill. ☹ Are there any
tunables to change this behaviour post OSD deployment to move data back onto SSD?
On a related note, does frequently accessed data move into the SSD, or is the overspill a one way ticket? I would assume writes would
cause data in rocksdb to be written back into L0 and work its way down, but I'm not sure about reads?
This is from a similar slightly newer node with 10TB spinners and 40G partition
"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 53684985856,
"db_used_bytes": 10380902400,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
"slow_total_bytes": 400033841152,
"slow_used_bytes": 0,
"num_files": 165,
"log_bytes": 15683584,
"log_compactions": 8,
"logged_bytes": 384712704,
"files_written_wal": 2,
"files_written_sst": 11317,
"bytes_written_wal": 564218701044,
"bytes_written_sst": 618268958848
So I see your point about the 25G file size making it over spill the partition, as it obvious in this case that the 10G of DB used is
completely stored on the SSD. Theses OSD's are about 70% full, so I'm not expecting a massive increase in usage. Albeit if I move to EC
pools, I should expect maybe a doubling in objects, so maybe that db_used might double, but it should still be within the 40G
The 4% rule would not be workable in my case, there are 12X10TB disks in these nodes, I would nearly 5TB worth of SSD, which would
likely cost a similar amount to the whole node+disks. I get the fact that any recommendations need to take the worse case into
account, but I would imagine for a lot of simple RBD only use cases, this number is quite inflated.
So I think the lesson from this is that despite whatever DB usage you may think you may end up with, always make sure your SSD
partition is bigger than 26GB (L0+L1)?
Ok, so after some reading [1] a slight correction. block.db needs to be at a minimum of around 28G(L1+L2+L3) to make sure L3 fits on the SSD, where for most RBD workloads (or any other largish object type workloads) the metadata will likely fit well within this limit.

[1] https://www.spinics.net/lists/ceph-devel/msg39315.html
Post by Nick Fisk
Am I also understanding correctly that BlueFS has reserved 300G of space on the spinning disk?
Post by Nick Fisk
ceph-users mailing list
Florian Engelmann
2018-10-19 09:47:37 UTC

Our Ceph cluster is a 6 Node cluster each node having 8 disks. The
cluster is used for object storage only (right now). We do use EC 3+2 on
the buckets.data pool.

We had a problem with RadosGW segfaulting (12.2.5) till we upgraded to
12.2.8. We had almost 30.000 radosgw crashes leading to millions of
unreferenced objects (failed multiuploads?). It filled our cluster so
fast that we are now in danger to run out of space.

As you can see we are reweighting some OSDs right now. But the real
question is how "used" is calculated in ceph df.

Global: %RAW USED = 76.49%


x-1.rgw.buckets.data Used = 90.32%

Am I right this is because we should still be "able" to loose one OSD node?

If thats true - reweight can help just a little to rebalance the
capacity used on each node?

The only chance we have right now to survive until new HDDs arrive is to
delete objects, right?

ceph -s
id: a2222146-6561-307e-b032-xxxxxxxxxxxxx
3 nearfull osd(s)
13 pool(s) nearfull
1 large omap objects
766760/180478374 objects misplaced (0.425%)

mon: 3 daemons, quorum ceph1-mon3,ceph1-mon2,ceph1-mon1
mgr: ceph1-mon2(active), standbys: ceph1-mon1, ceph1-mon3
osd: 36 osds: 36 up, 36 in; 24 remapped pgs
rgw: 3 daemons active
rgw-nfs: 2 daemons active

pools: 13 pools, 1424 pgs
objects: 36.10M objects, 115TiB
usage: 200TiB used, 61.6TiB / 262TiB avail
pgs: 766760/180478374 objects misplaced (0.425%)
1400 active+clean
16 active+remapped+backfill_wait
8 active+remapped+backfilling

client: 3.05MiB/s rd, 0B/s wr, 1.12kop/s rd, 37op/s wr
recovery: 306MiB/s, 91objects/s

ceph df
262TiB 61.6TiB 200TiB 76.49
iscsi-images 1 35B 0 6.87TiB
.rgw.root 2 3.57KiB 0 6.87TiB
x-1.rgw.buckets.data 6 115TiB 90.32 12.4TiB
x-1.rgw.control 7 0B 0 6.87TiB
x-1.rgw.meta 8 943KiB 0 6.87TiB
x-1.rgw.log 9 0B 0 6.87TiB
x-1.rgw.buckets.index 12 0B 0 6.87TiB
x-1.rgw.buckets.non-ec 13 0B 0 6.87TiB
default.rgw.meta 14 373B 0 6.87TiB
default.rgw.control 15 0B 0 6.87TiB
default.rgw.log 16 0B 0 6.87TiB
scbench 17 0B 0 6.87TiB
rbdbench 18 1.00GiB 0.01 6.87TiB

Jakub Jaszewski
2018-10-19 18:43:30 UTC
Hi, your question is more about MAX AVAIL value I think, see how Ceph
calculates it

One OSD getting full makes the pool full as well, so keep on nearfull OSDs
reweighting .


19 paź 2018 16:34 "Florian Engelmann" <***@everyware.ch>


Our Ceph cluster is a 6 Node cluster each node having 8 disks. The
cluster is used for object storage only (right now). We do use EC 3+2 on
the buckets.data pool.

We had a problem with RadosGW segfaulting (12.2.5) till we upgraded to
12.2.8. We had almost 30.000 radosgw crashes leading to millions of
unreferenced objects (failed multiuploads?). It filled our cluster so
fast that we are now in danger to run out of space.

As you can see we are reweighting some OSDs right now. But the real
question is how "used" is calculated in ceph df.

Global: %RAW USED = 76.49%


x-1.rgw.buckets.data Used = 90.32%

Am I right this is because we should still be "able" to loose one OSD node?

If thats true - reweight can help just a little to rebalance the
capacity used on each node?

The only chance we have right now to survive until new HDDs arrive is to
delete objects, right?

ceph -s
id: a2222146-6561-307e-b032-xxxxxxxxxxxxx
3 nearfull osd(s)
13 pool(s) nearfull
1 large omap objects
766760/180478374 objects misplaced (0.425%)

mon: 3 daemons, quorum ceph1-mon3,ceph1-mon2,ceph1-mon1
mgr: ceph1-mon2(active), standbys: ceph1-mon1, ceph1-mon3
osd: 36 osds: 36 up, 36 in; 24 remapped pgs
rgw: 3 daemons active
rgw-nfs: 2 daemons active

pools: 13 pools, 1424 pgs
objects: 36.10M objects, 115TiB
usage: 200TiB used, 61.6TiB / 262TiB avail
pgs: 766760/180478374 objects misplaced (0.425%)
1400 active+clean
16 active+remapped+backfill_wait
8 active+remapped+backfilling

client: 3.05MiB/s rd, 0B/s wr, 1.12kop/s rd, 37op/s wr
recovery: 306MiB/s, 91objects/s

ceph df
262TiB 61.6TiB 200TiB 76.49
iscsi-images 1 35B 0 6.87TiB
.rgw.root 2 3.57KiB 0 6.87TiB
x-1.rgw.buckets.data 6 115TiB 90.32 12.4TiB
x-1.rgw.control 7 0B 0 6.87TiB
x-1.rgw.meta 8 943KiB 0 6.87TiB
x-1.rgw.log 9 0B 0 6.87TiB
x-1.rgw.buckets.index 12 0B 0 6.87TiB
x-1.rgw.buckets.non-ec 13 0B 0 6.87TiB
default.rgw.meta 14 373B 0 6.87TiB
default.rgw.control 15 0B 0 6.87TiB
default.rgw.log 16 0B 0 6.87TiB
scbench 17 0B 0 6.87TiB
rbdbench 18 1.00GiB 0.01 6.87TiB

Igor Fedotov
2018-10-19 21:18:21 UTC
Hi Nick
Post by Nick Fisk
-----Original Message-----
Sent: 19 October 2018 01:03
Subject: Re: [ceph-users] slow_used_bytes - SlowDB being used despite lots of space free in BlockDB on SSD?
Post by Nick Fisk
Ceph Version = 12.2.8
8TB spinner with 20G SSD partition
"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 21472731136,
"db_used_bytes": 3467640832,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
"slow_total_bytes": 320063143936,
"slow_used_bytes": 4546625536,
"num_files": 124,
"log_bytes": 11833344,
"log_compactions": 4,
"logged_bytes": 316227584,
"files_written_wal": 2,
"files_written_sst": 4375,
"bytes_written_wal": 204427489105,
"bytes_written_sst": 248223463173
Am I reading that correctly, about 3.4GB used out of 20GB on the SSD, yet 4.5GB of DB is stored on the spinning disk?
Correct. Most probably the rationale for this is the layered scheme RocksDB uses to keep its sst. For each level It has a maximum
threshold (determined by level no, some base value and corresponding multiplier - see max_bytes_for_level_base &
max_bytes_for_level_multiplier at
If the next level (at its max size) doesn't fit into the space available at DB volume - it's totally spilled over to slow device.
IIRC level_base is about 250MB and multiplier is 10 so the third level needs 25Gb and hence doesn't fit into your DB volume.
In fact DB volume of 20GB is VERY small for 8TB OSD - just 0.25% of the slow one. AFAIR current recommendation is about 4%.
Thanks Igor, these nodes were designed back in the filestore days where Small 10DWPD SSD's were all the rage, I might be able to shrink the OS/swap partition and get each DB partition up to 25/26GB, they are not going to get any bigger than that as that’s the NVME completely filled. But I'm then going have to effectively wipe all the disks I've done so far and re-backfill. ☹ Are there any tunables to change this behaviour post OSD deployment to move data back onto SSD?
None I'm aware of.

However I've just completed development for offline BlueFS volume
migration feature within ceph-bluestore-tool. It allows DB/WAL volumes
allocation and resizing as well as moving BlueFS data between volumes
(with some limitations unrelated to your case). Hence one doesn't need
slow backfilling to adjust BlueFS volume configuration.
Here is the PR (Nautilus only for now):
Post by Nick Fisk
On a related note, does frequently accessed data move into the SSD, or is the overspill a one way ticket? I would assume writes would cause data in rocksdb to be written back into L0 and work its way down, but I'm not sure about reads?
AFAIK reads don't trigger any data layout changes.
Post by Nick Fisk
This is from a similar slightly newer node with 10TB spinners and 40G partition
"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 53684985856,
"db_used_bytes": 10380902400,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
"slow_total_bytes": 400033841152,
"slow_used_bytes": 0,
"num_files": 165,
"log_bytes": 15683584,
"log_compactions": 8,
"logged_bytes": 384712704,
"files_written_wal": 2,
"files_written_sst": 11317,
"bytes_written_wal": 564218701044,
"bytes_written_sst": 618268958848
So I see your point about the 25G file size making it over spill the partition, as it obvious in this case that the 10G of DB used is completely stored on the SSD. Theses OSD's are about 70% full, so I'm not expecting a massive increase in usage. Albeit if I move to EC pools, I should expect maybe a doubling in objects, so maybe that db_used might double, but it should still be within the 40G hopefully.
The 4% rule would not be workable in my case, there are 12X10TB disks in these nodes, I would nearly 5TB worth of SSD, which would likely cost a similar amount to the whole node+disks. I get the fact that any recommendations need to take the worse case into account, but I would imagine for a lot of simple RBD only use cases, this number is quite inflated.
So I think the lesson from this is that despite whatever DB usage you may think you may end up with, always make sure your SSD partition is bigger than 26GB (L0+L1)?
In fact that's
L0+L1 (2x250Mb), L2(2500MB), L3(25000MB) which is about 28GB.

One more observation from my side - RocksDB might additionally use up to
100% of the level maximum size during compaction - hence it might make
sense to have up to 25GB of additional spare space. Surely this spare
space wouldn't be fully used most of the time. And actually I don't have
any instructions or clear knowledge base for this aspect. Just some
To track such an  excess I used additional perf counters, see commit
2763c4de41ea55a97ed7400f54a2b2d841894bf5 in
Perhaps makes sense to have a separare PR for this stuff and even
backport it...
Post by Nick Fisk
Post by Nick Fisk
Am I also understanding correctly that BlueFS has reserved 300G of space on the spinning disk?
Post by Nick Fisk
ceph-users mailing list
Nick Fisk
2018-10-20 20:30:03 UTC
Post by Nick Fisk
Post by Nick Fisk
Post by Igor Fedotov
Post by Nick Fisk
Ceph Version = 12.2.8
8TB spinner with 20G SSD partition
"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 21472731136,
"db_used_bytes": 3467640832,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
"slow_total_bytes": 320063143936,
"slow_used_bytes": 4546625536,
"num_files": 124,
"log_bytes": 11833344,
"log_compactions": 4,
"logged_bytes": 316227584,
"files_written_wal": 2,
"files_written_sst": 4375,
"bytes_written_wal": 204427489105,
"bytes_written_sst": 248223463173
Am I reading that correctly, about 3.4GB used out of 20GB on the SSD, yet 4.5GB of DB is stored on the spinning disk?
Correct. Most probably the rationale for this is the layered scheme
RocksDB uses to keep its sst. For each level It has a maximum
threshold (determined by level no, some base value and corresponding
multiplier - see max_bytes_for_level_base &
max_bytes_for_level_multiplier at
If the next level (at its max size) doesn't fit into the space available at DB volume - it's totally spilled over to slow device.
IIRC level_base is about 250MB and multiplier is 10 so the third level needs 25Gb and hence doesn't fit into your DB volume.
In fact DB volume of 20GB is VERY small for 8TB OSD - just 0.25% of the slow one. AFAIR current recommendation is about 4%.
Thanks Igor, these nodes were designed back in the filestore days where Small 10DWPD SSD's were all the rage, I might be able to
shrink the OS/swap partition and get each DB partition up to 25/26GB, they are not going to get any bigger than that as that’s the
NVME completely filled. But I'm then going have to effectively wipe all the disks I've done so far and re-backfill. ☹ Are there any
tunables to change this behaviour post OSD deployment to move data back onto SSD?
None I'm aware of.
However I've just completed development for offline BlueFS volume migration feature within ceph-bluestore-tool. It allows DB/WAL
volumes allocation and resizing as well as moving BlueFS data between volumes (with some limitations unrelated to your case). Hence
one doesn't need slow backfilling to adjust BlueFS volume configuration.
That sounds awesome, I might look at leaving the current OSD's how they are and look to "fix" them when Nautilus comes out.
Post by Nick Fisk
Post by Nick Fisk
On a related note, does frequently accessed data move into the SSD, or is the overspill a one way ticket? I would assume writes
would cause data in rocksdb to be written back into L0 and work its way down, but I'm not sure about reads?
AFAIK reads don't trigger any data layout changes.
Post by Nick Fisk
Post by Nick Fisk
So I think the lesson from this is that despite whatever DB usage you may think you may end up with, always make sure your SSD
partition is bigger than 26GB (L0+L1)?
In fact that's
L0+L1 (2x250Mb), L2(2500MB), L3(25000MB) which is about 28GB.
Well I upgraded a new node and after shrinking the OS, I managed to assign 29GB as the DB's. It's just finished backfilling and disappointingly it looks like the DB has over spilled onto the disks ☹ So the magic minimum number is going to be somewhere between 30GB and 40GB. I might be able to squeeze 30G partitions out if I go for a tiny OS disk and no swap. Will try that on the next one. Hoping that 30G does it.
Post by Nick Fisk
One more observation from my side - RocksDB might additionally use up to 100% of the level maximum size during compaction -
hence it might make sense to have up to 25GB of additional spare space. Surely this spare space wouldn't be fully used most of the
time. And actually I don't have any instructions or clear knowledge base for this aspect. Just some warning.
To track such an excess I used additional perf counters, see commit
2763c4de41ea55a97ed7400f54a2b2d841894bf5 in
Perhaps makes sense to have a separare PR for this stuff and even backport it...
I think I'm starting to capture some of that data as I'm graphing all the "perf dump" values into graphite. The nodes with the 40GB DB partitions with all data on SSD currently have about 10GiB in the DB. During compactions the highest it has peaked over the last few days is up to 14GiB. In the nodes with the 20GB partitions, the SSD.DB sits at about 2.5GiB and peaks to just under 5GiB, the slow sits at 4.3GiB and peaks to about 6GiB.
Post by Nick Fisk
Post by Nick Fisk
Post by Igor Fedotov
Post by Nick Fisk
Am I also understanding correctly that BlueFS has reserved 300G of space on the spinning disk?
Post by Nick Fisk
ceph-users mailing list
Nick Fisk
2018-10-30 07:45:58 UTC
Post by Nick Fisk
Post by Nick Fisk
Post by Nick Fisk
Post by Igor Fedotov
Post by Nick Fisk
Ceph Version = 12.2.8
8TB spinner with 20G SSD partition
"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 21472731136,
"db_used_bytes": 3467640832,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
"slow_total_bytes": 320063143936,
"slow_used_bytes": 4546625536,
"num_files": 124,
"log_bytes": 11833344,
"log_compactions": 4,
"logged_bytes": 316227584,
"files_written_wal": 2,
"files_written_sst": 4375,
"bytes_written_wal": 204427489105,
"bytes_written_sst": 248223463173
Am I reading that correctly, about 3.4GB used out of 20GB on the SSD, yet 4.5GB of DB is stored on the spinning disk?
Correct. Most probably the rationale for this is the layered scheme
RocksDB uses to keep its sst. For each level It has a maximum
threshold (determined by level no, some base value and
corresponding multiplier - see max_bytes_for_level_base &
max_bytes_for_level_multiplier at
If the next level (at its max size) doesn't fit into the space available at DB volume - it's totally spilled over to slow device.
IIRC level_base is about 250MB and multiplier is 10 so the third level needs 25Gb and hence doesn't fit into your DB volume.
In fact DB volume of 20GB is VERY small for 8TB OSD - just 0.25% of the slow one. AFAIR current recommendation is about 4%.
Thanks Igor, these nodes were designed back in the filestore days
where Small 10DWPD SSD's were all the rage, I might be able to
shrink the OS/swap partition and get each DB partition up to 25/26GB,
they are not going to get any bigger than that as that’s the NVME
completely filled. But I'm then going have to effectively wipe all the disks I've done so far and re-backfill. ☹ Are there any tunables to
change this behaviour post OSD deployment to move data back onto SSD?
Post by Nick Fisk
None I'm aware of.
However I've just completed development for offline BlueFS volume
migration feature within ceph-bluestore-tool. It allows DB/WAL volumes
allocation and resizing as well as moving BlueFS data between volumes (with some limitations unrelated to your case). Hence one
doesn't need slow backfilling to adjust BlueFS volume configuration.
Post by Nick Fisk
That sounds awesome, I might look at leaving the current OSD's how they are and look to "fix" them when Nautilus comes out.
Post by Nick Fisk
Post by Nick Fisk
On a related note, does frequently accessed data move into the SSD,
or is the overspill a one way ticket? I would assume writes
would cause data in rocksdb to be written back into L0 and work its way down, but I'm not sure about reads?
AFAIK reads don't trigger any data layout changes.
Post by Nick Fisk
Post by Nick Fisk
So I think the lesson from this is that despite whatever DB usage
you may think you may end up with, always make sure your SSD
partition is bigger than 26GB (L0+L1)?
In fact that's
L0+L1 (2x250Mb), L2(2500MB), L3(25000MB) which is about 28GB.
Well I upgraded a new node and after shrinking the OS, I managed to assign 29GB as the DB's. It's just finished backfilling and
disappointingly it looks like the DB has over spilled onto the disks ☹ So the magic minimum number is going to be somewhere between
30GB and 40GB. I might be able to squeeze 30G partitions out if I go for a tiny OS disk and no swap. Will try that on the next one.
Hoping that 30G does it.
Mark, looping you in as we were talking about this last Thursday.

So it looks like the magic size is 30G. I re-created a single OSD with a 30G DB partition and after backfilling all data is now stored on the SSD. Perf dump below showing difference between 29G and 30G partitions:

"db_total_bytes": 32210149376,
"db_used_bytes": 7182745600,
"slow_total_bytes": 320063143936,
"slow_used_bytes": 0,

"db_total_bytes": 31136407552,
"db_used_bytes": 3696230400,
"slow_total_bytes": 320063143936,
"slow_used_bytes": 5875171328,

So it seems the minimum sizes for SSD partition should be 30G, unless you have <1TB spinning disks, which might fit in 3G partition. 30G should cover most RBD workloads up to pretty large disks (8TB in my example). RGW workloads I'm guessing are most at risk for having larger DB requirements and so probably the next minimum size would be just over 300G. I can't test the last one as that would require a significant amount of test data and I don't currently use RGW. Might be good to hear if anyone has db_used_bytes > 30G and also what your total block.db partition size is.

As discussed Mark, I'm also considering if there is a better way to adjust the RocksDB target sizes to make these minimum sizes fit better around allocated space.

Post by Nick Fisk
Post by Nick Fisk
One more observation from my side - RocksDB might additionally use up
to 100% of the level maximum size during compaction - hence it might
make sense to have up to 25GB of additional spare space. Surely this spare space wouldn't be fully used most of the time. And
actually I don't have any instructions or clear knowledge base for this aspect. Just some warning.
Post by Nick Fisk
To track such an excess I used additional perf counters, see commit
2763c4de41ea55a97ed7400f54a2b2d841894bf5 in
Perhaps makes sense to have a separare PR for this stuff and even backport it...
I think I'm starting to capture some of that data as I'm graphing all the "perf dump" values into graphite. The nodes with the 40GB DB
partitions with all data on SSD currently have about 10GiB in the DB. During compactions the highest it has peaked over the last few
days is up to 14GiB. In the nodes with the 20GB partitions, the SSD.DB sits at about 2.5GiB and peaks to just under 5GiB, the slow sits at
4.3GiB and peaks to about 6GiB.
Post by Nick Fisk
Post by Nick Fisk
Post by Igor Fedotov
Post by Nick Fisk
Am I also understanding correctly that BlueFS has reserved 300G of space on the spinning disk?
Post by Nick Fisk
ceph-users mailing list