[ceph-users] Bluestore OSD

Discussion:

[ceph-users] Bluestore OSD_DATA, WAL & DB

Lazuardi Nasution

2017-09-15 18:39:27 UTC

Hi,

1. Is it possible configure use osd_data not as small partition on OSD but
a folder (ex. on root disk)? If yes, how to do that with ceph-disk and any
pros/cons of doing that?
2. Is WAL & DB size calculated based on OSD size or expected throughput
like on journal device of filestore? If no, what is the default value and
pro/cons of adjusting that?
3. Is partition alignment matter on Bluestore, including WAL & DB if using
separate device for them?

Best regards,

Lazuardi Nasution

2017-09-21 05:56:04 UTC

Permalink

Hi,

I'm still looking for the answer of these questions. Maybe someone can
share their thought on these. Any comment will be helpful too.

Best regards,

On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution <***@gmail.com>
wrote:

> Hi,
>
> 1. Is it possible configure use osd_data not as small partition on OSD but
> a folder (ex. on root disk)? If yes, how to do that with ceph-disk and any
> pros/cons of doing that?
> 2. Is WAL & DB size calculated based on OSD size or expected throughput
> like on journal device of filestore? If no, what is the default value and
> pro/cons of adjusting that?
> 3. Is partition alignment matter on Bluestore, including WAL & DB if using
> separate device for them?
>
> Best regards,
>

Maged Mokhtar

2017-09-21 07:45:18 UTC

Permalink

On 2017-09-21 07:56, Lazuardi Nasution wrote:

> Hi,
>
> I'm still looking for the answer of these questions. Maybe someone can share their thought on these. Any comment will be helpful too.
>
> Best regards,
>
> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution <***@gmail.com> wrote:
>
>> Hi,
>>
>> 1. Is it possible configure use osd_data not as small partition on OSD but a folder (ex. on root disk)? If yes, how to do that with ceph-disk and any pros/cons of doing that?
>> 2. Is WAL & DB size calculated based on OSD size or expected throughput like on journal device of filestore? If no, what is the default value and pro/cons of adjusting that?
>> 3. Is partition alignment matter on Bluestore, including WAL & DB if using separate device for them?
>>
>> Best regards,
>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

I am also looking for recommendations on wal/db partition sizes. Some
hints:

ceph-disk defaults used in case it does not find
bluestore_block_wal_size or bluestore_block_db_size in config file:

wal = 512MB

db = if bluestore_block_size (data size) is in config file it uses 1/100
of it else it uses 1G.

There is also a presentation by Sage back in March, see page 16:

https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in

wal: 512 MB

db: "a few" GB

the wal size is probably not debatable, it will be like a journal for
small block sizes which are constrained by iops hence 512 MB is more
than enough. Probably we will see more on the db size in the future.

Maged

Dietmar Rieder

2017-09-21 08:17:28 UTC

Permalink

On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
> On 2017-09-21 07:56, Lazuardi Nasution wrote:
>
>> Hi,
>> Â
>> I'm still looking for the answer of these questions. Maybe someone can
>> share their thought on these. Any comment will be helpful too.
>> Â
>> Best regards,
>>
>> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
>> <***@gmail.com <mailto:***@gmail.com>> wrote:
>>
>> Hi,
>> Â
>> 1. Is it possible configure use osd_data not as small partition on
>> OSD but a folder (ex. on root disk)? If yes, how to do that with
>> ceph-disk and any pros/cons of doing that?
>> 2. Is WAL & DB size calculated based on OSD size or expected
>> throughput like on journal device of filestore? If no, what is the
>> default value and pro/cons of adjusting that?
>> 3. Is partition alignment matter on Bluestore, including WAL & DB
>> if using separate device for them?
>> Â
>> Best regards,
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> Â
>
> I am also looking for recommendations on wal/db partition sizes.Â Some hints:
>
> ceph-disk defaults used in case it does not find
> bluestore_block_wal_size orÂ bluestore_block_db_size in config file:
>
> wal = Â 512MB
>
> db = ifÂ bluestore_block_size (data size) is in config file it uses 1/100
> of it else it uses 1G.
>
> There is also a presentation by Sage back in March,Â see page 16:
>
> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in
>
> wal: 512 MB
>
> db: "a few" GBÂ
>
> the wal size is probably not debatable, it will be like a journal for
> small block sizes which are constrained by iops hence 512 MB is more
> than enough. Probably we will see more on the db size in the future.

This is what I understood so far.
I wonder if it makes sense to set the db size as big as possible and
divide entire db device is by the number of OSDs it will serve.

E.g. 10 OSDs / 1 NVME (800GB)

(800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD

Is this smart/stupid?

Dietmar
--
_________________________________________
D i e t m a r R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: ***@i-med.ac.at
Web: http://www.icbi.at

Mark Nelson

2017-09-21 15:03:07 UTC

Permalink

On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
>> On 2017-09-21 07:56, Lazuardi Nasution wrote:
>>
>>> Hi,
>>>
>>> I'm still looking for the answer of these questions. Maybe someone can
>>> share their thought on these. Any comment will be helpful too.
>>>
>>> Best regards,
>>>
>>> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
>>> <***@gmail.com <mailto:***@gmail.com>> wrote:
>>>
>>> Hi,
>>>
>>> 1. Is it possible configure use osd_data not as small partition on
>>> OSD but a folder (ex. on root disk)? If yes, how to do that with
>>> ceph-disk and any pros/cons of doing that?
>>> 2. Is WAL & DB size calculated based on OSD size or expected
>>> throughput like on journal device of filestore? If no, what is the
>>> default value and pro/cons of adjusting that?
>>> 3. Is partition alignment matter on Bluestore, including WAL & DB
>>> if using separate device for them?
>>>
>>> Best regards,
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> I am also looking for recommendations on wal/db partition sizes. Some hints:
>>
>> ceph-disk defaults used in case it does not find
>> bluestore_block_wal_size or bluestore_block_db_size in config file:
>>
>> wal = 512MB
>>
>> db = if bluestore_block_size (data size) is in config file it uses 1/100
>> of it else it uses 1G.
>>
>> There is also a presentation by Sage back in March, see page 16:
>>
>> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in
>>
>> wal: 512 MB
>>
>> db: "a few" GB
>>
>> the wal size is probably not debatable, it will be like a journal for
>> small block sizes which are constrained by iops hence 512 MB is more
>> than enough. Probably we will see more on the db size in the future.
>
> This is what I understood so far.
> I wonder if it makes sense to set the db size as big as possible and
> divide entire db device is by the number of OSDs it will serve.
>
> E.g. 10 OSDs / 1 NVME (800GB)
>
> (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD
>
> Is this smart/stupid?

Personally I'd use 512MB-2GB for the WAL (larger buffers reduce write
amp but mean larger memtables and potentially higher overhead scanning
through memtables). 4x256MB buffers works pretty well, but it means
memory overhead too. Beyond that, I'd devote the entire rest of the
device to DB partitions.

Mark

>
> Dietmar
> --
> _________________________________________
> D i e t m a r R i e d e r, Mag.Dr.
> Innsbruck Medical University
> Biocenter - Division for Bioinformatics
> Innrain 80, 6020 Innsbruck
> Phone: +43 512 9003 71402
> Fax: +43 512 9003 73100
> Email: ***@i-med.ac.at
> Web: http://www.icbi.at
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

Dietmar Rieder

2017-09-21 16:15:48 UTC

Permalink

On 09/21/2017 05:03 PM, Mark Nelson wrote:
>
>
> On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
>>> On 2017-09-21 07:56, Lazuardi Nasution wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm still looking for the answer of these questions. Maybe someone can
>>>> share their thought on these. Any comment will be helpful too.
>>>>
>>>> Best regards,
>>>>
>>>> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
>>>> <***@gmail.com <mailto:***@gmail.com>> wrote:
>>>>
>>>> Â Â Â Hi,
>>>>
>>>> Â Â Â 1. Is it possible configure use osd_data not as small partition on
>>>> Â Â Â OSD but a folder (ex. on root disk)? If yes, how to do that with
>>>> Â Â Â ceph-disk and any pros/cons of doing that?
>>>> Â Â Â 2. Is WAL & DB size calculated based on OSD size or expected
>>>> Â Â Â throughput like on journal device of filestore? If no, what is the
>>>> Â Â Â default value and pro/cons of adjusting that?
>>>> Â Â Â 3. Is partition alignment matter on Bluestore, including WAL & DB
>>>> Â Â Â if using separate device for them?
>>>>
>>>> Â Â Â Best regards,
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>> I am also looking for recommendations on wal/db partition sizes. Some
>>> hints:
>>>
>>> ceph-disk defaults used in case it does not find
>>> bluestore_block_wal_size or bluestore_block_db_size in config file:
>>>
>>> wal =Â 512MB
>>>
>>> db = if bluestore_block_size (data size) is in config file it uses 1/100
>>> of it else it uses 1G.
>>>
>>> There is also a presentation by Sage back in March, see page 16:
>>>
>>> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in
>>>
>>>
>>> wal: 512 MB
>>>
>>> db: "a few" GB
>>>
>>> the wal size is probably not debatable, it will be like a journal for
>>> small block sizes which are constrained by iops hence 512 MB is more
>>> than enough. Probably we will see more on the db size in the future.
>>
>> This is what I understood so far.
>> I wonder if it makes sense to set the db size as big as possible and
>> divide entire db device isÂ by the number of OSDs it will serve.
>>
>> E.g. 10 OSDs / 1 NVME (800GB)
>>
>> Â (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD
>>
>> Is this smart/stupid?
>
> Personally I'd use 512MB-2GB for the WAL (larger buffers reduce write
> amp but mean larger memtables and potentially higher overhead scanning
> through memtables).Â 4x256MB buffers works pretty well, but it means
> memory overhead too.Â Beyond that, I'd devote the entire rest of the
> device to DB partitions.
>

thanks for your suggestion Mark!

So, just to make sure I understood this right:

You'd use a separeate 512MB-2GB WAL partition for each OSD and the
entire rest for DB partitions.

In the example case with 10xHDD OSD and 1 NVME it would then be 10 WAL
partitions with each 512MB-2GB and 10 equal sized DB partitions
consuming the rest of the NVME.

Thanks
Dietmar
--
_________________________________________
D i e t m a r R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics

Benjeman Meekhof

2017-09-21 19:37:32 UTC

Permalink

Some of this thread seems to contradict the documentation and confuses
me. Is the statement below correct?

"The BlueStore journal will always be placed on the fastest device
available, so using a DB device will provide the same benefit that the
WAL device would while also allowing additional metadata to be stored
there (if it will fix)."

http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices

it seems to be saying that there's no reason to create separate WAL
and DB partitions if they are on the same device. Specifying one
large DB partition per OSD will cover both uses.

thanks,
Ben

On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
<***@i-med.ac.at> wrote:
> On 09/21/2017 05:03 PM, Mark Nelson wrote:
>>
>>
>> On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
>>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
>>>> On 2017-09-21 07:56, Lazuardi Nasution wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm still looking for the answer of these questions. Maybe someone can
>>>>> share their thought on these. Any comment will be helpful too.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
>>>>> <***@gmail.com <mailto:***@gmail.com>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> 1. Is it possible configure use osd_data not as small partition on
>>>>> OSD but a folder (ex. on root disk)? If yes, how to do that with
>>>>> ceph-disk and any pros/cons of doing that?
>>>>> 2. Is WAL & DB size calculated based on OSD size or expected
>>>>> throughput like on journal device of filestore? If no, what is the
>>>>> default value and pro/cons of adjusting that?
>>>>> 3. Is partition alignment matter on Bluestore, including WAL & DB
>>>>> if using separate device for them?
>>>>>
>>>>> Best regards,
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>>
>>>> I am also looking for recommendations on wal/db partition sizes. Some
>>>> hints:
>>>>
>>>> ceph-disk defaults used in case it does not find
>>>> bluestore_block_wal_size or bluestore_block_db_size in config file:
>>>>
>>>> wal = 512MB
>>>>
>>>> db = if bluestore_block_size (data size) is in config file it uses 1/100
>>>> of it else it uses 1G.
>>>>
>>>> There is also a presentation by Sage back in March, see page 16:
>>>>
>>>> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in
>>>>
>>>>
>>>> wal: 512 MB
>>>>
>>>> db: "a few" GB
>>>>
>>>> the wal size is probably not debatable, it will be like a journal for
>>>> small block sizes which are constrained by iops hence 512 MB is more
>>>> than enough. Probably we will see more on the db size in the future.
>>>
>>> This is what I understood so far.
>>> I wonder if it makes sense to set the db size as big as possible and
>>> divide entire db device is by the number of OSDs it will serve.
>>>
>>> E.g. 10 OSDs / 1 NVME (800GB)
>>>
>>> (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD
>>>
>>> Is this smart/stupid?
>>
>> Personally I'd use 512MB-2GB for the WAL (larger buffers reduce write
>> amp but mean larger memtables and potentially higher overhead scanning
>> through memtables). 4x256MB buffers works pretty well, but it means
>> memory overhead too. Beyond that, I'd devote the entire rest of the
>> device to DB partitions.
>>
>
> thanks for your suggestion Mark!
>
> So, just to make sure I understood this right:
>
> You'd use a separeate 512MB-2GB WAL partition for each OSD and the
> entire rest for DB partitions.
>
> In the example case with 10xHDD OSD and 1 NVME it would then be 10 WAL
> partitions with each 512MB-2GB and 10 equal sized DB partitions
> consuming the rest of the NVME.
>
>
> Thanks
> Dietmar
> --
> _________________________________________
> D i e t m a r R i e d e r, Mag.Dr.
> Innsbruck Medical University
> Biocenter - Division for Bioinformatics
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

Richard Hesketh

2017-09-22 09:27:19 UTC

Permalink

I asked the same question a couple of weeks ago. No response I got
contradicted the documentation but nobody actively confirmed the
documentation was correct on this subject, either; my end state was that
I was relatively confident I wasn't making some horrible mistake by
simply specifying a big DB partition and letting bluestore work itself
out (in my case, I've just got HDDs and SSDs that were journals under
filestore), but I could not be sure there wasn't some sort of
performance tuning I was missing out on by not specifying them separately.

Rich

On 21/09/17 20:37, Benjeman Meekhof wrote:
> Some of this thread seems to contradict the documentation and confuses
> me. Is the statement below correct?
>
> "The BlueStore journal will always be placed on the fastest device
> available, so using a DB device will provide the same benefit that the
> WAL device would while also allowing additional metadata to be stored
> there (if it will fix)."
>
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices
>
> it seems to be saying that there's no reason to create separate WAL
> and DB partitions if they are on the same device. Specifying one
> large DB partition per OSD will cover both uses.
>
> thanks,
> Ben
>
> On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
> <***@i-med.ac.at> wrote:
>> On 09/21/2017 05:03 PM, Mark Nelson wrote:
>>>
>>> On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
>>>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
>>>>> On 2017-09-21 07:56, Lazuardi Nasution wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm still looking for the answer of these questions. Maybe someone can
>>>>>> share their thought on these. Any comment will be helpful too.
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
>>>>>> <***@gmail.com <mailto:***@gmail.com>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> 1. Is it possible configure use osd_data not as small partition on
>>>>>> OSD but a folder (ex. on root disk)? If yes, how to do that with
>>>>>> ceph-disk and any pros/cons of doing that?
>>>>>> 2. Is WAL & DB size calculated based on OSD size or expected
>>>>>> throughput like on journal device of filestore? If no, what is the
>>>>>> default value and pro/cons of adjusting that?
>>>>>> 3. Is partition alignment matter on Bluestore, including WAL & DB
>>>>>> if using separate device for them?
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>>
>>>>> I am also looking for recommendations on wal/db partition sizes. Some
>>>>> hints:
>>>>>
>>>>> ceph-disk defaults used in case it does not find
>>>>> bluestore_block_wal_size or bluestore_block_db_size in config file:
>>>>>
>>>>> wal = 512MB
>>>>>
>>>>> db = if bluestore_block_size (data size) is in config file it uses 1/100
>>>>> of it else it uses 1G.
>>>>>
>>>>> There is also a presentation by Sage back in March, see page 16:
>>>>>
>>>>> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in
>>>>>
>>>>>
>>>>> wal: 512 MB
>>>>>
>>>>> db: "a few" GB
>>>>>
>>>>> the wal size is probably not debatable, it will be like a journal for
>>>>> small block sizes which are constrained by iops hence 512 MB is more
>>>>> than enough. Probably we will see more on the db size in the future.
>>>> This is what I understood so far.
>>>> I wonder if it makes sense to set the db size as big as possible and
>>>> divide entire db device is by the number of OSDs it will serve.
>>>>
>>>> E.g. 10 OSDs / 1 NVME (800GB)
>>>>
>>>> (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD
>>>>
>>>> Is this smart/stupid?
>>> Personally I'd use 512MB-2GB for the WAL (larger buffers reduce write
>>> amp but mean larger memtables and potentially higher overhead scanning
>>> through memtables). 4x256MB buffers works pretty well, but it means
>>> memory overhead too. Beyond that, I'd devote the entire rest of the
>>> device to DB partitions.
>>>
>> thanks for your suggestion Mark!
>>
>> So, just to make sure I understood this right:
>>
>> You'd use a separeate 512MB-2GB WAL partition for each OSD and the
>> entire rest for DB partitions.
>>
>> In the example case with 10xHDD OSD and 1 NVME it would then be 10 WAL
>> partitions with each 512MB-2GB and 10 equal sized DB partitions
>> consuming the rest of the NVME.
>>
>>
>> Thanks
>> Dietmar
>> --
>> _________________________________________
>> D i e t m a r R i e d e r, Mag.Dr.
>> Innsbruck Medical University
>> Biocenter - Division for Bioinformatics
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-***@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

TYLin

2017-09-25 08:31:50 UTC

Permalink

Hi,

To my understand, the bluestore write workflow is

For normal big write
1. Write data to block
2. Update metadata to rocksdb
3. Rocksdb write to memory and block.wal
4. Once reach threshold, flush entries in block.wal to block.db

For overwrite and small write
1. Write data and metadata to rocksdb
2. Apply the data to block

Seems we don’t have a formula or suggestion to the size of block.db. It depends on the object size and number of objects in your pool. You can just give big partition to block.db to ensure all the database files are on that fast partition. If block.db full, it will use block to put db files, however, this will slow down the db performance. So give db size as much as you can.

If you want to put wal and db on same ssd, you don’t need to create block.wal. It will implicitly use block.db to put wal. The only case you need block.wal is that you want to separate wal to another disk.

I’m also studying bluestore, this is what I know so far. Any correction is welcomed.

Thanks

> On Sep 22, 2017, at 5:27 PM, Richard Hesketh <***@rd.bbc.co.uk> wrote:
>
> I asked the same question a couple of weeks ago. No response I got contradicted the documentation but nobody actively confirmed the documentation was correct on this subject, either; my end state was that I was relatively confident I wasn't making some horrible mistake by simply specifying a big DB partition and letting bluestore work itself out (in my case, I've just got HDDs and SSDs that were journals under filestore), but I could not be sure there wasn't some sort of performance tuning I was missing out on by not specifying them separately.
>
> Rich
>
> On 21/09/17 20:37, Benjeman Meekhof wrote:
>> Some of this thread seems to contradict the documentation and confuses
>> me. Is the statement below correct?
>>
>> "The BlueStore journal will always be placed on the fastest device
>> available, so using a DB device will provide the same benefit that the
>> WAL device would while also allowing additional metadata to be stored
>> there (if it will fix)."
>>
>> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices
>>
>> it seems to be saying that there's no reason to create separate WAL
>> and DB partitions if they are on the same device. Specifying one
>> large DB partition per OSD will cover both uses.
>>
>> thanks,
>> Ben
>>
>> On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
>> <***@i-med.ac.at> wrote:
>>> On 09/21/2017 05:03 PM, Mark Nelson wrote:
>>>>
>>>> On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
>>>>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
>>>>>> On 2017-09-21 07:56, Lazuardi Nasution wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm still looking for the answer of these questions. Maybe someone can
>>>>>>> share their thought on these. Any comment will be helpful too.
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
>>>>>>> <***@gmail.com <mailto:***@gmail.com>> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> 1. Is it possible configure use osd_data not as small partition on
>>>>>>> OSD but a folder (ex. on root disk)? If yes, how to do that with
>>>>>>> ceph-disk and any pros/cons of doing that?
>>>>>>> 2. Is WAL & DB size calculated based on OSD size or expected
>>>>>>> throughput like on journal device of filestore? If no, what is the
>>>>>>> default value and pro/cons of adjusting that?
>>>>>>> 3. Is partition alignment matter on Bluestore, including WAL & DB
>>>>>>> if using separate device for them?
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>>
>>>>>> I am also looking for recommendations on wal/db partition sizes. Some
>>>>>> hints:
>>>>>>
>>>>>> ceph-disk defaults used in case it does not find
>>>>>> bluestore_block_wal_size or bluestore_block_db_size in config file:
>>>>>>
>>>>>> wal = 512MB
>>>>>>
>>>>>> db = if bluestore_block_size (data size) is in config file it uses 1/100
>>>>>> of it else it uses 1G.
>>>>>>
>>>>>> There is also a presentation by Sage back in March, see page 16:
>>>>>>
>>>>>> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in
>>>>>>
>>>>>>
>>>>>> wal: 512 MB
>>>>>>
>>>>>> db: "a few" GB
>>>>>>
>>>>>> the wal size is probably not debatable, it will be like a journal for
>>>>>> small block sizes which are constrained by iops hence 512 MB is more
>>>>>> than enough. Probably we will see more on the db size in the future.
>>>>> This is what I understood so far.
>>>>> I wonder if it makes sense to set the db size as big as possible and
>>>>> divide entire db device is by the number of OSDs it will serve.
>>>>>
>>>>> E.g. 10 OSDs / 1 NVME (800GB)
>>>>>
>>>>> (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD
>>>>>
>>>>> Is this smart/stupid?
>>>> Personally I'd use 512MB-2GB for the WAL (larger buffers reduce write
>>>> amp but mean larger memtables and potentially higher overhead scanning
>>>> through memtables). 4x256MB buffers works pretty well, but it means
>>>> memory overhead too. Beyond that, I'd devote the entire rest of the
>>>> device to DB partitions.
>>>>
>>> thanks for your suggestion Mark!
>>>
>>> So, just to make sure I understood this right:
>>>
>>> You'd use a separeate 512MB-2GB WAL partition for each OSD and the
>>> entire rest for DB partitions.
>>>
>>> In the example case with 10xHDD OSD and 1 NVME it would then be 10 WAL
>>> partitions with each 512MB-2GB and 10 equal sized DB partitions
>>> consuming the rest of the NVME.
>>>
>>>
>>> Thanks
>>> Dietmar
>>> --
>>> _________________________________________
>>> D i e t m a r R i e d e r, Mag.Dr.
>>> Innsbruck Medical University
>>> Biocenter - Division for Bioinformatics
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-***@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-***@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Mark Nelson

2017-09-25 12:59:06 UTC

Permalink

On 09/25/2017 03:31 AM, TYLin wrote:
> Hi,
>
> To my understand, the bluestore write workflow is
>
> For normal big write
> 1. Write data to block
> 2. Update metadata to rocksdb
> 3. Rocksdb write to memory and block.wal
> 4. Once reach threshold, flush entries in block.wal to block.db
>
> For overwrite and small write
> 1. Write data and metadata to rocksdb
> 2. Apply the data to block
>
> Seems we don’t have a formula or suggestion to the size of block.db. It depends on the object size and number of objects in your pool. You can just give big partition to block.db to ensure all the database files are on that fast partition. If block.db full, it will use block to put db files, however, this will slow down the db performance. So give db size as much as you can.

This is basically correct. What's more, it's not just the object size,
but the number of extents, checksums, RGW bucket indices, and
potentially other random stuff. I'm skeptical how well we can estimate
all of this in the long run. I wonder if we would be better served by
just focusing on making it easy to understand how the DB device is being
used, how much is spilling over to the block device, and make it easy to
upgrade to a new device once it gets full.

>
> If you want to put wal and db on same ssd, you don’t need to create block.wal. It will implicitly use block.db to put wal. The only case you need block.wal is that you want to separate wal to another disk.

I always make explicit partitions, but only because I (potentially
illogically) like it that way. There may actually be some benefits to
using a single partition for both if sharing a single device.

>
> I’m also studying bluestore, this is what I know so far. Any correction is welcomed.
>
> Thanks
>
>
>> On Sep 22, 2017, at 5:27 PM, Richard Hesketh <***@rd.bbc.co.uk> wrote:
>>
>> I asked the same question a couple of weeks ago. No response I got contradicted the documentation but nobody actively confirmed the documentation was correct on this subject, either; my end state was that I was relatively confident I wasn't making some horrible mistake by simply specifying a big DB partition and letting bluestore work itself out (in my case, I've just got HDDs and SSDs that were journals under filestore), but I could not be sure there wasn't some sort of performance tuning I was missing out on by not specifying them separately.
>>
>> Rich
>>
>> On 21/09/17 20:37, Benjeman Meekhof wrote:
>>> Some of this thread seems to contradict the documentation and confuses
>>> me. Is the statement below correct?
>>>
>>> "The BlueStore journal will always be placed on the fastest device
>>> available, so using a DB device will provide the same benefit that the
>>> WAL device would while also allowing additional metadata to be stored
>>> there (if it will fix)."
>>>
>>> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices
>>>
>>> it seems to be saying that there's no reason to create separate WAL
>>> and DB partitions if they are on the same device. Specifying one
>>> large DB partition per OSD will cover both uses.
>>>
>>> thanks,
>>> Ben
>>>
>>> On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
>>> <***@i-med.ac.at> wrote:
>>>> On 09/21/2017 05:03 PM, Mark Nelson wrote:
>>>>>
>>>>> On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
>>>>>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
>>>>>>> On 2017-09-21 07:56, Lazuardi Nasution wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm still looking for the answer of these questions. Maybe someone can
>>>>>>>> share their thought on these. Any comment will be helpful too.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>>
>>>>>>>> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
>>>>>>>> <***@gmail.com <mailto:***@gmail.com>> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> 1. Is it possible configure use osd_data not as small partition on
>>>>>>>> OSD but a folder (ex. on root disk)? If yes, how to do that with
>>>>>>>> ceph-disk and any pros/cons of doing that?
>>>>>>>> 2. Is WAL & DB size calculated based on OSD size or expected
>>>>>>>> throughput like on journal device of filestore? If no, what is the
>>>>>>>> default value and pro/cons of adjusting that?
>>>>>>>> 3. Is partition alignment matter on Bluestore, including WAL & DB
>>>>>>>> if using separate device for them?
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>>
>>>>>>> I am also looking for recommendations on wal/db partition sizes. Some
>>>>>>> hints:
>>>>>>>
>>>>>>> ceph-disk defaults used in case it does not find
>>>>>>> bluestore_block_wal_size or bluestore_block_db_size in config file:
>>>>>>>
>>>>>>> wal = 512MB
>>>>>>>
>>>>>>> db = if bluestore_block_size (data size) is in config file it uses 1/100
>>>>>>> of it else it uses 1G.
>>>>>>>
>>>>>>> There is also a presentation by Sage back in March, see page 16:
>>>>>>>
>>>>>>> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in
>>>>>>>
>>>>>>>
>>>>>>> wal: 512 MB
>>>>>>>
>>>>>>> db: "a few" GB
>>>>>>>
>>>>>>> the wal size is probably not debatable, it will be like a journal for
>>>>>>> small block sizes which are constrained by iops hence 512 MB is more
>>>>>>> than enough. Probably we will see more on the db size in the future.
>>>>>> This is what I understood so far.
>>>>>> I wonder if it makes sense to set the db size as big as possible and
>>>>>> divide entire db device is by the number of OSDs it will serve.
>>>>>>
>>>>>> E.g. 10 OSDs / 1 NVME (800GB)
>>>>>>
>>>>>> (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD
>>>>>>
>>>>>> Is this smart/stupid?
>>>>> Personally I'd use 512MB-2GB for the WAL (larger buffers reduce write
>>>>> amp but mean larger memtables and potentially higher overhead scanning
>>>>> through memtables). 4x256MB buffers works pretty well, but it means
>>>>> memory overhead too. Beyond that, I'd devote the entire rest of the
>>>>> device to DB partitions.
>>>>>
>>>> thanks for your suggestion Mark!
>>>>
>>>> So, just to make sure I understood this right:
>>>>
>>>> You'd use a separeate 512MB-2GB WAL partition for each OSD and the
>>>> entire rest for DB partitions.
>>>>
>>>> In the example case with 10xHDD OSD and 1 NVME it would then be 10 WAL
>>>> partitions with each 512MB-2GB and 10 equal sized DB partitions
>>>> consuming the rest of the NVME.
>>>>
>>>>
>>>> Thanks
>>>> Dietmar
>>>> --
>>>> _________________________________________
>>>> D i e t m a r R i e d e r, Mag.Dr.
>>>> Innsbruck Medical University
>>>> Biocenter - Division for Bioinformatics
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-***@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-***@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-***@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

Dietmar Rieder

2017-09-25 14:44:28 UTC

Permalink

On 09/25/2017 02:59 PM, Mark Nelson wrote:
> On 09/25/2017 03:31 AM, TYLin wrote:
>> Hi,
>>
>> To my understand, the bluestore write workflow is
>>
>> For normal big write
>> 1. Write data to block
>> 2. Update metadata to rocksdb
>> 3. Rocksdb write to memory and block.wal
>> 4. Once reach threshold, flush entries in block.wal to block.db
>>
>> For overwrite and small write
>> 1. Write data and metadata to rocksdb
>> 2. Apply the data to block
>>
>> Seems we donât have a formula or suggestion to the size of block.db.
>> It depends on the object size and number of objects in your pool. You
>> can just give big partition to block.db to ensure all the database
>> files are on that fast partition. If block.db full, it will use block
>> to put db files, however, this will slow down the db performance. So
>> give db size as much as you can.
>
> This is basically correct.Â What's more, it's not just the object size,
> but the number of extents, checksums, RGW bucket indices, and
> potentially other random stuff.Â I'm skeptical how well we can estimate
> all of this in the long run.Â I wonder if we would be better served by
> just focusing on making it easy to understand how the DB device is being
> used, how much is spilling over to the block device, and make it easy to
> upgrade to a new device once it gets full.
>
>>
>> If you want to put wal and db on same ssd, you donât need to create
>> block.wal. It will implicitly use block.db to put wal. The only case
>> you need block.wal is that you want to separate wal to another disk.
>
> I always make explicit partitions, but only because I (potentially
> illogically) like it that way.Â There may actually be some benefits to
> using a single partition for both if sharing a single device.

is this "Single db/wal partition" then to be used for all OSDs on a node
or do you need to create a seperate "Single db/wal partition" for each
OSD on the node?

>
>>
>> Iâm also studying bluestore, this is what I know so far. Any
>> correction is welcomed.
>>
>> Thanks
>>
>>
>>> On Sep 22, 2017, at 5:27 PM, Richard Hesketh
>>> <***@rd.bbc.co.uk> wrote:
>>>
>>> I asked the same question a couple of weeks ago. No response I got
>>> contradicted the documentation but nobody actively confirmed the
>>> documentation was correct on this subject, either; my end state was
>>> that I was relatively confident I wasn't making some horrible mistake
>>> by simply specifying a big DB partition and letting bluestore work
>>> itself out (in my case, I've just got HDDs and SSDs that were
>>> journals under filestore), but I could not be sure there wasn't some
>>> sort of performance tuning I was missing out on by not specifying
>>> them separately.
>>>
>>> Rich
>>>
>>> On 21/09/17 20:37, Benjeman Meekhof wrote:
>>>> Some of this thread seems to contradict the documentation and confuses
>>>> me.Â Is the statement below correct?
>>>>
>>>> "The BlueStore journal will always be placed on the fastest device
>>>> available, so using a DB device will provide the same benefit that the
>>>> WAL device would while also allowing additional metadata to be stored
>>>> there (if it will fix)."
>>>>
>>>> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices
>>>>
>>>>
>>>> Â it seems to be saying that there's no reason to create separate WAL
>>>> and DB partitions if they are on the same device.Â Specifying one
>>>> large DB partition per OSD will cover both uses.
>>>>
>>>> thanks,
>>>> Ben
>>>>
>>>> On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
>>>> <***@i-med.ac.at> wrote:
>>>>> On 09/21/2017 05:03 PM, Mark Nelson wrote:
>>>>>>
>>>>>> On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
>>>>>>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
>>>>>>>> On 2017-09-21 07:56, Lazuardi Nasution wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I'm still looking for the answer of these questions. Maybe
>>>>>>>>> someone can
>>>>>>>>> share their thought on these. Any comment will be helpful too.
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>>
>>>>>>>>> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
>>>>>>>>> <***@gmail.com <mailto:***@gmail.com>> wrote:
>>>>>>>>>
>>>>>>>>> Â Â Â Hi,
>>>>>>>>>
>>>>>>>>> Â Â Â 1. Is it possible configure use osd_data not as small
>>>>>>>>> partition on
>>>>>>>>> Â Â Â OSD but a folder (ex. on root disk)? If yes, how to do that
>>>>>>>>> with
>>>>>>>>> Â Â Â ceph-disk and any pros/cons of doing that?
>>>>>>>>> Â Â Â 2. Is WAL & DB size calculated based on OSD size or expected
>>>>>>>>> Â Â Â throughput like on journal device of filestore? If no, what
>>>>>>>>> is the
>>>>>>>>> Â Â Â default value and pro/cons of adjusting that?
>>>>>>>>> Â Â Â 3. Is partition alignment matter on Bluestore, including
>>>>>>>>> WAL & DB
>>>>>>>>> Â Â Â if using separate device for them?
>>>>>>>>>
>>>>>>>>> Â Â Â Best regards,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>
>>>>>>>>
>>>>>>>> I am also looking for recommendations on wal/db partition sizes.
>>>>>>>> Some
>>>>>>>> hints:
>>>>>>>>
>>>>>>>> ceph-disk defaults used in case it does not find
>>>>>>>> bluestore_block_wal_size or bluestore_block_db_size in config file:
>>>>>>>>
>>>>>>>> wal =Â 512MB
>>>>>>>>
>>>>>>>> db = if bluestore_block_size (data size) is in config file it
>>>>>>>> uses 1/100
>>>>>>>> of it else it uses 1G.
>>>>>>>>
>>>>>>>> There is also a presentation by Sage back in March, see page 16:
>>>>>>>>
>>>>>>>> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> wal: 512 MB
>>>>>>>>
>>>>>>>> db: "a few" GB
>>>>>>>>
>>>>>>>> the wal size is probably not debatable, it will be like a
>>>>>>>> journal for
>>>>>>>> small block sizes which are constrained by iops hence 512 MB is
>>>>>>>> more
>>>>>>>> than enough. Probably we will see more on the db size in the
>>>>>>>> future.
>>>>>>> This is what I understood so far.
>>>>>>> I wonder if it makes sense to set the db size as big as possible and
>>>>>>> divide entire db device isÂ by the number of OSDs it will serve.
>>>>>>>
>>>>>>> E.g. 10 OSDs / 1 NVME (800GB)
>>>>>>>
>>>>>>> Â (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD
>>>>>>>
>>>>>>> Is this smart/stupid?
>>>>>> Personally I'd use 512MB-2GB for the WAL (larger buffers reduce write
>>>>>> amp but mean larger memtables and potentially higher overhead
>>>>>> scanning
>>>>>> through memtables).Â 4x256MB buffers works pretty well, but it means
>>>>>> memory overhead too.Â Beyond that, I'd devote the entire rest of the
>>>>>> device to DB partitions.
>>>>>>
>>>>> thanks for your suggestion Mark!
>>>>>
>>>>> So, just to make sure I understood this right:
>>>>>
>>>>> You'dÂ use a separeate 512MB-2GB WAL partition for each OSD and the
>>>>> entire rest for DB partitions.
>>>>>
>>>>> In the example case with 10xHDD OSD and 1 NVME it would then be 10 WAL
>>>>> partitions with each 512MB-2GB and 10 equal sized DB partitions
>>>>> consuming the rest of the NVME.
>>>>>
>>>>>
>>>>> Thanks
>>>>> Â Dietmar
>>>>> --
>>>>> _________________________________________
>>>>> D i e t m a rÂ R i e d e r, Mag.Dr.
>>>>> Innsbruck Medical University
>>>>> Biocenter - Division for Bioinformatics
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-***@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-***@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-***@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-***@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
_________________________________________
D i e t m a r R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics

David Turner

2017-09-25 15:10:26 UTC

Permalink

db/wal partitions are per OSD. DB partitions need to be made as big as you
need them. If they run out of space, they will fall back to the block
device. If the DB and block are on the same device, then there's no reason
to partition them and figure out the best size. If they are on separate
devices, then you need to make it as big as you need to to ensure that it
won't spill over (or if it does that you're ok with the degraded
performance while the db partition is full). I haven't come across an
equation to judge what size should be used for either partition yet.

On Mon, Sep 25, 2017 at 10:53 AM Dietmar Rieder <***@i-med.ac.at>
wrote:

> On 09/25/2017 02:59 PM, Mark Nelson wrote:
> > On 09/25/2017 03:31 AM, TYLin wrote:
> >> Hi,
> >>
> >> To my understand, the bluestore write workflow is
> >>
> >> For normal big write
> >> 1. Write data to block
> >> 2. Update metadata to rocksdb
> >> 3. Rocksdb write to memory and block.wal
> >> 4. Once reach threshold, flush entries in block.wal to block.db
> >>
> >> For overwrite and small write
> >> 1. Write data and metadata to rocksdb
> >> 2. Apply the data to block
> >>
> >> Seems we donât have a formula or suggestion to the size of block.db.
> >> It depends on the object size and number of objects in your pool. You
> >> can just give big partition to block.db to ensure all the database
> >> files are on that fast partition. If block.db full, it will use block
> >> to put db files, however, this will slow down the db performance. So
> >> give db size as much as you can.
> >
> > This is basically correct. What's more, it's not just the object size,
> > but the number of extents, checksums, RGW bucket indices, and
> > potentially other random stuff. I'm skeptical how well we can estimate
> > all of this in the long run. I wonder if we would be better served by
> > just focusing on making it easy to understand how the DB device is being
> > used, how much is spilling over to the block device, and make it easy to
> > upgrade to a new device once it gets full.
> >
> >>
> >> If you want to put wal and db on same ssd, you donât need to create
> >> block.wal. It will implicitly use block.db to put wal. The only case
> >> you need block.wal is that you want to separate wal to another disk.
> >
> > I always make explicit partitions, but only because I (potentially
> > illogically) like it that way. There may actually be some benefits to
> > using a single partition for both if sharing a single device.
>
> is this "Single db/wal partition" then to be used for all OSDs on a node
> or do you need to create a seperate "Single db/wal partition" for each
> OSD on the node?
>
> >
> >>
> >> Iâm also studying bluestore, this is what I know so far. Any
> >> correction is welcomed.
> >>
> >> Thanks
> >>
> >>
> >>> On Sep 22, 2017, at 5:27 PM, Richard Hesketh
> >>> <***@rd.bbc.co.uk> wrote:
> >>>
> >>> I asked the same question a couple of weeks ago. No response I got
> >>> contradicted the documentation but nobody actively confirmed the
> >>> documentation was correct on this subject, either; my end state was
> >>> that I was relatively confident I wasn't making some horrible mistake
> >>> by simply specifying a big DB partition and letting bluestore work
> >>> itself out (in my case, I've just got HDDs and SSDs that were
> >>> journals under filestore), but I could not be sure there wasn't some
> >>> sort of performance tuning I was missing out on by not specifying
> >>> them separately.
> >>>
> >>> Rich
> >>>
> >>> On 21/09/17 20:37, Benjeman Meekhof wrote:
> >>>> Some of this thread seems to contradict the documentation and confuses
> >>>> me. Is the statement below correct?
> >>>>
> >>>> "The BlueStore journal will always be placed on the fastest device
> >>>> available, so using a DB device will provide the same benefit that the
> >>>> WAL device would while also allowing additional metadata to be stored
> >>>> there (if it will fix)."
> >>>>
> >>>>
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices
> >>>>
> >>>>
> >>>> it seems to be saying that there's no reason to create separate WAL
> >>>> and DB partitions if they are on the same device. Specifying one
> >>>> large DB partition per OSD will cover both uses.
> >>>>
> >>>> thanks,
> >>>> Ben
> >>>>
> >>>> On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
> >>>> <***@i-med.ac.at> wrote:
> >>>>> On 09/21/2017 05:03 PM, Mark Nelson wrote:
> >>>>>>
> >>>>>> On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
> >>>>>>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
> >>>>>>>> On 2017-09-21 07:56, Lazuardi Nasution wrote:
> >>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> I'm still looking for the answer of these questions. Maybe
> >>>>>>>>> someone can
> >>>>>>>>> share their thought on these. Any comment will be helpful too.
> >>>>>>>>>
> >>>>>>>>> Best regards,
> >>>>>>>>>
> >>>>>>>>> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
> >>>>>>>>> <***@gmail.com <mailto:***@gmail.com>> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> 1. Is it possible configure use osd_data not as small
> >>>>>>>>> partition on
> >>>>>>>>> OSD but a folder (ex. on root disk)? If yes, how to do that
> >>>>>>>>> with
> >>>>>>>>> ceph-disk and any pros/cons of doing that?
> >>>>>>>>> 2. Is WAL & DB size calculated based on OSD size or expected
> >>>>>>>>> throughput like on journal device of filestore? If no, what
> >>>>>>>>> is the
> >>>>>>>>> default value and pro/cons of adjusting that?
> >>>>>>>>> 3. Is partition alignment matter on Bluestore, including
> >>>>>>>>> WAL & DB
> >>>>>>>>> if using separate device for them?
> >>>>>>>>>
> >>>>>>>>> Best regards,
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> ceph-users mailing list
> >>>>>>>>> ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
> >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I am also looking for recommendations on wal/db partition sizes.
> >>>>>>>> Some
> >>>>>>>> hints:
> >>>>>>>>
> >>>>>>>> ceph-disk defaults used in case it does not find
> >>>>>>>> bluestore_block_wal_size or bluestore_block_db_size in config
> file:
> >>>>>>>>
> >>>>>>>> wal = 512MB
> >>>>>>>>
> >>>>>>>> db = if bluestore_block_size (data size) is in config file it
> >>>>>>>> uses 1/100
> >>>>>>>> of it else it uses 1G.
> >>>>>>>>
> >>>>>>>> There is also a presentation by Sage back in March, see page 16:
> >>>>>>>>
> >>>>>>>>
> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> wal: 512 MB
> >>>>>>>>
> >>>>>>>> db: "a few" GB
> >>>>>>>>
> >>>>>>>> the wal size is probably not debatable, it will be like a
> >>>>>>>> journal for
> >>>>>>>> small block sizes which are constrained by iops hence 512 MB is
> >>>>>>>> more
> >>>>>>>> than enough. Probably we will see more on the db size in the
> >>>>>>>> future.
> >>>>>>> This is what I understood so far.
> >>>>>>> I wonder if it makes sense to set the db size as big as possible
> and
> >>>>>>> divide entire db device is by the number of OSDs it will serve.
> >>>>>>>
> >>>>>>> E.g. 10 OSDs / 1 NVME (800GB)
> >>>>>>>
> >>>>>>> (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD
> >>>>>>>
> >>>>>>> Is this smart/stupid?
> >>>>>> Personally I'd use 512MB-2GB for the WAL (larger buffers reduce
> write
> >>>>>> amp but mean larger memtables and potentially higher overhead
> >>>>>> scanning
> >>>>>> through memtables). 4x256MB buffers works pretty well, but it means
> >>>>>> memory overhead too. Beyond that, I'd devote the entire rest of the
> >>>>>> device to DB partitions.
> >>>>>>
> >>>>> thanks for your suggestion Mark!
> >>>>>
> >>>>> So, just to make sure I understood this right:
> >>>>>
> >>>>> You'd use a separeate 512MB-2GB WAL partition for each OSD and the
> >>>>> entire rest for DB partitions.
> >>>>>
> >>>>> In the example case with 10xHDD OSD and 1 NVME it would then be 10
> WAL
> >>>>> partitions with each 512MB-2GB and 10 equal sized DB partitions
> >>>>> consuming the rest of the NVME.
> >>>>>
> >>>>>
> >>>>> Thanks
> >>>>> Dietmar
> >>>>> --
> >>>>> _________________________________________
> >>>>> D i e t m a r R i e d e r, Mag.Dr.
> >>>>> Innsbruck Medical University
> >>>>> Biocenter - Division for Bioinformatics
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list
> >>>>> ceph-***@lists.ceph.com
> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-***@lists.ceph.com
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-***@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-***@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> > _______________________________________________
> > ceph-users mailing list
> > ceph-***@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> _________________________________________
> D i e t m a r R i e d e r, Mag.Dr.
> Innsbruck Medical University
> Biocenter - Division for Bioinformatics
>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

Nigel Williams

2017-09-25 22:02:22 UTC

Permalink

On 26 September 2017 at 01:10, David Turner <***@gmail.com> wrote:
> If they are on separate
> devices, then you need to make it as big as you need to to ensure that it
> won't spill over (or if it does that you're ok with the degraded performance
> while the db partition is full). I haven't come across an equation to judge
> what size should be used for either partition yet.

Is it the case that only the WAL will spill if there is a backlog
clearing entries into the DB partition? so the WAL's fill-mark
oscillates but the DB is going to steadily grow (depending on the
previously mentioned factors of "...extents, checksums, RGW bucket
indices, and potentially other random stuff".

Is there an indicator that can be monitored to show that a spill is occurring?

Mark Nelson

2017-09-25 22:11:26 UTC

Permalink

On 09/25/2017 05:02 PM, Nigel Williams wrote:
> On 26 September 2017 at 01:10, David Turner <***@gmail.com> wrote:
>> If they are on separate
>> devices, then you need to make it as big as you need to to ensure that it
>> won't spill over (or if it does that you're ok with the degraded performance
>> while the db partition is full). I haven't come across an equation to judge
>> what size should be used for either partition yet.
>
> Is it the case that only the WAL will spill if there is a backlog
> clearing entries into the DB partition? so the WAL's fill-mark
> oscillates but the DB is going to steadily grow (depending on the
> previously mentioned factors of "...extents, checksums, RGW bucket
> indices, and potentially other random stuff".

The WAL should never grow larger than the size of the buffers you've
specified. It's the DB that can grow and is difficult to estimate both
because different workloads will cause different numbers of extents and
objects, but also because rocksdb itself causes a certain amount of
space-amplification due to a variety of factors.

>
> Is there an indicator that can be monitored to show that a spill is occurring?

I think there's a message in the logs, but beyond that I don't remember
if we added any kind of indication in the user tools. At one point I
think I remember Sage mentioning he wanted to add something to ceph df.

> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

Nigel Williams

2017-09-25 22:17:20 UTC

Permalink

On 26 September 2017 at 08:11, Mark Nelson <***@redhat.com> wrote:
> The WAL should never grow larger than the size of the buffers you've
> specified. It's the DB that can grow and is difficult to estimate both
> because different workloads will cause different numbers of extents and
> objects, but also because rocksdb itself causes a certain amount of
> space-amplification due to a variety of factors.

Ok, I was confused whether both types could spill. within Bluestore it
simply blocks if the WAL hits 100%?

Would a drastic (quick) action to correct a too-small-DB-partition
(impacting performance) is to destroy the OSD and rebuild it with a
larger DB partition?

Sage Weil

2017-09-25 22:26:41 UTC

Permalink

On Tue, 26 Sep 2017, Nigel Williams wrote:
> On 26 September 2017 at 08:11, Mark Nelson <***@redhat.com> wrote:
> > The WAL should never grow larger than the size of the buffers you've
> > specified. It's the DB that can grow and is difficult to estimate both
> > because different workloads will cause different numbers of extents and
> > objects, but also because rocksdb itself causes a certain amount of
> > space-amplification due to a variety of factors.
>
> Ok, I was confused whether both types could spill. within Bluestore it
> simply blocks if the WAL hits 100%?

It never blocks; it will always just spill over onto the next fastest
device (wal -> db -> main). Note that there is no value to a db partition
if it is on the same device as the main partition.

> Would a drastic (quick) action to correct a too-small-DB-partition
> (impacting performance) is to destroy the OSD and rebuild it with a
> larger DB partition?

That's the easiest!
sage

Dietmar Rieder

2017-09-26 06:10:27 UTC

Permalink

thanks David,

that's confirming what I was assuming. To bad that there is no
estimate/method to calculate the db partition size.

Dietmar

On 09/25/2017 05:10 PM, David Turner wrote:
> db/wal partitions are per OSD.Â DB partitions need to be made as big as
> you need them.Â If they run out of space, they will fall back to the
> block device.Â If the DB and block are on the same device, then there's
> no reason to partition them and figure out the best size.Â If they are
> on separate devices, then you need to make it as big as you need to to
> ensure that it won't spill over (or if it does that you're ok with the
> degraded performance while the db partition is full).Â I haven't come
> across an equation to judge what size should be used for either
> partition yet.
>
> On Mon, Sep 25, 2017 at 10:53 AM Dietmar Rieder
> <***@i-med.ac.at <mailto:***@i-med.ac.at>> wrote:
>
> On 09/25/2017 02:59 PM, Mark Nelson wrote:
> > On 09/25/2017 03:31 AM, TYLin wrote:
> >> Hi,
> >>
> >> To my understand, the bluestore write workflow is
> >>
> >> For normal big write
> >> 1. Write data to block
> >> 2. Update metadata to rocksdb
> >> 3. Rocksdb write to memory and block.wal
> >> 4. Once reach threshold, flush entries in block.wal to block.db
> >>
> >> For overwrite and small write
> >> 1. Write data and metadata to rocksdb
> >> 2. Apply the data to block
> >>
> >> Seems we donât have a formula or suggestion to the size of block.db.
> >> It depends on the object size and number of objects in your pool. You
> >> can just give big partition to block.db to ensure all the database
> >> files are on that fast partition. If block.db full, it will use block
> >> to put db files, however, this will slow down the db performance. So
> >> give db size as much as you can.
> >
> > This is basically correct.Â What's more, it's not just the object
> size,
> > but the number of extents, checksums, RGW bucket indices, and
> > potentially other random stuff.Â I'm skeptical how well we can
> estimate
> > all of this in the long run.Â I wonder if we would be better served by
> > just focusing on making it easy to understand how the DB device is
> being
> > used, how much is spilling over to the block device, and make it
> easy to
> > upgrade to a new device once it gets full.
> >
> >>
> >> If you want to put wal and db on same ssd, you donât need to create
> >> block.wal. It will implicitly use block.db to put wal. The only case
> >> you need block.wal is that you want to separate wal to another disk.
> >
> > I always make explicit partitions, but only because I (potentially
> > illogically) like it that way.Â There may actually be some benefits to
> > using a single partition for both if sharing a single device.
>
> is this "Single db/wal partition" then to be used for all OSDs on a node
> or do you need to create a seperate "SingleÂ db/wal partition" for each
> OSDÂ on the node?
>
> >
> >>
> >> Iâm also studying bluestore, this is what I know so far. Any
> >> correction is welcomed.
> >>
> >> Thanks
> >>
> >>
> >>> On Sep 22, 2017, at 5:27 PM, Richard Hesketh
> >>> <***@rd.bbc.co.uk
> <mailto:***@rd.bbc.co.uk>> wrote:
> >>>
> >>> I asked the same question a couple of weeks ago. No response I got
> >>> contradicted the documentation but nobody actively confirmed the
> >>> documentation was correct on this subject, either; my end state was
> >>> that I was relatively confident I wasn't making some horrible
> mistake
> >>> by simply specifying a big DB partition and letting bluestore work
> >>> itself out (in my case, I've just got HDDs and SSDs that were
> >>> journals under filestore), but I could not be sure there wasn't some
> >>> sort of performance tuning I was missing out on by not specifying
> >>> them separately.
> >>>
> >>> Rich
> >>>
> >>> On 21/09/17 20:37, Benjeman Meekhof wrote:
> >>>> Some of this thread seems to contradict the documentation and
> confuses
> >>>> me.Â Is the statement below correct?
> >>>>
> >>>> "The BlueStore journal will always be placed on the fastest device
> >>>> available, so using a DB device will provide the same benefit
> that the
> >>>> WAL device would while also allowing additional metadata to be
> stored
> >>>> there (if it will fix)."
> >>>>
> >>>>
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices
> >>>>
> >>>>
> >>>> Â it seems to be saying that there's no reason to create
> separate WAL
> >>>> and DB partitions if they are on the same device.Â Specifying one
> >>>> large DB partition per OSD will cover both uses.
> >>>>
> >>>> thanks,
> >>>> Ben
> >>>>
> >>>> On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
> >>>> <***@i-med.ac.at
> <mailto:***@i-med.ac.at>> wrote:
> >>>>> On 09/21/2017 05:03 PM, Mark Nelson wrote:
> >>>>>>
> >>>>>> On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
> >>>>>>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
> >>>>>>>> On 2017-09-21 07:56, Lazuardi Nasution wrote:
> >>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> I'm still looking for the answer of these questions. Maybe
> >>>>>>>>> someone can
> >>>>>>>>> share their thought on these. Any comment will be helpful too.
> >>>>>>>>>
> >>>>>>>>> Best regards,
> >>>>>>>>>
> >>>>>>>>> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
> >>>>>>>>> <***@gmail.com <mailto:***@gmail.com>
> <mailto:***@gmail.com <mailto:***@gmail.com>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Â Â Â Hi,
> >>>>>>>>>
> >>>>>>>>> Â Â Â 1. Is it possible configure use osd_data not as small
> >>>>>>>>> partition on
> >>>>>>>>> Â Â Â OSD but a folder (ex. on root disk)? If yes, how to do
> that
> >>>>>>>>> with
> >>>>>>>>> Â Â Â ceph-disk and any pros/cons of doing that?
> >>>>>>>>> Â Â Â 2. Is WAL & DB size calculated based on OSD size or
> expected
> >>>>>>>>> Â Â Â throughput like on journal device of filestore? If no,
> what
> >>>>>>>>> is the
> >>>>>>>>> Â Â Â default value and pro/cons of adjusting that?
> >>>>>>>>> Â Â Â 3. Is partition alignment matter on Bluestore, including
> >>>>>>>>> WAL & DB
> >>>>>>>>> Â Â Â if using separate device for them?
> >>>>>>>>>
> >>>>>>>>> Â Â Â Best regards,
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> ceph-users mailing list
> >>>>>>>>> ceph-***@lists.ceph.com
> <mailto:ceph-***@lists.ceph.com> <mailto:ceph-***@lists.ceph.com
> <mailto:ceph-***@lists.ceph.com>>
> >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I am also looking for recommendations on wal/db partition
> sizes.
> >>>>>>>> Some
> >>>>>>>> hints:
> >>>>>>>>
> >>>>>>>> ceph-disk defaults used in case it does not find
> >>>>>>>> bluestore_block_wal_size or bluestore_block_db_size in
> config file:
> >>>>>>>>
> >>>>>>>> wal =Â 512MB
> >>>>>>>>
> >>>>>>>> db = if bluestore_block_size (data size) is in config file it
> >>>>>>>> uses 1/100
> >>>>>>>> of it else it uses 1G.
> >>>>>>>>
> >>>>>>>> There is also a presentation by Sage back in March, see
> page 16:
> >>>>>>>>
> >>>>>>>>
> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> wal: 512 MB
> >>>>>>>>
> >>>>>>>> db: "a few" GB
> >>>>>>>>
> >>>>>>>> the wal size is probably not debatable, it will be like a
> >>>>>>>> journal for
> >>>>>>>> small block sizes which are constrained by iops hence 512 MB is
> >>>>>>>> more
> >>>>>>>> than enough. Probably we will see more on the db size in the
> >>>>>>>> future.
> >>>>>>> This is what I understood so far.
> >>>>>>> I wonder if it makes sense to set the db size as big as
> possible and
> >>>>>>> divide entire db device isÂ by the number of OSDs it will serve.
> >>>>>>>
> >>>>>>> E.g. 10 OSDs / 1 NVME (800GB)
> >>>>>>>
> >>>>>>> Â (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD
> >>>>>>>
> >>>>>>> Is this smart/stupid?
> >>>>>> Personally I'd use 512MB-2GB for the WAL (larger buffers
> reduce write
> >>>>>> amp but mean larger memtables and potentially higher overhead
> >>>>>> scanning
> >>>>>> through memtables).Â 4x256MB buffers works pretty well, but
> it means
> >>>>>> memory overhead too.Â Beyond that, I'd devote the entire rest
> of the
> >>>>>> device to DB partitions.
> >>>>>>
> >>>>> thanks for your suggestion Mark!
> >>>>>
> >>>>> So, just to make sure I understood this right:
> >>>>>
> >>>>> You'dÂ use a separeate 512MB-2GB WAL partition for each OSD
> and the
> >>>>> entire rest for DB partitions.
> >>>>>
> >>>>> In the example case with 10xHDD OSD and 1 NVME it would then
> be 10 WAL
> >>>>> partitions with each 512MB-2GB and 10 equal sized DB partitions
> >>>>> consuming the rest of the NVME.
> >>>>>
> >>>>>
> >>>>> Thanks
> >>>>> Â Dietmar
> >>>>> --
> >>>>> _________________________________________
> >>>>> D i e t m a rÂ R i e d e r, Mag.Dr.
> >>>>> Innsbruck Medical University
> >>>>> Biocenter - Division for Bioinformatics
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list
> >>>>> ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> > _______________________________________________
> > ceph-users mailing list
> > ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> _________________________________________
> D i e t m a rÂ R i e d e r, Mag.Dr.
> Innsbruck Medical University
> Biocenter - Division for Bioinformatics
>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

--
_________________________________________
D i e t m a r R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics

Mark Nelson

2017-09-26 14:39:43 UTC

Permalink

On 09/26/2017 01:10 AM, Dietmar Rieder wrote:
> thanks David,
>
> that's confirming what I was assuming. To bad that there is no
> estimate/method to calculate the db partition size.

It's possible that we might be able to get ranges for certain kinds of
scenarios. Maybe if you do lots of small random writes on RBD, you can
expect a typical metadata size of X per object. Or maybe if you do lots
of large sequential object writes in RGW, it's more like Y. I think
it's probably going to be tough to make it accurate for everyone though.

Mark

>
> Dietmar
>
> On 09/25/2017 05:10 PM, David Turner wrote:
>> db/wal partitions are per OSD. DB partitions need to be made as big as
>> you need them. If they run out of space, they will fall back to the
>> block device. If the DB and block are on the same device, then there's
>> no reason to partition them and figure out the best size. If they are
>> on separate devices, then you need to make it as big as you need to to
>> ensure that it won't spill over (or if it does that you're ok with the
>> degraded performance while the db partition is full). I haven't come
>> across an equation to judge what size should be used for either
>> partition yet.
>>
>> On Mon, Sep 25, 2017 at 10:53 AM Dietmar Rieder
>> <***@i-med.ac.at <mailto:***@i-med.ac.at>> wrote:
>>
>> On 09/25/2017 02:59 PM, Mark Nelson wrote:
>> > On 09/25/2017 03:31 AM, TYLin wrote:
>> >> Hi,
>> >>
>> >> To my understand, the bluestore write workflow is
>> >>
>> >> For normal big write
>> >> 1. Write data to block
>> >> 2. Update metadata to rocksdb
>> >> 3. Rocksdb write to memory and block.wal
>> >> 4. Once reach threshold, flush entries in block.wal to block.db
>> >>
>> >> For overwrite and small write
>> >> 1. Write data and metadata to rocksdb
>> >> 2. Apply the data to block
>> >>
>> >> Seems we don’t have a formula or suggestion to the size of block.db.
>> >> It depends on the object size and number of objects in your pool. You
>> >> can just give big partition to block.db to ensure all the database
>> >> files are on that fast partition. If block.db full, it will use block
>> >> to put db files, however, this will slow down the db performance. So
>> >> give db size as much as you can.
>> >
>> > This is basically correct. What's more, it's not just the object
>> size,
>> > but the number of extents, checksums, RGW bucket indices, and
>> > potentially other random stuff. I'm skeptical how well we can
>> estimate
>> > all of this in the long run. I wonder if we would be better served by
>> > just focusing on making it easy to understand how the DB device is
>> being
>> > used, how much is spilling over to the block device, and make it
>> easy to
>> > upgrade to a new device once it gets full.
>> >
>> >>
>> >> If you want to put wal and db on same ssd, you don’t need to create
>> >> block.wal. It will implicitly use block.db to put wal. The only case
>> >> you need block.wal is that you want to separate wal to another disk.
>> >
>> > I always make explicit partitions, but only because I (potentially
>> > illogically) like it that way. There may actually be some benefits to
>> > using a single partition for both if sharing a single device.
>>
>> is this "Single db/wal partition" then to be used for all OSDs on a node
>> or do you need to create a seperate "Single db/wal partition" for each
>> OSD on the node?
>>
>> >
>> >>
>> >> I’m also studying bluestore, this is what I know so far. Any
>> >> correction is welcomed.
>> >>
>> >> Thanks
>> >>
>> >>
>> >>> On Sep 22, 2017, at 5:27 PM, Richard Hesketh
>> >>> <***@rd.bbc.co.uk
>> <mailto:***@rd.bbc.co.uk>> wrote:
>> >>>
>> >>> I asked the same question a couple of weeks ago. No response I got
>> >>> contradicted the documentation but nobody actively confirmed the
>> >>> documentation was correct on this subject, either; my end state was
>> >>> that I was relatively confident I wasn't making some horrible
>> mistake
>> >>> by simply specifying a big DB partition and letting bluestore work
>> >>> itself out (in my case, I've just got HDDs and SSDs that were
>> >>> journals under filestore), but I could not be sure there wasn't some
>> >>> sort of performance tuning I was missing out on by not specifying
>> >>> them separately.
>> >>>
>> >>> Rich
>> >>>
>> >>> On 21/09/17 20:37, Benjeman Meekhof wrote:
>> >>>> Some of this thread seems to contradict the documentation and
>> confuses
>> >>>> me. Is the statement below correct?
>> >>>>
>> >>>> "The BlueStore journal will always be placed on the fastest device
>> >>>> available, so using a DB device will provide the same benefit
>> that the
>> >>>> WAL device would while also allowing additional metadata to be
>> stored
>> >>>> there (if it will fix)."
>> >>>>
>> >>>>
>> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices
>> >>>>
>> >>>>
>> >>>> it seems to be saying that there's no reason to create
>> separate WAL
>> >>>> and DB partitions if they are on the same device. Specifying one
>> >>>> large DB partition per OSD will cover both uses.
>> >>>>
>> >>>> thanks,
>> >>>> Ben
>> >>>>
>> >>>> On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
>> >>>> <***@i-med.ac.at
>> <mailto:***@i-med.ac.at>> wrote:
>> >>>>> On 09/21/2017 05:03 PM, Mark Nelson wrote:
>> >>>>>>
>> >>>>>> On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
>> >>>>>>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
>> >>>>>>>> On 2017-09-21 07:56, Lazuardi Nasution wrote:
>> >>>>>>>>
>> >>>>>>>>> Hi,
>> >>>>>>>>>
>> >>>>>>>>> I'm still looking for the answer of these questions. Maybe
>> >>>>>>>>> someone can
>> >>>>>>>>> share their thought on these. Any comment will be helpful too.
>> >>>>>>>>>
>> >>>>>>>>> Best regards,
>> >>>>>>>>>
>> >>>>>>>>> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
>> >>>>>>>>> <***@gmail.com <mailto:***@gmail.com>
>> <mailto:***@gmail.com <mailto:***@gmail.com>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>> Hi,
>> >>>>>>>>>
>> >>>>>>>>> 1. Is it possible configure use osd_data not as small
>> >>>>>>>>> partition on
>> >>>>>>>>> OSD but a folder (ex. on root disk)? If yes, how to do
>> that
>> >>>>>>>>> with
>> >>>>>>>>> ceph-disk and any pros/cons of doing that?
>> >>>>>>>>> 2. Is WAL & DB size calculated based on OSD size or
>> expected
>> >>>>>>>>> throughput like on journal device of filestore? If no,
>> what
>> >>>>>>>>> is the
>> >>>>>>>>> default value and pro/cons of adjusting that?
>> >>>>>>>>> 3. Is partition alignment matter on Bluestore, including
>> >>>>>>>>> WAL & DB
>> >>>>>>>>> if using separate device for them?
>> >>>>>>>>>
>> >>>>>>>>> Best regards,
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> _______________________________________________
>> >>>>>>>>> ceph-users mailing list
>> >>>>>>>>> ceph-***@lists.ceph.com
>> <mailto:ceph-***@lists.ceph.com> <mailto:ceph-***@lists.ceph.com
>> <mailto:ceph-***@lists.ceph.com>>
>> >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> I am also looking for recommendations on wal/db partition
>> sizes.
>> >>>>>>>> Some
>> >>>>>>>> hints:
>> >>>>>>>>
>> >>>>>>>> ceph-disk defaults used in case it does not find
>> >>>>>>>> bluestore_block_wal_size or bluestore_block_db_size in
>> config file:
>> >>>>>>>>
>> >>>>>>>> wal = 512MB
>> >>>>>>>>
>> >>>>>>>> db = if bluestore_block_size (data size) is in config file it
>> >>>>>>>> uses 1/100
>> >>>>>>>> of it else it uses 1G.
>> >>>>>>>>
>> >>>>>>>> There is also a presentation by Sage back in March, see
>> page 16:
>> >>>>>>>>
>> >>>>>>>>
>> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> wal: 512 MB
>> >>>>>>>>
>> >>>>>>>> db: "a few" GB
>> >>>>>>>>
>> >>>>>>>> the wal size is probably not debatable, it will be like a
>> >>>>>>>> journal for
>> >>>>>>>> small block sizes which are constrained by iops hence 512 MB is
>> >>>>>>>> more
>> >>>>>>>> than enough. Probably we will see more on the db size in the
>> >>>>>>>> future.
>> >>>>>>> This is what I understood so far.
>> >>>>>>> I wonder if it makes sense to set the db size as big as
>> possible and
>> >>>>>>> divide entire db device is by the number of OSDs it will serve.
>> >>>>>>>
>> >>>>>>> E.g. 10 OSDs / 1 NVME (800GB)
>> >>>>>>>
>> >>>>>>> (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD
>> >>>>>>>
>> >>>>>>> Is this smart/stupid?
>> >>>>>> Personally I'd use 512MB-2GB for the WAL (larger buffers
>> reduce write
>> >>>>>> amp but mean larger memtables and potentially higher overhead
>> >>>>>> scanning
>> >>>>>> through memtables). 4x256MB buffers works pretty well, but
>> it means
>> >>>>>> memory overhead too. Beyond that, I'd devote the entire rest
>> of the
>> >>>>>> device to DB partitions.
>> >>>>>>
>> >>>>> thanks for your suggestion Mark!
>> >>>>>
>> >>>>> So, just to make sure I understood this right:
>> >>>>>
>> >>>>> You'd use a separeate 512MB-2GB WAL partition for each OSD
>> and the
>> >>>>> entire rest for DB partitions.
>> >>>>>
>> >>>>> In the example case with 10xHDD OSD and 1 NVME it would then
>> be 10 WAL
>> >>>>> partitions with each 512MB-2GB and 10 equal sized DB partitions
>> >>>>> consuming the rest of the NVME.
>> >>>>>
>> >>>>>
>> >>>>> Thanks
>> >>>>> Dietmar
>> >>>>> --
>> >>>>> _________________________________________
>> >>>>> D i e t m a r R i e d e r, Mag.Dr.
>> >>>>> Innsbruck Medical University
>> >>>>> Biocenter - Division for Bioinformatics
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> ceph-users mailing list
>> >>>>> ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
>> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>>>>
>> >>>> _______________________________________________
>> >>>> ceph-users mailing list
>> >>>> ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
>> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>>
>> >>> _______________________________________________
>> >>> ceph-users mailing list
>> >>> ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
>> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>
>> >> _______________________________________________
>> >> ceph-users mailing list
>> >> ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> --
>> _________________________________________
>> D i e t m a r R i e d e r, Mag.Dr.
>> Innsbruck Medical University
>> Biocenter - Division for Bioinformatics
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

Wido den Hollander

2017-10-16 12:45:13 UTC

Permalink

> Op 26 september 2017 om 16:39 schreef Mark Nelson <***@redhat.com>:
>
>
>
>
> On 09/26/2017 01:10 AM, Dietmar Rieder wrote:
> > thanks David,
> >
> > that's confirming what I was assuming. To bad that there is no
> > estimate/method to calculate the db partition size.
>
> It's possible that we might be able to get ranges for certain kinds of
> scenarios. Maybe if you do lots of small random writes on RBD, you can
> expect a typical metadata size of X per object. Or maybe if you do lots
> of large sequential object writes in RGW, it's more like Y. I think
> it's probably going to be tough to make it accurate for everyone though.
>

So I did a quick test. I wrote 75.000 objects to a BlueStore device:

***@alpha:~# ceph daemon osd.0 perf dump|jq '.bluestore.bluestore_onodes'
75085
***@alpha:~#

I then saw the RocksDB database was 450MB in size:

***@alpha:~# ceph daemon osd.0 perf dump|jq '.bluefs.db_used_bytes'
459276288
***@alpha:~#

459276288 / 75085 = 6116

So about 6kb of RocksDB data per object.

Let's say I want to store 1M objects in a single OSD I would need ~6GB of DB space.

Is this a safe assumption? Do you think that 6kb is normal? Low? High?

There aren't many of these numbers out there for BlueStore right now so I'm trying to gather some numbers.

Wido

> Mark
>
> >
> > Dietmar
> >
> > On 09/25/2017 05:10 PM, David Turner wrote:
> >> db/wal partitions are per OSD. DB partitions need to be made as big as
> >> you need them. If they run out of space, they will fall back to the
> >> block device. If the DB and block are on the same device, then there's
> >> no reason to partition them and figure out the best size. If they are
> >> on separate devices, then you need to make it as big as you need to to
> >> ensure that it won't spill over (or if it does that you're ok with the
> >> degraded performance while the db partition is full). I haven't come
> >> across an equation to judge what size should be used for either
> >> partition yet.
> >>
> >> On Mon, Sep 25, 2017 at 10:53 AM Dietmar Rieder
> >> <***@i-med.ac.at <mailto:***@i-med.ac.at>> wrote:
> >>
> >> On 09/25/2017 02:59 PM, Mark Nelson wrote:
> >> > On 09/25/2017 03:31 AM, TYLin wrote:
> >> >> Hi,
> >> >>
> >> >> To my understand, the bluestore write workflow is
> >> >>
> >> >> For normal big write
> >> >> 1. Write data to block
> >> >> 2. Update metadata to rocksdb
> >> >> 3. Rocksdb write to memory and block.wal
> >> >> 4. Once reach threshold, flush entries in block.wal to block.db
> >> >>
> >> >> For overwrite and small write
> >> >> 1. Write data and metadata to rocksdb
> >> >> 2. Apply the data to block
> >> >>
> >> >> Seems we don’t have a formula or suggestion to the size of block.db.
> >> >> It depends on the object size and number of objects in your pool. You
> >> >> can just give big partition to block.db to ensure all the database
> >> >> files are on that fast partition. If block.db full, it will use block
> >> >> to put db files, however, this will slow down the db performance. So
> >> >> give db size as much as you can.
> >> >
> >> > This is basically correct. What's more, it's not just the object
> >> size,
> >> > but the number of extents, checksums, RGW bucket indices, and
> >> > potentially other random stuff. I'm skeptical how well we can
> >> estimate
> >> > all of this in the long run. I wonder if we would be better served by
> >> > just focusing on making it easy to understand how the DB device is
> >> being
> >> > used, how much is spilling over to the block device, and make it
> >> easy to
> >> > upgrade to a new device once it gets full.
> >> >
> >> >>
> >> >> If you want to put wal and db on same ssd, you don’t need to create
> >> >> block.wal. It will implicitly use block.db to put wal. The only case
> >> >> you need block.wal is that you want to separate wal to another disk.
> >> >
> >> > I always make explicit partitions, but only because I (potentially
> >> > illogically) like it that way. There may actually be some benefits to
> >> > using a single partition for both if sharing a single device.
> >>
> >> is this "Single db/wal partition" then to be used for all OSDs on a node
> >> or do you need to create a seperate "Single db/wal partition" for each
> >> OSD on the node?
> >>
> >> >
> >> >>
> >> >> I’m also studying bluestore, this is what I know so far. Any
> >> >> correction is welcomed.
> >> >>
> >> >> Thanks
> >> >>
> >> >>
> >> >>> On Sep 22, 2017, at 5:27 PM, Richard Hesketh
> >> >>> <***@rd.bbc.co.uk
> >> <mailto:***@rd.bbc.co.uk>> wrote:
> >> >>>
> >> >>> I asked the same question a couple of weeks ago. No response I got
> >> >>> contradicted the documentation but nobody actively confirmed the
> >> >>> documentation was correct on this subject, either; my end state was
> >> >>> that I was relatively confident I wasn't making some horrible
> >> mistake
> >> >>> by simply specifying a big DB partition and letting bluestore work
> >> >>> itself out (in my case, I've just got HDDs and SSDs that were
> >> >>> journals under filestore), but I could not be sure there wasn't some
> >> >>> sort of performance tuning I was missing out on by not specifying
> >> >>> them separately.
> >> >>>
> >> >>> Rich
> >> >>>
> >> >>> On 21/09/17 20:37, Benjeman Meekhof wrote:
> >> >>>> Some of this thread seems to contradict the documentation and
> >> confuses
> >> >>>> me. Is the statement below correct?
> >> >>>>
> >> >>>> "The BlueStore journal will always be placed on the fastest device
> >> >>>> available, so using a DB device will provide the same benefit
> >> that the
> >> >>>> WAL device would while also allowing additional metadata to be
> >> stored
> >> >>>> there (if it will fix)."
> >> >>>>
> >> >>>>
> >> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices
> >> >>>>
> >> >>>>
> >> >>>> it seems to be saying that there's no reason to create
> >> separate WAL
> >> >>>> and DB partitions if they are on the same device. Specifying one
> >> >>>> large DB partition per OSD will cover both uses.
> >> >>>>
> >> >>>> thanks,
> >> >>>> Ben
> >> >>>>
> >> >>>> On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
> >> >>>> <***@i-med.ac.at
> >> <mailto:***@i-med.ac.at>> wrote:
> >> >>>>> On 09/21/2017 05:03 PM, Mark Nelson wrote:
> >> >>>>>>
> >> >>>>>> On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
> >> >>>>>>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
> >> >>>>>>>> On 2017-09-21 07:56, Lazuardi Nasution wrote:
> >> >>>>>>>>
> >> >>>>>>>>> Hi,
> >> >>>>>>>>>
> >> >>>>>>>>> I'm still looking for the answer of these questions. Maybe
> >> >>>>>>>>> someone can
> >> >>>>>>>>> share their thought on these. Any comment will be helpful too.
> >> >>>>>>>>>
> >> >>>>>>>>> Best regards,
> >> >>>>>>>>>
> >> >>>>>>>>> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
> >> >>>>>>>>> <***@gmail.com <mailto:***@gmail.com>
> >> <mailto:***@gmail.com <mailto:***@gmail.com>>> wrote:
> >> >>>>>>>>>
> >> >>>>>>>>> Hi,
> >> >>>>>>>>>
> >> >>>>>>>>> 1. Is it possible configure use osd_data not as small
> >> >>>>>>>>> partition on
> >> >>>>>>>>> OSD but a folder (ex. on root disk)? If yes, how to do
> >> that
> >> >>>>>>>>> with
> >> >>>>>>>>> ceph-disk and any pros/cons of doing that?
> >> >>>>>>>>> 2. Is WAL & DB size calculated based on OSD size or
> >> expected
> >> >>>>>>>>> throughput like on journal device of filestore? If no,
> >> what
> >> >>>>>>>>> is the
> >> >>>>>>>>> default value and pro/cons of adjusting that?
> >> >>>>>>>>> 3. Is partition alignment matter on Bluestore, including
> >> >>>>>>>>> WAL & DB
> >> >>>>>>>>> if using separate device for them?
> >> >>>>>>>>>
> >> >>>>>>>>> Best regards,
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> _______________________________________________
> >> >>>>>>>>> ceph-users mailing list
> >> >>>>>>>>> ceph-***@lists.ceph.com
> >> <mailto:ceph-***@lists.ceph.com> <mailto:ceph-***@lists.ceph.com
> >> <mailto:ceph-***@lists.ceph.com>>
> >> >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> I am also looking for recommendations on wal/db partition
> >> sizes.
> >> >>>>>>>> Some
> >> >>>>>>>> hints:
> >> >>>>>>>>
> >> >>>>>>>> ceph-disk defaults used in case it does not find
> >> >>>>>>>> bluestore_block_wal_size or bluestore_block_db_size in
> >> config file:
> >> >>>>>>>>
> >> >>>>>>>> wal = 512MB
> >> >>>>>>>>
> >> >>>>>>>> db = if bluestore_block_size (data size) is in config file it
> >> >>>>>>>> uses 1/100
> >> >>>>>>>> of it else it uses 1G.
> >> >>>>>>>>
> >> >>>>>>>> There is also a presentation by Sage back in March, see
> >> page 16:
> >> >>>>>>>>
> >> >>>>>>>>
> >> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> wal: 512 MB
> >> >>>>>>>>
> >> >>>>>>>> db: "a few" GB
> >> >>>>>>>>
> >> >>>>>>>> the wal size is probably not debatable, it will be like a
> >> >>>>>>>> journal for
> >> >>>>>>>> small block sizes which are constrained by iops hence 512 MB is
> >> >>>>>>>> more
> >> >>>>>>>> than enough. Probably we will see more on the db size in the
> >> >>>>>>>> future.
> >> >>>>>>> This is what I understood so far.
> >> >>>>>>> I wonder if it makes sense to set the db size as big as
> >> possible and
> >> >>>>>>> divide entire db device is by the number of OSDs it will serve.
> >> >>>>>>>
> >> >>>>>>> E.g. 10 OSDs / 1 NVME (800GB)
> >> >>>>>>>
> >> >>>>>>> (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD
> >> >>>>>>>
> >> >>>>>>> Is this smart/stupid?
> >> >>>>>> Personally I'd use 512MB-2GB for the WAL (larger buffers
> >> reduce write
> >> >>>>>> amp but mean larger memtables and potentially higher overhead
> >> >>>>>> scanning
> >> >>>>>> through memtables). 4x256MB buffers works pretty well, but
> >> it means
> >> >>>>>> memory overhead too. Beyond that, I'd devote the entire rest
> >> of the
> >> >>>>>> device to DB partitions.
> >> >>>>>>
> >> >>>>> thanks for your suggestion Mark!
> >> >>>>>
> >> >>>>> So, just to make sure I understood this right:
> >> >>>>>
> >> >>>>> You'd use a separeate 512MB-2GB WAL partition for each OSD
> >> and the
> >> >>>>> entire rest for DB partitions.
> >> >>>>>
> >> >>>>> In the example case with 10xHDD OSD and 1 NVME it would then
> >> be 10 WAL
> >> >>>>> partitions with each 512MB-2GB and 10 equal sized DB partitions
> >> >>>>> consuming the rest of the NVME.
> >> >>>>>
> >> >>>>>
> >> >>>>> Thanks
> >> >>>>> Dietmar
> >> >>>>> --
> >> >>>>> _________________________________________
> >> >>>>> D i e t m a r R i e d e r, Mag.Dr.
> >> >>>>> Innsbruck Medical University
> >> >>>>> Biocenter - Division for Bioinformatics
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> _______________________________________________
> >> >>>>> ceph-users mailing list
> >> >>>>> ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
> >> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >>>>>
> >> >>>> _______________________________________________
> >> >>>> ceph-users mailing list
> >> >>>> ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
> >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >>>
> >> >>> _______________________________________________
> >> >>> ceph-users mailing list
> >> >>> ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >>
> >> >> _______________________________________________
> >> >> ceph-users mailing list
> >> >> ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >>
> >> > _______________________________________________
> >> > ceph-users mailing list
> >> > ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >> --
> >> _________________________________________
> >> D i e t m a r R i e d e r, Mag.Dr.
> >> Innsbruck Medical University
> >> Biocenter - Division for Bioinformatics
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-***@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Richard Hesketh

2017-10-16 16:14:12 UTC

Permalink

On 16/10/17 13:45, Wido den Hollander wrote:
>> Op 26 september 2017 om 16:39 schreef Mark Nelson <***@redhat.com>:
>> On 09/26/2017 01:10 AM, Dietmar Rieder wrote:
>>> thanks David,
>>>
>>> that's confirming what I was assuming. To bad that there is no
>>> estimate/method to calculate the db partition size.
>>
>> It's possible that we might be able to get ranges for certain kinds of
>> scenarios. Maybe if you do lots of small random writes on RBD, you can
>> expect a typical metadata size of X per object. Or maybe if you do lots
>> of large sequential object writes in RGW, it's more like Y. I think
>> it's probably going to be tough to make it accurate for everyone though.
>
> So I did a quick test. I wrote 75.000 objects to a BlueStore device:
>
> ***@alpha:~# ceph daemon osd.0 perf dump|jq '.bluestore.bluestore_onodes'
> 75085
> ***@alpha:~#
>
> I then saw the RocksDB database was 450MB in size:
>
> ***@alpha:~# ceph daemon osd.0 perf dump|jq '.bluefs.db_used_bytes'
> 459276288
> ***@alpha:~#
>
> 459276288 / 75085 = 6116
>
> So about 6kb of RocksDB data per object.
>
> Let's say I want to store 1M objects in a single OSD I would need ~6GB of DB space.
>
> Is this a safe assumption? Do you think that 6kb is normal? Low? High?
>
> There aren't many of these numbers out there for BlueStore right now so I'm trying to gather some numbers.
>
> Wido

If I check for the same stats on OSDs in my production cluster I see similar but variable values:

***@vm-ds-01:~/ceph-conf# for i in {0..9} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.0 db per object: 7490
osd.1 db per object: 7523
osd.2 db per object: 7378
osd.3 db per object: 7447
osd.4 db per object: 7233
osd.5 db per object: 7393
osd.6 db per object: 7074
osd.7 db per object: 7967
osd.8 db per object: 7253
osd.9 db per object: 7680

***@vm-ds-02:~# for i in {10..19} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.10 db per object: 5168
osd.11 db per object: 5291
osd.12 db per object: 5476
osd.13 db per object: 4978
osd.14 db per object: 5252
osd.15 db per object: 5461
osd.16 db per object: 5135
osd.17 db per object: 5126
osd.18 db per object: 9336
osd.19 db per object: 4986

***@vm-ds-03:~# for i in {20..29} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.20 db per object: 5115
osd.21 db per object: 4844
osd.22 db per object: 5063
osd.23 db per object: 5486
osd.24 db per object: 5228
osd.25 db per object: 4966
osd.26 db per object: 5047
osd.27 db per object: 5021
osd.28 db per object: 5321
osd.29 db per object: 5150

***@vm-ds-04:~# for i in {30..39} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.30 db per object: 6658
osd.31 db per object: 6445
osd.32 db per object: 6259
osd.33 db per object: 6691
osd.34 db per object: 6513
osd.35 db per object: 6628
osd.36 db per object: 6779
osd.37 db per object: 6819
osd.38 db per object: 6677
osd.39 db per object: 6689

***@vm-ds-05:~# for i in {40..49} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.40 db per object: 5335
osd.41 db per object: 5203
osd.42 db per object: 5552
osd.43 db per object: 5188
osd.44 db per object: 5218
osd.45 db per object: 5157
osd.46 db per object: 4956
osd.47 db per object: 5370
osd.48 db per object: 5117
osd.49 db per object: 5313

I'm not sure why so much variance (these nodes are basically identical) and I think that the db_used_bytes includes the WAL at least in my case, as I don't have a separate WAL device. I'm not sure how big the WAL is relative to metadata and hence how much this might be thrown off, but ~6kb/object seems like a reasonable value to take for back-of-envelope calculating.

[bonus hilarity]
On my all-in-one-SSD OSDs, because bluestore reports them entirely as db space, I get results like:

***@vm-hv-01:~# for i in {60..65} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.60 db per object: 80273
osd.61 db per object: 68859
osd.62 db per object: 45560
osd.63 db per object: 38209
osd.64 db per object: 48258
osd.65 db per object: 50525

Rich

Wido den Hollander

2017-10-17 06:54:26 UTC

Permalink

> Op 16 oktober 2017 om 18:14 schreef Richard Hesketh <***@rd.bbc.co.uk>:
>
>
> On 16/10/17 13:45, Wido den Hollander wrote:
> >> Op 26 september 2017 om 16:39 schreef Mark Nelson <***@redhat.com>:
> >> On 09/26/2017 01:10 AM, Dietmar Rieder wrote:
> >>> thanks David,
> >>>
> >>> that's confirming what I was assuming. To bad that there is no
> >>> estimate/method to calculate the db partition size.
> >>
> >> It's possible that we might be able to get ranges for certain kinds of
> >> scenarios. Maybe if you do lots of small random writes on RBD, you can
> >> expect a typical metadata size of X per object. Or maybe if you do lots
> >> of large sequential object writes in RGW, it's more like Y. I think
> >> it's probably going to be tough to make it accurate for everyone though.
> >
> > So I did a quick test. I wrote 75.000 objects to a BlueStore device:
> >
> > ***@alpha:~# ceph daemon osd.0 perf dump|jq '.bluestore.bluestore_onodes'
> > 75085
> > ***@alpha:~#
> >
> > I then saw the RocksDB database was 450MB in size:
> >
> > ***@alpha:~# ceph daemon osd.0 perf dump|jq '.bluefs.db_used_bytes'
> > 459276288
> > ***@alpha:~#
> >
> > 459276288 / 75085 = 6116
> >
> > So about 6kb of RocksDB data per object.
> >
> > Let's say I want to store 1M objects in a single OSD I would need ~6GB of DB space.
> >
> > Is this a safe assumption? Do you think that 6kb is normal? Low? High?
> >
> > There aren't many of these numbers out there for BlueStore right now so I'm trying to gather some numbers.
> >
> > Wido
>
> If I check for the same stats on OSDs in my production cluster I see similar but variable values:
>
> ***@vm-ds-01:~/ceph-conf# for i in {0..9} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> osd.0 db per object: 7490
> osd.1 db per object: 7523
> osd.2 db per object: 7378
> osd.3 db per object: 7447
> osd.4 db per object: 7233
> osd.5 db per object: 7393
> osd.6 db per object: 7074
> osd.7 db per object: 7967
> osd.8 db per object: 7253
> osd.9 db per object: 7680
>
> ***@vm-ds-02:~# for i in {10..19} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> osd.10 db per object: 5168
> osd.11 db per object: 5291
> osd.12 db per object: 5476
> osd.13 db per object: 4978
> osd.14 db per object: 5252
> osd.15 db per object: 5461
> osd.16 db per object: 5135
> osd.17 db per object: 5126
> osd.18 db per object: 9336
> osd.19 db per object: 4986
>
> ***@vm-ds-03:~# for i in {20..29} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> osd.20 db per object: 5115
> osd.21 db per object: 4844
> osd.22 db per object: 5063
> osd.23 db per object: 5486
> osd.24 db per object: 5228
> osd.25 db per object: 4966
> osd.26 db per object: 5047
> osd.27 db per object: 5021
> osd.28 db per object: 5321
> osd.29 db per object: 5150
>
> ***@vm-ds-04:~# for i in {30..39} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> osd.30 db per object: 6658
> osd.31 db per object: 6445
> osd.32 db per object: 6259
> osd.33 db per object: 6691
> osd.34 db per object: 6513
> osd.35 db per object: 6628
> osd.36 db per object: 6779
> osd.37 db per object: 6819
> osd.38 db per object: 6677
> osd.39 db per object: 6689
>
> ***@vm-ds-05:~# for i in {40..49} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> osd.40 db per object: 5335
> osd.41 db per object: 5203
> osd.42 db per object: 5552
> osd.43 db per object: 5188
> osd.44 db per object: 5218
> osd.45 db per object: 5157
> osd.46 db per object: 4956
> osd.47 db per object: 5370
> osd.48 db per object: 5117
> osd.49 db per object: 5313
>
> I'm not sure why so much variance (these nodes are basically identical) and I think that the db_used_bytes includes the WAL at least in my case, as I don't have a separate WAL device. I'm not sure how big the WAL is relative to metadata and hence how much this might be thrown off, but ~6kb/object seems like a reasonable value to take for back-of-envelope calculating.
>

Yes, judging from your numbers 6kb/object seems reasonable. More datapoints are welcome in this case.

Some input from a BlueStore dev might be helpful as well to see we are not drawing the wrong conclusions here.

Wido

> [bonus hilarity]
> On my all-in-one-SSD OSDs, because bluestore reports them entirely as db space, I get results like:
>
> ***@vm-hv-01:~# for i in {60..65} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> osd.60 db per object: 80273
> osd.61 db per object: 68859
> osd.62 db per object: 45560
> osd.63 db per object: 38209
> osd.64 db per object: 48258
> osd.65 db per object: 50525
>
> Rich
>

Marco Baldini - H.S. Amiata

2017-10-17 07:17:19 UTC

Permalink

Hello

Here my results

In this node, I have 3 OSDs (1TB HDD), osd.1 and osd.2 have blocks.db in
SSD partitions each of 90GB, osd.8 has no separate blocks.db

pve-hs-main[0]:~$ for i in {1,2,8} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.1 db per object: 20872
osd.2 db per object: 20416
osd.8 db per object: 16888

In this node, I have 3 OSDs (1TB HDD), each with a 60GB blocks.db on a
separate SSD

pve-hs-2[0]:/$ for i in {3..5} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.3 db per object: 19053
osd.4 db per object: 18742
osd.5 db per object: 14979

In this node I have 3 OSDs (1TB HDD) with no separate SSD

pve-hs-3[0]:~$ for i in {0,6,7} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.0 db per object: 27392
osd.6 db per object: 54065
osd.7 db per object: 69986

My ceph df and rados df, if they can be useful

pve-hs-3[0]:~$ ceph df detail
GLOBAL:
SIZE AVAIL RAW USED %RAW USED OBJECTS
8742G 6628G 2114G 24.19 187k
POOLS:
NAME ID QUOTA OBJECTS QUOTA BYTES USED %USED MAX AVAIL OBJECTS DIRTY READ WRITE RAW USED
cephbackup 9 N/A N/A 469G 7.38 2945G 120794 117k 759k 2899k 938G
cephwin 13 N/A N/A 73788M 1.21 1963G 18711 18711 1337k 1637k 216G
cephnix 14 N/A N/A 201G 3.31 1963G 52407 52407 791k 1781k 605G
pve-hs-3[0]:~$ rados df detail
POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR
cephbackup 469G 120794 0 241588 0 0 0 777872 7286M 2968926 718G
cephnix 201G 52407 0 157221 0 0 0 810317 67057M 1824184 242G
cephwin 73788M 18711 0 56133 0 0 0 1369792 155G 1677060 136G

total_objects 191912
total_used 2114G
total_avail 6628G
total_space 8742G

Can someone see a pattern?

Il 17/10/2017 08:54, Wido den Hollander ha scritto:
>> Op 16 oktober 2017 om 18:14 schreef Richard Hesketh <***@rd.bbc.co.uk>:
>>
>>
>> On 16/10/17 13:45, Wido den Hollander wrote:
>>>> Op 26 september 2017 om 16:39 schreef Mark Nelson <***@redhat.com>:
>>>> On 09/26/2017 01:10 AM, Dietmar Rieder wrote:
>>>>> thanks David,
>>>>>
>>>>> that's confirming what I was assuming. To bad that there is no
>>>>> estimate/method to calculate the db partition size.
>>>> It's possible that we might be able to get ranges for certain kinds of
>>>> scenarios. Maybe if you do lots of small random writes on RBD, you can
>>>> expect a typical metadata size of X per object. Or maybe if you do lots
>>>> of large sequential object writes in RGW, it's more like Y. I think
>>>> it's probably going to be tough to make it accurate for everyone though.
>>> So I did a quick test. I wrote 75.000 objects to a BlueStore device:
>>>
>>> ***@alpha:~# ceph daemon osd.0 perf dump|jq '.bluestore.bluestore_onodes'
>>> 75085
>>> ***@alpha:~#
>>>
>>> I then saw the RocksDB database was 450MB in size:
>>>
>>> ***@alpha:~# ceph daemon osd.0 perf dump|jq '.bluefs.db_used_bytes'
>>> 459276288
>>> ***@alpha:~#
>>>
>>> 459276288 / 75085 = 6116
>>>
>>> So about 6kb of RocksDB data per object.
>>>
>>> Let's say I want to store 1M objects in a single OSD I would need ~6GB of DB space.
>>>
>>> Is this a safe assumption? Do you think that 6kb is normal? Low? High?
>>>
>>> There aren't many of these numbers out there for BlueStore right now so I'm trying to gather some numbers.
>>>
>>> Wido
>> If I check for the same stats on OSDs in my production cluster I see similar but variable values:
>>
>> ***@vm-ds-01:~/ceph-conf# for i in {0..9} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
>> osd.0 db per object: 7490
>> osd.1 db per object: 7523
>> osd.2 db per object: 7378
>> osd.3 db per object: 7447
>> osd.4 db per object: 7233
>> osd.5 db per object: 7393
>> osd.6 db per object: 7074
>> osd.7 db per object: 7967
>> osd.8 db per object: 7253
>> osd.9 db per object: 7680
>>
>> ***@vm-ds-02:~# for i in {10..19} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
>> osd.10 db per object: 5168
>> osd.11 db per object: 5291
>> osd.12 db per object: 5476
>> osd.13 db per object: 4978
>> osd.14 db per object: 5252
>> osd.15 db per object: 5461
>> osd.16 db per object: 5135
>> osd.17 db per object: 5126
>> osd.18 db per object: 9336
>> osd.19 db per object: 4986
>>
>> ***@vm-ds-03:~# for i in {20..29} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
>> osd.20 db per object: 5115
>> osd.21 db per object: 4844
>> osd.22 db per object: 5063
>> osd.23 db per object: 5486
>> osd.24 db per object: 5228
>> osd.25 db per object: 4966
>> osd.26 db per object: 5047
>> osd.27 db per object: 5021
>> osd.28 db per object: 5321
>> osd.29 db per object: 5150
>>
>> ***@vm-ds-04:~# for i in {30..39} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
>> osd.30 db per object: 6658
>> osd.31 db per object: 6445
>> osd.32 db per object: 6259
>> osd.33 db per object: 6691
>> osd.34 db per object: 6513
>> osd.35 db per object: 6628
>> osd.36 db per object: 6779
>> osd.37 db per object: 6819
>> osd.38 db per object: 6677
>> osd.39 db per object: 6689
>>
>> ***@vm-ds-05:~# for i in {40..49} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
>> osd.40 db per object: 5335
>> osd.41 db per object: 5203
>> osd.42 db per object: 5552
>> osd.43 db per object: 5188
>> osd.44 db per object: 5218
>> osd.45 db per object: 5157
>> osd.46 db per object: 4956
>> osd.47 db per object: 5370
>> osd.48 db per object: 5117
>> osd.49 db per object: 5313
>>
>> I'm not sure why so much variance (these nodes are basically identical) and I think that the db_used_bytes includes the WAL at least in my case, as I don't have a separate WAL device. I'm not sure how big the WAL is relative to metadata and hence how much this might be thrown off, but ~6kb/object seems like a reasonable value to take for back-of-envelope calculating.
>>
> Yes, judging from your numbers 6kb/object seems reasonable. More datapoints are welcome in this case.
>
> Some input from a BlueStore dev might be helpful as well to see we are not drawing the wrong conclusions here.
>
> Wido
>
>> [bonus hilarity]
>> On my all-in-one-SSD OSDs, because bluestore reports them entirely as db space, I get results like:
>>
>> ***@vm-hv-01:~# for i in {60..65} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
>> osd.60 db per object: 80273
>> osd.61 db per object: 68859
>> osd.62 db per object: 45560
>> osd.63 db per object: 38209
>> osd.64 db per object: 48258
>> osd.65 db per object: 50525
>>
>> Rich
>>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
*Marco Baldini*
*H.S. Amiata Srl*
Ufficio: 0577-779396
Cellulare: 335-8765169
WEB: www.hsamiata.it <https://www.hsamiata.it>
EMAIL: ***@hsamiata.it <mailto:***@hsamiata.it>

Mark Nelson

2017-10-17 12:21:08 UTC

Permalink

On 10/17/2017 01:54 AM, Wido den Hollander wrote:
>
>> Op 16 oktober 2017 om 18:14 schreef Richard Hesketh <***@rd.bbc.co.uk>:
>>
>>
>> On 16/10/17 13:45, Wido den Hollander wrote:
>>>> Op 26 september 2017 om 16:39 schreef Mark Nelson <***@redhat.com>:
>>>> On 09/26/2017 01:10 AM, Dietmar Rieder wrote:
>>>>> thanks David,
>>>>>
>>>>> that's confirming what I was assuming. To bad that there is no
>>>>> estimate/method to calculate the db partition size.
>>>>
>>>> It's possible that we might be able to get ranges for certain kinds of
>>>> scenarios. Maybe if you do lots of small random writes on RBD, you can
>>>> expect a typical metadata size of X per object. Or maybe if you do lots
>>>> of large sequential object writes in RGW, it's more like Y. I think
>>>> it's probably going to be tough to make it accurate for everyone though.
>>>
>>> So I did a quick test. I wrote 75.000 objects to a BlueStore device:
>>>
>>> ***@alpha:~# ceph daemon osd.0 perf dump|jq '.bluestore.bluestore_onodes'
>>> 75085
>>> ***@alpha:~#
>>>
>>> I then saw the RocksDB database was 450MB in size:
>>>
>>> ***@alpha:~# ceph daemon osd.0 perf dump|jq '.bluefs.db_used_bytes'
>>> 459276288
>>> ***@alpha:~#
>>>
>>> 459276288 / 75085 = 6116
>>>
>>> So about 6kb of RocksDB data per object.
>>>
>>> Let's say I want to store 1M objects in a single OSD I would need ~6GB of DB space.
>>>
>>> Is this a safe assumption? Do you think that 6kb is normal? Low? High?
>>>
>>> There aren't many of these numbers out there for BlueStore right now so I'm trying to gather some numbers.
>>>
>>> Wido
>>
>> If I check for the same stats on OSDs in my production cluster I see similar but variable values:
>>
>> ***@vm-ds-01:~/ceph-conf# for i in {0..9} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
>> osd.0 db per object: 7490
>> osd.1 db per object: 7523
>> osd.2 db per object: 7378
>> osd.3 db per object: 7447
>> osd.4 db per object: 7233
>> osd.5 db per object: 7393
>> osd.6 db per object: 7074
>> osd.7 db per object: 7967
>> osd.8 db per object: 7253
>> osd.9 db per object: 7680
>>
>> ***@vm-ds-02:~# for i in {10..19} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
>> osd.10 db per object: 5168
>> osd.11 db per object: 5291
>> osd.12 db per object: 5476
>> osd.13 db per object: 4978
>> osd.14 db per object: 5252
>> osd.15 db per object: 5461
>> osd.16 db per object: 5135
>> osd.17 db per object: 5126
>> osd.18 db per object: 9336
>> osd.19 db per object: 4986
>>
>> ***@vm-ds-03:~# for i in {20..29} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
>> osd.20 db per object: 5115
>> osd.21 db per object: 4844
>> osd.22 db per object: 5063
>> osd.23 db per object: 5486
>> osd.24 db per object: 5228
>> osd.25 db per object: 4966
>> osd.26 db per object: 5047
>> osd.27 db per object: 5021
>> osd.28 db per object: 5321
>> osd.29 db per object: 5150
>>
>> ***@vm-ds-04:~# for i in {30..39} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
>> osd.30 db per object: 6658
>> osd.31 db per object: 6445
>> osd.32 db per object: 6259
>> osd.33 db per object: 6691
>> osd.34 db per object: 6513
>> osd.35 db per object: 6628
>> osd.36 db per object: 6779
>> osd.37 db per object: 6819
>> osd.38 db per object: 6677
>> osd.39 db per object: 6689
>>
>> ***@vm-ds-05:~# for i in {40..49} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
>> osd.40 db per object: 5335
>> osd.41 db per object: 5203
>> osd.42 db per object: 5552
>> osd.43 db per object: 5188
>> osd.44 db per object: 5218
>> osd.45 db per object: 5157
>> osd.46 db per object: 4956
>> osd.47 db per object: 5370
>> osd.48 db per object: 5117
>> osd.49 db per object: 5313
>>
>> I'm not sure why so much variance (these nodes are basically identical) and I think that the db_used_bytes includes the WAL at least in my case, as I don't have a separate WAL device. I'm not sure how big the WAL is relative to metadata and hence how much this might be thrown off, but ~6kb/object seems like a reasonable value to take for back-of-envelope calculating.
>>
>
> Yes, judging from your numbers 6kb/object seems reasonable. More datapoints are welcome in this case.
>
> Some input from a BlueStore dev might be helpful as well to see we are not drawing the wrong conclusions here.
>
> Wido

I would be very careful about drawing too many conclusions given a
single snapshot in time, especially if there haven't been a lot of
partial object rewrites yet. Just on the surface, 6KB/object feels low
(especially if you they are moderately large objects), but perhaps if
they've never been rewritten this is a reasonable lower bound. This is
important because things like 4MB RBD objects that are regularly
rewritten might behave a lot differently than RGW objects that are
written once and then never rewritten.

Also, note that Marco is seeing much different numbers in his recent
post to the thread.

Mark

>
>> [bonus hilarity]
>> On my all-in-one-SSD OSDs, because bluestore reports them entirely as db space, I get results like:
>>
>> ***@vm-hv-01:~# for i in {60..65} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
>> osd.60 db per object: 80273
>> osd.61 db per object: 68859
>> osd.62 db per object: 45560
>> osd.63 db per object: 38209
>> osd.64 db per object: 48258
>> osd.65 db per object: 50525
>>
>> Rich
>>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

Wido den Hollander

2017-10-18 06:29:38 UTC

Permalink

> Op 17 oktober 2017 om 14:21 schreef Mark Nelson <***@redhat.com>:
>
>
>
>
> On 10/17/2017 01:54 AM, Wido den Hollander wrote:
> >
> >> Op 16 oktober 2017 om 18:14 schreef Richard Hesketh <***@rd.bbc.co.uk>:
> >>
> >>
> >> On 16/10/17 13:45, Wido den Hollander wrote:
> >>>> Op 26 september 2017 om 16:39 schreef Mark Nelson <***@redhat.com>:
> >>>> On 09/26/2017 01:10 AM, Dietmar Rieder wrote:
> >>>>> thanks David,
> >>>>>
> >>>>> that's confirming what I was assuming. To bad that there is no
> >>>>> estimate/method to calculate the db partition size.
> >>>>
> >>>> It's possible that we might be able to get ranges for certain kinds of
> >>>> scenarios. Maybe if you do lots of small random writes on RBD, you can
> >>>> expect a typical metadata size of X per object. Or maybe if you do lots
> >>>> of large sequential object writes in RGW, it's more like Y. I think
> >>>> it's probably going to be tough to make it accurate for everyone though.
> >>>
> >>> So I did a quick test. I wrote 75.000 objects to a BlueStore device:
> >>>
> >>> ***@alpha:~# ceph daemon osd.0 perf dump|jq '.bluestore.bluestore_onodes'
> >>> 75085
> >>> ***@alpha:~#
> >>>
> >>> I then saw the RocksDB database was 450MB in size:
> >>>
> >>> ***@alpha:~# ceph daemon osd.0 perf dump|jq '.bluefs.db_used_bytes'
> >>> 459276288
> >>> ***@alpha:~#
> >>>
> >>> 459276288 / 75085 = 6116
> >>>
> >>> So about 6kb of RocksDB data per object.
> >>>
> >>> Let's say I want to store 1M objects in a single OSD I would need ~6GB of DB space.
> >>>
> >>> Is this a safe assumption? Do you think that 6kb is normal? Low? High?
> >>>
> >>> There aren't many of these numbers out there for BlueStore right now so I'm trying to gather some numbers.
> >>>
> >>> Wido
> >>
> >> If I check for the same stats on OSDs in my production cluster I see similar but variable values:
> >>
> >> ***@vm-ds-01:~/ceph-conf# for i in {0..9} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> >> osd.0 db per object: 7490
> >> osd.1 db per object: 7523
> >> osd.2 db per object: 7378
> >> osd.3 db per object: 7447
> >> osd.4 db per object: 7233
> >> osd.5 db per object: 7393
> >> osd.6 db per object: 7074
> >> osd.7 db per object: 7967
> >> osd.8 db per object: 7253
> >> osd.9 db per object: 7680
> >>
> >> ***@vm-ds-02:~# for i in {10..19} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> >> osd.10 db per object: 5168
> >> osd.11 db per object: 5291
> >> osd.12 db per object: 5476
> >> osd.13 db per object: 4978
> >> osd.14 db per object: 5252
> >> osd.15 db per object: 5461
> >> osd.16 db per object: 5135
> >> osd.17 db per object: 5126
> >> osd.18 db per object: 9336
> >> osd.19 db per object: 4986
> >>
> >> ***@vm-ds-03:~# for i in {20..29} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> >> osd.20 db per object: 5115
> >> osd.21 db per object: 4844
> >> osd.22 db per object: 5063
> >> osd.23 db per object: 5486
> >> osd.24 db per object: 5228
> >> osd.25 db per object: 4966
> >> osd.26 db per object: 5047
> >> osd.27 db per object: 5021
> >> osd.28 db per object: 5321
> >> osd.29 db per object: 5150
> >>
> >> ***@vm-ds-04:~# for i in {30..39} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> >> osd.30 db per object: 6658
> >> osd.31 db per object: 6445
> >> osd.32 db per object: 6259
> >> osd.33 db per object: 6691
> >> osd.34 db per object: 6513
> >> osd.35 db per object: 6628
> >> osd.36 db per object: 6779
> >> osd.37 db per object: 6819
> >> osd.38 db per object: 6677
> >> osd.39 db per object: 6689
> >>
> >> ***@vm-ds-05:~# for i in {40..49} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> >> osd.40 db per object: 5335
> >> osd.41 db per object: 5203
> >> osd.42 db per object: 5552
> >> osd.43 db per object: 5188
> >> osd.44 db per object: 5218
> >> osd.45 db per object: 5157
> >> osd.46 db per object: 4956
> >> osd.47 db per object: 5370
> >> osd.48 db per object: 5117
> >> osd.49 db per object: 5313
> >>
> >> I'm not sure why so much variance (these nodes are basically identical) and I think that the db_used_bytes includes the WAL at least in my case, as I don't have a separate WAL device. I'm not sure how big the WAL is relative to metadata and hence how much this might be thrown off, but ~6kb/object seems like a reasonable value to take for back-of-envelope calculating.
> >>
> >
> > Yes, judging from your numbers 6kb/object seems reasonable. More datapoints are welcome in this case.
> >
> > Some input from a BlueStore dev might be helpful as well to see we are not drawing the wrong conclusions here.
> >
> > Wido
>
> I would be very careful about drawing too many conclusions given a
> single snapshot in time, especially if there haven't been a lot of
> partial object rewrites yet. Just on the surface, 6KB/object feels low
> (especially if you they are moderately large objects), but perhaps if
> they've never been rewritten this is a reasonable lower bound. This is
> important because things like 4MB RBD objects that are regularly
> rewritten might behave a lot differently than RGW objects that are
> written once and then never rewritten.
>

Thanks for the feedback. Indeed, we have to be cautious in this case. So 6kB/object feels low to you, so it's probably.

I'm testing with a 1GB WAL/50GB DB on a SSD with a 4TB disk which seems to hold out fine. It's not that space is a true issue, but "use as much as available" doesn't say much to people.

If I have a 1TB NVMe for 10 disks, should I give 100GB of DB to each OSD? It's those things people want to know. So we need numbers to figure these things out.

Wido

> Also, note that Marco is seeing much different numbers in his recent
> post to the thread.
>
> Mark
>
> >
> >> [bonus hilarity]
> >> On my all-in-one-SSD OSDs, because bluestore reports them entirely as db space, I get results like:
> >>
> >> ***@vm-hv-01:~# for i in {60..65} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> >> osd.60 db per object: 80273
> >> osd.61 db per object: 68859
> >> osd.62 db per object: 45560
> >> osd.63 db per object: 38209
> >> osd.64 db per object: 48258
> >> osd.65 db per object: 50525
> >>
> >> Rich
> >>
> > _______________________________________________
> > ceph-users mailing list
> > ceph-***@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Marco Baldini - H.S. Amiata

2017-10-18 07:43:56 UTC

Permalink

Hi

I'm about to change some SATA SSD disks to NVME disks and for CEPH I too
would like to know how to assign space. I have 3 1TB SATA OSDs so I'll
split the NVME disks in 3 partitions of equal size, I'm not going to
assign a different WAL partition because, if the docs are right, the WAL
is automatically put in the fastest device.

What I can't find is some indication of how much space WAL and blocks.db
are using, so I could tune them better.

Il 18/10/2017 08:29, Wido den Hollander ha scritto:
> Thanks for the feedback. Indeed, we have to be cautious in this case. So 6kB/object feels low to you, so it's probably.
>
> I'm testing with a 1GB WAL/50GB DB on a SSD with a 4TB disk which seems to hold out fine. It's not that space is a true issue, but "use as much as available" doesn't say much to people.
>
> If I have a 1TB NVMe for 10 disks, should I give 100GB of DB to each OSD? It's those things people want to know. So we need numbers to figure these things out.
>
> Wido

--
*Marco Baldini*
*H.S. Amiata Srl*
Ufficio: 0577-779396
Cellulare: 335-8765169
WEB: www.hsamiata.it <https://www.hsamiata.it>
EMAIL: ***@hsamiata.it <mailto:***@hsamiata.it>

Martin Overgaard Hansen

2017-11-02 20:45:20 UTC

Permalink

Hi, it seems like Iâm in the same boat as everyone else in this particular thread.

Iâm also unable to find any guidelines or recommendations regarding sizing of the wal and / or db.

I want to bring this subject back in the light and hope someone can provide insight regarding the issue, thanks.

Best Regards,
Martin Overgaard Hansen
MultiHouse IT Partner A/S

Nigel Williams

2017-11-02 23:09:55 UTC

Permalink

On 3 November 2017 at 07:45, Martin Overgaard Hansen <***@multihouse.dk> wrote:
> I want to bring this subject back in the light and hope someone can provide
> insight regarding the issue, thanks.

Thanks Martin, I was going to do the same.

Is it possible to make the DB partition (on the fastest device) too
big? in other words is there a point where for a given set of OSDs
(number + size) the DB partition is sized too large and is wasting
resources. I recall a comment by someone proposing to split up a
single large (fast) SSD into 100GB partitions for each OSD.

The answer could be couched as some intersection of pool type (RBD /
RADOS / CephFS), object change(update?) intensity, size of OSD etc and
rule-of-thumb.

An idea occurred to me that by monitoring for the logged spill message
(the event when the DB partition spills/overflows to the OSD), OSDs
could be (lazily) destroyed and recreated with a new DB partition
increased in size say by 10% each time.

Wido den Hollander

2017-11-03 07:44:00 UTC

Permalink

> Op 3 november 2017 om 0:09 schreef Nigel Williams <***@tpac.org.au>:
>
>
> On 3 November 2017 at 07:45, Martin Overgaard Hansen <***@multihouse.dk> wrote:
> > I want to bring this subject back in the light and hope someone can provide
> > insight regarding the issue, thanks.
>
> Thanks Martin, I was going to do the same.
>
> Is it possible to make the DB partition (on the fastest device) too
> big? in other words is there a point where for a given set of OSDs
> (number + size) the DB partition is sized too large and is wasting
> resources. I recall a comment by someone proposing to split up a
> single large (fast) SSD into 100GB partitions for each OSD.
>

It depends on the size of your backing disk. The DB will grow for the amount of Objects you have on your OSD.

A 4TB drive will hold more objects then a 1TB drive (usually), same goes for a 10TB vs 6TB.

From what I've seen now there is no such thing as a 'too big' DB.

The tests I've done for now seem to suggest that filling up a 50GB DB is rather hard to do. But if you have Billions of Objects and thus tens of millions object per OSD.

Let's say the avg overhead is 16k you would need a 150GB DB for 10M objects.

You could look into your current numbers and check how many objects you have per OSD.

I checked a couple of Ceph clusters I run and see about 1M objects per OSD, but other only have 250k OSDs.

In all those cases even with 32k you would need a 30GB DB with 1M objects in that OSD.

> The answer could be couched as some intersection of pool type (RBD /
> RADOS / CephFS), object change(update?) intensity, size of OSD etc and
> rule-of-thumb.
>

I would check your running Ceph clusters and calculate the amount of objects per OSD.

total objects / num osd * 3

Wido

> An idea occurred to me that by monitoring for the logged spill message
> (the event when the DB partition spills/overflows to the OSD), OSDs
> could be (lazily) destroyed and recreated with a new DB partition
> increased in size say by 10% each time.
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Mark Nelson

2017-11-03 12:33:25 UTC

Permalink

On 11/03/2017 02:44 AM, Wido den Hollander wrote:
>
>> Op 3 november 2017 om 0:09 schreef Nigel Williams <***@tpac.org.au>:
>>
>>
>> On 3 November 2017 at 07:45, Martin Overgaard Hansen <***@multihouse.dk> wrote:
>>> I want to bring this subject back in the light and hope someone can provide
>>> insight regarding the issue, thanks.
>>
>> Thanks Martin, I was going to do the same.
>>
>> Is it possible to make the DB partition (on the fastest device) too
>> big? in other words is there a point where for a given set of OSDs
>> (number + size) the DB partition is sized too large and is wasting
>> resources. I recall a comment by someone proposing to split up a
>> single large (fast) SSD into 100GB partitions for each OSD.
>>
>
> It depends on the size of your backing disk. The DB will grow for the amount of Objects you have on your OSD.
>
> A 4TB drive will hold more objects then a 1TB drive (usually), same goes for a 10TB vs 6TB.
>
> From what I've seen now there is no such thing as a 'too big' DB.
>
> The tests I've done for now seem to suggest that filling up a 50GB DB is rather hard to do. But if you have Billions of Objects and thus tens of millions object per OSD.

Are you doing RBD, RGW, or something else to test? What size are the
objets and are you fragmenting them?
>
> Let's say the avg overhead is 16k you would need a 150GB DB for 10M objects.
>
> You could look into your current numbers and check how many objects you have per OSD.
>
> I checked a couple of Ceph clusters I run and see about 1M objects per OSD, but other only have 250k OSDs.
>
> In all those cases even with 32k you would need a 30GB DB with 1M objects in that OSD.
>
>> The answer could be couched as some intersection of pool type (RBD /
>> RADOS / CephFS), object change(update?) intensity, size of OSD etc and
>> rule-of-thumb.
>>
>
> I would check your running Ceph clusters and calculate the amount of objects per OSD.
>
> total objects / num osd * 3

One nagging concern I have in the back of my mind is that the amount of
space amplification in rocksdb might grow with the number of levels (ie
the number of objects). The space used per object might be different at
10M objects and 50M objects.

>
> Wido
>
>> An idea occurred to me that by monitoring for the logged spill message
>> (the event when the DB partition spills/overflows to the OSD), OSDs
>> could be (lazily) destroyed and recreated with a new DB partition
>> increased in size say by 10% each time.
>> _______________________________________________
>> ceph-users mailing list
>> ceph-***@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

Wido den Hollander

2017-11-03 13:25:09 UTC

Permalink

> Op 3 november 2017 om 13:33 schreef Mark Nelson <***@redhat.com>:
>
>
>
>
> On 11/03/2017 02:44 AM, Wido den Hollander wrote:
> >
> >> Op 3 november 2017 om 0:09 schreef Nigel Williams <***@tpac.org.au>:
> >>
> >>
> >> On 3 November 2017 at 07:45, Martin Overgaard Hansen <***@multihouse.dk> wrote:
> >>> I want to bring this subject back in the light and hope someone can provide
> >>> insight regarding the issue, thanks.
> >>
> >> Thanks Martin, I was going to do the same.
> >>
> >> Is it possible to make the DB partition (on the fastest device) too
> >> big? in other words is there a point where for a given set of OSDs
> >> (number + size) the DB partition is sized too large and is wasting
> >> resources. I recall a comment by someone proposing to split up a
> >> single large (fast) SSD into 100GB partitions for each OSD.
> >>
> >
> > It depends on the size of your backing disk. The DB will grow for the amount of Objects you have on your OSD.
> >
> > A 4TB drive will hold more objects then a 1TB drive (usually), same goes for a 10TB vs 6TB.
> >
> > From what I've seen now there is no such thing as a 'too big' DB.
> >
> > The tests I've done for now seem to suggest that filling up a 50GB DB is rather hard to do. But if you have Billions of Objects and thus tens of millions object per OSD.
>
> Are you doing RBD, RGW, or something else to test? What size are the
> objets and are you fragmenting them?
> >
> > Let's say the avg overhead is 16k you would need a 150GB DB for 10M objects.
> >
> > You could look into your current numbers and check how many objects you have per OSD.
> >
> > I checked a couple of Ceph clusters I run and see about 1M objects per OSD, but other only have 250k OSDs.
> >
> > In all those cases even with 32k you would need a 30GB DB with 1M objects in that OSD.
> >
> >> The answer could be couched as some intersection of pool type (RBD /
> >> RADOS / CephFS), object change(update?) intensity, size of OSD etc and
> >> rule-of-thumb.
> >>
> >
> > I would check your running Ceph clusters and calculate the amount of objects per OSD.
> >
> > total objects / num osd * 3
>
> One nagging concern I have in the back of my mind is that the amount of
> space amplification in rocksdb might grow with the number of levels (ie
> the number of objects). The space used per object might be different at
> 10M objects and 50M objects.
>

True. But how many systems do we have out there with 10M objects in ONE OSD?

The systems I checked range from 250k to 1M objects per OSD. Ofcourse, but statistics aren't the golden rule, but users will want some guideline on how to size their DB.

WAL should be sufficient with 1GB~2GB, right?

Wido

> >
> > Wido
> >
> >> An idea occurred to me that by monitoring for the logged spill message
> >> (the event when the DB partition spills/overflows to the OSD), OSDs
> >> could be (lazily) destroyed and recreated with a new DB partition
> >> increased in size say by 10% each time.
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-***@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > _______________________________________________
> > ceph-users mailing list
> > ceph-***@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Mark Nelson

2017-11-03 13:43:39 UTC

Permalink

On 11/03/2017 08:25 AM, Wido den Hollander wrote:
>
>> Op 3 november 2017 om 13:33 schreef Mark Nelson <***@redhat.com>:
>>
>>
>>
>>
>> On 11/03/2017 02:44 AM, Wido den Hollander wrote:
>>>
>>>> Op 3 november 2017 om 0:09 schreef Nigel Williams <***@tpac.org.au>:
>>>>
>>>>
>>>> On 3 November 2017 at 07:45, Martin Overgaard Hansen <***@multihouse.dk> wrote:
>>>>> I want to bring this subject back in the light and hope someone can provide
>>>>> insight regarding the issue, thanks.
>>>>
>>>> Thanks Martin, I was going to do the same.
>>>>
>>>> Is it possible to make the DB partition (on the fastest device) too
>>>> big? in other words is there a point where for a given set of OSDs
>>>> (number + size) the DB partition is sized too large and is wasting
>>>> resources. I recall a comment by someone proposing to split up a
>>>> single large (fast) SSD into 100GB partitions for each OSD.
>>>>
>>>
>>> It depends on the size of your backing disk. The DB will grow for the amount of Objects you have on your OSD.
>>>
>>> A 4TB drive will hold more objects then a 1TB drive (usually), same goes for a 10TB vs 6TB.
>>>
>>> From what I've seen now there is no such thing as a 'too big' DB.
>>>
>>> The tests I've done for now seem to suggest that filling up a 50GB DB is rather hard to do. But if you have Billions of Objects and thus tens of millions object per OSD.
>>
>> Are you doing RBD, RGW, or something else to test? What size are the
>> objets and are you fragmenting them?
>>>
>>> Let's say the avg overhead is 16k you would need a 150GB DB for 10M objects.
>>>
>>> You could look into your current numbers and check how many objects you have per OSD.
>>>
>>> I checked a couple of Ceph clusters I run and see about 1M objects per OSD, but other only have 250k OSDs.
>>>
>>> In all those cases even with 32k you would need a 30GB DB with 1M objects in that OSD.
>>>
>>>> The answer could be couched as some intersection of pool type (RBD /
>>>> RADOS / CephFS), object change(update?) intensity, size of OSD etc and
>>>> rule-of-thumb.
>>>>
>>>
>>> I would check your running Ceph clusters and calculate the amount of objects per OSD.
>>>
>>> total objects / num osd * 3
>>
>> One nagging concern I have in the back of my mind is that the amount of
>> space amplification in rocksdb might grow with the number of levels (ie
>> the number of objects). The space used per object might be different at
>> 10M objects and 50M objects.
>>
>
> True. But how many systems do we have out there with 10M objects in ONE OSD?
>
> The systems I checked range from 250k to 1M objects per OSD. Ofcourse, but statistics aren't the golden rule, but users will want some guideline on how to size their DB.

That's actually something I would really like better insight into. I
don't feel like I have a sufficient understanding of how many
objects/OSD people are really deploying in the field. I figure 10M/OSD
is probably a reasonable "typical" upper limit for HDDs, but I could see
some use cases with flash backed SSDs pushing far more.

>
> WAL should be sufficient with 1GB~2GB, right?

Yep. On the surface this appears to be a simple question, but a much
deeper question is what are we actually doing with the WAL? How should
we be storing PG log and dup ops data? How can we get away from the
large WAL buffers and memtables we have now? These are questions we are
actively working on solving. For the moment though, having multiple (4)
256MB WAL buffers appears to give us the best performance despite
resulting in large memtables, so 1-2GB for the WAL is right.

Mark

>
> Wido
>
>>>
>>> Wido
>>>
>>>> An idea occurred to me that by monitoring for the logged spill message
>>>> (the event when the DB partition spills/overflows to the OSD), OSDs
>>>> could be (lazily) destroyed and recreated with a new DB partition
>>>> increased in size say by 10% each time.
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-***@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-***@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-***@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Wido den Hollander

2017-11-03 13:59:17 UTC

Permalink

> Op 3 november 2017 om 14:43 schreef Mark Nelson <***@redhat.com>:
>
>
>
>
> On 11/03/2017 08:25 AM, Wido den Hollander wrote:
> >
> >> Op 3 november 2017 om 13:33 schreef Mark Nelson <***@redhat.com>:
> >>
> >>
> >>
> >>
> >> On 11/03/2017 02:44 AM, Wido den Hollander wrote:
> >>>
> >>>> Op 3 november 2017 om 0:09 schreef Nigel Williams <***@tpac.org.au>:
> >>>>
> >>>>
> >>>> On 3 November 2017 at 07:45, Martin Overgaard Hansen <***@multihouse.dk> wrote:
> >>>>> I want to bring this subject back in the light and hope someone can provide
> >>>>> insight regarding the issue, thanks.
> >>>>
> >>>> Thanks Martin, I was going to do the same.
> >>>>
> >>>> Is it possible to make the DB partition (on the fastest device) too
> >>>> big? in other words is there a point where for a given set of OSDs
> >>>> (number + size) the DB partition is sized too large and is wasting
> >>>> resources. I recall a comment by someone proposing to split up a
> >>>> single large (fast) SSD into 100GB partitions for each OSD.
> >>>>
> >>>
> >>> It depends on the size of your backing disk. The DB will grow for the amount of Objects you have on your OSD.
> >>>
> >>> A 4TB drive will hold more objects then a 1TB drive (usually), same goes for a 10TB vs 6TB.
> >>>
> >>> From what I've seen now there is no such thing as a 'too big' DB.
> >>>
> >>> The tests I've done for now seem to suggest that filling up a 50GB DB is rather hard to do. But if you have Billions of Objects and thus tens of millions object per OSD.
> >>
> >> Are you doing RBD, RGW, or something else to test? What size are the
> >> objets and are you fragmenting them?
> >>>
> >>> Let's say the avg overhead is 16k you would need a 150GB DB for 10M objects.
> >>>
> >>> You could look into your current numbers and check how many objects you have per OSD.
> >>>
> >>> I checked a couple of Ceph clusters I run and see about 1M objects per OSD, but other only have 250k OSDs.
> >>>
> >>> In all those cases even with 32k you would need a 30GB DB with 1M objects in that OSD.
> >>>
> >>>> The answer could be couched as some intersection of pool type (RBD /
> >>>> RADOS / CephFS), object change(update?) intensity, size of OSD etc and
> >>>> rule-of-thumb.
> >>>>
> >>>
> >>> I would check your running Ceph clusters and calculate the amount of objects per OSD.
> >>>
> >>> total objects / num osd * 3
> >>
> >> One nagging concern I have in the back of my mind is that the amount of
> >> space amplification in rocksdb might grow with the number of levels (ie
> >> the number of objects). The space used per object might be different at
> >> 10M objects and 50M objects.
> >>
> >
> > True. But how many systems do we have out there with 10M objects in ONE OSD?
> >
> > The systems I checked range from 250k to 1M objects per OSD. Ofcourse, but statistics aren't the golden rule, but users will want some guideline on how to size their DB.
>
> That's actually something I would really like better insight into. I
> don't feel like I have a sufficient understanding of how many
> objects/OSD people are really deploying in the field. I figure 10M/OSD
> is probably a reasonable "typical" upper limit for HDDs, but I could see
> some use cases with flash backed SSDs pushing far more.

Would a poll on the ceph-users list work? I understand that you require such feedback to make a proper judgement.

I know of one cluster which has 10M objects (heavy, heavy, heavy RGW user) in about 400TB of data.

All other clusters I've seen aren't that high on the amount of Objects. They are usually high on data since they have a RBD use-case which is a lot of 4M objects.

You could also ask users to use this tool: https://github.com/42on/ceph-collect

That tarball would give you a lot of information about the cluster and the amount of objects per OSD and PG.

Wido

>
> >
> > WAL should be sufficient with 1GB~2GB, right?
>
> Yep. On the surface this appears to be a simple question, but a much
> deeper question is what are we actually doing with the WAL? How should
> we be storing PG log and dup ops data? How can we get away from the
> large WAL buffers and memtables we have now? These are questions we are
> actively working on solving. For the moment though, having multiple (4)
> 256MB WAL buffers appears to give us the best performance despite
> resulting in large memtables, so 1-2GB for the WAL is right.
>
> Mark
>
> >
> > Wido
> >
> >>>
> >>> Wido
> >>>
> >>>> An idea occurred to me that by monitoring for the logged spill message
> >>>> (the event when the DB partition spills/overflows to the OSD), OSDs
> >>>> could be (lazily) destroyed and recreated with a new DB partition
> >>>> increased in size say by 10% each time.
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-***@lists.ceph.com
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-***@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-***@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Maged Mokhtar

2017-11-03 14:26:47 UTC

Permalink

On 2017-11-03 15:59, Wido den Hollander wrote:

> Op 3 november 2017 om 14:43 schreef Mark Nelson <***@redhat.com>:
>
> On 11/03/2017 08:25 AM, Wido den Hollander wrote:
> Op 3 november 2017 om 13:33 schreef Mark Nelson <***@redhat.com>:
>
> On 11/03/2017 02:44 AM, Wido den Hollander wrote:
> Op 3 november 2017 om 0:09 schreef Nigel Williams <***@tpac.org.au>:
>
> On 3 November 2017 at 07:45, Martin Overgaard Hansen <***@multihouse.dk> wrote: I want to bring this subject back in the light and hope someone can provide
> insight regarding the issue, thanks.
> Thanks Martin, I was going to do the same.
>
> Is it possible to make the DB partition (on the fastest device) too
> big? in other words is there a point where for a given set of OSDs
> (number + size) the DB partition is sized too large and is wasting
> resources. I recall a comment by someone proposing to split up a
> single large (fast) SSD into 100GB partitions for each OSD.

It depends on the size of your backing disk. The DB will grow for the
amount of Objects you have on your OSD.

A 4TB drive will hold more objects then a 1TB drive (usually), same goes
for a 10TB vs 6TB.

From what I've seen now there is no such thing as a 'too big' DB.

The tests I've done for now seem to suggest that filling up a 50GB DB is
rather hard to do. But if you have Billions of Objects and thus tens of
millions object per OSD.
Are you doing RBD, RGW, or something else to test? What size are the
objets and are you fragmenting them?

> Let's say the avg overhead is 16k you would need a 150GB DB for 10M objects.
>
> You could look into your current numbers and check how many objects you have per OSD.
>
> I checked a couple of Ceph clusters I run and see about 1M objects per OSD, but other only have 250k OSDs.
>
> In all those cases even with 32k you would need a 30GB DB with 1M objects in that OSD.
>
>> The answer could be couched as some intersection of pool type (RBD /
>> RADOS / CephFS), object change(update?) intensity, size of OSD etc and
>> rule-of-thumb.
>
> I would check your running Ceph clusters and calculate the amount of objects per OSD.
>
> total objects / num osd * 3

One nagging concern I have in the back of my mind is that the amount of
space amplification in rocksdb might grow with the number of levels (ie
the number of objects). The space used per object might be different at
10M objects and 50M objects.

True. But how many systems do we have out there with 10M objects in ONE
OSD?

The systems I checked range from 250k to 1M objects per OSD. Ofcourse,
but statistics aren't the golden rule, but users will want some
guideline on how to size their DB.
That's actually something I would really like better insight into. I
don't feel like I have a sufficient understanding of how many
objects/OSD people are really deploying in the field. I figure 10M/OSD
is probably a reasonable "typical" upper limit for HDDs, but I could see

some use cases with flash backed SSDs pushing far more.
Would a poll on the ceph-users list work? I understand that you require
such feedback to make a proper judgement.

I know of one cluster which has 10M objects (heavy, heavy, heavy RGW
user) in about 400TB of data.

All other clusters I've seen aren't that high on the amount of Objects.
They are usually high on data since they have a RBD use-case which is a
lot of 4M objects.

You could also ask users to use this tool:
https://github.com/42on/ceph-collect

That tarball would give you a lot of information about the cluster and
the amount of objects per OSD and PG.

Wido

>> WAL should be sufficient with 1GB~2GB, right?
>
> Yep. On the surface this appears to be a simple question, but a much
> deeper question is what are we actually doing with the WAL? How should
> we be storing PG log and dup ops data? How can we get away from the
> large WAL buffers and memtables we have now? These are questions we are
> actively working on solving. For the moment though, having multiple (4)
> 256MB WAL buffers appears to give us the best performance despite
> resulting in large memtables, so 1-2GB for the WAL is right.
>
> Mark
>
> Wido
>
> Wido
>
> An idea occurred to me that by monitoring for the logged spill message
> (the event when the DB partition spills/overflows to the OSD), OSDs
> could be (lazily) destroyed and recreated with a new DB partition
> increased in size say by 10% each time.
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

I agree with Wido that rbd is an abundant use case. At least we can
start with recommendations for it. in this case:

Number of objects per OSD = 750k per TB of OSD disk capacity

So for an avg 16k per object: 12G per TB. For 32k per object: 24G per
TB

Maged

Wido den Hollander

2018-01-30 09:23:15 UTC

Permalink

On 11/03/2017 02:43 PM, Mark Nelson wrote:
>
>
> On 11/03/2017 08:25 AM, Wido den Hollander wrote:
>>
>>> Op 3 november 2017 om 13:33 schreef Mark Nelson <***@redhat.com>:
>>>
>>>
>>>
>>>
>>> On 11/03/2017 02:44 AM, Wido den Hollander wrote:
>>>>
>>>>> Op 3 november 2017 om 0:09 schreef Nigel Williams
>>>>> <***@tpac.org.au>:
>>>>>
>>>>>
>>>>> On 3 November 2017 at 07:45, Martin Overgaard Hansen
>>>>> <***@multihouse.dk> wrote:
>>>>>> I want to bring this subject back in the light and hope someone
>>>>>> can provide
>>>>>> insight regarding the issue, thanks.
>>>>>
>>>>> Thanks Martin, I was going to do the same.
>>>>>
>>>>> Is it possible to make the DB partition (on the fastest device) too
>>>>> big? in other words is there a point where for a given set of OSDs
>>>>> (number + size) the DB partition is sized too large and is wasting
>>>>> resources. I recall a comment by someone proposing to split up a
>>>>> single large (fast) SSD into 100GB partitions for each OSD.
>>>>>
>>>>
>>>> It depends on the size of your backing disk. The DB will grow for
>>>> the amount of Objects you have on your OSD.
>>>>
>>>> A 4TB drive will hold more objects then a 1TB drive (usually), same
>>>> goes for a 10TB vs 6TB.
>>>>
>>>> From what I've seen now there is no such thing as a 'too big' DB.
>>>>
>>>> The tests I've done for now seem to suggest that filling up a 50GB
>>>> DB is rather hard to do. But if you have Billions of Objects and
>>>> thus tens of millions object per OSD.
>>>
>>> Are you doing RBD, RGW, or something else to test? What size are the
>>> objets and are you fragmenting them?
>>>>
>>>> Let's say the avg overhead is 16k you would need a 150GB DB for 10M
>>>> objects.
>>>>
>>>> You could look into your current numbers and check how many objects
>>>> you have per OSD.
>>>>
>>>> I checked a couple of Ceph clusters I run and see about 1M objects
>>>> per OSD, but other only have 250k OSDs.
>>>>
>>>> In all those cases even with 32k you would need a 30GB DB with 1M
>>>> objects in that OSD.
>>>>
>>>>> The answer could be couched as some intersection of pool type (RBD /
>>>>> RADOS / CephFS), object change(update?) intensity, size of OSD etc and
>>>>> rule-of-thumb.
>>>>>
>>>>
>>>> I would check your running Ceph clusters and calculate the amount of
>>>> objects per OSD.
>>>>
>>>> total objects / num osd * 3
>>>
>>> One nagging concern I have in the back of my mind is that the amount of
>>> space amplification in rocksdb might grow with the number of levels (ie
>>> the number of objects). The space used per object might be different at
>>> 10M objects and 50M objects.
>>>
>>
>> True. But how many systems do we have out there with 10M objects in
>> ONE OSD?
>>
>> The systems I checked range from 250k to 1M objects per OSD. Ofcourse,
>> but statistics aren't the golden rule, but users will want some
>> guideline on how to size their DB.
>
> That's actually something I would really like better insight into. I
> don't feel like I have a sufficient understanding of how many
> objects/OSD people are really deploying in the field. I figure 10M/OSD
> is probably a reasonable "typical" upper limit for HDDs, but I could see
> some use cases with flash backed SSDs pushing far more.
>

A few months later I've gathered some more data and wrote a script to
quickly query it on OSDs:
https://gist.github.com/wido/875d531692a922d608b9392e1766405d

I fetched information from a few systems running with BlueStore.

So far the larged value I found on systems running with RBD is 24k per
onode.

This OSD reported 70k onodes in it's database with a total DB size of
about 1.5GB

As most deployments I see out there are RBD those are the ones I can get
the most information from.

The avg object size I saw was 2.8MB.

So let's say you would like to fill a OSD with 2TB of data. With a avg
object size of 2.8M you would have 714k objects on that OSD.

714k objects * 24k per onode = 16GB DB

The rule of thumb I've been using now is 10GB DB per 1TB of OSD storage.
For now this seems to work out for me in all the cases I have seen.

I'm not saying it applies to every case, but the cases I've seen so faw
seem to hold up.

If your average object size drops you will get more onodes per TB and
thus have a larger DB.

I'm just trying to gather information so people designing their system
have something to work with.

Wido

>>
>> WAL should be sufficient with 1GB~2GB, right?
>
> Yep. On the surface this appears to be a simple question, but a much
> deeper question is what are we actually doing with the WAL? How should
> we be storing PG log and dup ops data? How can we get away from the
> large WAL buffers and memtables we have now? These are questions we are
> actively working on solving. For the moment though, having multiple (4)
> 256MB WAL buffers appears to give us the best performance despite
> resulting in large memtables, so 1-2GB for the WAL is right.
>
> Mark
>
>>
>> Wido
>>
>>>>
>>>> Wido
>>>>
>>>>> An idea occurred to me that by monitoring for the logged spill message
>>>>> (the event when the DB partition spills/overflows to the OSD), OSDs
>>>>> could be (lazily) destroyed and recreated with a new DB partition
>>>>> increased in size say by 10% each time.
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-***@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-***@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-***@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Willem Jan Withagen

2017-11-03 14:01:35 UTC

Permalink

On 3-11-2017 00:09, Nigel Williams wrote:
> On 3 November 2017 at 07:45, Martin Overgaard Hansen <***@multihouse.dk> wrote:
>> I want to bring this subject back in the light and hope someone can provide
>> insight regarding the issue, thanks.

> Is it possible to make the DB partition (on the fastest device) too
> big? in other words is there a point where for a given set of OSDs
> (number + size) the DB partition is sized too large and is wasting
> resources. I recall a comment by someone proposing to split up a
> single large (fast) SSD into 100GB partitions for each OSD.

Waisting resources is probably relative.

SSD have a limitted lifetime. And Ceph is a seriously hard (ab)user of
the wear for SSDs.

Now if you over dimension the allocated space, it looks like it is not
used. But onderneath in the SSD firmware writting is spread out over all
cells of the SSD. So the wear is evely distibuted over all componets of
the SSD.

And by overcommitting you have thus prolonged the life of you SSD.

So it is either buy more now, but less replacing.
Or allocate stricktly, and replace sooner.

--WjW

Jorge Pinilla López

2017-11-03 09:08:45 UTC

Permalink

well I haven't found any recomendation eitherÂ but I think that
sometimes the SSD space is being wasted.

I was thinking about making an OSD from the rest of my SSD space, but it
wouldnt scale in case more speed is needed.

Other option I asked was to use bcache or a mix between bcache and small
DB partitions but I was only reply with corruption problems so I decided
not to do it.

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021535.html

I think a good idea would be to use the space needed to store the Hot DB
and the rest use it as a cache (at least a read cache)

I dont really know a lot about this topic but I think that maybe giving
50GB of a really expensive SSD is pointless with its only using 10GB.

El 02/11/2017 a las 21:45, Martin Overgaard Hansen escribiÃ³:

> Hi, it seems like Iâm in the same boat as everyone else in
> thisÂ particular thread.
>
> Iâm also unable to find any guidelines or recommendations regarding
> sizing of the wal and / or db.
>
> I want to bring this subject back in the light and hope someone can
> provide insight regarding the issue, thanks. Â
>
> Best Regards,
> Martin Overgaard Hansen
>
> MultiHouse IT Partner A/S
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
------------------------------------------------------------------------
*Jorge Pinilla LÃ³pez*
***@unizar.es
Estudiante de ingenieria informÃ¡tica
Becario del area de sistemas (SICUZ)
Universidad de Zaragoza
PGP-KeyID: A34331932EBC715A
<http://pgp.rediris.es:11371/pks/lookup?op=get&search=0xA34331932EBC715A>
------------------------------------------------------------------------

Mark Nelson

2017-11-03 13:22:59 UTC

Permalink

On 11/03/2017 04:08 AM, Jorge Pinilla López wrote:
> well I haven't found any recomendation either but I think that
> sometimes the SSD space is being wasted.

If someone wanted to write it, you could have bluefs share some of the
space on the drive for hot object data and release space as needed for
the DB. I'd very much recommend keeping the promotion rate incredibly low.

>
> I was thinking about making an OSD from the rest of my SSD space, but it
> wouldnt scale in case more speed is needed.

I think there's a temptation to try to shove more stuff on the SSD, but
honestly I'm not sure it's a great idea. These drives are already
handling WAL and DB traffic, potentially for multiple OSDs. If you have
a very read centric workload or are using drives with high write
endurance that's one thing. From a monetary perspective, think
carefully about how much drive endurance and mttf matter to you.

>
> Other option I asked was to use bcache or a mix between bcache and small
> DB partitions but I was only reply with corruption problems so I decided
> not to do it.
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021535.html
>
> I think a good idea would be to use the space needed to store the Hot DB
> and the rest use it as a cache (at least a read cache)

Given that bluestore is already storing all of the metadata in rocksdb,
putting the DB partition on flash is already going to buy you a lot.
Having said that, something that could let the DB and a cache
share/reclaim space on the SSD could be interesting. It won't be a cure
all, but at least could provide a small improvement so long as the
promotion overhead is kept very low.

>
> I dont really know a lot about this topic but I think that maybe giving
> 50GB of a really expensive SSD is pointless with its only using 10GB.

Think of it less as "space" and more of it as cells of write endurance.
That's really what you are buying. Whether that's a small drive with
high write endurance or a big drive with low write endurance. Some may
have better properties for reads, some may have power-loss-protection
that allows O_DSYNC writes to go much faster. As far as the WAL and DB
goes, it's all about how many writes you can get out of the drive before
it goes kaput.

>
> El 02/11/2017 a las 21:45, Martin Overgaard Hansen escribió:
>
>> Hi, it seems like I’m in the same boat as everyone else in
>> this particular thread.
>>
>> I’m also unable to find any guidelines or recommendations regarding
>> sizing of the wal and / or db.
>>
>> I want to bring this subject back in the light and hope someone can
>> provide insight regarding the issue, thanks.
>>
>> Best Regards,
>> Martin Overgaard Hansen
>>
>> MultiHouse IT Partner A/S
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-***@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> --
> ------------------------------------------------------------------------
> *Jorge Pinilla López*
> ***@unizar.es
> Estudiante de ingenieria informática
> Becario del area de sistemas (SICUZ)
> Universidad de Zaragoza
> PGP-KeyID: A34331932EBC715A
> <http://pgp.rediris.es:11371/pks/lookup?op=get&search=0xA34331932EBC715A>
> ------------------------------------------------------------------------
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>