[ceph-users] Sizing your MON storage with a large cluster

Discussion:

Wido den Hollander

2018-02-03 15:50:46 UTC

Hi,

I just wanted to inform people about the fact that Monitor databases can
grow quite big when you have a large cluster which is performing a very
long rebalance.

I'm posting this on ceph-users and ceph-large as it applies to both, but
you'll see this sooner on a cluster with a lof of OSDs.

Some information:

- Version: Luminous 12.2.2
- Number of OSDs: 2175
- Data used: ~2PB

We are in the middle of migrating from FileStore to BlueStore and this
is causing a lot of PGs to backfill at the moment:

33488 active+clean
4802 active+undersized+degraded+remapped+backfill_wait
1670 active+remapped+backfill_wait
263 active+undersized+degraded+remapped+backfilling
250 active+recovery_wait+degraded
54 active+recovery_wait+degraded+remapped
27 active+remapped+backfilling
13 active+recovery_wait+undersized+degraded+remapped
2 active+recovering+degraded

This has been running for a few days now and it has caused this warning:

MON_DISK_BIG mons
srv-zmb03-05,srv-zmb04-05,srv-zmb05-05,srv-zmb06-05,srv-zmb07-05 are
using a lot of disk space
mon.srv-zmb03-05 is 31666 MB >= mon_data_size_warn (15360 MB)
mon.srv-zmb04-05 is 31670 MB >= mon_data_size_warn (15360 MB)
mon.srv-zmb05-05 is 31670 MB >= mon_data_size_warn (15360 MB)
mon.srv-zmb06-05 is 31897 MB >= mon_data_size_warn (15360 MB)
mon.srv-zmb07-05 is 31891 MB >= mon_data_size_warn (15360 MB)

This is to be expected as MONs do not trim their store if one or more
PGs is not active+clean.

In this case we expected this and the MONs are each running on a 1TB
Intel DC-series SSD to make sure we do not run out of space before the
backfill finishes.

The cluster is spread out over racks and in CRUSH we replicate over
racks. Rack by rack we are wiping/destroying the OSDs and bringing them
back as BlueStore OSDs and letting the backfill handle everything.

In between we wait for the cluster to become HEALTH_OK (all PGs
active+clean) so that the Monitors can trim their database before we
start with the next rack.

I just want to warn and inform people about this. Under normal
circumstances a MON database isn't that big, but if you have a very long
period of backfills/recoveries and also have a large number of OSDs
you'll see the DB grow quite big.

This has improved significantly going to Jewel and Luminous, but it is
still something to watch out for.

Make sure your MONs have enough free space to handle this!

Wido

Sage Weil

2018-02-03 16:03:14 UTC

Permalink

Hi,
I just wanted to inform people about the fact that Monitor databases can grow
quite big when you have a large cluster which is performing a very long
rebalance.
I'm posting this on ceph-users and ceph-large as it applies to both, but
you'll see this sooner on a cluster with a lof of OSDs.
- Version: Luminous 12.2.2
- Number of OSDs: 2175
- Data used: ~2PB
We are in the middle of migrating from FileStore to BlueStore and this is
33488 active+clean
4802 active+undersized+degraded+remapped+backfill_wait
1670 active+remapped+backfill_wait
263 active+undersized+degraded+remapped+backfilling
250 active+recovery_wait+degraded
54 active+recovery_wait+degraded+remapped
27 active+remapped+backfilling
13 active+recovery_wait+undersized+degraded+remapped
2 active+recovering+degraded
MON_DISK_BIG mons
srv-zmb03-05,srv-zmb04-05,srv-zmb05-05,srv-zmb06-05,srv-zmb07-05 are using a
lot of disk space
mon.srv-zmb03-05 is 31666 MB >= mon_data_size_warn (15360 MB)
mon.srv-zmb04-05 is 31670 MB >= mon_data_size_warn (15360 MB)
mon.srv-zmb05-05 is 31670 MB >= mon_data_size_warn (15360 MB)
mon.srv-zmb06-05 is 31897 MB >= mon_data_size_warn (15360 MB)
mon.srv-zmb07-05 is 31891 MB >= mon_data_size_warn (15360 MB)
This is to be expected as MONs do not trim their store if one or more PGs is
not active+clean.
In this case we expected this and the MONs are each running on a 1TB Intel
DC-series SSD to make sure we do not run out of space before the backfill
finishes.
The cluster is spread out over racks and in CRUSH we replicate over racks.
Rack by rack we are wiping/destroying the OSDs and bringing them back as
BlueStore OSDs and letting the backfill handle everything.
In between we wait for the cluster to become HEALTH_OK (all PGs active+clean)
so that the Monitors can trim their database before we start with the next
rack.
I just want to warn and inform people about this. Under normal circumstances a
MON database isn't that big, but if you have a very long period of
backfills/recoveries and also have a large number of OSDs you'll see the DB
grow quite big.
This has improved significantly going to Jewel and Luminous, but it is still
something to watch out for.
Make sure your MONs have enough free space to handle this!

Yes!

Just a side note that Joao has an elegant fix for this that allows the mon
to trim most of the space-consuming full osdmaps. It's still work in
progress but is likely to get backported to luminous.

sage

Wes Dillingham

2018-02-05 15:54:47 UTC

Permalink

Good data point on not trimming when non active+clean PGs are present. So
am I reading this correct? It grew to 32GB? Did it end up growing beyond
that, what was the max? Also is only ~18PGs per OSD a reasonable amount of
PGs per OSD? I would think about quadruple that would be ideal. Is this an
artifact of a steadily growing cluster or a design choice?

Post by Wido den Hollander
Hi,
I just wanted to inform people about the fact that Monitor databases can
grow quite big when you have a large cluster which is performing a very
long rebalance.
I'm posting this on ceph-users and ceph-large as it applies to both, but
you'll see this sooner on a cluster with a lof of OSDs.
- Version: Luminous 12.2.2
- Number of OSDs: 2175
- Data used: ~2PB
We are in the middle of migrating from FileStore to BlueStore and this is
33488 active+clean
4802 active+undersized+degraded+remapped+backfill_wait
1670 active+remapped+backfill_wait
263 active+undersized+degraded+remapped+backfilling
250 active+recovery_wait+degraded
54 active+recovery_wait+degraded+remapped
27 active+remapped+backfilling
13 active+recovery_wait+undersized+degraded+remapped
2 active+recovering+degraded
MON_DISK_BIG mons srv-zmb03-05,srv-zmb04-05,srv-
zmb05-05,srv-zmb06-05,srv-zmb07-05 are using a lot of disk space
mon.srv-zmb03-05 is 31666 MB >= mon_data_size_warn (15360 MB)
mon.srv-zmb04-05 is 31670 MB >= mon_data_size_warn (15360 MB)
mon.srv-zmb05-05 is 31670 MB >= mon_data_size_warn (15360 MB)
mon.srv-zmb06-05 is 31897 MB >= mon_data_size_warn (15360 MB)
mon.srv-zmb07-05 is 31891 MB >= mon_data_size_warn (15360 MB)
This is to be expected as MONs do not trim their store if one or more PGs
is not active+clean.
In this case we expected this and the MONs are each running on a 1TB Intel
DC-series SSD to make sure we do not run out of space before the backfill
finishes.
The cluster is spread out over racks and in CRUSH we replicate over racks.
Rack by rack we are wiping/destroying the OSDs and bringing them back as
BlueStore OSDs and letting the backfill handle everything.
In between we wait for the cluster to become HEALTH_OK (all PGs
active+clean) so that the Monitors can trim their database before we start
with the next rack.
I just want to warn and inform people about this. Under normal
circumstances a MON database isn't that big, but if you have a very long
period of backfills/recoveries and also have a large number of OSDs you'll
see the DB grow quite big.
This has improved significantly going to Jewel and Luminous, but it is
still something to watch out for.
Make sure your MONs have enough free space to handle this!
Wido
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Respectfully,

Wes Dillingham
***@harvard.edu
Research Computing | Senior CyberInfrastructure Storage Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 204

Wido den Hollander

2018-02-05 19:21:42 UTC

Permalink

Post by Wes Dillingham
Good data point on not trimming when non active+clean PGs are present.
So am I reading this correct? It grew to 32GB? Did it end up growing
beyond that, what was the max?Also is only ~18PGs per OSD a reasonable
amount of PGs per OSD? I would think about quadruple that would be
ideal. Is this an artifact of a steadily growing cluster or a design choice?

The backfills are still busy and the MONs are at 39GB right now. Still
have plenty of space left.

Regarding the PGs it's a long story, but two sided.

1. This is an archive running on Atom 8-core CPUs to keep power
consumption low, so we went low on amount of PGs
2. The system is still growing and after adding OSDs recently we didn't
increase the amount of PGs yet

Post by Wes Dillingham
Hi,
I just wanted to inform people about the fact that Monitor databases
can grow quite big when you have a large cluster which is performing
a very long rebalance.
I'm posting this on ceph-users and ceph-large as it applies to both,
but you'll see this sooner on a cluster with a lof of OSDs.
- Version: Luminous 12.2.2
- Number of OSDs: 2175
- Data used: ~2PB
We are in the middle of migrating from FileStore to BlueStore and
33488 active+clean
4802 active+undersized+degraded+remapped+backfill_wait
1670 active+remapped+backfill_wait
263 active+undersized+degraded+remapped+backfilling
250 active+recovery_wait+degraded
54 active+recovery_wait+degraded+remapped
27 active+remapped+backfilling
13 active+recovery_wait+undersized+degraded+remapped
2 active+recovering+degraded
MON_DISK_BIG mons
srv-zmb03-05,srv-zmb04-05,srv-zmb05-05,srv-zmb06-05,srv-zmb07-05 are
using a lot of disk space
mon.srv-zmb03-05 is 31666 MB >= mon_data_size_warn (15360 MB)
mon.srv-zmb04-05 is 31670 MB >= mon_data_size_warn (15360 MB)
mon.srv-zmb05-05 is 31670 MB >= mon_data_size_warn (15360 MB)
mon.srv-zmb06-05 is 31897 MB >= mon_data_size_warn (15360 MB)
mon.srv-zmb07-05 is 31891 MB >= mon_data_size_warn (15360 MB)
This is to be expected as MONs do not trim their store if one or
more PGs is not active+clean.
In this case we expected this and the MONs are each running on a 1TB
Intel DC-series SSD to make sure we do not run out of space before
the backfill finishes.
The cluster is spread out over racks and in CRUSH we replicate over
racks. Rack by rack we are wiping/destroying the OSDs and bringing
them back as BlueStore OSDs and letting the backfill handle everything.
In between we wait for the cluster to become HEALTH_OK (all PGs
active+clean) so that the Monitors can trim their database before we
start with the next rack.
I just want to warn and inform people about this. Under normal
circumstances a MON database isn't that big, but if you have a very
long period of backfills/recoveries and also have a large number of
OSDs you'll see the DB grow quite big.
This has improved significantly going to Jewel and Luminous, but it
is still something to watch out for.
Make sure your MONs have enough free space to handle this!
Wido
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
--
Respectfully,
Wes Dillingham
Research Computing | Senior CyberInfrastructure Storage Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 204

Matthew Vernon

2018-02-09 13:49:57 UTC

Permalink

Post by Wes Dillingham
Good data point on not trimming when non active+clean PGs are present.
So am I reading this correct? It grew to 32GB? Did it end up growing
beyond that, what was the max?

The largest Mon store size I've seen (in a 3000-OSD cluster) was about 66GB.

Regards,

Matthew

--
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

Anthony D'Atri

2018-02-09 21:21:41 UTC

Permalink

Thanks, Wido -- words to live by.

I had all kinds of problems with mon DBs not compacting under Firefly, really pointed out the benefit of having ample space on the mons -- and the necessity of having those DB's live on something faster than an LFF HDD.

I've had this happen when using ceph-gentle-reweight to slowly bring in a large population of new OSDs. Breaking that into phases helps a bunch, or setting a large -i interval.

Dan van der Ster

2018-02-28 17:15:54 UTC

Permalink

Hi Wido,

Are your mon's using rocksdb or still leveldb?

Are your mon stores trimming back to a small size after HEALTH_OK was restored?

One v12.2.2 cluster here just started showing the "is using a lot of
disk space" warning on one of our mons. In fact all three mons are now
using >16GB. I tried compacting and resyncing an empty mon but those
don't trim anything -- there really is 16GB of data mon store for this
healthy cluster.

(The mon's on this cluster were using ~560MB before updating to
luminous back in December.)

Any thoughts?

Cheers, Dan

Post by Wido den Hollander
Hi,
I just wanted to inform people about the fact that Monitor databases can
grow quite big when you have a large cluster which is performing a very long
rebalance.
I'm posting this on ceph-users and ceph-large as it applies to both, but
you'll see this sooner on a cluster with a lof of OSDs.
- Version: Luminous 12.2.2
- Number of OSDs: 2175
- Data used: ~2PB
We are in the middle of migrating from FileStore to BlueStore and this is
33488 active+clean
4802 active+undersized+degraded+remapped+backfill_wait
1670 active+remapped+backfill_wait
263 active+undersized+degraded+remapped+backfilling
250 active+recovery_wait+degraded
54 active+recovery_wait+degraded+remapped
27 active+remapped+backfilling
13 active+recovery_wait+undersized+degraded+remapped
2 active+recovering+degraded
MON_DISK_BIG mons
srv-zmb03-05,srv-zmb04-05,srv-zmb05-05,srv-zmb06-05,srv-zmb07-05 are using a
lot of disk space
mon.srv-zmb03-05 is 31666 MB >= mon_data_size_warn (15360 MB)
mon.srv-zmb04-05 is 31670 MB >= mon_data_size_warn (15360 MB)
mon.srv-zmb05-05 is 31670 MB >= mon_data_size_warn (15360 MB)
mon.srv-zmb06-05 is 31897 MB >= mon_data_size_warn (15360 MB)
mon.srv-zmb07-05 is 31891 MB >= mon_data_size_warn (15360 MB)
This is to be expected as MONs do not trim their store if one or more PGs is
not active+clean.
In this case we expected this and the MONs are each running on a 1TB Intel
DC-series SSD to make sure we do not run out of space before the backfill
finishes.
The cluster is spread out over racks and in CRUSH we replicate over racks.
Rack by rack we are wiping/destroying the OSDs and bringing them back as
BlueStore OSDs and letting the backfill handle everything.
In between we wait for the cluster to become HEALTH_OK (all PGs
active+clean) so that the Monitors can trim their database before we start
with the next rack.
I just want to warn and inform people about this. Under normal circumstances
a MON database isn't that big, but if you have a very long period of
backfills/recoveries and also have a large number of OSDs you'll see the DB
grow quite big.
This has improved significantly going to Jewel and Luminous, but it is still
something to watch out for.
Make sure your MONs have enough free space to handle this!
Wido
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com