[ceph-users] Increasing pg

Discussion:

[ceph-users] Increasing pg_num

Chris Dunlop

2016-05-16 05:56:48 UTC

Hi,

I'm trying to understand the potential impact on an active cluster of
increasing pg_num/pgp_num.

The conventional wisdom, as gleaned from the mailing lists and general
google fu, seems to be to increase pg_num followed by pgp_num, both in
small increments, to the target size, using "osd max backfills" (and
perhaps "osd recovery max active"?) to control the rate and thus
performance impact of data movement.

I'd really like to understand what's going on rather than "cargo culting"
it.

I'm currently on Hammer, but I'm hoping the answers are broadly applicable
across all versions for others following the trail.

Why do we have both pg_num and pgp_num? Given the docs say "The pgp_num
should be equal to the pg_num": under what circumstances might you want
these different, apart from when actively increasing pg_num first then
increasing pgp_num to match? (If they're supposed to be always the same, why
not have a single parameter and do the "increase pg_num, then pgp_num"
within ceph's internals?)

What do "osd backfill scan min" and "osd backfill scan max" actually
control? The docs say "The minimum/maximum number of objects per backfill
scan" but what does this actually mean and how does it affect the impact (if
at all)?

Is "osd recovery max active" actually relevant to this situation? It's
mentioned in various places related to increasing pg_num/pgp_num but my
understanding is it's related to recovery (e.g. osd falls out and comes
back again and needs to catch up) rather than back filling (migrating
pgs misplaced due to increasing pg_num, crush map changes etc.)

Previously (back in Dumpling days):

----
http://article.gmane.org/gmane.comp.file-systems.ceph.user/11490
----
From: Gregory Farnum
Subject: Re: Throttle pool pg_num/pgp_num increase impact
Newsgroups: gmane.comp.file-systems.ceph.user
Date: 2014-07-08 17:01:30 GMT

Should we be worried that the pg/pgp num increase on the bigger pool will
have a 300X larger impact?

The impact won't be 300 times bigger, but it will be bigger. There are two
things impacting your cluster here

1) the initial "split" of the affected PGs into multiple child PGs. You can
mitigate this by stepping through pg_num at small multiples.
2) the movement of data to its new location (when you adjust pgp_num). This
can be adjusted by setting the "OSD max backfills" and related parameters;
check the docs.
-Greg
----

Am I correct thinking "small multiples" in this context is along the lines
of "1.1" rather than "2" or "4"?.

Is there really much impact when increasing pg_num in a single large step
e.g. 1024 to 4096? If so, what causes this impact? An initial trial of
increasing pg_num by 10% (1024 to 1126) on one of my pools showed it
completed in a matter of tens of seconds, too short to really measure any
performance impact. But I'm concerned this could be exponential to the size
of the step such that increasing by a large step (e.g. the rest of the way
from 1126 to 4096) could cause problems.

Given the use of "osd max backfills" to limit the impact of the data
movement associated with increasing pgp_num, is there any advantage or
disadvantage to increasing pgp_num in small increments (e.g. 10% at a time)
vs "all at once", apart from small increments likely moving some data
multiple times? E.g. with a large step is there a higher potential for
problems if something else happens to the cluster the same time (e.g. an OSD
dies) because the current state of the system is further from the expected
state, or something like that?

If small increments of pgp_num are advisable, should the process be
"increase pg_num by a small increment, increase pgp_num to match, repeat
until target reached", or is that no advantage to increasing pg_num (in
multiple small increments or single large step) to the target, then
increasing pgp_num in small increments to the target - and why?

Given that increasing pg_num/pgp_num seem almost inevitable for a growing
cluster, and that increasing these can be one of the most
performance-impacting operations you can perform on a cluster, perhaps a
document going into these details would be appropriate?

Cheers,

Chris

Wido den Hollander

2016-05-16 20:40:47 UTC

Permalink

Post by Chris Dunlop
Hi,
I'm trying to understand the potential impact on an active cluster of
increasing pg_num/pgp_num.
The conventional wisdom, as gleaned from the mailing lists and general
google fu, seems to be to increase pg_num followed by pgp_num, both in
small increments, to the target size, using "osd max backfills" (and
perhaps "osd recovery max active"?) to control the rate and thus
performance impact of data movement.
I'd really like to understand what's going on rather than "cargo culting"
it.
I'm currently on Hammer, but I'm hoping the answers are broadly applicable
across all versions for others following the trail.
Why do we have both pg_num and pgp_num? Given the docs say "The pgp_num
should be equal to the pg_num": under what circumstances might you want
these different, apart from when actively increasing pg_num first then
increasing pgp_num to match? (If they're supposed to be always the same, why
not have a single parameter and do the "increase pg_num, then pgp_num"
within ceph's internals?)

pg_num is the actual amount of PGs. This you can increase without any actual data moving.

pgp_num is the number CRUSH uses in the calculations. pgp_num can't be greater than pg_num for that reason.

You can slowly increase pgp_num to make sure not all your data moves at the same time.

Post by Chris Dunlop
What do "osd backfill scan min" and "osd backfill scan max" actually
control? The docs say "The minimum/maximum number of objects per backfill
scan" but what does this actually mean and how does it affect the impact (if
at all)?

The less objects is scans at once, the less I/O it causes. I don't play with those values to much.

Post by Chris Dunlop
Is "osd recovery max active" actually relevant to this situation? It's
mentioned in various places related to increasing pg_num/pgp_num but my
understanding is it's related to recovery (e.g. osd falls out and comes
back again and needs to catch up) rather than back filling (migrating
pgs misplaced due to increasing pg_num, crush map changes etc.)
----
http://article.gmane.org/gmane.comp.file-systems.ceph.user/11490
----
From: Gregory Farnum
Subject: Re: Throttle pool pg_num/pgp_num increase impact
Newsgroups: gmane.comp.file-systems.ceph.user
Date: 2014-07-08 17:01:30 GMT

Should we be worried that the pg/pgp num increase on the bigger pool will
have a 300X larger impact?

The impact won't be 300 times bigger, but it will be bigger. There are two
things impacting your cluster here
1) the initial "split" of the affected PGs into multiple child PGs. You can
mitigate this by stepping through pg_num at small multiples.
2) the movement of data to its new location (when you adjust pgp_num). This
can be adjusted by setting the "OSD max backfills" and related parameters;
check the docs.
-Greg
----
Am I correct thinking "small multiples" in this context is along the lines
of "1.1" rather than "2" or "4"?.
Is there really much impact when increasing pg_num in a single large step
e.g. 1024 to 4096? If so, what causes this impact? An initial trial of
increasing pg_num by 10% (1024 to 1126) on one of my pools showed it
completed in a matter of tens of seconds, too short to really measure any
performance impact. But I'm concerned this could be exponential to the size
of the step such that increasing by a large step (e.g. the rest of the way
from 1126 to 4096) could cause problems.
Given the use of "osd max backfills" to limit the impact of the data
movement associated with increasing pgp_num, is there any advantage or
disadvantage to increasing pgp_num in small increments (e.g. 10% at a time)
vs "all at once", apart from small increments likely moving some data
multiple times? E.g. with a large step is there a higher potential for
problems if something else happens to the cluster the same time (e.g. an OSD
dies) because the current state of the system is further from the expected
state, or something like that?
If small increments of pgp_num are advisable, should the process be
"increase pg_num by a small increment, increase pgp_num to match, repeat
until target reached", or is that no advantage to increasing pg_num (in
multiple small increments or single large step) to the target, then
increasing pgp_num in small increments to the target - and why?
Given that increasing pg_num/pgp_num seem almost inevitable for a growing
cluster, and that increasing these can be one of the most
performance-impacting operations you can perform on a cluster, perhaps a
document going into these details would be appropriate?
Cheers,
Chris
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Christian Balzer

2016-05-16 23:21:48 UTC

Permalink

Hello,

Post by Wido den Hollander

Post by Chris Dunlop
Hi,
I'm trying to understand the potential impact on an active cluster of
increasing pg_num/pgp_num.
The conventional wisdom, as gleaned from the mailing lists and general
google fu, seems to be to increase pg_num followed by pgp_num, both in
small increments, to the target size, using "osd max backfills" (and
perhaps "osd recovery max active"?) to control the rate and thus
performance impact of data movement.
I'd really like to understand what's going on rather than "cargo
culting" it.
I'm currently on Hammer, but I'm hoping the answers are broadly
applicable across all versions for others following the trail.
Why do we have both pg_num and pgp_num? Given the docs say "The pgp_num
should be equal to the pg_num": under what circumstances might you want
these different, apart from when actively increasing pg_num first then
increasing pgp_num to match? (If they're supposed to be always the
same, why not have a single parameter and do the "increase pg_num,
then pgp_num" within ceph's internals?)

pg_num is the actual amount of PGs. This you can increase without any actual data moving.

Yes and no.

Increasing the pg_num will split PGs, which causes potentially massive I/O.
Also AFAIK that I/O isn't regulated by the various recovery and backfill
parameters.
That's probably why recent Ceph versions will only let you increase pg_num
in smallish increments.

Moving data (as in redistributing amongst the OSD based on CRUSH) will
indeed not happen until pgp_num is also increased.

Regards,

Christian

--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/

Chris Dunlop

2016-05-17 00:47:15 UTC

Permalink

Post by Christian Balzer

Post by Wido den Hollander
pg_num is the actual amount of PGs. This you can increase without any
actual data moving.

Yes and no.
Increasing the pg_num will split PGs, which causes potentially massive I/O.
Also AFAIK that I/O isn't regulated by the various recovery and backfill
parameters.

Where is this potentially massive I/O coming from? I have this naive concept
that the PGs are mathematically-calculated buckets, so splitting them would
involve little or no I/O, although I can imagine there are management
overheads (cpu, memory) involved in correctly maintaining state during the
splitting process.

Post by Christian Balzer
That's probably why recent Ceph versions will only let you increase pg_num
in smallish increments.

Oh, I wasn't aware of that!

Ok, so it looks like it's mon_osd_max_split_count, introduced by commit
d8ccd73. Unfortunately it seems to be missing from the ceph docs. It's
mentioned in the Suse docs:

https://www.suse.com/documentation/ses-2/singlehtml/book_storage_admin/book_storage_admin.html#storage.bp.cluster_mntc.add_pgnum

...although, if I'm understanding "mon_osd_max_split_count" correctly, their
script for calculating the maximum to which you can increase pg_num is
incorrect in that it's calculating "current pg_num +
mon_osd_max_split_count" when it should be "current pg_num +
(mon_osd_max_split_count * number of pool OSDs)".

Hmmm, is there a generic command-line(ish) way of determining the number of
OSDs involved in a pool?

Post by Christian Balzer
Moving data (as in redistributing amongst the OSD based on CRUSH) will
indeed not happen until pgp_num is also increased.

Christian Balzer

2016-05-17 01:41:52 UTC

Permalink

Hello,

Post by Chris Dunlop

Post by Christian Balzer

Post by Wido den Hollander
pg_num is the actual amount of PGs. This you can increase without any
actual data moving.

Yes and no.
Increasing the pg_num will split PGs, which causes potentially massive
I/O. Also AFAIK that I/O isn't regulated by the various recovery and
backfill parameters.

Where is this potentially massive I/O coming from? I have this naive
concept that the PGs are mathematically-calculated buckets, so splitting
them would involve little or no I/O, although I can imagine there are
management overheads (cpu, memory) involved in correctly maintaining
state during the splitting process.

I would have thought "splitting" to be pretty unambiguous, in that it
involves moving data.

That's on top of course for the CPU/RAM resources needed when creating
those new PGs and having them peer.

Most your questions would be easily answered if you did spend a few
minutes with even the crappiest test cluster and observing things (with
atop and the likes).

To wit, this is a test pool (12) created with 32 PGs and slightly filled
with data via rados bench:
---
# ls -la /var/lib/ceph/osd/ceph-8/current/ |grep "12\."
drwxr-xr-x 2 root root 4096 May 17 10:04 12.13_head
drwxr-xr-x 2 root root 4096 May 17 10:04 12.1e_head
drwxr-xr-x 2 root root 4096 May 17 10:04 12.b_head
# du -h /var/lib/ceph/osd/ceph-8/current/12.13_head/
121M /var/lib/ceph/osd/ceph-8/current/12.13_head/
---

After increasing that to 128 PGs we get this:
---
# ls -la /var/lib/ceph/osd/ceph-8/current/ |grep "12\."
drwxr-xr-x 2 root root 4096 May 17 10:18 12.13_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.1e_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.2b_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.33_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.3e_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.4b_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.53_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.5e_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.6b_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.73_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.7e_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.b_head
# du -h /var/lib/ceph/osd/ceph-8/current/12.13_head/
25M /var/lib/ceph/osd/ceph-8/current/12.13_head/
---

Now this was fairly uneventful even on my crappy test cluster, given the
small amount of data (which was mostly cached) and the fact that it's idle.

However consider this with 100's of GB per PG and a busy cluster and you
get the idea where massive and very disruptive I/O comes from.

Post by Chris Dunlop

Post by Christian Balzer
That's probably why recent Ceph versions will only let you increase
pg_num in smallish increments.

Oh, I wasn't aware of that!
Ok, so it looks like it's mon_osd_max_split_count, introduced by commit
d8ccd73. Unfortunately it seems to be missing from the ceph docs. It's
https://www.suse.com/documentation/ses-2/singlehtml/book_storage_admin/book_storage_admin.html#storage.bp.cluster_mntc.add_pgnum
...although, if I'm understanding "mon_osd_max_split_count" correctly,
their script for calculating the maximum to which you can increase
pg_num is incorrect in that it's calculating "current pg_num +
mon_osd_max_split_count" when it should be "current pg_num +
(mon_osd_max_split_count * number of pool OSDs)".
Hmmm, is there a generic command-line(ish) way of determining the number
of OSDs involved in a pool?

Unless you have a pool with a very small pg_num and a very large cluster
the answer usually tends to be "all of them".

And google ("ceph number of osds per pool") is your friend:

http://cephnotes.ksperis.com/blog/2015/02/23/get-the-number-of-placement-groups-per-osd

Christian

--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/

Chris Dunlop

2016-05-17 02:12:02 UTC

Permalink

Hi Christian,

Post by Christian Balzer
Most your questions would be easily answered if you did spend a few
minutes with even the crappiest test cluster and observing things (with
atop and the likes).

You're right of course. I'll set up a test cluster and start experimenting,
which I should have done before asking questions here.

Post by Christian Balzer
To wit, this is a test pool (12) created with 32 PGs and slightly filled
---
# ls -la /var/lib/ceph/osd/ceph-8/current/ |grep "12\."
drwxr-xr-x 2 root root 4096 May 17 10:04 12.13_head
drwxr-xr-x 2 root root 4096 May 17 10:04 12.1e_head
drwxr-xr-x 2 root root 4096 May 17 10:04 12.b_head
# du -h /var/lib/ceph/osd/ceph-8/current/12.13_head/
121M /var/lib/ceph/osd/ceph-8/current/12.13_head/
---
---
# ls -la /var/lib/ceph/osd/ceph-8/current/ |grep "12\."
drwxr-xr-x 2 root root 4096 May 17 10:18 12.13_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.1e_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.2b_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.33_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.3e_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.4b_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.53_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.5e_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.6b_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.73_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.7e_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.b_head
# du -h /var/lib/ceph/osd/ceph-8/current/12.13_head/
25M /var/lib/ceph/osd/ceph-8/current/12.13_head/
---
Now this was fairly uneventful even on my crappy test cluster, given the
small amount of data (which was mostly cached) and the fact that it's idle.
However consider this with 100's of GB per PG and a busy cluster and you
get the idea where massive and very disruptive I/O comes from.

Per above, I'll experiment with this, but my first thought is I suspect
that's moving object/data files around rather than copying data, so the
overheads are in directory operations rather than data copies - not that
directory operations are free either of course.

Post by Christian Balzer

Post by Chris Dunlop
Hmmm, is there a generic command-line(ish) way of determining the number
of OSDs involved in a pool?

Unless you have a pool with a very small pg_num and a very large cluster
the answer usually tends to be "all of them".

Or, as in my case, several completely independent pools (i.e. different
OSDs) in the one cluster.

Post by Christian Balzer
http://cephnotes.ksperis.com/blog/2015/02/23/get-the-number-of-placement-groups-per-osd

Crap. And I was just looking at that very page yesterday, in the context of
the distribution of the PGs, and completely forgot about the SUM part.

Thanks for taking the time to respond.

Chris.

Christian Balzer

2016-05-17 04:27:52 UTC

Permalink

Hello,

Post by Chris Dunlop
Hi Christian,

Post by Christian Balzer
Most your questions would be easily answered if you did spend a few
minutes with even the crappiest test cluster and observing things (with
atop and the likes).

You're right of course. I'll set up a test cluster and start
experimenting, which I should have done before asking questions here.

Post by Christian Balzer
To wit, this is a test pool (12) created with 32 PGs and slightly
---
# ls -la /var/lib/ceph/osd/ceph-8/current/ |grep "12\."
drwxr-xr-x 2 root root 4096 May 17 10:04 12.13_head
drwxr-xr-x 2 root root 4096 May 17 10:04 12.1e_head
drwxr-xr-x 2 root root 4096 May 17 10:04 12.b_head
# du -h /var/lib/ceph/osd/ceph-8/current/12.13_head/
121M /var/lib/ceph/osd/ceph-8/current/12.13_head/
---
---
# ls -la /var/lib/ceph/osd/ceph-8/current/ |grep "12\."
drwxr-xr-x 2 root root 4096 May 17 10:18 12.13_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.1e_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.2b_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.33_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.3e_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.4b_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.53_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.5e_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.6b_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.73_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.7e_head
drwxr-xr-x 2 root root 4096 May 17 10:18 12.b_head
# du -h /var/lib/ceph/osd/ceph-8/current/12.13_head/
25M /var/lib/ceph/osd/ceph-8/current/12.13_head/
---
Now this was fairly uneventful even on my crappy test cluster, given
the small amount of data (which was mostly cached) and the fact that
it's idle.
However consider this with 100's of GB per PG and a busy cluster and
you get the idea where massive and very disruptive I/O comes from.

That's correct, but given enough objects (and thus directory depths) and
most of all I/O contention in a busy cluster the impact is quite
pronounced.

Christian

--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/

Chris Dunlop

2016-05-17 00:58:03 UTC

Permalink

Post by Wido den Hollander

Post by Chris Dunlop
Why do we have both pg_num and pgp_num? Given the docs say "The pgp_num
should be equal to the pg_num": under what circumstances might you want
these different, apart from when actively increasing pg_num first then
increasing pgp_num to match? (If they're supposed to be always the same, why
not have a single parameter and do the "increase pg_num, then pgp_num"
within ceph's internals?)

pg_num is the actual amount of PGs. This you can increase without any actual data moving.
pgp_num is the number CRUSH uses in the calculations. pgp_num can't be greater than pg_num for that reason.

OK, I understand that from the docs. But why are they two separate
parameters? E.g., why might you increase pg_num and not pgp_num? Or are the
two parameters purely to separate splitting the PGs (pg_num) from moving
data around (pgp_num)?

Post by Wido den Hollander
You can slowly increase pgp_num to make sure not all your data moves at the same time.

Why slowly increase pgp_num rather than rely on "osd max backfills"? I.e.
what downsides are there to setting "osd max backfills" as appropriate,
increasing pg_num in small steps to the target, then increasing pgp_num to
the target in one step?

If you're slowly increasing pgp_num, is the recommendation to "increase
pg_num a bit, increase pgp_num a bit, repeat till target is reached" (and
thus potentially moving some data multiple times), or is the recommendation
to "increase pg_num a bit step by step to the target, then increase pgp_num
bit by bit to the target"?