Vladimir Brik
2018-11-14 19:45:24 UTC
Hello
I have a ceph 13.2.2 cluster comprised of 5 hosts, each with 16 HDDs and
4 SSDs. HDD OSDs have about 50 PGs each, while SSD OSDs have about 400
PGs each (a lot more pools use SSDs than HDDs). Servers are fairly
powerful: 48 HT cores, 192GB of RAM, and 2x25Gbps Ethernet.
The impression I got from the docs is that having more than 200 PGs per
OSD is not a good thing, but justifications were vague (no concrete
numbers), like increased peering time, increased resource consumption,
and possibly decreased recovery performance. None of these appeared to
be a significant problem in my testing, but the tests were very basic
and done on a pretty empty cluster under minimal load, so I worry I'll
run into trouble down the road.
Here are the questions I have:
- In practice, is it a big deal that some OSDs have ~400 PGs?
- In what situations would our cluster most likely fare significantly
better if I went through the trouble of re-creating pools so that no OSD
would have more than, say, ~100 PGs?
- What performance metrics could I monitor to detect possible issues due
to having too many PGs?
Thanks,
Vlad
I have a ceph 13.2.2 cluster comprised of 5 hosts, each with 16 HDDs and
4 SSDs. HDD OSDs have about 50 PGs each, while SSD OSDs have about 400
PGs each (a lot more pools use SSDs than HDDs). Servers are fairly
powerful: 48 HT cores, 192GB of RAM, and 2x25Gbps Ethernet.
The impression I got from the docs is that having more than 200 PGs per
OSD is not a good thing, but justifications were vague (no concrete
numbers), like increased peering time, increased resource consumption,
and possibly decreased recovery performance. None of these appeared to
be a significant problem in my testing, but the tests were very basic
and done on a pretty empty cluster under minimal load, so I worry I'll
run into trouble down the road.
Here are the questions I have:
- In practice, is it a big deal that some OSDs have ~400 PGs?
- In what situations would our cluster most likely fare significantly
better if I went through the trouble of re-creating pools so that no OSD
would have more than, say, ~100 PGs?
- What performance metrics could I monitor to detect possible issues due
to having too many PGs?
Thanks,
Vlad