Discussion:
Help rebalancing OSD usage, Luminus 1.2.2
Add Reply
Bryan Banister
2018-01-30 16:24:05 UTC
Reply
Permalink
Raw Message
Hi all,

We are still very new to running a Ceph cluster and have run a RGW cluster for a while now (6-ish mo), it mainly holds large DB backups (Write once, read once, delete after N days). The system is now warning us about an OSD that is near_full and so we went to look at the usage across OSDs. We are somewhat surprised at how imbalanced the usage is across the OSDs, with the lowest usage at 22% full, the highest at nearly 90%, and an almost linear usage pattern across the OSDs (though it looks to step in roughly 5% increments):

[***@carf-ceph-osd01 ~]# ceph osd df | sort -nk8
ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
77 hdd 7.27730 1.00000 7451G 1718G 5733G 23.06 0.43 32
73 hdd 7.27730 1.00000 7451G 1719G 5732G 23.08 0.43 31
3 hdd 7.27730 1.00000 7451G 2059G 5392G 27.63 0.52 27
46 hdd 7.27730 1.00000 7451G 2060G 5391G 27.65 0.52 32
48 hdd 7.27730 1.00000 7451G 2061G 5390G 27.66 0.52 25
127 hdd 7.27730 1.00000 7451G 2066G 5385G 27.73 0.52 31
42 hdd 7.27730 1.00000 7451G 2067G 5384G 27.74 0.52 42
107 hdd 7.27730 1.00000 7451G 2402G 5049G 32.24 0.61 34
56 hdd 7.27730 1.00000 7451G 2405G 5046G 32.28 0.61 37
51 hdd 7.27730 1.00000 7451G 2406G 5045G 32.29 0.61 30
106 hdd 7.27730 1.00000 7451G 2408G 5043G 32.31 0.61 29
81 hdd 7.27730 1.00000 7451G 2408G 5043G 32.32 0.61 25
123 hdd 7.27730 1.00000 7451G 2411G 5040G 32.37 0.61 35
47 hdd 7.27730 1.00000 7451G 2412G 5039G 32.37 0.61 29
122 hdd 7.27730 1.00000 7451G 2749G 4702G 36.90 0.69 30
84 hdd 7.27730 1.00000 7451G 2750G 4701G 36.91 0.69 35
114 hdd 7.27730 1.00000 7451G 2751G 4700G 36.92 0.69 26
82 hdd 7.27730 1.00000 7451G 2751G 4700G 36.92 0.69 43
103 hdd 7.27730 1.00000 7451G 2753G 4698G 36.94 0.69 39
36 hdd 7.27730 1.00000 7451G 2752G 4699G 36.94 0.69 37
105 hdd 7.27730 1.00000 7451G 2754G 4697G 36.97 0.69 26
14 hdd 7.27730 1.00000 7451G 3091G 4360G 41.49 0.78 31
2 hdd 7.27730 1.00000 7451G 3091G 4360G 41.49 0.78 43
8 hdd 7.27730 1.00000 7451G 3091G 4360G 41.49 0.78 37
20 hdd 7.27730 1.00000 7451G 3092G 4359G 41.50 0.78 28
60 hdd 7.27730 1.00000 7451G 3092G 4359G 41.50 0.78 29
69 hdd 7.27730 1.00000 7451G 3092G 4359G 41.50 0.78 37
110 hdd 7.27730 1.00000 7451G 3093G 4358G 41.51 0.78 38
68 hdd 7.27730 1.00000 7451G 3092G 4358G 41.51 0.78 34
76 hdd 7.27730 1.00000 7451G 3093G 4358G 41.51 0.78 28
99 hdd 7.27730 1.00000 7451G 3092G 4358G 41.51 0.78 34
50 hdd 7.27730 1.00000 7451G 3095G 4356G 41.54 0.78 35
95 hdd 7.27730 1.00000 7451G 3095G 4356G 41.54 0.78 31
0 hdd 7.27730 1.00000 7451G 3096G 4355G 41.55 0.78 36
125 hdd 7.27730 1.00000 7451G 3096G 4355G 41.55 0.78 34
128 hdd 7.27730 1.00000 7451G 3095G 4355G 41.55 0.78 37
94 hdd 7.27730 1.00000 7451G 3096G 4355G 41.55 0.78 33
63 hdd 7.27730 1.00000 7451G 3096G 4355G 41.56 0.78 41
30 hdd 7.27730 1.00000 7451G 3100G 4351G 41.60 0.78 31
26 hdd 7.27730 1.00000 7451G 3435G 4015G 46.11 0.87 30
64 hdd 7.27730 1.00000 7451G 3435G 4016G 46.11 0.87 42
57 hdd 7.27730 1.00000 7451G 3437G 4014G 46.12 0.87 29
33 hdd 7.27730 1.00000 7451G 3437G 4014G 46.13 0.87 27
65 hdd 7.27730 1.00000 7451G 3439G 4012G 46.15 0.87 29
109 hdd 7.27730 1.00000 7451G 3439G 4012G 46.16 0.87 39
11 hdd 7.27730 1.00000 7451G 3441G 4010G 46.18 0.87 32
121 hdd 7.27730 1.00000 7451G 3441G 4010G 46.18 0.87 46
78 hdd 7.27730 1.00000 7451G 3441G 4010G 46.18 0.87 36
13 hdd 7.27730 1.00000 7451G 3442G 4009G 46.19 0.87 40
115 hdd 7.27730 1.00000 7451G 3443G 4008G 46.21 0.87 33
41 hdd 7.27730 1.00000 7451G 3444G 4007G 46.22 0.87 37
49 hdd 7.27730 1.00000 7451G 3776G 3674G 50.68 0.95 34
71 hdd 7.27730 1.00000 7451G 3776G 3675G 50.68 0.95 36
97 hdd 7.27730 1.00000 7451G 3776G 3675G 50.68 0.95 26
17 hdd 7.27730 1.00000 7451G 3777G 3674G 50.70 0.95 35
75 hdd 7.27730 1.00000 7451G 3778G 3673G 50.70 0.95 41
1 hdd 7.27730 1.00000 7451G 3779G 3672G 50.71 0.95 40
79 hdd 7.27730 1.00000 7451G 3778G 3672G 50.71 0.95 42
54 hdd 7.27730 1.00000 7451G 3779G 3672G 50.72 0.95 39
58 hdd 7.27730 1.00000 7451G 3780G 3670G 50.74 0.95 41
7 hdd 7.27730 1.00000 7451G 3781G 3670G 50.74 0.95 40
21 hdd 7.27730 1.00000 7451G 3783G 3668G 50.77 0.95 27
31 hdd 7.27730 1.00000 7451G 3783G 3668G 50.77 0.95 34
67 hdd 7.27730 1.00000 7451G 3784G 3667G 50.79 0.95 33
43 hdd 7.27730 1.00000 7451G 4119G 3332G 55.28 1.04 36
72 hdd 7.27730 1.00000 7451G 4120G 3331G 55.30 1.04 45
74 hdd 7.27730 1.00000 7451G 4121G 3330G 55.31 1.04 32
102 hdd 7.27730 1.00000 7451G 4123G 3328G 55.33 1.04 35
34 hdd 7.27730 1.00000 7451G 4123G 3328G 55.33 1.04 37
111 hdd 7.27730 1.00000 7451G 4123G 3327G 55.34 1.04 40
44 hdd 7.27730 1.00000 7451G 4123G 3328G 55.34 1.04 41
27 hdd 7.27730 1.00000 7451G 4124G 3327G 55.35 1.04 44
39 hdd 7.27730 1.00000 7451G 4124G 3327G 55.35 1.04 36
55 hdd 7.27730 1.00000 7451G 4124G 3327G 55.35 1.04 45
80 hdd 7.27730 1.00000 7451G 4125G 3326G 55.36 1.04 35
116 hdd 7.27730 1.00000 7451G 4125G 3326G 55.37 1.04 47
98 hdd 7.27730 1.00000 7451G 4126G 3325G 55.38 1.04 41
132 hdd 7.27730 1.00000 7451G 4128G 3323G 55.40 1.04 43
89 hdd 7.27730 1.00000 7451G 4130G 3321G 55.43 1.04 44
6 hdd 7.27730 1.00000 7451G 4461G 2990G 59.87 1.12 32
91 hdd 7.27730 1.00000 7451G 4462G 2989G 59.88 1.12 39
124 hdd 7.27730 1.00000 7451G 4465G 2986G 59.92 1.12 30
28 hdd 7.27730 1.00000 7451G 4465G 2985G 59.93 1.12 32
92 hdd 7.27730 1.00000 7451G 4465G 2986G 59.93 1.12 41
10 hdd 7.27730 1.00000 7451G 4466G 2985G 59.94 1.13 36
25 hdd 7.27730 1.00000 7451G 4467G 2984G 59.95 1.13 35
85 hdd 7.27730 1.00000 7451G 4467G 2984G 59.95 1.13 38
12 hdd 7.27730 1.00000 7451G 4467G 2984G 59.96 1.13 46
22 hdd 7.27730 1.00000 7451G 4468G 2983G 59.96 1.13 40
40 hdd 7.27730 1.00000 7451G 4469G 2982G 59.98 1.13 43
53 hdd 7.27730 1.00000 7451G 4469G 2982G 59.98 1.13 33
88 hdd 7.27730 1.00000 7451G 4469G 2982G 59.98 1.13 36
118 hdd 7.27730 1.00000 7451G 4470G 2981G 59.99 1.13 39
86 hdd 7.27730 1.00000 7451G 4470G 2981G 59.99 1.13 40
90 hdd 7.27730 1.00000 7451G 4471G 2980G 60.01 1.13 48
100 hdd 7.27730 1.00000 7451G 4473G 2978G 60.02 1.13 34
112 hdd 7.27730 1.00000 7451G 4473G 2978G 60.03 1.13 35
24 hdd 7.27730 1.00000 7451G 4475G 2976G 60.06 1.13 36
117 hdd 7.27730 1.00000 7451G 4806G 2645G 64.49 1.21 34
66 hdd 7.27730 1.00000 7451G 4805G 2646G 64.49 1.21 37
119 hdd 7.27730 1.00000 7451G 4806G 2645G 64.50 1.21 41
93 hdd 7.27730 1.00000 7451G 4807G 2644G 64.51 1.21 34
16 hdd 7.27730 1.00000 7451G 4809G 2642G 64.54 1.21 38
101 hdd 7.27730 1.00000 7451G 4812G 2639G 64.58 1.21 36
104 hdd 7.27730 1.00000 7451G 4812G 2639G 64.58 1.21 33
15 hdd 7.27730 1.00000 7451G 4812G 2639G 64.58 1.21 39
133 hdd 7.27730 1.00000 7451G 4814G 2637G 64.61 1.21 34
4 hdd 7.27730 1.00000 7451G 4814G 2637G 64.61 1.21 38
62 hdd 7.27730 1.00000 7451G 4815G 2636G 64.62 1.21 39
9 hdd 7.27730 1.00000 7451G 4816G 2635G 64.63 1.21 46
59 hdd 7.27730 1.00000 7451G 4816G 2635G 64.64 1.21 38
38 hdd 7.27730 1.00000 7451G 4817G 2634G 64.65 1.21 42
131 hdd 7.27730 1.00000 7451G 5150G 2301G 69.12 1.30 42
32 hdd 7.27730 1.00000 7451G 5157G 2294G 69.21 1.30 42
96 hdd 7.27730 1.00000 7451G 5158G 2293G 69.22 1.30 41
83 hdd 7.27730 1.00000 7451G 5158G 2293G 69.23 1.30 40
37 hdd 7.27730 1.00000 7451G 5492G 1959G 73.70 1.38 30
108 hdd 7.27730 1.00000 7451G 5492G 1959G 73.71 1.38 35
129 hdd 7.27730 1.00000 7451G 5496G 1955G 73.75 1.38 42
18 hdd 7.27730 1.00000 7451G 5499G 1952G 73.80 1.39 37
5 hdd 7.27730 1.00000 7451G 5499G 1952G 73.80 1.39 38
130 hdd 7.27730 1.00000 7451G 5501G 1950G 73.82 1.39 41
35 hdd 7.27730 1.00000 7451G 5502G 1949G 73.83 1.39 39
70 hdd 7.27730 1.00000 7451G 5502G 1949G 73.84 1.39 46
45 hdd 7.27730 1.00000 7451G 5503G 1948G 73.86 1.39 35
126 hdd 7.27730 1.00000 7451G 5505G 1946G 73.88 1.39 42
120 hdd 7.27730 1.00000 7451G 5840G 1611G 78.37 1.47 39
23 hdd 7.27730 1.00000 7451G 5841G 1610G 78.39 1.47 40
52 hdd 7.27730 1.00000 7451G 5842G 1609G 78.40 1.47 45
61 hdd 7.27730 1.00000 7451G 5841G 1609G 78.40 1.47 41
29 hdd 7.27730 1.00000 7451G 6185G 1266G 83.01 1.56 46
87 hdd 7.27730 1.00000 7451G 6190G 1260G 83.08 1.56 43
113 hdd 7.27730 1.00000 7451G 6527G 924G 87.59 1.64 45
TOTAL 967T 515T 452T 53.27

MIN/MAX VAR: 0.43/1.64 STDDEV: 14.15

We don't want to shoot ourselves in the foot here, so thought a quick email out to the list would be wise to get some guidance. What's the best option to get the OSD usage rebalanced closer to even on this cluster?

Is it reweighting the OSDs?

Weight the bottom 25% up and the top 25% down?

How do we mitigate this issue going forward?

Thanks for all help in this regard!
-Bryan



________________________________

Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product.
Janne Johansson
2018-01-31 13:52:45 UTC
Reply
Permalink
Raw Message
Post by Bryan Banister
Hi all,
We are still very new to running a Ceph cluster and have run a RGW cluster
for a while now (6-ish mo), it mainly holds large DB backups (Write once,
read once, delete after N days). The system is now warning us about an OSD
that is near_full and so we went to look at the usage across OSDs. We are
somewhat surprised at how imbalanced the usage is across the OSDs, with the
lowest usage at 22% full, the highest at nearly 90%, and an almost linear
usage pattern across the OSDs (though it looks to step in roughly 5%
ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
77 hdd 7.27730 1.00000 7451G 1718G 5733G 23.06 0.43 32
73 hdd 7.27730 1.00000 7451G 1719G 5732G 23.08 0.43 31
I noticed that the PGs (the last column there, which counts PGs per OSD I
gather) was kind of even,
so perhaps the objects that get into the PGs are very unbalanced in size?

But yes, using reweight to compensate for this should work for you.

ceph osd test-reweight-by-*util*ization

should be worth testing.
--
May the most significant bit of your life be positive.
Bryan Banister
2018-01-31 14:58:09 UTC
Reply
Permalink
Raw Message
Thanks for the response, Janne!

Here is what test-reweight-by-utilization gives me:

[***@carf-ceph-osd01 ~]# ceph osd test-reweight-by-utilization
no change
moved 12 / 4872 (0.246305%)
avg 36.6316
stddev 5.37535 -> 5.29218 (expected baseline 6.02961)
min osd.48 with 25 -> 25 pgs (0.682471 -> 0.682471 * mean)
max osd.90 with 48 -> 48 pgs (1.31034 -> 1.31034 * mean)

oload 120
max_change 0.05
max_change_osds 4
average_utilization 0.5273
overload_utilization 0.6327
osd.113 weight 1.0000 -> 0.9500
osd.87 weight 1.0000 -> 0.9500
osd.29 weight 1.0000 -> 0.9500
osd.52 weight 1.0000 -> 0.9500

I tried looking for documentation on this command to see if there is a way to increase the max_change or max_change_osd’s but can’t find any docs on how to do this!

Man:
Subcommand reweight-by-utilization reweight OSDs by utilization [overload-percentage-for-consideration, default 120].

Usage:

ceph osd reweight-by-utilization {<int[100-]>}
{--no-increasing}

The `ceph –h` output:
osd reweight-by-utilization {<int>} {<float>} {<int>} {--no-increasing}

What do those optional parameters do (e.g. {<int>} {<float>} {<int>} {--no-increasing} )??

We could keep running this multiple times, but would be nice to just rebalance everything in one shot so that the rebalance gets things back to pretty even.

Yes, these backup images do vary greatly in size, but I expected that just through random PG allocation that all OSDs would have still accumulated roughly the same number of small and large objects that the usage would be much closer to even. This usage is way imbalanced! So I still need to know how to mitigate this going forward. Should we increase the number of PGs in this pool??

[***@carf-ceph-osd01 ~]# ceph osd pool ls detail
[snip]
pool 14 'carf01.rgw.buckets.data' erasure size 3 min_size 2 crush_rule 7 object_hash rjenkins pg_num 512 pgp_num 512 last_change 3187 lfor 0/1005 flags hashpspool,nearfull stripe_width 8192 application rgw
[snip]

Given that this will move data around (I think), should we increase the pg_num and pgp_num first and then see how it looks?

Thanks,
-Bryan

From: Janne Johansson [mailto:***@gmail.com]
Sent: Wednesday, January 31, 2018 7:53 AM
To: Bryan Banister <***@jumptrading.com>
Cc: Ceph Users <ceph-***@lists.ceph.com>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email
________________________________
2018-01-30 17:24 GMT+01:00 Bryan Banister <***@jumptrading.com<mailto:***@jumptrading.com>>:
Hi all,

We are still very new to running a Ceph cluster and have run a RGW cluster for a while now (6-ish mo), it mainly holds large DB backups (Write once, read once, delete after N days). The system is now warning us about an OSD that is near_full and so we went to look at the usage across OSDs. We are somewhat surprised at how imbalanced the usage is across the OSDs, with the lowest usage at 22% full, the highest at nearly 90%, and an almost linear usage pattern across the OSDs (though it looks to step in roughly 5% increments):

[***@carf-ceph-osd01 ~]# ceph osd df | sort -nk8
ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
77 hdd 7.27730 1.00000 7451G 1718G 5733G 23.06 0.43 32
73 hdd 7.27730 1.00000 7451G 1719G 5732G 23.08 0.43 31

I noticed that the PGs (the last column there, which counts PGs per OSD I gather) was kind of even,
so perhaps the objects that get into the PGs are very unbalanced in size?

But yes, using reweight to compensate for this should work for you.


ceph osd test-reweight-by-utilization

should be worth testing.
--
May the most significant bit of your life be positive.

________________________________

Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product.
Janne Johansson
2018-01-31 15:34:13 UTC
Reply
Permalink
Raw Message
Post by Bryan Banister
Given that this will move data around (I think), should we increase the
pg_num and pgp_num first and then see how it looks?
I guess adding pgs and pgps will move stuff around too, but if the PGCALC
formula says you should have more then that would still be a good
start. Still, a few manual reweights first to take the 85-90% ones down
might be good, some move operations are going to refuse adding things
to too-full OSDs, so you would not want to get accidentally bumped above
such a limit due to some temp-data being created during moves.

Also, dont bump pgs like crazy, you can never move down. Aim for getting
~100 per OSD at most, and perhaps even then in smaller steps so
that the creation (and evening out of data to the new empty PGs) doesn't
kill normal client I/O perf in the meantime.
--
May the most significant bit of your life be positive.
Bryan Banister
2018-02-12 21:19:48 UTC
Reply
Permalink
Raw Message
Hi Janne and others,

We used the “ceph osd reweight-by-utilization “ command to move a small amount of data off of the top four OSDs by utilization. Then we updated the pg_num and pgp_num on the pool from 512 to 1024 which started moving roughly 50% of the objects around as a result. The unfortunate issue is that the weights on the OSDs are still roughly equivalent and the OSDs that are nearfull were still getting allocated objects during the rebalance backfill operations.

At this point I have made some massive changes to the weights of the OSDs in an attempt to stop Ceph from allocating any more data to OSDs that are getting close to full. Basically the OSD with the lowest utilization remains weighted at 1 and the rest of the OSDs are now reduced in weight based on the percent usage of the OSD + the %usage of the OSD with the amount of data (21% at the time). This means the OSD that is at the most full at this time at 86% full now has a weight of only .33 (it was at 89% when reweight was applied). I’m not sure this is a good idea, but it seemed like the only option I had. Please let me know if I’m making a bad situation worse!

I still have the question on how this happened in the first place and how to prevent it from happening going forward without a lot of monitoring and reweighting on weekends/etc to keep things balanced. It sounds like Ceph is really expecting that objects stored into a pool will roughly have the same size, is that right?

Our backups going into this pool have very large variation in size, so would it be better to create multiple pools based on expected size of objects and then put backups of similar size into each pool?

The backups also have basically the same names with the only difference being the date which it was taken (e.g. backup name difference in subsequent days can be one digit at times), so does this mean that large backups with basically the same name will end up being placed in the same PGs based on the CRUSH calculation using the object name?

Thanks,
-Bryan

From: Janne Johansson [mailto:***@gmail.com]
Sent: Wednesday, January 31, 2018 9:34 AM
To: Bryan Banister <***@jumptrading.com>
Cc: Ceph Users <ceph-***@lists.ceph.com>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email
________________________________


2018-01-31 15:58 GMT+01:00 Bryan Banister <***@jumptrading.com<mailto:***@jumptrading.com>>:


Given that this will move data around (I think), should we increase the pg_num and pgp_num first and then see how it looks?


I guess adding pgs and pgps will move stuff around too, but if the PGCALC formula says you should have more then that would still be a good
start. Still, a few manual reweights first to take the 85-90% ones down might be good, some move operations are going to refuse adding things
to too-full OSDs, so you would not want to get accidentally bumped above such a limit due to some temp-data being created during moves.

Also, dont bump pgs like crazy, you can never move down. Aim for getting ~100 per OSD at most, and perhaps even then in smaller steps so
that the creation (and evening out of data to the new empty PGs) doesn't kill normal client I/O perf in the meantime.
--
May the most significant bit of your life be positive.

________________________________

Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product.
Bryan Stillwell
2018-02-13 18:43:26 UTC
Reply
Permalink
Raw Message
Bryan,

Based off the information you've provided so far, I would say that your largest pool still doesn't have enough PGs.

If you originally had only 512 PGs for you largest pool (I'm guessing .rgw.buckets has 99% of your data), then on a balanced cluster you would have just ~11.5 PGs per OSD (3*512/133). That's way lower than the recommended 100 PGs/OSD.

Based on the number of disks and assuming your .rgw.buckets pool has 99% of the data, you should have around 4,096 PGs for that pool. You'll still end up with an uneven distribution, but the outliers shouldn't be as far out.

Sage recently wrote a new balancer plugin that makes balancing a cluster something that happens automatically. He gave a great talk at LinuxConf Australia that you should check out, here's a link into the video where he talks about the balancer and the need for it:



Even though your objects are fairly large, they are getting broken up into chunks that are spread across the cluster. You can see how large each of your PGs are with a command like this:

ceph pg dump | grep '[0-9]*\.[0-9a-f]*' | awk '{ print $1 "\t" $7 }' |sort -n -k2

You'll see that within a pool the PG sizes are fairly close to the same size, but in your cluster the PGs are fairly large (~200GB would be my guess).

Bryan

From: ceph-users <ceph-users-***@lists.ceph.com> on behalf of Bryan Banister <***@jumptrading.com>
Date: Monday, February 12, 2018 at 2:19 PM
To: Janne Johansson <***@gmail.com>
Cc: Ceph Users <ceph-***@lists.ceph.com>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Hi Janne and others,
 
We used the “ceph osd reweight-by-utilization “ command to move a small amount of data off of the top four OSDs by utilization.  Then we updated the pg_num and pgp_num on the pool from 512 to 1024 which started moving roughly 50% of the objects around as a result.  The unfortunate issue is that the weights on the OSDs are still roughly equivalent and the OSDs that are nearfull were still getting allocated objects during the rebalance backfill operations.
 
At this point I have made some massive changes to the weights of the OSDs in an attempt to stop Ceph from allocating any more data to OSDs that are getting close to full.  Basically the OSD with the lowest utilization remains weighted at 1 and the rest of the OSDs are now reduced in weight based on the percent usage of the OSD + the %usage of the OSD with the amount of data (21% at the time).  This means the OSD that is at the most full at this time at 86% full now has a weight of only .33 (it was at 89% when reweight was applied).  I’m not sure this is a good idea, but it seemed like the only option I had.  Please let me know if I’m making a bad situation worse!
 
I still have the question on how this happened in the first place and how to prevent it from happening going forward without a lot of monitoring and reweighting on weekends/etc to keep things balanced.  It sounds like Ceph is really expecting that objects stored into a pool will roughly have the same size, is that right?
 
Our backups going into this pool have very large variation in size, so would it be better to create multiple pools based on expected size of objects and then put backups of similar size into each pool?
 
The backups also have basically the same names with the only difference being the date which it was taken (e.g. backup name difference in subsequent days can be one digit at times), so does this mean that large backups with basically the same name will end up being placed in the same PGs based on the CRUSH calculation using the object name?
 
Thanks,
-Bryan
 
From: Janne Johansson [mailto:***@gmail.com]
Sent: Wednesday, January 31, 2018 9:34 AM
To: Bryan Banister <***@jumptrading.com>
Cc: Ceph Users <ceph-***@lists.ceph.com>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
 
Note: External Email

 
 
2018-01-31 15:58 GMT+01:00 Bryan Banister <mailto:***@jumptrading.com>:
 
 
Given that this will move data around (I think), should we increase the pg_num and pgp_num first and then see how it looks?
 
 
I guess adding pgs and pgps will move stuff around too, but if the PGCALC formula says you should have more then that would still be a good
start. Still, a few manual reweights first to take the 85-90% ones down might be good, some move operations are going to refuse adding things
to too-full OSDs, so you would not want to get accidentally bumped above such a limit due to some temp-data being created during moves.
 
Also, dont bump pgs like crazy, you can never move down. Aim for getting ~100 per OSD at most, and perhaps even then in smaller steps so
that the creation (and evening out of data to the new empty PGs) doesn't kill normal client I/O perf in the meantime.

 
--
May the most significant bit of your life be positive.



Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product.
Bryan Banister
2018-02-13 20:16:23 UTC
Reply
Permalink
Raw Message
Thanks for the response Bryan!

Would it be good to go ahead and do the increase up to 4096 PGs for thee pool given that it's only at 52% done with the rebalance backfilling operations?

Thanks in advance!!
-Bryan

-----Original Message-----
From: Bryan Stillwell [mailto:***@godaddy.com]
Sent: Tuesday, February 13, 2018 12:43 PM
To: Bryan Banister <***@jumptrading.com>; Janne Johansson <***@gmail.com>
Cc: Ceph Users <ceph-***@lists.ceph.com>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email
-------------------------------------------------

Bryan,

Based off the information you've provided so far, I would say that your largest pool still doesn't have enough PGs.

If you originally had only 512 PGs for you largest pool (I'm guessing .rgw.buckets has 99% of your data), then on a balanced cluster you would have just ~11.5 PGs per OSD (3*512/133). That's way lower than the recommended 100 PGs/OSD.

Based on the number of disks and assuming your .rgw.buckets pool has 99% of the data, you should have around 4,096 PGs for that pool. You'll still end up with an uneven distribution, but the outliers shouldn't be as far out.

Sage recently wrote a new balancer plugin that makes balancing a cluster something that happens automatically. He gave a great talk at LinuxConf Australia that you should check out, here's a link into the video where he talks about the balancer and the need for it:

http://youtu.be/GrStE7XSKFE

Even though your objects are fairly large, they are getting broken up into chunks that are spread across the cluster. You can see how large each of your PGs are with a command like this:

ceph pg dump | grep '[0-9]*\.[0-9a-f]*' | awk '{ print $1 "\t" $7 }' |sort -n -k2

You'll see that within a pool the PG sizes are fairly close to the same size, but in your cluster the PGs are fairly large (~200GB would be my guess).

Bryan

From: ceph-users <ceph-users-***@lists.ceph.com> on behalf of Bryan Banister <***@jumptrading.com>
Date: Monday, February 12, 2018 at 2:19 PM
To: Janne Johansson <***@gmail.com>
Cc: Ceph Users <ceph-***@lists.ceph.com>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Hi Janne and others,

We used the “ceph osd reweight-by-utilization “ command to move a small amount of data off of the top four OSDs by utilization. Then we updated the pg_num and pgp_num on the pool from 512 to 1024 which started moving roughly 50% of the objects around as a result. The unfortunate issue is that the weights on the OSDs are still roughly equivalent and the OSDs that are nearfull were still getting allocated objects during the rebalance backfill operations.

At this point I have made some massive changes to the weights of the OSDs in an attempt to stop Ceph from allocating any more data to OSDs that are getting close to full. Basically the OSD with the lowest utilization remains weighted at 1 and the rest of the OSDs are now reduced in weight based on the percent usage of the OSD + the %usage of the OSD with the amount of data (21% at the time). This means the OSD that is at the most full at this time at 86% full now has a weight of only .33 (it was at 89% when reweight was applied). I’m not sure this is a good idea, but it seemed like the only option I had. Please let me know if I’m making a bad situation worse!

I still have the question on how this happened in the first place and how to prevent it from happening going forward without a lot of monitoring and reweighting on weekends/etc to keep things balanced. It sounds like Ceph is really expecting that objects stored into a pool will roughly have the same size, is that right?

Our backups going into this pool have very large variation in size, so would it be better to create multiple pools based on expected size of objects and then put backups of similar size into each pool?

The backups also have basically the same names with the only difference being the date which it was taken (e.g. backup name difference in subsequent days can be one digit at times), so does this mean that large backups with basically the same name will end up being placed in the same PGs based on the CRUSH calculation using the object name?

Thanks,
-Bryan

From: Janne Johansson [mailto:***@gmail.com]
Sent: Wednesday, January 31, 2018 9:34 AM
To: Bryan Banister <***@jumptrading.com>
Cc: Ceph Users <ceph-***@lists.ceph.com>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email



2018-01-31 15:58 GMT+01:00 Bryan Banister <mailto:***@jumptrading.com>:


Given that this will move data around (I think), should we increase the pg_num and pgp_num first and then see how it looks?


I guess adding pgs and pgps will move stuff around too, but if the PGCALC formula says you should have more then that would still be a good
start. Still, a few manual reweights first to take the 85-90% ones down might be good, some move operations are going to refuse adding things
to too-full OSDs, so you would not want to get accidentally bumped above such a limit due to some temp-data being created during moves.

Also, dont bump pgs like crazy, you can never move down. Aim for getting ~100 per OSD at most, and perhaps even then in smaller steps so
that the creation (and evening out of data to the new empty PGs) doesn't kill normal client I/O perf in the meantime.
--
May the most significant bit of your life be positive.



Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product.



________________________________

Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product.
Bryan Stillwell
2018-02-13 20:26:57 UTC
Reply
Permalink
Raw Message
It may work fine, but I would suggest limiting the number of operations going on at the same time.

Bryan

From: Bryan Banister <***@jumptrading.com>
Date: Tuesday, February 13, 2018 at 1:16 PM
To: Bryan Stillwell <***@godaddy.com>, Janne Johansson <***@gmail.com>
Cc: Ceph Users <ceph-***@lists.ceph.com>
Subject: RE: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Thanks for the response Bryan!

Would it be good to go ahead and do the increase up to 4096 PGs for thee pool given that it's only at 52% done with the rebalance backfilling operations?

Thanks in advance!!
-Bryan

-----Original Message-----
From: Bryan Stillwell [mailto:***@godaddy.com]
Sent: Tuesday, February 13, 2018 12:43 PM
To: Bryan Banister <***@jumptrading.com<mailto:***@jumptrading.com>>; Janne Johansson <***@gmail.com<mailto:***@gmail.com>>
Cc: Ceph Users <ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email
-------------------------------------------------

Bryan,

Based off the information you've provided so far, I would say that your largest pool still doesn't have enough PGs.

If you originally had only 512 PGs for you largest pool (I'm guessing .rgw.buckets has 99% of your data), then on a balanced cluster you would have just ~11.5 PGs per OSD (3*512/133). That's way lower than the recommended 100 PGs/OSD.

Based on the number of disks and assuming your .rgw.buckets pool has 99% of the data, you should have around 4,096 PGs for that pool. You'll still end up with an uneven distribution, but the outliers shouldn't be as far out.

Sage recently wrote a new balancer plugin that makes balancing a cluster something that happens automatically. He gave a great talk at LinuxConf Australia that you should check out, here's a link into the video where he talks about the balancer and the need for it:

http://youtu.be/GrStE7XSKFE

Even though your objects are fairly large, they are getting broken up into chunks that are spread across the cluster. You can see how large each of your PGs are with a command like this:

ceph pg dump | grep '[0-9]*\.[0-9a-f]*' | awk '{ print $1 "\t" $7 }' |sort -n -k2

You'll see that within a pool the PG sizes are fairly close to the same size, but in your cluster the PGs are fairly large (~200GB would be my guess).

Bryan

From: ceph-users <ceph-users-***@lists.ceph.com<mailto:ceph-users-***@lists.ceph.com>> on behalf of Bryan Banister <***@jumptrading.com<mailto:***@jumptrading.com>>
Date: Monday, February 12, 2018 at 2:19 PM
To: Janne Johansson <***@gmail.com<mailto:***@gmail.com>>
Cc: Ceph Users <ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Hi Janne and others,

We used the “ceph osd reweight-by-utilization “ command to move a small amount of data off of the top four OSDs by utilization. Then we updated the pg_num and pgp_num on the pool from 512 to 1024 which started moving roughly 50% of the objects around as a result. The unfortunate issue is that the weights on the OSDs are still roughly equivalent and the OSDs that are nearfull were still getting allocated objects during the rebalance backfill operations.

At this point I have made some massive changes to the weights of the OSDs in an attempt to stop Ceph from allocating any more data to OSDs that are getting close to full. Basically the OSD with the lowest utilization remains weighted at 1 and the rest of the OSDs are now reduced in weight based on the percent usage of the OSD + the %usage of the OSD with the amount of data (21% at the time). This means the OSD that is at the most full at this time at 86% full now has a weight of only .33 (it was at 89% when reweight was applied). I’m not sure this is a good idea, but it seemed like the only option I had. Please let me know if I’m making a bad situation worse!

I still have the question on how this happened in the first place and how to prevent it from happening going forward without a lot of monitoring and reweighting on weekends/etc to keep things balanced. It sounds like Ceph is really expecting that objects stored into a pool will roughly have the same size, is that right?

Our backups going into this pool have very large variation in size, so would it be better to create multiple pools based on expected size of objects and then put backups of similar size into each pool?

The backups also have basically the same names with the only difference being the date which it was taken (e.g. backup name difference in subsequent days can be one digit at times), so does this mean that large backups with basically the same name will end up being placed in the same PGs based on the CRUSH calculation using the object name?

Thanks,
-Bryan

From: Janne Johansson [mailto:***@gmail.com]
Sent: Wednesday, January 31, 2018 9:34 AM
To: Bryan Banister <***@jumptrading.com<mailto:***@jumptrading.com>>
Cc: Ceph Users <ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email



2018-01-31 15:58 GMT+01:00 Bryan Banister <mailto:***@jumptrading.com>:


Given that this will move data around (I think), should we increase the pg_num and pgp_num first and then see how it looks?


I guess adding pgs and pgps will move stuff around too, but if the PGCALC formula says you should have more then that would still be a good
start. Still, a few manual reweights first to take the 85-90% ones down might be good, some move operations are going to refuse adding things
to too-full OSDs, so you would not want to get accidentally bumped above such a limit due to some temp-data being created during moves.

Also, dont bump pgs like crazy, you can never move down. Aim for getting ~100 per OSD at most, and perhaps even then in smaller steps so
that the creation (and evening out of data to the new empty PGs) doesn't kill normal client I/O perf in the meantime.
--
May the most significant bit of your life be positive.



Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product.



________________________________

Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product.
Bryan Banister
2018-02-16 19:11:59 UTC
Reply
Permalink
Raw Message
Well I decided to try the increase in PGs to 4096 and that seems to have caused some issues:

2018-02-16 12:38:35.798911 mon.carf-ceph-osd01 [ERR] overall HEALTH_ERR 61802168/241154376 objects misplaced (25.628%); Reduced data availability: 2081 pgs inactive, 322 pgs peering; Degraded data redundancy: 552/241154376 objects degraded (0.000%), 3099 pgs unclean, 38 pgs degraded; 163 stuck requests are blocked > 4096 sec

The cluster is actively backfilling misplaced objects, but not all PGs are active at this point and may are stuck peering, stuck unclean, or have a state of unknown:
PG_AVAILABILITY Reduced data availability: 2081 pgs inactive, 322 pgs peering
pg 14.fae is stuck inactive for 253360.025730, current state activating+remapped, last acting [85,12,41]
pg 14.faf is stuck inactive for 253368.511573, current state unknown, last acting []
pg 14.fb0 is stuck peering since forever, current state peering, last acting [94,30,38]
pg 14.fb1 is stuck inactive for 253362.605886, current state activating+remapped, last acting [6,74,34]
[snip]

The health also shows a large number of degraded data redundancy PGs:
PG_DEGRADED Degraded data redundancy: 552/241154376 objects degraded (0.000%), 3099 pgs unclean, 38 pgs degraded
pg 14.fc7 is stuck unclean for 253368.511573, current state unknown, last acting []
pg 14.fc8 is stuck unclean for 531622.531271, current state active+remapped+backfill_wait, last acting [73,132,71]
pg 14.fca is stuck unclean for 420540.396199, current state active+remapped+backfill_wait, last acting [0,80,61]
pg 14.fcb is stuck unclean for 531622.421855, current state activating+remapped, last acting [70,26,75]
[snip]

We also now have a number of stuck requests:
REQUEST_STUCK 163 stuck requests are blocked > 4096 sec
69 ops are blocked > 268435 sec
66 ops are blocked > 134218 sec
28 ops are blocked > 67108.9 sec
osds 7,39,60,103,133 have stuck requests > 67108.9 sec
osds 5,12,13,28,33,40,55,56,61,64,69,70,75,83,92,96,110,114,119,122,123,129,131 have stuck requests > 134218 sec
osds 4,8,10,15,16,20,27,29,30,31,34,37,38,42,43,44,47,48,49,51,52,57,66,68,73,81,84,85,87,90,95,97,99,100,102,105,106,107,108,111,112,113,121,124,127,130,132 have stuck requests > 268435 sec

I tried looking through the mailing list archive on how to solve the stuck requests, and it seems that restarting the OSDs is the right way?

At this point we have just been watching the backfills running and see a steady but slow decrease of misplaced objects. When the cluster is idle, the overall OSD disk utilization is not too bad at roughly 40% on the physical disks running these backfills.

However we still have our backups trying to push new images to the cluster. This worked ok for the first few days, but yesterday we were getting failure alerts. I checked the status of the RGW service and noticed that 2 of the 3 RGW civetweb servers where not responsive. I restarted the RGWs on the ones that appeared hung and that got them working for a while, but then the same condition happened. The RGWs seem to have recovered on their own now, but again the cluster is idle and only backfills are currently doing anything (that I can tell). I did see these log entries:
2018-02-15 16:46:07.541542 7fffe6c56700 1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7fffcec26700' had timed out after 600
2018-02-15 16:46:12.541613 7fffe6c56700 1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7fffdbc40700' had timed out after 600
2018-02-15 16:46:12.541629 7fffe6c56700 1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7fffcec26700' had timed out after 600
2018-02-15 16:46:17.541701 7fffe6c56700 1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7fffdbc40700' had timed out after 600

At this point we do not know to proceed with recovery efforts. I tried looking at the ceph docs and mail list archives but wasn’t able to determine the right path forward here.

Any help is appreciated,
-Bryan


From: Bryan Stillwell [mailto:***@godaddy.com]
Sent: Tuesday, February 13, 2018 2:27 PM
To: Bryan Banister <***@jumptrading.com>; Janne Johansson <***@gmail.com>
Cc: Ceph Users <ceph-***@lists.ceph.com>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email
________________________________
It may work fine, but I would suggest limiting the number of operations going on at the same time.

Bryan

From: Bryan Banister <***@jumptrading.com<mailto:***@jumptrading.com>>
Date: Tuesday, February 13, 2018 at 1:16 PM
To: Bryan Stillwell <***@godaddy.com<mailto:***@godaddy.com>>, Janne Johansson <***@gmail.com<mailto:***@gmail.com>>
Cc: Ceph Users <ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>>
Subject: RE: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Thanks for the response Bryan!

Would it be good to go ahead and do the increase up to 4096 PGs for thee pool given that it's only at 52% done with the rebalance backfilling operations?

Thanks in advance!!
-Bryan

-----Original Message-----
From: Bryan Stillwell [mailto:***@godaddy.com]
Sent: Tuesday, February 13, 2018 12:43 PM
To: Bryan Banister <***@jumptrading.com<mailto:***@jumptrading.com>>; Janne Johansson <***@gmail.com<mailto:***@gmail.com>>
Cc: Ceph Users <ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email
-------------------------------------------------

Bryan,

Based off the information you've provided so far, I would say that your largest pool still doesn't have enough PGs.

If you originally had only 512 PGs for you largest pool (I'm guessing .rgw.buckets has 99% of your data), then on a balanced cluster you would have just ~11.5 PGs per OSD (3*512/133). That's way lower than the recommended 100 PGs/OSD.

Based on the number of disks and assuming your .rgw.buckets pool has 99% of the data, you should have around 4,096 PGs for that pool. You'll still end up with an uneven distribution, but the outliers shouldn't be as far out.

Sage recently wrote a new balancer plugin that makes balancing a cluster something that happens automatically. He gave a great talk at LinuxConf Australia that you should check out, here's a link into the video where he talks about the balancer and the need for it:

http://youtu.be/GrStE7XSKFE

Even though your objects are fairly large, they are getting broken up into chunks that are spread across the cluster. You can see how large each of your PGs are with a command like this:

ceph pg dump | grep '[0-9]*\.[0-9a-f]*' | awk '{ print $1 "\t" $7 }' |sort -n -k2

You'll see that within a pool the PG sizes are fairly close to the same size, but in your cluster the PGs are fairly large (~200GB would be my guess).

Bryan

From: ceph-users <ceph-users-***@lists.ceph.com<mailto:ceph-users-***@lists.ceph.com>> on behalf of Bryan Banister <***@jumptrading.com<mailto:***@jumptrading.com>>
Date: Monday, February 12, 2018 at 2:19 PM
To: Janne Johansson <***@gmail.com<mailto:***@gmail.com>>
Cc: Ceph Users <ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Hi Janne and others,

We used the “ceph osd reweight-by-utilization “ command to move a small amount of data off of the top four OSDs by utilization. Then we updated the pg_num and pgp_num on the pool from 512 to 1024 which started moving roughly 50% of the objects around as a result. The unfortunate issue is that the weights on the OSDs are still roughly equivalent and the OSDs that are nearfull were still getting allocated objects during the rebalance backfill operations.

At this point I have made some massive changes to the weights of the OSDs in an attempt to stop Ceph from allocating any more data to OSDs that are getting close to full. Basically the OSD with the lowest utilization remains weighted at 1 and the rest of the OSDs are now reduced in weight based on the percent usage of the OSD + the %usage of the OSD with the amount of data (21% at the time). This means the OSD that is at the most full at this time at 86% full now has a weight of only .33 (it was at 89% when reweight was applied). I’m not sure this is a good idea, but it seemed like the only option I had. Please let me know if I’m making a bad situation worse!

I still have the question on how this happened in the first place and how to prevent it from happening going forward without a lot of monitoring and reweighting on weekends/etc to keep things balanced. It sounds like Ceph is really expecting that objects stored into a pool will roughly have the same size, is that right?

Our backups going into this pool have very large variation in size, so would it be better to create multiple pools based on expected size of objects and then put backups of similar size into each pool?

The backups also have basically the same names with the only difference being the date which it was taken (e.g. backup name difference in subsequent days can be one digit at times), so does this mean that large backups with basically the same name will end up being placed in the same PGs based on the CRUSH calculation using the object name?

Thanks,
-Bryan

From: Janne Johansson [mailto:***@gmail.com]
Sent: Wednesday, January 31, 2018 9:34 AM
To: Bryan Banister <***@jumptrading.com<mailto:***@jumptrading.com>>
Cc: Ceph Users <ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email



2018-01-31 15:58 GMT+01:00 Bryan Banister <mailto:***@jumptrading.com>:


Given that this will move data around (I think), should we increase the pg_num and pgp_num first and then see how it looks?


I guess adding pgs and pgps will move stuff around too, but if the PGCALC formula says you should have more then that would still be a good
start. Still, a few manual reweights first to take the 85-90% ones down might be good, some move operations are going to refuse adding things
to too-full OSDs, so you would not want to get accidentally bumped above such a limit due to some temp-data being created during moves.

Also, dont bump pgs like crazy, you can never move down. Aim for getting ~100 per OSD at most, and perhaps even then in smaller steps so
that the creation (and evening out of data to the new empty PGs) doesn't kill normal client I/O perf in the meantime.
--
May the most significant bit of your life be positive.



Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product.



________________________________

Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product.


________________________________

Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product.
David Turner
2018-02-16 19:21:23 UTC
Reply
Permalink
Raw Message
Your problem might have been creating too many PGs at once. I generally
increase pg_num and pgp_num by no more than 256 at a time. Making sure
that all PGs are creating, peered, and healthy (other than backfilling).

To help you get back to a healthy state, let's start off by getting all of
your PGs peered. Go ahead and put a stop to backfilling, recovery,
scrubbing, etc. Those are all hindering the peering effort right now. The
more clients you can disable is also better.

ceph osd set nobackfill
ceph osd set norecovery
ceph osd set noscrubbing
ceph osd set nodeep-scrubbing

After that look at your peering PGs and find out what is blocking their
peering. This is where you might need to be using `ceph osd down 23`
(assuming you needed to kick osd.23) to mark them down in the cluster and
let them re-assert themselves. Once you have all PGs done with peering, go
ahead and unset nobackfill and norecovery and let the cluster start moving
data around. Leaving noscrubbing and nodeep-scrubbing off is optional and
up to you. I'll never say it's better to leave them off, but scrubbing
does use a fair bit of spindles while you're trying to backfill.
Post by Bryan Banister
Well I decided to try the increase in PGs to 4096 and that seems to have
2018-02-16 12:38:35.798911 mon.carf-ceph-osd01 [ERR] overall HEALTH_ERR
2081 pgs inactive, 322 pgs peering; Degraded data redundancy: 552/241154376
objects degraded (0.000%), 3099 pgs unclean, 38 pgs degraded; 163 stuck
requests are blocked > 4096 sec
The cluster is actively backfilling misplaced objects, but not all PGs are
active at this point and may are stuck peering, stuck unclean, or have a
PG_AVAILABILITY Reduced data availability: 2081 pgs inactive, 322 pgs peering
pg 14.fae is stuck inactive for 253360.025730, current state
activating+remapped, last acting [85,12,41]
pg 14.faf is stuck inactive for 253368.511573, current state unknown, last acting []
pg 14.fb0 is stuck peering since forever, current state peering, last acting [94,30,38]
pg 14.fb1 is stuck inactive for 253362.605886, current state
activating+remapped, last acting [6,74,34]
[snip]
PG_DEGRADED Degraded data redundancy: 552/241154376 objects degraded
(0.000%), 3099 pgs unclean, 38 pgs degraded
pg 14.fc7 is stuck unclean for 253368.511573, current state unknown, last acting []
pg 14.fc8 is stuck unclean for 531622.531271, current state
active+remapped+backfill_wait, last acting [73,132,71]
pg 14.fca is stuck unclean for 420540.396199, current state
active+remapped+backfill_wait, last acting [0,80,61]
pg 14.fcb is stuck unclean for 531622.421855, current state
activating+remapped, last acting [70,26,75]
[snip]
REQUEST_STUCK 163 stuck requests are blocked > 4096 sec
69 ops are blocked > 268435 sec
66 ops are blocked > 134218 sec
28 ops are blocked > 67108.9 sec
osds 7,39,60,103,133 have stuck requests > 67108.9 sec
osds
5,12,13,28,33,40,55,56,61,64,69,70,75,83,92,96,110,114,119,122,123,129,131
have stuck requests > 134218 sec
osds
4,8,10,15,16,20,27,29,30,31,34,37,38,42,43,44,47,48,49,51,52,57,66,68,73,81,84,85,87,90,95,97,99,100,102,105,106,107,108,111,112,113,121,124,127,130,132
have stuck requests > 268435 sec
I tried looking through the mailing list archive on how to solve the stuck
requests, and it seems that restarting the OSDs is the right way?
At this point we have just been watching the backfills running and see a
steady but slow decrease of misplaced objects. When the cluster is idle,
the overall OSD disk utilization is not too bad at roughly 40% on the
physical disks running these backfills.
However we still have our backups trying to push new images to the
cluster. This worked ok for the first few days, but yesterday we were
getting failure alerts. I checked the status of the RGW service and
noticed that 2 of the 3 RGW civetweb servers where not responsive. I
restarted the RGWs on the ones that appeared hung and that got them working
for a while, but then the same condition happened. The RGWs seem to have
recovered on their own now, but again the cluster is idle and only
backfills are currently doing anything (that I can tell). I did see these
2018-02-15 16:46:07.541542 7fffe6c56700 1 heartbeat_map is_healthy
'RGWAsyncRadosProcessor::m_tp thread 0x7fffcec26700' had timed out after 600
2018-02-15 16:46:12.541613 7fffe6c56700 1 heartbeat_map is_healthy
'RGWAsyncRadosProcessor::m_tp thread 0x7fffdbc40700' had timed out after 600
2018-02-15 16:46:12.541629 7fffe6c56700 1 heartbeat_map is_healthy
'RGWAsyncRadosProcessor::m_tp thread 0x7fffcec26700' had timed out after 600
2018-02-15 16:46:17.541701 7fffe6c56700 1 heartbeat_map is_healthy
'RGWAsyncRadosProcessor::m_tp thread 0x7fffdbc40700' had timed out after 600
At this point we do not know to proceed with recovery efforts. I tried
looking at the ceph docs and mail list archives but wasn’t able to
determine the right path forward here.
Any help is appreciated,
-Bryan
*Sent:* Tuesday, February 13, 2018 2:27 PM
*Subject:* Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
*Note: External Email*
------------------------------
It may work fine, but I would suggest limiting the number of operations
going on at the same time.
Bryan
*Date: *Tuesday, February 13, 2018 at 1:16 PM
*Subject: *RE: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
Thanks for the response Bryan!
Would it be good to go ahead and do the increase up to 4096 PGs for thee
pool given that it's only at 52% done with the rebalance backfilling
operations?
Thanks in advance!!
-Bryan
-----Original Message-----
Sent: Tuesday, February 13, 2018 12:43 PM
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
Note: External Email
-------------------------------------------------
Bryan,
Based off the information you've provided so far, I would say that your
largest pool still doesn't have enough PGs.
If you originally had only 512 PGs for you largest pool (I'm guessing
.rgw.buckets has 99% of your data), then on a balanced cluster you would
have just ~11.5 PGs per OSD (3*512/133). That's way lower than the
recommended 100 PGs/OSD.
Based on the number of disks and assuming your .rgw.buckets pool has 99%
of the data, you should have around 4,096 PGs for that pool. You'll still
end up with an uneven distribution, but the outliers shouldn't be as far
out.
Sage recently wrote a new balancer plugin that makes balancing a cluster
something that happens automatically. He gave a great talk at LinuxConf
Australia that you should check out, here's a link into the video where he
http://youtu.be/GrStE7XSKFE
Even though your objects are fairly large, they are getting broken up into
chunks that are spread across the cluster. You can see how large each of
ceph pg dump | grep '[0-9]*\.[0-9a-f]*' | awk '{ print $1 "\t" $7 }' |sort -n -k2
You'll see that within a pool the PG sizes are fairly close to the same
size, but in your cluster the PGs are fairly large (~200GB would be my
guess).
Bryan
Date: Monday, February 12, 2018 at 2:19 PM
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
Hi Janne and others,
We used the “ceph osd reweight-by-utilization “ command to move a small
amount of data off of the top four OSDs by utilization. Then we updated
the pg_num and pgp_num on the pool from 512 to 1024 which started moving
roughly 50% of the objects around as a result. The unfortunate issue is
that the weights on the OSDs are still roughly equivalent and the OSDs that
are nearfull were still getting allocated objects during the rebalance
backfill operations.
At this point I have made some massive changes to the weights of the OSDs
in an attempt to stop Ceph from allocating any more data to OSDs that are
getting close to full. Basically the OSD with the lowest utilization
remains weighted at 1 and the rest of the OSDs are now reduced in weight
based on the percent usage of the OSD + the %usage of the OSD with the
amount of data (21% at the time). This means the OSD that is at the most
full at this time at 86% full now has a weight of only .33 (it was at 89%
when reweight was applied). I’m not sure this is a good idea, but it
seemed like the only option I had. Please let me know if I’m making a bad
situation worse!
I still have the question on how this happened in the first place and how
to prevent it from happening going forward without a lot of monitoring and
reweighting on weekends/etc to keep things balanced. It sounds like Ceph
is really expecting that objects stored into a pool will roughly have the
same size, is that right?
Our backups going into this pool have very large variation in size, so
would it be better to create multiple pools based on expected size of
objects and then put backups of similar size into each pool?
The backups also have basically the same names with the only difference
being the date which it was taken (e.g. backup name difference in
subsequent days can be one digit at times), so does this mean that large
backups with basically the same name will end up being placed in the same
PGs based on the CRUSH calculation using the object name?
Thanks,
-Bryan
Sent: Wednesday, January 31, 2018 9:34 AM
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
Note: External Email
2018-01-31 15:58 GMT+01:00 Bryan Banister <
Given that this will move data around (I think), should we increase the
pg_num and pgp_num first and then see how it looks?
I guess adding pgs and pgps will move stuff around too, but if the PGCALC
formula says you should have more then that would still be a good
start. Still, a few manual reweights first to take the 85-90% ones down
might be good, some move operations are going to refuse adding things
to too-full OSDs, so you would not want to get accidentally bumped above
such a limit due to some temp-data being created during moves.
Also, dont bump pgs like crazy, you can never move down. Aim for getting
~100 per OSD at most, and perhaps even then in smaller steps so
that the creation (and evening out of data to the new empty PGs) doesn't
kill normal client I/O perf in the meantime.
--
May the most significant bit of your life be positive.
Note: This email is for the confidential use of the named addressee(s)
only and may contain proprietary, confidential or privileged information.
If you are not the intended recipient, you are hereby notified that any
review, dissemination or copying of this email is strictly prohibited, and
to please notify the sender immediately and destroy this email and any
attachments. Email transmission cannot be guaranteed to be secure or
error-free. The Company, therefore, does not make any guarantees as to the
completeness or accuracy of this email or any attachments. This email is
for informational purposes only and does not constitute a recommendation,
offer, request or solicitation of any kind to buy, sell, subscribe, redeem
or perform any type of transaction of a financial product.
________________________________
Note: This email is for the confidential use of the named addressee(s)
only and may contain proprietary, confidential or privileged information.
If you are not the intended recipient, you are hereby notified that any
review, dissemination or copying of this email is strictly prohibited, and
to please notify the sender immediately and destroy this email and any
attachments. Email transmission cannot be guaranteed to be secure or
error-free. The Company, therefore, does not make any guarantees as to the
completeness or accuracy of this email or any attachments. This email is
for informational purposes only and does not constitute a recommendation,
offer, request or solicitation of any kind to buy, sell, subscribe, redeem
or perform any type of transaction of a financial product.
------------------------------
Note: This email is for the confidential use of the named addressee(s)
only and may contain proprietary, confidential or privileged information.
If you are not the intended recipient, you are hereby notified that any
review, dissemination or copying of this email is strictly prohibited, and
to please notify the sender immediately and destroy this email and any
attachments. Email transmission cannot be guaranteed to be secure or
error-free. The Company, therefore, does not make any guarantees as to the
completeness or accuracy of this email or any attachments. This email is
for informational purposes only and does not constitute a recommendation,
offer, request or solicitation of any kind to buy, sell, subscribe, redeem
or perform any type of transaction of a financial product.
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Bryan Banister
2018-02-16 19:53:26 UTC
Reply
Permalink
Raw Message
Thanks David,

I have set the nobackfill, norecover, noscrub, and nodeep-scrub options at this point and the backfills have stopped. I’ll also stop the backups from pushing into ceph for now.

I don’t want to make things worse, so ask for some more guidance now.


1) In looking at a PG that is still peering or one that is “unknown”, Ceph complains that it doesn’t have that pgid:
pg 14.fb0 is stuck peering since forever, current state peering, last acting [94,30,38]
[***@carf-ceph-osd03 ~]# ceph pg 14.fb0 query
Error ENOENT: i don't have pgid 14.fb0
[***@carf-ceph-osd03 ~]#


2) One that is activating shows this for the recovery_state:
[***@carf-ceph-osd03 ~]# ceph pg 14.fe1 query | less
[snip]
"recovery_state": [
{
"name": "Started/Primary/Active",
"enter_time": "2018-02-13 14:33:21.406919",
"might_have_unfound": [
{
"osd": "84(0)",
"status": "not queried"
}
],
"recovery_progress": {
"backfill_targets": [
"56(0)",
"87(1)",
"88(2)"
],
"waiting_on_backfill": [],
"last_backfill_started": "MIN",
"backfill_info": {
"begin": "MIN",
"end": "MIN",
"objects": []
},
"peer_backfill_info": [],
"backfills_in_flight": [],
"recovering": [],
"pg_backend": {
"recovery_ops": [],
"read_ops": []
}
},
"scrub": {
"scrubber.epoch_start": "0",
"scrubber.active": false,
"scrubber.state": "INACTIVE",
"scrubber.start": "MIN",
"scrubber.end": "MIN",
"scrubber.subset_last_update": "0'0",
"scrubber.deep": false,
"scrubber.seed": 0,
"scrubber.waiting_on": 0,
"scrubber.waiting_on_whom": []
}
},
{
"name": "Started",
"enter_time": "2018-02-13 14:33:17.491148"
}
],


Sorry for all the hand holding, but how do I determine if I need to set an OSD as ‘down’ to fix the issues, and how does it go about re-asserting itself?

I again tried looking at the ceph docs on troubleshooting OSDs but didn’t find any details. Man page also has no details.

Thanks again,
-Bryan

From: David Turner [mailto:***@gmail.com]
Sent: Friday, February 16, 2018 1:21 PM
To: Bryan Banister <***@jumptrading.com>
Cc: Bryan Stillwell <***@godaddy.com>; Janne Johansson <***@gmail.com>; Ceph Users <ceph-***@lists.ceph.com>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email
________________________________
Your problem might have been creating too many PGs at once. I generally increase pg_num and pgp_num by no more than 256 at a time. Making sure that all PGs are creating, peered, and healthy (other than backfilling).

To help you get back to a healthy state, let's start off by getting all of your PGs peered. Go ahead and put a stop to backfilling, recovery, scrubbing, etc. Those are all hindering the peering effort right now. The more clients you can disable is also better.

ceph osd set nobackfill
ceph osd set norecovery
ceph osd set noscrubbing
ceph osd set nodeep-scrubbing

After that look at your peering PGs and find out what is blocking their peering. This is where you might need to be using `ceph osd down 23` (assuming you needed to kick osd.23) to mark them down in the cluster and let them re-assert themselves. Once you have all PGs done with peering, go ahead and unset nobackfill and norecovery and let the cluster start moving data around. Leaving noscrubbing and nodeep-scrubbing off is optional and up to you. I'll never say it's better to leave them off, but scrubbing does use a fair bit of spindles while you're trying to backfill.

On Fri, Feb 16, 2018 at 2:12 PM Bryan Banister <***@jumptrading.com<mailto:***@jumptrading.com>> wrote:
Well I decided to try the increase in PGs to 4096 and that seems to have caused some issues:

2018-02-16 12:38:35.798911 mon.carf-ceph-osd01 [ERR] overall HEALTH_ERR 61802168/241154376 objects misplaced (25.628%); Reduced data availability: 2081 pgs inactive, 322 pgs peering; Degraded data redundancy: 552/241154376 objects degraded (0.000%), 3099 pgs unclean, 38 pgs degraded; 163 stuck requests are blocked > 4096 sec

The cluster is actively backfilling misplaced objects, but not all PGs are active at this point and may are stuck peering, stuck unclean, or have a state of unknown:
PG_AVAILABILITY Reduced data availability: 2081 pgs inactive, 322 pgs peering
pg 14.fae is stuck inactive for 253360.025730, current state activating+remapped, last acting [85,12,41]
pg 14.faf is stuck inactive for 253368.511573, current state unknown, last acting []
pg 14.fb0 is stuck peering since forever, current state peering, last acting [94,30,38]
pg 14.fb1 is stuck inactive for 253362.605886, current state activating+remapped, last acting [6,74,34]
[snip]

The health also shows a large number of degraded data redundancy PGs:
PG_DEGRADED Degraded data redundancy: 552/241154376 objects degraded (0.000%), 3099 pgs unclean, 38 pgs degraded
pg 14.fc7 is stuck unclean for 253368.511573, current state unknown, last acting []
pg 14.fc8 is stuck unclean for 531622.531271, current state active+remapped+backfill_wait, last acting [73,132,71]
pg 14.fca is stuck unclean for 420540.396199, current state active+remapped+backfill_wait, last acting [0,80,61]
pg 14.fcb is stuck unclean for 531622.421855, current state activating+remapped, last acting [70,26,75]
[snip]

We also now have a number of stuck requests:
REQUEST_STUCK 163 stuck requests are blocked > 4096 sec
69 ops are blocked > 268435 sec
66 ops are blocked > 134218 sec
28 ops are blocked > 67108.9 sec
osds 7,39,60,103,133 have stuck requests > 67108.9 sec
osds 5,12,13,28,33,40,55,56,61,64,69,70,75,83,92,96,110,114,119,122,123,129,131 have stuck requests > 134218 sec
osds 4,8,10,15,16,20,27,29,30,31,34,37,38,42,43,44,47,48,49,51,52,57,66,68,73,81,84,85,87,90,95,97,99,100,102,105,106,107,108,111,112,113,121,124,127,130,132 have stuck requests > 268435 sec

I tried looking through the mailing list archive on how to solve the stuck requests, and it seems that restarting the OSDs is the right way?

At this point we have just been watching the backfills running and see a steady but slow decrease of misplaced objects. When the cluster is idle, the overall OSD disk utilization is not too bad at roughly 40% on the physical disks running these backfills.

However we still have our backups trying to push new images to the cluster. This worked ok for the first few days, but yesterday we were getting failure alerts. I checked the status of the RGW service and noticed that 2 of the 3 RGW civetweb servers where not responsive. I restarted the RGWs on the ones that appeared hung and that got them working for a while, but then the same condition happened. The RGWs seem to have recovered on their own now, but again the cluster is idle and only backfills are currently doing anything (that I can tell). I did see these log entries:
2018-02-15 16:46:07.541542 7fffe6c56700 1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7fffcec26700' had timed out after 600
2018-02-15 16:46:12.541613 7fffe6c56700 1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7fffdbc40700' had timed out after 600
2018-02-15 16:46:12.541629 7fffe6c56700 1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7fffcec26700' had timed out after 600
2018-02-15 16:46:17.541701 7fffe6c56700 1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7fffdbc40700' had timed out after 600

At this point we do not know to proceed with recovery efforts. I tried looking at the ceph docs and mail list archives but wasn’t able to determine the right path forward here.

Any help is appreciated,
-Bryan


From: Bryan Stillwell [mailto:***@godaddy.com<mailto:***@godaddy.com>]
Sent: Tuesday, February 13, 2018 2:27 PM

To: Bryan Banister <***@jumptrading.com<mailto:***@jumptrading.com>>; Janne Johansson <***@gmail.com<mailto:***@gmail.com>>
Cc: Ceph Users <ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email
________________________________
It may work fine, but I would suggest limiting the number of operations going on at the same time.

Bryan

From: Bryan Banister <***@jumptrading.com<mailto:***@jumptrading.com>>
Date: Tuesday, February 13, 2018 at 1:16 PM
To: Bryan Stillwell <***@godaddy.com<mailto:***@godaddy.com>>, Janne Johansson <***@gmail.com<mailto:***@gmail.com>>
Cc: Ceph Users <ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>>
Subject: RE: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Thanks for the response Bryan!

Would it be good to go ahead and do the increase up to 4096 PGs for thee pool given that it's only at 52% done with the rebalance backfilling operations?

Thanks in advance!!
-Bryan

-----Original Message-----
From: Bryan Stillwell [mailto:***@godaddy.com]
Sent: Tuesday, February 13, 2018 12:43 PM
To: Bryan Banister <***@jumptrading.com<mailto:***@jumptrading.com>>; Janne Johansson <***@gmail.com<mailto:***@gmail.com>>
Cc: Ceph Users <ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email
-------------------------------------------------

Bryan,

Based off the information you've provided so far, I would say that your largest pool still doesn't have enough PGs.

If you originally had only 512 PGs for you largest pool (I'm guessing .rgw.buckets has 99% of your data), then on a balanced cluster you would have just ~11.5 PGs per OSD (3*512/133). That's way lower than the recommended 100 PGs/OSD.

Based on the number of disks and assuming your .rgw.buckets pool has 99% of the data, you should have around 4,096 PGs for that pool. You'll still end up with an uneven distribution, but the outliers shouldn't be as far out.

Sage recently wrote a new balancer plugin that makes balancing a cluster something that happens automatically. He gave a great talk at LinuxConf Australia that you should check out, here's a link into the video where he talks about the balancer and the need for it:

http://youtu.be/GrStE7XSKFE

Even though your objects are fairly large, they are getting broken up into chunks that are spread across the cluster. You can see how large each of your PGs are with a command like this:

ceph pg dump | grep '[0-9]*\.[0-9a-f]*' | awk '{ print $1 "\t" $7 }' |sort -n -k2

You'll see that within a pool the PG sizes are fairly close to the same size, but in your cluster the PGs are fairly large (~200GB would be my guess).

Bryan

From: ceph-users <ceph-users-***@lists.ceph.com<mailto:ceph-users-***@lists.ceph.com>> on behalf of Bryan Banister <***@jumptrading.com<mailto:***@jumptrading.com>>
Date: Monday, February 12, 2018 at 2:19 PM
To: Janne Johansson <***@gmail.com<mailto:***@gmail.com>>
Cc: Ceph Users <ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Hi Janne and others,

We used the “ceph osd reweight-by-utilization “ command to move a small amount of data off of the top four OSDs by utilization. Then we updated the pg_num and pgp_num on the pool from 512 to 1024 which started moving roughly 50% of the objects around as a result. The unfortunate issue is that the weights on the OSDs are still roughly equivalent and the OSDs that are nearfull were still getting allocated objects during the rebalance backfill operations.

At this point I have made some massive changes to the weights of the OSDs in an attempt to stop Ceph from allocating any more data to OSDs that are getting close to full. Basically the OSD with the lowest utilization remains weighted at 1 and the rest of the OSDs are now reduced in weight based on the percent usage of the OSD + the %usage of the OSD with the amount of data (21% at the time). This means the OSD that is at the most full at this time at 86% full now has a weight of only .33 (it was at 89% when reweight was applied). I’m not sure this is a good idea, but it seemed like the only option I had. Please let me know if I’m making a bad situation worse!

I still have the question on how this happened in the first place and how to prevent it from happening going forward without a lot of monitoring and reweighting on weekends/etc to keep things balanced. It sounds like Ceph is really expecting that objects stored into a pool will roughly have the same size, is that right?

Our backups going into this pool have very large variation in size, so would it be better to create multiple pools based on expected size of objects and then put backups of similar size into each pool?

The backups also have basically the same names with the only difference being the date which it was taken (e.g. backup name difference in subsequent days can be one digit at times), so does this mean that large backups with basically the same name will end up being placed in the same PGs based on the CRUSH calculation using the object name?

Thanks,
-Bryan

From: Janne Johansson [mailto:***@gmail.com]
Sent: Wednesday, January 31, 2018 9:34 AM
To: Bryan Banister <***@jumptrading.com<mailto:***@jumptrading.com>>
Cc: Ceph Users <ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email



2018-01-31 15:58 GMT+01:00 Bryan Banister <mailto:***@jumptrading.com>:


Given that this will move data around (I think), should we increase the pg_num and pgp_num first and then see how it looks?


I guess adding pgs and pgps will move stuff around too, but if the PGCALC formula says you should have more then that would still be a good
start. Still, a few manual reweights first to take the 85-90% ones down might be good, some move operations are going to refuse adding things
to too-full OSDs, so you would not want to get accidentally bumped above such a limit due to some temp-data being created during moves.

Also, dont bump pgs like crazy, you can never move down. Aim for getting ~100 per OSD at most, and perhaps even then in smaller steps so
that the creation (and evening out of data to the new empty PGs) doesn't kill normal client I/O perf in the meantime.
--
May the most significant bit of your life be positive.



Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product.



________________________________

Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product.


________________________________

Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product.
_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

________________________________

Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product.
David Turner
2018-02-16 20:51:20 UTC
Reply
Permalink
Raw Message
The questions I definitely know the answer to first, and then we'll
continue from there. If an OSD is blocking peering but is online, when you
mark it as down in the cluster it receives a message in it's log saying it
was wrongly marked down and tells the mons it is online. That gets it to
stop what it was doing and start talking again. I referred to that as
re-asserting. If the OSD that you marked down doesn't mark itself back up
within a couple minutes, restarting the OSD might be a good idea. Then
again actually restarting the daemon could be bad because the daemon is
doing something. With as much potential for places to work with to get
things going, actually restarting the daemons is probably something I would
wait to do for now.

The reason the cluster doesn't know anything about the PG is because it's
still creating and hasn't actually been created. Starting with some of the
OSDs that you see with blocked requests would be a good idea. Eventually
you'll down an OSD that when it comes back up things start looking much
better as things start peering and getting better. Below are the list of
OSDs you had from a previous email that if they're still there with stuck
requests then they'll be good to start doing this to. On closer review,
it's almost all of them... but you have to start somewhere. Another
possible place to start with these is to look at a list of all of the
peering PGs and see if there are any common OSDs when you look at all of
them at once. Some patterns may emerge and would be good options to try.

osds 7,39,60,103,133 have stuck requests > 67108.9 sec

osds
5,12,13,28,33,40,55,56,61,64,69,70,75,83,92,96,110,114,119,122,123,129,131
have stuck requests > 134218 sec

osds
4,8,10,15,16,20,27,29,30,31,34,37,38,42,43,44,47,48,49,51,52,57,66,68,73,81,84,85,87,90,95,97,99,100,102,105,106,107,108,111,112,113,121,124,127,130,132
have stuck requests > 268435 sec
Post by Bryan Banister
Thanks David,
I have set the nobackfill, norecover, noscrub, and nodeep-scrub options at
this point and the backfills have stopped. I’ll also stop the backups from
pushing into ceph for now.
I don’t want to make things worse, so ask for some more guidance now.
1) In looking at a PG that is still peering or one that is
pg 14.fb0 is stuck peering since forever, current state peering, last acting [94,30,38]
Error ENOENT: i don't have pgid 14.fb0
[snip]
"recovery_state": [
{
"name": "Started/Primary/Active",
"enter_time": "2018-02-13 14:33:21.406919",
"might_have_unfound": [
{
"osd": "84(0)",
"status": "not queried"
}
],
"recovery_progress": {
"backfill_targets": [
"56(0)",
"87(1)",
"88(2)"
],
"waiting_on_backfill": [],
"last_backfill_started": "MIN",
"backfill_info": {
"begin": "MIN",
"end": "MIN",
"objects": []
},
"peer_backfill_info": [],
"backfills_in_flight": [],
"recovering": [],
"pg_backend": {
"recovery_ops": [],
"read_ops": []
}
},
"scrub": {
"scrubber.epoch_start": "0",
"scrubber.active": false,
"scrubber.state": "INACTIVE",
"scrubber.start": "MIN",
"scrubber.end": "MIN",
"scrubber.subset_last_update": "0'0",
"scrubber.deep": false,
"scrubber.seed": 0,
"scrubber.waiting_on": 0,
"scrubber.waiting_on_whom": []
}
},
{
"name": "Started",
"enter_time": "2018-02-13 14:33:17.491148"
}
],
Sorry for all the hand holding, but how do I determine if I need to set an
OSD as ‘down’ to fix the issues, and how does it go about re-asserting
itself?
I again tried looking at the ceph docs on troubleshooting OSDs but didn’t
find any details. Man page also has no details.
Thanks again,
-Bryan
*Sent:* Friday, February 16, 2018 1:21 PM
*Subject:* Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
*Note: External Email*
------------------------------
Your problem might have been creating too many PGs at once. I generally
increase pg_num and pgp_num by no more than 256 at a time. Making sure
that all PGs are creating, peered, and healthy (other than backfilling).
To help you get back to a healthy state, let's start off by getting all of
your PGs peered. Go ahead and put a stop to backfilling, recovery,
scrubbing, etc. Those are all hindering the peering effort right now. The
more clients you can disable is also better.
ceph osd set nobackfill
ceph osd set norecovery
ceph osd set noscrubbing
ceph osd set nodeep-scrubbing
After that look at your peering PGs and find out what is blocking their
peering. This is where you might need to be using `ceph osd down 23`
(assuming you needed to kick osd.23) to mark them down in the cluster and
let them re-assert themselves. Once you have all PGs done with peering, go
ahead and unset nobackfill and norecovery and let the cluster start moving
data around. Leaving noscrubbing and nodeep-scrubbing off is optional and
up to you. I'll never say it's better to leave them off, but scrubbing
does use a fair bit of spindles while you're trying to backfill.
Well I decided to try the increase in PGs to 4096 and that seems to have
2018-02-16 12:38:35.798911 mon.carf-ceph-osd01 [ERR] overall HEALTH_ERR
2081 pgs inactive, 322 pgs peering; Degraded data redundancy: 552/241154376
objects degraded (0.000%), 3099 pgs unclean, 38 pgs degraded; 163 stuck
requests are blocked > 4096 sec
The cluster is actively backfilling misplaced objects, but not all PGs are
active at this point and may are stuck peering, stuck unclean, or have a
PG_AVAILABILITY Reduced data availability: 2081 pgs inactive, 322 pgs peering
pg 14.fae is stuck inactive for 253360.025730, current state
activating+remapped, last acting [85,12,41]
pg 14.faf is stuck inactive for 253368.511573, current state unknown, last acting []
pg 14.fb0 is stuck peering since forever, current state peering, last acting [94,30,38]
pg 14.fb1 is stuck inactive for 253362.605886, current state
activating+remapped, last acting [6,74,34]
[snip]
PG_DEGRADED Degraded data redundancy: 552/241154376 objects degraded
(0.000%), 3099 pgs unclean, 38 pgs degraded
pg 14.fc7 is stuck unclean for 253368.511573, current state unknown, last acting []
pg 14.fc8 is stuck unclean for 531622.531271, current state
active+remapped+backfill_wait, last acting [73,132,71]
pg 14.fca is stuck unclean for 420540.396199, current state
active+remapped+backfill_wait, last acting [0,80,61]
pg 14.fcb is stuck unclean for 531622.421855, current state
activating+remapped, last acting [70,26,75]
[snip]
REQUEST_STUCK 163 stuck requests are blocked > 4096 sec
69 ops are blocked > 268435 sec
66 ops are blocked > 134218 sec
28 ops are blocked > 67108.9 sec
osds 7,39,60,103,133 have stuck requests > 67108.9 sec
osds
5,12,13,28,33,40,55,56,61,64,69,70,75,83,92,96,110,114,119,122,123,129,131
have stuck requests > 134218 sec
osds
4,8,10,15,16,20,27,29,30,31,34,37,38,42,43,44,47,48,49,51,52,57,66,68,73,81,84,85,87,90,95,97,99,100,102,105,106,107,108,111,112,113,121,124,127,130,132
have stuck requests > 268435 sec
I tried looking through the mailing list archive on how to solve the stuck
requests, and it seems that restarting the OSDs is the right way?
At this point we have just been watching the backfills running and see a
steady but slow decrease of misplaced objects. When the cluster is idle,
the overall OSD disk utilization is not too bad at roughly 40% on the
physical disks running these backfills.
However we still have our backups trying to push new images to the
cluster. This worked ok for the first few days, but yesterday we were
getting failure alerts. I checked the status of the RGW service and
noticed that 2 of the 3 RGW civetweb servers where not responsive. I
restarted the RGWs on the ones that appeared hung and that got them working
for a while, but then the same condition happened. The RGWs seem to have
recovered on their own now, but again the cluster is idle and only
backfills are currently doing anything (that I can tell). I did see these
2018-02-15 16:46:07.541542 7fffe6c56700 1 heartbeat_map is_healthy
'RGWAsyncRadosProcessor::m_tp thread 0x7fffcec26700' had timed out after 600
2018-02-15 16:46:12.541613 7fffe6c56700 1 heartbeat_map is_healthy
'RGWAsyncRadosProcessor::m_tp thread 0x7fffdbc40700' had timed out after 600
2018-02-15 16:46:12.541629 7fffe6c56700 1 heartbeat_map is_healthy
'RGWAsyncRadosProcessor::m_tp thread 0x7fffcec26700' had timed out after 600
2018-02-15 16:46:17.541701 7fffe6c56700 1 heartbeat_map is_healthy
'RGWAsyncRadosProcessor::m_tp thread 0x7fffdbc40700' had timed out after 600
At this point we do not know to proceed with recovery efforts. I tried
looking at the ceph docs and mail list archives but wasn’t able to
determine the right path forward here.
Any help is appreciated,
-Bryan
*Sent:* Tuesday, February 13, 2018 2:27 PM
*Subject:* Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
*Note: External Email*
------------------------------
It may work fine, but I would suggest limiting the number of operations
going on at the same time.
Bryan
*Date: *Tuesday, February 13, 2018 at 1:16 PM
*Subject: *RE: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
Thanks for the response Bryan!
Would it be good to go ahead and do the increase up to 4096 PGs for thee
pool given that it's only at 52% done with the rebalance backfilling
operations?
Thanks in advance!!
-Bryan
-----Original Message-----
Sent: Tuesday, February 13, 2018 12:43 PM
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
Note: External Email
-------------------------------------------------
Bryan,
Based off the information you've provided so far, I would say that your
largest pool still doesn't have enough PGs.
If you originally had only 512 PGs for you largest pool (I'm guessing
.rgw.buckets has 99% of your data), then on a balanced cluster you would
have just ~11.5 PGs per OSD (3*512/133). That's way lower than the
recommended 100 PGs/OSD.
Based on the number of disks and assuming your .rgw.buckets pool has 99%
of the data, you should have around 4,096 PGs for that pool. You'll still
end up with an uneven distribution, but the outliers shouldn't be as far
out.
Sage recently wrote a new balancer plugin that makes balancing a cluster
something that happens automatically. He gave a great talk at LinuxConf
Australia that you should check out, here's a link into the video where he
http://youtu.be/GrStE7XSKFE
Even though your objects are fairly large, they are getting broken up into
chunks that are spread across the cluster. You can see how large each of
ceph pg dump | grep '[0-9]*\.[0-9a-f]*' | awk '{ print $1 "\t" $7 }' |sort -n -k2
You'll see that within a pool the PG sizes are fairly close to the same
size, but in your cluster the PGs are fairly large (~200GB would be my
guess).
Bryan
Date: Monday, February 12, 2018 at 2:19 PM
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
Hi Janne and others,
We used the “ceph osd reweight-by-utilization “ command to move a small
amount of data off of the top four OSDs by utilization. Then we updated
the pg_num and pgp_num on the pool from 512 to 1024 which started moving
roughly 50% of the objects around as a result. The unfortunate issue is
that the weights on the OSDs are still roughly equivalent and the OSDs that
are nearfull were still getting allocated objects during the rebalance
backfill operations.
At this point I have made some massive changes to the weights of the OSDs
in an attempt to stop Ceph from allocating any more data to OSDs that are
getting close to full. Basically the OSD with the lowest utilization
remains weighted at 1 and the rest of the OSDs are now reduced in weight
based on the percent usage of the OSD + the %usage of the OSD with the
amount of data (21% at the time). This means the OSD that is at the most
full at this time at 86% full now has a weight of only .33 (it was at 89%
when reweight was applied). I’m not sure this is a good idea, but it
seemed like the only option I had. Please let me know if I’m making a bad
situation worse!
I still have the question on how this happened in the first place and how
to prevent it from happening going forward without a lot of monitoring and
reweighting on weekends/etc to keep things balanced. It sounds like Ceph
is really expecting that objects stored into a pool will roughly have the
same size, is that right?
Our backups going into this pool have very large variation in size, so
would it be better to create multiple pools based on expected size of
objects and then put backups of similar size into each pool?
The backups also have basically the same names with the only difference
being the date which it was taken (e.g. backup name difference in
subsequent days can be one digit at times), so does this mean that large
backups with basically the same name will end up being placed in the same
PGs based on the CRUSH calculation using the object name?
Thanks,
-Bryan
Sent: Wednesday, January 31, 2018 9:34 AM
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
Note: External Email
2018-01-31 15:58 GMT+01:00 Bryan Banister <
Given that this will move data around (I think), should we increase the
pg_num and pgp_num first and then see how it looks?
I guess adding pgs and pgps will move stuff around too, but if the PGCALC
formula says you should have more then that would still be a good
start. Still, a few manual reweights first to take the 85-90% ones down
might be good, some move operations are going to refuse adding things
to too-full OSDs, so you would not want to get accidentally bumped above
such a limit due to some temp-data being created during moves.
Also, dont bump pgs like crazy, you can never move down. Aim for getting
~100 per OSD at most, and perhaps even then in smaller steps so
that the creation (and evening out of data to the new empty PGs) doesn't
kill normal client I/O perf in the meantime.
--
May the most significant bit of your life be positive.
Note: This email is for the confidential use of the named addressee(s)
only and may contain proprietary, confidential or privileged information.
If you are not the intended recipient, you are hereby notified that any
review, dissemination or copying of this email is strictly prohibited, and
to please notify the sender immediately and destroy this email and any
attachments. Email transmission cannot be guaranteed to be secure or
error-free. The Company, therefore, does not make any guarantees as to the
completeness or accuracy of this email or any attachments. This email is
for informational purposes only and does not constitute a recommendation,
offer, request or solicitation of any kind to buy, sell, subscribe, redeem
or perform any type of transaction of a financial product.
________________________________
Note: This email is for the confidential use of the named addressee(s)
only and may contain proprietary, confidential or privileged information.
If you are not the intended recipient, you are hereby notified that any
review, dissemination or copying of this email is strictly prohibited, and
to please notify the sender immediately and destroy this email and any
attachments. Email transmission cannot be guaranteed to be secure or
error-free. The Company, therefore, does not make any guarantees as to the
completeness or accuracy of this email or any attachments. This email is
for informational purposes only and does not constitute a recommendation,
offer, request or solicitation of any kind to buy, sell, subscribe, redeem
or perform any type of transaction of a financial product.
------------------------------
Note: This email is for the confidential use of the named addressee(s)
only and may contain proprietary, confidential or privileged information.
If you are not the intended recipient, you are hereby notified that any
review, dissemination or copying of this email is strictly prohibited, and
to please notify the sender immediately and destroy this email and any
attachments. Email transmission cannot be guaranteed to be secure or
error-free. The Company, therefore, does not make any guarantees as to the
completeness or accuracy of this email or any attachments. This email is
for informational purposes only and does not constitute a recommendation,
offer, request or solicitation of any kind to buy, sell, subscribe, redeem
or perform any type of transaction of a financial product.
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
------------------------------
Note: This email is for the confidential use of the named addressee(s)
only and may contain proprietary, confidential or privileged information.
If you are not the intended recipient, you are hereby notified that any
review, dissemination or copying of this email is strictly prohibited, and
to please notify the sender immediately and destroy this email and any
attachments. Email transmission cannot be guaranteed to be secure or
error-free. The Company, therefore, does not make any guarantees as to the
completeness or accuracy of this email or any attachments. This email is
for informational purposes only and does not constitute a recommendation,
offer, request or solicitation of any kind to buy, sell, subscribe, redeem
or perform any type of transaction of a financial product.
Bryan Banister
2018-02-16 21:15:00 UTC
Reply
Permalink
Raw Message
Thanks David,

Taking the list of all OSDs that are stuck reports that a little over 50% of all OSDs are in this condition. There isn’t any discernable pattern that I can find and they are spread across the three servers. All of the OSDs are online as far as the service is concern.

I have also taken all PGs that were reported the health detail output and looked for any that report “peering_blocked_by” but none do, so I can’t tell if any OSD is actually blocking the peering operation.

As suggested, I got a report of all peering PGs:
[***@carf-ceph-osd01 ~]# ceph health detail | grep "pg " | grep peering | sort -k13
pg 14.fe0 is stuck peering since forever, current state peering, last acting [104,94,108]
pg 14.fe0 is stuck unclean since forever, current state peering, last acting [104,94,108]
pg 14.fbc is stuck peering since forever, current state peering, last acting [110,91,0]
pg 14.fd1 is stuck peering since forever, current state peering, last acting [130,62,111]
pg 14.fd1 is stuck unclean since forever, current state peering, last acting [130,62,111]
pg 14.fed is stuck peering since forever, current state peering, last acting [32,33,82]
pg 14.fed is stuck unclean since forever, current state peering, last acting [32,33,82]
pg 14.fee is stuck peering since forever, current state peering, last acting [37,96,68]
pg 14.fee is stuck unclean since forever, current state peering, last acting [37,96,68]
pg 14.fe8 is stuck peering since forever, current state peering, last acting [45,31,107]
pg 14.fe8 is stuck unclean since forever, current state peering, last acting [45,31,107]
pg 14.fc1 is stuck peering since forever, current state peering, last acting [59,124,39]
pg 14.ff2 is stuck peering since forever, current state peering, last acting [62,117,7]
pg 14.ff2 is stuck unclean since forever, current state peering, last acting [62,117,7]
pg 14.fe4 is stuck peering since forever, current state peering, last acting [84,55,92]
pg 14.fe4 is stuck unclean since forever, current state peering, last acting [84,55,92]
pg 14.fb0 is stuck peering since forever, current state peering, last acting [94,30,38]
pg 14.ffc is stuck peering since forever, current state peering, last acting [96,53,70]
pg 14.ffc is stuck unclean since forever, current state peering, last acting [96,53,70]

Some have common OSDs but some OSDs only listed once.

Should I try just marking OSDs with stuck requests down to see if that will re-assert them?

Thanks!!
-Bryan

From: David Turner [mailto:***@gmail.com]
Sent: Friday, February 16, 2018 2:51 PM
To: Bryan Banister <***@jumptrading.com>
Cc: Bryan Stillwell <***@godaddy.com>; Janne Johansson <***@gmail.com>; Ceph Users <ceph-***@lists.ceph.com>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email
________________________________
The questions I definitely know the answer to first, and then we'll continue from there. If an OSD is blocking peering but is online, when you mark it as down in the cluster it receives a message in it's log saying it was wrongly marked down and tells the mons it is online. That gets it to stop what it was doing and start talking again. I referred to that as re-asserting. If the OSD that you marked down doesn't mark itself back up within a couple minutes, restarting the OSD might be a good idea. Then again actually restarting the daemon could be bad because the daemon is doing something. With as much potential for places to work with to get things going, actually restarting the daemons is probably something I would wait to do for now.

The reason the cluster doesn't know anything about the PG is because it's still creating and hasn't actually been created. Starting with some of the OSDs that you see with blocked requests would be a good idea. Eventually you'll down an OSD that when it comes back up things start looking much better as things start peering and getting better. Below are the list of OSDs you had from a previous email that if they're still there with stuck requests then they'll be good to start doing this to. On closer review, it's almost all of them... but you have to start somewhere. Another possible place to start with these is to look at a list of all of the peering PGs and see if there are any common OSDs when you look at all of them at once. Some patterns may emerge and would be good options to try.

osds 7,39,60,103,133 have stuck requests > 67108.9 sec
osds 5,12,13,28,33,40,55,56,61,64,69,70,75,83,92,96,110,114,119,122,123,129,131 have stuck requests > 134218 sec
osds 4,8,10,15,16,20,27,29,30,31,34,37,38,42,43,44,47,48,49,51,52,57,66,68,73,81,84,85,87,90,95,97,99,100,102,105,106,107,108,111,112,113,121,124,127,130,132 have stuck requests > 268435 sec


On Fri, Feb 16, 2018 at 2:53 PM Bryan Banister <***@jumptrading.com<mailto:***@jumptrading.com>> wrote:
Thanks David,

I have set the nobackfill, norecover, noscrub, and nodeep-scrub options at this point and the backfills have stopped. I’ll also stop the backups from pushing into ceph for now.

I don’t want to make things worse, so ask for some more guidance now.


1) In looking at a PG that is still peering or one that is “unknown”, Ceph complains that it doesn’t have that pgid:
pg 14.fb0 is stuck peering since forever, current state peering, last acting [94,30,38]
[***@carf-ceph-osd03 ~]# ceph pg 14.fb0 query
Error ENOENT: i don't have pgid 14.fb0
[***@carf-ceph-osd03 ~]#


2) One that is activating shows this for the recovery_state:
[***@carf-ceph-osd03 ~]# ceph pg 14.fe1 query | less
[snip]
"recovery_state": [
{
"name": "Started/Primary/Active",
"enter_time": "2018-02-13 14:33:21.406919",
"might_have_unfound": [
{
"osd": "84(0)",
"status": "not queried"
}
],
"recovery_progress": {
"backfill_targets": [
"56(0)",
"87(1)",
"88(2)"
],
"waiting_on_backfill": [],
"last_backfill_started": "MIN",
"backfill_info": {
"begin": "MIN",
"end": "MIN",
"objects": []
},
"peer_backfill_info": [],
"backfills_in_flight": [],
"recovering": [],
"pg_backend": {
"recovery_ops": [],
"read_ops": []
}
},
"scrub": {
"scrubber.epoch_start": "0",
"scrubber.active": false,
"scrubber.state": "INACTIVE",
"scrubber.start": "MIN",
"scrubber.end": "MIN",
"scrubber.subset_last_update": "0'0",
"scrubber.deep": false,
"scrubber.seed": 0,
"scrubber.waiting_on": 0,
"scrubber.waiting_on_whom": []
}
},
{
"name": "Started",
"enter_time": "2018-02-13 14:33:17.491148"
}
],


Sorry for all the hand holding, but how do I determine if I need to set an OSD as ‘down’ to fix the issues, and how does it go about re-asserting itself?

I again tried looking at the ceph docs on troubleshooting OSDs but didn’t find any details. Man page also has no details.

Thanks again,
-Bryan

From: David Turner [mailto:***@gmail.com<mailto:***@gmail.com>]
Sent: Friday, February 16, 2018 1:21 PM
To: Bryan Banister <***@jumptrading.com<mailto:***@jumptrading.com>>
Cc: Bryan Stillwell <***@godaddy.com<mailto:***@godaddy.com>>; Janne Johansson <***@gmail.com<mailto:***@gmail.com>>; Ceph Users <ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>>

Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email
________________________________
Your problem might have been creating too many PGs at once. I generally increase pg_num and pgp_num by no more than 256 at a time. Making sure that all PGs are creating, peered, and healthy (other than backfilling).

To help you get back to a healthy state, let's start off by getting all of your PGs peered. Go ahead and put a stop to backfilling, recovery, scrubbing, etc. Those are all hindering the peering effort right now. The more clients you can disable is also better.

ceph osd set nobackfill
ceph osd set norecovery
ceph osd set noscrubbing
ceph osd set nodeep-scrubbing

After that look at your peering PGs and find out what is blocking their peering. This is where you might need to be using `ceph osd down 23` (assuming you needed to kick osd.23) to mark them down in the cluster and let them re-assert themselves. Once you have all PGs done with peering, go ahead and unset nobackfill and norecovery and let the cluster start moving data around. Leaving noscrubbing and nodeep-scrubbing off is optional and up to you. I'll never say it's better to leave them off, but scrubbing does use a fair bit of spindles while you're trying to backfill.

On Fri, Feb 16, 2018 at 2:12 PM Bryan Banister <***@jumptrading.com<mailto:***@jumptrading.com>> wrote:
Well I decided to try the increase in PGs to 4096 and that seems to have caused some issues:

2018-02-16 12:38:35.798911 mon.carf-ceph-osd01 [ERR] overall HEALTH_ERR 61802168/241154376 objects misplaced (25.628%); Reduced data availability: 2081 pgs inactive, 322 pgs peering; Degraded data redundancy: 552/241154376 objects degraded (0.000%), 3099 pgs unclean, 38 pgs degraded; 163 stuck requests are blocked > 4096 sec

The cluster is actively backfilling misplaced objects, but not all PGs are active at this point and may are stuck peering, stuck unclean, or have a state of unknown:
PG_AVAILABILITY Reduced data availability: 2081 pgs inactive, 322 pgs peering
pg 14.fae is stuck inactive for 253360.025730, current state activating+remapped, last acting [85,12,41]
pg 14.faf is stuck inactive for 253368.511573, current state unknown, last acting []
pg 14.fb0 is stuck peering since forever, current state peering, last acting [94,30,38]
pg 14.fb1 is stuck inactive for 253362.605886, current state activating+remapped, last acting [6,74,34]
[snip]

The health also shows a large number of degraded data redundancy PGs:
PG_DEGRADED Degraded data redundancy: 552/241154376 objects degraded (0.000%), 3099 pgs unclean, 38 pgs degraded
pg 14.fc7 is stuck unclean for 253368.511573, current state unknown, last acting []
pg 14.fc8 is stuck unclean for 531622.531271, current state active+remapped+backfill_wait, last acting [73,132,71]
pg 14.fca is stuck unclean for 420540.396199, current state active+remapped+backfill_wait, last acting [0,80,61]
pg 14.fcb is stuck unclean for 531622.421855, current state activating+remapped, last acting [70,26,75]
[snip]

We also now have a number of stuck requests:
REQUEST_STUCK 163 stuck requests are blocked > 4096 sec
69 ops are blocked > 268435 sec
66 ops are blocked > 134218 sec
28 ops are blocked > 67108.9 sec
osds 7,39,60,103,133 have stuck requests > 67108.9 sec
osds 5,12,13,28,33,40,55,56,61,64,69,70,75,83,92,96,110,114,119,122,123,129,131 have stuck requests > 134218 sec
osds 4,8,10,15,16,20,27,29,30,31,34,37,38,42,43,44,47,48,49,51,52,57,66,68,73,81,84,85,87,90,95,97,99,100,102,105,106,107,108,111,112,113,121,124,127,130,132 have stuck requests > 268435 sec

I tried looking through the mailing list archive on how to solve the stuck requests, and it seems that restarting the OSDs is the right way?

At this point we have just been watching the backfills running and see a steady but slow decrease of misplaced objects. When the cluster is idle, the overall OSD disk utilization is not too bad at roughly 40% on the physical disks running these backfills.

However we still have our backups trying to push new images to the cluster. This worked ok for the first few days, but yesterday we were getting failure alerts. I checked the status of the RGW service and noticed that 2 of the 3 RGW civetweb servers where not responsive. I restarted the RGWs on the ones that appeared hung and that got them working for a while, but then the same condition happened. The RGWs seem to have recovered on their own now, but again the cluster is idle and only backfills are currently doing anything (that I can tell). I did see these log entries:
2018-02-15 16:46:07.541542 7fffe6c56700 1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7fffcec26700' had timed out after 600
2018-02-15 16:46:12.541613 7fffe6c56700 1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7fffdbc40700' had timed out after 600
2018-02-15 16:46:12.541629 7fffe6c56700 1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7fffcec26700' had timed out after 600
2018-02-15 16:46:17.541701 7fffe6c56700 1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7fffdbc40700' had timed out after 600

At this point we do not know to proceed with recovery efforts. I tried looking at the ceph docs and mail list archives but wasn’t able to determine the right path forward here.

Any help is appreciated,
-Bryan


From: Bryan Stillwell [mailto:***@godaddy.com<mailto:***@godaddy.com>]
Sent: Tuesday, February 13, 2018 2:27 PM

To: Bryan Banister <***@jumptrading.com<mailto:***@jumptrading.com>>; Janne Johansson <***@gmail.com<mailto:***@gmail.com>>
Cc: Ceph Users <ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email
________________________________
It may work fine, but I would suggest limiting the number of operations going on at the same time.

Bryan

From: Bryan Banister <***@jumptrading.com<mailto:***@jumptrading.com>>
Date: Tuesday, February 13, 2018 at 1:16 PM
To: Bryan Stillwell <***@godaddy.com<mailto:***@godaddy.com>>, Janne Johansson <***@gmail.com<mailto:***@gmail.com>>
Cc: Ceph Users <ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>>
Subject: RE: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Thanks for the response Bryan!

Would it be good to go ahead and do the increase up to 4096 PGs for thee pool given that it's only at 52% done with the rebalance backfilling operations?

Thanks in advance!!
-Bryan

-----Original Message-----
From: Bryan Stillwell [mailto:***@godaddy.com]
Sent: Tuesday, February 13, 2018 12:43 PM
To: Bryan Banister <***@jumptrading.com<mailto:***@jumptrading.com>>; Janne Johansson <***@gmail.com<mailto:***@gmail.com>>
Cc: Ceph Users <ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email
-------------------------------------------------

Bryan,

Based off the information you've provided so far, I would say that your largest pool still doesn't have enough PGs.

If you originally had only 512 PGs for you largest pool (I'm guessing .rgw.buckets has 99% of your data), then on a balanced cluster you would have just ~11.5 PGs per OSD (3*512/133). That's way lower than the recommended 100 PGs/OSD.

Based on the number of disks and assuming your .rgw.buckets pool has 99% of the data, you should have around 4,096 PGs for that pool. You'll still end up with an uneven distribution, but the outliers shouldn't be as far out.

Sage recently wrote a new balancer plugin that makes balancing a cluster something that happens automatically. He gave a great talk at LinuxConf Australia that you should check out, here's a link into the video where he talks about the balancer and the need for it:

http://youtu.be/GrStE7XSKFE

Even though your objects are fairly large, they are getting broken up into chunks that are spread across the cluster. You can see how large each of your PGs are with a command like this:

ceph pg dump | grep '[0-9]*\.[0-9a-f]*' | awk '{ print $1 "\t" $7 }' |sort -n -k2

You'll see that within a pool the PG sizes are fairly close to the same size, but in your cluster the PGs are fairly large (~200GB would be my guess).

Bryan

From: ceph-users <ceph-users-***@lists.ceph.com<mailto:ceph-users-***@lists.ceph.com>> on behalf of Bryan Banister <***@jumptrading.com<mailto:***@jumptrading.com>>
Date: Monday, February 12, 2018 at 2:19 PM
To: Janne Johansson <***@gmail.com<mailto:***@gmail.com>>
Cc: Ceph Users <ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Hi Janne and others,

We used the “ceph osd reweight-by-utilization “ command to move a small amount of data off of the top four OSDs by utilization. Then we updated the pg_num and pgp_num on the pool from 512 to 1024 which started moving roughly 50% of the objects around as a result. The unfortunate issue is that the weights on the OSDs are still roughly equivalent and the OSDs that are nearfull were still getting allocated objects during the rebalance backfill operations.

At this point I have made some massive changes to the weights of the OSDs in an attempt to stop Ceph from allocating any more data to OSDs that are getting close to full. Basically the OSD with the lowest utilization remains weighted at 1 and the rest of the OSDs are now reduced in weight based on the percent usage of the OSD + the %usage of the OSD with the amount of data (21% at the time). This means the OSD that is at the most full at this time at 86% full now has a weight of only .33 (it was at 89% when reweight was applied). I’m not sure this is a good idea, but it seemed like the only option I had. Please let me know if I’m making a bad situation worse!

I still have the question on how this happened in the first place and how to prevent it from happening going forward without a lot of monitoring and reweighting on weekends/etc to keep things balanced. It sounds like Ceph is really expecting that objects stored into a pool will roughly have the same size, is that right?

Our backups going into this pool have very large variation in size, so would it be better to create multiple pools based on expected size of objects and then put backups of similar size into each pool?

The backups also have basically the same names with the only difference being the date which it was taken (e.g. backup name difference in subsequent days can be one digit at times), so does this mean that large backups with basically the same name will end up being placed in the same PGs based on the CRUSH calculation using the object name?

Thanks,
-Bryan

From: Janne Johansson [mailto:***@gmail.com]
Sent: Wednesday, January 31, 2018 9:34 AM
To: Bryan Banister <***@jumptrading.com<mailto:***@jumptrading.com>>
Cc: Ceph Users <ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email



2018-01-31 15:58 GMT+01:00 Bryan Banister <mailto:***@jumptrading.com>:


Given that this will move data around (I think), should we increase the pg_num and pgp_num first and then see how it looks?


I guess adding pgs and pgps will move stuff around too, but if the PGCALC formula says you should have more then that would still be a good
start. Still, a few manual reweights first to take the 85-90% ones down might be good, some move operations are going to refuse adding things
to too-full OSDs, so you would not want to get accidentally bumped above such a limit due to some temp-data being created during moves.

Also, dont bump pgs like crazy, you can never move down. Aim for getting ~100 per OSD at most, and perhaps even then in smaller steps so
that the creation (and evening out of data to the new empty PGs) doesn't kill normal client I/O perf in the meantime.
--
May the most significant bit of your life be positive.



Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product.



________________________________

Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product.


________________________________

Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product.
_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

________________________________

Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product.

________________________________

Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product.
David Turner
2018-02-16 21:20:48 UTC
Reply
Permalink
Raw Message
That sounds like a good next step. Start with OSDs involved in the longest
blocked requests. Wait a couple minutes after the osd marks itself back up
and continue through them. Hopefully things will start clearing up so that
you don't need to mark all of them down. There is usually a only a couple
OSDs holding everything up.
Post by Bryan Banister
Thanks David,
Taking the list of all OSDs that are stuck reports that a little over 50%
of all OSDs are in this condition. There isn’t any discernable pattern
that I can find and they are spread across the three servers. All of the
OSDs are online as far as the service is concern.
I have also taken all PGs that were reported the health detail output and
looked for any that report “peering_blocked_by” but none do, so I can’t
tell if any OSD is actually blocking the peering operation.
pg 14.fe0 is stuck peering since forever, current state peering, last
acting [104,94,108]
pg 14.fe0 is stuck unclean since forever, current state peering, last
acting [104,94,108]
pg 14.fbc is stuck peering since forever, current state peering, last acting [110,91,0]
pg 14.fd1 is stuck peering since forever, current state peering, last
acting [130,62,111]
pg 14.fd1 is stuck unclean since forever, current state peering, last
acting [130,62,111]
pg 14.fed is stuck peering since forever, current state peering, last acting [32,33,82]
pg 14.fed is stuck unclean since forever, current state peering, last acting [32,33,82]
pg 14.fee is stuck peering since forever, current state peering, last acting [37,96,68]
pg 14.fee is stuck unclean since forever, current state peering, last acting [37,96,68]
pg 14.fe8 is stuck peering since forever, current state peering, last
acting [45,31,107]
pg 14.fe8 is stuck unclean since forever, current state peering, last
acting [45,31,107]
pg 14.fc1 is stuck peering since forever, current state peering, last
acting [59,124,39]
pg 14.ff2 is stuck peering since forever, current state peering, last acting [62,117,7]
pg 14.ff2 is stuck unclean since forever, current state peering, last acting [62,117,7]
pg 14.fe4 is stuck peering since forever, current state peering, last acting [84,55,92]
pg 14.fe4 is stuck unclean since forever, current state peering, last acting [84,55,92]
pg 14.fb0 is stuck peering since forever, current state peering, last acting [94,30,38]
pg 14.ffc is stuck peering since forever, current state peering, last acting [96,53,70]
pg 14.ffc is stuck unclean since forever, current state peering, last acting [96,53,70]
Some have common OSDs but some OSDs only listed once.
Should I try just marking OSDs with stuck requests down to see if that will re-assert them?
Thanks!!
-Bryan
*Sent:* Friday, February 16, 2018 2:51 PM
*Subject:* Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
*Note: External Email*
------------------------------
The questions I definitely know the answer to first, and then we'll
continue from there. If an OSD is blocking peering but is online, when you
mark it as down in the cluster it receives a message in it's log saying it
was wrongly marked down and tells the mons it is online. That gets it to
stop what it was doing and start talking again. I referred to that as
re-asserting. If the OSD that you marked down doesn't mark itself back up
within a couple minutes, restarting the OSD might be a good idea. Then
again actually restarting the daemon could be bad because the daemon is
doing something. With as much potential for places to work with to get
things going, actually restarting the daemons is probably something I would
wait to do for now.
The reason the cluster doesn't know anything about the PG is because it's
still creating and hasn't actually been created. Starting with some of the
OSDs that you see with blocked requests would be a good idea. Eventually
you'll down an OSD that when it comes back up things start looking much
better as things start peering and getting better. Below are the list of
OSDs you had from a previous email that if they're still there with stuck
requests then they'll be good to start doing this to. On closer review,
it's almost all of them... but you have to start somewhere. Another
possible place to start with these is to look at a list of all of the
peering PGs and see if there are any common OSDs when you look at all of
them at once. Some patterns may emerge and would be good options to try.
osds 7,39,60,103,133 have stuck requests > 67108.9 sec
osds
5,12,13,28,33,40,55,56,61,64,69,70,75,83,92,96,110,114,119,122,123,129,131
have stuck requests > 134218 sec
osds
4,8,10,15,16,20,27,29,30,31,34,37,38,42,43,44,47,48,49,51,52,57,66,68,73,81,84,85,87,90,95,97,99,100,102,105,106,107,108,111,112,113,121,124,127,130,132
have stuck requests > 268435 sec
Thanks David,
I have set the nobackfill, norecover, noscrub, and nodeep-scrub options at
this point and the backfills have stopped. I’ll also stop the backups from
pushing into ceph for now.
I don’t want to make things worse, so ask for some more guidance now.
1) In looking at a PG that is still peering or one that is
pg 14.fb0 is stuck peering since forever, current state peering, last acting [94,30,38]
Error ENOENT: i don't have pgid 14.fb0
[snip]
"recovery_state": [
{
"name": "Started/Primary/Active",
"enter_time": "2018-02-13 14:33:21.406919",
"might_have_unfound": [
{
"osd": "84(0)",
"status": "not queried"
}
],
"recovery_progress": {
"backfill_targets": [
"56(0)",
"87(1)",
"88(2)"
],
"waiting_on_backfill": [],
"last_backfill_started": "MIN",
"backfill_info": {
"begin": "MIN",
"end": "MIN",
"objects": []
},
"peer_backfill_info": [],
"backfills_in_flight": [],
"recovering": [],
"pg_backend": {
"recovery_ops": [],
"read_ops": []
}
},
"scrub": {
"scrubber.epoch_start": "0",
"scrubber.active": false,
"scrubber.state": "INACTIVE",
"scrubber.start": "MIN",
"scrubber.end": "MIN",
"scrubber.subset_last_update": "0'0",
"scrubber.deep": false,
"scrubber.seed": 0,
"scrubber.waiting_on": 0,
"scrubber.waiting_on_whom": []
}
},
{
"name": "Started",
"enter_time": "2018-02-13 14:33:17.491148"
}
],
Sorry for all the hand holding, but how do I determine if I need to set an
OSD as ‘down’ to fix the issues, and how does it go about re-asserting
itself?
I again tried looking at the ceph docs on troubleshooting OSDs but didn’t
find any details. Man page also has no details.
Thanks again,
-Bryan
*Sent:* Friday, February 16, 2018 1:21 PM
*Subject:* Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
*Note: External Email*
------------------------------
Your problem might have been creating too many PGs at once. I generally
increase pg_num and pgp_num by no more than 256 at a time. Making sure
that all PGs are creating, peered, and healthy (other than backfilling).
To help you get back to a healthy state, let's start off by getting all of
your PGs peered. Go ahead and put a stop to backfilling, recovery,
scrubbing, etc. Those are all hindering the peering effort right now. The
more clients you can disable is also better.
ceph osd set nobackfill
ceph osd set norecovery
ceph osd set noscrubbing
ceph osd set nodeep-scrubbing
After that look at your peering PGs and find out what is blocking their
peering. This is where you might need to be using `ceph osd down 23`
(assuming you needed to kick osd.23) to mark them down in the cluster and
let them re-assert themselves. Once you have all PGs done with peering, go
ahead and unset nobackfill and norecovery and let the cluster start moving
data around. Leaving noscrubbing and nodeep-scrubbing off is optional and
up to you. I'll never say it's better to leave them off, but scrubbing
does use a fair bit of spindles while you're trying to backfill.
Well I decided to try the increase in PGs to 4096 and that seems to have
2018-02-16 12:38:35.798911 mon.carf-ceph-osd01 [ERR] overall HEALTH_ERR
2081 pgs inactive, 322 pgs peering; Degraded data redundancy: 552/241154376
objects degraded (0.000%), 3099 pgs unclean, 38 pgs degraded; 163 stuck
requests are blocked > 4096 sec
The cluster is actively backfilling misplaced objects, but not all PGs are
active at this point and may are stuck peering, stuck unclean, or have a
PG_AVAILABILITY Reduced data availability: 2081 pgs inactive, 322 pgs peering
pg 14.fae is stuck inactive for 253360.025730, current state
activating+remapped, last acting [85,12,41]
pg 14.faf is stuck inactive for 253368.511573, current state unknown, last acting []
pg 14.fb0 is stuck peering since forever, current state peering, last acting [94,30,38]
pg 14.fb1 is stuck inactive for 253362.605886, current state
activating+remapped, last acting [6,74,34]
[snip]
PG_DEGRADED Degraded data redundancy: 552/241154376 objects degraded
(0.000%), 3099 pgs unclean, 38 pgs degraded
pg 14.fc7 is stuck unclean for 253368.511573, current state unknown, last acting []
pg 14.fc8 is stuck unclean for 531622.531271, current state
active+remapped+backfill_wait, last acting [73,132,71]
pg 14.fca is stuck unclean for 420540.396199, current state
active+remapped+backfill_wait, last acting [0,80,61]
pg 14.fcb is stuck unclean for 531622.421855, current state
activating+remapped, last acting [70,26,75]
[snip]
REQUEST_STUCK 163 stuck requests are blocked > 4096 sec
69 ops are blocked > 268435 sec
66 ops are blocked > 134218 sec
28 ops are blocked > 67108.9 sec
osds 7,39,60,103,133 have stuck requests > 67108.9 sec
osds
5,12,13,28,33,40,55,56,61,64,69,70,75,83,92,96,110,114,119,122,123,129,131
have stuck requests > 134218 sec
osds
4,8,10,15,16,20,27,29,30,31,34,37,38,42,43,44,47,48,49,51,52,57,66,68,73,81,84,85,87,90,95,97,99,100,102,105,106,107,108,111,112,113,121,124,127,130,132
have stuck requests > 268435 sec
I tried looking through the mailing list archive on how to solve the stuck
requests, and it seems that restarting the OSDs is the right way?
At this point we have just been watching the backfills running and see a
steady but slow decrease of misplaced objects. When the cluster is idle,
the overall OSD disk utilization is not too bad at roughly 40% on the
physical disks running these backfills.
However we still have our backups trying to push new images to the
cluster. This worked ok for the first few days, but yesterday we were
getting failure alerts. I checked the status of the RGW service and
noticed that 2 of the 3 RGW civetweb servers where not responsive. I
restarted the RGWs on the ones that appeared hung and that got them working
for a while, but then the same condition happened. The RGWs seem to have
recovered on their own now, but again the cluster is idle and only
backfills are currently doing anything (that I can tell). I did see these
2018-02-15 16:46:07.541542 7fffe6c56700 1 heartbeat_map is_healthy
'RGWAsyncRadosProcessor::m_tp thread 0x7fffcec26700' had timed out after 600
2018-02-15 16:46:12.541613 7fffe6c56700 1 heartbeat_map is_healthy
'RGWAsyncRadosProcessor::m_tp thread 0x7fffdbc40700' had timed out after 600
2018-02-15 16:46:12.541629 7fffe6c56700 1 heartbeat_map is_healthy
'RGWAsyncRadosProcessor::m_tp thread 0x7fffcec26700' had timed out after 600
2018-02-15 16:46:17.541701 7fffe6c56700 1 heartbeat_map is_healthy
'RGWAsyncRadosProcessor::m_tp thread 0x7fffdbc40700' had timed out after 600
At this point we do not know to proceed with recovery efforts. I tried
looking at the ceph docs and mail list archives but wasn’t able to
determine the right path forward here.
Any help is appreciated,
-Bryan
*Sent:* Tuesday, February 13, 2018 2:27 PM
*Subject:* Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
*Note: External Email*
------------------------------
It may work fine, but I would suggest limiting the number of operations
going on at the same time.
Bryan
*Date: *Tuesday, February 13, 2018 at 1:16 PM
*Subject: *RE: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
Thanks for the response Bryan!
Would it be good to go ahead and do the increase up to 4096 PGs for thee
pool given that it's only at 52% done with the rebalance backfilling
operations?
Thanks in advance!!
-Bryan
-----Original Message-----
Sent: Tuesday, February 13, 2018 12:43 PM
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
Note: External Email
-------------------------------------------------
Bryan,
Based off the information you've provided so far, I would say that your
largest pool still doesn't have enough PGs.
If you originally had only 512 PGs for you largest pool (I'm guessing
.rgw.buckets has 99% of your data), then on a balanced cluster you would
have just ~11.5 PGs per OSD (3*512/133). That's way lower than the
recommended 100 PGs/OSD.
Based on the number of disks and assuming your .rgw.buckets pool has 99%
of the data, you should have around 4,096 PGs for that pool. You'll still
end up with an uneven distribution, but the outliers shouldn't be as far
out.
Sage recently wrote a new balancer plugin that makes balancing a cluster
something that happens automatically. He gave a great talk at LinuxConf
Australia that you should check out, here's a link into the video where he
http://youtu.be/GrStE7XSKFE
Even though your objects are fairly large, they are getting broken up into
chunks that are spread across the cluster. You can see how large each of
ceph pg dump | grep '[0-9]*\.[0-9a-f]*' | awk '{ print $1 "\t" $7 }' |sort -n -k2
You'll see that within a pool the PG sizes are fairly close to the same
size, but in your cluster the PGs are fairly large (~200GB would be my
guess).
Bryan
Date: Monday, February 12, 2018 at 2:19 PM
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
Hi Janne and others,
We used the “ceph osd reweight-by-utilization “ command to move a small
amount of data off of the top four OSDs by utilization. Then we updated
the pg_num and pgp_num on the pool from 512 to 1024 which started moving
roughly 50% of the objects around as a result. The unfortunate issue is
that the weights on the OSDs are still roughly equivalent and the OSDs that
are nearfull were still getting allocated objects during the rebalance
backfill operations.
At this point I have made some massive changes to the weights of the OSDs
in an attempt to stop Ceph from allocating any more data to OSDs that are
getting close to full. Basically the OSD with the lowest utilization
remains weighted at 1 and the rest of the OSDs are now reduced in weight
based on the percent usage of the OSD + the %usage of the OSD with the
amount of data (21% at the time). This means the OSD that is at the most
full at this time at 86% full now has a weight of only .33 (it was at 89%
when reweight was applied). I’m not sure this is a good idea, but it
seemed like the only option I had. Please let me know if I’m making a bad
situation worse!
I still have the question on how this happened in the first place and how
to prevent it from happening going forward without a lot of monitoring and
reweighting on weekends/etc to keep things balanced. It sounds like Ceph
is really expecting that objects stored into a pool will roughly have the
same size, is that right?
Our backups going into this pool have very large variation in size, so
would it be better to create multiple pools based on expected size of
objects and then put backups of similar size into each pool?
The backups also have basically the same names with the only difference
being the date which it was taken (e.g. backup name difference in
subsequent days can be one digit at times), so does this mean that large
backups with basically the same name will end up being placed in the same
PGs based on the CRUSH calculation using the object name?
Thanks,
-Bryan
Sent: Wednesday, January 31, 2018 9:34 AM
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
Note: External Email
2018-01-31 15:58 GMT+01:00 Bryan Banister <
Given that this will move data around (I think), should we increase the
pg_num and pgp_num first and then see how it looks?
I guess adding pgs and pgps will move stuff around too, but if the PGCALC
formula says you should have more then that would still be a good
start. Still, a few manual reweights first to take the 85-90% ones down
might be good, some move operations are going to refuse adding things
to too-full OSDs, so you would not want to get accidentally bumped above
such a limit due to some temp-data being created during moves.
Also, dont bump pgs like crazy, you can never move down. Aim for getting
~100 per OSD at most, and perhaps even then in smaller steps so
that the creation (and evening out of data to the new empty PGs) doesn't
kill normal client I/O perf in the meantime.
--
May the most significant bit of your life be positive.
Note: This email is for the confidential use of the named addressee(s)
only and may contain proprietary, confidential or privileged information.
If you are not the intended recipient, you are hereby notified that any
review, dissemination or copying of this email is strictly prohibited, and
to please notify the sender immediately and destroy this email and any
attachments. Email transmission cannot be guaranteed to be secure or
error-free. The Company, therefore, does not make any guarantees as to the
completeness or accuracy of this email or any attachments. This email is
for informational purposes only and does not constitute a recommendation,
offer, request or solicitation of any kind to buy, sell, subscribe, redeem
or perform any type of transaction of a financial product.
________________________________
Note: This email is for the confidential use of the named addressee(s)
only and may contain proprietary, confidential or privileged information.
If you are not the intended recipient, you are hereby notified that any
review, dissemination or copying of this email is strictly prohibited, and
to please notify the sender immediately and destroy this email and any
attachments. Email transmission cannot be guaranteed to be secure or
error-free. The Company, therefore, does not make any guarantees as to the
completeness or accuracy of this email or any attachments. This email is
for informational purposes only and does not constitute a recommendation,
offer, request or solicitation of any kind to buy, sell, subscribe, redeem
or perform any type of transaction of a financial product.
------------------------------
Note: This email is for the confidential use of the named addressee(s)
only and may contain proprietary, confidential or privileged information.
If you are not the intended recipient, you are hereby notified that any
review, dissemination or copying of this email is strictly prohibited, and
to please notify the sender immediately and destroy this email and any
attachments. Email transmission cannot be guaranteed to be secure or
error-free. The Company, therefore, does not make any guarantees as to the
completeness or accuracy of this email or any attachments. This email is
for informational purposes only and does not constitute a recommendation,
offer, request or solicitation of any kind to buy, sell, subscribe, redeem
or perform any type of transaction of a financial product.
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
------------------------------
Note: This email is for the confidential use of the named addressee(s)
only and may contain proprietary, confidential or privileged information.
If you are not the intended recipient, you are hereby notified that any
review, dissemination or copying of this email is strictly prohibited, and
to please notify the sender immediately and destroy this email and any
attachments. Email transmission cannot be guaranteed to be secure or
error-free. The Company, therefore, does not make any guarantees as to the
completeness or accuracy of this email or any attachments. This email is
for informational purposes only and does not constitute a recommendation,
offer, request or solicitation of any kind to buy, sell, subscribe, redeem
or perform any type of transaction of a financial product.
------------------------------
Note: This email is for the confidential use of the named addressee(s)
only and may contain proprietary, confidential or privileged information.
If you are not the intended recipient, you are hereby notified that any
review, dissemination or copying of this email is strictly prohibited, and
to please notify the sender immediately and destroy this email and any
attachments. Email transmission cannot be guaranteed to be secure or
error-free. The Company, therefore, does not make any guarantees as to the
completeness or accuracy of this email or any attachments. This email is
for informational purposes only and does not constitute a recommendation,
offer, request or solicitation of any kind to buy, sell, subscribe, redeem
or perform any type of transaction of a financial product.
Loading...