[ceph-users] [Ceph-community] Pool broke after increase pg

Discussion:

[ceph-users] [Ceph-community] Pool broke after increase pg_num

Joao Eduardo Luis

2018-11-08 11:30:06 UTC

Hello Gesiel,

Welcome to Ceph!

In the future, you may want to address the ceph-users list

Hi everyone,
I am a beginner in Ceph. I made a increase of pg_num in a pool, and
after the cluster rebalance I increased pgp_num (a confission: I not
had read the complete documentation about this operation :-( ). Then
after this my cluster broken, and stoped all. The cluster not rebalance,
and my impression is that are all stopped.
Below is my "ceph -s". Can anyone help-me?

You have two osds down. Depending on how your data is mapped, your pgs
may be waiting for those to come back up before they finish being
cleaned up.

-Joao

+++++++
id: ab5dcb0c-480d-419c-bcb8-013cbcce5c4d
health: HEALTH_WARN
14402/995493 objects misplaced (1.447%)
Reduced data availability: 348 pgs inactive, 313 pgs peering

mon: 3 daemons, quorum cmonitor,thanos,cmonitor2
mgr: thanos(active), standbys: cmonitor
osd: 19 osds: 17 up, 17 in; 221 remapped pgs

pools: 1 pools, 1024 pgs
objects: 331.8 k objects, 1.3 TiB
usage: 3.8 TiB used, 7.4 TiB / 11 TiB avail
pgs: 1.660% pgs unknown
32.324% pgs not active
14402/995493 objects misplaced (1.447%)
676 active+clean
186 remapped+peering
127 peering
18 activating+remapped
17 unknown
At
Gesiel
_______________________________________________
Ceph-community mailing list
http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com

Gesiel Galvão Bernardes

2018-11-08 16:08:17 UTC

Permalink

Post by Joao Eduardo Luis
Hello Gesiel,
Welcome to Ceph!
In the future, you may want to address the ceph-users list

Thank you, I will do.

Post by Joao Eduardo Luis

You have two osds down. Depending on how your data is mapped, your pgs
may be waiting for those to come back up before they finish being
cleaned up.

After removed OSD downs, it is tried rebalance, but is "frozen" again, in
this status:

cluster:
id: ab5dcb0c-480d-419c-bcb8-013cbcce5c4d
health: HEALTH_WARN
12840/988707 objects misplaced (1.299%)
Reduced data availability: 358 pgs inactive, 325 pgs peering

services:
mon: 3 daemons, quorum cmonitor,thanos,cmonitor2
mgr: thanos(active), standbys: cmonitor
osd: 17 osds: 17 up, 17 in; 221 remapped pgs

data:
pools: 1 pools, 1024 pgs
objects: 329.6 k objects, 1.3 TiB
usage: 3.8 TiB used, 7.4 TiB / 11 TiB avail
pgs: 1.660% pgs unknown
33.301% pgs not active
12840/988707 objects misplaced (1.299%)
666 active+clean
188 remapped+peering
137 peering
17 unknown
16 activating+remapped

Any other idea?

Gesiel

Ashley Merrick

2018-11-09 05:38:42 UTC

Permalink

Are you sure the down OSD didn't happen to have any data required for the
re-balance to complete? How long has the down now removed OSD been out?
Before or after your increased PG count?

If you do "ceph health detail" and then pick a stuck PG what does "ceph pg
PG query" output?

Has your ceph -s output changed at all since the last paste?

On Fri, Nov 9, 2018 at 12:08 AM Gesiel GalvÃ£o Bernardes <

Post by Gesiel GalvÃ£o Bernardes

Post by Joao Eduardo Luis
Hello Gesiel,
Welcome to Ceph!
In the future, you may want to address the ceph-users list

Thank you, I will do.

Post by Joao Eduardo Luis

You have two osds down. Depending on how your data is mapped, your pgs
may be waiting for those to come back up before they finish being
cleaned up.

After removed OSD downs, it is tried rebalance, but is "frozen" again, in
id: ab5dcb0c-480d-419c-bcb8-013cbcce5c4d
health: HEALTH_WARN
12840/988707 objects misplaced (1.299%)
Reduced data availability: 358 pgs inactive, 325 pgs peering
mon: 3 daemons, quorum cmonitor,thanos,cmonitor2
mgr: thanos(active), standbys: cmonitor
osd: 17 osds: 17 up, 17 in; 221 remapped pgs
pools: 1 pools, 1024 pgs
objects: 329.6 k objects, 1.3 TiB
usage: 3.8 TiB used, 7.4 TiB / 11 TiB avail
pgs: 1.660% pgs unknown
33.301% pgs not active
12840/988707 objects misplaced (1.299%)
666 active+clean
188 remapped+peering
137 peering
17 unknown
16 activating+remapped
Any other idea?
Gesiel

Post by Joao Eduardo Luis
_______________________________________________

ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Gesiel Galvão Bernardes

2018-11-09 11:11:52 UTC

Permalink

Hi,

The pool is back up and running. I made this actions:

- Increased max pg per OSD (ceph tell mon.* injectargs
'--mon_max_pg_per_osd=400'). But was still frozen. (already had OSDs with
251 pgs, then I not sure if this was the my problem.)
- Restarted all daemons, including OSDs. In a specific host, when I
restarted a OSD daemon, It took too long, and after this I saw that the
pool started rebuild.

I don't have a sure conclusion about what's happened, at least it's
working. I will read logs, now with more diem, for understanding exactly
happened.

Thank you all for your help.

Gesiel

Post by Ashley Merrick
Are you sure the down OSD didn't happen to have any data required for the
re-balance to complete? How long has the down now removed OSD been out?
Before or after your increased PG count?
If you do "ceph health detail" and then pick a stuck PG what does "ceph pg
PG query" output?
Has your ceph -s output changed at all since the last paste?
On Fri, Nov 9, 2018 at 12:08 AM Gesiel GalvÃ£o Bernardes <

Post by Gesiel GalvÃ£o Bernardes

Post by Joao Eduardo Luis
Hello Gesiel,
Welcome to Ceph!
In the future, you may want to address the ceph-users list

Thank you, I will do.

Post by Joao Eduardo Luis

rebalance,

and my impression is that are all stopped.
Below is my "ceph -s". Can anyone help-me?

You have two osds down. Depending on how your data is mapped, your pgs
may be waiting for those to come back up before they finish being
cleaned up.

After removed OSD downs, it is tried rebalance, but is "frozen" again,
id: ab5dcb0c-480d-419c-bcb8-013cbcce5c4d
health: HEALTH_WARN
12840/988707 objects misplaced (1.299%)
Reduced data availability: 358 pgs inactive, 325 pgs peering
mon: 3 daemons, quorum cmonitor,thanos,cmonitor2
mgr: thanos(active), standbys: cmonitor
osd: 17 osds: 17 up, 17 in; 221 remapped pgs
pools: 1 pools, 1024 pgs
objects: 329.6 k objects, 1.3 TiB
usage: 3.8 TiB used, 7.4 TiB / 11 TiB avail
pgs: 1.660% pgs unknown
33.301% pgs not active
12840/988707 objects misplaced (1.299%)
666 active+clean
188 remapped+peering
137 peering
17 unknown
16 activating+remapped
Any other idea?
Gesiel

Post by Joao Eduardo Luis
_______________________________________________

ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com