Discussion:
[ceph-users] [Ceph-community] Pool broke after increase pg_num
Joao Eduardo Luis
2018-11-08 11:30:06 UTC
Permalink
Hello Gesiel,

Welcome to Ceph!

In the future, you may want to address the ceph-users list
Hi everyone,
I am a beginner in Ceph. I made a increase of pg_num in a pool, and
after  the cluster rebalance I increased pgp_num (a confission: I not
had read the complete documentation about this operation :-(  ). Then
after this my cluster broken, and stoped all. The cluster not rebalance,
and my impression is that are all stopped. 
Below is my "ceph -s". Can anyone help-me?
You have two osds down. Depending on how your data is mapped, your pgs
may be waiting for those to come back up before they finish being
cleaned up.

-Joao
+++++++
    id:     ab5dcb0c-480d-419c-bcb8-013cbcce5c4d
    health: HEALTH_WARN
            14402/995493 objects misplaced (1.447%)
            Reduced data availability: 348 pgs inactive, 313 pgs peering
 
    mon: 3 daemons, quorum cmonitor,thanos,cmonitor2
    mgr: thanos(active), standbys: cmonitor
    osd: 19 osds: 17 up, 17 in; 221 remapped pgs
 
    pools:   1 pools, 1024 pgs
    objects: 331.8 k objects, 1.3 TiB
    usage:   3.8 TiB used, 7.4 TiB / 11 TiB avail
    pgs:     1.660% pgs unknown
             32.324% pgs not active
             14402/995493 objects misplaced (1.447%)
             676 active+clean
             186 remapped+peering
             127 peering
             18  activating+remapped
             17  unknown
At
Gesiel
_______________________________________________
Ceph-community mailing list
http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
Gesiel Galvão Bernardes
2018-11-08 16:08:17 UTC
Permalink
Post by Joao Eduardo Luis
Hello Gesiel,
Welcome to Ceph!
In the future, you may want to address the ceph-users list
Thank you, I will do.
Post by Joao Eduardo Luis
Hi everyone,
I am a beginner in Ceph. I made a increase of pg_num in a pool, and
after the cluster rebalance I increased pgp_num (a confission: I not
had read the complete documentation about this operation :-( ). Then
after this my cluster broken, and stoped all. The cluster not rebalance,
and my impression is that are all stopped.
Below is my "ceph -s". Can anyone help-me?
You have two osds down. Depending on how your data is mapped, your pgs
may be waiting for those to come back up before they finish being
cleaned up.
After removed OSD downs, it is tried rebalance, but is "frozen" again, in
this status:

cluster:
id: ab5dcb0c-480d-419c-bcb8-013cbcce5c4d
health: HEALTH_WARN
12840/988707 objects misplaced (1.299%)
Reduced data availability: 358 pgs inactive, 325 pgs peering

services:
mon: 3 daemons, quorum cmonitor,thanos,cmonitor2
mgr: thanos(active), standbys: cmonitor
osd: 17 osds: 17 up, 17 in; 221 remapped pgs

data:
pools: 1 pools, 1024 pgs
objects: 329.6 k objects, 1.3 TiB
usage: 3.8 TiB used, 7.4 TiB / 11 TiB avail
pgs: 1.660% pgs unknown
33.301% pgs not active
12840/988707 objects misplaced (1.299%)
666 active+clean
188 remapped+peering
137 peering
17 unknown
16 activating+remapped

Any other idea?


Gesiel
Ashley Merrick
2018-11-09 05:38:42 UTC
Permalink
Are you sure the down OSD didn't happen to have any data required for the
re-balance to complete? How long has the down now removed OSD been out?
Before or after your increased PG count?

If you do "ceph health detail" and then pick a stuck PG what does "ceph pg
PG query" output?

Has your ceph -s output changed at all since the last paste?

On Fri, Nov 9, 2018 at 12:08 AM Gesiel Galvão Bernardes <
Post by Gesiel Galvão Bernardes
Post by Joao Eduardo Luis
Hello Gesiel,
Welcome to Ceph!
In the future, you may want to address the ceph-users list
Thank you, I will do.
Post by Joao Eduardo Luis
Hi everyone,
I am a beginner in Ceph. I made a increase of pg_num in a pool, and
after the cluster rebalance I increased pgp_num (a confission: I not
had read the complete documentation about this operation :-( ). Then
after this my cluster broken, and stoped all. The cluster not rebalance,
and my impression is that are all stopped.
Below is my "ceph -s". Can anyone help-me?
You have two osds down. Depending on how your data is mapped, your pgs
may be waiting for those to come back up before they finish being
cleaned up.
After removed OSD downs, it is tried rebalance, but is "frozen" again, in
id: ab5dcb0c-480d-419c-bcb8-013cbcce5c4d
health: HEALTH_WARN
12840/988707 objects misplaced (1.299%)
Reduced data availability: 358 pgs inactive, 325 pgs peering
mon: 3 daemons, quorum cmonitor,thanos,cmonitor2
mgr: thanos(active), standbys: cmonitor
osd: 17 osds: 17 up, 17 in; 221 remapped pgs
pools: 1 pools, 1024 pgs
objects: 329.6 k objects, 1.3 TiB
usage: 3.8 TiB used, 7.4 TiB / 11 TiB avail
pgs: 1.660% pgs unknown
33.301% pgs not active
12840/988707 objects misplaced (1.299%)
666 active+clean
188 remapped+peering
137 peering
17 unknown
16 activating+remapped
Any other idea?
Gesiel
Post by Joao Eduardo Luis
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Gesiel Galvão Bernardes
2018-11-09 11:11:52 UTC
Permalink
Hi,

The pool is back up and running. I made this actions:

- Increased max pg per OSD (ceph tell mon.* injectargs
'--mon_max_pg_per_osd=400'). But was still frozen. (already had OSDs with
251 pgs, then I not sure if this was the my problem.)
- Restarted all daemons, including OSDs. In a specific host, when I
restarted a OSD daemon, It took too long, and after this I saw that the
pool started rebuild.

I don't have a sure conclusion about what's happened, at least it's
working. I will read logs, now with more diem, for understanding exactly
happened.

Thank you all for your help.


Gesiel
Post by Ashley Merrick
Are you sure the down OSD didn't happen to have any data required for the
re-balance to complete? How long has the down now removed OSD been out?
Before or after your increased PG count?
If you do "ceph health detail" and then pick a stuck PG what does "ceph pg
PG query" output?
Has your ceph -s output changed at all since the last paste?
On Fri, Nov 9, 2018 at 12:08 AM Gesiel Galvão Bernardes <
Post by Gesiel Galvão Bernardes
Post by Joao Eduardo Luis
Hello Gesiel,
Welcome to Ceph!
In the future, you may want to address the ceph-users list
Thank you, I will do.
Post by Joao Eduardo Luis
Hi everyone,
I am a beginner in Ceph. I made a increase of pg_num in a pool, and
after the cluster rebalance I increased pgp_num (a confission: I not
had read the complete documentation about this operation :-( ). Then
after this my cluster broken, and stoped all. The cluster not
rebalance,
and my impression is that are all stopped.
Below is my "ceph -s". Can anyone help-me?
You have two osds down. Depending on how your data is mapped, your pgs
may be waiting for those to come back up before they finish being
cleaned up.
After removed OSD downs, it is tried rebalance, but is "frozen" again,
id: ab5dcb0c-480d-419c-bcb8-013cbcce5c4d
health: HEALTH_WARN
12840/988707 objects misplaced (1.299%)
Reduced data availability: 358 pgs inactive, 325 pgs peering
mon: 3 daemons, quorum cmonitor,thanos,cmonitor2
mgr: thanos(active), standbys: cmonitor
osd: 17 osds: 17 up, 17 in; 221 remapped pgs
pools: 1 pools, 1024 pgs
objects: 329.6 k objects, 1.3 TiB
usage: 3.8 TiB used, 7.4 TiB / 11 TiB avail
pgs: 1.660% pgs unknown
33.301% pgs not active
12840/988707 objects misplaced (1.299%)
666 active+clean
188 remapped+peering
137 peering
17 unknown
16 activating+remapped
Any other idea?
Gesiel
Post by Joao Eduardo Luis
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Loading...