[ceph-users] ghost PG : "i don't have pgid xx"

Discussion:

Olivier Bonvalet

2018-06-05 07:25:49 UTC

Hi,

I have a cluster in "stale" state : a lots of RBD are blocked since ~10
hours. In the status I see PG in stale or down state, but thoses PG
doesn't seem to exists anymore :

root! stor00-sbg:~# ceph health detail | egrep '(stale|down)'
HEALTH_ERR noout,noscrub,nodeep-scrub flag(s) set; 1 nearfull osd(s); 16 pool(s) nearfull; 4645278/103969515 objects misplaced (4.468%); Reduced data availability: 643 pgs inactive, 12 pgs down, 2 pgs peering, 3 pgs stale; Degraded data redundancy: 2723173/103969515 objects degraded (2.619%), 387 pgs degraded, 297 pgs undersized; 229 slow requests are blocked > 32 sec; 4074 stuck requests are blocked > 4096 sec; too many PGs per OSD (202 > max 200); mons hyp01-sbg,hyp02-sbg,hyp03-sbg are using a lot of disk space
PG_AVAILABILITY Reduced data availability: 643 pgs inactive, 12 pgs down, 2 pgs peering, 3 pgs stale
pg 31.8b is down, acting [2147483647,16,36]
pg 31.8e is down, acting [2147483647,29,19]
pg 46.b8 is down, acting [2147483647,2147483647,13,17,47,28]

root! stor00-sbg:~# ceph pg 31.8b query
Error ENOENT: i don't have pgid 31.8b

root! stor00-sbg:~# ceph pg 31.8e query
Error ENOENT: i don't have pgid 31.8e

root! stor00-sbg:~# ceph pg 46.b8 query
Error ENOENT: i don't have pgid 46.b8

We just loose an HDD, and mark the corresponding OSD as "lost".

Any idea of what should I do ?

Thanks,

Olivier

Olivier Bonvalet

2018-06-05 07:40:37 UTC

Permalink

Some more informations : the cluster was just upgraded from Jewel to
Luminous.

# ceph pg dump | egrep '(stale|creating)'
dumped all
15.32 10947 0 0 0 0 45870301184 3067 3067 stale+active+clean 2018-06-04 09:20:42.594317 387644'251008 437722:754803 [48,31,45] 48 [48,31,45] 48 213014'224196 2018-04-22 02:01:09.148152 200181'219150 2018-04-14 14:40:13.116285 0
19.77 4131 0 0 0 0 17326669824 3076 3076 stale+down 2018-06-05 07:28:33.968860 394478'58307 438699:736881 [NONE,20,76] 20 [NONE,20,76] 20 273736'49495 2018-05-17 01:05:35.523735 273736'49495 2018-05-17 01:05:35.523735 0
13.76 10730 0 0 0 0 44127133696 3011 3011 stale+down 2018-06-05 07:30:27.578512 397231'457143 438813:4600135 [NONE,21,76] 21 [NONE,21,76] 21 286462'438402 2018-05-20 18:06:12.443141 286462'438402 2018-05-20 18:06:12.443141 0

Post by Olivier Bonvalet
Hi,
I have a cluster in "stale" state : a lots of RBD are blocked since ~10
hours. In the status I see PG in stale or down state, but thoses PG
root! stor00-sbg:~# ceph health detail | egrep '(stale|down)'
HEALTH_ERR noout,noscrub,nodeep-scrub flag(s) set; 1 nearfull osd(s);
16 pool(s) nearfull; 4645278/103969515 objects misplaced (4.468%);
Reduced data availability: 643 pgs inactive, 12 pgs down, 2 pgs
peering, 3 pgs stale; Degraded data redundancy: 2723173/103969515
objects degraded (2.619%), 387 pgs degraded, 297 pgs undersized; 229
slow requests are blocked > 32 sec; 4074 stuck requests are blocked >
4096 sec; too many PGs per OSD (202 > max 200); mons hyp01-sbg,hyp02-
sbg,hyp03-sbg are using a lot of disk space
PG_AVAILABILITY Reduced data availability: 643 pgs inactive, 12 pgs
down, 2 pgs peering, 3 pgs stale
pg 31.8b is down, acting [2147483647,16,36]
pg 31.8e is down, acting [2147483647,29,19]
pg 46.b8 is down, acting [2147483647,2147483647,13,17,47,28]
root! stor00-sbg:~# ceph pg 31.8b query
Error ENOENT: i don't have pgid 31.8b
root! stor00-sbg:~# ceph pg 31.8e query
Error ENOENT: i don't have pgid 31.8e
root! stor00-sbg:~# ceph pg 46.b8 query
Error ENOENT: i don't have pgid 46.b8
We just loose an HDD, and mark the corresponding OSD as "lost".
Any idea of what should I do ?
Thanks,
Olivier
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Olivier Bonvalet

2018-06-05 09:53:24 UTC

Permalink

Hi,

Good point ! Changing this value, *and* restarting ceph-mgr fix this
issue. Now we have to find a way to reduce PG account.

Thanks Paul !

Olivier

Hi,
looks like you are running into the PG overdose protection of
Luminous (you got > 200 PGs per OSD): try to increase
mon_max_pg_per_osd on the monitors to 300 or so to temporarily
resolve this.
Paul

Post by Olivier Bonvalet
Some more informations : the cluster was just upgraded from Jewel to
Luminous.
# ceph pg dump | egrep '(stale|creating)'
dumped all
15.32 10947 0 0 0 0
45870301184 3067 3067
stale+active+clean 2018-06-04 09:20:42.594317 387644'251008
437722:754803 [48,31,45] 48
[48,31,45] 48 213014'224196 2018-04-22
02:01:09.148152 200181'219150 2018-04-14 14:40:13.116285
0
19.77 4131 0 0 0 0
17326669824 3076 3076
stale+down 2018-06-05 07:28:33.968860 394478'58307
438699:736881 [NONE,20,76] 20
[NONE,20,76] 20 273736'49495 2018-05-17
01:05:35.523735 273736'49495 2018-05-17 01:05:35.523735
0
13.76 10730 0 0 0 0
44127133696 3011 3011
stale+down 2018-06-05 07:30:27.578512 397231'457143
438813:4600135 [NONE,21,76] 21
[NONE,21,76] 21 286462'438402 2018-05-20
18:06:12.443141 286462'438402 2018-05-20 18:06:12.443141
0

Post by Olivier Bonvalet
Hi,
I have a cluster in "stale" state : a lots of RBD are blocked

since

Post by Olivier Bonvalet
~10
hours. In the status I see PG in stale or down state, but thoses

Post by Olivier Bonvalet
root! stor00-sbg:~# ceph health detail | egrep '(stale|down)'
HEALTH_ERR noout,noscrub,nodeep-scrub flag(s) set; 1 nearfull

osd(s);

Post by Olivier Bonvalet
16 pool(s) nearfull; 4645278/103969515 objects misplaced

(4.468%);

Post by Olivier Bonvalet
Reduced data availability: 643 pgs inactive, 12 pgs down, 2 pgs
peering, 3 pgs stale; Degraded data redundancy: 2723173/103969515
objects degraded (2.619%), 387 pgs degraded, 297 pgs undersized;

229

Post by Olivier Bonvalet
slow requests are blocked > 32 sec; 4074 stuck requests are

blocked >

Post by Olivier Bonvalet
4096 sec; too many PGs per OSD (202 > max 200); mons hyp01-

sbg,hyp02-

Post by Olivier Bonvalet
sbg,hyp03-sbg are using a lot of disk space
PG_AVAILABILITY Reduced data availability: 643 pgs inactive, 12

pgs

Post by Olivier Bonvalet
down, 2 pgs peering, 3 pgs stale
pg 31.8b is down, acting [2147483647,16,36]
pg 31.8e is down, acting [2147483647,29,19]
pg 46.b8 is down, acting [2147483647,2147483647,13,17,47,28]
root! stor00-sbg:~# ceph pg 31.8b query
Error ENOENT: i don't have pgid 31.8b
root! stor00-sbg:~# ceph pg 31.8e query
Error ENOENT: i don't have pgid 31.8e
root! stor00-sbg:~# ceph pg 46.b8 query
Error ENOENT: i don't have pgid 46.b8
We just loose an HDD, and mark the corresponding OSD as "lost".
Any idea of what should I do ?
Thanks,
Olivier
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com