Discussion:
[ceph-users] Some pgs stuck unclean in active+remapped state
Thomas Klute
2018-11-19 11:49:58 UTC
Permalink
Hi,

we have a production cluster (3 nodes) stuck unclean after we had to
replace one osd.
Cluster recovered fine except some pgs that are stuck unclean for about
2-3 days now:

[***@ceph1 ~]# ceph health detail
HEALTH_WARN 7 pgs stuck unclean; recovery 8/8565617 objects degraded
(0.000%); recovery 38790/8565617 objects misplaced (0.453%)
pg 3.19 is stuck unclean for 324141.349243, current state
active+remapped, last acting [8,1,12]
pg 3.17f is stuck unclean for 324093.413743, current state
active+remapped, last acting [7,10,14]
pg 3.15e is stuck unclean for 324072.637573, current state
active+remapped, last acting [9,11,12]
pg 3.1cc is stuck unclean for 324141.437666, current state
active+remapped, last acting [6,4,9]
pg 3.47 is stuck unclean for 324014.795713, current state
active+remapped, last acting [4,7,14]
pg 3.1d6 is stuck unclean for 324019.903078, current state
active+remapped, last acting [8,0,4]
pg 3.83 is stuck unclean for 324024.970570, current state
active+remapped, last acting [5,11,13]
recovery 8/8565617 objects degraded (0.000%)
recovery 38790/8565617 objects misplaced (0.453%)

Grep on pg dump shows:
[***@ceph1 ~]# fgrep remapp /tmp/pgdump.txt
3.83    5423    0       0       5423    0       22046870528     3065   
3065    active+remapped 2018-11-16 04:08:22.365825      85711'8469810  
85711:8067280   [5,11]  5       [5,11,13]       5       83827'8450839  
2018-11-14 14:01:20.330322   81079'8422114   2018-11-11 05:10:57.628147
3.47    5487    0       0       5487    0       22364503552     3010   
3010    active+remapped 2018-11-15 18:24:24.047889      85711'9511787  
85711:9975900   [4,7]   4       [4,7,14]        4       84165'9471676  
2018-11-14 23:46:23.149867   80988'9434392   2018-11-11 02:00:23.427834
3.1d6   5567    0       2       5567    0       22652505618     3093   
3093    active+remapped 2018-11-16 23:26:06.136037      85711'6730858  
85711:6042914   [8,0]   8       [8,0,4] 8       83682'6673939  
2018-11-14 09:15:37.810103  80664'6608489    2018-11-09 09:21:00.431783
3.1cc   5656    0       0       5656    0       22988533760     3088   
3088    active+remapped 2018-11-17 09:18:42.263108      85711'9795820  
85711:8040672   [6,4]   6       [6,4,9] 6       80670'9756755  
2018-11-10 13:07:35.097811  80664'9742234    2018-11-09 04:33:10.497507
3.15e   5564    0       6       5564    0       22675107328     3007   
3007    active+remapped 2018-11-17 02:47:44.282884      85711'9000186  
85711:8021053   [9,11]  9       [9,11,12]       9       83502'8957026  
2018-11-14 03:31:18.592781   80664'8920925   2018-11-09 22:15:54.478402
3.17f   5601    0       0       5601    0       22861908480     3077   
3077    active+remapped 2018-11-17 01:16:34.016231      85711'31880220 
85711:30659045  [7,10]  7       [7,10,14]       7       83668'31705772 
2018-11-14 08:35:10.952368   80664'31649045  2018-11-09 04:40:28.644421
3.19    5492    0       0       5492    0       22460691985     3016   
3016    active+remapped 2018-11-15 18:54:32.268758      85711'16782496 
85711:15483621  [8,1]   8       [8,1,12]        8       84542'16774356 
2018-11-15 09:40:41.713627   82163'16760520  2018-11-12 13:13:29.764191

We running  Jewel (10.2.11) on Centos 7:

rpm -qa |grep ceph
ceph-radosgw-10.2.11-0.el7.x86_64
libcephfs1-10.2.11-0.el7.x86_64
ceph-mds-10.2.11-0.el7.x86_64
ceph-release-1-1.el7.noarch
ceph-common-10.2.11-0.el7.x86_64
ceph-selinux-10.2.11-0.el7.x86_64
python-cephfs-10.2.11-0.el7.x86_64
ceph-base-10.2.11-0.el7.x86_64
ceph-osd-10.2.11-0.el7.x86_64
ceph-mon-10.2.11-0.el7.x86_64
ceph-deploy-1.5.39-0.noarch
ceph-10.2.11-0.el7.x86_64

Could please someone help how to proceed?

Thanks and kind regards,
Thomas
Burkhard Linke
2018-11-19 12:22:45 UTC
Permalink
Hi,
Post by Thomas Klute
Hi,
we have a production cluster (3 nodes) stuck unclean after we had to
replace one osd.
Cluster recovered fine except some pgs that are stuck unclean for about
*snipsnap*
Post by Thomas Klute
3.83    5423    0       0       5423    0       22046870528     3065
3065    active+remapped 2018-11-16 04:08:22.365825      85711'8469810
85711:8067280   [5,11]  5       [5,11,13]       5       83827'8450839
This PG is currently running on OSDs 5,11,13 and the reshuffling due to
replacing the OSD has lead to a problem with crush and getting three
OSDs following the crush rules. Crush came up with OSDs 5 and 11 for
this PG; a third OSD is missing.


You only have three nodes, so this is a corner case in the crush
algorithm and its pseudo random nature. To solve this problem you can
either add more nodes, or change some of the crush parameters, e.g. the
number of tries.


Regards,

Burkhard

Loading...