[ceph-users] Some pgs stuck unclean in active+remapped state

Thomas Klute

2018-11-19 11:49:58 UTC

Hi,

we have a production cluster (3 nodes) stuck unclean after we had to
replace one osd.
Cluster recovered fine except some pgs that are stuck unclean for about
2-3 days now:

[***@ceph1 ~]# ceph health detail
HEALTH_WARN 7 pgs stuck unclean; recovery 8/8565617 objects degraded
(0.000%); recovery 38790/8565617 objects misplaced (0.453%)
pg 3.19 is stuck unclean for 324141.349243, current state
active+remapped, last acting [8,1,12]
pg 3.17f is stuck unclean for 324093.413743, current state
active+remapped, last acting [7,10,14]
pg 3.15e is stuck unclean for 324072.637573, current state
active+remapped, last acting [9,11,12]
pg 3.1cc is stuck unclean for 324141.437666, current state
active+remapped, last acting [6,4,9]
pg 3.47 is stuck unclean for 324014.795713, current state
active+remapped, last acting [4,7,14]
pg 3.1d6 is stuck unclean for 324019.903078, current state
active+remapped, last acting [8,0,4]
pg 3.83 is stuck unclean for 324024.970570, current state
active+remapped, last acting [5,11,13]
recovery 8/8565617 objects degraded (0.000%)
recovery 38790/8565617 objects misplaced (0.453%)

Grep on pg dump shows:
[***@ceph1 ~]# fgrep remapp /tmp/pgdump.txt
3.83    5423    0       0       5423    0       22046870528     3065
3065    active+remapped 2018-11-16 04:08:22.365825      85711'8469810
85711:8067280   [5,11] 5       [5,11,13]       5       83827'8450839
2018-11-14 14:01:20.330322   81079'8422114   2018-11-11 05:10:57.628147
3.47    5487    0       0       5487    0       22364503552     3010
3010    active+remapped 2018-11-15 18:24:24.047889      85711'9511787
85711:9975900   [4,7]   4       [4,7,14]        4       84165'9471676
2018-11-14 23:46:23.149867   80988'9434392   2018-11-11 02:00:23.427834
3.1d6   5567    0       2       5567    0       22652505618     3093
3093    active+remapped 2018-11-16 23:26:06.136037      85711'6730858
85711:6042914   [8,0]   8       [8,0,4] 8       83682'6673939
2018-11-14 09:15:37.810103 80664'6608489    2018-11-09 09:21:00.431783
3.1cc   5656    0       0       5656    0       22988533760     3088
3088    active+remapped 2018-11-17 09:18:42.263108      85711'9795820
85711:8040672   [6,4]   6       [6,4,9] 6       80670'9756755
2018-11-10 13:07:35.097811 80664'9742234    2018-11-09 04:33:10.497507
3.15e   5564    0       6       5564    0       22675107328     3007
3007    active+remapped 2018-11-17 02:47:44.282884      85711'9000186
85711:8021053   [9,11] 9       [9,11,12]       9       83502'8957026
2018-11-14 03:31:18.592781   80664'8920925   2018-11-09 22:15:54.478402
3.17f   5601    0       0       5601    0       22861908480     3077
3077    active+remapped 2018-11-17 01:16:34.016231      85711'31880220
85711:30659045 [7,10] 7       [7,10,14]       7       83668'31705772
2018-11-14 08:35:10.952368   80664'31649045 2018-11-09 04:40:28.644421
3.19    5492    0       0       5492    0       22460691985     3016
3016    active+remapped 2018-11-15 18:54:32.268758      85711'16782496
85711:15483621 [8,1]   8       [8,1,12]        8       84542'16774356
2018-11-15 09:40:41.713627   82163'16760520 2018-11-12 13:13:29.764191

We running Jewel (10.2.11) on Centos 7:

rpm -qa |grep ceph
ceph-radosgw-10.2.11-0.el7.x86_64
libcephfs1-10.2.11-0.el7.x86_64
ceph-mds-10.2.11-0.el7.x86_64
ceph-release-1-1.el7.noarch
ceph-common-10.2.11-0.el7.x86_64
ceph-selinux-10.2.11-0.el7.x86_64
python-cephfs-10.2.11-0.el7.x86_64
ceph-base-10.2.11-0.el7.x86_64
ceph-osd-10.2.11-0.el7.x86_64
ceph-mon-10.2.11-0.el7.x86_64
ceph-deploy-1.5.39-0.noarch
ceph-10.2.11-0.el7.x86_64

Could please someone help how to proceed?

Thanks and kind regards,
Thomas