[ceph-users] How to repair active+clean+inconsistent?

Discussion:

K.C. Wong

2018-11-11 19:10:18 UTC

Hi folks,

I would appreciate any pointer as to how I can resolve a
PG stuck in âactive+clean+inconsistentâ state. This has
resulted in HEALTH_ERR status for the last 5 days with no
end in sight. The state got triggered when one of the drives
in the PG returned I/O error. Iâve since replaced the failed
drive.

Iâm running Jewel (out of centos-release-ceph-jewel) on
CentOS 7. Iâve tried âceph pg repair <pg>â and it didnât seem
to do anything. Iâve tried even more drastic measures such as
comparing all the files (using filestore) under that PG_head
on all 3 copies and then nuking the outlier. Nothing worked.

Many thanks,

-kc

K.C. Wong
***@verseon.com
M: +1 (408) 769-8235

-----------------------------------------------------
Confidentiality Notice:
This message contains confidential information. If you are not the
intended recipient and received this message in error, any use or
distribution is strictly prohibited. Please also notify us
immediately by return e-mail, and delete this message from your
computer system. Thank you.
-----------------------------------------------------
4096R/B8995EDE E527 CBE8 023E 79EA 8BBB 5C77 23A6 92E9 B899 5EDE
hkps://hkps.pool.sks-keyservers.net

Brad Hubbard

2018-11-12 01:43:57 UTC

Permalink

What does "rados list-inconsistent-obj <pg>" say?

Note that you may have to do a deep scrub to populate the output.

Post by K.C. Wong
Hi folks,
I would appreciate any pointer as to how I can resolve a
PG stuck in “active+clean+inconsistent” state. This has
resulted in HEALTH_ERR status for the last 5 days with no
end in sight. The state got triggered when one of the drives
in the PG returned I/O error. I’ve since replaced the failed
drive.
I’m running Jewel (out of centos-release-ceph-jewel) on
CentOS 7. I’ve tried “ceph pg repair <pg>” and it didn’t seem
to do anything. I’ve tried even more drastic measures such as
comparing all the files (using filestore) under that PG_head
on all 3 copies and then nuking the outlier. Nothing worked.
Many thanks,
-kc
K.C. Wong
M: +1 (408) 769-8235
-----------------------------------------------------
This message contains confidential information. If you are not the
intended recipient and received this message in error, any use or
distribution is strictly prohibited. Please also notify us
immediately by return e-mail, and delete this message from your
computer system. Thank you.
-----------------------------------------------------
4096R/B8995EDE E527 CBE8 023E 79EA 8BBB 5C77 23A6 92E9 B899 5EDE
hkps://hkps.pool.sks-keyservers.net
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Cheers,
Brad

K.C. Wong

2018-11-12 06:19:40 UTC

Permalink

Hi Brad,

I got the following:

[***@mgmt01 ~]# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 1.65 is active+clean+inconsistent, acting [62,67,47]
1 scrub errors
[***@mgmt01 ~]# rados list-inconsistent-obj 1.65
No scrub information available for pg 1.65
error 2: (2) No such file or directory
[***@mgmt01 ~]# rados list-inconsistent-snapset 1.65
No scrub information available for pg 1.65
error 2: (2) No such file or directory

Rather odd output, Iâd say; not that I understand what
that means. I also tried ceph list-inconsistent-pg:

[***@mgmt01 ~]# rados lspools
rbd
cephfs_data
cephfs_metadata
.rgw.root
default.rgw.control
default.rgw.data.root
default.rgw.gc
default.rgw.log
ctrl-p
prod
corp
camp
dev
default.rgw.users.uid
default.rgw.users.keys
default.rgw.buckets.index
default.rgw.buckets.data
default.rgw.buckets.non-ec
[***@mgmt01 ~]# for i in $(rados lspools); do rados list-inconsistent-pg $i; done
[]
["1.65"]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]

So, thatâd put the inconsistency in the cephfs_data pool.

Thank you for your help,

-kc

K.C. Wong
***@verseon.com <mailto:***@verseon.com>
M: +1 (408) 769-8235

-----------------------------------------------------
Confidentiality Notice:
This message contains confidential information. If you are not the
intended recipient and received this message in error, any use or
distribution is strictly prohibited. Please also notify us
immediately by return e-mail, and delete this message from your
computer system. Thank you.
-----------------------------------------------------
4096R/B8995EDE <https://sks-keyservers.net/pks/lookup?op=get&search=0x23A692E9B8995EDE> E527 CBE8 023E 79EA 8BBB 5C77 23A6 92E9 B899 5EDE
hkps://hkps.pool.sks-keyservers.net

Post by Brad Hubbard
What does "rados list-inconsistent-obj <pg>" say?
Note that you may have to do a deep scrub to populate the output.

Post by K.C. Wong
Hi folks,
I would appreciate any pointer as to how I can resolve a
PG stuck in âactive+clean+inconsistentâ state. This has
resulted in HEALTH_ERR status for the last 5 days with no
end in sight. The state got triggered when one of the drives
in the PG returned I/O error. Iâve since replaced the failed
drive.
Iâm running Jewel (out of centos-release-ceph-jewel) on
CentOS 7. Iâve tried âceph pg repair <pg>â and it didnât seem
to do anything. Iâve tried even more drastic measures such as
comparing all the files (using filestore) under that PG_head
on all 3 copies and then nuking the outlier. Nothing worked.
Many thanks,
-kc
K.C. Wong
M: +1 (408) 769-8235
-----------------------------------------------------
This message contains confidential information. If you are not the
intended recipient and received this message in error, any use or
distribution is strictly prohibited. Please also notify us
immediately by return e-mail, and delete this message from your
computer system. Thank you.
-----------------------------------------------------
4096R/B8995EDE E527 CBE8 023E 79EA 8BBB 5C77 23A6 92E9 B899 5EDE
hkps://hkps.pool.sks-keyservers.net
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Cheers,
Brad

Ashley Merrick

2018-11-12 06:22:18 UTC

Permalink

Your need to run "ceph pg deep-scrub 1.65" first

Post by K.C. Wong
Hi Brad,
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 1.65 is active+clean+inconsistent, acting [62,67,47]
1 scrub errors
No scrub information available for pg 1.65
error 2: (2) No such file or directory
No scrub information available for pg 1.65
error 2: (2) No such file or directory
Rather odd output, Iâd say; not that I understand what
rbd
cephfs_data
cephfs_metadata
.rgw.root
default.rgw.control
default.rgw.data.root
default.rgw.gc
default.rgw.log
ctrl-p
prod
corp
camp
dev
default.rgw.users.uid
default.rgw.users.keys
default.rgw.buckets.index
default.rgw.buckets.data
default.rgw.buckets.non-ec
[]
["1.65"]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
So, thatâd put the inconsistency in the cephfs_data pool.
Thank you for your help,
-kc
K.C. Wong
M: +1 (408) 769-8235
-----------------------------------------------------
This message contains confidential information. If you are not the
intended recipient and received this message in error, any use or
distribution is strictly prohibited. Please also notify us
immediately by return e-mail, and delete this message from your
computer system. Thank you.
-----------------------------------------------------
4096R/B8995EDE <https://sks-keyservers.net/pks/lookup?op=get&search=0x23A692E9B8995EDE> E527 CBE8 023E 79EA 8BBB 5C77 23A6 92E9 B899 5EDE
hkps://hkps.pool.sks-keyservers.net
What does "rados list-inconsistent-obj <pg>" say?
Note that you may have to do a deep scrub to populate the output.
Hi folks,
I would appreciate any pointer as to how I can resolve a
PG stuck in âactive+clean+inconsistentâ state. This has
resulted in HEALTH_ERR status for the last 5 days with no
end in sight. The state got triggered when one of the drives
in the PG returned I/O error. Iâve since replaced the failed
drive.
Iâm running Jewel (out of centos-release-ceph-jewel) on
CentOS 7. Iâve tried âceph pg repair <pg>â and it didnât seem
to do anything. Iâve tried even more drastic measures such as
comparing all the files (using filestore) under that PG_head
on all 3 copies and then nuking the outlier. Nothing worked.
Many thanks,
-kc
K.C. Wong
M: +1 (408) 769-8235
-----------------------------------------------------
This message contains confidential information. If you are not the
intended recipient and received this message in error, any use or
distribution is strictly prohibited. Please also notify us
immediately by return e-mail, and delete this message from your
computer system. Thank you.
-----------------------------------------------------
4096R/B8995EDE E527 CBE8 023E 79EA 8BBB 5C77 23A6 92E9 B899 5EDE
hkps://hkps.pool.sks-keyservers.net
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Cheers,
Brad
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Brad Hubbard

2018-11-12 06:58:33 UTC

Permalink

Post by Ashley Merrick
Your need to run "ceph pg deep-scrub 1.65" first

Right, thanks Ashley. That's what the "Note that you may have to do a
deep scrub to populate the output." part of my answer meant but
perhaps I needed to go further?

The system has a record of a scrub error on a previous scan but
subsequent activity in the cluster has invalidated the specifics. You
need to run another scrub to get the specific information for this pg
at this point in time (the information does not remain valid
indefinitely and therefore may need to be renewed depending on
circumstances).

Post by Ashley Merrick

Post by K.C. Wong
Hi Brad,
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 1.65 is active+clean+inconsistent, acting [62,67,47]
1 scrub errors
No scrub information available for pg 1.65
error 2: (2) No such file or directory
No scrub information available for pg 1.65
error 2: (2) No such file or directory
Rather odd output, I’d say; not that I understand what
rbd
cephfs_data
cephfs_metadata
.rgw.root
default.rgw.control
default.rgw.data.root
default.rgw.gc
default.rgw.log
ctrl-p
prod
corp
camp
dev
default.rgw.users.uid
default.rgw.users.keys
default.rgw.buckets.index
default.rgw.buckets.data
default.rgw.buckets.non-ec
[]
["1.65"]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
So, that’d put the inconsistency in the cephfs_data pool.
Thank you for your help,
-kc
K.C. Wong
M: +1 (408) 769-8235
-----------------------------------------------------
This message contains confidential information. If you are not the
intended recipient and received this message in error, any use or
distribution is strictly prohibited. Please also notify us
immediately by return e-mail, and delete this message from your
computer system. Thank you.
-----------------------------------------------------
4096R/B8995EDE E527 CBE8 023E 79EA 8BBB 5C77 23A6 92E9 B899 5EDE
hkps://hkps.pool.sks-keyservers.net
What does "rados list-inconsistent-obj <pg>" say?
Note that you may have to do a deep scrub to populate the output.
Hi folks,
I would appreciate any pointer as to how I can resolve a
PG stuck in “active+clean+inconsistent” state. This has
resulted in HEALTH_ERR status for the last 5 days with no
end in sight. The state got triggered when one of the drives
in the PG returned I/O error. I’ve since replaced the failed
drive.
I’m running Jewel (out of centos-release-ceph-jewel) on
CentOS 7. I’ve tried “ceph pg repair <pg>” and it didn’t seem
to do anything. I’ve tried even more drastic measures such as
comparing all the files (using filestore) under that PG_head
on all 3 copies and then nuking the outlier. Nothing worked.
Many thanks,
-kc
K.C. Wong
M: +1 (408) 769-8235
-----------------------------------------------------
This message contains confidential information. If you are not the
intended recipient and received this message in error, any use or
distribution is strictly prohibited. Please also notify us
immediately by return e-mail, and delete this message from your
computer system. Thank you.
-----------------------------------------------------
4096R/B8995EDE E527 CBE8 023E 79EA 8BBB 5C77 23A6 92E9 B899 5EDE
hkps://hkps.pool.sks-keyservers.net
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Cheers,
Brad
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Cheers,
Brad

K.C. Wong

2018-11-14 18:07:08 UTC

Permalink

So, Iâve issued the deep-scrub command (and the repair command)
and nothing seems to happen.
Unrelated to this issue, I have to take down some OSD to prepare
a host for RMA. One of them happens to be in the replication
group for this PG. So, a scrub happened indirectly. I now have
this from âceph -sâ:

cluster 374aed9e-5fc1-47e1-8d29-4416f7425e76
health HEALTH_ERR
1 pgs inconsistent
18446 scrub errors
monmap e1: 3 mons at {mgmt01=10.0.1.1:6789/0,mgmt02=10.1.1.1:6789/0,mgmt03=10.2.1.1:6789/0}
election epoch 252, quorum 0,1,2 mgmt01,mgmt02,mgmt03
fsmap e346: 1/1/1 up {0=mgmt01=up:active}, 2 up:standby
osdmap e40248: 120 osds: 119 up, 119 in
flags sortbitwise,require_jewel_osds
pgmap v22025963: 3136 pgs, 18 pools, 18975 GB data, 214 Mobjects
59473 GB used, 287 TB / 345 TB avail
3120 active+clean
15 active+clean+scrubbing+deep
1 active+clean+inconsistent

Thatâs a lot of scrub errors:

HEALTH_ERR 1 pgs inconsistent; 18446 scrub errors
pg 1.65 is active+clean+inconsistent, acting [62,67,33]
18446 scrub errors

Now, ârados list-inconsistent-obj 1.65â returns a *very* long JSON
output. Hereâs a very small snippet, the errors look the same across:

{
âobjectâ:{
"name":â100000ea8bb.00000045â,
"nspace":â",
"locator":â",
"snap":"headâ,
"versionâ:59538
},
"errors":["attr_name_mismatchâ],
"union_shard_errors":["oi_attr_missingâ],
"selected_object_info":"1:a70dc1cc:::100000ea8bb.00000045:head(2897'59538 client.4895965.0:462007 dirty|data_digest|omap_digest s 4194304 uv 59538 dd f437a612 od ffffffff alloc_hint [0 0])â,
"shardsâ:[
{
"osd":33,
"errors":[],
"size":4194304,
"omap_digestâ:"0xffffffffâ,
"data_digestâ:"0xf437a612â,
"attrs":[
{"name":"_â,
"value":âEAgNAQAABAM1AA...â,
"Base64":true},
{"name":"snapsetâ,
"value":âAgIZAAAAAQAAAA...â,
"Base64":true}
]
},
{
"osd":62,
"errors":[],
"size":4194304,
"omap_digest":"0xffffffffâ,
"data_digest":"0xf437a612â,
"attrsâ:[
{"name":"_â,
"value":âEAgNAQAABAM1AA...",
"Base64":true},
{"name":"snapsetâ,
"value":âAgIZAAAAAQAAAAâŠ",
"Base64":true}
]
},
{
"osd":67,
"errors":["oi_attr_missingâ],
"size":4194304,
"omap_digest":"0xffffffffâ,
"data_digest":"0xf437a612â,
"attrs":[]
}
]
}

Clearly, on osd.67, the âattrsâ array is empty. The question is,
how do I fix this?

Many thanks in advance,

-kc

K.C. Wong
***@verseon.com <mailto:***@verseon.com>
M: +1 (408) 769-8235

-----------------------------------------------------
Confidentiality Notice:
This message contains confidential information. If you are not the
intended recipient and received this message in error, any use or
distribution is strictly prohibited. Please also notify us
immediately by return e-mail, and delete this message from your
computer system. Thank you.
-----------------------------------------------------
4096R/B8995EDE <https://sks-keyservers.net/pks/lookup?op=get&search=0x23A692E9B8995EDE> E527 CBE8 023E 79EA 8BBB 5C77 23A6 92E9 B899 5EDE
hkps://hkps.pool.sks-keyservers.net

Post by Brad Hubbard

Post by Ashley Merrick
Your need to run "ceph pg deep-scrub 1.65" first

Right, thanks Ashley. That's what the "Note that you may have to do a
deep scrub to populate the output." part of my answer meant but
perhaps I needed to go further?
The system has a record of a scrub error on a previous scan but
subsequent activity in the cluster has invalidated the specifics. You
need to run another scrub to get the specific information for this pg
at this point in time (the information does not remain valid
indefinitely and therefore may need to be renewed depending on
circumstances).

Post by Ashley Merrick

Post by K.C. Wong
Hi Brad,
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 1.65 is active+clean+inconsistent, acting [62,67,47]
1 scrub errors
No scrub information available for pg 1.65
error 2: (2) No such file or directory
No scrub information available for pg 1.65
error 2: (2) No such file or directory
Rather odd output, Iâd say; not that I understand what
rbd
cephfs_data
cephfs_metadata
.rgw.root
default.rgw.control
default.rgw.data.root
default.rgw.gc
default.rgw.log
ctrl-p
prod
corp
camp
dev
default.rgw.users.uid
default.rgw.users.keys
default.rgw.buckets.index
default.rgw.buckets.data
default.rgw.buckets.non-ec
[]
["1.65"]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
So, thatâd put the inconsistency in the cephfs_data pool.
Thank you for your help,
-kc
K.C. Wong
M: +1 (408) 769-8235
-----------------------------------------------------
This message contains confidential information. If you are not the
intended recipient and received this message in error, any use or
distribution is strictly prohibited. Please also notify us
immediately by return e-mail, and delete this message from your
computer system. Thank you.
-----------------------------------------------------
4096R/B8995EDE E527 CBE8 023E 79EA 8BBB 5C77 23A6 92E9 B899 5EDE
hkps://hkps.pool.sks-keyservers.net
What does "rados list-inconsistent-obj <pg>" say?
Note that you may have to do a deep scrub to populate the output.
Hi folks,
I would appreciate any pointer as to how I can resolve a
PG stuck in âactive+clean+inconsistentâ state. This has
resulted in HEALTH_ERR status for the last 5 days with no
end in sight. The state got triggered when one of the drives
in the PG returned I/O error. Iâve since replaced the failed
drive.
Iâm running Jewel (out of centos-release-ceph-jewel) on
CentOS 7. Iâve tried âceph pg repair <pg>â and it didnât seem
to do anything. Iâve tried even more drastic measures such as
comparing all the files (using filestore) under that PG_head
on all 3 copies and then nuking the outlier. Nothing worked.
Many thanks,
-kc
K.C. Wong
M: +1 (408) 769-8235
-----------------------------------------------------
This message contains confidential information. If you are not the
intended recipient and received this message in error, any use or
distribution is strictly prohibited. Please also notify us
immediately by return e-mail, and delete this message from your
computer system. Thank you.
-----------------------------------------------------
4096R/B8995EDE E527 CBE8 023E 79EA 8BBB 5C77 23A6 92E9 B899 5EDE
hkps://hkps.pool.sks-keyservers.net
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Cheers,
Brad
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Brad Hubbard

2018-11-14 23:19:14 UTC

Permalink

You could try a 'rados get' and then a 'rados put' on the object to start with.

So, I’ve issued the deep-scrub command (and the repair command)
and nothing seems to happen.
Unrelated to this issue, I have to take down some OSD to prepare
a host for RMA. One of them happens to be in the replication
group for this PG. So, a scrub happened indirectly. I now have
cluster 374aed9e-5fc1-47e1-8d29-4416f7425e76
health HEALTH_ERR
1 pgs inconsistent
18446 scrub errors
monmap e1: 3 mons at {mgmt01=10.0.1.1:6789/0,mgmt02=10.1.1.1:6789/0,mgmt03=10.2.1.1:6789/0}
election epoch 252, quorum 0,1,2 mgmt01,mgmt02,mgmt03
fsmap e346: 1/1/1 up {0=mgmt01=up:active}, 2 up:standby
osdmap e40248: 120 osds: 119 up, 119 in
flags sortbitwise,require_jewel_osds
pgmap v22025963: 3136 pgs, 18 pools, 18975 GB data, 214 Mobjects
59473 GB used, 287 TB / 345 TB avail
3120 active+clean
15 active+clean+scrubbing+deep
1 active+clean+inconsistent
HEALTH_ERR 1 pgs inconsistent; 18446 scrub errors
pg 1.65 is active+clean+inconsistent, acting [62,67,33]
18446 scrub errors
Now, “rados list-inconsistent-obj 1.65” returns a *very* long JSON
{
“object”:{
"name":”100000ea8bb.00000045”,
"nspace":”",
"locator":”",
"snap":"head”,
"version”:59538
},
"errors":["attr_name_mismatch”],
"union_shard_errors":["oi_attr_missing”],
"selected_object_info":"1:a70dc1cc:::100000ea8bb.00000045:head(2897'59538 client.4895965.0:462007 dirty|data_digest|omap_digest s 4194304 uv 59538 dd f437a612 od ffffffff alloc_hint [0 0])”,
"shards”:[
{
"osd":33,
"errors":[],
"size":4194304,
"omap_digest”:"0xffffffff”,
"data_digest”:"0xf437a612”,
"attrs":[
{"name":"_”,
"value":”EAgNAQAABAM1AA...“,
"Base64":true},
{"name":"snapset”,
"value":”AgIZAAAAAQAAAA...“,
"Base64":true}
]
},
{
"osd":62,
"errors":[],
"size":4194304,
"omap_digest":"0xffffffff”,
"data_digest":"0xf437a612”,
"attrs”:[
{"name":"_”,
"value":”EAgNAQAABAM1AA...",
"Base64":true},
{"name":"snapset”,
"value":”AgIZAAAAAQAAAA…",
"Base64":true}
]
},
{
"osd":67,
"errors":["oi_attr_missing”],
"size":4194304,
"omap_digest":"0xffffffff”,
"data_digest":"0xf437a612”,
"attrs":[]
}
]
}
Clearly, on osd.67, the “attrs” array is empty. The question is,
how do I fix this?
Many thanks in advance,
-kc
K.C. Wong
M: +1 (408) 769-8235
-----------------------------------------------------
This message contains confidential information. If you are not the
intended recipient and received this message in error, any use or
distribution is strictly prohibited. Please also notify us
immediately by return e-mail, and delete this message from your
computer system. Thank you.
-----------------------------------------------------
4096R/B8995EDE E527 CBE8 023E 79EA 8BBB 5C77 23A6 92E9 B899 5EDE
hkps://hkps.pool.sks-keyservers.net
Your need to run "ceph pg deep-scrub 1.65" first
Right, thanks Ashley. That's what the "Note that you may have to do a
deep scrub to populate the output." part of my answer meant but
perhaps I needed to go further?
The system has a record of a scrub error on a previous scan but
subsequent activity in the cluster has invalidated the specifics. You
need to run another scrub to get the specific information for this pg
at this point in time (the information does not remain valid
indefinitely and therefore may need to be renewed depending on
circumstances).
Hi Brad,
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 1.65 is active+clean+inconsistent, acting [62,67,47]
1 scrub errors
No scrub information available for pg 1.65
error 2: (2) No such file or directory
No scrub information available for pg 1.65
error 2: (2) No such file or directory
Rather odd output, I’d say; not that I understand what
rbd
cephfs_data
cephfs_metadata
.rgw.root
default.rgw.control
default.rgw.data.root
default.rgw.gc
default.rgw.log
ctrl-p
prod
corp
camp
dev
default.rgw.users.uid
default.rgw.users.keys
default.rgw.buckets.index
default.rgw.buckets.data
default.rgw.buckets.non-ec
[]
["1.65"]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
So, that’d put the inconsistency in the cephfs_data pool.
Thank you for your help,
-kc
K.C. Wong
M: +1 (408) 769-8235
-----------------------------------------------------
This message contains confidential information. If you are not the
intended recipient and received this message in error, any use or
distribution is strictly prohibited. Please also notify us
immediately by return e-mail, and delete this message from your
computer system. Thank you.
-----------------------------------------------------
4096R/B8995EDE E527 CBE8 023E 79EA 8BBB 5C77 23A6 92E9 B899 5EDE
hkps://hkps.pool.sks-keyservers.net
What does "rados list-inconsistent-obj <pg>" say?
Note that you may have to do a deep scrub to populate the output.
Hi folks,
I would appreciate any pointer as to how I can resolve a
PG stuck in “active+clean+inconsistent” state. This has
resulted in HEALTH_ERR status for the last 5 days with no
end in sight. The state got triggered when one of the drives
in the PG returned I/O error. I’ve since replaced the failed
drive.
I’m running Jewel (out of centos-release-ceph-jewel) on
CentOS 7. I’ve tried “ceph pg repair <pg>” and it didn’t seem
to do anything. I’ve tried even more drastic measures such as
comparing all the files (using filestore) under that PG_head
on all 3 copies and then nuking the outlier. Nothing worked.
Many thanks,
-kc
K.C. Wong
M: +1 (408) 769-8235
-----------------------------------------------------
This message contains confidential information. If you are not the
intended recipient and received this message in error, any use or
distribution is strictly prohibited. Please also notify us
immediately by return e-mail, and delete this message from your
computer system. Thank you.
-----------------------------------------------------
4096R/B8995EDE E527 CBE8 023E 79EA 8BBB 5C77 23A6 92E9 B899 5EDE
hkps://hkps.pool.sks-keyservers.net
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Cheers,
Brad
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Cheers,
Brad

K.C. Wong

2018-11-12 07:02:34 UTC

Permalink

Thanks, Ashley.

Should I expect the deep-scrubbing to start immediately?

[***@mgmt01 ~]# ceph pg deep-scrub 1.65
instructing pg 1.65 on osd.62 to deep-scrub
[***@mgmt01 ~]# ceph pg ls deep_scrub
pg_stat objects mip degr misp unf bytes log disklog state state_stamp v reported up up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
16.75 430657 0 0 0 0 30754735820 3007 3007 active+clean+scrubbing+deep 2018-11-11 11:05:11.572325 39934'549067 39934:1311893 [4,64,35] 4 [4,64,35] 4 28743'539264 2018-11-07 02:17:53.293336 28743'539264 2018-11-03 14:39:44.837702
16.86 430617 0 0 0 0 30316842298 3048 3048 active+clean+scrubbing+deep 2018-11-11 15:56:30.148527 39934'548012 39934:1038058 [18,2,62] 18 [18,2,62] 18 26347'529815 2018-10-28 01:06:55.526624 26347'529815 2018-10-28 01:06:55.526624
16.eb 432196 0 0 0 0 30612459543 3071 3071 active+clean+scrubbing+deep 2018-11-11 11:02:46.993022 39934'550340 39934:3662047 [56,44,42] 56 [56,44,42] 56 28507'540255 2018-11-02 03:28:28.013949 28507'540255 2018-11-02 03:28:28.013949
16.f3 431399 0 0 0 0 30672009253 3067 3067 active+clean+scrubbing+deep 2018-11-11 17:40:55.732162 39934'549240 39934:2212192 [69,82,6] 69 [69,82,6] 69 28743'539336 2018-11-02 17:22:05.745972 28743'539336 2018-11-02 17:22:05.745972
16.f7 430885 0 0 0 0 30796505272 3100 3100 active+clean+scrubbing+deep 2018-11-11 22:50:05.231599 39934'548910 39934:683169 [59,63,119] 59 [59,63,119] 59 28743'539167 2018-11-03 07:24:43.776341 26347'530830 2018-10-28 04:44:12.276982
16.14c 430565 0 0 0 0 31177011073 3042 3042 active+clean+scrubbing+deep 2018-11-11 20:11:31.107313 39934'550564 39934:1545200 [41,12,70] 41 [41,12,70] 41 28743'540758 2018-11-03 23:04:49.155741 28743'540758 2018-11-03 23:04:49.155741
16.156 430356 0 0 0 0 31021738479 3006 3006 active+clean+scrubbing+deep 2018-11-11 20:44:14.019537 39934'549241 39934:2958053 [83,47,1] 83 [83,47,1] 83 28743'539462 2018-11-04 14:46:56.890822 28743'539462 2018-11-04 14:46:56.890822
16.19f 431613 0 0 0 0 30746145827 3063 3063 active+clean+scrubbing+deep 2018-11-11 19:06:40.693002 39934'549429 39934:1189872 [14,54,37] 14 [14,54,37] 14 28743'539660 2018-11-04 18:25:13.225962 26347'531345 2018-10-28 20:08:45.286421
16.1b1 431225 0 0 0 0 30988996529 3048 3048 active+clean+scrubbing+deep 2018-11-11 20:12:35.367935 39934'549604 39934:778127 [34,106,11] 34 [34,106,11] 34 26347'531560 2018-10-27 16:49:46.944748 26347'531560 2018-10-27 16:49:46.944748
16.1e2 431724 0 0 0 0 30247732969 3070 3070 active+clean+scrubbing+deep 2018-11-11 20:55:17.591646 39934'550105 39934:1428341 [103,48,3] 103 [103,48,3] 103 28743'540270 2018-11-06 03:36:30.531106 28507'539840 2018-11-02 01:08:23.268409
16.1f3 430604 0 0 0 0 30633545866 3039 3039 active+clean+scrubbing+deep 2018-11-11 20:15:28.557464 39934'548804 39934:1354817 [66,102,33] 66 [66,102,33] 66 28743'538896 2018-11-04 04:59:33.118414 28743'538896 2018-11-04 04:59:33.118414
[***@mgmt01 ~]# ceph pg ls inconsistent
pg_stat objects mip degr misp unf bytes log disklog state state_stamp v reported up up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
1.65 12806 0 0 0 0 30010463024 3008 3008 active+clean+inconsistent 2018-11-10 00:16:43.965966 39934'184512 39934:388820 [62,67,47] 62 [62,67,47] 62 28743'183853 2018-11-04 01:31:27.042458 28743'183853 2018-11-04 01:31:27.042458

Itâs similar to when I issued âceph pg repair 1.65â, instructing
osd.62 to repair 1.65, and then nothing seems to happen.

-kc

K.C. Wong
***@verseon.com <mailto:***@verseon.com>
M: +1 (408) 769-8235

-----------------------------------------------------
Confidentiality Notice:
This message contains confidential information. If you are not the
intended recipient and received this message in error, any use or
distribution is strictly prohibited. Please also notify us
immediately by return e-mail, and delete this message from your
computer system. Thank you.
-----------------------------------------------------
4096R/B8995EDE <https://sks-keyservers.net/pks/lookup?op=get&search=0x23A692E9B8995EDE> E527 CBE8 023E 79EA 8BBB 5C77 23A6 92E9 B899 5EDE
hkps://hkps.pool.sks-keyservers.net

Post by Ashley Merrick
Your need to run "ceph pg deep-scrub 1.65" first
Hi Brad,
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 1.65 is active+clean+inconsistent, acting [62,67,47]
1 scrub errors
No scrub information available for pg 1.65
error 2: (2) No such file or directory
No scrub information available for pg 1.65
error 2: (2) No such file or directory
Rather odd output, Iâd say; not that I understand what
rbd
cephfs_data
cephfs_metadata
.rgw.root
default.rgw.control
default.rgw.data.root
default.rgw.gc
default.rgw.log
ctrl-p
prod
corp
camp
dev
default.rgw.users.uid
default.rgw.users.keys
default.rgw.buckets.index
default.rgw.buckets.data
default.rgw.buckets.non-ec
[]
["1.65"]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
So, thatâd put the inconsistency in the cephfs_data pool.
Thank you for your help,
-kc
K.C. Wong
M: +1 (408) 769-8235
-----------------------------------------------------
This message contains confidential information. If you are not the
intended recipient and received this message in error, any use or
distribution is strictly prohibited. Please also notify us
immediately by return e-mail, and delete this message from your
computer system. Thank you.
-----------------------------------------------------
4096R/B8995EDE <https://sks-keyservers.net/pks/lookup?op=get&search=0x23A692E9B8995EDE> E527 CBE8 023E 79EA 8BBB 5C77 23A6 92E9 B899 5EDE
hkps://hkps.pool.sks-keyservers.net <>

Post by Brad Hubbard
What does "rados list-inconsistent-obj <pg>" say?
Note that you may have to do a deep scrub to populate the output.

Post by K.C. Wong
Hi folks,
I would appreciate any pointer as to how I can resolve a
PG stuck in âactive+clean+inconsistentâ state. This has
resulted in HEALTH_ERR status for the last 5 days with no
end in sight. The state got triggered when one of the drives
in the PG returned I/O error. Iâve since replaced the failed
drive.
Iâm running Jewel (out of centos-release-ceph-jewel) on
CentOS 7. Iâve tried âceph pg repair <pg>â and it didnât seem
to do anything. Iâve tried even more drastic measures such as
comparing all the files (using filestore) under that PG_head
on all 3 copies and then nuking the outlier. Nothing worked.
Many thanks,
-kc
K.C. Wong
M: +1 (408) 769-8235
-----------------------------------------------------
This message contains confidential information. If you are not the
intended recipient and received this message in error, any use or
distribution is strictly prohibited. Please also notify us
immediately by return e-mail, and delete this message from your
computer system. Thank you.
-----------------------------------------------------
4096R/B8995EDE E527 CBE8 023E 79EA 8BBB 5C77 23A6 92E9 B899 5EDE
hkps://hkps.pool.sks-keyservers.net <http://hkps.pool.sks-keyservers.net/>
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

--
Cheers,
Brad

_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>