[ceph-users] Inconsistent PG could not be repaired

Discussion:

Arvydas Opulskis

2018-07-24 07:27:43 UTC

Hello, Cephers,

after trying different repair approaches I am out of ideas how to repair
inconsistent PG. I hope, someones sharp eye will notice what I overlooked.

Some info about cluster:
Centos 7.4
Jewel 10.2.10
Pool size 2 (yes, I know it's a very bad choice)
Pool with inconsistent PG: .rgw.buckets

After routine deep-scrub I've found PG 26.c3f in inconsistent status. While
running "ceph pg repair 26.c3f" command and monitoring "ceph -w" log, I
noticed these errors:

2018-07-24 08:28:06.517042 osd.36 [ERR] 26.c3f shard 30: soid
26:fc32a1f1:::default.142609570.87_20180206.093111%2frepositories%2fnuget-local%2fApplication%2fCompany.Application.Api%2fCompany.Application.Api.1.1.1.nupkg.artifactory-metadata%2fproperties.xml:head
data_digest 0x540e4f8b != data_digest 0x49a34c1f from auth oi
26:e261561a:::default.168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-05T03%3a51%3a39+00%3a00.sha1:head(167828'216051
client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051
dd 49a34c1f od ffffffff alloc_hint [0 0])

2018-07-24 08:28:06.517118 osd.36 [ERR] 26.c3f shard 36: soid
26:fc32a1f1:::default.142609570.87_20180206.093111%2frepositories%2fnuget-local%2fApplication%2fCompany.Application.Api%2fCompany.Application.Api.1.1.1.nupkg.artifactory-metadata%2fproperties.xml:head
data_digest 0x540e4f8b != data_digest 0x49a34c1f from auth oi
26:e261561a:::default.168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-05T03%3a51%3a39+00%3a00.sha1:head(167828'216051
client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051
dd 49a34c1f od ffffffff alloc_hint [0 0])

2018-07-24 08:28:06.517122 osd.36 [ERR] 26.c3f soid
26:fc32a1f1:::default.142609570.87_20180206.093111%2frepositories%2fnuget-local%2fApplication%2fCompany.Application.Api%2fCompany.Application.Api.1.1.1.nupkg.artifactory-metadata%2fproperties.xml:head:
failed to pick suitable auth object

...and same errors about another object on same PG.

Repair failed, so I checked inconsistencies "rados list-inconsistent-obj
26.c3f --format=json-pretty":

{
"epoch": 178403,
"inconsistents": [
{
"object": {
"name":
"default.142609570.87_20180203.020047\/repositories\/docker-local\/yyy\/company.yyy.api.assets\/1.2.4\/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d",
"nspace": "",
"locator": "",
"snap": "head",
"version": 217749
},
"errors": [],
"union_shard_errors": [
"data_digest_mismatch_oi"
],
"selected_object_info":
"26:f4ce1748:::default.168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-08T03%3a45%3a15+00%3a00.sha1:head(167944'217749
client.177936559.0:1884719302 dirty|data_digest|omap_digest s 40 uv 217749
dd 422f251b od ffffffff alloc_hint [0 0])",
"shards": [
{
"osd": 30,
"errors": [
"data_digest_mismatch_oi"
],
"size": 40,
"omap_digest": "0xffffffff",
"data_digest": "0x551c282f"
},
{
"osd": 36,
"errors": [
"data_digest_mismatch_oi"
],
"size": 40,
"omap_digest": "0xffffffff",
"data_digest": "0x551c282f"
}
]
},
{
"object": {
"name":
"default.142609570.87_20180206.093111\/repositories\/nuget-local\/Application\/Company.Application.Api\/Company.Application.Api.1.1.1.nupkg.artifactory-metadata\/properties.xml",
"nspace": "",
"locator": "",
"snap": "head",
"version": 216051
},
"errors": [],
"union_shard_errors": [
"data_digest_mismatch_oi"
],
"selected_object_info":
"26:e261561a:::default.168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-05T03%3a51%3a39+00%3a00.sha1:head(167828'216051
client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051
dd 49a34c1f od ffffffff alloc_hint [0 0])",
"shards": [
{
"osd": 30,
"errors": [
"data_digest_mismatch_oi"
],
"size": 40,
"omap_digest": "0xffffffff",
"data_digest": "0x540e4f8b"
},
{
"osd": 36,
"errors": [
"data_digest_mismatch_oi"
],
"size": 40,
"omap_digest": "0xffffffff",
"data_digest": "0x540e4f8b"
}
]
}
]
}

After some reading, I understand, I needed rados get/put trick to solve
this problem. I couldn't do rados get, because I was getting "no such file"
error, even objects were listed by "rados ls" command, so I got them
directly from OSD. After putting them back to rados (rados commands doesn't
returned any errors) and doing deep-scrub on same PG, problem still
existed. The only thing changed - when I try to get object via rados now I
get "(5) Input/output error".

I tried force object size to 40 (it's real size of both objects) by adding
"-o 40" option to "rados put" command, but with no luck.

Guys, maybe you have other ideas what to try? Why overwriting object
doesn't solve this problem?

Thanks a lot!

Arvydas

Arvydas Opulskis

2018-08-06 08:11:38 UTC

Permalink

Hi again,

after two weeks I've got another inconsistent PG in same cluster. OSD's are
different from first PG, object can not be GET as well:

# rados list-inconsistent-obj 26.821 --format=json-pretty
{
"epoch": 178472,
"inconsistents": [
{
"object": {
"name":
"default.122888368.52__shadow_.3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7",
"nspace": "",
"locator": "",
"snap": "head",
"version": 118920
},
"errors": [],
"union_shard_errors": [
"data_digest_mismatch_oi"
],
"selected_object_info":
"26:8411bae4:::default.122888368.52__shadow_.3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7:head(126495'118920
client.142609570.0:41412640 dirty|data_digest|omap_digest s 4194304 uv
118920 dd cd142aaa od ffffffff alloc_hint [0 0])",
"shards": [
{
"osd": 20,
"errors": [
"data_digest_mismatch_oi"
],
"size": 4194304,
"omap_digest": "0xffffffff",
"data_digest": "0x6b102e59"
},
{
"osd": 44,
"errors": [
"data_digest_mismatch_oi"
],
"size": 4194304,
"omap_digest": "0xffffffff",
"data_digest": "0x6b102e59"
}
]
}
]
}

# rados -p .rgw.buckets get
default.122888368.52__shadow_.3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7
test_2pg.file
error getting
.rgw.buckets/default.122888368.52__shadow_.3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7:
(5) Input/output error

Still struggling how to solve it. Any ideas, guys?

Thank you

Post by Arvydas Opulskis
Hello, Cephers,
after trying different repair approaches I am out of ideas how to repair
inconsistent PG. I hope, someones sharp eye will notice what I overlooked.
Centos 7.4
Jewel 10.2.10
Pool size 2 (yes, I know it's a very bad choice)
Pool with inconsistent PG: .rgw.buckets
After routine deep-scrub I've found PG 26.c3f in inconsistent status.
While running "ceph pg repair 26.c3f" command and monitoring "ceph -w" log,
2018-07-24 08:28:06.517042 osd.36 [ERR] 26.c3f shard 30: soid
26:fc32a1f1:::default.142609570.87_20180206.093111%
2frepositories%2fnuget-local%2fApplication%2fCompany.
Application.Api%2fCompany.Application.Api.1.1.1.nupkg.
artifactory-metadata%2fproperties.xml:head data_digest 0x540e4f8b !=
data_digest 0x49a34c1f from auth oi 26:e261561a:::default.
168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-
segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-
05T03%3a51%3a39+00%3a00.sha1:head(167828'216051
client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051
dd 49a34c1f od ffffffff alloc_hint [0 0])
2018-07-24 08:28:06.517118 osd.36 [ERR] 26.c3f shard 36: soid
26:fc32a1f1:::default.142609570.87_20180206.093111%
2frepositories%2fnuget-local%2fApplication%2fCompany.
Application.Api%2fCompany.Application.Api.1.1.1.nupkg.
artifactory-metadata%2fproperties.xml:head data_digest 0x540e4f8b !=
data_digest 0x49a34c1f from auth oi 26:e261561a:::default.
168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-
segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-
05T03%3a51%3a39+00%3a00.sha1:head(167828'216051
client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051
dd 49a34c1f od ffffffff alloc_hint [0 0])
2018-07-24 08:28:06.517122 osd.36 [ERR] 26.c3f soid 26:fc32a1f1:::default.
142609570.87_20180206.093111%2frepositories%2fnuget-local%
2fApplication%2fCompany.Application.Api%2fCompany.
failed to pick suitable auth object
...and same errors about another object on same PG.
Repair failed, so I checked inconsistencies "rados list-inconsistent-obj
{
"epoch": 178403,
"inconsistents": [
{
"object": {
"name": "default.142609570.87_
20180203.020047\/repositories\/docker-local\/yyy\/company.
yyy.api.assets\/1.2.4\/sha256__ce41e5246ead8bddd2a2b5bbb863db
250f328be9dc5c3041481d778a32f8130d",
"nspace": "",
"locator": "",
"snap": "head",
"version": 217749
},
"errors": [],
"union_shard_errors": [
"data_digest_mismatch_oi"
],
"selected_object_info": "26:f4ce1748:::default.
168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-
segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-
08T03%3a45%3a15+00%3a00.sha1:head(167944'217749
client.177936559.0:1884719302 dirty|data_digest|omap_digest s 40 uv 217749
dd 422f251b od ffffffff alloc_hint [0 0])",
"shards": [
{
"osd": 30,
"errors": [
"data_digest_mismatch_oi"
],
"size": 40,
"omap_digest": "0xffffffff",
"data_digest": "0x551c282f"
},
{
"osd": 36,
"errors": [
"data_digest_mismatch_oi"
],
"size": 40,
"omap_digest": "0xffffffff",
"data_digest": "0x551c282f"
}
]
},
{
"object": {
"name": "default.142609570.87_
20180206.093111\/repositories\/nuget-local\/Application\/
Company.Application.Api\/Company.Application.Api.1.1.1.
nupkg.artifactory-metadata\/properties.xml",
"nspace": "",
"locator": "",
"snap": "head",
"version": 216051
},
"errors": [],
"union_shard_errors": [
"data_digest_mismatch_oi"
],
"selected_object_info": "26:e261561a:::default.
168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-
segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-
05T03%3a51%3a39+00%3a00.sha1:head(167828'216051
client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051
dd 49a34c1f od ffffffff alloc_hint [0 0])",
"shards": [
{
"osd": 30,
"errors": [
"data_digest_mismatch_oi"
],
"size": 40,
"omap_digest": "0xffffffff",
"data_digest": "0x540e4f8b"
},
{
"osd": 36,
"errors": [
"data_digest_mismatch_oi"
],
"size": 40,
"omap_digest": "0xffffffff",
"data_digest": "0x540e4f8b"
}
]
}
]
}
After some reading, I understand, I needed rados get/put trick to solve
this problem. I couldn't do rados get, because I was getting "no such file"
error, even objects were listed by "rados ls" command, so I got them
directly from OSD. After putting them back to rados (rados commands doesn't
returned any errors) and doing deep-scrub on same PG, problem still
existed. The only thing changed - when I try to get object via rados now I
get "(5) Input/output error".
I tried force object size to 40 (it's real size of both objects) by adding
"-o 40" option to "rados put" command, but with no luck.
Guys, maybe you have other ideas what to try? Why overwriting object
doesn't solve this problem?
Thanks a lot!
Arvydas

Brent Kennedy

2018-08-07 14:49:59 UTC

Permalink

Last time I had an inconsistent PG that could not be repaired using the repair command, I looked at which OSDs hosted the PG, then restarted them one by one(usually stopping, waiting a few seconds, then starting them back up ). You could also stop them, flush the journal, then start them back up.

If that didnât work, it meant there was data loss and I had to use the ceph-objectstore-tool repair tool to export the objects from a location that had the latest data and import into the one that had no data. The ceph-objectstore-tool is not a simple thing though and should not be used lightly. When I say data loss, I mean that ceph thinks the last place written has the data, that place being the OSD that doesnât actually have the data(meaning it failed to write there).

If you want to go that route, let me know, I wrote a how to on it. Should be the last resort though. I also donât know your setup, so I would hate to recommend something so drastic.

-Brent

From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf Of Arvydas Opulskis
Sent: Monday, August 6, 2018 4:12 AM
To: ceph-***@lists.ceph.com
Subject: Re: [ceph-users] Inconsistent PG could not be repaired

Hi again,

after two weeks I've got another inconsistent PG in same cluster. OSD's are different from first PG, object can not be GET as well:

# rados list-inconsistent-obj 26.821 --format=json-pretty

{

"epoch": 178472,

"inconsistents": [

{

"object": {

"name": "default.122888368.52__shadow_.3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7",

"nspace": "",

"locator": "",

"snap": "head",

"version": 118920

},

"errors": [],

"union_shard_errors": [

"data_digest_mismatch_oi"

],

"selected_object_info": "26:8411bae4:::default.122888368.52__shadow_.3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7:head(126495'118920 client.142609570.0:41412640 dirty|data_digest|omap_digest s 4194304 uv 118920 dd cd142aaa od ffffffff alloc_hint [0 0])",

"shards": [

{

"osd": 20,

"errors": [

"data_digest_mismatch_oi"

],

"size": 4194304,

"omap_digest": "0xffffffff",

"data_digest": "0x6b102e59"

},

{

"osd": 44,

"errors": [

"data_digest_mismatch_oi"

],

"size": 4194304,

"omap_digest": "0xffffffff",

"data_digest": "0x6b102e59"

}

]

}

]

}

# rados -p .rgw.buckets get default.122888368.52__shadow_.3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7 test_2pg.file

error getting .rgw.buckets/default.122888368.52__shadow_.3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7: (5) Input/output error

Still struggling how to solve it. Any ideas, guys?

Thank you

On Tue, Jul 24, 2018 at 10:27 AM, Arvydas Opulskis <***@gmail.com <mailto:***@gmail.com> > wrote:

Hello, Cephers,

after trying different repair approaches I am out of ideas how to repair inconsistent PG. I hope, someones sharp eye will notice what I overlooked.

Some info about cluster:

Centos 7.4

Jewel 10.2.10

Pool size 2 (yes, I know it's a very bad choice)

Pool with inconsistent PG: .rgw.buckets

After routine deep-scrub I've found PG 26.c3f in inconsistent status. While running "ceph pg repair 26.c3f" command and monitoring "ceph -w" log, I noticed these errors:

2018-07-24 08:28:06.517042 osd.36 [ERR] 26.c3f shard 30: soid 26:fc32a1f1:::default.142609570.87_20180206.093111%2frepositories%2fnuget-local%2fApplication%2fCompany.Application.Api%2fCompany.Application.Api.1.1.1.nupkg.artifactory-metadata%2fproperties.xml:head data_digest 0x540e4f8b != data_digest 0x49a34c1f from auth oi 26:e261561a:::default.168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-05T03%3a51%3a39+00%3a00.sha1:head(167828'216051 client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051 dd 49a34c1f od ffffffff alloc_hint [0 0])

2018-07-24 08:28:06.517118 osd.36 [ERR] 26.c3f shard 36: soid 26:fc32a1f1:::default.142609570.87_20180206.093111%2frepositories%2fnuget-local%2fApplication%2fCompany.Application.Api%2fCompany.Application.Api.1.1.1.nupkg.artifactory-metadata%2fproperties.xml:head data_digest 0x540e4f8b != data_digest 0x49a34c1f from auth oi 26:e261561a:::default.168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-05T03%3a51%3a39+00%3a00.sha1:head(167828'216051 client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051 dd 49a34c1f od ffffffff alloc_hint [0 0])

2018-07-24 08:28:06.517122 osd.36 [ERR] 26.c3f soid 26:fc32a1f1:::default.142609570.87_20180206.093111%2frepositories%2fnuget-local%2fApplication%2fCompany.Application.Api%2fCompany.Application.Api.1.1.1.nupkg.artifactory-metadata%2fproperties.xml:head: failed to pick suitable auth object

...and same errors about another object on same PG.

Repair failed, so I checked inconsistencies "rados list-inconsistent-obj 26.c3f --format=json-pretty":

{

"epoch": 178403,

"inconsistents": [

{

"object": {

"name": "default.142609570.87_20180203.020047\/repositories\/docker-local\/yyy\/company.yyy.api.assets\/1.2.4\/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d",

"nspace": "",

"locator": "",

"snap": "head",

"version": 217749

},

"errors": [],

"union_shard_errors": [

"data_digest_mismatch_oi"

],

"selected_object_info": "26:f4ce1748:::default.168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-08T03%3a45%3a15+00%3a00.sha1:head(167944'217749 client.177936559.0:1884719302 dirty|data_digest|omap_digest s 40 uv 217749 dd 422f251b od ffffffff alloc_hint [0 0])",

"shards": [

{

"osd": 30,

"errors": [

"data_digest_mismatch_oi"

],

"size": 40,

"omap_digest": "0xffffffff",

"data_digest": "0x551c282f"

},

{

"osd": 36,

"errors": [

"data_digest_mismatch_oi"

],

"size": 40,

"omap_digest": "0xffffffff",

"data_digest": "0x551c282f"

}

]

},

{

"object": {

"name": "default.142609570.87_20180206.093111\/repositories\/nuget-local\/Application\/Company.Application.Api\/Company.Application.Api.1.1.1.nupkg.artifactory-metadata\/properties.xml",

"nspace": "",

"locator": "",

"snap": "head",

"version": 216051

},

"errors": [],

"union_shard_errors": [

"data_digest_mismatch_oi"

],

"selected_object_info": "26:e261561a:::default.168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-05T03%3a51%3a39+00%3a00.sha1:head(167828'216051 client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051 dd 49a34c1f od ffffffff alloc_hint [0 0])",

"shards": [

{

"osd": 30,

"errors": [

"data_digest_mismatch_oi"

],

"size": 40,

"omap_digest": "0xffffffff",

"data_digest": "0x540e4f8b"

},

{

"osd": 36,

"errors": [

"data_digest_mismatch_oi"

],

"size": 40,

"omap_digest": "0xffffffff",

"data_digest": "0x540e4f8b"

}

]

}

]

}

After some reading, I understand, I needed rados get/put trick to solve this problem. I couldn't do rados get, because I was getting "no such file" error, even objects were listed by "rados ls" command, so I got them directly from OSD. After putting them back to rados (rados commands doesn't returned any errors) and doing deep-scrub on same PG, problem still existed. The only thing changed - when I try to get object via rados now I get "(5) Input/output error".

I tried force object size to 40 (it's real size of both objects) by adding "-o 40" option to "rados put" command, but with no luck.

Guys, maybe you have other ideas what to try? Why overwriting object doesn't solve this problem?

Thanks a lot!

Arvydas

Arvydas Opulskis

2018-08-14 11:33:10 UTC

Permalink

Thanks for suggestion about restarting OSD's, but this doesn't work either.

Anyway, I managed to fix second unrepairing PG by getting object from OSD
and saving it again via rados, but still no luck with first one.
I think, I found main problem why this doesn't work. Seems, object is not
overwritten, even rados command returns no errors. I tried to delete
object, but it still stays in pool untouched. There is an example of what I
see:

# rados -p .rgw.buckets ls | grep -i
"sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d"
default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d

# rados -p .rgw.buckets get
default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d
testfile
error getting
.rgw.buckets/default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d:
(2) No such file or directory

# rados -p .rgw.buckets rm
default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d

# rados -p .rgw.buckets ls | grep -i
"sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d"
default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d

I've never seen this in our Ceph clusters before. Should I report a bug
about it? If any of you guys need more diagnostic info - let me know.

Thanks,
Arvydas

Post by Brent Kennedy
Last time I had an inconsistent PG that could not be repaired using the
repair command, I looked at which OSDs hosted the PG, then restarted them
one by one(usually stopping, waiting a few seconds, then starting them back
up ). You could also stop them, flush the journal, then start them back
up.
If that didnât work, it meant there was data loss and I had to use the
ceph-objectstore-tool repair tool to export the objects from a location
that had the latest data and import into the one that had no data. The
ceph-objectstore-tool is not a simple thing though and should not be used
lightly. When I say data loss, I mean that ceph thinks the last place
written has the data, that place being the OSD that doesnât actually have
the data(meaning it failed to write there).
If you want to go that route, let me know, I wrote a how to on it. Should
be the last resort though. I also donât know your setup, so I would hate
to recommend something so drastic.
-Brent
Of *Arvydas Opulskis
*Sent:* Monday, August 6, 2018 4:12 AM
*Subject:* Re: [ceph-users] Inconsistent PG could not be repaired
Hi again,
after two weeks I've got another inconsistent PG in same cluster. OSD's
# rados list-inconsistent-obj 26.821 --format=json-pretty
{
"epoch": 178472,
"inconsistents": [
{
"object": {
"name": "default.122888368.52__shadow_
.3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7",
"nspace": "",
"locator": "",
"snap": "head",
"version": 118920
},
"errors": [],
"union_shard_errors": [
"data_digest_mismatch_oi"
],
"selected_object_info": "26:8411bae4:::default.
122888368.52__shadow_.3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7:head(126495'118920
client.142609570.0:41412640 dirty|data_digest|omap_digest s 4194304 uv
118920 dd cd142aaa od ffffffff alloc_hint [0 0])",
"shards": [
{
"osd": 20,
"errors": [
"data_digest_mismatch_oi"
],
"size": 4194304,
"omap_digest": "0xffffffff",
"data_digest": "0x6b102e59"
},
{
"osd": 44,
"errors": [
"data_digest_mismatch_oi"
],
"size": 4194304,
"omap_digest": "0xffffffff",
"data_digest": "0x6b102e59"
}
]
}
]
}
# rados -p .rgw.buckets get default.122888368.52__shadow_.
3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7 test_2pg.file
error getting .rgw.buckets/default.122888368.52__shadow_.
3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7: (5) Input/output error
Still struggling how to solve it. Any ideas, guys?
Thank you
Hello, Cephers,
after trying different repair approaches I am out of ideas how to repair
inconsistent PG. I hope, someones sharp eye will notice what I overlooked.
Centos 7.4
Jewel 10.2.10
Pool size 2 (yes, I know it's a very bad choice)
Pool with inconsistent PG: .rgw.buckets
After routine deep-scrub I've found PG 26.c3f in inconsistent status.
While running "ceph pg repair 26.c3f" command and monitoring "ceph -w" log,
2018-07-24 08:28:06.517042 osd.36 [ERR] 26.c3f shard 30: soid
26:fc32a1f1:::default.142609570.87_20180206.093111%
2frepositories%2fnuget-local%2fApplication%2fCompany.
Application.Api%2fCompany.Application.Api.1.1.1.nupkg.
artifactory-metadata%2fproperties.xml:head data_digest 0x540e4f8b !=
data_digest 0x49a34c1f from auth oi 26:e261561a:::default.
168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-
segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-
05T03%3a51%3a39+00%3a00.sha1:head(167828'216051
client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051
dd 49a34c1f od ffffffff alloc_hint [0 0])
2018-07-24 08:28:06.517118 osd.36 [ERR] 26.c3f shard 36: soid
26:fc32a1f1:::default.142609570.87_20180206.093111%
2frepositories%2fnuget-local%2fApplication%2fCompany.
Application.Api%2fCompany.Application.Api.1.1.1.nupkg.
artifactory-metadata%2fproperties.xml:head data_digest 0x540e4f8b !=
data_digest 0x49a34c1f from auth oi 26:e261561a:::default.
168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-
segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-
05T03%3a51%3a39+00%3a00.sha1:head(167828'216051
client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051
dd 49a34c1f od ffffffff alloc_hint [0 0])
2018-07-24 08:28:06.517122 osd.36 [ERR] 26.c3f soid 26:fc32a1f1:::default.
142609570.87_20180206.093111%2frepositories%2fnuget-local%
2fApplication%2fCompany.Application.Api%2fCompany.
failed to pick suitable auth object
...and same errors about another object on same PG.
Repair failed, so I checked inconsistencies "rados list-inconsistent-obj
{
"epoch": 178403,
"inconsistents": [
{
"object": {
"name": "default.142609570.87_
20180203.020047\/repositories\/docker-local\/yyy\/company.
yyy.api.assets\/1.2.4\/sha256__ce41e5246ead8bddd2a2b5bbb863db
250f328be9dc5c3041481d778a32f8130d",
"nspace": "",
"locator": "",
"snap": "head",
"version": 217749
},
"errors": [],
"union_shard_errors": [
"data_digest_mismatch_oi"
],
"selected_object_info": "26:f4ce1748:::default.
168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-
segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-
08T03%3a45%3a15+00%3a00.sha1:head(167944'217749
client.177936559.0:1884719302 dirty|data_digest|omap_digest s 40 uv 217749
dd 422f251b od ffffffff alloc_hint [0 0])",
"shards": [
{
"osd": 30,
"errors": [
"data_digest_mismatch_oi"
],
"size": 40,
"omap_digest": "0xffffffff",
"data_digest": "0x551c282f"
},
{
"osd": 36,
"errors": [
"data_digest_mismatch_oi"
],
"size": 40,
"omap_digest": "0xffffffff",
"data_digest": "0x551c282f"
}
]
},
{
"object": {
"name": "default.142609570.87_
20180206.093111\/repositories\/nuget-local\/Application\/
Company.Application.Api\/Company.Application.Api.1.1.1.
nupkg.artifactory-metadata\/properties.xml",
"nspace": "",
"locator": "",
"snap": "head",
"version": 216051
},
"errors": [],
"union_shard_errors": [
"data_digest_mismatch_oi"
],
"selected_object_info": "26:e261561a:::default.
168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-
segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-
05T03%3a51%3a39+00%3a00.sha1:head(167828'216051
client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051
dd 49a34c1f od ffffffff alloc_hint [0 0])",
"shards": [
{
"osd": 30,
"errors": [
"data_digest_mismatch_oi"
],
"size": 40,
"omap_digest": "0xffffffff",
"data_digest": "0x540e4f8b"
},
{
"osd": 36,
"errors": [
"data_digest_mismatch_oi"
],
"size": 40,
"omap_digest": "0xffffffff",
"data_digest": "0x540e4f8b"
}
]
}
]
}
After some reading, I understand, I needed rados get/put trick to solve
this problem. I couldn't do rados get, because I was getting "no such file"
error, even objects were listed by "rados ls" command, so I got them
directly from OSD. After putting them back to rados (rados commands doesn't
returned any errors) and doing deep-scrub on same PG, problem still
existed. The only thing changed - when I try to get object via rados now I
get "(5) Input/output error".
I tried force object size to 40 (it's real size of both objects) by adding
"-o 40" option to "rados put" command, but with no luck.
Guys, maybe you have other ideas what to try? Why overwriting object
doesn't solve this problem?
Thanks a lot!
Arvydas

Thomas White

2018-08-14 19:24:57 UTC

Permalink

Hi Arvydas,

The error seems to suggest this is not an issue with your object data, but the expected object digest data. I am unable to access where I stored my very hacky diagnosis process for this, but our eventual fix was to locate the bucket or files affected and then rename an object within it, forcing a recalculation of the digest. Depending on the size of the pool perhaps it would be possible to randomly rename a few files to cause this recalculation to occur to see if this remedies it?

Kind Regards,

Tom

From: ceph-users <ceph-users-***@lists.ceph.com> On Behalf Of Arvydas Opulskis
Sent: 14 August 2018 12:33
To: Brent Kennedy <***@cfl.rr.com>
Cc: Ceph Users <ceph-***@lists.ceph.com>
Subject: Re: [ceph-users] Inconsistent PG could not be repaired

Thanks for suggestion about restarting OSD's, but this doesn't work either.

Anyway, I managed to fix second unrepairing PG by getting object from OSD and saving it again via rados, but still no luck with first one.

I think, I found main problem why this doesn't work. Seems, object is not overwritten, even rados command returns no errors. I tried to delete object, but it still stays in pool untouched. There is an example of what I see:

# rados -p .rgw.buckets ls | grep -i "sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d"
default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d

# rados -p .rgw.buckets get default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d testfile
error getting .rgw.buckets/default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d: (2) No such file or directory

# rados -p .rgw.buckets rm default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d

# rados -p .rgw.buckets ls | grep -i "sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d"
default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d

I've never seen this in our Ceph clusters before. Should I report a bug about it? If any of you guys need more diagnostic info - let me know.

Thanks,

Arvydas

On Tue, Aug 7, 2018 at 5:49 PM, Brent Kennedy <***@cfl.rr.com <mailto:***@cfl.rr.com> > wrote:

Last time I had an inconsistent PG that could not be repaired using the repair command, I looked at which OSDs hosted the PG, then restarted them one by one(usually stopping, waiting a few seconds, then starting them back up ). You could also stop them, flush the journal, then start them back up.

If that didnât work, it meant there was data loss and I had to use the ceph-objectstore-tool repair tool to export the objects from a location that had the latest data and import into the one that had no data. The ceph-objectstore-tool is not a simple thing though and should not be used lightly. When I say data loss, I mean that ceph thinks the last place written has the data, that place being the OSD that doesnât actually have the data(meaning it failed to write there).

If you want to go that route, let me know, I wrote a how to on it. Should be the last resort though. I also donât know your setup, so I would hate to recommend something so drastic.

-Brent

From: ceph-users [mailto:ceph-users-***@lists.ceph.com <mailto:ceph-users-***@lists.ceph.com> ] On Behalf Of Arvydas Opulskis
Sent: Monday, August 6, 2018 4:12 AM
To: ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
Subject: Re: [ceph-users] Inconsistent PG could not be repaired

Hi again,

after two weeks I've got another inconsistent PG in same cluster. OSD's are different from first PG, object can not be GET as well:

# rados list-inconsistent-obj 26.821 --format=json-pretty

{

"epoch": 178472,

"inconsistents": [

{

"object": {

"name": "default.122888368.52__shadow_.3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7",

"nspace": "",

"locator": "",

"snap": "head",

"version": 118920

},

"errors": [],

"union_shard_errors": [

"data_digest_mismatch_oi"

],

"selected_object_info": "26:8411bae4:::default.122888368.52__shadow_.3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7:head(126495'118920 client.142609570.0:41412640 dirty|data_digest|omap_digest s 4194304 uv 118920 dd cd142aaa od ffffffff alloc_hint [0 0])",

"shards": [

{

"osd": 20,

"errors": [

"data_digest_mismatch_oi"

],

"size": 4194304,

"omap_digest": "0xffffffff",

"data_digest": "0x6b102e59"

},

{

"osd": 44,

"errors": [

"data_digest_mismatch_oi"

],

"size": 4194304,

"omap_digest": "0xffffffff",

"data_digest": "0x6b102e59"

}

]

}

]

}

# rados -p .rgw.buckets get default.122888368.52__shadow_.3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7 test_2pg.file

error getting .rgw.buckets/default.122888368.52__shadow_.3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7: (5) Input/output error

Still struggling how to solve it. Any ideas, guys?

Thank you

On Tue, Jul 24, 2018 at 10:27 AM, Arvydas Opulskis <***@gmail.com <mailto:***@gmail.com> > wrote:

Hello, Cephers,

after trying different repair approaches I am out of ideas how to repair inconsistent PG. I hope, someones sharp eye will notice what I overlooked.

Some info about cluster:

Centos 7.4

Jewel 10.2.10

Pool size 2 (yes, I know it's a very bad choice)

Pool with inconsistent PG: .rgw.buckets

After routine deep-scrub I've found PG 26.c3f in inconsistent status. While running "ceph pg repair 26.c3f" command and monitoring "ceph -w" log, I noticed these errors:

2018-07-24 08:28:06.517042 osd.36 [ERR] 26.c3f shard 30: soid 26:fc32a1f1:::default.142609570.87_20180206.093111%2frepositories%2fnuget-local%2fApplication%2fCompany.Application.Api%2fCompany.Application.Api.1.1.1.nupkg.artifactory-metadata%2fproperties.xml:head data_digest 0x540e4f8b != data_digest 0x49a34c1f from auth oi 26:e261561a:::default.168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-05T03%3a51%3a39+00%3a00.sha1:head(167828'216051 client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051 dd 49a34c1f od ffffffff alloc_hint [0 0])

2018-07-24 08:28:06.517118 osd.36 [ERR] 26.c3f shard 36: soid 26:fc32a1f1:::default.142609570.87_20180206.093111%2frepositories%2fnuget-local%2fApplication%2fCompany.Application.Api%2fCompany.Application.Api.1.1.1.nupkg.artifactory-metadata%2fproperties.xml:head data_digest 0x540e4f8b != data_digest 0x49a34c1f from auth oi 26:e261561a:::default.168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-05T03%3a51%3a39+00%3a00.sha1:head(167828'216051 client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051 dd 49a34c1f od ffffffff alloc_hint [0 0])

2018-07-24 08:28:06.517122 osd.36 [ERR] 26.c3f soid 26:fc32a1f1:::default.142609570.87_20180206.093111%2frepositories%2fnuget-local%2fApplication%2fCompany.Application.Api%2fCompany.Application.Api.1.1.1.nupkg.artifactory-metadata%2fproperties.xml:head: failed to pick suitable auth object

...and same errors about another object on same PG.

Repair failed, so I checked inconsistencies "rados list-inconsistent-obj 26.c3f --format=json-pretty":

{

"epoch": 178403,

"inconsistents": [

{

"object": {

"name": "default.142609570.87_20180203.020047\/repositories\/docker-local\/yyy\/company.yyy.api.assets\/1.2.4\/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d",

"nspace": "",

"locator": "",

"snap": "head",

"version": 217749

},

"errors": [],

"union_shard_errors": [

"data_digest_mismatch_oi"

],

"selected_object_info": "26:f4ce1748:::default.168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-08T03%3a45%3a15+00%3a00.sha1:head(167944'217749 client.177936559.0:1884719302 dirty|data_digest|omap_digest s 40 uv 217749 dd 422f251b od ffffffff alloc_hint [0 0])",

"shards": [

{

"osd": 30,

"errors": [

"data_digest_mismatch_oi"

],

"size": 40,

"omap_digest": "0xffffffff",

"data_digest": "0x551c282f"

},

{

"osd": 36,

"errors": [

"data_digest_mismatch_oi"

],

"size": 40,

"omap_digest": "0xffffffff",

"data_digest": "0x551c282f"

}

]

},

{

"object": {

"name": "default.142609570.87_20180206.093111\/repositories\/nuget-local\/Application\/Company.Application.Api\/Company.Application.Api.1.1.1.nupkg.artifactory-metadata\/properties.xml",

"nspace": "",

"locator": "",

"snap": "head",

"version": 216051

},

"errors": [],

"union_shard_errors": [

"data_digest_mismatch_oi"

],

"selected_object_info": "26:e261561a:::default.168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-05T03%3a51%3a39+00%3a00.sha1:head(167828'216051 client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051 dd 49a34c1f od ffffffff alloc_hint [0 0])",

"shards": [

{

"osd": 30,

"errors": [

"data_digest_mismatch_oi"

],

"size": 40,

"omap_digest": "0xffffffff",

"data_digest": "0x540e4f8b"

},

{

"osd": 36,

"errors": [

"data_digest_mismatch_oi"

],

"size": 40,

"omap_digest": "0xffffffff",

"data_digest": "0x540e4f8b"

}

]

}

]

}

After some reading, I understand, I needed rados get/put trick to solve this problem. I couldn't do rados get, because I was getting "no such file" error, even objects were listed by "rados ls" command, so I got them directly from OSD. After putting them back to rados (rados commands doesn't returned any errors) and doing deep-scrub on same PG, problem still existed. The only thing changed - when I try to get object via rados now I get "(5) Input/output error".

I tried force object size to 40 (it's real size of both objects) by adding "-o 40" option to "rados put" command, but with no luck.

Guys, maybe you have other ideas what to try? Why overwriting object doesn't solve this problem?

Thanks a lot!

Arvydas

Arvydas Opulskis

2018-08-16 13:06:43 UTC

Permalink

Hi Thomas,

thanks for suggestion, but changing other objects or even object itself
doesn't helped out.

But I finally solved the problem:

1. Backed up problematic S3 object
2. Deleted it from S3
3. Stopped OSD
4. Flushed journal
5. Removed object directly from OSD
6. Started OSD
7. Repeated 3-6 steps on other OSD
8. Did deep-scrub on problematic PG (inconsistency went away)
9. Checked S3 bucket with --fix option
10. Put S3 object back via S3
11. Did deep-scrub, checked for object in OSD, etc., to be sure it exist
and can be accessed

Thanks, guys, for ideas!

Arvydas

Post by Thomas White
Hi Arvydas,
The error seems to suggest this is not an issue with your object data, but
the expected object digest data. I am unable to access where I stored my
very hacky diagnosis process for this, but our eventual fix was to locate
the bucket or files affected and then rename an object within it, forcing a
recalculation of the digest. Depending on the size of the pool perhaps it
would be possible to randomly rename a few files to cause this
recalculation to occur to see if this remedies it?
Kind Regards,
Tom
Opulskis
*Sent:* 14 August 2018 12:33
*Subject:* Re: [ceph-users] Inconsistent PG could not be repaired
Thanks for suggestion about restarting OSD's, but this doesn't work either.
Anyway, I managed to fix second unrepairing PG by getting object from OSD
and saving it again via rados, but still no luck with first one.
I think, I found main problem why this doesn't work. Seems, object is not
overwritten, even rados command returns no errors. I tried to delete
object, but it still stays in pool untouched. There is an example of what I
# rados -p .rgw.buckets ls | grep -i "sha256__
ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d"
default.142609570.87_20180203.020047/repositories/docker-
local/yyy/company.yyy.api.assets/1.2.4/sha256__
ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d
# rados -p .rgw.buckets get default.142609570.87_20180203.
020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__
ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d testfile
error getting .rgw.buckets/default.142609570.87_20180203.020047/
repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__
ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d: (2) No
such file or directory
# rados -p .rgw.buckets rm default.142609570.87_20180203.
020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__
ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d
# rados -p .rgw.buckets ls | grep -i "sha256__
ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d"
default.142609570.87_20180203.020047/repositories/docker-
local/yyy/company.yyy.api.assets/1.2.4/sha256__
ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d
I've never seen this in our Ceph clusters before. Should I report a bug
about it? If any of you guys need more diagnostic info - let me know.
Thanks,
Arvydas
Last time I had an inconsistent PG that could not be repaired using the
repair command, I looked at which OSDs hosted the PG, then restarted them
one by one(usually stopping, waiting a few seconds, then starting them back
up ). You could also stop them, flush the journal, then start them back
up.
If that didnât work, it meant there was data loss and I had to use the
ceph-objectstore-tool repair tool to export the objects from a location
that had the latest data and import into the one that had no data. The
ceph-objectstore-tool is not a simple thing though and should not be used
lightly. When I say data loss, I mean that ceph thinks the last place
written has the data, that place being the OSD that doesnât actually have
the data(meaning it failed to write there).
If you want to go that route, let me know, I wrote a how to on it. Should
be the last resort though. I also donât know your setup, so I would hate
to recommend something so drastic.
-Brent
Of *Arvydas Opulskis
*Sent:* Monday, August 6, 2018 4:12 AM
*Subject:* Re: [ceph-users] Inconsistent PG could not be repaired
Hi again,
after two weeks I've got another inconsistent PG in same cluster. OSD's
# rados list-inconsistent-obj 26.821 --format=json-pretty
{
"epoch": 178472,
"inconsistents": [
{
"object": {
"name": "default.122888368.52__shadow_
.3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7",
"nspace": "",
"locator": "",
"snap": "head",
"version": 118920
},
"errors": [],
"union_shard_errors": [
"data_digest_mismatch_oi"
],
"selected_object_info": "26:8411bae4:::default.
122888368.52__shadow_.3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7:head(126495'118920
client.142609570.0:41412640 dirty|data_digest|omap_digest s 4194304 uv
118920 dd cd142aaa od ffffffff alloc_hint [0 0])",
"shards": [
{
"osd": 20,
"errors": [
"data_digest_mismatch_oi"
],
"size": 4194304,
"omap_digest": "0xffffffff",
"data_digest": "0x6b102e59"
},
{
"osd": 44,
"errors": [
"data_digest_mismatch_oi"
],
"size": 4194304,
"omap_digest": "0xffffffff",
"data_digest": "0x6b102e59"
}
]
}
]
}
# rados -p .rgw.buckets get default.122888368.52__shadow_.
3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7 test_2pg.file
error getting .rgw.buckets/default.122888368.52__shadow_.
3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7: (5) Input/output error
Still struggling how to solve it. Any ideas, guys?
Thank you
Hello, Cephers,
after trying different repair approaches I am out of ideas how to repair
inconsistent PG. I hope, someones sharp eye will notice what I overlooked.
Centos 7.4
Jewel 10.2.10
Pool size 2 (yes, I know it's a very bad choice)
Pool with inconsistent PG: .rgw.buckets
After routine deep-scrub I've found PG 26.c3f in inconsistent status.
While running "ceph pg repair 26.c3f" command and monitoring "ceph -w" log,
2018-07-24 08:28:06.517042 osd.36 [ERR] 26.c3f shard 30: soid
26:fc32a1f1:::default.142609570.87_20180206.093111%
2frepositories%2fnuget-local%2fApplication%2fCompany.
Application.Api%2fCompany.Application.Api.1.1.1.nupkg.
artifactory-metadata%2fproperties.xml:head data_digest 0x540e4f8b !=
data_digest 0x49a34c1f from auth oi 26:e261561a:::default.
168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-
segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-
05T03%3a51%3a39+00%3a00.sha1:head(167828'216051
client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051
dd 49a34c1f od ffffffff alloc_hint [0 0])
2018-07-24 08:28:06.517118 osd.36 [ERR] 26.c3f shard 36: soid
26:fc32a1f1:::default.142609570.87_20180206.093111%
2frepositories%2fnuget-local%2fApplication%2fCompany.
Application.Api%2fCompany.Application.Api.1.1.1.nupkg.
artifactory-metadata%2fproperties.xml:head data_digest 0x540e4f8b !=
data_digest 0x49a34c1f from auth oi 26:e261561a:::default.
168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-
segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-
05T03%3a51%3a39+00%3a00.sha1:head(167828'216051
client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051
dd 49a34c1f od ffffffff alloc_hint [0 0])
2018-07-24 08:28:06.517122 osd.36 [ERR] 26.c3f soid 26:fc32a1f1:::default.
142609570.87_20180206.093111%2frepositories%2fnuget-local%
2fApplication%2fCompany.Application.Api%2fCompany.
failed to pick suitable auth object
...and same errors about another object on same PG.
Repair failed, so I checked inconsistencies "rados list-inconsistent-obj
{
"epoch": 178403,
"inconsistents": [
{
"object": {
"name": "default.142609570.87_
20180203.020047\/repositories\/docker-local\/yyy\/company.
yyy.api.assets\/1.2.4\/sha256__ce41e5246ead8bddd2a2b5bbb863db
250f328be9dc5c3041481d778a32f8130d",
"nspace": "",
"locator": "",
"snap": "head",
"version": 217749
},
"errors": [],
"union_shard_errors": [
"data_digest_mismatch_oi"
],
"selected_object_info": "26:f4ce1748:::default.
168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-
segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-
08T03%3a45%3a15+00%3a00.sha1:head(167944'217749
client.177936559.0:1884719302 dirty|data_digest|omap_digest s 40 uv 217749
dd 422f251b od ffffffff alloc_hint [0 0])",
"shards": [
{
"osd": 30,
"errors": [
"data_digest_mismatch_oi"
],
"size": 40,
"omap_digest": "0xffffffff",
"data_digest": "0x551c282f"
},
{
"osd": 36,
"errors": [
"data_digest_mismatch_oi"
],
"size": 40,
"omap_digest": "0xffffffff",
"data_digest": "0x551c282f"
}
]
},
{
"object": {
"name": "default.142609570.87_
20180206.093111\/repositories\/nuget-local\/Application\/
Company.Application.Api\/Company.Application.Api.1.1.1.
nupkg.artifactory-metadata\/properties.xml",
"nspace": "",
"locator": "",
"snap": "head",
"version": 216051
},
"errors": [],
"union_shard_errors": [
"data_digest_mismatch_oi"
],
"selected_object_info": "26:e261561a:::default.
168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-
segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-
05T03%3a51%3a39+00%3a00.sha1:head(167828'216051
client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051
dd 49a34c1f od ffffffff alloc_hint [0 0])",
"shards": [
{
"osd": 30,
"errors": [
"data_digest_mismatch_oi"
],
"size": 40,
"omap_digest": "0xffffffff",
"data_digest": "0x540e4f8b"
},
{
"osd": 36,
"errors": [
"data_digest_mismatch_oi"
],
"size": 40,
"omap_digest": "0xffffffff",
"data_digest": "0x540e4f8b"
}
]
}
]
}
After some reading, I understand, I needed rados get/put trick to solve
this problem. I couldn't do rados get, because I was getting "no such file"
error, even objects were listed by "rados ls" command, so I got them
directly from OSD. After putting them back to rados (rados commands doesn't
returned any errors) and doing deep-scrub on same PG, problem still
existed. The only thing changed - when I try to get object via rados now I
get "(5) Input/output error".
I tried force object size to 40 (it's real size of both objects) by adding
"-o 40" option to "rados put" command, but with no luck.
Guys, maybe you have other ideas what to try? Why overwriting object
doesn't solve this problem?
Thanks a lot!
Arvydas