Discussion:
Understanding/correcting sudden onslaught of unfound objects
(too old to reply)
Graham Allan
2018-02-13 00:26:24 UTC
Permalink
Raw Message
Hi,

For the past few weeks I've been seeing a large number of pgs on our
main erasure coded pool being flagged inconsistent, followed by them
becoming active+recovery_wait+inconsistent with unfound objects. The
cluster is currently running luminous 12.2.2 but has in the past also
run its way through firefly, hammer and jewel.

Here's a sample object from "ceph pg list_missing" (there are 150
pg 70.467 is stuck unclean for 1004525.715896, current state active+recovery_wait+inconsistent, last acting [449,233,336,323,259,193]
{
"oid": {
"oid": "default.323253.6_20150226/Downloads/linux-nvme-HEAD-5aa2ffa/include/config/via/fir.h",
"key": "",
"snapid": -2,
"hash": 628294759,
"max": 0,
"pool": 70,
"namespace": ""
},
"need": "73222'132227",
"have": "0'0",
"flags": "none",
"locations": [
"193(5)",
"259(4)",
"449(0)"
]
},
When I trace through the filesystem on each OSD, I find the associated
file present on each OSD but with size 0 bytes.

Interestingly, for the 3 OSDs for which "list_missing" shows locations
above (193,259,449), the timestamp of the 0-byte file is recent (within
last few weeks). For the other 3 OSDs (233,336,323), it's in the distant
past (08/2015 and 02/2016).

All the unfound objects I've checked on this pg show the same pattern,
along with the "have" epoch showing as "0'0".

Other than the potential data loss being disturbing, I wonder why this
showed up so suddenly?

It seems to have been triggered by one OSD host failing over a long
weekend. By the time we looked at it on Monday, the cluster had
re-balanced enough data that I decided to simply leave it - we had long
wanted to evacuate a first host to convert to a newer OS release, as
well as Bluestore. Perhaps this was a bad choice, but the cluster
recovery appeared to be proceeding normally, and was apparently complete
a few days later. It was only around a week later that the unfound
objects started.

All the unfound object file fragments I've tracked down so far have
their older members with timestamps in the same mid-2015 to mid-2016
period. I could be wrong but this really seems like a long-standing
problem has just been unearthed. I wonder if it could be connected to
this thread from early 2016, concerning a problem on the same cluster:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008120.html

It's a long thread, but the 0-byte files sound very like the "orphaned
files" in that thread - related to performing a directory split while
handling links on a filename with the special long filename handling...

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008317.html

However unlike that thread, I'm not finding any other files with
duplicate names in the hierarchy.

I'm not sure there's much else I can do besides record the names of any
unfound objects before resorting to "mark_unfound_lost delete" - any
suggestions for further research?

Thanks,

Graham
--
Graham Allan
Minnesota Supercomputing Institute - ***@umn.edu
Graham Allan
2018-02-13 16:41:14 UTC
Permalink
Raw Message
I'm replying to myself here, but it's probably worth mentioning that
after this started, I did bring back the failed host, though with "ceph
osd weight 0" to avoid more data movement.

For inconsistent pgs containing unfound objects, the output of "ceph pg
<n> query" does then show the original osd being queried for objects,
and indeed if I dig through the filesystem I find the same 0-byte files
dated from 2015-2016.

This strongly implies to me that data loss occurred a long time in the
past and is not related to the osd host going down - this only triggered
the problem being found.

Graham
Post by Graham Allan
Hi,
For the past few weeks I've been seeing a large number of pgs on our
main erasure coded pool being flagged inconsistent, followed by them
becoming active+recovery_wait+inconsistent with unfound objects. The
cluster is currently running luminous 12.2.2 but has in the past also
run its way through firefly, hammer and jewel.
Here's a sample object from "ceph pg list_missing" (there are 150
    pg 70.467 is stuck unclean for 1004525.715896, current state
active+recovery_wait+inconsistent, last acting [449,233,336,323,259,193]
        {
            "oid": {
"default.323253.6_20150226/Downloads/linux-nvme-HEAD-5aa2ffa/include/config/via/fir.h",
                "key": "",
                "snapid": -2,
                "hash": 628294759,
                "max": 0,
                "pool": 70,
                "namespace": ""
            },
            "need": "73222'132227",
            "have": "0'0",
            "flags": "none",
            "locations": [
                "193(5)",
                "259(4)",
                "449(0)"
            ]
        },
When I trace through the filesystem on each OSD, I find the associated
file present on each OSD but with size 0 bytes.
Interestingly, for the 3 OSDs for which "list_missing" shows locations
above (193,259,449), the timestamp of the 0-byte file is recent (within
last few weeks). For the other 3 OSDs (233,336,323), it's in the distant
past (08/2015 and 02/2016).
All the unfound objects I've checked on this pg show the same pattern,
along with the "have" epoch showing as "0'0".
Other than the potential data loss being disturbing, I wonder why this
showed up so suddenly?
It seems to have been triggered by one OSD host failing over a long
weekend. By the time we looked at it on Monday, the cluster had
re-balanced enough data that I decided to simply leave it - we had long
wanted to evacuate a first host to convert to a newer OS release, as
well as Bluestore. Perhaps this was a bad choice, but the cluster
recovery appeared to be proceeding normally, and was apparently complete
a few days later. It was only around a week later that the unfound
objects started.
All the unfound object file fragments I've tracked down so far have
their older members with timestamps in the same mid-2015 to mid-2016
period. I could be wrong but this really seems like a long-standing
problem has just been unearthed. I wonder if it could be connected to
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008120.html
It's a long thread, but the 0-byte files sound very like the "orphaned
files" in that thread - related to performing a directory split while
handling links on a filename with the special long filename handling...
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008317.html
However unlike that thread, I'm not finding any other files with
duplicate names in the hierarchy.
I'm not sure there's much else I can do besides record the names of any
unfound objects before resorting to "mark_unfound_lost delete" - any
suggestions for further research?
Thanks,
Graham
--
Graham Allan
Minnesota Supercomputing Institute - ***@umn.edu
Gregory Farnum
2018-02-14 17:49:41 UTC
Permalink
Raw Message
Post by Graham Allan
I'm replying to myself here, but it's probably worth mentioning that
after this started, I did bring back the failed host, though with "ceph
osd weight 0" to avoid more data movement.
For inconsistent pgs containing unfound objects, the output of "ceph pg
<n> query" does then show the original osd being queried for objects,
and indeed if I dig through the filesystem I find the same 0-byte files
dated from 2015-2016.
This strongly implies to me that data loss occurred a long time in the
past and is not related to the osd host going down - this only triggered
the problem being found.
I would assume that too, but unless you had scrubbing disabled then it
should have been discovered long ago; I don’t understand how it could have
stayed hidden. Did you change any other settings recently?

Or, what is this EC pool being used for, and what are the EC settings?
Having a bunch of empty files is not surprising if the objects are smaller
than the chunk/stripe size — then just the primary and the parity locations
would actually have data for them.
Post by Graham Allan
Graham
Post by Graham Allan
Hi,
For the past few weeks I've been seeing a large number of pgs on our
main erasure coded pool being flagged inconsistent, followed by them
becoming active+recovery_wait+inconsistent with unfound objects. The
cluster is currently running luminous 12.2.2 but has in the past also
run its way through firefly, hammer and jewel.
Here's a sample object from "ceph pg list_missing" (there are 150
pg 70.467 is stuck unclean for 1004525.715896, current state
active+recovery_wait+inconsistent, last acting [449,233,336,323,259,193]
{
"oid": {
"default.323253.6_20150226/Downloads/linux-nvme-HEAD-5aa2ffa/include/config/via/fir.h",
Post by Graham Allan
"key": "",
"snapid": -2,
"hash": 628294759,
"max": 0,
"pool": 70,
"namespace": ""
},
"need": "73222'132227",
"have": "0'0",
"flags": "none",
"locations": [
"193(5)",
"259(4)",
"449(0)"
]
},
When I trace through the filesystem on each OSD, I find the associated
file present on each OSD but with size 0 bytes.
Interestingly, for the 3 OSDs for which "list_missing" shows locations
above (193,259,449), the timestamp of the 0-byte file is recent (within
last few weeks). For the other 3 OSDs (233,336,323), it's in the distant
past (08/2015 and 02/2016).
All the unfound objects I've checked on this pg show the same pattern,
along with the "have" epoch showing as "0'0".
Other than the potential data loss being disturbing, I wonder why this
showed up so suddenly?
It seems to have been triggered by one OSD host failing over a long
weekend. By the time we looked at it on Monday, the cluster had
re-balanced enough data that I decided to simply leave it - we had long
wanted to evacuate a first host to convert to a newer OS release, as
well as Bluestore. Perhaps this was a bad choice, but the cluster
recovery appeared to be proceeding normally, and was apparently complete
a few days later. It was only around a week later that the unfound
objects started.
All the unfound object file fragments I've tracked down so far have
their older members with timestamps in the same mid-2015 to mid-2016
period. I could be wrong but this really seems like a long-standing
problem has just been unearthed. I wonder if it could be connected to
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008120.html
Post by Graham Allan
It's a long thread, but the 0-byte files sound very like the "orphaned
files" in that thread - related to performing a directory split while
handling links on a filename with the special long filename handling...
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008317.html
Post by Graham Allan
However unlike that thread, I'm not finding any other files with
duplicate names in the hierarchy.
I'm not sure there's much else I can do besides record the names of any
unfound objects before resorting to "mark_unfound_lost delete" - any
suggestions for further research?
Thanks,
Graham
--
Graham Allan
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Graham Allan
2018-02-15 17:41:39 UTC
Permalink
Raw Message
Hi Greg,
Post by Graham Allan
I'm replying to myself here, but it's probably worth mentioning that
after this started, I did bring back the failed host, though with "ceph
osd weight 0" to avoid more data movement.
For inconsistent pgs containing unfound objects, the output of "ceph pg
<n> query" does then show the original osd being queried for objects,
and indeed if I dig through the filesystem I find the same 0-byte files
dated from 2015-2016.
This strongly implies to me that data loss occurred a long time in the
past and is not related to the osd host going down - this only triggered
the problem being found.
I would assume that too, but unless you had scrubbing disabled then it
should have been discovered long ago; I don’t understand how it could
have stayed hidden. Did you change any other settings recently?
Or, what is this EC pool being used for, and what are the EC settings?
Having a bunch of empty files is not surprising if the objects are
smaller than the chunk/stripe size — then just the primary and the
parity locations would actually have data for them.
We didn't have scrubbing disabled -it /was/ dialed back somewhat using
the "osd scrub sleep|priority|load_threshold" parameters but I would see
it running, and keeping an eye on the last scrub times for pgs looked
ok. (I guess it is appropriate to revisit and maybe remove "tuning
tweaks" such as the above, as ceph defaults change/improve over time.)

The (4+2) EC pool is used as backing for radosgw - the pool itself
contains ~400T data in ~300M objects.

Having examined some specific "unfound" objects in some detail, I am
less pessimistic but even more confused.

For a huge number of the objects I find they are also listed as 0-bytes
in size when listed via s3 - and these are grouped by "directory" (path)
so it really makes me question whether these were ever successful
uploads. It would explain the 0-byte shard files I see in filestore.

However some objects definitely do contain data. I have some identified
from last week as unfound, which I traced to 6x 0-bytes filestore files,
and attempting to download them from s3 would simply stall (I do need to
follow up on radosgw logs for this case).

However these same test objects now *do* downloads successfully. The pg
itself has reverted to active+clean+inconsistent state. The files
contain image data so I can load them and see they are as expected. If I
again trawl through all the pg filestore locations, I can still only
find the six 0-byte files - where is the data coming from!?
Post by Graham Allan
-rw-r--r-- 1 root root 0 Jan 30 13:50 /var/lib/ceph/osd/ceph-295/current/70.3d6s0_head/DIR_6/DIR_D/DIR_3/DIR_0/DIR_0/DIR_4/default.325674.85\ubellplants\uimages\s1089213.jpg__head_794003D6__46_ffffffffffffffff_0
-rw-r--r-- 1 root root 0 Aug 22 2015 /var/lib/ceph/osd/ceph-221/current/70.3d6s1_head/DIR_6/DIR_D/DIR_3/DIR_0/DIR_0/DIR_4/default.325674.85\ubellplants\uimages\s1089213.jpg__head_794003D6__46_ffffffffffffffff_1
-rw-r--r-- 1 root root 0 Feb 17 2016 /var/lib/ceph/osd/ceph-357/current/70.3d6s2_head/DIR_6/DIR_D/DIR_3/DIR_0/DIR_0/DIR_4/default.325674.85\ubellplants\uimages\s1089213.jpg__head_794003D6__46_ffffffffffffffff_2
Thanks for any insights!

Graham
--
Graham Allan
Minnesota Supercomputing Institute - ***@umn.edu
Gregory Farnum
2018-02-15 17:58:27 UTC
Permalink
Raw Message
Post by Graham Allan
Hi Greg,
Post by Graham Allan
I'm replying to myself here, but it's probably worth mentioning that
after this started, I did bring back the failed host, though with
"ceph
Post by Graham Allan
osd weight 0" to avoid more data movement.
For inconsistent pgs containing unfound objects, the output of "ceph
pg
Post by Graham Allan
<n> query" does then show the original osd being queried for objects,
and indeed if I dig through the filesystem I find the same 0-byte
files
Post by Graham Allan
dated from 2015-2016.
This strongly implies to me that data loss occurred a long time in
the
Post by Graham Allan
past and is not related to the osd host going down - this only
triggered
Post by Graham Allan
the problem being found.
I would assume that too, but unless you had scrubbing disabled then it
should have been discovered long ago; I don’t understand how it could
have stayed hidden. Did you change any other settings recently?
Or, what is this EC pool being used for, and what are the EC settings?
Having a bunch of empty files is not surprising if the objects are
smaller than the chunk/stripe size — then just the primary and the
parity locations would actually have data for them.
We didn't have scrubbing disabled -it /was/ dialed back somewhat using
the "osd scrub sleep|priority|load_threshold" parameters but I would see
it running, and keeping an eye on the last scrub times for pgs looked
ok. (I guess it is appropriate to revisit and maybe remove "tuning
tweaks" such as the above, as ceph defaults change/improve over time.)
The (4+2) EC pool is used as backing for radosgw - the pool itself
contains ~400T data in ~300M objects.
Having examined some specific "unfound" objects in some detail, I am
less pessimistic but even more confused.
For a huge number of the objects I find they are also listed as 0-bytes
in size when listed via s3 - and these are grouped by "directory" (path)
so it really makes me question whether these were ever successful
uploads. It would explain the 0-byte shard files I see in filestore.
However some objects definitely do contain data. I have some identified
from last week as unfound, which I traced to 6x 0-bytes filestore files,
and attempting to download them from s3 would simply stall (I do need to
follow up on radosgw logs for this case).
However these same test objects now *do* downloads successfully. The pg
itself has reverted to active+clean+inconsistent state. The files
contain image data so I can load them and see they are as expected. If I
again trawl through all the pg filestore locations, I can still only
find the six 0-byte files - where is the data coming from!?
Well, if the objects were uploaded using multi-part upload I believe the
objects you’re looking at here will only contain omap (or xattr?) entries
pointing to the part files, so the empty file data is to be expected. This
might also make slightly more sense in terms of the scrub inconsistencies
popping up, although I didn’t think any omap issues I remember should have
impacted rgw.

Other than that, I’m not sure how it would be turning 0 bytes of data into
the correct results.
Post by Graham Allan
/var/lib/ceph/osd/ceph-295/current/70.3d6s0_head -name '*1089213*' -exec ls
-ltr {} +
Post by Graham Allan
-rw-r--r-- 1 root root 0 Jan 30 13:50
/var/lib/ceph/osd/ceph-295/current/70.3d6s0_head/DIR_6/DIR_D/DIR_3/DIR_0/DIR_0/DIR_4/default.325674.85\ubellplants\uimages\s1089213.jpg__head_794003D6__46_ffffffffffffffff_0
/var/lib/ceph/osd/ceph-221/current/70.3d6s1_head -name '*1089213*' -exec ls
-ltr {} +
Post by Graham Allan
-rw-r--r-- 1 root root 0 Aug 22 2015
/var/lib/ceph/osd/ceph-221/current/70.3d6s1_head/DIR_6/DIR_D/DIR_3/DIR_0/DIR_0/DIR_4/default.325674.85\ubellplants\uimages\s1089213.jpg__head_794003D6__46_ffffffffffffffff_1
/var/lib/ceph/osd/ceph-357/current/70.3d6s2_head -name '*1089213*' -exec ls
-ltr {} +
Post by Graham Allan
-rw-r--r-- 1 root root 0 Feb 17 2016
/var/lib/ceph/osd/ceph-357/current/70.3d6s2_head/DIR_6/DIR_D/DIR_3/DIR_0/DIR_0/DIR_4/default.325674.85\ubellplants\uimages\s1089213.jpg__head_794003D6__46_ffffffffffffffff_2
Thanks for any insights!
Graham
--
Graham Allan
Graham Allan
2018-02-15 23:10:50 UTC
Permalink
Raw Message
Post by Gregory Farnum
Well, if the objects were uploaded using multi-part upload I believe the
objects you’re looking at here will only contain omap (or xattr?)
entries pointing to the part files, so the empty file data is to be
expected. This might also make slightly more sense in terms of the scrub
inconsistencies popping up, although I didn’t think any omap issues I
remember should have impacted rgw.
Other than that, I’m not sure how it would be turning 0 bytes of data
into the correct results.
That makes a lot of sense! So the 0-byte file here is effectivelyjust a
holder for xattr data.

Now I'm trying to figure out how I can associate it to files which might
contain the data.

I don't see a lot of data in omap (assuming I'm looking at the right
Post by Gregory Farnum
00000000 05 03 bb 00 00 00 1d 00 00 00 62 65 6c 6c 70 6c |..........bellpl|
00000010 61 6e 74 73 5f 69 6d 61 67 65 73 2f 31 30 38 39 |ants_images/1089|
00000020 32 31 33 2e 6a 70 67 71 70 1f 00 00 00 00 00 01 |213.jpgqp.......|
00000030 03 03 3a 00 00 00 01 00 a4 92 00 00 00 00 00 10 |..:.............|
00000050 70 72 61 74 68 65 72 11 00 00 00 54 68 6f 6d 61 |prather....Thoma|
00000060 73 20 46 2e 20 50 72 61 74 68 65 72 00 00 00 00 |s F. Prather....|
00000070 00 00 00 00 1d 00 00 00 62 65 6c 6c 70 6c 61 6e |........bellplan|
00000080 74 73 5f 69 6d 61 67 65 73 2f 31 30 38 39 32 31 |ts_images/108921|
00000090 33 2e 6a 70 67 01 01 06 00 00 00 46 84 71 70 1f |3.jpg......F.qp.|
000000a0 00 84 14 ee 09 00 17 00 00 00 64 65 66 61 75 6c |..........defaul|
000000b0 74 2e 33 32 35 36 37 34 2e 31 31 35 39 33 36 33 |t.325674.1159363|
000000c0 37 |7|
000000c1
user.ceph._user.rgw.idtag
user.cephos.spill_out
user.ceph._
user.ceph.snapset
user.ceph._user.rgw.manifest
user.ceph.hinfo_key
user.ceph._user.rgw.acl
user.ceph._user.rgw.x-amz-acl
user.ceph._user.rgw.etag
user.ceph._user.rgw.x-amz-date
user.ceph._user.rgw.content_type
Not sure which among these would contain pointers to part files.

While I'm less pessimistic about data being lost I'm still really
uncertain that the cluster is making progress towards a clean state.

One example, yesterday pg 70.438 showed "has 169 objects unfound and
apparently lost". At that time the state was
active+recovery_wait+inconsistent. Today it's showing no unfound objects
but is active+clean+inconsistent, and objects which were inaccessible
via radosgw yesterday can now download. I'm not sure what changed. I
have asked ceph to perform another deep scrub and repair on the pg, but
it has yet to start. I'm really curious to see if it becomes consistent,
or discovers unfound objects again.

Actually now I notice that a pg reported as
active+recovery_wait+inconsistent by "ceph health detail" is shown as
active+recovering+inconsistent by "ceph pg list". That makes more sense
to me - "recovery_wait" implied to me that it was waiting for recovery
to start, while "recovering" explains why the problem might clear itself.

Graham
--
Graham Allan
Minnesota Supercomputing Institute - ***@umn.edu
Gregory Farnum
2018-02-15 23:33:07 UTC
Permalink
Raw Message
Post by Graham Allan
Post by Gregory Farnum
Well, if the objects were uploaded using multi-part upload I believe the
objects you’re looking at here will only contain omap (or xattr?)
entries pointing to the part files, so the empty file data is to be
expected. This might also make slightly more sense in terms of the scrub
inconsistencies popping up, although I didn’t think any omap issues I
remember should have impacted rgw.
Other than that, I’m not sure how it would be turning 0 bytes of data
into the correct results.
That makes a lot of sense! So the 0-byte file here is effectivelyjust a
holder for xattr data.
Now I'm trying to figure out how I can associate it to files which might
contain the data.
I don't see a lot of data in omap (assuming I'm looking at the right
.dir.default.325674.85 bellplants_images/1089213.jpg
Post by Gregory Farnum
00000000 05 03 bb 00 00 00 1d 00 00 00 62 65 6c 6c 70 6c
|..........bellpl|
Post by Gregory Farnum
00000010 61 6e 74 73 5f 69 6d 61 67 65 73 2f 31 30 38 39
|ants_images/1089|
Post by Gregory Farnum
00000020 32 31 33 2e 6a 70 67 71 70 1f 00 00 00 00 00 01
|213.jpgqp.......|
Post by Gregory Farnum
00000030 03 03 3a 00 00 00 01 00 a4 92 00 00 00 00 00 10
|..:.............|
Post by Gregory Farnum
00000040 40 c5 55 00 00 00 00 00 00 00 00 08 00 00 00 74
00000050 70 72 61 74 68 65 72 11 00 00 00 54 68 6f 6d 61
|prather....Thoma|
Post by Gregory Farnum
00000060 73 20 46 2e 20 50 72 61 74 68 65 72 00 00 00 00 |s F.
Prather....|
Post by Gregory Farnum
00000070 00 00 00 00 1d 00 00 00 62 65 6c 6c 70 6c 61 6e
|........bellplan|
Post by Gregory Farnum
00000080 74 73 5f 69 6d 61 67 65 73 2f 31 30 38 39 32 31
|ts_images/108921|
Post by Gregory Farnum
00000090 33 2e 6a 70 67 01 01 06 00 00 00 46 84 71 70 1f
|3.jpg......F.qp.|
Post by Gregory Farnum
000000a0 00 84 14 ee 09 00 17 00 00 00 64 65 66 61 75 6c
|..........defaul|
Post by Gregory Farnum
000000b0 74 2e 33 32 35 36 37 34 2e 31 31 35 39 33 36 33
|t.325674.1159363|
Post by Gregory Farnum
000000c0 37 |7|
000000c1
/var/lib/ceph/osd/ceph-295/current/70.3d6s0_head -name '*1089213*' -exec
xattr {} +
Post by Gregory Farnum
user.ceph._user.rgw.idtag
user.cephos.spill_out
user.ceph._
user.ceph.snapset
user.ceph._user.rgw.manifest
user.ceph.hinfo_key
user.ceph._user.rgw.acl
user.ceph._user.rgw.x-amz-acl
user.ceph._user.rgw.etag
user.ceph._user.rgw.x-amz-date
user.ceph._user.rgw.content_type
Not sure which among these would contain pointers to part files.
I believe it’s the manifest xattrs there.
Post by Graham Allan
While I'm less pessimistic about data being lost I'm still really
uncertain that the cluster is making progress towards a clean state.
One example, yesterday pg 70.438 showed "has 169 objects unfound and
apparently lost". At that time the state was
active+recovery_wait+inconsistent. Today it's showing no unfound objects
but is active+clean+inconsistent, and objects which were inaccessible
via radosgw yesterday can now download. I'm not sure what changed. I
have asked ceph to perform another deep scrub and repair on the pg, but
it has yet to start. I'm really curious to see if it becomes consistent,
or discovers unfound objects again.
Actually now I notice that a pg reported as
active+recovery_wait+inconsistent by "ceph health detail" is shown as
active+recovering+inconsistent by "ceph pg list". That makes more sense
to me - "recovery_wait" implied to me that it was waiting for recovery
to start, while "recovering" explains why the problem might clear itself.
Right, “recovery_wait” means that the pg needs to do log-based recovery but
(at least) one of the participating OSDs doesn’t have a slot available;
that will resolve itself eventually.

It sounds like the scrubbing has detected some inconsistencies but the
reason you weren’t getting data is just that it hit an object which needed
recovery but was blocked waiting on it.
-Greg
Post by Graham Allan
Graham
--
Graham Allan
Graham Allan
2018-02-16 18:31:46 UTC
Permalink
Raw Message
Post by Graham Allan
/var/lib/ceph/osd/ceph-295/current/70.3d6s0_head -name '*1089213*'
-exec xattr {} +
Post by Gregory Farnum
user.ceph._user.rgw.idtag
user.cephos.spill_out
user.ceph._
user.ceph.snapset
user.ceph._user.rgw.manifest
user.ceph.hinfo_key
user.ceph._user.rgw.acl
user.ceph._user.rgw.x-amz-acl
user.ceph._user.rgw.etag
user.ceph._user.rgw.x-amz-date
user.ceph._user.rgw.content_type
Not sure which among these would contain pointers to part files.
I believe it’s the manifest xattrs there.
Thanks, I see better now how this goes together. Rather than dumping out
the manifest xattrs, I guess I should also be able to understand this
from the output of "radosgw-admin object stat".
Post by Graham Allan
One example, yesterday pg 70.438 showed "has 169 objects unfound and
apparently lost". At that time the state was
active+recovery_wait+inconsistent. Today it's showing no unfound objects
but is active+clean+inconsistent, and objects which were inaccessible
via radosgw yesterday can now download. I'm not sure what changed. I
have asked ceph to perform another deep scrub and repair on the pg, but
it has yet to start. I'm really curious to see if it becomes consistent,
or discovers unfound objects again.
Actually now I notice that a pg reported as
active+recovery_wait+inconsistent by "ceph health detail" is shown as
active+recovering+inconsistent by "ceph pg list". That makes more sense
to me - "recovery_wait" implied to me that it was waiting for recovery
to start, while "recovering" explains why the problem might clear itself.
Right, “recovery_wait” means that the pg needs to do log-based recovery
but (at least) one of the participating OSDs doesn’t have a slot
available; that will resolve itself eventually.
It sounds like the scrubbing has detected some inconsistencies but the
reason you weren’t getting data is just that it hit an object which
needed recovery but was blocked waiting on it.
Yes, though it seems to be stuck in a cycle. This morning, that same pg
70.438 is back in active+recovery_wait+inconsistent state with the same
169 unfound objects - and from s3, the objects which would download
successfully yesterday now stall. Which probably makes sense while in
"unfound" state, but clearly the data is there, while ceph is not
successful in making the pg consistent. Each time it repairs again, it
claims to fix more errors but finds the same number of unfound objects
again.
Post by Graham Allan
/var/log/ceph/ceph.log.2.gz:2018-02-14 15:48:14.438357 osd.175 osd.175 10.31.0.71:6838/66301 2928 : cluster [ERR] 70.438s0 repair 0 missing, 169 inconsistent objects
/var/log/ceph/ceph.log.2.gz:2018-02-14 15:48:14.442875 osd.175 osd.175 10.31.0.71:6838/66301 2929 : cluster [ERR] 70.438 repair 169 errors, 845 fixed
/var/log/ceph/ceph.log.1.gz:2018-02-15 19:42:25.040196 osd.175 osd.175 10.31.0.71:6838/66301 2995 : cluster [ERR] 70.438s0 repair 0 missing, 169 inconsistent objects
/var/log/ceph/ceph.log.1.gz:2018-02-15 19:42:25.046028 osd.175 osd.175 10.31.0.71:6838/66301 2996 : cluster [ERR] 70.438 repair 169 errors, 685 fixed
I also now see that for these unfound objects "radosgw-admin object
stat" also hangs. Clearly makes sense since radosgw must perform this to
retrieve the object. Does it imply that ceph can't access the "head"
object in order to read the xattr data?

If I set debug rgw=1 and demug ms=1 before running the "object stat"
command, it seems to stall in a loop of trying communicate with osds for
pool 96, which is .rgw.control
Post by Graham Allan
10.32.16.93:0/2689814946 --> 10.31.0.68:6818/8969 -- osd_op(unknown.0.0:541 96.e 96:7759931f:::notify.3:head [watch ping cookie 139709246356176] snapc 0=[] ondisk+write+known_if_redirected e507695) v8 -- 0x7f10ac033610 con 0
10.32.16.93:0/2689814946 <== osd.38 10.31.0.68:6818/8969 59 ==== osd_op_reply(541 notify.3 [watch ping cookie 139709246356176] v0'0 uv3933745 ondisk = 0) v8 ==== 152+0+0 (2536111836 0 0) 0x7f1158003e20 con 0x7f117afd8390
Prior to that, probably more relevant, this was the only communication
Post by Graham Allan
10.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 -- osd_op(unknown.0.0:96 70.438s0 70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head [getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e507695) v8 -- 0x7fab79889fa0 con 0
10.32.16.93:0/1552085932 <== osd.175 10.31.0.71:6838/66301 1 ==== osd_backoff(70.438s0 block id 1 [70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head) e507695) v1 ==== 209+0+0 (1958971312 0 0) 0x7fab5003d3c0 con 0x7fab79885980
210.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 -- osd_backoff(70.438s0 ack-block id 1 [70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head) e507695) v1 -- 0x7fab48065420 con 0
so I guess the backoff message above is saying the object is
unavailable. OK, that certainly makes sense. Not sure that it helps me
understand how to fix the inconsistencies....

Graham
--
Graham Allan
Minnesota Supercomputing Institute - ***@umn.edu
Graham Allan
2018-02-16 20:17:10 UTC
Permalink
Raw Message
Post by Graham Allan
If I set debug rgw=1 and demug ms=1 before running the "object stat"
command, it seems to stall in a loop of trying communicate with osds for
pool 96, which is .rgw.control
Post by Graham Allan
10.32.16.93:0/2689814946 --> 10.31.0.68:6818/8969 --
osd_op(unknown.0.0:541 96.e 96:7759931f:::notify.3:head [watch ping
cookie 139709246356176] snapc 0=[] ondisk+write+known_if_redirected
e507695) v8 -- 0x7f10ac033610 con 0
10.32.16.93:0/2689814946 <== osd.38 10.31.0.68:6818/8969 59 ====
osd_op_reply(541 notify.3 [watch ping cookie 139709246356176] v0'0
uv3933745 ondisk = 0) v8 ==== 152+0+0 (2536111836 0 0) 0x7f1158003e20
con 0x7f117afd8390
Prior to that, probably more relevant, this was the only communication
Post by Graham Allan
10.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 --
osd_op(unknown.0.0:96 70.438s0
70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head
[getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e507695)
v8 -- 0x7fab79889fa0 con 0
10.32.16.93:0/1552085932 <== osd.175 10.31.0.71:6838/66301 1 ====
osd_backoff(70.438s0 block id 1
[70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head)
e507695) v1 ==== 209+0+0 (1958971312 0 0) 0x7fab5003d3c0 con
0x7fab79885980
210.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 --
osd_backoff(70.438s0 ack-block id 1
[70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head)
e507695) v1 -- 0x7fab48065420 con 0
so I guess the backoff message above is saying the object is
unavailable. OK, that certainly makes sense. Not sure that it helps me
understand how to fix the inconsistencies....
If I restart the primary osd for the pg, that makes it forget its state
and return to active+clean+inconsistent. I can then download the
previously-unfound objects again, as well as run "radosgw-admin object
stat".

So the interesting bit is probably figuring out why it decides these
objects are unfound, when clearly they aren't.

What would be the best place to enable additional logging to understand
this - perhaps the primary osd?

Thanks for all your help,

Graham
--
Graham Allan
Minnesota Supercomputing Institute - ***@umn.edu
Gregory Farnum
2018-02-16 21:40:54 UTC
Permalink
Raw Message
Post by Graham Allan
Post by Graham Allan
If I set debug rgw=1 and demug ms=1 before running the "object stat"
command, it seems to stall in a loop of trying communicate with osds for
pool 96, which is .rgw.control
Post by Graham Allan
10.32.16.93:0/2689814946 --> 10.31.0.68:6818/8969 --
osd_op(unknown.0.0:541 96.e 96:7759931f:::notify.3:head [watch ping
cookie 139709246356176] snapc 0=[] ondisk+write+known_if_redirected
e507695) v8 -- 0x7f10ac033610 con 0
10.32.16.93:0/2689814946 <== osd.38 10.31.0.68:6818/8969 59 ====
osd_op_reply(541 notify.3 [watch ping cookie 139709246356176] v0'0
uv3933745 ondisk = 0) v8 ==== 152+0+0 (2536111836 <(253)%20611-1836> 0
0) 0x7f1158003e20
Post by Graham Allan
Post by Graham Allan
con 0x7f117afd8390
Prior to that, probably more relevant, this was the only communication
Post by Graham Allan
10.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 --
osd_op(unknown.0.0:96 70.438s0
70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head
[getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e507695)
v8 -- 0x7fab79889fa0 con 0
10.32.16.93:0/1552085932 <== osd.175 10.31.0.71:6838/66301 1 ====
osd_backoff(70.438s0 block id 1
[70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head)
Post by Graham Allan
Post by Graham Allan
e507695) v1 ==== 209+0+0 (1958971312 0 0) 0x7fab5003d3c0 con
0x7fab79885980
210.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 --
osd_backoff(70.438s0 ack-block id 1
[70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head)
Post by Graham Allan
Post by Graham Allan
e507695) v1 -- 0x7fab48065420 con 0
so I guess the backoff message above is saying the object is
unavailable. OK, that certainly makes sense. Not sure that it helps me
understand how to fix the inconsistencies....
If I restart the primary osd for the pg, that makes it forget its state
and return to active+clean+inconsistent. I can then download the
previously-unfound objects again, as well as run "radosgw-admin object
stat".
So the interesting bit is probably figuring out why it decides these
objects are unfound, when clearly they aren't.
What would be the best place to enable additional logging to understand
this - perhaps the primary osd?
David, this sounds like one of the bugs where an OSD can mark objects as
inconsistent locally but then doesn't actually trigger recovery on them. Or
it doesn't like any copy but doesn't persist that.
Do any known issues around that apply to 12.2.2?
-Greg
Post by Graham Allan
Thanks for all your help,
Graham
--
Graham Allan
David Zafman
2018-02-17 18:48:29 UTC
Permalink
Raw Message
The commits below came after v12.2.2 and may impact this issue. When a
pg is active+clean+inconsistent means that scrub has detected issues
with 1 or more replicas of 1 or more objects .  An unfound object is a
potentially temporary state in which the current set of available OSDs
doesn't allow an object to be recovered/backfilled/repaired.  When the
primary OSD restarts, any unfound objects ( an in memory structure) are
reset so that the new set of peered OSDs can determine again what
objects are unfound.

I'm not clear in this scenario whether recovery failed to start,
recovery hung before due to a bug or if recovery stopped (as designed)
because of the unfound object.  The new recovery_unfound and
backfill_unfound states indicates that recovery has stopped due to
unfound objects.


commit 64047e1bac2e775a06423a03cfab69b88462538c
Author: David Zafman <***@redhat.com>
Date:   Wed Jan 10 13:30:41 2018 -0800

    osd: Don't start recovery for missing until active pg state set

    I was seeing recovery hang when it is started before
_activate_committed()
    The state machine passes into "Active" but this transitions to
activating
    pg state and after commmitted into "active" pg state.

    Signed-off-by: David Zafman <***@redhat.com>

commit 7f8b0ce9e681f727d8217e3ed74a1a3355f364f3
Author: David Zafman <***@redhat.com>
Date:   Mon Oct 9 08:19:21 2017 -0700

    osd, mon: Add new pg states recovery_unfound and backfill_unfound
Post by Gregory Farnum
Post by Graham Allan
Post by Graham Allan
If I set debug rgw=1 and demug ms=1 before running the "object stat"
command, it seems to stall in a loop of trying communicate with osds for
pool 96, which is .rgw.control
Post by Graham Allan
10.32.16.93:0/2689814946 --> 10.31.0.68:6818/8969 --
osd_op(unknown.0.0:541 96.e 96:7759931f:::notify.3:head [watch ping
cookie 139709246356176] snapc 0=[] ondisk+write+known_if_redirected
e507695) v8 -- 0x7f10ac033610 con 0
10.32.16.93:0/2689814946 <== osd.38 10.31.0.68:6818/8969 59 ====
osd_op_reply(541 notify.3 [watch ping cookie 139709246356176] v0'0
uv3933745 ondisk = 0) v8 ==== 152+0+0 (2536111836 <(253)%20611-1836> 0
0) 0x7f1158003e20
Post by Graham Allan
Post by Graham Allan
con 0x7f117afd8390
Prior to that, probably more relevant, this was the only communication
Post by Graham Allan
10.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 --
osd_op(unknown.0.0:96 70.438s0
70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head
[getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e507695)
v8 -- 0x7fab79889fa0 con 0
10.32.16.93:0/1552085932 <== osd.175 10.31.0.71:6838/66301 1 ====
osd_backoff(70.438s0 block id 1
[70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head)
Post by Graham Allan
Post by Graham Allan
e507695) v1 ==== 209+0+0 (1958971312 0 0) 0x7fab5003d3c0 con
0x7fab79885980
210.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 --
osd_backoff(70.438s0 ack-block id 1
[70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head)
Post by Graham Allan
Post by Graham Allan
e507695) v1 -- 0x7fab48065420 con 0
so I guess the backoff message above is saying the object is
unavailable. OK, that certainly makes sense. Not sure that it helps me
understand how to fix the inconsistencies....
If I restart the primary osd for the pg, that makes it forget its state
and return to active+clean+inconsistent. I can then download the
previously-unfound objects again, as well as run "radosgw-admin object
stat".
So the interesting bit is probably figuring out why it decides these
objects are unfound, when clearly they aren't.
What would be the best place to enable additional logging to understand
this - perhaps the primary osd?
David, this sounds like one of the bugs where an OSD can mark objects as
inconsistent locally but then doesn't actually trigger recovery on them. Or
it doesn't like any copy but doesn't persist that.
Do any known issues around that apply to 12.2.2?
-Greg
Post by Graham Allan
Thanks for all your help,
Graham
--
Graham Allan
Graham Allan
2018-02-19 21:38:28 UTC
Permalink
Raw Message
Post by David Zafman
The commits below came after v12.2.2 and may impact this issue. When a
pg is active+clean+inconsistent means that scrub has detected issues
with 1 or more replicas of 1 or more objects .  An unfound object is a
potentially temporary state in which the current set of available OSDs
doesn't allow an object to be recovered/backfilled/repaired.  When the
primary OSD restarts, any unfound objects ( an in memory structure) are
reset so that the new set of peered OSDs can determine again what
objects are unfound.
I'm not clear in this scenario whether recovery failed to start,
recovery hung before due to a bug or if recovery stopped (as designed)
because of the unfound object.  The new recovery_unfound and
backfill_unfound states indicates that recovery has stopped due to
unfound objects.
Thanks for your comments David. I could certainly enable any additional
logging that might help to clarify what's going on here - perhaps on the
primary OSD for a given pg?

I am still having a hard time understanding why these objects repeatedly
get flagged as unfound, when they are downloadable and contain correct
data whenever they are not in this state. It is a 4+2 EC pool, so I
would think it possible to reconstruct any missing EC chunks.

It's an extensive problem; while I have been focusing on examining a
couple of specific pgs, the pool in general is showing 2410 pgs
inconsistent (out of 4096).

Graham
--
Graham Allan
Minnesota Supercomputing Institute - ***@umn.edu
Graham Allan
2018-03-13 22:48:24 UTC
Permalink
Raw Message
Updated cluster now to 12.2.4 and the cycle of
inconsistent->repair->unfound seems to continue, though possibly
slightly differently. A pg does pass through an "active+clean" phase
after repair, which might be new, but more likely I never observed it at
the right time before.

I see messages like this in the logs now "attr name mismatch
2018-03-02 18:55:11.583850 osd.386 osd.386 10.31.0.72:6817/4057280 401 : cluster [ERR] 70.3dbs0 : soid 70:dbc6ed68:::default.325674.85_bellplants_images%2f1055211.jpg:head attr name mismatch 'hinfo_key'
2018-03-02 19:00:18.031929 osd.386 osd.386 10.31.0.72:6817/4057280 428 : cluster [ERR] 70.3dbs0 : soid 70:dbc97561:::default.325674.85_bellplants_images%2f1017818.jpg:head attr name mismatch 'hinfo_key'
2018-03-02 19:04:50.058477 osd.386 osd.386 10.31.0.72:6817/4057280 452 : cluster [ERR] 70.3dbs0 : soid 70:dbcbcb34:::default.325674.85_bellplants_images%2f1049756.jpg:head attr name mismatch 'hinfo_key'
2018-03-02 19:13:05.689136 osd.386 osd.386 10.31.0.72:6817/4057280 494 : cluster [ERR] 70.3dbs0 : soid 70:dbcfc7c9:::default.325674.85_bellplants_images%2f1021177.jpg:head attr name mismatch 'hinfo_key'
2018-03-02 19:13:30.883100 osd.386 osd.386 10.31.0.72:6817/4057280 495 : cluster [ERR] 70.3dbs0 repair 0 missing, 161 inconsistent objects
2018-03-02 19:13:30.888259 osd.386 osd.386 10.31.0.72:6817/4057280 496 : cluster [ERR] 70.3db repair 161 errors, 161 fixed
The only similar-sounding issue I could find is

http://tracker.ceph.com/issues/20089

When I look at src/osd/PGBackend.cc be_compare_scrubmaps in luminous, I
don't see the changes in the commit here:

https://github.com/ceph/ceph/pull/15368/files

of course a lot of other things have changed, but is it possible this
fix never made it into luminous?

Graham
The commits below came after v12.2.2 and may impact this issue. When a
pg is active+clean+inconsistent means that scrub has detected issues
with 1 or more replicas of 1 or more objects .  An unfound object is a
potentially temporary state in which the current set of available OSDs
doesn't allow an object to be recovered/backfilled/repaired.  When the
primary OSD restarts, any unfound objects ( an in memory structure) are
reset so that the new set of peered OSDs can determine again what
objects are unfound.
I'm not clear in this scenario whether recovery failed to start,
recovery hung before due to a bug or if recovery stopped (as designed)
because of the unfound object.  The new recovery_unfound and
backfill_unfound states indicates that recovery has stopped due to
unfound objects.
commit 64047e1bac2e775a06423a03cfab69b88462538c
Date:   Wed Jan 10 13:30:41 2018 -0800
    osd: Don't start recovery for missing until active pg state set
    I was seeing recovery hang when it is started before
_activate_committed()
    The state machine passes into "Active" but this transitions to
activating
    pg state and after commmitted into "active" pg state.
commit 7f8b0ce9e681f727d8217e3ed74a1a3355f364f3
Date:   Mon Oct 9 08:19:21 2017 -0700
    osd, mon: Add new pg states recovery_unfound and backfill_unfound
Post by Gregory Farnum
Post by Graham Allan
Post by Graham Allan
If I set debug rgw=1 and demug ms=1 before running the "object stat"
command, it seems to stall in a loop of trying communicate with osds for
pool 96, which is .rgw.control
Post by Graham Allan
10.32.16.93:0/2689814946 --> 10.31.0.68:6818/8969 --
osd_op(unknown.0.0:541 96.e 96:7759931f:::notify.3:head [watch ping
cookie 139709246356176] snapc 0=[] ondisk+write+known_if_redirected
e507695) v8 -- 0x7f10ac033610 con 0
10.32.16.93:0/2689814946 <== osd.38 10.31.0.68:6818/8969 59 ====
osd_op_reply(541 notify.3 [watch ping cookie 139709246356176] v0'0
uv3933745 ondisk = 0) v8 ==== 152+0+0 (2536111836 <(253)%20611-1836> 0
0) 0x7f1158003e20
Post by Graham Allan
Post by Graham Allan
con 0x7f117afd8390
Prior to that, probably more relevant, this was the only communication
Post by Graham Allan
10.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 --
osd_op(unknown.0.0:96 70.438s0
70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head
[getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e507695)
v8 -- 0x7fab79889fa0 con 0
10.32.16.93:0/1552085932 <== osd.175 10.31.0.71:6838/66301 1 ====
osd_backoff(70.438s0 block id 1
[70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head)
Post by Graham Allan
Post by Graham Allan
e507695) v1 ==== 209+0+0 (1958971312 0 0) 0x7fab5003d3c0 con
0x7fab79885980
210.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 --
osd_backoff(70.438s0 ack-block id 1
[70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head)
Post by Graham Allan
Post by Graham Allan
e507695) v1 -- 0x7fab48065420 con 0
so I guess the backoff message above is saying the object is
unavailable. OK, that certainly makes sense. Not sure that it helps me
understand how to fix the inconsistencies....
If I restart the primary osd for the pg, that makes it forget its state
and return to active+clean+inconsistent. I can then download the
previously-unfound objects again, as well as run "radosgw-admin object
stat".
So the interesting bit is probably figuring out why it decides these
objects are unfound, when clearly they aren't.
What would be the best place to enable additional logging to understand
this - perhaps the primary osd?
David, this sounds like one of the bugs where an OSD can mark objects as
inconsistent locally but then doesn't actually trigger recovery on them. Or
it doesn't like any copy but doesn't persist that.
Do any known issues around that apply to 12.2.2?
-Greg
Post by Graham Allan
Thanks for all your help,
Graham
--
Graham Allan
--
Graham Allan
Minnesota Supercomputing Institute - ***@umn.edu
David Zafman
2018-03-14 18:16:22 UTC
Permalink
Raw Message
The fix for tracker 20089 undid the changes you're seeing in the 15368
pull request.  The attr name mismatch of 'hinfo_key'  means that key is
missing because every erasure coded object should have a key called
"hinfo_key."

You should try to determine why your extended attributes are getting
corrupted.  All the errors are on shard 0.  My testing shows that repair
will fix this scenario.

David
Post by Graham Allan
Updated cluster now to 12.2.4 and the cycle of
inconsistent->repair->unfound seems to continue, though possibly
slightly differently. A pg does pass through an "active+clean" phase
after repair, which might be new, but more likely I never observed it
at the right time before.
I see messages like this in the logs now "attr name mismatch
2018-03-02 18:55:11.583850 osd.386 osd.386 10.31.0.72:6817/4057280
401 : cluster [ERR] 70.3dbs0 : soid
70:dbc6ed68:::default.325674.85_bellplants_images%2f1055211.jpg:head
attr name mismatch 'hinfo_key'
2018-03-02 19:00:18.031929 osd.386 osd.386 10.31.0.72:6817/4057280
428 : cluster [ERR] 70.3dbs0 : soid
70:dbc97561:::default.325674.85_bellplants_images%2f1017818.jpg:head
attr name mismatch 'hinfo_key'
2018-03-02 19:04:50.058477 osd.386 osd.386 10.31.0.72:6817/4057280
452 : cluster [ERR] 70.3dbs0 : soid
70:dbcbcb34:::default.325674.85_bellplants_images%2f1049756.jpg:head
attr name mismatch 'hinfo_key'
2018-03-02 19:13:05.689136 osd.386 osd.386 10.31.0.72:6817/4057280
494 : cluster [ERR] 70.3dbs0 : soid
70:dbcfc7c9:::default.325674.85_bellplants_images%2f1021177.jpg:head
attr name mismatch 'hinfo_key'
2018-03-02 19:13:30.883100 osd.386 osd.386 10.31.0.72:6817/4057280
495 : cluster [ERR] 70.3dbs0 repair 0 missing, 161 inconsistent objects
2018-03-02 19:13:30.888259 osd.386 osd.386 10.31.0.72:6817/4057280
496 : cluster [ERR] 70.3db repair 161 errors, 161 fixed
The only similar-sounding issue I could find is
http://tracker.ceph.com/issues/20089
When I look at src/osd/PGBackend.cc be_compare_scrubmaps in luminous,
https://github.com/ceph/ceph/pull/15368/files
of course a lot of other things have changed, but is it possible this
fix never made it into luminous?
Graham
The commits below came after v12.2.2 and may impact this issue. When
a pg is active+clean+inconsistent means that scrub has detected
issues with 1 or more replicas of 1 or more objects . An unfound
object is a potentially temporary state in which the current set of
available OSDs doesn't allow an object to be
recovered/backfilled/repaired.  When the primary OSD restarts, any
unfound objects ( an in memory structure) are reset so that the new
set of peered OSDs can determine again what objects are unfound.
I'm not clear in this scenario whether recovery failed to start,
recovery hung before due to a bug or if recovery stopped (as
designed) because of the unfound object.  The new recovery_unfound
and backfill_unfound states indicates that recovery has stopped due
to unfound objects.
commit 64047e1bac2e775a06423a03cfab69b88462538c
Date:   Wed Jan 10 13:30:41 2018 -0800
     osd: Don't start recovery for missing until active pg state set
     I was seeing recovery hang when it is started before
_activate_committed()
     The state machine passes into "Active" but this transitions to
activating
     pg state and after commmitted into "active" pg state.
commit 7f8b0ce9e681f727d8217e3ed74a1a3355f364f3
Date:   Mon Oct 9 08:19:21 2017 -0700
     osd, mon: Add new pg states recovery_unfound and backfill_unfound
Post by Gregory Farnum
Post by Graham Allan
Post by Graham Allan
If I set debug rgw=1 and demug ms=1 before running the "object stat"
command, it seems to stall in a loop of trying communicate with osds for
pool 96, which is .rgw.control
Post by Graham Allan
10.32.16.93:0/2689814946 --> 10.31.0.68:6818/8969 --
osd_op(unknown.0.0:541 96.e 96:7759931f:::notify.3:head [watch ping
cookie 139709246356176] snapc 0=[] ondisk+write+known_if_redirected
e507695) v8 -- 0x7f10ac033610 con 0
10.32.16.93:0/2689814946 <== osd.38 10.31.0.68:6818/8969 59 ====
osd_op_reply(541 notify.3 [watch ping cookie 139709246356176] v0'0
uv3933745 ondisk = 0) v8 ==== 152+0+0 (2536111836
<(253)%20611-1836> 0
0) 0x7f1158003e20
Post by Graham Allan
Post by Graham Allan
con 0x7f117afd8390
Prior to that, probably more relevant, this was the only
communication
Post by Graham Allan
10.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 --
osd_op(unknown.0.0:96 70.438s0
70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head
[getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e507695)
v8 -- 0x7fab79889fa0 con 0
10.32.16.93:0/1552085932 <== osd.175 10.31.0.71:6838/66301 1 ====
osd_backoff(70.438s0 block id 1
[70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head)
Post by Graham Allan
Post by Graham Allan
e507695) v1 ==== 209+0+0 (1958971312 0 0) 0x7fab5003d3c0 con
0x7fab79885980
210.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 --
osd_backoff(70.438s0 ack-block id 1
[70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head)
Post by Graham Allan
Post by Graham Allan
e507695) v1 -- 0x7fab48065420 con 0
so I guess the backoff message above is saying the object is
unavailable. OK, that certainly makes sense. Not sure that it helps me
understand how to fix the inconsistencies....
If I restart the primary osd for the pg, that makes it forget its state
and return to active+clean+inconsistent. I can then download the
previously-unfound objects again, as well as run "radosgw-admin object
stat".
So the interesting bit is probably figuring out why it decides these
objects are unfound, when clearly they aren't.
What would be the best place to enable additional logging to understand
this - perhaps the primary osd?
David, this sounds like one of the bugs where an OSD can mark objects as
inconsistent locally but then doesn't actually trigger recovery on them. Or
it doesn't like any copy but doesn't persist that.
Do any known issues around that apply to 12.2.2?
-Greg
Post by Graham Allan
Thanks for all your help,
Graham
--
Graham Allan
Loading...