[ceph-users] rocksdb: Corruption: missing start of fragmented record

Discussion:

Michael

2017-11-01 08:30:06 UTC

Hello everyone,

I've conducted some crash tests (unplugging drives, the machine,
terminating and restarting ceph systemd services) with Ceph 12.2.0 on
Ubuntu and quite easily managed to corrupt what appears to be rocksdb's
log replay on a bluestore OSD:

# ceph-bluestore-tool fsckÂ --path /var/lib/ceph/osd/ceph-2/
[...]
4 rocksdb:
[/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/version_set.cc:2859]
Recovered from manifest file:db/MANIFEST-000975
succeeded,manifest_file_number is 975, next_file_number is 1008,
last_sequence is 51965907, log_number is 0,prev_log_number is
0,max_column_family is 0
4 rocksdb:
[/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/version_set.cc:2867]
Column family [default] (ID 0), log number is 1005
4 rocksdb: EVENT_LOG_v1 {"time_micros": 1509298585082794, "job": 1,
"event": "recovery_started", "log_files": [1003, 1005]}
4 rocksdb:
[/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/db_impl_open.cc:482]
Recovering log #1003 mode 0
4 rocksdb:
[/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/db_impl_open.cc:482]
Recovering log #1005 mode 0
3 rocksdb:
[/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/db_impl_open.cc:424]
db/001005.log: dropping 3225 bytes; Corruption: missing start of
fragmented record(2)
4 rocksdb:
[/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/db_impl.cc:217] Shutdown:
canceling all background work
4 rocksdb:
[/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/db_impl.cc:343] Shutdown
complete
-1 rocksdb: Corruption: missing start of fragmented record(2)
-1 bluestore(/var/lib/ceph/osd/ceph-2/) _open_db erroring opening db:
1 bluefs umount
1 bdev(0x557f5b6a4240 /var/lib/ceph/osd/ceph-2//block) close

If I understand this right, rocksdb isÂ just trying to replay WAL type
logs, of which presumably "001005.log" is corrupted. It then throws an
error that stops everything.

I did try to mount the bluestore, as I was assuming that would probably
where I'd find the rocksdb's files somewhere, but that also doesn't seem
possible:

#ceph-objectstore-tool --op fsck --data-path /var/lib/ceph/osd/ceph-2/
--mountpoint /mnt/bluestore-repair/
fsck failed: (5) Input/output error
# ceph-objectstore-tool --op fuse --data-path /var/lib/ceph/osd/ceph-2
--mountpoint /mnt/bluestore-repair/
Mount failed with '(5) Input/output error'
# ceph-objectstore-tool --op fuse --force --skip-journal-replay
--data-path /var/lib/ceph/osd/ceph-2 --mountpoint /mnt/bluestore-repair/
Mount failed with '(5) Input/output error'

Adding --debug shows the ultimate culprit is just the above rocksdb
error again.

Q: Is there some way in which I can tell rockdb to truncate or delete /
skip the respective log entries? Or can I get access to rocksdb('s
files) in some other way to just manipulate it or delete corrupted WAL
files manually?

-Michael

Christian Balzer

2017-11-01 08:53:30 UTC

Permalink

Hello,

Post by Michael
Hello everyone,
I've conducted some crash tests (unplugging drives, the machine,

Your exact system configuration (HW, drives, controller, settings, etc)
would be interesting as I can think of plenty scenarios on how to corrupt
things that normally shouldn't be affected by such actions.

Post by Michael
terminating and restarting ceph systemd services) with Ceph 12.2.0 on

Now that bit is quite disconcerting, though you're one release behind the
curve and from what I read .2 has plenty more bug fixes coming.

Christian

Post by Michael
Ubuntu and quite easily managed to corrupt what appears to be rocksdb's
# ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-2/
[...]
[/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/version_set.cc:2859]
Recovered from manifest file:db/MANIFEST-000975
succeeded,manifest_file_number is 975, next_file_number is 1008,
last_sequence is 51965907, log_number is 0,prev_log_number is
0,max_column_family is 0
[/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/version_set.cc:2867]
Column family [default] (ID 0), log number is 1005
4 rocksdb: EVENT_LOG_v1 {"time_micros": 1509298585082794, "job": 1,
"event": "recovery_started", "log_files": [1003, 1005]}
[/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/db_impl_open.cc:482]
Recovering log #1003 mode 0
[/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/db_impl_open.cc:482]
Recovering log #1005 mode 0
[/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/db_impl_open.cc:424]
db/001005.log: dropping 3225 bytes; Corruption: missing start of
fragmented record(2)
canceling all background work
[/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/db_impl.cc:343] Shutdown
complete
-1 rocksdb: Corruption: missing start of fragmented record(2)
1 bluefs umount
1 bdev(0x557f5b6a4240 /var/lib/ceph/osd/ceph-2//block) close
If I understand this right, rocksdb is just trying to replay WAL type
logs, of which presumably "001005.log" is corrupted. It then throws an
error that stops everything.
I did try to mount the bluestore, as I was assuming that would probably
where I'd find the rocksdb's files somewhere, but that also doesn't seem
#ceph-objectstore-tool --op fsck --data-path /var/lib/ceph/osd/ceph-2/
--mountpoint /mnt/bluestore-repair/
fsck failed: (5) Input/output error
# ceph-objectstore-tool --op fuse --data-path /var/lib/ceph/osd/ceph-2
--mountpoint /mnt/bluestore-repair/
Mount failed with '(5) Input/output error'
# ceph-objectstore-tool --op fuse --force --skip-journal-replay
--data-path /var/lib/ceph/osd/ceph-2 --mountpoint /mnt/bluestore-repair/
Mount failed with '(5) Input/output error'
Adding --debug shows the ultimate culprit is just the above rocksdb
error again.
Q: Is there some way in which I can tell rockdb to truncate or delete /
skip the respective log entries? Or can I get access to rocksdb('s
files) in some other way to just manipulate it or delete corrupted WAL
files manually?
-Michael

--
Christian Balzer Network/Systems Engineer
***@gol.com Rakuten Communications

Michael

2017-11-01 21:54:53 UTC

Permalink

Post by Christian Balzer
Your exact system configuration (HW, drives, controller, settings, etc)
would be interesting as I can think of plenty scenarios on how to corrupt
things that normally shouldn't be affected by such actions

Oh, the hardware in question is consumer grade and not new. Some old i7
machine. But my current guess is that the specific hardware is
semi-unrelated.

I think there probably is just a WAL log or entry in that that wasn't
finished writing to disk or corrupted. But WAL aren't something that
should be fully assumed to be complete and correct in the case of some
failure, right? They are WAL.

If this can't be fixed automatically with some command, I would simply
like to have a look at & tinker trivially the with these DB files if
possible (which I still haven't figured a way how to do).

Post by Christian Balzer
Now that bit is quite disconcerting, though you're one release behind the
curve and from what I read .2 has plenty more bug fixes coming.

Fair point. I just tried with 12.2.1 (on pre-release Ubuntu bionic now).

Doesn't change anything - fsck doesn't fix rocksdb, the bluestore won't
mount, the OSD won't activate and the error is the same.

Is there any fix in .2 that might address this, or do you just mean that
in general there will be bug fixes?

Thanks for your response!

- Michael

Michael

2017-11-13 11:05:23 UTC

Permalink

I think Christian talks about version 12.2.2, not 12.2.*

Which isn't released yet, yes. I could try building the development
repository if you think that has a chance of resolving the issue?

Although I'd still like to know how I could theoretically get my hands
at these rocksdb files manually, if anyone knows how to do that? I still
have no idea how.

I also reported this as a bug last week, in case anyone has information
or the same issue:
http://tracker.ceph.com/issues/22044

Konstantin Shalygin

2017-11-14 02:47:25 UTC

Permalink

Post by Michael
Which isn't released yet, yes. I could try building the development
repository if you think that has a chance of resolving the issue?

For tests - yes...
This <https://marc.info/?l=ceph-devel&m=151024277700487&w=2> ML tells
that 12.2.2 should be based on commit
1071fdcf73faa387d0df18489ab7b0359a0c0afb.

Gregory Farnum

2017-11-19 10:01:40 UTC

Permalink

Post by Michael

Oh, the hardware in question is consumer grade and not new. Some old i7
machine. But my current guess is that the specific hardware is
semi-unrelated.
I think there probably is just a WAL log or entry in that that wasn't
finished writing to disk or corrupted. But WAL aren't something that
should be fully assumed to be complete and correct in the case of some
failure, right? They are WAL.
If this can't be fixed automatically with some command, I would simply
like to have a look at & tinker trivially the with these DB files if
possible (which I still haven't figured a way how to do).

Your hardware and configuration is very relevant. As you note, the WAL
should be able to handle being incompletely-written, and both Ceph and
RocksDB are designed to handle failures mid-write. That RocksDB *isn't*
doing that here implies either
1) there's a fatal bug in rocksdb, or
2) your hardware configuration is not honoring the storage consistency APIs
correctly.

Given past experience, (2) is far more likely than (1) is. The RocksDB
community may be able to give you more immediate guidance to what exactly
went wrong, but I'd look at whether you have a writeback cache somewhere
that isn't reflecting ordering requirements, or if your disk passes a crash
consistency tester. (No, I don't know one off-hand. But many disks lie
horribly even about stuff like flushes.)
-Greg

Post by Michael

Post by Christian Balzer
Now that bit is quite disconcerting, though you're one release behind the
curve and from what I read .2 has plenty more bug fixes coming.

Fair point. I just tried with 12.2.1 (on pre-release Ubuntu bionic now).
Doesn't change anything - fsck doesn't fix rocksdb, the bluestore won't
mount, the OSD won't activate and the error is the same.
Is there any fix in .2 that might address this, or do you just mean that
in general there will be bug fixes?
Thanks for your response!
- Michael
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Gregory Farnum

2017-11-20 15:16:52 UTC

Permalink

Post by Gregory Farnum
Your hardware and configuration is very relevant.
[...]
I'd look at whether you have a writeback cache somewhere that isn't
reflecting ordering requirements, or if your disk passes a crash consistency
tester. (No, I don't know one off-hand. But many disks lie horribly even
about stuff like flushes.)

It may certainly have had something to do with how I managed to end up with
the broken rocksdb WAL log. Maybe this is not the best possible behavior
possible when one simulates a crash or drive disconnect. Perhaps if I can
get this OSD back in action, and the issue occurs entirely predictably on
another test, I'll eventually start to see a pattern where / how it happens
& maybe even find out what hardware / configuration changes might be needed
to prevent the WAL from corrupting. Perhaps.
--
However, my actual & immediate Ceph + ceph-users relevant problem with this
is basically only that I cannot seem to figure out how one could deal with
such an already broken rocksdb WAL log.
1. Ceph's tooling and rocksdb don't *appear* to be capable to deal be able
to deal with this corrupted WAL file once it has been corrupted, certainly
not with the commands that I tried.
I initially had hoped for some tool to be able to do something - drop the
log, revert to an earlier backup of a consistent db - any option like that
that I might have missed. Judging by this ML so far, I'm going to guess

The error is pretty clear: "Corruption: missing start of fragmented record(2)"
What that says to me is that rocksdb has a journal entry saying that
record *does* exist, but it's missing the opening block or something.
ie, during an atomic write it (1) wrote down a lookaside block, (2)
flushed that block to disk, (3), journaled that it had written the
block. But now on restart, it's finding out that (2) apparently didn't
happen.

2. I do not know how I can get manual, filewise access to the rocksdb WAL
logs. This may be immensely simple, but I simply don't know how.
I don't have any indication that either 1. or 2. is failing due to hardware
or configuration specifics (...beyond having this broken WAL log) so far.

Rocksdb may offer repair tools for this (I have no idea), but the
fundamental issue is that as far as the program can tell, the
underlying hardware lied, the disk state is corrupted, and it has no
idea what data it can trust or not at this point. Same with Ceph; the
OSD has no desire to believe anything a corrupted disk tells it since
that can break all of our invariants.
BlueStore is a custom block device-managing system; we have a way to
mount and poke at it via FUSE but that assumes the data on disk makes
any sense. In this case, it doesn't (RocksDB stores the disk layout
metadata.) Somebody more familiar with bluestore development may know
if there's a way to mount only the "BlueFS" portion that RocksDB
writes its own data to; if there is it's just a bunch of .ldb files or
whatever, but those are again a custom data format that you'll need
rocksdb expertise to do anything with...

Toss this disk and let Ceph do its recovery thing. Look hard at what
your hardware configuration is doing to make sure it doesn't happen
again. *shrug*
-Greg

Post by Gregory Farnum
As you note, the WAL should be able to handle being incompletely-written

Yes, I'd also have thought so? But it apparently just isn't able to deal
with this log file corruption. Maybe it is not an extremely specific bug.
Maybe a lot of possible WAL corruptions might throw a comparable error and
prevent replay.

Post by Gregory Farnum
and both Ceph and RocksDB are designed to handle failures mid-write.

As far as I can tell, not in this WAL log case, no. It would certainly be
really interesting to see at this point if just moving or deleting that WAL
log allows everything to continue and the OSD to go online, and if then
doing a scrub fixes the entirety of this issue. Maybe everything is
essentially fine apart from JUST the WAL log replay and maybe one or another
bit of a page on the OSD.

Post by Gregory Farnum
That RocksDB *isn't* doing that here implies either 1) there's a fatal bug
in rocksdb

Not so sure. Ultimately rocksdb does seem to throw a fairly indicative
error: "db/001005.log: dropping 3225 bytes; Corruption: missing start of
fragmented record(2)".
Maybe they intend that users use a repair tool at
https://github.com/facebook/rocksdb/wiki/RocksDB-Repairer . Or maybe it's a
case for manual interaction with the file.
But my point 2. -namely that I don't even understand how to get filewise
access to rocksdb's files- has so far prevented me from trying either.
Thanks for your input!
-Michael