Discussion:
OSD Segfaults after Bluestore conversion
(too old to reply)
Kyle Hutson
2018-02-06 21:53:42 UTC
Permalink
Raw Message
We had a 26-node production ceph cluster which we upgraded to Luminous a
little over a month ago. I added a 27th-node with Bluestore and didn't have
any issues, so I began converting the others, one at a time. The first two
went off pretty smoothly, but the 3rd is doing something strange.

Initially, all the OSDs came up fine, but then some started to segfault.
Out of curiosity more than anything else, I did reboot the server to see if
it would get better or worse, and it pretty much stayed the same - 12 of
the 18 OSDs did not properly come up. Of those, 3 again segfaulted

I picked one that didn't properly come up and copied the log to where
anybody can view it:
http://people.beocat.ksu.edu/~kylehutson/ceph-osd.426.log

You can contrast that with one that is up:
http://people.beocat.ksu.edu/~kylehutson/ceph-osd.428.log

(which is still showing segfaults in the logs, but seems to be recovering
from them OK?)

Any ideas?
Mike O'Connor
2018-02-08 09:02:51 UTC
Permalink
Raw Message
Post by Kyle Hutson
We had a 26-node production ceph cluster which we upgraded to Luminous
a little over a month ago. I added a 27th-node with Bluestore and
didn't have any issues, so I began converting the others, one at a
time. The first two went off pretty smoothly, but the 3rd is doing
something strange.
Initially, all the OSDs came up fine, but then some started to
segfault. Out of curiosity more than anything else, I did reboot the
server to see if it would get better or worse, and it pretty much
stayed the same - 12 of the 18 OSDs did not properly come up. Of
those, 3 again segfaulted
I picked one that didn't properly come up and copied the log to where
http://people.beocat.ksu.edu/~kylehutson/ceph-osd.426.log
<http://people.beocat.ksu.edu/%7Ekylehutson/ceph-osd.426.log>
http://people.beocat.ksu.edu/~kylehutson/ceph-osd.428.log
<http://people.beocat.ksu.edu/%7Ekylehutson/ceph-osd.428.log>
(which is still showing segfaults in the logs, but seems to be
recovering from them OK?)
Any ideas?
Ideas ? yes

There is a a bug which is hitting a small number of systems and at this
time there is no solution. Issues details at
http://tracker.ceph.com/issues/22102.

Please submit more details of your problem on the ticket.

Mike
Kyle Hutson
2018-02-28 20:46:11 UTC
Permalink
Raw Message
I'm following up from awhile ago. I don't think this is the same bug. The
bug referenced shows "abort: Corruption: block checksum mismatch", and I'm
not seeing that on mine.

Now I've had 8 OSDs down on this one server for a couple of weeks, and I
just tried to start it back up. Here's a link to the log of that OSD (which
segfaulted right after starting up):
http://people.beocat.ksu.edu/~kylehutson/ceph-osd.414.log

To me, it looks like the logs are providing surprisingly few hints as to
where the problem lies. Is there a way I can turn up logging to see if I
can get any more info as to why this is happening?
Post by Mike O'Connor
Post by Kyle Hutson
We had a 26-node production ceph cluster which we upgraded to Luminous
a little over a month ago. I added a 27th-node with Bluestore and
didn't have any issues, so I began converting the others, one at a
time. The first two went off pretty smoothly, but the 3rd is doing
something strange.
Initially, all the OSDs came up fine, but then some started to
segfault. Out of curiosity more than anything else, I did reboot the
server to see if it would get better or worse, and it pretty much
stayed the same - 12 of the 18 OSDs did not properly come up. Of
those, 3 again segfaulted
I picked one that didn't properly come up and copied the log to where
http://people.beocat.ksu.edu/~kylehutson/ceph-osd.426.log
<http://people.beocat.ksu.edu/%7Ekylehutson/ceph-osd.426.log>
http://people.beocat.ksu.edu/~kylehutson/ceph-osd.428.log
<http://people.beocat.ksu.edu/%7Ekylehutson/ceph-osd.428.log>
(which is still showing segfaults in the logs, but seems to be
recovering from them OK?)
Any ideas?
Ideas ? yes
There is a a bug which is hitting a small number of systems and at this
time there is no solution. Issues details at
http://tracker.ceph.com/issues/22102.
Please submit more details of your problem on the ticket.
Mike
Loading...