Discussion:
OSD Segfaults after Bluestore conversion
(too old to reply)
Kyle Hutson
2018-02-06 21:53:42 UTC
Permalink
We had a 26-node production ceph cluster which we upgraded to Luminous a
little over a month ago. I added a 27th-node with Bluestore and didn't have
any issues, so I began converting the others, one at a time. The first two
went off pretty smoothly, but the 3rd is doing something strange.

Initially, all the OSDs came up fine, but then some started to segfault.
Out of curiosity more than anything else, I did reboot the server to see if
it would get better or worse, and it pretty much stayed the same - 12 of
the 18 OSDs did not properly come up. Of those, 3 again segfaulted

I picked one that didn't properly come up and copied the log to where
anybody can view it:
http://people.beocat.ksu.edu/~kylehutson/ceph-osd.426.log

You can contrast that with one that is up:
http://people.beocat.ksu.edu/~kylehutson/ceph-osd.428.log

(which is still showing segfaults in the logs, but seems to be recovering
from them OK?)

Any ideas?
Mike O'Connor
2018-02-08 09:02:51 UTC
Permalink
Post by Kyle Hutson
We had a 26-node production ceph cluster which we upgraded to Luminous
a little over a month ago. I added a 27th-node with Bluestore and
didn't have any issues, so I began converting the others, one at a
time. The first two went off pretty smoothly, but the 3rd is doing
something strange.
Initially, all the OSDs came up fine, but then some started to
segfault. Out of curiosity more than anything else, I did reboot the
server to see if it would get better or worse, and it pretty much
stayed the same - 12 of the 18 OSDs did not properly come up. Of
those, 3 again segfaulted
I picked one that didn't properly come up and copied the log to where
http://people.beocat.ksu.edu/~kylehutson/ceph-osd.426.log
<http://people.beocat.ksu.edu/%7Ekylehutson/ceph-osd.426.log>
http://people.beocat.ksu.edu/~kylehutson/ceph-osd.428.log
<http://people.beocat.ksu.edu/%7Ekylehutson/ceph-osd.428.log>
(which is still showing segfaults in the logs, but seems to be
recovering from them OK?)
Any ideas?
Ideas ? yes

There is a a bug which is hitting a small number of systems and at this
time there is no solution. Issues details at
http://tracker.ceph.com/issues/22102.

Please submit more details of your problem on the ticket.

Mike
Kyle Hutson
2018-02-28 20:46:11 UTC
Permalink
I'm following up from awhile ago. I don't think this is the same bug. The
bug referenced shows "abort: Corruption: block checksum mismatch", and I'm
not seeing that on mine.

Now I've had 8 OSDs down on this one server for a couple of weeks, and I
just tried to start it back up. Here's a link to the log of that OSD (which
segfaulted right after starting up):
http://people.beocat.ksu.edu/~kylehutson/ceph-osd.414.log

To me, it looks like the logs are providing surprisingly few hints as to
where the problem lies. Is there a way I can turn up logging to see if I
can get any more info as to why this is happening?
Post by Mike O'Connor
Post by Kyle Hutson
We had a 26-node production ceph cluster which we upgraded to Luminous
a little over a month ago. I added a 27th-node with Bluestore and
didn't have any issues, so I began converting the others, one at a
time. The first two went off pretty smoothly, but the 3rd is doing
something strange.
Initially, all the OSDs came up fine, but then some started to
segfault. Out of curiosity more than anything else, I did reboot the
server to see if it would get better or worse, and it pretty much
stayed the same - 12 of the 18 OSDs did not properly come up. Of
those, 3 again segfaulted
I picked one that didn't properly come up and copied the log to where
http://people.beocat.ksu.edu/~kylehutson/ceph-osd.426.log
<http://people.beocat.ksu.edu/%7Ekylehutson/ceph-osd.426.log>
http://people.beocat.ksu.edu/~kylehutson/ceph-osd.428.log
<http://people.beocat.ksu.edu/%7Ekylehutson/ceph-osd.428.log>
(which is still showing segfaults in the logs, but seems to be
recovering from them OK?)
Any ideas?
Ideas ? yes
There is a a bug which is hitting a small number of systems and at this
time there is no solution. Issues details at
http://tracker.ceph.com/issues/22102.
Please submit more details of your problem on the ticket.
Mike
Tyler Bishop
2018-08-28 02:51:37 UTC
Permalink
Did you solve this? Similar issue.
_____________________________________________
Post by Kyle Hutson
I'm following up from awhile ago. I don't think this is the same bug. The
bug referenced shows "abort: Corruption: block checksum mismatch", and I'm
not seeing that on mine.
Now I've had 8 OSDs down on this one server for a couple of weeks, and I
just tried to start it back up. Here's a link to the log of that OSD (which
http://people.beocat.ksu.edu/~kylehutson/ceph-osd.414.log
To me, it looks like the logs are providing surprisingly few hints as to
where the problem lies. Is there a way I can turn up logging to see if I
can get any more info as to why this is happening?
Post by Mike O'Connor
Post by Kyle Hutson
We had a 26-node production ceph cluster which we upgraded to Luminous
a little over a month ago. I added a 27th-node with Bluestore and
didn't have any issues, so I began converting the others, one at a
time. The first two went off pretty smoothly, but the 3rd is doing
something strange.
Initially, all the OSDs came up fine, but then some started to
segfault. Out of curiosity more than anything else, I did reboot the
server to see if it would get better or worse, and it pretty much
stayed the same - 12 of the 18 OSDs did not properly come up. Of
those, 3 again segfaulted
I picked one that didn't properly come up and copied the log to where
http://people.beocat.ksu.edu/~kylehutson/ceph-osd.426.log
<http://people.beocat.ksu.edu/%7Ekylehutson/ceph-osd.426.log>
http://people.beocat.ksu.edu/~kylehutson/ceph-osd.428.log
<http://people.beocat.ksu.edu/%7Ekylehutson/ceph-osd.428.log>
(which is still showing segfaults in the logs, but seems to be
recovering from them OK?)
Any ideas?
Ideas ? yes
There is a a bug which is hitting a small number of systems and at this
time there is no solution. Issues details at
http://tracker.ceph.com/issues/22102.
Please submit more details of your problem on the ticket.
Mike
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Adam Tygart
2018-08-28 03:05:43 UTC
Permalink
This issue was related to using Jemalloc. Jemalloc is not as well
tested with Bluestore and lead to lots of segfaults. We moved back to
the default of tcmalloc with Bluestore and these stopped.

Check /etc/sysconfig/ceph under RHEL based distros.

--
Adam
On Mon, Aug 27, 2018 at 9:51 PM Tyler Bishop
Post by Tyler Bishop
Did you solve this? Similar issue.
_____________________________________________
I'm following up from awhile ago. I don't think this is the same bug. The bug referenced shows "abort: Corruption: block checksum mismatch", and I'm not seeing that on mine.
Now I've had 8 OSDs down on this one server for a couple of weeks, and I just tried to start it back up. Here's a link to the log of that OSD (which segfaulted right after starting up): http://people.beocat.ksu.edu/~kylehutson/ceph-osd.414.log
To me, it looks like the logs are providing surprisingly few hints as to where the problem lies. Is there a way I can turn up logging to see if I can get any more info as to why this is happening?
Post by Mike O'Connor
Post by Kyle Hutson
We had a 26-node production ceph cluster which we upgraded to Luminous
a little over a month ago. I added a 27th-node with Bluestore and
didn't have any issues, so I began converting the others, one at a
time. The first two went off pretty smoothly, but the 3rd is doing
something strange.
Initially, all the OSDs came up fine, but then some started to
segfault. Out of curiosity more than anything else, I did reboot the
server to see if it would get better or worse, and it pretty much
stayed the same - 12 of the 18 OSDs did not properly come up. Of
those, 3 again segfaulted
I picked one that didn't properly come up and copied the log to where
http://people.beocat.ksu.edu/~kylehutson/ceph-osd.426.log
<http://people.beocat.ksu.edu/%7Ekylehutson/ceph-osd.426.log>
http://people.beocat.ksu.edu/~kylehutson/ceph-osd.428.log
<http://people.beocat.ksu.edu/%7Ekylehutson/ceph-osd.428.log>
(which is still showing segfaults in the logs, but seems to be
recovering from them OK?)
Any ideas?
Ideas ? yes
There is a a bug which is hitting a small number of systems and at this
time there is no solution. Issues details at
http://tracker.ceph.com/issues/22102.
Please submit more details of your problem on the ticket.
Mike
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Tyler Bishop
2018-08-28 03:11:02 UTC
Permalink
Okay so far since switching back it looks more stable. I have around 2GB/s
and 100k iops flowing with FIO atm to test.
_____________________________________________
Post by Adam Tygart
This issue was related to using Jemalloc. Jemalloc is not as well
tested with Bluestore and lead to lots of segfaults. We moved back to
the default of tcmalloc with Bluestore and these stopped.
Check /etc/sysconfig/ceph under RHEL based distros.
--
Adam
On Mon, Aug 27, 2018 at 9:51 PM Tyler Bishop
Post by Tyler Bishop
Did you solve this? Similar issue.
_____________________________________________
Post by Kyle Hutson
I'm following up from awhile ago. I don't think this is the same bug.
The bug referenced shows "abort: Corruption: block checksum mismatch", and
I'm not seeing that on mine.
Post by Tyler Bishop
Post by Kyle Hutson
Now I've had 8 OSDs down on this one server for a couple of weeks, and
I just tried to start it back up. Here's a link to the log of that OSD
http://people.beocat.ksu.edu/~kylehutson/ceph-osd.414.log
Post by Tyler Bishop
Post by Kyle Hutson
To me, it looks like the logs are providing surprisingly few hints as
to where the problem lies. Is there a way I can turn up logging to see if I
can get any more info as to why this is happening?
Post by Tyler Bishop
Post by Kyle Hutson
Post by Mike O'Connor
Post by Kyle Hutson
We had a 26-node production ceph cluster which we upgraded to
Luminous
Post by Tyler Bishop
Post by Kyle Hutson
Post by Mike O'Connor
Post by Kyle Hutson
a little over a month ago. I added a 27th-node with Bluestore and
didn't have any issues, so I began converting the others, one at a
time. The first two went off pretty smoothly, but the 3rd is doing
something strange.
Initially, all the OSDs came up fine, but then some started to
segfault. Out of curiosity more than anything else, I did reboot the
server to see if it would get better or worse, and it pretty much
stayed the same - 12 of the 18 OSDs did not properly come up. Of
those, 3 again segfaulted
I picked one that didn't properly come up and copied the log to where
http://people.beocat.ksu.edu/~kylehutson/ceph-osd.426.log
<http://people.beocat.ksu.edu/%7Ekylehutson/ceph-osd.426.log>
http://people.beocat.ksu.edu/~kylehutson/ceph-osd.428.log
<http://people.beocat.ksu.edu/%7Ekylehutson/ceph-osd.428.log>
(which is still showing segfaults in the logs, but seems to be
recovering from them OK?)
Any ideas?
Ideas ? yes
There is a a bug which is hitting a small number of systems and at this
time there is no solution. Issues details at
http://tracker.ceph.com/issues/22102.
Please submit more details of your problem on the ticket.
Mike
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Loading...