Discussion:
[ceph-users] rbd-nbd timeout and crash
Jan Pekař - Imatic
2017-12-06 08:46:22 UTC
Permalink
Hi,
I run to overloaded cluster (deep-scrub running) for few seconds and
rbd-nbd client timeouted, and device become unavailable.

block nbd0: Connection timed out
block nbd0: shutting down sockets
block nbd0: Connection timed out
print_req_error: I/O error, dev nbd0, sector 2131833856
print_req_error: I/O error, dev nbd0, sector 2131834112

Is there any way how to extend rbd-nbd timeout?

Also getting pammed devices failed -

rbd-nbd list-mapped

/build/ceph-12.2.2/src/tools/rbd_nbd/rbd-nbd.cc: In function 'int
get_mapped_info(int, Config*)' thread 7f069d41ec40 time 2017-12-06
09:40:33.541426
/build/ceph-12.2.2/src/tools/rbd_nbd/rbd-nbd.cc: 841: FAILED
assert(ifs.is_open())
ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba)
luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x7f0693f567c2]
2: (()+0x14165) [0x559a8783d165]
3: (main()+0x9) [0x559a87838e59]
4: (__libc_start_main()+0xf1) [0x7f0691178561]
5: (()+0xff80) [0x559a87838f80]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
Aborted


Thank you
With regards
Jan Pekar
Jason Dillaman
2017-12-06 14:24:49 UTC
Permalink
Hi,
I run to overloaded cluster (deep-scrub running) for few seconds and rbd-nbd
client timeouted, and device become unavailable.
block nbd0: Connection timed out
block nbd0: shutting down sockets
block nbd0: Connection timed out
print_req_error: I/O error, dev nbd0, sector 2131833856
print_req_error: I/O error, dev nbd0, sector 2131834112
Is there any way how to extend rbd-nbd timeout?
Support for changing the default timeout of 30 seconds is supported by
the kernel [1], but it's not currently implemented in rbd-nbd. I
opened a new feature ticket for adding this option [2] but it may be
more constructive to figure out how to address a >30 second IO stall
on your cluster during deep-scrub.
Also getting pammed devices failed -
rbd-nbd list-mapped
/build/ceph-12.2.2/src/tools/rbd_nbd/rbd-nbd.cc: In function 'int
get_mapped_info(int, Config*)' thread 7f069d41ec40 time 2017-12-06
09:40:33.541426
/build/ceph-12.2.2/src/tools/rbd_nbd/rbd-nbd.cc: 841: FAILED
assert(ifs.is_open())
ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous
(stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x7f0693f567c2]
2: (()+0x14165) [0x559a8783d165]
3: (main()+0x9) [0x559a87838e59]
4: (__libc_start_main()+0xf1) [0x7f0691178561]
5: (()+0xff80) [0x559a87838f80]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
Aborted
It's been fixed in the master branch and is awaiting backport to
Luminous [1] -- I'd expect it to be available in v12.2.3.
Thank you
With regards
Jan Pekar
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[1] https://github.com/torvalds/linux/blob/master/drivers/block/nbd.c#L1166
[2] http://tracker.ceph.com/issues/22333
[3] http://tracker.ceph.com/issues/22185
--
Jason
Jan Pekař - Imatic
2017-12-06 22:30:16 UTC
Permalink
Hi,
Post by Jason Dillaman
Hi,
I run to overloaded cluster (deep-scrub running) for few seconds and rbd-nbd
client timeouted, and device become unavailable.
block nbd0: Connection timed out
block nbd0: shutting down sockets
block nbd0: Connection timed out
print_req_error: I/O error, dev nbd0, sector 2131833856
print_req_error: I/O error, dev nbd0, sector 2131834112
Is there any way how to extend rbd-nbd timeout?
Support for changing the default timeout of 30 seconds is supported by
the kernel [1], but it's not currently implemented in rbd-nbd. I
opened a new feature ticket for adding this option [2] but it may be
more constructive to figure out how to address a >30 second IO stall
on your cluster during deep-scrub.
Kernel client is not supporting new image features, so I decided to use
rbd-nbd.
Now I tried to rm 300GB folder, which is mounted with rbd-nbd from COW
snapshot on my healthy and almost idle cluster with only 1 deep-scrub
running and I also hit 30s timeout and device disconnect. I'm mapping it
from virtual server so there can be some performance issue but I'm not
hunting performance, but stability.

Thank you
With regards
Jan Pekar
Post by Jason Dillaman
Also getting pammed devices failed -
rbd-nbd list-mapped
/build/ceph-12.2.2/src/tools/rbd_nbd/rbd-nbd.cc: In function 'int
get_mapped_info(int, Config*)' thread 7f069d41ec40 time 2017-12-06
09:40:33.541426
/build/ceph-12.2.2/src/tools/rbd_nbd/rbd-nbd.cc: 841: FAILED
assert(ifs.is_open())
ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous
(stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x7f0693f567c2]
2: (()+0x14165) [0x559a8783d165]
3: (main()+0x9) [0x559a87838e59]
4: (__libc_start_main()+0xf1) [0x7f0691178561]
5: (()+0xff80) [0x559a87838f80]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
Aborted
It's been fixed in the master branch and is awaiting backport to
Luminous [1] -- I'd expect it to be available in v12.2.3.
Thank you
With regards
Jan Pekar
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[1] https://github.com/torvalds/linux/blob/master/drivers/block/nbd.c#L1166
[2] http://tracker.ceph.com/issues/22333
[3] http://tracker.ceph.com/issues/22185
--
============
Ing. Jan Pekař
***@imatic.cz | +420603811737
----
Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz
============
--
David Turner
2017-12-06 22:58:54 UTC
Permalink
Do you have the FS mounted with a trimming ability? What are your mount
options?
Post by Jan Pekař - Imatic
Hi,
Post by Jason Dillaman
Post by Jan Pekař - Imatic
Hi,
I run to overloaded cluster (deep-scrub running) for few seconds and
rbd-nbd
Post by Jason Dillaman
Post by Jan Pekař - Imatic
client timeouted, and device become unavailable.
block nbd0: Connection timed out
block nbd0: shutting down sockets
block nbd0: Connection timed out
print_req_error: I/O error, dev nbd0, sector 2131833856
print_req_error: I/O error, dev nbd0, sector 2131834112
Is there any way how to extend rbd-nbd timeout?
Support for changing the default timeout of 30 seconds is supported by
the kernel [1], but it's not currently implemented in rbd-nbd. I
opened a new feature ticket for adding this option [2] but it may be
more constructive to figure out how to address a >30 second IO stall
on your cluster during deep-scrub.
Kernel client is not supporting new image features, so I decided to use
rbd-nbd.
Now I tried to rm 300GB folder, which is mounted with rbd-nbd from COW
snapshot on my healthy and almost idle cluster with only 1 deep-scrub
running and I also hit 30s timeout and device disconnect. I'm mapping it
from virtual server so there can be some performance issue but I'm not
hunting performance, but stability.
Thank you
With regards
Jan Pekar
Post by Jason Dillaman
Post by Jan Pekař - Imatic
Also getting pammed devices failed -
rbd-nbd list-mapped
/build/ceph-12.2.2/src/tools/rbd_nbd/rbd-nbd.cc: In function 'int
get_mapped_info(int, Config*)' thread 7f069d41ec40 time 2017-12-06
09:40:33.541426
/build/ceph-12.2.2/src/tools/rbd_nbd/rbd-nbd.cc: 841: FAILED
assert(ifs.is_open())
ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba)
luminous
Post by Jason Dillaman
Post by Jan Pekař - Imatic
(stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x7f0693f567c2]
2: (()+0x14165) [0x559a8783d165]
3: (main()+0x9) [0x559a87838e59]
4: (__libc_start_main()+0xf1) [0x7f0691178561]
5: (()+0xff80) [0x559a87838f80]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to
Post by Jason Dillaman
Post by Jan Pekař - Imatic
interpret this.
Aborted
It's been fixed in the master branch and is awaiting backport to
Luminous [1] -- I'd expect it to be available in v12.2.3.
Post by Jan Pekař - Imatic
Thank you
With regards
Jan Pekar
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[1]
https://github.com/torvalds/linux/blob/master/drivers/block/nbd.c#L1166
Post by Jason Dillaman
[2] http://tracker.ceph.com/issues/22333
[3] http://tracker.ceph.com/issues/22185
--
============
Ing. Jan Pekař
----
Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz
============
--
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Jan Pekař - Imatic
2018-01-04 09:05:19 UTC
Permalink
Sorry for late answer.
No - I'm not mounting with trimming, only noatime.
Problem is, that cluster was highly loaded, so there were timeouts.
I "solved" it by compiling
https://github.com/jerome-pouiller/ioctl
and set NBD_SET_TIMEOUT ioctl timeout after creating the device.

With regards
Jan Pekar
Do you have the FS mounted with a trimming ability?  What are your mount
options?
Hi,
On Wed, Dec 6, 2017 at 3:46 AM, Jan Pekař - Imatic
Post by Jan Pekař - Imatic
Hi,
I run to overloaded cluster (deep-scrub running) for few seconds
and rbd-nbd
Post by Jan Pekař - Imatic
client timeouted, and device become unavailable.
block nbd0: Connection timed out
block nbd0: shutting down sockets
block nbd0: Connection timed out
print_req_error: I/O error, dev nbd0, sector 2131833856
print_req_error: I/O error, dev nbd0, sector 2131834112
Is there any way how to extend rbd-nbd timeout?
Support for changing the default timeout of 30 seconds is
supported by
the kernel [1], but it's not currently implemented in rbd-nbd.  I
opened a new feature ticket for adding this option [2] but it may be
more constructive to figure out how to address a >30 second IO stall
on your cluster during deep-scrub.
Kernel client is not supporting new image features, so I decided to use
rbd-nbd.
Now I tried to rm 300GB folder, which is mounted with rbd-nbd from COW
snapshot on my healthy and almost idle cluster with only 1 deep-scrub
running and I also hit 30s timeout and device disconnect. I'm mapping it
from virtual server so there can be some performance issue but I'm not
hunting performance, but stability.
Thank you
With regards
Jan Pekar
Post by Jan Pekař - Imatic
Also getting pammed devices failed -
rbd-nbd list-mapped
/build/ceph-12.2.2/src/tools/rbd_nbd/rbd-nbd.cc: In function 'int
get_mapped_info(int, Config*)' thread 7f069d41ec40 time 2017-12-06
09:40:33.541426
/build/ceph-12.2.2/src/tools/rbd_nbd/rbd-nbd.cc: 841: FAILED
assert(ifs.is_open())
   ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba)
luminous
Post by Jan Pekař - Imatic
(stable)
   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x7f0693f567c2]
   2: (()+0x14165) [0x559a8783d165]
   3: (main()+0x9) [0x559a87838e59]
   4: (__libc_start_main()+0xf1) [0x7f0691178561]
   5: (()+0xff80) [0x559a87838f80]
   NOTE: a copy of the executable, or `objdump -rdS <executable>`
is needed to
Post by Jan Pekař - Imatic
interpret this.
Aborted
It's been fixed in the master branch and is awaiting backport to
Luminous [1] -- I'd expect it to be available in v12.2.3.
Post by Jan Pekař - Imatic
Thank you
With regards
Jan Pekar
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[1]
https://github.com/torvalds/linux/blob/master/drivers/block/nbd.c#L1166
[2] http://tracker.ceph.com/issues/22333
[3] http://tracker.ceph.com/issues/22185
--
============
Ing. Jan Pekař
<tel:+420%20603%20811%20737>
----
Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz
============
--
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
============
Ing. Jan Pekař
***@imatic.cz | +420603811737
----
Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz
============
--
Loading...