[ceph-users] rbd-nbd timeout and crash

Discussion:

Jan Pekař - Imatic

2017-12-06 08:46:22 UTC

Hi,
I run to overloaded cluster (deep-scrub running) for few seconds and
rbd-nbd client timeouted, and device become unavailable.

block nbd0: Connection timed out
block nbd0: shutting down sockets
block nbd0: Connection timed out
print_req_error: I/O error, dev nbd0, sector 2131833856
print_req_error: I/O error, dev nbd0, sector 2131834112

Is there any way how to extend rbd-nbd timeout?

Also getting pammed devices failed -

rbd-nbd list-mapped

/build/ceph-12.2.2/src/tools/rbd_nbd/rbd-nbd.cc: In function 'int
get_mapped_info(int, Config*)' thread 7f069d41ec40 time 2017-12-06
09:40:33.541426
/build/ceph-12.2.2/src/tools/rbd_nbd/rbd-nbd.cc: 841: FAILED
assert(ifs.is_open())
ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba)
luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x7f0693f567c2]
2: (()+0x14165) [0x559a8783d165]
3: (main()+0x9) [0x559a87838e59]
4: (__libc_start_main()+0xf1) [0x7f0691178561]
5: (()+0xff80) [0x559a87838f80]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
Aborted

Thank you
With regards
Jan Pekar

Jason Dillaman

2017-12-06 14:24:49 UTC

Permalink

Support for changing the default timeout of 30 seconds is supported by
the kernel [1], but it's not currently implemented in rbd-nbd. I
opened a new feature ticket for adding this option [2] but it may be
more constructive to figure out how to address a >30 second IO stall
on your cluster during deep-scrub.

Also getting pammed devices failed -
rbd-nbd list-mapped
/build/ceph-12.2.2/src/tools/rbd_nbd/rbd-nbd.cc: In function 'int
get_mapped_info(int, Config*)' thread 7f069d41ec40 time 2017-12-06
09:40:33.541426
/build/ceph-12.2.2/src/tools/rbd_nbd/rbd-nbd.cc: 841: FAILED
assert(ifs.is_open())
ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous
(stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x7f0693f567c2]
2: (()+0x14165) [0x559a8783d165]
3: (main()+0x9) [0x559a87838e59]
4: (__libc_start_main()+0xf1) [0x7f0691178561]
5: (()+0xff80) [0x559a87838f80]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
Aborted

It's been fixed in the master branch and is awaiting backport to
Luminous [1] -- I'd expect it to be available in v12.2.3.

Thank you
With regards
Jan Pekar
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[1] https://github.com/torvalds/linux/blob/master/drivers/block/nbd.c#L1166
[2] http://tracker.ceph.com/issues/22333
[3] http://tracker.ceph.com/issues/22185

--
Jason

Jan Pekař - Imatic

2017-12-06 22:30:16 UTC

Permalink

Hi,

Post by Jason Dillaman

Kernel client is not supporting new image features, so I decided to use
rbd-nbd.
Now I tried to rm 300GB folder, which is mounted with rbd-nbd from COW
snapshot on my healthy and almost idle cluster with only 1 deep-scrub
running and I also hit 30s timeout and device disconnect. I'm mapping it
from virtual server so there can be some performance issue but I'm not
hunting performance, but stability.

Thank you
With regards
Jan Pekar

Post by Jason Dillaman

It's been fixed in the master branch and is awaiting backport to
Luminous [1] -- I'd expect it to be available in v12.2.3.

Thank you
With regards
Jan Pekar
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[1] https://github.com/torvalds/linux/blob/master/drivers/block/nbd.c#L1166
[2] http://tracker.ceph.com/issues/22333
[3] http://tracker.ceph.com/issues/22185

--
============
Ing. Jan Pekař
***@imatic.cz | +420603811737
----
Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz
============
--

David Turner

2017-12-06 22:58:54 UTC

Permalink

Do you have the FS mounted with a trimming ability? What are your mount
options?

Post by Jan PekaÅ - Imatic
Hi,

Post by Jason Dillaman

Post by Jan PekaÅ - Imatic
Hi,
I run to overloaded cluster (deep-scrub running) for few seconds and

rbd-nbd

Post by Jason Dillaman

Post by Jan PekaÅ - Imatic
client timeouted, and device become unavailable.
block nbd0: Connection timed out
block nbd0: shutting down sockets
block nbd0: Connection timed out
print_req_error: I/O error, dev nbd0, sector 2131833856
print_req_error: I/O error, dev nbd0, sector 2131834112
Is there any way how to extend rbd-nbd timeout?

Post by Jason Dillaman

Post by Jan PekaÅ - Imatic
Also getting pammed devices failed -
rbd-nbd list-mapped
/build/ceph-12.2.2/src/tools/rbd_nbd/rbd-nbd.cc: In function 'int
get_mapped_info(int, Config*)' thread 7f069d41ec40 time 2017-12-06
09:40:33.541426
/build/ceph-12.2.2/src/tools/rbd_nbd/rbd-nbd.cc: 841: FAILED
assert(ifs.is_open())
ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba)

luminous

Post by Jason Dillaman

Post by Jan PekaÅ - Imatic
(stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x7f0693f567c2]
2: (()+0x14165) [0x559a8783d165]
3: (main()+0x9) [0x559a87838e59]
4: (__libc_start_main()+0xf1) [0x7f0691178561]
5: (()+0xff80) [0x559a87838f80]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is

needed to

Post by Jason Dillaman

Post by Jan PekaÅ - Imatic
interpret this.
Aborted

It's been fixed in the master branch and is awaiting backport to
Luminous [1] -- I'd expect it to be available in v12.2.3.

Post by Jan PekaÅ - Imatic
Thank you
With regards
Jan Pekar
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[1]

https://github.com/torvalds/linux/blob/master/drivers/block/nbd.c#L1166

Post by Jason Dillaman
[2] http://tracker.ceph.com/issues/22333
[3] http://tracker.ceph.com/issues/22185

--
============
Ing. Jan PekaÅ
----
Imatic | JagellonskÃ¡ 14 | Praha 3 | 130 00
http://www.imatic.cz
============
--
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Jan Pekař - Imatic

2018-01-04 09:05:19 UTC

Permalink

Sorry for late answer.
No - I'm not mounting with trimming, only noatime.
Problem is, that cluster was highly loaded, so there were timeouts.
I "solved" it by compiling
https://github.com/jerome-pouiller/ioctl
and set NBD_SET_TIMEOUT ioctl timeout after creating the device.

With regards
Jan Pekar

Do you have the FS mounted with a trimming ability? What are your mount
options?
Hi,

On Wed, Dec 6, 2017 at 3:46 AM, Jan Pekař - Imatic

Post by Jan PekaÅ - Imatic
Hi,
I run to overloaded cluster (deep-scrub running) for few seconds

and rbd-nbd

Support for changing the default timeout of 30 seconds is

supported by

the kernel [1], but it's not currently implemented in rbd-nbd. I
opened a new feature ticket for adding this option [2] but it may be
more constructive to figure out how to address a >30 second IO stall
on your cluster during deep-scrub.

luminous

is needed to

Post by Jan PekaÅ - Imatic
interpret this.
Aborted

It's been fixed in the master branch and is awaiting backport to
Luminous [1] -- I'd expect it to be available in v12.2.3.

Post by Jan PekaÅ - Imatic
Thank you
With regards
Jan Pekar
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[1]

https://github.com/torvalds/linux/blob/master/drivers/block/nbd.c#L1166

[2] http://tracker.ceph.com/issues/22333
[3] http://tracker.ceph.com/issues/22185

--
============
Ing. Jan Pekař
<tel:+420%20603%20811%20737>
----
Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz
============
--
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
============
Ing. Jan Pekař
***@imatic.cz | +420603811737
----
Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz
============
--