[ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket closed (con state OPEN)

Discussion:

Alexandre DERUMIER

2018-11-08 17:16:20 UTC

Hi,

we are currently test cephfs with kernel module (4.17 and 4.18) instead fuse (worked fine),

and we have hang, iowait jump like crazy for around 20min.

client is a qemu 2.12 vm with virtio-net interface.

Is the client logs, we are seeing this kind of logs:

[jeu. nov. 8 12:20:18 2018] libceph: osd14 x.x.x.x:6801 socket closed (con state OPEN)
[jeu. nov. 8 12:42:03 2018] libceph: osd9 x.x.x.x:6821 socket closed (con state OPEN)

and in osd logs:

osd14:
2018-11-08 12:20:25.247 7f31ffac8700 0 -- x.x.x.x:6801/1745 >> x.x.x.x:0/3678871522 conn(0x558c430ec300 :6801 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)

osd9:
2018-11-08 12:42:09.820 7f7ca970e700 0 -- x.x.x.x:6821/1739 >> x.x.x.x:0/3678871522 conn(0x564fcbec5100 :6821 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)

cluster is ceph 13.2.1

Note that we have a physical firewall between client and server, I'm not sure yet if the session could be dropped. (I don't have find any logs in the firewall).

Any idea ? I would like to known if it's a network bug, or ceph bug (not sure how to understand the osd logs)

Regards,

Alexandre

client ceph.conf
----------------
[client]
fuse_disable_pagecache = true
client_reconnect_stale = true

Alexandre DERUMIER

2018-11-09 00:12:25 UTC

Permalink

To be more precise,

the logs occurs when the hang is finished.

I have looked at stats on 10 differents hang, and the duration is always around 15 minutes.

Maybe related to:

ms tcp read timeout
Description: If a client or daemon makes a request to another Ceph daemon and does not drop an unused connection, the ms tcp read timeout defines the connection as idle after the specified number of seconds.
Type: Unsigned 64-bit Integer
Required: No
Default: 900 15 minutes.

?

Find a similar bug report with firewall too:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013841.html

----- Mail original -----
De: "aderumier" <***@odiso.com>
À: "ceph-users" <ceph-***@lists.ceph.com>
Envoyé: Jeudi 8 Novembre 2018 18:16:20
Objet: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket closed (con state OPEN)

Hi,

we are currently test cephfs with kernel module (4.17 and 4.18) instead fuse (worked fine),

and we have hang, iowait jump like crazy for around 20min.

client is a qemu 2.12 vm with virtio-net interface.

Is the client logs, we are seeing this kind of logs:

[jeu. nov. 8 12:20:18 2018] libceph: osd14 x.x.x.x:6801 socket closed (con state OPEN)
[jeu. nov. 8 12:42:03 2018] libceph: osd9 x.x.x.x:6821 socket closed (con state OPEN)

and in osd logs:

osd14:
2018-11-08 12:20:25.247 7f31ffac8700 0 -- x.x.x.x:6801/1745 >> x.x.x.x:0/3678871522 conn(0x558c430ec300 :6801 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)

osd9:
2018-11-08 12:42:09.820 7f7ca970e700 0 -- x.x.x.x:6821/1739 >> x.x.x.x:0/3678871522 conn(0x564fcbec5100 :6821 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)

cluster is ceph 13.2.1

Note that we have a physical firewall between client and server, I'm not sure yet if the session could be dropped. (I don't have find any logs in the firewall).

Any idea ? I would like to known if it's a network bug, or ceph bug (not sure how to understand the osd logs)

Regards,

Alexandre

client ceph.conf
----------------
[client]
fuse_disable_pagecache = true
client_reconnect_stale = true

_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Alexandre DERUMIER

2018-11-09 01:06:43 UTC

Permalink

Ok,
It seem to come from firewall,
I'm seeing dropped session exactly 15min before the log.

The sessions are the session to osd, session to mon && mds are ok.

Seem that keeplive2 is used to monitor the mon session
https://patchwork.kernel.org/patch/7105641/

but I'm not sure about osd sessions ?

----- Mail original -----
De: "aderumier" <***@odiso.com>
À: "ceph-users" <ceph-***@lists.ceph.com>
Cc: "Alexandre Bruyelles" <***@odiso.com>
Envoyé: Vendredi 9 Novembre 2018 01:12:25
Objet: Re: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket closed (con state OPEN)

To be more precise,

the logs occurs when the hang is finished.

I have looked at stats on 10 differents hang, and the duration is always around 15 minutes.

Maybe related to:

ms tcp read timeout
Description: If a client or daemon makes a request to another Ceph daemon and does not drop an unused connection, the ms tcp read timeout defines the connection as idle after the specified number of seconds.
Type: Unsigned 64-bit Integer
Required: No
Default: 900 15 minutes.

?

Find a similar bug report with firewall too:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013841.html

----- Mail original -----
De: "aderumier" <***@odiso.com>
À: "ceph-users" <ceph-***@lists.ceph.com>
Envoyé: Jeudi 8 Novembre 2018 18:16:20
Objet: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket closed (con state OPEN)

Hi,

we are currently test cephfs with kernel module (4.17 and 4.18) instead fuse (worked fine),

and we have hang, iowait jump like crazy for around 20min.

client is a qemu 2.12 vm with virtio-net interface.

Is the client logs, we are seeing this kind of logs:

[jeu. nov. 8 12:20:18 2018] libceph: osd14 x.x.x.x:6801 socket closed (con state OPEN)
[jeu. nov. 8 12:42:03 2018] libceph: osd9 x.x.x.x:6821 socket closed (con state OPEN)

and in osd logs:

osd14:
2018-11-08 12:20:25.247 7f31ffac8700 0 -- x.x.x.x:6801/1745 >> x.x.x.x:0/3678871522 conn(0x558c430ec300 :6801 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)

osd9:
2018-11-08 12:42:09.820 7f7ca970e700 0 -- x.x.x.x:6821/1739 >> x.x.x.x:0/3678871522 conn(0x564fcbec5100 :6821 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)

cluster is ceph 13.2.1

Note that we have a physical firewall between client and server, I'm not sure yet if the session could be dropped. (I don't have find any logs in the firewall).

Any idea ? I would like to known if it's a network bug, or ceph bug (not sure how to understand the osd logs)

Regards,

Alexandre

client ceph.conf
----------------
[client]
fuse_disable_pagecache = true
client_reconnect_stale = true

_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Linh Vu

2018-11-09 01:16:07 UTC

Permalink

If you're using kernel client for cephfs, I strongly advise to have the client on the same subnet as the ceph public one i.e all traffic should be on the same subnet/VLAN. Even if your firewall situation is good, if you have to cross subnets or VLANs, you will run into weird problems later. Fuse has much better tolerance for that scenario.

________________________________
From: ceph-users <ceph-users-***@lists.ceph.com> on behalf of Alexandre DERUMIER <***@odiso.com>
Sent: Friday, 9 November 2018 12:06:43 PM
To: ceph-users
Subject: Re: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket closed (con state OPEN)

Ok,
It seem to come from firewall,
I'm seeing dropped session exactly 15min before the log.

The sessions are the session to osd, session to mon && mds are ok.

Seem that keeplive2 is used to monitor the mon session
https://patchwork.kernel.org/patch/7105641/

but I'm not sure about osd sessions ?

----- Mail original -----
De: "aderumier" <***@odiso.com>
À: "ceph-users" <ceph-***@lists.ceph.com>
Cc: "Alexandre Bruyelles" <***@odiso.com>
Envoyé: Vendredi 9 Novembre 2018 01:12:25
Objet: Re: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket closed (con state OPEN)

To be more precise,

the logs occurs when the hang is finished.

I have looked at stats on 10 differents hang, and the duration is always around 15 minutes.

Maybe related to:

ms tcp read timeout
Description: If a client or daemon makes a request to another Ceph daemon and does not drop an unused connection, the ms tcp read timeout defines the connection as idle after the specified number of seconds.
Type: Unsigned 64-bit Integer
Required: No
Default: 900 15 minutes.

?

Find a similar bug report with firewall too:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013841.html

----- Mail original -----
De: "aderumier" <***@odiso.com>
À: "ceph-users" <ceph-***@lists.ceph.com>
Envoyé: Jeudi 8 Novembre 2018 18:16:20
Objet: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket closed (con state OPEN)

Hi,

we are currently test cephfs with kernel module (4.17 and 4.18) instead fuse (worked fine),

and we have hang, iowait jump like crazy for around 20min.

client is a qemu 2.12 vm with virtio-net interface.

Is the client logs, we are seeing this kind of logs:

[jeu. nov. 8 12:20:18 2018] libceph: osd14 x.x.x.x:6801 socket closed (con state OPEN)
[jeu. nov. 8 12:42:03 2018] libceph: osd9 x.x.x.x:6821 socket closed (con state OPEN)

and in osd logs:

osd14:
2018-11-08 12:20:25.247 7f31ffac8700 0 -- x.x.x.x:6801/1745 >> x.x.x.x:0/3678871522 conn(0x558c430ec300 :6801 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)

osd9:
2018-11-08 12:42:09.820 7f7ca970e700 0 -- x.x.x.x:6821/1739 >> x.x.x.x:0/3678871522 conn(0x564fcbec5100 :6821 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)

cluster is ceph 13.2.1

Note that we have a physical firewall between client and server, I'm not sure yet if the session could be dropped. (I don't have find any logs in the firewall).

Any idea ? I would like to known if it's a network bug, or ceph bug (not sure how to understand the osd logs)

Regards,

Alexandre

client ceph.conf
----------------
[client]
fuse_disable_pagecache = true
client_reconnect_stale = true

_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Alexandre DERUMIER

2018-11-09 06:08:55 UTC

Permalink

>>If you're using kernel client for cephfs, I strongly advise to have the client on the same subnet as the ceph public one i.e all traffic should be on the same subnet/VLAN. Even if your firewall situation is good, if you >>have to cross subnets or VLANs, you will run into weird problems later.

Thanks.

Currently client is in different vlan for security. (multiple differents customer, don't want that a customer have direct access to other customer or ceph).
But, as they are vm, I can manage to put them in the same vlan and do firewalling on the hypervisor. (but I'll need firewalling in all cases)

>>Fuse has much better tolerance for that scenario.

What's the difference ?

----- Mail original -----
De: "Linh Vu" <***@unimelb.edu.au>
À: "aderumier" <***@odiso.com>, "ceph-users" <ceph-***@lists.ceph.com>
Envoyé: Vendredi 9 Novembre 2018 02:16:07
Objet: Re: cephfs kernel, hang with libceph: osdx X.X.X.X socket closed (con state OPEN)

If you're using kernel client for cephfs, I strongly advise to have the client on the same subnet as the ceph public one i.e all traffic should be on the same subnet/VLAN. Even if your firewall situation is good, if you have to cross subnets or VLANs, you will run into weird problems later. Fuse has much better tolerance for that scenario.

From: ceph-users <ceph-users-***@lists.ceph.com> on behalf of Alexandre DERUMIER <***@odiso.com>
Sent: Friday, 9 November 2018 12:06:43 PM
To: ceph-users
Subject: Re: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket closed (con state OPEN)
Ok,
It seem to come from firewall,
I'm seeing dropped session exactly 15min before the log.

The sessions are the session to osd, session to mon && mds are ok.

Seem that keeplive2 is used to monitor the mon session
[ https://patchwork.kernel.org/patch/7105641/ | https://patchwork.kernel.org/patch/7105641/ ]

but I'm not sure about osd sessions ?

----- Mail original -----
De: "aderumier" <***@odiso.com>
À: "ceph-users" <ceph-***@lists.ceph.com>
Cc: "Alexandre Bruyelles" <***@odiso.com>
Envoyé: Vendredi 9 Novembre 2018 01:12:25
Objet: Re: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket closed (con state OPEN)

To be more precise,

the logs occurs when the hang is finished.

I have looked at stats on 10 differents hang, and the duration is always around 15 minutes.

Maybe related to:

ms tcp read timeout
Description: If a client or daemon makes a request to another Ceph daemon and does not drop an unused connection, the ms tcp read timeout defines the connection as idle after the specified number of seconds.
Type: Unsigned 64-bit Integer
Required: No
Default: 900 15 minutes.

?

Find a similar bug report with firewall too:

[ http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013841.html | http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013841.html ]

----- Mail original -----
De: "aderumier" <***@odiso.com>
À: "ceph-users" <ceph-***@lists.ceph.com>
Envoyé: Jeudi 8 Novembre 2018 18:16:20
Objet: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket closed (con state OPEN)

Hi,

we are currently test cephfs with kernel module (4.17 and 4.18) instead fuse (worked fine),

and we have hang, iowait jump like crazy for around 20min.

client is a qemu 2.12 vm with virtio-net interface.

Is the client logs, we are seeing this kind of logs:

[jeu. nov. 8 12:20:18 2018] libceph: osd14 x.x.x.x:6801 socket closed (con state OPEN)
[jeu. nov. 8 12:42:03 2018] libceph: osd9 x.x.x.x:6821 socket closed (con state OPEN)

and in osd logs:

osd14:
2018-11-08 12:20:25.247 7f31ffac8700 0 -- x.x.x.x:6801/1745 >> x.x.x.x:0/3678871522 conn(0x558c430ec300 :6801 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)

osd9:
2018-11-08 12:42:09.820 7f7ca970e700 0 -- x.x.x.x:6821/1739 >> x.x.x.x:0/3678871522 conn(0x564fcbec5100 :6821 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)

cluster is ceph 13.2.1

Note that we have a physical firewall between client and server, I'm not sure yet if the session could be dropped. (I don't have find any logs in the firewall).

Any idea ? I would like to known if it's a network bug, or ceph bug (not sure how to understand the osd logs)

Regards,

Alexandre

client ceph.conf
----------------
[client]
fuse_disable_pagecache = true
client_reconnect_stale = true

_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com
[ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]

_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com
[ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]

_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com
[ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]