Alexandre DERUMIER
2018-11-08 17:16:20 UTC
Hi,
we are currently test cephfs with kernel module (4.17 and 4.18) instead fuse (worked fine),
and we have hang, iowait jump like crazy for around 20min.
client is a qemu 2.12 vm with virtio-net interface.
Is the client logs, we are seeing this kind of logs:
[jeu. nov. 8 12:20:18 2018] libceph: osd14 x.x.x.x:6801 socket closed (con state OPEN)
[jeu. nov. 8 12:42:03 2018] libceph: osd9 x.x.x.x:6821 socket closed (con state OPEN)
and in osd logs:
osd14:
2018-11-08 12:20:25.247 7f31ffac8700 0 -- x.x.x.x:6801/1745 >> x.x.x.x:0/3678871522 conn(0x558c430ec300 :6801 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
osd9:
2018-11-08 12:42:09.820 7f7ca970e700 0 -- x.x.x.x:6821/1739 >> x.x.x.x:0/3678871522 conn(0x564fcbec5100 :6821 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
cluster is ceph 13.2.1
Note that we have a physical firewall between client and server, I'm not sure yet if the session could be dropped. (I don't have find any logs in the firewall).
Any idea ? I would like to known if it's a network bug, or ceph bug (not sure how to understand the osd logs)
Regards,
Alexandre
client ceph.conf
----------------
[client]
fuse_disable_pagecache = true
client_reconnect_stale = true
we are currently test cephfs with kernel module (4.17 and 4.18) instead fuse (worked fine),
and we have hang, iowait jump like crazy for around 20min.
client is a qemu 2.12 vm with virtio-net interface.
Is the client logs, we are seeing this kind of logs:
[jeu. nov. 8 12:20:18 2018] libceph: osd14 x.x.x.x:6801 socket closed (con state OPEN)
[jeu. nov. 8 12:42:03 2018] libceph: osd9 x.x.x.x:6821 socket closed (con state OPEN)
and in osd logs:
osd14:
2018-11-08 12:20:25.247 7f31ffac8700 0 -- x.x.x.x:6801/1745 >> x.x.x.x:0/3678871522 conn(0x558c430ec300 :6801 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
osd9:
2018-11-08 12:42:09.820 7f7ca970e700 0 -- x.x.x.x:6821/1739 >> x.x.x.x:0/3678871522 conn(0x564fcbec5100 :6821 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
cluster is ceph 13.2.1
Note that we have a physical firewall between client and server, I'm not sure yet if the session could be dropped. (I don't have find any logs in the firewall).
Any idea ? I would like to known if it's a network bug, or ceph bug (not sure how to understand the osd logs)
Regards,
Alexandre
client ceph.conf
----------------
[client]
fuse_disable_pagecache = true
client_reconnect_stale = true