[ceph-users] [cephfs] Kernel outage / timeout

Discussion:

c***@jack.fr.eu.org

2018-12-04 10:55:40 UTC

Hi,

I have some wild freeze using cephfs with the kernel driver
For instance:
[Tue Dec 4 10:57:48 2018] libceph: mon1 10.5.0.88:6789 session lost,
hunting for new mon
[Tue Dec 4 10:57:48 2018] libceph: mon2 10.5.0.89:6789 session established
[Tue Dec 4 10:58:20 2018] ceph: mds0 caps stale
[..] server is now frozen, filesystem accesses are stuck
[Tue Dec 4 11:13:02 2018] libceph: mds0 10.5.0.88:6804 socket closed
(con state OPEN)
[Tue Dec 4 11:13:03 2018] libceph: mds0 10.5.0.88:6804 connection reset
[Tue Dec 4 11:13:03 2018] libceph: reset on mds0
[Tue Dec 4 11:13:03 2018] ceph: mds0 closed our session
[Tue Dec 4 11:13:03 2018] ceph: mds0 reconnect start
[Tue Dec 4 11:13:04 2018] ceph: mds0 reconnect denied
[Tue Dec 4 11:13:04 2018] ceph: dropping dirty+flushing Fw state for
000000003f1ae609 1099692263746
[Tue Dec 4 11:13:04 2018] ceph: dropping dirty+flushing Fw state for
00000000ccd58b71 1099692263749
[Tue Dec 4 11:13:04 2018] ceph: dropping dirty+flushing Fw state for
00000000da5acf8f 1099692263750
[Tue Dec 4 11:13:04 2018] ceph: dropping dirty+flushing Fw state for
000000005ddc2fcf 1099692263751
[Tue Dec 4 11:13:04 2018] ceph: dropping dirty+flushing Fw state for
00000000469a70f4 1099692263754
[Tue Dec 4 11:13:04 2018] ceph: dropping dirty+flushing Fw state for
000000005c0038f9 1099692263757
[Tue Dec 4 11:13:04 2018] ceph: dropping dirty+flushing Fw state for
00000000e7288aa2 1099692263758
[Tue Dec 4 11:13:04 2018] ceph: dropping dirty+flushing Fw state for
00000000b431209a 1099692263759
[Tue Dec 4 11:13:04 2018] libceph: mds0 10.5.0.88:6804 socket closed
(con state NEGOTIATING)
[Tue Dec 4 11:13:31 2018] libceph: osd12 10.5.0.89:6805 socket closed
(con state OPEN)
[Tue Dec 4 11:13:35 2018] libceph: osd17 10.5.0.89:6800 socket closed
(con state OPEN)
[Tue Dec 4 11:13:35 2018] libceph: osd9 10.5.0.88:6813 socket closed
(con state OPEN)
[Tue Dec 4 11:13:41 2018] libceph: osd0 10.5.0.87:6800 socket closed
(con state OPEN)

Kernel 4.17 is used, we got the same issue with 4.18
Ceph 13.2.1 is used
From what I understand, the kernel hang itself for some reason (perhaps
it simply cannot handle some wild event)

Is there a fix for that ?

Secondly, it seems that the kernel reconnect itself after 15 minutes
everytime
Where is that tunable ? Could I lower that variables, so that hang have
less impacts ?

On ceph.log, I get Health check failed: 1 MDSs report slow requests
(MDS_SLOW_REQUEST), but this is probably the consequence, not the cause

Any tip ?

Best regards,

NingLi

2018-12-04 11:00:34 UTC

Permalink

Hi，maybe this reference can help you

http://docs.ceph.com/docs/master/cephfs/troubleshooting/#disconnected-remounted-fs

Post by c***@jack.fr.eu.org
Hi,
I have some wild freeze using cephfs with the kernel driver
[Tue Dec 4 10:57:48 2018] libceph: mon1 10.5.0.88:6789 session lost,
hunting for new mon
[Tue Dec 4 10:57:48 2018] libceph: mon2 10.5.0.89:6789 session established
[Tue Dec 4 10:58:20 2018] ceph: mds0 caps stale
[..] server is now frozen, filesystem accesses are stuck
[Tue Dec 4 11:13:02 2018] libceph: mds0 10.5.0.88:6804 socket closed
(con state OPEN)
[Tue Dec 4 11:13:03 2018] libceph: mds0 10.5.0.88:6804 connection reset
[Tue Dec 4 11:13:03 2018] libceph: reset on mds0
[Tue Dec 4 11:13:03 2018] ceph: mds0 closed our session
[Tue Dec 4 11:13:03 2018] ceph: mds0 reconnect start
[Tue Dec 4 11:13:04 2018] ceph: mds0 reconnect denied
[Tue Dec 4 11:13:04 2018] ceph: dropping dirty+flushing Fw state for
000000003f1ae609 1099692263746
[Tue Dec 4 11:13:04 2018] ceph: dropping dirty+flushing Fw state for
00000000ccd58b71 1099692263749
[Tue Dec 4 11:13:04 2018] ceph: dropping dirty+flushing Fw state for
00000000da5acf8f 1099692263750
[Tue Dec 4 11:13:04 2018] ceph: dropping dirty+flushing Fw state for
000000005ddc2fcf 1099692263751
[Tue Dec 4 11:13:04 2018] ceph: dropping dirty+flushing Fw state for
00000000469a70f4 1099692263754
[Tue Dec 4 11:13:04 2018] ceph: dropping dirty+flushing Fw state for
000000005c0038f9 1099692263757
[Tue Dec 4 11:13:04 2018] ceph: dropping dirty+flushing Fw state for
00000000e7288aa2 1099692263758
[Tue Dec 4 11:13:04 2018] ceph: dropping dirty+flushing Fw state for
00000000b431209a 1099692263759
[Tue Dec 4 11:13:04 2018] libceph: mds0 10.5.0.88:6804 socket closed
(con state NEGOTIATING)
[Tue Dec 4 11:13:31 2018] libceph: osd12 10.5.0.89:6805 socket closed
(con state OPEN)
[Tue Dec 4 11:13:35 2018] libceph: osd17 10.5.0.89:6800 socket closed
(con state OPEN)
[Tue Dec 4 11:13:35 2018] libceph: osd9 10.5.0.88:6813 socket closed
(con state OPEN)
[Tue Dec 4 11:13:41 2018] libceph: osd0 10.5.0.87:6800 socket closed
(con state OPEN)
Kernel 4.17 is used, we got the same issue with 4.18
Ceph 13.2.1 is used
From what I understand, the kernel hang itself for some reason (perhaps
it simply cannot handle some wild event)
Is there a fix for that ?
Secondly, it seems that the kernel reconnect itself after 15 minutes
everytime
Where is that tunable ? Could I lower that variables, so that hang have
less impacts ?
On ceph.log, I get Health check failed: 1 MDSs report slow requests
(MDS_SLOW_REQUEST), but this is probably the consequence, not the cause
Any tip ?
Best regards,
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Jack

2018-12-04 18:37:33 UTC

Permalink

Thanks

However, I do not think this tip is related to my issue

Best regards,

Post by NingLi
Hi，maybe this reference can help you
http://docs.ceph.com/docs/master/cephfs/troubleshooting/#disconnected-remounted-fs

Gregory Farnum

2018-12-04 18:41:56 UTC

Permalink

Yes, this is exactly it with the "reconnect denied".
-Greg

HiïŒmaybe this reference can help you
http://docs.ceph.com/docs/master/cephfs/troubleshooting/#disconnected-remounted-fs

established

Post by c***@jack.fr.eu.org
[Tue Dec 4 10:58:20 2018] ceph: mds0 caps stale
[..] server is now frozen, filesystem accesses are stuck
[Tue Dec 4 11:13:02 2018] libceph: mds0 10.5.0.88:6804 socket closed
(con state OPEN)
[Tue Dec 4 11:13:03 2018] libceph: mds0 10.5.0.88:6804 connection reset
[Tue Dec 4 11:13:03 2018] libceph: reset on mds0
[Tue Dec 4 11:13:03 2018] ceph: mds0 closed our session
[Tue Dec 4 11:13:03 2018] ceph: mds0 reconnect start
[Tue Dec 4 11:13:04 2018] ceph: mds0 reconnect denied
[Tue Dec 4 11:13:04 2018] ceph: dropping dirty+flushing Fw state for
000000003f1ae609 1099692263746
[Tue Dec 4 11:13:04 2018] ceph: dropping dirty+flushing Fw state for
00000000ccd58b71 1099692263749
[Tue Dec 4 11:13:04 2018] ceph: dropping dirty+flushing Fw state for
00000000da5acf8f 1099692263750
[Tue Dec 4 11:13:04 2018] ceph: dropping dirty+flushing Fw state for
000000005ddc2fcf 1099692263751
[Tue Dec 4 11:13:04 2018] ceph: dropping dirty+flushing Fw state for
00000000469a70f4 1099692263754
[Tue Dec 4 11:13:04 2018] ceph: dropping dirty+flushing Fw state for
000000005c0038f9 1099692263757
[Tue Dec 4 11:13:04 2018] ceph: dropping dirty+flushing Fw state for
00000000e7288aa2 1099692263758
[Tue Dec 4 11:13:04 2018] ceph: dropping dirty+flushing Fw state for
00000000b431209a 1099692263759
[Tue Dec 4 11:13:04 2018] libceph: mds0 10.5.0.88:6804 socket closed
(con state NEGOTIATING)
[Tue Dec 4 11:13:31 2018] libceph: osd12 10.5.0.89:6805 socket closed
(con state OPEN)
[Tue Dec 4 11:13:35 2018] libceph: osd17 10.5.0.89:6800 socket closed
(con state OPEN)
[Tue Dec 4 11:13:35 2018] libceph: osd9 10.5.0.88:6813 socket closed
(con state OPEN)
[Tue Dec 4 11:13:41 2018] libceph: osd0 10.5.0.87:6800 socket closed
(con state OPEN)
Kernel 4.17 is used, we got the same issue with 4.18
Ceph 13.2.1 is used
From what I understand, the kernel hang itself for some reason (perhaps
it simply cannot handle some wild event)
Is there a fix for that ?
Secondly, it seems that the kernel reconnect itself after 15 minutes
everytime
Where is that tunable ? Could I lower that variables, so that hang have
less impacts ?
On ceph.log, I get Health check failed: 1 MDSs report slow requests
(MDS_SLOW_REQUEST), but this is probably the consequence, not the cause
Any tip ?
Best regards,
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Jack

2018-12-04 18:50:10 UTC

Permalink

Why is the client frozen at the first place ?
Is this because it somehow lost the connection to the mon (have not
found anything about this yet) ?
How can I prevent this ?
Can I make the client reconnect in less that 15 minutes, to lessen the
impact ?

Best regards,

Post by Gregory Farnum
Yes, this is exactly it with the "reconnect denied".
-Greg

Post by NingLi
Hi，maybe this reference can help you
http://docs.ceph.com/docs/master/cephfs/troubleshooting/#disconnected-remounted-fs

established

_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Yan, Zheng

2018-12-05 01:47:31 UTC

Permalink

This is more like network issue. check if there is firewall between
mds and client

Post by c***@jack.fr.eu.org
On ceph.log, I get Health check failed: 1 MDSs report slow requests
(MDS_SLOW_REQUEST), but this is probably the consequence, not the cause
Any tip ?
Best regards,
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Jack

2018-12-10 10:43:04 UTC

Permalink

There is only a simple iptables conntrack there

Could it be something related to timeout ?

/proc/sys/net/netfilter/nf_conntrack_tcp_timeout_established has 7875
currently

Best regards,

Post by Yan, Zheng
This is more like network issue. check if there is firewall between
mds and client

Continue reading on narkive:

Search results for '[ceph-users] [cephfs] Kernel outage / timeout' (Questions and Answers)

replies

When a client sends a DHCPDISCOVER request to the ISC DHCP server on Red Hat,the server does not respond.Why?

started 2006-04-01 23:45:16 UTC

computer networking