[ceph-users] mon:failed in thread_name:safe

Discussion:

[ceph-users] mon:failed in thread_name:safe_timer

楼锴毅

2018-11-20 03:17:36 UTC

Hello,
sorry to disturb , but recently when I use ceph(12.2.8),I found that the leader monitor will always failed in thread_name:safe_timer.
Here is a part of the log

0> 2018-11-20 10:33:22.386543 7faf7d84f700 -1 *** Caught signal (Aborted) **
in thread 7faf7d84f700 thread_name:safe_timer

ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)
1: (()+0x93f2d1) [0x55ef7319c2d1]
2: (()+0xf5e0) [0x7faf83fb55e0]
3: (gsignal()+0x37) [0x7faf810ee1f7]
4: (abort()+0x148) [0x7faf810ef8e8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7faf819f4ac5]
6: (()+0x5ea36) [0x7faf819f2a36]
7: (()+0x5ea63) [0x7faf819f2a63]
8: (()+0x5ec83) [0x7faf819f2c83]
9: (std::__throw_out_of_range(char const*)+0x77) [0x7faf81a47a97]
10: (FSMap::get_info_gid(mds_gid_t) const+0xfc) [0x55ef72e1dc0c]
11: (MDSMonitor::tick()+0x427) [0x55ef72e107d7]
12: (Monitor::tick()+0x128) [0x55ef72c48908]
13: (C_MonContext::finish(int)+0x37) [0x55ef72c1a7d7]
14: (Context::complete(int)+0x9) [0x55ef72c585c9]
15: (SafeTimer::timer_thread()+0x104) [0x55ef72e8dbc4]
16: (SafeTimerThread::entry()+0xd) [0x55ef72e8f5ed]
17: (()+0x7e25) [0x7faf83fade25]
18: (clone()+0x6d) [0x7faf811b134d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

And my cluster¡¯s status is about:

cluster:
id: 8c9bc910-c7f1-4b98-8c61-e18ee786e983
health: HEALTH_OK
services:
mon: 2 daemons, quorum qbs-monitor-online010-hbaz1.qiyi.virtual,qbs-monitor-online009-hbaz1.qiyi.virtual
mgr: qbs-monitor-online009-hbaz1(active, starting)
osd: 164 osds: 164 up, 164 in
rgw: 3 daemons active
data:
pools: 26 pools, 4832 pgs
objects: 5.39k objects, 20.0GiB
usage: 243GiB used, 1.07PiB / 1.07PiB avail
pgs: 4832 active+clean
io:
client: 4.63KiB/s wr, 0op/s rd, 0op/s wr

what can I do to recover it ? I am happy to give more information about the question if necessary.

Sincerely,
LouKaiyi

Patrick Donnelly

2018-11-20 03:29:14 UTC

Permalink

Post by æ¥¼é´æ¯
sorry to disturb , but recently when I use ceph(12.2.8),I found that the leader monitor will always failed in thread_name:safe_timer.
[...]

Try upgrading the mons to v12.2.9 (but see recent warnings concerning
upgrades to v12.2.9 for the OSDs):
https://tracker.ceph.com/issues/35848

--
Patrick Donnelly

楼锴毅

2018-11-20 05:48:55 UTC

Permalink

Thanks for reply . But I noticed that in https://tracker.ceph.com/issues/35848 , monitor failed in thread_name:fn_monstore, and didn't mention about safe_timer. Will the solution for the issues:35848 work in my problem?
When I try to upgrade the cluster to v12.2.9, I noticed that the v12.2.9 is not recommended, v12.2.10 may be better,so I wonder when v12.2.10 will be released?
Thanks a lot !

LouKaiyi

-----Original Message-----
From: Patrick Donnelly [mailto:***@redhat.com]
Sent: Tuesday, November 20, 2018 11:29 AM
To: 楼锴毅
Cc: Ceph Users
Subject: Re: [ceph-users] mon:failed in thread_name:safe_timer

Post by æ¥¼é´æ¯
sorry to disturb , but recently when I use ceph(12.2.8),I found that the leader monitor will always failed in thread_name:safe_timer.
[...]

Try upgrading the mons to v12.2.9 (but see recent warnings concerning upgrades to v12.2.9 for the OSDs):
https://tracker.ceph.com/issues/35848

--
Patrick Donnelly

楼锴毅

2018-11-21 02:18:27 UTC

Permalink

Hello
Yesterday I upgraded my cluster to v12.2.9.But the mons still failed for the same reason.And when I run 'ceph versions', it returned
"
"mds": {
"ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)": 1,
"ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous (stable)": 4
},
"
But actually I only have four MDS , and their versions are all v12.2.9 .I am confused about it.

The cause of the mon's fail may be helpful. I have 2 mds up:active and 2 mds up:standy at that time. Then I stopped two standby mds and the mds in rank 1 , only the mds in rank 0 was active. After restart the 3 mds , they all became standby . And the cluster told me rank 1 was damaged. At the same time,the monitor failed in thread_name:safe_timer. Though the rank 1 was repaired ,the mon still failed.

Is there anything I can do to repair the mons ?

Loukaiyi

-----Original Message-----
From: 楼锴毅
Sent: Tuesday, November 20, 2018 1:49 PM
To: 'Patrick Donnelly'
Cc: Ceph Users
Subject: RE: [ceph-users] mon:failed in thread_name:safe_timer

Thanks for reply . But I noticed that in https://tracker.ceph.com/issues/35848 , monitor failed in thread_name:fn_monstore, and didn't mention about safe_timer. Will the solution for the issues:35848 work in my problem?
When I try to upgrade the cluster to v12.2.9, I noticed that the v12.2.9 is not recommended, v12.2.10 may be better,so I wonder when v12.2.10 will be released?
Thanks a lot !

LouKaiyi

-----Original Message-----
From: Patrick Donnelly [mailto:***@redhat.com]
Sent: Tuesday, November 20, 2018 11:29 AM
To: 楼锴毅
Cc: Ceph Users
Subject: Re: [ceph-users] mon:failed in thread_name:safe_timer

Post by æ¥¼é´æ¯
sorry to disturb , but recently when I use ceph(12.2.8),I found that the leader monitor will always failed in thread_name:safe_timer.
[...]

Try upgrading the mons to v12.2.9 (but see recent warnings concerning upgrades to v12.2.9 for the OSDs):
https://tracker.ceph.com/issues/35848

--
Patrick Donnelly

Patrick Donnelly

2018-11-21 18:16:56 UTC

Permalink

Post by æ¥¼é´æ¯
Hello
Yesterday I upgraded my cluster to v12.2.9.But the mons still failed for the same reason.And when I run 'ceph versions', it returned
"
"mds": {
"ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)": 1,
"ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous (stable)": 4
},
"
But actually I only have four MDS , and their versions are all v12.2.9 .I am confused about it.

How did you restart the MDSs? If you used `ceph mds fail` then the
executable version (v12.2.8) will not change.

Also, the monitor failure requires updating the monitor to v12.2.9.
What version is the mons?

--
Patrick Donnelly

楼锴毅

2018-11-22 02:22:14 UTC

Permalink

Well,I used 'systemctl restart ceph-***@qbs-monitor-online009-hbaz1.service' and 4 MDSs did update to v12.2.9 from 12.2.8.
The mons'version is v12.2.9.Actually ,the mgr and the OSDs' version is v12.2.9 now. But the mon still failed

What I am confused about is that when I used 'ceph-mds --version' at each MDS ,it returned 'v12.2.9',but when I used 'ceph versions', it seems that I have 5 MDSs, and one's version is 12.2.8.( I only have 4 mds in fact)

Thanks
Loukaiyi

-----Original Message-----
From: Patrick Donnelly [mailto:***@redhat.com]
Sent: Thursday, November 22, 2018 2:17 AM
To: 楼锴毅
Cc: Ceph Users
Subject: Re: [ceph-users] mon:failed in thread_name:safe_timer

Post by æ¥¼é´æ¯
Hello
Yesterday I upgraded my cluster to v12.2.9.But the mons still failed
for the same reason.And when I run 'ceph versions', it returned "
"mds": {
"ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)": 1,
"ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous (stable)": 4
},
"
But actually I only have four MDS , and their versions are all v12.2.9 .I am confused about it.

How did you restart the MDSs? If you used `ceph mds fail` then the executable version (v12.2.8) will not change.

Also, the monitor failure requires updating the monitor to v12.2.9.
What version is the mons?

--
Patrick Donnelly