Discussion:
[ceph-users] 转发: how to fix the mds damaged issue
Lihang
2016-07-03 07:06:21 UTC
Permalink
***@BoreNode2:~# ceph -v
ceph version 10.2.0

·¢ŒþÈË: lihang 12398 (RD)
·¢ËÍʱŒä: 2016Äê7ÔÂ3ÈÕ 14:47
ÊÕŒþÈË: ceph-***@lists.ceph.com
³­ËÍ: Ceph Development; '***@gmail.com'; zhengbin 08747 (RD); xusangdi 11976 (RD)
Ö÷Ìâ: how to fix the mds damaged issue

Hi, my ceph cluster mds is damaged and the cluster is degraded after our machines library power down suddenly. then the cluster is ¡°HEALTH_ERR¡± and cann¡¯t be recovered to health by itself after my
Reboot the storage node system or restart the ceph cluster yet. After that I also use the following command to remove the damaged mds, but the damaged mds be removed failed and the issue exist still. The another two mds state is standby. Who can tell me how to fix this issue and find out what happened in my cluter?
the remove damaged mds process in my storage node as follows.

1> Execute ¡±stop ceph-mds-all¡± command in the damaged mds node

2> ceph mds rmfailed 0 --yes-i-really-mean-it

3> ***@BoreNode2:~# ceph mds rm 0

mds gid 0 dne

The detailed status of my cluster as following:
***@BoreNode2:~# ceph -s
cluster 98edd275-5df7-414f-a202-c3d4570f251c
health HEALTH_ERR
mds rank 0 is damaged
mds cluster is degraded
monmap e1: 3 mons at {BoreNode2=172.16.65.141:6789/0,BoreNode3=172.16.65.142:6789/0,BoreNode4=172.16.65.143:6789/0}
election epoch 1010, quorum 0,1,2 BoreNode2,BoreNode3,BoreNode4
fsmap e168: 0/1/1 up, 3 up:standby, 1 damaged
osdmap e338: 8 osds: 8 up, 8 in
flags sortbitwise
pgmap v17073: 1560 pgs, 5 pools, 218 kB data, 32 objects
423 MB used, 3018 GB / 3018 GB avail
1560 active+clean
***@BoreNode2:~# ceph mds dump
dumped fsmap epoch 168
fs_name TudouFS
epoch 156
flags 0
created 2016-04-02 02:48:11.150539
modified 2016-04-03 03:04:57.347064
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
last_failure 0
last_failure_osd_epoch 83
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
max_mds 1
in 0
up {}
failed
damaged 0
stopped
data_pools 4
metadata_pool 3
inline_data disabled
-------------------------------------------------------------------------------------------------------------------------------------
±ŸÓÊŒþŒ°ÆäžœŒþº¬ÓкŒÖÝ»ªÈýÍšÐÅŒŒÊõÓÐÏÞ¹«ËŸµÄ±£ÃÜÐÅÏ¢£¬œöÏÞÓÚ·¢ËÍžøÉÏÃæµØÖ·ÖÐÁгö
µÄžöÈË»òȺ×é¡£œûÖ¹ÈκÎÆäËûÈËÒÔÈκÎÐÎʜʹÓãš°üÀšµ«²»ÏÞÓÚÈ«²¿»ò²¿·ÖµØй¶¡¢žŽÖÆ¡¢
»òÉ¢·¢£©±ŸÓÊŒþÖеÄÐÅÏ¢¡£Èç¹ûÄúŽíÊÕÁ˱ŸÓÊŒþ£¬ÇëÄúÁ¢ŒŽµç»°»òÓÊŒþÍšÖª·¢ŒþÈ˲¢ÉŸ³ý±Ÿ
ÓÊŒþ£¡
This e-mail and its attachments contain confidential information from H3C, which is
intended only for the person or entity whose address is listed above. Any use of the
information contained herein in any way (including, but not limited to, total or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender
by phone or email immediately and delete it!
John Spray
2016-07-04 09:49:16 UTC
Permalink
Post by Lihang
ceph version 10.2.0
发件人: lihang 12398 (RD)
发送时间: 2016年7月3日 14:47
11976 (RD)
主题: how to fix the mds damaged issue
Hi, my ceph cluster mds is damaged and the cluster is degraded after our
machines library power down suddenly. then the cluster is “HEALTH_ERR” and
cann’t be recovered to health by itself after my
Reboot the storage node system or restart the ceph cluster yet. After that I
also use the following command to remove the damaged mds, but the damaged
mds be removed failed and the issue exist still. The another two mds state
is standby. Who can tell me how to fix this issue and find out what happened
in my cluter?
the remove damaged mds process in my storage node as follows.
1> Execute ”stop ceph-mds-all” command in the damaged mds node
2> ceph mds rmfailed 0 --yes-i-really-mean-it
rmfailed is not something you want to use in these circumstances.
Post by Lihang
mds gid 0 dne
cluster 98edd275-5df7-414f-a202-c3d4570f251c
health HEALTH_ERR
mds rank 0 is damaged
mds cluster is degraded
monmap e1: 3 mons at
{BoreNode2=172.16.65.141:6789/0,BoreNode3=172.16.65.142:6789/0,BoreNode4=172.16.65.143:6789/0}
election epoch 1010, quorum 0,1,2 BoreNode2,BoreNode3,BoreNode4
fsmap e168: 0/1/1 up, 3 up:standby, 1 damaged
osdmap e338: 8 osds: 8 up, 8 in
flags sortbitwise
pgmap v17073: 1560 pgs, 5 pools, 218 kB data, 32 objects
423 MB used, 3018 GB / 3018 GB avail
1560 active+clean
When an MDS rank is marked as damaged, that means something invalid
was found when reading from the pool storing metadata objects. The
next step is to find out what that was. Look in the MDS log and in
ceph.log from the time when it went damaged, to find the most specific
error message you can.

If you do not have the logs and want to have the MDS try operating
again (to reproduce whatever condition caused it to be marked
damaged), you can enable it by using "ceph mds repaired 0", then start
the daemon and see how it is failing.

John
Post by Lihang
dumped fsmap epoch 168
fs_name TudouFS
epoch 156
flags 0
created 2016-04-02 02:48:11.150539
modified 2016-04-03 03:04:57.347064
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
last_failure 0
last_failure_osd_epoch 83
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
max_mds 1
in 0
up {}
failed
damaged 0
stopped
data_pools 4
metadata_pool 3
inline_data disabled
-------------------------------------------------------------------------------------------------------------------------------------
本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
邮件!
This e-mail and its attachments contain confidential information from H3C, which is
intended only for the person or entity whose address is listed above. Any use of the
information contained herein in any way (including, but not limited to, total or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender
by phone or email immediately and delete it!
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Lihang
2016-07-04 12:42:40 UTC
Permalink
Thank you very much for your advice. The command "ceph mds repaired 0" work fine in my cluster, my cluster state become HEALTH_OK and the cephfs state become normal also. but in the monitor or mds log file ,it just record the replay and recover process log without point out somewhere is abnormal . and I haven't the log when this issue happened . So I haven't found out the root cause of this issue. I'll try to reproduce this issue . thank you very much again!
fisher

-----邮件原件-----
发件人: John Spray [mailto:***@redhat.com]
发送时间: 2016年7月4日 17:49
收件人: lihang 12398 (RD)
抄送: ceph-***@lists.ceph.com
主题: Re: [ceph-users] 转发: how to fix the mds damaged issue
Post by Lihang
ceph version 10.2.0
发件人: lihang 12398 (RD)
发送时间: 2016年7月3日 14:47
11976 (RD)
主题: how to fix the mds damaged issue
Hi, my ceph cluster mds is damaged and the cluster is degraded after
our machines library power down suddenly. then the cluster is
“HEALTH_ERR” and cann’t be recovered to health by itself after my
Reboot the storage node system or restart the ceph cluster yet. After
that I also use the following command to remove the damaged mds, but
the damaged mds be removed failed and the issue exist still. The
another two mds state is standby. Who can tell me how to fix this
issue and find out what happened in my cluter?
the remove damaged mds process in my storage node as follows.
1> Execute ”stop ceph-mds-all” command in the damaged mds node
2> ceph mds rmfailed 0 --yes-i-really-mean-it
rmfailed is not something you want to use in these circumstances.
Post by Lihang
mds gid 0 dne
cluster 98edd275-5df7-414f-a202-c3d4570f251c
health HEALTH_ERR
mds rank 0 is damaged
mds cluster is degraded
monmap e1: 3 mons at
{BoreNode2=172.16.65.141:6789/0,BoreNode3=172.16.65.142:6789/0,BoreNod
e4=172.16.65.143:6789/0}
election epoch 1010, quorum 0,1,2
BoreNode2,BoreNode3,BoreNode4
fsmap e168: 0/1/1 up, 3 up:standby, 1 damaged
osdmap e338: 8 osds: 8 up, 8 in
flags sortbitwise
pgmap v17073: 1560 pgs, 5 pools, 218 kB data, 32 objects
423 MB used, 3018 GB / 3018 GB avail
1560 active+clean
When an MDS rank is marked as damaged, that means something invalid was found when reading from the pool storing metadata objects. The next step is to find out what that was. Look in the MDS log and in ceph.log from the time when it went damaged, to find the most specific error message you can.

If you do not have the logs and want to have the MDS try operating again (to reproduce whatever condition caused it to be marked damaged), you can enable it by using "ceph mds repaired 0", then start the daemon and see how it is failing.

John
Post by Lihang
dumped fsmap epoch 168
fs_name TudouFS
epoch 156
flags 0
created 2016-04-02 02:48:11.150539
modified 2016-04-03 03:04:57.347064
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
last_failure 0
last_failure_osd_epoch 83
compat compat={},rocompat={},incompat={1=base v0.20,2=client
writeable ranges,3=default file layouts on dirs,4=dir inode in
separate object,5=mds uses versioned encoding,6=dirfrag is stored in
omap,8=file layout v2}
max_mds 1
in 0
up {}
failed
damaged 0
stopped
data_pools 4
metadata_pool 3
inline_data disabled
----------------------------------------------------------------------
---------------------------------------------------------------
本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
邮件!
This e-mail and its attachments contain confidential information from
H3C, which is intended only for the person or entity whose address is
listed above. Any use of the information contained herein in any way
(including, but not limited to, total or partial disclosure,
reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error,
please notify the sender by phone or email immediately and delete it!
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Shinobu Kinjo
2016-07-04 22:21:50 UTC
Permalink
Reproduce with 'debug mds = 20' and 'debug ms = 20'.

shinobu
Post by Lihang
Thank you very much for your advice. The command "ceph mds repaired 0"
work fine in my cluster, my cluster state become HEALTH_OK and the cephfs
state become normal also. but in the monitor or mds log file ,it just
record the replay and recover process log without point out somewhere is
abnormal . and I haven't the log when this issue happened . So I haven't
found out the root cause of this issue. I'll try to reproduce this issue .
thank you very much again!
fisher
-----邮件原件-----
发送时闎: 2016幎7月4日 17:49
收件人: lihang 12398 (RD)
䞻题: Re: [ceph-users] 蜬发: how to fix the mds damaged issue
Post by Lihang
ceph version 10.2.0
发件人: lihang 12398 (RD)
发送时闎: 2016幎7月3日 14:47
xusangdi
11976 (RD)
䞻题: how to fix the mds damaged issue
Hi, my ceph cluster mds is damaged and the cluster is degraded after
our machines library power down suddenly. then the cluster is
“HEALTH_ERR” and cann’t be recovered to health by itself after my
Reboot the storage node system or restart the ceph cluster yet. After
that I also use the following command to remove the damaged mds, but
the damaged mds be removed failed and the issue exist still. The
another two mds state is standby. Who can tell me how to fix this
issue and find out what happened in my cluter?
the remove damaged mds process in my storage node as follows.
1> Execute ”stop ceph-mds-all” command in the damaged mds node
2> ceph mds rmfailed 0 --yes-i-really-mean-it
rmfailed is not something you want to use in these circumstances.
Post by Lihang
mds gid 0 dne
cluster 98edd275-5df7-414f-a202-c3d4570f251c
health HEALTH_ERR
mds rank 0 is damaged
mds cluster is degraded
monmap e1: 3 mons at
{BoreNode2=172.16.65.141:6789/0,BoreNode3=172.16.65.142:6789/0,BoreNod
e4=172.16.65.143:6789/0}
election epoch 1010, quorum 0,1,2
BoreNode2,BoreNode3,BoreNode4
fsmap e168: 0/1/1 up, 3 up:standby, 1 damaged
osdmap e338: 8 osds: 8 up, 8 in
flags sortbitwise
pgmap v17073: 1560 pgs, 5 pools, 218 kB data, 32 objects
423 MB used, 3018 GB / 3018 GB avail
1560 active+clean
When an MDS rank is marked as damaged, that means something invalid was
found when reading from the pool storing metadata objects. The next step
is to find out what that was. Look in the MDS log and in ceph.log from the
time when it went damaged, to find the most specific error message you can.
If you do not have the logs and want to have the MDS try operating again
(to reproduce whatever condition caused it to be marked damaged), you can
enable it by using "ceph mds repaired 0", then start the daemon and see how
it is failing.
John
Post by Lihang
dumped fsmap epoch 168
fs_name TudouFS
epoch 156
flags 0
created 2016-04-02 02:48:11.150539
modified 2016-04-03 03:04:57.347064
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
last_failure 0
last_failure_osd_epoch 83
compat compat={},rocompat={},incompat={1=base v0.20,2=client
writeable ranges,3=default file layouts on dirs,4=dir inode in
separate object,5=mds uses versioned encoding,6=dirfrag is stored in
omap,8=file layout v2}
max_mds 1
in 0
up {}
failed
damaged 0
stopped
data_pools 4
metadata_pool 3
inline_data disabled
----------------------------------------------------------------------
---------------------------------------------------------------
本邮件及其附件含有杭州华䞉通信技术有限公叞的保密信息仅限于发送给䞊面地址䞭列出
的䞪人或矀组。犁止任䜕其他人以任䜕圢匏䜿甚包括䜆䞍限于党郚或郚分地泄露、倍制、
或散发本邮件䞭的信息。劂果悚错收了本邮件请悚立即电话或邮件通知发件人并删陀本
邮件
This e-mail and its attachments contain confidential information from
H3C, which is intended only for the person or entity whose address is
listed above. Any use of the information contained herein in any way
(including, but not limited to, total or partial disclosure,
reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error,
please notify the sender by phone or email immediately and delete it!
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Email:
***@linux.com
***@redhat.com
Loading...