[ceph-users] all vms can not start up when boot all the ceph hosts.

Discussion:

linghucongsong

2018-12-04 08:48:57 UTC

HI all!

I have a ceph test envirment use ceph with openstack. There are some vms run on the openstack. It is just a test envirment.

my ceph version is 12.2.4. Last day I reboot all the ceph hosts before this I do not shutdown the vms on the openstack.

When all the hosts boot up and the ceph become healthy. I found all the vms can not start up. All the vms have the

below xfs error. Even I use xfs_repair also can not repair this problem .

It is just a test envrement so the data is not important to me. I know the ceph version 12.2..4 is not stable

enough but how does it have so serious problems. Mind to other people care about this. Thanks to all. :)

Janne Johansson

2018-12-04 09:30:13 UTC

Permalink

So you removed the underlying storage while the machines were running. What
did you expect would happen?
If you do this to a physical machine, or guests running with some other
kind of remote storage like iscsi, what
do you think will happen to running machines in that case?

--
May the most significant bit of your life be positive.

linghucongsong

2018-12-04 09:37:20 UTC

Permalink

Thank you for reply!

But it is just in case suddenly power off for all the hosts!

So the best way for this it is to have the snapshot on the import vms or have to mirror the

images to other ceph cluster?

At 2018-12-04 17:30:13, "Janne Johansson" <***@gmail.com> wrote:

Den tis 4 dec. 2018 kl 09:49 skrev linghucongsong <***@163.com>:

HI all!

I have a ceph test envirment use ceph with openstack. There are some vms run on the openstack. It is just a test envirment.
my ceph version is 12.2.4. Last day I reboot all the ceph hosts before this I do not shutdown the vms on the openstack.
When all the hosts boot up and the ceph become healthy. I found all the vms can not start up. All the vms have the
below xfs error. Even I use xfs_repair also can not repair this problem .

So you removed the underlying storage while the machines were running. What did you expect would happen?
If you do this to a physical machine, or guests running with some other kind of remote storage like iscsi, what
do you think will happen to running machines in that case?

--

May the most significant bit of your life be positive.

Janne Johansson

2018-12-04 09:47:18 UTC

Permalink

Post by linghucongsong
Thank you for reply!
But it is just in case suddenly power off for all the hosts!
So the best way for this it is to have the snapshot on the import vms or have to mirror the
images to other ceph cluster?

Best way is probably to do just like you would handle power outages
for physical machines,
make sure you have working backups with tested restores, AND/OR
scripts that can reinstall your
guests again if needed and then avoid power outages as much as possible.

ceph will only know that certain writes and reads are being made from
the openstack compute hosts,
it will not know the meaning of an individual write, so it can't know
if one write just before an outage
is half of an xfs update needed to create a new file or if it is a
complete transaction, so if you remove
the storage at random, then random errors will occur on your
filesystem, just like pulling out a USB
stick while you are writing to it.

If power outages are very common, you might consider having lots of
small filesystems on your guests
and mount them readonly as much as possible, and then for short
intervals have them readwrite-able
so that the chance for each partition getting broken during outages
gets as small as possible.

Post by linghucongsong

So you removed the underlying storage while the machines were running. What did you expect would happen?
If you do this to a physical machine, or guests running with some other kind of remote storage like iscsi, what
do you think will happen to running machines in that case?
--
May the most significant bit of your life be positive.

--
May the most significant bit of your life be positive.

Simon Ironside

2018-12-04 13:55:55 UTC

Permalink

Post by linghucongsong
But it is just in case suddenly power off for all the hosts!

I'm surprised you're seeing I/O errors inside the VM once they're restarted.
Is the cluster healthy? What's the output of ceph status?

Simon

Jason Dillaman

2018-12-04 14:33:58 UTC

Permalink

I would check to see if the images have an exclusive-lock still held
by a force-killed VM. librbd will generally automatically clear this
up unless it doesn't have the proper permissions to blacklist a dead
client from the Ceph cluster. Verify that your OpenStack Ceph user
caps are correct [1][2].

[1] http://docs.ceph.com/docs/master/releases/luminous/#upgrade-from-jewel-or-kraken
[2] http://docs.ceph.com/docs/luminous/rbd/rbd-openstack/#setup-ceph-client-authentication

Post by linghucongsong
But it is just in case suddenly power off for all the hosts!
I'm surprised you're seeing I/O errors inside the VM once they're restarted.
Is the cluster healthy? What's the output of ceph status?
Simon
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Jason

Ouyang Xu

2018-12-04 15:42:15 UTC

Permalink

Hi linghucongsong:

I have got this issue before, you can try to fix it as below:

1. use /rbd lock ls/ to get the lock for the vm
2. use /rbd lock rm/ to remove that lock for the vm
3. start vm again

hope that can help you.

regards,

Ouyang

Post by linghucongsong
HI all!
I have a ceph test envirment use ceph with openstack. There are some
vms run on the openstack. It is just a test envirment.
my ceph version is 12.2.4. Last day I reboot all the ceph hosts before
this I do not shutdown the vms on the openstack.
When all the hosts boot up and the ceph become healthy. I found all
the vms can not start up. All the vms have the
below xfs error. Even I use xfs_repair also can not repair this problem .
It is just a test envrement so the data is not important to me. I know
the ceph version 12.2..4 is not stable
enough but how does it have so serious problems. Mind to other people
care about this. Thanks to all. :)
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

linghucongsong

2018-12-05 03:49:32 UTC

Permalink

Thanks to all! I might have found the reason.

It is look like relate to the below bug.

https://bugs.launchpad.net/nova/+bug/1773449

At 2018-12-04 23:42:15, "Ouyang Xu" <***@126.com> wrote:

Hi linghucongsong:

I have got this issue before, you can try to fix it as below:

1. use rbd lock ls to get the lock for the vm

2. use rbd lock rm to remove that lock for the vm

3. start vm again

hope that can help you.

regards,

Ouyang

On 2018/12/4 ÏÂÎç4:48, linghucongsong wrote:

HI all!

I have a ceph test envirment use ceph with openstack. There are some vms run on the openstack. It is just a test envirment.

my ceph version is 12.2.4. Last day I reboot all the ceph hosts before this I do not shutdown the vms on the openstack.

When all the hosts boot up and the ceph become healthy. I found all the vms can not start up. All the vms have the

below xfs error. Even I use xfs_repair also can not repair this problem .

It is just a test envrement so the data is not important to me. I know the ceph version 12.2..4 is not stable

enough but how does it have so serious problems. Mind to other people care about this. Thanks to all. :)