[ceph-users] SSD-Cache Tier + RBD-Cache = Filesystem corruption?

Discussion:

Udo Waechter

2016-02-06 10:31:51 UTC

Hello,

I am experiencing totally weird filesystem corruptions with the
following setup:

* Ceph infernalis on Debian8
* 10 OSDs (5 hosts) with spinning disks
* 4 OSDs (1 host, with SSDs)

The SSDs are new in my setup and I am trying to setup a Cache tier.

Now, with the spinning disks Ceph is running since about a year without
any major issues. Replacing disks and all that went fine.

Ceph is used by rbd+libvirt+kvm with

rbd_cache = true
rbd_cache_writethrough_until_flush = true
rbd_cache_size = 128M
rbd_cache_max_dirty = 96M

Also, in libvirt, I have

cachemode=writeback enabled.

So far so good.

Now, I've added the SSD-Cache tier to the picture with "cache-mode
writeback"

The SSD-Machine also has "deadline" scheduler enabled.

Suddenly VMs start to corrupt their filesystems (all ext4) with "Journal
failed".
Trying to reboot the machines ends in "No bootable drive"
Using parted and testdisk on the image mapped via rbd reveals that the
partition table is gone.

testdisk finds the proper ones, e2fsck repairs the filesystem beyond
usage afterwards.

This does not happen to all machines, It happens to those that actually
do some or most fo the IO

elasticsearch, MariaDB+Galera, postgres, backup, GIT

So I thought, yesterday one of my ldap-servers died, and that one is not
doing IO.

Could it be that rbd caching + qemu writeback cache + ceph cach tier
writeback are not playing well together?

I've read through some older mails on the list, where people had similar
problems and suspected somehting like that.

What are the proper/right settings for rdb/qemu/libvirt?

libvirt: cachemode=none (writeback?)
rdb: cache_mode = none
SSD-tier: cachemode: writeback

?

Thanks for any help,
udo.

Alexandre DERUMIER

2016-02-06 11:49:46 UTC

Permalink

Post by Udo Waechter
Could it be that rbd caching + qemu writeback cache + ceph cach tier
writeback are not playing well together?

rbd caching=true is the same than qemu writeback.

Setting cache=writeback in qemu, configure the librbd with rbd cache=true

if you have fs corruption, it seem that flush from guest are not going correctly to the final storage.
I never have had problem with rbd_cache=true .

Maybe it's a bug with ssd cache tier...

----- Mail original -----
De: "Udo Waechter" <***@zoide.net>
À: "ceph-users" <ceph-***@lists.ceph.com>
Envoyé: Samedi 6 Février 2016 11:31:51
Objet: [ceph-users] SSD-Cache Tier + RBD-Cache = Filesystem corruption?

Hello,

I am experiencing totally weird filesystem corruptions with the
following setup:

* Ceph infernalis on Debian8
* 10 OSDs (5 hosts) with spinning disks
* 4 OSDs (1 host, with SSDs)

The SSDs are new in my setup and I am trying to setup a Cache tier.

Now, with the spinning disks Ceph is running since about a year without
any major issues. Replacing disks and all that went fine.

Ceph is used by rbd+libvirt+kvm with

rbd_cache = true
rbd_cache_writethrough_until_flush = true
rbd_cache_size = 128M
rbd_cache_max_dirty = 96M

Also, in libvirt, I have

cachemode=writeback enabled.

So far so good.

Now, I've added the SSD-Cache tier to the picture with "cache-mode
writeback"

The SSD-Machine also has "deadline" scheduler enabled.

Suddenly VMs start to corrupt their filesystems (all ext4) with "Journal
failed".
Trying to reboot the machines ends in "No bootable drive"
Using parted and testdisk on the image mapped via rbd reveals that the
partition table is gone.

testdisk finds the proper ones, e2fsck repairs the filesystem beyond
usage afterwards.

This does not happen to all machines, It happens to those that actually
do some or most fo the IO

elasticsearch, MariaDB+Galera, postgres, backup, GIT

So I thought, yesterday one of my ldap-servers died, and that one is not
doing IO.

Could it be that rbd caching + qemu writeback cache + ceph cach tier
writeback are not playing well together?

I've read through some older mails on the list, where people had similar
problems and suspected somehting like that.

What are the proper/right settings for rdb/qemu/libvirt?

libvirt: cachemode=none (writeback?)
rdb: cache_mode = none
SSD-tier: cachemode: writeback

?

Thanks for any help,
udo.

_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Christian Balzer

2016-02-09 02:27:06 UTC

Permalink

Hello,

I'm quite concerned by this (and the silence from the devs), however there
are a number of people doing similar things (at least with Hammer) and
you'd think they would have been bitten by this if it were a systemic bug.

More below.

Post by Udo Waechter
Hello,
I am experiencing totally weird filesystem corruptions with the
* Ceph infernalis on Debian8

Hammer here, might be a regression.

Post by Udo Waechter
* 10 OSDs (5 hosts) with spinning disks
* 4 OSDs (1 host, with SSDs)

So you're running your cache tier host with replication of 1, I presume?
What kind of SSDs/FS/other relevant configuration options?
Could there be simply some corruption on the SSDs that is of course then
presented to the RDB clients eventually?

Post by Udo Waechter
The SSDs are new in my setup and I am trying to setup a Cache tier.
Now, with the spinning disks Ceph is running since about a year without
any major issues. Replacing disks and all that went fine.
Ceph is used by rbd+libvirt+kvm with
rbd_cache = true
rbd_cache_writethrough_until_flush = true
rbd_cache_size = 128M
rbd_cache_max_dirty = 96M
Also, in libvirt, I have
cachemode=writeback enabled.
So far so good.
Now, I've added the SSD-Cache tier to the picture with "cache-mode
writeback"
The SSD-Machine also has "deadline" scheduler enabled.
Suddenly VMs start to corrupt their filesystems (all ext4) with "Journal
failed".
Trying to reboot the machines ends in "No bootable drive"
Using parted and testdisk on the image mapped via rbd reveals that the
partition table is gone.

Did turning the cache explicitly off (both Ceph and qemu) fix this?

Post by Udo Waechter
testdisk finds the proper ones, e2fsck repairs the filesystem beyond
usage afterwards.
This does not happen to all machines, It happens to those that actually
do some or most fo the IO
elasticsearch, MariaDB+Galera, postgres, backup, GIT
So I thought, yesterday one of my ldap-servers died, and that one is not
doing IO.
Could it be that rbd caching + qemu writeback cache + ceph cach tier
writeback are not playing well together?
I've read through some older mails on the list, where people had similar
problems and suspected somehting like that.

Any particular references (URLs, Message-IDs)?

Regards,

Christian

Post by Udo Waechter
What are the proper/right settings for rdb/qemu/libvirt?
libvirt: cachemode=none (writeback?)
rdb: cache_mode = none
SSD-tier: cachemode: writeback
?
Thanks for any help,
udo.

--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/

Jason Dillaman

2016-02-09 14:46:48 UTC

Permalink

What release of Infernalis are you running? When you encounter this error, is the partition table zeroed out or does it appear to be random corruption?
--
Jason Dillaman

----- Original Message -----

Sent: Saturday, February 6, 2016 5:31:51 AM
Subject: [ceph-users] SSD-Cache Tier + RBD-Cache = Filesystem corruption?
Hello,
I am experiencing totally weird filesystem corruptions with the
* Ceph infernalis on Debian8
* 10 OSDs (5 hosts) with spinning disks
* 4 OSDs (1 host, with SSDs)
The SSDs are new in my setup and I am trying to setup a Cache tier.
Now, with the spinning disks Ceph is running since about a year without
any major issues. Replacing disks and all that went fine.
Ceph is used by rbd+libvirt+kvm with
rbd_cache = true
rbd_cache_writethrough_until_flush = true
rbd_cache_size = 128M
rbd_cache_max_dirty = 96M
Also, in libvirt, I have
cachemode=writeback enabled.
So far so good.
Now, I've added the SSD-Cache tier to the picture with "cache-mode
writeback"
The SSD-Machine also has "deadline" scheduler enabled.
Suddenly VMs start to corrupt their filesystems (all ext4) with "Journal
failed".
Trying to reboot the machines ends in "No bootable drive"
Using parted and testdisk on the image mapped via rbd reveals that the
partition table is gone.
testdisk finds the proper ones, e2fsck repairs the filesystem beyond
usage afterwards.
This does not happen to all machines, It happens to those that actually
do some or most fo the IO
elasticsearch, MariaDB+Galera, postgres, backup, GIT
So I thought, yesterday one of my ldap-servers died, and that one is not
doing IO.
Could it be that rbd caching + qemu writeback cache + ceph cach tier
writeback are not playing well together?
I've read through some older mails on the list, where people had similar
problems and suspected somehting like that.
What are the proper/right settings for rdb/qemu/libvirt?
libvirt: cachemode=none (writeback?)
rdb: cache_mode = none
SSD-tier: cachemode: writeback
?
Thanks for any help,
udo.
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Udo Waechter

2016-02-10 17:04:41 UTC

Permalink

Hi,

Post by Jason Dillaman
What release of Infernalis are you running? When you encounter this error, is the partition table zeroed out or does it appear to be random corruption?

its
ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)

and dpkg -l ceph:
ceph 9.2.0-1~bpo80+1

from eu.ceph.com

The partition table is zeroed out. Also, I have experienced that all
files which are actually written (DB-Files in ldap-cluster, postgres
transaction logs, ...) are corrupted.

Nevertheless, restoring the partition-table and then running e2fsck
corrupts the filessytem beyond repair.
Some Images are even empty afterwards :(

Thanks,
udo.

Jason Dillaman

2016-02-10 17:07:25 UTC

Permalink

Can you provide the 'rbd info' dump from one of these corrupt images?
--
Jason Dillaman

----- Original Message -----

Sent: Wednesday, February 10, 2016 12:04:41 PM
Subject: Re: [ceph-users] SSD-Cache Tier + RBD-Cache = Filesystem corruption?
Hi,

Post by Jason Dillaman
What release of Infernalis are you running? When you encounter this error,
is the partition table zeroed out or does it appear to be random
corruption?

its
ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
ceph 9.2.0-1~bpo80+1
from eu.ceph.com
The partition table is zeroed out. Also, I have experienced that all
files which are actually written (DB-Files in ldap-cluster, postgres
transaction logs, ...) are corrupted.
Nevertheless, restoring the partition-table and then running e2fsck
corrupts the filessytem beyond repair.
Some Images are even empty afterwards :(
Thanks,
udo.

Udo Waechter

2016-02-11 09:04:43 UTC

Permalink

Post by Jason Dillaman
Can you provide the 'rbd info' dump from one of these corrupt images?

sure,

rbd image 'ldap01.root.borked':
size 20000 MB in 5000 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.18394b3d1b58ba
format: 2
features: layering
flags:
parent: libvirt-pool/debian7-***@installed.mini
overlap: 20000 MB

Thanks,
udo.

Jason Dillaman

2016-02-11 14:13:13 UTC

Permalink

Assuming the partition table is still zeroed on that image, can you run:

# rados -p <pool name> get rbd_data.18394b3d1b58ba.0000000000000000 - | cut -b 512 | hexdump

Can you also provide your pool setup:

# ceph report --format xml 2>/dev/null | xmlstarlet sel -t -c "//osdmap/pools"
--
Jason Dillaman

----- Original Message -----

Sent: Thursday, February 11, 2016 4:04:43 AM
Subject: Re: [ceph-users] SSD-Cache Tier + RBD-Cache = Filesystem corruption?

Post by Jason Dillaman
Can you provide the 'rbd info' dump from one of these corrupt images?

sure,
size 20000 MB in 5000 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.18394b3d1b58ba
format: 2
features: layering
overlap: 20000 MB
Thanks,
udo.

Udo Waechter

2016-02-17 08:15:01 UTC

Permalink

Hello, sorry for the delay. I was pretty busy otherwise.

Post by Jason Dillaman
# rados -p <pool name> get rbd_data.18394b3d1b58ba.0000000000000000 - | cut -b 512 | hexdump

Here's the hexdump:

0000000 0a0a 0a0a 0a00 0a00 0a0a 0a0a 0a0a 0a0a
0000010 0a0a 0a0a 0a0a 0a0a 0a0a 0a0a 0a0a 0a0a
0000020 0a0a 0a0a 0a0a 000a 0a0a 0a0a 0a0a 0a0a
0000030 0a0a 0a00 0a0a 0a0a 0a0a 0a0a 000a 0a0a
0000040 0a0a 0a0a 0a0a 0a0a 0a00
000004a

Post by Jason Dillaman
# ceph report --format xml 2>/dev/null | xmlstarlet sel -t -c "//osdmap/pools"

Attached you'll find the pools information.

Thanks very much for looking into this.

udo.

Udo Waechter

2016-02-21 09:34:30 UTC

Permalink

Hi,

That's a pretty strange and seemingly non-random corruption of your first block. Is that object in the cache pool right now? If so, is the backing pool object just as corrupt as the cache pool's object?

How do I see all that? Sorry, I'm new to this kind of ceph-debugging. If
there is no quick answer, I will start digging into this topic.

I see that your cache pool is currently configured in forward mode. Did you switch to that mode in an attempt to stop any further issues or was it configured in forward mode before any corruption?

No, I switched to forward mode in order to stop the corruption. It was
in writeback initially.

Cheers,
udo.