[ceph-users] CephFS file contains garbage zero padding after an unclean cluster shutdown

Discussion:

Hector Martin

2018-11-23 15:54:42 UTC

Background: I'm running single-node Ceph with CephFS as an experimental
replacement for "traditional" filesystems. In this case I have 11 OSDs,
1 mon, and 1 MDS.

I just had an unclean shutdown (kernel panic) while a large (>1TB) file
was being copied to CephFS (via rsync). Upon bringing the system back
up, I noticed that the (incomplete) file has about 320MB worth of zeroes
at the end.

This is the kind of behavior I would expect of traditional local
filesystems, where file metadata was updated to reflect the new size of
a growing file before disk extents were allocated and filled with data,
so an unclean shutdown results in files with tails of zeroes, but I'm
surprised to see it with Ceph. I expected the OSD side of things should
be atomic with all the BlueStore goodness, checksums, etc. I figured
CephFS would build upon those primitives in a way that this kind of
inconsistency isn't possible.

Is this expected behavior? It's not a huge dealbreaker, but I'd like to
understand how this kind of situation happens in CephFS (and how it
could affect a proper cluster, if at all - can this happen if e.g. a
client, or an MDS, or an OSD dies uncleanly? Or only if several things
go down at once?)

--
Hector Martin (***@marcansoft.com)
Public Key: https://mrcn.st/pub

Paul Emmerich

2018-11-25 15:16:29 UTC

Permalink

Maybe rsync called fallocate() on the file?

Paul
--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

Am Fr., 23. Nov. 2018 um 16:55 Uhr schrieb Hector Martin

Post by Hector Martin
Background: I'm running single-node Ceph with CephFS as an experimental
replacement for "traditional" filesystems. In this case I have 11 OSDs,
1 mon, and 1 MDS.
I just had an unclean shutdown (kernel panic) while a large (>1TB) file
was being copied to CephFS (via rsync). Upon bringing the system back
up, I noticed that the (incomplete) file has about 320MB worth of zeroes
at the end.
This is the kind of behavior I would expect of traditional local
filesystems, where file metadata was updated to reflect the new size of
a growing file before disk extents were allocated and filled with data,
so an unclean shutdown results in files with tails of zeroes, but I'm
surprised to see it with Ceph. I expected the OSD side of things should
be atomic with all the BlueStore goodness, checksums, etc. I figured
CephFS would build upon those primitives in a way that this kind of
inconsistency isn't possible.
Is this expected behavior? It's not a huge dealbreaker, but I'd like to
understand how this kind of situation happens in CephFS (and how it
could affect a proper cluster, if at all - can this happen if e.g. a
client, or an MDS, or an OSD dies uncleanly? Or only if several things
go down at once?)
--
Public Key: https://mrcn.st/pub
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Paul Emmerich

2018-11-25 15:19:34 UTC

Permalink

No, wait. Which system did kernel panic? Your CephFS client running rsync?
In this case this would be expected behavior because rsync doesn't
sync on every block and you lost your file system cache.
--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

Am So., 25. Nov. 2018 um 16:16 Uhr schrieb Paul Emmerich

Post by Paul Emmerich
Maybe rsync called fallocate() on the file?
Paul
--
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
Am Fr., 23. Nov. 2018 um 16:55 Uhr schrieb Hector Martin

Hector Martin

2018-11-25 20:30:43 UTC

Permalink

Post by Paul Emmerich
No, wait. Which system did kernel panic? Your CephFS client running rsync?
In this case this would be expected behavior because rsync doesn't
sync on every block and you lost your file system cache.

It was all on the same system. So is it expected behavior for size
metadata to be updated non-atomically with respect to file contents
being written when using the CephFS kernel client? I.e. after appending
data to the file, the metadata in CephFS is updated to reflect the new
size but the data remains in the page cache until those pages are flushed?

--
Hector Martin (***@marcansoft.com)
Public Key: https://mrcn.st/pub

Yan, Zheng

2018-11-26 02:05:26 UTC

Permalink

Post by Hector Martin

Yes, it's expected behavior. We haven't implement ordered write
(data=ordered mount option of ext4)

Post by Hector Martin
--
Public Key: https://mrcn.st/pub
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Hector Martin

2018-11-26 03:58:01 UTC

Permalink

Post by Yan, Zheng

Post by Hector Martin

Yes, it's expected behavior. We haven't implement ordered write
(data=ordered mount option of ext4)

Makes sense, thanks for confirming. Good to know it's a client issue
then (I'd be more worried if what caused this was the Ceph server side
stack).

--
Hector Martin (***@marcansoft.com)
Public Key: https://mrcn.st/pub