Discussion:
[ceph-users] CephFS "move" operation
Oliver Freyermuth
2018-05-25 12:10:42 UTC
Permalink
Dear Cephalopodians,

I was wondering why a simple "mv" is taking extraordinarily long on CephFS and must note that,
at least with the fuse-client (12.2.5) and when moving a file from one directory to another,
the file appears to be copied first (byte by byte, traffic going through the client?) before the initial file is deleted.

Is this true, or am I missing something?

For large files, this might be rather time consuming,
and we should certainly advise all our users to not move files around needlessly if this is the case.

Cheers,
Oliver
John Spray
2018-05-25 12:50:43 UTC
Permalink
On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
Post by Oliver Freyermuth
Dear Cephalopodians,
I was wondering why a simple "mv" is taking extraordinarily long on CephFS and must note that,
at least with the fuse-client (12.2.5) and when moving a file from one directory to another,
the file appears to be copied first (byte by byte, traffic going through the client?) before the initial file is deleted.
Is this true, or am I missing something?
A mv should not involve copying a file through the client -- it's
implemented in the MDS as a rename from one location to another.
What's the observation that's making it seem like the data is going
through the client?

John
Post by Oliver Freyermuth
For large files, this might be rather time consuming,
and we should certainly advise all our users to not move files around needlessly if this is the case.
Cheers,
Oliver
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Ric Wheeler
2018-05-25 12:57:15 UTC
Permalink
Is this move between directories on the same file system?

Rename as a system call only works within a file system.

The user space mv command becomes a copy when not the same file system.

Regards,

Ric
Post by John Spray
On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
Post by Oliver Freyermuth
Dear Cephalopodians,
I was wondering why a simple "mv" is taking extraordinarily long on
CephFS and must note that,
Post by Oliver Freyermuth
at least with the fuse-client (12.2.5) and when moving a file from one
directory to another,
Post by Oliver Freyermuth
the file appears to be copied first (byte by byte, traffic going through
the client?) before the initial file is deleted.
Post by Oliver Freyermuth
Is this true, or am I missing something?
A mv should not involve copying a file through the client -- it's
implemented in the MDS as a rename from one location to another.
What's the observation that's making it seem like the data is going
through the client?
John
Post by Oliver Freyermuth
For large files, this might be rather time consuming,
and we should certainly advise all our users to not move files around
needlessly if this is the case.
Post by Oliver Freyermuth
Cheers,
Oliver
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Oliver Freyermuth
2018-05-25 13:04:08 UTC
Permalink
Post by Ric Wheeler
Is this move between directories on the same file system?
It is, we only have a single CephFS in use. There's also only a single ceph-fuse client running.

What's different, though, are different ACLs set for source and target directory, and owner / group,
but I hope that should not matter.

All the best,
Oliver
Post by Ric Wheeler
Rename as a system call only works within a file system.
The user space mv command becomes a copy when not the same file system. 
Regards,
Ric
On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
Post by Oliver Freyermuth
Dear Cephalopodians,
I was wondering why a simple "mv" is taking extraordinarily long on CephFS and must note that,
at least with the fuse-client (12.2.5) and when moving a file from one directory to another,
the file appears to be copied first (byte by byte, traffic going through the client?) before the initial file is deleted.
Is this true, or am I missing something?
A mv should not involve copying a file through the client -- it's
implemented in the MDS as a rename from one location to another.
What's the observation that's making it seem like the data is going
through the client?
John
Post by Oliver Freyermuth
For large files, this might be rather time consuming,
and we should certainly advise all our users to not move files around needlessly if this is the case.
Cheers,
         Oliver
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Ric Wheeler
2018-05-25 13:06:20 UTC
Permalink
We should look at what mv uses to see if it thinks the directories are on
different file systems.

If the fstat or whatever it looks at is confused, that might explain it.

Ric


On Fri, May 25, 2018, 9:04 AM Oliver Freyermuth <
Post by Oliver Freyermuth
Post by Ric Wheeler
Is this move between directories on the same file system?
It is, we only have a single CephFS in use. There's also only a single
ceph-fuse client running.
What's different, though, are different ACLs set for source and target
directory, and owner / group,
but I hope that should not matter.
All the best,
Oliver
Post by Ric Wheeler
Rename as a system call only works within a file system.
The user space mv command becomes a copy when not the same file system.
Regards,
Ric
On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
Post by Oliver Freyermuth
Dear Cephalopodians,
I was wondering why a simple "mv" is taking extraordinarily long
on CephFS and must note that,
Post by Ric Wheeler
Post by Oliver Freyermuth
at least with the fuse-client (12.2.5) and when moving a file from
one directory to another,
Post by Ric Wheeler
Post by Oliver Freyermuth
the file appears to be copied first (byte by byte, traffic going
through the client?) before the initial file is deleted.
Post by Ric Wheeler
Post by Oliver Freyermuth
Is this true, or am I missing something?
A mv should not involve copying a file through the client -- it's
implemented in the MDS as a rename from one location to another.
What's the observation that's making it seem like the data is going
through the client?
John
Post by Oliver Freyermuth
For large files, this might be rather time consuming,
and we should certainly advise all our users to not move files
around needlessly if this is the case.
Post by Ric Wheeler
Post by Oliver Freyermuth
Cheers,
Oliver
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Oliver Freyermuth
2018-05-25 13:15:20 UTC
Permalink
Mhhhm... that's funny, I checked an mv with an strace now. I get:
---------------------------------------------------------------------------------
access("/cephfs/some_folder/file", W_OK) = 0
rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid cross-device link)
unlink("/cephfs/some_folder/file") = 0
lgetxattr("foo", "security.selinux", "system_u:object_r:fusefs_t:s0", 255) = 30
---------------------------------------------------------------------------------
But I can assure it's only a single filesystem, and a single ceph-fuse client running.

Same happens when using absolute paths.

Cheers,
Oliver
We should look at what mv uses to see if it thinks the directories are on different file systems.
If the fstat or whatever it looks at is confused, that might explain it.
Ric
Post by Ric Wheeler
Is this move between directories on the same file system?
It is, we only have a single CephFS in use. There's also only a single ceph-fuse client running.
What's different, though, are different ACLs set for source and target directory, and owner / group,
but I hope that should not matter.
All the best,
Oliver
Post by Ric Wheeler
Rename as a system call only works within a file system.
The user space mv command becomes a copy when not the same file system. 
Regards,
Ric
     On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
     > Dear Cephalopodians,
     >
     > I was wondering why a simple "mv" is taking extraordinarily long on CephFS and must note that,
     > at least with the fuse-client (12.2.5) and when moving a file from one directory to another,
     > the file appears to be copied first (byte by byte, traffic going through the client?) before the initial file is deleted.
     >
     > Is this true, or am I missing something?
     A mv should not involve copying a file through the client -- it's
     implemented in the MDS as a rename from one location to another.
     What's the observation that's making it seem like the data is going
     through the client?
     John
     >
     > For large files, this might be rather time consuming,
     > and we should certainly advise all our users to not move files around needlessly if this is the case.
     >
     > Cheers,
     >         Oliver
     >
     >
     > _______________________________________________
     > ceph-users mailing list
     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
     >
     _______________________________________________
     ceph-users mailing list
     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Ric Wheeler
2018-05-25 13:18:42 UTC
Permalink
That seems to be the issue - we need to understand why rename sees them as
different.

Ric


On Fri, May 25, 2018, 9:15 AM Oliver Freyermuth <
Post by Oliver Freyermuth
---------------------------------------------------------------------------------
access("/cephfs/some_folder/file", W_OK) = 0
rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid cross-device link)
unlink("/cephfs/some_folder/file") = 0
lgetxattr("foo", "security.selinux", "system_u:object_r:fusefs_t:s0", 255) = 30
---------------------------------------------------------------------------------
But I can assure it's only a single filesystem, and a single ceph-fuse client running.
Same happens when using absolute paths.
Cheers,
Oliver
Post by Ric Wheeler
We should look at what mv uses to see if it thinks the directories are
on different file systems.
Post by Ric Wheeler
If the fstat or whatever it looks at is confused, that might explain it.
Ric
On Fri, May 25, 2018, 9:04 AM Oliver Freyermuth <
Post by Ric Wheeler
Is this move between directories on the same file system?
It is, we only have a single CephFS in use. There's also only a
single ceph-fuse client running.
Post by Ric Wheeler
What's different, though, are different ACLs set for source and
target directory, and owner / group,
Post by Ric Wheeler
but I hope that should not matter.
All the best,
Oliver
Post by Ric Wheeler
Rename as a system call only works within a file system.
The user space mv command becomes a copy when not the same file
system.
Post by Ric Wheeler
Post by Ric Wheeler
Regards,
Ric
On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
Post by Oliver Freyermuth
Dear Cephalopodians,
I was wondering why a simple "mv" is taking extraordinarily
long on CephFS and must note that,
Post by Ric Wheeler
Post by Ric Wheeler
Post by Oliver Freyermuth
at least with the fuse-client (12.2.5) and when moving a
file from one directory to another,
Post by Ric Wheeler
Post by Ric Wheeler
Post by Oliver Freyermuth
the file appears to be copied first (byte by byte, traffic
going through the client?) before the initial file is deleted.
Post by Ric Wheeler
Post by Ric Wheeler
Post by Oliver Freyermuth
Is this true, or am I missing something?
A mv should not involve copying a file through the client --
it's
Post by Ric Wheeler
Post by Ric Wheeler
implemented in the MDS as a rename from one location to
another.
Post by Ric Wheeler
Post by Ric Wheeler
What's the observation that's making it seem like the data is
going
Post by Ric Wheeler
Post by Ric Wheeler
through the client?
John
Post by Oliver Freyermuth
For large files, this might be rather time consuming,
and we should certainly advise all our users to not move
files around needlessly if this is the case.
Post by Ric Wheeler
Post by Ric Wheeler
Post by Oliver Freyermuth
Cheers,
Oliver
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Sage Weil
2018-05-25 13:21:33 UTC
Permalink
Can you paste the output of 'stat foo' and 'stat /cephfs/some_folder'?
(Maybe also the same with 'stat -f'.)

Thanks!
sage
Post by Ric Wheeler
That seems to be the issue - we need to understand why rename sees them as
different.
Ric
On Fri, May 25, 2018, 9:15 AM Oliver Freyermuth <
Post by Oliver Freyermuth
---------------------------------------------------------------------------------
access("/cephfs/some_folder/file", W_OK) = 0
rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid cross-device link)
unlink("/cephfs/some_folder/file") = 0
lgetxattr("foo", "security.selinux", "system_u:object_r:fusefs_t:s0", 255) = 30
---------------------------------------------------------------------------------
But I can assure it's only a single filesystem, and a single ceph-fuse client running.
Same happens when using absolute paths.
Cheers,
Oliver
Post by Ric Wheeler
We should look at what mv uses to see if it thinks the directories are
on different file systems.
Post by Ric Wheeler
If the fstat or whatever it looks at is confused, that might explain it.
Ric
On Fri, May 25, 2018, 9:04 AM Oliver Freyermuth <
Post by Ric Wheeler
Is this move between directories on the same file system?
It is, we only have a single CephFS in use. There's also only a
single ceph-fuse client running.
Post by Ric Wheeler
What's different, though, are different ACLs set for source and
target directory, and owner / group,
Post by Ric Wheeler
but I hope that should not matter.
All the best,
Oliver
Post by Ric Wheeler
Rename as a system call only works within a file system.
The user space mv command becomes a copy when not the same file
system.
Post by Ric Wheeler
Post by Ric Wheeler
Regards,
Ric
On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
Post by Oliver Freyermuth
Dear Cephalopodians,
I was wondering why a simple "mv" is taking extraordinarily
long on CephFS and must note that,
Post by Ric Wheeler
Post by Ric Wheeler
Post by Oliver Freyermuth
at least with the fuse-client (12.2.5) and when moving a
file from one directory to another,
Post by Ric Wheeler
Post by Ric Wheeler
Post by Oliver Freyermuth
the file appears to be copied first (byte by byte, traffic
going through the client?) before the initial file is deleted.
Post by Ric Wheeler
Post by Ric Wheeler
Post by Oliver Freyermuth
Is this true, or am I missing something?
A mv should not involve copying a file through the client --
it's
Post by Ric Wheeler
Post by Ric Wheeler
implemented in the MDS as a rename from one location to
another.
Post by Ric Wheeler
Post by Ric Wheeler
What's the observation that's making it seem like the data is
going
Post by Ric Wheeler
Post by Ric Wheeler
through the client?
John
Post by Oliver Freyermuth
For large files, this might be rather time consuming,
and we should certainly advise all our users to not move
files around needlessly if this is the case.
Post by Ric Wheeler
Post by Ric Wheeler
Post by Oliver Freyermuth
Cheers,
Oliver
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Oliver Freyermuth
2018-05-25 13:29:51 UTC
Permalink
Dear Sage,

here you go, some_folder in reality is "/cephfs/group":

------------------------------------------------
# stat foo
File: ‘foo’
Size: 1048576000 Blocks: 2048000 IO Block: 4194304 regular file
Device: 27h/39d Inode: 1099515065517 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Context: system_u:object_r:fusefs_t:s0
Access: 2018-05-25 15:27:59.433279424 +0200
Modify: 2018-05-25 15:28:01.379754052 +0200
Change: 2018-05-25 15:28:01.379754052 +0200
Birth: -
------------------------------------------------
# stat -f foo
File: "foo"
ID: 0 Namelen: 255 Type: fuseblk
Block size: 4194304 Fundamental block size: 4194304
Blocks: Total: 104471885 Free: 79096968 Available: 79096968
Inodes: Total: 26258533 Free: -1
------------------------------------------------
------------------------------------------------
# stat -f /cephfs/group/
File: "/cephfs/group/"
ID: 0 Namelen: 255 Type: fuseblk
Block size: 4194304 Fundamental block size: 4194304
Blocks: Total: 104471835 Free: 79098264 Available: 79098264
Inodes: Total: 26257190 Free: -1
------------------------------------------------
# stat /cephfs/group/
File: ‘/cephfs/group/’
Size: 73167320986856 Blocks: 1 IO Block: 4096 directory
Device: 27h/39d Inode: 1099511627888 Links: 1
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Context: system_u:object_r:fusefs_t:s0
Access: 2018-03-09 18:22:47.061501906 +0100
Modify: 2018-05-25 15:18:02.164391701 +0200
Change: 2018-05-25 15:18:02.164391701 +0200
Birth: -
------------------------------------------------

Cheers,
Oliver
Post by Sage Weil
Can you paste the output of 'stat foo' and 'stat /cephfs/some_folder'?
(Maybe also the same with 'stat -f'.)
Thanks!
sage
Post by Ric Wheeler
That seems to be the issue - we need to understand why rename sees them as
different.
Ric
On Fri, May 25, 2018, 9:15 AM Oliver Freyermuth <
Post by Oliver Freyermuth
---------------------------------------------------------------------------------
access("/cephfs/some_folder/file", W_OK) = 0
rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid cross-device link)
unlink("/cephfs/some_folder/file") = 0
lgetxattr("foo", "security.selinux", "system_u:object_r:fusefs_t:s0", 255) = 30
---------------------------------------------------------------------------------
But I can assure it's only a single filesystem, and a single ceph-fuse client running.
Same happens when using absolute paths.
Cheers,
Oliver
Post by Ric Wheeler
We should look at what mv uses to see if it thinks the directories are
on different file systems.
Post by Ric Wheeler
If the fstat or whatever it looks at is confused, that might explain it.
Ric
On Fri, May 25, 2018, 9:04 AM Oliver Freyermuth <
Post by Ric Wheeler
Is this move between directories on the same file system?
It is, we only have a single CephFS in use. There's also only a
single ceph-fuse client running.
Post by Ric Wheeler
What's different, though, are different ACLs set for source and
target directory, and owner / group,
Post by Ric Wheeler
but I hope that should not matter.
All the best,
Oliver
Post by Ric Wheeler
Rename as a system call only works within a file system.
The user space mv command becomes a copy when not the same file
system.
Post by Ric Wheeler
Post by Ric Wheeler
Regards,
Ric
On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
Post by Oliver Freyermuth
Dear Cephalopodians,
I was wondering why a simple "mv" is taking extraordinarily
long on CephFS and must note that,
Post by Ric Wheeler
Post by Ric Wheeler
Post by Oliver Freyermuth
at least with the fuse-client (12.2.5) and when moving a
file from one directory to another,
Post by Ric Wheeler
Post by Ric Wheeler
Post by Oliver Freyermuth
the file appears to be copied first (byte by byte, traffic
going through the client?) before the initial file is deleted.
Post by Ric Wheeler
Post by Ric Wheeler
Post by Oliver Freyermuth
Is this true, or am I missing something?
A mv should not involve copying a file through the client --
it's
Post by Ric Wheeler
Post by Ric Wheeler
implemented in the MDS as a rename from one location to
another.
Post by Ric Wheeler
Post by Ric Wheeler
What's the observation that's making it seem like the data is
going
Post by Ric Wheeler
Post by Ric Wheeler
through the client?
John
Post by Oliver Freyermuth
For large files, this might be rather time consuming,
and we should certainly advise all our users to not move
files around needlessly if this is the case.
Post by Ric Wheeler
Post by Ric Wheeler
Post by Oliver Freyermuth
Cheers,
Oliver
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Oliver Freyermuth
UniversitÀt Bonn
Physikalisches Institut, Raum 1.047
Nußallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax: +49 228 73 7869
--
Oliver Freyermuth
2018-05-25 13:26:32 UTC
Permalink
Dear Ric,

I played around a bit - the common denominator seems to be: Moving it within a directory subtree below a directory for which max_bytes / max_files quota settings are set,
things work fine.
Moving it to another directory tree without quota settings / with different quota settings, rename() returns EXDEV.

Cheers,
Oliver
That seems to be the issue - we need to understand why rename sees them as different.
Ric
---------------------------------------------------------------------------------
access("/cephfs/some_folder/file", W_OK) = 0
rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid cross-device link)
unlink("/cephfs/some_folder/file") = 0
lgetxattr("foo", "security.selinux", "system_u:object_r:fusefs_t:s0", 255) = 30
---------------------------------------------------------------------------------
But I can assure it's only a single filesystem, and a single ceph-fuse client running.
Same happens when using absolute paths.
Cheers,
        Oliver
We should look at what mv uses to see if it thinks the directories are on different file systems.
If the fstat or whatever it looks at is confused, that might explain it.
Ric
     > Is this move between directories on the same file system?
     It is, we only have a single CephFS in use. There's also only a single ceph-fuse client running.
     What's different, though, are different ACLs set for source and target directory, and owner / group,
     but I hope that should not matter.
     All the best,
     Oliver
     > Rename as a system call only works within a file system.
     >
     > The user space mv command becomes a copy when not the same file system. 
     >
     > Regards,
     >
     > Ric
     >
     >
     >
     >     On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
     >     > Dear Cephalopodians,
     >     >
     >     > I was wondering why a simple "mv" is taking extraordinarily long on CephFS and must note that,
     >     > at least with the fuse-client (12.2.5) and when moving a file from one directory to another,
     >     > the file appears to be copied first (byte by byte, traffic going through the client?) before the initial file is deleted.
     >     >
     >     > Is this true, or am I missing something?
     >
     >     A mv should not involve copying a file through the client -- it's
     >     implemented in the MDS as a rename from one location to another.
     >     What's the observation that's making it seem like the data is going
     >     through the client?
     >
     >     John
     >
     >     >
     >     > For large files, this might be rather time consuming,
     >     > and we should certainly advise all our users to not move files around needlessly if this is the case.
     >     >
     >     > Cheers,
     >     >         Oliver
     >     >
     >     >
     >     > _______________________________________________
     >     > ceph-users mailing list
     >     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
     >     >
     >     _______________________________________________
     >     ceph-users mailing list
     >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
     >
Sage Weil
2018-05-25 13:39:21 UTC
Permalink
Post by Oliver Freyermuth
Dear Ric,
I played around a bit - the common denominator seems to be: Moving it
within a directory subtree below a directory for which max_bytes /
max_files quota settings are set, things work fine. Moving it to another
directory tree without quota settings / with different quota settings,
rename() returns EXDEV.
Aha, yes, this is the issue.

When you set a quota you force subvolume-like behavior. This is done
because hard links across this quota boundary won't correctly account for
utilization (only one of the file links will accrue usage). The
expectation is that quotas are usually set in locations that aren't
frequently renamed across.

It might be possible to allow rename(2) to proceed in cases where
nlink==1, but the behavior will probably seem inconsistent (some files get
EXDEV, some don't).

sage
Post by Oliver Freyermuth
Cheers, Oliver
That seems to be the issue - we need to understand why rename sees them as different.
Ric
---------------------------------------------------------------------------------
access("/cephfs/some_folder/file", W_OK) = 0
rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid cross-device link)
unlink("/cephfs/some_folder/file") = 0
lgetxattr("foo", "security.selinux", "system_u:object_r:fusefs_t:s0", 255) = 30
---------------------------------------------------------------------------------
But I can assure it's only a single filesystem, and a single ceph-fuse client running.
Same happens when using absolute paths.
Cheers,
        Oliver
We should look at what mv uses to see if it thinks the directories are on different file systems.
If the fstat or whatever it looks at is confused, that might explain it.
Ric
     > Is this move between directories on the same file system?
     It is, we only have a single CephFS in use. There's also only a single ceph-fuse client running.
     What's different, though, are different ACLs set for source and target directory, and owner / group,
     but I hope that should not matter.
     All the best,
     Oliver
     > Rename as a system call only works within a file system.
     >
     > The user space mv command becomes a copy when not the same file system. 
     >
     > Regards,
     >
     > Ric
     >
     >
     >
     >     On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
     >     > Dear Cephalopodians,
     >     >
     >     > I was wondering why a simple "mv" is taking extraordinarily long on CephFS and must note that,
     >     > at least with the fuse-client (12.2.5) and when moving a file from one directory to another,
     >     > the file appears to be copied first (byte by byte, traffic going through the client?) before the initial file is deleted.
     >     >
     >     > Is this true, or am I missing something?
     >
     >     A mv should not involve copying a file through the client -- it's
     >     implemented in the MDS as a rename from one location to another.
     >     What's the observation that's making it seem like the data is going
     >     through the client?
     >
     >     John
     >
     >     >
     >     > For large files, this might be rather time consuming,
     >     > and we should certainly advise all our users to not move files around needlessly if this is the case.
     >     >
     >     > Cheers,
     >     >         Oliver
     >     >
     >     >
     >     > _______________________________________________
     >     > ceph-users mailing list
     >     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
     >     >
     >     _______________________________________________
     >     ceph-users mailing list
     >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
     >
Oliver Freyermuth
2018-05-25 13:46:33 UTC
Permalink
Post by Sage Weil
Post by Oliver Freyermuth
Dear Ric,
I played around a bit - the common denominator seems to be: Moving it
within a directory subtree below a directory for which max_bytes /
max_files quota settings are set, things work fine. Moving it to another
directory tree without quota settings / with different quota settings,
rename() returns EXDEV.
Aha, yes, this is the issue.
When you set a quota you force subvolume-like behavior. This is done
because hard links across this quota boundary won't correctly account for
utilization (only one of the file links will accrue usage). The
expectation is that quotas are usually set in locations that aren't
frequently renamed across.
Understood, that explains it. That's indeed also true for our application in most cases -
but sometimes, we have the case that users want to migrate their data to group storage, or vice-versa.
Post by Sage Weil
It might be possible to allow rename(2) to proceed in cases where
nlink==1, but the behavior will probably seem inconsistent (some files get
EXDEV, some don't).
I believe even this would be extremely helpful, performance-wise. At least in our case, hardlinks are seldomly used,
it's more about data movement between user, group and scratch areas.
For files with nlinks>1, it's more or less expected a copy has to be performed when crossing quota boundaries (I think).

Cheers,
Oliver
Post by Sage Weil
sage
Post by Oliver Freyermuth
Cheers, Oliver
That seems to be the issue - we need to understand why rename sees them as different.
Ric
---------------------------------------------------------------------------------
access("/cephfs/some_folder/file", W_OK) = 0
rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid cross-device link)
unlink("/cephfs/some_folder/file") = 0
lgetxattr("foo", "security.selinux", "system_u:object_r:fusefs_t:s0", 255) = 30
---------------------------------------------------------------------------------
But I can assure it's only a single filesystem, and a single ceph-fuse client running.
Same happens when using absolute paths.
Cheers,
        Oliver
We should look at what mv uses to see if it thinks the directories are on different file systems.
If the fstat or whatever it looks at is confused, that might explain it.
Ric
     > Is this move between directories on the same file system?
     It is, we only have a single CephFS in use. There's also only a single ceph-fuse client running.
     What's different, though, are different ACLs set for source and target directory, and owner / group,
     but I hope that should not matter.
     All the best,
     Oliver
     > Rename as a system call only works within a file system.
     >
     > The user space mv command becomes a copy when not the same file system. 
     >
     > Regards,
     >
     > Ric
     >
     >
     >
     >     On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
     >     > Dear Cephalopodians,
     >     >
     >     > I was wondering why a simple "mv" is taking extraordinarily long on CephFS and must note that,
     >     > at least with the fuse-client (12.2.5) and when moving a file from one directory to another,
     >     > the file appears to be copied first (byte by byte, traffic going through the client?) before the initial file is deleted.
     >     >
     >     > Is this true, or am I missing something?
     >
     >     A mv should not involve copying a file through the client -- it's
     >     implemented in the MDS as a rename from one location to another.
     >     What's the observation that's making it seem like the data is going
     >     through the client?
     >
     >     John
     >
     >     >
     >     > For large files, this might be rather time consuming,
     >     > and we should certainly advise all our users to not move files around needlessly if this is the case.
     >     >
     >     > Cheers,
     >     >         Oliver
     >     >
     >     >
     >     > _______________________________________________
     >     > ceph-users mailing list
     >     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
     >     >
     >     _______________________________________________
     >     ceph-users mailing list
     >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
     >
Patrick Donnelly
2018-05-25 14:52:05 UTC
Permalink
On Fri, May 25, 2018 at 6:46 AM, Oliver Freyermuth
Post by Oliver Freyermuth
Post by Sage Weil
It might be possible to allow rename(2) to proceed in cases where
nlink==1, but the behavior will probably seem inconsistent (some files get
EXDEV, some don't).
I believe even this would be extremely helpful, performance-wise. At least in our case, hardlinks are seldomly used,
it's more about data movement between user, group and scratch areas.
For files with nlinks>1, it's more or less expected a copy has to be performed when crossing quota boundaries (I think).
It may be possible to allow the rename in the MDS and check quotas
there. I've filed a tracker ticket here:
http://tracker.ceph.com/issues/24305
--
Patrick Donnelly
Luis Henriques
2018-05-25 13:26:47 UTC
Permalink
Post by Oliver Freyermuth
---------------------------------------------------------------------------------
access("/cephfs/some_folder/file", W_OK) = 0
rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid cross-device link)
I believe this could happen if you have quotas set on any of the paths,
or different snapshot realms.

Cheers,
--
Luis
Post by Oliver Freyermuth
unlink("/cephfs/some_folder/file") = 0 lgetxattr("foo",
"security.selinux", "system_u:object_r:fusefs_t:s0", 255) = 30
---------------------------------------------------------------------------------
But I can assure it's only a single filesystem, and a single ceph-fuse client running.
Same happens when using absolute paths.
Cheers,
Oliver
We should look at what mv uses to see if it thinks the directories are on different file systems.
If the fstat or whatever it looks at is confused, that might explain it.
Ric
Post by Ric Wheeler
Is this move between directories on the same file system?
It is, we only have a single CephFS in use. There's also only a single ceph-fuse client running.
What's different, though, are different ACLs set for source and target directory, and owner / group,
but I hope that should not matter.
All the best,
Oliver
Post by Ric Wheeler
Rename as a system call only works within a file system.
The user space mv command becomes a copy when not the same file system. 
Regards,
Ric
     On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
     > Dear Cephalopodians,
     >
     > I was wondering why a simple "mv" is taking extraordinarily long on CephFS and must note that,
     > at least with the fuse-client (12.2.5) and when moving a file from one directory to another,
     > the file appears to be copied first (byte by byte, traffic going through the client?) before the initial file is deleted.
     >
     > Is this true, or am I missing something?
     A mv should not involve copying a file through the client -- it's
     implemented in the MDS as a rename from one location to another.
     What's the observation that's making it seem like the data is going
     through the client?
     John
     >
     > For large files, this might be rather time consuming,
     > and we should certainly advise all our users to not move files around needlessly if this is the case.
     >
     > Cheers,
     >         Oliver
     >
     >
     > _______________________________________________
     > ceph-users mailing list
     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
     >
     _______________________________________________
     ceph-users mailing list
     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Oliver Freyermuth
2018-05-25 13:31:28 UTC
Permalink
Post by Luis Henriques
Post by Oliver Freyermuth
---------------------------------------------------------------------------------
access("/cephfs/some_folder/file", W_OK) = 0
rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid cross-device link)
I believe this could happen if you have quotas set on any of the paths,
or different snapshot realms.
Wow - yes, this matches my observations!
So in this case, e.g. moving files from a "user" directory with quota to a "group" directory with different quota,
it's currently expected that files can not be renamed across those boundaries?

Cheers,
Oliver
Post by Luis Henriques
Cheers,
Oliver Freyermuth
2018-05-25 13:02:03 UTC
Permalink
Post by John Spray
On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
Post by Oliver Freyermuth
Dear Cephalopodians,
I was wondering why a simple "mv" is taking extraordinarily long on CephFS and must note that,
at least with the fuse-client (12.2.5) and when moving a file from one directory to another,
the file appears to be copied first (byte by byte, traffic going through the client?) before the initial file is deleted.
Is this true, or am I missing something?
A mv should not involve copying a file through the client -- it's
implemented in the MDS as a rename from one location to another.
What's the observation that's making it seem like the data is going
through the client?
The fact that it's happening with only about 1 GBit/s and all OSDs are reading and writing.
I will also check the network interface of the client next time it occurs. Also, ceph-fuse was taking 50 % CPU load just from this.

Also, I observe the file at the source being kept during the copy,
and the file at the target growing slowly. So it's definitely a copy, and only at the end the source file is deleted.
Post by John Spray
John
Post by Oliver Freyermuth
For large files, this might be rather time consuming,
and we should certainly advise all our users to not move files around needlessly if this is the case.
Cheers,
Oliver
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Loading...