Discussion:
[ceph-users] Btrfs defragmentation
Lionel Bouton
2015-05-03 22:43:39 UTC
Permalink
Hi,

we began testing one Btrfs OSD volume last week and for this first test
we disabled autodefrag and began to launch manual btrfs fi defrag.

During the tests, I monitored the number of extents of the journal
(10GB) and it went through the roof (it currently sits at 8000+ extents
for example).
I was tempted to defragment it but after thinking a bit about it I think
it might not be a good idea.
With Btrfs, by default the data written to the journal on disk isn't
copied to its final destination. Ceph is using a clone_range feature to
reference the same data instead of copying it.
So if you defragment both the journal and the final destination, you are
moving the data around to attempt to get both references to satisfy a
one extent goal but most of the time can't get both of them at the same
time (unless the destination is a whole file instead of a fragment of one).

I assume the journal probably doesn't benefit at all from
defragmentation: it's overwritten constantly and as Btrfs uses CoW, the
previous extents won't be reused at all and new ones will be created for
the new data instead of overwritting the old in place. The final
destination files are reused (reread) and benefit from defragmentation.

Under these assumptions we excluded the journal file from
defragmentation, in fact we only defragment the "current" directory
(snapshot directories are probably only read from in rare cases and are
ephemeral so optimizing them is not interesting).

The filesystem is only one week old so we will have to wait a bit to see
if this strategy is better than the one used when mounting with
autodefrag (I couldn't find much about it but last year we had
unmanageable latencies).
We have a small Ruby script which triggers defragmentation based on the
number of extents and by default limits the rate of calls to btrfs fi
defrag to a negligible level to avoid trashing the filesystem. If
someone is interested I can attach it or push it on Github after a bit
of cleanup.

Best regards,

Lionel
Sage Weil
2015-05-03 23:34:18 UTC
Permalink
Post by Lionel Bouton
Hi,
we began testing one Btrfs OSD volume last week and for this first test
we disabled autodefrag and began to launch manual btrfs fi defrag.
During the tests, I monitored the number of extents of the journal
(10GB) and it went through the roof (it currently sits at 8000+ extents
for example).
I was tempted to defragment it but after thinking a bit about it I think
it might not be a good idea.
With Btrfs, by default the data written to the journal on disk isn't
copied to its final destination. Ceph is using a clone_range feature to
reference the same data instead of copying it.
We've discussed this possibility but have never implemented it. The data
is written twice: once to the journal and once to the object file.
Post by Lionel Bouton
So if you defragment both the journal and the final destination, you are
moving the data around to attempt to get both references to satisfy a
one extent goal but most of the time can't get both of them at the same
time (unless the destination is a whole file instead of a fragment of one).
I assume the journal probably doesn't benefit at all from
defragmentation: it's overwritten constantly and as Btrfs uses CoW, the
previous extents won't be reused at all and new ones will be created for
the new data instead of overwritting the old in place. The final
destination files are reused (reread) and benefit from defragmentation.
Yeah, I agree. It is probably best to let btrfs write the journal
anywhere since it is never read (except for replay after a failure
or restart).

There is also a newish 'journal discard' option that is false by default;
enabling this may let us thorw out the previously allocated space so that
the new writes get written to fresh locations (instead of to the
previously written and fragmented positions). I expect this will make a
positive difference, but I'm not sure that anyone has tested it.
Post by Lionel Bouton
Under these assumptions we excluded the journal file from
defragmentation, in fact we only defragment the "current" directory
(snapshot directories are probably only read from in rare cases and are
ephemeral so optimizing them is not interesting).
The filesystem is only one week old so we will have to wait a bit to see
if this strategy is better than the one used when mounting with
autodefrag (I couldn't find much about it but last year we had
unmanageable latencies).
Cool.. let us know how things look after it ages!

sage
Post by Lionel Bouton
We have a small Ruby script which triggers defragmentation based on the
number of extents and by default limits the rate of calls to btrfs fi
defrag to a negligible level to avoid trashing the filesystem. If
someone is interested I can attach it or push it on Github after a bit
of cleanup.
Lionel Bouton
2015-05-04 00:33:30 UTC
Permalink
Hi, we began testing one Btrfs OSD volume last week and for this
first test we disabled autodefrag and began to launch manual btrfs fi
defrag. During the tests, I monitored the number of extents of the
journal (10GB) and it went through the roof (it currently sits at
8000+ extents for example). I was tempted to defragment it but after
thinking a bit about it I think it might not be a good idea. With
Btrfs, by default the data written to the journal on disk isn't
copied to its final destination. Ceph is using a clone_range feature
to reference the same data instead of copying it.
We've discussed this possibility but have never implemented it. The
data is written twice: once to the journal and once to the object file.
That's odd. Here's an extract of filefrag output:

Filesystem type is: 9123683e
File size of /var/lib/ceph/osd/ceph-17/journal is 10485760000 (2560000
blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 0: 155073097.. 155073097: 1:
1: 1.. 1254: 155068587.. 155069840: 1254: 155073098: shared
2: 1255.. 2296: 155071149.. 155072190: 1042: 155069841: shared
3: 2297.. 2344: 148124256.. 148124303: 48: 155072191: shared
4: 2345.. 4396: 148129654.. 148131705: 2052: 148124304: shared
5: 4397.. 6446: 148137117.. 148139166: 2050: 148131706: shared
6: 6447.. 6451: 150414237.. 150414241: 5: 148139167: shared
7: 6452.. 10552: 150432040.. 150436140: 4101: 150414242: shared
8: 10553.. 12603: 150477824.. 150479874: 2051: 150436141: shared

Almost all extents of the journal are shared with another file (on one
occasion I've found 3 consecutive extents without the shared flag). I've
thought that it could be shared by a copy in a snapshot but the
snapshots are of the "current" subvolume.

Lionel
Lionel Bouton
2015-05-05 00:24:22 UTC
Permalink
Post by Sage Weil
Post by Lionel Bouton
Hi,
we began testing one Btrfs OSD volume last week and for this first test
we disabled autodefrag and began to launch manual btrfs fi defrag.
[...]
Cool.. let us know how things look after it ages!
We had the first signs of Btrfs aging yesterday's morning. Latencies
went up noticeably. The journal was at ~3000 extents back from a maximum
of ~13000 the day before. To verify my assumption that journal
fragmentation was not the cause of latencies, I defragmented it. It took
more than 7 minutes (10GB journal), left it at ~2300 extents (probably
because it was heavily used during the defragmentation) and the high
latencies weren't solved at all.

The initial algorithm selected files to defragment based solely on the
number of extents (files with more extents were processed first). This
was a simple approach to the problem that I hoped would be enough so I
had to make it more clever.

filefrag -v conveniently outputs each fragment relative position on the
device and the total file size. So I changed the algorithm so that it
can still use the result of a periodic find | xargs filefrag call (which
is relatively cheap and ends up fitting in a <100MB Ruby process) but
better model the fragmentation cost.

The new one computes the total cost of reading every file, counting an
initial seek, the total time based on sequential read speed and the time
associated with each seek from one extent to the next (which can be 0
when Btrfs managed to put an extent just after another, or very small if
it is not far from the first on the same HDD track). This total cost is
compared with the ideal defragmented case to know what the speedup could
be after defragmentation. Finally the result is normalized by dividing
it with the total size of each file. The normalization is done because
in the case of RBD (and probably most other uses) what is interesting is
how long a 128kB or 1MB read would take whatever the file and the offset
in the file, not how long a whole file read would take (there's an
assumption that each file as the same probability of being read which
might need to be revisited). There are approximations in the cost
computation and it's HDD centric but it's not very far from reality.

The idea was that it would be able to find the files where fragmentation
is the most painful faster instead of wasting time on less interesting
files. This would make the defragmentation more efficient even if it
didn't process as many files (the less defragmentation takes place the
less load we add).

It worked for the past day. Before the algorithm change the Btrfs OSD
disk was the slowest on the system compared to the three XFS ones by a
large margin. This was confirmed both by iostat %util (often at 90-100%)
and monitoring the disk average read/write latencies over time which
often spiked one order of magnitude above the other disks (as high as 3
seconds). Now the Btrfs OSD disk is at least comparable to the other
disks if not a bit faster (comparing latencies).

This is still too early to tell, but very encouraging.

Best regards,

Lionel
Timofey Titovets
2015-05-05 04:30:09 UTC
Permalink
Hi list,
Excuse me, what I'm saying is off topic

@Lionel, if you use btrfs, did you already try to use btrfs compression for OSD?
If yes, сan you share the your experience?
Post by Lionel Bouton
Post by Sage Weil
Post by Lionel Bouton
Hi,
we began testing one Btrfs OSD volume last week and for this first test
we disabled autodefrag and began to launch manual btrfs fi defrag.
[...]
Cool.. let us know how things look after it ages!
We had the first signs of Btrfs aging yesterday's morning. Latencies
went up noticeably. The journal was at ~3000 extents back from a maximum
of ~13000 the day before. To verify my assumption that journal
fragmentation was not the cause of latencies, I defragmented it. It took
more than 7 minutes (10GB journal), left it at ~2300 extents (probably
because it was heavily used during the defragmentation) and the high
latencies weren't solved at all.
The initial algorithm selected files to defragment based solely on the
number of extents (files with more extents were processed first). This
was a simple approach to the problem that I hoped would be enough so I
had to make it more clever.
filefrag -v conveniently outputs each fragment relative position on the
device and the total file size. So I changed the algorithm so that it
can still use the result of a periodic find | xargs filefrag call (which
is relatively cheap and ends up fitting in a <100MB Ruby process) but
better model the fragmentation cost.
The new one computes the total cost of reading every file, counting an
initial seek, the total time based on sequential read speed and the time
associated with each seek from one extent to the next (which can be 0
when Btrfs managed to put an extent just after another, or very small if
it is not far from the first on the same HDD track). This total cost is
compared with the ideal defragmented case to know what the speedup could
be after defragmentation. Finally the result is normalized by dividing
it with the total size of each file. The normalization is done because
in the case of RBD (and probably most other uses) what is interesting is
how long a 128kB or 1MB read would take whatever the file and the offset
in the file, not how long a whole file read would take (there's an
assumption that each file as the same probability of being read which
might need to be revisited). There are approximations in the cost
computation and it's HDD centric but it's not very far from reality.
The idea was that it would be able to find the files where fragmentation
is the most painful faster instead of wasting time on less interesting
files. This would make the defragmentation more efficient even if it
didn't process as many files (the less defragmentation takes place the
less load we add).
It worked for the past day. Before the algorithm change the Btrfs OSD
disk was the slowest on the system compared to the three XFS ones by a
large margin. This was confirmed both by iostat %util (often at 90-100%)
and monitoring the disk average read/write latencies over time which
often spiked one order of magnitude above the other disks (as high as 3
seconds). Now the Btrfs OSD disk is at least comparable to the other
disks if not a bit faster (comparing latencies).
This is still too early to tell, but very encouraging.
Best regards,
Lionel
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Have a nice day,
Timofey.
Lionel Bouton
2015-05-05 07:58:16 UTC
Permalink
Post by Timofey Titovets
Hi list,
Excuse me, what I'm saying is off topic
@Lionel, if you use btrfs, did you already try to use btrfs compression for OSD?
If yes, сan you share the your experience?
Btrfs compresses by default using zlib. We force lzo compression instead
by using compress=lzo in fstab. Behaviour obviously depends on the kind
of data stored but in our case when we had more Btrfs OSD to compare
with XFS ones we got between 10 and 15% less disk space on average (on
this Ceph instance most files are in an already compressed format in the
RBD volumes). Although it looks like it again with only one OSD out of
24 I can't confirm this right now.

Best regards,

Lionel.
Lionel Bouton
2015-05-06 17:51:09 UTC
Permalink
Post by Lionel Bouton
Post by Sage Weil
Post by Lionel Bouton
Hi,
we began testing one Btrfs OSD volume last week and for this first test
we disabled autodefrag and began to launch manual btrfs fi defrag.
[...]
Cool.. let us know how things look after it ages!
[...]
It worked for the past day. Before the algorithm change the Btrfs OSD
disk was the slowest on the system compared to the three XFS ones by a
large margin. This was confirmed both by iostat %util (often at 90-100%)
and monitoring the disk average read/write latencies over time which
often spiked one order of magnitude above the other disks (as high as 3
seconds). Now the Btrfs OSD disk is at least comparable to the other
disks if not a bit faster (comparing latencies).
This is still too early to tell, but very encouraging.
Still going well, I added two new OSDs which are behaving correctly too.

The first of the two has finished catching up. There's a big difference
in the number of extents on XFS and on Btrfs. I've seen files backing
rbd (4MB files with rbd in their names) often have only 1 or 2 extents
on XFS.
On Btrfs they seem to start at 32 extents when they are created and
Btrfs doesn't seem to mind (ie: calling btrfs fi defrag <file> doesn't
reduce the number of extents, at least not in the following 30s where it
should go down). The extents aren't far from each other on disk though,
at least initially.

When my simple algorithm computes the fragmentation cost (the expected
overhead of reading a file vs its optimized version), it seems that just
after finishing catching up (between 3 hours and 1 day depending on the
cluster load and settings), the content is already heavily fragmented
(files are expected to take more than 6x time the read delay than
optimized versions would). Then my defragmentation scheduler manages to
bring down the maximum fragmentation cost (according to its own
definition) by a factor of 0.66 (the very first OSD volume is currently
sitting at a ~4x cost and occasionally reaches the 3.25-3.5 range).

Is there something that would explain why initially Btrfs creates the
4MB files with 128k extents (32 extents / file) ? Is it a bad thing for
performance ?

During normal operation Btrfs OSD volumes continue to behave in the same
way XFS ones do on the same system (sometimes faster/sometimes slower).
What is really slow though it the OSD process startup. I've yet to make
serious tests (umounting the filesystems to clear caches), but I've
already seen 3 minutes of delay reading the pgs. Example :

2015-05-05 16:01:24.854504 7f57c518b780 0 osd.17 22428 load_pgs
2015-05-05 16:01:24.936111 7f57ae7fc700 0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-17) destroy_checkpoint:
ioctl SNAP_DESTROY got (2) No such file or directory
2015-05-05 16:01:24.936137 7f57ae7fc700 -1
filestore(/var/lib/ceph/osd/ceph-17) unable to destroy snap
'snap_1671188' got (2) No such file or directory
2015-05-05 16:01:24.991629 7f57ae7fc700 0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-17) destroy_checkpoint:
ioctl SNAP_DESTROY got (2) No such file or directory
2015-05-05 16:01:24.991654 7f57ae7fc700 -1
filestore(/var/lib/ceph/osd/ceph-17) unable to destroy snap
'snap_1671189' got (2) No such file or directory
2015-05-05 16:04:25.413110 7f57c518b780 0 osd.17 22428 load_pgs opened
160 pgs

The filesystem might not have reached its balance between fragmentation
and defragmentation rate at this time (so this may change) but mirrors
our initial experience with Btrfs where this was the first symptom of
bad performance.

Best regards,

Lionel
Mark Nelson
2015-05-06 18:04:30 UTC
Permalink
Post by Lionel Bouton
Post by Lionel Bouton
Post by Sage Weil
Post by Lionel Bouton
Hi,
we began testing one Btrfs OSD volume last week and for this first test
we disabled autodefrag and began to launch manual btrfs fi defrag.
[...]
Cool.. let us know how things look after it ages!
[...]
It worked for the past day. Before the algorithm change the Btrfs OSD
disk was the slowest on the system compared to the three XFS ones by a
large margin. This was confirmed both by iostat %util (often at 90-100%)
and monitoring the disk average read/write latencies over time which
often spiked one order of magnitude above the other disks (as high as 3
seconds). Now the Btrfs OSD disk is at least comparable to the other
disks if not a bit faster (comparing latencies).
This is still too early to tell, but very encouraging.
Still going well, I added two new OSDs which are behaving correctly too.
The first of the two has finished catching up. There's a big difference
in the number of extents on XFS and on Btrfs. I've seen files backing
rbd (4MB files with rbd in their names) often have only 1 or 2 extents
on XFS.
On Btrfs they seem to start at 32 extents when they are created and
Btrfs doesn't seem to mind (ie: calling btrfs fi defrag <file> doesn't
reduce the number of extents, at least not in the following 30s where it
should go down). The extents aren't far from each other on disk though,
at least initially.
When my simple algorithm computes the fragmentation cost (the expected
overhead of reading a file vs its optimized version), it seems that just
after finishing catching up (between 3 hours and 1 day depending on the
cluster load and settings), the content is already heavily fragmented
(files are expected to take more than 6x time the read delay than
optimized versions would). Then my defragmentation scheduler manages to
bring down the maximum fragmentation cost (according to its own
definition) by a factor of 0.66 (the very first OSD volume is currently
sitting at a ~4x cost and occasionally reaches the 3.25-3.5 range).
Is there something that would explain why initially Btrfs creates the
4MB files with 128k extents (32 extents / file) ? Is it a bad thing for
performance ?
During normal operation Btrfs OSD volumes continue to behave in the same
way XFS ones do on the same system (sometimes faster/sometimes slower).
What is really slow though it the OSD process startup. I've yet to make
serious tests (umounting the filesystems to clear caches), but I've
2015-05-05 16:01:24.854504 7f57c518b780 0 osd.17 22428 load_pgs
2015-05-05 16:01:24.936111 7f57ae7fc700 0
ioctl SNAP_DESTROY got (2) No such file or directory
2015-05-05 16:01:24.936137 7f57ae7fc700 -1
filestore(/var/lib/ceph/osd/ceph-17) unable to destroy snap
'snap_1671188' got (2) No such file or directory
2015-05-05 16:01:24.991629 7f57ae7fc700 0
ioctl SNAP_DESTROY got (2) No such file or directory
2015-05-05 16:01:24.991654 7f57ae7fc700 -1
filestore(/var/lib/ceph/osd/ceph-17) unable to destroy snap
'snap_1671189' got (2) No such file or directory
2015-05-05 16:04:25.413110 7f57c518b780 0 osd.17 22428 load_pgs opened
160 pgs
The filesystem might not have reached its balance between fragmentation
and defragmentation rate at this time (so this may change) but mirrors
our initial experience with Btrfs where this was the first symptom of
bad performance.
Out of curiosity, do you see excessive memory usage during
defragmentation? Last time I spoke to josef it sounded like it wasn't
particularly safe yet and could make the machine go OOM, especially if
there are lots of snapshots.

I've also included some test results from emperor (ie quite old now)
showcasing how sequential read performance degrades on btrfs after
random writes are performed (on the 2nd tab you can see how even writes
are affected as well). Basically the first iteration of tests look
great up until random writes are done which causes excessive
fragmentation due to COW, then subsequent tests are quite bad compared
to initial BTRFS tests (and XFS).

Your testing is thus quite interesting, especially if it means we can
reduce this effect. Keep it up!

Mark
Post by Lionel Bouton
Best regards,
Lionel
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Lionel Bouton
2015-05-06 18:21:34 UTC
Permalink
Hi,
Post by Mark Nelson
[...]
Out of curiosity, do you see excessive memory usage during
defragmentation? Last time I spoke to josef it sounded like it wasn't
particularly safe yet and could make the machine go OOM, especially if
there are lots of snapshots.
We have large amounts of memory (80GB) so we might have missed this.
There was no problem with autodefrag though (we would have seen them
because we had more limited amounts of free memory after OSD and VM
being accounted for).
Snapshots shouldn't be a problem for OSDs as it seems Ceph only
maintains 2 at any given time.
Post by Mark Nelson
I've also included some test results from emperor (ie quite old now)
showcasing how sequential read performance degrades on btrfs after
random writes are performed (on the 2nd tab you can see how even
writes are affected as well). Basically the first iteration of tests
look great up until random writes are done which causes excessive
fragmentation due to COW, then subsequent tests are quite bad compared
to initial BTRFS tests (and XFS).
Your testing is thus quite interesting, especially if it means we can
reduce this effect. Keep it up!
I'll have problems getting good results from inside VM as there is a
constant load on this platform.

Currently I'm limited to average results on total time spent accessing disks

Best regards,

Lionel
Timofey Titovets
2015-05-06 18:07:44 UTC
Permalink
Post by Lionel Bouton
Post by Lionel Bouton
Post by Sage Weil
Post by Lionel Bouton
Hi,
we began testing one Btrfs OSD volume last week and for this first test
we disabled autodefrag and began to launch manual btrfs fi defrag.
[...]
Cool.. let us know how things look after it ages!
[...]
It worked for the past day. Before the algorithm change the Btrfs OSD
disk was the slowest on the system compared to the three XFS ones by a
large margin. This was confirmed both by iostat %util (often at 90-100%)
and monitoring the disk average read/write latencies over time which
often spiked one order of magnitude above the other disks (as high as 3
seconds). Now the Btrfs OSD disk is at least comparable to the other
disks if not a bit faster (comparing latencies).
This is still too early to tell, but very encouraging.
Still going well, I added two new OSDs which are behaving correctly too.
The first of the two has finished catching up. There's a big difference
in the number of extents on XFS and on Btrfs. I've seen files backing
rbd (4MB files with rbd in their names) often have only 1 or 2 extents
on XFS.
On Btrfs they seem to start at 32 extents when they are created and
Btrfs doesn't seem to mind (ie: calling btrfs fi defrag <file> doesn't
reduce the number of extents, at least not in the following 30s where it
should go down). The extents aren't far from each other on disk though,
at least initially.
When my simple algorithm computes the fragmentation cost (the expected
overhead of reading a file vs its optimized version), it seems that just
after finishing catching up (between 3 hours and 1 day depending on the
cluster load and settings), the content is already heavily fragmented
(files are expected to take more than 6x time the read delay than
optimized versions would). Then my defragmentation scheduler manages to
bring down the maximum fragmentation cost (according to its own
definition) by a factor of 0.66 (the very first OSD volume is currently
sitting at a ~4x cost and occasionally reaches the 3.25-3.5 range).
Is there something that would explain why initially Btrfs creates the
4MB files with 128k extents (32 extents / file) ? Is it a bad thing for
performance ?
This kind of behaviour is a reason why i ask you about compression.
"You can use filefrag to locate heavily fragmented files (may not work
correctly with compression)."
https://btrfs.wiki.kernel.org/index.php/Gotchas

Filefrag show each compressed chunk as separated extents, but he can
be located linear. This is a problem in file frag =\
Post by Lionel Bouton
During normal operation Btrfs OSD volumes continue to behave in the same
way XFS ones do on the same system (sometimes faster/sometimes slower).
What is really slow though it the OSD process startup. I've yet to make
serious tests (umounting the filesystems to clear caches), but I've
2015-05-05 16:01:24.854504 7f57c518b780 0 osd.17 22428 load_pgs
2015-05-05 16:01:24.936111 7f57ae7fc700 0
ioctl SNAP_DESTROY got (2) No such file or directory
2015-05-05 16:01:24.936137 7f57ae7fc700 -1
filestore(/var/lib/ceph/osd/ceph-17) unable to destroy snap
'snap_1671188' got (2) No such file or directory
2015-05-05 16:01:24.991629 7f57ae7fc700 0
ioctl SNAP_DESTROY got (2) No such file or directory
2015-05-05 16:01:24.991654 7f57ae7fc700 -1
filestore(/var/lib/ceph/osd/ceph-17) unable to destroy snap
'snap_1671189' got (2) No such file or directory
2015-05-05 16:04:25.413110 7f57c518b780 0 osd.17 22428 load_pgs opened
160 pgs
The filesystem might not have reached its balance between fragmentation
and defragmentation rate at this time (so this may change) but mirrors
our initial experience with Btrfs where this was the first symptom of
bad performance.
Best regards,
Lionel
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Have a nice day,
Timofey.
Lionel Bouton
2015-05-06 18:28:07 UTC
Permalink
Hi,
Post by Timofey Titovets
Post by Lionel Bouton
Is there something that would explain why initially Btrfs creates the
4MB files with 128k extents (32 extents / file) ? Is it a bad thing for
performance ?
This kind of behaviour is a reason why i ask you about compression.
"You can use filefrag to locate heavily fragmented files (may not work
correctly with compression)."
https://btrfs.wiki.kernel.org/index.php/Gotchas
Filefrag show each compressed chunk as separated extents, but he can
be located linear. This is a problem in file frag =\
Hum, I see. This could explain why we rarely see the number of extents
go down. When data is replaced with incompressible data Btrfs must
deactivate compression and be able to reduce the number of extents.

This should not have much impact on the defragmentation process and
performance: we check for extents being written sequentially next to
each other and don't count this as a cost for file access. This is why
these files aren't defragmented even if we ask for it and our tool
reports a low overhead for them.

Best regards,

Lionel
Lionel Bouton
2015-05-12 10:17:09 UTC
Permalink
Post by Lionel Bouton
Hi,
Post by Timofey Titovets
Post by Lionel Bouton
Is there something that would explain why initially Btrfs creates the
4MB files with 128k extents (32 extents / file) ? Is it a bad thing for
performance ?
This kind of behaviour is a reason why i ask you about compression.
"You can use filefrag to locate heavily fragmented files (may not work
correctly with compression)."
https://btrfs.wiki.kernel.org/index.php/Gotchas
Filefrag show each compressed chunk as separated extents, but he can
be located linear. This is a problem in file frag =\
Hum, I see. This could explain why we rarely see the number of extents
go down. When data is replaced with incompressible data Btrfs must
deactivate compression and be able to reduce the number of extents.
This should not have much impact on the defragmentation process and
performance: we check for extents being written sequentially next to
each other and don't count this as a cost for file access. This is why
these files aren't defragmented even if we ask for it and our tool
reports a low overhead for them.
Here's more information, especially about compression.

1/ filefrag behaviour.

I use our tool to trace the fragmentation evolution after launching
btrfs fi defrag on each file (it calls filefrag -v asynchronously every
5 seconds until the defragmentation seems done).
filefrag output doesn't understand compression and doesn't seem to have
access to the latest on-disk layout.

- for compression, you can have a reported layout where an extent begins
in the middle of the previous pretty often. So I assume the physical
offset of the extent start is good but the end is computed from the
extent decompressed length (it's always 32x4096-bytes blocks which
matches the compression block size). We had to compensate for that
because we erroneously considered this case needing a seek although it
doesn't. This means you can't trust the number of extents reported by
filefrag -v (it is supposed to merge consecutive extents when run with -v).
- for access to the layout, I assume Btrfs reports what is committed to
disk. I base this assumption on the fact that for all defragmented
files, filefrag -v output becomes stable in at most 30 seconds after the
"btrfs fi defrag" command returns (30 seconds is the default commit
interval for Btrfs).

There's something odd going on with the 'shared' flag reported by
filefrag too: I assumed this was linked to clone_range or snapshots and
most of the time it seems so but on other (non-OSD) filesystems I found
files with this flag on extents and I couldn't find any explanation for it.

2/ compression influence on fragmentation

Even after compensating for filefrag -v errors, Btrfs clearly has more
difficulties defragmenting compressed files. At least our model for
computing the cost associated with a particular layout reports fewer
gains when defragmenting a compressed file. In our configuration and
according to our model of disk latencies we seem to hit a limit where
file reads cost ~ 2.75x what it would if the files where in an ideal,
sequential layout. If we try to go lower the majority of the files don't
benefit at all from defragmentation (the resulting layout isn't better
than the initial one).
Note that this doesn't account for NCQ/TCQ : we suppose the read is
isolated. So in practice reading from multiple threads should be less
costly and the OSD might not suffer much from this.
In fact 2 out of the 3 BTRFS OSD have lower latencies than most of the
rest of the cluster even with our tool slowly checking files and
triggering defragmentations in the background.

3/ History/heavy backfilling seems to have a large influence on performance

As I said 2 out of our 3 BTRFS OSDs have a very good behavior.
Unfortunately the third doesn't. This is the OSD where our tool was
deactivated during most of the initial backfilling process. It doesn't
have the most data, the most writes or the most reads of the group but
it had by far the worst latencies these last two days. I even checked
the disk for hardware problems and couldn't find any.
I don't have a clear explanation for the performance difference. Maybe
the 2.75x overhead target isn't low enough and this OSD has more
fragmented files than the others bellow this target (we don't compute
the average fragmentation yet). This would mean than we can expect the
performance of the 2 others to slowly degrade over time (so the test
isn't conclusive yet).

I've decided to remount this particular OSD without compression and let
our tool slowly bring down the maximum overhead to 1.5x (which should be
doable as without compression files are more easily defragmented) while
using primary-affinity = 0. I'll revert to primary-affinity 1 when the
defragmentation is done and see how the OSD/disk behave.

4/ autodefrag doesn't like Ceph OSDs

According to our previous experience by now all our Btrfs OSDs should be
on their knees begging us to shot them down: there's clearly something
to gain by tuning the defragmentation process. I suspect that autodefrag
either takes too much time trying to defragment the journal and/or is
overwhelmed by the amount of fragmentation going on and skip
defragmentations randomly instead of focusing on the most fragmented files.

Best regards,

Lionel

Lionel Bouton
2015-05-07 10:04:26 UTC
Permalink
Post by Lionel Bouton
During normal operation Btrfs OSD volumes continue to behave in the same
way XFS ones do on the same system (sometimes faster/sometimes slower).
What is really slow though it the OSD process startup. I've yet to make
serious tests (umounting the filesystems to clear caches), but I've
2015-05-05 16:01:24.854504 7f57c518b780 0 osd.17 22428 load_pgs
2015-05-05 16:01:24.936111 7f57ae7fc700 0
ioctl SNAP_DESTROY got (2) No such file or directory
2015-05-05 16:01:24.936137 7f57ae7fc700 -1
filestore(/var/lib/ceph/osd/ceph-17) unable to destroy snap
'snap_1671188' got (2) No such file or directory
2015-05-05 16:01:24.991629 7f57ae7fc700 0
ioctl SNAP_DESTROY got (2) No such file or directory
2015-05-05 16:01:24.991654 7f57ae7fc700 -1
filestore(/var/lib/ceph/osd/ceph-17) unable to destroy snap
'snap_1671189' got (2) No such file or directory
2015-05-05 16:04:25.413110 7f57c518b780 0 osd.17 22428 load_pgs opened
160 pgs
The filesystem might not have reached its balance between fragmentation
and defragmentation rate at this time (so this may change) but mirrors
our initial experience with Btrfs where this was the first symptom of
bad performance.
We've seen progress on this front. Unfortunately for us we had 2 power
outages and they seem to have damaged the disk controller of the system
we are testing Btrfs on: we just had a system crash.
On the positive side this gives us an update on the OSD boot time.

With a freshly booted system without anything in cache :
- the first Btrfs OSD we installed loaded the pgs in ~1mn30s which is
half of the previous time,
- the second Btrfs OSD where defragmentation was disabled for some time
and was considered more fragmented by our tool took nearly 10 minutes to
load its pgs (and even spent 1mn before starting to load them).
- the third Btrfs OSD which was always defragmented took 4mn30 seconds
to load its pgs (it was considered more fragmented than the first and
less than the second).

My current assumption is that the defragmentation process we use can't
handle large spikes of writes (at least when originally populating the
OSD with data through backfills) but then can repair the damage on
performance they cause at least partially (it's still slower to boot
than the 3 XFS OSDs on the same system where loading pgs took 6-9 seconds).
In the current setup the defragmentation is very slow to process because
I set it up to generate very little load on the filesystems it processes
: there may be room to improve.

Best regards,

Lionel
Burkhard Linke
2015-05-07 10:30:36 UTC
Permalink
Hi,
*snipsnap*
Post by Lionel Bouton
We've seen progress on this front. Unfortunately for us we had 2 power
outages and they seem to have damaged the disk controller of the system
we are testing Btrfs on: we just had a system crash.
On the positive side this gives us an update on the OSD boot time.
- the first Btrfs OSD we installed loaded the pgs in ~1mn30s which is
half of the previous time,
- the second Btrfs OSD where defragmentation was disabled for some time
and was considered more fragmented by our tool took nearly 10 minutes to
load its pgs (and even spent 1mn before starting to load them).
- the third Btrfs OSD which was always defragmented took 4mn30 seconds
to load its pgs (it was considered more fragmented than the first and
less than the second).
My current assumption is that the defragmentation process we use can't
handle large spikes of writes (at least when originally populating the
OSD with data through backfills) but then can repair the damage on
performance they cause at least partially (it's still slower to boot
than the 3 XFS OSDs on the same system where loading pgs took 6-9 seconds).
In the current setup the defragmentation is very slow to process because
I set it up to generate very little load on the filesystems it processes
: there may be room to improve.
Part of the OSD boot up process is also the handling of existing
snapshots and journal replay. I've also had several btrfs based OSDs
that took up to 20-30 minutes to start, especially after a crash. During
journal replay the OSD daemon creates a number of new snapshot for its
operations (newly created snap_XYZ directories that vanish after a short
time). This snapshotting probably also adds overhead to the OSD startup
time.
I have disabled snapshots in my setup now, since the stock ubuntu trusty
kernel had some stability problems with btrfs.

I also had to establish cron jobs for rebalancing the btrfs partitions.
It compacts the extents and may reduce the total amount of space taken.
Unfortunately this procedure is not a default in most distribution (it
definitely should be!). The problems associated with unbalanced extents
should have been solved in kernel 3.18, but I didn't had the time to
check it yet.

As a side note: I had several OSD with dangling snapshots (more than the
two usually handled by the OSD). They are probably due to crashed OSD
daemons. You have to remove the manually, otherwise they start to
consume disk space.

Best regards,
Burkhard
Lionel Bouton
2015-05-07 11:21:59 UTC
Permalink
Hi,
Post by Burkhard Linke
[...]
Part of the OSD boot up process is also the handling of existing
snapshots and journal replay. I've also had several btrfs based OSDs
that took up to 20-30 minutes to start, especially after a crash.
During journal replay the OSD daemon creates a number of new snapshot
for its operations (newly created snap_XYZ directories that vanish
after a short time). This snapshotting probably also adds overhead to
the OSD startup time.
I have disabled snapshots in my setup now, since the stock ubuntu
trusty kernel had some stability problems with btrfs.
I also had to establish cron jobs for rebalancing the btrfs
partitions. It compacts the extents and may reduce the total amount of
space taken.
I'm not sure what you mean by "compacting" extents. I'm sure balance
doesn't defragment or compress files. It moves extents and before 3.14
according to the Btrfs wiki it was used to reclaim allocated but unused
space.
This shouldn't affect performance and with modern kernels may not be
needed to reclaim unused space anymore.
Post by Burkhard Linke
Unfortunately this procedure is not a default in most distribution (it
definitely should be!). The problems associated with unbalanced
extents should have been solved in kernel 3.18, but I didn't had the
time to check it yet.
I don't have any btrfs filesystem running on 3.17 or earlier version
anymore (with a notable exception, see below) so I can't comment. I have
old btrfs filesystems that were created on 3.14 and are now on 3.18.x or
3.19.x (by the way avoid 3.18.9 to 3.19.4 if you can have any sort of
power failure, there's a possibility of a mount deadlock which requires
btrfs-zero-log to solve...). btrfs fi usage doesn't show anything
suspicious on these old fs.
I have a Jolla Phone which comes with a btrfs filesystem and uses an old
heavily patched 3.4 kernel. It didn't have any problem yet but I don't
stuff it with data (I've seen discussions about triggering a balance
before a SailfishOS upgrade).
I assume that you shouldn't have any problem with filesystems that
aren't heavily used which should be the case with Ceph OSD (for example
our current alert level is at 75% space usage).
Post by Burkhard Linke
As a side note: I had several OSD with dangling snapshots (more than
the two usually handled by the OSD). They are probably due to crashed
OSD daemons. You have to remove the manually, otherwise they start to
consume disk space.
Thanks a lot, I didn't think it could happen. I'll configure an alert
for this case.

Best regards,

Lionel
Continue reading on narkive:
Loading...