Discussion:
Deprecating ext4 support
(too old to reply)
Allen Samuels
2016-04-11 21:42:13 UTC
Permalink
RIP ext4.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
-----Original Message-----
Sent: Monday, April 11, 2016 2:40 PM
Subject: Deprecating ext4 support
Hi,
ext4 has never been recommended, but we did test it. After Jewel is out,
we would like explicitly recommend *against* ext4 and stop testing it.
Recently we discovered an issue with the long object name handling that is
not fixable without rewriting a significant chunk of FileStores filename
handling. (There is a limit in the amount of xattr data ext4 can store in the
inode, which causes problems in LFNIndex.)
We *could* invest a ton of time rewriting this to fix, but it only affects ext4,
which we never recommended, and we plan to deprecate FileStore once
BlueStore is stable anyway, so it seems like a waste of time that would be
better spent elsewhere.
Also, by dropping ext4 test coverage in ceph-qa-suite, we can significantly
improve time/coverage for FileStore on XFS and on BlueStore.
The long file name handling is problematic anytime someone is storing rados
objects with long names. The primary user that does this is RGW, which
means any RGW cluster using ext4 should recreate their OSDs to use XFS.
Other librados users could be affected too, though, like users with very long
rbd image names (e.g., > 100 characters), or custom librados users.
To make this change as visible as possible, the plan is to make ceph-osd
refuse to start if the backend is unable to support the configured max
object name (osd_max_object_name_len). The OSD will complain that ext4
cannot store such an object and refuse to start. A user who is only using
RBD might decide they don't need long file names to work and can adjust
the osd_max_object_name_len setting to something small (say, 64) and run
successfully. They would be taking a risk, though, because we would like
to stop testing on ext4.
Is this reasonable? If there significant ext4 users that are unwilling to
recreate their OSDs, now would be the time to speak up.
Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jan Schermer
2016-04-11 21:47:00 UTC
Permalink
RIP Ceph.
Post by Allen Samuels
RIP ext4.
Allen Samuels
Software Architect, Fellow, Systems and Software Solutions
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
-----Original Message-----
Sent: Monday, April 11, 2016 2:40 PM
Subject: Deprecating ext4 support
Hi,
ext4 has never been recommended, but we did test it. After Jewel is out,
we would like explicitly recommend *against* ext4 and stop testing it.
Recently we discovered an issue with the long object name handling that is
not fixable without rewriting a significant chunk of FileStores filename
handling. (There is a limit in the amount of xattr data ext4 can store in the
inode, which causes problems in LFNIndex.)
We *could* invest a ton of time rewriting this to fix, but it only affects ext4,
which we never recommended, and we plan to deprecate FileStore once
BlueStore is stable anyway, so it seems like a waste of time that would be
better spent elsewhere.
Also, by dropping ext4 test coverage in ceph-qa-suite, we can significantly
improve time/coverage for FileStore on XFS and on BlueStore.
The long file name handling is problematic anytime someone is storing rados
objects with long names. The primary user that does this is RGW, which
means any RGW cluster using ext4 should recreate their OSDs to use XFS.
Other librados users could be affected too, though, like users with very long
rbd image names (e.g., > 100 characters), or custom librados users.
To make this change as visible as possible, the plan is to make ceph-osd
refuse to start if the backend is unable to support the configured max
object name (osd_max_object_name_len). The OSD will complain that ext4
cannot store such an object and refuse to start. A user who is only using
RBD might decide they don't need long file names to work and can adjust
the osd_max_object_name_len setting to something small (say, 64) and run
successfully. They would be taking a risk, though, because we would like
to stop testing on ext4.
Is this reasonable? If there significant ext4 users that are unwilling to
recreate their OSDs, now would be the time to speak up.
Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Mark Nelson
2016-04-11 21:57:16 UTC
Permalink
Hi,
ext4 has never been recommended, but we did test it. After Jewel is out,
we would like explicitly recommend *against* ext4 and stop testing it.
I should clarify that this is a proposal and solicitation of feedback--we
haven't made any decisions yet. Now is the time to weigh in.
To add to this on the performance side, we stopped doing regular
performance testing on ext4 (and btrfs) sometime back around when ICE
was released to focus specifically on filestore behavior on xfs. There
were some cases at the time where ext4 was faster than xfs, but not
consistently so. btrfs is often quite fast on fresh fs, but degrades
quickly due to fragmentation induced by cow with
small-writes-to-large-object workloads (IE RBD small writes). If btrfs
auto-defrag is now safe to use in production it might be worth looking
at again, but probably not ext4.

Set sail for bluestore!

Mark
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Shinobu Kinjo
2016-04-11 22:49:09 UTC
Permalink
Just to clarify to prevent any confusion.

Honestly I've never used ext4 as underlying filesystem for the Ceph cluster, but according to wiki [1], ext4 is recommended -;

[1] https://en.wikipedia.org/wiki/Ceph_%28software%29

Shinobu

----- Original Message -----
From: "Mark Nelson" <***@redhat.com>
To: "Sage Weil" <***@newdream.net>, ceph-***@vger.kernel.org, ceph-***@ceph.com, ceph-***@ceph.com, ceph-***@ceph.com
Sent: Tuesday, April 12, 2016 6:57:16 AM
Subject: Re: [ceph-users] Deprecating ext4 support
Hi,
ext4 has never been recommended, but we did test it. After Jewel is out,
we would like explicitly recommend *against* ext4 and stop testing it.
I should clarify that this is a proposal and solicitation of feedback--we
haven't made any decisions yet. Now is the time to weigh in.
To add to this on the performance side, we stopped doing regular
performance testing on ext4 (and btrfs) sometime back around when ICE
was released to focus specifically on filestore behavior on xfs. There
were some cases at the time where ext4 was faster than xfs, but not
consistently so. btrfs is often quite fast on fresh fs, but degrades
quickly due to fragmentation induced by cow with
small-writes-to-large-object workloads (IE RBD small writes). If btrfs
auto-defrag is now safe to use in production it might be worth looking
at again, but probably not ext4.

Set sail for bluestore!

Mark
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Robin H. Johnson
2016-04-11 23:54:32 UTC
Permalink
Post by Shinobu Kinjo
Just to clarify to prevent any confusion.
Honestly I've never used ext4 as underlying filesystem for the Ceph cluster, but according to wiki [1], ext4 is recommended -;
[1] https://en.wikipedia.org/wiki/Ceph_%28software%29
Clearly somebody made a copy&paste error from the actual documentation.

Here's the docs on master and the recent LTS releases.
http://docs.ceph.com/docs/firefly/rados/configuration/filesystem-recommendations/
http://docs.ceph.com/docs/hammer/rados/configuration/filesystem-recommendations/
http://docs.ceph.com/docs/master2/rados/configuration/filesystem-recommendations/

The documentation has NEVER recommended ext4.
Here's a slice of all history for that file:
http://dev.gentoo.org/~robbat2/ceph-history-of-filesystem-recommendations.patch

Generated with
$ git log -C -C -M -p ceph/master -- \
doc/rados/configuration/filesystem-recommendations.rst \
doc/config-cluster/file-system-recommendations.rst \
doc/config-cluster/file_system_recommendations.rst
Post by Shinobu Kinjo
``ext4`` is a poor file system choice if you intend to deploy the
RADOS Gateway or use snapshots on versions earlier than 0.45.
--
Robin Hugh Johnson
Gentoo Linux: Developer, Infrastructure Lead, Foundation Trustee
E-Mail : ***@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
Lionel Bouton
2016-04-11 23:09:44 UTC
Permalink
Hi,
Post by Mark Nelson
[...]
To add to this on the performance side, we stopped doing regular
performance testing on ext4 (and btrfs) sometime back around when ICE
was released to focus specifically on filestore behavior on xfs.
There were some cases at the time where ext4 was faster than xfs, but
not consistently so. btrfs is often quite fast on fresh fs, but
degrades quickly due to fragmentation induced by cow with
small-writes-to-large-object workloads (IE RBD small writes). If
btrfs auto-defrag is now safe to use in production it might be worth
looking at again, but probably not ext4.
For BTRFS, autodefrag is probably not performance-safe (yet), at least
with RBD access patterns. At least it wasn't in 4.1.9 when we tested it
last time (the performance degraded slowly but surely over several weeks
from an initially good performing filesystem to the point where we
measured a 100% increase in average latencies and large spikes and
stopped the experiment). I didn't see any patches on linux-btrfs since
then (it might have benefited from other modifications, but the
autodefrag algorithm wasn't reworked itself AFAIK).
That's not an inherent problem of BTRFS but of the autodefrag
implementation though. Deactivating autodefrag and reimplementing a
basic, cautious defragmentation scheduler gave us noticeably better
latencies with BTRFS vs XFS (~30% better) on the same hardware and
workload long term (as in almost a year and countless full-disk rewrites
on the same filesystems due to both normal writes and rebalancing with 3
to 4 months of XFS and BTRFS OSDs coexisting for comparison purposes).
I'll certainly remount a subset of our OSDs autodefrag as I did with
4.1.9 when we will deploy 4.4.x or a later LTS kernel. So I might have
more up to date information in the coming months. I don't plan to
compare BTRFS to XFS anymore though : XFS only saves us from running our
defragmentation scheduler, BTRFS is far more suited to our workload and
we've seen constant improvements in behavior in the (arguably bumpy
until late 3.19 versions) 3.16.x to 4.1.x road.

Other things:

* If the journal is not on a separate partition (SSD), it should
definitely be re-created NoCoW to avoid unnecessary fragmentation. From
memory : stop OSD, touch journal.new, chattr +C journal.new, dd
if=journal of=journal.new (your dd options here for best perf/least
amount of cache eviction), rm journal, mv journal.new journal, start OSD
again.
* filestore btrfs snap = false
is mandatory if you want consistent performance (at least on HDDs). It
may not be felt with almost empty OSDs but performance hiccups appear if
any non trivial amount of data is added to the filesystems.
IIRC, after debugging surprisingly the snapshot creation didn't seem
to be the actual cause of the performance problems but the snapshot
deletion... It's so bad that the default should probably be false and
not true.

Lionel
Lindsay Mathieson
2016-04-11 23:40:28 UTC
Permalink
Post by Lionel Bouton
* If the journal is not on a separate partition (SSD), it should
definitely be re-created NoCoW to avoid unnecessary fragmentation. From
memory : stop OSD, touch journal.new, chattr +C journal.new, dd
if=journal of=journal.new (your dd options here for best perf/least
amount of cache eviction), rm journal, mv journal.new journal, start OSD
again.
Flush the journal after stopping the OSD !
--
Lindsay Mathieson
Lionel Bouton
2016-04-11 23:46:40 UTC
Permalink
Post by Lindsay Mathieson
Post by Lionel Bouton
* If the journal is not on a separate partition (SSD), it should
definitely be re-created NoCoW to avoid unnecessary fragmentation. From
memory : stop OSD, touch journal.new, chattr +C journal.new, dd
if=journal of=journal.new (your dd options here for best perf/least
amount of cache eviction), rm journal, mv journal.new journal, start OSD
again.
Flush the journal after stopping the OSD !
No need to: dd makes an exact duplicate.

Lionel
Michael Hanscho
2016-04-11 22:27:47 UTC
Permalink
Hi!

How about these findings?

https://events.linuxfoundation.org/sites/events/files/slides/AFL%20filesystem%20fuzzing%2C%20Vault%202016.pdf

Ext4 seems to be the one file system tested best... (although xfs
survived also quite long...)

Gruesse
Michael
Hi,
ext4 has never been recommended, but we did test it. After Jewel is out,
we would like explicitly recommend *against* ext4 and stop testing it.
I should clarify that this is a proposal and solicitation of feedback--we
haven't made any decisions yet. Now is the time to weigh in.
sage
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Christian Balzer
2016-04-11 23:39:25 UTC
Permalink
Hello,

What a lovely missive to start off my working day...
Hi,
ext4 has never been recommended, but we did test it.
Patently wrong, as Shinobu just pointed.

Ext4 never was (especially recently) flogged as much as XFS, but it always
was a recommended, supported filestorage filesystem, unlike the
experimental BTRFS of ZFS.
And for various reasons people, including me, deployed it instead of XFS.
After Jewel is
out, we would like explicitly recommend *against* ext4 and stop testing
it.
Changing your recommendations is fine, stopping testing/supporting it
isn't.
People deployed Ext4 in good faith and can be expected to use it at least
until their HW is up for replacement (4-5 years).
Recently we discovered an issue with the long object name handling that
is not fixable without rewriting a significant chunk of FileStores
filename handling. (There is a limit in the amount of xattr data ext4
can store in the inode, which causes problems in LFNIndex.)
Is that also true if the Ext4 inode size is larger than default?
We *could* invest a ton of time rewriting this to fix, but it only
affects ext4, which we never recommended, and we plan to deprecate
FileStore once BlueStore is stable anyway, so it seems like a waste of
time that would be better spent elsewhere.
If you (that is RH) is going to declare bluestore stable this year, I
would be very surprised.
Either way, dropping support before the successor is truly ready doesn't
sit well with me.

Which brings me to the reasons why people would want to migrate (NOT
talking about starting freshly) to bluestore.

1. Will it be faster (IOPS) than filestore with SSD journals?
Don't think so, but feel free to prove me wrong.

2. Will it be bit-rot proof? Note the deafening silence from the devs in
this thread:
http://www.spinics.net/lists/ceph-users/msg26510.html
Also, by dropping ext4 test coverage in ceph-qa-suite, we can
significantly improve time/coverage for FileStore on XFS and on
BlueStore.
Really, isn't that fully automated?
The long file name handling is problematic anytime someone is storing
rados objects with long names. The primary user that does this is RGW,
which means any RGW cluster using ext4 should recreate their OSDs to use
XFS. Other librados users could be affected too, though, like users
with very long rbd image names (e.g., > 100 characters), or custom
librados users.
To make this change as visible as possible, the plan is to make ceph-osd
refuse to start if the backend is unable to support the configured max
object name (osd_max_object_name_len). The OSD will complain that ext4
cannot store such an object and refuse to start. A user who is only
using RBD might decide they don't need long file names to work and can
adjust the osd_max_object_name_len setting to something small (say, 64)
and run successfully. They would be taking a risk, though, because we
would like to stop testing on ext4.
Is this reasonable?
About as reasonable as dropping format 1 support, that is not at all.
https://www.mail-archive.com/ceph-***@lists.ceph.com/msg28070.html

I'm officially only allowed to do (preventative) maintenance during weekend
nights on our main production cluster.
That would mean 13 ruined weekends at the realistic rate of 1 OSD per
night, so you can see where my lack of enthusiasm for OSD recreation comes
from.
If there significant ext4 users that are unwilling
to recreate their OSDs, now would be the time to speak up.
Consider that done.

Christian
--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/
Sage Weil
2016-04-12 01:12:14 UTC
Permalink
Post by Christian Balzer
Hello,
What a lovely missive to start off my working day...
Hi,
ext4 has never been recommended, but we did test it.
Patently wrong, as Shinobu just pointed.
Ext4 never was (especially recently) flogged as much as XFS, but it always
was a recommended, supported filestorage filesystem, unlike the
experimental BTRFS of ZFS.
And for various reasons people, including me, deployed it instead of XFS.
Greg definitely wins the prize for raising this as a major issue, then
(and for naming you as one of the major ext4 users).

I was not aware that we were recommending ext4 anywhere. FWIW, here's
what the docs currently say:

Ceph OSD Daemons rely heavily upon the stability and performance of the
underlying filesystem.

Note: We currently recommend XFS for production deployments. We recommend
btrfs for testing, development, and any non-critical deployments. We
believe that btrfs has the correct feature set and roadmap to serve Ceph
in the long-term, but XFS and ext4 provide the necessary stability for
today¢s deployments. btrfs development is proceeding rapidly: users should
be comfortable installing the latest released upstream kernels and be able
to track development activity for critical bug fixes.

Ceph OSD Daemons depend on the Extended Attributes (XATTRs) of the
underlying file system for various forms of internal object state and
metadata. The underlying filesystem must provide sufficient capacity for
XATTRs. btrfs does not bound the total xattr metadata stored with a file.
XFS has a relatively large limit (64 KB) that most deployments won¢t
encounter, but the ext4 is too small to be usable.

(http://docs.ceph.com/docs/master/rados/configuration/filesystem-recommendations/?highlight=ext4)

Unfortunately that second paragraph, second sentence indirectly says ext4
is stable. :( :( I'll prepare a PR tomorrow to revise this whole section
based on the new information.

If anyone knows of other docs that recommend ext4, please let me know!
They need to be updated.
Post by Christian Balzer
After Jewel is out, we would like explicitly recommend *against* ext4
and stop testing it.
Changing your recommendations is fine, stopping testing/supporting it
isn't.
People deployed Ext4 in good faith and can be expected to use it at least
until their HW is up for replacement (4-5 years).
I agree, which is why I asked.

And part of it depends on what it's being used for. If there are major
users using ext4 for RGW then their deployments are at risk and they
should swap it out for data safety reasons alone. (Or, we need to figure
out how to fix long object name support on ext4.) On the other hand, if
the only ext4 users are using RBD only, then they can safely continue with
lower max object names, and upstream testing is important to let those
OSDs age out naturally.

Does your cluster support RBD, RGW, or something else?
Post by Christian Balzer
Recently we discovered an issue with the long object name handling that
is not fixable without rewriting a significant chunk of FileStores
filename handling. (There is a limit in the amount of xattr data ext4
can store in the inode, which causes problems in LFNIndex.)
Is that also true if the Ext4 inode size is larger than default?
I'm not sure... Sam, do you know? (It's somewhat academic, though, since
we can't change the inode size on existing file systems.)
Post by Christian Balzer
We *could* invest a ton of time rewriting this to fix, but it only
affects ext4, which we never recommended, and we plan to deprecate
FileStore once BlueStore is stable anyway, so it seems like a waste of
time that would be better spent elsewhere.
If you (that is RH) is going to declare bluestore stable this year, I
would be very surprised.
My hope is that it can be the *default* for L (next spring). But we'll
see.
Post by Christian Balzer
Either way, dropping support before the successor is truly ready doesn't
sit well with me.
Yeah, I misspoke. Once BlueStore is supported and the default, support
for FileStore won't be dropped immediately. But we'll want to communicate
that eventually it will lose support. How strongly that is messaged
probably depends on how confident we are in BlueStore at that point. And
I confess I haven't thought much about how long "long enough" is yet.
Post by Christian Balzer
Which brings me to the reasons why people would want to migrate (NOT
talking about starting freshly) to bluestore.
1. Will it be faster (IOPS) than filestore with SSD journals?
Don't think so, but feel free to prove me wrong.
It will absolutely faster on the same hardware. Whether BlueStore on HDD
only is faster than FileStore HDD + SSD journal will depend on the
workload.
Post by Christian Balzer
2. Will it be bit-rot proof? Note the deafening silence from the devs in
http://www.spinics.net/lists/ceph-users/msg26510.html
I missed that thread, sorry.

We (Mirantis, SanDisk, Red Hat) are currently working on checksum support
in BlueStore. Part of the reason why BlueStore is the preferred path is
because we will probably never see full checksumming in ext4 or XFS.
Post by Christian Balzer
Also, by dropping ext4 test coverage in ceph-qa-suite, we can
significantly improve time/coverage for FileStore on XFS and on
BlueStore.
Really, isn't that fully automated?
It is, but hardware and time are finite. Fewer tests on FileStore+ext4
means more tests on FileStore+XFS or BlueStore. But this is a minor
point.
Post by Christian Balzer
The long file name handling is problematic anytime someone is storing
rados objects with long names. The primary user that does this is RGW,
which means any RGW cluster using ext4 should recreate their OSDs to use
XFS. Other librados users could be affected too, though, like users
with very long rbd image names (e.g., > 100 characters), or custom
librados users.
To make this change as visible as possible, the plan is to make ceph-osd
refuse to start if the backend is unable to support the configured max
object name (osd_max_object_name_len). The OSD will complain that ext4
cannot store such an object and refuse to start. A user who is only
using RBD might decide they don't need long file names to work and can
adjust the osd_max_object_name_len setting to something small (say, 64)
and run successfully. They would be taking a risk, though, because we
would like to stop testing on ext4.
Is this reasonable?
About as reasonable as dropping format 1 support, that is not at all.
Fortunately nobody (to my knowledge) has suggested dropping format 1
support. :)
Post by Christian Balzer
I'm officially only allowed to do (preventative) maintenance during weekend
nights on our main production cluster.
That would mean 13 ruined weekends at the realistic rate of 1 OSD per
night, so you can see where my lack of enthusiasm for OSD recreation comes
from.
Yeah. :(
Post by Christian Balzer
If there significant ext4 users that are unwilling
to recreate their OSDs, now would be the time to speak up.
Consider that done.
Thank you for the feedback!

sage
Shinobu Kinjo
2016-04-12 01:32:44 UTC
Permalink
Hi Sage,

Probably it may be better to mention that we only update master document otherwise someone gets confused again [1].

[1] https://en.wikipedia.org/wiki/Ceph_%28software%29

Cheers,
Shinobu

----- Original Message -----
From: "Sage Weil" <***@redhat.com>
To: "Christian Balzer" <***@gol.com>
Cc: ceph-***@vger.kernel.org, ceph-***@ceph.com, ceph-***@ceph.com
Sent: Tuesday, April 12, 2016 10:12:14 AM
Subject: Re: [ceph-users] Deprecating ext4 support
Post by Christian Balzer
Hello,
What a lovely missive to start off my working day...
Hi,
ext4 has never been recommended, but we did test it.
Patently wrong, as Shinobu just pointed.
Ext4 never was (especially recently) flogged as much as XFS, but it always
was a recommended, supported filestorage filesystem, unlike the
experimental BTRFS of ZFS.
And for various reasons people, including me, deployed it instead of XFS.
Greg definitely wins the prize for raising this as a major issue, then
(and for naming you as one of the major ext4 users).

I was not aware that we were recommending ext4 anywhere. FWIW, here's
what the docs currently say:

Ceph OSD Daemons rely heavily upon the stability and performance of the
underlying filesystem.

Note: We currently recommend XFS for production deployments. We recommend
btrfs for testing, development, and any non-critical deployments. We
believe that btrfs has the correct feature set and roadmap to serve Ceph
in the long-term, but XFS and ext4 provide the necessary stability for
today’s deployments. btrfs development is proceeding rapidly: users should
be comfortable installing the latest released upstream kernels and be able
to track development activity for critical bug fixes.

Ceph OSD Daemons depend on the Extended Attributes (XATTRs) of the
underlying file system for various forms of internal object state and
metadata. The underlying filesystem must provide sufficient capacity for
XATTRs. btrfs does not bound the total xattr metadata stored with a file.
XFS has a relatively large limit (64 KB) that most deployments won’t
encounter, but the ext4 is too small to be usable.

(http://docs.ceph.com/docs/master/rados/configuration/filesystem-recommendations/?highlight=ext4)

Unfortunately that second paragraph, second sentence indirectly says ext4
is stable. :( :( I'll prepare a PR tomorrow to revise this whole section
based on the new information.

If anyone knows of other docs that recommend ext4, please let me know!
They need to be updated.
Post by Christian Balzer
After Jewel is out, we would like explicitly recommend *against* ext4
and stop testing it.
Changing your recommendations is fine, stopping testing/supporting it
isn't.
People deployed Ext4 in good faith and can be expected to use it at least
until their HW is up for replacement (4-5 years).
I agree, which is why I asked.

And part of it depends on what it's being used for. If there are major
users using ext4 for RGW then their deployments are at risk and they
should swap it out for data safety reasons alone. (Or, we need to figure
out how to fix long object name support on ext4.) On the other hand, if
the only ext4 users are using RBD only, then they can safely continue with
lower max object names, and upstream testing is important to let those
OSDs age out naturally.

Does your cluster support RBD, RGW, or something else?
Post by Christian Balzer
Recently we discovered an issue with the long object name handling that
is not fixable without rewriting a significant chunk of FileStores
filename handling. (There is a limit in the amount of xattr data ext4
can store in the inode, which causes problems in LFNIndex.)
Is that also true if the Ext4 inode size is larger than default?
I'm not sure... Sam, do you know? (It's somewhat academic, though, since
we can't change the inode size on existing file systems.)
Post by Christian Balzer
We *could* invest a ton of time rewriting this to fix, but it only
affects ext4, which we never recommended, and we plan to deprecate
FileStore once BlueStore is stable anyway, so it seems like a waste of
time that would be better spent elsewhere.
If you (that is RH) is going to declare bluestore stable this year, I
would be very surprised.
My hope is that it can be the *default* for L (next spring). But we'll
see.
Post by Christian Balzer
Either way, dropping support before the successor is truly ready doesn't
sit well with me.
Yeah, I misspoke. Once BlueStore is supported and the default, support
for FileStore won't be dropped immediately. But we'll want to communicate
that eventually it will lose support. How strongly that is messaged
probably depends on how confident we are in BlueStore at that point. And
I confess I haven't thought much about how long "long enough" is yet.
Post by Christian Balzer
Which brings me to the reasons why people would want to migrate (NOT
talking about starting freshly) to bluestore.
1. Will it be faster (IOPS) than filestore with SSD journals?
Don't think so, but feel free to prove me wrong.
It will absolutely faster on the same hardware. Whether BlueStore on HDD
only is faster than FileStore HDD + SSD journal will depend on the
workload.
Post by Christian Balzer
2. Will it be bit-rot proof? Note the deafening silence from the devs in
http://www.spinics.net/lists/ceph-users/msg26510.html
I missed that thread, sorry.

We (Mirantis, SanDisk, Red Hat) are currently working on checksum support
in BlueStore. Part of the reason why BlueStore is the preferred path is
because we will probably never see full checksumming in ext4 or XFS.
Post by Christian Balzer
Also, by dropping ext4 test coverage in ceph-qa-suite, we can
significantly improve time/coverage for FileStore on XFS and on
BlueStore.
Really, isn't that fully automated?
It is, but hardware and time are finite. Fewer tests on FileStore+ext4
means more tests on FileStore+XFS or BlueStore. But this is a minor
point.
Post by Christian Balzer
The long file name handling is problematic anytime someone is storing
rados objects with long names. The primary user that does this is RGW,
which means any RGW cluster using ext4 should recreate their OSDs to use
XFS. Other librados users could be affected too, though, like users
with very long rbd image names (e.g., > 100 characters), or custom
librados users.
To make this change as visible as possible, the plan is to make ceph-osd
refuse to start if the backend is unable to support the configured max
object name (osd_max_object_name_len). The OSD will complain that ext4
cannot store such an object and refuse to start. A user who is only
using RBD might decide they don't need long file names to work and can
adjust the osd_max_object_name_len setting to something small (say, 64)
and run successfully. They would be taking a risk, though, because we
would like to stop testing on ext4.
Is this reasonable?
About as reasonable as dropping format 1 support, that is not at all.
Fortunately nobody (to my knowledge) has suggested dropping format 1
support. :)
Post by Christian Balzer
I'm officially only allowed to do (preventative) maintenance during weekend
nights on our main production cluster.
That would mean 13 ruined weekends at the realistic rate of 1 OSD per
night, so you can see where my lack of enthusiasm for OSD recreation comes
from.
Yeah. :(
Post by Christian Balzer
If there significant ext4 users that are unwilling
to recreate their OSDs, now would be the time to speak up.
Consider that done.
Thank you for the feedback!

sage
hp cre
2016-04-12 02:05:29 UTC
Permalink
As far as i remember, the documentation did say that either filesystems
(ext4 or xfs) are OK, except for xattr which was better represented on xfs.

I would think the best move would be to make xfs the default osd creation
method and put in a warning about ext4 being deprecated in future
releases. But leave support for it till all users are weaned off of it in
favour of xfs and later, btrfs.
Post by Christian Balzer
Post by Christian Balzer
Hello,
What a lovely missive to start off my working day...
Hi,
ext4 has never been recommended, but we did test it.
Patently wrong, as Shinobu just pointed.
Ext4 never was (especially recently) flogged as much as XFS, but it
always
Post by Christian Balzer
was a recommended, supported filestorage filesystem, unlike the
experimental BTRFS of ZFS.
And for various reasons people, including me, deployed it instead of XFS.
Greg definitely wins the prize for raising this as a major issue, then
(and for naming you as one of the major ext4 users).
I was not aware that we were recommending ext4 anywhere. FWIW, here's
Ceph OSD Daemons rely heavily upon the stability and performance of the
underlying filesystem.
Note: We currently recommend XFS for production deployments. We recommend
btrfs for testing, development, and any non-critical deployments. We
believe that btrfs has the correct feature set and roadmap to serve Ceph
in the long-term, but XFS and ext4 provide the necessary stability for
today’s deployments. btrfs development is proceeding rapidly: users should
be comfortable installing the latest released upstream kernels and be able
to track development activity for critical bug fixes.
Ceph OSD Daemons depend on the Extended Attributes (XATTRs) of the
underlying file system for various forms of internal object state and
metadata. The underlying filesystem must provide sufficient capacity for
XATTRs. btrfs does not bound the total xattr metadata stored with a file.
XFS has a relatively large limit (64 KB) that most deployments won’t
encounter, but the ext4 is too small to be usable.
(
http://docs.ceph.com/docs/master/rados/configuration/filesystem-recommendations/?highlight=ext4
)
Unfortunately that second paragraph, second sentence indirectly says ext4
is stable. :( :( I'll prepare a PR tomorrow to revise this whole section
based on the new information.
If anyone knows of other docs that recommend ext4, please let me know!
They need to be updated.
Post by Christian Balzer
After Jewel is out, we would like explicitly recommend *against* ext4
and stop testing it.
Changing your recommendations is fine, stopping testing/supporting it
isn't.
People deployed Ext4 in good faith and can be expected to use it at least
until their HW is up for replacement (4-5 years).
I agree, which is why I asked.
And part of it depends on what it's being used for. If there are major
users using ext4 for RGW then their deployments are at risk and they
should swap it out for data safety reasons alone. (Or, we need to figure
out how to fix long object name support on ext4.) On the other hand, if
the only ext4 users are using RBD only, then they can safely continue with
lower max object names, and upstream testing is important to let those
OSDs age out naturally.
Does your cluster support RBD, RGW, or something else?
Post by Christian Balzer
Recently we discovered an issue with the long object name handling that
is not fixable without rewriting a significant chunk of FileStores
filename handling. (There is a limit in the amount of xattr data ext4
can store in the inode, which causes problems in LFNIndex.)
Is that also true if the Ext4 inode size is larger than default?
I'm not sure... Sam, do you know? (It's somewhat academic, though, since
we can't change the inode size on existing file systems.)
Post by Christian Balzer
We *could* invest a ton of time rewriting this to fix, but it only
affects ext4, which we never recommended, and we plan to deprecate
FileStore once BlueStore is stable anyway, so it seems like a waste of
time that would be better spent elsewhere.
If you (that is RH) is going to declare bluestore stable this year, I
would be very surprised.
My hope is that it can be the *default* for L (next spring). But we'll
see.
Post by Christian Balzer
Either way, dropping support before the successor is truly ready doesn't
sit well with me.
Yeah, I misspoke. Once BlueStore is supported and the default, support
for FileStore won't be dropped immediately. But we'll want to communicate
that eventually it will lose support. How strongly that is messaged
probably depends on how confident we are in BlueStore at that point. And
I confess I haven't thought much about how long "long enough" is yet.
Post by Christian Balzer
Which brings me to the reasons why people would want to migrate (NOT
talking about starting freshly) to bluestore.
1. Will it be faster (IOPS) than filestore with SSD journals?
Don't think so, but feel free to prove me wrong.
It will absolutely faster on the same hardware. Whether BlueStore on HDD
only is faster than FileStore HDD + SSD journal will depend on the
workload.
Post by Christian Balzer
2. Will it be bit-rot proof? Note the deafening silence from the devs in
http://www.spinics.net/lists/ceph-users/msg26510.html
I missed that thread, sorry.
We (Mirantis, SanDisk, Red Hat) are currently working on checksum support
in BlueStore. Part of the reason why BlueStore is the preferred path is
because we will probably never see full checksumming in ext4 or XFS.
Post by Christian Balzer
Also, by dropping ext4 test coverage in ceph-qa-suite, we can
significantly improve time/coverage for FileStore on XFS and on
BlueStore.
Really, isn't that fully automated?
It is, but hardware and time are finite. Fewer tests on FileStore+ext4
means more tests on FileStore+XFS or BlueStore. But this is a minor
point.
Post by Christian Balzer
The long file name handling is problematic anytime someone is storing
rados objects with long names. The primary user that does this is RGW,
which means any RGW cluster using ext4 should recreate their OSDs to
use
Post by Christian Balzer
XFS. Other librados users could be affected too, though, like users
with very long rbd image names (e.g., > 100 characters), or custom
librados users.
To make this change as visible as possible, the plan is to make
ceph-osd
Post by Christian Balzer
refuse to start if the backend is unable to support the configured max
object name (osd_max_object_name_len). The OSD will complain that ext4
cannot store such an object and refuse to start. A user who is only
using RBD might decide they don't need long file names to work and can
adjust the osd_max_object_name_len setting to something small (say, 64)
and run successfully. They would be taking a risk, though, because we
would like to stop testing on ext4.
Is this reasonable?
About as reasonable as dropping format 1 support, that is not at all.
Fortunately nobody (to my knowledge) has suggested dropping format 1
support. :)
Post by Christian Balzer
I'm officially only allowed to do (preventative) maintenance during
weekend
Post by Christian Balzer
nights on our main production cluster.
That would mean 13 ruined weekends at the realistic rate of 1 OSD per
night, so you can see where my lack of enthusiasm for OSD recreation
comes
Post by Christian Balzer
from.
Yeah. :(
Post by Christian Balzer
If there significant ext4 users that are unwilling
to recreate their OSDs, now would be the time to speak up.
Consider that done.
Thank you for the feedback!
sage
_______________________________________________
Ceph-maintainers mailing list
http://lists.ceph.com/listinfo.cgi/ceph-maintainers-ceph.com
Christian Balzer
2016-04-12 02:43:41 UTC
Permalink
Hello,
Post by Sage Weil
Post by Christian Balzer
Hello,
What a lovely missive to start off my working day...
Hi,
ext4 has never been recommended, but we did test it.
Patently wrong, as Shinobu just pointed.
Ext4 never was (especially recently) flogged as much as XFS, but it
always was a recommended, supported filestorage filesystem, unlike the
experimental BTRFS of ZFS.
And for various reasons people, including me, deployed it instead of XFS.
Greg definitely wins the prize for raising this as a major issue, then
(and for naming you as one of the major ext4 users).
I'm sure there are other ones, it's often surprising how people will pipe
up on this ML for the first time with really massive deployments they've
been running for years w/o ever being on anybody's radar.
Post by Sage Weil
I was not aware that we were recommending ext4 anywhere. FWIW, here's
Ceph OSD Daemons rely heavily upon the stability and performance of the
underlying filesystem.
Note: We currently recommend XFS for production deployments. We
recommend btrfs for testing, development, and any non-critical
deployments. We believe that btrfs has the correct feature set and
roadmap to serve Ceph in the long-term, but XFS and ext4 provide the
necessary stability for today’s deployments. btrfs development is
proceeding rapidly: users should be comfortable installing the latest
released upstream kernels and be able to track development activity for
critical bug fixes.
Ceph OSD Daemons depend on the Extended Attributes (XATTRs) of the
underlying file system for various forms of internal object state and
metadata. The underlying filesystem must provide sufficient capacity
for XATTRs. btrfs does not bound the total xattr metadata stored with a
file. XFS has a relatively large limit (64 KB) that most deployments
won’t encounter, but the ext4 is too small to be usable.
(http://docs.ceph.com/docs/master/rados/configuration/filesystem-recommendations/?highlight=ext4)
Unfortunately that second paragraph, second sentence indirectly says
ext4 is stable. :( :( I'll prepare a PR tomorrow to revise this whole
section based on the new information.
Not only that, the "filestore xattr use omap" section afterwards
reinforces that by clearly suggesting that this is the official
work-around for the XATTR issue.
Post by Sage Weil
If anyone knows of other docs that recommend ext4, please let me know!
They need to be updated.
Not going to try find any cached versions, but when I did my first
deployment with Dumpling I don't think the "Note" section was there or as
prominent.
Not that it would have stopped me from using Ext4, mind.
Post by Sage Weil
Post by Christian Balzer
After Jewel is out, we would like explicitly recommend *against*
ext4 and stop testing it.
Changing your recommendations is fine, stopping testing/supporting it
isn't.
People deployed Ext4 in good faith and can be expected to use it at
least until their HW is up for replacement (4-5 years).
I agree, which is why I asked.
And part of it depends on what it's being used for. If there are major
users using ext4 for RGW then their deployments are at risk and they
should swap it out for data safety reasons alone. (Or, we need to
figure out how to fix long object name support on ext4.) On the other
hand, if the only ext4 users are using RBD only, then they can safely
continue with lower max object names, and upstream testing is important
to let those OSDs age out naturally.
Does your cluster support RBD, RGW, or something else?
Only RBD on all clusters so far and definitely no plans to change that for
the main, mission critical production cluster.
I might want to add CephFS to the other production cluster at some time,
though.

No RGW, but if/when RGW supports "listing objects quickly" (is what I
vaguely remember from my conversation with Timo Sirainen, the Dovecot
author) we would be very interested in that particular piece of Ceph as
well. On a completely new cluster though, so no issue.
Post by Sage Weil
Post by Christian Balzer
Recently we discovered an issue with the long object name handling
that is not fixable without rewriting a significant chunk of
FileStores filename handling. (There is a limit in the amount of
xattr data ext4 can store in the inode, which causes problems in
LFNIndex.)
Is that also true if the Ext4 inode size is larger than default?
I'm not sure... Sam, do you know? (It's somewhat academic, though,
since we can't change the inode size on existing file systems.)
Yes and no.
Some people (and I think not just me) were perfectly capable of reading
between the lines and format their Ext4 FS accordingly:
"mkfs.ext4 -J size=1024 -I 2048 -i 65536 ... " (the -I bit)
Post by Sage Weil
Post by Christian Balzer
We *could* invest a ton of time rewriting this to fix, but it only
affects ext4, which we never recommended, and we plan to deprecate
FileStore once BlueStore is stable anyway, so it seems like a waste
of time that would be better spent elsewhere.
If you (that is RH) is going to declare bluestore stable this year, I
would be very surprised.
My hope is that it can be the *default* for L (next spring). But we'll
see.
Yeah, that's my most optimistic estimate as well.
Post by Sage Weil
Post by Christian Balzer
Either way, dropping support before the successor is truly ready
doesn't sit well with me.
Yeah, I misspoke. Once BlueStore is supported and the default, support
for FileStore won't be dropped immediately. But we'll want to
communicate that eventually it will lose support. How strongly that is
messaged probably depends on how confident we are in BlueStore at that
point. And I confess I haven't thought much about how long "long
enough" is yet.
Again, most people that deploy Ceph in a commercial environment (that is
working for a company) will be under pressure by the penny-pinching
department to use their HW for 4-5 years (never mind the pace of
technology and Moore's law).

So you will want to:
a) Announce the end of FileStore ASAP, but then again you can't really
do that before BlueStore is stable.
b) support FileStore for 4 years at least after BlueStore is the default.
This could be done by having a _real_ LTS release, instead of dragging
Filestore into newer version.
Post by Sage Weil
Post by Christian Balzer
Which brings me to the reasons why people would want to migrate (NOT
talking about starting freshly) to bluestore.
1. Will it be faster (IOPS) than filestore with SSD journals?
Don't think so, but feel free to prove me wrong.
It will absolutely faster on the same hardware. Whether BlueStore on
HDD only is faster than FileStore HDD + SSD journal will depend on the
workload.
Where would the Journal SSDs enter the picture with BlueStore?
Not at all, AFAIK, right?

I'm thinking again about people with existing HW again.
What do they do with those SSDs, which aren't necessarily sized in a
fashion to be sensible SSD pools/cache tiers?
Post by Sage Weil
Post by Christian Balzer
2. Will it be bit-rot proof? Note the deafening silence from the devs
http://www.spinics.net/lists/ceph-users/msg26510.html
I missed that thread, sorry.
We (Mirantis, SanDisk, Red Hat) are currently working on checksum
support in BlueStore. Part of the reason why BlueStore is the preferred
path is because we will probably never see full checksumming in ext4 or
XFS.
Now this (when done correctly) and BlueStore being a stable default will
be a much, MUCH higher motivation for people to migrate to it than
terminating support for something that works perfectly well (for my use
case at least).
Post by Sage Weil
Post by Christian Balzer
Also, by dropping ext4 test coverage in ceph-qa-suite, we can
significantly improve time/coverage for FileStore on XFS and on
BlueStore.
Really, isn't that fully automated?
It is, but hardware and time are finite. Fewer tests on FileStore+ext4
means more tests on FileStore+XFS or BlueStore. But this is a minor
point.
Post by Christian Balzer
The long file name handling is problematic anytime someone is
storing rados objects with long names. The primary user that does
this is RGW, which means any RGW cluster using ext4 should recreate
their OSDs to use XFS. Other librados users could be affected too,
though, like users with very long rbd image names (e.g., > 100
characters), or custom librados users.
To make this change as visible as possible, the plan is to make
ceph-osd refuse to start if the backend is unable to support the
configured max object name (osd_max_object_name_len). The OSD will
complain that ext4 cannot store such an object and refuse to start.
A user who is only using RBD might decide they don't need long file
names to work and can adjust the osd_max_object_name_len setting to
something small (say, 64) and run successfully. They would be
taking a risk, though, because we would like to stop testing on ext4.
Is this reasonable?
About as reasonable as dropping format 1 support, that is not at all.
Fortunately nobody (to my knowledge) has suggested dropping format 1
support. :)
I suggest you look at that thread and your official release notes:
---
* The rbd legacy image format (version 1) is deprecated with the Jewel release.
Attempting to create a new version 1 RBD image will result in a warning.
Future releases of Ceph will remove support for version 1 RBD images.
---
Post by Sage Weil
Post by Christian Balzer
I'm officially only allowed to do (preventative) maintenance during
weekend nights on our main production cluster.
That would mean 13 ruined weekends at the realistic rate of 1 OSD per
night, so you can see where my lack of enthusiasm for OSD recreation
comes from.
Yeah. :(
Post by Christian Balzer
If there significant ext4 users that are unwilling
to recreate their OSDs, now would be the time to speak up.
Consider that done.
Thank you for the feedback!
Thanks for getting back to me so quickly.

Christian
--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/
Sage Weil
2016-04-12 13:56:32 UTC
Permalink
Hi all,

I've posted a pull request that updates any mention of ext4 in the docs:

https://github.com/ceph/ceph/pull/8556

In particular, I would appreciate any feedback on

https://github.com/ceph/ceph/pull/8556/commits/49604303124a2b546e66d6e130ad4fa296602b01

both on substance and delivery.

Given the previous lack of clarity around ext4, and that it works well
enough for RBD and other short object name workloads, I think the most we
can do now is deprecate it to steer any new OSDs away.

And at least in the non-RGW case, I mean deprecate in the "recommend
alternative" sense of the word, not that it won't be tested or that any
code will be removed.

https://en.wikipedia.org/wiki/Deprecation#Software_deprecation

If there are ext4 + RGW users, that is still a difficult issue, since it
is broken now, and expensive to fix.
Post by Christian Balzer
Only RBD on all clusters so far and definitely no plans to change that
for the main, mission critical production cluster. I might want to add
CephFS to the other production cluster at some time, though.
That's good to hear. If you continue to use ext4 (by adjusting down the
max object length), the only limitation you should hit is an indirect cap
on the max RBD image name length.
Post by Christian Balzer
No RGW, but if/when RGW supports "listing objects quickly" (is what I
vaguely remember from my conversation with Timo Sirainen, the Dovecot
author) we would be very interested in that particular piece of Ceph as
well. On a completely new cluster though, so no issue.
OT, but I suspect he was referring to something slightly different here.
Our conversations about object listing vs the dovecot backend surrounded
the *rados* listing semantics (hash-based, not prefix/name based). RGW
supports fast sorted/prefix name listings, but you pay for it by
maintaining an index (which slows down PUT). The latest RGW in Jewel has
experimental support for a non-indexed 'blind' bucket as well for users
that need some of the RGW features (ACLs, striping, etc.) but not the
ordered object listing and other index-dependent features.
Post by Christian Balzer
Again, most people that deploy Ceph in a commercial environment (that is
working for a company) will be under pressure by the penny-pinching
department to use their HW for 4-5 years (never mind the pace of
technology and Moore's law).
a) Announce the end of FileStore ASAP, but then again you can't really
do that before BlueStore is stable.
b) support FileStore for 4 years at least after BlueStore is the default.
This could be done by having a _real_ LTS release, instead of dragging
Filestore into newer version.
Right. Nothing can be done until the preferred alternative is completely
stable, and from then it will take quite some time to drop support or
remove it given the install base.
Post by Christian Balzer
Post by Sage Weil
Post by Christian Balzer
Which brings me to the reasons why people would want to migrate (NOT
talking about starting freshly) to bluestore.
1. Will it be faster (IOPS) than filestore with SSD journals?
Don't think so, but feel free to prove me wrong.
It will absolutely faster on the same hardware. Whether BlueStore on
HDD only is faster than FileStore HDD + SSD journal will depend on the
workload.
Where would the Journal SSDs enter the picture with BlueStore?
Not at all, AFAIK, right?
BlueStore can use as many as three devices: one for the WAL (journal,
though it can be much smaller than FileStores, e.g., 128MB), one for
metadata (e.g., an SSD partition), and one for data.
Post by Christian Balzer
I'm thinking again about people with existing HW again.
What do they do with those SSDs, which aren't necessarily sized in a
fashion to be sensible SSD pools/cache tiers?
We can either use them for BlueStore wal and metadata, or as a cache for
the data device (e.g., dm-cache, bcache, FlashCache), or some combination
of the above. It will take some time to figure out which gives the
best performance (and for which workloads).
Post by Christian Balzer
Post by Sage Weil
Post by Christian Balzer
2. Will it be bit-rot proof? Note the deafening silence from the devs
http://www.spinics.net/lists/ceph-users/msg26510.html
I missed that thread, sorry.
We (Mirantis, SanDisk, Red Hat) are currently working on checksum
support in BlueStore. Part of the reason why BlueStore is the preferred
path is because we will probably never see full checksumming in ext4 or
XFS.
Now this (when done correctly) and BlueStore being a stable default will
be a much, MUCH higher motivation for people to migrate to it than
terminating support for something that works perfectly well (for my use
case at least).
Agreed.
Post by Christian Balzer
Post by Sage Weil
Post by Christian Balzer
To make this change as visible as possible, the plan is to make
ceph-osd refuse to start if the backend is unable to support the
configured max object name (osd_max_object_name_len). The OSD will
complain that ext4 cannot store such an object and refuse to start.
A user who is only using RBD might decide they don't need long file
names to work and can adjust the osd_max_object_name_len setting to
something small (say, 64) and run successfully. They would be
taking a risk, though, because we would like to stop testing on ext4.
Is this reasonable?
About as reasonable as dropping format 1 support, that is not at all.
Fortunately nobody (to my knowledge) has suggested dropping format 1
support. :)
---
* The rbd legacy image format (version 1) is deprecated with the Jewel release.
Attempting to create a new version 1 RBD image will result in a warning.
Future releases of Ceph will remove support for version 1 RBD images.
---
"Future releases of Ceph *may* remove support" might be more accurate, but
it doesn't make for as compelling a warning, and it's pretty likely that
*eventually* it will make sense to drop it. That won't happen without a
proper conversation about user impact and migration, though. There are
real problems with format 1 besides just the lack of new features (e.g.,
rename vs watchers).

This is what 'deprecation' means: we're not dropping support now (that
*would* be unreasonable), but we're warning users that at some future
point we (probably) will. If there is any reason why new images shouldn't
be created with v2, please let us know. Obviously v1 -> v2 image
conversion remains an open issue.

Thanks-
sage
Christian Balzer
2016-04-13 03:27:35 UTC
Permalink
Hello,
Post by Sage Weil
Hi all,
https://github.com/ceph/ceph/pull/8556
In particular, I would appreciate any feedback on
https://github.com/ceph/ceph/pull/8556/commits/49604303124a2b546e66d6e130ad4fa296602b01
both on substance and delivery.
Given the previous lack of clarity around ext4, and that it works well
enough for RBD and other short object name workloads, I think the most
we can do now is deprecate it to steer any new OSDs away.
A clear statement of what "short" means in this context and if this (in
general) applies to RBD and CephFS would probably be helpful.
Post by Sage Weil
And at least in the non-RGW case, I mean deprecate in the "recommend
alternative" sense of the word, not that it won't be tested or that any
code will be removed.
https://en.wikipedia.org/wiki/Deprecation#Software_deprecation
If there are ext4 + RGW users, that is still a difficult issue, since it
is broken now, and expensive to fix.
I'm wondering what the cross section of RGW (being "stable" a lot longer
than CephFS) and Ext4 users is for this to pop up so late in the game.

Also, since Sam didn't pipe up, I'd still would like to know if this is
"fixed" by having larger than the default 256Byte Ext4 inodes (2KB in my
case) as it isn't purely academic for me.
Or maybe other people like "Michael Metz-Martini" who need Ext4 for
performance reasons and can't obviously go to BlueStore yet.
Post by Sage Weil
Post by Christian Balzer
Only RBD on all clusters so far and definitely no plans to change that
for the main, mission critical production cluster. I might want to add
CephFS to the other production cluster at some time, though.
That's good to hear. If you continue to use ext4 (by adjusting down the
max object length), the only limitation you should hit is an indirect
cap on the max RBD image name length.
Just to parse this sentence correctly, is it the name of the object
(output of "rados ls"), the name of the image "rbd ls" or either?
Post by Sage Weil
Post by Christian Balzer
No RGW, but if/when RGW supports "listing objects quickly" (is what I
vaguely remember from my conversation with Timo Sirainen, the Dovecot
author) we would be very interested in that particular piece of Ceph as
well. On a completely new cluster though, so no issue.
OT, but I suspect he was referring to something slightly different
here. Our conversations about object listing vs the dovecot backend
surrounded the *rados* listing semantics (hash-based, not prefix/name
based). RGW supports fast sorted/prefix name listings, but you pay for
it by maintaining an index (which slows down PUT). The latest RGW in
Jewel has experimental support for a non-indexed 'blind' bucket as well
for users that need some of the RGW features (ACLs, striping, etc.) but
not the ordered object listing and other index-dependent features.
Sorry about the OT, but since the Dovecot (Pro) backend supports S3 I
would have thought that RGW would be a logical expansion from there, not
going for a completely new (but likely a lot faster) backend using rados.
Oh well, I shall go poke them.
Post by Sage Weil
Post by Christian Balzer
Again, most people that deploy Ceph in a commercial environment (that
is working for a company) will be under pressure by the penny-pinching
department to use their HW for 4-5 years (never mind the pace of
technology and Moore's law).
a) Announce the end of FileStore ASAP, but then again you can't really
do that before BlueStore is stable.
b) support FileStore for 4 years at least after BlueStore is the
default. This could be done by having a _real_ LTS release, instead of
dragging Filestore into newer version.
Right. Nothing can be done until the preferred alternative is
completely stable, and from then it will take quite some time to drop
support or remove it given the install base.
Post by Christian Balzer
Post by Sage Weil
Post by Christian Balzer
Which brings me to the reasons why people would want to migrate
(NOT talking about starting freshly) to bluestore.
1. Will it be faster (IOPS) than filestore with SSD journals?
Don't think so, but feel free to prove me wrong.
It will absolutely faster on the same hardware. Whether BlueStore on
HDD only is faster than FileStore HDD + SSD journal will depend on
the workload.
Where would the Journal SSDs enter the picture with BlueStore?
Not at all, AFAIK, right?
BlueStore can use as many as three devices: one for the WAL (journal,
though it can be much smaller than FileStores, e.g., 128MB), one for
metadata (e.g., an SSD partition), and one for data.
Right, I blanked on that, despite having read the K/V storage back when
they first showed up. Just didn't make the connection with BlueStore.

OK, so we have a small write-intent-log, probably even better hosted on
NVRAM with new installs.
The metadata is the same/similar to what lives in ...current/meta/... on
OSDs these days?
If so, that's 30MB per PG in my case, so not a lot either.
Post by Sage Weil
Post by Christian Balzer
I'm thinking again about people with existing HW again.
What do they do with those SSDs, which aren't necessarily sized in a
fashion to be sensible SSD pools/cache tiers?
We can either use them for BlueStore wal and metadata, or as a cache for
the data device (e.g., dm-cache, bcache, FlashCache), or some
combination of the above. It will take some time to figure out which
gives the best performance (and for which workloads).
Including finding out which sauce these caching layers prefer when eating
your data. ^_-
Given the current state of affairs and reports of people here I'll likely
take a comfy backseat there.
Post by Sage Weil
Post by Christian Balzer
Post by Sage Weil
Post by Christian Balzer
2. Will it be bit-rot proof? Note the deafening silence from the
http://www.spinics.net/lists/ceph-users/msg26510.html
I missed that thread, sorry.
We (Mirantis, SanDisk, Red Hat) are currently working on checksum
support in BlueStore. Part of the reason why BlueStore is the
preferred path is because we will probably never see full
checksumming in ext4 or XFS.
Now this (when done correctly) and BlueStore being a stable default
will be a much, MUCH higher motivation for people to migrate to it than
terminating support for something that works perfectly well (for my use
case at least).
Agreed.
Post by Christian Balzer
Post by Sage Weil
Post by Christian Balzer
To make this change as visible as possible, the plan is to make
ceph-osd refuse to start if the backend is unable to support the
configured max object name (osd_max_object_name_len). The OSD
will complain that ext4 cannot store such an object and refuse
to start. A user who is only using RBD might decide they don't
need long file names to work and can adjust the
osd_max_object_name_len setting to something small (say, 64) and
run successfully. They would be taking a risk, though, because
we would like to stop testing on ext4.
Is this reasonable?
About as reasonable as dropping format 1 support, that is not at all.
Fortunately nobody (to my knowledge) has suggested dropping format 1
support. :)
---
* The rbd legacy image format (version 1) is deprecated with the Jewel
release. Attempting to create a new version 1 RBD image will result in
a warning. Future releases of Ceph will remove support for version 1
RBD images. ---
"Future releases of Ceph *may* remove support" might be more accurate,
but it doesn't make for as compelling a warning, and it's pretty likely
that *eventually* it will make sense to drop it. That won't happen
without a proper conversation about user impact and migration, though.
There are real problems with format 1 besides just the lack of new
features (e.g., rename vs watchers).
This is what 'deprecation' means: we're not dropping support now (that
*would* be unreasonable), but we're warning users that at some future
point we (probably) will. If there is any reason why new images
shouldn't be created with v2, please let us know. Obviously v1 -> v2
image conversion remains an open issue.
Yup, I did change my default format on the other cluster early on to 2,
but the mission critical one is a lot older and at 1 with over 450
images/VMs.
So having something that will convert things with a light touch is very
much needed.

Thanks again,

Christian
--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/
Steffen Weißgerber
2016-04-14 09:43:07 UTC
Permalink
Post by Christian Balzer
Hello,
Hi,
Post by Christian Balzer
I'm officially only allowed to do (preventative) maintenance during weekend
nights on our main production cluster.
That would mean 13 ruined weekends at the realistic rate of 1 OSD per
night, so you can see where my lack of enthusiasm for OSD recreation comes
from.
Wondering extremely about that. We introduced ceph for VM's on RBD to not
have to move maintenance time to night shift.

My understanding of ceph is that it was also made as reliable storage in case
of hardware failure.

So what's the difference between maintain an osd and it's failure in effect for
the end user? In both cases it should be none.

Maintaining OSD's should be routine so that you're confident that your application
stays save while hardware fails in a amount one configured unused reserve.

In the end what happens to your cluster, when a complete node fails?

Regards

Steffen
Post by Christian Balzer
Christian
--
Christian Balzer Network/Systems Engineer
http://www.gol.com/
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Klinik-Service Neubrandenburg GmbH
Allendestr. 30, 17036 Neubrandenburg
Amtsgericht Neubrandenburg, HRB 2457
Geschaeftsfuehrerin: Gudrun Kappich
Christian Balzer
2016-04-14 15:00:44 UTC
Permalink
This post might be inappropriate. Click to display it.
Steffen Weißgerber
2016-04-15 09:50:04 UTC
Permalink
2016 um
Post by Christian Balzer
Hello,
[reduced to ceph-users]
2016
Post by Christian Balzer
Post by Steffen Weißgerber
Post by Christian Balzer
Hello,
Hi,
Post by Christian Balzer
I'm officially only allowed to do (preventative) maintenance during
weekend nights on our main production cluster.
That would mean 13 ruined weekends at the realistic rate of 1 OSD per
night, so you can see where my lack of enthusiasm for OSD
recreation
Post by Christian Balzer
Post by Steffen Weißgerber
Post by Christian Balzer
comes from.
Wondering extremely about that. We introduced ceph for VM's on RBD to not
have to move maintenance time to night shift.
This is Japan.
It makes the most anal retentive people/rules in "der alten Heimat" look
like a bunch of hippies on drugs.
Note the preventative and I should have put "officially" in quotes, like
that.
I can do whatever I feel comfortable with on our other production cluster,
since there aren't hundreds of customers with very, VERY tight SLAs on it.
So if I were to tell my boss that I want to renew all OSDs he'd say "Sure,
but at time that if anything goes wrong it will not impact any
customer
Post by Christian Balzer
unexpectedly" meaning the official maintenance windows...
For "all OSD's" (at the same time), I would agree. But when we talk
about
changing one by one the effect to a cluster auf x OSD's on y nodes ...
Hmm.
Post by Christian Balzer
Post by Steffen Weißgerber
My understanding of ceph is that it was also made as reliable
storage in
Post by Christian Balzer
Post by Steffen Weißgerber
case of hardware failure.
Reliable, yes. With certain limitations, see below.
Post by Steffen Weißgerber
So what's the difference between maintain an osd and it's failure in
effect for the end user? In both cases it should be none.
Ideally, yes.
Note than an OSD failure can result in slow I/O (to the point of what
would be considered service interruption) depending on the failure mode
and the various timeout settings.
So planned and properly executed maintenance has less impact.
None (or at least not noticeable) IF your cluster has enough
resources
Post by Christian Balzer
and/or all the tuning has been done correctly.
Post by Steffen Weißgerber
Maintaining OSD's should be routine so that you're confident that your
application stays save while hardware fails in a amount one
configured
Post by Christian Balzer
Post by Steffen Weißgerber
unused reserve.
IO is a very fickle beast, it may perform splendidly at 2000ops/s just to
totally go down the drain at 2100.
Knowing your capacity and reserve isn't straightforward, especially not in
a live environment as compared to synthetic tests.
In short, could that cluster (now, after upgrades and adding a cache tier)
handle OSD renewals at any given time?
Absolutely.
Will I get an official blessing to do so?
No effing way.
Understand. A setup with cache tiering is more complex than simple
osd's
with journals on SSD.

But that reminds me to a keynote held by Kris Köhntopp at
the FFG of the GUUG in 2015 were he talked about restarting a huge
MySQL-DB part of the backend of booking.com were he had the choice
to regulary restart die DB which tooks 10-15 minutes or so or kill the
DB
process whereafter the DB recovery tooks only 1-2 minutes.

Having this knowledge, he told, is one thing but being that self
confident
to do it with a good feeling only comes from experience to have it
done
in routine.

Please don't understand me wrong, I'll will not force you to be
reckless.

Another interesting fact, Kris explained, was that the IT was equiped
with
a budget for loss of business due to IT unavailability. And the
management
only intervened when this budget was exhausted.

That's also i kind of reserve an IT-Administrator can work with. But
having
such budget surely depends on a corresponding management mentality.
Post by Christian Balzer
Post by Steffen Weißgerber
In the end what happens to your cluster, when a complete node
fails?
Post by Christian Balzer
Nothing much, in fact LESS than when an OSD should fail since it won't
trigger re-balancing (mon_osd_down_out_subtree_limit = host).
Yes, but does a single osd change can trigger this in your
configuration and
is the amount of data that much for a relevant recovery load?

And the same problem you have is when you extend your cluster, haven't
you?

For me a level of operation with such sorrows would be to change
crushmap
related things (e.g. our tunables are already on bobtail profile).
But mainly because I never did it.
Post by Christian Balzer
Regards,
Christian
Regards

Steffen
Post by Christian Balzer
--
Christian Balzer Network/Systems Engineer
http://www.gol.com/
--
Klinik-Service Neubrandenburg GmbH
Allendestr. 30, 17036 Neubrandenburg
Amtsgericht Neubrandenburg, HRB 2457
Geschaeftsfuehrerin: Gudrun Kappich
Loic Dachary
2016-04-12 06:39:38 UTC
Permalink
Hi Sage,

I suspect most people nowadays run tests and develop on ext4. Not supporting ext4 in the future means we'll need to find a convenient way for developers to run tests against the supported file systems.

My 2cts :-)
Hi,
ext4 has never been recommended, but we did test it. After Jewel is out,
we would like explicitly recommend *against* ext4 and stop testing it.
Recently we discovered an issue with the long object name handling that is
not fixable without rewriting a significant chunk of FileStores filename
handling. (There is a limit in the amount of xattr data ext4 can store in
the inode, which causes problems in LFNIndex.)
We *could* invest a ton of time rewriting this to fix, but it only affects
ext4, which we never recommended, and we plan to deprecate FileStore once
BlueStore is stable anyway, so it seems like a waste of time that would be
better spent elsewhere.
Also, by dropping ext4 test coverage in ceph-qa-suite, we can
significantly improve time/coverage for FileStore on XFS and on BlueStore.
The long file name handling is problematic anytime someone is storing
rados objects with long names. The primary user that does this is RGW,
which means any RGW cluster using ext4 should recreate their OSDs to use
XFS. Other librados users could be affected too, though, like users
with very long rbd image names (e.g., > 100 characters), or custom
librados users.
To make this change as visible as possible, the plan is to make ceph-osd
refuse to start if the backend is unable to support the configured max
object name (osd_max_object_name_len). The OSD will complain that ext4
cannot store such an object and refuse to start. A user who is only using
RBD might decide they don't need long file names to work and can adjust
the osd_max_object_name_len setting to something small (say, 64) and run
successfully. They would be taking a risk, though, because we would like
to stop testing on ext4.
Is this reasonable? If there significant ext4 users that are unwilling to
recreate their OSDs, now would be the time to speak up.
Thanks!
sage
_______________________________________________
Ceph-maintainers mailing list
http://lists.ceph.com/listinfo.cgi/ceph-maintainers-ceph.com
--
Loïc Dachary, Artisan Logiciel Libre
Michael Metz-Martini | SpeedPartner GmbH
2016-04-12 07:00:19 UTC
Permalink
Hi,
ext4 has never been recommended, but we did test it. After Jewel is out,
we would like explicitly recommend *against* ext4 and stop testing it.
Hmmm. We're currently migrating away from xfs as we had some strange
performance-issues which were resolved / got better by switching to
ext4. We think this is related to our high number of objects (4358
Mobjects according to ceph -s).
Recently we discovered an issue with the long object name handling
that is not fixable without rewriting a significant chunk of
FileStores filename handling. (There is a limit in the amount of
xattr data ext4 can store in the inode, which causes problems in
LFNIndex.)
We're only using cephfs so we shouldn't be affected by your discovered
bug, right?
--
Kind regards
Michael
Christian Balzer
2016-04-13 02:29:30 UTC
Permalink
Hello,

On Tue, 12 Apr 2016 09:00:19 +0200 Michael Metz-Martini | SpeedPartner
Post by Michael Metz-Martini | SpeedPartner GmbH
Hi,
ext4 has never been recommended, but we did test it. After Jewel is
out, we would like explicitly recommend *against* ext4 and stop
testing it.
Hmmm. We're currently migrating away from xfs as we had some strange
performance-issues which were resolved / got better by switching to
ext4. We think this is related to our high number of objects (4358
Mobjects according to ceph -s).
It would be interesting to see on how this maps out to the OSDs/PGs.
I'd guess loads and loads of subdirectories per PG, which is probably where
Ext4 performs better than XFS.
Post by Michael Metz-Martini | SpeedPartner GmbH
Recently we discovered an issue with the long object name handling
that is not fixable without rewriting a significant chunk of
FileStores filename handling. (There is a limit in the amount of
xattr data ext4 can store in the inode, which causes problems in
LFNIndex.)
We're only using cephfs so we shouldn't be affected by your discovered
bug, right?
I don't use CephFS, but you should be able to tell this yourself by doing
a "rados -p <poolname> ls" on your data and metadata pools and see the
resulting name lengths.
However since you have so many objects, I'd do that on a test cluster, if
you have one. ^o^
If CephFS is using the same/similar hashing to create object names as it
does with RBD images I'd imagine you're OK.

Christian
--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/
Sage Weil
2016-04-13 12:30:52 UTC
Permalink
Post by Christian Balzer
Post by Michael Metz-Martini | SpeedPartner GmbH
Recently we discovered an issue with the long object name handling
that is not fixable without rewriting a significant chunk of
FileStores filename handling. (There is a limit in the amount of
xattr data ext4 can store in the inode, which causes problems in
LFNIndex.)
We're only using cephfs so we shouldn't be affected by your discovered
bug, right?
I don't use CephFS, but you should be able to tell this yourself by doing
a "rados -p <poolname> ls" on your data and metadata pools and see the
resulting name lengths.
However since you have so many objects, I'd do that on a test cluster, if
you have one. ^o^
If CephFS is using the same/similar hashing to create object names as it
does with RBD images I'd imagine you're OK.
All of CephFS's object names are short, like RBD's.

For RBD, there is only one object per image that is long: rbd_id.$name.
As long as your RBD image names are "short" (a max length of 256 chars is
enough to make ext4 happy) you'll be fine.

sage
Christian Balzer
2016-04-14 00:57:00 UTC
Permalink
Post by Sage Weil
Post by Christian Balzer
Post by Michael Metz-Martini | SpeedPartner GmbH
Recently we discovered an issue with the long object name handling
that is not fixable without rewriting a significant chunk of
FileStores filename handling. (There is a limit in the amount of
xattr data ext4 can store in the inode, which causes problems in
LFNIndex.)
We're only using cephfs so we shouldn't be affected by your
discovered bug, right?
I don't use CephFS, but you should be able to tell this yourself by
doing a "rados -p <poolname> ls" on your data and metadata pools and
see the resulting name lengths.
However since you have so many objects, I'd do that on a test cluster,
if you have one. ^o^
If CephFS is using the same/similar hashing to create object names as
it does with RBD images I'd imagine you're OK.
All of CephFS's object names are short, like RBD's.
Sweet!
Post by Sage Weil
For RBD, there is only one object per image that is long: rbd_id.$name.
As long as your RBD image names are "short" (a max length of 256 chars
is enough to make ext4 happy) you'll be fine.
No worries there, ganeti definitely creates them way shorter than that and
IIRC so do Open(Stack/Nebula).

Christian
--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/
Michael Metz-Martini | SpeedPartner GmbH
2016-04-13 12:51:58 UTC
Permalink
Hi,
Post by Christian Balzer
On Tue, 12 Apr 2016 09:00:19 +0200 Michael Metz-Martini | SpeedPartner
Post by Michael Metz-Martini | SpeedPartner GmbH
ext4 has never been recommended, but we did test it. After Jewel is
out, we would like explicitly recommend *against* ext4 and stop
testing it.
Hmmm. We're currently migrating away from xfs as we had some strange
performance-issues which were resolved / got better by switching to
ext4. We think this is related to our high number of objects (4358
Mobjects according to ceph -s).
It would be interesting to see on how this maps out to the OSDs/PGs.
I'd guess loads and loads of subdirectories per PG, which is probably where
Ext4 performs better than XFS.
A simple ls -l takes "ages" on XFS while ext4 lists a directory
immediately. According to our findings regarding XFS this seems to be
"normal" behavior.

pool name category KB objects
data - 3240 2265521646
document_root - 577364 10150
images - 96197462245 2256616709
metadata - 1150105 35903724
queue - 542967346 173865
raw - 36875247450 13095410

total of 4736 pgs, 6 pools, 124 TB data, 4359 Mobjects

What would you like to see?
tree? du per Directory?

As you can see we have one data-object in pool "data" per file saved
somewhere else. I'm not sure what's this related to, but maybe this is a
must by cephfs.
--
Kind regards
Michael Metz-Martini
Christian Balzer
2016-04-14 01:32:27 UTC
Permalink
Hello,

[reducing MLs to ceph-user]

On Wed, 13 Apr 2016 14:51:58 +0200 Michael Metz-Martini | SpeedPartner
Post by Michael Metz-Martini | SpeedPartner GmbH
Hi,
Post by Christian Balzer
On Tue, 12 Apr 2016 09:00:19 +0200 Michael Metz-Martini | SpeedPartner
Post by Michael Metz-Martini | SpeedPartner GmbH
ext4 has never been recommended, but we did test it. After Jewel is
out, we would like explicitly recommend *against* ext4 and stop
testing it.
Hmmm. We're currently migrating away from xfs as we had some strange
performance-issues which were resolved / got better by switching to
ext4. We think this is related to our high number of objects (4358
Mobjects according to ceph -s).
It would be interesting to see on how this maps out to the OSDs/PGs.
I'd guess loads and loads of subdirectories per PG, which is probably
where Ext4 performs better than XFS.
A simple ls -l takes "ages" on XFS while ext4 lists a directory
immediately. According to our findings regarding XFS this seems to be
"normal" behavior.
Just for the record, this is also influenced (for Ext4 at least) on how
much memory you have and the "vm/vfs_cache_pressure" settings.
Once Ext4 runs out of space in SLAB for dentry and ext4_inode_cache
(amongst others), it will become slower as well, since it has to go to the
disk.
Another thing to remember is that "ls" by itself is also a LOT faster than
"ls -l" since it accesses less data.
Post by Michael Metz-Martini | SpeedPartner GmbH
pool name category KB objects
data - 3240 2265521646
document_root - 577364 10150
images - 96197462245 2256616709
metadata - 1150105 35903724
queue - 542967346 173865
raw - 36875247450 13095410
total of 4736 pgs, 6 pools, 124 TB data, 4359 Mobjects
What would you like to see?
tree? du per Directory?
Just an example tree and typical size of the first "data layer".
For example on my very lightly loaded/filled test cluster (45000 objects)
the actual objects are in the "top" directory of the PG in question, like:
ls -lah /var/lib/ceph/osd/ceph-3/current/2.fa_head/
total 289M
drwxr-xr-x 2 root root 8.0K Mar 8 11:06 .
drwxr-xr-x 106 root root 8.0K Mar 30 12:08 ..
-rw-r--r-- 1 root root 4.0M Mar 8 11:05 benchmark\udata\uirt03\u16185\uobject586__head_60D672FA__2
-rw-r--r-- 1 root root 0 Mar 8 10:50 __head_000000FA__2
-rw-r--r-- 1 root root 4.0M Mar 8 11:06 rb.0.1034.74b0dc51.0000000000c6__head_C147A6FA__2
[79 further objects]
---

Whereas on my main production cluster I got 2000 Kobjects and it's nested a
lot more like this:
---
ls -lah /var/lib/ceph/osd/ceph-2/current/2.35e_head/DIR_E/DIR_5/DIR_3/DIR_0
total 128M
drwxr-xr-x 2 root root 4.0K Mar 10 09:20 .
drwxr-xr-x 18 root root 32K Dec 14 15:15 ..
-rw-r--r-- 1 root root 0 Feb 21 01:15 __head_0000035E__2
-rw-r--r-- 1 root root 4.0M Jun 3 2015 rb.0.11eb.238e1f29.000000010b1b__head_AD6E035E__2
[36 further 4MB objects)
---
Post by Michael Metz-Martini | SpeedPartner GmbH
As you can see we have one data-object in pool "data" per file saved
somewhere else. I'm not sure what's this related to, but maybe this is a
must by cephfs.
That's rather confusing (even more so since I don't use CephFS), but it
feels wrong.
From what little I know about CephFS is that you can have only one FS per
cluster and the pools can be arbitrarily named (default data and metadata).

Looking at your output above I'm assuming that "metadata" is actually what
the name implies and that you have quite a few files (as in CephFS
files) at 35 million objects in there.
Furthermore the actual DATA for these files seems to reside in "images",
not in "data" (which nearly empty at 3.2MB).
My guess is that you somehow managed to create things in a way that
puts references (not the actual data) of everything in "images" to "data".

Hell, it might even be a bug where a "data" pool will always be used by
Ceph in that fashion even if the actual data holding pool is named
differently.

Don't think that's normal at all and I wonder if you could just remove
"data", after checking with more knowledgeable people than me of course.

Christian
--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/
Michael Metz-Martini | SpeedPartner GmbH
2016-04-14 17:39:01 UTC
Permalink
Hi,
Post by Christian Balzer
Post by Michael Metz-Martini | SpeedPartner GmbH
Post by Christian Balzer
Post by Michael Metz-Martini | SpeedPartner GmbH
ext4 has never been recommended, but we did test it. After Jewel is
out, we would like explicitly recommend *against* ext4 and stop
testing it.
Hmmm. We're currently migrating away from xfs as we had some strange
performance-issues which were resolved / got better by switching to
ext4. We think this is related to our high number of objects (4358
Mobjects according to ceph -s).
It would be interesting to see on how this maps out to the OSDs/PGs.
I'd guess loads and loads of subdirectories per PG, which is probably
where Ext4 performs better than XFS.
A simple ls -l takes "ages" on XFS while ext4 lists a directory
immediately. According to our findings regarding XFS this seems to be
"normal" behavior.
Just for the record, this is also influenced (for Ext4 at least) on how
much memory you have and the "vm/vfs_cache_pressure" settings.
Once Ext4 runs out of space in SLAB for dentry and ext4_inode_cache
(amongst others), it will become slower as well, since it has to go to the
disk.
Another thing to remember is that "ls" by itself is also a LOT faster than
"ls -l" since it accesses less data.
128 GB RAM for 21 OSD (each 4 TB in size). Kernel so far "untuned"
regarding cache-pressure / inode-cache.
Post by Christian Balzer
Post by Michael Metz-Martini | SpeedPartner GmbH
pool name category KB objects
data - 3240 2265521646
document_root - 577364 10150
images - 96197462245 2256616709
metadata - 1150105 35903724
queue - 542967346 173865
raw - 36875247450 13095410
total of 4736 pgs, 6 pools, 124 TB data, 4359 Mobjects
What would you like to see?
tree? du per Directory?
Just an example tree and typical size of the first "data layer".
[...]
First levels seem to be empty, so:
./DIR_3
./DIR_3/DIR_9
./DIR_3/DIR_9/DIR_0
./DIR_3/DIR_9/DIR_0/DIR_0
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_0
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_D
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_E
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_A
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_C
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_1
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_4
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_2
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_B
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_5
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_3
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_9
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_6
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_F
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_7
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_8
./DIR_3/DIR_9/DIR_0/DIR_D
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_0
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_D
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_E
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_A
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_C
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_1
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_4
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_2
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_B
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_5
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_3
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_9
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_6
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_F
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_7
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_8
...

/var/lib/ceph/osd/ceph-58/current/6.93_head/DIR_3/DIR_9/DIR_C/DIR_0$ du
-ms *
99 DIR_0
102 DIR_1
105 DIR_2
102 DIR_3
101 DIR_4
105 DIR_5
106 DIR_6
102 DIR_7
105 DIR_8
98 DIR_9
99 DIR_A
105 DIR_B
103 DIR_C
100 DIR_D
103 DIR_E
104 DIR_F
Post by Christian Balzer
Post by Michael Metz-Martini | SpeedPartner GmbH
As you can see we have one data-object in pool "data" per file saved
somewhere else. I'm not sure what's this related to, but maybe this is a
must by cephfs.
That's rather confusing (even more so since I don't use CephFS), but it
feels wrong.
From what little I know about CephFS is that you can have only one FS per
cluster and the pools can be arbitrarily named (default data and metadata).
[...]
Post by Christian Balzer
My guess is that you somehow managed to create things in a way that
puts references (not the actual data) of everything in "images" to "data".
You can tune the pool by e.g.
cephfs /mnt/storage/docroot set_layout -p 4

We thought this was a good idea so that we can change the replication
size different for doc_root and raw-data if we like. Seems this was a
bad idea for all objects.
--
Kind regards
Michael Metz-Martini
Christian Balzer
2016-04-15 01:07:07 UTC
Permalink
On Thu, 14 Apr 2016 19:39:01 +0200 Michael Metz-Martini | SpeedPartner
Post by Michael Metz-Martini | SpeedPartner GmbH
Hi,
[massive snip]

Thanks for that tree/du output, it matches what I expected.
You'd think XFS wouldn't be that intimidated by directories of that size.
Post by Michael Metz-Martini | SpeedPartner GmbH
Post by Christian Balzer
Post by Michael Metz-Martini | SpeedPartner GmbH
As you can see we have one data-object in pool "data" per file saved
somewhere else. I'm not sure what's this related to, but maybe this
is a must by cephfs.
That's rather confusing (even more so since I don't use CephFS), but it
feels wrong.
From what little I know about CephFS is that you can have only one FS
per cluster and the pools can be arbitrarily named (default data and
metadata).
[...]
Post by Christian Balzer
My guess is that you somehow managed to create things in a way that
puts references (not the actual data) of everything in "images" to "data".
You can tune the pool by e.g.
cephfs /mnt/storage/docroot set_layout -p 4
Yesterday morning I wouldn't have known what that meant, but since then I
did a lot of reading and created a CephFS on the test cluster a well,
including a second data pool and layouts.
Post by Michael Metz-Martini | SpeedPartner GmbH
We thought this was a good idea so that we can change the replication
size different for doc_root and raw-data if we like. Seems this was a
bad idea for all objects.
I'm not sure how you managed to get into that state or if it's a bug after
all, but I can't replicate it on the latest hammer.

Firstly I created a "default" FS, with the classic metadata and data
pools, mounted it and put some files into the root.
Then I added a second pool (filegoats) and set the layout for a
subdirectory to use it. After re-mounting the FS and copying data to that
subdir I get this, exactly what one would expect:
---

NAME ID USED %USED MAX AVAIL OBJECTS
data 0 82043k 0 1181G 334
metadata 1 2845k 0 1181G 20
rbd 2 161G 2.84 787G 41914
filegoats 10 89034k 0 1181G 336
---
So no duplicate objects (or at least their headers) for me.

If nobody else has anything to say about this, I'd consider filing a bug
report.

Regards,

Christian
--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/
Michael Metz-Martini | SpeedPartner GmbH
2016-04-15 05:02:13 UTC
Permalink
Hi,
Post by Christian Balzer
Post by Michael Metz-Martini | SpeedPartner GmbH
We thought this was a good idea so that we can change the replication
size different for doc_root and raw-data if we like. Seems this was a
bad idea for all objects.
I'm not sure how you managed to get into that state or if it's a bug after
all, but I can't replicate it on the latest hammer.
Firstly I created a "default" FS, with the classic metadata and data
pools, mounted it and put some files into the root.
Then I added a second pool (filegoats) and set the layout for a
subdirectory to use it. After re-mounting the FS and copying data to that
---
NAME ID USED %USED MAX AVAIL OBJECTS
data 0 82043k 0 1181G 334
metadata 1 2845k 0 1181G 20
rbd 2 161G 2.84 787G 41914
filegoats 10 89034k 0 1181G 336
---
So no duplicate objects (or at least their headers) for me.
If nobody else has anything to say about this, I'd consider filing a bug
report.
Im must admit that we're currently using 0.87 (Giant) and haven't
upgraded so far. Would be nice to know if upgrade would "clean" this
state or we should better start with a new cluster ... :(
--
Kind regards
Michael Metz-Martini
Christian Balzer
2016-04-15 05:43:17 UTC
Permalink
Hello,

On Fri, 15 Apr 2016 07:02:13 +0200 Michael Metz-Martini | SpeedPartner
Post by Michael Metz-Martini | SpeedPartner GmbH
Hi,
Post by Christian Balzer
Post by Michael Metz-Martini | SpeedPartner GmbH
We thought this was a good idea so that we can change the replication
size different for doc_root and raw-data if we like. Seems this was a
bad idea for all objects.
I'm not sure how you managed to get into that state or if it's a bug
after all, but I can't replicate it on the latest hammer.
Firstly I created a "default" FS, with the classic metadata and data
pools, mounted it and put some files into the root.
Then I added a second pool (filegoats) and set the layout for a
subdirectory to use it. After re-mounting the FS and copying data to
---
NAME ID USED %USED MAX AVAIL OBJECTS
data 0 82043k 0 1181G 334
metadata 1 2845k 0 1181G 20
rbd 2 161G 2.84 787G 41914
filegoats 10 89034k 0 1181G 336
---
So no duplicate objects (or at least their headers) for me.
If nobody else has anything to say about this, I'd consider filing a
bug report.
Im must admit that we're currently using 0.87 (Giant) and haven't
upgraded so far. Would be nice to know if upgrade would "clean" this
state or we should better start with a new cluster ... :(
I can't really comment on that, but you will probably want to wait for
Jewel, being a LTS release and having plenty of CephFS enhancements
including a fsck.

Have you verified what those objects in your data pool are?
And that they are actually there on disk?
If so, I'd expect them all to be zero length.

Christian
--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/
Michael Metz-Martini | SpeedPartner GmbH
2016-04-15 06:20:45 UTC
Permalink
Hi,
Post by Christian Balzer
On Fri, 15 Apr 2016 07:02:13 +0200 Michael Metz-Martini | SpeedPartner
Post by Michael Metz-Martini | SpeedPartner GmbH
Post by Christian Balzer
Post by Michael Metz-Martini | SpeedPartner GmbH
We thought this was a good idea so that we can change the replication
size different for doc_root and raw-data if we like. Seems this was a
bad idea for all objects.
[...]
Post by Christian Balzer
Post by Michael Metz-Martini | SpeedPartner GmbH
Post by Christian Balzer
If nobody else has anything to say about this, I'd consider filing a
bug report.
Im must admit that we're currently using 0.87 (Giant) and haven't
upgraded so far. Would be nice to know if upgrade would "clean" this
state or we should better start with a new cluster ... :(
I can't really comment on that, but you will probably want to wait for
Jewel, being a LTS release and having plenty of CephFS enhancements
including a fsck.
Have you verified what those objects in your data pool are?
And that they are actually there on disk?
If so, I'd expect them all to be zero length.
They exist and are all of size 0 - right.

/var/lib/ceph/osd/ceph-21/current/0.179_head/DIR_9/DIR_7/DIR_1/DIR_0/DIR_0/DIR_0$
ls -l
total 492
-rw-r--r--. 1 root root 0 Oct 6 2015
10003aed5cb.00000000__head_AF000179__0
-rw-r--r--. 1 root root 0 Oct 6 2015
10003d09223.00000000__head_6D000179__0
[..]

$ getfattr -d 10003aed5cb.00000000__head_AF000179__0
# file: 10003aed5cb.00000000__head_AF000179__0
user.ceph._=0sDQjpAAAABAM1AAAAAAAAABQAAAAxMDAwM2FlZDVjYi4wMDAwMDAwMP7/////////eQEArwAAAAAAAAAAAAAAAAAGAxwAAAAAAAAAAAAAAP////8AAAAAAAAAAP//////////AAAAAHTfAwAAAAAA2hoAAAAAAAAAAAAAAAAAAAICFQAAAAIAAAAAAAAAAGScLgEAAAAADQAAAAAAAAAAAAAAY4zeU3D2EwgCAhUAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB03wMAAAAAAAAAAAAAAAAAAAQAAAA=
user.ceph._parent=0sBQTvAAAAy9WuAwABAAAGAAAAAgIbAAAAldSuAwABAAAHAAAAOF81LmpwZ0gCAAAAAAAAAgIWAAAA1NGuAwABAAACAAAAMTKhAwAAAAAAAAICNAAAAHwIgwMAAQAAIAAAADBlZjY3MTk5OGMzNGE5MjViYzdjZjQxZGYyOTM5NmFlWgAAAAAAAAACAhYAAADce3oDAAEAAAIAAABmNscPAAAAAAAAAgIWAAAAJvV3AwABAAACAAAAMGWGeA0AAAAAAAICGgAAAAEAAAAAAAAABgAAAGltYWdlc28yNQAAAAAABgAAAAAAAAABAAAAAAAAAAAAAAA=
user.cephos.spill_out=0sMQA=
--
Kind regards
Michael Metz-Martini
Christian Balzer
2016-04-18 04:05:49 UTC
Permalink
Hello,

On Fri, 15 Apr 2016 08:20:45 +0200 Michael Metz-Martini | SpeedPartner
Post by Michael Metz-Martini | SpeedPartner GmbH
Hi,
Post by Christian Balzer
On Fri, 15 Apr 2016 07:02:13 +0200 Michael Metz-Martini | SpeedPartner
Post by Michael Metz-Martini | SpeedPartner GmbH
Post by Christian Balzer
Post by Michael Metz-Martini | SpeedPartner GmbH
We thought this was a good idea so that we can change the
replication size different for doc_root and raw-data if we like.
Seems this was a bad idea for all objects.
[...]
Post by Christian Balzer
Post by Michael Metz-Martini | SpeedPartner GmbH
Post by Christian Balzer
If nobody else has anything to say about this, I'd consider filing a
bug report.
Im must admit that we're currently using 0.87 (Giant) and haven't
upgraded so far. Would be nice to know if upgrade would "clean" this
state or we should better start with a new cluster ... :(
Actually, I ran some more tests, with larger and differing data sets.

I can now replicate this behavior here, before:
---
NAME ID USED %USED MAX AVAIL OBJECTS
data 0 6224M 0.11 1175G 1870
metadata 1 18996k 0 1175G 24
filegoats 10 468M 0 1175G 1346
---

And after copying /usr/ from the client were that CephFS is mounted to the
directory mapped to "filegoats":
---
data 0 6224M 0.11 1173G 47274
metadata 1 42311k 0 1173G 4057
filegoats 10 1642M 0.03 1173G 43496
---

So not a "bug" per se, but not exactly elegant when considering the object
overhead.
This feels a lot like how cache-tiering is implemented as well (evicted
objects get zero'd, not deleted).

I guess the best strategy here is do to have the vast majority of data in
"data" and only special cases in other pools (like SSD based ones).

Would be nice if somebody from the devs, RH could pipe up and the
documentation updated to reflect this.

Christian
Post by Michael Metz-Martini | SpeedPartner GmbH
Post by Christian Balzer
I can't really comment on that, but you will probably want to wait for
Jewel, being a LTS release and having plenty of CephFS enhancements
including a fsck.
Have you verified what those objects in your data pool are?
And that they are actually there on disk?
If so, I'd expect them all to be zero length.
They exist and are all of size 0 - right.
/var/lib/ceph/osd/ceph-21/current/0.179_head/DIR_9/DIR_7/DIR_1/DIR_0/DIR_0/DIR_0$
ls -l
total 492
-rw-r--r--. 1 root root 0 Oct 6 2015
10003aed5cb.00000000__head_AF000179__0
-rw-r--r--. 1 root root 0 Oct 6 2015
10003d09223.00000000__head_6D000179__0
[..]
$ getfattr -d 10003aed5cb.00000000__head_AF000179__0
# file: 10003aed5cb.00000000__head_AF000179__0
user.ceph._=0sDQjpAAAABAM1AAAAAAAAABQAAAAxMDAwM2FlZDVjYi4wMDAwMDAwMP7/////////eQEArwAAAAAAAAAAAAAAAAAGAxwAAAAAAAAAAAAAAP////8AAAAAAAAAAP//////////AAAAAHTfAwAAAAAA2hoAAAAAAAAAAAAAAAAAAAICFQAAAAIAAAAAAAAAAGScLgEAAAAADQAAAAAAAAAAAAAAY4zeU3D2EwgCAhUAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB03wMAAAAAAAAAAAAAAAAAAAQAAAA=
user.ceph._parent=0sBQTvAAAAy9WuAwABAAAGAAAAAgIbAAAAldSuAwABAAAHAAAAOF81LmpwZ0gCAAAAAAAAAgIWAAAA1NGuAwABAAACAAAAMTKhAwAAAAAAAAICNAAAAHwIgwMAAQAAIAAAADBlZjY3MTk5OGMzNGE5MjViYzdjZjQxZGYyOTM5NmFlWgAAAAAAAAACAhYAAADce3oDAAEAAAIAAABmNscPAAAAAAAAAgIWAAAAJvV3AwABAAACAAAAMGWGeA0AAAAAAAICGgAAAAEAAAAAAAAABgAAAGltYWdlc28yNQAAAAAABgAAAAAAAAABAAAAAAAAAAAAAAA=
user.cephos.spill_out=0sMQA=
--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/
Gregory Farnum
2016-04-18 18:46:18 UTC
Permalink
Post by Christian Balzer
Hello,
On Fri, 15 Apr 2016 08:20:45 +0200 Michael Metz-Martini | SpeedPartner
Hi,
Post by Christian Balzer
On Fri, 15 Apr 2016 07:02:13 +0200 Michael Metz-Martini | SpeedPartner
Post by Michael Metz-Martini | SpeedPartner GmbH
Post by Christian Balzer
Post by Michael Metz-Martini | SpeedPartner GmbH
We thought this was a good idea so that we can change the
replication size different for doc_root and raw-data if we like.
Seems this was a bad idea for all objects.
[...]
Post by Christian Balzer
Post by Michael Metz-Martini | SpeedPartner GmbH
Post by Christian Balzer
If nobody else has anything to say about this, I'd consider filing a
bug report.
Im must admit that we're currently using 0.87 (Giant) and haven't
upgraded so far. Would be nice to know if upgrade would "clean" this
state or we should better start with a new cluster ... :(
Actually, I ran some more tests, with larger and differing data sets.
---
NAME ID USED %USED MAX AVAIL OBJECTS
data 0 6224M 0.11 1175G 1870
metadata 1 18996k 0 1175G 24
filegoats 10 468M 0 1175G 1346
---
And after copying /usr/ from the client were that CephFS is mounted to the
---
data 0 6224M 0.11 1173G 47274
metadata 1 42311k 0 1173G 4057
filegoats 10 1642M 0.03 1173G 43496
---
So not a "bug" per se, but not exactly elegant when considering the object
overhead.
This feels a lot like how cache-tiering is implemented as well (evicted
objects get zero'd, not deleted).
I guess the best strategy here is do to have the vast majority of data in
"data" and only special cases in other pools (like SSD based ones).
Would be nice if somebody from the devs, RH could pipe up and the
documentation updated to reflect this.
It's not really clear to me what test you're running here. But if
you're talking about lots of empty RADOS objects, you're probably
running into the backtraces. Objects store (often stale) backtraces of
their directory path in an xattr for disaster recovery and lookup. But
to facilitate that lookup, they need to be visible without knowing
anything about the data placement, so if you hav ea bunch of files
elsewhere it still puts a pointer backtrace in the default file data
pool.
Although I think we've talked about ways to avoid that and maybe did
something to improve it by Jewel, but I don't remember for certain.
-Greg
Christian Balzer
2016-04-19 00:00:30 UTC
Permalink
Post by Gregory Farnum
Post by Christian Balzer
Hello,
On Fri, 15 Apr 2016 08:20:45 +0200 Michael Metz-Martini | SpeedPartner
Hi,
Post by Christian Balzer
On Fri, 15 Apr 2016 07:02:13 +0200 Michael Metz-Martini |
Post by Michael Metz-Martini | SpeedPartner GmbH
Post by Christian Balzer
Post by Michael Metz-Martini | SpeedPartner GmbH
We thought this was a good idea so that we can change the
replication size different for doc_root and raw-data if we like.
Seems this was a bad idea for all objects.
[...]
Post by Christian Balzer
Post by Michael Metz-Martini | SpeedPartner GmbH
Post by Christian Balzer
If nobody else has anything to say about this, I'd consider
filing a bug report.
Im must admit that we're currently using 0.87 (Giant) and haven't
upgraded so far. Would be nice to know if upgrade would "clean"
this state or we should better start with a new cluster ... :(
Actually, I ran some more tests, with larger and differing data sets.
---
NAME ID USED %USED MAX AVAIL OBJECTS
data 0 6224M 0.11 1175G 1870
metadata 1 18996k 0 1175G 24
filegoats 10 468M 0 1175G 1346
---
And after copying /usr/ from the client were that CephFS is mounted to
---
data 0 6224M 0.11 1173G 47274
metadata 1 42311k 0 1173G 4057
filegoats 10 1642M 0.03 1173G 43496
---
So not a "bug" per se, but not exactly elegant when considering the
object overhead.
This feels a lot like how cache-tiering is implemented as well (evicted
objects get zero'd, not deleted).
I guess the best strategy here is do to have the vast majority of data
in "data" and only special cases in other pools (like SSD based ones).
Would be nice if somebody from the devs, RH could pipe up and the
documentation updated to reflect this.
It's not really clear to me what test you're running here.
Create FS, with default metadata and data pool.
Add another data pool (filegoats).
Map (set layout) a subdirectory to that data pool.
Copy lots of data (files) there.
Find all those empty objects in "data", matching up with the actual data
holding objects in "filegoats".
Post by Gregory Farnum
But if
you're talking about lots of empty RADOS objects, you're probably
running into the backtraces. Objects store (often stale) backtraces of
their directory path in an xattr for disaster recovery and lookup. But
to facilitate that lookup, they need to be visible without knowing
anything about the data placement, so if you hav ea bunch of files
elsewhere it still puts a pointer backtrace in the default file data
pool.
That's obviously what's happening here.
Post by Gregory Farnum
Although I think we've talked about ways to avoid that and maybe did
something to improve it by Jewel, but I don't remember for certain.
Michael would be probably mostly interested in that, with 2.2billion of
those empty objects that are significantly impacting performance.

Christian
--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/
Jan Schermer
2016-04-12 07:45:02 UTC
Permalink
I'd like to raise these points, then

1) some people (like me) will never ever use XFS if they have a choice
given no choice, we will not use something that depends on XFS

2) choice is always good

3) doesn't majority of Ceph users only care about RBD?

(Angry rant coming)
Even our last performance testing of Ceph (Infernalis) showed abysmal performance. The most damning sign is the consumption of CPU time at unprecedented rate. Was it faster than Dumpling? Slightly, but it ate more CPU also, so in effect it was not really "faster".

It would make *some* sense to only support ZFS or BTRFS because you can offload things like clones/snapshots and consistency to the filesystem - which would make the architecture much simpler and everything much faster.
Instead you insist on XFS and reimplement everything in software. I always dismissed this because CPU time was ususally cheap, but in practice it simply doesn't work.
You duplicate things that filesystems had solved for years now (namely crash consistency - though we have seen that fail as well), instead of letting them do their work and stripping the IO path to the bare necessity and letting someone smarter and faster handle that.

IMO, If Ceph was moving in the right direction there would be no "supported filesystem" debate, instead we'd be free to choose whatever is there that provides the guarantees we need from filesystem (which is usually every filesystem in the kernel) and Ceph would simply distribute our IO around with CRUSH.

Right now CRUSH (and in effect what it allows us to do with data) is _the_ reason people use Ceph, as there simply wasn't much else to use for distributed storage. This isn't true anymore and the alternatives are orders of magnitude faster and smaller.

Jan

P.S. If anybody needs a way out I think I found it, with no need to trust a higher power :P
Hi,
ext4 has never been recommended, but we did test it. After Jewel is out,
we would like explicitly recommend *against* ext4 and stop testing it.
I should clarify that this is a proposal and solicitation of feedback--we
haven't made any decisions yet. Now is the time to weigh in.
sage
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Sage Weil
2016-04-12 18:00:31 UTC
Permalink
Post by Jan Schermer
I'd like to raise these points, then
1) some people (like me) will never ever use XFS if they have a choice
given no choice, we will not use something that depends on XFS
2) choice is always good
Okay!
Post by Jan Schermer
3) doesn't majority of Ceph users only care about RBD?
Probably that's true now. We shouldn't recommend something that prevents
them from adding RGW to an existing cluster in the future, though.
Post by Jan Schermer
(Angry rant coming)
Even our last performance testing of Ceph (Infernalis) showed abysmal
performance. The most damning sign is the consumption of CPU time at
unprecedented rate. Was it faster than Dumpling? Slightly, but it ate
more CPU also, so in effect it was not really "faster".
It would make *some* sense to only support ZFS or BTRFS because you can
offload things like clones/snapshots and consistency to the filesystem -
which would make the architecture much simpler and everything much
faster. Instead you insist on XFS and reimplement everything in
software. I always dismissed this because CPU time was ususally cheap,
but in practice it simply doesn't work. You duplicate things that
filesystems had solved for years now (namely crash consistency - though
we have seen that fail as well), instead of letting them do their work
and stripping the IO path to the bare necessity and letting someone
smarter and faster handle that.
IMO, If Ceph was moving in the right direction there would be no
"supported filesystem" debate, instead we'd be free to choose whatever
is there that provides the guarantees we need from filesystem (which is
usually every filesystem in the kernel) and Ceph would simply distribute
our IO around with CRUSH.
Right now CRUSH (and in effect what it allows us to do with data) is
_the_ reason people use Ceph, as there simply wasn't much else to use
for distributed storage. This isn't true anymore and the alternatives
are orders of magnitude faster and smaller.
This touched on pretty much every reason why we are ditching file
systems entirely and moving toward BlueStore.

Local kernel file systems maintain their own internal consistency, but
they only provide what consistency promises the POSIX interface
does--which is almost nothing. That's why every complicated data
structure (e.g., database) stored on a file system ever includes it's own
journal. In our case, what POSIX provides isn't enough. We can't even
update a file and it's xattr atomically, let alone the much more
complicated transitions we need to do. We coudl "wing it" and hope for
the best, then do an expensive crawl and rsync of data on recovery, but we
chose very early on not to do that. If you want a system that "just"
layers over an existing filesystem, try you can try Gluster (although note
that they have a different sort of pain with the ordering of xattr
updates, and are moving toward a model that looks more like Ceph's backend
in their next version).

Offloading stuff to the file system doesn't save you CPU--it just makes
someone else responsible. What does save you CPU is avoiding the
complexity you don't need (i.e., half of what the kernel file system is
doing, and everything we have to do to work around an ill-suited
interface) and instead implement exactly the set of features that we need
to get the job done.

FileStore is slow, mostly because of the above, but also because it is an
old and not-very-enlightened design. BlueStore is roughly 2x faster in
early testing.

Finally, remember you *are* completely free to run Ceph on whatever file
system you want--and many do. We just aren't going to test them all for
you and promise they will all work. Remember that we have hit different
bugs in every single one we've tried. It's not as simple as saying they
just have to "provide the guarantees we need" given the complexity of the
interface, and almost every time we've tried to use "supported" APIs that
are remotely unusually (fallocate, zeroing extents... even xattrs) we've
hit bugs or undocumented limits and idiosyncrasies on one fs or another.

Cheers-
sage
Jan Schermer
2016-04-12 19:19:07 UTC
Permalink
Post by Sage Weil
Post by Jan Schermer
I'd like to raise these points, then
1) some people (like me) will never ever use XFS if they have a choice
given no choice, we will not use something that depends on XFS
2) choice is always good
Okay!
Post by Jan Schermer
3) doesn't majority of Ceph users only care about RBD?
Probably that's true now. We shouldn't recommend something that prevents
them from adding RGW to an existing cluster in the future, though.
Post by Jan Schermer
(Angry rant coming)
Even our last performance testing of Ceph (Infernalis) showed abysmal
performance. The most damning sign is the consumption of CPU time at
unprecedented rate. Was it faster than Dumpling? Slightly, but it ate
more CPU also, so in effect it was not really "faster".
It would make *some* sense to only support ZFS or BTRFS because you can
offload things like clones/snapshots and consistency to the filesystem -
which would make the architecture much simpler and everything much
faster. Instead you insist on XFS and reimplement everything in
software. I always dismissed this because CPU time was ususally cheap,
but in practice it simply doesn't work. You duplicate things that
filesystems had solved for years now (namely crash consistency - though
we have seen that fail as well), instead of letting them do their work
and stripping the IO path to the bare necessity and letting someone
smarter and faster handle that.
IMO, If Ceph was moving in the right direction there would be no
"supported filesystem" debate, instead we'd be free to choose whatever
is there that provides the guarantees we need from filesystem (which is
usually every filesystem in the kernel) and Ceph would simply distribute
our IO around with CRUSH.
Right now CRUSH (and in effect what it allows us to do with data) is
_the_ reason people use Ceph, as there simply wasn't much else to use
for distributed storage. This isn't true anymore and the alternatives
are orders of magnitude faster and smaller.
This touched on pretty much every reason why we are ditching file
systems entirely and moving toward BlueStore.
Nooooooooooooooo!
Post by Sage Weil
Local kernel file systems maintain their own internal consistency, but
they only provide what consistency promises the POSIX interface
does--which is almost nothing.
... which is exactly what everyone expects
... which is everything any app needs
Post by Sage Weil
That's why every complicated data
structure (e.g., database) stored on a file system ever includes it's own
journal.
... see?
Post by Sage Weil
In our case, what POSIX provides isn't enough. We can't even
update a file and it's xattr atomically, let alone the much more
complicated transitions we need to do.
... have you thought that maybe xattrs weren't meant to be abused this way? Filesystems usually aren't designed to be a performant key=value stores.
btw at least i_version should be atomic?

And I still feel (ironically) that you don't understand what journals and commits/flushes are for if you make this argument...

Btw I think at least i_version xattr could be atomic.
Post by Sage Weil
We coudl "wing it" and hope for
the best, then do an expensive crawl and rsync of data on recovery, but we
chose very early on not to do that. If you want a system that "just"
layers over an existing filesystem, try you can try Gluster (although note
that they have a different sort of pain with the ordering of xattr
updates, and are moving toward a model that looks more like Ceph's backend
in their next version).
True, which is why we dismissed it.
Post by Sage Weil
Offloading stuff to the file system doesn't save you CPU--it just makes
someone else responsible. What does save you CPU is avoiding the
complexity you don't need (i.e., half of what the kernel file system is
doing, and everything we have to do to work around an ill-suited
interface) and instead implement exactly the set of features that we need
to get the job done.
In theory you are right.
In practice in-kernel filesystems are fast, and fuse filesystems are slow.
Ceph is like that - slow. And you want to be fast by writing more code :)
Post by Sage Weil
FileStore is slow, mostly because of the above, but also because it is an
old and not-very-enlightened design. BlueStore is roughly 2x faster in
early testing.
... which is still literally orders of magnitude slower than a filesystem.
I dug into bluestore and how you want to implement it, and from what I understood you are reimplementing what the filesystem journal does...
It makes sense it will be 2x faster if you avoid the double-journalling, but I'd be very much surprised if it helped with CPU usage one bit - I certainly don't see my filesystems consuming significant amount of CPU time on any of my machines, and I seriously doubt you're going to do that better, sorry.
Post by Sage Weil
Finally, remember you *are* completely free to run Ceph on whatever file
system you want--and many do. We just aren't going to test them all for
you and promise they will all work. Remember that we have hit different
bugs in every single one we've tried. It's not as simple as saying they
just have to "provide the guarantees we need" given the complexity of the
interface, and almost every time we've tried to use "supported" APIs that
are remotely unusually (fallocate, zeroing extents... even xattrs) we've
hit bugs or undocumented limits and idiosyncrasies on one fs or another.
This can be a valid point, those are features people either don't use, or use quite differently. But just because you can stress the filesystems until they break doesn't mean you should go write a new one. What makes you think you will do a better job than all the people who made xfs/ext4/...?

Anyway, I don't know how more to debunk the "insufficient guarantees in POSIX filesystem transactions" myth that you insist on fixing, so I guess I'll have to wait until you rewrite everything up to the drive firmware to appreciate it :)

Jan


P.S. A joke for you
How many syscalls does it take for Ceph to write "lightbulb" to the disk?
10 000
ha ha?
Post by Sage Weil
Cheers-
sage
c***@jack.fr.eu.org
2016-04-12 19:40:04 UTC
Permalink
Post by Jan Schermer
Post by Sage Weil
Post by Jan Schermer
I'd like to raise these points, then
1) some people (like me) will never ever use XFS if they have a choice
given no choice, we will not use something that depends on XFS
2) choice is always good
Okay!
Post by Jan Schermer
3) doesn't majority of Ceph users only care about RBD?
Probably that's true now. We shouldn't recommend something that prevents
them from adding RGW to an existing cluster in the future, though.
Post by Jan Schermer
(Angry rant coming)
Even our last performance testing of Ceph (Infernalis) showed abysmal
performance. The most damning sign is the consumption of CPU time at
unprecedented rate. Was it faster than Dumpling? Slightly, but it ate
more CPU also, so in effect it was not really "faster".
It would make *some* sense to only support ZFS or BTRFS because you can
offload things like clones/snapshots and consistency to the filesystem -
which would make the architecture much simpler and everything much
faster. Instead you insist on XFS and reimplement everything in
software. I always dismissed this because CPU time was ususally cheap,
but in practice it simply doesn't work. You duplicate things that
filesystems had solved for years now (namely crash consistency - though
we have seen that fail as well), instead of letting them do their work
and stripping the IO path to the bare necessity and letting someone
smarter and faster handle that.
IMO, If Ceph was moving in the right direction there would be no
"supported filesystem" debate, instead we'd be free to choose whatever
is there that provides the guarantees we need from filesystem (which is
usually every filesystem in the kernel) and Ceph would simply distribute
our IO around with CRUSH.
Right now CRUSH (and in effect what it allows us to do with data) is
_the_ reason people use Ceph, as there simply wasn't much else to use
for distributed storage. This isn't true anymore and the alternatives
are orders of magnitude faster and smaller.
This touched on pretty much every reason why we are ditching file
systems entirely and moving toward BlueStore.
Nooooooooooooooo!
Post by Sage Weil
Local kernel file systems maintain their own internal consistency, but
they only provide what consistency promises the POSIX interface
does--which is almost nothing.
... which is exactly what everyone expects
... which is everything any app needs
Correction: this is every non-storage-related apps needs.
mdadm is an app, and do run over block storage (extrem comparison)
ext4 is an app, same results

Ceph is there to store the data, it is much "an FS" than "a regular app"
Post by Jan Schermer
Post by Sage Weil
That's why every complicated data
structure (e.g., database) stored on a file system ever includes it's own
journal.
... see?
Post by Sage Weil
In our case, what POSIX provides isn't enough. We can't even
update a file and it's xattr atomically, let alone the much more
complicated transitions we need to do.
... have you thought that maybe xattrs weren't meant to be abused this way? Filesystems usually aren't designed to be a performant key=value stores.
btw at least i_version should be atomic?
And I still feel (ironically) that you don't understand what journals and commits/flushes are for if you make this argument...
Btw I think at least i_version xattr could be atomic.
Post by Sage Weil
We coudl "wing it" and hope for
the best, then do an expensive crawl and rsync of data on recovery, but we
chose very early on not to do that. If you want a system that "just"
layers over an existing filesystem, try you can try Gluster (although note
that they have a different sort of pain with the ordering of xattr
updates, and are moving toward a model that looks more like Ceph's backend
in their next version).
True, which is why we dismissed it.
Post by Sage Weil
Offloading stuff to the file system doesn't save you CPU--it just makes
someone else responsible. What does save you CPU is avoiding the
complexity you don't need (i.e., half of what the kernel file system is
doing, and everything we have to do to work around an ill-suited
interface) and instead implement exactly the set of features that we need
to get the job done.
In theory you are right.
In practice in-kernel filesystems are fast, and fuse filesystems are slow.
Ceph is like that - slow. And you want to be fast by writing more code :)
Yep, let's push ceph near butterfs, where it belongs to
Would be awesome
Post by Jan Schermer
Post by Sage Weil
FileStore is slow, mostly because of the above, but also because it is an
old and not-very-enlightened design. BlueStore is roughly 2x faster in
early testing.
... which is still literally orders of magnitude slower than a filesystem.
I dug into bluestore and how you want to implement it, and from what I understood you are reimplementing what the filesystem journal does...
It makes sense it will be 2x faster if you avoid the double-journalling, but I'd be very much surprised if it helped with CPU usage one bit - I certainly don't see my filesystems consuming significant amount of CPU time on any of my machines, and I seriously doubt you're going to do that better, sorry.
Well, order of magnitude slower than a FS ?
I do have a cluster.
I do use it.
Ceph (over 7200tr, no ssd journal) brings me better latency than a raid1
cheetah 15k

So, ceph is orders of magnitude *faster* than FS.
Post by Jan Schermer
Post by Sage Weil
Finally, remember you *are* completely free to run Ceph on whatever file
system you want--and many do. We just aren't going to test them all for
you and promise they will all work. Remember that we have hit different
bugs in every single one we've tried. It's not as simple as saying they
just have to "provide the guarantees we need" given the complexity of the
interface, and almost every time we've tried to use "supported" APIs that
are remotely unusually (fallocate, zeroing extents... even xattrs) we've
hit bugs or undocumented limits and idiosyncrasies on one fs or another.
This can be a valid point, those are features people either don't use, or use quite differently. But just because you can stress the filesystems until they break doesn't mean you should go write a new one. What makes you think you will do a better job than all the people who made xfs/ext4/...?
Not the same needs = not the same solution ?
Post by Jan Schermer
Anyway, I don't know how more to debunk the "insufficient guarantees in POSIX filesystem transactions" myth that you insist on fixing, so I guess I'll have to wait until you rewrite everything up to the drive firmware to appreciate it :)
Jan
P.S. A joke for you
How many syscalls does it take for Ceph to write "lightbulb" to the disk?
10 000
ha ha?
What is the point ?
Do you have alternative ?
Is syscall a good representation of the complexity / CPU usage of
something ?
You can write a large shitty in-kernel code that will be used with a
single syscall
Means nothing to me
Sage Weil
2016-04-12 19:58:42 UTC
Permalink
Okay, I'll bite.
Post by Jan Schermer
Post by Sage Weil
Local kernel file systems maintain their own internal consistency, but
they only provide what consistency promises the POSIX interface
does--which is almost nothing.
... which is exactly what everyone expects
... which is everything any app needs
Post by Sage Weil
That's why every complicated data
structure (e.g., database) stored on a file system ever includes it's own
journal.
... see?
They do this because POSIX doesn't give them what they want. They
implement a *second* journal on top. The result is that you get the
overhead from both--the fs journal keeping its data structures consistent,
the database keeping its consistent. If you're not careful, that means
the db has to do something like file write, fsync, db journal append,
fsync. And both fsyncs turn into a *fs* journal io and flush. (Smart
databases often avoid most of the fs overhead by putting everything in a
single large file, but at that point the file system isn't actually doing
anything except passing IO to the block layer).

There is nothing wrong with POSIX file systems. They have the unenviable
task of catering to a huge variety of workloads and applications, but are
truly optimal for very few. And that's fine. If you want a local file
system, you should use ext4 or XFS, not Ceph.

But it turns ceph-osd isn't a generic application--it has a pretty
specific workload pattern, and POSIX doesn't give us the interfaces we
want (mainly, atomic transactions or ordered object/file enumeration).
Post by Jan Schermer
Post by Sage Weil
We coudl "wing it" and hope for
the best, then do an expensive crawl and rsync of data on recovery, but we
chose very early on not to do that. If you want a system that "just"
layers over an existing filesystem, try you can try Gluster (although note
that they have a different sort of pain with the ordering of xattr
updates, and are moving toward a model that looks more like Ceph's backend
in their next version).
True, which is why we dismissed it.
Post by Sage Weil
IMO, If Ceph was moving in the right direction [...] Ceph would
simply distribute our IO around with CRUSH.
You want ceph to "just use a file system." That's what gluster does--it
just layers the distributed namespace right on top of a local namespace.
If you didn't care about correctness or data safety, it would be
beautiful, and just as fast as the local file system (modulo network).
But if you want your data safe, you immediatley realize that local POSIX
file systems don't get you want you need: the atomic update of two files
on different servers so that you can keep your replicas in sync. Gluster
originally took the minimal path to accomplish this: a "simple"
prepare/write/commit, using xattrs as transaction markers. We took a
heavyweight approach to support arbitrary transactions. And both of us
have independently concluded that the local fs is the wrong tool for the
job.
Post by Jan Schermer
Post by Sage Weil
Offloading stuff to the file system doesn't save you CPU--it just makes
someone else responsible. What does save you CPU is avoiding the
complexity you don't need (i.e., half of what the kernel file system is
doing, and everything we have to do to work around an ill-suited
interface) and instead implement exactly the set of features that we need
to get the job done.
In theory you are right.
In practice in-kernel filesystems are fast, and fuse filesystems are slow.
Ceph is like that - slow. And you want to be fast by writing more code :)
You get fast by writing the *right* code, and eliminating layers of the
stack (the local file system, in this case) that are providing
functionality you don't want (or more functionality than you need at too
high a price).
Post by Jan Schermer
I dug into bluestore and how you want to implement it, and from what I
understood you are reimplementing what the filesystem journal does...
Yes. The difference is that a single journal manages all of the metadata
and data consistency in the system, instead of a local fs journal managing
just block allocation and a second ceph journal managing ceph's data
structures.

The main benefit, though, is that we can choose a different set of
semantics, like the ability to overwrite data in a file/object and update
metadata atomically. You can't do that with POSIX without building a
write-ahead journal and double-writing.
Post by Jan Schermer
Btw I think at least i_version xattr could be atomic.
Nope. All major file systems (other than btrfs) overwrite data in place,
which means it is impossible for any piece of metadata to accurately
indicate whether you have the old data or the new data (or perhaps a bit
of both).
Post by Jan Schermer
It makes sense it will be 2x faster if you avoid the double-journalling,
but I'd be very much surprised if it helped with CPU usage one bit - I
certainly don't see my filesystems consuming significant amount of CPU
time on any of my machines, and I seriously doubt you're going to do
that better, sorry.
Apples and oranges. The file systems aren't doing what we're doing. But
once you combine the what we spend now in FileStore + a local fs,
BlueStore will absolutely spend less CPU time.
Post by Jan Schermer
What makes you think you will do a better job than all the people who
made xfs/ext4/...?
I don't. XFS et al are great file systems and for the most part I have no
complaints about them. The problem is that Ceph doesn't need a file
system: it needs a transactional object store with a different set of
features. So that's what we're building.

sage
Jan Schermer
2016-04-12 20:33:00 UTC
Permalink
Still the answer to most of your points from me is "but who needs that?"
Who needs to have exactly the same data in two separate objects (replicas)? Ceph needs it because "consistency"?, but the app (VM filesystem) is fine with whatever version because the flush didn't happen (if it did the contents would be the same).

You say "Ceph needs", but I say "the guest VM needs" - there's the problem.
Post by Sage Weil
Okay, I'll bite.
Post by Jan Schermer
Post by Sage Weil
Local kernel file systems maintain their own internal consistency, but
they only provide what consistency promises the POSIX interface
does--which is almost nothing.
... which is exactly what everyone expects
... which is everything any app needs
Post by Sage Weil
That's why every complicated data
structure (e.g., database) stored on a file system ever includes it's own
journal.
... see?
They do this because POSIX doesn't give them what they want. They
implement a *second* journal on top. The result is that you get the
overhead from both--the fs journal keeping its data structures consistent,
the database keeping its consistent. If you're not careful, that means
the db has to do something like file write, fsync, db journal append,
fsync.
It's more like
transaction log write, flush
data write
That's simply because most filesystems don't journal data, but some do.
Post by Sage Weil
And both fsyncs turn into a *fs* journal io and flush. (Smart
databases often avoid most of the fs overhead by putting everything in a
single large file, but at that point the file system isn't actually doing
anything except passing IO to the block layer).
There is nothing wrong with POSIX file systems. They have the unenviable
task of catering to a huge variety of workloads and applications, but are
truly optimal for very few. And that's fine. If you want a local file
system, you should use ext4 or XFS, not Ceph.
But it turns ceph-osd isn't a generic application--it has a pretty
specific workload pattern, and POSIX doesn't give us the interfaces we
want (mainly, atomic transactions or ordered object/file enumeration).
The workload (with RBD) is inevitably expecting POSIX. Who needs more than that? To me that indicates unnecessary guarantees.
Post by Sage Weil
Post by Jan Schermer
Post by Sage Weil
We coudl "wing it" and hope for
the best, then do an expensive crawl and rsync of data on recovery, but we
chose very early on not to do that. If you want a system that "just"
layers over an existing filesystem, try you can try Gluster (although note
that they have a different sort of pain with the ordering of xattr
updates, and are moving toward a model that looks more like Ceph's backend
in their next version).
True, which is why we dismissed it.
I was implying it suffers the same flaws. In any case it wasn't really fast and it seemed overly complex.
To be fair it was some while ago when I tried it.
Can't talk about consistency - I don't think I ever used it in production as more than a PoC.
Post by Sage Weil
Post by Jan Schermer
Post by Sage Weil
IMO, If Ceph was moving in the right direction [...] Ceph would
simply distribute our IO around with CRUSH.
You want ceph to "just use a file system." That's what gluster does--it
just layers the distributed namespace right on top of a local namespace.
If you didn't care about correctness or data safety, it would be
beautiful, and just as fast as the local file system (modulo network).
But if you want your data safe, you immediatley realize that local POSIX
file systems don't get you want you need: the atomic update of two files
on different servers so that you can keep your replicas in sync. Gluster
originally took the minimal path to accomplish this: a "simple"
prepare/write/commit, using xattrs as transaction markers. We took a
heavyweight approach to support arbitrary transactions. And both of us
have independently concluded that the local fs is the wrong tool for the
job.
Post by Jan Schermer
Post by Sage Weil
Offloading stuff to the file system doesn't save you CPU--it just makes
someone else responsible. What does save you CPU is avoiding the
complexity you don't need (i.e., half of what the kernel file system is
doing, and everything we have to do to work around an ill-suited
interface) and instead implement exactly the set of features that we need
to get the job done.
In theory you are right.
In practice in-kernel filesystems are fast, and fuse filesystems are slow.
Ceph is like that - slow. And you want to be fast by writing more code :)
You get fast by writing the *right* code, and eliminating layers of the
stack (the local file system, in this case) that are providing
functionality you don't want (or more functionality than you need at too
high a price).
Post by Jan Schermer
I dug into bluestore and how you want to implement it, and from what I
understood you are reimplementing what the filesystem journal does...
Yes. The difference is that a single journal manages all of the metadata
and data consistency in the system, instead of a local fs journal managing
just block allocation and a second ceph journal managing ceph's data
structures.
The main benefit, though, is that we can choose a different set of
semantics, like the ability to overwrite data in a file/object and update
metadata atomically. You can't do that with POSIX without building a
write-ahead journal and double-writing.
Post by Jan Schermer
Btw I think at least i_version xattr could be atomic.
Nope. All major file systems (other than btrfs) overwrite data in place,
which means it is impossible for any piece of metadata to accurately
indicate whether you have the old data or the new data (or perhaps a bit
of both).
Post by Jan Schermer
It makes sense it will be 2x faster if you avoid the double-journalling,
but I'd be very much surprised if it helped with CPU usage one bit - I
certainly don't see my filesystems consuming significant amount of CPU
time on any of my machines, and I seriously doubt you're going to do
that better, sorry.
Apples and oranges. The file systems aren't doing what we're doing. But
once you combine the what we spend now in FileStore + a local fs,
BlueStore will absolutely spend less CPU time.
I don't think it's apples and oranges.
If I export two files via losetup over iSCSI and make a raid1 swraid out of them in guest VM, I bet it will still be faster than ceph with bluestore.
And yet it will provide the same guarantees and do the same job without eating significant CPU time.
True or false?
Yes, the filesystem is unnecessary in this scenario, but the performance impact is negligible if you use it right.
Post by Sage Weil
Post by Jan Schermer
What makes you think you will do a better job than all the people who
made xfs/ext4/...?
I don't. XFS et al are great file systems and for the most part I have no
complaints about them. The problem is that Ceph doesn't need a file
system: it needs a transactional object store with a different set of
features. So that's what we're building.
sage
Sage Weil
2016-04-12 20:47:48 UTC
Permalink
Post by Jan Schermer
Still the answer to most of your points from me is "but who needs that?"
Who needs to have exactly the same data in two separate objects
(replicas)? Ceph needs it because "consistency"?, but the app (VM
filesystem) is fine with whatever version because the flush didn't
happen (if it did the contents would be the same).
If you want replicated VM store that isn't picky about consistency,
try Sheepdog. Or your mdraid over iSCSI proposal.

We care about these things because VMs are just one of many users of
rados, and because even if we could get away with being sloppy in some (or
even most) cases with VMs, we need the strong consistency to build other
features people want, like RBD journaling for multi-site async
replication.

Then there's the CephFS MDS, RGW, and a pile of out-of-tree users that
chose rados for a reason.

And we want to make sense of an inconsistency when we find one on scrub.
(Does it mean the disk is returning bad data, or we just crashed during a
write a while back?)

...

Cheers-
sage
Nick Fisk
2016-04-12 21:08:55 UTC
Permalink
Jan,

I would like to echo Sage's response here. It seems you only want a subset
of what Ceph offers, whereas RADOS is designed to offer a whole lot more,
which requires a lot more intelligence at the lower levels.

I must say I have found your attitude to both Sage and the Ceph project as a
whole over the last few emails quite disrespectful. I spend a lot of my time
trying to sell the benefits of open source, which centre on the openness of
the idea/code and not around the fact that you can get it for free. One of
the things that I like about open source is the constructive, albeit
sometimes abrupt, constructive criticism that results in a better product.
Simply shouting Ceph is slow and it's because dev's don't understand
filesystems is not constructive.

I've just come back from an expo at ExCel London where many providers are
passionately talking about Ceph. There seems to be a lot of big money
sloshing about for something that is inherently "wrong"

Sage and the core Ceph team seem like very clever people to me and I trust
that over the years of development, that if they have decided that standard
FS's are not the ideal backing store for Ceph, that this is probably correct
decision. However I am also aware that the human condition "Can't see the
wood for the trees" is everywhere and I'm sure if you have any clever
insights into filesystem behaviour, the Ceph Dev team would be more than
open to suggestions.

Personally I wish I could contribute more to the project as I feel that I
(any my company) get more from Ceph than we put in, but it strikes a nerve
when there is such negative criticism for what effectively is a free
product.

Yes, I also suffer from the problem of slow sync writes, but the benefit of
being able to shift 1U servers around a Rack/DC compared to a SAS tethered
4U jbod somewhat outweighs that as well as several other advanatages. A new
cluster that we are deploying has several hardware choices which go a long
way to improve this performance as well. Coupled with the coming Bluestore,
the future looks bright.
-----Original Message-----
Sage Weil
Sent: 12 April 2016 21:48
Subject: Re: [ceph-users] Deprecating ext4 support
Post by Jan Schermer
Still the answer to most of your points from me is "but who needs that?"
Who needs to have exactly the same data in two separate objects
(replicas)? Ceph needs it because "consistency"?, but the app (VM
filesystem) is fine with whatever version because the flush didn't
happen (if it did the contents would be the same).
If you want replicated VM store that isn't picky about consistency, try
Sheepdog. Or your mdraid over iSCSI proposal.
We care about these things because VMs are just one of many users of
rados, and because even if we could get away with being sloppy in some (or
even most) cases with VMs, we need the strong consistency to build other
features people want, like RBD journaling for multi-site async
replication.
Then there's the CephFS MDS, RGW, and a pile of out-of-tree users that
chose rados for a reason.
And we want to make sense of an inconsistency when we find one on scrub.
(Does it mean the disk is returning bad data, or we just crashed during a
write
a while back?)
...
Cheers-
sage
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
w***@42on.com
2016-04-12 21:22:17 UTC
Permalink
Post by Nick Fisk
Jan,
I would like to echo Sage's response here. It seems you only want a subset
of what Ceph offers, whereas RADOS is designed to offer a whole lot more,
which requires a lot more intelligence at the lower levels.
I fully agree with your e-mail. I think the Ceph devvers have earned their respect over the years and they know what they are talking about.

For years I have been wondering why there even was a POSIX filesystem underneath Ceph.
Post by Nick Fisk
I must say I have found your attitude to both Sage and the Ceph project as a
whole over the last few emails quite disrespectful. I spend a lot of my time
trying to sell the benefits of open source, which centre on the openness of
the idea/code and not around the fact that you can get it for free. One of
the things that I like about open source is the constructive, albeit
sometimes abrupt, constructive criticism that results in a better product.
Simply shouting Ceph is slow and it's because dev's don't understand
filesystems is not constructive.
I've just come back from an expo at ExCel London where many providers are
passionately talking about Ceph. There seems to be a lot of big money
sloshing about for something that is inherently "wrong"
Sage and the core Ceph team seem like very clever people to me and I trust
that over the years of development, that if they have decided that standard
FS's are not the ideal backing store for Ceph, that this is probably correct
decision. However I am also aware that the human condition "Can't see the
wood for the trees" is everywhere and I'm sure if you have any clever
insights into filesystem behaviour, the Ceph Dev team would be more than
open to suggestions.
Personally I wish I could contribute more to the project as I feel that I
(any my company) get more from Ceph than we put in, but it strikes a nerve
when there is such negative criticism for what effectively is a free
product.
Yes, I also suffer from the problem of slow sync writes, but the benefit of
being able to shift 1U servers around a Rack/DC compared to a SAS tethered
4U jbod somewhat outweighs that as well as several other advanatages. A new
cluster that we are deploying has several hardware choices which go a long
way to improve this performance as well. Coupled with the coming Bluestore,
the future looks bright.
-----Original Message-----
Sage Weil
Sent: 12 April 2016 21:48
Subject: Re: [ceph-users] Deprecating ext4 support
Post by Jan Schermer
Still the answer to most of your points from me is "but who needs that?"
Who needs to have exactly the same data in two separate objects
(replicas)? Ceph needs it because "consistency"?, but the app (VM
filesystem) is fine with whatever version because the flush didn't
happen (if it did the contents would be the same).
If you want replicated VM store that isn't picky about consistency, try
Sheepdog. Or your mdraid over iSCSI proposal.
We care about these things because VMs are just one of many users of
rados, and because even if we could get away with being sloppy in some (or
even most) cases with VMs, we need the strong consistency to build other
features people want, like RBD journaling for multi-site async
replication.
Then there's the CephFS MDS, RGW, and a pile of out-of-tree users that
chose rados for a reason.
And we want to make sense of an inconsistency when we find one on scrub.
(Does it mean the disk is returning bad data, or we just crashed during a
write
a while back?)
...
Cheers-
sage
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Jan Schermer
2016-04-12 23:12:43 UTC
Permalink
I apologise, I probably should have dialed down a bit.
I'd like to personally apologise to Sage, for being so patient with my ranting.

To be clear: We are so lucky to have Ceph. It was something we sorely needed and for the right price (free).
It's was a dream come true to cloud providers - and it still is.

However, working with it in production, spending much time getting to know how ceph works, what it does, and also seeing how and where it fails prompted my interest in where it's going, because big public clouds are one thing, traditional SMB/Small enterprise needs are another and that's where I feel it fails hard. So I tried prodding here on ML, watched performance talks (which, frankly, reinforced my confirmation bias) and hoped to see some hint of it getting bette. That for me equals simpler, faster, not reinventing the wheel. I truly don't see that and it makes me sad.

You are talking about the big picture - Ceph for storing anything, new architecture - and it sounds cool. Given enough money and time it can materialise, I won't elaborate on that. I just hope you don't forget about the measly RBD users like me (I'd guesstimate a silent 90%+ majority, but no idea, hopefully the product manager has a better one) who are frustrated from the current design. I'd like to think I represent those users who used to solve HA with DRBD 10 years ago, who had to battle NFS shares with rsync and inotify scripts, who were the only people on-call every morning at 3AM when logrotate killed their IO, all while having to work with rotting hardware and no budget. We are still out there and there's nothing for us - RBD is not as fast, simple or reliable as DRBD, filesystem is not as simple nor as fast as rsync, scrubbing still wakes us at 3AM...

I'd very much like Ceph to be my storage system of choice in the future again, which is why I am so vocal with my opinions, and maybe truly selfish with my needs. I have not yet been convinced of the bright future, and - being the sceptical^Wcynical monster I turned into - I expect everything which makes my spidey sense tingle to fail, as it usually does. But that's called confirmation bias, which can make my whole point moot I guess :)

Jan
Post by Nick Fisk
Jan,
I would like to echo Sage's response here. It seems you only want a subset
of what Ceph offers, whereas RADOS is designed to offer a whole lot more,
which requires a lot more intelligence at the lower levels.
I must say I have found your attitude to both Sage and the Ceph project as a
whole over the last few emails quite disrespectful. I spend a lot of my time
trying to sell the benefits of open source, which centre on the openness of
the idea/code and not around the fact that you can get it for free. One of
the things that I like about open source is the constructive, albeit
sometimes abrupt, constructive criticism that results in a better product.
Simply shouting Ceph is slow and it's because dev's don't understand
filesystems is not constructive.
I've just come back from an expo at ExCel London where many providers are
passionately talking about Ceph. There seems to be a lot of big money
sloshing about for something that is inherently "wrong"
Sage and the core Ceph team seem like very clever people to me and I trust
that over the years of development, that if they have decided that standard
FS's are not the ideal backing store for Ceph, that this is probably correct
decision. However I am also aware that the human condition "Can't see the
wood for the trees" is everywhere and I'm sure if you have any clever
insights into filesystem behaviour, the Ceph Dev team would be more than
open to suggestions.
Personally I wish I could contribute more to the project as I feel that I
(any my company) get more from Ceph than we put in, but it strikes a nerve
when there is such negative criticism for what effectively is a free
product.
Yes, I also suffer from the problem of slow sync writes, but the benefit of
being able to shift 1U servers around a Rack/DC compared to a SAS tethered
4U jbod somewhat outweighs that as well as several other advanatages. A new
cluster that we are deploying has several hardware choices which go a long
way to improve this performance as well. Coupled with the coming Bluestore,
the future looks bright.
-----Original Message-----
Sage Weil
Sent: 12 April 2016 21:48
Subject: Re: [ceph-users] Deprecating ext4 support
Post by Jan Schermer
Still the answer to most of your points from me is "but who needs that?"
Who needs to have exactly the same data in two separate objects
(replicas)? Ceph needs it because "consistency"?, but the app (VM
filesystem) is fine with whatever version because the flush didn't
happen (if it did the contents would be the same).
If you want replicated VM store that isn't picky about consistency, try
Sheepdog. Or your mdraid over iSCSI proposal.
We care about these things because VMs are just one of many users of
rados, and because even if we could get away with being sloppy in some (or
even most) cases with VMs, we need the strong consistency to build other
features people want, like RBD journaling for multi-site async
replication.
Then there's the CephFS MDS, RGW, and a pile of out-of-tree users that
chose rados for a reason.
And we want to make sense of an inconsistency when we find one on scrub.
(Does it mean the disk is returning bad data, or we just crashed during a
write
a while back?)
...
Cheers-
sage
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Sage Weil
2016-04-13 13:13:41 UTC
Permalink
Post by Jan Schermer
I apologise, I probably should have dialed down a bit.
I'd like to personally apologise to Sage, for being so patient with my ranting.
No worries :)
Post by Jan Schermer
I just hope you don't forget about the measly RBD users like me (I'd
guesstimate a silent 90%+ majority, but no idea, hopefully the product
manager has a better one) who are frustrated from the current design.
Don't worry: RBD users are a pretty clear #1 as far as where our current
priorities are, and driving most of the decisions we make in RADOS.
They're just not the only priorities.

Cheers-
sage
Sage Weil
2016-04-13 13:06:12 UTC
Permalink
Post by Jan Schermer
Who needs to have exactly the same data in two separate objects
(replicas)? Ceph needs it because "consistency"?, but the app (VM
filesystem) is fine with whatever version because the flush didn't
happen (if it did the contents would be the same).
While we're talking/thinking about this, here's a simple example of why
the simple solution (let the replicas be out of sync), which seems
reasonable at first, can blow up in your face.

If a disk block contains A and you write B over the top of it and then
there is a failure (e.g. power loss before you issue a flush), it's okay
for the disk to contain either A or B. In a replicated system, let's say
2x mirroring (call them R1 and R2), you might end up with B on R1 and A
on R2. If you don't immediately clean it up, then at some point down the
line you might switch from reading R1 to reading R2 and the disk block
will go "back in time" (previously you read B, now you read A). A
single disk/replica will never do that, and applications can break.

For example, if the block in question is a journal block, we might see B
the first time (valid journal!), the do a bunch of work and
journal/write new stuff to the blocks that follow. Then we lose
power again, lose R1, replay the journal, read A from R2, and stop journal
replay early... missing out on all the new stuff. This can easily corrupt
a file system or database or whatever else.

It might sound unlikely, but keep in mind that writes to these
all-important metadata and commit blocks are extremely frequent. It's the
kind of thing you can usually get away with, until you don't, and then you
have a very bad day...

sage
Samuel Just
2016-04-14 18:30:23 UTC
Permalink
It doesn't seem like it would be wise to run such systems on top of rbd.
-Sam
Post by Sage Weil
Post by Jan Schermer
Who needs to have exactly the same data in two separate objects
(replicas)? Ceph needs it because "consistency"?, but the app (VM
filesystem) is fine with whatever version because the flush didn't
happen (if it did the contents would be the same).
While we're talking/thinking about this, here's a simple example of why
the simple solution (let the replicas be out of sync), which seems
reasonable at first, can blow up in your face.
If a disk block contains A and you write B over the top of it and then
there is a failure (e.g. power loss before you issue a flush), it's okay
for the disk to contain either A or B. In a replicated system, let's say
2x mirroring (call them R1 and R2), you might end up with B on R1 and A
on R2. If you don't immediately clean it up, then at some point down the
line you might switch from reading R1 to reading R2 and the disk block
will go "back in time" (previously you read B, now you read A). A
single disk/replica will never do that, and applications can break.
For example, if the block in question is a journal block, we might see B
the first time (valid journal!), the do a bunch of work and
journal/write new stuff to the blocks that follow. Then we lose
power again, lose R1, replay the journal, read A from R2, and stop journal
replay early... missing out on all the new stuff. This can easily corrupt
a file system or database or whatever else.
If data is critical, applications use their own replicas, MySQL,
Cassandra, MongoDB... if above scenario happens and one replica is out
of sync, they use quorum like protocol to guarantee reading the latest
data, and repair those out-of-sync replicas. so eventual consistency
in storage is acceptable for them?
Jianjian
Post by Sage Weil
It might sound unlikely, but keep in mind that writes to these
all-important metadata and commit blocks are extremely frequent. It's the
kind of thing you can usually get away with, until you don't, and then you
have a very bad day...
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
c***@jack.fr.eu.org
2016-04-12 21:14:45 UTC
Permalink
Post by Jan Schermer
I don't think it's apples and oranges.
If I export two files via losetup over iSCSI and make a raid1 swraid out of them in guest VM, I bet it will still be faster than ceph with bluestore.
And yet it will provide the same guarantees and do the same job without eating significant CPU time.
True or false?
False
First, your iSCSI server will be a spoof
Second, you won't aggregate many things (limited by the network, at least)

Saying you don't care about consistency made be laught ..
You are using xfs/ext4 options, like nobarrier etc, on production, right
? They can improve really performance, and only provide the so useless
consistency that nobody care for :)
Oliver Dzombic
2016-04-12 21:27:23 UTC
Permalink
Hi Jan,

i can answer your question very quickly: We.

We need that!

We need and want a stable, selfhealing, scaleable, robust, reliable
storagesystem which can talk to our infrastructure in different languages.

I have full understanding, that people who are using an infrastructure,
which is going to loose support by a software are not too much amused.

I dont understand your strict insisting on looking at that matter from
different points of view.

And if you will just think about it for a moment, you will remember
yourself, that this software is not designed for a single purpose.

Its designed for multiple purposes. Where "purpose" is the different
flavour/ways the different people are trying to use a software for.

I am very thankful, if software designers are trying to make their
product better and better. If that means that they will have to drop the
support for a filesystem type, then may it be so.

You will not die from that, as well as all others.

I am waiting for the upcoming jewel to make a new cluster, to migrate
the old hammer cluster into that.

Jewel will have a new feature that will allow to migrate clusters.

So whats your problem ? For now i dont see any draw back for you.

If the software will be able to provide your rbd vm's, then you should
not care about if its ext2,3,4,200 or xfs or $what_ever_new.

As long as its working, and maybe even providing more features than
before, then, whats the problem ?

That YOU dont need that features ? That you dont want your running
system to be changed ? That you are not the only ceph user and the
software is not privately developed for your neeeds ?

Seriously ?

So, let me welcome to this world, where you are not alone, and where are
other people who also have wishes and wantings.

I am sure that the people who soo much need/want to have the ext4
support are in the minority. Otherwise the ceph developers wont drop it,
because they are not stupid to drop a feature which is wanted/needed by
a majority of people.

So please, try to open your eyes a bit for the rest of the ceph users.

And, if you managed that, try to open your eyes for the ceph developers
who made here a product that was enabling you to manage your stuff and
what ever you use ceph for.

And if that is all not ok/right from your side, then become a ceph
developer and code contributor. Keep up the ext4 support and try to
influence the other developers to maintain a feature with is technically
not needed, technically in the way of better software design and used by
a minority of users. Goood luck with that !
--
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:***@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107
Post by Jan Schermer
Still the answer to most of your points from me is "but who needs that?"
Who needs to have exactly the same data in two separate objects (replicas)? Ceph needs it because "consistency"?, but the app (VM filesystem) is fine with whatever version because the flush didn't happen (if it did the contents would be the same).
You say "Ceph needs", but I say "the guest VM needs" - there's the problem.
Post by Sage Weil
Okay, I'll bite.
Post by Jan Schermer
Post by Sage Weil
Local kernel file systems maintain their own internal consistency, but
they only provide what consistency promises the POSIX interface
does--which is almost nothing.
... which is exactly what everyone expects
... which is everything any app needs
Post by Sage Weil
That's why every complicated data
structure (e.g., database) stored on a file system ever includes it's own
journal.
... see?
They do this because POSIX doesn't give them what they want. They
implement a *second* journal on top. The result is that you get the
overhead from both--the fs journal keeping its data structures consistent,
the database keeping its consistent. If you're not careful, that means
the db has to do something like file write, fsync, db journal append,
fsync.
It's more like
transaction log write, flush
data write
That's simply because most filesystems don't journal data, but some do.
Post by Sage Weil
And both fsyncs turn into a *fs* journal io and flush. (Smart
databases often avoid most of the fs overhead by putting everything in a
single large file, but at that point the file system isn't actually doing
anything except passing IO to the block layer).
There is nothing wrong with POSIX file systems. They have the unenviable
task of catering to a huge variety of workloads and applications, but are
truly optimal for very few. And that's fine. If you want a local file
system, you should use ext4 or XFS, not Ceph.
But it turns ceph-osd isn't a generic application--it has a pretty
specific workload pattern, and POSIX doesn't give us the interfaces we
want (mainly, atomic transactions or ordered object/file enumeration).
The workload (with RBD) is inevitably expecting POSIX. Who needs more than that? To me that indicates unnecessary guarantees.
Post by Sage Weil
Post by Jan Schermer
Post by Sage Weil
We coudl "wing it" and hope for
the best, then do an expensive crawl and rsync of data on recovery, but we
chose very early on not to do that. If you want a system that "just"
layers over an existing filesystem, try you can try Gluster (although note
that they have a different sort of pain with the ordering of xattr
updates, and are moving toward a model that looks more like Ceph's backend
in their next version).
True, which is why we dismissed it.
I was implying it suffers the same flaws. In any case it wasn't really fast and it seemed overly complex.
To be fair it was some while ago when I tried it.
Can't talk about consistency - I don't think I ever used it in production as more than a PoC.
Post by Sage Weil
Post by Jan Schermer
Post by Sage Weil
IMO, If Ceph was moving in the right direction [...] Ceph would
simply distribute our IO around with CRUSH.
You want ceph to "just use a file system." That's what gluster does--it
just layers the distributed namespace right on top of a local namespace.
If you didn't care about correctness or data safety, it would be
beautiful, and just as fast as the local file system (modulo network).
But if you want your data safe, you immediatley realize that local POSIX
file systems don't get you want you need: the atomic update of two files
on different servers so that you can keep your replicas in sync. Gluster
originally took the minimal path to accomplish this: a "simple"
prepare/write/commit, using xattrs as transaction markers. We took a
heavyweight approach to support arbitrary transactions. And both of us
have independently concluded that the local fs is the wrong tool for the
job.
Post by Jan Schermer
Post by Sage Weil
Offloading stuff to the file system doesn't save you CPU--it just makes
someone else responsible. What does save you CPU is avoiding the
complexity you don't need (i.e., half of what the kernel file system is
doing, and everything we have to do to work around an ill-suited
interface) and instead implement exactly the set of features that we need
to get the job done.
In theory you are right.
In practice in-kernel filesystems are fast, and fuse filesystems are slow.
Ceph is like that - slow. And you want to be fast by writing more code :)
You get fast by writing the *right* code, and eliminating layers of the
stack (the local file system, in this case) that are providing
functionality you don't want (or more functionality than you need at too
high a price).
Post by Jan Schermer
I dug into bluestore and how you want to implement it, and from what I
understood you are reimplementing what the filesystem journal does...
Yes. The difference is that a single journal manages all of the metadata
and data consistency in the system, instead of a local fs journal managing
just block allocation and a second ceph journal managing ceph's data
structures.
The main benefit, though, is that we can choose a different set of
semantics, like the ability to overwrite data in a file/object and update
metadata atomically. You can't do that with POSIX without building a
write-ahead journal and double-writing.
Post by Jan Schermer
Btw I think at least i_version xattr could be atomic.
Nope. All major file systems (other than btrfs) overwrite data in place,
which means it is impossible for any piece of metadata to accurately
indicate whether you have the old data or the new data (or perhaps a bit
of both).
Post by Jan Schermer
It makes sense it will be 2x faster if you avoid the double-journalling,
but I'd be very much surprised if it helped with CPU usage one bit - I
certainly don't see my filesystems consuming significant amount of CPU
time on any of my machines, and I seriously doubt you're going to do
that better, sorry.
Apples and oranges. The file systems aren't doing what we're doing. But
once you combine the what we spend now in FileStore + a local fs,
BlueStore will absolutely spend less CPU time.
I don't think it's apples and oranges.
If I export two files via losetup over iSCSI and make a raid1 swraid out of them in guest VM, I bet it will still be faster than ceph with bluestore.
And yet it will provide the same guarantees and do the same job without eating significant CPU time.
True or false?
Yes, the filesystem is unnecessary in this scenario, but the performance impact is negligible if you use it right.
Post by Sage Weil
Post by Jan Schermer
What makes you think you will do a better job than all the people who
made xfs/ext4/...?
I don't. XFS et al are great file systems and for the most part I have no
complaints about them. The problem is that Ceph doesn't need a file
system: it needs a transactional object store with a different set of
features. So that's what we're building.
sage
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Gregory Farnum
2016-04-12 21:43:32 UTC
Permalink
Thank you for the votes of confidence, everybody. :)
It would be good if we could keep this thread focused on who is harmed
by retiring ext4 as a tested configuration at what speed, and break
out other threads for other issues. (I'm about to do that for one of
them!)
-Greg
c***@jack.fr.eu.org
2016-04-12 19:20:47 UTC
Permalink
Post by Jan Schermer
I'd like to raise these points, then
1) some people (like me) will never ever use XFS if they have a choice
given no choice, we will not use something that depends on XFS
Huh ?
Post by Jan Schermer
3) doesn't majority of Ceph users only care about RBD?
Well, half users does
The other half, including myself, are using radosgw
Post by Jan Schermer
Finally, remember you *are* completely free to run Ceph on whatever file
system you want--and many do.
Yep


About the "ext4 support" stuff, the wiki was pretty clear : you *can*
use ext4, but you *should* use xfs
This is why, despite I mostly run ext4, my OSD are built upon xfs.

So, I think it is a good idea to disable ext4 testing, and make the wiki
more expressive about that.
Beyond that point, as Sage said, people can you whatever FS they want
Udo Lembke
2016-04-12 07:56:13 UTC
Permalink
Hi Sage,
we run ext4 only on our 8node-cluster with 110 OSDs and are quite happy
with ext4.
We start with xfs but the latency was much higher comparable to ext4...

But we use RBD only with "short" filenames like
rbd_data.335986e2ae8944a.00000000000761e1.
If we can switch from Jewel to K* and change during the update the
filestore for each OSD to BlueStore it's will be OK for us.
I hope we will get than an better performance with BlueStore??
Will be BlueStore production ready during the Jewel-Lifetime, so that we
can switch to BlueStore before the next big upgrade?


Udo
Hi,
ext4 has never been recommended, but we did test it. After Jewel is out,
we would like explicitly recommend *against* ext4 and stop testing it.
Recently we discovered an issue with the long object name handling that is
not fixable without rewriting a significant chunk of FileStores filename
handling. (There is a limit in the amount of xattr data ext4 can store in
the inode, which causes problems in LFNIndex.)
We *could* invest a ton of time rewriting this to fix, but it only affects
ext4, which we never recommended, and we plan to deprecate FileStore once
BlueStore is stable anyway, so it seems like a waste of time that would be
better spent elsewhere.
Also, by dropping ext4 test coverage in ceph-qa-suite, we can
significantly improve time/coverage for FileStore on XFS and on BlueStore.
The long file name handling is problematic anytime someone is storing
rados objects with long names. The primary user that does this is RGW,
which means any RGW cluster using ext4 should recreate their OSDs to use
XFS. Other librados users could be affected too, though, like users
with very long rbd image names (e.g., > 100 characters), or custom
librados users.
To make this change as visible as possible, the plan is to make ceph-osd
refuse to start if the backend is unable to support the configured max
object name (osd_max_object_name_len). The OSD will complain that ext4
cannot store such an object and refuse to start. A user who is only using
RBD might decide they don't need long file names to work and can adjust
the osd_max_object_name_len setting to something small (say, 64) and run
successfully. They would be taking a risk, though, because we would like
to stop testing on ext4.
Is this reasonable? If there significant ext4 users that are unwilling to
recreate their OSDs, now would be the time to speak up.
Thanks!
sage
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Christian Balzer
2016-04-13 02:15:44 UTC
Permalink
Hello,
Post by Shinobu Kinjo
Hi Sage,
Not Sage, but since he hasn't piped up yet...
Post by Shinobu Kinjo
we run ext4 only on our 8node-cluster with 110 OSDs and are quite happy
with ext4.
We start with xfs but the latency was much higher comparable to ext4...
Welcome to the club. ^o^
Post by Shinobu Kinjo
But we use RBD only with "short" filenames like
rbd_data.335986e2ae8944a.00000000000761e1.
If we can switch from Jewel to K* and change during the update the
filestore for each OSD to BlueStore it's will be OK for us.
I don't think K* will be a truly, fully stable BlueStore platform, but
I'll be happy to be proven wrong.
Also, would you really want to upgrade to a non-LTS version?
Post by Shinobu Kinjo
I hope we will get than an better performance with BlueStore??
That seems to be a given, after having read up on it last night.
Post by Shinobu Kinjo
Will be BlueStore production ready during the Jewel-Lifetime, so that we
can switch to BlueStore before the next big upgrade?
Again doubtful from my perspective.

For example cache-tiering was introduced (and not as a technology preview,
requiring "will eat your data" flags to be set in ceph.conf) in Firely.

It worked seemingly well enough, but was broken in certain situations.
And in the latest Hammer release it is again broken dangerously by a
backport from Infernalis/Jewel.

Christian
Post by Shinobu Kinjo
Udo
Hi,
ext4 has never been recommended, but we did test it. After Jewel is
out, we would like explicitly recommend *against* ext4 and stop
testing it.
Recently we discovered an issue with the long object name handling
that is not fixable without rewriting a significant chunk of
FileStores filename handling. (There is a limit in the amount of
xattr data ext4 can store in the inode, which causes problems in
LFNIndex.)
We *could* invest a ton of time rewriting this to fix, but it only
affects ext4, which we never recommended, and we plan to deprecate
FileStore once BlueStore is stable anyway, so it seems like a waste of
time that would be better spent elsewhere.
Also, by dropping ext4 test coverage in ceph-qa-suite, we can
significantly improve time/coverage for FileStore on XFS and on BlueStore.
The long file name handling is problematic anytime someone is storing
rados objects with long names. The primary user that does this is RGW,
which means any RGW cluster using ext4 should recreate their OSDs to
use XFS. Other librados users could be affected too, though, like
users with very long rbd image names (e.g., > 100 characters), or
custom librados users.
To make this change as visible as possible, the plan is to make
ceph-osd refuse to start if the backend is unable to support the
configured max object name (osd_max_object_name_len). The OSD will
complain that ext4 cannot store such an object and refuse to start. A
user who is only using RBD might decide they don't need long file
names to work and can adjust the osd_max_object_name_len setting to
something small (say, 64) and run successfully. They would be taking
a risk, though, because we would like to stop testing on ext4.
Is this reasonable? If there significant ext4 users that are
unwilling to recreate their OSDs, now would be the time to speak up.
Thanks!
sage
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/
Max A. Krasilnikov
2016-04-12 08:42:06 UTC
Permalink
Hello!
Hi,
ext4 has never been recommended, but we did test it. After Jewel is out,
we would like explicitly recommend *against* ext4 and stop testing it.
1. Does filestore_xattr_use_omap fix issues with ext4? So, can I continue using
ext4 for cluster with RBD && CephFS + this option set to true?
2. Agree with Christian, it would be better to warn but not drop support for
legacy fs until old HW is out of service, 4-5 years.
3. Also, if BlueStore will be so good, one prefer to use it instead of
FileStore, so fs deprecation would be not so painful.

I'm not so great ceph user, but I have limitations like Christian and changing
fs would cost me 24 nights for now :(
--
WBR, Max A. Krasilnikov
Francois Lafont
2016-04-13 14:19:05 UTC
Permalink
Hello,
[...] Is this reasonable? [...]
Warning: I'm just a ceph user and definitively non-expert user.

1. Personally, if you see the documentation, read a little the maling list
and/or IRC, it seems to me _clear_ that ext4 is not recommended even if the
opposite if mentioned sometimes (personally I don't use ext4 in my ceph
cluster, I use xfs as the doc says).

2. I'm not a ceph expert but I can imagine the monstrous work that represents
the development of a software such as ceph and I think it can be reasonable
sometimes to limit the work when it's possible.

So make ext4 deprecated seems to me reasonable. I think the comfort of the
users is important but, in a _long_ term, it seems to me important that the
developers can concentrate their work to important things.
--
François Lafont
Continue reading on narkive:
Loading...