Discussion:
[ceph-users] OCFS2 or GFS2 for cluster filesystem?
Tom Verdaat
2013-07-11 10:08:07 UTC
Permalink
Hi guys,

We want to use our Ceph cluster to create a shared disk file system to host
VM's. Our preference would be to use CephFS but since it is not considered
stable I'm looking into alternatives.

The most appealing alternative seems to be to create a RBD volume, format
it with a cluster file system and mount it on all the VM host machines.

Obvious file system candidates would be OCFS2 and GFS2 but I'm having
trouble finding recent and reliable documentation on the performance,
features and reliability of these file systems, especially related to our
specific use case. The specifics I'm trying to keep in mind are:

- Using it to host VM ephemeral disks means the file system needs to
perform well with few but very large files and usually machines don't try
to compete for access to the same file, except for during live migration.
- Needs to handle scale well (large number of nodes, manage a volume of
tens of terabytes and file sizes of tens or hundreds of gigabytes) and
handle online operations like increasing the volume size.
- Since the cluster FS is already running on a distributed storage
system (Ceph), the file system does not need to concern itself with things
like replication. Just needs to not get corrupted and be fast of course.


Anybody here that can help me shed some light on the following questions:

1. Are there other cluster file systems to consider besides OCFS2 and
GFS2?
2. Which one would yield the best performance for our use case?
3. Is anybody doing this already and willing to share their experience?
4. Is there anything important that you think we might have missed?


Your help is very much appreciated!

Thanks!

Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20130711/a02ce76c/attachment.htm>
Gilles Mocellin
2013-07-11 18:25:11 UTC
Permalink
Le 11/07/2013 12:08, Tom Verdaat a ?crit :
> Hi guys,
>
> We want to use our Ceph cluster to create a shared disk file system to
> host VM's. Our preference would be to use CephFS but since it is not
> considered stable I'm looking into alternatives.
>
> The most appealing alternative seems to be to create a RBD volume,
> format it with a cluster file system and mount it on all the VM host
> machines.
>
> Obvious file system candidates would be OCFS2 and GFS2 but I'm having
> trouble finding recent and reliable documentation on the performance,
> features and reliability of these file systems, especially related to
> our specific use case. The specifics I'm trying to keep in mind are:
>
> * Using it to host VM ephemeral disks means the file system needs to
> perform well with few but very large files and usually machines
> don't try to compete for access to the same file, except for
> during live migration.
> * Needs to handle scale well (large number of nodes, manage a volume
> of tens of terabytes and file sizes of tens or hundreds of
> gigabytes) and handle online operations like increasing the volume
> size.
> * Since the cluster FS is already running on a distributed storage
> system (Ceph), the file system does not need to concern itself
> with things like replication. Just needs to not get corrupted and
> be fast of course.
>
>
> Anybody here that can help me shed some light on the following questions:
>
> 1. Are there other cluster file systems to consider besides OCFS2 and
> GFS2?
> 2. Which one would yield the best performance for our use case?
> 3. Is anybody doing this already and willing to share their experience?
> 4. Is there anything important that you think we might have missed?
>

Hello,

Yes, you missed that qemu can use directly RADOS volume.
Look here :
http://ceph.com/docs/master/rbd/qemu-rbd/

Create :
qemu-img create -f rbd rbd:data/squeeze 10G

Use :

qemu -m 1024 -drive format=raw,file=rbd:data/squeeze
Alex Bligh
2013-07-11 21:03:21 UTC
Permalink
On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:

> Hello,
>
> Yes, you missed that qemu can use directly RADOS volume.
> Look here :
> http://ceph.com/docs/master/rbd/qemu-rbd/
>
> Create :
> qemu-img create -f rbd rbd:data/squeeze 10G
>
> Use :
>
> qemu -m 1024 -drive format=raw,file=rbd:data/squeeze

I don't think he did. As I read it he wants his VMs to all access the same filing system, and doesn't want to use cephfs.

OCFS2 on RBD I suppose is a reasonable choice for that.

--
Alex Bligh
McNamara, Bradley
2013-07-11 22:18:47 UTC
Permalink
Correct me if I'm wrong, I'm new to this, but I think the distinction between the two methods is that using 'qemu-img create -f rbd' creates an RBD for either a VM to boot from, or for mounting within a VM. Whereas, the OP wants a single RBD, formatted with a cluster file system, to use as a place for multiple VM image files to reside.

I've often contemplated this same scenario, and would be quite interested in different ways people have implemented their VM infrastructure using RBD. I guess one of the advantages of using 'qemu-img create -f rbd' is that a snapshot of a single RBD would capture just the changed RBD data for that VM, whereas a snapshot of a larger RBD with OCFS2 and multiple VM images on it, would capture changes of all the VM's, not just one. It might provide more administrative agility to use the former.

Also, I guess another question would be, when a RBD is expanded, does the underlying VM that is created using 'qemu-img create -f rbd' need to be rebooted to "see" the additional space. My guess would be, yes.

Brad

-----Original Message-----
From: ceph-users-bounces at lists.ceph.com [mailto:ceph-users-bounces at lists.ceph.com] On Behalf Of Alex Bligh
Sent: Thursday, July 11, 2013 2:03 PM
To: Gilles Mocellin
Cc: ceph-users at lists.ceph.com
Subject: Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?


On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:

> Hello,
>
> Yes, you missed that qemu can use directly RADOS volume.
> Look here :
> http://ceph.com/docs/master/rbd/qemu-rbd/
>
> Create :
> qemu-img create -f rbd rbd:data/squeeze 10G
>
> Use :
>
> qemu -m 1024 -drive format=raw,file=rbd:data/squeeze

I don't think he did. As I read it he wants his VMs to all access the same filing system, and doesn't want to use cephfs.

OCFS2 on RBD I suppose is a reasonable choice for that.

--
Alex Bligh
Tom Verdaat
2013-07-11 23:40:34 UTC
Permalink
You are right, I do want a single RBD, formatted with a cluster file
system, to use as a place for multiple VM image files to reside.

Doing everything straight from volumes would be more effective with regards
to snapshots, using CoW etc. but unfortunately for now OpenStack nova
insists on having an ephemeral disk and copying it to its local
/var/lib/nova/instances
directory. If you want to be able to do live migrations and such you need
to mount a cluster filesystem at that path on every host machine.

And that's what my questions were about!

Tom



2013/7/12 McNamara, Bradley <Bradley.McNamara at seattle.gov>

> Correct me if I'm wrong, I'm new to this, but I think the distinction
> between the two methods is that using 'qemu-img create -f rbd' creates an
> RBD for either a VM to boot from, or for mounting within a VM. Whereas,
> the OP wants a single RBD, formatted with a cluster file system, to use as
> a place for multiple VM image files to reside.
>
> I've often contemplated this same scenario, and would be quite interested
> in different ways people have implemented their VM infrastructure using
> RBD. I guess one of the advantages of using 'qemu-img create -f rbd' is
> that a snapshot of a single RBD would capture just the changed RBD data for
> that VM, whereas a snapshot of a larger RBD with OCFS2 and multiple VM
> images on it, would capture changes of all the VM's, not just one. It
> might provide more administrative agility to use the former.
>
> Also, I guess another question would be, when a RBD is expanded, does the
> underlying VM that is created using 'qemu-img create -f rbd' need to be
> rebooted to "see" the additional space. My guess would be, yes.
>
> Brad
>
> -----Original Message-----
> From: ceph-users-bounces at lists.ceph.com [mailto:
> ceph-users-bounces at lists.ceph.com] On Behalf Of Alex Bligh
> Sent: Thursday, July 11, 2013 2:03 PM
> To: Gilles Mocellin
> Cc: ceph-users at lists.ceph.com
> Subject: Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?
>
>
> On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:
>
> > Hello,
> >
> > Yes, you missed that qemu can use directly RADOS volume.
> > Look here :
> > http://ceph.com/docs/master/rbd/qemu-rbd/
> >
> > Create :
> > qemu-img create -f rbd rbd:data/squeeze 10G
> >
> > Use :
> >
> > qemu -m 1024 -drive format=raw,file=rbd:data/squeeze
>
> I don't think he did. As I read it he wants his VMs to all access the same
> filing system, and doesn't want to use cephfs.
>
> OCFS2 on RBD I suppose is a reasonable choice for that.
>
> --
> Alex Bligh
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20130712/498c3a12/attachment.htm>
Youd, Douglas
2013-07-12 00:11:08 UTC
Permalink
Depending on which hypervisor he's using, it may not be possible to mount the RBD's natively.

For instance, the elephant in the room... ESXi.

I've pondered several architectures for presentation of Ceph to ESXi which may be related to this thread.

1) Large RBD's (2TB-512B), re-presented through an iSCSI gateway (hopefully in a HA config pair). VMFS, with VMDK's on top.
* Seems to have been done a couple of times already, not sure of the success.
* Small number of RBD's required, so not a frequent task. Perhaps dev-time in doing the automation provisioning can be reduced.

2) Large CephFS volumes (20+ TB), re-presented through NFS gateways. VMDK's on top.
* Less abstraction layers, hopefully better pass-through of commands.
* Any improvements of CephFS should be available to vmware. (De-dupe for instance).
* Easy to manage from a vmware perspective, NFS is pretty commonly deployed, large volumes.
* No multi-MDS means this is not viable... yet.

3) Small RBD's, (10's-100's GB), represented through iSCSI gateway, RDM to VM's directly.
*Possibly more appropriate for Ceph (lots of small RBDs)
* Harder to manage, more automation will be required for provisioning
* Cloning of templates etc may be harder.

Just my 2c anyway....

Douglas Youd
Cloud Solution Architect
ZettaGrid



-----Original Message-----
From: ceph-users-bounces at lists.ceph.com [mailto:ceph-users-bounces at lists.ceph.com] On Behalf Of McNamara, Bradley
Sent: Friday, 12 July 2013 8:19 AM
To: Alex Bligh; Gilles Mocellin
Cc: ceph-users at lists.ceph.com
Subject: Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?

Correct me if I'm wrong, I'm new to this, but I think the distinction between the two methods is that using 'qemu-img create -f rbd' creates an RBD for either a VM to boot from, or for mounting within a VM. Whereas, the OP wants a single RBD, formatted with a cluster file system, to use as a place for multiple VM image files to reside.

I've often contemplated this same scenario, and would be quite interested in different ways people have implemented their VM infrastructure using RBD. I guess one of the advantages of using 'qemu-img create -f rbd' is that a snapshot of a single RBD would capture just the changed RBD data for that VM, whereas a snapshot of a larger RBD with OCFS2 and multiple VM images on it, would capture changes of all the VM's, not just one. It might provide more administrative agility to use the former.

Also, I guess another question would be, when a RBD is expanded, does the underlying VM that is created using 'qemu-img create -f rbd' need to be rebooted to "see" the additional space. My guess would be, yes.

Brad

-----Original Message-----
From: ceph-users-bounces at lists.ceph.com [mailto:ceph-users-bounces at lists.ceph.com] On Behalf Of Alex Bligh
Sent: Thursday, July 11, 2013 2:03 PM
To: Gilles Mocellin
Cc: ceph-users at lists.ceph.com
Subject: Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?


On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:

> Hello,
>
> Yes, you missed that qemu can use directly RADOS volume.
> Look here :
> http://ceph.com/docs/master/rbd/qemu-rbd/
>
> Create :
> qemu-img create -f rbd rbd:data/squeeze 10G
>
> Use :
>
> qemu -m 1024 -drive format=raw,file=rbd:data/squeeze

I don't think he did. As I read it he wants his VMs to all access the same filing system, and doesn't want to use cephfs.

OCFS2 on RBD I suppose is a reasonable choice for that.

--
Alex Bligh




_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


________________________________

ZettaServe Disclaimer: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately if you have received this email by mistake and delete this email from your system. Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. ZettaServe Pty Ltd accepts no liability for any damage caused by any virus transmitted by this email.
Tom Verdaat
2013-07-11 23:40:34 UTC
Permalink
You are right, I do want a single RBD, formatted with a cluster file
system, to use as a place for multiple VM image files to reside.

Doing everything straight from volumes would be more effective with regards
to snapshots, using CoW etc. but unfortunately for now OpenStack nova
insists on having an ephemeral disk and copying it to its local
/var/lib/nova/instances
directory. If you want to be able to do live migrations and such you need
to mount a cluster filesystem at that path on every host machine.

And that's what my questions were about!

Tom



2013/7/12 McNamara, Bradley <Bradley.McNamara at seattle.gov>

> Correct me if I'm wrong, I'm new to this, but I think the distinction
> between the two methods is that using 'qemu-img create -f rbd' creates an
> RBD for either a VM to boot from, or for mounting within a VM. Whereas,
> the OP wants a single RBD, formatted with a cluster file system, to use as
> a place for multiple VM image files to reside.
>
> I've often contemplated this same scenario, and would be quite interested
> in different ways people have implemented their VM infrastructure using
> RBD. I guess one of the advantages of using 'qemu-img create -f rbd' is
> that a snapshot of a single RBD would capture just the changed RBD data for
> that VM, whereas a snapshot of a larger RBD with OCFS2 and multiple VM
> images on it, would capture changes of all the VM's, not just one. It
> might provide more administrative agility to use the former.
>
> Also, I guess another question would be, when a RBD is expanded, does the
> underlying VM that is created using 'qemu-img create -f rbd' need to be
> rebooted to "see" the additional space. My guess would be, yes.
>
> Brad
>
> -----Original Message-----
> From: ceph-users-bounces at lists.ceph.com [mailto:
> ceph-users-bounces at lists.ceph.com] On Behalf Of Alex Bligh
> Sent: Thursday, July 11, 2013 2:03 PM
> To: Gilles Mocellin
> Cc: ceph-users at lists.ceph.com
> Subject: Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?
>
>
> On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:
>
> > Hello,
> >
> > Yes, you missed that qemu can use directly RADOS volume.
> > Look here :
> > http://ceph.com/docs/master/rbd/qemu-rbd/
> >
> > Create :
> > qemu-img create -f rbd rbd:data/squeeze 10G
> >
> > Use :
> >
> > qemu -m 1024 -drive format=raw,file=rbd:data/squeeze
>
> I don't think he did. As I read it he wants his VMs to all access the same
> filing system, and doesn't want to use cephfs.
>
> OCFS2 on RBD I suppose is a reasonable choice for that.
>
> --
> Alex Bligh
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20130712/498c3a12/attachment-0002.htm>
Youd, Douglas
2013-07-12 00:11:08 UTC
Permalink
Depending on which hypervisor he's using, it may not be possible to mount the RBD's natively.

For instance, the elephant in the room... ESXi.

I've pondered several architectures for presentation of Ceph to ESXi which may be related to this thread.

1) Large RBD's (2TB-512B), re-presented through an iSCSI gateway (hopefully in a HA config pair). VMFS, with VMDK's on top.
* Seems to have been done a couple of times already, not sure of the success.
* Small number of RBD's required, so not a frequent task. Perhaps dev-time in doing the automation provisioning can be reduced.

2) Large CephFS volumes (20+ TB), re-presented through NFS gateways. VMDK's on top.
* Less abstraction layers, hopefully better pass-through of commands.
* Any improvements of CephFS should be available to vmware. (De-dupe for instance).
* Easy to manage from a vmware perspective, NFS is pretty commonly deployed, large volumes.
* No multi-MDS means this is not viable... yet.

3) Small RBD's, (10's-100's GB), represented through iSCSI gateway, RDM to VM's directly.
*Possibly more appropriate for Ceph (lots of small RBDs)
* Harder to manage, more automation will be required for provisioning
* Cloning of templates etc may be harder.

Just my 2c anyway....

Douglas Youd
Cloud Solution Architect
ZettaGrid



-----Original Message-----
From: ceph-users-bounces at lists.ceph.com [mailto:ceph-users-bounces at lists.ceph.com] On Behalf Of McNamara, Bradley
Sent: Friday, 12 July 2013 8:19 AM
To: Alex Bligh; Gilles Mocellin
Cc: ceph-users at lists.ceph.com
Subject: Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?

Correct me if I'm wrong, I'm new to this, but I think the distinction between the two methods is that using 'qemu-img create -f rbd' creates an RBD for either a VM to boot from, or for mounting within a VM. Whereas, the OP wants a single RBD, formatted with a cluster file system, to use as a place for multiple VM image files to reside.

I've often contemplated this same scenario, and would be quite interested in different ways people have implemented their VM infrastructure using RBD. I guess one of the advantages of using 'qemu-img create -f rbd' is that a snapshot of a single RBD would capture just the changed RBD data for that VM, whereas a snapshot of a larger RBD with OCFS2 and multiple VM images on it, would capture changes of all the VM's, not just one. It might provide more administrative agility to use the former.

Also, I guess another question would be, when a RBD is expanded, does the underlying VM that is created using 'qemu-img create -f rbd' need to be rebooted to "see" the additional space. My guess would be, yes.

Brad

-----Original Message-----
From: ceph-users-bounces at lists.ceph.com [mailto:ceph-users-bounces at lists.ceph.com] On Behalf Of Alex Bligh
Sent: Thursday, July 11, 2013 2:03 PM
To: Gilles Mocellin
Cc: ceph-users at lists.ceph.com
Subject: Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?


On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:

> Hello,
>
> Yes, you missed that qemu can use directly RADOS volume.
> Look here :
> http://ceph.com/docs/master/rbd/qemu-rbd/
>
> Create :
> qemu-img create -f rbd rbd:data/squeeze 10G
>
> Use :
>
> qemu -m 1024 -drive format=raw,file=rbd:data/squeeze

I don't think he did. As I read it he wants his VMs to all access the same filing system, and doesn't want to use cephfs.

OCFS2 on RBD I suppose is a reasonable choice for that.

--
Alex Bligh




_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


________________________________

ZettaServe Disclaimer: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately if you have received this email by mistake and delete this email from your system. Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. ZettaServe Pty Ltd accepts no liability for any damage caused by any virus transmitted by this email.
Tom Verdaat
2013-07-11 23:40:34 UTC
Permalink
You are right, I do want a single RBD, formatted with a cluster file
system, to use as a place for multiple VM image files to reside.

Doing everything straight from volumes would be more effective with regards
to snapshots, using CoW etc. but unfortunately for now OpenStack nova
insists on having an ephemeral disk and copying it to its local
/var/lib/nova/instances
directory. If you want to be able to do live migrations and such you need
to mount a cluster filesystem at that path on every host machine.

And that's what my questions were about!

Tom



2013/7/12 McNamara, Bradley <Bradley.McNamara at seattle.gov>

> Correct me if I'm wrong, I'm new to this, but I think the distinction
> between the two methods is that using 'qemu-img create -f rbd' creates an
> RBD for either a VM to boot from, or for mounting within a VM. Whereas,
> the OP wants a single RBD, formatted with a cluster file system, to use as
> a place for multiple VM image files to reside.
>
> I've often contemplated this same scenario, and would be quite interested
> in different ways people have implemented their VM infrastructure using
> RBD. I guess one of the advantages of using 'qemu-img create -f rbd' is
> that a snapshot of a single RBD would capture just the changed RBD data for
> that VM, whereas a snapshot of a larger RBD with OCFS2 and multiple VM
> images on it, would capture changes of all the VM's, not just one. It
> might provide more administrative agility to use the former.
>
> Also, I guess another question would be, when a RBD is expanded, does the
> underlying VM that is created using 'qemu-img create -f rbd' need to be
> rebooted to "see" the additional space. My guess would be, yes.
>
> Brad
>
> -----Original Message-----
> From: ceph-users-bounces at lists.ceph.com [mailto:
> ceph-users-bounces at lists.ceph.com] On Behalf Of Alex Bligh
> Sent: Thursday, July 11, 2013 2:03 PM
> To: Gilles Mocellin
> Cc: ceph-users at lists.ceph.com
> Subject: Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?
>
>
> On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:
>
> > Hello,
> >
> > Yes, you missed that qemu can use directly RADOS volume.
> > Look here :
> > http://ceph.com/docs/master/rbd/qemu-rbd/
> >
> > Create :
> > qemu-img create -f rbd rbd:data/squeeze 10G
> >
> > Use :
> >
> > qemu -m 1024 -drive format=raw,file=rbd:data/squeeze
>
> I don't think he did. As I read it he wants his VMs to all access the same
> filing system, and doesn't want to use cephfs.
>
> OCFS2 on RBD I suppose is a reasonable choice for that.
>
> --
> Alex Bligh
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20130712/498c3a12/attachment-0003.htm>
Youd, Douglas
2013-07-12 00:11:08 UTC
Permalink
Depending on which hypervisor he's using, it may not be possible to mount the RBD's natively.

For instance, the elephant in the room... ESXi.

I've pondered several architectures for presentation of Ceph to ESXi which may be related to this thread.

1) Large RBD's (2TB-512B), re-presented through an iSCSI gateway (hopefully in a HA config pair). VMFS, with VMDK's on top.
* Seems to have been done a couple of times already, not sure of the success.
* Small number of RBD's required, so not a frequent task. Perhaps dev-time in doing the automation provisioning can be reduced.

2) Large CephFS volumes (20+ TB), re-presented through NFS gateways. VMDK's on top.
* Less abstraction layers, hopefully better pass-through of commands.
* Any improvements of CephFS should be available to vmware. (De-dupe for instance).
* Easy to manage from a vmware perspective, NFS is pretty commonly deployed, large volumes.
* No multi-MDS means this is not viable... yet.

3) Small RBD's, (10's-100's GB), represented through iSCSI gateway, RDM to VM's directly.
*Possibly more appropriate for Ceph (lots of small RBDs)
* Harder to manage, more automation will be required for provisioning
* Cloning of templates etc may be harder.

Just my 2c anyway....

Douglas Youd
Cloud Solution Architect
ZettaGrid



-----Original Message-----
From: ceph-users-bounces at lists.ceph.com [mailto:ceph-users-bounces at lists.ceph.com] On Behalf Of McNamara, Bradley
Sent: Friday, 12 July 2013 8:19 AM
To: Alex Bligh; Gilles Mocellin
Cc: ceph-users at lists.ceph.com
Subject: Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?

Correct me if I'm wrong, I'm new to this, but I think the distinction between the two methods is that using 'qemu-img create -f rbd' creates an RBD for either a VM to boot from, or for mounting within a VM. Whereas, the OP wants a single RBD, formatted with a cluster file system, to use as a place for multiple VM image files to reside.

I've often contemplated this same scenario, and would be quite interested in different ways people have implemented their VM infrastructure using RBD. I guess one of the advantages of using 'qemu-img create -f rbd' is that a snapshot of a single RBD would capture just the changed RBD data for that VM, whereas a snapshot of a larger RBD with OCFS2 and multiple VM images on it, would capture changes of all the VM's, not just one. It might provide more administrative agility to use the former.

Also, I guess another question would be, when a RBD is expanded, does the underlying VM that is created using 'qemu-img create -f rbd' need to be rebooted to "see" the additional space. My guess would be, yes.

Brad

-----Original Message-----
From: ceph-users-bounces at lists.ceph.com [mailto:ceph-users-bounces at lists.ceph.com] On Behalf Of Alex Bligh
Sent: Thursday, July 11, 2013 2:03 PM
To: Gilles Mocellin
Cc: ceph-users at lists.ceph.com
Subject: Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?


On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:

> Hello,
>
> Yes, you missed that qemu can use directly RADOS volume.
> Look here :
> http://ceph.com/docs/master/rbd/qemu-rbd/
>
> Create :
> qemu-img create -f rbd rbd:data/squeeze 10G
>
> Use :
>
> qemu -m 1024 -drive format=raw,file=rbd:data/squeeze

I don't think he did. As I read it he wants his VMs to all access the same filing system, and doesn't want to use cephfs.

OCFS2 on RBD I suppose is a reasonable choice for that.

--
Alex Bligh




_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


________________________________

ZettaServe Disclaimer: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately if you have received this email by mistake and delete this email from your system. Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. ZettaServe Pty Ltd accepts no liability for any damage caused by any virus transmitted by this email.
Tom Verdaat
2013-07-11 23:40:34 UTC
Permalink
You are right, I do want a single RBD, formatted with a cluster file
system, to use as a place for multiple VM image files to reside.

Doing everything straight from volumes would be more effective with regards
to snapshots, using CoW etc. but unfortunately for now OpenStack nova
insists on having an ephemeral disk and copying it to its local
/var/lib/nova/instances
directory. If you want to be able to do live migrations and such you need
to mount a cluster filesystem at that path on every host machine.

And that's what my questions were about!

Tom



2013/7/12 McNamara, Bradley <Bradley.McNamara at seattle.gov>

> Correct me if I'm wrong, I'm new to this, but I think the distinction
> between the two methods is that using 'qemu-img create -f rbd' creates an
> RBD for either a VM to boot from, or for mounting within a VM. Whereas,
> the OP wants a single RBD, formatted with a cluster file system, to use as
> a place for multiple VM image files to reside.
>
> I've often contemplated this same scenario, and would be quite interested
> in different ways people have implemented their VM infrastructure using
> RBD. I guess one of the advantages of using 'qemu-img create -f rbd' is
> that a snapshot of a single RBD would capture just the changed RBD data for
> that VM, whereas a snapshot of a larger RBD with OCFS2 and multiple VM
> images on it, would capture changes of all the VM's, not just one. It
> might provide more administrative agility to use the former.
>
> Also, I guess another question would be, when a RBD is expanded, does the
> underlying VM that is created using 'qemu-img create -f rbd' need to be
> rebooted to "see" the additional space. My guess would be, yes.
>
> Brad
>
> -----Original Message-----
> From: ceph-users-bounces at lists.ceph.com [mailto:
> ceph-users-bounces at lists.ceph.com] On Behalf Of Alex Bligh
> Sent: Thursday, July 11, 2013 2:03 PM
> To: Gilles Mocellin
> Cc: ceph-users at lists.ceph.com
> Subject: Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?
>
>
> On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:
>
> > Hello,
> >
> > Yes, you missed that qemu can use directly RADOS volume.
> > Look here :
> > http://ceph.com/docs/master/rbd/qemu-rbd/
> >
> > Create :
> > qemu-img create -f rbd rbd:data/squeeze 10G
> >
> > Use :
> >
> > qemu -m 1024 -drive format=raw,file=rbd:data/squeeze
>
> I don't think he did. As I read it he wants his VMs to all access the same
> filing system, and doesn't want to use cephfs.
>
> OCFS2 on RBD I suppose is a reasonable choice for that.
>
> --
> Alex Bligh
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20130712/498c3a12/attachment-0004.htm>
Youd, Douglas
2013-07-12 00:11:08 UTC
Permalink
Depending on which hypervisor he's using, it may not be possible to mount the RBD's natively.

For instance, the elephant in the room... ESXi.

I've pondered several architectures for presentation of Ceph to ESXi which may be related to this thread.

1) Large RBD's (2TB-512B), re-presented through an iSCSI gateway (hopefully in a HA config pair). VMFS, with VMDK's on top.
* Seems to have been done a couple of times already, not sure of the success.
* Small number of RBD's required, so not a frequent task. Perhaps dev-time in doing the automation provisioning can be reduced.

2) Large CephFS volumes (20+ TB), re-presented through NFS gateways. VMDK's on top.
* Less abstraction layers, hopefully better pass-through of commands.
* Any improvements of CephFS should be available to vmware. (De-dupe for instance).
* Easy to manage from a vmware perspective, NFS is pretty commonly deployed, large volumes.
* No multi-MDS means this is not viable... yet.

3) Small RBD's, (10's-100's GB), represented through iSCSI gateway, RDM to VM's directly.
*Possibly more appropriate for Ceph (lots of small RBDs)
* Harder to manage, more automation will be required for provisioning
* Cloning of templates etc may be harder.

Just my 2c anyway....

Douglas Youd
Cloud Solution Architect
ZettaGrid



-----Original Message-----
From: ceph-users-bounces at lists.ceph.com [mailto:ceph-users-bounces at lists.ceph.com] On Behalf Of McNamara, Bradley
Sent: Friday, 12 July 2013 8:19 AM
To: Alex Bligh; Gilles Mocellin
Cc: ceph-users at lists.ceph.com
Subject: Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?

Correct me if I'm wrong, I'm new to this, but I think the distinction between the two methods is that using 'qemu-img create -f rbd' creates an RBD for either a VM to boot from, or for mounting within a VM. Whereas, the OP wants a single RBD, formatted with a cluster file system, to use as a place for multiple VM image files to reside.

I've often contemplated this same scenario, and would be quite interested in different ways people have implemented their VM infrastructure using RBD. I guess one of the advantages of using 'qemu-img create -f rbd' is that a snapshot of a single RBD would capture just the changed RBD data for that VM, whereas a snapshot of a larger RBD with OCFS2 and multiple VM images on it, would capture changes of all the VM's, not just one. It might provide more administrative agility to use the former.

Also, I guess another question would be, when a RBD is expanded, does the underlying VM that is created using 'qemu-img create -f rbd' need to be rebooted to "see" the additional space. My guess would be, yes.

Brad

-----Original Message-----
From: ceph-users-bounces at lists.ceph.com [mailto:ceph-users-bounces at lists.ceph.com] On Behalf Of Alex Bligh
Sent: Thursday, July 11, 2013 2:03 PM
To: Gilles Mocellin
Cc: ceph-users at lists.ceph.com
Subject: Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?


On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:

> Hello,
>
> Yes, you missed that qemu can use directly RADOS volume.
> Look here :
> http://ceph.com/docs/master/rbd/qemu-rbd/
>
> Create :
> qemu-img create -f rbd rbd:data/squeeze 10G
>
> Use :
>
> qemu -m 1024 -drive format=raw,file=rbd:data/squeeze

I don't think he did. As I read it he wants his VMs to all access the same filing system, and doesn't want to use cephfs.

OCFS2 on RBD I suppose is a reasonable choice for that.

--
Alex Bligh




_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


________________________________

ZettaServe Disclaimer: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately if you have received this email by mistake and delete this email from your system. Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. ZettaServe Pty Ltd accepts no liability for any damage caused by any virus transmitted by this email.
Tom Verdaat
2013-07-11 23:41:18 UTC
Permalink
Hi Alex,

We're planning to deploy OpenStack Grizzly using KVM. I agree that running
every VM directly from RBD devices would be preferable, but booting from
volumes is not one of OpenStack's strengths and configuring nova to make
boot from volume the default method that works automatically is not really
feasible yet.

So the alternative is to mount a shared filesystem
on /var/lib/nova/instances of every compute node. Hence the RBD +
OCFS2/GFS2 question.

Tom

p.s. yes I've read the
rbd-openstack<http://ceph.com/docs/master/rbd/rbd-openstack/> page
which covers images and persistent volumes, not running instances which is
what my question is about.


2013/7/12 Alex Bligh <alex at alex.org.uk>

> Tom,
>
> On 11 Jul 2013, at 22:28, Tom Verdaat wrote:
>
> > Actually I want my running VMs to all be stored on the same file system,
> so we can use live migration to move them between hosts.
> >
> > QEMU is not going to help because we're not using it in our
> virtualization solution.
>
> Out of interest, what are you using in your virtualization solution? Most
> things (including modern Xen) seem to use Qemu for the back end. If your
> virtualization solution does not use qemu as a back end, you can use kernel
> rbd devices straight which I think will give you better performance than
> OCFS2 on RBD devices.
>
> A
>
> >
> > 2013/7/11 Alex Bligh <alex at alex.org.uk>
> >
> > On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:
> >
> > > Hello,
> > >
> > > Yes, you missed that qemu can use directly RADOS volume.
> > > Look here :
> > > http://ceph.com/docs/master/rbd/qemu-rbd/
> > >
> > > Create :
> > > qemu-img create -f rbd rbd:data/squeeze 10G
> > >
> > > Use :
> > >
> > > qemu -m 1024 -drive format=raw,file=rbd:data/squeeze
> >
> > I don't think he did. As I read it he wants his VMs to all access the
> same filing system, and doesn't want to use cephfs.
> >
> > OCFS2 on RBD I suppose is a reasonable choice for that.
> >
> > --
> > Alex Bligh
> >
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users at lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> --
> Alex Bligh
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20130712/464d2f4e/attachment.htm>
Darryl Bond
2013-07-12 00:04:47 UTC
Permalink
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20130712/e0c5d2e7/attachment.htm>
Tom Verdaat
2013-07-12 12:21:23 UTC
Permalink
Hi Darryl,

Would love to do that too but only if we can configure nova to do this
automatically. Any chance you could dig up and share how you guys
accomplished this?

>From everything I've read so far Grizzly is not up for the task yet. If
I can't set it in nova.conf then it probably won't work with 3rd party
tools like Hostbill and break user self service functionality that we're
aiming for with a public cloud concept. I think we'll need this and this
blueprint implemented to be able to achieve this, and of course this one
for the dashboard would be nice too.

I'll do some more digging into Openstack and see how far we can get with
this.

In the mean time I've done some more research and figured out that:

* There is a bunch of other cluster file systems but GFS2 and
OCFS2 are the only open source ones I could find, and I believe
the only ones that are integrated in the Linux kernel.
* OCFS2 seems to have a lot more public information than GFS2. It
has more documentation and a living - though not very active -
mailing list.
* OCFS2 seems to be in active use by its sponsor Oracle, while I
can't find much on GFS2 from its sponsor RedHat.
* OCFS2 documentation indicates a node soft limit of 256 versus 16
for GFS2, and there are actual deployments of stable 45 TB+
production clusters.
* Performance tests from 2010 indicate OCFS2 clearly beating GFS2,
though of course newer versions have been released since.
* GFS2 has more fencing options than OCFS2.


There is not much info from the last 12 months so it's hard get an
accurate picture. If we have to go with the shared storage approach
OCFS2 looks like the preferred option based on the info I've gathered so
far though.

Tom



Darryl Bond schreef op vr 12-07-2013 om 10:04 [+1000]:

> Tom,
> I'm no expert as I didn't set it up, but we are using Openstack
> Grizzly with KVM/QEMU and RBD volumes for VM's.
> We boot the VMs from the RBD volumes and it all seems to work just
> fine.
> Migration works perfectly, although live - no break migration only
> works from the command line tools. The GUI uses the pause, migrate
> then un-pause mode.
> Layered snapshot/cloning works just fine through the GUI. I would say
> Grizzly has pretty good integration with CEPH.
>
> Regards
> Darryl
>
>
> On 07/12/13 09:41, Tom Verdaat wrote:
>
> > Hi Alex,
> >
> >
> >
> > We're planning to deploy OpenStack Grizzly using KVM. I agree that
> > running every VM directly from RBD devices would be preferable, but
> > booting from volumes is not one of OpenStack's strengths and
> > configuring nova to make boot from volume the default method that
> > works automatically is not really feasible yet.
> >
> >
> > So the alternative is to mount a shared filesystem
> > on /var/lib/nova/instances of every compute node. Hence the RBD +
> > OCFS2/GFS2 question.
> >
> >
> > Tom
> >
> >
> > p.s. yes I've read the rbd-openstack page which covers images and
> > persistent volumes, not running instances which is what my question
> > is about.
> >
> >
> >
> > 2013/7/12 Alex Bligh <alex at alex.org.uk>
> >
> > Tom,
> >
> >
> > On 11 Jul 2013, at 22:28, Tom Verdaat wrote:
> >
> > > Actually I want my running VMs to all be stored on the
> > same file system, so we can use live migration to move them
> > between hosts.
> > >
> > > QEMU is not going to help because we're not using it in
> > our virtualization solution.
> >
> >
> >
> > Out of interest, what are you using in your virtualization
> > solution? Most things (including modern Xen) seem to use
> > Qemu for the back end. If your virtualization solution does
> > not use qemu as a back end, you can use kernel rbd devices
> > straight which I think will give you better performance than
> > OCFS2 on RBD devices.
> >
> >
> > A
> >
> > >
> > > 2013/7/11 Alex Bligh <alex at alex.org.uk>
> > >
> > > On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:
> > >
> > > > Hello,
> > > >
> > > > Yes, you missed that qemu can use directly RADOS volume.
> > > > Look here :
> > > > http://ceph.com/docs/master/rbd/qemu-rbd/
> > > >
> > > > Create :
> > > > qemu-img create -f rbd rbd:data/squeeze 10G
> > > >
> > > > Use :
> > > >
> > > > qemu -m 1024 -drive format=raw,file=rbd:data/squeeze
> > >
> > > I don't think he did. As I read it he wants his VMs to all
> > access the same filing system, and doesn't want to use
> > cephfs.
> > >
> > > OCFS2 on RBD I suppose is a reasonable choice for that.
> > >
> > > --
> > > Alex Bligh
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users at lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> >
> >
> >
> > --
> > Alex Bligh
> >
> >
> >
> >
> >
> >
> >
>
>
>
>
> ______________________________________________________________________
> The contents of this electronic message and any attachments are
> intended only for the addressee and may contain legally privileged,
> personal, sensitive or confidential information. If you are not the
> intended addressee, and have received this email, any transmission,
> distribution, downloading, printing or photocopying of the contents of
> this message or attachments is strictly prohibited. Any legal
> privilege or confidentiality attached to this message and attachments
> is not waived, lost or destroyed by reason of delivery to any person
> other than intended addressee. If you have received this message and
> are not the intended addressee you should notify the sender by return
> email and destroy all copies of the message and any attachments.
> Unless expressly attributed, the views expressed in this email do not
> necessarily represent the views of the company.
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20130712/085746d3/attachment-0001.htm>
Alex Bligh
2013-07-12 12:32:04 UTC
Permalink
On 12 Jul 2013, at 13:21, Tom Verdaat wrote:

> In the mean time I've done some more research and figured out that:
> ? There is a bunch of other cluster file systems but GFS2 and OCFS2 are the only open source ones I could find, and I believe the only ones that are integrated in the Linux kernel.
> ? OCFS2 seems to have a lot more public information than GFS2. It has more documentation and a living - though not very active - mailing list.
> ? OCFS2 seems to be in active use by its sponsor Oracle, while I can't find much on GFS2 from its sponsor RedHat.
> ? OCFS2 documentation indicates a node soft limit of 256 versus 16 for GFS2, and there are actual deployments of stable 45 TB+ production clusters.
> ? Performance tests from 2010 indicate OCFS2 clearly beating GFS2, though of course newer versions have been released since.
> ? GFS2 has more fencing options than OCFS2.

FWIW: For VM images (i.e. large files accessed by only one client at once) OCFS2 seems to perform better than GFS2. I seem to remember some performance issues with small files, and large directories with a lot of contention (multiple readers and writers of files or file metadata). You may need to forward port some of the more modern tools to your distro.

--
Alex Bligh
Wolfgang Hennerbichler
2013-07-12 12:58:22 UTC
Permalink
FYI: i'm using ocfs2 as you plan to (/var/Lib/nova/instances/) it is stable, but Performance isnt blasting.

--
Sent from my mobile device

On 12.07.2013, at 14:21, "Tom Verdaat" <tom at server.biz<mailto:tom at server.biz>> wrote:

Hi Darryl,

Would love to do that too but only if we can configure nova to do this automatically. Any chance you could dig up and share how you guys accomplished this?

>From everything I've read so far Grizzly is not up for the task yet. If I can't set it in nova.conf then it probably won't work with 3rd party tools like Hostbill and break user self service functionality that we're aiming for with a public cloud concept. I think we'll need this<https://blueprints.launchpad.net/nova/+spec/improve-boot-from-volume> and this<https://blueprints.launchpad.net/nova/+spec/bring-rbd-support-libvirt-images-type> blueprint implemented to be able to achieve this, and of course this one<https://blueprints.launchpad.net/horizon/+spec/improved-boot-from-volume> for the dashboard would be nice too.

I'll do some more digging into Openstack and see how far we can get with this.

In the mean time I've done some more research and figured out that:

* There is a bunch of other cluster file systems but GFS2 and OCFS2 are the only open source ones I could find, and I believe the only ones that are integrated in the Linux kernel.
* OCFS2 seems to have a lot more public information than GFS2. It has more documentation and a living - though not very active - mailing list.
* OCFS2 seems to be in active use by its sponsor Oracle, while I can't find much on GFS2 from its sponsor RedHat.
* OCFS2 documentation indicates a node soft limit of 256 versus 16 for GFS2, and there are actual deployments of stable 45 TB+ production clusters.
* Performance tests from 2010 indicate OCFS2 clearly beating GFS2, though of course newer versions have been released since.
* GFS2 has more fencing options than OCFS2.

There is not much info from the last 12 months so it's hard get an accurate picture. If we have to go with the shared storage approach OCFS2 looks like the preferred option based on the info I've gathered so far though.

Tom



Darryl Bond schreef op vr 12-07-2013 om 10:04 [+1000]:
Tom,
I'm no expert as I didn't set it up, but we are using Openstack Grizzly with KVM/QEMU and RBD volumes for VM's.
We boot the VMs from the RBD volumes and it all seems to work just fine.
Migration works perfectly, although live - no break migration only works from the command line tools. The GUI uses the pause, migrate then un-pause mode.
Layered snapshot/cloning works just fine through the GUI. I would say Grizzly has pretty good integration with CEPH.

Regards
Darryl

On 07/12/13 09:41, Tom Verdaat wrote:

Hi Alex,


We're planning to deploy OpenStack Grizzly using KVM. I agree that running every VM directly from RBD devices would be preferable, but booting from volumes is not one of OpenStack's strengths and configuring nova to make boot from volume the default method that works automatically is not really feasible yet.


So the alternative is to mount a shared filesystem on /var/lib/nova/instances of every compute node. Hence the RBD + OCFS2/GFS2 question.


Tom


p.s. yes I've read the rbd-openstack<http://ceph.com/docs/master/rbd/rbd-openstack/> page which covers images and persistent volumes, not running instances which is what my question is about.


2013/7/12 Alex Bligh <alex at alex.org.uk<mailto:alex at alex.org.uk>>
Tom,

On 11 Jul 2013, at 22:28, Tom Verdaat wrote:

> Actually I want my running VMs to all be stored on the same file system, so we can use live migration to move them between hosts.
>
> QEMU is not going to help because we're not using it in our virtualization solution.


Out of interest, what are you using in your virtualization solution? Most things (including modern Xen) seem to use Qemu for the back end. If your virtualization solution does not use qemu as a back end, you can use kernel rbd devices straight which I think will give you better performance than OCFS2 on RBD devices.

A

>
> 2013/7/11 Alex Bligh <alex at alex.org.uk<mailto:alex at alex.org.uk>>
>
> On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:
>
> > Hello,
> >
> > Yes, you missed that qemu can use directly RADOS volume.
> > Look here :
> > http://ceph.com/docs/master/rbd/qemu-rbd/
> >
> > Create :
> > qemu-img create -f rbd rbd:data/squeeze 10G
> >
> > Use :
> >
> > qemu -m 1024 -drive format=raw,file=rbd:data/squeeze
>
> I don't think he did. As I read it he wants his VMs to all access the same filing system, and doesn't want to use cephfs.
>
> OCFS2 on RBD I suppose is a reasonable choice for that.
>
> --
> Alex Bligh
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com<mailto:ceph-users at lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


--
Alex Bligh








________________________________

The contents of this electronic message and any attachments are intended only for the addressee and may contain legally privileged, personal, sensitive or confidential information. If you are not the intended addressee, and have received this email, any transmission, distribution, downloading, printing or photocopying of the contents of this message or attachments is strictly prohibited. Any legal privilege or confidentiality attached to this message and attachments is not waived, lost or destroyed by reason of delivery to any person other than intended addressee. If you have received this message and are not the intended addressee you should notify the sender by return email and destroy all copies of the message and any attachments. Unless expressly attributed, the views expressed in this email do not necessarily represent the views of the company.

_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com<mailto:ceph-users at lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com<mailto:ceph-users at lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20130712/0c74b923/attachment.htm>
Alex Bligh
2013-07-12 12:32:04 UTC
Permalink
On 12 Jul 2013, at 13:21, Tom Verdaat wrote:

> In the mean time I've done some more research and figured out that:
> ? There is a bunch of other cluster file systems but GFS2 and OCFS2 are the only open source ones I could find, and I believe the only ones that are integrated in the Linux kernel.
> ? OCFS2 seems to have a lot more public information than GFS2. It has more documentation and a living - though not very active - mailing list.
> ? OCFS2 seems to be in active use by its sponsor Oracle, while I can't find much on GFS2 from its sponsor RedHat.
> ? OCFS2 documentation indicates a node soft limit of 256 versus 16 for GFS2, and there are actual deployments of stable 45 TB+ production clusters.
> ? Performance tests from 2010 indicate OCFS2 clearly beating GFS2, though of course newer versions have been released since.
> ? GFS2 has more fencing options than OCFS2.

FWIW: For VM images (i.e. large files accessed by only one client at once) OCFS2 seems to perform better than GFS2. I seem to remember some performance issues with small files, and large directories with a lot of contention (multiple readers and writers of files or file metadata). You may need to forward port some of the more modern tools to your distro.

--
Alex Bligh
Wolfgang Hennerbichler
2013-07-12 12:58:22 UTC
Permalink
FYI: i'm using ocfs2 as you plan to (/var/Lib/nova/instances/) it is stable, but Performance isnt blasting.

--
Sent from my mobile device

On 12.07.2013, at 14:21, "Tom Verdaat" <tom at server.biz<mailto:tom at server.biz>> wrote:

Hi Darryl,

Would love to do that too but only if we can configure nova to do this automatically. Any chance you could dig up and share how you guys accomplished this?

>From everything I've read so far Grizzly is not up for the task yet. If I can't set it in nova.conf then it probably won't work with 3rd party tools like Hostbill and break user self service functionality that we're aiming for with a public cloud concept. I think we'll need this<https://blueprints.launchpad.net/nova/+spec/improve-boot-from-volume> and this<https://blueprints.launchpad.net/nova/+spec/bring-rbd-support-libvirt-images-type> blueprint implemented to be able to achieve this, and of course this one<https://blueprints.launchpad.net/horizon/+spec/improved-boot-from-volume> for the dashboard would be nice too.

I'll do some more digging into Openstack and see how far we can get with this.

In the mean time I've done some more research and figured out that:

* There is a bunch of other cluster file systems but GFS2 and OCFS2 are the only open source ones I could find, and I believe the only ones that are integrated in the Linux kernel.
* OCFS2 seems to have a lot more public information than GFS2. It has more documentation and a living - though not very active - mailing list.
* OCFS2 seems to be in active use by its sponsor Oracle, while I can't find much on GFS2 from its sponsor RedHat.
* OCFS2 documentation indicates a node soft limit of 256 versus 16 for GFS2, and there are actual deployments of stable 45 TB+ production clusters.
* Performance tests from 2010 indicate OCFS2 clearly beating GFS2, though of course newer versions have been released since.
* GFS2 has more fencing options than OCFS2.

There is not much info from the last 12 months so it's hard get an accurate picture. If we have to go with the shared storage approach OCFS2 looks like the preferred option based on the info I've gathered so far though.

Tom



Darryl Bond schreef op vr 12-07-2013 om 10:04 [+1000]:
Tom,
I'm no expert as I didn't set it up, but we are using Openstack Grizzly with KVM/QEMU and RBD volumes for VM's.
We boot the VMs from the RBD volumes and it all seems to work just fine.
Migration works perfectly, although live - no break migration only works from the command line tools. The GUI uses the pause, migrate then un-pause mode.
Layered snapshot/cloning works just fine through the GUI. I would say Grizzly has pretty good integration with CEPH.

Regards
Darryl

On 07/12/13 09:41, Tom Verdaat wrote:

Hi Alex,


We're planning to deploy OpenStack Grizzly using KVM. I agree that running every VM directly from RBD devices would be preferable, but booting from volumes is not one of OpenStack's strengths and configuring nova to make boot from volume the default method that works automatically is not really feasible yet.


So the alternative is to mount a shared filesystem on /var/lib/nova/instances of every compute node. Hence the RBD + OCFS2/GFS2 question.


Tom


p.s. yes I've read the rbd-openstack<http://ceph.com/docs/master/rbd/rbd-openstack/> page which covers images and persistent volumes, not running instances which is what my question is about.


2013/7/12 Alex Bligh <alex at alex.org.uk<mailto:alex at alex.org.uk>>
Tom,

On 11 Jul 2013, at 22:28, Tom Verdaat wrote:

> Actually I want my running VMs to all be stored on the same file system, so we can use live migration to move them between hosts.
>
> QEMU is not going to help because we're not using it in our virtualization solution.


Out of interest, what are you using in your virtualization solution? Most things (including modern Xen) seem to use Qemu for the back end. If your virtualization solution does not use qemu as a back end, you can use kernel rbd devices straight which I think will give you better performance than OCFS2 on RBD devices.

A

>
> 2013/7/11 Alex Bligh <alex at alex.org.uk<mailto:alex at alex.org.uk>>
>
> On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:
>
> > Hello,
> >
> > Yes, you missed that qemu can use directly RADOS volume.
> > Look here :
> > http://ceph.com/docs/master/rbd/qemu-rbd/
> >
> > Create :
> > qemu-img create -f rbd rbd:data/squeeze 10G
> >
> > Use :
> >
> > qemu -m 1024 -drive format=raw,file=rbd:data/squeeze
>
> I don't think he did. As I read it he wants his VMs to all access the same filing system, and doesn't want to use cephfs.
>
> OCFS2 on RBD I suppose is a reasonable choice for that.
>
> --
> Alex Bligh
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com<mailto:ceph-users at lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


--
Alex Bligh








________________________________

The contents of this electronic message and any attachments are intended only for the addressee and may contain legally privileged, personal, sensitive or confidential information. If you are not the intended addressee, and have received this email, any transmission, distribution, downloading, printing or photocopying of the contents of this message or attachments is strictly prohibited. Any legal privilege or confidentiality attached to this message and attachments is not waived, lost or destroyed by reason of delivery to any person other than intended addressee. If you have received this message and are not the intended addressee you should notify the sender by return email and destroy all copies of the message and any attachments. Unless expressly attributed, the views expressed in this email do not necessarily represent the views of the company.

_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com<mailto:ceph-users at lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com<mailto:ceph-users at lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20130712/0c74b923/attachment-0002.htm>
Alex Bligh
2013-07-12 12:32:04 UTC
Permalink
On 12 Jul 2013, at 13:21, Tom Verdaat wrote:

> In the mean time I've done some more research and figured out that:
> ? There is a bunch of other cluster file systems but GFS2 and OCFS2 are the only open source ones I could find, and I believe the only ones that are integrated in the Linux kernel.
> ? OCFS2 seems to have a lot more public information than GFS2. It has more documentation and a living - though not very active - mailing list.
> ? OCFS2 seems to be in active use by its sponsor Oracle, while I can't find much on GFS2 from its sponsor RedHat.
> ? OCFS2 documentation indicates a node soft limit of 256 versus 16 for GFS2, and there are actual deployments of stable 45 TB+ production clusters.
> ? Performance tests from 2010 indicate OCFS2 clearly beating GFS2, though of course newer versions have been released since.
> ? GFS2 has more fencing options than OCFS2.

FWIW: For VM images (i.e. large files accessed by only one client at once) OCFS2 seems to perform better than GFS2. I seem to remember some performance issues with small files, and large directories with a lot of contention (multiple readers and writers of files or file metadata). You may need to forward port some of the more modern tools to your distro.

--
Alex Bligh
Wolfgang Hennerbichler
2013-07-12 12:58:22 UTC
Permalink
FYI: i'm using ocfs2 as you plan to (/var/Lib/nova/instances/) it is stable, but Performance isnt blasting.

--
Sent from my mobile device

On 12.07.2013, at 14:21, "Tom Verdaat" <tom at server.biz<mailto:tom at server.biz>> wrote:

Hi Darryl,

Would love to do that too but only if we can configure nova to do this automatically. Any chance you could dig up and share how you guys accomplished this?

>From everything I've read so far Grizzly is not up for the task yet. If I can't set it in nova.conf then it probably won't work with 3rd party tools like Hostbill and break user self service functionality that we're aiming for with a public cloud concept. I think we'll need this<https://blueprints.launchpad.net/nova/+spec/improve-boot-from-volume> and this<https://blueprints.launchpad.net/nova/+spec/bring-rbd-support-libvirt-images-type> blueprint implemented to be able to achieve this, and of course this one<https://blueprints.launchpad.net/horizon/+spec/improved-boot-from-volume> for the dashboard would be nice too.

I'll do some more digging into Openstack and see how far we can get with this.

In the mean time I've done some more research and figured out that:

* There is a bunch of other cluster file systems but GFS2 and OCFS2 are the only open source ones I could find, and I believe the only ones that are integrated in the Linux kernel.
* OCFS2 seems to have a lot more public information than GFS2. It has more documentation and a living - though not very active - mailing list.
* OCFS2 seems to be in active use by its sponsor Oracle, while I can't find much on GFS2 from its sponsor RedHat.
* OCFS2 documentation indicates a node soft limit of 256 versus 16 for GFS2, and there are actual deployments of stable 45 TB+ production clusters.
* Performance tests from 2010 indicate OCFS2 clearly beating GFS2, though of course newer versions have been released since.
* GFS2 has more fencing options than OCFS2.

There is not much info from the last 12 months so it's hard get an accurate picture. If we have to go with the shared storage approach OCFS2 looks like the preferred option based on the info I've gathered so far though.

Tom



Darryl Bond schreef op vr 12-07-2013 om 10:04 [+1000]:
Tom,
I'm no expert as I didn't set it up, but we are using Openstack Grizzly with KVM/QEMU and RBD volumes for VM's.
We boot the VMs from the RBD volumes and it all seems to work just fine.
Migration works perfectly, although live - no break migration only works from the command line tools. The GUI uses the pause, migrate then un-pause mode.
Layered snapshot/cloning works just fine through the GUI. I would say Grizzly has pretty good integration with CEPH.

Regards
Darryl

On 07/12/13 09:41, Tom Verdaat wrote:

Hi Alex,


We're planning to deploy OpenStack Grizzly using KVM. I agree that running every VM directly from RBD devices would be preferable, but booting from volumes is not one of OpenStack's strengths and configuring nova to make boot from volume the default method that works automatically is not really feasible yet.


So the alternative is to mount a shared filesystem on /var/lib/nova/instances of every compute node. Hence the RBD + OCFS2/GFS2 question.


Tom


p.s. yes I've read the rbd-openstack<http://ceph.com/docs/master/rbd/rbd-openstack/> page which covers images and persistent volumes, not running instances which is what my question is about.


2013/7/12 Alex Bligh <alex at alex.org.uk<mailto:alex at alex.org.uk>>
Tom,

On 11 Jul 2013, at 22:28, Tom Verdaat wrote:

> Actually I want my running VMs to all be stored on the same file system, so we can use live migration to move them between hosts.
>
> QEMU is not going to help because we're not using it in our virtualization solution.


Out of interest, what are you using in your virtualization solution? Most things (including modern Xen) seem to use Qemu for the back end. If your virtualization solution does not use qemu as a back end, you can use kernel rbd devices straight which I think will give you better performance than OCFS2 on RBD devices.

A

>
> 2013/7/11 Alex Bligh <alex at alex.org.uk<mailto:alex at alex.org.uk>>
>
> On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:
>
> > Hello,
> >
> > Yes, you missed that qemu can use directly RADOS volume.
> > Look here :
> > http://ceph.com/docs/master/rbd/qemu-rbd/
> >
> > Create :
> > qemu-img create -f rbd rbd:data/squeeze 10G
> >
> > Use :
> >
> > qemu -m 1024 -drive format=raw,file=rbd:data/squeeze
>
> I don't think he did. As I read it he wants his VMs to all access the same filing system, and doesn't want to use cephfs.
>
> OCFS2 on RBD I suppose is a reasonable choice for that.
>
> --
> Alex Bligh
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com<mailto:ceph-users at lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


--
Alex Bligh








________________________________

The contents of this electronic message and any attachments are intended only for the addressee and may contain legally privileged, personal, sensitive or confidential information. If you are not the intended addressee, and have received this email, any transmission, distribution, downloading, printing or photocopying of the contents of this message or attachments is strictly prohibited. Any legal privilege or confidentiality attached to this message and attachments is not waived, lost or destroyed by reason of delivery to any person other than intended addressee. If you have received this message and are not the intended addressee you should notify the sender by return email and destroy all copies of the message and any attachments. Unless expressly attributed, the views expressed in this email do not necessarily represent the views of the company.

_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com<mailto:ceph-users at lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com<mailto:ceph-users at lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20130712/0c74b923/attachment-0003.htm>
Alex Bligh
2013-07-12 12:32:04 UTC
Permalink
On 12 Jul 2013, at 13:21, Tom Verdaat wrote:

> In the mean time I've done some more research and figured out that:
> ? There is a bunch of other cluster file systems but GFS2 and OCFS2 are the only open source ones I could find, and I believe the only ones that are integrated in the Linux kernel.
> ? OCFS2 seems to have a lot more public information than GFS2. It has more documentation and a living - though not very active - mailing list.
> ? OCFS2 seems to be in active use by its sponsor Oracle, while I can't find much on GFS2 from its sponsor RedHat.
> ? OCFS2 documentation indicates a node soft limit of 256 versus 16 for GFS2, and there are actual deployments of stable 45 TB+ production clusters.
> ? Performance tests from 2010 indicate OCFS2 clearly beating GFS2, though of course newer versions have been released since.
> ? GFS2 has more fencing options than OCFS2.

FWIW: For VM images (i.e. large files accessed by only one client at once) OCFS2 seems to perform better than GFS2. I seem to remember some performance issues with small files, and large directories with a lot of contention (multiple readers and writers of files or file metadata). You may need to forward port some of the more modern tools to your distro.

--
Alex Bligh
Wolfgang Hennerbichler
2013-07-12 12:58:22 UTC
Permalink
FYI: i'm using ocfs2 as you plan to (/var/Lib/nova/instances/) it is stable, but Performance isnt blasting.

--
Sent from my mobile device

On 12.07.2013, at 14:21, "Tom Verdaat" <tom at server.biz<mailto:tom at server.biz>> wrote:

Hi Darryl,

Would love to do that too but only if we can configure nova to do this automatically. Any chance you could dig up and share how you guys accomplished this?

>From everything I've read so far Grizzly is not up for the task yet. If I can't set it in nova.conf then it probably won't work with 3rd party tools like Hostbill and break user self service functionality that we're aiming for with a public cloud concept. I think we'll need this<https://blueprints.launchpad.net/nova/+spec/improve-boot-from-volume> and this<https://blueprints.launchpad.net/nova/+spec/bring-rbd-support-libvirt-images-type> blueprint implemented to be able to achieve this, and of course this one<https://blueprints.launchpad.net/horizon/+spec/improved-boot-from-volume> for the dashboard would be nice too.

I'll do some more digging into Openstack and see how far we can get with this.

In the mean time I've done some more research and figured out that:

* There is a bunch of other cluster file systems but GFS2 and OCFS2 are the only open source ones I could find, and I believe the only ones that are integrated in the Linux kernel.
* OCFS2 seems to have a lot more public information than GFS2. It has more documentation and a living - though not very active - mailing list.
* OCFS2 seems to be in active use by its sponsor Oracle, while I can't find much on GFS2 from its sponsor RedHat.
* OCFS2 documentation indicates a node soft limit of 256 versus 16 for GFS2, and there are actual deployments of stable 45 TB+ production clusters.
* Performance tests from 2010 indicate OCFS2 clearly beating GFS2, though of course newer versions have been released since.
* GFS2 has more fencing options than OCFS2.

There is not much info from the last 12 months so it's hard get an accurate picture. If we have to go with the shared storage approach OCFS2 looks like the preferred option based on the info I've gathered so far though.

Tom



Darryl Bond schreef op vr 12-07-2013 om 10:04 [+1000]:
Tom,
I'm no expert as I didn't set it up, but we are using Openstack Grizzly with KVM/QEMU and RBD volumes for VM's.
We boot the VMs from the RBD volumes and it all seems to work just fine.
Migration works perfectly, although live - no break migration only works from the command line tools. The GUI uses the pause, migrate then un-pause mode.
Layered snapshot/cloning works just fine through the GUI. I would say Grizzly has pretty good integration with CEPH.

Regards
Darryl

On 07/12/13 09:41, Tom Verdaat wrote:

Hi Alex,


We're planning to deploy OpenStack Grizzly using KVM. I agree that running every VM directly from RBD devices would be preferable, but booting from volumes is not one of OpenStack's strengths and configuring nova to make boot from volume the default method that works automatically is not really feasible yet.


So the alternative is to mount a shared filesystem on /var/lib/nova/instances of every compute node. Hence the RBD + OCFS2/GFS2 question.


Tom


p.s. yes I've read the rbd-openstack<http://ceph.com/docs/master/rbd/rbd-openstack/> page which covers images and persistent volumes, not running instances which is what my question is about.


2013/7/12 Alex Bligh <alex at alex.org.uk<mailto:alex at alex.org.uk>>
Tom,

On 11 Jul 2013, at 22:28, Tom Verdaat wrote:

> Actually I want my running VMs to all be stored on the same file system, so we can use live migration to move them between hosts.
>
> QEMU is not going to help because we're not using it in our virtualization solution.


Out of interest, what are you using in your virtualization solution? Most things (including modern Xen) seem to use Qemu for the back end. If your virtualization solution does not use qemu as a back end, you can use kernel rbd devices straight which I think will give you better performance than OCFS2 on RBD devices.

A

>
> 2013/7/11 Alex Bligh <alex at alex.org.uk<mailto:alex at alex.org.uk>>
>
> On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:
>
> > Hello,
> >
> > Yes, you missed that qemu can use directly RADOS volume.
> > Look here :
> > http://ceph.com/docs/master/rbd/qemu-rbd/
> >
> > Create :
> > qemu-img create -f rbd rbd:data/squeeze 10G
> >
> > Use :
> >
> > qemu -m 1024 -drive format=raw,file=rbd:data/squeeze
>
> I don't think he did. As I read it he wants his VMs to all access the same filing system, and doesn't want to use cephfs.
>
> OCFS2 on RBD I suppose is a reasonable choice for that.
>
> --
> Alex Bligh
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com<mailto:ceph-users at lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


--
Alex Bligh








________________________________

The contents of this electronic message and any attachments are intended only for the addressee and may contain legally privileged, personal, sensitive or confidential information. If you are not the intended addressee, and have received this email, any transmission, distribution, downloading, printing or photocopying of the contents of this message or attachments is strictly prohibited. Any legal privilege or confidentiality attached to this message and attachments is not waived, lost or destroyed by reason of delivery to any person other than intended addressee. If you have received this message and are not the intended addressee you should notify the sender by return email and destroy all copies of the message and any attachments. Unless expressly attributed, the views expressed in this email do not necessarily represent the views of the company.

_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com<mailto:ceph-users at lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com<mailto:ceph-users at lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20130712/0c74b923/attachment-0004.htm>
Tom Verdaat
2013-07-12 12:21:23 UTC
Permalink
Hi Darryl,

Would love to do that too but only if we can configure nova to do this
automatically. Any chance you could dig up and share how you guys
accomplished this?

>From everything I've read so far Grizzly is not up for the task yet. If
I can't set it in nova.conf then it probably won't work with 3rd party
tools like Hostbill and break user self service functionality that we're
aiming for with a public cloud concept. I think we'll need this and this
blueprint implemented to be able to achieve this, and of course this one
for the dashboard would be nice too.

I'll do some more digging into Openstack and see how far we can get with
this.

In the mean time I've done some more research and figured out that:

* There is a bunch of other cluster file systems but GFS2 and
OCFS2 are the only open source ones I could find, and I believe
the only ones that are integrated in the Linux kernel.
* OCFS2 seems to have a lot more public information than GFS2. It
has more documentation and a living - though not very active -
mailing list.
* OCFS2 seems to be in active use by its sponsor Oracle, while I
can't find much on GFS2 from its sponsor RedHat.
* OCFS2 documentation indicates a node soft limit of 256 versus 16
for GFS2, and there are actual deployments of stable 45 TB+
production clusters.
* Performance tests from 2010 indicate OCFS2 clearly beating GFS2,
though of course newer versions have been released since.
* GFS2 has more fencing options than OCFS2.


There is not much info from the last 12 months so it's hard get an
accurate picture. If we have to go with the shared storage approach
OCFS2 looks like the preferred option based on the info I've gathered so
far though.

Tom



Darryl Bond schreef op vr 12-07-2013 om 10:04 [+1000]:

> Tom,
> I'm no expert as I didn't set it up, but we are using Openstack
> Grizzly with KVM/QEMU and RBD volumes for VM's.
> We boot the VMs from the RBD volumes and it all seems to work just
> fine.
> Migration works perfectly, although live - no break migration only
> works from the command line tools. The GUI uses the pause, migrate
> then un-pause mode.
> Layered snapshot/cloning works just fine through the GUI. I would say
> Grizzly has pretty good integration with CEPH.
>
> Regards
> Darryl
>
>
> On 07/12/13 09:41, Tom Verdaat wrote:
>
> > Hi Alex,
> >
> >
> >
> > We're planning to deploy OpenStack Grizzly using KVM. I agree that
> > running every VM directly from RBD devices would be preferable, but
> > booting from volumes is not one of OpenStack's strengths and
> > configuring nova to make boot from volume the default method that
> > works automatically is not really feasible yet.
> >
> >
> > So the alternative is to mount a shared filesystem
> > on /var/lib/nova/instances of every compute node. Hence the RBD +
> > OCFS2/GFS2 question.
> >
> >
> > Tom
> >
> >
> > p.s. yes I've read the rbd-openstack page which covers images and
> > persistent volumes, not running instances which is what my question
> > is about.
> >
> >
> >
> > 2013/7/12 Alex Bligh <alex at alex.org.uk>
> >
> > Tom,
> >
> >
> > On 11 Jul 2013, at 22:28, Tom Verdaat wrote:
> >
> > > Actually I want my running VMs to all be stored on the
> > same file system, so we can use live migration to move them
> > between hosts.
> > >
> > > QEMU is not going to help because we're not using it in
> > our virtualization solution.
> >
> >
> >
> > Out of interest, what are you using in your virtualization
> > solution? Most things (including modern Xen) seem to use
> > Qemu for the back end. If your virtualization solution does
> > not use qemu as a back end, you can use kernel rbd devices
> > straight which I think will give you better performance than
> > OCFS2 on RBD devices.
> >
> >
> > A
> >
> > >
> > > 2013/7/11 Alex Bligh <alex at alex.org.uk>
> > >
> > > On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:
> > >
> > > > Hello,
> > > >
> > > > Yes, you missed that qemu can use directly RADOS volume.
> > > > Look here :
> > > > http://ceph.com/docs/master/rbd/qemu-rbd/
> > > >
> > > > Create :
> > > > qemu-img create -f rbd rbd:data/squeeze 10G
> > > >
> > > > Use :
> > > >
> > > > qemu -m 1024 -drive format=raw,file=rbd:data/squeeze
> > >
> > > I don't think he did. As I read it he wants his VMs to all
> > access the same filing system, and doesn't want to use
> > cephfs.
> > >
> > > OCFS2 on RBD I suppose is a reasonable choice for that.
> > >
> > > --
> > > Alex Bligh
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users at lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> >
> >
> >
> > --
> > Alex Bligh
> >
> >
> >
> >
> >
> >
> >
>
>
>
>
> ______________________________________________________________________
> The contents of this electronic message and any attachments are
> intended only for the addressee and may contain legally privileged,
> personal, sensitive or confidential information. If you are not the
> intended addressee, and have received this email, any transmission,
> distribution, downloading, printing or photocopying of the contents of
> this message or attachments is strictly prohibited. Any legal
> privilege or confidentiality attached to this message and attachments
> is not waived, lost or destroyed by reason of delivery to any person
> other than intended addressee. If you have received this message and
> are not the intended addressee you should notify the sender by return
> email and destroy all copies of the message and any attachments.
> Unless expressly attributed, the views expressed in this email do not
> necessarily represent the views of the company.
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20130712/085746d3/attachment-0002.htm>
Tom Verdaat
2013-07-12 12:21:23 UTC
Permalink
Hi Darryl,

Would love to do that too but only if we can configure nova to do this
automatically. Any chance you could dig up and share how you guys
accomplished this?

>From everything I've read so far Grizzly is not up for the task yet. If
I can't set it in nova.conf then it probably won't work with 3rd party
tools like Hostbill and break user self service functionality that we're
aiming for with a public cloud concept. I think we'll need this and this
blueprint implemented to be able to achieve this, and of course this one
for the dashboard would be nice too.

I'll do some more digging into Openstack and see how far we can get with
this.

In the mean time I've done some more research and figured out that:

* There is a bunch of other cluster file systems but GFS2 and
OCFS2 are the only open source ones I could find, and I believe
the only ones that are integrated in the Linux kernel.
* OCFS2 seems to have a lot more public information than GFS2. It
has more documentation and a living - though not very active -
mailing list.
* OCFS2 seems to be in active use by its sponsor Oracle, while I
can't find much on GFS2 from its sponsor RedHat.
* OCFS2 documentation indicates a node soft limit of 256 versus 16
for GFS2, and there are actual deployments of stable 45 TB+
production clusters.
* Performance tests from 2010 indicate OCFS2 clearly beating GFS2,
though of course newer versions have been released since.
* GFS2 has more fencing options than OCFS2.


There is not much info from the last 12 months so it's hard get an
accurate picture. If we have to go with the shared storage approach
OCFS2 looks like the preferred option based on the info I've gathered so
far though.

Tom



Darryl Bond schreef op vr 12-07-2013 om 10:04 [+1000]:

> Tom,
> I'm no expert as I didn't set it up, but we are using Openstack
> Grizzly with KVM/QEMU and RBD volumes for VM's.
> We boot the VMs from the RBD volumes and it all seems to work just
> fine.
> Migration works perfectly, although live - no break migration only
> works from the command line tools. The GUI uses the pause, migrate
> then un-pause mode.
> Layered snapshot/cloning works just fine through the GUI. I would say
> Grizzly has pretty good integration with CEPH.
>
> Regards
> Darryl
>
>
> On 07/12/13 09:41, Tom Verdaat wrote:
>
> > Hi Alex,
> >
> >
> >
> > We're planning to deploy OpenStack Grizzly using KVM. I agree that
> > running every VM directly from RBD devices would be preferable, but
> > booting from volumes is not one of OpenStack's strengths and
> > configuring nova to make boot from volume the default method that
> > works automatically is not really feasible yet.
> >
> >
> > So the alternative is to mount a shared filesystem
> > on /var/lib/nova/instances of every compute node. Hence the RBD +
> > OCFS2/GFS2 question.
> >
> >
> > Tom
> >
> >
> > p.s. yes I've read the rbd-openstack page which covers images and
> > persistent volumes, not running instances which is what my question
> > is about.
> >
> >
> >
> > 2013/7/12 Alex Bligh <alex at alex.org.uk>
> >
> > Tom,
> >
> >
> > On 11 Jul 2013, at 22:28, Tom Verdaat wrote:
> >
> > > Actually I want my running VMs to all be stored on the
> > same file system, so we can use live migration to move them
> > between hosts.
> > >
> > > QEMU is not going to help because we're not using it in
> > our virtualization solution.
> >
> >
> >
> > Out of interest, what are you using in your virtualization
> > solution? Most things (including modern Xen) seem to use
> > Qemu for the back end. If your virtualization solution does
> > not use qemu as a back end, you can use kernel rbd devices
> > straight which I think will give you better performance than
> > OCFS2 on RBD devices.
> >
> >
> > A
> >
> > >
> > > 2013/7/11 Alex Bligh <alex at alex.org.uk>
> > >
> > > On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:
> > >
> > > > Hello,
> > > >
> > > > Yes, you missed that qemu can use directly RADOS volume.
> > > > Look here :
> > > > http://ceph.com/docs/master/rbd/qemu-rbd/
> > > >
> > > > Create :
> > > > qemu-img create -f rbd rbd:data/squeeze 10G
> > > >
> > > > Use :
> > > >
> > > > qemu -m 1024 -drive format=raw,file=rbd:data/squeeze
> > >
> > > I don't think he did. As I read it he wants his VMs to all
> > access the same filing system, and doesn't want to use
> > cephfs.
> > >
> > > OCFS2 on RBD I suppose is a reasonable choice for that.
> > >
> > > --
> > > Alex Bligh
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users at lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> >
> >
> >
> > --
> > Alex Bligh
> >
> >
> >
> >
> >
> >
> >
>
>
>
>
> ______________________________________________________________________
> The contents of this electronic message and any attachments are
> intended only for the addressee and may contain legally privileged,
> personal, sensitive or confidential information. If you are not the
> intended addressee, and have received this email, any transmission,
> distribution, downloading, printing or photocopying of the contents of
> this message or attachments is strictly prohibited. Any legal
> privilege or confidentiality attached to this message and attachments
> is not waived, lost or destroyed by reason of delivery to any person
> other than intended addressee. If you have received this message and
> are not the intended addressee you should notify the sender by return
> email and destroy all copies of the message and any attachments.
> Unless expressly attributed, the views expressed in this email do not
> necessarily represent the views of the company.
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20130712/085746d3/attachment-0003.htm>
Tom Verdaat
2013-07-12 12:21:23 UTC
Permalink
Hi Darryl,

Would love to do that too but only if we can configure nova to do this
automatically. Any chance you could dig up and share how you guys
accomplished this?

>From everything I've read so far Grizzly is not up for the task yet. If
I can't set it in nova.conf then it probably won't work with 3rd party
tools like Hostbill and break user self service functionality that we're
aiming for with a public cloud concept. I think we'll need this and this
blueprint implemented to be able to achieve this, and of course this one
for the dashboard would be nice too.

I'll do some more digging into Openstack and see how far we can get with
this.

In the mean time I've done some more research and figured out that:

* There is a bunch of other cluster file systems but GFS2 and
OCFS2 are the only open source ones I could find, and I believe
the only ones that are integrated in the Linux kernel.
* OCFS2 seems to have a lot more public information than GFS2. It
has more documentation and a living - though not very active -
mailing list.
* OCFS2 seems to be in active use by its sponsor Oracle, while I
can't find much on GFS2 from its sponsor RedHat.
* OCFS2 documentation indicates a node soft limit of 256 versus 16
for GFS2, and there are actual deployments of stable 45 TB+
production clusters.
* Performance tests from 2010 indicate OCFS2 clearly beating GFS2,
though of course newer versions have been released since.
* GFS2 has more fencing options than OCFS2.


There is not much info from the last 12 months so it's hard get an
accurate picture. If we have to go with the shared storage approach
OCFS2 looks like the preferred option based on the info I've gathered so
far though.

Tom



Darryl Bond schreef op vr 12-07-2013 om 10:04 [+1000]:

> Tom,
> I'm no expert as I didn't set it up, but we are using Openstack
> Grizzly with KVM/QEMU and RBD volumes for VM's.
> We boot the VMs from the RBD volumes and it all seems to work just
> fine.
> Migration works perfectly, although live - no break migration only
> works from the command line tools. The GUI uses the pause, migrate
> then un-pause mode.
> Layered snapshot/cloning works just fine through the GUI. I would say
> Grizzly has pretty good integration with CEPH.
>
> Regards
> Darryl
>
>
> On 07/12/13 09:41, Tom Verdaat wrote:
>
> > Hi Alex,
> >
> >
> >
> > We're planning to deploy OpenStack Grizzly using KVM. I agree that
> > running every VM directly from RBD devices would be preferable, but
> > booting from volumes is not one of OpenStack's strengths and
> > configuring nova to make boot from volume the default method that
> > works automatically is not really feasible yet.
> >
> >
> > So the alternative is to mount a shared filesystem
> > on /var/lib/nova/instances of every compute node. Hence the RBD +
> > OCFS2/GFS2 question.
> >
> >
> > Tom
> >
> >
> > p.s. yes I've read the rbd-openstack page which covers images and
> > persistent volumes, not running instances which is what my question
> > is about.
> >
> >
> >
> > 2013/7/12 Alex Bligh <alex at alex.org.uk>
> >
> > Tom,
> >
> >
> > On 11 Jul 2013, at 22:28, Tom Verdaat wrote:
> >
> > > Actually I want my running VMs to all be stored on the
> > same file system, so we can use live migration to move them
> > between hosts.
> > >
> > > QEMU is not going to help because we're not using it in
> > our virtualization solution.
> >
> >
> >
> > Out of interest, what are you using in your virtualization
> > solution? Most things (including modern Xen) seem to use
> > Qemu for the back end. If your virtualization solution does
> > not use qemu as a back end, you can use kernel rbd devices
> > straight which I think will give you better performance than
> > OCFS2 on RBD devices.
> >
> >
> > A
> >
> > >
> > > 2013/7/11 Alex Bligh <alex at alex.org.uk>
> > >
> > > On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:
> > >
> > > > Hello,
> > > >
> > > > Yes, you missed that qemu can use directly RADOS volume.
> > > > Look here :
> > > > http://ceph.com/docs/master/rbd/qemu-rbd/
> > > >
> > > > Create :
> > > > qemu-img create -f rbd rbd:data/squeeze 10G
> > > >
> > > > Use :
> > > >
> > > > qemu -m 1024 -drive format=raw,file=rbd:data/squeeze
> > >
> > > I don't think he did. As I read it he wants his VMs to all
> > access the same filing system, and doesn't want to use
> > cephfs.
> > >
> > > OCFS2 on RBD I suppose is a reasonable choice for that.
> > >
> > > --
> > > Alex Bligh
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users at lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> >
> >
> >
> > --
> > Alex Bligh
> >
> >
> >
> >
> >
> >
> >
>
>
>
>
> ______________________________________________________________________
> The contents of this electronic message and any attachments are
> intended only for the addressee and may contain legally privileged,
> personal, sensitive or confidential information. If you are not the
> intended addressee, and have received this email, any transmission,
> distribution, downloading, printing or photocopying of the contents of
> this message or attachments is strictly prohibited. Any legal
> privilege or confidentiality attached to this message and attachments
> is not waived, lost or destroyed by reason of delivery to any person
> other than intended addressee. If you have received this message and
> are not the intended addressee you should notify the sender by return
> email and destroy all copies of the message and any attachments.
> Unless expressly attributed, the views expressed in this email do not
> necessarily represent the views of the company.
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20130712/085746d3/attachment-0004.htm>
Darryl Bond
2013-07-12 00:04:47 UTC
Permalink
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20130712/e0c5d2e7/attachment-0002.htm>
Darryl Bond
2013-07-12 00:04:47 UTC
Permalink
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20130712/e0c5d2e7/attachment-0003.htm>
Darryl Bond
2013-07-12 00:04:47 UTC
Permalink
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20130712/e0c5d2e7/attachment-0004.htm>
McNamara, Bradley
2013-07-11 22:18:47 UTC
Permalink
Correct me if I'm wrong, I'm new to this, but I think the distinction between the two methods is that using 'qemu-img create -f rbd' creates an RBD for either a VM to boot from, or for mounting within a VM. Whereas, the OP wants a single RBD, formatted with a cluster file system, to use as a place for multiple VM image files to reside.

I've often contemplated this same scenario, and would be quite interested in different ways people have implemented their VM infrastructure using RBD. I guess one of the advantages of using 'qemu-img create -f rbd' is that a snapshot of a single RBD would capture just the changed RBD data for that VM, whereas a snapshot of a larger RBD with OCFS2 and multiple VM images on it, would capture changes of all the VM's, not just one. It might provide more administrative agility to use the former.

Also, I guess another question would be, when a RBD is expanded, does the underlying VM that is created using 'qemu-img create -f rbd' need to be rebooted to "see" the additional space. My guess would be, yes.

Brad

-----Original Message-----
From: ceph-users-bounces at lists.ceph.com [mailto:ceph-users-bounces at lists.ceph.com] On Behalf Of Alex Bligh
Sent: Thursday, July 11, 2013 2:03 PM
To: Gilles Mocellin
Cc: ceph-users at lists.ceph.com
Subject: Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?


On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:

> Hello,
>
> Yes, you missed that qemu can use directly RADOS volume.
> Look here :
> http://ceph.com/docs/master/rbd/qemu-rbd/
>
> Create :
> qemu-img create -f rbd rbd:data/squeeze 10G
>
> Use :
>
> qemu -m 1024 -drive format=raw,file=rbd:data/squeeze

I don't think he did. As I read it he wants his VMs to all access the same filing system, and doesn't want to use cephfs.

OCFS2 on RBD I suppose is a reasonable choice for that.

--
Alex Bligh
Tom Verdaat
2013-07-11 23:41:18 UTC
Permalink
Hi Alex,

We're planning to deploy OpenStack Grizzly using KVM. I agree that running
every VM directly from RBD devices would be preferable, but booting from
volumes is not one of OpenStack's strengths and configuring nova to make
boot from volume the default method that works automatically is not really
feasible yet.

So the alternative is to mount a shared filesystem
on /var/lib/nova/instances of every compute node. Hence the RBD +
OCFS2/GFS2 question.

Tom

p.s. yes I've read the
rbd-openstack<http://ceph.com/docs/master/rbd/rbd-openstack/> page
which covers images and persistent volumes, not running instances which is
what my question is about.


2013/7/12 Alex Bligh <alex at alex.org.uk>

> Tom,
>
> On 11 Jul 2013, at 22:28, Tom Verdaat wrote:
>
> > Actually I want my running VMs to all be stored on the same file system,
> so we can use live migration to move them between hosts.
> >
> > QEMU is not going to help because we're not using it in our
> virtualization solution.
>
> Out of interest, what are you using in your virtualization solution? Most
> things (including modern Xen) seem to use Qemu for the back end. If your
> virtualization solution does not use qemu as a back end, you can use kernel
> rbd devices straight which I think will give you better performance than
> OCFS2 on RBD devices.
>
> A
>
> >
> > 2013/7/11 Alex Bligh <alex at alex.org.uk>
> >
> > On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:
> >
> > > Hello,
> > >
> > > Yes, you missed that qemu can use directly RADOS volume.
> > > Look here :
> > > http://ceph.com/docs/master/rbd/qemu-rbd/
> > >
> > > Create :
> > > qemu-img create -f rbd rbd:data/squeeze 10G
> > >
> > > Use :
> > >
> > > qemu -m 1024 -drive format=raw,file=rbd:data/squeeze
> >
> > I don't think he did. As I read it he wants his VMs to all access the
> same filing system, and doesn't want to use cephfs.
> >
> > OCFS2 on RBD I suppose is a reasonable choice for that.
> >
> > --
> > Alex Bligh
> >
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users at lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> --
> Alex Bligh
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20130712/464d2f4e/attachment-0002.htm>
McNamara, Bradley
2013-07-11 22:18:47 UTC
Permalink
Correct me if I'm wrong, I'm new to this, but I think the distinction between the two methods is that using 'qemu-img create -f rbd' creates an RBD for either a VM to boot from, or for mounting within a VM. Whereas, the OP wants a single RBD, formatted with a cluster file system, to use as a place for multiple VM image files to reside.

I've often contemplated this same scenario, and would be quite interested in different ways people have implemented their VM infrastructure using RBD. I guess one of the advantages of using 'qemu-img create -f rbd' is that a snapshot of a single RBD would capture just the changed RBD data for that VM, whereas a snapshot of a larger RBD with OCFS2 and multiple VM images on it, would capture changes of all the VM's, not just one. It might provide more administrative agility to use the former.

Also, I guess another question would be, when a RBD is expanded, does the underlying VM that is created using 'qemu-img create -f rbd' need to be rebooted to "see" the additional space. My guess would be, yes.

Brad

-----Original Message-----
From: ceph-users-bounces at lists.ceph.com [mailto:ceph-users-bounces at lists.ceph.com] On Behalf Of Alex Bligh
Sent: Thursday, July 11, 2013 2:03 PM
To: Gilles Mocellin
Cc: ceph-users at lists.ceph.com
Subject: Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?


On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:

> Hello,
>
> Yes, you missed that qemu can use directly RADOS volume.
> Look here :
> http://ceph.com/docs/master/rbd/qemu-rbd/
>
> Create :
> qemu-img create -f rbd rbd:data/squeeze 10G
>
> Use :
>
> qemu -m 1024 -drive format=raw,file=rbd:data/squeeze

I don't think he did. As I read it he wants his VMs to all access the same filing system, and doesn't want to use cephfs.

OCFS2 on RBD I suppose is a reasonable choice for that.

--
Alex Bligh
Tom Verdaat
2013-07-11 23:41:18 UTC
Permalink
Hi Alex,

We're planning to deploy OpenStack Grizzly using KVM. I agree that running
every VM directly from RBD devices would be preferable, but booting from
volumes is not one of OpenStack's strengths and configuring nova to make
boot from volume the default method that works automatically is not really
feasible yet.

So the alternative is to mount a shared filesystem
on /var/lib/nova/instances of every compute node. Hence the RBD +
OCFS2/GFS2 question.

Tom

p.s. yes I've read the
rbd-openstack<http://ceph.com/docs/master/rbd/rbd-openstack/> page
which covers images and persistent volumes, not running instances which is
what my question is about.


2013/7/12 Alex Bligh <alex at alex.org.uk>

> Tom,
>
> On 11 Jul 2013, at 22:28, Tom Verdaat wrote:
>
> > Actually I want my running VMs to all be stored on the same file system,
> so we can use live migration to move them between hosts.
> >
> > QEMU is not going to help because we're not using it in our
> virtualization solution.
>
> Out of interest, what are you using in your virtualization solution? Most
> things (including modern Xen) seem to use Qemu for the back end. If your
> virtualization solution does not use qemu as a back end, you can use kernel
> rbd devices straight which I think will give you better performance than
> OCFS2 on RBD devices.
>
> A
>
> >
> > 2013/7/11 Alex Bligh <alex at alex.org.uk>
> >
> > On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:
> >
> > > Hello,
> > >
> > > Yes, you missed that qemu can use directly RADOS volume.
> > > Look here :
> > > http://ceph.com/docs/master/rbd/qemu-rbd/
> > >
> > > Create :
> > > qemu-img create -f rbd rbd:data/squeeze 10G
> > >
> > > Use :
> > >
> > > qemu -m 1024 -drive format=raw,file=rbd:data/squeeze
> >
> > I don't think he did. As I read it he wants his VMs to all access the
> same filing system, and doesn't want to use cephfs.
> >
> > OCFS2 on RBD I suppose is a reasonable choice for that.
> >
> > --
> > Alex Bligh
> >
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users at lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> --
> Alex Bligh
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20130712/464d2f4e/attachment-0003.htm>
McNamara, Bradley
2013-07-11 22:18:47 UTC
Permalink
Correct me if I'm wrong, I'm new to this, but I think the distinction between the two methods is that using 'qemu-img create -f rbd' creates an RBD for either a VM to boot from, or for mounting within a VM. Whereas, the OP wants a single RBD, formatted with a cluster file system, to use as a place for multiple VM image files to reside.

I've often contemplated this same scenario, and would be quite interested in different ways people have implemented their VM infrastructure using RBD. I guess one of the advantages of using 'qemu-img create -f rbd' is that a snapshot of a single RBD would capture just the changed RBD data for that VM, whereas a snapshot of a larger RBD with OCFS2 and multiple VM images on it, would capture changes of all the VM's, not just one. It might provide more administrative agility to use the former.

Also, I guess another question would be, when a RBD is expanded, does the underlying VM that is created using 'qemu-img create -f rbd' need to be rebooted to "see" the additional space. My guess would be, yes.

Brad

-----Original Message-----
From: ceph-users-bounces at lists.ceph.com [mailto:ceph-users-bounces at lists.ceph.com] On Behalf Of Alex Bligh
Sent: Thursday, July 11, 2013 2:03 PM
To: Gilles Mocellin
Cc: ceph-users at lists.ceph.com
Subject: Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?


On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:

> Hello,
>
> Yes, you missed that qemu can use directly RADOS volume.
> Look here :
> http://ceph.com/docs/master/rbd/qemu-rbd/
>
> Create :
> qemu-img create -f rbd rbd:data/squeeze 10G
>
> Use :
>
> qemu -m 1024 -drive format=raw,file=rbd:data/squeeze

I don't think he did. As I read it he wants his VMs to all access the same filing system, and doesn't want to use cephfs.

OCFS2 on RBD I suppose is a reasonable choice for that.

--
Alex Bligh
Tom Verdaat
2013-07-11 23:41:18 UTC
Permalink
Hi Alex,

We're planning to deploy OpenStack Grizzly using KVM. I agree that running
every VM directly from RBD devices would be preferable, but booting from
volumes is not one of OpenStack's strengths and configuring nova to make
boot from volume the default method that works automatically is not really
feasible yet.

So the alternative is to mount a shared filesystem
on /var/lib/nova/instances of every compute node. Hence the RBD +
OCFS2/GFS2 question.

Tom

p.s. yes I've read the
rbd-openstack<http://ceph.com/docs/master/rbd/rbd-openstack/> page
which covers images and persistent volumes, not running instances which is
what my question is about.


2013/7/12 Alex Bligh <alex at alex.org.uk>

> Tom,
>
> On 11 Jul 2013, at 22:28, Tom Verdaat wrote:
>
> > Actually I want my running VMs to all be stored on the same file system,
> so we can use live migration to move them between hosts.
> >
> > QEMU is not going to help because we're not using it in our
> virtualization solution.
>
> Out of interest, what are you using in your virtualization solution? Most
> things (including modern Xen) seem to use Qemu for the back end. If your
> virtualization solution does not use qemu as a back end, you can use kernel
> rbd devices straight which I think will give you better performance than
> OCFS2 on RBD devices.
>
> A
>
> >
> > 2013/7/11 Alex Bligh <alex at alex.org.uk>
> >
> > On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:
> >
> > > Hello,
> > >
> > > Yes, you missed that qemu can use directly RADOS volume.
> > > Look here :
> > > http://ceph.com/docs/master/rbd/qemu-rbd/
> > >
> > > Create :
> > > qemu-img create -f rbd rbd:data/squeeze 10G
> > >
> > > Use :
> > >
> > > qemu -m 1024 -drive format=raw,file=rbd:data/squeeze
> >
> > I don't think he did. As I read it he wants his VMs to all access the
> same filing system, and doesn't want to use cephfs.
> >
> > OCFS2 on RBD I suppose is a reasonable choice for that.
> >
> > --
> > Alex Bligh
> >
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users at lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> --
> Alex Bligh
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20130712/464d2f4e/attachment-0004.htm>
Alex Bligh
2013-07-11 21:03:21 UTC
Permalink
On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:

> Hello,
>
> Yes, you missed that qemu can use directly RADOS volume.
> Look here :
> http://ceph.com/docs/master/rbd/qemu-rbd/
>
> Create :
> qemu-img create -f rbd rbd:data/squeeze 10G
>
> Use :
>
> qemu -m 1024 -drive format=raw,file=rbd:data/squeeze

I don't think he did. As I read it he wants his VMs to all access the same filing system, and doesn't want to use cephfs.

OCFS2 on RBD I suppose is a reasonable choice for that.

--
Alex Bligh
Alex Bligh
2013-07-11 21:03:21 UTC
Permalink
On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:

> Hello,
>
> Yes, you missed that qemu can use directly RADOS volume.
> Look here :
> http://ceph.com/docs/master/rbd/qemu-rbd/
>
> Create :
> qemu-img create -f rbd rbd:data/squeeze 10G
>
> Use :
>
> qemu -m 1024 -drive format=raw,file=rbd:data/squeeze

I don't think he did. As I read it he wants his VMs to all access the same filing system, and doesn't want to use cephfs.

OCFS2 on RBD I suppose is a reasonable choice for that.

--
Alex Bligh
Alex Bligh
2013-07-11 21:03:21 UTC
Permalink
On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:

> Hello,
>
> Yes, you missed that qemu can use directly RADOS volume.
> Look here :
> http://ceph.com/docs/master/rbd/qemu-rbd/
>
> Create :
> qemu-img create -f rbd rbd:data/squeeze 10G
>
> Use :
>
> qemu -m 1024 -drive format=raw,file=rbd:data/squeeze

I don't think he did. As I read it he wants his VMs to all access the same filing system, and doesn't want to use cephfs.

OCFS2 on RBD I suppose is a reasonable choice for that.

--
Alex Bligh
Dzianis Kahanovich
2013-07-15 14:28:35 UTC
Permalink
Tom Verdaat ?????:

> 3. Is anybody doing this already and willing to share their experience?

Relative yes. Before ceph I was use drbd+ocfs2 (with o2cb stack), now both this
servers are inside of VM's with same ocfs2. There are same like over drbd, but
just remeber to turn rbd cache to "off" (RTFM).

I just not solve (while?) problem with VMs reboot on ceph hard works (one of
nodes restart or just recovery, even if size=3 on 3 nodes - data is ready). IMHO
there are OCFS2 internal heartbeat (heartbeat=local) problem, but there are just
my IMHO. I will make tests about this problem soon.

But, even in case of reboot or killing qemu - OCFS2 still clean (usually nothing
fixed by fsck), so data integrity is good.

I not trying GFS2, but this considered slower then OCFS2. Also GFS2 is "too
RedHat's" - there absent in some of distros and hard to install [userspace] self
via large number of dependences.

OCFS2 also present in 2 ways: O2CB stack (internal) and user stack. O2CB is
good, kernel-side, but have no byte range locking feature. User stack have byte
range locking - over userspace support, but required many userspace stuff too
(present at least in Oracle linux or SuSE, but I use Gentoo and simple
heartbeat, no corosync, so I don't want to work too much).

So, if you need no byte-range locking, I suggest to use OCFS2 with simple O2CB
stack.

--
WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/
Tom Verdaat
2013-07-11 10:08:07 UTC
Permalink
Hi guys,

We want to use our Ceph cluster to create a shared disk file system to host
VM's. Our preference would be to use CephFS but since it is not considered
stable I'm looking into alternatives.

The most appealing alternative seems to be to create a RBD volume, format
it with a cluster file system and mount it on all the VM host machines.

Obvious file system candidates would be OCFS2 and GFS2 but I'm having
trouble finding recent and reliable documentation on the performance,
features and reliability of these file systems, especially related to our
specific use case. The specifics I'm trying to keep in mind are:

- Using it to host VM ephemeral disks means the file system needs to
perform well with few but very large files and usually machines don't try
to compete for access to the same file, except for during live migration.
- Needs to handle scale well (large number of nodes, manage a volume of
tens of terabytes and file sizes of tens or hundreds of gigabytes) and
handle online operations like increasing the volume size.
- Since the cluster FS is already running on a distributed storage
system (Ceph), the file system does not need to concern itself with things
like replication. Just needs to not get corrupted and be fast of course.


Anybody here that can help me shed some light on the following questions:

1. Are there other cluster file systems to consider besides OCFS2 and
GFS2?
2. Which one would yield the best performance for our use case?
3. Is anybody doing this already and willing to share their experience?
4. Is there anything important that you think we might have missed?


Your help is very much appreciated!

Thanks!

Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20130711/a02ce76c/attachment-0002.htm>
Gilles Mocellin
2013-07-11 18:25:11 UTC
Permalink
Le 11/07/2013 12:08, Tom Verdaat a ?crit :
> Hi guys,
>
> We want to use our Ceph cluster to create a shared disk file system to
> host VM's. Our preference would be to use CephFS but since it is not
> considered stable I'm looking into alternatives.
>
> The most appealing alternative seems to be to create a RBD volume,
> format it with a cluster file system and mount it on all the VM host
> machines.
>
> Obvious file system candidates would be OCFS2 and GFS2 but I'm having
> trouble finding recent and reliable documentation on the performance,
> features and reliability of these file systems, especially related to
> our specific use case. The specifics I'm trying to keep in mind are:
>
> * Using it to host VM ephemeral disks means the file system needs to
> perform well with few but very large files and usually machines
> don't try to compete for access to the same file, except for
> during live migration.
> * Needs to handle scale well (large number of nodes, manage a volume
> of tens of terabytes and file sizes of tens or hundreds of
> gigabytes) and handle online operations like increasing the volume
> size.
> * Since the cluster FS is already running on a distributed storage
> system (Ceph), the file system does not need to concern itself
> with things like replication. Just needs to not get corrupted and
> be fast of course.
>
>
> Anybody here that can help me shed some light on the following questions:
>
> 1. Are there other cluster file systems to consider besides OCFS2 and
> GFS2?
> 2. Which one would yield the best performance for our use case?
> 3. Is anybody doing this already and willing to share their experience?
> 4. Is there anything important that you think we might have missed?
>

Hello,

Yes, you missed that qemu can use directly RADOS volume.
Look here :
http://ceph.com/docs/master/rbd/qemu-rbd/

Create :
qemu-img create -f rbd rbd:data/squeeze 10G

Use :

qemu -m 1024 -drive format=raw,file=rbd:data/squeeze
Dzianis Kahanovich
2013-07-15 14:28:35 UTC
Permalink
Tom Verdaat ?????:

> 3. Is anybody doing this already and willing to share their experience?

Relative yes. Before ceph I was use drbd+ocfs2 (with o2cb stack), now both this
servers are inside of VM's with same ocfs2. There are same like over drbd, but
just remeber to turn rbd cache to "off" (RTFM).

I just not solve (while?) problem with VMs reboot on ceph hard works (one of
nodes restart or just recovery, even if size=3 on 3 nodes - data is ready). IMHO
there are OCFS2 internal heartbeat (heartbeat=local) problem, but there are just
my IMHO. I will make tests about this problem soon.

But, even in case of reboot or killing qemu - OCFS2 still clean (usually nothing
fixed by fsck), so data integrity is good.

I not trying GFS2, but this considered slower then OCFS2. Also GFS2 is "too
RedHat's" - there absent in some of distros and hard to install [userspace] self
via large number of dependences.

OCFS2 also present in 2 ways: O2CB stack (internal) and user stack. O2CB is
good, kernel-side, but have no byte range locking feature. User stack have byte
range locking - over userspace support, but required many userspace stuff too
(present at least in Oracle linux or SuSE, but I use Gentoo and simple
heartbeat, no corosync, so I don't want to work too much).

So, if you need no byte-range locking, I suggest to use OCFS2 with simple O2CB
stack.

--
WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/
Tom Verdaat
2013-07-11 10:08:07 UTC
Permalink
Hi guys,

We want to use our Ceph cluster to create a shared disk file system to host
VM's. Our preference would be to use CephFS but since it is not considered
stable I'm looking into alternatives.

The most appealing alternative seems to be to create a RBD volume, format
it with a cluster file system and mount it on all the VM host machines.

Obvious file system candidates would be OCFS2 and GFS2 but I'm having
trouble finding recent and reliable documentation on the performance,
features and reliability of these file systems, especially related to our
specific use case. The specifics I'm trying to keep in mind are:

- Using it to host VM ephemeral disks means the file system needs to
perform well with few but very large files and usually machines don't try
to compete for access to the same file, except for during live migration.
- Needs to handle scale well (large number of nodes, manage a volume of
tens of terabytes and file sizes of tens or hundreds of gigabytes) and
handle online operations like increasing the volume size.
- Since the cluster FS is already running on a distributed storage
system (Ceph), the file system does not need to concern itself with things
like replication. Just needs to not get corrupted and be fast of course.


Anybody here that can help me shed some light on the following questions:

1. Are there other cluster file systems to consider besides OCFS2 and
GFS2?
2. Which one would yield the best performance for our use case?
3. Is anybody doing this already and willing to share their experience?
4. Is there anything important that you think we might have missed?


Your help is very much appreciated!

Thanks!

Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20130711/a02ce76c/attachment-0003.htm>
Gilles Mocellin
2013-07-11 18:25:11 UTC
Permalink
Le 11/07/2013 12:08, Tom Verdaat a ?crit :
> Hi guys,
>
> We want to use our Ceph cluster to create a shared disk file system to
> host VM's. Our preference would be to use CephFS but since it is not
> considered stable I'm looking into alternatives.
>
> The most appealing alternative seems to be to create a RBD volume,
> format it with a cluster file system and mount it on all the VM host
> machines.
>
> Obvious file system candidates would be OCFS2 and GFS2 but I'm having
> trouble finding recent and reliable documentation on the performance,
> features and reliability of these file systems, especially related to
> our specific use case. The specifics I'm trying to keep in mind are:
>
> * Using it to host VM ephemeral disks means the file system needs to
> perform well with few but very large files and usually machines
> don't try to compete for access to the same file, except for
> during live migration.
> * Needs to handle scale well (large number of nodes, manage a volume
> of tens of terabytes and file sizes of tens or hundreds of
> gigabytes) and handle online operations like increasing the volume
> size.
> * Since the cluster FS is already running on a distributed storage
> system (Ceph), the file system does not need to concern itself
> with things like replication. Just needs to not get corrupted and
> be fast of course.
>
>
> Anybody here that can help me shed some light on the following questions:
>
> 1. Are there other cluster file systems to consider besides OCFS2 and
> GFS2?
> 2. Which one would yield the best performance for our use case?
> 3. Is anybody doing this already and willing to share their experience?
> 4. Is there anything important that you think we might have missed?
>

Hello,

Yes, you missed that qemu can use directly RADOS volume.
Look here :
http://ceph.com/docs/master/rbd/qemu-rbd/

Create :
qemu-img create -f rbd rbd:data/squeeze 10G

Use :

qemu -m 1024 -drive format=raw,file=rbd:data/squeeze
Dzianis Kahanovich
2013-07-15 14:28:35 UTC
Permalink
Tom Verdaat ?????:

> 3. Is anybody doing this already and willing to share their experience?

Relative yes. Before ceph I was use drbd+ocfs2 (with o2cb stack), now both this
servers are inside of VM's with same ocfs2. There are same like over drbd, but
just remeber to turn rbd cache to "off" (RTFM).

I just not solve (while?) problem with VMs reboot on ceph hard works (one of
nodes restart or just recovery, even if size=3 on 3 nodes - data is ready). IMHO
there are OCFS2 internal heartbeat (heartbeat=local) problem, but there are just
my IMHO. I will make tests about this problem soon.

But, even in case of reboot or killing qemu - OCFS2 still clean (usually nothing
fixed by fsck), so data integrity is good.

I not trying GFS2, but this considered slower then OCFS2. Also GFS2 is "too
RedHat's" - there absent in some of distros and hard to install [userspace] self
via large number of dependences.

OCFS2 also present in 2 ways: O2CB stack (internal) and user stack. O2CB is
good, kernel-side, but have no byte range locking feature. User stack have byte
range locking - over userspace support, but required many userspace stuff too
(present at least in Oracle linux or SuSE, but I use Gentoo and simple
heartbeat, no corosync, so I don't want to work too much).

So, if you need no byte-range locking, I suggest to use OCFS2 with simple O2CB
stack.

--
WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/
Tom Verdaat
2013-07-11 10:08:07 UTC
Permalink
Hi guys,

We want to use our Ceph cluster to create a shared disk file system to host
VM's. Our preference would be to use CephFS but since it is not considered
stable I'm looking into alternatives.

The most appealing alternative seems to be to create a RBD volume, format
it with a cluster file system and mount it on all the VM host machines.

Obvious file system candidates would be OCFS2 and GFS2 but I'm having
trouble finding recent and reliable documentation on the performance,
features and reliability of these file systems, especially related to our
specific use case. The specifics I'm trying to keep in mind are:

- Using it to host VM ephemeral disks means the file system needs to
perform well with few but very large files and usually machines don't try
to compete for access to the same file, except for during live migration.
- Needs to handle scale well (large number of nodes, manage a volume of
tens of terabytes and file sizes of tens or hundreds of gigabytes) and
handle online operations like increasing the volume size.
- Since the cluster FS is already running on a distributed storage
system (Ceph), the file system does not need to concern itself with things
like replication. Just needs to not get corrupted and be fast of course.


Anybody here that can help me shed some light on the following questions:

1. Are there other cluster file systems to consider besides OCFS2 and
GFS2?
2. Which one would yield the best performance for our use case?
3. Is anybody doing this already and willing to share their experience?
4. Is there anything important that you think we might have missed?


Your help is very much appreciated!

Thanks!

Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20130711/a02ce76c/attachment-0004.htm>
Gilles Mocellin
2013-07-11 18:25:11 UTC
Permalink
Le 11/07/2013 12:08, Tom Verdaat a ?crit :
> Hi guys,
>
> We want to use our Ceph cluster to create a shared disk file system to
> host VM's. Our preference would be to use CephFS but since it is not
> considered stable I'm looking into alternatives.
>
> The most appealing alternative seems to be to create a RBD volume,
> format it with a cluster file system and mount it on all the VM host
> machines.
>
> Obvious file system candidates would be OCFS2 and GFS2 but I'm having
> trouble finding recent and reliable documentation on the performance,
> features and reliability of these file systems, especially related to
> our specific use case. The specifics I'm trying to keep in mind are:
>
> * Using it to host VM ephemeral disks means the file system needs to
> perform well with few but very large files and usually machines
> don't try to compete for access to the same file, except for
> during live migration.
> * Needs to handle scale well (large number of nodes, manage a volume
> of tens of terabytes and file sizes of tens or hundreds of
> gigabytes) and handle online operations like increasing the volume
> size.
> * Since the cluster FS is already running on a distributed storage
> system (Ceph), the file system does not need to concern itself
> with things like replication. Just needs to not get corrupted and
> be fast of course.
>
>
> Anybody here that can help me shed some light on the following questions:
>
> 1. Are there other cluster file systems to consider besides OCFS2 and
> GFS2?
> 2. Which one would yield the best performance for our use case?
> 3. Is anybody doing this already and willing to share their experience?
> 4. Is there anything important that you think we might have missed?
>

Hello,

Yes, you missed that qemu can use directly RADOS volume.
Look here :
http://ceph.com/docs/master/rbd/qemu-rbd/

Create :
qemu-img create -f rbd rbd:data/squeeze 10G

Use :

qemu -m 1024 -drive format=raw,file=rbd:data/squeeze
Dzianis Kahanovich
2013-07-15 14:28:35 UTC
Permalink
Tom Verdaat ?????:

> 3. Is anybody doing this already and willing to share their experience?

Relative yes. Before ceph I was use drbd+ocfs2 (with o2cb stack), now both this
servers are inside of VM's with same ocfs2. There are same like over drbd, but
just remeber to turn rbd cache to "off" (RTFM).

I just not solve (while?) problem with VMs reboot on ceph hard works (one of
nodes restart or just recovery, even if size=3 on 3 nodes - data is ready). IMHO
there are OCFS2 internal heartbeat (heartbeat=local) problem, but there are just
my IMHO. I will make tests about this problem soon.

But, even in case of reboot or killing qemu - OCFS2 still clean (usually nothing
fixed by fsck), so data integrity is good.

I not trying GFS2, but this considered slower then OCFS2. Also GFS2 is "too
RedHat's" - there absent in some of distros and hard to install [userspace] self
via large number of dependences.

OCFS2 also present in 2 ways: O2CB stack (internal) and user stack. O2CB is
good, kernel-side, but have no byte range locking feature. User stack have byte
range locking - over userspace support, but required many userspace stuff too
(present at least in Oracle linux or SuSE, but I use Gentoo and simple
heartbeat, no corosync, so I don't want to work too much).

So, if you need no byte-range locking, I suggest to use OCFS2 with simple O2CB
stack.

--
WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/
Loading...