Discussion:
[ceph-users] Unhelpful behaviour of ceph-volume lvm batch with >1 NVME card for block.db
Matthew Vernon
2018-11-14 14:10:06 UTC
Permalink
Hi,

We currently deploy our filestore OSDs with ceph-disk (via
ceph-ansible), and I was looking at using ceph-volume as we migrate to
bluestore.

Our servers have 60 OSDs and 2 NVME cards; each OSD is made up of a
single hdd, and an NVME partition for journal.

If, however, I do:
ceph-volume lvm batch /dev/sda /dev/sdb [...] /dev/nvme0n1 /dev/nvme1n1
then I get (inter alia):

Solid State VG:
Targets: block.db Total size: 1.82 TB
Total LVs: 2 Size per LV: 931.51 GB

Devices: /dev/nvme0n1, /dev/nvme1n1

i.e. ceph-volume is going to make a single VG containing both NVME
devices, and split that up into LVs to use for block.db

It seems to me that this is straightforwardly the wrong answer - either
NVME failing will now take out *every* OSD on the host, whereas the
obvious alternative (one VG per NVME, divide those into LVs) would give
you just as good performance, but you'd only lose 1/2 the OSDs if an
NVME card failed.

Am I missing something obvious here?

I appreciate I /could/ do it all myself, but even using ceph-ansible
that's going to be very tiresome...

Regards,

Matthew
--
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
Alfredo Deza
2018-11-14 15:17:48 UTC
Permalink
Post by Matthew Vernon
Hi,
We currently deploy our filestore OSDs with ceph-disk (via
ceph-ansible), and I was looking at using ceph-volume as we migrate to
bluestore.
Our servers have 60 OSDs and 2 NVME cards; each OSD is made up of a
single hdd, and an NVME partition for journal.
ceph-volume lvm batch /dev/sda /dev/sdb [...] /dev/nvme0n1 /dev/nvme1n1
Targets: block.db Total size: 1.82 TB
Total LVs: 2 Size per LV: 931.51 GB
Devices: /dev/nvme0n1, /dev/nvme1n1
i.e. ceph-volume is going to make a single VG containing both NVME
devices, and split that up into LVs to use for block.db
It seems to me that this is straightforwardly the wrong answer - either
NVME failing will now take out *every* OSD on the host, whereas the
obvious alternative (one VG per NVME, divide those into LVs) would give
you just as good performance, but you'd only lose 1/2 the OSDs if an
NVME card failed.
Am I missing something obvious here?
This is exactly the intended behavior. The `lvm batch` sub-command is
meant to simplify LV management, and by doing so, it has to adhere to
some constraints.

These constraints (making a single VG out of both NVMe devices) makes
is far easier+robust on the implementation, and allows us to
accommodate for a lot of different scenarios, but I do see how this
might be
unexpected.
Post by Matthew Vernon
I appreciate I /could/ do it all myself, but even using ceph-ansible
that's going to be very tiresome...
Right, so you are able to chop the devices up in any way you find more
acceptable (creating LVs and then passing them to `lvm create`)

There is a bit of wiggle room here though, you could deploy half of it
first which would force `lvm batch` to use just one NVMe:

ceph-volume lvm batch /dev/sda [...] /dev/nvme0n1

And then the rest of devices

ceph-volume lvm batch /dev/sdb [...] /dev/nvme1n1
Post by Matthew Vernon
Regards,
Matthew
--
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Loading...