Discussion:
Newbie question: stretch ceph cluster
(too old to reply)
ST Wong (ITSC)
2018-02-09 14:46:11 UTC
Permalink
Hi, I'm new to CEPH and got a task to setup CEPH with kind of DR feature. We've 2 10Gb connected data centers in the same campus. I wonder if it's possible to setup a CEPH cluster with following components in each data center:


3 x mon + mds + mgr

3 x OSD (replicated factor=2, between data center)


So that any one of following failure won't affect the cluster's operation and data availability:

* any one component in either data center
* failure of either one of the data center


Is it possible?

In case one data center failure case, seems replication can't occur any more. Any CRUSH rule can achieve this purpose?


Sorry for the newbie question.


Thanks a lot.

Regards

/st wong
Kai Wagner
2018-02-09 14:59:37 UTC
Permalink
Hi and welcome,
Post by ST Wong (ITSC)
Hi, I'm new to CEPH and got a task to setup CEPH with kind of DR
feature.  We've 2 10Gb connected data centers in the same campus.    I
wonder if it's possible to setup a CEPH cluster with following
3 x mon + mds + mgr
3 x OSD (replicated factor=2, between data center)
So that any one of following failure won't affect the cluster's
* any one component in either data center
* failure of either one of the data center 
Is it possible?
In general this is possible, but I would consider that replica=2 is not
a good idea. In case of a failure scenario or just maintenance and one
DC is powered off and just one single disk fails on the other DC, this
can already lead to data loss. My advice here would be, if anyhow
possible, please don't do replica=2.
Post by ST Wong (ITSC)
In case one data center failure case, seems replication can't occur
any more.   Any CRUSH rule can achieve this purpose?
Sorry for the newbie question.
Thanks a lot.
Regards
/st wong
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
Luis Periquito
2018-02-09 15:34:05 UTC
Permalink
Post by Kai Wagner
Hi and welcome,
Hi, I'm new to CEPH and got a task to setup CEPH with kind of DR feature.
We've 2 10Gb connected data centers in the same campus. I wonder if it's
possible to setup a CEPH cluster with following components in each data
3 x mon + mds + mgr
In this scenario you wouldn't be any better, as loosing a room means
loosing half of your cluster. Can you run the MON somewhere else that
would be able to continue if you loose one of the rooms?

As for MGR and MDS they're (recommended) active/passive; so one per
room would be enough.
Post by Kai Wagner
3 x OSD (replicated factor=2, between data center)
replicated with size=2 is a bad idea. You can have size=4 and
min_size=2 and have a crush map with rules something like:

rule crosssite {
id 0
type replicated
min_size 4
max_size 4
step take default
step choose firstn 2 type room
step chooseleaf firstn 2 type host
step emit
}

this will store 4 copies, 2 in different hosts and 2 different rooms.
Post by Kai Wagner
So that any one of following failure won't affect the cluster's operation
any one component in either data center
failure of either one of the data center
Is it possible?
In general this is possible, but I would consider that replica=2 is not a
good idea. In case of a failure scenario or just maintenance and one DC is
powered off and just one single disk fails on the other DC, this can already
lead to data loss. My advice here would be, if anyhow possible, please don't
do replica=2.
In case one data center failure case, seems replication can't occur any
more. Any CRUSH rule can achieve this purpose?
Sorry for the newbie question.
Thanks a lot.
Regards
/st wong
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB
21284 (AG Nürnberg)
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
ST Wong (ITSC)
2018-02-14 02:12:56 UTC
Permalink
Hi,

Thanks for your advice,

-----Original Message-----
From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf Of Luis Periquito
Sent: Friday, February 09, 2018 11:34 PM
To: Kai Wagner
Cc: Ceph Users
Subject: Re: [ceph-users] Newbie question: stretch ceph cluster
Post by Kai Wagner
Hi and welcome,
Hi, I'm new to CEPH and got a task to setup CEPH with kind of DR feature.
We've 2 10Gb connected data centers in the same campus. I wonder if it's
possible to setup a CEPH cluster with following components in each data
3 x mon + mds + mgr
In this scenario you wouldn't be any better, as loosing a room means loosing half of your cluster. Can you run the MON somewhere else that would be able to continue if you loose one of the rooms?
Will it be okay to have 3 x MON per DC so that we still have 3 x MON in case of losing 1 DC ? Or need more in case of double fault - losing 1 DC and failure of any MON in remaining DC will make the cluster stop working?
Post by Kai Wagner
As for MGR and MDS they're (recommended) active/passive; so one per room would be enough.
3 x OSD (replicated factor=2, between data center)
replicated with size=2 is a bad idea. You can have size=4 and
rule crosssite {
id 0
type replicated
min_size 4
max_size 4
step take default
step choose firstn 2 type room
step chooseleaf firstn 2 type host
step emit
}
Post by Kai Wagner
this will store 4 copies, 2 in different hosts and 2 different rooms.
Does it mean for new data write to hostA:roomA, replication will take place as following?
1. from hostA:roomA to hostB:roomA
2. from hostA:roomA to hostA, roomB
3. from hostB:roomA to hostB, roomB

If it works in this way, can copy in 3 be skipped so that for each piece of data, there are 3 replicas - original one, replica in same room, and replica in other room, in order to save some space?

Besides, would also like to ask if it's correct that the cluster will continue to work (degraded) if one room is lost?

Will there be any better way to setup such 'stretched' cluster between 2 DCs? They're extension instead of real DR site...

Sorry for the newbie questions and we'll proceed to have more study and experiment on this.

Thanks a lot.
Post by Kai Wagner
So that any one of following failure won't affect the cluster's
any one component in either data center failure of either one of the
data center
Is it possible?
In general this is possible, but I would consider that replica=2 is
not a good idea. In case of a failure scenario or just maintenance and
one DC is powered off and just one single disk fails on the other DC,
this can already lead to data loss. My advice here would be, if anyhow
possible, please don't do replica=2.
In case one data center failure case, seems replication can't occur any
more. Any CRUSH rule can achieve this purpose?
Sorry for the newbie question.
Thanks a lot.
Regards
/st wong
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB
21284 (AG Nürnberg)
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Maged Mokhtar
2018-02-14 08:19:51 UTC
Permalink
Hi,

You need to set the min_size to 2 in crush rule.

The exact location and replication flow when a client writes data
depends on the object name and num of pgs. the crush rule determines
which osds will serve a pg, the first is the primary osd for that pg.
The client computes the pg from the object name and writes the object to
the primary osd for that pg, then primary osd is then responsible to
replicate with the other osds serving this pg. So for the same client,
some objects will be sent to datacenter 1 and some to 2 and the osds
will do the rest.

The other point is regarding how to setup monitors across 2 datacenters
and be able to function if one goes down, this is tricky since monitors
do require an odd number and form a quorum. This link my is quite
interesting, i am not sure if there are better ways to do it:

https://www.sebastien-han.fr/blog/2013/01/28/ceph-geo-replication-sort-of/


Maged
Post by ST Wong (ITSC)
Hi,
Thanks for your advice,
-----Original Message-----
Sent: Friday, February 09, 2018 11:34 PM
To: Kai Wagner
Cc: Ceph Users
Subject: Re: [ceph-users] Newbie question: stretch ceph cluster
Hi, I'm new to CEPH and got a task to setup CEPH with kind of DR feature.
We've 2 10Gb connected data centers in the same campus. I wonder if it's
possible to setup a CEPH cluster with following components in each data
3 x mon + mds + mgr In this scenario you wouldn't be any better, as loosing a room means loosing half of your cluster. Can you run the MON somewhere else that would be able to continue if you loose one of the rooms?
Will it be okay to have 3 x MON per DC so that we still have 3 x MON in
case of losing 1 DC ? Or need more in case of double fault - losing 1
DC and failure of any MON in remaining DC will make the cluster stop
working?
Post by ST Wong (ITSC)
As for MGR and MDS they're (recommended) active/passive; so one per room would be enough.
3 x OSD (replicated factor=2, between data center)
replicated with size=2 is a bad idea. You can have size=4 and
rule crosssite {
id 0
type replicated
min_size 4
max_size 4
step take default
step choose firstn 2 type room
step chooseleaf firstn 2 type host
step emit
}
Post by ST Wong (ITSC)
this will store 4 copies, 2 in different hosts and 2 different rooms.
Does it mean for new data write to hostA:roomA, replication will take
place as following?
1. from hostA:roomA to hostB:roomA
2. from hostA:roomA to hostA, roomB
3. from hostB:roomA to hostB, roomB

If it works in this way, can copy in 3 be skipped so that for each piece
of data, there are 3 replicas - original one, replica in same room, and
replica in other room, in order to save some space?

Besides, would also like to ask if it's correct that the cluster will
continue to work (degraded) if one room is lost?

Will there be any better way to setup such 'stretched' cluster between 2
DCs? They're extension instead of real DR site...

Sorry for the newbie questions and we'll proceed to have more study and
experiment on this.

Thanks a lot.
Post by ST Wong (ITSC)
So that any one of following failure won't affect the cluster's
any one component in either data center failure of either one of the
data center
Is it possible?
In general this is possible, but I would consider that replica=2 is
not a good idea. In case of a failure scenario or just maintenance and
one DC is powered off and just one single disk fails on the other DC,
this can already lead to data loss. My advice here would be, if anyhow
possible, please don't do replica=2.
In case one data center failure case, seems replication can't occur any
more. Any CRUSH rule can achieve this purpose?
Sorry for the newbie question.
Thanks a lot.
Regards
/st wong
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB
21284 (AG NÃŒrnberg)
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Alex Gorbachev
2018-02-17 06:28:21 UTC
Permalink
Post by Maged Mokhtar
Hi,
You need to set the min_size to 2 in crush rule.
The exact location and replication flow when a client writes data depends
on the object name and num of pgs. the crush rule determines which osds
will serve a pg, the first is the primary osd for that pg. The client
computes the pg from the object name and writes the object to the primary
osd for that pg, then primary osd is then responsible to replicate with the
other osds serving this pg. So for the same client, some objects will be
sent to datacenter 1 and some to 2 and the osds will do the rest.
The other point is regarding how to setup monitors across 2 datacenters
and be able to function if one goes down, this is tricky since monitors do
require an odd number and form a quorum. This link my is quite interesting,
https://www.sebastien-han.fr/blog/2013/01/28/ceph-geo-replication-sort-of/
Fyi, I had this reply from Vincent Godin, you can search the ML for the
full thread:

Hello

We have a similar design. Two Datacenters at short distance (sharing
the same level 2 network) and one Datacenter at long range (more than
100km) for our Ceph cluster. Let's call these sites A1, A2 and B.

We set 2 Mons on A1, 2 Mons on A2 and 1 Mon on B. A1 and A2 shared a
same level 2 network. We need routing to connect to B.

We set a HSRP Gateway on A1 & A2 to reach the B site. Let's call them
GwA1 and GwA2 with default to GwA1

We set a HSRP Gateway on site B. Let's call them GwB1 and GwB2 with
default to GwB1. GwB1 connected to A1 and A2 via GwA1, GwB2 connected
to A1 and A2 via GwA2. We set an simple LACP between GwB1 and GwA1
ports and an other between GwB2 and GwA2 ports. (If GwA1 port is going
down then GwB1 port will go down too)

So if everything is OK, the Mon on site B can see every OSDs and Mons
on both sites A1 & A2 via GwB1, then GwA1. Quorum is reached and Ceph
is healthy

if B1 site is down, the Mon on site B can see every OSDs and Mons on
site A1 via GwB1, then GwA1. Quorum is reached and Ceph is available

If A1 site is down, both HSRPs will change. The Mon on site B will see
Mons and OSDs of the A2 site via GwB2 then GwA2. Quorum is reached and
Ceph is still available

if the L2 links between A1 & B2 are cut, the B2 site will be isolated.
The Mon on site B can see every OSDs and Mons on A1 via GwB1, then
GwA1 but cannot see Mons and OSDs of the A2 site because of the link
failure. The quorum will be reached only on A1 site with 3 Mons and
Ceph will still be available
Post by Maged Mokhtar
Maged
Hi,
Thanks for your advice,
-----Original Message-----
Sent: Friday, February 09, 2018 11:34 PM
To: Kai Wagner
Cc: Ceph Users
Subject: Re: [ceph-users] Newbie question: stretch ceph cluster
Hi and welcome,
Hi, I'm new to CEPH and got a task to setup CEPH with kind of DR feature.
We've 2 10Gb connected data centers in the same campus. I wonder if it's
possible to setup a CEPH cluster with following components in each data
3 x mon + mds + mgr
In this scenario you wouldn't be any better, as loosing a room means
loosing half of your cluster. Can you run the MON somewhere else that would
be able to continue if you loose one of the rooms?
Will it be okay to have 3 x MON per DC so that we still have 3 x MON in
case of losing 1 DC ? Or need more in case of double fault - losing 1 DC
and failure of any MON in remaining DC will make the cluster stop working?
As for MGR and MDS they're (recommended) active/passive; so one per room would be enough.
3 x OSD (replicated factor=2, between data center)
replicated with size=2 is a bad idea. You can have size=4 and
rule crosssite {
id 0
type replicated
min_size 4
max_size 4
step take default
step choose firstn 2 type room
step chooseleaf firstn 2 type host
step emit
}
this will store 4 copies, 2 in different hosts and 2 different rooms.
Does it mean for new data write to hostA:roomA, replication will take place as following?
1. from hostA:roomA to hostB:roomA
2. from hostA:roomA to hostA, roomB
3. from hostB:roomA to hostB, roomB
If it works in this way, can copy in 3 be skipped so that for each piece
of data, there are 3 replicas - original one, replica in same room, and
replica in other room, in order to save some space?
Besides, would also like to ask if it's correct that the cluster will
continue to work (degraded) if one room is lost?
Will there be any better way to setup such 'stretched' cluster between 2
DCs? They're extension instead of real DR site...
Sorry for the newbie questions and we'll proceed to have more study and experiment on this.
Thanks a lot.
So that any one of following failure won't affect the cluster's
any one component in either data center failure of either one of the
data center
Is it possible?
In general this is possible, but I would consider that replica=2 is
not a good idea. In case of a failure scenario or just maintenance and
one DC is powered off and just one single disk fails on the other DC,
this can already lead to data loss. My advice here would be, if anyhow
possible, please don't do replica=2.
In case one data center failure case, seems replication can't occur any
more. Any CRUSH rule can achieve this purpose?
Sorry for the newbie question.
Thanks a lot.
Regards
/st wong
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB
21284 (AG NÃŒrnberg)
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
--
Alex Gorbachev
Storcium
ST Wong (ITSC)
2018-02-20 02:16:05 UTC
Permalink
Hi,

Thanks for your advice. Will try it out.

Best Regards,
/ST Wong

From: Maged Mokhtar [mailto:***@petasan.org]
Sent: Wednesday, February 14, 2018 4:20 PM
To: ST Wong (ITSC)
Cc: Luis Periquito; Kai Wagner; Ceph Users
Subject: Re: [ceph-users] Newbie question: stretch ceph cluster


Hi,

You need to set the min_size to 2 in crush rule.

The exact location and replication flow when a client writes data depends on the object name and num of pgs. the crush rule determines which osds will serve a pg, the first is the primary osd for that pg. The client computes the pg from the object name and writes the object to the primary osd for that pg, then primary osd is then responsible to replicate with the other osds serving this pg. So for the same client, some objects will be sent to datacenter 1 and some to 2 and the osds will do the rest.

The other point is regarding how to setup monitors across 2 datacenters and be able to function if one goes down, this is tricky since monitors do require an odd number and form a quorum. This link my is quite interesting, i am not sure if there are better ways to do it:

https://www.sebastien-han.fr/blog/2013/01/28/ceph-geo-replication-sort-of/



Maged

On 2018-02-14 04:12, ST Wong (ITSC) wrote:
Hi,

Thanks for your advice,

-----Original Message-----
From: ceph-users [mailto:ceph-users-***@lists.ceph.com<mailto:ceph-users-***@lists.ceph.com>] On Behalf Of Luis Periquito
Sent: Friday, February 09, 2018 11:34 PM
To: Kai Wagner
Cc: Ceph Users
Subject: Re: [ceph-users] Newbie question: stretch ceph cluster

On Fri, Feb 9, 2018 at 2:59 PM, Kai Wagner <***@suse.com<mailto:***@suse.com>> wrote:
Hi and welcome,


On 09.02.2018 15:46, ST Wong (ITSC) wrote:

Hi, I'm new to CEPH and got a task to setup CEPH with kind of DR feature.
We've 2 10Gb connected data centers in the same campus. I wonder if it's
possible to setup a CEPH cluster with following components in each
data
center:


3 x mon + mds + mgr
In this scenario you wouldn't be any better, as loosing a room means loosing half of your cluster. Can you run the MON somewhere else that would be able to continue if you loose one of the rooms?

Will it be okay to have 3 x MON per DC so that we still have 3 x MON in case of losing 1 DC ? Or need more in case of double fault - losing 1 DC and failure of any MON in remaining DC will make the cluster stop working?



As for MGR and MDS they're (recommended) active/passive; so one per room would be enough.

3 x OSD (replicated factor=2, between data center)


replicated with size=2 is a bad idea. You can have size=4 and
min_size=2 and have a crush map with rules something like:


rule crosssite {
id 0
type replicated
min_size 4
max_size 4
step take default
step choose firstn 2 type room
step chooseleaf firstn 2 type host
step emit
}


this will store 4 copies, 2 in different hosts and 2 different rooms.

Does it mean for new data write to hostA:roomA, replication will take place as following?
1. from hostA:roomA to hostB:roomA
2. from hostA:roomA to hostA, roomB
3. from hostB:roomA to hostB, roomB

If it works in this way, can copy in 3 be skipped so that for each piece of data, there are 3 replicas - original one, replica in same room, and replica in other room, in order to save some space?

Besides, would also like to ask if it's correct that the cluster will continue to work (degraded) if one room is lost?

Will there be any better way to setup such 'stretched' cluster between 2 DCs? They're extension instead of real DR site...

Sorry for the newbie questions and we'll proceed to have more study and experiment on this.

Thanks a lot.







So that any one of following failure won't affect the cluster's
operation and data availability:

any one component in either data center failure of either one of the
data center


Is it possible?

In general this is possible, but I would consider that replica=2 is
not a good idea. In case of a failure scenario or just maintenance and
one DC is powered off and just one single disk fails on the other DC,
this can already lead to data loss. My advice here would be, if anyhow
possible, please don't do replica=2.

In case one data center failure case, seems replication can't occur any
more. Any CRUSH rule can achieve this purpose?


Sorry for the newbie question.


Thanks a lot.

Regards

/st wong





_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB
21284 (AG NÃŒrnberg)


_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
ST Wong (ITSC)
2018-02-14 01:51:05 UTC
Permalink
Hi,

Thanks a lot,

From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf Of Kai Wagner
Sent: Friday, February 09, 2018 11:00 PM
To: ceph-***@lists.ceph.com
Subject: Re: [ceph-users] Newbie question: stretch ceph cluster


Hi and welcome,

On 09.02.2018 15:46, ST Wong (ITSC) wrote:

Hi, I'm new to CEPH and got a task to setup CEPH with kind of DR feature. We've 2 10Gb connected data centers in the same campus. I wonder if it's possible to setup a CEPH cluster with following components in each data center:


3 x mon + mds + mgr

3 x OSD (replicated factor=2, between data center)



So that any one of following failure won't affect the cluster's operation and data availability:

* any one component in either data center
* failure of either one of the data center


Is it possible?
In general this is possible, but I would consider that replica=2 is not a good idea. In case of a failure scenario or just maintenance and one DC is powered off and just one single disk fails on the other DC, this can already lead to data loss. My advice here would be, if anyhow possible, please don't do replica=2.
Then at least we've to do replica > 2, making replication between DC, and also among OSD in the same DC. Is that correct? Thanks again.



In case one data center failure case, seems replication can't occur any more. Any CRUSH rule can achieve this purpose?



Sorry for the newbie question.



Thanks a lot.

Regards

/st wong








_______________________________________________

ceph-users mailing list

ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
Loading...